I adjusted the number of layers in the main decoder to equalize the number of parameters across models around 40M.
I wrote an "experiment setup" parts.
Our model is grappling with excessive recursive generation. To address this issue, we could either apply Focal Loss to train our models or employ a sampling method to manipulate the outputs.
Given that our vocabulary's token count is relatively small compared to NLP and that we use Conti tokens, which occupy a significant portion of tunes, we've prevented models from learning relatively easy samples by using Focal Loss. This approach has proven effective in preventing degeneration.
from Focal Loss for Dense Object Detection, Tsung-Yi Lin et al.
some samples become shorter
some samples shows instrument which is not common for classical music
However, using Focal Loss does have its drawbacks. Unlike the sampling method, it lacks versatility and necessitates parameter searching to achieve the desired outputs from our generation model. Plus, we remain uncertain about the impact of using Focal Loss. For instance, if the encoding method is altered, the Focal would also need to be adjusted. We observed that the outcome was compromised when we added Conti to the instrument feature.
no Focal + no Conti + sampling method -> better quality?
Nevertheless, we can identify settings (Top-p & temperature) that yield high-quality outputs without recursive generations.
Check samples!