by Shangda Wu, Xiaobing Li, Feng Yu and Maosong Sun
Introduction
a Transformerbased dual-decoder model that combines bar patching and control codes to efficiently generate expressive Irish music in ABC notation
from TunesFormer, Shangda Wu et al.
Contribution
- As a dual-decoder model based on bar patching, TunesFormer significantly accelerates
generation speed while maintaining the quality of the generated music.
- TunesFormer enables users to generate melodies with diverse musical forms, providing
flexibility and alignment with artistic vision through control codes.
- To support future research, we release the Irish Massive ABC Notation (IrishMAN) dataset, an open-source collection of 216,284 Irish tunes in the ABC notation format.
Methodology
Given 𝐿 as sequence length and 𝑃 as patch size, bar patching reduces the patch-level decoder complexity from O(L2) to O(L2/P2). Meanwhile, the character-level decoder complexity becomes O(LP) <- O(P2)∗(L/P). Considering 𝑀 and 𝑁 as parameter sizes for patch and character-level decoders respectively, computational need shifts from (M+N)∗L2 to M∗(L2/P2)+N∗LP.
Control Codes
from CTRL: A Conditional Transformer Language Model for Controllable Generation, Nitish Shirish Keskar, Bryan McCann, Lav R. Varshney, Caiming Xiong, Richard Socher
- S:number of sections - Dictates melody sections, ranging 1-8 (e.g., S : 1 for a singlesection melody, and S : 8 for a melody with eight sections), based on symbols like [ |, | |, | ], | :, : :, and : | used to represent section boundaries.
• B:number of bars - Sets number of bars within a section. It counts on the bar symbol | .
The range is 1 to 32 (e.g., B : 1 for a one-bar section, and B : 3 2 for a section with 32 bars).
• E:edit distance similarity - Manages similarity between section 𝑐 and previous section
𝑝. Derived from Levenshtein distance [16] 𝑙𝑒𝑣 (𝑐, 𝑝), it measures section differences:
eds(c,p)=1−max(c,p)lev(c,p)
where |𝑐| and |𝑝| are the string lengths of the two sections. It is discretized into 11 levels, ranging from 0 to 10 (e.g., E : 0 for no similarity, and E : 1 0 for an exact match). For the 𝑁-th section, there are 𝑁 − 1 previous sections to compare with.
Dataset
216,284 Irish ABC tunes sourced from thesession.org and abcnotation.com
Uniformity is maintained by converting tunes to XML and back using scripts(https://wim.vree.org/svgParse/index.html)
Training code
looks easy to carry, huggingface is a good tool
data for feeding
patch to embedding
This is tricky part. The auther flatten each character using one-hot vector, and add linear layer on it.
scheduler
update step
Experiment
RWKV
from Efficient Transformers: A Survey, Yi Tay et al.
Metric
two objective metrics
- Efficiency: The number of tokens generated per second on an RTX 2080 Ti.
- Controllability: Quantifying control precision by comparing edit distance between generated and actual control codes.
comparative evaluations
Thirteen Irish musicians compared melody pairs one from thesession.org with chord symbols, and a modelgenerated continuation from the initial two bars.
- Engagement: Captivating to the ear, evokes emotional resonance, and maintains the listener’s interest.
- Authenticity: Representing the distinctive characteristics of Irish traditional music.
- Harmoniousness: Creating a natural flow that unifies melody and harmony into a
cohesive and pleasing musical experience.
- Playability: Well-suited for performance and offers a wide range of playing techniques.
Question
validity in data split
making embedding
the difference between using nn.Embedding and one-hot+linear