[Symbolic-Music-Encoding]#5.5 paper review, TunesFormer: Forming Irish Tunes with Control Codes by Bar Patching

Jude's Sound Lab·2024년 1월 30일

Projects

목록 보기

35/43

TunesFormer: Forming Irish Tunes with Control Codes by Bar Patching

by Shangda Wu, Xiaobing Li, Feng Yu and Maosong Sun

Introduction

a Transformerbased dual-decoder model that combines bar patching and control codes to efficiently generate expressive Irish music in ABC notation

from TunesFormer, Shangda Wu et al.

Contribution

As a dual-decoder model based on bar patching, TunesFormer significantly accelerates
generation speed while maintaining the quality of the generated music.
TunesFormer enables users to generate melodies with diverse musical forms, providing
flexibility and alignment with artistic vision through control codes.
To support future research, we release the Irish Massive ABC Notation (IrishMAN) dataset, an open-source collection of 216,284 Irish tunes in the ABC notation format.

Methodology

TunesFormer

Given 𝐿 as sequence length and 𝑃 as patch size, bar patching reduces the patch-level decoder complexity from $O(L^2)$ to $O(L^2 / P^2)$ . Meanwhile, the character-level decoder complexity becomes $𝑂(𝐿𝑃)$ <- $O(P^2) * (L / P)$ . Considering 𝑀 and 𝑁 as parameter sizes for patch and character-level decoders respectively, computational need shifts from $(M + N) * L^2$ to $M * (L^2 / P^2) + N * LP$ .

Control Codes

from CTRL: A Conditional Transformer Language Model for Controllable Generation, Nitish Shirish Keskar, Bryan McCann, Lav R. Varshney, Caiming Xiong, Richard Socher

S:number of sections - Dictates melody sections, ranging 1-8 (e.g., S : 1 for a singlesection melody, and S : 8 for a melody with eight sections), based on symbols like [ |, | |, | ], | :, : :, and : | used to represent section boundaries.
• B:number of bars - Sets number of bars within a section. It counts on the bar symbol | .
The range is 1 to 32 (e.g., B : 1 for a one-bar section, and B : 3 2 for a section with 32 bars).
• E:edit distance similarity - Manages similarity between section 𝑐 and previous section
𝑝. Derived from Levenshtein distance [16] 𝑙𝑒𝑣 (𝑐, 𝑝), it measures section differences:
$eds(c,p) = 1 - \frac {lev(c,p)}{max({c},{p})}$
where |𝑐| and |𝑝| are the string lengths of the two sections. It is discretized into 11 levels, ranging from 0 to 10 (e.g., E : 0 for no similarity, and E : 1 0 for an exact match). For the 𝑁-th section, there are 𝑁 − 1 previous sections to compare with.

Dataset

216,284 Irish ABC tunes sourced from thesession.org and abcnotation.com

Uniformity is maintained by converting tunes to XML and back using scripts(https://wim.vree.org/svgParse/index.html)

Training code

huggingface transformer

looks easy to carry, huggingface is a good tool

data for feeding

patch to embedding

This is tricky part. The auther flatten each character using one-hot vector, and add linear layer on it.

scheduler

update step

Experiment

RWKV

from Efficient Transformers: A Survey, Yi Tay et al.

Metric

two objective metrics

Efficiency: The number of tokens generated per second on an RTX 2080 Ti.
Controllability: Quantifying control precision by comparing edit distance between generated and actual control codes.

comparative evaluations

Thirteen Irish musicians compared melody pairs one from thesession.org with chord symbols, and a modelgenerated continuation from the initial two bars.

Engagement: Captivating to the ear, evokes emotional resonance, and maintains the listener’s interest.
Authenticity: Representing the distinctive characteristics of Irish traditional music.
Harmoniousness: Creating a natural flow that unifies melody and harmony into a
cohesive and pleasing musical experience.
Playability: Well-suited for performance and offers a wide range of playing techniques.

Question

validity in data split

making embedding

the difference between using nn.Embedding and one-hot+linear

Jude's Sound Lab

chords & code // harmony with structure

이전 포스트

[Symbolic-Music-Encoding]#5.3 Paper writing

다음 포스트