[Collecting Pansori Dataset]#2 Collecting Method

Clay Ryu's sound lab·2023년 11월 7일
0

Projects

목록 보기
32/43

Introduction

Comparable dataset

  • GTZAN Dataset: A collection of 10 genres with 100 audio files each, all having a length of 30 seconds. -> too small
  • MagnaTagATune dataset: contains 25,863 music clips. Each clip is a 29-seconds-long excerpt belonging to one of the 5223 songs, 445 albums and 230 artists. Approximately, total 12,500 minutes. -> contain various genres, instrumental music
  • Audioset: Audioset is an audio event dataset, which consists of over 2M human-annotated 10-second video clips. These clips are collected from YouTube, therefore many of which are in poor-quality and contain multiple sound-sources. -> not only music, but with many types of sound
  • CompMusic Dataset: This dataset includes traditional music from various cultures, including Indian (Hindustani, Carnatic), Turkish (makam), Persian (dastgah), Arabian (maqam), and Chinese (Jingju, Beijing Opera). From commercial CDs, radio broadcasts, and other forms of publicly available music, Total approximately 150 hours of raw audio data.

The entire process will proceed as follows:

Crawling(collecting) -> Cleaning(matching) -> Post-process(OCR, slicing using Audio Event Detection)

Expected size of the data

500 songs(100 each from 5 Madangs), about 3,500 minutes total from traditional Pansori

Category in metadata

metadata in json format

Genre

For now we only consider traditional Pansori performance
1. There could be several singers but they should be necessary, not a choir but main singers in that Daemok
2. There should be only one session, Gosu(drummer)

Singer & gosu


Yupa(Style)

quite rare information

Madang & Daemok

5 Madang and its subsequent Daemok
춘향가 : 적성가, 천자풀이, 긴 사랑가, 이별가, 신연맞이, 옥중가, 군노사령, 박석틔, 어사와장모 등
적벽가 : 삼초고려, 부모설움, 처자설움, 아내설움, 공명기풍, 자룡탄궁, 적벽대전, 새타령, 장승타령, 군사점고 등
심청가 : 곽씨 부인 유언, 상여소리, 주과포혜, 시비따라 가는 대목, 범피종류, 화초타령, 추월만정, 방아타령 등
흥보가 : 놀보 심술, 흥보 애원하는 대목, 놀보가 흥보 때리는 대목, 가난타령, 집터 잡는 대목, 제비노정기, 돈타령, 밥타령 , 비단타령, 흥모 집짓는 대목, 화초장 타령, 놀보 제비 몰러 나가는 대목, 놀보 제비 노정기, 박타는 대목 등
수궁가 : 용왕이 득병하는 대목, 토끼 화상, 날짐승들 상좌 다투는 대목, 길짐승들 상좌 다투는 대목, 호랑이 나이 자랑하는 대목, 토끼 용왕을 속이는 대목 등

Considerations in collecting

Storage

Instead of opting for local server storage for security reasons, consider utilizing a 2TB Google Drive that can be shared among collectors
Since Google supports Colab in a shared drive, we can easily download and upload crawled data on the platform

Prevent duplication

  • Five collectors divide the responsibility for five different Madang
  • Share meta data to prevent download existing data via using Youtube's video ID

Crawling demo video

https://youtu.be/NwdZvEJW_FA

Subsequent Objectives

Fixing issues

There are many error cases in using pytube, we are sharing these cases to fix our crawling codes and try to be updated for that case

Downstream task

  • Classification: Yupa, Madang, Daemok

Cleaning & Post-processing

  • unification in metadata
  • slicing the audio via audio event detection
  • get more information via OCR

Develop evaluation criteria

profile
chords & code // harmony with structure

0개의 댓글