[DL/Audio] AUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss

๊ตฌ๋งยท2024๋…„ 9์›” 10์ผ
0

[Paper Review]

๋ชฉ๋ก ๋ณด๊ธฐ
4/8

๐Ÿ“„ [์›๋ฌธ]

Abstract

non-parallel์˜ ๋‹ค๋Œ€๋‹ค vc์™€ zero-shot vc๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ธฐ ์œ„ํ•ด GAN ๊ทธ๋ฆฌ๊ณ  CVAE๊ฐ€ ์ƒˆ๋กœ์šด ํ•ด๊ฒฐ์ฑ…์œผ๋กœ ๋“ฑ์žฅํ–ˆ์—ˆ๋‹ค. ํ•˜์ง€๋งŒ GAN์˜ training์€ ๋ณต์žกํ•˜๊ณ  CVAE์˜ training์€ ๊ฐ„๋‹จํ•˜์ง€๋งŒ GAN๋งŒํผ์˜ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋‚ด์ง€ ๋ชปํ•œ๋‹ค. ๋”ฐ๋ผ์„œ ์ด๋…ผ๋ฌธ์—์„œ๋Š” AutoVC๋ผ๋Š” autoencoder๋ฅผ ํฌํ•จํ•˜๋Š” ์Šคํ‚ค๋งˆ๋ฅผ ๊ฐ€์ง„ ์ƒˆ๋กœ์šด style transfer๋ฅผ ์†Œ๊ฐœํ•œ๋‹ค.

Introduction

[ ๊ธฐ์กด ๋ฐฉ์‹๋“ค์˜ ๋ฌธ์ œ์  ]

  1. ๋Œ€๋ถ€๋ถ„์ด parallelํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ •ํ•œ vc system
  2. ์†Œ์ˆ˜์˜ non-parallelํ•œ ๋ฐ์ดํ„ฐ๊ฐ€ ์‚ฌ์šฉ๊ฐ€๋Šฅํ•œ ์—ฐ๊ตฌ ์ค‘์—์„œ ๋‹ค๋Œ€๋‹ค conversion์€ ๋” ์†Œ์ˆ˜
  3. zero-shot conversion์ด ๊ฐ€๋Šฅํ•œ vc๋Š” ์—†์Œ

์ตœ๊ทผ์—” deep style transfer๋กœ GAN, CVAE๋“ฑ์ด vc์—์„œ ์ธ๊ธฐ๋ฅผ ์–ป๊ณ  ์žˆ๋‹ค. ํ•˜์ง€๋งŒ GAN์€ ํ›ˆ๋ จ์‹œํ‚ค๊ธฐ ๋งค์šฐ ์–ด๋ ต๊ณ  ์ˆ˜๋ ด์ด ์ž˜ ์•ˆ๋œ๋‹ค. ์ƒ์„ฑํ•œ speech์˜ ํ’ˆ์งˆ๋„ ๊ทธ๋ ‡๊ฒŒ ์ข‹์ง€ ์•Š๋‹ค. ๋ฐ˜๋ฉด์— CVAE๋Š” ํ›ˆ๋ จ์ด ๋น„๊ต์  ์‰ฝ๋‹ค. ํ•˜์ง€๋งŒ CVAE๋Š” ์•Œ๋งž์€ distribution matching์„ ๋ณด์žฅํ•˜์ง€ ์•Š๊ณ  conversion output์˜ over-smoothing์œผ๋กœ๋ถ€ํ„ฐ ์–ด๋ ค์›€์„ ๊ฒช๋Š”๋‹ค.

์ ์ ˆํ•œ style transfer ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ GAN์ฒ˜๋Ÿผ distribution์„ ์ผ์น˜์‹œํ‚ค๊ณ  ํ›ˆ๋ จ์€ CVAE์ฒ˜๋Ÿผ ๊ฐ„๋‹จํ•˜๋ฉด ๋˜์ง€ ์•Š์„๊นŒ๋ผ๋Š” ์˜๋ฌธ์— ๋™๊ธฐ๋ฅผ ๋ถ€์—ฌ๋ฐ›์•„ ์ด๋…ผ๋ฌธ์€ AutoVC๋ผ๋Š” ์ƒˆ๋กœ์šด style transfer scheme์„ ์ฃผ์žฅํ•œ๋‹ค.

AutoVC๋Š” parallel data์—†์ด๋„ ๋‹ค๋Œ€๋‹ค vc๊ฐ€ ๊ฐ€๋Šฅํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋‹ค. autoencoder ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ๋”ฐ๋ฅด๊ณ  ์˜ค์ง autoencoder loss๋กœ ํ›ˆ๋ จ๋œ๋‹ค. ํ•˜์ง€๋งŒ ์ด๋Š” ๋ฏธ์„ธํ•˜๊ฒŒ ์กฐ์ •๋œ ์ฐจ์› ์ถ•์†Œ์™€ ์ผ์‹œ์  downsampling์„ ๋„์ž…ํ•œ๋‹ค.

๊ทธ๋ฆฌ๊ณ  ์ตœ์ดˆ๋กœ zero-shot vc๊ฐ€ ๊ฐ€๋Šฅํ•œ ๋ชจ๋ธ์ด๋‹ค.

Style Transfer Autoencoder

์•ž์œผ๋กœ ๋“ฑ์žฅํ•  ์ˆ˜์‹์—์„œ์˜ ์ •์˜

  • ๋Œ€๋ฌธ์ž ๊ธ€์ž(ex. X) : ๋žœ๋ค ๋ณ€์ˆ˜,๋ฒกํ„ฐ๋“ค
  • ์†Œ๋ฌธ์ž ๊ธ€์ž(ex. x) : ๋žœ๋ค ๋ณ€์ˆ˜๋“ค์˜ ๊ฒฐ์ • ๋ณ€์ˆ˜ ๋˜๋Š” ์ธ์Šคํ„ด์Šค
  • H(โ‹…)H(\cdot) : entropy
  • H(โ‹…โˆฃโ‹…)H(\cdot | \cdot) : ์กฐ๊ฑด๋ถ€ entropy

Problem Formulation

  • U : speaker identity
  • Z : content vector
  • X(t) : speech waveform์˜ ์ƒ˜ํ”Œ ๋˜๋Š” speech spectorgram์˜ ํ”„๋ ˆ์ž„
  • ์—ฌ๊ธฐ์„œ๋Š” ๊ฐ ํ™”์ž๊ฐ€ ๊ฐ™์€ ์–‘์˜ gross information์„ ์ƒ์„ฑํ•œ๋‹ค๊ณ  ๊ฐ€์ •ํ•œ๋‹ค
    • [ Eq.(1) ] H(XโˆฃU=u)=hspeech=constantH(X | U = u) = h_{speech} = constant
  • source speaker : (U1,Z1,X1)( U_1, Z_1, X_1)
  • target speaker : (U2,Z2,X2)( U_2, Z_2, X_2)
  • ๋ชฉํ‘œ
    • X1X_1์—์„œ content๋ฅผ ๋ณด์กดํ•˜๋ฉด์„œ X^1โ†’2\hat X_{1 โ†’ 2}์˜ ๋ณ€ํ™˜ output๋ฅผ ์ƒ์„ฑํ•˜๋Š” speech converter
    • ๊ทธ๋Ÿฌ๋ฉด์„œ speaker U_2์˜ ํ™”์ž ํŠน์„ฑ์€ ์ผ์น˜์‹œํ‚จ๋‹ค
  • idealํ•œ speech converter์€ ๋‹ค์Œ ์ˆ˜์‹์„ ๋”ฐ๋ฅธ๋‹ค
    • [ Eq.(2) ]pX^1โ†’2(โ‹…โˆฃU2=u2,Z1=z1)=pX(โ‹…โˆฃU=u2,Z=z1)p_{\hat X_{1โ†’2}} (\cdot | U_2 = u_2, Z_1 = z_1 ) = p_X(\cdot | U = u_2, Z = z_1)
    • target speaker์˜ identity์ธ U2=u2U_2 = u_2์™€ source speech์—์„œ content Z1=z1Z_1 = z_1, ๋ณ€ํ™˜๋œ speech๋Š” u2u_2์ฒ˜๋Ÿผ ๋“ค๋ คํ•œ๋‹ค๋Š” ์˜๋ฏธ์ด๋‹ค

The Autoencoder Framework


Autovc๋Š” ์œ„ ๊ทธ๋ฆผ๊ณผ ๊ฐ™์€ ๋‹จ์ˆœํ•œ autoencoder ํ”„๋ ˆ์ž„์›Œํฌ์—์„œ vc๋ฅผ ์ˆ˜ํ–‰ํ•œ๋‹ค.

[ 3๊ฐ€์ง€ ๋ชจ๋“ˆ ๊ตฌ์„ฑ ]

  • content encoder Ec(โ‹…)E_c(\cdot)
    • speech๋กœ๋ถ€ํ„ฐ content embedding ์ƒ์„ฑ
  • speaker encoder Es(โ‹…)E_s(\cdot)
    • speech๋กœ๋ถ€ํ„ฐ speaker embedding ์ƒ์„ฑ
  • decoder D(โ‹…,โ‹…)D(\cdot , \cdot)
    • content์™€ speaker embedding์œผ๋กœ๋ถ€ํ„ฐ speech ์ƒ์„ฑ

์ด ๋ชจ๋“ˆ๋“ค์—์„œ์˜ input์€ conversion๊ณผ training๊ณผ์ •์—์„œ ๊ฐ๊ธฐ ๋‹ค๋ฅด๋‹ค.

Conversion

๋ณ€ํ™˜ ๊ณผ์ •์—์„œ source speech X1X_1์€ content ์ •๋ณด๋ฅผ ์ถ”์ถœํ•˜๊ธฐ ์œ„ํ•ด content encoder EcE_c๋กœ ์ž…๋ ฅ๋œ๋‹ค

target speech X2X_2๋Š” target ํ™”์ž์˜ ์ •๋ณด๋ฅผ ์ œ๊ณตํ•˜๊ธฐ ์œ„ํ•ด speaker encoder EsE_s๋กœ ์ž…๋ ฅ๋œ๋‹ค.

๋””์ฝ”๋”๋Š” source speech์—์„œ content ์ •๋ณด์™€ target speech์—์„œ target ํ™”์ž ์ •๋ณด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ณ€ํ™˜๋œ speech๋ฅผ ์ƒ์„ฑํ•œ๋‹ค.

  • [ Eq(3) ]C1=Ec(X1),S2=Es(X2),X^1โ†’2=D(C1,S2)C_1 = E_c(X_1) , S_2 = E_s(X_2) , \hat X_{1โ†’2} = D(C_1,S_2)

Training

์ด ๋…ผ๋ฌธ์—์„œ๋Š” speaker encoder๊ฐ€ ์ด๋ฏธ pre-trained๋๋‹ค๊ณ  ๊ฐ€์ •ํ•˜๊ธฐ์—, contetn encoder์™€ decoder๋งŒ ํ›ˆ๋ จ์„ ์ง„ํ–‰ํ•œ๋‹ค.

parallell data๋ฅผ ๊ฐ€์ •ํ•˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์— training์—์„œ๋Š” ์˜ค์ง self-reconstruction๋งŒ ์š”๊ตฌ๋œ๋‹ค.

๋” ์ž์„ธํ•˜๊ฒŒ๋Š”, content encoder์˜ input์€ ์—ฌ์ „ํžˆ X1X_1์ด์ง€๋งŒ, style encoder์—์„œ input์€ X1โ€™X_1^โ€™๋กœ ๋‚˜ํƒ€๋‚˜๋Š” ๊ฐ™์€ ํ™”์ž U1U_1์˜ ๋ฐœํ™”๊ฐ€ ๋œ๋‹ค.

  • [ Eq(4) ] C1=Ec(X1),S1=Es(X1โ€™),X^1โ†’1=D(C1,S1)C_1 = E_c(X_1), S_1 = E_s(X_1^โ€™), \hat X_{1โ†’1} = D(C_1, S_1)

loss function์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค:

[ Eq(5) & Eq(6) ]

Why does it work?

autoencoder๊ธฐ๋ฐ˜ ํ›ˆ๋ จ ์Šคํ‚ค๋งˆ๊ฐ€ ์ด์ƒ์ ์ธ vc๋ฅผ ๊ฐ€๋Šฅ์ผ€ํ•˜๋Š” ํ•ต์‹ฌ ์ด์œ ๋Š” ์ ์ ˆํ•œ information bottelneck๋ฅผ ๊ฐ–๋Š” ๊ฒƒ์— ์žˆ๋‹ค.

AutoVC์˜ ํ”„๋ ˆ์ž„์›Œํฌ๋Š” ๋‹ค์Œ์˜ ์ด๋ก  ๋ฐ ๊ฐ€์ •์„ ๋”ฐ๋ฅธ๋‹ค.

[ Theorem 1]

Eq3, Eq4๋ฅผ ๊ณ ๋ คํ–ˆ์„ ๋•Œ,

  1. ๊ฐ™์€ ํ™”์ž์˜ ๋‹ค๋ฅธ ๋ฐœํ™”๋“ค์˜ speaker embedding์€ ๊ฐ™๋‹ค. U1=U2U_1 = U_2๋ผ๋ฉด Es(X1)=Es(X2)E_s(X_1) = E_s(X_2)์ด๋‹ค
  2. ์„œ๋กœ ๋‹ค๋ฅธ ํ™”์ž๋“ค์˜ speaker embedding์€ ๋‹ค๋ฅด๋‹ค. U1โ‰ U2U_1\ne U_2๋ผ๋ฉด Es(X1)โ‰ Es(X2)E_s(X_1) \ne E_s(X_2)์ด๋‹ค

์œ„ ์‚ฌ์ง„์—์„œ ๋ณด์ด๋“ฏ์ด speech๋Š” ๋‘๊ฐ€์ง€ ์ข…๋ฅ˜์˜ ์ •๋ณด๋ฅผ ํฌํ•จํ•œ๋‹ค : ํ™”์ž ์ •๋ณด & ํ™”์ž์™€ ๋…๋ฆฝ์ ์ธ ์ •๋ณด(=content ์ •๋ณด)

bottleneck์ด ๋„ˆ๋ฌด ๋„“์œผ๋ฉด, content embedding์ธ C1C_1์˜ ์ฐจ์›์ด ์ถ•์†Œ๋˜๊ธฐ ๋•Œ๋ฌธ์— C1C_1์€ ์ •๋ณด๋ฅผ ์†์‹คํ•˜๊ฒŒ ๋œ๋‹ค.

๋ฐ˜๋ฉด์— bottleneck์ด ๋„ˆ๋ฌด ์ข์œผ๋ฉด, content encoder๋Š” ํ™”์ž ์ •๋ณด ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ content ์ •๋ณด๊นŒ์ง€ ๋„ˆ๋ฌด ๋งŽ์€ ์ •๋ณด๋ฅผ ์žƒ๊ฒŒ ๋œ๋‹ค. ์ด ๊ฒฝ์šฐ ์™„๋ฒฝํ•œ reconstruction์€ ๋ถˆ๊ฐ€๋Šฅํ•˜๋‹ค.

๊ทธ๋Ÿฌ๋ฏ€๋กœ ์œ„์‚ฌ์ง„์—์„œ (c)์ฒ˜๋Ÿผ C1C_1์˜ ์ฐจ์›์€ content ์ •๋ณด๋ฅผ ์†์ƒ์‹œํ‚ค์ง€ ์•Š์œผ๋ฉด์„œ ๋ชจ๋“  ํ™”์ž ์ •๋ณด๋ฅผ ์ œ๊ฑฐ๊ฐ€๋Šฅํ•œ ๋งŒํผ ์ถฉ๋ถ„ํ•œ ์ •๋„๋กœ ์ฐจ์›์ด ์ถ•์†Œ๋˜์–ด์•ผ ํ•œ๋‹ค.

์ ์ ˆํ•œ Bottleneck ๋ฒ”์œ„๋ฅผ ์ •ํ•  ๊ฒฝ์šฐ, ๋‹ค์Œ์ด ๊ฐ€๋Šฅํ•˜๋‹ค.

  1. ์™„๋ฒฝํ•œ reconstruction์ด ๊ฐ€๋Šฅํ•˜๋‹ค
  2. content embedding์ธ C1C_1์€ source speaker U1U_1์— ๊ด€ํ•œ ์–ด๋–ค ์ •๋ณด๋„ ํฌํ•จํ•˜์ง€ ์•Š๋Š”๋‹ค(= speaker disentaglement)

AutoVC Architecture

Speaker Encoder

speaker encoder์˜ ๋ชฉํ‘œ๋Š” ๊ฐ™์€ ํ™”์ž์˜ ๋‹ค๋ฅธ ๋ฐœํ™”๋“ค์—์„œ๋„ ๊ฐ™์€ ์ž„๋ฒ ๋”ฉ์„ ๋งŒ๋“œ๋Š” ๊ฒƒ์ด๋‹ค.

zero-shot cv๋ฅผ ์œ„ํ•ด์„œ๋Š” unseenํ•œ ํ™”์ž๋“ค์—์„œ๋„ ์ผ๋ฐ˜ํ™”๊ฐ€๋Šฅํ•œ ์ž„๋ฒ ๋”ฉ์„ ์ ์šฉํ•˜๋Š” ๊ฒƒ์ด ํ•„์š”ํ•˜๋‹ค. ๊ทธ๋Ÿฌ๋ฏ€๋กœ ์œ„ ๊ทธ๋ฆผ์—์„œ (3)(b)์ฒ˜๋Ÿผ speaker encoder๋Š” cell ํฌ๊ธฐ 768์ธ 2๊ฐœ์˜ LSTM ๋ ˆ์ด์–ด๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๋‹ค. output์—์„œ ๋งˆ์ง€๋ง‰ ์‹œ๊ฐ„๋งŒ fc ๋ ˆ์ด์–ด๋กœ 256์ฐจ์›์œผ๋กœ ์„ ํƒ๋˜์—ˆ๋‹ค.

speaker embedding์˜ ๊ฒฐ๊ณผ๋Š” 256์˜ 1๋ฒกํ„ฐ์ด๋‹ค.

์ด๋Š” GE2E loss๋กœ pre-trained๋˜์—ˆ๋‹ค.

  • GE2E loss๋Š” ๊ฐ™์€ ํ™”์ž์˜ ๋‹ค๋ฅธ ๋ฐœํ™”๋“ค ์‚ฌ์ด์—์„œ ์ž„๋ฒ ๋”ฉ ์œ ์‚ฌ์„ฑ์„ ์ตœ๋Œ€ํ™”ํ•˜๊ณ , ๋‹ค๋ฅธ ํ™”์ž๋“ค ์‚ฌ์ด์—์„œ ์ž„๋ฒ ๋”ฉ ์œ ์‚ฌ์„ฑ์„ ์ตœ์†Œํ™”ํ•œ๋‹ค.

Content Encoder

์œ„ ์‚ฌ์ง„์—์„œ 3(a)์ฒ˜๋Ÿผ content encoder์—์„œ input์€ ํ™”์ž ์ž„๋ฒ ๋”ฉ Es(X1)E_s(X_1)๊ณผ ํ•ฉ์ณ์ง„ X1X_1์˜ 80์ฐจ์›์˜ mel-spectrogram์ด๋‹ค

ํ•ฉ์ณ์ง„ ํ”ผ์ณ๋“ค์€ 3๊ฐœ์˜ 5x1 conv ๋ ˆ์ด์–ด๋“ค์— ์ž…๋ ฅ๋œ๋‹ค.

์ฑ„๋„์˜ ์ˆ˜๋Š” 512์ด๊ธฐ์— output์€ 2๊ฐœ์˜ bidirectional LSTM ๋ ˆ์ด์–ด๋“ค์€ ํ†ต๊ณผํ•œ๋‹ค.

information bottleneck์„ ๊ตฌ์„ฑํ•˜๋Š”๋ฐ ์ค‘์š”ํ•œ ๋‹จ๊ณ„๋Š”, BLSTM์˜ foward์™€ backward ouput๋“ค์ด 32๋กœ ๋‹ค์šด์ƒ˜ํ”Œ๋ง๋˜๋Š” ๊ฒƒ์ด๋‹ค.

content embedding์˜ ๊ฒฐ๊ณผ๋ฌผ์€ 2๊ฐœ์˜ 32-by-T/32 matrices์ด๋‹ค.

  • ๊ฐ๊ฐ C1โ†’,C1โ†C_{1โ†’}, C_{1โ†}๋กœ ํ‘œํ˜„๋œ๋‹ค.

Decoder

์œ„ ์‚ฌ์ง„์—์„œ 3(c)์— ํ•ด๋‹นํ•˜๋Š” ๋””์ฝ”๋”

๋จผ์ € content, speaker ์ž„๋ฒ ๋”ฉ์€ ๋‘˜๋‹ค ์›๋ž˜์˜ ์‹œ๊ฐ„ ํ•ด์ƒ๋„๋ฅผ ๋ณต๊ตฌํ•˜๊ธฐ ์œ„ํ•ด copyingํ•จ์œผ๋กœ์จ ์—…์ƒ˜ํ”Œ๋ง๋œ๋‹ค.

  • ๊ฐ๊ฐ Uโ†’,Uโ†U_โ†’ , U_โ†๋กœ ํ‘œํ˜„๋œ๋‹ค.

์—…์ƒ˜ํ”Œ๋ง๋œ ์ž„๋ฒ ๋”ฉ๋“ค์€ ํ•ฉ์ณ์ง€๊ณ  3๊ฐœ์˜ 512์ฑ„๋„์„ ๊ฐ€์ง„ 5x1 conv ๋ ˆ์ด์–ด๋“ค์— ์ž…๋ ฅ๋œ๋‹ค. ์ด ์ดํ›„์—๋Š” ๋ฐฐ์น˜ ์ •๊ทœํ™”, ReLU ๊ทธ๋ฆฌ๊ณ  cell dimension 1024๋ฅผ ๊ฐ€์ง„ 3๊ฐœ์˜ LSTM๋ ˆ์ด์–ด๋“ค์„ ํ†ต๊ณผํ•œ๋‹ค. LSTM์˜ output๋“ค์€ ์ฐจ์› 80์˜ 1x1 conv ๋ ˆ์ด์–ด๋“ค์— ํˆฌ์˜๋œ๋‹ค. ์ด ํˆฌ์˜์˜ output์€ X~1โ†’2\tilde X_{1โ†’2}๋กœ ๋‚˜ํƒ€๋‚˜๋Š” ๋ณ€ํ™˜๋œ speech์˜ ์ดˆ๊ธฐ ์ถ”์ •์น˜์ด๋‹ค.

๋งˆ์ง€๋ง‰ ๋ณ€ํ™˜ ๊ฒฐ๊ณผ๋Š” ์ดˆ๊ธฐ ์ถ”์ •์น˜์— residual๋ฅผ ๋”ํ•จ์œผ๋กœ์จ ์ƒ์„ฑ๋œ๋‹ค.

  • [ Eq(10) ] X^1โ†’2=X~1โ†’2+R1โ†’2\hat X_{1โ†’2} = \tilde X_{1โ†’2} + R_{1โ†’2}

ํ›ˆ๋ จ๋™์•ˆ reconstruction loss๋Š” ์ดˆ๊ธฐ์™€ ๋งˆ์ง€๋ง‰ reconstruction ๊ฒฐ๊ณผ๋“ค ๋ชจ๋‘์— ์ ์šฉ๋œ๋‹ค.

Total loss
[ Eq(12) ]

Spectrogram inverter

autoVC๋Š” 4๊ฐœ์˜ deconv ๋ ˆ์ด์–ด๋“ค๋กœ ๊ตฌ์„ฑ๋œ WaveNet ๋ณด์ฝ”๋”๋ฅผ ์ ์šฉํ•œ๋‹ค.

๊ณต์‹์ ์ธ ๊ตฌํ˜„์—์„œ๋Š” mel-spectrogram์˜ frame rate๋Š” 62.5Hz์ด๊ณ  speech waveform์˜ sampling rate๋Š” 16kHz์ด๋‹ค.

pre-trained๋œ WaveNet vocoder๋ฅผ ์‚ฌ์šฉํ–ˆ๋‹ค.

profile
๐Ÿ“ ๋ฐ์ดํ„ฐ์‚ฌ์ด์–ธ์Šค ํ•™๋ถ€์ƒ์˜ ๊ธฐ๋ก์žฅ!

0๊ฐœ์˜ ๋Œ“๊ธ€