๐Ÿคฃ DialogueRNN: An Attentive RNN for Emotion Detection in Conversations

ukkikkiaiยท2024๋…„ 4์›” 1์ผ

Euron ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ

๋ชฉ๋ก ๋ณด๊ธฐ
4/13

ABSTRACT

๋Œ€ํ™”์—์„œ์˜ ๊ฐ์ • ๊ฐ์ง€๋Š” ํ”ผ๋“œ๋ฐฑ์„ ์ดํ•ดํ•จ์— ์žˆ์–ด์„œ ํ•„์ˆ˜์ ์ธ ๋‹จ๊ณ„์ž„. ํ˜„์žฌ ์‹œ์Šคํ…œ์€ ๊ฐ ๋ฐœํ™”์ž์— ๋งž์ถคํ˜•์œผ๋กœ ๋‹ค๋ฃจ์–ด์ฃผ์ง€ ์•Š์Œ. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” RNN์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๋Œ€ํ™” ์ „์ฒด์—์„œ๊ฐœ๋ณ„ ๋ฐœํ™”์ž์˜ ์ƒํƒœ๋ฅผ ์ถ”์ ํ•˜๊ณ  ํ•ด๋‹น ์ •๋ณด๋ฅผ ๊ฐ์ • ๋ถ„๋ฅ˜์— ํ™œ์šฉํ•จ.

1. INTRODUCTION

1) ๋ฐœํ™”์ž
2) ์ด์ „ ๋ฐœํ™”๋“ค์˜ ๋ฌธ๋งฅ
3) ์ด์ „ ๋ฐœํ™”์˜ ๊ฐ์ •

  • ์œ„์˜ ์„ธ ๊ฐ€์ง€ ์ฃผ์š” ์ธก๋ฉด์ด ๊ฐ์ •๊ณผ ์—ฐ๊ด€์žˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜์—ฌ, ๊ฐœ๋ณ„์ ์œผ๋กœ ๋ชจ๋ธ๋งํ•จ.

DialogueRNN

๋ฌธ๋งฅ์„ ์ถ”์ถœํ•จ์— ์žˆ์–ด์„œ ๋ฐœํ™”์ž + ์ฒญ์ทจ์ž์˜ ์ด์ „ ๋ฐœํ™”๋ฅผ ๋ชจ๋‘ ๊ณ ๋ คํ•จ. 3๊ฐœ์˜ Gated Recurrent Unit์„ ํ™œ์šฉํ•จ

  • global GRU, party GRU: ๊ฐ๊ฐ ์ตœ์‹  ๋ฌธ๋งฅ + ์ธ์ฝ”๋”ฉ ๊ณผ์ •์—์„œ ๋‹น์‚ฌ์ž์˜ ์ •๋ณด, ๋ถ€๋ถ„์ ์ธ ์ƒํƒœ๋ฅผ ์—…๋ฐ์ดํŠธํ•จ.

=> ํ•ด๋‹น GRU๋ฅผ ํ†ตํ•ด ์ด์ „ ๋ฐœํ™”์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ํฌ๊ด„ํ•˜๋Š” ๋ฌธ๋งฅ์  representation์„ ์–ป์„ ์ˆ˜ ์žˆ์Œ.

  • emotion GRU: ์—…๋ฐ์ดํŠธ๋œ ๋ฐœํ™”์ž์˜ ์ƒํƒœ๊ฐ€ ๊ณต๊ธ‰๋˜์–ด ๊ฐ์ • ํ‘œํ˜„์„ decodingํ•จ์œผ๋กœ์จ ๊ฐ์ • ๋ถ„๋ฅ˜์— ์‚ฌ์šฉํ•จ.

=> Emotion GRU, global GRU๋Š” ํ•จ๊ป˜ ๋‹น์‚ฌ์ž ๊ฐ„์˜ ๊ด€๊ณ„ ๋ชจ๋ธ๋ง์— ์ค‘์š”ํ•œ ์—ญํ• ์„ ํ•˜๋Š” ๋ฐ˜๋ฉด, Party GRU๋Š” ๋™์ผํ•œ ๋‹น์‚ฌ์ž์˜ ์ˆœ์ฐจ์ ์ธ ์ƒํƒœ ์‚ฌ์ด์˜ ๊ด€๊ณ„๋ฅผ ๋ชจ๋ธ๋งํ•จ.

DialogueRNN์€ ์œ„์˜ 3๊ฐ€์ง€ GRU๋“ค์ด recursiveํ•˜๊ฒŒ ์—ฐ๊ฒฐ๋˜์–ด ์žˆ์Œ.

3. Methodology

3.1 Problem Definition

M ๊ฐœ์˜ party, participant๊ฐ€ ์žˆ๋‹ค๊ณ  ๊ฐ€์ •ํ–ˆ์„ ๋•Œ, ์ฃผ์–ด์ง„ ๊ณผ์ œ๋Š” emotion label์„ ๋ฐœํ™”์˜ ์š”์†Œ์— ๋Œ€ํ•ด ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ์ž„.

3.2 Unimodal Feature Extraction

Textual Feature Extraction: CNN์„ ํ™œ์šฉํ•˜์—ฌ ํ•ด๋‹น feature ์ถ”์ถœ์„ ์ˆ˜ํ–‰ํ•จ. ํ•ด๋‹น ๋ชจ๋ธ์„ ํ†ตํ•˜์—ฌ ๋ฐœํ™”์™€ ๊ฐ์ • ๋ ˆ์ด๋ธ”์— ๋Œ€ํ•ด ํ•™์Šตํ•จ.

Audio and Visual Feature Extraction: 3D-CNN๊ณผ openSMILE์„ ํ™œ์šฉํ•˜์—ฌ ์‹œ๊ฐ์ , ์ฒญ๊ฐ์  feature๋„ ํ•จ๊ป˜ ์ถ”์ถœํ•จ.

3.3 Model

๋ณธ ๋…ผ๋ฌธ์€ ๋ฐœํ™”์˜ ๊ฐ์ •์ด 3๊ฐ€์ง€ ์ฃผ์š” ์š”์†Œ์— ์˜์กดํ•œ๋‹ค๊ณ  ๊ฐ€์ •ํ•จ.

1) the speaker ๋ฐœํ™”์ž
2) the context given by preceding utterances ์ด์ „ ๋ฐœํ™”์— ์˜ํ•œ ๋งฅ๋ฝ
3) the emotion behind the preceding utterances ์ด์ „ ๋ฐœํ™”์— ๊น”๋ ค์žˆ๋Š” ๊ฐ์ •

GRU cell์˜ ํ™œ์šฉ

1) Global State(Global GRU)

  • ๋ฐœํ™”์ž์™€ ๋ฐœํ™”์ž์˜ state๋ฅผ ํ•จ๊ป˜ encodingํ•˜์—ฌ ๋งฅ๋ฝ์„ ํฌ์ฐฉํ•จ. ๋ฐœํ™”์ž์˜ state๋Š” qt-1 -> qt๋กœ ๋ณ€ํ™”ํ•˜๋ฉฐ, ์ด ๋ณ€ํ™”๋ฅผ GRU cell์— ๋‹ด์•„๋‘ .

2) Party State(Party GRU)

  • ๋ฐœํ™”์ž ๊ฐœ๋ณ„์ ์œผ๋กœ ๊ณ ์ •๋œ ๋ฒกํ„ฐ๋กœ state๋ฅผ ์ง€์†์ ์œผ๋กœ ์ถ”์ ํ•จ. ํ•ด๋‹น state ๋ฒกํ„ฐ๋“ค์€ null๋กœ ์ดˆ๊ธฐํ™”๋˜์—ˆ๋‹ค๊ฐ€, ๋ชจ๋ธ์ด ๋ฐœํ™”์ž ๊ฐ๊ฐ์— ๋Œ€ํ•˜์—ฌ ์ฃผ์˜๋ฅผ ๊ธฐ์šธ์—ฌ ๋‹ค๋ฃฐ ์ˆ˜ ์žˆ๋„๋ก ์ ์šฉํ•จ.

3) Speaker Update(Speaker GRU)

  • ๋ฐœํ™” ut์— ๋Œ€ํ•œ ๋งฅ๋ฝ ct๋ฅผ ์ด์ „ ๋ฐœํ™”์™€ ์—ฐ๊ด€์ง€์–ด ํฌ์ฐฉํ•จ. Attention ์ ์ˆ˜๋ฅผ ์ด์ „ global state(์ด์ „ ๋ฐœํ™”์— ๋Œ€ํ•œ ๋งฅ๋ฝ)์— ๋Œ€ํ•˜์—ฌ ๊ณ„์‚ฐ์„ ํ•จ.

4) Listener Update

  • ๋“ฃ๋Š” ์‚ฌ๋žŒ์˜ ํ‘œ์ • ๋ณ€ํ™”, ๊ทธ๋ฆฌ๊ณ  ์ƒํƒœ๋„ ํ•จ๊ป˜ ํฌ์ฐฉํ•˜๊ธฐ ์œ„ํ•ด ์œ„์˜ GRU ๋˜ํ•œ ๋„์ž…ํ•จ.

5) Emotional Representation(Emotion GRU)

  • ์ตœ์ข…์ ์œผ๋กœ ์ด์ „ ๋ฐœํ™”์˜ emotion et-1์— ๋Œ€ํ•˜์—ฌ ์œ„์˜ ๋งฅ๋ฝ์„ ๋ชจ๋‘ ๋”ํ•˜์—ฌ et๋ฅผ ๊ตฌํ•จ. ๋ฐœํ™”์ž์˜ state๋Š” global state๋กœ๋ถ€ํ„ฐ ์ •๋ณด๋ฅผ ๋ฐ›์œผ๋ฏ€๋กœ, ๋ชจ๋ธ์€ ์ด๋ฏธ ๋‹ค๋ฅธ party์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ๋‹ด๊ณ  ์žˆ์Œ.

Emotion Classification

  • Layer 2๊ฐœ์˜ perceptron์„ ํ™œ์šฉํ•˜์—ฌ 6๊ฐœ์˜ emotion class๋ฅผ ์˜ˆ์ธกํ•จ. Training process์—์„œ๋Š” categorical cross entropy๋ฅผ ํ™œ์šฉํ•˜์—ฌ label์„ ์˜ˆ์ธกํ•จ.

4. Experimental Setting

์œ„์™€ ๊ฐ™์€ ๋ฐ์ดํ„ฐ์…‹์„ ์‚ฌ์šฉํ•จ.

๋ฒ ์ด์Šค๋ผ์ธ ๋ชจ๋ธ๊ณผ์˜ ๋น„๊ต ๊ฒฐ๊ณผ๋Š” ์œ„์™€ ๊ฐ™์Œ.

5. Conclusion

  • ๋Œ€ํ™”์—์„œ์˜ ๊ฐ์ • ๊ฐ์ง€๋ฅผ ์œ„ํ•œ RNN ๊ธฐ๋ฐ˜ ์‹ ๊ฒฝ ์•„ํ‚คํ…์ฒ˜๋ฅผ ์ œ์‹œํ•จ. ๊ฐ ์ž…๋ ฅ ๋ฐœํ™”์ž์˜ ํŠน์„ฑ์„ ๊ณ ๋ คํ•˜์—ฌ ์ฒ˜๋ฆฌํ•จ์œผ๋กœ์„œ ๋ฐœํ™”์— ๋” ์„ธ๋ฐ€ํ•œ ๋ฌธ๋งฅ์„ ์ œ๊ณตํ•˜๋„๋ก ํ•จ. Multi Modal ์„ค์ •์— ์žˆ์–ด์„œ ๋‘ ๊ฐ€์ง€ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ์…‹์—์„œ ์ตœ์‹  ๊ธฐ์ˆ ์„ ๋Šฅ๊ฐ€ํ•˜๋Š” ์„ฑ๋Šฅ์„ ๋ณด์ž„.
profile
์œ ์ •๋ฏผ

0๊ฐœ์˜ ๋Œ“๊ธ€