[Papers] Efficient Intent Detection with Dual Encoders ๐Ÿ”ซ

KwanHongยท2020๋…„ 12์›” 22์ผ
1

Papers

๋ชฉ๋ก ๋ณด๊ธฐ
3/3
post-thumbnail

๐ŸŽŠ๊ฐœ์š”

โ” Introduction

Intent detection in task-oriented conversational system

  • ๋Œ€ํ™” ์‹œ์Šคํ…œ์€ ์‚ฌ์šฉ์ž์˜ ํ˜„์žฌ goal์„ ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•˜์—ฌ, intent detector๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์‚ฌ์šฉ์ž์˜ ๋ฐœํ™”๋ฅผ ๋ถ„๋ฅ˜ํ•œ๋‹ค.

  • โ— ์ƒˆ๋กœ์šด ๋„๋ฉ”์ธ๊ณผ task๋ฅผ ์ง€์›ํ•˜๊ธฐ ์œ„ํ•ด intent detector๋ฅผ ํ™•์žฅํ•˜๋Š” ์ผ์€ ์–ด๋ ต๊ณ  ์ž์›์ด ๋งŽ์ด ์†Œ๋ชจ๋˜๋Š” ๊ณผ์ •์ด๋‹ค.

    • ๋„๋ฉ”์ธ ์ง€์‹ ์ „๋ฌธ๊ฐ€์™€ ๋„๋ฉ”์ธ ํŠน์ •(domain-specific) ๋ฐ์ดํ„ฐ ์…‹์ด ํ•„์š”ํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ์‹ ์†ํ•˜๊ณ  ๊ด‘๋ฒ”์œ„ํ•˜๊ฒŒ intent detector๋ฅผ ๋ฐฐ์น˜ํ•˜๊ธฐ์— ์–ด๋ ค์›€์ด ์žˆ๋‹ค.
    • ์ธํ…ํŠธ ๋ณ„๋กœ ๋ช‡ ๊ฐœ์˜ ์ƒ˜ํ”Œ ๋ฐ์ดํ„ฐ ๋ฐ–์— ์—†๋Š” low-data scenario ์ƒํ™ฉ์—์„œ, ํšจ๊ณผ์ ์œผ๋กœ ์ธํ…ํŠธ๋ฅผ ์ธ์‹ํ•  ์ˆ˜ ์žˆ์–ด์•ผ ํ•œ๋‹ค.

Pretraining methods in few-shot scenarios

  • ๋ถ€์กฑํ•œ ๋„๋ฉ”์ธ ๋ฐ์ดํ„ฐ์˜ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด, ๋ฏธ๋ฆฌ ํ•™์Šต๋œ ์ธ์ฝ”๋”๋ฅผ ์ด์šฉํ•˜์—ฌ ์ „์ดํ•™์Šต์„ ํ•˜๋Š” ๋ฐฉ์‹์ด ๋Œ€์„ธ์ด๋‹ค.

  • BERT์™€ ๊ฐ™์€ ๋ณดํŽธ์ ์ธ ๋ฌธ์žฅ ์ธ์ฝ”๋”๋ฅผ ๊ทธ๋Œ€๋กœ ์ ์šฉํ•˜๋Š” ๊ฒƒ์€ ์ตœ์„ ์ด ์•„๋‹ ์ˆ˜ ์žˆ๋‹ค.

    • ๋Œ€ํ™” ๊ด€๋ จ task์—์„œ๋Š”, ์ผ๋ฐ˜์ ์ธ ์–ธ์–ด ๋ชจ๋ธ(language modeling)๋ฐฉ์‹์€ ์‘๋‹ต ์„ ํƒ task ๊ธฐ๋ฐ˜ ํ•™์Šต์ธ conversational pretraining๋ณด๋‹ค ๋œ ํšจ๊ณผ์ ์ผ ์ˆ˜ ์žˆ๋‹ค.
    • BERT๋‚˜ BERT์˜ ๋ณ€ํ˜• ๋ชจ๋ธ์„ fine-tuning ํ•˜๋Š” ๊ฒƒ์€ ๋ชจ๋ธ ์ „์ฒด๋ฅผ ๋„๋ฉ”์ธ์— ์ ์‘(adaptation)์‹œํ‚ค๊ธฐ ๋•Œ๋ฌธ์— ์ž์›์˜ ์†Œ๋ชจ๊ฐ€ ๋งŽ์ด ํ•„์š”ํ•œ ์ž‘์—…์ด๋‹ค.
      • ๋” ๋‚˜์•„๊ฐ€, ์ด ๋ฐฉ์‹์€ few-shot scenario์—์„œ ์˜ค๋ฒ„ํ”ผํŒ…(overfitting)์„ ๋ฐœ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋‹ค.
    • ์ด๋Ÿฌํ•œ ์†์„ฑ๋“ค๋กœ ์ธํ•ด ๋งค์šฐ ๋Š๋ฆฌ๊ณ , ๋ณต์žกํ•˜๊ณ , ๋น„์šฉ์ด ๋งŽ์ด ๋“œ๋Š” ๊ฐœ๋ฐœ ์ˆœํ™˜ ๊ณผ์ •(development cycle)์œผ๋กœ ์ด์–ด์ง„๋‹ค.

Dual sentence encoders

  • USE(Universal Sentence Encoder)๋‚˜ ConveRT์™€ ๊ฐ™์€ ๋ฌธ์žฅ ์Œ์„ ๋ชจ๋ธ๋งํ•˜๋Š” ์‹ ๊ฒฝ๋ง ๊ตฌ์กฐ ๋ชจ๋ธ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋Š” Dual sentence encoder ๊ตฌ์กฐ๋ฅผ ์ œ์•ˆํ•œ๋‹ค.

    Advantages

    • USE(Universal Sentence Encoder)์™€ ConveRT ๊ธฐ๋ฐ˜ intent detector๊ฐ€ BERT๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ–ˆ์„ ๊ฒฝ์šฐ๋ณด๋‹ค few-shot scenario์—์„œ๋„ ๋” ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ค€๋‹ค.
    • ๋ชจ๋ธ์˜ ํฌ๊ธฐ๊ฐ€ ์ƒ๋Œ€์ ์œผ๋กœ ์ž‘๊ณ  ํ•™์Šต ๋น„์šฉ๋„ ํฌ์ง€ ์•Š๋‹ค (compactness)
    • ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ๋ณ€๊ฒฝ์œผ๋กœ ์ธํ•œ ์„ฑ๋Šฅ ๋ณ€๋™์ด ํฌ์ง€ ์•Š์Œ(ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹ ๋น„์šฉ ๊ฐ์†Œ)

๐ŸŽฃ Methodology: Intent Detection with Dual Sentence Encoders

Pretrained Sentence Encoders

  • ํŠน์ • ํƒœ์Šคํฌ ๋˜๋Š” ๋„๋ฉ”์ธ์— ๋งž์ถ”์–ด ๋ชจ๋ธ ์ „์ฒด๋ฅผ ์ ์‘(adaptation)์‹œํ‚ค๋Š” fine-tuning ๊ณผ์ •์ด ํ•„์š”ํ•˜๋‹ค.
  • Fine-tuning ๊ณผ์ •์€ ๋น„์šฉ ์†Œ๋ชจ๊ฐ€ ์žˆ์œผ๋ฉฐ, few-shot scenario์—์„œ ์˜ค๋ฒ„ํ”ผํŒ… ๋˜๊ฑฐ๋‚˜ ์ตœ์ ์˜ ๊ฒฐ๊ณผ๋ฅผ ์–ป์ง€ ๋ชป ํ• ์ˆ˜ ์žˆ๋‹ค.

Dual Sentence Encoders and Conversational Pretraining

  • Conversational pretraining๋Š” ๊ธฐ์กด์˜ ์–ธ์–ด ๋ชจ๋ธ ๊ธฐ๋ฐ˜ ํ•™์Šต๋ณด๋‹ค dialouge act prediction๋‚˜ next utterance generation์™€ ๊ฐ™์€ ๋Œ€ํ™” ํƒœ์Šคํฌ์— ๋” ์ž˜ ๋งž๋Š”๋‹ค.
  • Dual ๋ชจ๋ธ์€ ์ž…๋ ฅ ๋ฌธ์žฅ/๋ฌธ๋งฅ์— ๋Œ€์‘ํ•˜๋Š” ์‘๋‹ต๊ณผ์˜ ๊ด€๊ณ„๋ฅผ ํ•™์Šตํ•˜๋Š” dual-encoder ๊ตฌ์กฐ์ด๋‹ค.
  • ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” response selection task๋กœ ํ•™์Šตํ•œ USE(Universal Sentence Encoder)์™€ ConveRT์— ์ดˆ์ ์„ ๋งž์ถ”์—ˆ๋‹ค.

Intent Detection with dual Encoders

  • USE์™€ ConveRT๋กœ ์ธ์ฝ”๋”ฉํ•œ ๊ณ ์ • ๋ฌธ์žฅ ํ‘œํ˜„ ์ž„๋ฒ ๋”ฉ(fixed sentence representation)์„ ์‚ฌ์šฉ
  • ReLU activation์„ ๊ฐ€์ง„ ๋‹จ์ผ ์€๋‹‰์ธต์ธ Multi-Layer Perceptron(MLP) layer ์œ„์— multi-class ๋ถ„๋ฅ˜๋ฅผ ์œ„ํ•œ ์†Œํ”„ํŠธ๋งฅ์Šค ์ธต์„ ์Œ“๋Š”๋‹ค.
  • ๊ฐ๊ฐ์˜ dual encoder์—์„œ ๋‚˜์˜จ ๋ฌธ์žฅ ๋ฒกํ„ฐ๋ฅผ concatenateํ•˜์—ฌ ์ž…๋ ฅํ•  ์ˆ˜ ์žˆ๋‹ค.

๐Ÿ”ฌ Results and discussion

  • ๋‘ ๊ฐœ์˜ dual model๋ฅผ ์กฐํ•ฉํ•˜์˜€์„ ๊ฒฝ์šฐ( USE+ConveRT ), ์ƒํ˜ธ๋ณด์™„์  ์ •๋ณด๋ฅผ ํฌ์ฐฉํ•˜์—ฌ ๋” ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์คŒ
  • BERT๋Š” pretraining์˜ ๋ชฉ์ ์ด ๋‹ค๋ฅด๊ธฐ ๋•Œ๋ฌธ์—, fine-tuning์„ ํ•œ BERT-TUNED ๋ชจ๋ธ์—์„œ ์˜๋ฏธ์žˆ๋Š” ์„ฑ๋Šฅ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Œ

Few-Shot Scenarios

  • ๋ฐ์ดํ„ฐ ์ƒ˜ํ”Œ์ด ์ ์€ ์ผ€์ด์Šค(few-shot scenario)์—์„œ BERT-TUNED ๋ณด๋‹ค dual encoders๋ฅผ ์‚ฌ์šฉํ•œ ๋ชจ๋ธ์˜ ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Œ

  • few-shot scenario์—์„œ ์‚ฌ์šฉํ•˜๋Š” intent detector๋Š” validation set์— ๋Œ€ํ•œ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ํŠœ๋‹๊ณผ ๋ฌด๊ด€ํ•˜๊ฒŒ off-the-shelf ๋ฐฉ์‹์œผ๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์–ด์•ผ ๋ฐ”๋žŒ์งํ•˜๋‹ค.

    • ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” intent detector์˜ ์‹ ๋ขฐ์„ฑ ๋ณด์žฅ๊ณผ ์˜ค๋ฒ„ํ”ผํŒ… ๋ฐฉ์ง€๋ฅผ ์œ„ํ•ด ๊ณต๊ฒฉ์ ์ธ dropout(i.e. dropout rate 0.75)๊ณผ ๋งŽ์€ ํ•™์Šต ๋ฐ˜๋ณต(500 iteration)์„ ์ง„ํ–‰ํ•จ
  • ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์„ค์ •์„ ๋‹จ๊ณ„์ ์œผ๋กœ ๋ณ€๊ฒฝํ•˜๋ฉฐ ์„ฑ๋Šฅ ํ…Œ์ŠคํŠธ

    • Dual-based ๋ชจ๋ธ์€ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ๋ณ€๊ฒฝ์— ๋”ฐ๋ฅธ ์„ฑ๋Šฅ์˜ ๋ณ€๊ฒฝ ํญ์ด ํฌ์ง€ ์•Š์Œ(robust)
    • few-shot scenario์—์„œ BERT-FIXED ๋ชจ๋ธ์˜ ์ตœ๊ณ  ์„ฑ๋Šฅ๊ณผ ํ‰๊ท  ์„ฑ๋Šฅ์˜ ํŽธ์ฐจ๊ฐ€ ํฐ ์•„์›ƒ๋ผ์ด์–ด๋„ ๊ด€์ฐฐ๋จ

Resource Efficiency

  • 10๊ฐœ์˜ ์ƒ˜ํ”Œ few-shot scenario์—์„œ ํ•™์Šต ๋ฐ ํ‰๊ฐ€ ์†Œ์š” ์‹œ๊ฐ„
  • GPU ๋˜๋Š” TPU ์ž์›์ด ํ•„์š”ํ•œ ๋Œ€๊ทœ๋ชจ ๋ชจ๋ธ์ด ์•„๋‹ˆ๋ผ, CPU์—์„œ๋„ ํ•™์Šต ๊ฐ€๋Šฅํ•œ ํšจ๊ณผ์ ์ธ dual encoder ๊ธฐ๋ฐ˜ intent detector ๊ตฌ์ถ• ๊ฐ€๋Šฅ

๐ŸŽ‰ Conclusion

  • USE์™€ ConveRT์™€ ๊ฐ™์€ dual encoder ๋ชจ๋ธ๋กœ ์ธํ…ํŠธ ๋ถ„๋ฅ˜ ํƒœ์Šคํฌ์—์„œ ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์คŒ
  • ์‹ค์ œ ๋น„์ฆˆ๋‹ˆ์Šค ํ˜„์—…์—์„œ์ฒ˜๋Ÿผ ์ž‘์€ ๊ทœ๋ชจ์˜ ๊ฐ€๊ณต๋œ ๋ฐ์ดํ„ฐ์…‹(annotated samples)๋งŒ ์‚ฌ์šฉ๊ฐ€๋Šฅํ•œ ๊ฒฝ์šฐ, ๋…ผ๋ฌธ์˜ ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด BERT-based classifier๋ฅผ ๋งค๋ฒˆ ์ ์‘์‹œํ‚ค๋Š” ๊ฒƒ๋ณด๋‹ค ์–ป๋Š” ์ด๋“์ด ํฌ๋‹ค.
profile
๋ณธ์งˆ์— ์ง‘์ค‘ํ•˜๋ ค๊ณ  ๋…ธ๋ ฅํ•ฉ๋‹ˆ๋‹ค. ๐Ÿ”จ

0๊ฐœ์˜ ๋Œ“๊ธ€