BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (2019)

J.ยท2025๋…„ 7์›” 18์ผ

Text & Speech Papers

๋ชฉ๋ก ๋ณด๊ธฐ
6/12
post-thumbnail

๐Ÿ“Œ ํ‹ˆํ‹ˆํžˆ ์ •๋ฆฌํ•˜๋Š”๊ฑธ๋กœ...์•„๋งˆ ๊ฝค๋‚˜ ๋ชฐ์•„์„œ ์˜ฌ๋ฆด๊ฒƒ ๊ฐ™์ง€๋งŒ...์–ผ๋งˆ ์ „์— ๋ฆฌ๋ทฐ ๋๋‚ธ BERT ๋…ผ๋ฌธ๋ถ€ํ„ฐ.
๐Ÿ“Œ Original Paper : https://arxiv.org/abs/1810.04805

Abstract

BERT (Bidirectional Encoder Representations from Transformers) ๋Š” ๋ง ๊ทธ๋Œ€๋กœ Transformer ๊ตฌ์กฐ์˜ ์ธ์ฝ”๋” ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉํ•œ ๋ชจ๋ธ์ด๋‹ค. Unlabeled data ๋กœ๋ถ€ํ„ฐ pre-train์„ ์ง„ํ–‰ ํ›„ ํŠน์ • downstream task (๋ผ๋ฒจ ์žˆ๋Š” ๋ฐ์ดํ„ฐ) ์— fine-tuning ํ•œ ๋ชจ๋ธ์ด๋‹ค.

์‚ฌ์ „ํ•™์Šต + ํŒŒ์ธํŠœ๋‹ ๋ฐฉ์‹์„ NLP ๋ถ„์•ผ์—์„œ ์ž๋ฆฌ์žก๊ฒŒ ๋งŒ๋“  ๋Œ€ํ‘œ์ ์ธ ๋ฐฉ๋ฒ•๋ก ์œผ๋กœ, pre-trained ๋œ BERT ๋ชจ๋ธ์ด ํ•œ output layer ๋งŒ ์ถ”๊ฐ€ํ•˜๋ฉด NLP ์˜ ๋‹ค์–‘ํ•œ ๊ตฌ์ฒด์  task (Question Answering, Language Inference ๋“ฑ) ์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋‹ค๊ณ  ํ•จ.

์ด ๋ฐฉ๋ฒ•์€ ๋…ผ๋ฌธ ๋‚˜์˜ฌ ๋•Œ ๋‹น์‹œ NLP ๋‹ค์–‘ํ•œ ์ฃผ์š” 11๊ฐœ์˜ task ์— ๋Œ€ํ•˜์—ฌ SOTA ๋‹ฌ์„ฑํ–ˆ๋‹ค๊ณ  ์„ค๋ช…ํ•œ๋‹ค.

BERT ๋“ฑ์žฅ ์ด์ „ ๋ฐฉ๋ฒ•๋ก ๋“ค์„ ์‚ดํŽด๋ณด๋ฉด, ์ด์ „์—๋Š” ํŠน์ • task ์— ๋Œ€ํ•ด pre-trained ๋œ language representation ์„ ์ ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์ด ํฌ๊ฒŒ Feature based approach ์™€ Fine-tuning approach ๊ฐ€ ์žˆ์—ˆ๋‹ค.

Feature Based Approach

๋‹จ์–ด ์ˆ˜์ค€์˜ ๊ณ ์ •๋œ ๋ฒกํ„ฐ ํ‘œํ˜„์„ ์–ป์€ ํ›„ โ†’ ํ•ด๋‹น ๋ฒกํ„ฐ๋ฅผ task specific ํ•œ ๋ชจ๋ธ์— ์ ์šฉํ•˜์—ฌ ํ•™์Šต์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค. ๋Œ€ํ‘œ์ ์ธ ์˜ˆ์‹œ๊ฐ€ ELMo ๋ชจ๋ธ.

  • ์ž…๋ ฅ ๋ฌธ์žฅ โ†’ ๊ฐ ๋‹จ์–ด์˜ ์ž„๋ฒ ๋”ฉ ์ƒ์„ฑ
  • ์ž„๋ฒ ๋”ฉ์„ ๋ชจ๋ธ์— ๋„ฃ๊ณ :
    • Forward LSTM: ์™ผโ†’์˜ค๋ฅธ์ชฝ ๋ฌธ๋งฅ
    • Backward LSTM: ์˜ค๋ฅธโ†’์™ผ์ชฝ ๋ฌธ๋งฅ
  • ๊ฐ ์ธต์˜ forward hidden state์™€ backward hidden state๋ฅผ concat
  • ์—ฌ๋Ÿฌ ์ธต์˜ concat๋œ ๊ฒฐ๊ณผ๋ฅผ ๊ฐ€์ค‘ํ•ฉ(weighted sum) ํ•ด์„œ ์ตœ์ข… ELMo vector ์ƒ์„ฑ

Fine-tuning Approach

ํ•˜๋‚˜์˜ pre-trained ๋ชจ๋ธ์„ ํ†ต์งธ๋กœ ๋ถˆ๋Ÿฌ์™€์„œ, downstream task์—์„œ ์ „์ฒด ๋ชจ๋ธ์„ ํ•จ๊ป˜ ํ•™์Šต์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•
๋Œ€ํ‘œ์ ์ธ ์˜ˆ์‹œ๊ฐ€ (๋…ผ๋ฌธ ๋‚˜์˜จ ๋‹น์‹œ ๊ธฐ์ค€) GPT1

"Left-to-Right" ์–ธ์–ด ๋ชจ๋ธ ๊ตฌ์กฐ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์œผ๋ฉฐ, ์•ž์— ๋‚˜์˜จ ๋‹จ์–ด ๋ณด๊ณ  ํ˜„์žฌ ๋‹จ์–ด ์˜ˆ์ธกํ•˜๋Š” ๋ฐฉ์‹์„ ๋”ฐ๋ฅธ๋‹ค.

ํ•˜์ง€๋งŒ ๊ธฐ์กด์˜ ๋‘ ๋ฐฉ์‹ ๋ชจ๋‘ ์‚ฌ์ „ํ•™์Šต ๊ณผ์ •์—์„œ ์ด์ „ ๋ฐ์ดํ„ฐ๋“ค๋งŒ ์‚ฌ์šฉํ•˜๋Š” unidirectional launguage models (๋‹จ๋ฐฉํ–ฅ ๋ชจ๋ธ)์ด๋‹ค.

์ด๋Ÿฌํ•œ ๋ฐฉ์‹์€ pre-trained representation ํšจ๊ณผ๋ฅผ ๊ฐ์†Œ์‹œํ‚ค๊ณ ใ…ก ํŠนํžˆ Fine-tuning approach ์—์„œ ์ด๋Ÿฐ ๋‹จ์ ์ด ๋‘๋“œ๋Ÿฌ์ง.

  • GPT: Left-to-right๋งŒ ๋ณด๋Š” ๋‹จ๋ฐฉํ–ฅ ๋ชจ๋ธ
  • ELMo: ์–‘๋ฐฉํ–ฅ BiLSTM์ด์ง€๋งŒ forward, backward๋ฅผ ๋…๋ฆฝ์ ์œผ๋กœ ํ•™์Šต (Joint context ์•„๋‹˜)
  • ์ด๋กœ ์ธํ•ด:
    - ํŠน์ • ๋‹จ์–ด์˜ ์–‘์ชฝ ๋ฌธ๋งฅ์„ ๋™์‹œ์— ๋ฐ˜์˜ํ•˜๊ธฐ ์–ด๋ ค์›€
    - Fine-tuning ์‹œ์—๋„ ํ‘œํ˜„๋ ฅ์— ํ•œ๊ณ„
    ๊ทธ๋ ‡๊ฒŒ ํ•ด์„œ ์ด๋Ÿฐ ๋‹จ์  ๋ณด์™„ โ†’ BERT

Pre-training + Fine-Tuning ๋ฐฉ๋ฒ•์œผ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์œผ๋ฉฐ, Pre-training ์€ Masked Language Model + Next Sentence Prediction ๊ณผ์ •์œผ๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ์Œ โ†’ ์ดํ›„ task specific ํ•˜๊ฒŒ ๋„˜์–ด๊ฐ€ fine-tuning ํ•ด์ฃผ๋Š” ๊ตฌ์กฐ.

BERT

๋ชจ๋ธ ์•„์นดํ…์ฒ˜

  • BERT ๋Š” Transformer ์˜ ์ธ์ฝ”๋” ๋ธ”๋ก๋งŒ ์‚ฌ์šฉํ•œ๋‹ค.
  • L: ๋ ˆ์ด์–ด ๊ฐœ์ˆ˜ (Transformer block ๋ช‡๊ฐœ ์‚ฌ์šฉํ–ˆ๋‚˜)
  • H: ํžˆ๋“  ์‚ฌ์ด์ฆˆ
  • A: ๋ฉ€ํ‹ฐ ํ—ค๋“œ ์–ดํ…์…˜ ๋ช‡๊ฐœ์ธ์ง€

BERT Base : ๋ ˆ์ด์–ด ๊ฐœ์ˆ˜ 12๊ฐœ ํžˆ๋“  ์‚ฌ์ด์ฆˆ 768์ฐจ์›, ์–ดํ…์…˜ ํ—ค๋“œ 12๊ฐœ

์ „์ฒด ํŒŒ๋ผ๋ฏธํ„ฐ 1์–ต 1000๋งŒ๊ฐœ

GPT ํฌ๊ธฐ์™€ ์œ ์‚ฌํ•˜๊ฒŒ ๋งŒ๋“ค์–ด ์—ฐ๊ตฌ ์ง„ํ–‰

Tokens

์•„๋ž˜ ๊ทธ๋ฆผ์€ Pre-Train ๊ณผ์ •์˜ MLM + NSP ๊ณผ์ • ๋‘๊ฐœ ๊ฐ™์ด ํ‘œํ˜„ํ•œ ๊ทธ๋ฆผ

  • Input : Masking ์ด ๋œ ๋ฌธ์žฅ 2๊ฐœ A, B
  • CLS - ๋ฌธ์žฅ ์‹œ์ž‘์„ ์•Œ๋ ค์ฃผ๋Š” ํ† ํฐ
  • SEP: ๋‘ ๋ฌธ์žฅ ๊ตฌ๋ถ„์‹œ์ผœ์ฃผ๋Š” token
  • Tn โ†’ ๊ฐ ์œ„์น˜ ๋Œ€์‘ํ•˜๋Š” ํ† ํฐ (final hidden vector)
  • C(NSP) - classification task ์œ„ํ•ด ์‚ฌ์šฉ โ†’ CLS ํ† ํฐ์˜ ๋งˆ์ง€๋ง‰ hidden state ์ถœ๋ ฅํ•˜๊ณ , NSP (Next Sentence Prediction) ์‹œ ์‚ฌ์šฉ or fine-tuning ๋•Œ ํŠน์ • task ์— ๋Œ€ํ•ด ์‚ฌ์šฉ

Representations

Token Embedding : WordPiece tokenizer๋กœ ๋ถ„๋ฆฌ๋œ subword ๋‹จ์œ„ ํ† ํฐ ๊ฐ๊ฐ์— ๋Œ€ํ•ด ๊ณ ์œ ํ•œ ์ž„๋ฒ ๋”ฉ

ex. playing ์ด๋ฉด play ์™€ ing ๋กœ ๋‚˜๋ˆˆ ํ˜•ํƒœ์˜ ์ž„๋ฒ ๋”ฉ

Segment Embedding : BERT๋Š” ๋‘ ๋ฌธ์žฅ (sentence A, sentence B) ์„ ๋™์‹œ์— ์ž…๋ ฅ์œผ๋กœ ๋ฐ›๊ธฐ ๋•Œ๋ฌธ์—, ๊ฐ ํ† ํฐ์ด ์–ด๋–ค ๋ฌธ์žฅ์— ์†ํ•ด ์žˆ๋Š”์ง€๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ์ž„๋ฒ ๋”ฉ

Position Embedding : Transformer๋Š” ์ˆœ์„œ๋ฅผ ๊ณ ๋ คํ•˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์—, ๊ฐ ํ† ํฐ์˜ ์œ„์น˜ ์ •๋ณด๋ฅผ ์•Œ๋ ค์ฃผ๋Š” ์ˆœ์„œ ๊ธฐ๋ฐ˜ ์ž„๋ฒ ๋”ฉ

Pre-Training BERT: Masked LM


ex) w4 ๋‹จ์–ด๊ฐ€ ๋งˆ์ŠคํŒ… ๋จ โ†’ ์ธ์ฝ”๋” โ†’ ํ•œ๊ฐœ์˜ classification layer ์ง€๋‚˜์„œ w4โ€™ ๊ฐ€ w4 ๊ฐ€ ๋˜๋„๋ก ํ•™์Šต์‹œํ‚ค๋Š” ๊ณผ์ •์ด masked language model ๊ณผ์ •์ด๋‹ค.

์ด์ฒ˜๋Ÿผ ์–‘๋ฐฉํ–ฅ์œผ๋กœ ํ•™์Šต์„ ์‹œํ‚ค๋˜, ๊ฐ€๋ ค์ง„ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•˜๋„๋ก ํ•™์Šต์‹œํ‚ค๋Š” ๊ณผ์ •์ด masked language model ์˜ ๋ชฉ์ .

ํ‹€๋ฆฐ ์ •๋ณด๋„ ์ฃผ๊ณ , ์ •๋‹ต๋„ ๋ณด์—ฌ์ฃผ๋ฉด์„œ mask ๋œ ๋‹จ์–ด ๋งž์ถ”๋Š” ๊ณผ์ •์—์„œ context ํ•™์Šต์„ ๊ฐ•ํ™”์‹œํ‚ด

Pre-Training BERT : Next Sentence Prediction (NSP)

๋งŽ์€ NLP์˜ downstream task(QA, NLI ๋“ฑ)๋Š” ๋‘ ๋ฌธ์žฅ ์‚ฌ์ด์˜ ๊ด€๊ณ„๋ฅผ ์ดํ•ดํ•˜๋Š”๊ฒƒ์ด ํ•ต์‹ฌ

BERT์˜ pre-training ๊ณผ์ •์—์„œ๋Š” NSP(Next Sentence Prediction)์ด๋ผ๋Š” ํƒœ์Šคํฌ๋ฅผ ํ†ตํ•ด ๋‘ ๋ฌธ์žฅ A์™€ B๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ, 50%๋Š” ์‹ค์ œ ์—ฐ์† ๋ฌธ์žฅ ์Œ(IsNext), 50%๋Š” ๋ฌด์ž‘์œ„ ๋ฌธ์žฅ ์Œ(NotNext)์œผ๋กœ ๊ตฌ์„ฑํ•˜์—ฌ ํ•™์Šต์‹œํ‚จ๋‹ค.

์ž…๋ ฅ์˜ ์ฒซ ํ† ํฐ์ธ [CLS]์˜ ์ตœ์ข… hidden state๋ฅผ NSP classifier์— ํ†ต๊ณผ์‹œ์ผœ, B ๋ฌธ์žฅ์ด A์˜ ๋‹ค์Œ ๋ฌธ์žฅ์ธ์ง€ ์•„๋‹Œ์ง€๋ฅผ ์ด์ง„ ๋ถ„๋ฅ˜๋กœ ์˜ˆ์ธกํ•˜๊ฒŒ ํ•œ๋‹ค.

โ†’ ์ด๋ฅผ ํ†ตํ•ด ๋ชจ๋ธ์€ ๋ฌธ์žฅ ๊ฐ„ ๋ฌธ๋งฅ ๊ด€๊ณ„๋ฅผ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋‹ค.

Fine-tuning BERT

a. ๋‘ ๋ฌธ์žฅ์ด ๊ฐ™์€ ์˜๋ฏธ์ธ์ง€, ๋ชจ์ˆœ์ธ์ง€, ์ด์–ด์ง€๋Š”์ง€ ๋“ฑ ๋‘ ๋ฌธ์žฅ ๊ฐ„ ๊ด€๊ณ„ ํŒ๋‹จ
b. ๊ฐ์ •, ๋ฌธ๋ฒ•์„ฑ ๋“ฑ ๋ฌธ์žฅ ํ•˜๋‚˜์— ๋Œ€ํ•œ ์†์„ฑ ๋ถ„๋ฅ˜
c. ์งˆ๋ฌธ์— ๋Œ€ํ•œ ์ •๋‹ต์ด ๋ฌธ๋‹จ์˜ ์–ด๋””์— ์žˆ๋Š”์ง€ ์ฐพ๊ธฐ (Q&A)
d. ๋ฌธ์žฅ ์† ๋‹จ์–ด๋งˆ๋‹ค ์‚ฌ๋žŒ/์žฅ์†Œ/๊ธฐ๊ด€ ๋“ฑ ํ‘œ์‹œํ•˜๊ธฐ (NER, ์ด๋ฆ„ ์ธ์‹ ๋“ฑ)

Experiment

์‹คํ—˜ ๊ฒฐ๊ณผ๋ฅผ ๋ณด๋ฉด ์•ž์„œ ๋งํ•œ ์ด์ „ ๋ฐฉ๋ฒ•๋ก ๋“ค๋ณด๋‹ค BERT base ์™€ ๋น„๊ตํ•ด๋„ ์œ ์˜๋ฏธํ•œ ์„ฑ๋Šฅ ํ–ฅ์ƒ์ด ์žˆ์—ˆ๊ณ , BERT large ์™€ ๋น„๊ตํ•˜๋ฉด ํ™• ํŠ€๋Š” ํฐ ์„ฑ๋Šฅ ์ฐจ์ด๊ฐ€ ๋ชจ๋“  task์—์„œ ๋ณด์ž„์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

  • LM + NSP ๋‘˜ ๋‹ค ์ ์šฉํ•œ BERT ์™€ no NSP (only masking) ๋น„๊ต์‹œ ๋‘๊ฐœ ์ด์ƒ์˜ ๋ฌธ์žฅ์„ ์ดํ•ด์•ผ ํ•˜๋Š” Q&A task ์ธ SQuAD ๋Š” ์„ฑ๋Šฅ ํ–ฅ์ƒ์ด ๋ถ„๋ช…์ด ์žˆ์—ˆ๋‹ค.
  • BERT Large, ์ฆ‰ ๋ชจ๋ธ์˜ ๊ตฌ์กฐ๊ฐ€ ์ปค์ง€๋ฉด ์„ฑ๋Šฅ์ด ์ข‹์•„์ง€๋”๋ผ


์œ„์—๋Š” ์ „์ฒด BERT ํ•™์Šต๋œ๊ฑฐ, ๋ฐ‘์—๋Š” fine tuning ์ ์šฉ ์•ˆํ•˜๊ณ  pre-training๊นŒ์ง€๋งŒ ์ˆ˜ํ–‰ํ•˜๊ณ  ๊ณ ์ •(freeze) ํ•œ ์ฑ„, ํŠน์ • ๋ ˆ์ด์–ด์˜ ์ถœ๋ ฅ๋งŒ ๊บผ๋‚ด์–ด ํ•™์Šต์‹œ์ผœ ์„ฑ๋Šฅ ๋น„๊ตํ•œ ๊ฒƒ โ†’ pre-training ๋„ ๊ฝค ์„ฑ๋Šฅ ์ข‹๋‹ค ์˜๋ฏธํ•จ

๐Ÿ“Œ ๋…ผ๋ฌธ ์˜์˜

Pre-training + Fine-tuning ํŒจ๋Ÿฌ๋‹ค์ž„์˜ ํ‘œ์ค€ํ™”
- ๊ธฐ์กด์—๋Š” NLP task๋งˆ๋‹ค ๋ณ„๋„ ๊ตฌ์กฐ ์„ค๊ณ„์™€ ํ•™์Šต ๋ฐฉ์‹์ด ํ•„์š”ํ–ˆ์œผ๋‚˜, BERT๋Š” ํ•˜๋‚˜์˜ ์‚ฌ์ „ํ•™์Šต ๋ชจ๋ธ์„ ๋‹ค์–‘ํ•œ task์— ์ผ๊ด€๋˜๊ฒŒ fine-tuningํ•˜๋Š” ์ ‘๊ทผ ์ œ์‹œ
- Masked Language Model (MLM) ๊ธฐ๋ฐ˜์˜ ์–‘๋ฐฉํ–ฅ ๋ฌธ๋งฅ ํ•™์Šต์œผ๋กœ ๋‹ค์–‘ํ•œ ์–ธ์–ด ์ดํ•ด task์—์„œ ๋” ๊ฐ•๋ ฅํ•œ ํ‘œํ˜„๋ ฅ ํ™•๋ณด
- ๋‹จ์ˆœํ•œ task-specific layer๋งŒ ์ถ”๊ฐ€ํ•˜๋ฉด, ๋ฌธ์žฅ ๋ถ„๋ฅ˜, ๊ฐœ์ฒด๋ช… ์ธ์‹, ์งˆ์˜์‘๋‹ต ๋“ฑ ๋‹ค์–‘ํ•œ task์— ์‰ฝ๊ฒŒ ์ ์šฉ ๊ฐ€๋Šฅ
- NLP์—์„œ Transfer Learning์˜ ์‹ค์šฉ์  ํ‘œ์ค€์„ ์ •๋ฆฝํ•˜๊ณ , ์ดํ›„ ์–ธ์–ด ๋ชจ๋ธ ์„ค๊ณ„์˜ ๊ธฐ๋ณธ ํ‹€๋กœ ์ž๋ฆฌ์žก์Œ (ex. RoBERTa, ALBERT, T5 ๋“ฑ)

profile
AI & Languages galore.

0๊ฐœ์˜ ๋Œ“๊ธ€