[Review] BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

YSLยท2023๋…„ 7์›” 17์ผ
0

Review

๋ชฉ๋ก ๋ณด๊ธฐ
3/7
post-thumbnail

๐Ÿ“ BERT: Pre-training of Deep Bidirectional Transformers for
Language Understanding
(์›๋ฌธ)

โ—๏ธ๊ฐœ๋…์„ ์ •๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด ์ž‘์„ฑํ•œ ๊ธ€๋กœ, ๋‚ด์šฉ์ƒ ์ž˜๋ชป๋œ ๋ถ€๋ถ„์ด ์žˆ์„ ์ˆ˜ ์žˆ๋‹ค๋Š” ์  ์ฐธ๊ณ  ๋ฐ”๋ž๋‹ˆ๋‹ค.

๋…ผ๋ฌธ์—์„œ๋Š” ์ˆœ์„œ๊ฐ€ ์•„๋ž˜์™€ ๊ฐ™์ด ๊ตฌ์„ฑ๋˜์–ด์žˆ๋‹ค.

  • Introduction
  • Related Work
    • Unsupervised Feature-based Approaches
    • Unsupervised Fine-tuning Approaches
    • Transfer Learning from Supervised Data
  • BERT
    • Model Architecture
    • Input/Output Representation
      • Pre-training BERT
      • Fine-tuning BERT
  • Experiments
    • GLUE
    • SQuAD v1.1
    • SQuAD v2.0
    • SWAG
  • Ablation Studies
    • Effect of Pre-training Tasks
    • Effect of Model size
    • Feature-based Approach with BERT
  • Conclusion

์ด ๊ธ€์€ ๋…ผ๋ฌธ ์ˆœ์„œ๋ฅผ ๊ทธ๋Œ€๋กœ ๋”ฐ๋ผ๊ฐ€๊ธฐ๋ณด๋‹ค๋Š” ๋‚ด๊ฐ€ ๊ณต๋ถ€ํ•  ๋•Œ ์ดํ•ดํ•˜๊ธฐ ํŽธํ–ˆ๋˜ ํ๋ฆ„๋Œ€๋กœ ์ž‘์„ฑํ•˜๋ ค๊ณ  ํ•œ๋‹ค.


Introduction

BERT๋Š” unlabeled data๋ฅผ MLM(Mask Language Model)๊ณผ NSP(Next Sentence Prediction)์„ ๊ฑฐ์ณ pre-train์‹œํ‚จ ํ›„
์–ป์€ pre-trained parameter๋ฅผ task์— ๋งž๋Š” labeled data๋กœ fine-tuningํ•˜๋ฉฐ ์—…๋ฐ์ดํŠธํ•˜๋Š” ๋ฐฉ์‹์ด๋‹ค.


์‚ฌ์ „ ํ•™์Šต์€ token-level task์™€ sentence-level task ๊ฐ™์€ NLP ๋ถ„์•ผ์—์„œ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค.

์ด๋Ÿฌํ•œ ์‚ฌ์ „ ํ•™์Šต์€ ๋‘ ๊ฐ€์ง€ ๋ฐฉ์‹์œผ๋กœ ์ ์šฉ๋  ์ˆ˜ ์žˆ๋Š”๋ฐ, ๊ทธ ๋ฐฉ์‹์€ ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

1. Feature-based Approaches

: ์‚ฌ์ „ ํ•™์Šต ๋‹จ๊ณ„์—์„œ๋ถ€ํ„ฐ ํŠน์ • task์— ๋งž๊ฒŒ ๋ชจ๋ธ์„ ํ•™์Šต์‹œํ‚ค๋Š” ๊ฒฝ์šฐ๋กœ, ํ•ด๋‹น task์— ํŠนํ™”๋œ ๋ชจ๋ธ ๊ตฌ์กฐ๋ฅผ ๊ฐ–๋Š”๋‹ค.

2. Fine-tuning Approaches โœ“

: ํŠน์ • task์— ํŠนํ™”๋œ ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜๋Š” ์ตœ์†Œํ™”ํ•˜๊ณ , generalํ•˜๊ฒŒ ์‚ฌ์ „ ํ•™์Šต๋œ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์›ํ•˜๋Š” task์— ๋งž๊ฒŒ fine-tuningํ•ด ์ „์ฒด ํŒŒ๋ผ๋ฏธํ„ฐ๋“ค์„ ์—…๋ฐ์ดํŠธํ•œ๋‹ค.

์ด ๋…ผ๋ฌธ์—์„œ๋Š” ELMo๋ฅผ feature-based approach๋กœ, GPT์™€ BERT๋ฅผ fine-tuning approach๋กœ ๋ถ„๋ฅ˜ํ•˜์˜€๋‹ค.

feature-based vs. fine-tuning
๋‘˜์˜ ์ฐจ์ด๊ฐ€ ์ž˜ ์ดํ•ด๊ฐ€ ๋˜์ง€ ์•Š์•˜๋Š”๋ฐ ๋Œ€๋žต์ ์œผ๋กœ ์ •๋ฆฌํ•˜์ž๋ฉด, ์ฒ˜์Œ๋ถ€ํ„ฐ task์— ๋งž๊ฒŒ ํ•™์Šต์„ ์‹œํ‚ค๋Š๋ƒ / ํ•œ ๋ฒˆ ํ•™์Šตํ•œ ์ดํ›„ ์›ํ•˜๋Š” task์— ๋งž์ถฐ ๋‹ค์‹œ ํ•œ ๋ฒˆ ํ•™์Šต์„ ์‹œํ‚ค๋Š๋ƒ์˜ ์ฐจ์ด์ธ ๊ฒƒ ๊ฐ™๋‹ค. ํ›„์— Transfer Learning๋„ ๋“ฑ์žฅํ•˜๋Š”๋ฐ ์ด ์…‹์˜ ์œ ์˜๋ฏธํ•œ ์ฐจ์ด๋ฅผ ์ •๋ฆฌํ•ด๋ณผ ์˜ˆ์ •์ด๋‹ค.

์‚ฌ์ „ ํ•™์Šต์„ ํ•  ๋•Œ๋Š” ๋‘ ๋ฐฉ์‹ ๋ชจ๋‘ ๋™์ผํ•œ ๋ชฉ์  ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜๋Š”๋ฐ
์ „๋ฐ˜์ ์ธ ์–ธ์–ด ํ‘œํ˜„์„ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•ด unidirectional language models์„ ์‚ฌ์šฉํ•œ๋‹ค.

์ด๋•Œ unidirectionalํ•˜๊ฒŒ ํ•™์Šตํ•œ ๊ฒฝ์šฐ, " ์„ฑ๋Šฅ ์ €ํ•˜ " (especially fine-tuning approach)์˜ ํ•œ๊ณ„๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค.

ex) GPT : left-to-right ๋ฐฉ์‹์œผ๋กœ, ํ˜„์žฌ ํ† ํฐ์ด ์•ž์„œ ๋‚˜ํƒ€๋‚œ ํ† ํฐ์— ๋Œ€ํ•ด์„œ๋งŒ attention ๊ณ„์‚ฐ ๊ฐ€๋Šฅ
ํ•˜์ง€๋งŒ Question-Answering ๊ฐ™์€ ๋ฌธ์žฅ ์ˆ˜์ค€ tasks๋Š” ์–‘์ชฝ ๋ฌธ๋งฅ์„ ๋ชจ๋‘ ํŒŒ์•…ํ•ด์•ผ ํ•จ

BERT

: Bidirectional Encoder Representations from Transformers
BERT๋Š” Transformer์—์„œ Encoder ๋ถ€๋ถ„์ด ์—ฌ๋Ÿฌ ์ธต ์Œ“์—ฌ์žˆ๋Š” ๊ตฌ์กฐ๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ๋‹ค.

์™œ Encoder ๋ถ€๋ถ„๋งŒ ์‚ฌ์šฉํ•˜์˜€์„๊นŒ?
์ด์— ๋Œ€ํ•ด์„œ๋Š” BERT์˜ ๋ชฉ์ ์„ ์ƒ๊ฐํ•˜๋ฉด ์•Œ ์ˆ˜ ์žˆ๋‹ค.
BERT๋Š” ๋ฌธ์žฅ์„ ์ƒ์„ฑํ•˜๊ธฐ ์œ„ํ•œ GPT, ๊ธฐ๊ณ„ ๋ฒˆ์—ญ์„ ์œ„ํ•œ Transformer์ฒ˜๋Ÿผ
ํŠน์ • task๋ฅผ ์œ„ํ•œ ๋ชจ๋ธ์ด๋ผ๊ธฐ๋ณด๋‹ค ์ „๋ฐ˜์ ์ธ ์–ธ์–ด์˜ ๋ฌธ๋งฅ์— ๋Œ€ํ•ด ํŒŒ์•…ํ•˜๋Š” ๋ฐ ์ดˆ์ ์„ ๋งž์ถ”์—ˆ๋‹ค.
๋”ฐ๋ผ์„œ BERT ์ž์ฒด๋Š” ๋‹จ์–ด์˜ embedding ์ •๋ณด๋ฅผ ์ž˜ ํŒŒ์•…ํ•˜๋Š” ์šฉ๋„์ด๋‹ค. ๊ทธ๋ ‡๊ธฐ ๋•Œ๋ฌธ์— Encoder ๋ถ€๋ถ„๋งŒ ์‚ฌ์šฉํ•ด ์ž…๋ ฅ์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ํ•จ์ถ•ํ•˜๋Š” ๊ณผ์ •๋งŒ ๊ฑฐ์นœ๋‹ค.

BERT๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๊ตฌ์กฐ๋ฅผ ๊ฐ€์ง„๋‹ค.

BERT์˜ framework๋Š” ๋‘ ๋‹จ๊ณ„๋กœ ๊ตฌ์„ฑ๋˜๋Š”๋ฐ
์™ผ์ชฝ ๋ถ€๋ถ„์˜ pre-training step๊ณผ ์˜ค๋ฅธ์ชฝ ๋ถ€๋ถ„์˜ fine-tuning step์„ ๊ฑฐ์นœ๋‹ค.
๋”ฐ๋ผ์„œ ๋™์ผํ•œ pre-trained ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ ์ดˆ๊ธฐํ™”๋œ ๋ชจ๋ธ์—์„œ ์‹œ์ž‘ํ•˜์—ฌ task์— ๋”ฐ๋ผ ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ์—…๋ฐ์ดํŠธ๋˜์–ด ๊ฒฐ๊ณผ์ ์œผ๋กœ ์„œ๋กœ ๋‹ค๋ฅธ fine-tuned ๋ชจ๋ธ์„ ๊ฐ–๊ฒŒ ๋œ๋‹ค.

Input / Output Representation

BERT์˜ ์ž…๋ ฅ์€ ์ด 3๊ฐœ์˜ ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ๊ฐ€ ํ•ฉ์ณ์ง„ ํ˜•ํƒœ์ด๋‹ค.

Sequence : ๋‹จ์ผ ๋ฌธ์žฅ ๋˜๋Š” ์Œ์œผ๋กœ ์ด๋ฃจ์–ด์ง„ ๋ฌธ์žฅ (<Question, Answer>)

  • ๋‹จ์ผ ๋ฌธ์žฅ : ๋ถ„๋ฅ˜ task์— ์ฃผ๋กœ ์‚ฌ์šฉ ex) ์ŠคํŒธ ์—ฌ๋ถ€
  • ์Œ์œผ๋กœ ์ด๋ฃจ์–ด์ง„ ๋ฌธ์žฅ : sentence-level task์— ์ฃผ๋กœ ์‚ฌ์šฉ ex) Question-Answering

๋ฌธ์žฅ์ด ๋“ค์–ด์˜ค๋ฉด
step 1) WordPiece tokenizer๋กœ ํ† ํฐ ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ ์ƒ์„ฑ
step 2) [CLS] ํ† ํฐ์„ ๊ฐ ์‹œํ€€์Šค์˜ ์ฒซ(์‹œ์ž‘) ํ† ํฐ์œผ๋กœ ์‚ฌ์šฉ
step 3) text-pair๊ฐ€ ๋“ค์–ด์˜ค๋Š” ๊ฒฝ์šฐ ํ•˜๋‚˜์˜ ์‹œํ€€์Šค๋กœ ํ‘œํ˜„ํ•˜๊ธฐ ์œ„ํ•ด [SEP] ํ† ํฐ์œผ๋กœ ๋ถ„๋ฆฌ

โ‡’ ๊ฐ ๋ฌธ์žฅ์˜ ๋‹จ์–ด token embedding vector + A/B ์ค‘ ์–ด๋–ค ๋ฌธ์žฅ์— ์†ํ•˜๋Š”์ง€๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” segmentation embedding vector + ๋ฌธ์žฅ ๋‚ด์—์„œ์˜ ์œ„์น˜ ์ •๋ณด๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” position embedding vector

[CLS]

Positional Encoding vs. Positional Embedding

  • Positional Encoding - Transformer์˜ Encoder์—์„œ ์‚ฌ์šฉํ•œ ๋ฐฉ์‹
    : ๊ฐ ๋‹จ์–ด์˜ ์œ„์น˜๋งˆ๋‹ค ๊ณ ์œ ํ•œ ๊ฐ’(๋ฒกํ„ฐ)์„ ๋งŒ๋“ค์–ด ๋‚ด๋Š” ์ ์ ˆํ•œ ํ•จ์ˆ˜๋ฅผ ํ†ตํ•ด ๊ฐ ๋‹จ์–ด์˜ ์œ„์น˜๋ฅผ ์„ค๋ช…ํ•˜๋Š” ์œ„์น˜ ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ๋ฅผ ๋งŒ๋“  ๋’ค, ๋‹จ์–ด ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ์— ๋”ํ•ด์คŒ

  • Positional Embedding - BERT์—์„œ ์‚ฌ์šฉํ•œ ๋ฐฉ์‹
    : ์œ„์น˜ ์ •๋ณด๋ฅผ ํ‘œํ˜„ํ•˜๊ธฐ ์œ„ํ•ด ํ•™์Šต ๊ฐ€๋Šฅํ•œ ์ถ”๊ฐ€์ ์ธ Embedding layer๋ฅผ ์‚ฌ์šฉํ•จ
    โ†’ ๋ชจ๋ธ์ด ํ•™์Šต๋˜๋ฉด์„œ ์œ„์น˜ ์ •๋ณด๋ฅผ ์–ด๋–ป๊ฒŒ ํ‘œํ˜„(embedding)ํ•˜๋Š”๊ฒŒ ์ œ์ผ ์ข‹์€์ง€ ์Šค์Šค๋กœ ํ•™์Šตํ•จ

Pre-training BERT

BERT๋Š” ๋‘ ๊ฐ€์ง€ Unsupervised tasks๋ฅผ ํ†ตํ•ด bidirectionalํ•˜๊ฒŒ ๋ชจ๋ธ์„ ํ•™์Šตํ•˜๊ณ ์ž ํ•œ๋‹ค.

1. Masked Language Model

๊ธฐ์กด left-to-right / right-to-left ๊ฐ™์€ ๊ธฐ์กด์˜ ๋ฐฉ์‹์œผ๋กœ ํ•™์Šต์„ ํ•˜๊ฒŒ ๋˜๋ฉด ์ž…๋ ฅ ๋ฌธ์žฅ์ด ๋“ค์–ด์™”์„ ๋•Œ, ๊ฐ„์ ‘์ ์œผ๋กœ ์˜ˆ์ธกํ•˜๋ ค๋Š” ๋‹จ์–ด๋ฅผ ์ฐธ์กฐํ•˜๊ฒŒ ๋˜๊ณ  multi-layer ๊ตฌ์กฐ์—์„œ ์ด์ „ layer์˜ output์ด ์ „์ฒด ๋ฌธ์žฅ์˜ ํ† ํฐ์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ๋‹ด๊ณ  ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ํ•ด๋‹น ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ๋Š” ๊ฐ„์ ‘ ์ฐธ์กฐ์˜ ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค.

โ‡’ ๋ฌธ์žฅ์—์„œ ์ผ์ • ํ† ํฐ์„ maskingํ•˜๊ณ  ํ•ด๋‹น ํ† ํฐ์ด ๊ตฌ์„ฑํ•˜๋Š” ๋ฌธ์žฅ์—์„œ ์•ž๋’ค ๋ฌธ๋งฅ์„ ํŒŒ์•…ํ•˜์—ฌ masking๋œ ํ† ํฐ์˜ ์›๋ž˜ ๊ฐ’์„ ์˜ˆ์ธกํ•˜๋Š” ๋ฐฉ์‹

๋…ผ๋ฌธ์—์„œ๋Š” ์ „์ฒด ํ† ํฐ์˜ 15%๋ฅผ masking ํ•˜์˜€๋Š”๋ฐ, ์‹ค์ œ fine-tuning task์—์„œ๋Š” [MASK] ํ† ํฐ์ด ์—†๊ธฐ ๋•Œ๋ฌธ์— pre-training๊ณผ fine-tuning ์‚ฌ์ด์˜ ๋ถˆ์ผ์น˜์„ฑ์ด ์ƒ๊ธฐ๋Š” ๋ฌธ์ œ๊ฐ€ ๋‚˜ํƒ€๋‚ฌ๋‹ค.
์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด 15% ์ค‘

  • 80% : [MASK] ํ† ํฐ์œผ๋กœ
  • 10% : ๋žœ๋คํ•œ ํ† ํฐ์œผ๋กœ
  • 10% : ์›๋ž˜ ํ† ํฐ์œผ๋กœ

์น˜ํ™˜ํ•˜๋Š” ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•˜์˜€๋‹ค. ์ด๋ ‡๊ฒŒ masking๋œ ํ† ํฐ์„ ๋ฐ”ํƒ•์œผ๋กœ ์ตœ์ข…์ ์œผ๋กœ๋Š” cross-entropy loss๋ฅผ ์‚ฌ์šฉํ•ด ๊ธฐ์กด์˜ ํ† ํฐ์„ ์˜ˆ์ธกํ•˜๋„๋ก ํ•™์Šต์„ ์ง„ํ–‰ํ•˜์˜€๋‹ค.

2. Next Sentence Prediction

MLM๋งŒ ๊ฑฐ์นœ ๊ฒฝ์šฐ, ํ† ํฐ ๊ฐ„ ๊ด€๊ณ„๋ฅผ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์ง€๋งŒ ๋ฌธ์žฅ๊ณผ ๋ฌธ์žฅ ์‚ฌ์ด์˜ ๊ด€๊ณ„๋Š” ํ•™์Šตํ•  ์ˆ˜ ์—†๋‹ค. ๋”ฐ๋ผ์„œ NSP๋ฅผ ํ†ตํ•ด masked language model์— text-pair representation์„ ํ•™์Šต์‹œํ‚จ๋‹ค.
๋ฌธ์žฅ A์™€ ๋ฌธ์žฅ B๋ฅผ ๋ชจ๋ธ์— ๋„ฃ๋Š”๋ฐ

  • 50% : A ๋‹ค์Œ์œผ๋กœ ์ •๋‹ต ๋ฌธ์žฅ์ธ B ๊ทธ๋Œ€๋กœ ์ œ๊ณต (โ†’ IsNext)
  • 50% : A ๋‹ค์Œ์œผ๋กœ ๋žœ๋คํ•œ ๋ฌธ์žฅ์ด ์ œ๊ณต (โ†’ NotNext)

์™€ ๊ฐ™์ด ์ž…๋ ฅ์ด ์ œ๊ณต๋œ๋‹ค. ์ด๋ ‡๊ฒŒ ์ฃผ์–ด์ง„ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ๋‘ ๋ฌธ์žฅ์ด ์„œ๋กœ ์—ฐ๊ฒฐ๋๋Š”์ง€ / ์•„๋‹Œ์ง€๋ฅผ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋‹ค.

Fine-tuning BERT

์•ž์—์„œ ์‚ฌ์ „ ํ•™์Šต์„ ๋งˆ์นœ BERT๋Š” ํ† ํฐ ๊ฐ„ ๊ด€๊ณ„, ๋ฌธ์žฅ ๊ฐ„ ๊ด€๊ณ„์™€ ๊ฐ™์ด ์ „๋ฐ˜์ ์ธ ์–ธ์–ด ํ‘œํ˜„์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค. ์ด๋ฅผ ์›ํ•˜๋Š” task์— ๋งž๋Š” labeled data๋กœ ํ•™์Šต์‹œ์ผœ ๊ธฐ์กด์˜ pre-trained ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์—…๋ฐ์ดํŠธํ•˜๋ฉด ๋œ๋‹ค.
์ด ๋‹จ๊ณ„๋Š” pre-traing์— ๋น„ํ•ด ์ƒ๋Œ€์ ์œผ๋กœ ์ ์€ ์‹œ๊ฐ„์œผ๋กœ ์ˆ˜ํ–‰ ๊ฐ€๋Šฅํ•˜๋‹ค.

Conclusion

BERT๋Š” ๋Œ€์šฉ๋Ÿ‰ unlabeled data๋กœ bidirectionalํ•˜๊ฒŒ ๋ชจ๋ธ์„ ํ•™์Šต์‹œ์ผœ ํŒŒ๋ผ๋ฏธํ„ฐ ๊ฐ’์„ ์–ป๋Š”๋‹ค. ์ด ๊ฐ’์€ ๋” ๋‚˜์€ ํ•™์Šต์˜ ์‹œ์ž‘์  ์—ญํ• ์„ ํ•œ๋‹ค. ์ดํ›„ ์›ํ•˜๋Š” task์— ๋งž๋Š” labeled data๋กœ ์•ž์„œ ์–ป์€ ํŒŒ๋ผ๋ฏธํ„ฐ ๊ฐ’์„ ์—…๋ฐ์ดํŠธํ•œ๋‹ค.

์ด๋Ÿฌํ•œ pre-training โ†’ fine-tuning ๋ฐฉ์‹์œผ๋กœ BERT๋Š” 11๊ฐ€์ง€ NLP task์—์„œ ๋งค์šฐ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์˜€๋‹ค. +) ๋ชจ๋ธ์˜ ํฌ๊ธฐ๊ฐ€ ํด์ˆ˜๋ก ํฐ ๊ทœ๋ชจ์˜ task์—์„œ ์ง€์†์ ์ธ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๋ณด์ž„

BERT vs. GPT


๋‘˜ ๋‹ค Transformer ๊ธฐ๋ฐ˜์˜ ๋ชจ๋ธ์ด์ง€๋งŒ ๋ชฉ์ ๊ณผ ๊ตฌ์กฐ๊ฐ€ ์ƒ์ดํ•˜๋‹ค.
Transformer๋Š” ๊ธฐ๊ณ„ ๋ฒˆ์—ญ์— ํŠนํ™”๋œ self-attention ๊ธฐ๋ฐ˜ ๋ชจ๋ธ๋กœ,
ํ•œ-์˜ ๋ฒˆ์—ญ์„ ์˜ˆ๋กœ ๋“ค๋ฉด

  • encoder์—์„œ ํ•œ๊ตญ์–ด ๋ฌธ์žฅ์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ํ•™์Šตํ•˜๊ณ 
  • decoder์—์„œ ์˜์–ด ๋ฌธ์žฅ์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ํ•™์Šตํ•ด

๋‘ ๋ถ€๋ถ„์„ ํ•ฉ์ณ ์ตœ์ข…์ ์ธ ๋ฒˆ์—ญ์„ ์ˆ˜ํ–‰ํ•œ๋‹ค.

์ด์— ๋ฐ˜ํ•ด BERT์™€ GPT๋Š” Transformer์˜ ์ผ๋ถ€๋ฅผ ๊ฐ€์ ธ์™€ ์‚ฌ์šฉํ•œ๋‹ค.

BERT

  • self-attention์„ ํ†ตํ•ด bidirectionalํ•˜๊ฒŒ ํ•™์Šต
  • ํ˜„์žฌ ํ† ํฐ์˜ ์ขŒ/์šฐ ๋ชจ๋‘ ์ฐธ์กฐ ๊ฐ€๋Šฅ
  • transformer encoder๋งŒ ์‚ฌ์šฉ
  • ๋นˆ์นธ ์˜ˆ์ธก์— ํŠนํ™”๋จ

GPT

  • left-to-right ํ•™์Šต
  • ํ˜„์žฌ ํ† ํฐ์˜ ์™ผ์ชฝ๋งŒ ์ฐธ์กฐ ๊ฐ€๋Šฅ
  • transformer decoder๋งŒ ์‚ฌ์šฉ
  • ๋‹ค์Œ ๋‹จ์–ด ์˜ˆ์ธก์— ํŠนํ™”๋จ


References

๐Ÿ“ The Bidirectional Language Model
๐Ÿ“ [์ตœ๋Œ€ํ•œ ์ž์„ธํ•˜๊ฒŒ ์„ค๋ช…ํ•œ ๋…ผ๋ฌธ๋ฆฌ๋ทฐ] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (1)
๐Ÿ“ [NLP] ์ตœ๋Œ€ํ•œ ์‰ฝ๊ฒŒ ์„ค๋ช…ํ•œ Transformer
๐Ÿ“ [NLP | ๋…ผ๋ฌธ๋ฆฌ๋ทฐ] BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding ์ƒํŽธ
๐Ÿ“ [๋”ฅ๋Ÿฌ๋‹์„ ์ด์šฉํ•œ ์ž์—ฐ์–ด์ฒ˜๋ฆฌ ์ž…๋ฌธ] 12-02 ์–‘๋ฐฉํ–ฅ LSTM๊ณผ CRF(Bidirectional LSTM + CRF)
๐Ÿ“ Paper Dissected: โ€œBERT: Pre-training of Deep Bidirectional Transformers for Language Understandingโ€ Explained
๐Ÿ“ Feature-based Transfer Learning vs Fine Tuning?
๐Ÿ“ Large Language Models (LLM): Difference between GPT-3 & BERT

0๊ฐœ์˜ ๋Œ“๊ธ€