[Review] Improving Language Understanding by Generative Pre-Training

YSLยท2023๋…„ 7์›” 21์ผ
0

Review

๋ชฉ๋ก ๋ณด๊ธฐ
4/7
post-thumbnail

๐Ÿ“ Improving Language Understanding by Generative Pre-Training

โ—๏ธ๊ฐœ๋…์„ ์ •๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด ์ž‘์„ฑํ•œ ๊ธ€๋กœ, ๋‚ด์šฉ์ƒ ์ž˜๋ชป๋œ ๋ถ€๋ถ„์ด ์žˆ์„ ์ˆ˜ ์žˆ๋‹ค๋Š” ์  ์ฐธ๊ณ  ๋ฐ”๋ž๋‹ˆ๋‹ค.

๋…ผ๋ฌธ์—์„œ๋Š” ์ˆœ์„œ๊ฐ€ ์•„๋ž˜์™€ ๊ฐ™์ด ๊ตฌ์„ฑ๋˜์–ด์žˆ๋‹ค.

  • Introduction
  • Related Work
  • Framework
    • Unsupervised pre-training
    • Supervised fine-tuning
  • Experiments
    • SetUp
    • Supervised fine-tuning
  • Analysis
  • Conclusion

์ด ๊ธ€์€ ๋…ผ๋ฌธ ์ˆœ์„œ๋ฅผ ๊ทธ๋Œ€๋กœ ๋”ฐ๋ผ๊ฐ€๊ธฐ๋ณด๋‹ค๋Š” ๋‚ด๊ฐ€ ๊ณต๋ถ€ํ•  ๋•Œ ์ดํ•ดํ•˜๊ธฐ ํŽธํ–ˆ๋˜ ํ๋ฆ„๋Œ€๋กœ ์ž‘์„ฑํ•˜๋ ค๊ณ  ํ•œ๋‹ค.


Introduction

NLP ๋ถ„์•ผ๋Š” labeled data๊ฐ€ ํ•œ์ •์ ์ด๊ธฐ ๋•Œ๋ฌธ์— raw text๋กœ๋ถ€ํ„ฐ ํšจ์œจ์ ์œผ๋กœ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•˜๋‹ค.
(โˆต manually labelingํ•˜๊ธฐ ์–ด๋ ค์›€ + ๋‚˜๋ผ๋งˆ๋‹ค ์–ธ์–ด๊ฐ€ ๋‹ค๋ฅด๋‹ค๋Š” ๋ฌธ์ œ์ )
๋”ฐ๋ผ์„œ ์ง€๋„ํ•™์Šต ์˜์กด๋„๋ฅผ ์ค„์ด๊ณ ์ž Unsupervised learning์— ์ง‘์ค‘ํ•˜๊ธฐ ์‹œ์ž‘ํ–ˆ๋‹ค.

ํ•˜์ง€๋งŒ ๋น„์ง€๋„ํ•™์Šต์œผ๋กœ๋Š” ๋‹จ์–ด ์ˆ˜์ค€ ์ด์ƒ์˜ ์ •๋ณด๋ฅผ ์–ป๊ธฐ ์–ด๋ ค์› ๋Š”๋ฐ, ์ด์œ ๋Š” ์•„๋ž˜ 2๊ฐ€์ง€์™€ ๊ฐ™๋‹ค.

  1. pre-train์œผ๋กœ text-representation์„ ํ•™์Šต์‹œํ‚ฌ ๋•Œ, ์–ด๋–ค ์ตœ์ ํ™” ๋ชฉ์ ์„ ์จ์•ผ ํ•  ์ง€ ๋ถˆ๋ถ„๋ช…ํ•จ
  2. ํ•™์Šต๋œ text-representation์„ target task์— ์ ์šฉํ•  ๋•Œ ๊ฐ€์žฅ ํšจ์œจ์ ์ธ ์ „๋‹ฌ ๋ฐฉ์‹์ด ์ •ํ•ด์ง€์ง€ ์•Š์Œ

์ด 2๊ฐ€์ง€ ๋ชจํ˜ธ์„ฑ ๋•Œ๋ฌธ์— ๋น„์ง€๋„ํ•™์Šต์—์„œ ์–ด๋ ค์›€์„ ๊ฒช์—ˆ์ง€๋งŒ
์ตœ๊ทผ ์—ฐ๊ตฌ์— ๋”ฐ๋ผ 1๋ฒˆ ๋ฌธ์ œ์˜ ๊ฒฝ์šฐ, language modeling, machine translation, discourse coherence๊ฐ€ ์„ฑ๋Šฅ์ด ์ข‹๋‹ค๋Š” ๊ฒƒ์ด ๋ฐํ˜€์กŒ๋‹ค.

โ‡’ GPT๋Š” ๋Œ€์šฉ๋Ÿ‰์˜ unlabeled text๋กœ language modeling objective์— ๋Œ€ํ•ด pre-trainingํ•˜๊ณ  ์‚ฌ์ „ ํ•™์Šต๋œ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ labeled data๋กœ fine-tuningํ•˜๋Š” semi-supervised learning ๋ฐฉ์‹์„ ์ง„ํ–‰ํ–ˆ๋‹ค.

Language Modeling
= being trained to predict the next word in a sequence of words
= ๋‹ค์Œ ๋‹จ์–ด ์˜ˆ์ธก task

Framework

GPT๋Š” ๋ณ„๋„์˜ input์ด ์กด์žฌํ•˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์— Transformer ์ค‘ Decoder๋งŒ์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ์„ ๊ตฌ์„ฑํ–ˆ๋‹ค.

Transformer๋Š” LSTM์ด๋‚˜ ๊ธฐ์กด RNN์— ๋น„ํ•ด ์•„๋ž˜์™€ ๊ฐ™์€ ์ธก๋ฉด์—์„œ ๋” ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์˜€๋‹ค.

  • ๊ตฌ์กฐํ™”๋œ ๋ฉ”๋ชจ๋ฆฌ๋กœ long-term dependencies(= ์ฐธ์กฐํ•  ๋ฌธ๋งฅ์˜ ๋ฒ”์œ„๊ฐ€ ๋” ๋„’์Œ)์— ์ œ์•ฝ์ด ์ ์Œ
  • ๋‹ค์–‘ํ•œ task์— robustํ•˜๊ฒŒ transfer ๊ฐ€๋Šฅํ•จ

Unsupervised pre-training

: ๋Œ€์šฉ๋Ÿ‰์˜ unlabeled corpus๋กœ language model ๋งŒ๋“ค๊ธฐ
โ†’ ๋‹ค์Œ ๋‹จ์–ด ์˜ˆ์ธก

h0=UWe+Wphl=transformerย block(hlโˆ’1)โˆ€iโˆˆ[1,n]P(u)=softmax(hnWeT)h_0 = UW_e + W_p \\ h_l = transformer\ block(h_{l-1})\forall_iโˆˆ[1,n]\\ P(u) = softmax(h_nW_e^T)
  • UU : ํ˜„์žฌ ํ† ํฐ ์ด์ „์˜ ๋ชจ๋“  ํ† ํฐ์— ๋Œ€ํ•œ context vector๋“ค์„ ๋ชจ์•„๋‘” ํ–‰๋ ฌ
  • WeW_e, WpW_p : embedding ๊ฐ€์ค‘์น˜, position ๊ฐ€์ค‘์น˜

โ†’ self-attention์„ ํ†ตํ•ด context ํ–‰๋ ฌ UU๊ฐ€ nn๊ฐœ์˜ transformer block์„ ๊ฑฐ์ณ ์˜ˆ์ธกํ•˜๊ณ ์ž ํ•˜๋Š” ํ† ํฐ์˜ ํ™•๋ฅ  ๋ถ„ํฌ ๊ตฌํ•˜๊ธฐ

๊ฒฐ๊ณผ์ ์œผ๋กœ๋Š”

L1(u)=โˆ‘ilogP(uiโˆฃuiโˆ’k,...,uiโˆ’1;ฮธ)L_1(u) = \sum_i logP(u_i|u_{i-k},..., u_{i-1};\theta)\\

โ‡’ k๋ฒˆ์งธ ์ด์ „ ๋‹จ์–ด๋ถ€ํ„ฐ ์ง์ „ ๋‹จ์–ด๊นŒ์ง€ ์ฃผ์–ด์กŒ์„ ๋•Œ, ํ˜„์žฌ ๋‹จ์–ด๊ฐ€ ๋‚˜์˜ฌ ํ™•๋ฅ ์ด ์ตœ๋Œ€๊ฐ€ ๋˜๋„๋ก ์‚ฌ์ „ ํ•™์Šต ๋‹จ๊ณ„์—์„œ ๋ชจ๋ธ์„ ํ•™์Šต์‹œ์ผœ์•ผ ํ•œ๋‹ค.

Supervised fine-tuning

: labeled data๋ฅผ ์ด์šฉํ•ด ๋ชจ๋ธ์„ ํŠน์ • task์— ์ ์šฉํ•˜๊ธฐ

P(yโˆฃx1,...,xm)=softmax(hlmWy)P(y|x_1, ..., x_m) = softmax(h_l^mW_y)
  • x1,...,xmx_1, ..., x_m : input token
  • yy : label
  • CC : labeled data

๊ฒฐ๊ณผ์ ์œผ๋กœ๋Š”

L2(C)=โˆ‘(x,y)logP(yโˆฃx1,...,xm)L_2(C) = \sum_{(x,y)}log P(y|x_1, ..., x_m)

โ‡’ ์ž…๋ ฅ ํ† ํฐ๋“ค์ด ์ฃผ์–ด์กŒ์„ ๋•Œ ์ •๋‹ต ๋ผ๋ฒจ(= ๋‹ค์Œ์— ๋‚˜ํƒ€๋‚  ๋‹จ์–ด)๋กœ ์˜ˆ์ธกํ•  ํ™•๋ฅ ์ด ์ตœ๋Œ€๊ฐ€ ๋˜๋„๋ก ๋ชจ๋ธ์„ ํ•™์Šต์‹œํ‚จ๋‹ค.

fine-tuning์„ ํ•  ๋•Œ ๋ณ„๋„๋กœ ํ•„์š”ํ•œ ํŒŒ๋ผ๋ฏธํ„ฐ๋Š” ๋ผ๋ฒจ์— ๋Œ€ํ•œ ๊ฐ€์ค‘์น˜์ธ WyW_y ํ•˜๋‚˜๋กœ, pre-train๊ณผ fine-tuning ์‹œ ๋ชจ๋ธ์˜ ๊ตฌ์กฐ๊ฐ€ ๊ฑฐ์˜ ๋‹ฌ๋ผ์ง€์ง€ ์•Š๋Š” ์žฅ์ ์ด ์žˆ๋‹ค.

L3(C)=L2(C)+ฮปโˆ—L1(C)L_3(C) = L_2(C) + \lambda * L_1(C)
pre-trained๋œ ๋ชจ๋ธ์— labeled data๋ฅผ ๋„ฃ์–ด pre-train ๋ฐฉ์‹๋Œ€๋กœ ๋ชจ๋ธ์„ ํ•œ ๋ฒˆ ๋” ํ•™์Šต์‹œํ‚ค๊ณ  ์ดํ›„ fine-tuning์„ ํ•˜๊ธฐ๋„ ํ•œ๋‹ค.
์ด ๊ฒฝ์šฐ, supervised model์— ๋Œ€ํ•œ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋œ๋‹ค๋Š” ์žฅ์ ์ด ์žˆ๋‹ค.


+) generalization
: ๊ฐ™์€ ๋ฐ์ดํ„ฐ๋กœ ๋‹ค๋ฅธ task๋ฅผ ํ•œ๋‹ค (X)
: ๊ฐ™์€ task์— ๋‹ค๋ฅธ ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค (O)

โ‡’ ์œ„์™€ ๊ฐ™์ด language modeling์„ fine-tuning์˜ ๋ณด์กฐ ๋ชฉ์ ์œผ๋กœ ์‚ฌ์šฉํ•˜๋ฉด ์™„์ „ํžˆ ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ๋ฅผ ๋„ฃ์—ˆ์„ ๋•Œ๋„ ๊ฐ™์€ task์— ๋Œ€ํ•ด์„œ๋Š” ์„ฑ๋Šฅ์ด ์ž˜ ๋‚˜์˜ค๊ฒŒ ๋œ๋‹ค.

task-specific input transformations


task๋งˆ๋‹ค ๊ตฌ์กฐํ™”๋œ input์„ ๋ชจ๋ธ ์ž…๋ ฅ์šฉ sequence๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๊ณผ์ •์ด ํ•„์š”ํ•˜๋‹ค.
Input : structured inputs โ†’ token sequences

์ด๋•Œ task๋งˆ๋‹ค ๋ณ€ํ™˜ ํ˜•ํƒœ๊ฐ€ ์•ฝ๊ฐ„ ๋‹ฌ๋ผ์ง€์ง€๋งŒ
๊ณตํ†ต์ ์œผ๋กœ ๋ฌธ์žฅ์˜ ์‹œ์ž‘๊ณผ ๋์— <s>, <e> ํ† ํฐ์„ ๋„ฃ์–ด์ค˜์•ผ ํ•œ๋‹ค.

Conclusion

GPT vs. Transformer vs. BERT

GPT์™€ BERT ๋ชจ๋‘ Transformer์—์„œ ํŒŒ์ƒ๋œ ๋ชจ๋ธ์ด์ง€๋งŒ ๋ชฉ์ ๊ณผ ๊ตฌ์กฐ๊ฐ€ ํ™•์—ฐํžˆ ๋‹ค๋ฅด๋‹ค.

Transformer

๊ธฐ๊ณ„ ๋ฒˆ์—ญ์„ ๋ชฉ์ ์œผ๋กœ ๋งŒ๋“ค์–ด์ง„ ๋ชจ๋ธ๋กœ,
ใ€Ž์–ธ์–ด A์— ๋Œ€ํ•œ ํŠน์ง•์„ ํŒŒ์•…ํ•˜๋Š” Encoder + ์–ธ์–ด B์— ๋Œ€ํ•œ ํŠน์ง•์„ ํŒŒ์•…ํ•˜๊ณ  ์ด๋ฅผ Encoder์—์„œ ์–ป์€ ํŠน์ง•๊ณผ ๊ฒฐํ•ฉํ•ด A โ†’ B๋กœ ๋ฒˆ์—ญํ•˜๋Š” Decoderใ€ ๋กœ ๊ตฌ์„ฑ๋˜์—ˆ๋‹ค.

BERT

๋”ฅ๋Ÿฌ๋‹์œผ๋กœ word embedding์„ ํšจ๊ณผ์ ์œผ๋กœ ์ˆ˜ํ–‰ํ•˜๊ธฐ ์œ„ํ•ด ๋งŒ๋“ค์–ด์ง„ ๋ชจ๋ธ๋กœ,
ใ€ŽEncoder์˜ self-attention + masking๋œ ์ž…๋ ฅใ€์œผ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์–‘๋ฐฉํ–ฅ ์ฐธ์กฐ๋ฅผ ํ†ตํ•ด ๋ฌธ๋งฅ์„ ํŒŒ์•…ํ•  ์ˆ˜ ์žˆ๋‹ค.

GPT

์ƒ์„ฑ์„ ์œ„ํ•œ ๋ชจ๋ธ๋กœ,
ใ€ŽLinear, softmax layer๊ฐ€ ํฌํ•จ๋œ Decoderใ€๋งŒ์„ ์‚ฌ์šฉํ•ด ๋‹ค์Œ ๋‹จ์–ด๊ฐ€ ๋‚˜์˜ฌ ํ™•๋ฅ ์„ ์ž˜ ์˜ˆ์ธกํ•˜๋„๋ก ์„ค๊ณ„๋˜์—ˆ๋‹ค.


References

๐Ÿ“ Visualizing A Neural Machine Translation Model (Mechanics of Seq2seq Models With Attention)
ใ„ด GPT ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์ „๋ฐ˜์ ์ธ NLP์— ๋Œ€ํ•ด ์ •๋ง ์„ค๋ช…์ด ์ž˜ ๋˜์–ด์žˆ๋‹ค. ์ฝ์–ด๋ณด๋ฉด์„œ ๋งŽ์€ ๋„์›€์ด ๋˜์—ˆ๋‹ค ๐Ÿ‘๐Ÿป
๐Ÿ“ Improving Language Understanding by Generative Pre-Training (GPT1)
๐Ÿ“ ํŠธ๋žœ์Šคํฌ๋จธ ๊ธฐ๋ฐ˜ ์ž์—ฐ์–ด์ฒ˜๋ฆฌ ๋ชจ๋ธ ๊ฐ„๋žตํ•˜๊ฒŒ ํ›‘์–ด๋ณด๊ธฐ
๐Ÿ“ Transformer (Attention Is All You Need) ๊ตฌํ˜„ํ•˜๊ธฐ (1/3)

0๊ฐœ์˜ ๋Œ“๊ธ€