GPT-1 : Improving Language Understanding by Generative Pre-Training (2018)

kellsieยท2025๋…„ 3์›” 22์ผ

๋…ผ๋ฌธ๋ฆฌ๋ทฐ

๋ชฉ๋ก ๋ณด๊ธฐ
5/12

Original Paper : GPT-1 (https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf)


๐Ÿ“ฅBackground

Transformer

Transformer ์ •๋ฆฌ ๊ธ€ ์ฐธ๊ณ  https://velog.io/@angel5893/Transformer-Attention-Is-All-You-Need-2017

์ด์ „๊นŒ์ง€๋Š” attention์„ RNN์˜ ๋ณด์ • ์šฉ๋„๋กœ ์‚ฌ์šฉํ•ด์™”๋‹ค๋ฉด, Transformer์—์„œ๋Š” ์ด ๊ฐœ๋…๋งŒ ์‚ฌ์šฉํ•˜์—ฌ RNN์„ ๋นผ๊ณ  self-attention์œผ๋กœ Encoder์™€ Decoder๋ฅผ ๊ตฌ์„ฑํ–ˆ๋‹ค.

(1) Positional Embedding

RNN ํ˜น์€ CNN ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์— ์‹œํ€€์Šค์˜ ์ˆœ์„œ๋ฅผ ๋‚˜ํƒ€๋‚ด๊ธฐ ์œ„ํ•ด์„œ ์ถ”๊ฐ€์ ์ธ ์œ„์น˜ ํ† ํฐ์„ ์‚ฝ์ž…ํ•ด์•ผ ํ•œ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด encoder์™€ decoder ํ•˜๋‹จ์˜ ์ž…๋ ฅ ์ž„๋ฒ ๋”ฉ์— "์œ„์น˜ ์ž„๋ฒ ๋”ฉ"์„ ์ถ”๊ฐ€ํ–ˆ๋‹ค.

์ด๋•Œ ์‚ฌ์ธ-์ฝ”์‚ฌ์ธ ๊ธฐ๋ฐ˜ ์ธ์ฝ”๋”ฉ์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ ์ฐจ์›๋ณ„๋กœ ๋‹ค๋ฅธ ์ฃผํŒŒ์ˆ˜๋ฅผ ์ ์šฉํ–ˆ๋‹ค.

PE(pos,2i)=sinโก(pos100002idmodel)PE_{(pos, 2i)} = \sin \left(\frac{pos}{10000^{\frac{2i}{d_{\text{model}}}}} \right)

PE(pos,2i+1)=cosโก(pos100002idmodel)PE_{(pos, 2i+1)} = \cos \left(\frac{pos}{10000^{\frac{2i}{d_{\text{model}}}}} \right)

(2) Attention

self-attention

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax} \left( \frac{QK^T}{\sqrt{d_k}} \right) V

[๊ทธ๋ฆผ1] self-attention ๊ณ„์‚ฐ ๊ณผ์ • (์ถœ์ฒ˜ : ๋”ฅ๋Ÿฌ๋‹์„ ์ด์šฉํ•œ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ์ž…๋ฌธ - 16์žฅ [https://wikidocs.net/35596])


[๊ทธ๋ฆผ2] encoder, decoder์—์„œ ์‚ฌ์šฉ๋˜๋Š” self-attention ์ข…๋ฅ˜ (์ถœ์ฒ˜ : ๋”ฅ๋Ÿฌ๋‹์„ ์ด์šฉํ•œ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ์ž…๋ฌธ - 16์žฅ [https://wikidocs.net/35596])


1. encoder self-attention : ์ธ์ฝ”๋”์—์„œ ์‚ฌ์šฉ๋˜๋Š” ์–ดํ…์…˜

  • ์—ญํ•  : ์†Œ์Šค ๋ฌธ์žฅ ๋‚ด์—์„œ ๋‹จ์–ด๋“ค๋ผ๋ฆฌ์˜ ๊ด€๊ณ„๋ฅผ ํ•™์Šต
    (๊ฐ ๋‹จ์–ด๊ฐ€ ๋ฌธ์žฅ์—์„œ ์–ด๋–ค ๋‹จ์–ด์™€ ๊ด€๊ณ„๊ฐ€ ๊นŠ์€ ์ง€ ํŒŒ์•…)
  • ๋ฐฉํ–ฅ์„ฑ : ์–‘๋ฐฉํ–ฅ = ๋ชจ๋“  ๋‹จ์–ด๊ฐ€ ์„œ๋กœ๋ฅผ ์ฐธ๊ณ ํ•  ์ˆ˜ ์žˆ์Œ
  • Q,K,VQ, K, V ์ถœ์ฒ˜ : query = key = value (์›๋ณธ ๋ฒกํ„ฐ ์ถœ์ฒ˜๋Š” ๊ฐ™์Œ)

2. masked decoder self-attention : ๋””์ฝ”๋”์˜ 1๋ฒˆ์งธ ํ•˜์œ„ ๊ณ„์ธต

  • ์—ญํ•  : ํƒ€๊ฒŸ ๋ฌธ์žฅ์—์„œ ์•ž์ชฝ ๋‹จ์–ด๋งŒ ์ฐธ๊ณ ํ•  ์ˆ˜ ์žˆ๋„๋ก ๋งˆ์Šคํ‚น
  • ํ˜„์žฌ ์˜ˆ์ธกํ•˜๋Š” ๋‹จ์–ด๋ณด๋‹ค ๋’ค์— ์˜ค๋Š” ๋‹จ์–ด๋Š” ๋ณผ ์ˆ˜ ์—†์Œ)
    โ† RNN๊ณผ ๋‹ค๋ฅด๊ฒŒ ๋ฌธ์žฅ์„ ํ†ต์œผ๋กœ ์ž…๋ ฅํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋’ค์—๋ฅผ ๋ด๋ฒ„๋ฆฌ๋ฉด ์‚ฌ์ „ ๊ด€์ฐฐ
  • Q,K,VQ, K, V ์ถœ์ฒ˜ : query = key = value (์›๋ณธ ๋ฒกํ„ฐ ์ถœ์ฒ˜๋Š” ๊ฐ™์Œ)

3. encoder-decoder attention : ์ธ์ฝ”๋” ๋ฒกํ„ฐ๋ฅผ ์ธํ’‹์œผ๋กœ ๋ฐ›๋Š” ์–ดํ…์…˜

  • ์—ญํ•  : ์ธ์ฝ”๋”์—์„œ ์ถœ๋ ฅํ•œ context vector๋ฅผ ๋””์ฝ”๋”์—์„œ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก ์—ฐ๊ฒฐ
  • ๋””์ฝ”๋”์˜ query ์™€ ์ธ์ฝ”๋”์˜ key, value๋ฅผ ๋น„๊ตํ•˜์—ฌ ๊ฐ€์žฅ ๊ด€๋ จ ์žˆ๋Š” ์ •๋ณด ์ฐพ๊ธฐ
  • Q,K,VQ, K, V ์ถœ์ฒ˜
    • query : from decoder input
    • key = value : from encoder output


๐Ÿ“„Paper Review

0. Abstract

NLP ๋ถ„์•ผ๋Š” ๋ฌธ์žฅ ํ•จ์˜, QA, ์˜๋ฏธ์  ์œ ์‚ฌ์„ฑ ํ‰๊ฐ€ ๋“ฑ ๋‹ค์–‘ํ•œ ํƒœ์Šคํฌ๊ฐ€ ์กด์žฌํ•˜๋Š”๋ฐ, ๊ฐ ํƒœ์Šคํฌ์— ๋Œ€ํ•ด labeled data, ์ฆ‰ ์ •๋‹ต์ด ์กด์žฌํ•˜๋Š” ๋ฐ์ดํ„ฐ๋Š” ๋งค์šฐ ํฌ์†Œํ•˜๋‹ค.

์ด์— ๋ณธ ๋…ผ๋ฌธ์€ unlabeled text๋กœ ์‚ฌ์ „ํ•™์Šตํ•œ ํ›„, ๊ฐ ํƒœ์Šคํฌ์— ๋Œ€ํ•ด์„œ fine-tuningํ•˜๋Š” ๋ฐฉ์‹์„ ์ œ์•ˆํ•œ๋‹ค.

ํ•ด๋‹น ๋ชจ๋ธ์€ ๋‹ค์Œ ์„ฑ๊ณผ๋ฅผ ๊ฑฐ๋‘์—ˆ๋‹ค.

  • 12๊ฐœ์˜ ๋ถ„์•ผ ์ค‘ 9๊ฐœ์˜ ๋ถ„์•ผ์—์„œ SOTA ๊ธฐ๋ก.
  • ์ƒ์‹์  ์ถ”๋ก  ํƒœ์Šคํฌ : 8.9% / QA : 5.7% / ํ…์ŠคํŠธ ํ•จ์˜ : 1.5%


1. Introduction

(1) unlabeled data๋ฅผ ํ†ตํ•œ ๋น„์ง€๋„ ํ•™์Šต์˜ ์ค‘์š”์„ฑ

raw data์—์„œ ํšจ๊ณผ์ ์œผ๋กœ ํ•™์Šตํ•˜๋Š” ๋Šฅ๋ ฅ์€ NLP์—์„œ ์ง€๋„ ํ•™์Šต์— ๋Œ€ํ•œ ์˜์กด๋„๋ฅผ ์ค„์ด๋Š” ๋ฐ ์žˆ์–ด ๋งค์šฐ ์ค‘์š”ํ•˜๋‹ค. ๊ทธ๋Ÿฐ๋ฐ ๋Œ€๋ถ€๋ถ„์˜ ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฒ•์€ ๋ฐฉ๋Œ€ํ•œ ์–‘์˜ labeled data๋ฅผ ํ•„์š”๋กœ ํ•˜๋ฉฐ, ์ด๋Š” label์ด ๋ถ€์กฑํ•œ ๋ถ„์•ผ์— ๋ชจ๋ธ์„ ์ ์šฉํ•˜๋Š” ๋ฐ ์žˆ์–ด ์žฅ์• ๋ฌผ์ด ๋  ์ˆ˜ ์žˆ๋‹ค.

์ด์— ๋”ฐ๋ผ unlabeled data๋กœ๋ถ€ํ„ฐ ์–ธ์–ด์  ์ •๋ณด๋ฅผ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๋ชจ๋ธ์ด ์ถ”๊ฐ€์ ์ธ ๋ผ๋ฒจ ์ž‘์—… ์—†์ด๋„ ์œ ์šฉํ•œ ๋Œ€์•ˆ์„ ์ œ๊ณตํ•  ์ˆ˜ ์žˆ๋‹ค. ๋”๋ถˆ์–ด, ์ถฉ๋ถ„ํ•œ ์ง€๋„ ํ•™์Šต ๋ฐ์ดํ„ฐ๊ฐ€ ์กด์žฌํ•˜๋Š” ๊ฒฝ์šฐ์—๋„, ๋น„์ง€๋„ ๋ฐฉ์‹์œผ๋กœ ์–‘์งˆ์˜ ํ‘œํ˜„์„ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์€ ์„ฑ๋Šฅ ํ–ฅ์ƒ์— ํฐ ๋„์›€์ด ๋œ๋‹ค.


(2) ๋‹จ์–ด ์˜๋ฏธ ์ด์ƒ์˜ ์ •๋ณด(๋ฌธ๋งฅ ๋“ฑ)๋ฅผ ๋น„์ง€๋„ ํ•™์Šต์œผ๋กœ ์ถ”์ถœํ•˜๋Š” ๊ฒƒ์ด ์–ด๋ ค์šด ์ด์œ 

  1. ์–ด๋–ค ๋ชฉ์  ํ•จ์ˆ˜๊ฐ€ ๊ฐ€์žฅ ์ ํ•ฉํ•œ์ง€ ๋ถˆ๋ถ„๋ช…ํ•˜๋‹ค. ์ตœ๊ทผ ์—ฐ๊ตฌ์— ๋”ฐ๋ฅด๋ฉด, ํƒœ์Šคํฌ์— ๋”ฐ๋ผ ๊ฐ๊ฐ ๋‹ค๋ฅธ ์ตœ์ ํ™” ๋ชฉ์  ํ•จ์ˆ˜๊ฐ€ ๋‹ค๋ฅธ ๊ฒƒ์„ ๋ฐœ๊ฒฌํ–ˆ๋‹ค.

  2. ๋น„์ง€๋„์ ์ธ ๋ฐฉ๋ฒ•์œผ๋กœ ํ•™์Šต๋œ ํ‘œํ˜„์„ ์ „์ดํ•˜๋Š” ํšจ๊ณผ์ ์ธ ๋ฐฉ๋ฒ•์ด ํ†ต์ผ๋˜์ง€ ์•Š์•˜๋‹ค. ํ˜„์žฌ, ๊ตฌ์กฐ ๋ณ€๊ฒฝ, ๋ณต์žกํ•œ ํ•™์Šต ๊ตฌ์กฐ, ๋ณด์กฐ ๋ชฉ์  ํ•จ์ˆ˜ ์ถ”๊ฐ€ ๋“ฑ์œผ๋กœ ์ „์ด ํ•™์Šต์„ ์ง„ํ–‰ํ•˜๊ณ  ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ํšจ๊ณผ์ ์ธ ์ค€์ง€๋„ ํ•™์Šต ๋ฐฉ์‹์„ ์ •๋ฆฝํ•˜๋Š” ๊ฒƒ์ด ์–ด๋ ต๋‹ค.

๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๋‹ค์Œ ์กฐํ•ฉ์„ ํ†ตํ•ด ์ค€์ง€๋„ ์ ‘๊ทผ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค.

"๋น„์ง€๋„ ๋ฐฉ์‹์˜ ์‚ฌ์ „ํ•™์Šต + ์ง€๋„ ๋ฐฉ์‹์˜ fine-tuning "

1. ์‚ฌ์ „ํ•™์Šต : ๊ฐ™์€ ๋„๋ฉ”์ธ์ด ์•„๋‹Œ, universalํ•œ unlabeled data๋ฅผ ํ•™์Šตํ•จ์œผ๋กœ์จ ์ดˆ๊ธฐ ํŒŒ๋ผ๋ฏธํ„ฐ ํ•™์Šต
2. fine-tuning : ๊ฐ ํƒœ์Šคํฌ์— ๋งž๊ฒŒ ์ง€๋„ํ•™์Šต์„ ํ†ตํ•ด ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์กฐ์ •


(3) Transformer ๋ชจ๋ธ ํ™œ์šฉ
Transformer๋Š” RNN๋ณด๋‹ค ์žฅ๊ธฐ ์˜์กด์„ฑ ์ฒ˜๋ฆฌ์— ์žˆ์–ด ๋›ฐ์–ด๋‚œ ๋ชจ์Šต์„ ๋ณด์ด๋ฉฐ ๊ธฐ๊ณ„๋ฒˆ์—ญ, ๋ฌธ์„œ ์ƒ์„ฑ ๋“ฑ์—์„œ ๊ฐ•๋ ฅํ•œ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•œ ๋ชจ๋ธ์ด๋ผ๋Š” ์ ์—์„œ ๋‹ค์–‘ํ•œ ๊ณผ์ œ์— ๊ฑธ์ณ ๊ฒฌ๊ณ ํ•œ ์ „์ด ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ๋‹ค.

์ „์ด ๊ณผ์ •์—์„œ ํƒ์ƒ‰ ๊ธฐ๋ฐ˜ ์ ‘๊ทผ๋ฒ•์„ ์‚ฌ์šฉํ•˜๋Š”๋ฐ, ์ด๋Š” ๊ตฌ์กฐํ™”๋œ ํ…์ŠคํŠธ ์ž…๋ ฅ์„ ํ•˜๋‚˜์˜ ์—ฐ์†์ ์ธ ์‹œํ€€์Šค๋กœ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐฉ์‹์ด๋‹ค. ์ด ๋ฐฉ๋ฒ•์€ ์‚ฌ์ „ํ•™์Šต๋œ ๋ชจ๋ธ์— ์ตœ์†Œํ•œ์˜ ๋ณ€ํ™”๋ฅผ ์ฃผ๋ฉด์„œ fine-tuning์„ ํšจ๊ณผ์ ์œผ๋กœ ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ด์ค€๋‹ค.


(4) ์‹คํ—˜ ๊ฒฐ๊ณผ : ๋ชจ๋ธ ์„ฑ๋Šฅ ๊ฐœ์„ 

  • ์ƒ์‹์  ์ถ”๋ก  ํƒœ์Šคํฌ : 8.9%
  • QA : 5.7%
  • ํ…์ŠคํŠธ ํ•จ์˜ : 1.5%
  • GLUE ๋ฒค์น˜๋งˆํฌ : 5.5%


(1) Semi-supervised learning for NLP

์ดˆ๊ธฐ์˜ ์ค€์ง€๋„ ํ•™์Šต์€ unlabeled data๋ฅผ ํ•™์Šตํ•˜์—ฌ ๋‹จ์–ด ์ˆ˜์ค€ ํ˜น์€ ๊ตฌ ์ˆ˜์ค€์˜ ํ†ต๊ณ„์ •๋ณด๋ฅผ ๊ณ„์‚ฐํ•œ ๋’ค, ์ด๋ฅผ ์ง€๋„ ํ•™์Šต ๋ชจ๋ธ์˜ feature๋กœ ํ™œ์šฉํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ์‚ฌ์šฉ๋˜์—ˆ๋‹ค.

์ง€๋‚œ ๋ช‡ ๋…„๊ฐ„ ๋งŽ์ด ์‚ฌ์šฉ๋˜์—ˆ๋˜ word embedding์€ ๋‹จ์–ด ์ˆ˜์ค€, ์ฆ‰ ๊ฐ ๋‹จ์–ด ์ž์ฒด์˜ ์˜๋ฏธ ์ˆ˜์ค€์˜ ์ •๋ณด๋งŒ ์ „์ดํ•˜๋Š” ๋ฐ์— ๊ทธ์ณค๋‹ค.

์ตœ๊ทผ ์—ฐ๊ตฌ๋“ค์€ unlabeled data๋กœ๋ถ€ํ„ฐ ๊ตฌ/๋ฌธ์žฅ ์ˆ˜์ค€์˜ ์˜๋ฏธ ํ‘œํ˜„์„ ํ•™์Šตํ•˜๊ณ , ์ด๋ฅผ ๋‹ค์–‘ํ•œ ๊ณผ์ œ์—์„œ ํ™œ์šฉ ๊ฐ€๋Šฅํ•œ ๋ฒกํ„ฐ ํ‘œํ˜„์œผ๋กœ ์ „ํ™˜ํ•˜๋Š” ๋ฐ ์„ฑ๊ณตํ•˜๊ณ  ์žˆ๋‹ค.


(2) Unsupervised pre-training

์ค€์ง€๋„ ํ•™์Šต์˜ ํŠน์ด ์ผ€์ด์Šฌ, ์ง€๋„ ํ•™์Šต ๋ชฉ์  ํ•จ์ˆ˜๋ฅผ ์ˆ˜์ •ํ•˜๋Š” ๋Œ€์‹ ์— ์ข‹์€ ์ดˆ๊ธฐํ™” ๊ฐ’์„ ์ฐพ๋Š” ๊ฒƒ์ด ๋ชฉํ‘œ์ด๋‹ค.

์‚ฌ์ „ํ•™์Šต ํ›„ fine-tuning์„ ์ง„ํ–‰ํ•˜๋Š” ๋ฐฉ์‹์€ ์ด๋ฏธ ์„ ํ–‰ ์—ฐ๊ตฌ์—์„œ ์ œ์•ˆ๋˜์—ˆ์œผ๋‚˜, ํ•ด๋‹น ๋ชจ๋ธ์—์„œ๋Š” LSTM์„ ์‚ฌ์šฉํ•˜์˜€๊ธฐ ๋•Œ๋ฌธ์— ์งง์€ ๋ฒ”์ฃผ์˜ ์˜ˆ์ธก๋งŒ ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ๋ฌธ์ œ์ ์ด ์žˆ์—ˆ๋‹ค. ๋ณธ ๋ชจ๋ธ์€ transformer๋ฅผ ํ†ตํ•ด ์ด๋Ÿฌํ•œ ์žฅ๊ธฐ ๊ธฐ์–ต ์†์‹ค์˜ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ–ˆ๋‹ค.


(3) Auxiliary training objectives

๋ณด์กฐ ํ•™์Šต ๋ชฉํ‘œ, ์ฆ‰ ์ฃผ์š” ๋ชฉํ‘œ ์™ธ์— ์ถ”๊ฐ€๋กœ ํ•™์Šตํ•˜๋Š” ๋ชฉ์  ํ•จ์ˆ˜๋Š” ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ๊ณผ ํ‘œํ˜„๋ ฅ์„ ๋†’์ด๋Š” ๋ฐฉ์‹์œผ๋กœ, ์ผ์ข…์˜ ์ค€์ง€๋„ ํ•™์Šต์ด๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ๋‹ค.

POS ํƒœํ‚น, ์ฒญํ‚น, ์–ธ์–ด ๋ชจ๋ธ๋ง ๋ชฉ์  ์ถ”๊ฐ€ ๋“ฑ ์ตœ๊ทผ ๋‹ค์–‘ํ•œ ๋ณด์กฐ ๋ชฉ์ ์„ ํ†ตํ•ด ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋˜์ง€๋งŒ, ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” ๊ฐ•๋ ฅํ•œ ๋น„์ง€๋„ ์‚ฌ์ „ํ•™์Šต ์ž์ฒด๋งŒ์œผ๋กœ๋„ ๋งŽ์€ ์–ธ์–ด์  ์ •๋ณด๊ฐ€ ๋ชจ๋ธ์— ๋‚ด์žฌ๋จ์„ ๋ณด์—ฌ์ค€๋‹ค. ์ฆ‰, ๋ณด์กฐ ๋ชฉ์ ์€ ์˜ต์…˜์— ๋ถˆ๊ณผํ•˜๋ฉฐ ์‚ฌ์ „ํ•™์Šต๋งŒ์œผ๋กœ๋„ ์ถฉ๋ถ„ํžˆ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋‚ผ ์ˆ˜ ์žˆ๋‹ค.



3. Framework

3.1 Unsupervised pre-training

[๊ทธ๋ฆผ3] unsupervised pre-training ๊ตฌ์กฐ (์ถœ์ฒ˜ : ๋…ผ๋ฌธ ๋ฐœ์ทŒ)


(1) ๋ชฉ์ ํ•จ์ˆ˜ (๊ฐ ํŒŒ๋ผ๋ฏธํ„ฐ๋Š” SGD๋ฅผ ํ†ตํ•ด ์ตœ์ ํ™”)

L(U)=โˆ‘ilogโกP(uiโˆฃuiโˆ’k,โ€ฆ,uiโˆ’1;ฮ˜)\mathcal{L}(\mathcal{U}) = \sum_i \log P(u_i \mid u_{i-k}, \ldots, u_{i-1}; \Theta)

\quad ( โ€ป ์ด๋•Œ, U={u1,...,un}U = \{u_1, ... , u_n\} )


(2) tranformer decoder

๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” Transformer decoder๋ฅผ 12๊ฐœ ์Œ“์€ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ–ˆ๋‹ค. ๊ฐ ๊ณผ์ •์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

(โ€ป์•„๋ž˜ ๊ณผ์ •์— ๋Œ€ํ•œ ๋‚ด์šฉ์€ ๋…ผ๋ฌธ์— ๋‚˜์™€์žˆ๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ ํ•„์ž๊ฐ€ ์„œ์นญ์„ ํ†ตํ•ด ์ •๋ฆฌํ•œ ๋‚ด์šฉ์œผ๋กœ ํ‹€๋ฆฐ ๋ถ€๋ถ„์ด ์žˆ์„ ์ˆ˜ ์žˆ์Œ)

1. ์ž…๋ ฅ๊ฐ’ ์ž„๋ฒ ๋”ฉ : h0=UWeโ€‰+โ€‰Wph_0 = UW_e \, +\, W_p

  • UU : ๋น„์ง€๋„ ์ฝ”ํผ์Šค์˜ ํ† ํฐ๋“ค
    BPE๋กœ ํ† ํฌ๋‚˜์ด์ฆˆ๋˜์–ด ์‚ฌ์ „์—์„œ์˜ ์ˆœ์„œ๋Œ€๋กœ ๊ฐ ํ† ํฐ์— ๋ฒˆํ˜ธ๊ฐ€ ๋ถ€์—ฌ๋ผ์žˆ๋Š” ์ƒํƒœ
  • WeW_e : ํ† ํฐ ์ž„๋ฒ ๋”ฉ ํ–‰๋ ฌ
  • WpW_p : ์œ„์น˜ ์ž„๋ฒ ๋”ฉ ํ–‰๋ ฌ

โ“ ์œ„ ์ˆ˜์‹์—์„œ ๋“ค์—ˆ๋˜ ์˜๋ฌธ

์ˆ˜์‹์—์„œ UU์™€ WeW_e๊ฐ€ ๊ณฑํ•ด์ง„๋‹ค๊ณ  ๋ผ์žˆ๋Š”๋ฐ, ๋ฃฉ์—…์ด ๋˜๋ ค๋ฉด UU๊ฐ€ ์›-ํ•ซ ๋ฒกํ„ฐ์—ฌ์•ผ ๊ฐ€๋Šฅํ•œ ๊ฒƒ ์•„๋‹Œ๊ฐ€? ๋‚ด๊ฐ€ ์ธํ„ฐ๋„ท ์ฐพ์•„๋ดค์„ ๋•Œ, UU๋Š” ์‚ฌ์ „์—์„œ์˜ ๊ฐ ์ˆœ์„œ๊ฐ€ ๋ฒˆํ˜ธ๋กœ ๋‚˜์—ด๋ผ์žˆ๋Š” ๋ฒกํ„ฐ๋ผ๊ณ  ํ–ˆ๋Š”๋ฐ ๋ญ์ง€?

โ†’ ์ˆ˜์‹์—์„œ๋งŒ ์ €๋ ‡๊ฒŒ ๊ณฑํ•œ๋‹ค๊ณ  ๋‚˜ํƒ€๋‚œ ๊ฒƒ์ด๊ณ , ์‹ค์ œ๋กœ๋Š” UU์˜ ๊ฐ ์š”์†Œ๋ฅผ ์ธ๋ฑ์Šค ์‚ผ์•„ WeW_e์—์„œ ์ธ๋ฑ์‹ฑํ•˜์—ฌ ์‚ฌ์šฉ
e.g. UU[1] = 32 โ†’ WeW_e[32] = [0.1, ... , 0.08] ์— WpW_p ๋”ํ•˜๊ธฐ


2. Masked Multi Self Attention

Attention(Q,K,V)=softmax(QKโŠคdk+mask)โ‹…V\text{Attention}(Q, K, V) = \text{softmax} \left( \frac{QK^\top}{\sqrt{d_k}} + \text{mask} \right) \cdot V

(1) 1.์—์„œ ๊ณ„์‚ฐํ•œ ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ๋ฅผ ์›๋ณธ ์‚ผ์•„ Q, K, V ๋ฒกํ„ฐ ์ƒ์„ฑ
(2) attention_score=QKโŠคdk\text{attention\_score} = \frac{QK^\top}{\sqrt{d_k}} ๊นŒ์ง€ ๊ตฌํ•œ ํ›„, ์ƒ์‚ผ๊ฐ ๋ถ€๋ถ„์ด โˆ’โˆž-\infin ์œผ๋กœ ๊ตฌ์„ฑ๋ผ์žˆ๋Š” ์ƒ์‚ผ๊ฐ ํ–‰๋ ฌ ๋”ํ•˜๊ธฐ
\quadโ†’ ์ฐจ์› : (seq_len,ย seq_len)(\text{seq\_len}, \text{ seq\_len})
(3) maskingํ•œ attention_score\text{attention\_score}๋ฅผ softmax์— ํ†ต๊ณผ์‹œํ‚ค๊ธฐ = attentionย weight\text{attention weight}
(4) V์™€ ๊ณฑํ•œ ๋‹ค์Œ์— ๋‚˜์˜จ ํ–‰๋ ฌ์„ head ๊ฐœ์ˆ˜๋งŒํผ concatํ•œ ํ›„, ๊ฐ€์ค‘์น˜ W0W_0์„ ๊ณฑํ•จ์œผ๋กœ์จ ์„ ํ˜• ๋ณ€ํ™˜

๊ธ€๋กœ๋งŒ ์ ์œผ๋ฉด ๋‹ค์‹œ ๋ณผ ๋•Œ ์ดํ•ดํ•˜๊ธฐ ์–ด๋ ค์šธ ๊ฒƒ ๊ฐ™์œผ๋‹ˆ ๋‹ค์Œ ์˜ˆ์‹œ๋ฅผ ํ†ตํ•ด ๊ณผ์ •์„ ๋‹ค์‹œ ํ•œ ๋ฒˆ ๋ณต์Šตํ•ด๋ณด์ž.


์˜ˆ์‹œ ๋ฌธ์žฅ : "I am in Paris"

(0) ์ด 4๊ฐœ์˜ ํ† ํฐ์— ๋Œ€ํ•˜์—ฌ ํ† ํฌ๋‚˜์ด์ง• โ†’ ํ† ํฐ ์ž„๋ฒ ๋”ฉ ๋ฐ ํฌ์ง€์…˜ ์ž„๋ฒ ๋”ฉ ์ง„ํ–‰

(1) QKT=scoresQK^T = scores ๊ณ„์‚ฐ

[1.31.10.90.61.21.51.00.50.70.81.30.40.60.71.11.4]\begin{bmatrix} 1.3 & 1.1 & 0.9 & 0.6 \\ 1.2 & 1.5 & 1.0 & 0.5 \\ 0.7 & 0.8 & 1.3 & 0.4 \\ 0.6 & 0.7 & 1.1 & 1.4 \\ \end{bmatrix}

(2) masking

Mask=[0โˆ’โˆžโˆ’โˆžโˆ’โˆž00โˆ’โˆžโˆ’โˆž000โˆ’โˆž0000]\text{Mask} = \begin{bmatrix} 0 & -\infty & -\infty & -\infty \\ 0 & 0 & -\infty & -\infty \\ 0 & 0 & 0 & -\infty \\ 0 & 0 & 0 & 0 \\ \end{bmatrix}

Maskedย Scores=[1.3โˆ’โˆžโˆ’โˆžโˆ’โˆž1.21.5โˆ’โˆžโˆ’โˆž0.70.81.3โˆ’โˆž0.60.71.11.4]\text{Masked Scores} = \begin{bmatrix} 1.3 & -\infty & -\infty & -\infty \\ 1.2 & 1.5 & -\infty & -\infty \\ 0.7 & 0.8 & 1.3 & -\infty \\ 0.6 & 0.7 & 1.1 & 1.4 \\ \end{bmatrix}

(3) Softmax์— ํ†ต๊ณผ
Attentionย Weights=[1.000.000.000.000.420.580.000.000.230.260.510.000.170.200.270.36]\text{Attention Weights} = \begin{bmatrix} 1.00 & 0.00 & 0.00 & 0.00 \\ 0.42 & 0.58 & 0.00 & 0.00 \\ 0.23 & 0.26 & 0.51 & 0.00 \\ 0.17 & 0.20 & 0.27 & 0.36 \\ \end{bmatrix}

(4) V์™€ ๊ณฑํ•˜์—ฌ ์ตœ์ข… attention output ๊ณ„์‚ฐ (multi-head concat ๋‚ด์šฉ์€ ์ƒ๋žต)

V=[0.10.00.20.50.40.20.90.7]V = \begin{bmatrix} 0.1 & 0.0 \\ 0.2 & 0.5 \\ 0.4 & 0.2 \\ 0.9 & 0.7 \\ \end{bmatrix}

Output=Attentionย Weightsร—V=[0.10.00.1550.290.2870.310.4380.406]\text{Output} = \text{Attention Weights} \times V = \begin{bmatrix} 0.1 & 0.0 \\ 0.155 & 0.29 \\ 0.287 & 0.31 \\ 0.438 & 0.406 \\ \end{bmatrix}


3. Residual Connection

  • Transformer์—์„œ ์‚ฌ์šฉ๋œ ์ž”์ฐจ ์—ฐ๊ฒฐ์„ GPT-1 ๋ชจ๋ธ์—์„œ๋„ ์‚ฌ์šฉ
  • Maskedย Multiย Selfย Attention\text{Masked Multi Self Attention}๊ณผ Feedย Forward\text{Feed Forward} ์•ž๋’ค๋กœ ์ž”์ฐจ ํ•™์Šต ์‚ฌ์šฉ
  • ์ž”์ฐจ ํ•™์Šต์˜ ๋ชฉ์ 
    (1) ๊ธฐ์šธ๊ธฐ ์†Œ์‹ค ๋ฐฉ์ง€
    โˆ‚h2โˆ‚h1=โˆ‚(h1+F(h1))โˆ‚h1=I+โˆ‚F(h1)โˆ‚h1\frac{\partial h_2}{\partial h_1} = \frac{\partial \left( h_1 + F(h_1) \right)}{\partial h_1} = \mathbf{I} + \frac{\partial F(h_1)}{\partial h_1}
    โ†’ ํ•ญ๋“ฑํ–‰๋ ฌ๋กœ ์ธํ•ด ๊ธฐ์šธ๊ธฐ๊ฐ€ ํ•ญ์ƒ 1 ์ด์ƒ์œผ๋กœ ์œ ์ง€๋จ

    (2) ํ•™์Šต ์•ˆ์ •์„ฑ ํ–ฅ์ƒ (์—ฐ์‚ฐ๋Ÿ‰ ๊ฐ์†ŒX)
    ์—ฐ์‚ฐ ์ž์ฒด๋Š” ์ด์ „ ๊ฐ’์„ ๋”ํ•˜๋Š” ๊ฒƒ์ด๊ธฐ ๋•Œ๋ฌธ์— ์•ฝ๊ฐ„ ๋Š˜์–ด๋‚œ๋‹ค๊ณ  ํ•  ์ˆ˜ ์žˆ์ง€๋งŒ, ์ดˆ๊ธฐ๊ฐ’์„ ์ดํ›„์—๋„ ์•ˆ์ •์ ์œผ๋กœ ๊ธฐ์–ตํ•  ์ˆ˜ ์žˆ๋‹ค. (์—ฐ์‚ฐ๋Ÿ‰ ๊ฐ์†Œ๋Š” ์ž”์ฐจ ํ•™์Šต์˜ ๋ชฉ์ ์ด ์•„๋‹˜)

โ“ ResNet ๋•Œ๋ถ€ํ„ฐ ๋“ค์—ˆ๋˜ ์˜๋ฌธ

์ž”์ฐจ ํ•™์Šต์ด๋ฉด ์ด์ „ ๊ฐ’์„ ๋บ€ ์ž”์ฐจ๊ฐ’์„ ํ•™์Šตํ•˜๋Š”๊ฑด๊ฐ€? ๊ทธ๋Ÿผ ๊ณ„์‚ฐ์ด ๋” ๋ณต์žกํ•ด์ง€์ง€ ์•Š๋‚˜?

โ†’ ๋ง์ด "์ž”์ฐจ" ํ•™์Šต์ด์ง€, ์‚ฌ์‹ค์€ ๊ทธ๋ƒฅ ์ด์ „ ๊ฐ’์— ์ƒˆ๋กญ๊ฒŒ ๊ณ„์‚ฐ๋œ ๊ฐ’์„ ๋”ํ•˜๋Š” ๊ฒƒ์ž„!

4. Layer Norm

  • Post LN : Residual Connection ๋’ค์— ์œ„์น˜ํ•œ ์ •๊ทœํ™”๋ฅผ ์˜๋ฏธ
  • ์ •๊ทœํ™”์˜ ์—ญํ•  : ๊ฐ’์˜ scale ์ •๋ˆ ๋ฐ ๋” ๋น ๋ฅด๊ณ  ์•ˆ์ •์ ์œผ๋กœ ํ•™์Šตํ•˜๊ฒŒ ํ•จ

5. Feed Forward : ๋น„์„ ํ˜•์„ฑ ์ถ”๊ฐ€
(1) Linear(dmodel,dff)\text{Linear}(d_{model}, d_{ff}) โ† ์ผ๋ฐ˜์ ์œผ๋กœ dff=4โ‹…dmodeld_{ff} = 4\cdot d_{model}
(2) ReLU\text{ReLU}
(3) Linear(dff,dmodel)\text{Linear}(d_{ff}, d_{model})

6. Mask ํ•ด๋‘์—ˆ๋˜, ์ฆ‰ ๋ฏธ๋ž˜ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•œ ํ›„, loss ๊ณ„์‚ฐ ํ›„ ์—ญ์ „ํŒŒ ์ง„ํ–‰

  • unsupervised pre-training์ธ ๋งŒํผ, ๋ณ„๋„์˜ label์ด ์กด์žฌํ•˜์ง€ ์•Š๋Š”๋‹ค.
  • ์•ž์„œ 2.์—์„œ maskํ•ด๋‘” ๋‹จ์–ด ์˜ˆ์ธก

3.2 Supervised fine-tuning

1. ์ž…๋ ฅ๊ณผ label
labeled dataset CC ์‚ฌ์šฉ

  • ์ž…๋ ฅ ์‹œํ€€์Šค : x1,x2,...,xmx_1, x_2, ..., x_m
  • ์ •๋‹ต ๋ผ๋ฒจ : yy

2. ์ถœ๋ ฅ ํ™•๋ฅ  ๊ณ„์‚ฐ
(1) ์ž…๋ ฅ ์‹œํ€€์Šค๋ฅผ ์‚ฌ์ „ํ•™์Šต๋œ GPT ๋ชจ๋ธ์— ํ†ต๊ณผ์‹œ์ผœ ๋งˆ์ง€๋ง‰ Transformer ๋ธ”๋ก์˜ ๋งˆ์ง€๋ง‰ ํ† ํฐ ์ถœ๋ ฅ hm(l)h^{(l)}_m ์–ป๊ธฐ
(2) ์ด ๋ฒกํ„ฐ๋ฅผ ์ƒˆ๋กญ๊ฒŒ ์ถ”๊ฐ€๋œ ์„ ํ˜• ๋ถ„๋ฅ˜๊ธฐ WyW_y์— ๋„ฃ์–ด ์˜ˆ์ธก ํ™•๋ฅ ์„ ๊ณ„์‚ฐ

P(yโˆฃx1,...,xm)=softmax(hm(l)Wy)P(y\mid x_1, ..., x_m) = \text{softmax}(h^{(l)}_m W_y)

3. fine-tuning ์†์‹ค ํ•จ์ˆ˜

L2(C)=โˆ‘(x,y)logโกP(yโˆฃx)\mathcal{L}_2(\mathcal{C}) = \sum_{(x, y)} \log P(y \mid x)

4. ๊ธฐ์กด ์–ธ์–ด ๋ชจ๋ธ๋ง ์†์‹ค L1\mathcal{L}_1์„ ๋ณด์กฐ ํ•™์Šต ๋ชฉํ‘œ๋กœ ์ถ”๊ฐ€
fine-tuning ์„ฑ๋Šฅ์„ ๋†’์ด๊ธฐ ์œ„ํ•ด ์•ž์„œ ์‚ฌ์ „ํ•™์Šต์—์„œ ์„ค์ •ํ–ˆ๋˜ L1\mathcal{L}_1์„ ๋ณด์กฐ๋กœ ๊ฐ™์ด ํ•™์Šต
โ†’ ๋ชจ๋ธ์˜ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ ํ–ฅ์ƒ ๋ฐ ํ•™์Šต ์ˆ˜๋ ด ์†๋„ ์ฆ๊ฐ€

L3(C)=L2(C)+ฮปโ‹…L1(C)\mathcal{L}_3(\mathcal{C}) = \mathcal{L}_2(\mathcal{C}) + \lambda \cdot \mathcal{L}_1(\mathcal{C})

3.3 Task-specific input transformations

1. ๋ฌธ์žฅ ๋ถ„๋ฅ˜ : ๊ธฐ์กด ์‚ฌ์ „ํ•™์Šต ๋ชจ๋ธ์— layer ์ถ”๊ฐ€
(1) linear : ๋ถ„๋ฅ˜๋  ํด๋ž˜์Šค ๊ฐœ์ˆ˜๋กœ ์ฐจ์› ๋ณ€ํ™˜
(2) softmax : ๊ฐ ํด๋ž˜์Šค์— ์†ํ•  ํ™•๋ฅ  ๊ณ„์‚ฐ
(3) loss ๊ณ„์‚ฐ ํ›„ ์—ญ์ „ํŒŒ ์ง„ํ–‰


2. ํ…์ŠคํŠธ ํ•จ์˜, ์œ ์‚ฌ์„ฑ, QA / Multi-turn QA, ์ƒ์‹์  ์ถ”๋ก  : input ์ฒ˜๋ฆฌ

  • ์„ ํ–‰ ์—ฐ๊ตฌ์—์„œ๋Š” ์ถ”๊ฐ€์ ์ธ ๊ตฌ์กฐ ์ปค์Šคํ„ฐ๋งˆ์ด์ง• ๋ฐ ์ƒˆ ์•„ํ‚คํ…์ฒ˜๋ฅผ ํ•„์š”๋กœ ํ•จ. ์ฆ‰, ๋ณต์žกํ•œ ๊ณผ์ •์ด ํ•„์š”ํ–ˆ์Œ.

  • ํƒ์ƒ‰ ๊ธฐ๋ฐ˜ ์ ‘๊ทผ๋ฒ•์„ ํ†ตํ•ด ๊ตฌ์กฐํ™”๋œ ํ…์ŠคํŠธ ์ž…๋ ฅ์„ ํ•˜๋‚˜์˜ ์—ฐ์†์ ์ธ ์‹œํ€€์Šค๋กœ ์ฒ˜๋ฆฌ

    (QA ์˜ˆ์‹œ)

    context : ๋ฏผ์„ฑ์ด๊ฐ€ ์น˜ํ‚จ์ด ๋จน๊ณ  ์‹ถ์„ ๋•Œ๋งˆ๋‹ค ์ผ๋ณธ์— ๊ณ„์‹  ์•„๋น ๊ป˜ ๋ง์”€๋“œ๋ฆฌ๋ฉด, ์•„๋น ๊ป˜์„œ ์š”๊ธฐ์š”๋กœ ์น˜ํ‚จ์„ ์‹œ์ผœ์ฃผ์‹ ๋‹ค.
    Question : ๋ฏผ์„ฑ์ด์˜ ์•„๋น ๋Š” ์–ด๋–ป๊ฒŒ ์น˜ํ‚จ์„ ์‹œํ‚ค์‹œ๋Š”๊ฐ€?

    => context์™€ question์„ ํ•˜๋‚˜๋กœ ํ•ฉ์ณ์„œ input์œผ๋กœ!


\quadโ“context์— ๋Œ€ํ•œ ๊ถ๊ธˆ์ฆ
\quadcontext๋Š” ์–ด๋–ป๊ฒŒ ๋ถ™์—ฌ์ฃผ๋Š”๊ฐ€? ์‚ฌ์šฉ์ž๊ฐ€ ์ผ์ผ์ด ์ž…๋ ฅํ•˜์ง€๋Š” ์•Š์„ ๊ฒƒ ๊ฐ™์€๋ฐ

\quadโ†’ fine-tuning : SQuAD ๋“ฑ์˜ ๋ฐ์ดํ„ฐ์…‹ ํ™œ์šฉ
\quadโ†’ ์‹ค์ „ : RAG ๋ชจ๋ธ์ด context ํ›„๋ณด๋ฅผ ์ฐพ์•„์˜ด




๐Ÿค” My Thoughts

  • ์ƒˆ๋กœ์šด ๋ชจ๋ธ์„ ๋ฐฐ์šธ์ˆ˜๋ก ์ด์ „ ๋ชจ๋ธ์„ ๊ณต๋ถ€ํ•  ๋•Œ๋Š” ์ƒ๊ฐ๋‚˜์ง€ ์•Š๋˜ ์ƒˆ๋กœ์šด ๊ถ๊ธˆ์ฆ์ด ์ƒ๊ธฐ๊ณ  ์ด๊ฒƒ์„ ํ•ด๊ฒฐํ•˜๋Š” ๊ณผ์ •์ด ์žฌ๋ฐŒ์—ˆ๋‹ค.
  • Transformer๋„ ๊ทธ๋ ‡๊ณ  GPT-1๋„ ๊ทธ๋ ‡๊ณ  ์„ ํ–‰ ์—ฐ๊ตฌ์—์„œ ํŠน์ • ๋ถ€๋ถ„๋งŒ ๋”ฐ์™”๋‹ค๊ณ  ํ–ˆ๋Š”๋ฐ, ์–ด์ฉœ ์ด๋Ÿฐ ์ƒ๊ฐ๋“ค์„ ํ•˜๋Š” ๊ฑด์ง€ ์ฐธ ์‹ ๊ธฐํ•˜๋‹ค.
profile
every high and every low

0๊ฐœ์˜ ๋Œ“๊ธ€