[Review] Language Models are Unsupervised Multitask Learners

YSLยท2023๋…„ 7์›” 31์ผ
0

Review

๋ชฉ๋ก ๋ณด๊ธฐ
5/7
post-thumbnail

์ง€๋‚œ๋ฒˆ GPT-1์— ์ด์–ด ์ด๋ฒˆ์—” GPT-2์— ๋Œ€ํ•œ ๋…ผ๋ฌธ์ด๋‹ค.
๐Ÿ“ Language Models are Unsupervised Multitask Learners

โ—๏ธ๊ฐœ๋…์„ ์ •๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด ์ž‘์„ฑํ•œ ๊ธ€๋กœ, ๋‚ด์šฉ์ƒ ์ž˜๋ชป๋œ ๋ถ€๋ถ„์ด ์žˆ์„ ์ˆ˜ ์žˆ๋‹ค๋Š” ์  ์ฐธ๊ณ  ๋ฐ”๋ž๋‹ˆ๋‹ค.

๋…ผ๋ฌธ์—์„œ๋Š” ์ˆœ์„œ๊ฐ€ ์•„๋ž˜์™€ ๊ฐ™์ด ๊ตฌ์„ฑ๋˜์–ด์žˆ๋‹ค.

  • Introduction
  • Approach
    • Training Dataset
    • Input Representation
    • Model
  • Experiments
    • Language Modeling
    • Children's Book Test
    • LAMBADA
    • Winograd Schema Challenge
    • Reading

์ด ๊ธ€์€ ๋…ผ๋ฌธ ์ˆœ์„œ๋ฅผ ๊ทธ๋Œ€๋กœ ๋”ฐ๋ผ๊ฐ€๊ธฐ๋ณด๋‹ค๋Š” ๋‚ด๊ฐ€ ๊ณต๋ถ€ํ•  ๋•Œ ์ดํ•ดํ•˜๊ธฐ ํŽธํ–ˆ๋˜ ํ๋ฆ„๋Œ€๋กœ ์ž‘์„ฑํ•˜๋ ค๊ณ  ํ•œ๋‹ค.


Introduction

GPT-1์„ ํฌํ•จํ•˜์—ฌ ๊ธฐ์กด ์—ฐ๊ตฌ๋“ค์€ unsupervised pre-training๊ณผ supervised fine-tuning 2๋‹จ๊ณ„๋ฅผ ๊ฑฐ์ณ ๋ชจ๋ธ์„ ํ•™์Šต์‹œ์ผฐ๋‹ค. ์ด ๋ฐฉ๋ฒ•์€ ํŠน์ • task์—์„œ ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋ณด์ด์ง€๋งŒ 'narrow expert'๋ผ๋Š” ์ , ๋ฐ์ดํ„ฐ์˜ ๋ถ„ํฌ๊ฐ€ ์กฐ๊ธˆ์ด๋ผ๋„ ๋ฐ”๋€Œ๋ฉด ๊ฒฐ๊ณผ๊ฐ€ ๋ถˆ์•ˆ์ •ํ•ด์ง€๋Š” ์ ์—์„œ ํ•œ๊ณ„๋ฅผ ๋ณด์˜€๋‹ค.

๋”ฐ๋ผ์„œ GPT-2์—์„œ๋Š” ๋” ๋ฒ”์šฉ์ ์ธ Language Model(LM)์„ ๋งŒ๋“ค๊ณ ์ž ํ•˜์˜€๋‹ค. ์ด๋ฅผ ์œ„ํ•ด fine-tuning ๊ณผ์ •์„ ์—†์• ๊ณ  pre-train๋œ ๋ชจ๋ธ์„ ๋ฐ”๋กœ task์— ์ ์šฉํ•˜๋Š” zero-shot ๋ฐฉ์‹์„ ์ฑ„ํƒํ•˜์˜€๋‹ค.

์ด ๊ฒฐ๊ณผ, supervise ๊ณผ์ •์ด ์—†๋Š” ์ƒํƒœ๋กœ, task๊ณผ ๋ฌด๊ด€ํ•˜๊ฒŒ ๋ชจ๋ธ์˜ ๋ฒ”์šฉ์„ฑ์ด ๋†’์•„์ง€๋„๋ก ํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค.

zero-shot

โ‡’ ๋ชจ๋ธ์—๊ฒŒ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ task์— ๋Œ€ํ•œ ์„ค๋ช…๋„ ํ•จ๊ป˜ ์ „๋‹ฌํ•œ๋‹ค. ๋ชจ๋ธ์€ ํŒŒ๋ผ๋ฏธํ„ฐ๋‚˜ ๋ชจ๋ธ ๊ตฌ์กฐ๋ฅผ ์—…๋ฐ์ดํŠธํ•˜์ง€ ์•Š๊ณ  ์„ค๋ช…๋œ task์— ๋งž๋Š” ์ถœ๋ ฅ์„ ๋งŒ๋“ค์–ด๋‚ธ๋‹ค.

Approach

์–ธ์–ด ๋ชจ๋ธ์€ ์•„๋ž˜์™€ ๊ฐ™์€ ์ˆ˜์‹์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ณ ์•ˆ๋˜์—ˆ๋‹ค.

p(x)=ฯ€i=1np(snโˆฃs1,...,snโˆ’1)p(x) = \pi^{n}_{i=1} p(s_n|s_1, ... , s_{n-1})

โ‡” ์ด์ „์— ์ฃผ์–ด์ง„ ํ† ํฐ๋“ค์„ ๋ฐ”ํƒ•์œผ๋กœ ๋‹ค์Œ์— ์˜ฌ ํ† ํฐ์— ๋Œ€ํ•œ ๋น„์ง€๋„ ๋ถ„ํฌ๋ฅผ ์ถ”์ •ํ•˜๊ณ  ์ด ๊ณผ์ •์„ ๋ฐ˜๋ณตํ•˜์—ฌ ๋ฌธ์žฅ์„ ์ƒ์„ฑํ•˜๋Š” ๋ฐฉ์‹


๊ธฐ์กด์˜ single task์˜ ๊ฒฝ์šฐ ์ด๋ฏธ task๊ฐ€ ์ •ํ•ด์ ธ ์žˆ๊ณ  ์ด์— ๋งž๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ์ž…๋ ฅํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋ชจ๋ธ์—๋Š” ๊ทธ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋งŒ ์ œ๊ณตํ•ด์ฃผ๋ฉด ๋˜์—ˆ๋‹ค.

p(outputโˆฃinput)p(output|input)

ํ•˜์ง€๋งŒ multu-task์˜ ๊ฒฝ์šฐ ๋ชจ๋ธ์—๊ฒŒ task์— ๋Œ€ํ•œ ์ •๋ณด๋„ ํ•จ๊ป˜ ์ œ๊ณตํ•ด์•ผ ๊ทธ๊ฒŒ ๋งž๋Š” ๊ฒฐ๊ณผ๋ฅผ ์ถ”์ •ํ•  ์ˆ˜ ์žˆ๋‹ค.

p(outputโˆฃinput,task)p(output | input, task)

๋”ฐ๋ผ์„œ GPT-2๋„ ๋ชจ๋ธ์„ ํ•™์Šต์‹œํ‚ฌ ๋•Œ ์–ด๋–ค task๋ฅผ ์ˆ˜ํ–‰ํ•  ์ง€์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ์™€ ํ•จ๊ป˜ ์ œ๊ณตํ–ˆ๋‹ค.

์˜ˆ)
translation : (translate to french, english text, french text)
reading comprehension : (answer the question, document, question, answer)

Training Dataset

๊ธฐ์กด์˜ ์—ฐ๊ตฌ์—์„œ๋Š” news article, text book ๋“ฑ์—์„œ ๋ฐ์ดํ„ฐ์…‹์„ ๊ตฌ์ถ•ํ–ˆ๋Š”๋ฐ,
์ด ๊ฒฝ์šฐ ํŠน์ • ๋„๋ฉ”์ธ์— ๋ฐ์ดํ„ฐ๊ฐ€ ํŽธํ–ฅ๋˜๋Š” ํ•œ๊ณ„๊ฐ€ ์žˆ์—ˆ๋‹ค.

๋”ฐ๋ผ์„œ GPT-2๋Š” ๋ฐ์ดํ„ฐ๊ฐ€ ํ•˜๋‚˜์˜ ๋„๋ฉ”์ธ์— ์น˜์šฐ์น˜์ง€ ์•Š๊ณ  ์ตœ๋Œ€ํ•œ ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ๋ชจ์œผ๊ณ ์ž ํ•˜์˜€๋‹ค. Common Crawl๊ณผ ๊ฐ™์€ ์›น์Šคํฌ๋žฉ ๋ฐ์ดํ„ฐ์…‹์ด ์žˆ์—ˆ์ง€๋งŒ data quality ์ด์Šˆ๊ฐ€ ์žˆ์–ด ์ง์ ‘ WebText๋ผ๋Š” ๋Œ€์šฉ๋Ÿ‰ ๋ฐ์ดํ„ฐ์…‹์„ ๊ตฌ์ถ•ํ•˜์˜€๋‹ค.

Input Representation

GPT-2๋Š” OOV ๋ฌธ์ œ ์™„ํ™”์— ํšจ๊ณผ์ ์ธ Byte-Pair-Encoding(BPE) ๋ฐฉ์‹์œผ๋กœ ํ† ํฐํ™”๋ฅผ ์ง„ํ–‰ํ–ˆ๋‹ค.

๊ธฐ์กด์˜ BPE๋Š” base character๊ฐ€ ์œ ๋‹ˆ์ฝ”๋“œ ๋‹จ์œ„์˜€๊ธฐ ๋•Œ๋ฌธ์— ์•ŒํŒŒ๋ฒณ๋งŒ์„ ํฌํ•จํ•˜๋Š” base vocabulary๋งŒ์œผ๋กœ๋„ ํฌ๊ธฐ๊ฐ€ ์ƒ๋‹นํ•ด์ง€๋Š” ํ•œ๊ณ„๊ฐ€ ์žˆ์—ˆ๋‹ค.

๐Ÿ“ byte์™€ unicode์— ๋Œ€ํ•œ ์ดํ•ด

๋”ฐ๋ผ์„œ base-character๋ฅผ byte ๋‹จ์œ„๋กœ ์ง€์ •ํ•˜๋Š” byte-level BPE๋ฅผ ์„ ํƒํ•˜์˜€๊ณ 
ํ•œ์ •๋œ ์‚ฌ์ด์ฆˆ์˜ vocabulary๋ฅผ ํšจ์œจ์ ์œผ๋กœ ์‚ฌ์šฉํ•˜๊ณ ์ž ๋ฐ˜๋“œ์‹œ ์ผ์ • ์ˆ˜์ค€ ์ด์ƒ์˜ ๋‹จ์œ„๋กœ ๋ณ‘ํ•ฉํ•˜์—ฌ ์ถ”๊ฐ€ํ•˜๋Š” ๊ณผ์ •์„ ํฌํ•จ์‹œ์ผฐ๋‹ค.

Model

GPT-2
GPT-2 size

GPT-2๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜์™€ ๋ชจ๋ธ์˜ ํฌ๊ธฐ๋ฅผ ๋Š˜๋ ธ์„ ๋ฟ GPT-1๊ณผ ๊ฑฐ์˜ ๋˜‘๊ฐ™์€ ๊ตฌ์กฐ์˜ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ–ˆ๋‹ค.
Layer Normalization block์„ ๊ฐ๊ฐ Attention block๊ณผ FFN block ์•ž์œผ๋กœ ์˜ฎ๊ฒผ๋Š”๋ฐ,
์ด๋ฅผ ํ†ตํ•ด gradient๊ฐ€ vanishing ๋˜๊ฑฐ๋‚˜ exploding ๋˜์ง€ ์•Š๋„๋ก ํ•˜์˜€๋‹ค. ๋˜ Residaul connecntion ์‹œ ํŒŒ๋ผ๋ฏธํ„ฐ๋“ค์„ 1N\frac{1}{\sqrt{N}}์œผ๋กœ scalingํ•˜์—ฌ ์ธต์ด ๊นŠ์–ด์ ธ๋„ gradient vanishin์„ ์™„ํ™”ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜์˜€๋‹ค.

Experiment

GPT-2๋Š” zero-shot ๋ฐฉ์‹์œผ๋กœ๋„ ๋‹ค์–‘ํ•œ NLP task์—์„œ ๊ธฐ์กด์˜ SOTA ๋ชจ๋ธ๊ณผ ๋น„์Šทํ•˜๊ฑฐ๋‚˜ ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๊ธฐ๋กํ•˜์˜€๋‹ค.

ํ•˜์ง€๋งŒ ์ผ๋ถ€ task์—์„œ๋Š” ์ข‹์ง€ ์•Š์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€๋Š”๋ฐ ๊ทธ ์˜ˆ๋Š” ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

  1. Summarization
    task์— ๋Œ€ํ•ด ์ถ”๊ฐ€์ ์œผ๋กœ 'TL;DR' ํ† ํฐ์„ ์ถ”๊ฐ€ํ•˜์ง€ ์•Š์€ ๊ฒฝ์šฐ, ๋ณธ๋ฌธ์—์„œ ๋žœ๋คํ•˜๊ฒŒ 3๋ฌธ์žฅ์„ ๋ฝ‘์•˜์„ ๋•Œ๋ณด๋‹ค ๋” ์—‰ํ„ฐ๋ฆฌ๋กœ ์š”์•ฝํ•˜์˜€๋‹ค.

  2. Translation
    'english sentence = french sentence'๋ผ๋Š” ์„ค๋ช…์„ ์ฃผ๊ณ  'english sentence = '์™€ ๊ฐ™์ด ์ž…๋ ฅํ–ˆ์„ ๋•Œ, ๊ธฐ์กด ๋ฒˆ์—ญ ๋ชจ๋ธ๋ณด๋‹ค ํ›จ์”ฌ ๋‚ฎ์€ BLEU ์ ์ˆ˜๋ฅผ ๊ธฐ๋กํ–ˆ๋‹ค.
    ํ•˜์ง€๋งŒ ์˜์–ด๋กœ๋งŒ ์ด๋ฃจ์–ด์ง„ WebText๋กœ ํ•™์Šต์‹œ์ผฐ์Œ์—๋„ ์˜-๋ถˆ, ๋ถˆ-์˜ ๋ฒˆ์—ญ์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค๋Š” ์ ์—์„œ ์œ ์˜๋ฏธํ•œ ์‹คํ—˜์ด์—ˆ๋‹ค.

Generalzation vs Memorization

WebText ๋ฐ์ดํ„ฐ์…‹์˜ ํฌ๊ธฐ๊ฐ€ ์›Œ๋‚™ ํฌ๊ธฐ ๋•Œ๋ฌธ์— GPT-2๊ฐ€ ์ด ๋ฐ์ดํ„ฐ๋ฅผ ์™ธ์›Œ์„œ ๋‹ต์„ ๋ฑ‰์–ด๋‚ด๋Š” ๊ฒƒ์ด ์•„๋‹Œ์ง€์— ๋Œ€ํ•œ ์‹คํ—˜์„ ์ง„ํ–‰ํ•˜์˜€๋‹ค.

8-grams Bloom filterin ๋ฐฉ์‹์„ ํ†ตํ•ด
๊ธฐ์กด์˜ ํ•™์Šต๋ฐ์ดํ„ฐ๋ฅผ ํ†ตํ•ด ๋งŒ๋“ค์–ด์ง„ ๋ชจ๋ธ์ด ์ƒ์„ฑํ•œ ์ถœ๋ ฅ์ด ์ธํ„ฐ๋„ท์— ์ด๋ฏธ ์กด์žฌํ•˜๋Š” ์ถœ๋ ฅ๊ณผ ๊ฐ™์€(overlap) ๊ฒฝ์šฐ๋ณด๋‹ค GPT-2๊ฐ€ ์ƒ์„ฑํ•œ ์ถœ๋ ฅ์ด ์ธํ„ฐ๋„ท์— ์ด๋ฏธ ์กด์žฌํ•˜๋Š” ์ถœ๋ ฅ๊ณผ ๊ฒน์น˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ํ›จ์”ฌ ์ ์€ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค.

์ด๋ฅผ ํ†ตํ•ด GPT-2๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ์™ธ์›Œ์„œ ๋‹ต์„ ๋ฑ‰์–ด๋‚ด๋Š” ๊ฒƒ(Memorization)์ด ์•„๋‹ˆ๋ผ ์ถ”์ •์„ ํ†ตํ•ด ๋‹ต์„ ์ƒ์„ฑํ•˜๋Š” Generalization ๋ฐฉ์‹์ž„์„ ์ž…์ฆํ•˜์˜€๋‹ค.

๋˜, ๋ชจ๋ธ์˜ ํฌ๊ธฐ๊ฐ€ ์ฆ๊ฐ€ํ•จ์— ๋”ฐ๋ผ ์„ฑ๋Šฅ์ด ๊พธ์ค€ํžˆ ํ–ฅ์ƒ๋˜๋Š” ๋ชจ์Šต์„ ํ†ตํ•ด Generalzation์ž„์„ ์ถ”๊ฐ€์ ์œผ๋กœ ์–ธ๊ธ‰ํ–ˆ๋‹ค.
(โˆต Memorization ๋ฐฉ์‹์ด๋ผ๋ฉด ๋ชจ๋ธ์˜ ํฌ๊ธฐ๊ฐ€ ์•„๋ฌด๋ฆฌ ์ฆ๊ฐ€ํ•ด๋„ ์ผ์ • ์ˆ˜์ค€ ์ด์ƒ๋ถ€ํ„ฐ๋Š” ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋˜์ง€ ์•Š๋Š” ํ˜•ํƒœ๋ฅผ ๋ณด์ผ ๊ฒƒ)

Conclusion

GPT-2๋Š” fine-tuning ์—†์ด unsupervised pre-traing๋งŒ์„ ํ†ตํ•ด zero-shot์œผ๋กœ downstream task๋ฅผ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋Š” ๋ฒ”์šฉ ์–ธ์–ด ๋ชจ๋ธ์„ ๋งŒ๋“ค๊ณ ์ž ํ•˜์˜€๋‹ค.

GPT-1 vs GPT-2

Why GPT uses only decoder?
The cases when we use encoder-decoder architectures are typically when we are mapping one type of sequence to another type of sequence, e.g. translating French to English or in the case of a chatbot taking a dialogue context and producing a response. In these cases, there are qualitative differences between the inputs and outputs so that it makes sense to use different weights for them.

In the case of GPT-2, which is trained on continuous text such as Wikipedia articles, if we wanted to use an encoder-decoder architecture, we would have to make arbitrary cutoffs to determine which part will be dealt with by the encoder and which part by the decoder. In these cases therefore, it is more common to just use the decoder by itself.

๋ชจ๋ธ์˜ ๊ตฌ์กฐ๋Š” '๋ชจ๋ธ์˜ ๋ชฉ์ '์— ๋”ฐ๋ผ ๋‹ฌ๋ผ์ง„๋‹ค.
์˜ˆ๋ฅผ ๋“ค์–ด Transformer๋Š” '๊ธฐ๊ณ„ ๋ฒˆ์—ญ'์„ ๋ชฉํ‘œ๋กœ ๋งŒ๋“ค์–ด์กŒ๊ณ , ๋”ฐ๋ผ์„œ ์–ธ์–ดA์— ๋Œ€ํ•œ ๋ฌธ๋งฅ์„ ํ•™์Šตํ•˜๋Š” Encoder์™€ ์–ธ์–ดB์— ๋Œ€ํ•œ ์ƒ์„ฑ์„ ํ•™์Šตํ•˜๋Š” Decoder๊ฐ€ ๊ฒฐํ•ฉํ•œ ํ˜•ํƒœ๋ฅผ ๊ฐ€์ง„๋‹ค. Encoder๋Š” ํ•˜๋‚˜์˜ sequence๊ฐ€ ๋“ค์–ด์™”์„ ๋•Œ, '์ „์ฒด๋ฅผ ์ฐธ์กฐ'ํ•˜์—ฌ ์–ด๋–ค ๋‹จ์–ด๊ฐ€ ๋ฌธ์žฅ ๋‚ด ๋‹ค๋ฅธ ์–ด๋–ค ๋‹จ์–ด์™€ ์—ฐ๊ด€์ด ์žˆ๋Š”์ง€๋ฅผ ํŒŒ์•…ํ•œ๋‹ค. Decoder๋Š” (Encoder์—์„œ ์–ป์€ context vecotr)์™€ (์–ธ์–ดB์˜ ์†Œ์Šค ๋ฌธ์žฅ ์ค‘ ์ƒ์„ฑํ•  ํ† ํฐ ์ด์ „์— ๋‚˜ํƒ€๋Š” ํ† ํฐ)์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ์ด์šฉํ•ด ๋‹ค์Œ์— ์˜ฌ ํ† ํฐ์„ ์˜ˆ์ธกํ•˜๋ฉด์„œ ์–ธ์–ดB์— ๋Œ€ํ•œ ๋ฌธ์žฅ์„ ์ƒ์„ฑํ•œ๋‹ค. ๋˜ ์–ธ์–ดA์™€ ์–ธ์–ดB์— ๋Œ€ํ•ด ๊ฐ€์ค‘์น˜๊ฐ€ ๊ฐ๊ฐ ๋‹ฌ๋ผ์•ผ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์„œ๋กœ ๋‹ค๋ฅธ ๋‘ ๊ตฌ์กฐ๋ฅผ ๊ฒฐํ•ฉํ•˜๋Š” ๋ฐฉ์‹์„ ์„ ํƒํ–ˆ๋‹ค.

๋ฐ˜๋ฉด GPT๋Š” ๋ชจ๋“  NLP task๋ฅผ ํ…์ŠคํŠธ ์ƒ์„ฑ ๋ฐฉ์‹์œผ๋กœ ํ’€ ์ˆ˜ ์žˆ๋‹ค๊ณ  ์ƒ๊ฐํ•ด 'ํ…์ŠคํŠธ ์ƒ์„ฑ'์„ ๋ชฉํ‘œ๋กœ ๋งŒ๋“ค์–ด์กŒ๋‹ค. ํ† ํฐ์„ ์˜ˆ์ธกํ•  ๋•Œ๋Š” ๋ฏธ๋ž˜ ํ† ํฐ์„ ์ฐธ์กฐํ•˜๋ฉด(=์ปจ๋‹) ์•ˆ๋˜๊ธฐ ๋•Œ๋ฌธ์— '์ด์ „ ํ† ํฐ๋“ค๋งŒ ์ฐธ์กฐ'ํ•˜์—ฌ ํ•™์Šต์„ ์ง„ํ–‰ํ•œ๋‹ค. ์ด๋•Œ ์ด์ „ ํ† ํฐ์— ๋Œ€ํ•œ attention ์ •๋ณด๋“ค์€ ์ด๋ฏธ ๋‹ค์Œ ํ† ํฐ ์˜ˆ์ธก์„ ํ•˜๋ฉด์„œ Decoder์— ์ €์žฅ์ด ๋˜์–ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์—(= masked๊ฐ€ ์ ์šฉ๋˜๊ธด ํ–ˆ์œผ๋‚˜ ์ด๋ฏธ encoder์˜ ์—ญํ• ์„ ์ˆ˜ํ–‰) ๋ณ„๋„์˜ Encoder๊ฐ€ ์—†์–ด๋„ ๋‹ค์Œ ํ† ํฐ ์ถ”์ •์ด ๊ฐ€๋Šฅํ•˜๋‹ค.

Code

๐Ÿ”ง ์ฐธ๊ณ ํ•œ GitHub
๋ชจ๋ธ์— start_token ๋˜๋Š” context๊ฐ€ ์ž…๋ ฅ๋˜๋ฉด ์ด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ๋‹ค์Œ ํ† ํฐ์„ ์˜ˆ์ธกํ•˜๊ณ  ์˜ˆ์ธก๋œ ํ† ํฐ๋“ค์„ ์ด์–ด length๋งŒํผ์˜ ๊ธธ์ด๋ฅผ ๊ฐ–๋Š” ๋ฌธ์žฅ์„ ์ƒ์„ฑํ•œ๋‹ค. ์ด ๋•Œ ์˜ˆ์ธก๋˜๋Š” ํ† ํฐ์€ ๊ฐ€์žฅ ํ™•๋ฅ ์ด ๋†’์€ k๊ฐœ์˜ ํ† ํฐ ์ค‘ ํ•˜๋‚˜์ด๋‹ค.

[Python ๋ฌธ๋ฒ• ์ •๋ฆฌ]

  • torch.where(condition, x, y)
    : condition์ด ์ฐธ(True)์ด๋ฉด x์—์„œ, ๊ฑฐ์ง“(False)์ด๋ฉด y์—์„œ ๊ฐ’์„ ๋ถˆ๋Ÿฌ์˜จ๋‹ค.
    ์œ„์˜ ์ฝ”๋“œ์—์„œ๋Š” ์ „์ฒด ๋กœ์ง“๊ฐ’๋“ค ์ค‘ k๋ฒˆ์งธ ๋กœ์ง“๊ฐ’๋ณด๋‹ค ์ž‘์€ ๋ถ€๋ถ„์€ โˆ’โˆž-\infty๋กœ, ๊ทธ๊ฒŒ ์•„๋‹ˆ๋ผ๋ฉด ์›๋ž˜ ๋กœ์ง“๊ฐ’์„ ๋ฐ˜ํ™˜ํ•˜์—ฌ top-k๊ฐœ์˜ ๋กœ์ง“๊ฐ’๋งŒ ๋ถˆ๋Ÿฌ์˜ค๋„๋ก ํ•˜์˜€๋‹ค.

  • assert {condition}, "error message"
    : raise์™€ ๊ฐ™์€ ์˜ˆ์™ธ ์ฒ˜๋ฆฌ ์—ญํ• ๋กœ, condition์„ ๋ณด์žฅํ•˜์ง€ ๋ชปํ•˜๋Š” ๊ฒฝ์šฐ error message๋ฅผ ์ถœ๋ ฅํ•œ๋‹ค.
    ์œ„์˜ ์ฝ”๋“œ์—์„œ๋Š” ์ž…๋ ฅ์œผ๋กœ start token ๋˜๋Š” context ์ค‘ ํ•˜๋‚˜๋งŒ ์˜ค๋„๋ก ์ƒํ™ฉ์„ ์„ค์ •ํ•˜๊ธฐ ์œ„ํ•ด
    start token๊ณผ context๊ฐ€ ๋‘˜ ๋‹ค ํ•จ๊ป˜ ์ž…๋ ฅ๋˜๋Š” ๊ฒฝ์šฐ์™€ ๋‘˜ ๋‹ค ์ž…๋ ฅ๋˜์ง€ ์•Š๋Š” ๊ฒฝ์šฐ,
    "Specify exactly one of start_token and context!"๋ฅผ ์ถœ๋ ฅํ•œ๋‹ค.

  • torch.full(size, value)
    : size๋งŒํผ value๋กœ ์ฑ„์›Œ์ง„ ํ…์„œ๋ฅผ ๋ฐ˜ํ™˜ํ•œ๋‹ค.
    ์œ„์˜ ์ฝ”๋“œ์—์„œ๋Š” start_token ๊ฐ’์ด (batch_size, 1)๋งŒํผ ๋ณต์ œ๋˜์–ด ๋ชจ๋“  sample์ด start_token ๊ฐ’์„ ์‚ฌ์šฉํ•ด ๋‹ค์Œ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•˜๋„๋ก ์ดˆ๊ธฐํ™”ํ•˜๊ฒŒ ๋œ๋‹ค.

  • torch.no_grad()
    : ์ˆœ์ „ํŒŒ ์ง„ํ–‰ ์‹œ ๋ถˆํ•„์š”ํ•œ ๊ธฐ์šธ๊ธฐ ๊ณ„์‚ฐ์„ ๋ฐฉ์ง€ํ•ด์ค€๋‹ค.

  • torch.multinomial(input, num_samples)
    : input ํ…์„œ์˜ ํ™•๋ฅ  ๋ถ„ํฌ๋Œ€๋กœ num_samples๋งŒํผ ์ƒ˜ํ”Œ๋งํ•œ ์ธ๋ฑ์Šค์˜ ํ…์„œ๋ฅผ ๋ฐ˜ํ™˜ํ•œ๋‹ค.
    ์˜ˆ๋ฅผ ๋“ค์–ด torch.multinomial([0.2, 0.8], 1)์ด๋ผ๋ฉด, ์ฒซ๋ฒˆ์งธ ๋‹จ์–ด๊ฐ€ ๋‚˜์˜ฌ ํ™•๋ฅ  20%, ๋‘๋ฒˆ์งธ ๋‹จ์–ด๊ฐ€ ๋‚˜์˜ฌ ํ™•๋ฅ  80%๋กœ ๋‹จ์–ด ํ•˜๋‚˜๋ฅผ ์„ ํƒํ•œ๋‹ค.
    ์œ„์˜ ์ฝ”๋“œ์—์„œ๋Š” top-k ๋กœ์ง“๋“ค์˜ ํ™•๋ฅ  ๋ถ„ํฌ์— ๋”ฐ๋ผ ๋‹ค์Œ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉ๋œ๋‹ค.

  • temperature
    (๋ธ”๋กœ๊ทธ ์ฐธ๊ณ ) [NLP] Temperature
    : ๋กœ์ง“๊ฐ’์„ ํ‰ํ™œํ™”ํ•˜์—ฌ ํ™•๋ฅ ๋ถ„ํฌ๋ฅผ ๋‹ค์–‘ํ•˜๊ฒŒ ๋งŒ๋“œ๋Š” ๋ฐ ์‚ฌ์šฉํ•œ๋‹ค.
    ์‚ฌ์šฉ์ž๊ฐ€ ์ž„์˜๋กœ ์„ค์ •ํ•  ์ˆ˜ ์žˆ๋Š” ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ,
    • temperature > 1 : ๋กœ์ง“๊ฐ’์„ ์ž‘๊ฒŒ ๋งŒ๋“ค์–ด ๋ชจ๋ธ์ด ๋‹ค์–‘ํ•œ ์„ ํƒ์„ ํ•˜๋„๋ก ํ•จ
    • temperature < 1: ๋กœ์ง“๊ฐ’์„ ํฌ๊ฒŒ ๋งŒ๋“ค์–ด ๋ชจ๋ธ์ด ์ผ๊ด€๋œ ์„ ํƒ์„ ํ•˜๋„๋ก ํ•จ

References

๐Ÿ“ The Illustrated GPT-2 (Visualizing Transformer Language Models)
๐Ÿ“ [๋ฒˆ์—ญ] ๊ทธ๋ฆผ์œผ๋กœ ์„ค๋ช…ํ•˜๋Š” GPT-2 (Transformer Language Model ์‹œ๊ฐํ™”)
๐Ÿ“ Tokenization algorithms in Natural Language Processing (NLP)
๐Ÿ“ Too long, didnโ€™t read: AI for Text Summarization and Generation of tldrs
๐Ÿ“ Byte pair encoding ์„ค๋ช… (BPE tokenizer, BPE ์„ค๋ช…, BPE ์˜ˆ์‹œ)
๐Ÿ“ Language Models are Unsupervised Multitask Learners (GPT-2)
Written on May 29th, 2021 by taekyoon.choi

๐Ÿ“ Step-by-Step Illustrated Explanations of Transformer
๐Ÿ“ Decoder-only Transformer model
๐Ÿ“ N_2. GPT-2 from scratch - Model Only
๐Ÿ“ Language Models: GPT and GPT-2
๐Ÿ“ GPT-1๋ถ€ํ„ฐ ChatGPT๊นŒ์ง€โ€ฆ ๊ทธ๋ฆฌ๊ณ  GPT-4์— ๋Œ€ํ•œ ์ „๋ง
๐Ÿ“ ํ† ํฌ๋‚˜์ด์ € ์ •๋ฆฌ(BPE,WordPiece,SentencePiece)
๐Ÿ“ Text generation with GPT-2

1๊ฐœ์˜ ๋Œ“๊ธ€

comment-user-thumbnail
2023๋…„ 7์›” 31์ผ

์œ ์ตํ•œ ๊ธ€์ด์—ˆ์Šต๋‹ˆ๋‹ค.

๋‹ต๊ธ€ ๋‹ฌ๊ธฐ