GPT-1 Improving Language Understanding by Generative Pre-Training (2018)

J.ยท2025๋…„ 7์›” 23์ผ

Text & Speech Papers

๋ชฉ๋ก ๋ณด๊ธฐ
4/12
post-thumbnail

๐Ÿ“ŒImproving Language Understanding by Generative Pre-Training (Radford et al., 2018)
๐Ÿ“Œ Original Paper: https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf

BERT ์™€ ๋ฌด์—‡์ด ๋‹ค๋ฅธ๊ฐ€ ์ •๋ฆฌ

ํ•ญ๋ชฉGPT-1BERT
๐Ÿ“š ๋…ผ๋ฌธ ์ด๋ฆ„Improving Language Understanding by Generative Pre-Training (Radford et al., 2018)BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (Devlin et al., 2018)
๐Ÿ— Transformer ๊ตฌ์กฐDecoder-only (GPT-style, left-to-right)Encoder-only (Fully bidirectional)
๐Ÿ”„ Pre-training ๋ฐฉํ–ฅLeft-to-right (Autoregressive)Bidirectional (Masked Language Model)
๐ŸŽฏ Pre-training ๋ชฉํ‘œLM objective (๋‹ค์Œ ๋‹จ์–ด ์˜ˆ์ธก)Masked LM (MLM) + Next Sentence Prediction (NSP)
๐Ÿ›  Fine-tuning ๋ฐฉ์‹๋‹จ์ˆœํ•œ classification head ์ถ”๊ฐ€ ํ›„ fine-tuning๋‹ค์–‘ํ•œ ํƒœ์Šคํฌ์— ๋งž๊ฒŒ ๊ตฌ์กฐ ์ผ๋ถ€ ์ˆ˜์ • ๋˜๋Š” linear head ์ถ”๊ฐ€
๐Ÿ” ์ž…๋ ฅ ๊ตฌ์กฐํ•œ ๋ฌธ์žฅ ๋‹จ์œ„ (๋ฌธ์žฅ ์ƒ์„ฑ ์ค‘์‹ฌ)๋ฌธ์žฅ ์Œ๋„ ๊ฐ€๋Šฅ ([CLS] ๋ฌธ์žฅ1 [SEP] ๋ฌธ์žฅ2 ํ˜•ํƒœ)
๐Ÿง  ํ•™์Šต ๋ชฉํ‘œ ์„ฑ๊ฒฉGenerative (์ƒ์„ฑ ๊ธฐ๋ฐ˜ ์–ธ์–ด๋ชจ๋ธ)Discriminative (์ดํ•ด ๊ธฐ๋ฐ˜ ์–ธ์–ด๋ชจ๋ธ)
๐Ÿงช Zero-shot ํ•™์Šต์ผ๋ถ€ ๊ฐ€๋Šฅ (task ๋งˆ๋‹ค ํŽธ์ฐจ ์žˆ์Œ - ์ถ”๋ก  ๊ธฐ๋ฐ˜ ํƒœ์Šคํฌ์—์„œ๋Š” ์„ฑ๋Šฅ ๋‚ฎ์Œ)๋ถˆ๊ฐ€๋Šฅ (BERT๋Š” fine-tuning ํ•„์ˆ˜)
โ›ณ๏ธ ์ฃผ ์šฉ๋„ํ…์ŠคํŠธ ์ƒ์„ฑ, zero/few-shot ๊ฐ€๋Šฅ์„ฑ ํƒ์ƒ‰๋ฌธ์žฅ ๋ถ„๋ฅ˜, ๊ด€๊ณ„ ํŒ๋‹จ, QA ๋“ฑ ์ดํ•ด ์ค‘์‹ฌ NLP task

Introduction

๐Ÿ“Œ๊ธฐ์กด ์ ‘๊ทผ ๋ฐฉ์‹์˜ ํ•œ๊ณ„

  • ์•ผ์ƒ์—์„œ ์ˆ˜์ง‘ํ•œ ์›์‹œ ํ…์ŠคํŠธ(raw text)๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ํ•™์Šตํ•˜๋Š” ๋Šฅ๋ ฅ์€ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ(NLP)์—์„œ ๋งค์šฐ ์ค‘์š”ํ•˜๋‹ค.

  • ํ•˜์ง€๋งŒ ๋ผ๋ฒจ์ด ์žˆ๋Š” ๋ฐ์ดํ„ฐ(labeled data)๋Š” ๋ถ€์กฑํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ๋น„๋ผ๋ฒจ ๋ฐ์ดํ„ฐ(unlabeled data)๋ฅผ ํ™œ์šฉํ•ด ์ง€๋„ ํ•™์Šต์— ๋Œ€ํ•œ ์˜์กด๋„๋ฅผ ์ค„์ผ ํ•„์š”๊ฐ€ ์žˆ๋‹ค.

์ด๋กœ ์ธํ•ด unlabeled ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์ „ ํ•™์Šต(pretraining)์— ํ™œ์šฉ ํ•˜๋Š” ์ ‘๊ทผ์ด ์ฃผ๋ชฉ๋ฐ›๊ฒŒ ๋˜์—ˆ๋‹ค.

๊ทธ๋Ÿฌ๋‚˜ ๊ธฐ์กด์˜ unsupervised pretraining ๋ฐฉ์‹์—๋Š” ํ•œ๊ณ„๊ฐ€ ์กด์žฌํ–ˆ๋‹ค:

  • ๊ธฐ์กด ๋ฐฉ์‹์€ ์ฃผ๋กœ ๋‹จ์–ด ์ˆ˜์ค€์˜ ํ†ต๊ณ„ ์ •๋ณด(word-level statistics)๋งŒ ํ•™์Šตํ•˜๋ฉฐ, ๋ฌธ์žฅ์ด๋‚˜ ๋ฌธ๋งฅ ์ˆ˜์ค€์˜ ํ‘œํ˜„๋ ฅ์„ ์ถฉ๋ถ„ํžˆ ํ•™์Šตํ•˜์ง€ ๋ชปํ–ˆ๋‹ค.

  • ๋‹จ์ˆœํžˆ unlabeled data๋งŒ์œผ๋กœ ํ•™์Šตํ•  ๊ฒฝ์šฐ, ์–ด๋–ค ๋ชฉ์ ํ•จ์ˆ˜(objective function)๊ฐ€ downstream task์— ํšจ๊ณผ์ ์ธ์ง€ ์•Œ๊ธฐ ์–ด๋ ต๊ณ , pretrained model์„ fine-tuningํ•  ๋•Œ task-specific ํ•™์Šต ์ „๋žต์ด ๋ช…ํ™•ํ•˜์ง€ ์•Š์•„ ์„ฑ๋Šฅ์ด ๋ถˆ์•ˆ์ •ํ–ˆ๋‹ค.

  • ๋˜ํ•œ, ๊ธฐ์กด ๋ฐฉ์‹์€ task๋งˆ๋‹ค ๋ชจ๋ธ ๊ตฌ์กฐ๋ฅผ ๋ณ€๊ฒฝํ•˜๊ฑฐ๋‚˜ ๋ณต์žกํ•˜๊ฒŒ ์„ค๊ณ„ํ•ด์•ผ ํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์•„, semi-supervised learning ํ™˜๊ฒฝ์—์„œ๋Š” ํ™•์žฅ์„ฑ์— ์–ด๋ ค์›€์ด ์žˆ์—ˆ๋‹ค.

๐ŸŽฏ GPT-1์˜ ํ•ด๊ฒฐ ๋ฐฉ๋ฒ•: ๋ณดํŽธ์ ์ธ ์‚ฌ์ „ํ•™์Šต + ํŒŒ์ธํŠœ๋‹

โ€œ๋‹ค์–‘ํ•œ NLP ํƒœ์Šคํฌ์— ์ ์šฉ ๊ฐ€๋Šฅํ•œ ๋ณดํŽธ์ ์ธ ์–ธ์–ด ํ‘œํ˜„(representation)์„ ๋ฏธ๋ฆฌ ํ•™์Šตํ•˜๊ณ ,task-specificํ•œ fine-tuning์„ ์ตœ์†Œํ•œ์˜ ๊ตฌ์กฐ ๋ณ€ํ™”๋งŒ์œผ๋กœ ์ˆ˜ํ–‰ํ•œ๋‹ค.โ€

์ด๋ฅผ ์œ„ํ•ด ๋…ผ๋ฌธ์€ ๋‘ ๋‹จ๊ณ„์˜ ํ›ˆ๋ จ ์ ˆ์ฐจ๋ฅผ ์ œ์•ˆํ•œ๋‹ค:

A. Pre-training:

๋Œ€๊ทœ๋ชจ์˜ unlabeled ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ language modeling objective (๋‹ค์Œ ๋‹จ์–ด ์˜ˆ์ธก)์„ ํ†ตํ•ด ์ผ๋ฐ˜์ ์ธ ์–ธ์–ด ํ‘œํ˜„์„ ํ•™์Šตํ•œ๋‹ค.

B. Fine-tuning:

๊ฐ downstream task์— ๋งž๊ฒŒ ์‚ฌ์ „ ํ•™์Šต๋œ ๋ชจ๋ธ์„ ์ง€๋„ ํ•™์Šต ๋ฐ์ดํ„ฐ(labeled data)๋กœ ๋ฏธ์„ธ ์กฐ์ •ํ•œ๋‹ค.

์ด ์ ‘๊ทผ ๋ฐฉ์‹์€ ๋ชจ๋ธ ๊ตฌ์กฐ์˜ ๋ณ€๊ฒฝ ์—†์ด ๋‹ค์–‘ํ•œ ํƒœ์Šคํฌ์— ์‰ฝ๊ฒŒ ์ ์šฉ ๊ฐ€๋Šฅํ•˜๋ฉฐ, transfer learning ์ •ํ™•๋„ ํฌ๊ฒŒ ๋†’์ž„

GPT1

Architecture


Transformer ์˜ ๋””์ฝ”๋”๋งŒ ๊ฐ€์ ธ์™€ ์‚ฌ์šฉํ•จ. ์ธ์ฝ”๋” ์‚ฌ์šฉ ์•ˆํ•˜๋‹ˆ๊นŒ cross self attention ๋ถ€๋ถ„์€ ์—†์–ด์ง€๋Š” ๊ฒƒ
๋””์ฝ”๋”๋งŒ ์‚ฌ์šฉํ•˜๋Š” ์ด์œ :

  1. ๋ชจ๋ธ์˜ ํ•™์Šต ๋ชฉํ‘œ๊ฐ€ โ€œ๋‹ค์Œ ๋‹จ์–ด ์˜ˆ์ธกโ€ (Autoregressive Language Modeling)์ด๊ธฐ ๋•Œ๋ฌธ.

  2. ๊ฐ„๊ฒฐ์„ฑ์œผ๋กœ ์—ฐ์‚ฐ๋Ÿ‰์ด ์ค„์–ด๋“ ๋‹ค.

Unsupervised Pre-Training



๋ฐ์ดํ„ฐ๋ฅผ ๋ฌธ์žฅ๋ณ„๋กœ ๋‚˜๋ˆ” -> ๊ฐ ๋ฌธ์žฅ์„ ์ž…๋ ฅํ•ด์ฃผ๋Š” ๊ตฌ์กฐ.
ํ•œ ๋ฌธ์žฅ๋„ ๋‚˜๋ˆ ์„œ ๋„ฃ๊ณ  ๋‹ค์Œ ์˜ˆ์ธกํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ์ง„ํ–‰.

Pretraining ์˜ ํšจ๊ณผ:
1. ์–ธ์–ด ๊ตฌ์กฐ๋ฅผ ํ•™์Šต
2. ๋ฌธ๋งฅ ์ดํ•ด ๋Šฅ๋ ฅ ํ–ฅ์ƒ
3. ๋‹ค์–‘ํ•œ ์–ธ์–ด ํŒจํ„ด ํ•™์Šต
4. transfer learning ์‹œ ์ผ๋ฐ˜์  ์–ธ์–ด ์ดํ•ด ๋Šฅ๋ ฅ ๋†’์ด๋Š”๋ฐ ํšจ๊ณผ์ 

  • h0๋Š” "๋‹จ์–ด ์˜๋ฏธ + ์œ„์น˜ ์ •๋ณด"
  • hlโˆ’1: ์ด์ „ ๋ ˆ์ด์–ด์˜ ์ถœ๋ ฅ
    transformer_block: GPT ๋””์ฝ”๋” ๋ธ”๋ก (Masked Multi-head Attention + FFN)
    ์ด n๊ฐœ์˜ Transformer ๋ธ”๋ก์„ ํ†ต๊ณผํ•˜๋ฉฐ,
    ๊ฐ ๋ธ”๋ก์€ ๋ฌธ๋งฅ ์ •๋ณด๋ฅผ ์ ์  ๋” ํ’๋ถ€ํ•˜๊ฒŒ ๋งŒ๋“ค์–ด ์คŒ

๋งˆ์Šคํ‚น๋œ self-attention์„ ์‚ฌ์šฉํ•˜๋ฏ€๋กœ ํ˜„์žฌ ์‹œ์ ๊นŒ์ง€์˜ ๋‹จ์–ด ์ •๋ณด๋งŒ ๋ฐ˜์˜

  • ๋งˆ์ง€๋ง‰ Transformer block ์ถœ๋ ฅ: hnh_n
  • ์ž„๋ฒ ๋”ฉ ํ–‰๋ ฌ ์ „์น˜: WeTW_e^T

์•„๋ž˜ ๊ณผ์ •์œผ๋กœ ์ตœ์ข… ์ถœ๋ ฅ ํ˜•์„ฑ

Supervised fine-tuning

์ฃผ์–ด์ง„ ์ •๋‹ต ๋ฐ์ดํ„ฐ์…‹
C
์— ๋Œ€ํ•ด, ์ž…๋ ฅ ๐‘ฅ๋กœ๋ถ€ํ„ฐ ์ •๋‹ต ๐‘ฆ๋ฅผ ์ตœ๋Œ€ํ•œ ์ •ํ™•ํžˆ ์˜ˆ์ธกํ•˜๋„๋ก ๋ชจ๋ธ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ ๐œƒ๋ฅผ ํ•™์Šต์‹œํ‚ค๋Š” ์†์‹ค ํ•จ์ˆ˜

์ดํ›„ L2 function ์— L1 function ๋”ํ•˜๋Š” ๋ณด์กฐ์  ์—ญํ•  ์‚ฌ์šฉ์‹œ ๋ชจ๋ธ ์ผ๋ฐ˜ํ™”์™€ ํ•™์Šต ์†๋„ ํ–ฅ์ƒ์— ๋„์›€์ด ๋˜์—ˆ๋‹ค๊ณ  ํ•จ.

Task Specific Input Transformations

Classification

์ž…๋ ฅ ํ…์ŠคํŠธ ๋ถ„๋ฅ˜ ๋ฌธ์ œ (ex. spam mail detection)
๋‚ด์šฉ ์ „์ฒด๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋„ฃ๊ณ  ๋ถ„๋ฅ˜ํ•˜๋Š” ๊ณผ์ œ

Textual Entailment

ํ•œ ๋ฌธ์žฅ์ด ๋‹ค๋ฅธ ๋ฌธ์žฅ์„ ์˜๋ฏธ์ ์œผ๋กœ ํ•จ์˜ํ•˜๋Š”๊ฐ€
premise (์ „์ œ) ์™€ hypothesis (๊ฐ€์„ค) ๋ฌธ์žฅ ๋‘๊ฐœ ๋„ฃ์–ด์คŒ
์ด๋•Œ ๊ฐ€์šด๋ฐ deliminator ๋กœ ๊ตฌ๋ถ„ํ•ด์„œ ์ฒ˜๋ฆฌ

Similarity

๋‘ ๋ฌธ์žฅ์ด ์–ผ๋งˆ๋‚˜ ์œ ์‚ฌํ•œ๊ฐ€ ํŒ๋‹จ.
input ์œผ๋กœ ๋‘๊ฐœ์˜ ๋ฌธ์žฅ์„ ๋ฐ›๊ณ  ์ถœ๋ ฅ์„ 0-1 ์‚ฌ์ด์— ๋ƒ„
์ตœ์ข… ์œ ์‚ฌ๋„ ๊ฐ’ ์ถœ๋ ฅํ•˜๋Š” ๊ตฌ์กฐ

text1 + text2 / text2 + text1
๋‘ ๊ฒฐ๊ณผ๋ฅผ add ํ•œํ›„ ์„ ํ˜• ๋ ˆ์ด์–ด์—์„œ ํ™œ์„ฑํ•จ์ˆ˜ ๊ฑฐ์ณ์„œ ์ตœ์ข… ์œ ์‚ฌ๋„ ์ถœ๋ ฅ

Question Answering & Commonsense Reasoning

Context ์™€ Answer ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๋‹ค.
๊ฐ๊ฐ์˜ ๋‹ต๋งˆ๋‹ค ์„œ๋กœ ๋‹ค๋ฅด๊ฒŒ context - answer ์Œ ๋งŒ๋“ค์–ด์„œ transformer ๊ตฌ์กฐ๋กœ ๋„ฃ์–ด์ฃผ๊ณ , ์ด ์ค‘ ๊ฐ€์žฅ ๋†’์€ ํ™•๋ฅ ์„ ๊ฐ€์ง„ ์Œ์„ ์ถœ๋ ฅํ•˜๋Š” ๊ตฌ์กฐ

Experiments

Natural Language Inference

๋ชจ๋“  ๋ฐ์ดํ„ฐ์…‹์—์„œ ๊ธฐ์กด ๋ฐฉ๋ฒ•๋“ค๋ณด๋‹ค ์„ฑ๋Šฅ ์šฐ์ˆ˜

Question Answering & Commonsense Reasoning


์—ญ์‹œ ๋ชจ๋“  ํ…Œ์Šคํฌ์— ๋Œ€ํ•ด ์„ฑ๋Šฅ ์šฐ์ˆ˜

Semantic Similarity & Classification


์ผ๋ถ€ ์ œ์™ธ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ

Analysis

Impact of Number of Layers Transferred & Zero Shot Behaviors


์™„์ชฝ ๊ทธ๋ž˜ํ”„๋Š” pre-train ์ดํ›„ ๋ช‡๊ฐœ์˜ ๋ ˆ์ด์–ด๋ฅผ ์ „์ดํ•˜์—ฌ fine-tuning ํ–ˆ๊ณ , ์ด๋•Œ์˜ ์„ฑ๋Šฅ ์ฐจ์ด๋ฅผ ๋ถ„์„ํ•œ ๊ทธ๋ž˜ํ”„์ด๋‹ค.

0-12๊ฐœ ๊ธฐ์ค€ ๋ ˆ์ด์–ด ์ˆ˜๋ฅผ ๋งŽ์ด ์ „์ด(transfer) ํ•ด์„œ fine-tuning ํ• ์ˆ˜๋ก RACE(์งˆ๋ฌธ ์‘๋‹ต), MultiNLI(๋ฌธ์žฅ๊ด€๊ณ„ํŒŒ๋‹จ) task ๋ชจ๋‘ ๋” ์ข‹์€ ์„ฑ๋Šฅ

GPT ์˜ fine-tuning ์„ฑ๋Šฅ ์ž…์ฆ

์˜ค๋ฅธ์ชฝ ๊ทธ๋ž˜ํ”„๋Š” ๋‹ค์–‘ํ•œ task ์— ๋Œ€ํ•ด ์‚ฌ์ „ํ•™์Šต ์—…๋ฐ์ดํŠธ ์ˆ˜ ์ฆ๊ฐ€์— ๋”ฐ๋ผ zero-shot (์ฆ‰ fine-tuning ์•ˆํ•˜๊ณ ๋„ ์ •๋‹ต ์ถœ๋ ฅ ์–ผ๋งˆ๋‚˜ ๋‚ด๋†“์„ ์ˆ˜ ์žˆ๋‚˜) ์„ฑ๋Šฅ ํ–ฅ์ƒํ•จ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

GPT ์˜ pre-training ์„ฑ๋Šฅ ์ž…์ฆ

์ •๋ฆฌ

โœ… 1. Pretraining + Finetuning ํŒจ๋Ÿฌ๋‹ค์ž„ ๋„์ž…
โœ… 2. Transformer Decoder๋งŒ ์‚ฌ์šฉ
โœ… 3. Few-shot/Zero-shot Learning์˜ ๊ฐ€๋Šฅ์„ฑ
โœ… 4. ๋ฒ”์šฉ ์–ธ์–ด๋ชจ๋ธ (BERT ๋„ ์ด ๋ถ€๋ถ„์€ ๋งˆ์ฐฌ๊ฐ€์ง€, Encoder based ์ด์ง€๋งŒ) ์ œ์•ˆ -> task ๋งˆ๋‹ค ๊ตฌ์กฐ ๋ณ€ํ˜• ํ•„์š” ์—†์Œ

<์šฉ์–ด>

โœ… Representation: ๋‹จ์–ด, ๋ฌธ์žฅ, ๋ฌธ์„œ ๋“ฑ ์–ธ์–ด ๋‹จ์œ„๋ฅผ ์ปดํ“จํ„ฐ๊ฐ€ ์ดํ•ดํ•˜๊ณ  ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋„๋ก ๋งŒ๋“  ์ˆ˜์น˜ ๋ฒกํ„ฐ(๋˜๋Š” ํ…์„œ)

profile
AI & Languages galore.

0๊ฐœ์˜ ๋Œ“๊ธ€