LLM : GPT3

지윤·2024년 7월 1일

인공지능개론

목록 보기

4/5

Shifting Paradigms in NLP

Pre-training ➔ Fine-Tuning

BERT에서는 zero-shot은 잘 되지 않았음

Limitations of Pre-training ➔Fine-Tuning

[효율성] Practical Issues
- 파인튜닝을 위해 task-specific datasets을 대용량 구축해야 함
- Task A를 위한 데이터 수집 → Task A에 대한 모델 파인튜닝 → Task B에 대해 동일한 작업 수행 → Task C에 대해 동일한 작업 수행 → 무한 반복… (비용 ↑)
- End up with many “copies” of the same model
→ fine tuning 없이 알아서 처리할 수 있으면 좋겠음
[효과성] Spurious correlations (Overfitting)을 피하는 것
- Large models fine-tuned on very narrow task distributions
- 학습 분포에 오버피팅이 될 뿐, Out-of-distribution(분포 외) 샘플에 대해서 제대로 동작하지 않음
- 즉, 일반화되지 않은 모델을 학습할 가능성이 높음
- 벤치마크에서 높은 성능을 달성하더라도 그 데이터셋을 푼 것이지 그 테스크를 푼 것은 아니라는 의심
→ 일반화된 똑똑한 모델이 필요
인간은 대용량 데이터를 동반한 지도학습을 요구하지 않음
- humans can leran from simple directives
- 인간은 쉽게 여러가지 skill/task를 함께 사용하거나 번갈아 가며 수행함
- 학계는 이러한 Human-like agent를 최종 목적으로 함

Addressing These Limitations

Scaling up
In-context Learing

Scaling up

모델 파라미터를 늘리는 것

LM Landscape pre GPT-3

GPT-3는 GPT-2의 100배 크기!

Why Scale?
- 오픈AI에 의한 실험적 연구 → Scaling Laws의 발견
  - 성능은 scale과 매우 의존적
    모델의 형태(shape, architecture)는 상대적으로 관련 덜함
  - Smooth power laws ( $y=ax^k$ ) b/w empirical performance & N - parameters, D- dataset size, C - compute
  - Transfer improves with test performance
  - Larger models are more sample efficient
Bigger is Better!
- 데이터 양이 동일해도 더 많이 배움

In-context Learning

In-Context learning is Meta-Learning
- Learning how to learn
  - 모델이 학습 동안 패턴인식 능력을 발전시키고 이를 테스트 타임에 활용하는 것
  - In-context Learning → 사전학습 언어 모델의 입력 값으로 태스크에 대한 묘사적인 설명문을 활용하는 것
  - GPT-2(Padford et al 2019)에서 이미 실험해본 바 있음:
    - Natural Questions메서 4% 성능 달성
    - CoQA에서 SOTA 대비 35포인트 낮은 55 F1-Score 달성
    - 전반적으로 성능이 뛰어나지 못했음
  - 초거대 언어모델에서 가능성 확인

What to Pick?

(↑Stronger, task-specific performance)
(↓More convenient, general, less data)

파인튜닝(Fine-tuning, FT)
1. - 뛰어난 성능(적어도 수치상으로는)
2. - 학습데이터를 수집/구축해야 함(일반적으로 1K~100K+)
3. - 일반화 능력 좋지 못함, Spurious correlation에 오염된 학습
퓨샷(Few-shot, FS)
1. - 학습데이터에 훨씬 덜 의존적임
2. - Spurious correlation의 위험 감소 (여전히 있기는 함)
3. - 어려움
원샷(One-shot, 1S)
1. - 가장 자연스러움(예: giving humans instructions)
2. - 어려움
제로샷(Zero-shot, 0s)
1. - 가장 편리함
2. - 어려움

The Prompting Zoo

Untitled

CoQA : few-shot
WSC : Instruction prompting

Larger Models Learn Better In-Context

모델의 사이즈가 커질수록 성능이 좋아짐
1.3, 13b 모델은 prompt를 잘 이해하지 못해서 성능 증가가 크지 못함
초거대 언어 모델은 지시를 이해할 수 있지 때문에 효과적
예시를 많이 주면 지시를 하지 않아도 예시만으로 높은 성능 달성
- example의 수가 많아지면 prompt의 유무에 따른 성능 차이가 없음

In-context Learing과 이전 Adaptation의 차이

adaptation - fine tuning과 유사 (adaptation이 조금 더 큰 개념)

In-context learning is the process of learning diverse skills and subtasks during the pre-training process that can be subsequently leveraged by prompting the model at inference(추론) time using natural language instructions and/or demonstrations (”shots”)
사전 훈련 과정 동안 다양한 기술과 하위 작업을 학습
자연어 명령／demonstrations ／shots을 통해서 뭘 풀려는지 무슨 문제를 해결하려는지 이해하게 됨
Unlike fine-tuning, the model is only trained once for all downstream tasks
fine-tuning과 달리 모든 downstream tasks에 대해 한 번만 학습
Weight are frozen, NOT trained! ⭐
Weight는 더이상 학습하지 않고 고정됨
- pre-training, fine-tuning은 모델 학습 과정에서 weight가 업데이트됨. 하지만 In-context learning은 prompt 내 맥락적 의미를 모델이 이해하고 답변을 생성함 (weight가 업데이트되는 것이 아님)

GPT-3 → GPT-2

아키텍처가 다르진 않음 그냥 엄청 거대해진 것

더 많은 레이어, 더 많은 파라미터 (매개변수)
더 많은 학습데이터
더 긴 학습기간
더 큰 임베딩 사이즈
더 큰 문맥 윈도우(max 토큰 수) → few-shot을 가능하게 함

GPT-3 is MASSIVE!

96개의 디코더 블럭(GPT-2의 2배)
2045토큰의 문맥 사이즈(GPT-2의 2배)
12288차원의 임베딩(GPT-2의 ~8배)
175B(1750억)개의 파라미터(GPT-2의 ~177배)

모든 모델은 300B개의 토큰으로 학습됨
Power law를 따르는 결과
“GPT-3” → GPT-3 175B

Datasets

Common Crawl(웹 크롤링 데이터)을 정제하여 사용
- Filtered based on similarity to well known corpora : 잘 알려진 corpora와의 유사성을 고려하여 필터링(45TB → 570GB)
- Fuzzy deduplication on a document level: 중복되는 문서는 제거
- Augmented with well known corpora to increase diversity: 추가 corpora로 증강
- epoch : dataset의 모든 data들이 한 번씩 모델을 통과한 횟수. 즉, 모든 학습 dataset을 학습하는 횟수

Even OpenAI makes mistakes

학습데이터를 클렌징하기 원함
모든 downstream task 데이터셋의 dev sets(=validation set 검증 세트)을 학습데이터에 포함하지 않는 것을 목표로 함
Bug in the code → 일부 dev sets이 학습에 포함됨 ☹️
추후에 발견됨

Results

벤치마크 : 컴퓨팅에서 특정 오브젝트(하드웨어 또는 소프트웨어 등)에 대해 일반적으로 수많은 표준 테스트와 시도를 수행함으로써 오브젝트의 상대적인 성능 측정을 목적으로 컴퓨터 프로그램을 실행하는 행

All 8 GPT-3 models → evaluated on datasets across 9 categories:

Traditional LM based
Closed book QA 질의응답…문서x (지식을 잘 외우고 있는지)
Translation 기계번역
Wingrad-Schema commonsense reasoning의 일종
Commonsense Reasoning and Question Answering
Reading Comprehension 주어진 문서를 잘 읽도록
SuperGLUE
Natural Language Interface
Additional tasks to probe “in-context learing”

1_1. Language Modelling

Language Modelling (Metric: Perplexity)

Zero-shot perplexity on Penn Tree Bank(PTB, 텍스트) 논문링크
PTB → only compatible w/ zero-shot setting
PTB → 2499 stories from WSJ(저널)으로 모델링
WSJ: predates the modern internet → not in training corpora
New SOTA on PTB by 15 points with a perplexity of 20.5
SOTA: State Of The ART(최신식)

1_2. Cloze & Completion 빈칸 채우기

LAMBADA (Metrix: Accuracy) 논문링크

LAnguage Modeling Broadened to Account for Discourse Aspects
Predict last word after context → long range dependencies
Task framed as a cloze-test, eg:
Alice was friends with Bob. Alice went to visit her friend .
➔ Bob
George bought some baseball equipment, a ball, a glove, and a . ➔ ?

GPT-3 achieves accuracy of 86.4% in few-shot setting
18% increase from previous SOTA

HellaSwag (Metrix: Accuracy) 논문링크

Pick best ending to story / set of instructions (객관식)
GPT-3: 78.9% (0-shot) | 78.1% (1-shot) | 79.3% (few-shot)

큰 차이 없음 → In-context learning을 잘 수행하지는 X
그래도 few-shot이 가장 뛰어나니 가능성은 있음
Worse than SOTA (ALUM model : 85.6%)

SOTA를 달성하지 못했다고 성능이 나쁜 것이 아님. GPT-3는 fine-tuning을 하지 않았음. 학습 데이터 구축 없이 바로 데이터를 적용했을 때 79.3%를 달성했다는 것이 뛰어난 것.

StoryCloze (Metric: Accuracy) 논문링크

Choose correct ending for a five-sentence story
GPT-3: 83.2% (0-shot) | 84.7.1% (1-shot), 87.7% (few-shot)

In-context learning 효과 있음
Worse than SOTA (BERT based model : 91.8%)

BERT based model도 fine-tuning 했음.

2. Cloosed Book QA

LM answers questions w/o conditioning on auxiliary information
auxiliary information 없이
3 Datasets (Metrics: Exact Match +F1)
- Natural Questions (2019)
  - Eg: "how many episodes in season 2 breaking bad?”
- Web Questions (2013)
  - Eg: "Where did Edgar Allan Poe die?”
- TriviaQA (2017)
  - Eg: "Miami Beach in Florida borders which ocean?”

In-context이 효과 있음
SOTA들은 Fine-Tuning한 SSM(검색 특화 모델)

3. Translation Task

Q2. How does GPT-3 handle non-English data? Why do you think it works on translation?

7% of training data is from other languages
Existing NMT(Neural Machine Translation) frameworks:
pre-training on a pair of monolingual datasets with back-translation(반대 방향으로 다시 번역해보는 것)
GPT-3 learns from a mix of data that is blended in a natuaral way, combined on a word, sentence, and document level
Metric: BLUE score

Few-shot으로 갈수록 성능 증가 → In-context learning 하기 좋음

Few-shot GPT-3 is > unsupervised NMT work by 5 BLUE when translating into English
Not good as supervised SOTA
Performance improves when model is scaled up
Translation into English > from English

4. Winograd-Style Tasks

Winograd Schemas Challenge (Metric: Accuracy)
- Reading comprehension test
- which word a pronoun refers to
- Eg. The trophy doesn't fit in the brown suitcase because it's too big. What is too big?
  Options: 0 ➔ the trophy I 1 ➔ the suitcase
WinoGrande Schemas Challenge
튜링 테스트의 어려운 버전 … 사람이 풀면 쉬운데 기계가 풀기 어려운 문제
- Greater scale + hardness

GPT-3 comes close to SOTA for Winograd
GPT-3 much lower than SOTA for WinoGrande
- ➔45% of test-set in training, clean subset ➔-.12.6%

5. Common Sense Reasoning

주로 객관식

3 Datasets (Metrics: Accuracy):
1. PIQA …물리상식
  - PhysIcal QA
2. ARC …상식
  - MCQs from 3rd - 9th grade science exams
  - Challenge → questions harder for statistical / info retrieval methods
3. OpenBookQA
  - Modelled after open book exams

New SOTA on PIQA
Much worse than SOTA on ARC and OpenBookQA

데이터를 통해 학습하는 것들은 다른 문제에 적용하기 힘들다 … GPT는 이러한 문제에서 자유롭기 때문에 의의가 있음

*➔29% of PIQA test-set seen at training, clean subset → -.13%

6. Reading Comprehension 기계독해

5 Datasets (Metrics: F1, RACE: Accuracy)

CoQA : Conversational QA dataset (2019)
QuAC : QA in Context (2018)
DROP : Discrete Reasoning Over the content of Paragraphs 2019)
RACE : ReAding Comprehension dataset from Examinations (2017)
- Large scale → very long context
SQuADv2 (2018)
GPT-3 is decent on CoQA
Much worse than SOTA on DROP and QuAC, SQuADv2 & RACE

multi-turn 대화는 잘 못함…

7. SuperGLUE (2020)

GPT-3 few-shot → 32 examples within the context
Performance improves w/ model size & #examples in context

8. Natural Language Inference

Natural Language Inference → model's ability to understand the relationship between two sentences

2 Datasets

RTE Dataset (SuperGLUE) (Metric: Accuracy)
Adversarial NLI
- Dificult dataset, inference done in 3 rounds

낮은 성능…

RTE ➔ GPT-3 comparable to BERT but far below SOTA (see slide 57)
Adversarial NLI Dataset ➔ GPT-3 is no better than chance in most scenarios

→ GPT가 Reasoning은 잘 못함

Preventing Memorization of Benchmarks

Because of the huge dataset ➔ GPT-3 doesn't overfit on test data it has seen before
Performance drop when seen samples are removed from test set is small

성능이 올라간 것도 있고 비슷한 것도 있고 떨어진 것도 있음 … 평균적으론 차이X

GPT-3, the good, the meh, the ugly

GOOD
- LM, Cloze & Completion
- Closed Book QA
- NMT
MEH
- Commonsense Reasoning
- SuperGLUE
UGLY
- Reading Comprehension
- NLI

Common Trends

Bigger is better
→ performance improved across all tasks w/ scaling
More demonstrations = better
→ few-shot > one-shot > zero-shot (usually)
No limit in sight of performance being bottlenecked by model size

Pushing GPT-3 Further

Arithmetic

2~3자리수는 잘하는데 4자리부턴 어려워함
파라미터가 작을 때는 못풀음

Observations:
1. Scale is important!
2. >1 operation or >3 digit numbers are much harder
Is GPT-3 Just Memorizing Tables?
→ No! (or at least not directly!)
1. Sampled training data to look for text of the form "[X] +[Y] =" and found matches for only 0.1-0.8% of the correctly answered problems
2. Evidence of making intermediate mistakes such as not carrying

Word Manipulation

단어 뒤섞고 복원시키기

Cycle letters in word (CL)

rageave ➔average
Anagrams of all but the first and last k letters (A1, A2 for k=1,2)

aregave ➔average
avraege ➔average
Random insertion in word (RI)

a;v'e;r_a g'e ➔average
Reversed words (RW)

egareva ➔average

Generated 10,000 examples using most frequent words, 4 <= len <= 15

파라미터가 크면 성능 증가

Qualitative Tasks

News article generation 뉴스 기사 생성
- 25 random newser.com article titles + subtitles
Humans clearly have a harder time distinguishing
- Accuracy ➔chance (50%), time spent increases

사람이 썼는지 기계가 썼는지 분별 매우 어려움

Limitations

language models in general 일반적인 언어 모델
- Simple pre-training objective
  장점(비용 적음)이자 단점(너무 단순)
- Lack of grounding → multimodal?
  텍스트에 대해서만 가능(비디오, 이미지, 음성…불가능)
  multimodal : GPT-4에 적용됨. 이미지+텍스트 활용 가능
- Poor sample eficiency
  학습 데이터가 너무 많이 필요
GPT-3
- Limited generation (repetitions, contradictions)
  반복적이거나 모순된 문장 생성
- Limited "common sense" world model
  상식 부족
- Poor one-shot and zero-shot performance (on some reading comprehension and comparison tasks)
  one-shot, few-shot의 성능 부족
- No bidirectionality → denoising objective?
  양방향 아님. 한 방향.
Performance aside
- Not interpretable
  어떻게 해결했는지 해석이 불가능
- Adaptation vs. recognition
- Expensive! → distillation
  학습 비용이 너무 비쌈
  (symbolic) distillation : 사전에 LLM을 통해 데이터를 생성 → LM에 학습

지윤

이전 포스트

LM : BERT

다음 포스트