Self-Instruct: Aligning Language Models with Self-Generated Instructions (ACL 2023)

Minhan Cho·2024년 8월 4일

Summary

이전 instruction tuning dataset의 단점: human-written instruction data의 양, 다양성, 창의성이 부족한 경우가 많음 $\rightarrow$ model generality가 하락
- instruction tuning dataset1: PromptSource (2073 prompt types over 177 datasets)
- instruction tuning dataset2: Super-NaturalInstruction; a.k.a. Super-NI, 119 tasks & 100 instances each
- (본 논문에서는 지적하지 않지만) public instruction tuning dataset은 품질과 양 모두 낮고, private instruction tuning dataset (InstructGPT) 은 접근이 아예 불가능함 $\rightarrow$ 업계와 유리된 한계의 치명적 약점
본 논문에서는 PLM 자신의 데이터를 이용 (knowledge distillation) 해 instruction-following capability를 향상시키는 Self-Instruct를 제안
pipeline: instruction, input, output sample 을 LLM을 이용해 생성 $\rightarrow$ invalid & similar sample을 필터링 $\rightarrow$ finetune
실험결과:
- GPT3와 Super-NaturalInstruction dataset에 실험
- 33% improvement 기록, InstructGPT와 동등한 성능 기록: private user data & human annotation에 학습된 것과 동등한 성능임
- novel task에 대한 expert written instruction set에 대해서도 실험 진행했을 때, Instruct GPT보다 약간 떨어지는 성능 기록

Contributions

(1) Self-Instruct 제시: human-labeled data를 최소한도로 사용하는 instruction tuning approach
(2) insturction-tuning experiment로 Self-Instruct의 효과성을 증명
(3) 새로운 instruction tuning dataset 제시: 52k instruction과 manually-written novel tasks ((1) 에서 구축한 데이터셋 & evaluation을 위한 새로운 dataset)

아래는 Self-Instruct를 통해 만들어진 data example

Method

GPT3-davinci를 이용해 instance 생성

Automatic Instruction Data Generation

Instruction dataset의 구성

(1) instruction; a.k.a. task description
(2) input (optional)
(3) output

(1) instruction $\rightarrow$ (2) clf task 여부를 나누고 $\rightarrow$ (3) input & output 의 순서로 instance를 생성

(1) Task Instruction Generation

175개 task (task마다 각 1개 instruction과 1개 task의 instance를 포함) 으로 이루어진 task pool에서 task instruction을 생성
- 175개의 seed task는 이전 dataset을 참조하지 않고 모두 직접 만들었다고 함
- 150개는 non-clf, 25개는 clf task
- SuperNI dataset과의 낮은 Rouge-L score 확인 (약 0.34 가량)
Seed Task의 모양새 (최종 dataset sample도 같은 모양새임)

"id": "seed_task_0",
"name": "breakfast_suggestion",
"instruction": "Is there anything I can eat for a breakfast that doesn't include eggs, yet includes protein, and has roughly 700-1000 calories?",
"instances": [{"input": "", "output": "Yes, you can have 1 oatmeal banana protein shake and 4 strips of bacon. The oatmeal banana protein shake may contain 1/2 cup oatmeal, 60 grams whey protein powder, 1/2 medium banana, 1tbsp flaxseed oil and 1/2 cup watter, totalling about 550 calories. The 4 strips of bacon contains about 200 calories."}],
"is_classification": false}
task pool 안에서 8개를 in context example로 주어 instruction을 생성

(2) Classification Task Identification

classification과 non-classification task에 대해서는 different approach를 취하기 때문에, task 및 instruction이 classification인지 아닌지 분류
12개의 classification task instruction과 19개의 non-classification task instruction을 in-context example로 제시

(3) Instance Generation

(1)에서 instruction, (2)에서 task type을 정한 뒤 instance를 생성
PLM을 이용해 {instruction, input, output in-context example}로 주었을 때, 적절한 instance (input & output) 를 생성할 수 있음을 확인
Input-first approach와 Output-first approach를 사용하여 input과 output을 생성
Input-first Approach: instruction에 기반해 input field 를 먼저 생성하고, 이후에 output을 생성
- 해당 방식은 one label에 biased된 input을 생성한다는 한계가 있음 (e.g. 문법 오류 검출 여부 task에서 문법적으로 정확한 문장만을 생성)

Output-first Approach: output (class label)을 먼저 생성하고, output (class label)에 따라서 input 을 만드는 것
- input은 output에 conditioned되어 적절한 input & output을 생성할 수 있었다고 함

clf task에서는 Output-first approach를, non-clf task에서는 Input-first approach를 사용

(4) Filtering & Postprocessing

ROUGE-L 기준으로 0.7 미만일 경우에만 생성된 instruction을 instruction pool에 추가
heuristic rule 추가: instruction이 너무 길거나, 너무 짧거나, instance output이 input의 단순 반복이거나 하면 제외

Finetuning the LM to Follow Instructions

input prompt: instruction과 instance input
y: instance output
objective: supervised finetuning
different format에 대해 작동하기 위한 설계
- instruction 앞에 "Task:"를 붙이거나 붙이지 않거나
- instance input 앞에 "Input:"을 붙이거나 붙이지 않거나
- instance output 앞에 "Output:"을 붙이거나 붙이지 않거나
- 중간에 개행이 여러 줄 들어갈 수도

Self-Instruct Data from GPT3

Self-Instruct 방식을 사용해 구축한 데이터의 규모, 다양성, 질을 확인

Scale

52k instruction에 대해 82k instance 생성
(사견) Dolly v2가 15k 규모인 것을 감안하면 적은 편은 아님

Diversity

Berkeley Neural parser를 이용해 verb-noun sturcture 를 분석
- 가장 중심이 되는 동사와 해당 동사와 연결된 명사를 추출 (e.g. "Write a paragraph." $\rightarrow$ verb: write, noun: paragraph)
26,559 instructions out of 52,445 instructions가 verb-noun structure를 가짐
나머지는 complex clauses or framed questions에 해당
- "Classify whether this tweet contains political content or not."
아래 figure (top 20 common root verbs & top 4 direct noun objects) 를 확인하면, 다양한 동사와 다양한 명사가 등장하여 intent와 textual format의 다양성을 확인할 수 있음

seed instruction과의 overlap (ROUGE-L 중 가장 높은 값의 분포), instruction, input, output의 길이 분포도 확인

Quality

200개 instruction과 instruction마다 각 1개의 instance를 선택해 expert annotator에게 label
- instruction, instance input, instance output이 유효한지(valid) 아닌지를 확인

일부 오류가 있긴 하지만, instruction의 format이 정확하기 때문에 instruction following을 위한 training에는 문제가 없을 것으로 판단

Experimental Results

사용한 LLM: GPT3-davinci (instruction tuning 진행되지 않은 PLM 상태의 GPT3) 에 finetuning
Baselines
- off-the-shelf LM: T5 (PLM), GPT3-davinci001 (PLM)
- publicly available instruction-tuned models: T0, Tk-instruct (11B both)
  - T0: T5에 PromptSource dataset 적용해 instruction tuning이 적용된 모델
  - Tk-instruct: T0와 유사하게 SuperNI dataset 적용해 instruction tuning이 적용된 모델
- instruction-tuned GPT3 models: InstructGPT
- 다른 instruction tuning dataset에 FT한 모델: GPT3 with PromptSource, GPT3 with SuperNI
  - training budget 때문에 50k instance로 한정

Experiment 1: ZS on SuperNI benchmark

SuperNI task는 119 tasks with 100 instances each 로 구성됨

(1) instruction following 능력의 향상: GPT3은 instruction을 전혀 follow 하지 못하나, Self-Instruct에 학습하였을 때 insturction-following ability가 향상됨 (33%p)
(2) low-quality public data 보다 높은 성능을, high-quality private data와 비슷한 성능: GPT3에 PromptSource dataset (T0 dataset) 을 학습했을 때보다 높은 성능을, InstructGPT와 유사한 성능을 기록
(3) SuperNI dataset에 추가로 학습할 때, complementary data로 작동하면서 추가적인 성능 향상을 확인

Experiment 2: Generalization to User-oriented Instructions on Novel Tasks

논문의 저자들이 novel task에 대한 instruction과 instance를 제작해 시험
- domain: email writing, social media, productivity tools, entertainment, programming etc.
- instruction의 길이는 유동적
- input & output은 text를 포함해 bullet points, tables, codes, equations 등으로 표현
- 252 instructions with 1 instance per instruction

evaluation dataset에 대한 LLM의 output evaluation 또한 저자들이 4-level scale로 수행

human-written dataset보다 훨씬 더 효율적: self-instruct GPT는 PromptSource나 SuperNI에 학습한 모델들을 outperform
self-instruct GPT는 InstructGPT_001에 비해 5% 정도밖에 성능 차이 없었음
(사견) self-instruct dataset을 구축하면서 저자들이 다른 dataset을 reference로 참고하지 않았다고 하는데, 그 과정에서 novel한 instruction들이 들어간 듯

Data Size & Data Quality

만약 instruction dataset을 개선할 수 있다면 어떤 방향으로..?

Data Size

self-instruct dataset에서 instruction의 개수를 줄여가면서 실험
instruction 개수를 늘렸을 때 성능 향상은 있지만, 16k 이후로는 plateau 발견 (16k 일 때는 왜 점 안 찍었지..?)

Data Quality

InstructGPT003에 instruction과 input을 주고, output을 생성하게끔 함
10% 가량의 성능 향상
Self-Instruct로 initial data를 얻고, human expert나 distillation으로 quality를 끌어올리는 방안 가능할 것으로 보임

Conclusion & Take-home Message

instruction tuning 되지 않은 PLM (GPT3-davinci) 을 이용해 52k 규모의 instruction tuning dataset을 구축
해당 데이터셋을 GPT3-davinci에 학습한 결과, GPT3-davinci에 비해 33%의 성능 향상 on SuperNI benchmark
public dataset (PromptSource, SuperNI) 을 활용하는 것보다 더 좋은 성능을 보임

Limitations & Comment

GPT의 knowledge를 기반으로 dataset을 만들기 때문에 bias가 증폭될 수 있음

(Table 2) instruction 과 input에 적절한 output과 response인지 여부를 확인했을 때 58%에 불과한데, 너무 낮은 것 아닌지?

PromptSource와 SuperNI dataset과의 차이를 보여주는 자료가 있으면 좋을 것. 어떤 차이 때문에 PromptSource나 SuperNI를 사용하는 것보다 Self-Instruct를 사용하는 것이 더 좋은지 아리송

instruction tuning 시 task의 다양성이 중요함을 확인 가능 (단, instance 개수의 중요도는 이전 논문을 인용함으로써 넘어감)

하지만, 데이터의 품질 또한 중요하다는 점을 강조

instruction tuning scheduling 때 낮은 난이도의 data는 Self-Instruct 방식을, 높은 난이도의 data는 human-written 방식을 사용하여 efficiency와 effectiveness를 모두 잡는 것이 좋겠음

Miscs

T0 dataset (PromptSource)

{"prompt": "Make a list of 10 ways to help students improve their study skills.\n\nOutput:",
"completion": " 1. Make a schedule for studying and stick to it.\n2. Study in the same place every time.\n3. Set goals for yourself.\n4. Take breaks when you need them.\n5. Don't cram before an exam.\n6. Get enough sleep.\n7. Eat healthy food.\n8. Exercise regularly.\n9. Find a study partner.\n10. Reward yourself after completing a task.<|endoftext|>"}

SuperNI dataset

{"prompt": "Generate an appropriate title for the given text. The generated title must be short and include the main topic of the text. The preferred titles are under fifteen words.\n\n
Input: Chorley Council has objected to plans for the Cuerden retail park near Bamber Bridge in Lancashire amid concerns it could threaten its \u00a317m town centre redevelopment. The authority is seeking the compensation from the developers and backers Lancashire County Council about the planned 65-hectare retail park. They said the \"money is not justified\". The developers said the site features restaurants and a hotel and would also create 4,500 jobs and could house offices and shops, as well as the Ikea store. But Chorley Council is concerned the plans could drive people away from its own new-look town centre, which will feature a cinema and new high street shops. The planning application will be dealt with by South Ribble Borough Council next month. Chorley Council has written a letter spelling out concerns, including fears about the impact of increased traffic. The council's deputy leader Peter Wilson said the council had been watching the plans \"very closely\" and wanted to \"protect the interests\" of Chorley. He added: \"While we want to see economic growth across Lancashire, we are concerned that the proposals don't properly address the impact that a development of that size could have on Chorley town centre and the traffic and highways surrounding the area. \"For those reasons, we cannot support the proposals as they currently stand.\" The council's objection letter demands \"a financial contribution of \u00a311,520,121.00 to mitigate the impact of the Cuerden development\". A spokesman for the Cuerden Strategic Site developers said: \"We are aware that Chorley Council has objected to the Cuerden application, however it is unclear from their letter what the justification is for doing so. \"Following a detailed and robust assessment, our own professional advisers have concluded that the potential retail impacts on neighbouring areas are acceptable. \"Therefore, Chorley Council's request for a significant sum of money by way of mitigation is not justified.\"\n\nOutput:",
"completion": " Chorley Council seeks \u00a311.5m compensation over Ikea retail park plan<|endoftext|>"}

Minhan Cho

multidisciplinary

이전 포스트

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints (EMNLP 2023)

다음 포스트