Instruction Tuning 개요

NJ·2024년 1월 28일

Instruction Tuning

1. 유명한 데이터셋들

https://newsletter.ruder.io/p/instruction-tuning-vol-1 읽고 정리겸 끄적끄적,,

Natural Instructions
1. 영어, 193k examples, 61 tasks
2. common schema를 사용해서 다른 데이터셋에 비해 more structured
Natural Instructions v.2 / Super-Natural Instructions
1. 5M examples, 76 task types, 55 languages
  1. 세세하게는 1600 이상의 task (SUPER-NATURALINSTRUCTIONS:
    Generalization via Declarative Instructions on 1600+ NLP Tasks)
2. v.1에 비해 instruction이 간단해짐: task definition, 긍정 예제, 부정 예제 with 설명
Unnatural Instructions
1. 자동적으로 모은 instruction 데이터셋
2. 240k examples, Super-Natural Instruction 예제로 InstructGPT (text-davinci-002)에 넣어서 새로운 example을 만들라고 한 것.
3. SNI보다 더 다양한 task를 다루고 있음
Self-Instruct
1. Unnatural Instruction과 비슷
2. 82k, 175 tasks. seed task로 InstructGPT에게 날림
Flan (2021, 2022) #reasoning
1. 2022는 Flan 2021, P3, SNI를 합하고 + additional reasoning, dialog, and program synthesis datasets를 합친 것. 9개의 새로운 reasoning datasets additionally annotated with CoT!!

정리

		개수	태스크	언어	기타
Natural Instructions	2022	193k	61	영어
NI v.2 / SNI	2022	5M	76	55
Unnatural Instructions	2023	240k	-	-	SNI에서 InstructGPT로 새로운 example 만들어 냄
Self-Instruct	2023	82k	175		seed task example로 instructGPT로 만들어냄

Important Aspects of Instruction Data

Longpre et al. (2023) and Iyer et al. (2022) ablate several important aspects of instruction data, which we highlight in the following.

Mixing few-shot settings. Training with mixed zero-shot and few-shot prompts significantly improves performance in both settings. 흠, 그럼 훈련 데이터에 few-shot도 넣고, zero-shot도 넣으란 건가! 그냥 zero-shot으로만 학습시키면 별론가?

Task diversity. Large models benefit from continuously increasing the number of tasks.

Data augmentation. Augmenting the data such as by inverting inputs and outputs (e.g., turning a question answering task into a question generation task) is beneficial. 데이터 확장이 도움이 됨. (흠?)

Mixing weights. When using a combination of instruction tuning datasets, appropriately tuning the mixing weights is important. 잘 섞어야 함.

While the above datasets are mainly derived from classical NLP tasks, recent datasets such as Baize (Xu et al., 2023), OpenAssistant Conversations (Köpf et al., 2023) and others cover a more diverse set of applications and domains. We will discuss these in the next edition. Stay tuned! 👋

근데!! 이 instruction data를 안다고 해도, 뭔가 insight를 가지긴 해야 함. 제대로 알려면 논문을 한 번 쫙 읽고 전반적인 느낌을 한 번 쏵!!

2. 요즘의 데이터셋 & instruction-tuned models

https://newsletter.ruder.io/p/instruction-tuning-vol-2 여기에 더 자세한 내용

Alpaca
- 52k, 영어, text-davinci-003 with self-instruct
Evol-Instruct
- 250k, 영어, Alpaca를 기반으로 rewrite.
Vicuna, ShareGPT
- 70k, Conversation!
LIMA
- 퀄리티 좋은 1000개의 instruction 데이터로 높은 성능을 달성 가능하단 것을 보여줌.

Takeaways

✅ Quality > quantity. As Zhou et al. (2023) observe, training on a small set of high-quality data outperforms instruction-tuning on larger, noisier data. Using more diverse prompts and quality filtering both improve performance.

🧑‍🎓 Imitation != mastery. Models that are instruction-tuned on ChatGPT-generated data mimic ChatGPT’s style (and may thus fool human raters!) but not its factuality (Gudibande et al., May 2023). They perform worse on standard benchmarks. Using stronger base models is the best way to address this.

🏛️ The stronger the base, the better. More powerful base models also produce stronger instruction-tuned models (Wang et al., June 2023).

🥇 The combination wins. Combining multiple instruction-tuning datasets results in the best average performance across tasks (Wang et al., June 2023). Dataset mixing and developing modular instruction-tuned models are thus important research directions.

Future Directions

Understanding instruction-tuning. While we have seen a proliferation of instruction-tuning datasets, we still lack a clear understanding of what makes a good instruction and good instruction–response pairs. There is much anecdotal knowledge when it comes to creating good model prompts—but to my knowledge it is unclear how instruction–following data can be created at scale in a more principled manner.

Improving data quality. To improve model performance, we need to develop more reliable methods to identify high-quality examples and filter out undesirable ones. In a similar vein, it is important to develop methods that allow us to identify how a particular instance affects model behavior and alignment at test time.

Evaluating instruction-tuned models. In light of the biases of both human and automatic evaluations, there is no clear gold standard for how to evaluate instruction-tuned models. Evaluating a model on a set of tests that can be efficiently and automatically evaluated is one way to side-step this issue, see LMentry (Efrat et al., ACL 2023), M2C (Hlavnova & Ruder, ACL 2023), IFEval (Zhou et al., Nov 2023), etc but these are restricted to a certain set of use cases. In general, it is crucial to design evaluations with a target application in mind.

Studying NLP

이전 포스트

RL의 기초 Introduction to RL

다음 포스트