https://newsletter.ruder.io/p/instruction-tuning-vol-1 읽고 정리겸 끄적끄적,,
정리
개수 | 태스크 | 언어 | 기타 | ||
---|---|---|---|---|---|
Natural Instructions | 2022 | 193k | 61 | 영어 | |
NI v.2 / SNI | 2022 | 5M | 76 | 55 | |
Unnatural Instructions | 2023 | 240k | - | - | SNI에서 InstructGPT로 새로운 example 만들어 냄 |
Self-Instruct | 2023 | 82k | 175 | seed task example로 instructGPT로 만들어냄 |
Longpre et al. (2023) and Iyer et al. (2022) ablate several important aspects of instruction data, which we highlight in the following.
Mixing few-shot settings. Training with mixed zero-shot and few-shot prompts significantly improves performance in both settings. 흠, 그럼 훈련 데이터에 few-shot도 넣고, zero-shot도 넣으란 건가! 그냥 zero-shot으로만 학습시키면 별론가?
Task diversity. Large models benefit from continuously increasing the number of tasks.
Data augmentation. Augmenting the data such as by inverting inputs and outputs (e.g., turning a question answering task into a question generation task) is beneficial. 데이터 확장이 도움이 됨. (흠?)
Mixing weights. When using a combination of instruction tuning datasets, appropriately tuning the mixing weights is important. 잘 섞어야 함.
While the above datasets are mainly derived from classical NLP tasks, recent datasets such as Baize (Xu et al., 2023), OpenAssistant Conversations (Köpf et al., 2023) and others cover a more diverse set of applications and domains. We will discuss these in the next edition. Stay tuned! 👋
근데!! 이 instruction data를 안다고 해도, 뭔가 insight를 가지긴 해야 함. 제대로 알려면 논문을 한 번 쫙 읽고 전반적인 느낌을 한 번 쏵!!
https://newsletter.ruder.io/p/instruction-tuning-vol-2 여기에 더 자세한 내용
✅ Quality > quantity. As Zhou et al. (2023) observe, training on a small set of high-quality data outperforms instruction-tuning on larger, noisier data. Using more diverse prompts and quality filtering both improve performance.
🧑🎓 Imitation != mastery. Models that are instruction-tuned on ChatGPT-generated data mimic ChatGPT’s style (and may thus fool human raters!) but not its factuality (Gudibande et al., May 2023). They perform worse on standard benchmarks. Using stronger base models is the best way to address this.
🏛️ The stronger the base, the better. More powerful base models also produce stronger instruction-tuned models (Wang et al., June 2023).
🥇 The combination wins. Combining multiple instruction-tuning datasets results in the best average performance across tasks (Wang et al., June 2023). Dataset mixing and developing modular instruction-tuned models are thus important research directions.
Understanding instruction-tuning. While we have seen a proliferation of instruction-tuning datasets, we still lack a clear understanding of what makes a good instruction and good instruction–response pairs. There is much anecdotal knowledge when it comes to creating good model prompts—but to my knowledge it is unclear how instruction–following data can be created at scale in a more principled manner.
Improving data quality. To improve model performance, we need to develop more reliable methods to identify high-quality examples and filter out undesirable ones. In a similar vein, it is important to develop methods that allow us to identify how a particular instance affects model behavior and alignment at test time.
Evaluating instruction-tuned models. In light of the biases of both human and automatic evaluations, there is no clear gold standard for how to evaluate instruction-tuned models. Evaluating a model on a set of tests that can be efficiently and automatically evaluated is one way to side-step this issue, see LMentry (Efrat et al., ACL 2023), M2C (Hlavnova & Ruder, ACL 2023), IFEval (Zhou et al., Nov 2023), etc but these are restricted to a certain set of use cases. In general, it is crucial to design evaluations with a target application in mind.