a dataset of synthetic textbooks, blog posts, stories, posts, and WikiHow articles generated by Mixtral-8x7B-Instruct-v0.1. It contains over 30 million files and 25 billion tokens, making it the largest open synthetic dataset to date.
Diverse prompts curation
maintaining diversity
cover a wide range of topics and minimize duplicate outputs
생성 전에 **HuggingChat** 같은 툴로 반복적으로 프롬프트
Goal
Phi-1.5 technical report는 다양성을 위해 웹 데이터로 20000개의 토픽과 관련된 20B 토큰 생성
“We carefully selected 20K topics to seed the generation of this new synthetic data. In our generation prompts, we use samples from web datasets for diversity.”
Assuming an average file length of 1000 tokens, this suggests using approximately 20 million distinct prompts. However, the methodology behind combining topics and web samples for increased diversity remains unclear.
Dataset
Figure 2. The distribution of data sources for building Cosmopedia prompts (left plot) and the distribution of sources inside the Curated sources category (right plot).
1) curated data
Stanford courses, Khan Academy, OpenStax, and WikiHow같은 명성있는 교육용 콘텐츠에서 토픽 사용.
Figure 3. Prompts for generating the same textbook for young children vs for professionals and researchers vs for high school students.
다양한 스탠포트 코스의 outline을 추출해서 prompt 구축해서 모델이 교과서 생성하게
→ LLM이 학습하기에 가치있는 토픽을 많음.
이런 접근은 고품질 콘텐츠 생성이 가능하지만 scalability에는 한계가 있음.
생성 샘플의 다양성을 증가 시키기 위해 청중과 스타일에 다양성 leverage하여 12배 많은 프롬프트 생성
2) web data
scalability 달성을 위해 웹데이터 사용해서 프롬프트 구성 (cosmopedia 프롬프트의 80%)
RefinedWeb같은 데이터셋 사용해서 수백만개의 웹샘플을 145개의 클러스터로 클러스터링해서 identified the topic of each cluster by providing extracts from 10 random samples and asking Mixtral to find their common topic. More details on this clustering are available in the Technical Stack section.
Figure 4. Example of a web extract and the associated prompt.
cluster기반으로 속한 토픽 범위 내에서 textbook 생성하도록 모델에 instruct하는 프롬프트
Figure 5. The distribution of seed data, generation format and target audiences in Cosmopedia dataset.
3) Instruction datasets and stories
common sense and fundamental knowledge 향상을 위해 UltraChat and OpenHermes2.5 instruction-tuning dataset에서 text 사용해서 prompt의 seed data로 사용
Figure 6. Prompts for generating stories from UltraChat and OpenHermes samples for young children vs a general audience vs reddit forums.
HuggingChat + llm-swarm 사용해서 데이터 생성
decontamination pipeline : 10-gram overlap → **difflib.SequenceMatcher
로 생성 데이터와 벤치마크 **데이터셋 비교
We used datatrove library for data deduplication and tokenization, nanotron for model training, and lighteval for evaluation.
Reference:
https://huggingface.co/blog/cosmopedia