SOLAR 10.7B: Scaling LLMs with Simple yet Effective Depth Up-Scaling

임재석·2024년 1월 9일

paper-study

목록 보기

5/23

Recent LLMs scaling with performance scaling law $\rightarrow$ MoE
- Often require non-trivial changes to the training and inference framework
- hinders widespread applicability
Scaling up and retain simplicity is important

Scaling the base model along the depth dimension and continually pretraining the scaled model
Not using MoE
No additional Module
No changes for framework
Applicable for all Transformer architecture
Solar > Mistral 7B, LLaMA 7b
Solar-Instruct > Mixtral-8x7b

From the base model with $n$ layers, set the target layer count $s$ for the scaled model (largely dictated by the available hardware)
Copy the base model
remove final $m$ layers for original model and initial $m$ layers for duplicated model
concatenate to form $s = 2 \cdot(n-m)$ layers ( $n=32$ , $s=48$ , $m=8$ for SOLAR)

The performance of scaled model initially drops below that of the base model
rapid performance recovery is observed
particular way of depthwise scaling has isolated the heterogeneity in the scaled model
- if we just repead all layers so that the number of total layer to be $2n$ , the layer distance or the difference in the layer indices is too large at the seam
- SOLAR sacrificed the $2m$ middle layers thereby reducing the discrepancy at the seam
- the success of DUS is obtained by both depthwise scaling and continued pretraining

DUS do not require a distinct training framework, additional modules (ex. gating networks, dynamic expert solution), specialized CUDA kernel
seamlessly integrate into existing training and inference frameworks with high efficiency

QA Format + synthesized math QA dataset
- seed math data from Math dataset only to avoid contamination
- using a process similar to MetaMath, rephrase the question and answers of the seed data $\rightarrow$ Synth. Math-Instruct

Instruction-tuned model is further fine-tuned to be more aligned with human or strong AI like GPT4 preference using DPO
Open-Source + Synth.Math-Instruct
Speculated that rephrased answer > original answer
Made DPO tuple with {prompt(rephrased question), chosen(rephrased answer), rejected(original answer)} $\rightarrow$ Synth.Math-Alignment

Merged some of models that they trained while instruction and alignment tuning stages.
Implemented their own merging method

Alpaca-GPT4 and OpenOrca makes the model to behave differently (SFTv1 and SFTv2)
Synth. Math-Instruct was helpful (SFTv3, SFT v4)
Merging models that specialize in different tasks is a promising way to obtain a model that performs well generally