1. Introduciton
-
Increased Scale is one of the main drivers of better performancd in DL (NLP, Vision, Speech, RL, Multimodal etc.)
-
Most SOTA Neural Nets are trained from-scratch (random weights) → Cost for training is high
-
Model Upcycling: upgrading an existing model with a relatively small additional compulational budget
- focus on dense models into larger sparsely activated MoEs (pretrained dense Transformer checkpoint)
- less than 40% additional budget for all size for language and vision
-
Valuable in two scenarios
- Have access to a pretrained Transformer and want to improve it within a computational budget
- Plan to train a large model and don't know whether dense of MoE would be more effective → First train the dense model then upcycle it into a MoE
-
Central challenge in model upcycling is the initial performance decrease entailed by changing a trained network structure → present a model surgery recipe
2. Background
2.1 Sparsely Activated Mixture of Experts (MoE)
Dense vs Sparse
- Dense model : apply all params to every input
- Sparse model : activating a subset of params for each input
- MoE models are an accelerator friendly family of sparse models that allow training of models with up to trillions of params
MoE Model
- alternate standard Transformer blocks with MoE blocks
- usually replace the MLPs in a Transformer block with a number of 'experts' (also MLP) with different params and a router (small neural net, decides which expert should be applied)
- There is multiple routing algorithms (Top-K, BASE and Sinkhorn-BASE layers, Hash layers, Expert Choice routing)
Sparsely Gated MoE (Shazeer et al., 2017)
![](https://velog.velcdn.com/images/0404_not_found/post/9e1c69c9-829d-423e-b19a-7aa311a6e0e6/image.png)
- Gating network G(x)∈Rn and n expert networks E1,E2,...En
- the output y of the MoE module is y=∑i=1nG(x)iEi(x)
- G(x) is a sparse vector. This has only non-zero element in the index of selected expert
- The choice of gating function
Expert Choice routing (Zhou et al., 2022)
- E for total # of experts
- n for total number of tokens
- Router output R∈Rn×E : routing probabilities
- the row ri∈RE corresponds to the i-th token and distribution over experts (non-negative and sum to 1)
- Every expert e independently chooses the T tokens with highest probabilities (top-T per column) and process
- parametrize T as T=C(n/E) where C is a capacity factor to control # of tokens per expert (if C=1, some token will be processed by multiple experts while others by none)
- This makes a model parameter count increase with minimal FLOPs overhead (router computation)
- Letting C>1 usually leads to higher performance at a higher compute cost
![](https://velog.velcdn.com/images/0404_not_found/post/922b30be-8ec1-4c82-99c0-ce017f784343/image.png)
S=Softmax(X⋅Wg),S∈Rn×eG, I=TopK(ST,k),P=Onehot(I)∈Re×k×n
- G∈Re×k is for weight of expert for the selected token, I is an index matrix where I[i,j] = j-th selected token of the i-th expert
- Then, apply MoE and gating funtion in the dense FFN layer
- input : Xin=P ⋅ X∈Re×k×d where P is permutation matrix
- Xin[i]∈Rk×d is input for i-th expert
- output of each expert
Xe[i]=GeLU(Xin[i]⋅W1[i])⋅W2[i]T
- Final output Xout[l,d]=∑i,jP[i,j,l]G[i,j]Xe[i,j,d]
l for batch dimension, d for model dimension
2.2 Architectures
- Apply same sparse upcycling recipe to both language and vision tasks on T5 and ViT (encoder)
- ViT : follow V-MoE, but used global average pooling and Expert Choice Routing
- T5 : use Expert Choice Routing for encoder, Top-K routing for decoder with K=2
3. The upcycling Algorithm
Initialize
![](https://velog.velcdn.com/images/0404_not_found/post/1eaae4d4-c82b-423e-b584-60961c5b8e30/image.png)
- Use dense model's parameters (checkpoint) to initialize new Transformer block (same number and shape)
- A subset of the MLP layers are expanded into MoE layer
- remaining layers are copied to new model
- each MoE have a fixed number of experts
- each expert is initialized as a copy of the original MLP
- After initializing, continue training it for a number of additional steps (considering budget and resources)
Design Decisions
The performance of upcycled models is heavily influenced by the configuration of the MoE layers
Router Type
- ViT : Expert Choice routing with C=2 (encoder)
- T5 : Expert Choice routing with C=2 (encoder), Top-K routing with K=2 (decoder)
Number of layers to upcycle
- Adding more MoE increases the model capacity
- replace half of the MLP layers of original model with MoE layers
Number of Experts to add in upcycled layers
- Adding more experts doesn't significantly affect the FLOPS (the expert capacity is inversely proportional to the number of experts)
- Too many experts make the upcycled model's larger initial quality drop (this will be overcome by sufficientl upcycling compute)
- 32 experts was good
Expert capacity
- Larger Expert Capacity generally yields larger quality byt increases the FLOPS
- C=2 was good
Resuming Optimizer State (Vision)
- reusing the optimizer state gives a performance boost for vision models (not language)
Normalize weights after routing (Vision)
- To reduce the performance drop when upcycling model surgery, normalized the router combine weights of each token to 1
- Each token was previously only processed by a single expert (original dense model)
- for vision, it was helpful but it hurts the performance of language case. (the hypothesis that the decoder of T5 uses Top-K routing)
4. Experiments
4.1 Experimental Setup
- Vision : V-MoE, ImageNet using 10-shot, 5 different training sets, average accuracy
- Language : span corruption task on English C4 (pretrain), a proportional mix of all SuperGLUE (fine-tune), dense baseline starting checkpoint (Base), T5 1.1 checkpoints (L, XL)
4.2 Results
4.2.1 Core Result
Pretraining
- When applying small amount of Extra training, the performance is almost their original checkpoint
![](https://velog.velcdn.com/images/0404_not_found/post/d233a7eb-4c47-4bb9-a590-a7b824a5abb9/image.png)
Full Fine-Tune
- Still, the upcycled model has faster growth of score
- For language, the difference is larger
![](https://velog.velcdn.com/images/0404_not_found/post/e5a254d9-649d-4136-b55a-6e0344d30695/image.png)
Sparse upcycling vs Sparse models from scratch
- training from scratch takes longer to catch up with the upcycled models
- For language, it used 120% of original dense checkpoint's computation to catch up upcycled models
- Larger learning rate + experts can develop and diversify from the beginning
- Given Large computation budget (> 100% of original dense), training MoE from scratch may be preferable
![](https://velog.velcdn.com/images/0404_not_found/post/12f399fb-9f35-4806-93ac-bcfd85ad1ac9/image.png)
Sparse upcycling vs Warm starting
- Dense upcycling (depth tiling) replicates layers from dense Base checkpoint to construct new layer
![](https://velog.velcdn.com/images/0404_not_found/post/9ea5d33f-bd77-4652-b51c-5a07488140c5/image.png)
4.2.2 Ablations
- Vision : B/16 sparse model with 32 experts, C=1, 6 MoE layers at the last few blocks, dense checkpoint with 14 epochs + 7 additional epoch
- Language : Base with 32 experts, C=2, 6 MoE layers interspersed, 0.5M ~ 1M additional steps
Amount of dense pretraining
- Regardless of the amount, upcycled model showed higher performance
![](https://velog.velcdn.com/images/0404_not_found/post/aeda3c38-56e3-41b1-8ecd-5a79f725ba74/image.png)
Router type
- For vision, Top-K routing and Batch Prioritized Routing matches performance of Expert Choice Routing but slightly slow (step basis)
- Top-K underperforms Expert Choice routing per time basis
![](https://velog.velcdn.com/images/0404_not_found/post/5ed51719-4088-4bc4-bd6d-82b3e34daef4/image.png)
Expert Capacity Factor
- The more tokens processed by expert, the greater the computation and performance
- C=2 was best
![](https://velog.velcdn.com/images/0404_not_found/post/3d60620f-4891-4ed7-b1ef-7c4e3523e911/image.png)
Number of MoE layers
- More MoE layers is not always better even on a per step basis
![](https://velog.velcdn.com/images/0404_not_found/post/05959dd4-d2f0-4789-940d-5c53bfd78166/image.png)
![](https://velog.velcdn.com/images/0404_not_found/post/65ae8773-bdf7-4aaa-aa8d-c21cac29afbd/image.png)
Initialization of Experts
- copying MLP layer >> train from scratch
- adding small Gaussian noise to each copied MLP layer didn't work (small amount - no effect, large amount - hurts the performance
![](https://velog.velcdn.com/images/0404_not_found/post/01eb4339-699e-4353-a6a5-bedf57c9cd42/image.png)
Number of Experts
- Adding more experts increases the model parameter count and quality
- Using very large number of expert shows large initial quality drop (Fig 10 left two)
5. Conclusion
- Provided Simple recipe to reuse pretrained dense checkpoints to initialize more powerful sparse models
- Smooth transition from dense to MoE
- Applicable for vision and language
- Upcycling of model
생각했던 것과는 다른 MoE였음. Expert를 선택하는 방법에 있어 Router을 이용할 수 있다는 아이디어와 Expert Choice라는 색다른 아이디어를 볼 수 있었음.