Sparse Upcycling: Training MoE from Dense Checkpoints

임재석·2024년 1월 16일
0

paper-study

목록 보기
6/23

1. Introduciton

  • Increased Scale is one of the main drivers of better performancd in DL (NLP, Vision, Speech, RL, Multimodal etc.)

  • Most SOTA Neural Nets are trained from-scratch (random weights) \rightarrow Cost for training is high

  • Model Upcycling: upgrading an existing model with a relatively small additional compulational budget

    • focus on dense models into larger sparsely activated MoEs (pretrained dense Transformer checkpoint)
    • less than 40% additional budget for all size for language and vision
  • Valuable in two scenarios

    • Have access to a pretrained Transformer and want to improve it within a computational budget
    • Plan to train a large model and don't know whether dense of MoE would be more effective \rightarrow First train the dense model then upcycle it into a MoE
  • Central challenge in model upcycling is the initial performance decrease entailed by changing a trained network structure \rightarrow present a model surgery recipe

2. Background

2.1 Sparsely Activated Mixture of Experts (MoE)

Dense vs Sparse

  • Dense model : apply all params to every input
  • Sparse model : activating a subset of params for each input
  • MoE models are an accelerator friendly family of sparse models that allow training of models with up to trillions of params

MoE Model

  • alternate standard Transformer blocks with MoE blocks
  • usually replace the MLPs in a Transformer block with a number of 'experts' (also MLP) with different params and a router (small neural net, decides which expert should be applied)
  • There is multiple routing algorithms (Top-K, BASE and Sinkhorn-BASE layers, Hash layers, Expert Choice routing)

Sparsely Gated MoE (Shazeer et al., 2017)

  • Gating network G(x)RnG(x) \in \mathbb{R}^n and nn expert networks E1,E2,...EnE_1, E_2, ... E_n
  • the output yy of the MoE module is y=i=1nG(x)iEi(x)y = \sum_{i=1} ^n G(x)_i E_i(x)
  • G(x)G(x) is a sparse vector. This has only non-zero element in the index of selected expert
  • The choice of gating function
    • Softmax gating : Gσ(x)=Softmax(xWg)G_{\sigma}(x) = Softmax(x \cdot W_g) where WgW_g is trainable weight matrix

    • Noisy Top-K gating

      G(x)=Softmax(KeepTopK(H(x),k))H(x)i=(xWg)i+StandardNormal()+Softplus((xWnoise)i)KeepTopK(v,k)i={viif vi is in top k elements of v otherwise Softplus(x)=ln(1+ex)\begin{aligned} G(x) &= Softmax(KeepTopK(H(x), k)) \\ H(x)_i &= (x \cdot W_g)_i + StandardNormal() + Softplus((x \cdot W_{noise})_i) \\ KeepTopK(v, k)_i &= \begin{cases} v_i \quad &\text{if } v_i \text{ is in top } k \text{ elements of } v \\ -\infin &\text{ otherwise } \end{cases} \\ Softplus(x) &= \ln ({1 + e^x}) \end{aligned}

Expert Choice routing (Zhou et al., 2022)

  • EE for total # of experts
  • nn for total number of tokens
  • Router output RRn×E\bold{R} \in \mathbb{R}^{n \times E} : routing probabilities
  • the row riREr_{i} \in \mathbb{R}^E corresponds to the ii-th token and distribution over experts (non-negative and sum to 1)
  • Every expert ee independently chooses the TT tokens with highest probabilities (top-T per column) and process
  • parametrize TT as T=C(n/E)T = C(n/E) where CC is a capacity factor to control # of tokens per expert (if C=1C=1, some token will be processed by multiple experts while others by none)
  • This makes a model parameter count increase with minimal FLOPs overhead (router computation)
  • Letting C>1C > 1 usually leads to higher performance at a higher compute cost
  • S=Softmax(XWg),SRn×eG, I=TopK(ST,k),P=Onehot(I)Re×k×nS = Softmax(X \cdot W_g), \quad S \in \mathbb{R}^{n\times e} \\ G,\ I = TopK(S^T, k), \quad P = Onehot(I) \in \mathbb{R}^{e \times k \times n}
  • GRe×kG \in \mathbb{R}^{e \times k} is for weight of expert for the selected token, II is an index matrix where I[i,j]I[i,j] = jj-th selected token of the ii-th expert
  • Then, apply MoE and gating funtion in the dense FFN layer
    • input : Xin=P  XRe×k×dX_{in} = P \ \cdot \ X \in \mathbb{R}^{e \times k \times d} where PP is permutation matrix
    • Xin[i]Rk×dX_{in}[i] \in \mathbb{R}^{k \times d} is input for ii-th expert
    • output of each expert
      Xe[i]=GeLU(Xin[i]W1[i])W2[i]TX_e[i] = \text{GeLU}(X_{in}[i] \cdot W_1[i]) \cdot W_2[i]^T
    • Final output Xout[l,d]=i,jP[i,j,l]G[i,j]Xe[i,j,d]X_{\text{out}}[l,d] = \sum_{i,j}P[i, j, l]G[i, j]X_e[i, j, d]
      ll for batch dimension, dd for model dimension

2.2 Architectures

  • Apply same sparse upcycling recipe to both language and vision tasks on T5 and ViT (encoder)
  • ViT : follow V-MoE, but used global average pooling and Expert Choice Routing
  • T5 : use Expert Choice Routing for encoder, Top-K routing for decoder with K=2K=2

3. The upcycling Algorithm

Initialize

  • Use dense model's parameters (checkpoint) to initialize new Transformer block (same number and shape)
  • A subset of the MLP layers are expanded into MoE layer
  • remaining layers are copied to new model
  • each MoE have a fixed number of experts
  • each expert is initialized as a copy of the original MLP
  • After initializing, continue training it for a number of additional steps (considering budget and resources)

Design Decisions

The performance of upcycled models is heavily influenced by the configuration of the MoE layers

Router Type

  • ViT : Expert Choice routing with C=2C=2 (encoder)
  • T5 : Expert Choice routing with C=2C=2 (encoder), Top-K routing with K=2K=2 (decoder)

Number of layers to upcycle

  • Adding more MoE increases the model capacity
  • replace half of the MLP layers of original model with MoE layers

Number of Experts to add in upcycled layers

  • Adding more experts doesn't significantly affect the FLOPS (the expert capacity is inversely proportional to the number of experts)
  • Too many experts make the upcycled model's larger initial quality drop (this will be overcome by sufficientl upcycling compute)
  • 32 experts was good

Expert capacity

  • Larger Expert Capacity generally yields larger quality byt increases the FLOPS
  • C=2C=2 was good

Resuming Optimizer State (Vision)

  • reusing the optimizer state gives a performance boost for vision models (not language)

Normalize weights after routing (Vision)

  • To reduce the performance drop when upcycling model surgery, normalized the router combine weights of each token to 1
    • Each token was previously only processed by a single expert (original dense model)
    • for vision, it was helpful but it hurts the performance of language case. (the hypothesis that the decoder of T5 uses Top-K routing)

4. Experiments

4.1 Experimental Setup

  • Vision : V-MoE, ImageNet using 10-shot, 5 different training sets, average accuracy
  • Language : span corruption task on English C4 (pretrain), a proportional mix of all SuperGLUE (fine-tune), dense baseline starting checkpoint (Base), T5 1.1 checkpoints (L, XL)

4.2 Results

4.2.1 Core Result

Pretraining

  • When applying small amount of Extra training, the performance is almost their original checkpoint

Full Fine-Tune

  • Still, the upcycled model has faster growth of score
  • For language, the difference is larger

Sparse upcycling vs Sparse models from scratch

  • training from scratch takes longer to catch up with the upcycled models
  • For language, it used 120% of original dense checkpoint's computation to catch up upcycled models
  • Larger learning rate + experts can develop and diversify from the beginning
  • Given Large computation budget (> 100% of original dense), training MoE from scratch may be preferable

Sparse upcycling vs Warm starting

  • Dense upcycling (depth tiling) replicates layers from dense Base checkpoint to construct new layer

4.2.2 Ablations

  • Vision : B/16 sparse model with 32 experts, C=1C=1, 6 MoE layers at the last few blocks, dense checkpoint with 14 epochs + 7 additional epoch
  • Language : Base with 32 experts, C=2C=2, 6 MoE layers interspersed, 0.5M ~ 1M additional steps

Amount of dense pretraining

  • Regardless of the amount, upcycled model showed higher performance

Router type

  • For vision, Top-K routing and Batch Prioritized Routing matches performance of Expert Choice Routing but slightly slow (step basis)
  • Top-K underperforms Expert Choice routing per time basis

Expert Capacity Factor

  • The more tokens processed by expert, the greater the computation and performance
  • C=2C=2 was best

Number of MoE layers

  • More MoE layers is not always better even on a per step basis

Initialization of Experts

  • copying MLP layer >> train from scratch
  • adding small Gaussian noise to each copied MLP layer didn't work (small amount - no effect, large amount - hurts the performance

Number of Experts

  • Adding more experts increases the model parameter count and quality
  • Using very large number of expert shows large initial quality drop (Fig 10 left two)

5. Conclusion

  • Provided Simple recipe to reuse pretrained dense checkpoints to initialize more powerful sparse models
  • Smooth transition from dense to MoE
  • Applicable for vision and language
  • Upcycling of model

6. Comment

생각했던 것과는 다른 MoE였음. Expert를 선택하는 방법에 있어 Router을 이용할 수 있다는 아이디어와 Expert Choice라는 색다른 아이디어를 볼 수 있었음.

0개의 댓글

관련 채용 정보