[자연어처리] MobileBERT 정리

double-oh·2021년 6월 2일

수식은 가급적으로 배제하고, 모델의 개념과 특징 위주로 정리하였음.

MobileBERT란?

경험적으로(empirical)하게 측정한 성능
original BERT와 마찬가지로 pre-train 모델인 MobileBERT에 fine-tuning을 하여 다양한 task 수행
4.3x smaller, 5.5x faster than BERT base
77.7 score on GLUE(Global Understanding Evaluation)
- BERT base 보다 0.6 낮음
- GLUE = 9개 nlu tasks에 대한 성능 지표
Pixel 4 모바일폰에서 63ms의 latency로 추론 가능
SQuAD에서는 BERT_base보다 성능이 좋음 (EM=79.2/F1=90.0)
경량화 BERT 모델과 성능 비교 시, 성능이 제일 우수 (논문에서 주장)
8-bit Quantization을 수행하였을 때, 성능 감소 없이 속도가 4x 빨라짐.

model의 depth는 BERT large와 동일하게 가져가면서, width를 줄이는 방식으로 모델 경량화를 진행 (deep하지만 thin한 구조)
latency에 상당 부분 영향을 미치는 layer normalization과 gelu를 수정
- layer normalization는 element-wise linear transformation으로 대체 (논문에서는 이 linear transformation을 NoNormd이라 명명)
- gelu를 relu로 변경
임베딩 차원을 128로 줄이고, kernel size 3인 1d convolution을 이용하여 512 차원의 임베딩 레이어 아웃풋을 반환
- 논문에서는 이부분을 Embedding Factorization 기법이라고 명명
Bottleneck 구조 도입
- [MHA + FFN] 에서 사용하는 hidden 차원은 128차원에 불과하여, Linear 레이어를 추가하여 512차원으로 증가 시킴
Inverted-Bottlenect 구조 도입
- MobileBERT에 knowledge transfer 할 수 있도록 인풋과 아웃풋을 512로 만드는 2개의 linear 레이어 추가
- MHA(Multi-Head Attention)과 FFN(Feed Forward Network) #parameters 사이의 1:2 비율을 stacked feed forward networks 기법을 사용하여 유지
  - stacked feed forward networks: feed forward network를 여러개aZ 쌓는 것

teacher 모델은 BERT large 모델에 inverted-bottlenect 구조를 augment한 IB-BERT large 모델을 사용
student 모델은 MobileBERT 모델을 사용
IB-BERT의 지식을 얇고 긴 네트워크인 MobileBERT에 knowledge distillation 수행
inter-block hidden size는 512까지 줄여도, 성능은 줄지 않아, inter-block hidden size는 512로 고정
intra-block hiddne size는 줄이면, 성능이 dramatic하게 줄어, 변경하지 않음.
MobileBERT tiny는 MHA에 linear layer를 통과한 값을 사용하여 차원을 더 축소하여, 파라미터를 감소 시킴

teacher 모델을 학습시킨 후 knowledge distillation을 수행
knowledge transfer의 학습 대상은 각 레이어의 feature map과 attentione 값
- feature map transfer: teacher의 output feature map과 student의 output feature map 차이를 minimize 하도록 학습
  - 차이 계산에는 mean squared error 사용
- attention transfer: teacher의 self-attention 값과 student의 self-attention 값 간의 차이를 minimize 하도록 학습
  - 차이 계산에는 KL-divergence 사용
각 layer 단계에서의 knowledge transfer 이후, 전체 model의 knowledge distillation을 수행 (pre-training distillation이라고 명명)
- distilBERT와 동일하게 MLM(Masked Language Model) loss에 대해서만 knowledge distillation을 수행
모델 학습은 3가지 방식을 사용하였고, 그 중 progressive knowledge transfer 방식이 MobileBERT를 학습하기에 효율적
1. Auxiliary Knowledge Transfer
- 모든 레이어의 knowledge transfer loss와 사전학습 distillation loss를 선형 결합한 단일 loss를 사용
2. Joint Knowledge Transfer
- 두 loss 항을 분리하여 먼저 layer 단의 지식을 전이학습한 후, 사전학습 distillation을 수행
3. Progressive Knowledge Transfer
- 각 레이어를 점진적으로 학습하는 방법
학습 데이터로는 BookCorpus, English Wikipedia 사용
학습 하드웨어: 256 TPU v3 chips for 500k steps with batch size of 4096 and LAMB optimizer