ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations

tacorico·2021년 10월 1일

TL;DR

BERT? 그거 너무 크다. Vocab이랑 Hidden Size를 분리하자. (Factorization)
NSP는 너무 쉬워서 비효율적이다 -> SOP(Sentence Order Prediction) 제안
Cross-layer Parameter Sharing해서 크기를 더 줄여보자.

Abstract

최근 pretrained model의 성능을 향상하기 위해서 모델이 커지는 경향이다.
일반적으로 Model Size가 커지면 Downstream Task Performance가 좋다
-> 최근 PLM의 연구 방향이기도 함(GPT-3, T5, HyperClova 등등)
하지만, 이렇게 큰 모델은,

Memory Limitation(TPU/GPU) -> 커지니까 당연
Training Time -> 커지니까 당연
Model Degradation -> 단순히 모델 크기만 키우면 Performance 감소(BERT-large -> BERT-xlarge 성능 하락 예시)

이라는 문제를 만나게 된다. 따라서 이 논문에서는,

Parameter reduction techniques(factorization, cross-layer parameter sharing)
Self-supervised loss for sentence-order prediction(SOP)

를 소개한다. 첫번째의 경우는 모델의 성능을 유지하면서 parameters를 줄이기 위한 노력이고, 두번째의 경우는 기존 BERT에서 pretraining task로 사용하는 NSP의 단점을 개선한 방법이다. ALBERT는 BERT가 가지고 있는 여러가지 문제들을 개선하여 GLUE, RACE and SQuAD에서 SOTA 달성하면서도 BERT-large에 비해서 적은 Parameter를 가짐.

Brief Comparision

	BERT-large	ALBERT-large
Params	x18	x1
Training Speed	x1	x1.7
Size	334MB	18MB

Factorized Embedding Parameterization

WordPiece Embedding(E) -> context-independent -> 정보량 적음
Hidden-layer Embedding(H) -> context-dependent -> 정보량 많음

-> WordPiece Embedding은 정보량에 비해 너무 크다!(Syllable이나 Word 단위의 Vocab Size를 생각해보자) 게다가 Training 중에도 Sparsely Update 된다.
-> E<<H인 E로 Factorization!!!

BERT: Embedding Size(E) == Hidden Size(H)
ALBERT: E << H

$O(V\times E) \rightarrow O(V \times E + E \times H)$

이렇게 Vocab Size와 Hidden Layer의 Dimension을 분리하면 큰 Vocab Size(8000, 16000, 32000..) 때문에 늘리기 힘들었던 Hidden Size를 늘릴 수 있고, 굳이 늘리지 않더라도 Embedding Size만큼 작아졌다가 다시 커지기 때문에 Model Size면에서 이득을 볼 수 있음.
이렇게 Factorization을 해도 성능이 유지되는 것을 볼 수 있음.

Layer간 같은 parameter를 공유하며 사용. FFN layer에서 공유할 때는 성능이 다소 떨어짐.
이걸 보면서 이런 논문(Are Sixteen Heads Really Better than One?)도 같이 고민해 볼 만하다.
(그리고 이 논문의 리뷰)

Sentence Order Prediction

NSP는 왜 하는 걸까? -> Downstream 성능을 개선하기 위함.

NSP = Topic Prediction + Coherence Prediction

Sampling 방법

Positive Sampling: 연속한 두 문장
Negative Sampling: 랜덤 두 문장

학습을 하면 대부분의 경우는 다른 Topic을 가진 문단에서 추출될 가능성이 높으므로(랜덤이니까) Topic Prediction으로 학습될 것. NSP는 Topic Prediction과 Coherence Prediction을 단일 Task에서 수행하는데, Topic Prediction은 Coherence Prediction에 비해서 학습하기도 쉽고, MLM loss로 학습한 것과 더 겹친다. 그래서 Topic Prediction을 피하고 Coherence Prediction에 집중하는 SOP Task를 채택함. 학습할 때 Training Log를 보면 NSP는 MLM 대비 쉬운 난이도를 가지고 있는 것을 볼 수 있음. 이 문제를 해결하기 위해서 ALBERT에서는 Sentence Order Prediction을 사용하는데, Sampling 방법이 다음과 같다.

Positive Sampling: 연속한 두 문장(기존과 동일)
Negative Sampling: Swapped 연속한 두 문장(Topic Prediction보다 Coherence Prediction 쪽에 무게를 실어주기 위함)

결론적으로 SOP task를 사용해서 학습했더니 Downstream Task에서 더 좋은 성능을 보였다고 함.
(사실 다른 건 잘 모르겠고 그나마 SQuAD2.0에서 좀 차이가 나는 것 같음)

Discussion

ALBERT-xxlarge는 BERT-large보다 parameter가 적고 훨씬 더 좋은 성능을 보여줌. 그러나 large structure로 인해 계산비용이 더 비싸다.

References

tacorico

우당탕탕 자연어 일기

이전 포스트

MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices

다음 포스트

ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations

TL;DR

Abstract

Brief Comparision

Factorized Embedding Parameterization

Sentence Order Prediction

Discussion

References

MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices

Soft Labeling Affects Out-of-Distribution Detection of Deep Neural Networks

0개의 댓글

관련 채용 정보

ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations

TL;DR

Abstract

Brief Comparision

Factorized Embedding Parameterization

Cross-layer Parameter Sharing

Sentence Order Prediction

Discussion

References

MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices

Soft Labeling Affects Out-of-Distribution Detection of Deep Neural Networks

0개의 댓글

관련 채용 정보