COCO-LM : Correcting and Contrasting Text Sequences for Language Model Pretraining

장한솔·2022년 1월 19일

NLP Papers

목록 보기

1/29

ICLR 2020 (google research)
Efficiently Learning an Encoder that Classifies Token Replacements Accurately.
MLM (Masked Language Model, bidirectional representations)
- 15% 마스킹, 복원하는 task (substantial compute cost)
- 실제 task에서 마스킹을 사용하지 않음

Small generator (generator 너무 강력하지 않도록 해야 discriminator가 의도대로 학습된다.)
Replaced Token Detection (RTD)
- Discriminator가 artifical MASK token을 보지 않도록 만들었음.
- Learning from all input positions causes ELECTRA to train much faster than BERT
  - MASK만 보면서 정답을 맞추는 것이 아니라 모든 토큰을 보면서 original/replaced를 확인함 (computationally efficient)
Jointly training, generator는 버리고 discriminator만으로 downstream task를 수행한다.
Efficiency (왜 효율적인 것이지? 15% vs 100%)
- ELECTRA 15% : discriminator loss를 15%의 token만으로 계산하도록 하였음.

Missing Language Modeling Benefits.
- language modeling capability
- not be sufficient to capture certain word-level semantics (the binary classification task)
  - few-shot 같은 것들을 하기에 적합하지 않다.
Squeezing Representation Space.

base

base++

aux transformer

Pretraining task
- SCL only?
- RTD only, CLM only 비교했을 때에는 거의 미세한 차이만 있었으나, SCL+RTD < COCO-LM (SCL+CLM)
Network setting
- Rel-Pos 가 MNLI를 제외한 다른 task에서 높은 성능을 보이기도함.
- ELECTRA's aux : 12 layer, 256 hidden
Training signal
- aux transformer 사용하지 않고, random replacements 사용하여 main transformer를 학습하고자 함.
- converged aux < pretrain two transformers together
  - the auxiliary model gradually increases the difficulty of the corrupted sequences
CLM setup
- CLM을 아예 사용하지 않고, LM으로만 진행하였을 때 성능이 상당히 낮아졌다. (SCL + LM)
- Correct corrputed text, binary classification을 제거한 경우
- Correct corrputed text, binary classification(같이 쓰이는 stop gradient를 사용하지 않았을 경우)