Recent Trends
1.Transformer model and its self-attention block has become a general-purpose sequence(or set) encoder and decoder in recent NLP applications as well as in other areas.
2.Training deeply stacked Transformer models via a self-supervised learning framework has significantly advanced various NLP tasks through transferlearning, e.g., BERT, GPT-3, XLNet, ALBERT, RoBERTa, Reformer, T5, ELECTRA…
3.Other applications are fast adopting the self-attention and Transformer architecture as well as self-supervised learning approach,
e.g., recommender systems, drug discovery, computer
vision, …
4.As for natural language generation, self-attention models still requires a greedy decoding of words one at a time.
GPT-1
BERT
1.Learn through masked language modeling task
2.Use large-scale data and large-scale model
1.Motivation
Language models only use left context or right context, but language understanding is
bi-directional
2.Problem: Words can “see themselves” (cheating) in a bi-directional encoder
1.Masked Language Model
1)How to
Mask out k% of the input words, and then predict the masked words
e.g., use k= 15%
Too little masking : Too expensive to train
Too much masking : Not enough to capture context
2)Problem - Solution
prob: Mask token never seen during fine-tuning
sol:
2.Next Sentence Predictio
BERT vs GPT
진행 x
Q1)BERT의 Masked Language Model의 단점은 무엇이 있을까요? 사람이 실제로 언어를 배우는 방식과의 차이를 생각해보며 떠올려봅시다.
A1)문맥 파악에 중요한 역할을 하는 단어나 모델이 한번도 본 적이 없는 단어를 마스킹하면 모델 입장에서 학습에 어려움이 있을 것 같습니다.
진행 X
새로운 팀원들과 줌미팅을 하며 공부는 계획의 절반 정도를 한 것 같습니다. 남은 것은 내일 마저 마무리 할 예정입니다.