Shared structure
- Even in languages like English that are not agglutinative and aren't highly inflected, words share important structure
- Even if we never see the word "unfriendly" in our data, we should be able to reason about it as: un + friend + ly
[참고]
Attention
weighted sum 을 이용해서 word embeddings 을 사용할 때 사용하는 개념
어떤 단어에 더 집중해야할 지에 따라 비중을 달리한다.
Define v to be a vector to be learned; think of it as an "important word" vector. The dot product here measures how
similar each input vector is to that "important word" vectorLots of variations on attention
1) Linear transformation of x into before otting with v (선형변환)
2) Non-linearities after each operation (비선형)
3) "Multi-head attention" : multiple v vectors to capture different phenomena that can be attended to in the input
4) Hierarchical attention (sentence representation with attention over words + document representation with attention over sentences (word 레벨에서 한 번보고 sentence 레벨에서 한 번 보고 ...)Attention gives us a normalized weight for every token in a sequence that tell us how important that word was for the prediction -> 어떤 단어가 중요했는지 역으로 파악 가능
- This can be useful for visualization
- BERT는 모든 상관관계를 Attention 을 고려하여 학습
- BERT는 Transformer-besed model 인데 빈칸채우기를 하기위해서 Bidirectional RNN 을 이용하기도 하고 문장을 쪼개서 이어지는 게 말이 맞는지를 확인하기도 한다.