Word2Vec, GloVe 등 word representation은 NLP의 model에 key component
- 그러나, high quality representation은 challenging
- high quality representation이 있을 때 word의 complex characteristics를 반영하여 모델을 짜거나(semantics & syntactics), 문맥에 따라서 다른 word use를 이용해 model을 만들 수 있음(다의어, 동음이의어 문제)
new deep contextualized word representation can handle both challenges, can be integrated into existing models
- above new model improves SoTA of NLP tasks
what's the difference?
ELMo representation: Embeddings from Language Models
각 token에 대해 전체 문장 안에서의 representation이 구해짐
- 문장 전체를 읽는 model은 biLM, trained with a coupled LM objective on a large text corpus
ELMo representation is deep: biLM 내부의 모든 layer의 output을 linear combination한 것이므로, rich word repsentation임
복수 갯수(논문에서는 2개)의 LM에 노출시킴으로써 low level LM은 syntactics, high level LM은 semantics를 잡아냄
- 위와 같은 특성을 이용하여 disambiguation task, POS tagging 등 범용적으로 사용 가능함
experiment results
ELMo representations can be easily added to existing models for 6 NLP understanding problems
addition of ELMo representations improves SoTA of all cases
word representations e.g. word vectors(2010), Word2Vec(2013), GloVe(2014) became component of SoTA NLP architectures
- above word vectors allow a single context independent representation for each word
previously proposed methods tried to overcome by enriching with subword information or learning separate vectors for each word sense
- enriching with subword information: CHARAGRAM(Wieting et al., 2016) incorporating character based n-gram with word vectors, FastText(Bojanowski et al., 2017) associates skip gram of word basis n-gram with word vectors
- learning separate vectors for each word sense: Neelakantan et al., 2014 utilizes skip gram for multiple embeddings per word type
상술한 subword units, multi-sense information의 아이디어를 차용
3. ELMo: Embedding from Language Models
ELMo word representations are functions of the entire input sentence
computed on top of two-layer biLMs with character convolutions
3.1 Bidirectional Language Models
language model computing the probability of the sequence(tokens, p(t1,t2,...,tN) is below
where (t1,t2,...,tk−1) is given
L-layer forward LSTM에서 각 position(k) 마다 context-dependent representation(hk,jwherej=1,...,L)이 발생
해당 논문에서 제시하는 biLM의 log likelihood는 아래와 같으며, forward LM과 backward LM의 log likelihood의 합을 maximize함
log likelihood of biLM: Σk=1N(logp(tk∣t1,...,tk−1;Θx,ΘLSTM,Θs)+logp(tk∣tk+1,...,tN;Θx,ΘLSTM,Θs)) Θx: parameters for token representation, Θs: parameters for softmax layer
3.2 ELMo
ELMo is a task specific combination of intermediate layer representations in the biLM
- biLM은 forward-only LM & large scale training corpus보다 더 효과적임(Peters et al. (2017))
Rk={xk,hk,j,hk,j∣j=1,...,L} ELMoktask=E(Rk;Θtask)=γtaskΣj=0Lsjtaskhk,j
(xk: content independent representation, hk,j: position k output by layer j, γtask: task-specific weight, sjtask: layer-representation-level weight)
3.3 Using biLMs for supervised NLP tasks
word마다 biLM의 layer representation(+token representation)을 모두 기록하고(R), representation의 linear combination(ELMotask)를 end task model이 학습하게끔 함
- 본 논문에서 제시된 model 구조는 ELMo(biLM) -> end task model의 모양새
- ELMo enhanced representation [xk;ELMoktask]를 end task model로 넣음
이전 모델들은 개개의 context-independent representation xk을 모델에 넣고 layer(RNN, CNN, FFN etc)를 거치며 context-dependent representation hk를 구하는 방식
3.4 Pre-trained biLM architecture
Jozefowicz et al. (2016), Kim et al. (2015)의 모델을 차용
- 각 input token마다 3 layers of representation 제공
- 이전 모델은 1 representation layer(tokens of fixed vocabulary)
biLM이 pretrained 된 이후 specific task 위해 fine tuned
4. Evaluation
6 task에 대해 단순히 ELMo를 더하는 것만으로도 SoTA 갱신(6~20% error reduction)
QA
24.9% relative error reduction, 4.7% F1 score improvement, 1.4% SoTA improvement
SQuAD 사용: 100k+ crowd sourced QA pairs, answers in Wikipedia paragraph
baseline model (Clark and Gardner. (2017)): improved version of Bidirectional Attention Flow model
Textual entailment
전제(premise)가 주어졌을 때, 가설(hypothesis)의 참거짓 여부 판단
model: ELMO + ESIM
Stanford Natural Language Inference (SNLI) corpus: 550k hypothesis/premise pairs
baseline model (Chen et al. (2017)): ESIM model, biLSTM(ecoding premise/hypothesis)-matrix attention(local inferece)-biLSTM(inferece compisition)-poolig-output
Semantic role labeling(SRL)
predicate-argumet structure of a sentennce, answering "Who did what to whom"
baseline: He et al. (2017), 8-layer deep biLSTM with forward and backward directiosn interleaved
Coreference resolution
clustering mentions in text that refer to the same underlying real world enntities
baseline: Lee et al. (2017), biLSTM & attention(compute span representations)-softmax(finnnd coreference chains)
Named etity extraction
CoLL 2003 NER task: Reuters RCV1 corpus tagged with 4 different enntity types(PER, LOC, ORG, MISC)
baseline: Peters et al. (2017), word embeddings & character-based CN representations-2 biLSTM-conditioal random field(CRF) loss
differennce with out model: baseline use top biLM layer while our model uses weighted average of all biLM layers
Sentiment analysis
SST-5: Socher et al. (2013), Stanford Sentiment Treebank involves 5 labels(very negative ~ very positive) on movie review
baseline: biattentive classification network(BCN, McCann et al., 2017), CoVE
difference: replaced CoVe with ELMo in the BCNs
5. Analysis
deep contextual representations(모든 layer의 output 모두 사용하는 것)이 top layer output만을 사용하는 것보다 더 높은 성능(5.1)