13.1. Word Embedding
A word can be represented as a vector.
-
Word2vec
- Using a large corpus,
- Predict the current word from neighboring words (Common Bag of Worlds; CBOW), or
- Predict the surrounding words given the current one (Skip-gram)
- Word meaning is determined by its (frequently co-occruing) neighboring words.
- Word vectors are fitted to maximize the likelihood:
L(θ)=i=1∏N−m≤j≤m,j=0∏p(wt+j∣wt;θ)
-
GloVe
- Global Vetors: global corpus statistics are directly captured.
- Ratios of co-occurrence probabilities can encode meaning components:
wi⊺wj=logp(i∣j)J=i,j=1∑Vf(Xij)(wi⊺w~j+bi+b~j−logXij)2 ![](https://velog.velcdn.com/images/zzwon1212/post/1f517cf4-a8ef-44e1-9245-27bc8c7cf21f/image.png)
Attention Is All You Need
In MLPs, CNNs, and RNNs, the output y^ is a weighted sum (+fixed unary operations) of the input x. That is, W is optimized to best map the input to the output in the training set, in terms of the loss function.
-
Attention function
Attention(Q,K,V) = Attention value
-
Self-attention
Each element learns to refine its own representation by attending its context (other elements in the input). More specifically, as a weighted sum of other elements (not arbitrary weights) in the sequence.
![](https://velog.velcdn.com/images/zzwon1212/post/fd518a96-bdce-4e2b-92c4-1257b9950270/image.png)
- With Transformer, we make Query, Key, and Value.
- From the input tokens {x1,x2,...,xN}
- Each token xi is mapped to its own Query Qi, Key Ki, Value Vi vectors by a linear transformation.
- The linear weights (WQ,WK,WV) are the learned parameters, shared by all inputs.
- WQ(WK,WV) learns how to represent a vector to serve as a Query (Key, Value) in general.
- We need another learnable parameter WO, which maps the attention value back to the original space.
- Each token xi becomes the Query when we learn about i.
- References are all tokens {x1,x2,...,xN} in the input sequence, including xi itself.
- The step above figure is repeated multiple times to further contextualize.
-
Training the Transformer model
![](https://velog.velcdn.com/images/zzwon1212/post/5deaf295-798b-4711-be49-0df3bff487ac/image.png)
A dummy token (called classification token) is appended to the input sequence, and use it as the aggregated embedding.
-
Transformer
![](https://velog.velcdn.com/images/zzwon1212/post/0e41ec22-1948-43a4-aa6f-7f33c785eb3e/image.png)
- Multi-head Self-attention
Having multiple projections to Q,K,V is beneficial. Because it allows the model to jointly attend to information from different representation subspaces at different positions.
- Multiple self-attentions output multiple attention values (Z0,Z1,...,Zk−1). So, simply concatenate them, then linearly transform with W0 it back to the original input size.
- Feed-forward layer
Each contextualized embedding goes through an additional FC layer. It is applied separately and identically, so there is no cross-token dependency.
- Residual connection
- Layer normalization
- Positional Encoding
PE(pos,2i)=sin(pos/100002i/dmodel)PE(pos,2i+1)=cos(pos/100002i/dmodel)
- Masked Multi-head Self-attention
The predictions for position i can depend only on the known outputs at positions less than i.
- The decoder inserts a third sub-layer, which performs multi-head attention over the output of the encoder stack.
- Decoding steps are repeated until the next word is predicted as [EOS] (End of Sentence).
- The output sentence may be chosen greedily (always the top one), or deferred with top-k choices (called beam search).
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
![](https://velog.velcdn.com/images/zzwon1212/post/6efb92b5-7aa8-4986-b722-fcf4f89a48fc/image.png)
These days, the pre-trained BERT is a default choice for word embeddings.
📙 강의