Summary of Wordvec
Keyword: CBOW, Skip gram
Prerequisites
-
Statistical NLP
- Create a statistical model to calculate the probability P(s) of a proposition.
- Use Bayes' rule,
P(s)=P(w1w2⋯wn)=P(w1)P(w2∣w1)⋯P(wn∣w1w2⋯wn−1)=i=1∏nP(wi∣hi−1) where hi−1=w1w2⋯wn−1, P(wi∣hi−1)=count(hi−1)count(hi−1wi)
- i.e. the probability of words is calculated from the frequency of the count.
- There is a sparsity problem (if hi−1 is too long, hi−1wi doesn't exist in the training corpus).
- For example, if hi−1="I have a pen, I have an apple, Ah, apple", wi will be the pen.
- But we can't observe the whole sentence in the training corpus because of its length.
- So we have to limit the length => N-gram
- This statistical method has a big problem: it is impossible to compare words. In other words, it only learns for sentences, not for words.
-
Distributional semantic model
- Main idea: Distributional hypothesis "Similar words occur in similar contexts".
- According to this hypothesis, our goal is to quantify semantic similarities (using various methods, e.g. vector similarities) between words based on their distributional properties.
- To compare between words, we will represent words as a vector => Vector representation
- One-hot encoding: No information about word similarity
- Dense representation: Embed words in a continuous vector of much lower dimension (s.t. N≈100∼1000)
Previous work
- Word2vec (NNLM)
- To imporve above idea, we train the word embedding using Neural Network(Deep learning)
- It is more efficient when we process NLP Problem, so The topic of how to embed words efficiently has attracted a lot of attention.
- NNLM is like bi-gram model, this model consists of two fully connected layers
- Linear
- Linear + softmax
- Input is one-hot encoding vector, and then forward pass hidden layer => output layer
- Before determine embedding vector, take softmax function to compare target vector
- Let xi be one-hot encoded input of ith word, W, W′ be weight matrices with the hidden layer size N
- Then, h=xiW, u=hW′
- After softmax, P(vj∣vi)=∑t=1Veuteuj
- Later, the model was developed and RNN-based RNLM was proposed, which continues to be a new site due to the presence matrix.
Problem of Previous work
-
Too slow and huge memory
- For example, above NNLM, if vocabulary is 104, vector size is 3×102, then we have to update 6×106 weights!
- In pracitce, size of vocabulary is much larger than 104, so vector size is also increased to outperform.
- Weight is so much and huge memory, we can't train large corpus!
-
Can't perform vector calculation
- If we have three words, "Fruits", "red", "not plum", then we can guess "apple".
- As model doesn't train syntax(or semantic) perspective similarity, this task is impossibile.
CBOW
- To generate "center word" using "surround words"
- Generate one-hot encdoings {x1, x2, ⋯, xC} of the input words' size is C
- Forward pass to get embedded word vectors {h1, h2, ⋯, hC} by hc=xcW (sharing weights like convolution)
- Take average and genearte score vector, and then softmax
- Our goal is to minimize the loss function
E=−logP(vO∣v1, v2, ⋯, vC)
- Size of Input words is N, dimension of hidden layer is D, and then softmax, So complexity is
Q=N×D+D×log2V
Skip n-gram
- There are two ways to learn word embedding using (Deep)NN according to our perspective.
- One is CBOW(Continuous Bag of Words), which is focuses on the context words, and the other is Skip gram, which is focuses on the center words.
- Especially, skip gram method is to learn context words using center words.
- Therefore we desire y to match the C many one-hot encodings of the actual output words, so using mamimum log-likelihood, obejctive function is following.
E=−logP(vO1, vO2, ⋯, vOC∣vI)=−logc=1∏C∑t=1Veuteujc=−c=1∑Cujc+Clogt=1∑Veut
where V is size of words, u is score vector.
- Size of output words is C, dimension of hidden layer is D, and then softmax, So complexity is
Q=C×(D+D×log2V)
Conclusion
- In this paper, we dramatically reduce the time and memory for computing embeddings with words.
- This makes it possible to use more layers and word vectors, and it is possible to predict a high level of similarity.
- Previously, while the focus was simply on improving similarity to semantic (or syntax), the paper focused on improving several similarities (i.e. sematomic & syntax) simultaneously.
- In addition, it was possible to predict words with completely different meanings through unintended operations with vectors, which can be said to have SOAT in language understanding.