Thumbnail image
Word2vec
Review

- Iterate through each word of the whole corpus
- Predict surrounding words using word vectors
- Calculate J(θ) and ∇θJ(θ)
- Update θ so you can predict well
- Each row of U∈RV×d is a word vector!
Note: You can use V or mean(U,V) instead of U, or use V for centre words and U for context words if it is possible.
Q. How to avoid too much weight on the high frquency words(like 'the', 'of', 'and')?
A. Discard the first biggest component of word vector! It contains information of frequency.
Stochastic Gradient Descent
- Problem: ∇θJ(θ) is very expensive to compute
- Stochastic Gradient Descent(SGD): Repeatedly sample windows, and update after each one!
While True:
window = sample_window(corpus)
theta_grad = evaluate_gradient(J, window, theta)
theta = theta - alpha * theta_grad
Negative Sampling
- Problem:
P(o∣c)=∑w∈Vexp(uwTvc)exp(uoTvc) ... normalization factor (demoninator) is too computationally expensive.
- Negative Sampling: Calculate the probability above with randomly selected negative(noise) pairs.
- Jneg−sample(o,vc,U)=−log(σ(uoTvc))−k=1∑Klog(σ(−ukTvc))
- uo is real outside words
- uk is random neg-samples
- K is number of neg-samples (10-20 would be fine)
- Maximize probability that real outside word appears.
Minimize probability that random words appear around centre word.
- P(w)=U(w)3/4/Z
- P(w): Sampling Probability (distribution)
- U(w): unigram distribution
- 3/4 power ⇒ common word ↓, rare word ↑ (This number determined heuristically)
GloVe
Word-document co-occurrence matrix
-
Word-document co-occurrence matrix will give general topics leading to "Latent Semantic Analysis"
-
Example of simple co-occurrence matrix:
I like deep learning.
I like NLP.
I enjoy flying.
-
counts | I | like | enjoy | deep | learning | NLP | flying | . |
---|
I | 0 | 2 | 1 | 0 | 0 | 0 | 0 | 0 |
like | 2 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
enjoy | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
deep | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
learning | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
NLP | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
flying | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
. | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 |
-
Problems with simple co-occurrence vectors:
1) Increase in size with vocabulary
2) Very high dimensional, requires a lot of storage
3) Sparsity issues with vectors
⇒ Naive solution: Singular Vector Decomposition (SVD)
⇒ Better solution: GloVe
Count-based vs Direct-prediction
-
The methods of word representation can be categorized as "Count-based" or "Direct-prediction"
-
Count-based | Direct-prediction |
---|
- LSA, HAL - COALS, Hellinger-PCA | - Word2vec (Skip-gram, CBOW) - NNLM, HLBL, RNN |
Advantages: - Fast training - Efficient usage of statistics | Advantages: - Generate improved performance on other tasks - Can capture complex patterns beyond word similarity |
Disadvantages: - Primarily used to capture word similarity - Disproportionate importance given to large counts | Disadvantages: - Scales with corpus size - Inefficient usage of statistics |
-
In other words, count-based methods can capture characteristics of the document because these methods receive whole corpus as an input, while the direct-prediction methods outperform in representing the syntactic and symantic features of a word.
Glove
-
Glove is designed to combine the advantages of both count-based and direct-prediction.
-
Insight: Ratio of co-occurrence probabilities can encode meaning components.
-
| x=solid | x=gas | x=water | x=random |
---|
P(x∣ice) | large | small | large | small |
P(x∣steam) | small | large | large | small |
P(x∣steam)P(x∣ice) | large | small | ≈1 | ≈1 |
-
"Ratio" makes both-related words (water) or un-related words (fashion) to be close to 1.
Q. How can we capture ratios of co-occurrence probabilities as linear meaning components in a word vector space?
A. Log-bilinear model!
-
Log-bilinear Model ... (What is bilinear?)
- co-occurrence prob. Pij=P(j∣i)=XiXij
- ratio of co-occurrence prob. PjkPik=F(wi,wj,wk)
- We want F to represent the information present in Pik/Pjk.
Thus, F(wi,wj,wk)=F(wi−wj,wk)=PjkPik
- Consider this: when wi≈wj, we want Pik/Pjk to be determined solely by word k.
- As Pik/Pjk is a scalar, F(wi−wj,wk) should also be a scalar.
Thus, F(wi−wj,wk)=F((wi−wj)Twk)=PjkPik
- While F could be taken to be a complicated function parameterized by, e.g., a neural network, doing so would obfuscate the linear structure we are trying to capture.
- The distinction between a word and a context word is arbitrary and we are free to exchange the two roles.
Thus, F((wi−wj)Twk)=F(wjTwk)F(wiTwk)=PjkPik
which means, F(wiTwk)=Pik=XiXik and F(wjTwk)=Pjk=XjXjk
- We require that F be a homomorphism between the groups (R,+) and (R>0,×).
This simply means that we want: F(a+b)=F(a)+F(b)
Thus, F=exp, and F(wi⋅wj)=exp(wi⋅wj)=Pij
- Finally we get log-bilinear model:
wi⋅wj=logP(i∣j)=Pij
And from this, we also get:
wx⋅(wa−wb)=logP(x∣b)P(x∣a)
-
Cost Function of GloVe
J=i,j=1∑Vf(Xij)(wiTw~j+bi+b~j−logXij)
- f is some special function that is designed to capping the effect of very common words.
- Note: bi and bj is introduced to obtain symmetricity between word i and j
- For more details, check the original GloVe paper.
-
vs Word2vec
Someone's opinion: stackoverflow
If there is something wrong in my writing or understanding, please comment and make corrections!
[references]
1. https://youtu.be/kEMJRjEdNzM
2. https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1194/slides/cs224n-2019-lecture02-wordvecs2.pdf
3. https://aclanthology.org/D14-1162/
4. https://qr.ae/pGPB2h
5. https://youtu.be/cYzp5IWqCsg
6. https://stackoverflow.com/questions/56071689/whats-the-major-difference-between-glove-and-word2vec