[Paper Review] Word2Vec

Sung Jae Hyuk·2023년 10월 30일

Papers

목록 보기

1/7

Keyword: CBOW, Skip gram

Too slow and huge memory
- For example, above NNLM, if vocabulary is $10^4$ , vector size is $3\times 10^2$ , then we have to update $6\times 10^6$ weights!
- In pracitce, size of vocabulary is much larger than $10^4$ , so vector size is also increased to outperform.
- Weight is so much and huge memory, we can't train large corpus!
Can't perform vector calculation
- If we have three words, "Fruits", "red", "not plum", then we can guess "apple".
- As model doesn't train syntax(or semantic) perspective similarity, this task is impossibile.

To generate "center word" using "surround words"
Generate one-hot encdoings $\left\{x_1,\ x_2,\ \cdots,\ x_C\right\}$ of the input words' size is $C$
Forward pass to get embedded word vectors $\left\{h_1,\ h_2,\ \cdots,\ h_C\right\}$ by $h_c=x_c W$ (sharing weights like convolution)
Take average and genearte score vector, and then softmax
Our goal is to minimize the loss function $E = -\log P(v_O|v_1,\ v_2,\ \cdots,\ v_C)$
Size of Input words is $N$ , dimension of hidden layer is $D$ , and then softmax, So complexity is $Q=N\times D + D \times \log_2{V}$

There are two ways to learn word embedding using (Deep)NN according to our perspective.
One is CBOW(Continuous Bag of Words), which is focuses on the context words, and the other is Skip gram, which is focuses on the center words.
Especially, skip gram method is to learn context words using center words.
Therefore we desire $\boldsymbol{y}$ to match the $C$ many one-hot encodings of the actual output words, so using mamimum log-likelihood, obejctive function is following.

\begin{aligned} E&=-\log P(v_{O_1},\ v_{O_2},\ \cdots,\ v_{O_C}|v_I)\\ &=-\log \prod_{c=1}^C \dfrac{e^{u_{j_c}}}{\sum_{t=1}^V e^{u_t}}\\ &=-\sum_{c=1}^C u_{j_c} +C \log\sum_{t=1}^V e^{u_t} \end{aligned}

where $V$ is size of words, $u$ is score vector.

Size of output words is $C$ , dimension of hidden layer is $D$ , and then softmax, So complexity is $Q=C\times (D + D \times \log_2{V})$

In this paper, we dramatically reduce the time and memory for computing embeddings with words.
This makes it possible to use more layers and word vectors, and it is possible to predict a high level of similarity.
Previously, while the focus was simply on improving similarity to semantic (or syntax), the paper focused on improving several similarities (i.e. sematomic & syntax) simultaneously.
In addition, it was possible to predict words with completely different meanings through unintended operations with vectors, which can be said to have SOAT in language understanding.

Hello World!