[Paper Review] Word2Vec

Sung Jae Hyuk·2023년 10월 30일
0

Papers

목록 보기
1/7

Summary of Wordvec

Keyword: CBOW, Skip gram

Prerequisites

  1. Statistical NLP

    • Create a statistical model to calculate the probability P(s)P(s) of a proposition.
    • Use Bayes' rule,
      P(s)=P(w1w2wn)=P(w1)P(w2w1)P(wnw1w2wn1)=i=1nP(wihi1)\begin{aligned} P(s)&=P(w_1w_2\cdots w_n)\\&=P(w_1)P(w_2|w_1)\cdots P(w_n|w_1w_2\cdots w_{n-1})\\&=\prod_{i=1}^n P(w_i|h_{i-1}) \end{aligned}
      where hi1=w1w2wn1, P(wihi1)=count(hi1wi)count(hi1)h_{i-1}=w_1w_2\cdots w_{n-1},~P(w_i|h_{i-1})=\dfrac{\text{count}(h_{i-1}w_i)}{\text{count}(h_{i-1})}
    • i.e. the probability of words is calculated from the frequency of the count.
    • There is a sparsity problem (if hi1h_{i-1} is too long, hi1wih_{i-1}w_i doesn't exist in the training corpus).
    • For example, if hi1h_{i-1}="I have a pen, I have an apple, Ah, apple", wiw_i will be the pen.
    • But we can't observe the whole sentence in the training corpus because of its length.
    • So we have to limit the length => N-gram
    • This statistical method has a big problem: it is impossible to compare words. In other words, it only learns for sentences, not for words.
  2. Distributional semantic model

    • Main idea: Distributional hypothesis "Similar words occur in similar contexts".
    • According to this hypothesis, our goal is to quantify semantic similarities (using various methods, e.g. vector similarities) between words based on their distributional properties.
    • To compare between words, we will represent words as a vector => Vector representation
      1. One-hot encoding: No information about word similarity
      2. Dense representation: Embed words in a continuous vector of much lower dimension (s.t. N1001000N\approx 100\sim 1000)

Previous work

  1. Word2vec (NNLM)
    • To imporve above idea, we train the word embedding using Neural Network(Deep learning)
    • It is more efficient when we process NLP Problem, so The topic of how to embed words efficiently has attracted a lot of attention.
    • NNLM is like bi-gram model, this model consists of two fully connected layers
      1. Linear
      2. Linear + softmax
    • Input is one-hot encoding vector, and then forward pass hidden layer => output layer
    • Before determine embedding vector, take softmax function to compare target vector
    • Let xix_i be one-hot encoded input of iith word, W, WW,~W' be weight matrices with the hidden layer size NN
    • Then, h=xiW\mathbf{h}=\mathbf{x_i}W, u=hW\mathbf{u}=\mathbf{h}W'
    • After softmax, P(vjvi)=eujt=1VeutP(v_j|v_i)=\dfrac{e^{u_j}}{\sum _ {t=1} ^ V e ^ {u_t}}
    • Later, the model was developed and RNN-based RNLM was proposed, which continues to be a new site due to the presence matrix.

Problem of Previous work

  1. Too slow and huge memory

    • For example, above NNLM, if vocabulary is 10410^4, vector size is 3×1023\times 10^2, then we have to update 6×1066\times 10^6 weights!
    • In pracitce, size of vocabulary is much larger than 10410^4, so vector size is also increased to outperform.
    • Weight is so much and huge memory, we can't train large corpus!
  2. Can't perform vector calculation

    • If we have three words, "Fruits", "red", "not plum", then we can guess "apple".
    • As model doesn't train syntax(or semantic) perspective similarity, this task is impossibile.

CBOW

  • To generate "center word" using "surround words"
  • Generate one-hot encdoings {x1, x2, , xC}\left\{x_1,\ x_2,\ \cdots,\ x_C\right\} of the input words' size is CC
  • Forward pass to get embedded word vectors {h1, h2, , hC}\left\{h_1,\ h_2,\ \cdots,\ h_C\right\} by hc=xcWh_c=x_c W (sharing weights like convolution)
  • Take average and genearte score vector, and then softmax
  • Our goal is to minimize the loss function
    E=logP(vOv1, v2, , vC)E = -\log P(v_O|v_1,\ v_2,\ \cdots,\ v_C)
  • Size of Input words is NN, dimension of hidden layer is DD, and then softmax, So complexity is
    Q=N×D+D×log2VQ=N\times D + D \times \log_2{V}

Skip n-gram

  • There are two ways to learn word embedding using (Deep)NN according to our perspective.
  • One is CBOW(Continuous Bag of Words), which is focuses on the context words, and the other is Skip gram, which is focuses on the center words.
  • Especially, skip gram method is to learn context words using center words.
  • Therefore we desire y\boldsymbol{y} to match the CC many one-hot encodings of the actual output words, so using mamimum log-likelihood, obejctive function is following.
E=logP(vO1, vO2, , vOCvI)=logc=1Ceujct=1Veut=c=1Cujc+Clogt=1Veut\begin{aligned} E&=-\log P(v_{O_1},\ v_{O_2},\ \cdots,\ v_{O_C}|v_I)\\ &=-\log \prod_{c=1}^C \dfrac{e^{u_{j_c}}}{\sum_{t=1}^V e^{u_t}}\\ &=-\sum_{c=1}^C u_{j_c} +C \log\sum_{t=1}^V e^{u_t} \end{aligned}

where VV is size of words, uu is score vector.

  • Size of output words is CC, dimension of hidden layer is DD, and then softmax, So complexity is
    Q=C×(D+D×log2V)Q=C\times (D + D \times \log_2{V})

Conclusion

  • In this paper, we dramatically reduce the time and memory for computing embeddings with words.
  • This makes it possible to use more layers and word vectors, and it is possible to predict a high level of similarity.
  • Previously, while the focus was simply on improving similarity to semantic (or syntax), the paper focused on improving several similarities (i.e. sematomic & syntax) simultaneously.
  • In addition, it was possible to predict words with completely different meanings through unintended operations with vectors, which can be said to have SOAT in language understanding.
profile
Hello World!

0개의 댓글