[Paper Review] Word2Vec

Sung Jae Hyuk·2023년 10월 30일
0

Papers

목록 보기
1/7

Summary of Wordvec

Keyword: CBOW, Skip gram

Prerequisites

  1. Statistical NLP

    • Create a statistical model to calculate the probability P(s)P(s) of a proposition.
    • Use Bayes' rule,
      P(s)=P(w1w2wn)=P(w1)P(w2w1)P(wnw1w2wn1)=i=1nP(wihi1)\begin{aligned} P(s)&=P(w_1w_2\cdots w_n)\\&=P(w_1)P(w_2|w_1)\cdots P(w_n|w_1w_2\cdots w_{n-1})\\&=\prod_{i=1}^n P(w_i|h_{i-1}) \end{aligned}
      where hi1=w1w2wn1, P(wihi1)=count(hi1wi)count(hi1)h_{i-1}=w_1w_2\cdots w_{n-1},~P(w_i|h_{i-1})=\dfrac{\text{count}(h_{i-1}w_i)}{\text{count}(h_{i-1})}
    • i.e. the probability of words is calculated from the frequency of the count.
    • There is a sparsity problem (if hi1h_{i-1} is too long, hi1wih_{i-1}w_i doesn't exist in the training corpus).
    • For example, if hi1h_{i-1}="I have a pen, I have an apple, Ah, apple", wiw_i will be the pen.
    • But we can't observe the whole sentence in the training corpus because of its length.
    • So we have to limit the length => N-gram
    • This statistical method has a big problem: it is impossible to compare words. In other words, it only learns for sentences, not for words.
  2. Distributional semantic model

    • Main idea: Distributional hypothesis "Similar words occur in similar contexts".
    • According to this hypothesis, our goal is to quantify semantic similarities (using various methods, e.g. vector similarities) between words based on their distributional properties.
    • To compare between words, we will represent words as a vector => Vector representation
      1. One-hot encoding: No information about word similarity
      2. Dense representation: Embed words in a continuous vector of much lower dimension (s.t. N1001000N\approx 100\sim 1000)

Previous work

  1. Word2vec (NNLM)
    • To imporve above idea, we train the word embedding using Neural Network(Deep learning)
    • It is more efficient when we process NLP Problem, so The topic of how to embed words efficiently has attracted a lot of attention.
    • NNLM is like bi-gram model, this model consists of two fully connected layers
      1. Linear
      2. Linear + softmax
    • Input is one-hot encoding vector, and then forward pass hidden layer => output layer
    • Before determine embedding vector, take softmax function to compare target vector
    • Let xix_i be one-hot encoded input of iith word, W, WW,~W' be weight matrices with the hidden layer size NN
    • Then, h=xiW\mathbf{h}=\mathbf{x_i}W, u=hW\mathbf{u}=\mathbf{h}W'
    • After softmax, P(vjvi)=eujt=1VeutP(v_j|v_i)=\dfrac{e^{u_j}}{\sum _ {t=1} ^ V e ^ {u_t}}
    • Later, the model was developed and RNN-based RNLM was proposed, which continues to be a new site due to the presence matrix.

Problem of Previous work

  1. Too slow and huge memory

    • For example, above NNLM, if vocabulary is 10410^4, vector size is 3×1023\times 10^2, then we have to update 6×1066\times 10^6 weights!
    • In pracitce, size of vocabulary is much larger than 10410^4, so vector size is also increased to outperform.
    • Weight is so much and huge memory, we can't train large corpus!
  2. Can't perform vector calculation

    • If we have three words, "Fruits", "red", "not plum", then we can guess "apple".
    • As model doesn't train syntax(or semantic) perspective similarity, this task is impossibile.

CBOW

  • To generate "center word" using "surround words"
  • Generate one-hot encdoings {x1, x2, , xC}\left\{x_1,\ x_2,\ \cdots,\ x_C\right\} of the input words' size is CC
  • Forward pass to get embedded word vectors {h1, h2, , hC}\left\{h_1,\ h_2,\ \cdots,\ h_C\right\} by hc=xcWh_c=x_c W (sharing weights like convolution)
  • Take average and genearte score vector, and then softmax
  • Our goal is to minimize the loss function
    E=logP(vOv1, v2, , vC)E = -\log P(v_O|v_1,\ v_2,\ \cdots,\ v_C)
  • Size of Input words is NN, dimension of hidden layer is DD, and then softmax, So complexity is
    Q=N×D+D×log2VQ=N\times D + D \times \log_2{V}

Skip n-gram

  • There are two ways to learn word embedding using (Deep)NN according to our perspective.
  • One is CBOW(Continuous Bag of Words), which is focuses on the context words, and the other is Skip gram, which is focuses on the center words.
  • Especially, skip gram method is to learn context words using center words.
  • Therefore we desire y\boldsymbol{y} to match the CC many one-hot encodings of the actual output words, so using mamimum log-likelihood, obejctive function is following.
E=logP(vO1, vO2, , vOCvI)=logc=1Ceujct=1Veut=c=1Cujc+Clogt=1Veut\begin{aligned} E&=-\log P(v_{O_1},\ v_{O_2},\ \cdots,\ v_{O_C}|v_I)\\ &=-\log \prod_{c=1}^C \dfrac{e^{u_{j_c}}}{\sum_{t=1}^V e^{u_t}}\\ &=-\sum_{c=1}^C u_{j_c} +C \log\sum_{t=1}^V e^{u_t} \end{aligned}

where VV is size of words, uu is score vector.

  • Size of output words is CC, dimension of hidden layer is DD, and then softmax, So complexity is
    Q=C×(D+D×log2V)Q=C\times (D + D \times \log_2{V})

Conclusion

  • In this paper, we dramatically reduce the time and memory for computing embeddings with words.
  • This makes it possible to use more layers and word vectors, and it is possible to predict a high level of similarity.
  • Previously, while the focus was simply on improving similarity to semantic (or syntax), the paper focused on improving several similarities (i.e. sematomic & syntax) simultaneously.
  • In addition, it was possible to predict words with completely different meanings through unintended operations with vectors, which can be said to have SOAT in language understanding.
profile
Hello World!

0개의 댓글

관련 채용 정보