CS224N (1) Introduction and Word Vectors

Sanghyeok Choi·2021년 6월 27일
0

CS224N

목록 보기
1/3

Note: I'll write about only the things that are I think noteworthy.


Lecture by Prof. Christopher Manning
Check course website

"It was the development of language that makes human being invincible."

Representation of the meaning of a word

  1. Word Net: A thesaurus containing lists of synonym sets and hypernyms
    • Problems
      missing nuance / missing new meaning of words / Can't compute word similarity, etc.
  2. Discrete Symbols (One-Hot vectors)
    • Problems
      Vector dimension = # words in vocabulary (too big)
      Any two vectors are orthogonal => can't compute word similarity
  3. Distributional semantics: A word's meaning is given by the words that frequently appear close-by
    • Word vectors (a.k.a. word embeddings, word representations)
      To represent word w, use the set of words that appear nearby ("contexts") of w.

Word2Vec

Word2vec(Mikolov et al. 2013) is a framework for learning word vectors.

Idea

1) We have a large corpus (pile of text)
2) Every word in a fixed vocabulary is represented by a vector (initialized with random vector)
3) Go through each position t in the text, which has a center word c and context (outside) word o
4) Use the similarity of the word vectors for c and o to calculate P(oc)P(\bm{o} \mid \bm{c}) or P(co)P(\bm{c} \mid \bm{o})
    Note: P(oc)P(\bm{o} \mid \bm{c}) is for Skip-gram, and P(co)P(\bm{c} \mid \bm{o}) is for CBOW(Continuous Bag Of Word)
5) Keep adjusting the word vectors to maximize this probability

Example

Image from: here

Word2vec Derivation

  • Likelihood (Maximize)
    For all given center word t=1,...Tt=1,...T,
    maximize the probability of occurence of context word t+j
    (mm is window size)
    L(θ)=t=1Tmjmj0P(wt+jwt;θ)L(\theta) = \prod\limits_{t=1}^{T} \prod\limits_{\substack{-m\le j\le m\\ j\not = 0}}P(w_{t+j} \mid w_{t};\theta)

  • Objective Function (Minimize)
    J(θ)=1TlogL(θ)=1Tt=1Tmjmj0P(wt+jwt;θ)J(\theta) = -\frac{1}{T}\log{L(\theta)} = -\frac{1}{T}\prod\limits_{t=1}^{T} \prod\limits_{\substack{-m\le j\le m\\ j\not = 0}}P(w_{t+j} \mid w_{t};\theta)
    * Minimize the objective function <=> Maximizing predictive accuracy

  • To calculate P(wtwt+j)P(w_{t} \mid w_{t+j}), we will use two vectors per word w (for easier optimization)
    - vwv_w when w is a center word
    - uwu_w when w is a context word
    - Then, P(wt+jwt;θ)=exp(uoTvc)wWexp(uwTvc)P(w_{t+j} \mid w_{t};\theta)=\frac{\exp(u_o^Tv_c)}{\sum_{w\in W}\exp(u_w^Tv_c)} . . . This is softmax!

  • θ=[vaardvarkvavzebrauaardvarkuauzebra]R2dV\theta=\begin{bmatrix}v_{aardvark}\\v_{a}\\\vdots\\v_{zebra}\\u_{aardvark}\\u_{a}\\\vdots\\u_{zebra}\end{bmatrix} \in \Reals^{2dV} ... dd is size of a word vector and VV is # words in vocabulary.

  • Algorithm (by SGD with learning rate α\alpha)

    For each sentence in corpus,

    For each word in sentence,
    calculate L(θ)L(\theta)

    calculate J(θ)=1TlogL(θ)J(\theta)=-\frac{1}{T}\log{L(\theta)}
    calculate θJ(θ)\nabla_{\theta}{J(\theta)} ... for more details, check here
    update θ=θαθJ(θ)\theta = \theta - \alpha\nabla_{\theta}{J(\theta)}

Side Note: Word2vec with one vector for each word

This is another method of Word2vec implementation, which uses only one vector for each word.

  • Architecture (Skip-gram)
    Note: The notation of the architecture below is different from what we learned so far.
             dNd \rarr N ... size of word vector
             θWV×N\theta \rarr W_{V\times N} ... and θ\theta uses 2 vectors (uw,vwu_w, v_w) for each word while WW doesn't.

    Image from: here

    - The input xkx_k is one-hot vector.
    - WV×NTxk{W_{V\times N}}^T\cdot x_k gives us hkRNh_k \in \Reals^{N}, which is the word vector we want to learn.
    - With hkh_k, calculate y^=hkTWN×V\hat{y}={h_k}^T\cdot W'_{N\times V} for each context words, and compare it with real context vector yy (using cross-entropy)
    - Update WW and WW' which also means that update word vector.


Here you can find my work for HW1!


If there is something wrong in my writing or understanding, please comment and make corrections!


[reference]
1. https://youtu.be/8rXD5-xhemo
2. https://arxiv.org/abs/1301.3781
3. https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1194/slides/cs224n-2019-lecture01-wordvecs1.pdf
4. https://stats.stackexchange.com/questions/253244/gradients-for-skipgram-word2vec
5. https://reniew.github.io/22/

profile
Lazy Enthusiast

0개의 댓글