CS224N (1) Introduction and Word Vectors

Sanghyeok Choi·2021년 6월 27일

NLP cs224n

CS224N

목록 보기

1/3

Note: I'll write about only the things that are I think noteworthy.

Lecture by Prof. Christopher Manning
Check course website

"It was the development of language that makes human being invincible."

Representation of the meaning of a word

Word Net: A thesaurus containing lists of synonym sets and hypernyms
- Problems
  missing nuance / missing new meaning of words / Can't compute word similarity, etc.
Discrete Symbols (One-Hot vectors)
- Problems
  Vector dimension = # words in vocabulary (too big)
  Any two vectors are orthogonal => can't compute word similarity
Distributional semantics: A word's meaning is given by the words that frequently appear close-by
- Word vectors (a.k.a. word embeddings, word representations)
  To represent word w, use the set of words that appear nearby ("contexts") of w.

Word2Vec

Word2vec(Mikolov et al. 2013) is a framework for learning word vectors.

Idea

1) We have a large corpus (pile of text)
2) Every word in a fixed vocabulary is represented by a vector (initialized with random vector)
3) Go through each position t in the text, which has a center word c and context (outside) word o
4) Use the similarity of the word vectors for c and o to calculate $P(\bm{o} \mid \bm{c})$ or $P(\bm{c} \mid \bm{o})$
Note: $P(\bm{o} \mid \bm{c})$ is for Skip-gram, and $P(\bm{c} \mid \bm{o})$ is for CBOW(Continuous Bag Of Word)
5) Keep adjusting the word vectors to maximize this probability

Example

Image from: here

Word2vec Derivation

Likelihood (Maximize)
For all given center word $t=1,...T$ ,
maximize the probability of occurence of context word t+j
( $m$ is window size)
$L(\theta) = \prod\limits_{t=1}^{T} \prod\limits_{\substack{-m\le j\le m\\ j\not = 0}}P(w_{t+j} \mid w_{t};\theta)$
Objective Function (Minimize)
$J(\theta) = -\frac{1}{T}\log{L(\theta)} = -\frac{1}{T}\prod\limits_{t=1}^{T} \prod\limits_{\substack{-m\le j\le m\\ j\not = 0}}P(w_{t+j} \mid w_{t};\theta)$
* Minimize the objective function <=> Maximizing predictive accuracy
To calculate $P(w_{t} \mid w_{t+j})$ , we will use two vectors per word w (for easier optimization)
- $v_w$ when w is a center word
- $u_w$ when w is a context word
- Then, $P(w_{t+j} \mid w_{t};\theta)=\frac{\exp(u_o^Tv_c)}{\sum_{w\in W}\exp(u_w^Tv_c)}$ . . . This is softmax!
$\theta=\begin{bmatrix}v_{aardvark}\\v_{a}\\\vdots\\v_{zebra}\\u_{aardvark}\\u_{a}\\\vdots\\u_{zebra}\end{bmatrix} \in \Reals^{2dV}$ ... $d$ is size of a word vector and $V$ is # words in vocabulary.
Algorithm (by SGD with learning rate $\alpha$ )

For each sentence in corpus,

For each word in sentence,
calculate $L(\theta)$

calculate $J(\theta)=-\frac{1}{T}\log{L(\theta)}$
calculate $\nabla_{\theta}{J(\theta)}$ ... for more details, check here
update $\theta = \theta - \alpha\nabla_{\theta}{J(\theta)}$

Side Note: Word2vec with one vector for each word

This is another method of Word2vec implementation, which uses only one vector for each word.

Architecture (Skip-gram)
Note: The notation of the architecture below is different from what we learned so far.
$d \rarr N$ ... size of word vector
$\theta \rarr W_{V\times N}$ ... and $\theta$ uses 2 vectors ( $u_w, v_w$ ) for each word while $W$ doesn't.
Image from: here
- The input $x_k$ is one-hot vector.
- ${W_{V\times N}}^T\cdot x_k$ gives us $h_k \in \Reals^{N}$ , which is the word vector we want to learn.
- With $h_k$ , calculate $\hat{y}={h_k}^T\cdot W'_{N\times V}$ for each context words, and compare it with real context vector $y$ (using cross-entropy)
- Update $W$ and $W'$ which also means that update word vector.

Here you can find my work for HW1!

If there is something wrong in my writing or understanding, please comment and make corrections!

[reference]
1. https://youtu.be/8rXD5-xhemo
2. https://arxiv.org/abs/1301.3781
3. https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1194/slides/cs224n-2019-lecture01-wordvecs1.pdf
4. https://stats.stackexchange.com/questions/253244/gradients-for-skipgram-word2vec
5. https://reniew.github.io/22/

Sanghyeok Choi

Lazy Enthusiast

다음 포스트