Lecture 1. Introduction and Word Vectors 정리

이동찬·2022년 10월 3일
0

cs224n

목록 보기
1/1

Usable meaning in a computer

  • WordNet (thesaurus containing lists of synonym sets and hypernyms)
    - hypernyms → “is a” relationships

    (단점) missing nuance, impossible to up-to-date (missing new meaning of words)

Traditional NLP의 문제점

  • words are regarded as discrete symbols : one-hot vector
  • vector dimension = number of words in vocab

→ No similarity, relationships because two vectors are orthogonal
(해결책) learn to encode similarity in the vectors themselves with DL

Representing words by their context

  • Distributional semantics(분포상의 의미) ↔  denotational semantics(글자 표기상의 의미)

→ A word’s meaning is given by the words that frequently appear close-by.

→ “context”의 중요성 (within a fixed-size window)

→ modern statistical NLP의 가장 성공적인 아이디어 중 하나

📌 “You shall know a word by the company it keeps.” (J.R.Firth)
  • 2 senses of words :

    • tokens : 문장에 나온 하나의 단어를 가리키는 것
    • types : tokens들을 많이 모았을 때, 여러 사례에서 사용하는 용도와 의미를 가리키는 것
  • (1) Word vectors : dense vector for each word

    → 비슷한 문맥의 단어들끼리는 word vectors의 value가 비슷함

    → dot(scalar) product로 similarity 측정

    → 일반적으로 300-dimension으로 표현

    • (2) word embeddings : when we have a whole bunch of words, representations place in a high dimensional vector space. → they’re embedded into that vector space.
    • (3) (neural) word representations
    • (4) distributed representation (↔  localized representation)

시각화해서 보면, distributionally learned되어 비슷한 단어들끼리 잘 grouping되어 있음

Word2vec (Mikolov et al. 2013)

  • framework for learning word vectors

  • 아이디어

    • 고정된 개수의 vocab에 있는 모든 단어는 vector로 표현됨
    • text tt가 이동하면서, 모든 단어는 2가지 유형의 단어를 가짐
      1. center word cc
      2. context word(”outside”) oo
    • 단어 출현 빈도에 따라 p(oc)p(o|c) 계산 ⇒ maximize
  • (ex) P(wt+jwt)P(w_{t + j} | w_t) 계산 (window = 2)

  • Likelihood function
    • predict context words within a window of fixed size
    • t=1,,Tt = 1, …, T, window size mm, center word wtw_t

  • objective function (cost or loss function)
    • (average) negative log likelihood
    • 곱셈을 덧셈으로 바꿔주기 위해 log를 취하고, 평균을 위해 1/T를 곱해줌

📌 Minimizing objective function J(θ)J(\theta) = Maximizing predictive accuracy

  • How to calculate P(wt+jwt;θ)P(w_{t + j} | w_t;\theta) ?
    • Use two vectors per word ww
      • vwv_w : ww is a center word
      • uwu_w : ww is a context word
    • center word cc , context word oo
      • 분자 : 하나의 context word vector와 하나의 center word vector가 곱해진 값
      • 분모 : vocab에 있는 모든 context word vector와 하나의 center word vector가 곱해진 값

  1. numerator(분자)

    • Dot product compares similarity of o and c vector
    uTv=i=1nuiviu^Tv = \sum_{i=1}^{n}u_iv_i
    • Larger dot product = Larger probability = more similar
  2. exponentiation makes anything positive

    • 1번의 결과는 +/- 둘 다 가능한데, 우리는 probability를 원하기 때문에 exp 적용
    • (cf) negative probability는 존재 X
  3. denominator(분모)

    • probability distribution을 얻기 위해 전체 vocab에 대해 Normalization
    • 다 더하면 1의 값이 나올 수 있게!

📌 Word2vec의 prediction function ⇒ [softmax function] Rn\mathbb{R}^n(0,1)n(0, 1)^n 의 예시

softmax(xi)=exp(xi)j=1nexp(xj)=pisoftmax(x_i) = \frac{exp(x_i)}{\sum_{j=1}^{n}exp(x_j)} = p_i

  • maps arbitrary values xix_i to a probability distribution pip_i
  • “soft” : still assigns some probability to smaller xix_i
  • “max” : amplifies probability of largest xix_i

Train the model

  • Optimize parameters to minimize loss
    = Maximize the prediction of context words

  • θ\theta : all the model parameters (= word vectors)

    ⇒ compute all vector gradients

  • (ex) d = 300, V = 500,000

  • every word has two vectors

    1. center vector
    2. context vector

objective function의 logP(wt+jwt)logP(w_{t + j} | w_t)를 미분해보자

  • p(oc)=exp(uoTvc)w=1vexp(uwTvc)p(o|c) = \frac{\exp(u_o^Tv_c)}{\sum_{w=1}^{v}\exp(u_w^Tv_c)} 를 이용할 것
    vclogexp(uoTvc)w=1vexp(uwTvc)=vclogexp(uoTvc)vclogw=1vexp(uwTvc)\frac{\partial }{\partial v_c}log\frac{\exp(u_o^Tv_c)}{\sum_{w=1}^{v}\exp(u_w^Tv_c)} \\ \\ = \frac{\partial }{\partial v_c}log\exp(u_o^Tv_c) - \frac{\partial }{\partial v_c}log\sum_{w=1}^{v}\exp(u_w^Tv_c)
    1. numerator

      vclogexp(uoTvc)=vcuoTvc=u0\frac{\partial }{\partial v_c}log\exp(u_o^Tv_c) \\ = \frac{\partial }{\partial v_c}u_o^Tv_c \\ = u_0
    2. denominator

      • 합성함수 미분하듯, chain rule 이용
      vclogw=1vexp(uwTvc)=1w=1vexp(uwTvc)×vcx=1vexp(uxTvc)=1w=1vexp(uwTvc)×(x=1vvcexp(uxTvc))=1w=1vexp(uwTvc)×(x=1vexp(uxTvc)ux)\frac{\partial }{\partial v_c}log\sum_{w=1}^{v}\exp(u_w^Tv_c) \\ = \frac{1}{\sum_{w=1}^{v}\exp(u_w^Tv_c)} \times \frac{\partial }{\partial v_c}\sum_{x=1}^{v}\exp(u_x^Tv_c) \\ = \frac{1}{\sum_{w=1}^{v}\exp(u_w^Tv_c)} \times \left ( \sum_{x=1}^{v} \frac{\partial }{\partial v_c} \exp(u_x^Tv_c)\right ) \\ = \frac{1}{\sum_{w=1}^{v}\exp(u_w^Tv_c)} \times \left ( \sum_{x=1}^{v} \exp(u_x^Tv_c)u_x\right )
    • 정리해보면,
      • expectation average over all context vectors weighted by their probability

        vclog(p(oc))=uo1w=1vexp(uwTvc)×(x=1vexp(uxTvc)ux)=uox=1vexp(uxTvc)w=1vexp(uwTvc)ux=uox=1vp(xc)ux=observedexpected(=XX)\frac{\partial }{\partial v_c}log\left ( p(o|c) \right ) \\= u_o - \frac{1}{\sum_{w=1}^{v}\exp(u_w^Tv_c)} \times \left ( \sum_{x=1}^{v} \exp(u_x^Tv_c)u_x\right ) \\ = u_o - \sum_{x=1}^{v}\frac{\exp(u_x^Tv_c)}{\sum_{w=1}^{v}\exp(u_w^Tv_c)}u_x \\ = u_o - \sum_{x=1}^{v}p(x|c)u_x \\ = observed - expected \\ (= X - \overline{X})

고민해볼 점

  1. vector들 간의 계산으로 유사도를 구할 수 있는 건 알겠는데, vector를 표현하는 values는 어떻게 생성되는가?

    → initialized randomly (starting point)

    → iterative algorithm (progressively updating those word vectors using gradients)

    → better job at predicting which words appear in the context of other words

profile
NLP ML Engineer, MLOps

0개의 댓글