[Paper Review] FastText

Sung Jae Hyuk·2023년 10월 30일
0

Papers

목록 보기
3/7

Summary of FastText

Prerequisites

  1. Skip n-gram
    • There are two ways to learn word embedding using (Deep)NN according to our perspective.
    • One is CBOW(Continuous Bag of Words), which is focuses on the context words, and the other is Skip gram, which is focuses on the center words.
    • Especially, skip gram method is to learn context words using center words.
    • Therefore we desire y\boldsymbol{y} to match the CC many one-hot encodings of the actual output words, so using mamimum log-likelihood, obejctive function is following.
E=logP(vO1, vO2, , vOCvI)=logc=1Ceujct=1Veut=c=1Cujc+Clogt=1Veut\begin{aligned} E&=-\log P(v_{O_1},\ v_{O_2},\ \cdots,\ v_{O_C}|v_I)\\ &=-\log \prod_{c=1}^C \dfrac{e^{u_{j_c}}}{\sum_{t=1}^V e^{u_t}}\\ &=-\sum_{c=1}^C u_{j_c} +C \log\sum_{t=1}^V e^{u_t} \end{aligned}

where VV is size of words, uu is score vector.

  1. Negative Sampling
    • When we learn word embeddings using A and B, we compute the cost for every word.
    • It doesn't matter as long as the number of words is small, but in practice there are so many words so that this takes a lot of time.
    • To solve this, We proceed with learning by selecting only some of the context vectors that are the answers and those that are not.
    • We called it "Negative sampling"
    • In this sampling, we want the network to output 1/01/0 for the output / neagtive word.
    • Instead of softmax, we use sigmoid only for the K+1K+1 words

Problem of Previous work

  1. Hard to learn injective word representation
    • Specifically, There is problem to learn "unseen" or "rare" data
    • e.g.e.g. tensor, flow is frequent word, so we can learn easily
      but, Since tensorflow is rare word, there is a high probability that the word does not exist in train data
    • We call this problem "OOV(Out of value) problem"
  2. Ignore morophologically rich language
    • Specifically, there is problem in the learning of those who have different tenses or parts of speech from the same word
    • e.g.e.g. we will learn following two words: "have", "having"
    • After learning is over, embedding of "have" and "having" has high similarity, however it depends on distribution of traning data
    • That means, embedding of "have" and "having" has low similarity, and then it is wrong representation.

General model

  • We will use skip-gram and negative sampling, so obejctive function will be binary classification problem.
  • Let f : xlog(1+ex)f\ :\ x\mapsto \log(1+e^{-x}) (i.e.i.e. sigmoid function), we can write the objective as:
    t=1T[CCtf(s(wt, wc))+nNt, cf(s(wt, n))]\displaystyle\sum_{t=1}^T\left[\displaystyle\sum_{C\in\mathcal{C}_t} f(s(w_t,\ w_c))+\displaystyle\sum _ {n\in\mathcal{N} _ {t,\ c}}f(-s(w_t,\ n))\right ]
    where Ct\mathcal{C} _ t is set of context words, Nt, c\mathcal{N} _ {t,\ c} is set of negative samples from the vocabulary.
  • Note that s(wi, wj)s(w_i,\ w_j) means similarity between embedding of input words wiw_i and embedding of output words wjw_j, and if we normalize embedding vector to size 11, then similarity function s(wi, wj)s(w_i,\ w_j) can be defined:
    s(wi, wj)=uwivwjuwivwj=uwiTvwjs(w_i,\ w_j)=\dfrac{\mathbf{u} _ {w_i}\cdot \mathbf{v} _ {w_j}}{\lvert \mathbf{u} _ {w_i} \rvert\lvert \mathbf{v} _ {w_j} \rvert}=\mathbf{u} _ {w_i}^T\mathbf{v} _ {w_j}

Main idea: Learn subword representation instead of entire word

  • To solve above two problems, we will use subword model.
  • Each word ww is represented as a bag of a character nn-gram, so we can calculate word embedding "sum(or means) of all nn-gram embedding" instead of word-self.
  • In addition, if only words were learned before, in this model, learning proceeds not only on words but also on all subwords.
  • In practice, authors extract all the nn-grams for nn greatoer or equal to 33 and smaller or equal to 66.
  • Let GG is dictionary of nn-grams.
    Given a word ww, let us denote by Gw1, 2, , G\mathcal{G} _ w \subset \\{ 1,\ 2,\ \cdots,\ G \\} the set of nn-grams appearing in ww
    Since we represent a word by the sum of the vector representations of its nn-gram, we can obtain the scoring function
    s(w, c)=gGwugTvcs(w,\ c)=\displaystyle\sum_{g\in\mathcal{G} _ w} u _ {g}^Tv _ {c}
profile
Hello World!

0개의 댓글