[Paper Review] FastText

Sung Jae Hyuk·2023년 10월 30일

Papers

목록 보기

3/7

Skip n-gram
- There are two ways to learn word embedding using (Deep)NN according to our perspective.
- One is CBOW(Continuous Bag of Words), which is focuses on the context words, and the other is Skip gram, which is focuses on the center words.
- Especially, skip gram method is to learn context words using center words.
- Therefore we desire $\boldsymbol{y}$ to match the $C$ many one-hot encodings of the actual output words, so using mamimum log-likelihood, obejctive function is following.

\begin{aligned} E&=-\log P(v_{O_1},\ v_{O_2},\ \cdots,\ v_{O_C}|v_I)\\ &=-\log \prod_{c=1}^C \dfrac{e^{u_{j_c}}}{\sum_{t=1}^V e^{u_t}}\\ &=-\sum_{c=1}^C u_{j_c} +C \log\sum_{t=1}^V e^{u_t} \end{aligned}

where $V$ is size of words, $u$ is score vector.

Hard to learn injective word representation
- Specifically, There is problem to learn "unseen" or "rare" data
- $e.g.$ tensor, flow is frequent word, so we can learn easily
  but, Since tensorflow is rare word, there is a high probability that the word does not exist in train data
- We call this problem "OOV(Out of value) problem"
Ignore morophologically rich language
- Specifically, there is problem in the learning of those who have different tenses or parts of speech from the same word
- $e.g.$ we will learn following two words: "have", "having"
- After learning is over, embedding of "have" and "having" has high similarity, however it depends on distribution of traning data
- That means, embedding of "have" and "having" has low similarity, and then it is wrong representation.

We will use skip-gram and negative sampling, so obejctive function will be binary classification problem.
Let $f\ :\ x\mapsto \log(1+e^{-x})$ ( $i.e.$ sigmoid function), we can write the objective as: $\displaystyle\sum_{t=1}^T\left[\displaystyle\sum_{C\in\mathcal{C}_t} f(s(w_t,\ w_c))+\displaystyle\sum _ {n\in\mathcal{N} _ {t,\ c}}f(-s(w_t,\ n))\right ]$ where $\mathcal{C} _ t$ is set of context words, $\mathcal{N} _ {t,\ c}$ is set of negative samples from the vocabulary.
Note that $s(w_i,\ w_j)$ means similarity between embedding of input words $w_i$ and embedding of output words $w_j$ , and if we normalize embedding vector to size $1$ , then similarity function $s(w_i,\ w_j)$ can be defined: $s(w_i,\ w_j)=\dfrac{\mathbf{u} _ {w_i}\cdot \mathbf{v} _ {w_j}}{\lvert \mathbf{u} _ {w_i} \rvert\lvert \mathbf{v} _ {w_j} \rvert}=\mathbf{u} _ {w_i}^T\mathbf{v} _ {w_j}$

To solve above two problems, we will use subword model.
Each word $w$ is represented as a bag of a character $n$ -gram, so we can calculate word embedding "sum(or means) of all $n$ -gram embedding" instead of word-self.
In addition, if only words were learned before, in this model, learning proceeds not only on words but also on all subwords.
In practice, authors extract all the $n$ -grams for $n$ greatoer or equal to $3$ and smaller or equal to $6$ .
Let $G$ is dictionary of $n$ -grams.
Given a word $w$ , let us denote by $\mathcal{G} _ w \subset \\{ 1,\ 2,\ \cdots,\ G \\}$ the set of $n$ -grams appearing in $w$
Since we represent a word by the sum of the vector representations of its $n$ -gram, we can obtain the scoring function $s(w,\ c)=\displaystyle\sum_{g\in\mathcal{G} _ w} u _ {g}^Tv _ {c}$

Hello World!