Summary of FastText
Prerequisites
- Skip n-gram
- There are two ways to learn word embedding using (Deep)NN according to our perspective.
- One is CBOW(Continuous Bag of Words), which is focuses on the context words, and the other is Skip gram, which is focuses on the center words.
- Especially, skip gram method is to learn context words using center words.
- Therefore we desire y to match the C many one-hot encodings of the actual output words, so using mamimum log-likelihood, obejctive function is following.
E=−logP(vO1, vO2, ⋯, vOC∣vI)=−logc=1∏C∑t=1Veuteujc=−c=1∑Cujc+Clogt=1∑Veut
where V is size of words, u is score vector.
- Negative Sampling
- When we learn word embeddings using A and B, we compute the cost for every word.
- It doesn't matter as long as the number of words is small, but in practice there are so many words so that this takes a lot of time.
- To solve this, We proceed with learning by selecting only some of the context vectors that are the answers and those that are not.
- We called it "Negative sampling"
- In this sampling, we want the network to output 1/0 for the output / neagtive word.
- Instead of softmax, we use sigmoid only for the K+1 words
Problem of Previous work
- Hard to learn injective word representation
- Specifically, There is problem to learn "unseen" or "rare" data
- e.g. tensor, flow is frequent word, so we can learn easily
but, Since tensorflow is rare word, there is a high probability that the word does not exist in train data
- We call this problem "OOV(Out of value) problem"
- Ignore morophologically rich language
- Specifically, there is problem in the learning of those who have different tenses or parts of speech from the same word
- e.g. we will learn following two words: "have", "having"
- After learning is over, embedding of "have" and "having" has high similarity, however it depends on distribution of traning data
- That means, embedding of "have" and "having" has low similarity, and then it is wrong representation.
General model
- We will use skip-gram and negative sampling, so obejctive function will be binary classification problem.
- Let f : x↦log(1+e−x) (i.e. sigmoid function), we can write the objective as:
t=1∑T⎣⎢⎡C∈Ct∑f(s(wt, wc))+n∈Nt, c∑f(−s(wt, n))⎦⎥⎤ where Ct is set of context words, Nt, c is set of negative samples from the vocabulary.
- Note that s(wi, wj) means similarity between embedding of input words wi and embedding of output words wj, and if we normalize embedding vector to size 1, then similarity function s(wi, wj) can be defined:
s(wi, wj)=∣uwi∣∣vwj∣uwi⋅vwj=uwiTvwj
Main idea: Learn subword representation instead of entire word
- To solve above two problems, we will use subword model.
- Each word w is represented as a bag of a character n-gram, so we can calculate word embedding "sum(or means) of all n-gram embedding" instead of word-self.
- In addition, if only words were learned before, in this model, learning proceeds not only on words but also on all subwords.
- In practice, authors extract all the n-grams for n greatoer or equal to 3 and smaller or equal to 6.
- Let G is dictionary of n-grams.
Given a word w, let us denote by Gw⊂1, 2, ⋯, G the set of n-grams appearing in w
Since we represent a word by the sum of the vector representations of its n-gram, we can obtain the scoring functions(w, c)=g∈Gw∑ugTvc