[논문 리뷰] Enriching Word Vectors with Subword Information

fla1512·2022년 8월 29일

NLP Study

목록 보기

6/23

참고자료
https://www.youtube.com/watch?v=7UA21vg4kKE
https://amitness.com/2020/06/fasttext-embeddings/
https://tkdguq05.github.io/2020/08/14/Fasttext/
책: 한국어 임베딩

Enriching Word vectors with Subword Information
1. Word vectors: 자연어 -> vector로 표현하는 word representation 방법
2. Fast: baseline(CBOW & Skip-gram) 대비 학습 속도 빠름
3. Subword: 형태론적 feature 정보를 한 단어의 subword unit, 즉 character level에서 추출

0. Abstract

등장 배경

large unlabeled corpora에서 훈련되는 continuous word representation
-> 많은 NLP task에 유용
-> 각 단어에 고유한 벡터를 부여해서, 단어의 형태학을 무시함
-> large vocabularies, rare words에서 큰 문제

continous word represenation(=word embedding, denser representation): 단어를 숫자로 표현하는 과정

연구 제시 방향

skipgram model 기반
각 단어는 a bag of character n-grams로 표현
- vector representation은 each character n-gram과 연관
  (words는 representations의 sum으로 표현)
특징
- 빠름(large corpora에서 모델 훈련 가능)
- 훈련 데이터에서 등장하지 않은 단어까지 허용
- 9개 언어 -> word similarity, analogy task 둘 다 평가
- state of the art performance 달성

Fast-Text
1. Skip-gram model: gradient의 gradient flow가 skip-gram에서는 주변 정보가 합산되고 CBOW에서는 분산되어서, skip-gram의 성능이 상대적으로 좋음
2. Subword information: character n-grams

1. Introduction

기존 연구

continuous representations of words -> NLP에서 긴 역사
- 일반적으로 large unlabeled corpora(co-occurrence statistics 사용)
- distributional semantics : 해당 분야의 속성을 연구
- neural network community: word embedding 사용해서 feed-forward neural network -> word based on the two words on the left and two words on the right 예측
- simple log-bilinear model: continous representations of words on very large corpora

문제점

주로 vocabulary의 각 단어를 distinct vector(without parameter sharing)로 표현
internal structure of words 무시(morphologically(형태학적) rich languages에서 큰 한계)
- 예)
  - French, Spanish -> 대부분의 동사는 forty different inflected forms
    -> training corpus에서 거의 등장하지 않는 많은 단어 가짐
    -> word representations 학습을 어렵게 함
  - Finnish -> fifteen cases for nouns

-> 그래서 ! morphologically rich languages의 경우, character level information 사용하는 것이 필요

기존 representation 방법의 한계점
1. 모든 단어를 각각의 vector로 즉 1:1로 represenation하는 것의 한계
-> 특히 training data로 등장하지 않은 rare word의 경우 정확한 vector embedding 어려움
-> OOV(Out of Vocabulary) 문제
2. 단어 자체의 내부적 구조를 무시
-> Morpholigically rich language(형태적으로 풍부한 언어: 다양한 조사나 어미 등의 기능어가 내용어와 결합) 언어를 표현하는 것의 한계
-> Parameter 공유 없이 유사한 의미의 여러가지 형태의 단어를 모두 represenation 하는 것의 어려움
-> 유사한 의미의 개별적 단어가 context 정보를 기반으로 unique하게 학습되는 문제
-> 예) shared radical : eat, eats, eaten, eater, eating

연구 제시 방향

character n-grams에 대한 representations 학습 방법
words를 sum of the n-gram vectors로 표시하기

논문 기여한 바
1. extension of the continuous skip-gram 제시
(subword information을 고려함)
2. nice language exhibiting different morphologies에 평가해서 성능 입증

Morphological word representations

incorporate morphological information into word representations
Alexandrescu and Kirchhoff (2006): factored neural language models(rare words 훈련 위해)
- morphological information 포함
- morphologically rich language인 Turkish에 적용
different composition functions
-> morphological decomposition of words에 의존
Chen et al. (2015): a method to jointly learn embeddings for Chinese words and characters
Cui et al. (2015): to constrain morphologically similar words to have similar representations.
Soricut and Och (2015): described a method to learn vector representations of morphological transformations
Cotterell and Schütze (2015): introduce word representations trained on morphologically annotated data
Schütze (1993): learned representations of character four-grams through singular value decomposition,and derived representations for words by summing the four-grams representations.
Wieting et al. (2016): represent words using character n-gram count vectors.

-> 이들은 paraphrase pairs를 기반으로 하는데, 이와 달리 우리 모델은 어떠한 text corpus 학습에서도 효율적이다

Character level features for NLP

discard the segmentation into words and aim at learning language representations directly from characters.

recurrent neural network
- language modeling, text normalization, part-of-speech tagging, parsing에 적용
convolutional neural network
- trained on characters
- part-of-speech tagging, sentiment analysis, text classification, language modeling에 적용
Sperr et al. (2013)
- language model based on restricted Boltzmann machines -- words are encoded as a set of character ngrams.
Sennrich et al., 2016; Luong and Manning, 2016: recent works in machine translation
- using subword units to obtain representations of rare words

3. Model

learn word representations + morphology
morphology를 subword units 고려해서 model하고, words를 sum of its chatacter n-grams로 represent

general framework 제시(word vectors 훈련에 사용)
present our subword model
describe how we handle the dictionary of character n-grams.
(각 단어를 문자 단위 n-gram 으로 표현)

3.1 General model

FastText가 기본적으로 차용하고 있는 모델 및 기법
: SKip-gram + Negative Sampling

+ 추가 설명

SKip-Gram Model

중심단어로부터, 주변 단어를 예측하는 방향으로 학습
주어진 window의 크기로 단어들을 슬라이딩하며, 중심단어 별로 각 주변 단어의 확률값 업데이트
input words w에 대해서 window 안의 context words를 예측
타겟 단어 Wt, 주변 단어 Wc
주변단어는 조건부 독립이다!를 가정으로 함
레이어: 인풋 레이어, 히든 레이어, 아웃풋 레이어
계산 과정에서 비효율적인 부분 발생
(각각의 Wc 하나를 예측하기 위해 모든 단어를 고려해야 함 -> 계산 효율 관점에서의 단점)
- corpus의 dimension(V) 대비 실질적으로 업데이트 계산에 사용되는 부분은 일부분
  -> corpus에 있는 모든 단어들과 비교하는 것이 불필요
- corpus의 크기가 증가할 수록 비효율적인 연산 증가
  -> Negative sampling idea: input 단어들은 one-hot index로 되어 있으니 W input의 row index를 그냥 가져오자
- multi-class classification 문제를 binary classification 문제로 치환: context word를 1, context가 아닌 것들은 0
- target word와 positive 및 negative examples의 index를 가지고 연산 수행
  -> 연산 복잡도 줄임

negative sampling

positive sample: target word(center word, t)와 그 주변에 실제로 등장한 context word(c) 쌍
negative sample: target word와 그 주변에 등장하지 않은 단어(말뭉치 전체에서 랜덤 추출) 쌍
예를 들어 window를 2로 설정한 경우, 포지티브 샘플을 만들 때 타깃단어 앞뒤 두 개씩만 고려한다는 뜻
negative sampling: 타깃 단어와 문맥 단어 쌍이 주어졌을 때, 해당 쌍이 포지티브샘플(+)인지 네거티브 샘플(-)인지 이진분류하는 과정의 학습 기법
원리:
1. 타겟에 해당하는 벡터와 Dot product의 시그모이드 로그값이 최대가 되도록 학습
2. context words를 제외한 window 밖의 words를 sampling하여 context words를 예측하는 multi label classification이 아닌, context words인지 아닌지를 예측하는 binary classification 문제로 objective function을 치환함
  -> output layer의 계산효율성
특징
- 주어진 Wt에 대해서 개별 context word wc가 context인지 아닌지를 예측
- loglikelihood: positive sample의 경우 합이 최대화, negative sample의 경우 -1이 곱해져 합이 최소화
word2vec와의 차별점
- 조건부 확률을 최대화하는 과정에서 학습되는데
- 입력 단어 쌍(t,c)이 실제 포지티브 샘플이라면 모델은 해당 입력 쌍이 포지티브라고 맞춰야 한다
  -> 여기서 더 나아가
- 타깃단어(t), 문맥단어(c) 쌍을 학습할 때 타깃 단어(t)에 속한 문자 단위 n-gram 벡터(z)들을 모두 업데이트한다

과정:

예) center word: "eating", context word: "am", "food"

center word의 embedding이 character n-grams와 whole word itself의 sum of vectors로 계산된다
actual context words의 경우 character n-grams를 더하지 않고 embedding table에서 word vector를 취함
square root of the unigram frequency에 negative samples를 randomly하게 probability proportion에서 collect.
dot product를 center word와 actual context words 사이에 시행, sigmoid function을 시행해서 0~1 사이에 match score를 얻도록 함
loss에 기반해서 SGD optimizer로 embedding vectors update
(center word에 actual context words를 더 가깝게 하고 negative samples와는 거리 두고자)

Skip-gram model

(skip-gram을 base로 한 모델이어서 소개)

Mikolov et al. (2013b)의 continuous skip-gram model with negative sampling
W 사이즈 word vocabulary에서 각 단어는 index w ∈ {1, ..., W}로 표현
목표: vectorial representation for each word w를 학습하는 것
word representations는 문맥에서 나타나는 단어를 잘 예측하도록 훈련됨
(corpus 내에서 자주 등장하는 단어를 더 많이 추출하고 드물게 등장하는 단어는 적게 추출하도록 하여 -> 자주 등장하는 단어에 대해서 여러 번 정교하게 학습이 가능하기 때문)
목적: 다음의 log-likelihood maximize하기

Wt: target word(t번째 주어진 단어), Wc: context word(t번째 단어의 주변 단어)
한 단어가 주어졌을 때, 주변 단어를 얼마나 잘 맞추는지에 관한 목적함수
목적함수를 최대화 하고, 각 단어를 얼마나 잘맞추는지 softmax함수로 평가!
context Ct is the set of indices of words surrounding word wt.
p(Wc|Wt)는 word vectors로 parameterized.

Negative sampling

scoring function s는 maps pairs of (word, context) to scores in R.
예) the probability of a context word를 정의하는 한 방법은 softmax:

-> 해당 모델의 경우, word wt가 주어졌을 때 one context word wc만 예측하기에 우리 모델에 적용 불가
predicting context words는 따라서, a set of independent binary classification task로 frame 가능
- 목표: independently predict the presence (or absence) of context words
- position t에 있는 단어에 대해서, all context words를 positive examples로 고려 + 사전으로부터 negatives를 랜덤으로 샘플링
선정된 context position c는 binary logistic loss를 이용해, 다음 negative log-likelihood를 얻음
(= 실제 주변 단어 하나에 대한 loss + negative example에 대한 loss)
- Nt,c: vocabulary에서 sample된 negative examples의 set
- logistic loss function l: x → log(1 + e−x) 이렇게 해서 최종적으로!

A natural parameterization for the scoring function s

wt와 wc 사이에 word vectors 사용하기
- vocabulary에 있는 각 단어 w를 two vectors vw, vw(R^d)라 했을 때
- two vectors는 literature에서 input, output으로 명시되기도 한다
- vectors uwt, vwc는 words wt, wc에 corresponds.
- score는 scalar product로 계산 가능
  (between word and context vectors as s(wt, wc) = u>wtvwc)

3.2 Subword model

문제점

각 단어에 대해서 distinct vector representation을 사용해서, skipgram model은 internal structure of words를 무시한다

모든 단어 1:1 representation 어려움
-> rare word 학습 어려움

형태론적 의미 정보를 활용하지 못함
-> 형태론적, 즉 내부적으로 구조가 유사한 inflected form(어미 변환, typo(오타)), compound word(합성어) 등에 대해 parameter 공유 없이 각자 unique하게 학습이 됨

different scoring function by a bag of character n-gram 제시
-> sharing parameters by subword embedding

각 단어 w는 bag of character n-gram으로 표현
special boundary symbols <'and'> 추가
- 단어의 시작과 끝에,
- 다른 character sequences로부터 prefixes(접두사), suffixes(접미사)를 구별하도록 함
word w 자체도 set of its n-grams에 포함
-> representation for each word를 학습하도록 함
예) where, n=3
character n-grams: <wh, whe, her, ere, re>
special sequence: <'where'>
- <'her'>는 where의 tri-gram이기에 단어 her과는 다름
- 실제로 모든 n-grams를 추출한다(3<=n<=6)
- 다양한 n-grams들은 모든 prefixes(접두사)와 suffixes(접미사)를 택하면서 고려된다
예) n-grams의 size가 G인 dictionary
- word w
- set of n-grams: Gw ⊂ {1, . . . , G}
- vector representation zg를 each n-gram g에 연관시킴
- word를 sum of the vector representations of its n-grams로 표현
- scoring function이 다음 식이 된다

g: bag-of-n-gram에 표현될 수 있는 집합
zg: bag of n-gram에서 표현된 서브단어(subword)
v: 맞춰야 할 단어
각 n-gram으로 표현한게 G만큼의 사이즈를 가지면 식과 같이 표현 가능
-> subword와 원래 단어를 동시에 임베딩하는 형식으로 OOV 문제를 개선한 모델

효과

해당 모델로, sharing representations across words가 가능해져, reliable representation for rare words 학습이 가능해짐
memory requirements를 bound 하기 위해, hashing function 사용

- maps n-grams to intefers in 1 to K(N-gram의 subset을 1에서 K까지의 정수로 mapping)
- character sequences를 Fowler-Noll-Vo hashing function으로 hash함
-> a word is represented by its index in the word dictionary + the set of hashed n-grams it contains.

hashing function
등장 배경: huge number of unique n-grams가 있는 경우, memory requirements를 제한하고자
적용 원리: 각 unique n-gram에 embedding을 학습하는 대신, total B embeddings를 학습(B는 the bucket size)
각 character n-gram은 1~B 사이의 정수로 hashed.
충돌이 발생할 수 있지만 vocabulary size를 제어하는데 도움이 된다.

4. Experimental setup

4.1 Baseline

5.3을 제외하고, C implemenation of the skipgram and cbow models from the
word2vec2 package과 비교했다

4.2 Optimization

stochastic gradient descent를 negative log likelihood에 적용
baseline skipgram model
- linear decay of the step size 사용
- training set containing T words, a number of passes over the data equal to P
  - time t에서의 step size는 γ0(1 −t/TP)

4.3 Implementation details

the word vectors have dimension 300
each positive example은
- sample 5 negatives at random
  (probability proportional to the square root of the uni-gram frequency)
context window of size c, uniformly sample the size c between 1 and 5
most frequent words를 sub-sample하고자, rejection threshold of 10−4
training set에서 최소 5번 나온 words 남김
step size γ0: 0.025(skipgram baseline), 0.05(our model + cbow baseline)
English data 기반

실험 결과

our model with character n-grams는 skipgram baseline보다 1.5배 느리게 훈련
process 105k words/second/thread versus 145k words/second/thread for the baseline
C++에서도 훈련 가능

4.4 Datasets

5.3을 제외하고 Wikipedia data에서 훈련
9개 언어: Arabic, Czech, German, English, Spanish, French, Italian, Romanian and Russian
Matt Mahoney’s pre-processing perl script로 raw Wikipedia data를 normalize

5. Results

5가지 실험

an evaluation of word similarity
word analogies(유추)
a comparison to state-of-the-art methods,
an analysis of the effect of the size of training data
the size of character n-grams

5.1 Human similarity judgement

the quality of our representations on the task of word similarity / relatedness

어떻게? : computing Spearman’s rank correlation coefficient (Spearman, 1904) between human judgement and the cosine similarity between the vector representations.
For German: GUR65, GUR350, ZG222 -> 세 데이터셋으로 모델 비교
For English: WS353, rare word dataset (RW)
French word vectors는 translated dataset RG65에서 평가
Spanish, Arabic and Romanian word vectors는 Hassan and Mihalcea, 2009에서 평가
Russian word vectors는 HJ dataset에서 평가

실험 과정

어떤 단어들은 training data에서 나타나지 않아서, 이 단어들에 대해서는 cbow, skipgram에서 word representation을 얻을 수 없음
-> 그래서, 해당 단어들에는 null vectors를 default로.
우리 모델의 경우 subword information를 이용하기에, valid representations for out-of-vocabulary words 계산 가능.
(taking the sum of its n-gram vectors.)
- OOV words가 null vectors를 사용하는 경우, sisg-, sisg로 명시

실험 결과

proposed model(sisg)가 WS353 제외하고 outperform.

computing vectors for out-of-vocabulary words (sisg) is always at least as good as not doing so (sisg-)
-> character n-grams에서 subword information 사용하는 것의 중요성이 입증

effect of using character n-grams

Arabic, German and Russian에서 효과 큼
German and Russian: grammatical declensions with four cases for German and six for Russian.
many German words are compound words
- 예) nominal phrase 'table tennis'는 “Tischtennis”로 쓰임
  - “Tischtennis”와 "Tennis"에 character-level similarities 적용 시, 우리 모델은 두 단어를 완전 다른 단어로 표현X

English Rare Words dataset (RW)에서 outperforms하고 English WS353 dataset에서는 아님

WS353 dataset의 단어는 subword information 없이도 얻을 수 있어서
less frequent words에서 평가 시, at the character level between words에서 similarities를 사용하는 것이 good word vectors를 학습하는데 도움 준다

5.2 Word analogy tasks

A is to B as C is to D(D must be predicted by the models)
dataset: English, Czech, German, Italian
어떤 질문들은 training corpus에 나타나지 않은 단어들을 포함해서 이는 평가에서 제외하였음

결과 해석

morphological information은 syntactic tasks를 향상시킴, baseline을 능가한 우리 모델
semantic question에서는 큰 도움X
(German, Italian의 경우 성능 하락)
이는 the choice of the length of character n-grams와 연관(Sec 5.5)있는데, n-grams의 size가 최적으로 선정되면 semantic anlaogies는 덜 degrade된다
baseline의 향상은 morphologically rich language(Czech, German)에서 더 중요

5.3 Comparison with morphological representations

word vectors incorporating subword information on word similarity tasks와 비교
- 결과를 비교할 수 있도록 하고자, model을 같은 데이터에서 훈련 함
log-bilinear language model과도 비교
representations of out-ofvocabulary words를 얻음
(summing the representations of character n-grams해서)

결과 해석

simple approach performs well relative to techniques based on subword information obtained from morphological segmentors.
our approach outperforms the Soricut and Och (2015) method, which is based on prefix and suffix analysis.
The large improvement for German is due to the fact that their approach does not model noun compounding, contrary to ours

5.4 Effect of the size of the training data

단어 사이에 character-level similarities를 이용했기에 infrquent words를 더 잘 model하는 것이 가능
-> training data의 size에 대해서 더 robust해야 함
evaluate the performance of our word vectors on the similarity task as a function of the training data size를 제시
train our model and the cbow baseline on portions of Wikipedia of increasing size

과정

Sec5.1과 마찬가지로 evaluation set로부터의 모든 단어가 Wikipedia data에 제시되는 것은 아니다
default로 null vector를 사용했고, summing the n-gram representations (sisg)하면서 vector를 compute했다
out-of-vocabulary rate는 dataset이 shrink됨에 따라 growing
(the performance of sisgand cbow necessarily degrades.)
the proposed model (sisg) assigns non-trivial vectors to previously unseen words.

실험 결과

all datasets, and all sizes, the proposed approach (sisg) performs better than the baseline.

the performance of the baseline cbow model gets better as more and more data is available.
Our model, on the other hand, seems to quickly saturate and adding more data does not always lead to improved results.

proposed approach provides very good word vectors even when using very small training datasets.
-> well performing word vectors can be computed on datasets of a restricted size and still work well on previously unseen words.

5.5 Effect of the size of n-grams

Sec 3.2에서 n을 3-6으로 설정한 것은 임의적인 것으로 n-grams of these lengths가 wide range of information를 cover한다
short suffixes와 longer roots 포함한다
range of n-grams의 영향력 평가
English and German on word similarity and analogy datasets.

실험 결과

English and German: 3-6, provides satisfactory performance across languages
optimal choice of length ranges: depends on the considered task and language and should be tuned appropriately.
However, due to the scarcity of test data, we did not implement any
proper validation procedure to automatically select the best parameters
include long n-grams, as columns corresponding to n ≤ 5 and n ≤ 6가 작동 가장 잘함 -> 특히 German(many nouns are compounds made up from several units that can only be captured by longer character sequences.)
analogy tasks: using larger n-grams helps for semantic analogies를 입증
n ≥ 3 이 n ≥ 2보다 좋은 결과
(character 2-grams are not informative for that task.)

5.6 Language modeling

language model on five languages (CS, DE, ES, FR, RU) 평가
Our model is a recurrent neural network with 650 LSTM units~
Two baselines are considered: we compare our approach to the log-bilinear language model and the character aware language model of Kim et al. (2016).

실험 결과

using word representations trained with subword information outperforms the plain skipgram model.

6. Qualitative analysis

6.1 Nearest neighbors

nearest neighbors according to cosine similarity for vectors trained using the proposed approach and for the skipgram baseline.
the nearest neighbors for complex, technical and infrequent words using our approach are better than the ones obtained using the baseline model

6.2 Character n-grams and morphemes

the most important n-grams in a word correspond to morphemes인지 아닌지 qualitatively하게 평가!
each word w는 the sum of its n-grams로 표현
각 n-gram에 대해 compute the restricted representation uw\g obtained by omitting g하도록 제시
rank n-grams by increasing value of cosine between uw and uw\g.
-> 세 언어에 대한 n-grams는 Table6에서 확인 가능

실험 결과

German: the most important n-grams correspond to valid morphemes
- Good examples: Autofahrer (car driver) whose most important n-grams are Auto (car) and Fahrer (driver)
- the separation of compound nouns into morphemes in English
- 예) words such as lifetime or starfish.
English: n-grams can correspond to affixes in words such as kindness or unlucky
French: the inflections of verbs with endings such as ais>, ent> or ions>

6.3 Word similarity for OOV words

analyze which of the n-grams match best for OOV words by selecting a few word pairs from the English RW similarity dataset.

과정

select pairs such that one of the two words is not in the training vocabulary and is hence only represented by its ngrams.
For each pair of words, we display the cosine similarity between each pair of n-grams that appear in the words

결과

subwords match correctly
- 예) chip -> two groups of n-grams in microcircuit that match well
  (These roughly correspond to micro and circuit, and n-grams in between don’t match well)
- 예) pair rarity and scarceness
  (scarce roughly matches rarity while the suffix -ness matches -ity very well)
- 예) preadolescent
  (matches young well thanks to the -adolescsubword.)
  -> we build robust word representations where prefixes and suffixes can be ignored if the grammatical form is not found in the dictionary

7. Conclusions

subword information을 고려해서 word representations 학습하는 간단한 방법
incorporates character n-grams into the skipgram model
간단해서 -> 빠르게 훈련하고, preprocessing이나 supervision을 필요로 하지 않음
우리의 모델이 baseline을 능가함을 입증

[코드 리뷰]

https://github.com/bentrevett/pytorch-sentiment-analysis/blob/master/3%20-%20Faster%20Sentiment%20Analysis.ipynb

Build the Model

과정
1. 각 단어에 대한 word embedding 계산('Embedding layer') -- 파란색
2. word embeddings의 평균 계산 -- 분홍색
3. 'Linear layer'에 feed -- 회색

averaging 시행

avg_pool2d (average pool 2-dimensions) function으로 시행
word embeddings는 2-dimensional grid
(the words are along one axis and the dimensions of the word embeddings are along the other)
예시는 5-dimensional word embeddings로 변환된 문장
(the words along the vertical axis and the embeddings along the horizontal axis)

'avg_pool2d'는 filter size를 1, 'embedded.shape[1]' (i.e. the length of the sentence)로 사용한다 -- 분홍색

filter에 의해 cover된 모든 elements의 average value를 계산한다. 그 후, filter는 문장에서 각 단어에 대한 다음 column의 embedding values를 계산하면서 오른쪽으로 slide된다

각 filter position은 모든 covered elements의 average에 대한 single value를 준다. filter가 all embedding dimensions를 covered하면 [1X5] tensor를 얻는다. 해당 tensor가 linear layer를 거쳐서 예측에 쓰인다.

fla1512

이전 포스트

LSTM (LONG SHORT-TERM MEMORY)

다음 포스트

[논문 리뷰] Enriching Word Vectors with Subword Information

NLP Study

0. Abstract