Word2Vec 관련 논문 "Efficient Estimation of Word Representations in Vector Space (Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean) , 2013"에 대한 해설 및 리뷰입니다.
단어를 단어 간 유사도를 표현한 벡터로 처리하여 연산하자. 단어들은 별개로 존재하지 않아.
[1-1-1] Many current NLP systems and techniques treat words as atomic units - there is no notion of similarity between words, as these are represented as indices in a vocabulary. This choice has several good reasons - simplicity, robustness and the observation that simple models trained on huge amounts of data outperform complex systems trained on less data.
해당 논문이 쓰인 2013년 당시 대부분 자연어처리 기술들은 단어를 최소단위 요소로 받아들여 처리했다. 각 단어 간 유사도를 고려하지 않은 채, 그저 단어사전/단어 리스트 속 인덱스 요소로 표현한 것이다.
물론 이렇게 '단어'를 최소단위 요소로서 처리하는 방식도 나름 장점들이 있으니까 사용한 것이겠지?
당시 이 방식은 (1) 심플하고, (2) 강건하고 (input 노이즈에 output이 크게 영향받지 않는), (3) 당시 '간단한 모델 + 대량의 데이터' 이 '복잡한 모델 + 소량의 데이터' 보다 성능이 좋다는 경험/관찰을 기반으로 사용되었다.
[PPL] An example is the popular N-gram model used for statistical language modeling - today, it is possible to train N-grams on virtually all available data (trillions of words [3]). However, the simple techniques are at their limits in many tasks. For example, the amount of relevant in-domain data for automatic speech recognition is limited - the performance is usually dominated by the size of high quality transcribed speech data (often just millions of words). In machine translation, the existing corpora for many languages contain only a few billions of words or less. Thus, there are situations where simple scaling up of the basic techniques will not result in any significant progress, and we have to focus on more advanced techniques.
N-gram 모델은 통계적 언어 모델링에 사용되는, 꽤 심플한 모델이다. 방대한 양의 데이터가 수집 가능해진 만큼, 이 N-gram 모델에게 줄 수 있는 학습 데이터도 무진장 많다.
야호~ 일리 없다. 심플한 모델에 전 세계의 데이터를 다 뚜드려박아 학습시킨다고 그게 좋은 모델일까? 주입식 교육엔 한계가 있기 마련.
With progress of machine learning techniques in recent years, it has become possible to train more complex models on much larger data set, and they typically outperform the simple models. Probably the most successful concept is to use distributed representations of words [10]. For example, neural network based language models significantly outperform N-gram models [1, 27, 17].
Distributed Representations of Words
설명
[1-4-1] The main goal of this paper is to introduce techniques that can be used for learning high-quality word vectors from huge data sets with billions of words, and with millions of words in the vocabulary.
이 논문의 목표는 : 수백만개의 어휘, 수십억개의 단어로 구성된 방대한 데이터에서 high quality word vectors를 학습할 수 있는 기술(Word2Vec)을 소개하는 것이다.
여기서 'high quality' 라고 함은, dimensionality of the vector 이 높다는 것을 의미한다.
[1-4-2] As far as we know, none of the previously proposed architectures has been successfully trained on more than a few hundred of millions of words, with a modest dimensionality of the word vectors between 50 - 100.
이 논문이 나올 때만 해도, 이전의 아키텍처들은 몇백개로 구성된 적은 데이터에서, 아주 낮은 차원의 (low-dimensioned) word vectors (50-100 차원) 학습 밖에 못했거든...
이 논문이 당연 자연어처리에 NN을 적용해본 인류의 첫 도전은 아닐 터 ...
A very popular model architecture for estimating neural network language model (NNLM) was proposed in [1], where a feedforward neural network with a linear projection layer and a non-linear hidden layer was used to learn jointly the word vector representation and a statistical language model. This work has been followed by many others.
그냥 feedforward NN + linear projection + non-linear hidden layer
] T. Mikolov. Language Modeling for Speech Recognition in Czech, Masters thesis, Brno University of Technology, 2007.
[14] T. Mikolov, J. Kopecky, L. Burget, O. Glembek and J. ´ Cernock ˇ y. Neural network based lan- ´
guage models for higly inflective languages, 2009
Another interesting architecture of NNLM was presented in [13, 14], where the word vectors are first learned using neural network with a single hidden layer. The word vectors are then used to train the NNLM. Thus, the word vectors are learned even without constructing the full NNLM. In this work, we directly extend this architecture, and focus just on the first step where the word vectors are learned using a simple model.
It was later shown that the word vectors can be used to significantly improve and simplify many
NLP applications [4, 5, 29]. Estimation of the word vectors itself was performed using different model architectures and trained on various corpora [4, 29, 23, 19, 9], and some of the resulting word vectors were made available for future research and comparison2. However, as far as we know, these architectures were significantly more computationally expensive for training than the one proposed in [13], with the exception of certain version of log-bilinear model where diagonal weight matrices
are used [23].
참고 자료 :
(영상) Stanford CS224N : NLP with Deep Learning | Winter 2021 | Lecture 1 - Intro & Word Vectors