[๋…ผ๋ฌธ ํ•ด์„]Efficient Estimation of Word Representations in Vector Space

numver_seยท2021๋…„ 3์›” 12์ผ
0

๋…ผ๋ฌธ ์Šคํ„ฐ๋””

๋ชฉ๋ก ๋ณด๊ธฐ
1/1

๐Ÿ’ก 0. Abstract

๋ณธ๊ณ ๋Š” ๋Œ€๋Ÿ‰์˜ ๋ฐ์ดํ„ฐ๋กœ ๋ถ€ํ„ฐ ๋‹จ์–ด์˜ ์—ฐ์†์ ์ธ ๋ฒกํ„ฐํ‘œํ˜„์„ ๊ณ„์‚ฐํ•˜๊ธฐ ์œ„ํ•œ ๋‘ ๊ฐœ์˜ ์ƒˆ๋กœ์šด ๋ชจ๋ธ ๊ตฌ์กฐ๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ์ด ํ‘œํ˜„๋“ค์˜ ์„ฑ๋Šฅ์€ ๋‹จ์–ด ์œ ์‚ฌ๋„๋กœ ์ธก์ •๋˜๋ฉฐ, ์ด ๊ฒฐ๊ณผ๋Š” ์ด์ „์— ๊ฐ€์žฅ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ƒˆ๋˜ ๋‹ค๋ฅธ ์œ ํ˜•์˜ ์‹ ๊ฒฝ๋ง๊ตฌ์กฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœํ•œ ๊ธฐ์ˆ ๊ณผ ๋น„๊ตํ•œ๋‹ค. ๋ณธ๊ณ ๋Š” ๋งค์šฐ ์ž‘์€ ๊ณ„์‚ฐ ๋ณต์žก๋„๋กœ ํฐ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๋ณด์—ฌ์ค€๋‹ค. ๋‹ค์‹œ๋งํ•ด, 1.6 billion ๊ฐœ์˜ ๋‹จ์–ด ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ๋ถ€ํ„ฐ ๋†’์€ ํ’ˆ์งˆ์˜ ๋‹จ์–ด ๋ฒกํ„ฐ๋ฅผ ๋ฐฐ์šฐ๋Š” ๋ฐ์— ํ•˜๋ฃจ๊ฐ€ ์ฑ„ ๊ฑธ๋ฆฌ์ง€ ์•Š๋Š”๋‹ค. ๋”์šฑ์ด, ๊ตฌ๋ฌธ ์œ ์‚ฌ๋„์™€ ์˜๋ฏธ ์œ ์‚ฌ๋„๋ฅผ ์ธก์ •ํ•˜๊ธฐ ์œ„ํ•ด ์ด ๋ฒกํ„ฐ๋“ค์ด ์ตœ์ฒจ๋‹จ์˜ ํ…Œ์ŠคํŠธ์…‹์„ ์ œ๊ณตํ•œ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค€๋‹ค.

๐Ÿ’ก 1. Introduction

๋งŽ์€ NLP systems and techniques๋“ค์ด ๋‹จ์–ด๋ฅผ atomic unit(์›์ž ์š”์†Œ)๋กœ ๋‹ค๋ฃฌ๋‹ค. ์ฆ‰, ๋‹จ์–ด ๊ฐ„์˜ ์œ ์‚ฌ์„ฑ์— ๋Œ€ํ•œ ๊ฐœ๋…์ด ์—†๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ์ด๋Ÿฌํ•œ ๋ฐฉ์‹์€ ๋‹จ์ˆœํ•˜๊ณ , robustํ•˜๋ฉฐ, '๋งŽ์€ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ง€๊ณ  ํ›ˆ๋ จ๋œ ๋‹จ์ˆœํ•œ ๋ชจ๋ธ'์ด '์ ์€ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ง€๊ณ  ํ›ˆ๋ จ๋œ ๋ณต์žกํ•œ ๋ชจ๋ธ'๋ณด๋‹ค ๋›ฐ์–ด๋‚œ ๊ฒƒ์ด ๊ด€์ฐฐ๋˜๋Š” ๋“ฑ์˜ ๋‹ค์–‘ํ•œ ์žฅ์ ๋“ค๋•Œ๋ฌธ์— ์ž์ฃผ ์‚ฌ์šฉ๋œ๋‹ค. ๊ทธ ์˜ˆ๋กœ N-gram model์„ ๋งํ•  ์ˆ˜ ์žˆ๋Š”๋ฐ, ์˜ค๋Š˜๋‚ ์˜ N-Gram์€ ์‚ฌ์‹ค์ƒ ๋ชจ๋“  ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด ํ›ˆ๋ จ์ด ๊ฐ€๋Šฅํ•˜๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๋งŽ์€ ๋ฉด์—์„œ ์ œ์•ฝ์„ ๊ฐ€์ง„๋‹ค.

N-Gram model

n-gram์€ n๊ฐœ์˜ ์—ฐ์†์ ์ธ ๋‹จ์–ด ๋‚˜์—ด์„ ์˜๋ฏธํ•œ๋‹ค. ๋‹ค์‹œ ๋งํ•ด, ๊ฐ–๊ณ  ์žˆ๋Š” ์ฝ”ํผ์Šค์—์„œ n๊ฐœ์˜ ๋‹จ์–ด ๋ญ‰์น˜ ๋‹จ์œ„๋กœ ๋Š์€ ๊ฒƒ์„ ํ•˜๋‚˜์˜ ํ† ํฐ์œผ๋กœ ๊ฐ„์ฃผํ•œ๋‹ค.

๐Ÿ“Œ "An adorable little boy is spreading smiles"๋ผ๋Š” ๋ฌธ์žฅ์ด ์žˆ์„ ๋•Œ, ๊ฐ n์— ๋Œ€ํ•ด์„œ n-gram์„ ์ „๋ถ€ ๊ตฌํ•ด๋ณด๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

  • unigrams : an, adorable, little, boy, is, spreading, smiles
  • bigrams : an adorable, adorable little, little boy, boy is, is spreading, spreading smiles
  • trigrams : an adorable little, adorable little boy, little boy is, boy is spreading, is spreading smiles
  • 4-grams : an adorable little boy, adorable little boy is, little boy is spreading, boy is spreading smiles

n์ด 1์ผ ๋•Œ๋Š” ์œ ๋‹ˆ๊ทธ๋žจ(unigram), 2์ผ ๋•Œ๋Š” ๋ฐ”์ด๊ทธ๋žจ(bigram), 3์ผ ๋•Œ๋Š” ํŠธ๋ผ์ด๊ทธ๋žจ(trigram)์ด๋ผ๊ณ  ๋ช…๋ช…ํ•˜๊ณ  n์ด 4 ์ด์ƒ์ผ ๋•Œ๋Š” gram ์•ž์— ๊ทธ๋Œ€๋กœ ์ˆซ์ž๋ฅผ ๋ถ™์—ฌ์„œ ๋ช…๋ช…ํ•œ๋‹ค.

์ตœ๊ทผ, ๋จธ์‹ ๋Ÿฌ๋‹ ๊ธฐ์ˆ ์˜ ์ง„๋ณด์™€ ํ•จ๊ป˜ ๋งค์šฐ ํฐ ๋ฐ์ดํ„ฐ์…‹์— ๋”์šฑ ๋ณต์žกํ•œ ๋ชจ๋ธ์„ ํ›ˆ๋ จ์‹œํ‚ค๋Š” ๊ฒƒ์ด ๊ฐ€๋Šฅํ•ด์กŒ๊ณ , ์ด๋Š” ๋‹จ์ˆœํ•œ ๋ชจ๋ธ๋“ค์˜ ์„ฑ๋Šฅ์„ ๋›ฐ์–ด๋„˜์—ˆ๋‹ค. ์•„๋งˆ ๊ฐ€์žฅ ์„ฑ๊ณต์ ์ธ concept์€ ๋‹จ์–ด์˜ distributed representation์„ ์‚ฌ์šฉ ํ•œ ๊ฒƒ์ด๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ์–ธ์–ด ๋ชจ๋ธ์— ๊ธฐ๋ฐ˜ํ•œ Neural Network(์ธ๊ณต์‹ ๊ฒฝ๋ง)๋Š” N-gram model์„ ๋Šฅ๊ฐ€ํ•˜๋Š” ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์คฌ๋‹ค.

Distributed Representation

๊ฐ๊ฐ์˜ ์†์„ฑ์„ ๋…๋ฆฝ์ ์ธ ์ฐจ์›์œผ๋กœ ๋‚˜ํƒ€๋‚ด์ง€ ์•Š๊ณ , ์šฐ๋ฆฌ๊ฐ€ ์ •ํ•œ ์ฐจ์›์œผ๋กœ ๋Œ€์ƒ์„ ๋Œ€์‘์‹œ์ผœ์„œ ํ‘œํ˜„ํ•œ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ํ•ด๋‹น ์†์„ฑ์„ 5์ฐจ์›์œผ๋กœ ํ‘œํ˜„ํ•  ๊ฒƒ์ด๋ผ๊ณ  ์ •ํ•˜๋ฉด ๊ทธ ์†์„ฑ์„ 5์ฐจ์› ๋ฒกํ„ฐ์— ๋Œ€์‘(embedding)์‹œํ‚ค๋Š” ๊ฒƒ์ด๋‹ค.

์ž„๋ฒ ๋”ฉ๋œ ๋ฒกํ„ฐ๋Š” ๋”์ด์ƒ sparseํ•˜์ง€ ์•Š๋‹ค. One-hot encoding์ฒ˜๋Ÿผ ๋Œ€๋ถ€๋ถ„์ด 0์ธ ๋ฒกํ„ฐ๊ฐ€ ์•„๋‹ˆ๋ผ, ๋ชจ๋“  ์ฐจ์›์ด ๊ฐ’์„ ๊ฐ–๊ณ  ์žˆ๋Š” ๋ฒกํ„ฐ๋กœ ํ‘œํ˜„์ด ๋œ๋‹ค. โ€˜Distributedโ€™๋ผ๋Š” ๋ง์ด ๋ถ™๋Š” ์ด์œ ๋Š” ํ•˜๋‚˜์˜ ์ •๋ณด๊ฐ€ ์—ฌ๋Ÿฌ ์ฐจ์›์— ๋ถ„์‚ฐ๋˜์–ด ํ‘œํ˜„๋˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. Sparse representation์—์„œ๋Š” ๊ฐ๊ฐ์˜ ์ฐจ์›์ด ๊ฐ๊ฐ์˜ ๋…๋ฆฝ์ ์ธ ์ •๋ณด๋ฅผ ๊ฐ–๊ณ  ์žˆ์ง€๋งŒ, Distribution representation์—์„œ๋Š” ํ•˜๋‚˜์˜ ์ฐจ์›์ด ์—ฌ๋Ÿฌ ์†์„ฑ๋“ค์ด ๋ฒ„๋ฌด๋ ค์ง„ ์ •๋ณด๋ฅผ ๋“ค๊ณ  ์žˆ๋‹ค. ์ฆ‰, ํ•˜๋‚˜์˜ ์ฐจ์›์ด ํ•˜๋‚˜์˜ ์†์„ฑ์„ ๋ช…์‹œ์ ์œผ๋กœ ํ‘œํ˜„ํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ ์—ฌ๋Ÿฌ ์ฐจ์›๋“ค์ด ์กฐํ•ฉ๋˜์–ด ๋‚˜ํƒ€๋‚ด๊ณ ์ž ํ•˜๋Š” ์†์„ฑ๋“ค์„ ํ‘œํ˜„ํ•˜๋Š” ๊ฒƒ์ด๋‹ค.

1. Goals of the Paper

๋…ผ๋ฌธ์˜ ์ฃผ ๋ชฉ์ ์€ ์ˆ˜ ์–ต๊ฐœ์˜ ๋‹จ์–ด๋กœ ๊ตฌ์„ฑ๋œ ๋งค์šฐ ํฐ ๋ฐ์ดํ„ฐ์—์„œ ํ€„๋ฆฌํ‹ฐ ๋†’์€ ๋‹จ์–ด ๋ฒกํ„ฐ๋ฅผ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋Š” ๊ธฐ์ˆ ๋“ค์„ ์†Œ๊ฐœํ•˜๋Š” ๋ฐ์— ์žˆ๋‹ค. ์ด์ œ๊ป ์ œ์•ˆ๋œ architecture ์ค‘์— ์–ด๋–ค ๊ฒƒ๋„ ์ˆ˜๋ฐฑ๋งŒ๊ฐœ ๋‹จ์–ด๋ฅผ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์— ์„ฑ๊ณตํ•˜์ง€ ๋ชปํ–ˆ์œผ๋ฉฐ, ๋ฒกํ„ฐ์˜ ํฌ๊ธฐ๋„ 50~100 ์ •๋„๋ฐ–์— ์‚ฌ์šฉํ•˜์ง€ ๋ชปํ–ˆ๋‹ค.

๋ณธ๊ณ ๋Š” ๋น„์Šทํ•œ ๋‹จ์–ด๋“ค์€ ๊ฐ€๊นŒ์ด์— ์žˆ๊ณ , multiple degrees of similarity๋ฅผ ๊ฐ–๋Š”๋‹ค๋Š” ๊ฐ€์ • ํ•˜์—์„œ vector representations์˜ ํ€„๋ฆฌํ‹ฐ๋ฅผ ์ธก์ •ํ•˜๋Š” ๊ธฐ์ˆ ์„ ์ œ์•ˆํ•œ๋‹ค. ์ด๋Š” ๊ตด์ ˆ์–ด(๋ผํ‹ด์–ด ๋“ฑ)์˜ ๋ฌธ๋งฅ์—์„œ ๋จผ์ € ๊ด€์ธก๋˜์—ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ์œ ์‚ฌํ•œ ๋‹จ์–ด๋ฅผ ์ฐพ์„ ๋•Œ, ๋ช…์‚ฌ๋Š” ๋‹ค์–‘ํ•œ ์–ด๋ฏธ๋ฅผ ๊ฐ€์ง€๋Š”๋ฐ ์›๋ž˜์˜ ๋ฒกํ„ฐ ๊ณต๊ฐ„์˜ subspace์—์„œ ๋น„์Šทํ•œ ์–ด๋ฏธ๋ฅผ ๊ฐ–๋Š” ๋‹จ์–ด๋“ค์„ ์ฐพ์„ ์ˆ˜ ์žˆ๋‹ค. ๋†€๋ž๊ฒŒ๋„, word representation์˜ ์œ ์‚ฌ๋„๋Š” ๋‹จ์ˆœํ•œ synatactic regularities(๊ตฌ๋ฌธ ๊ทœ์น™)์„ ๋„˜์–ด์„ ๋‹ค. ๋‹จ์–ด ๋ฒกํ„ฐ์—์„œ ๋Œ€์ˆ˜์  ์—ฐ์‚ฐ์œผ๋กœ word offset technique์„ ์‚ฌ์šฉํ•˜๋ฉด vector("King")โˆ’vector("Man")+vector("Woman")vector("King")-vector("Man")+vector("Woman")์€ vector("Queen")vector("Queen")์ด๋ผ๋Š” ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์˜จ๋‹ค.

์ด ๋…ผ๋ฌธ์—์„œ๋Š” ๋‹จ์–ด ์‚ฌ์ด์˜ ์„ ํ˜• ๊ทœ์น™์„ ๋ณด์กดํ•˜๋Š” ์ƒˆ๋กœ์šด ๋ชจ๋ธ ๊ตฌ์กฐ๋ฅผ ๋งŒ๋“ค์–ด์„œ vector representation ์—ฐ์‚ฐ์˜ ์ •ํ™•๋„๋ฅผ ๊ทน๋Œ€ํ™” ์‹œํ‚ฌ ๊ฒƒ์ด๋‹ค. ๊ตฌ๋ฌธ ๊ทœ์น™๊ณผ ์˜๋ฏธ ๊ทœ์น™์„ ์ธก์ •ํ•˜๊ธฐ ์œ„ํ•ด ์ดํ•ดํ•˜๊ธฐ ์‰ฌ์šด ์ƒˆ๋กœ์šด ํ…Œ์ŠคํŠธ์…‹์„ ๋งŒ๋“ค์—ˆ๊ณ , ๋†’์€ ์ •ํ™•๋„๋กœ ๊ทœ์น™๋“ค์ด ํ•™์Šต๋˜๋Š” ๊ฒƒ์„ ๋ณด์˜€๋‹ค. ๋˜ํ•œ, ํ›ˆ๋ จ ์‹œ๊ฐ„๊ณผ ์ •ํ™•๋„๊ฐ€ ๋‹จ์–ด ๋ฒกํ„ฐ์˜ ์ฐจ์›๊ณผ training ๋ฐ์ดํ„ฐ์˜ ์–‘์— ์–ผ๋งˆ๋‚˜ ์˜์กดํ•˜๋Š”์ง€์— ๋Œ€ํ•ด ์ด์•ผ๊ธฐ ํ•˜๊ณ ์ž ํ•œ๋‹ค.

2. Previous Work

์ด์ „์—๋„ ๋‹จ์–ด๋ฅผ ์—ฐ์†์ ์ธ ๋ฒกํ„ฐ๋กœ ํ‘œํ˜„ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ–ˆ์œผ๋ฉฐ, ๊ทธ ์ค‘์—์„œ๋„ NNLM(neural network language model)์— ๊ด€ํ•œ ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆํ•œ ๊ฒƒ๋“ค์ด ์ž˜ ์•Œ๋ ค์ ธ ์žˆ๋‹ค.A neural probabilistic language model ๋…ผ๋ฌธ์—์„œ ์ œ์‹œ๋œ ๋ชจ๋ธ์€ Linear Projection Layer์™€ Non-Linear Hidden Layer ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ Feedforward Neural Network๋ฅผ ํ†ตํ•ด ๋‹จ์–ด ๋ฒกํ„ฐ ํ‘œํ˜„๊ณผ ํ†ต๊ณ„ํ•™์ ์ธ ์–ธ์–ด ๋ชจ๋ธ์˜ ๊ฒฐํ•ฉ์„ ํ•™์Šตํ•˜๋Š”๋ฐ ์‚ฌ์šฉ๋œ๋‹ค.
๋˜ ๋‹ค๋ฅธ ํฅ๋ฏธ๋กœ์šด ๊ตฌ์กฐ์ธ NNLM์€ Language Modeling for Speech Recognition in Czech, Neural network based language models for higly inflective languages ๋…ผ๋ฌธ์—์„œ ์ œ์‹œํ•œ๋‹ค. ๋จผ์ €, ๋‹จ์–ด ๋ฒกํ„ฐ๋“ค์€ single hidden layer๋ฅผ ๊ฐ–๋Š” Neural network์— ์˜ํ•ด ํ•™์Šต ๋œ๋‹ค. ๊ทธ ๋‹จ์–ด ๋ฒกํ„ฐ๋“ค์€ NNLM์„ ํ•™์Šตํ•˜๋Š”๋ฐ ์‚ฌ์šฉ๋˜๋ฏ€๋กœ ๋‹จ์–ด ๋ฒกํ„ฐ๋“ค์€ ๋น„๋ก ์ „์ฒด NNLM์„ ๊ตฌ์„ฑํ•˜์ง€ ์•Š์•„๋„ ํ•™์Šต๋œ๋‹ค. ์ด ์ž‘์—…์„ ํ†ตํ•ด ์ง์ ‘์ ์œผ๋กœ ๊ตฌ์กฐ๋ฅผ ํ™•์žฅํ•˜์—ฌ, ๋‹จ์–ด ๋ฒกํ„ฐ๊ฐ€ ๊ฐ„๋‹จํ•œ ๋ชจ๋ธ์— ์˜ํ•ด ํ•™์Šต๋˜์–ด์ง€๋Š” ์ฒซ๋ฒˆ์งธ ๋‹จ๊ณ„์— ์ฃผ๋ชฉํ•œ๋‹ค.
์ด ๋‹จ์–ด ๋ฒกํ„ฐ๋“ค์€ ๋งŽ์€ NLP program์˜ ์—„์ฒญ๋‚œ ํ–ฅ์ƒ๊ณผ ๋‹จ์ˆœํ™”์— ์‚ฌ์šฉ๋  ์ˆ˜ ์žˆ์„ ๊ฒƒ์ด๋‹ค. ๋‹จ์–ด ๋ฒกํ„ฐ์˜ ์˜ˆ์ธก์€ ๋‹ค๋ฅธ ๋ชจ๋ธ ๊ตฌ์กฐ์„ ์‚ฌ์šฉํ•˜๋Š”๋ฐ ์‹คํ–‰๋˜๊ณ , ๋‹ค์–‘ํ•œ ๋‹จ์–ด corpora๋ฅผ ํ•™์Šตํ•œ๋‹ค. ๋‹จ์–ด ๋ฒกํ„ฐ ๊ฒฐ๊ณผ์˜ ์ผ๋ถ€๋Š” ๋ฏธ๋ž˜ ์—ฐ๊ตฌ์™€ ๋น„๊ต๋ฅผ ์œ„ํ•ด ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•˜๊ฒŒ ๋œ๋‹ค. ํ•˜์ง€๋งŒ ์ด ๊ตฌ์กฐ๋“ค์€ ํ•™์Šต์„ ํ•˜๊ธฐ ์œ„ํ•ด ๋งค์šฐ ๊ณ„์‚ฐ ๋ณต์žก๋„๊ฐ€ ์ปค์ง€๋ฉฐ ๋น„์šฉ์ด ๋งŽ์ด ๋“ ๋‹ค.

๐Ÿ’ก 2. Model Architectures

LSA์™€ LDA๋ฅผ ํฌํ•จํ•œ ๋งŽ์€ ๋‹ค์–‘ํ•œ ์ข…๋ฅ˜์˜ ๋ชจ๋ธ๋“ค์ด ๋‹จ์–ด์˜ ๊ณ„์†์ ์ธ ํ‘œํ˜„์„ ์œ„ํ•ด ์ œ์•ˆ๋˜์—ˆ๋‹ค. ๋ณธ๊ณ ์—์„œ๋Š” ์ธ๊ณต์‹ ๊ฒฝ๋ง์„ ํ†ตํ•ด ํ•™์Šต๋œ ๋‹จ์–ด์˜ distributed representation์— ์ฃผ๋ชฉํ•  ๊ฒƒ์ด๋‹ค. Strategies for Training Large Scale Neural Network Language Models์™€ ๋น„์Šทํ•˜๊ฒŒ, ๋ชจ๋ธ์˜ ๊ณ„์‚ฐ ๋ณต์žก์„ฑ์„ ๋ชจ๋ธ์„ ์™„์ „ํžˆ ํ›ˆ๋ จ์‹œํ‚ค๊ธฐ ์œ„ํ•ด ์•ก์„ธ์Šค ํ•ด์•ผ ํ•˜๋Š” ๋งค๊ฐœ ๋ณ€์ˆ˜์˜ ์ˆ˜๋กœ ์ •์˜ํ•œ๋‹ค. ๋‹ค์Œ์œผ๋กœ, ์ •ํ™•๋„๋ฅผ ๊ทน๋Œ€ํ™”ํ•˜๊ธฐ ์œ„ํ•ด ๊ณ„์‚ฐ ๋ณต์žก๋„๋ฅผ ์ตœ์†Œํ™” ์‹œํ‚จ๋‹ค.

Training comprexity๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •์˜ํ•œ๋‹ค.

O=Eร—Tร—QO = E ร— T ร— Q

OO๋Š” training epochs์˜ ์ˆ˜, TT๋Š” training set์˜ ๋‹จ์–ด ์ˆ˜, QQ๋Š” ๊ฐ ๋ชจ๋ธ์— ๋”ฐ๋ผ ๋‹ค๋ฅด๊ฒŒ ์ •์˜๋จ

  • ์ผ๋ฐ˜์ ์œผ๋กœ E = 3~50, T๋Š” 10์–ต๊ฐœ ์ด์ƒ์œผ๋กœ ์ •์˜๋จ
  • ๋ชจ๋“  ๋ชจ๋ธ์€ stochastic gradient descent์™€ backpropagation์„ ์ด์šฉํ•˜์—ฌ ํ•™์Šต

1. Feedforward Neural Net Language Model(NNLM)

A neural probabilistic language model์—์„œ ์ œ์•ˆ๋œ NNLM ๋ชจ๋ธ์€ Input, Projection, Hidden, Output layer ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๋‹ค. Input layer์—์„œ, NN๊ฐœ์˜ ์„ ํ–‰ ๋‹จ์–ด๋“ค์ด 1-of-VV coding์œผ๋กœ ์ธ์ฝ”๋”ฉ๋˜๋ฉฐ, ์ „์ฒด vocabulary์˜ ํฌ๊ธฐ๊ฐ€ VV ์ธ ๊ฒฝ์šฐ VV ํฌ๊ธฐ์˜ ๋ฒกํ„ฐ๊ฐ€ ์ฃผ์–ด์ง„๋‹ค. Nร—DN ร— D ๋ฒกํ„ฐ ํฌ๊ธฐ๋ฅผ ๊ฐ€์ง„ projection layer๊ฐ€ Input layer๊ฐ€ ๋œ๋‹ค.
NNLM ๊ตฌ์กฐ๋Š” projection layer๊ฐ€ ์ด˜์ด˜ํ• ์ˆ˜๋ก projection layer์™€ hidden layer ๊ฐ„์˜ ๊ณ„์‚ฐ์ด ๋ณต์žกํ•˜๋‹ค. N=10N=10์ผ ๋•Œ, PP๋Š” 500~2000์ด๋ฉฐ, HH๋Š” 500์—์„œ 1000๊ฐœ์ด๋‹ค. ๊ฒŒ๋‹ค๊ฐ€ hidden layer๊ฐ€ ๋ชจ๋“  ๋‹จ์–ด์˜ ํ™•๋ฅ  ๋ถ„ํฌ๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๋ฐ์— ์‚ฌ์šฉ๋˜๊ธฐ์— output layer์˜ ์ฐจ์›์€ VV๊ฐ€ ๋œ๋‹ค. ๋”ฐ๋ผ์„œ, ๋งค training example๋งˆ๋‹ค ๊ณ„์‚ฐ ๋ณต์žก๋„๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

Q=Nร—D+Nร—Dร—H+Hร—VQ=Nร—D+Nร—Dร—H+Hร—V
  • dominating term์€ Hร—VHร—V

์ด๋ฅผ ํ”ผํ•˜๊ธฐ ์œ„ํ•œ ๋ช‡ ๊ฐ€์ง€ ์‹ค์šฉ์ ์ธ ํ•ด๊ฒฐ์ฑ…์ด ์žˆ๋‹ค(Hierarchical version of softmax, avoiding normalized models). ๋‹จ์–ด์˜ ์ด์ง„๋ถ„๋ฅ˜ representation๊ณผ ํ•จ๊ป˜, output units์˜ ์ˆ˜๋ฅผ log2(V)log_2(V)๊นŒ์ง€ ๋‚ฎ์ถœ ์ˆ˜ ์žˆ๋‹ค. ์ด์— ๋”ฐ๋ผ ๋Œ€๋ถ€๋ถ„์˜ ๋ณต์žก๋„๊ฐ€ Nร—Dร—HNร—Dร—H์— ์˜ํ•ด ๋ฐœ์ƒ๋œ๋‹ค.
๋ณธ๊ณ ์—์„œ๋Š” Hierachical softmax๋ฅผ ์‚ฌ์šฉํ•˜๋Š”๋ฐ, ์ด๋Š” ๋‹จ์–ด๊ฐ€ Huffman binary tree๋กœ ๋‚˜ํƒ€๋‚˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค. ๋‹จ์–ด์˜ ๋นˆ๋„ ์ˆ˜๊ฐ€ NNLM์—์„œ class๋ฅผ ์–ป๊ธฐ์œ„ํ•ด ์ž˜ ์ž‘๋™ํ•œ๋‹ค๋Š” ์ด์ „์˜ ๊ด€์ธก๋“ค์„ ๋”ฐ๋ฅธ๋‹ค. Huffman trees๋Š” ๋นˆ๋„ ๋†’์€ ๋‹จ์–ด๋“ค์— ์งง์€ ์ด์ง„ ์ฝ”๋“œ๋ฅผ ํ• ๋‹นํ•˜๊ณ , ์ด๋Š” ํ‰๊ฐ€๋˜์–ด์•ผ ํ•˜๋Š” output unit์˜ ์ˆ˜๋ฅผ ๋‚ฎ์ถฐ์ค€๋‹ค. ๊ท ํ˜•์žกํžŒ ์ด์ง„ ํŠธ๋ฆฌ๋Š” ํ‰๊ฐ€๋˜์–ด์•ผ ํ•˜๋Š” log2(V)log_2(V)์˜ output์„ ์š”๊ตฌํ•˜๋Š” ๋ฐ˜๋ฉด, Hierachical softmax์— ๊ธฐ๋ฐ˜ํ•œ huffman tree๋Š” log2(Unigramโˆ’perplexity(V))log_2(Unigram-perplexity(V))์— ๋Œ€ํ•ด์„œ๋งŒ์„ ์š”๊ตฌํ•œ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ๋‹จ์–ด ์‚ฌ์ด์ฆˆ๊ฐ€ ๋ฐฑ๋งŒ๊ฐœ์˜ ๋‹จ์–ด๋ผ๋ฉด, ์ด ๊ฒฐ๊ณผ๋Š” ํ‰๊ฐ€์— ์žˆ์–ด์„œ ์†๋„๋ฅผ ๋‘ ๋ฐฐ ๋” ๋น ๋ฅด๊ฒŒ ํ•œ๋‹ค. Nร—Dร—HNร—Dร—H ์‹์—์„œ ๊ณ„์‚ฐ์˜ ๋ณ‘๋ชฉํ˜„์ƒ์ด ์ผ์–ด๋‚˜๋Š” NNLM์—์„œ๋Š” ์ค‘์š”ํ•œ ๋ฌธ์ œ๊ฐ€ ์•„๋‹์ง€๋ผ๋„, ๋ณธ๊ณ ๋Š” hidden layer๊ฐ€ ์—†๊ณ  softmax normalization์˜ ํšจ์œจ์„ฑ์— ์ฃผ๋กœ ์˜์กดํ•˜๋Š” architectures๋ฅผ ์ œ์•ˆํ•  ๊ฒƒ์ด๋‹ค.

2. Recurrent Net Language Model(RNNLM)

Recurrent Neural Net Language Model(RNNLM)์€ ๋ฌธ๋งฅ์˜ ๊ธธ์ด(the order of the model N)๋ฅผ ๋ช…์‹œํ•ด์•ผํ•˜๋Š” ๊ฒƒ๊ณผ ๊ฐ™์€ NNLM์˜ ํ•œ๊ณ„๋ฅผ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•ด์„œ ์ƒ๊ฒจ๋‚ฌ๋‹ค. ์ด๋ก ์ ์œผ๋กœ RNN์€ ๋” ๋ณต์žกํ•œ ํŒจํ„ด๋“ค์„ ์–•์€ ์ธ๊ณต์‹ ๊ฒฝ๋ง์„ ์ด์šฉํ•ด์„œ ํšจ์œจ์ ์œผ๋กœ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋‹ค. RNN ๋ชจ๋ธ์€ projection layer๊ฐ€ ์—†๊ณ , input, hidden, output layer๋งŒ ์žˆ๋‹ค. ์ด ๋ชจ๋ธ์— ํŠน๋ณ„ํ•œ ์ ์€ recurrent matrix๊ฐ€ hidden layer ๊ทธ ์ž์ฒด์™€ ์‹œ๊ฐ„์˜ ํ๋ฆ„์˜ ์—ฐ๊ฒฐ์„ ๊ฐ–๊ณ  ์—ฐ๊ฒฐ๋˜์–ด ์žˆ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค. ์ด๊ฒƒ์€ recurrent model์ด short term memory๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๊ฒƒ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๊ณ , ๊ณผ๊ฑฐ์˜ ์ •๋ณด๋Š” ์ด์ „ ๋‹จ๊ณ„์˜ hidden layer์˜ ์ƒํƒœ์™€ ํ˜„์žฌ์˜ input์— ๊ธฐ๋ฐ˜ํ•˜์—ฌ ์—…๋ฐ์ดํŠธ ๋œ hidden layer์˜ state๋กœ ํ‘œํ˜„๋  ์ˆ˜ ์žˆ๋‹ค.
RNN model์˜ ํ›ˆ๋ จ ๋‹น ๋ณต์žก๋„๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

Q=Hร—H+Hร—VQ = Hร—H+Hร—V

word represatation DD๋Š” hidden layer HH์™€ ๊ฐ™์€ ์ฐจ์›์„ ๊ฐ–๊ณ , HXVH X V ๋Š” ๊ณ„์ธต์  ์†Œํ”„ํŠธ๋งฅ์Šค๋ฅผ ์‚ฌ์šฉํ•ด์„œ HXlog2(V)H X log_2(V)๋กœ ์ถ•์†Œ๋  ์ˆ˜ ์žˆ๋‹ค. ๋Œ€๋ถ€๋ถ„์˜ ๋ณต์žก๋„๋Š” HXHH X H์—์„œ ๋‚˜์˜จ๋‹ค.

3. Parallel Training of Neural Networks

๊ฑฐ๋Œ€ํ•œ data set์— ๋Œ€ํ•ด NNLM๊ณผ ์ด ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆ๋œ ์ƒˆ๋กœ์šด ๋ชจ๋ธ๋“ค์„ ํฌํ•จํ•˜์—ฌ DistBelief๋ผ๊ณ  ๋ถˆ๋ฆฌ๋Š” top of large-scale distributed framework์ธ ๋ช‡๋ช‡ ๋ชจ๋ธ๋“ค์„ ์‹คํ–‰ํ–ˆ๋‹ค. ์ด framework๋Š” ์šฐ๋ฆฌ๊ฐ€ Parallelํ•˜๊ฒŒ ๊ฐ™์€ ๋ชจ๋ธ์„ ๋ฐ˜๋ณตํ•ด์„œ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ–ˆ๊ณ , ๋ชจ๋“  ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์œ ์ง€ํ•˜๋Š” centralized server๊ฐ€ ํ†ตํ•ด ๊ฐ replica๋“ค์€ ๊ทธ๊ฒƒ์˜ gradient์˜ update์™€ ๊ฐ™์•˜๋‹ค. ์ด parallel train์—์„œ, ์šฐ๋ฆฌ๋Š” Adagrad๋ฅผ ์‚ฌ์šฉํ•œ ๋ฏธ๋‹ˆ๋ฐฐ์น˜ ๊ฒฝ์‚ฌํ•˜๊ฐ•๋ฒ•์„ ์‚ฌ์šฉํ–ˆ๋‹ค. ์ด framework์—์„œ 100๊ฐœ, ๋˜๋Š” ๋ช‡๋ฐฑ๊ฐœ์˜ replica๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ์ผ๋ฐ˜์ ์ด๋ฉฐ, ์ด๋Š” data center์˜ ๋‹ค๋ฅธ ๊ธฐ๊ณ„๋“ค์˜ ๋งŽ์€ CPU core๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.

๐Ÿ’ก 3. New Log-linear Model

์ด ์„น์…˜์—์„œ๋Š” computational complexity๋ฅผ ์ตœ์†Œํ™”ํ•˜๋ฉด์„œ distributed representation์„ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•œ ๋‘๊ฐ€์ง€ ๋ชจ๋ธ ๊ตฌ์กฐ๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ์ด์ „ ์„น์…˜์—์„œ์˜ main observation์€ ๋Œ€๋ถ€๋ถ„์˜ ๋ณต์žก๋„๊ฐ€ non-linear hidden layer์— ์˜ํ•ด ์ƒ๊ธด๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.

1. Continuous Bag-of-Words Model

์ฒซ๋ฒˆ์งธ ์ œ์•ˆ๋œ ์•„ํ‚คํ…์ณ๋Š” non-linear hidden layer๊ฐ€ ์ œ๊ฑฐ๋˜๊ณ  projection layer๊ฐ€ ๋ชจ๋“  ๋‹จ์–ด๋ฅผ ์œ„ํ•ด ๊ณต์œ ๋˜๋Š” feedforward NNLM๊ณผ ๋น„์Šทํ•˜๋‹ค. ์šฐ๋ฆฌ๋Š” ๋‹จ์–ด์˜ ์ˆœ์„œ๊ฐ€ projection์— ์˜ํ–ฅ์„ ๋ผ์ง€์น˜ ์•Š๊ธฐ ๋•Œ๋ฌธ์— ์ด ๊ตฌ์กฐ๋ฅผ bag-of-word ๋ชจ๋ธ์ด๋ผ๊ณ  ๋ถ€๋ฅธ๋‹ค. ๊ฒŒ๋‹ค๊ฐ€, ์šฐ๋ฆฌ๋Š” ๋ฏธ๋ž˜๋กœ๋ถ€ํ„ฐ ๋‹จ์–ด๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. ์ฆ‰, ์šฐ๋ฆฌ๋Š” input์œผ๋กœ 4 future and 4 history words๋กœ log-linear classifier์„ ๊ตฌ์ถ•ํ•˜์—ฌ ๋‹ค์Œ section์—์„œ ์†Œ๊ฐœํ•  task์—์„œ ๊ฐ€์žฅ ์ข‹์€ ์„ฑ๋Šฅ์„ ์–ป์„ ์ˆ˜ ์žˆ์—ˆ๋‹ค. ํ›ˆ๋ จ์˜ ๋ณต์žก๋„๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

Q=Nร—D+Dร—log2(V)Q=Nร—D+Dร—log_2(V)

์šฐ๋ฆฌ๋Š” ์ด ๋ชจ๋ธ์„ ์•ž์œผ๋กœ CBOW๋ผ๊ณ  ๋ถ€๋ฅผ ๊ฒƒ์ด๋‹ค. ๋ณดํ†ต์˜ bag-of-word ๋ชจ๋ธ๊ณผ๋Š” ๋‹ค๋ฅด๊ฒŒ, ์ด๊ฒƒ์€ context์˜ continuous distributed representation๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. input๊ณผ projection layer ์‚ฌ์ด์˜ ๊ฐ€์ค‘์น˜ ๋งคํŠธ๋ฆญ์Šค๋Š” NNLM๊ณผ ๊ฐ™์€ ๋ฐฉ์‹์œผ๋กœ ๋ชจ๋“  ๋‹จ์–ด ์œ„์น˜๋ฅผ ์œ„ํ•ด ๊ณต์œ ๋œ๋‹ค.

2. Continuous Skip-gram Model

๋‘๋ฒˆ์งธ ์•„ํ‚คํ…์ณ๋Š” CBOW์™€ ๋น„์Šทํ•˜์ง€๋งŒ, context์— ๊ธฐ๋ฐ˜ํ•ด ํ˜„์žฌ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋Œ€์‹ ์— ๊ฐ™์€ ๋ฌธ์žฅ์˜ ๋‹ค๋ฅธ ๋‹จ์–ด์— ๊ธฐ๋ฐ˜ํ•œ ๋‹จ์–ด์˜ ๋ถ„๋ฅ˜๋ฅผ ๊ทน๋Œ€ํ™”ํ•œ๋‹ค. ๋” ์ •ํ™•ํžˆ ๋งํ•˜์ž๋ฉด, ์šฐ๋ฆฌ๋Š” ๊ฐ ํ˜„์žฌ์˜ ๋‹จ์–ด๋ฅผ continuous projection layer์™€ ํ•จ๊ป˜ log-linear classifier์— ์‚ฌ์šฉํ•˜๊ณ , ํ˜„์žฌ ๋‹จ์–ด ์•ž๋’ค์˜ ํŠน์ • ๋ฒ”์œ„์•ˆ์˜ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•œ๋‹ค. range๋ฅผ ์ฆ๊ฐ€์‹œํ‚ค๋ฉด ๋‹จ์–ด ๋ฒกํ„ฐ์˜ quality์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๊ฒƒ์„ ๋ฐœ๊ฒฌํ–ˆ์œผ๋‚˜, ์ด๋Š” ๊ณ„์‚ฐ ๋ณต์žก๋„๋ฅผ ์ฆ๊ฐ€์‹œํ‚จ๋‹ค. ๊ฑฐ๋ฆฌ๊ฐ€ ๋จผ ๋‹จ์–ด๋Š” ๊ฐ€๊นŒ์šด ๋‹จ์–ด๋ณด๋‹ค ํ˜„์žฌ ๋‹จ์–ด์™€ ์—ฐ๊ด€์„ฑ์ด ๋–จ์–ด์งˆ ๊ฒƒ์ด๋ฏ€๋กœ ํ›ˆ๋ จ ์„ธํŠธ์—์„œ ์ด๋Ÿฐ ๋‹จ์–ด๋“ค์€ ์ƒ˜ํ”Œ๋ง์„ ์ ๊ฒŒ ํ•จ์œผ๋กœ์จ ๊ฐ€์ค‘์น˜๋ฅผ ์ค„์˜€๋‹ค. ์ด ์•„ํ‚คํ…์ณ์˜ ํ›ˆ๋ จ ๋ณต์žก๋„๋Š” ๋‹ค์Œ์˜ ์‹์— ๋น„๋ก€ํ•œ๋‹ค.

Q=Cร—(D+Dร—log2(V))Q = Cร—(D+Dร—log_2(V))

CC: ๋‹จ์–ด์˜ ์ตœ๋Œ€ ๊ฑฐ๋ฆฌ

๐Ÿ’ก 4. Result

  • Semantic acc : Skip-gram > CBOW > NNLM < RNNLM
  • Syntactic acc : CBOW > Skip-gram > NNLM < RNNLM
  • Total acc : Skip-gram > CBOW > NNLM

Reference

https://wikidocs.net/21692
https://dreamgonfly.github.io/blog/word2vec-explained/

0๊ฐœ์˜ ๋Œ“๊ธ€