Efficient Estimation of Word Representations in Vector Space ๋ฆฌ๋ทฐ

์ด์ƒ๋ฏผยท2023๋…„ 3์›” 20์ผ

๋…ผ๋ฌธ๋ฆฌ๋ทฐ

๋ชฉ๋ก ๋ณด๊ธฐ
1/29

Abstract

ํฐ ๋ฐ์ดํ„ฐ ์…‹์—์„œ ์—ฐ์†์ ์ธ ๋‹จ์–ด์˜ ๋ฒกํ„ฐ๋ฅผ ํ‘œํ˜„์„ ๊ณ„์‚ฐํ•˜๋Š” 2๊ฐœ์˜ ๋ชจ๋ธ ๊ตฌ์กฐ๋ฅผ ์ œ์•ˆํ•˜๋ คํ•จ.
์ด์ „์˜ ๊ฐ€์žฅ ์„ฑ๋Šฅ์ด ์ข‹๋˜ ๋ชจ๋ธ๋ณด๋‹ค ์ •ํ™•๋„ ์ฆ๊ฐ€, ๊ณ„์‚ฐ ๋น„์šฉ์ด ์ค„์–ด๋“ฆ์ด ์žˆ๋‹ค. (16์–ต๊ฐœ์˜ ๋†’์€ quality์˜ ๋‹จ์–ด ๋ฒกํ„ฐ๋ฅผ ํ•˜๋ฃจ๋„ ์•ˆ๊ฑธ๋ ค์„œ ํ•™์Šต ๊ฐ€๋Šฅ)
์ด ๋ฒกํ„ฐ๋“ค์ด syntactic, semantic ๋ฒกํ„ฐ ์œ ์‚ฌ๋„๋ฅผ ์ธก์ •ํ•˜๋Š” test set์— ์ตœ์‹  ๊ธฐ์ˆ  ์„ฑ๋Šฅ์„ ์ œ๊ณตํ•จ

1. Introduction

์ง€๊ธˆ๊นŒ์ง€ NLP ์‹œ์Šคํ…œ์€ ๋‹จ์–ด๋ฅผ ์›์ž๋กœ ์—ฌ๊น€ ( ๋‹จ์–ด ์‚ฌ์ด์— ์œ ์‚ฌ์„ฑ ๊ฐœ๋…์ด ์—†์—ˆ์Œ. ๋‹จ์–ด๋ฅผ ์–ดํœ˜์ง‘์—์„œ index๋กœ ์—ฌ๊น€ )
์ด๊ฒƒ์— ๋Œ€ํ•œ ์žฅ์ ๋„ ์žˆ๋‹ค.
1. ๊ฐ„๊ฒฐ์„ฑ
2. ๊ฒฌ๊ณ ์„ฑ
3. ํฐ ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šต๋œ ๊ฐ„๋‹จํ•œ ๋ชจ๋ธ์ด ์ ์€ ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•œ ๋ณต์žกํ•œ ์‹œ์Šคํ…œ๋ณด๋‹ค ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๋ƒ„

๊ทธ๋Ÿฌ๋‚˜ ์ด ๊ฐ„๋‹จํ•œ ๊ธฐ์ˆ ์€ ํ•œ๊ณ„์ ์ด ์žˆ์Œ
ex) ์ž๋™ ์Œ์„ฑ ์ธ์‹์„ ์œ„ํ•œ ๊ด€๋ จ ๋„๋ฉ”์ธ ๋‚ด ๋ฐ์ดํ„ฐ์˜ ์–‘์ด ์ œํ•œ์ 

๊ทธ๋ž˜์„œ ๋‹จ์ˆœํžˆ basic ๊ธฐ์ˆ ์˜ ๊ทœ๋ชจํ™•์žฅ(๋‹จ์–ด์˜ ์ฆ๊ฐ€)์œผ๋กœ๋Š” ์ค‘์š”ํ•œ ๋ฐœ์ „์„ ์ด๋Œ์–ด๋‚ผ ์ˆ˜ ์—†๋Š” ์ƒํ™ฉ๋“ค์ด ์žˆ๊ณ , ๊ทธ๋ž˜์„œ ๋” ์ง„๋ณด๋œ ๊ธฐ์ˆ ์— ์ดˆ์ ์„ ๋งž์ถฐ์•ผ ํ•จ

์ตœ๊ทผ ๋จธ์‹ ๋Ÿฌ๋‹ ๊ธฐ์ˆ ์˜ ์ง„๋ณด๋กœ, ๋” ํฐ ๋ฐ์ดํ„ฐ ์…‹์— ๋ณต์žกํ•œ ๋ชจ๋ธ์„ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ด ๊ฐ€๋Šฅํ•ด์กŒ๋‹ค. ๊ฐ€์žฅ ์„ฑ๊ณต์ ์ธ concept์€ Distributed Representations of word(๋‹จ์–ด์˜ ๋ถ„์‚ฐ์  ํ‘œํ˜„)๋ฅผ ์ด์šฉํ•˜๋Š” ๊ฒƒ์ผ ๊ฒƒ์ด๋‹ค. ex) ๋‹จ์–ด ๋ชจ๋ธ ๊ธฐ๋ฐ˜ ๋‰ด๋Ÿด ๋„คํŠธ์›Œํฌ๊ฐ€ N-gram model๋ณด๋‹ค ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๋ƒˆ๋‹ค.
(N-gram model : ๋‹จ์–ด์˜ ์•ž์— (n-1)๊ฐœ์˜ ๋‹จ์–ด๋ฅผ ์ฐธ๊ณ ํ•˜์—ฌ ๋‹จ์–ด๋ฅผ ์œ ์ถ”ํ•˜๋Š” ๋ฐฉ๋ฒ•)

1-1. Goal of the Paper

๋…ผ๋ฌธ ๋ชฉํ‘œ๋Š” ์ˆ˜์‹ญ์–ต๊ฐœ์˜ ๋‹จ์–ด์˜ ํฐ ๋ฐ์ดํ„ฐ์…‹์—์„œ ๋†’์€ ํ€„๋ฆฌํ‹ฐ์˜ ๋‹จ์–ด ๋ฒกํ„ฐ๋ฅผ ํ•™์Šตํ•˜๋Š” ๊ธฐ์ˆ ์„ ์†Œ๊ฐœํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ์ „์— ๋ชจ๋ธ์€ ๋งŽ์€ ๋‹จ์–ด์™€ 50-100 ๋‹จ์–ด ๋ฒกํ„ฐ ์ฐจ์›์„ ์„ฑ๊ณต์ ์œผ๋กœ ํ•™์Šตํ•œ๊ฒŒ ์—†์—ˆ๋‹ค

์šฐ๋ฆฌ๋Š” ์œ ์‚ฌํ•œ ๋‹จ์–ด๊ฐ€ ์„œ๋กœ ๊ฐ€๊นŒ์šด ๊ฒฝํ–ฅ์ด ์žˆ์„ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๋‹จ์–ด๊ฐ€ ์—ฌ๋Ÿฌ ์ˆ˜์ค€์˜ ์œ ์‚ฌ์„ฑ์„ ๊ฐ€์งˆ ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ธฐ๋Œ€๋ฅผ ๊ฐ€์ง€๊ณ  ๊ฒฐ๊ณผ ๋ฒกํ„ฐ ํ‘œํ˜„์˜ ํ’ˆ์งˆ์„ ์ธก์ •ํ•˜๊ธฐ ์œ„ํ•ด ์ตœ๊ทผ ์ œ์•ˆ๋œ ๊ธฐ์ˆ ์„ ์‚ฌ์šฉํ•œ๋‹ค (Linguistic Regularities in Continuous Space Word Representations, 2013 : ์—ฌ๋Ÿฌ ์ˆ˜์ค€์˜ ์œ ์‚ฌ์„ฑ)

์ด๊ฒƒ์€ ์–ด๋ฏธ ๋ณ€ํ™” ์–ธ์–ด์˜ ๋งฅ๋ฝ์—์„œ ์ดˆ๊ธฐ์— ๊ด€์ฐฐ๋˜์—ˆ๋‹ค - ์˜ˆ๋ฅผ ๋“ค์–ด, ๋ช…์‚ฌ๋Š” ๋ณต์ˆ˜์˜ ๋‹จ์–ด ๋์„ ๊ฐ€์งˆ ์ˆ˜ ์žˆ๊ณ , ๋งŒ์•ฝ ์šฐ๋ฆฌ๊ฐ€ ์›๋ž˜ ๋ฒกํ„ฐ ๊ณต๊ฐ„์˜ ๋ถ€๋ถ„ ๊ณต๊ฐ„์—์„œ ์œ ์‚ฌํ•œ ๋‹จ์–ด๋ฅผ ๊ฒ€์ƒ‰ํ•œ๋‹ค๋ฉด, ์œ ์‚ฌํ•œ ๋์„ ๊ฐ€์ง„ ๋‹จ์–ด๋ฅผ ์ฐพ๋Š” ๊ฒƒ์ด ๊ฐ€๋Šฅํ•˜๋‹ค
๋‹จ์–ด ํ‘œํ˜„์˜ ์œ ์‚ฌ์„ฑ์€ ๋‹จ์ˆœํ•œ ๊ตฌ๋ฌธ(syntactic) ๊ทœ์น™์„ฑ์„ ๋„˜์–ด์„ ๋‹ค๋Š” ๊ฒƒ์ด ๋ฐœ๊ฒฌ๋˜์—ˆ๋‹ค.

๋‹จ์–ด ํ‘œํ˜„์˜ ์œ ์‚ฌ์„ฑ์€ ๋‹จ์ˆœํ•œ ๋ฌธ๋ฒ• ๊ทœ์น™์„ ๋„˜์–ด์„œ๋Š”๋ฐ, ๋‹จ์–ด ๋ฒกํ„ฐ์— ๋Œ€ํ•œ ๋‹จ์ˆœํ•œ ๋Œ€์ˆ˜ ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๋‹จ์–ด ์˜คํ”„์…‹ ๊ธฐ์ˆ ์„ ์‚ฌ์šฉํ•˜์—ฌ "King" ๋ฒกํ„ฐ์—์„œ "Man" ๋ฒกํ„ฐ๋ฅผ ๋บ€ ํ›„ "Woman" ๋ฒกํ„ฐ๋ฅผ ๋”ํ•œ ๊ฒฐ๊ณผ๊ฐ€ "Queen" ๋ฒกํ„ฐ ํ‘œํ˜„์— ๊ฐ€์žฅ ๊ฐ€๊นŒ์šด ๋ฒกํ„ฐ์ž„์„ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค.

์ด ๋…ผ๋ฌธ์—์„œ๋Š” ๋‹จ์–ด์˜ ์„ ํ˜•์ ์ธ ๊ทœ์น™์„ฑ์„ ๋ณด์กดํ•˜๋Š” ์ƒˆ๋กœ์šด ๋ชจ๋ธ ๋ฐœ์ „์„ ํ†ตํ•ด ๋ฒกํ„ฐ ์—ฐ์‚ฐ์˜ ์ •ํ™•๋„๋ฅผ ๊ทน๋Œ€ํ™” ์‹œํ‚ค๋ ค ๋…ธ๋ ฅํ•  ๊ฒƒ์ด๋‹ค. ๊ตฌ๋ฌธ ๋ฐ ์˜๋ฏธ์  ๊ทœ์น™์„ฑ์„ ๋ชจ๋‘ ์ธก์ •ํ•˜๋Š” ํฌ๊ด„์ ์ธ ์ƒˆ๋กœ์šด ํ…Œ์ŠคํŠธ ์„ธํŠธ๋ฅผ ๋””์ž์ธํ•˜๊ณ , ์ด๋Ÿฌํ•œ ๊ทœ์น™์„ฑ ์ค‘ ๋งŽ์€ ๊ฒƒ๋“ค์ด ๋†’์€ ์ •ํ™•๋„๋กœ ํ•™์Šต๋  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค€๋‹ค. ๊ฒŒ๋‹ค๊ฐ€ ์–ด๋–ป๊ฒŒ ํ•™์Šต ์‹œ๊ฐ„๊ณผ ์ •ํ™•๋„๊ฐ€ ๋‹จ์–ด ๋ฒกํ„ฐ์˜ ์ฐจ์›๊ณผ ํ•™์Šต ๋ฐ์ดํ„ฐ์˜ ์–‘์— ์–ด๋–ค ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š”์ง€ ์–˜๊ธฐํ•ด๋ณผ ๊ฒƒ์ด๋‹ค.


<์š”์•ฝ>
๋‹จ์–ด์˜ ์œ ์‚ฌ์„ฑ์„ ๊ฐ€์ง€๊ณ  ๋ฒกํ„ฐ๋ฅผ ํ‘œํ˜„ํ•˜๊ณ  ์‹ถ์—ˆ๋‹ค.
์–ด๋ฏธ์˜ ๋ณ€ํ™”์™€ ๊ฐ™์€ syntatic (๊ตฌ๋ฌธ) ๊ทœ์น™์„ฑ์€ ์ดˆ๊ธฐ์— ๊ด€์ฐฐ๋˜์—ˆ์ง€๋งŒ, ๋‹จ์–ด์˜ ์œ ์‚ฌ์„ฑ์€ ๊ตฌ๋ฌธ์˜ ๊ทœ์น™์„ฑ์„ ๋„˜์–ด ์˜๋ฏธ์—์„œ๋„ ์ฐพ์„ ์ˆ˜ ์žˆ์—ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ์™• - ๋‚จ์ž + ์—ฌ์ž = ์—ฌ์™•.
์ฆ‰, ๊ตฌ๋ฌธ์  + ์˜๋ฏธ์  ๊ทœ์น™์„ฑ์„ ๋ชจ๋‘ ์ธก์ •ํ•˜๋Š” ํฌ๊ด„์ ์ธ ์ƒˆ๋กœ์šด vector representation์„ ๋””์ž์ธํ•˜๊ณ , ์ด๊ฒƒ๋“ค์ด ๋†’์€ ์ •ํ™•๋„๋กœ ํ•™์Šต๋  ์ˆ˜ ์žˆ๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค€๋‹ค.

2. Model Architectures

๋‹จ์–ด ๊ฐ„์— ์„ ํ˜•์  ๊ทœ์น™์„ ๋ณด์กดํ•˜๋Š” ๊ฒƒ์— LSA๋ณด๋‹ค ๋” ์ข‹๊ฒŒ ๋‚˜์˜ค๊ฑฐ ์žˆ๋Š” ๊ฒƒ์ด ์ด์ „์— ๋ณด์—ฌ์กŒ๊ธฐ์—, ์ด ๋…ผ๋ฌธ์€ ๋‹จ์–ด์˜ ๋ถ„์‚ฐ ํ‘œํ˜„์— ์ดˆ์ ์„ ๋งž์ถœ ๊ฒƒ์ด๋‹ค.

๋‹ค๋ฅธ ๋ชจ๋ธ ๊ตฌ์กฐ์™€ ๋น„๊ตํ•จ์— ์šฐ๋ฆฌ๋Š” ์ฒ˜์Œ์œผ๋กœ computational ๋ณต์žก๋„๋ฅผ ๋ชจ๋ธ์„ ์™„์ „ํžˆ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•ด ์—‘์„ธ์Šคํ•ด์•ผํ•˜๋Š” parameter์˜ ์ˆ˜๋กœ ์ •์˜ํ•œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์šฐ๋ฆฌ๋Š” computational complexity๋ฅผ ์ตœ์†Œํ™”ํ•˜๋ฉด์„œ ์ •ํ™•๋„๋ฅผ ์ตœ๋Œ€ํ™”ํ•˜๊ธฐ ์œ„ํ•ด ๋…ธ๋ ฅํ•  ๊ฒƒ์ด๋‹ค.

ํ•™์Šต ๋ณต์žก๋„(training complexity)๋Š”

E๋Š” ํ•™์Šต ํšŸ์ˆ˜(training epochs), T๋Š” training set์— ๋“ค์–ด์žˆ๋Š” ๋‹จ์–ด์˜ ์ˆ˜, Q๋Š” ๊ฐ ๋ชจ๋ธ architecture์— ์˜ํ•ด ์ž์„ธํžˆ ์ •์˜๋จ
์ผ๋ฐ˜์ ์œผ๋กœ E๋Š” 3~50, T๋Š” 10์–ต๊นŒ์ง€๋กœ ์„ ํƒํ•จ
๋ชจ๋“  ๋ชจ๋ธ์€ ํ™•๋ฅ ์  ๊ฒฝ์‚ฌ ํ•˜๊ฐ•๋ฒ•๊ณผ ์˜ค์ฐจ์—ญ์ „๋ฒ•์„ ์ด์šฉํ•ด ํ•™์Šต๋œ๋‹ค

2.1 Feedforward Neural Net Language Model(NNLM)

probabilistic feedforward neural network language model์€ 'A neural probabilistic language model. Journal of Machine Learning Research' ์—์„œ ์ œ์•ˆ๋œ๋‹ค. input, projection, hidden, output ์ธต์œผ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๋‹ค.

Q = N P + N P H + H V

input -> projection ์‹œ ์ด์šฉํ–ˆ๋˜ W์˜ ํฌ๊ธฐ๋Š” (V x D)
projection -> hidden ์‹œ ์ด์šฉํ–ˆ๋˜ W์˜ ํฌ๊ธฐ๋Š” (N*D x H)
hidden -> output ์‹œ ์ด์šฉํ–ˆ๋˜ W์˜ ํฌ๊ธฐ๋Š” (H x V)

์—ฌ๊ธฐ์„œ ์ฃผ๋ชฉํ•  ์ ์€ projection์€ linear ํ•˜์ง€๋งŒ, hidden์€ non-linearํ•˜๊ฒŒ ๋งŒ๋“ค์–ด์ค€๋‹ค. ๊ทธ๋ฆผ์—์„  ๊ทธ๋ƒฅ ๊ฐ€์ค‘์น˜์™€ ํฌ๊ธฐ์— ๋Œ€ํ•ด์„œ๋งŒ ๋‚˜ํƒ€๋‚ด๋ ค๊ณ  ์ด๋ ‡๊ฒŒ ์ ์€ ๊ฒƒ์ด๊ณ  hidden์œผ๋กœ ๊ฐˆ ๋•Œ๋Š” ํ™œ์„ฑํ™” ํ•จ์ˆ˜๋„ ๊ณ„์‚ฐํ•ด ์ฃผ์–ด์•ผ ํ•œ๋‹ค.

Q์—์„œ VxP๊ฐ€ ์•„๋‹ˆ๋ผ NxP์ธ ์ด์œ ๋Š” ์‚ฌ์‹ค input์€ one-hot vector๋กœ ์ด๋ฃจ์–ด์ ธ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ๊ฐ€์ค‘์น˜ ๋ฐฑํ„ฐ์˜ ํ–‰์„ ๊ฐ€์ ธ์˜จ๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ๊ฒฐ๊ตญ ์—ด์˜ ๊ฐœ์ˆ˜์ธ D๊ฐœ๋ฅผ ๊บผ๋‚ด์™€์•ผํ•˜๋‹ˆ๊นŒ D๋งŒํผ ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•  ๊ฒƒ์ด๊ณ , ๋‹จ์–ด๊ฐ€ N๊ฐœ๋‹ˆ๊นŒ ๋ณต์žก๋„๋Š” N x D ๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

์—ฌ๊ธฐ์„œ ๊ณ„์‚ฐ์„ ์ค„์ด๋Š” ๋ฒ•์€ hierarchical softmax์„ ์‚ฌ์šฉํ•˜๋ฉด ๋œ๋‹ค.
๋“ฑ์žฅํ•˜๋Š” ๋นˆ๋„์™€ ๊ด€๊ณ„์žˆ๊ฒŒ ์ด์ง„ ํŠธ๋ฆฌ์—์„œ level์„ ์„ค์ •ํ•ด์ฃผ๋Š” Huffman Tree๋ฅผ ์ด์šฉ.
๋‹จ์–ด๋“ค์„ leaf๋ฅผ ๋‘๊ณ  ๊ณ„์‚ฐ

์ถœ์ฒ˜ : https://uponthesky.tistory.com/15

"asinine"๊ณผ "cost"์˜ ๊ฐ’์„ ๊ณ„์‚ฐํ•˜๋ ค๋ฉด

  • ์šฐ์„  ์˜ค๋ฅธ์ชฝ์ผ์ง€ ์™ผ์ชฝ์ผ์ง€ ๋ฐฉํ–ฅ์„ ์ •ํ•œ๋‹ค. (์˜ค๋ฅธ์ชฝ์ด๋ผ ํ•˜์ž ์—ฌ๊ธฐ์„ )
  1. "asinine"๊ณผ row6 ๋‚ด์  ํ›„ sigmoid ํ•จ์ˆ˜ ์ด์šฉ
    ๊ฐ’์ด 0.24๋ผ๊ณ  ํ•˜๋ฉด row6์—์„œ ์˜ค๋ฅธ์ชฝ์œผ๋กœ ๊ฐˆ ํ™•๋ฅ ์ด 0.24, sigmoid ํ•จ์ˆ˜์˜ ํŠน์ง•์— ๋”ฐ๋ผ ์™ผ์ชฝ์œผ๋กœ ๊ฐˆ ํ™•๋ฅ ์€ 1 - 0.24 =0.76 ์ด ๋œ๋‹ค.

  2. "asinine"๊ณผ row4 ๋‚ด์  ํ›„ sigmoid ํ•จ์ˆ˜ ์ด์šฉ
    ๊ณ„์‚ฐํ•œ ๊ฐ’์ด 0.43์ด๋ผ ํ•˜๋ฉด row4์—์„œ ์™ผ์ชฝ์œผ๋กœ ๊ฐˆ ํ™•๋ฅ ์€ 1 - 0.43 = 0.57

  3. "asinine"๊ณผ row3 ๋‚ด์  ํ›„ sigmoid ํ•จ์ˆ˜ ์ด์šฉ
    ๊ณ„์‚ฐํ•œ ๊ฐ’์ด 0.68์ด๋ผ๋ฉด row3์—์„œ ์˜ค๋ฅธ์ชฝ์œผ๋กœ ๊ฐˆ ํ™•๋ฅ ์€ 0.68

์ฆ‰ : "asinine"์„ ์ž…๋ ฅํ–ˆ์„ ๋•Œ "cost"๋ฅผ ์ถœ๋ ฅํ•  ํ™•๋ฅ ์€ 0.76 0.57 0.68
์ด๋Ÿฐ ์‹์œผ๋กœ ๊ณ„์‚ฐ ์‹์„ ์ค„์—ฌ์ค€๋‹ค ์ด์ง„ ํŠธ๋ฆฌ๋กœ ๋งŒ๋“ค์–ด์ฃผ์—ˆ๊ธฐ ๋•Œ๋ฌธ์— log2(V) ๋กœ ์ค„์–ด๋“ ๋‹ค

2-2. Recurrent Neural Net Language Model ( RNNLM )

N๊ฐœ window ์ง€์ •์˜ ํ•œ๊ณ„๋ฅผ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•ด ๊ณ ์•ˆ๋˜์—ˆ๋‹ค
๊ทธ๋ฆฌ๊ณ  ์ด๋ก ์ ์œผ๋กœ RNN ๋ชจ๋ธ์ด ์–•์€ neural network ๋ณด๋‹ค ๋ณต์žกํ•œ ํŒจํ„ด์„ ํšจ์œจ์ ์œผ๋กœ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋‹ค

Layer์˜ ๊ตฌ์„ฑ์€ {input layer, output layer, hidden layer} ์ด ์žˆ๋‹ค. NNLM๊ณผ ๋‹ฌ๋ฆฌ projection layer๊ฐ€ ์—†๋‹ค

์—ฌ๊ธฐ์„œ special type์€ time delay connection์„ ์ด์šฉํ•ด ์ž๊ธฐ ์ž์‹ ๊ณผ hidden layer๋ฅผ ์ด์–ด์ฃผ๋Š” recurrent matrix๋‹ค.

๊ณผ๊ฑฐ๋กœ๋ถ€ํ„ฐ์˜ ์ •๋ณด๊ฐ€ ์ด์ „์˜ hidden layer state์™€ ํ˜„์žฌ input์„ ๊ธฐ๋ฐ˜์œผ๋กœ update๋˜๋Š” hidden state๋กœ ํ‘œํ˜„๋  ์ˆ˜ ์žˆ๋‹ค

Q = H x H + H x V

hidden(t-1) -> hidden(t) ์‹œ ๊ณ„์‚ฐ ๋ณต์žก๋„ H x H
hidden -> output ์‹œ ๊ณ„์‚ฐ ๋ณต์žก๋„ H x V

( input -> hidden ์œผ๋กœ ๊ฐˆ ๋•Œ๋Š” input์ด one-hot vector ์ด๋‹ˆ๊นŒ ์‚ฌ์‹ค ๊ฐ€์ค‘์น˜(input์—์„œ hidden์œผ๋กœ ๊ฐˆ ๋•Œ ๊ณฑํ•ด์ง€๋Š”)์˜ ์–ด๋–ค ํ–‰์„ ๋ฝ‘์•„์˜ค๋ฉด ๋˜๋Š” ๊ฒƒ์œผ๋กœ ๊ณ„์‚ฐ X )

3. New - Log Linear Model

๊ณ„์‚ฐ ๋ณต์žก๋„๋ฅผ ๋‚ฎ์ถ”๊ธฐ ์œ„ํ•ด 2๊ฐœ์˜ ๋ชจ๋ธ์„ ์„ค๋ช…ํ•  ์ œ์•ˆํ•œ๋‹ค
์ด ์ „์—์„œ๋Š” ๊ณ„์‚ฐ ๋ณต์žก๋„๊ฐ€ non-linear hidden layer์—์„œ ์ฆ๊ฐ€ํ–ˆ๋‹ค
non-linear hidden layer๋Š” neural network๋ฅผ ๋งค๋ ฅ์ ์œผ๋กœ ๋ณด์ด๊ฒŒ ํ•ด์ฃผ๋Š” ๊ฒƒ์ด์ง€๋งŒ, ๋ฐ์ดํ„ฐ๋ฅผ neural network๋งŒํผ ์ •ํ™•ํ•˜๊ฒŒ ๋ฐ์ดํ„ฐ๋ฅผ ํ‘œํ˜„ํ•ด์ค„ ์ˆ˜ ์—†์„ ์ง€ ๋ชจ๋ฅด์ง€๋งŒ ํšจ์œจ์ ์œผ๋กœ ํ•™์Šต์ด ๊ฐ€๋Šฅํ•œ ๊ฐ„๋‹จํ•œ ๋ชจ๋ธ์„ ์‹คํ—˜ํ•ด๋ณผ ๊ฒƒ์ด๋‹ค

3-1. CBOW Model ( Continuous Bag of Words Model )

context (๋ฐฐ๊ฒฝ : ์ฃผ๋ณ€์˜ ๋‹จ์–ด๋“ค) ์„ ์ด์šฉํ•˜์—ฌ Target word๋ฅผ ์˜ˆ์ธกํ•˜๋Š” model

NNLM๊ณผ ๋น„์Šทํ•˜๋‹ค
non - linear hidden layer๊ฐ€ ์ œ๊ฑฐ๋˜๊ณ  projection layer๊ฐ€ ๋ชจ๋“  ๋‹จ์–ด์— ๊ณต์œ ๋œ๋‹ค. ๊ทธ๋ž˜์„œ ๋ชจ๋“  ๋‹จ์–ด๊ฐ€ ๊ฐ™์€ ์œ„์น˜์— ์˜์‚ฌ๋œ๋‹ค. ( ๋ฒกํ„ฐ ํ‰๊ท ํ™” )
standard bag-of-words model๊ณผ ๋‹ฌ๋ฆฌ ๋ฐฐ๊ฒฝ์˜ ์—ฐ์†์ ์ธ ๋ถ„๋ฐฐ๋œ ํ‘œํ˜„์„ ์ด์šฉํ•จ (Continuous distributed representation of the context)
=> standard bag-of-words model์€ ๋ชจ๋“  ๋‹จ์–ด๋ฅผ bag์— ๋‹ด์•„์„œ context์— ๋Œ€ํ•œ ๊ฐœ๋…์ด ์—†์ง€๋งŒ, CBOW๋Š” window ๊ฐœ์ˆ˜๋ฅผ ์ •ํ•ด๋‘ ์œผ๋กœ์จ ๋ฐฐ๊ฒฝ์˜ ๋ถ„๋ฐฐ๋ฅผ ์ด์šฉํ•จ

์ค‘์š”ํ•œ ๊ฒƒ์€ Input Layer์™€ Projection Layer ์‚ฌ์ด์˜ ๊ฐ€์ค‘์น˜ matrix๋Š” NNLM๊ณผ ๊ฐ™์€ ๋ฐฉ์‹์œผ๋กœ ๋ชจ๋“  ๋‹จ์–ด์— ๊ณต์œ (shared)๋œ๋‹ค

Loss Function์€ Cross Entropy๋ฅผ ์ด์šฉํ•ด ํ•™์Šต์„ ํ•œ๋‹ค

๊ณ„์‚ฐ ๋ณต์žก๋„
Q = N x D + D x Log2(V)


์—ฌ๊ธฐ์„œ ๋ง‰๊ฐ„ !
Word Embedding ์ด๋ž€ ?
ํ˜น์‹œ ๋ชจ๋ฅผ๊นŒ๋ด ๋งํ•˜๋Š” ๊ฒƒ์ด์ง€๋งŒ..
one-hot vector ๊ฐ™์ด sparse vector์„ dense vector๋กœ ๋ณ€ํ™˜ํ•ด์ฃผ๋Š” ๊ณผ์ • !


3-2. Skip - Gram Model

center word๋ฅผ ์ด์šฉํ•˜์—ฌ context word๋ฅผ ์˜ˆ์ธกํ•˜๋Š” Model

CBow model ๋ณด๋‹ค ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ธ๋‹ค๊ณ  ์•Œ๋ ค์กŒ๋Š”๋ฐ, ์ด๋Š” output์„ ํ•™์Šตํ•  ๋•Œ window์˜ ํฌ๊ธฐ์— ๋”ฐ๋ผ ์ค‘๋ณต์ ์œผ๋กœ ํ•™์Šตํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์•„์„œ ๊ทธ๋ ‡๋‹ค.
=> I am a boy you are a girl ์„ ํ•™์Šตํ•  ๋•Œ window๊ฐ€ 2์ด๋ฉด am์€ I๋ฅผ center word๋กœ ๋ฐ›์•„์˜ฌ ๋•Œ๋„ ํ•™์Šต์„ ํ•˜๊ณ , a์™€ boy๋ฅผ center word๋กœ ๋ฐ›์•„์˜ฌ ๋•Œ๋„ ํ•™์Šต์„ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ค‘๋ณต์  ํ•™์Šต์ด ๋งŽ์•„์ ธ ์ข‹์€ ์„ฑ๋Šฅ์ด ๋‚˜์˜ค๋Š” ๊ฒƒ์œผ๋กœ ๋ณด์ธ๋‹ค

๊ณ„์‚ฐ ๋ณต์žก๋„
Q = C x (D + D x log2(V))

D : input -> hidden
D x Log2(V) : hidden -> output

์—ฌ๊ธฐ์„œ C๋Š” maximum distance of the words

Conclusion

๊ตฌ๋ฌธ์ (syntactic), ์˜๋ฏธ์ (semantic)์„ ์ด์šฉํ•œ ๋ชจ๋ธ์„ ๊ณต๋ถ€ํ–ˆ๋‹ค.
์œ ๋ช…ํ•œ neural network์™€ ๋น„๊ต๋˜๊ฒŒ ๊ฐ„๋‹จํ•œ ๋ชจ๋ธ ๊ตฌ์กฐ๋ฅผ ๊ฐ€์ง€๊ณ  ๋†’์€ quality์˜ word vector์„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค.
๊ณ„์‚ฐ ๋ณต์žก๋„๊ฐ€ ์ค„์–ด๋“ค๋ฉด์„œ, ๋” ํฐ ๋ฐ์ดํ„ฐ์…‹์—์„œ ๋” ์ •ํ™•ํ•œ high - dimensional word vector์„ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค.


๊ณต๋ถ€๋ฅผ ํ•œ ๋‚ด์šฉ์ด๋ฏ€๋กœ ์ •ํ™•ํ•˜์ง€ ์•Š์„ ์ˆ˜ ์žˆ๊ณ , ํ˜น์‹œ ์ˆ˜์ •๋  ๋ถ€๋ถ„์ด ์žˆ๋‹ค๋ฉด ๋งํ•ด์ฃผ์„ธ์š”~

profile
์ˆ˜ํ•™, AI, CS study ๊ทธ๋ฆฌ๊ณ  ์ผ์ƒ๐Ÿค—

0๊ฐœ์˜ ๋Œ“๊ธ€