๐Ÿ”ฅ ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ - word2vec (Efficient Estimation of Word Representations in Vector Space)

esc247ยท2023๋…„ 9์›” 4์ผ
0

AI

๋ชฉ๋ก ๋ณด๊ธฐ
20/22

Abstract

  • computing continuous vector representations of words from very large data sets ํ•˜๋Š” ๋‘ ๊ฐ€์ง€ ๋ชจ๋ธ ๊ตฌ์กฐ ์ œ์‹œ
    • CBOW & SKIP-GRAM
  • word similarity task์—์„œ ์„ฑ๋Šฅ ์ธก์ •
  • better accuracy, less computiational cost

Introduction

  • Word๋ฅผ ๊ฐ๊ฐ ๋…๋ฆฝ๋œ atomic unit์œผ๋กœ ๋‹ค๋ฃฌ๋‹ค

    • ๋‹จ์–ด ๊ฐ„ ์œ ์‚ฌ์„ฑ ํ‘œํ˜„ X
    • ์žฅ์ 
      • Simplicty,
      • Robustness,
      • ๊ฐ„๋‹จํ•œ ๋ชจ๋ธ์— ๋งŽ์€ ๋ฐ์ดํ„ฐ ํ•™์Šต > ๋ณต์žกํ•œ ๋ชจ๋ธ์— ์ ์€ ๋ฐ์ดํ„ฐ ํ•™์Šต
      • N-Gram Model
        • ํ†ต๊ณ„ํ•™์  ์–ธ์–ด ๋ชจ๋ธ์˜ ํ•œ๊ณ„์ธ ํฌ์†Œ์„ฑ(Spasity) ๋ฌธ์ œ๋ฅผ ๋ณด์™„.
          • ๋‹จ์–ด์˜ ์ˆ˜๊ฐ€ ๋งค์šฐ ๋งŽ์€ ์–ดํœ˜ ์ง‘ํ•ฉ์—์„œ ๊ฐ ๋‹จ์–ด์˜ ๋“ฑ์žฅ ๋นˆ๋„๊ฐ€ ๊ท ๋“ฑํ•˜์ง€ ์•Š๊ณ , ๋Œ€๋ถ€๋ถ„์˜ ๋‹จ์–ด๋“ค์ด ์ œํ•œ๋œ ๋ฐ์ดํ„ฐ์—์„œ ๋“œ๋ฌผ๊ฒŒ ๋‚˜ํƒ€๋‚˜๊ธฐ ๋•Œ๋ฌธ์— ๋ฐœ์ƒ
        • ๋‹ค์Œ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธก ์‹œ ๋ชจ๋“  ๋‹จ์–ด๋ฅผ ๊ณ ๋ คํ•˜์ง€ ์•Š๊ณ  ํŠน์ • ๋‹จ์–ด N๊ฐœ๋งŒ ๊ณ ๋ คํ•œ๋‹ค โ†’ N๊ฐœ์˜ ์—ฐ์†์ ์ธ ๋‹จ์–ด๋ฅผ ํ•œ token์œผ๋กœ ๊ฐ„์ฃผ
        • ํ•œ๊ณ„
          • ์ „์ฒด ๋‹จ์–ด๋ฅผ ๊ณ ๋ คํ•˜๋Š” ์–ธ์–ด ๋ชจ๋ธ๊ณผ ๋น„๊ตํ–ˆ์„ ๋•Œ ์ •ํ™•๋„๊ฐ€ ๋‚ฎ๋‹ค
          • ์—ฌ์ „ํžˆ ํฌ์†Œ์„ฑ ๋ฌธ์ œ ์กด์žฌ
          • N๊ฐ’์— ๋”ฐ๋ฅธ Trade-off
  • ML ๊ธฐ์ˆ  ๋ฐœ์ „ํ•˜๋ฉด์„œ ๋ณต์žกํ•œ ๋ชจ๋ธ์— ๋งŽ์€ ๋ฐ์ดํ„ฐ ํ•™์Šต ๊ฐ€๋Šฅํ•ด์กŒ๋‹ค โ†’ ์ด์ „์— ๋ชปํ•ด๋ณธ ์‹œ๋„ ๊ฐ€๋Šฅ

  • ํŠนํžˆ Distributed Repersentations (๋ถ„์‚ฐ ํ‘œํ˜„) ๊ฐ€๋Šฅ

    • '๋น„์Šทํ•œ ๋ฌธ๋งฅ์—์„œ ๋“ฑ์žฅํ•˜๋Š” ๋‹จ์–ด๋“ค์€ ๋น„์Šทํ•œ ์˜๋ฏธ๋ฅผ ๊ฐ€์ง„๋‹คโ€™๋Š” ๊ฐ€์ • ํ•˜์— ๋งŒ๋“ค์–ด์ง„ ํ‘œํ˜„ ๋ฐฉ๋ฒ•
    • ๋‹จ์–ด ๋ฒกํ„ฐ ๊ฐ„ ์œ ์‚ฌ๋„ ๊ณ„์‚ฐ ๊ฐ€๋Šฅ
  • ๊ทผ์ฒ˜์— ์œ„์น˜ํ•œ ๋‹จ์–ด์˜ ์œ ์‚ฌ์„ฑ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๋‹ค์–‘ํ•œ ์ฐจ์›์˜ ์œ ์‚ฌ์„ฑ(Multiple Degrees of Similarity)๋„ ๊ฐ€์ง„๋‹ค

  • Syntatic regularities(๊ตฌ๋ฌธ๋ก ์  ๊ทœ์น™์„ฑ)๋„ ์ฐพ์•„๋‚ธ๋‹ค

    • King - Man + Woman = Queen
  • ๋ณธ ๋…ผ๋ฌธ์—์„  ๋‹จ์–ด ๊ฐ„ ์„ ํ˜• ๊ทœ์น™์„ฑ์„ ์œ ์ง€ํ•˜๋Š” ๋ชจ๋ธ ๊ตฌ์กฐ๋ฅผ ๊ฐœ๋ฐœํ•˜๊ณ , ๊ตฌ๋ฌธ๋ก ์ ,์˜๋ฏธ๋ก ์  ๊ทœ์น™์„ฑ์„ ์ธก์ •ํ•  ์ˆ˜ ์žˆ๋Š” test set์„ ๊ณ ์•ˆํ•˜๊ณ  ์ด ๊ทœ์น™์„ฑ์ด ๋†’์€ ์ •ํ™•๋„๋กœ ํ•™์Šต๋  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์ธ๋‹ค.

  • ๋˜ํ•œ ๋‹จ์–ด์˜ ์ฐจ์›๊ณผ ๋ฐ์ดํ„ฐ์˜ ์–‘์ด ํ•™์Šต ์‹œ๊ฐ„ ๋ฐ ์ •ํ™•๋„์— ์–ด๋–ป๊ฒŒ ์˜ํ–ฅ์„ ์ฃผ๋Š”์ง€ ์ด์•ผ๊ธฐํ•œ๋‹ค.

Previoud Work

  • NNLM
    • Feedforward Neural Network = Linear Projection Layer+ non-linear hidden layer
    • word vector๊ฐ€ single hidden layer์—์„œ ํ•™์Šต๋œ ํ›„ NNLM์„ trainํ•˜๋Š”๋ฐ ์‚ฌ์šฉ๋œ๋‹ค
      • ์ฆ‰ word vectors๋Š” full NNLM์„ ๊ตฌ์„ฑํ•˜์ง€์•Š๊ณ  ํ•™์Šต๋œ๋‹ค
    • ๋ณธ ๋…ผ๋ฌธ์€ ์ด ๊ตฌ์กฐ๋ฅผ ํ™•์žฅ์‹œํ‚ค๊ณ , word vector๊ฐ€ simple model์„ ์‚ฌ์šฉํ•ด ํ•™์Šต๋˜๋Š” first step์— ์ง‘์ค‘ํ•œ๋‹ค

Model Architecture

  • Neural Network์„ ํ†ตํ•ด ๋ถ„์‚ฐ ํ‘œํ˜„์„ ํ•™์Šตํ•˜๋Š”๋ฐ ์ง‘์ค‘ํ•œ๋‹ค.
  • Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA)๋ณด๋‹ค ๋›ฐ์–ด๋‚˜๋‹ค
  • ๋ณต์žก๋„ ์ธก์ • ๋ฐฉ์‹
    • O = E (Epochs) X T (Number of the words) X Q (Model Architecture)
      • E = 3 ~ 50, T = 1B
      • ๋ชจ๋“  ๋ชจ๋ธ์€ SGD ์‚ฌ์šฉ

Feedforward Neural Net Language Model(NNLM)

  • input, projection, hidden and output layers

  • Input Layer : ์ด์ „ N๊ฐœ ๋‹จ์–ด๋“ค์ด one hot encoding ๋˜์–ด ์žˆ๋‹ค. (์ „์ฒด๋Š” V๊ฐœ)

  • input layer๋Š” N X D ์ฐจ์›์ธ projection layer P๋กœ Projection ๋œ๋‹ค

  • ์ด ํ›„ N X D ํ–‰๋ ฌ์ด D X H์ธ Hidden Layer ๋งŒ๋‚˜ N X H ์ถœ๋ ฅ

  • Q = N X D + N X D X H + H X V

  • Hierarchical Softmax ์‚ฌ์šฉํ•ด H X V โ†’ H X log(V)

    • Huffman Binary Tree ์‚ฌ์šฉ
    • frequent words on short binary codes โ‡’ ๋นˆ๋„ ์ˆ˜ ๋†’์„ ์ˆ˜๋ก root์— ๊ฐ€๊น๊ฒŒ ์œ„์น˜
    • ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•์œผ๋กœ๋Š” Negative Sampling ์žˆ๋‹ค.
  • ๋”ฐ๋ผ์„œ N X D X H ๊ฐ€ Cost ๊ฒฐ์ •.

  • ํ•œ๊ณ„

    • ๋‹ค์Œ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•˜๊ธฐ ์œ„ํ•ด ๋ชจ๋“  ์ด์ „ ๋‹จ์–ด๋ฅผ ์ฐธ๊ณ ํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ ์ •ํ•ด์ง„ n๊ฐœ์˜ ๋‹จ์–ด๋งŒ ์ฐธ๊ณ  ๊ฐ€๋Šฅ

Recurrent Neural Net Language Model (RNNLM)

  • RNN ์‚ฌ์šฉํ•ด ์–•์€ ๊ตฌ์กฐ๋กœ ๋ณต์žกํ•œ ํŒจํ„ด ํ‘œํ˜„ ๊ฐ€๋Šฅ
  • Projection Layer์—†์ด Input, Hidden, Output๋งŒ ์กด์žฌ
  • Time Delay๋ฅผ ์ด์šฉํ•ด Hideen Layer๊ฐ€ ์ž๊ธฐ ์ž์‹ ์— ์—ฐ๊ฒฐ โ‡’ Short Term Memory
  • Q = H X H + H X V
    • D X H == H X H : word representations์ธ D๋Š” H์™€ ๊ฐ™์€ ์ฐจ์› ๊ฐ€์ง€๊ธฐ ๋•Œ๋ฌธ.
    • ์—ญ์‹œ H X V โ†’ H X log(V) ๊ฐ€๋Šฅ โ‡’ H x H ์— ์˜ํ•ด complexity ๊ฒฐ์ •

Parallel Training of Neural Networks

  • DistBelief ์‚ฌ์šฉ
    • run multiple replicas of the same model in parallel
    • ๋ฐ์ดํ„ฐ ๋ถ„์‚ฐ ์ฒ˜๋ฆฌ๋ผ๊ณ  ์ƒ๊ฐํ•˜๋ฉด ๋  ๋“ฏ

New Log-linear Models

  • ์ด์ „ ๋ชจ๋ธ ๊ตฌ์กฐ์—์„œ ๋Œ€๋ถ€๋ถ„์˜ complexity๋Š” non-linear Hidden Layer๊ฐ€ ์›์ธ

CBOW (Continuous Bag-of-Words Model)

  • ์ฃผ๋ณ€์— ์žˆ๋Š” ๋‹จ์–ด๋“ค์„ ์ž…๋ ฅ์œผ๋กœ ์ค‘๊ฐ„์— ์žˆ๋Š” ๋‹จ์–ด๋“ค์„ ์˜ˆ์ธก

  • non-linear Hidden Layer ์ œ๊ฑฐ

  • ๋ชจ๋“  ๋‹จ์–ด๊ฐ€ Projection Layer๋ฅผ ๊ณต์œ  โ†’ ๋ชจ๋“  ๋‹จ์–ด๊ฐ€ ๋™์ผํ•œ ํ–‰๋ ฌ๋กœ Projected

  • ๋‹จ์–ด์˜ ์ˆœ์„œ (Order of words)๊ฐ€ ์˜ํ–ฅ์„ ๋ฏธ์น˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์— bag-of-words๋ผ๊ณ  ๋ถ€๋ฅธ๋‹ค

  • ์˜ˆ์ธกํ•˜๋Š” ๋‹จ์–ด = Center word(์ค‘์‹ฌ ๋‹จ์–ด), ์˜ˆ์ธก์— ์‚ฌ์šฉ๋˜๋Š” ๋‹จ์–ด = Context word(์ฃผ๋ณ€ ๋‹จ์–ด)

  • Q = N X D + D X log(V)

Skip-Gram

1Tโˆ‘t=1Tโˆ‘โˆ’cโ‰คjโ‰คc,jโ‰ 0logโกp(wt+jโˆฃwt).\frac{1}{T} \sum_{t=1}^T \sum_{-c \leq j \leq c, j \neq 0} \log p(w_{t+j} \vert w_t).

  • ์ค‘์‹ฌ ๋‹จ์–ด์—์„œ ์ฃผ๋ณ€ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธก
  • Q = C (Max Distance) X ( D + D X log(V) )

Result

Examples of the Learned Relationships

Conclusion

  • vector representations of words์— ๋Œ€ํ•œ ์—ฐ๊ตฌ ์„ฑ๊ณผ
  • ๋งค์šฐ ๊ฐ„๋‹จํ•œ ๋ชจ๋ธ ๊ตฌ์กฐ์—์„œ ๋†’์€ ์ˆ˜์ค€์˜ ๋‹จ์–ด ๋ฒกํ„ฐ ํ•™์Šต ๊ฐ€๋Šฅ
profile
๋ง‰์ƒ ํ•˜๋ฉด ๋ชจ๋ฅด๋‹ˆ๊นŒ ์ผ๋‹จ ํ•˜์ž.

0๊ฐœ์˜ ๋Œ“๊ธ€