๐Ÿ“ Week 4: Tokenization, Word Embedding

oceannยท2024๋…„ 8์›” 30์ผ
0

๐Ÿ’ป Naver Boostcamp AI Tech 7๊ธฐ NLP

๋ชฉ๋ก ๋ณด๊ธฐ
4/5
post-thumbnail

์ด์ƒํ•˜๊ฒŒ ์ด ๋ถ€๋ถ„์€ ๋ณผ ๋•Œ๋งˆ๋‹ค ์žฌ๋ฏธ๊ฐ€ ์—†๋‹ค. ํ•˜์ง€๋งŒ ๊ฒฐ๊ตญ ํ•˜๊ฒŒ ๋œ๋‹ค. ์•„๋Š” ๋งŒํผ ๋ณด์ธ๋‹ค..
๋ฐ€๋ ธ์ง€๋งŒ ์•„๋ฌดํŠผ ์ •๋ฆฌ ๋.. ์ด์ œ์•ผ Tokenization๊ณผ Word Embedding์˜ ๊ฐœ๋…์„ ํ™•์‹คํžˆ ์žก์€ ๊ฒƒ ๊ฐ™๋‹ค.
์•ž์œผ๋กœ๋Š” ์‚ฌ๋‹ด์„ ์ค„์ด๊ณ  ์ข€ ๋” ์ฒด๊ณ„์ ์ธ ๊ธฐ๋ก์„ ๋‚จ๊ธฐ๋ ค๊ณ  ํ•œ๋‹ค. ๋„์ „!


Tokenization

๊ฐœ๋…

Tokenization, Tokenizing์ด๋ž€, ์ฃผ์–ด์ง„ text๋ฅผ token ๋‹จ์œ„๋กœ ๋ถ„๋ฆฌํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋งํ•œ๋‹ค. ์ด๋•Œ token์€ ์ž์—ฐ์–ด์ฒ˜๋ฆฌ ๋ชจ๋ธ์ด ๊ฐ timestep๋งˆ๋‹ค ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋Š” ํ•˜๋‚˜์˜ ๋‹จ์œ„๋ฅผ ๋งํ•œ๋‹ค. ์ด token์€ ์ƒํ™ฉ์— ๋”ฐ๋ผ ํ˜•ํƒœ๊ฐ€ ๋‹ฌ๋ผ์ง€์ง€๋งŒ, ๋ณดํ†ต ์˜๋ฏธ๋ฅผ ๋‹จ์œ„๋กœ ๊ตฌ๋ถ„ํ•œ๋‹ค.

Level๋ณ„ ๋ถ„๋ฅ˜

Word-level Tokenization
๋‹จ์–ด ๋‹จ์œ„๋กœ ๊ตฌ๋ถ„ํ•˜๊ธฐ ์œ„ํ•ด ์˜์–ด๋ฅผ ์ƒ๊ฐํ•ด๋ณด๋ฉด, ์ผ๋ฐ˜์ ์œผ๋กœ ๋„์–ด์“ฐ๊ธฐ ๋‹จ์œ„๋กœ ๊ตฌ๋ถ„ํ•œ๋‹ค.
I love AI โ†’ ['I', 'love', 'AI']
ํ•œ๊ตญ์–ด๋Š” ์กฐ์‚ฌ, ์–ด๊ทผ, ์ ‘์‚ฌ ๋“ฑ์ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ํ˜•ํƒœ์†Œ๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๋‹จ์–ด๋ฅผ ๊ตฌ๋ถ„ํ•˜๊ธฐ๋„ ํ•œ๋‹ค.
๋‚˜๋Š” ์ธ๊ณต์ง€๋Šฅ์ด ์ข‹๋‹ค โ†’ ['๋‚˜', '๋Š”', '์ธ๊ณต์ง€๋Šฅ', '์ด', '์ข‹๋‹ค']

ํ•˜์ง€๋งŒ ์‚ฌ์ „ ์ •์˜๋œ vocab์— ํŠน์ • ๋‹จ์–ด๊ฐ€ ์—†์„ ๊ฒฝ์šฐ ๋ชจ๋‘ UNK token์œผ๋กœ ์ฒ˜๋ฆฌ๋˜๋Š” Out-of-Vocabulary ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค.

Character-level Tokenization
token์„ ์ฒ ์ž ๋‹จ์œ„๋กœ ๊ตฌ๋ถ„ํ•œ๋‹ค. ๋‹ค๋ฅธ ์–ธ์–ด๋ผ๋„ ๊ฐ™์€ ์ฒ ์ž๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ์— ์ฒ ์ž ๋‹จ์œ„ token์œผ๋กœ ์ฒ˜๋ฆฌ๊ฐ€ ๊ฐ€๋Šฅํ•ด์ง„๋‹ค.
๋˜ํ•œ ๋ชจ๋“  ์ฒ ์ž๋ฅผ ๋“ฑ๋กํ•ด๋†“์œผ๋ฉด, ๋“ฑ์žฅํ•˜์ง€ ์•Š์€ ๋‹จ์–ด๋ผ๋„ ์ฒ ์ž๋“ค์˜ ์กฐํ•ฉ์œผ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์„ ๊ฒƒ์ด๊ธฐ ๋•Œ๋ฌธ์— OOV ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•˜์ง€ ์•Š๋Š”๋‹ค.

ํ•˜์ง€๋งŒ ์ฃผ์–ด์ง„ text์— ๋Œ€ํ•œ token์˜ ์ˆ˜๊ฐ€ ์ง€๋‚˜์น˜๊ฒŒ ๋งŽ์•„์ง„๋‹ค.
์ฒ ์ž๋งŒ์œผ๋กœ ์˜๋ฏธ๋ฅผ ๊ฐ€์งˆ ์ˆ˜ ์—†๊ธฐ ๋•Œ๋ฌธ์— ํ•ด๋‹น ๋ฐฉ์‹์œผ๋กœ tokenization์„ ์ˆ˜ํ–‰ํ•  ๊ฒฝ์šฐ, ๋ชจ๋ธ์ด ๋‚ฎ์€ ์„ฑ๋Šฅ์„ ๋ณด์ธ๋‹ค.

Subword-level Tokenization
subword๋ž€ ํ•˜๋‚˜์˜ ๋‹จ์–ด์กฐ์ฐจ ์˜๋ฏธ๋ฅผ ๊ฐ–๋Š” ๋” ์ž‘์€ ๋‹จ์œ„๋ฅผ ๋งํ•œ๋‹ค.
์˜ˆ๋ฅผ ๋“ค๋ฉด preprocessing์ด๋ผ๋Š” ๋‹จ์–ด๋Š” pre-, process, -ing๋กœ ๊ตฌ๋ถ„ํ•  ์ˆ˜ ์žˆ๋‹ค.
ํ•˜์ง€๋งŒ ์„ธ์ƒ ๋ชจ๋“  subword๋ฅผ ์‚ฌ์ „์— ๋“ฑ๋กํ•ด๋†“๊ธฐ ์–ด๋ ต๊ธฐ ๋•Œ๋ฌธ์— tokenization์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๋‹ค์–‘ํ•œ ๋ฐฉ๋ฒ•์ด ์กด์žฌํ•˜๋Š”๋ฐ, BPE(Byte-Pair Encoding), WordPiece, SentencePiece ๋“ฑ subword tokenization ๋ฐฉ๋ฒ•๋ก ์— ๋”ฐ๋ผ subword์˜ ๋‹จ์œ„๋Š” ๋‹ค์–‘ํ•˜๊ฒŒ ๊ฒฐ์ •๋œ๋‹ค.

์ด๋Ÿฌํ•œ Subword Tokenization์€ Character-level Tokenization์— ๋น„ํ•ด ์‚ฌ์šฉ๋˜๋Š” token์˜ ํ‰๊ท  ๊ฐœ์ˆ˜๊ฐ€ ์ ๊ณ , OOV ๋ฌธ์ œ๊ฐ€ ์—†์–ด ์•ž์„  ๋‘ ์ข…๋ฅ˜์˜ ๋‹จ์ ์„ ํ•ด๊ฒฐํ•œ ๋ฐฉ์‹์ด๋‹ค. ๋˜ํ•œ ์˜๋ฏธ๋ฅผ ๊ธฐ์ค€์œผ๋กœ tokenization์„ ์ˆ˜ํ–‰ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋ชจ๋ธ์ด ์˜๋ฏธ๋ฅผ ์ดํ•ดํ•  ์ˆ˜ ์žˆ์–ด ๋†’์€ ์ •ํ™•๋„๋ฅผ ๋ณด์ธ๋‹ค.

Byte-Pair Encoding

Subword-level Tokenization์˜ ๋Œ€ํ‘œ์ ์ธ ์˜ˆ์‹œ์ด๋‹ค. ์˜ˆ์™€ ํ•จ๊ป˜ ๊ณผ์ •์„ ์‚ดํŽด๋ณด๋„๋ก ํ•˜์ž.

['low', 'lower', 'newest', 'widest']๋ผ๋Š” ๋‹จ์–ด ๋ชฉ๋ก์ด ์žˆ๋‹ค๊ณ  ํ•˜์ž.

  1. ์ฒ ์ž ๋‹จ์œ„์˜ subword ๋ชฉ๋ก์„ ๋งŒ๋“ ๋‹ค.
    ['l', 'o', 'w', 'e', 'r', 'n', 's', 't', 'i', 'd']

  2. ์—ฐ์†ํ•ด์„œ ๋“ฑ์žฅํ•˜๋Š” ์ฒ ์ž๋“ค์˜ pair ์ค‘ ๊ฐ€์žฅ ๋นˆ๋„์ˆ˜๊ฐ€ ๋†’์€ pair๋ฅผ token์œผ๋กœ ์ถ”๊ฐ€ํ•œ๋‹ค.
    ์œ„ ๊ฒฝ์šฐ์—์„œ lo๊ฐ€ ๋‘ ๋ฒˆ์œผ๋กœ ๊ฐ€์žฅ ๋งŽ์ด ๋“ฑ์žฅํ•œ๋‹ค. ์ด pair๋ฅผ subword ๋ชฉ๋ก์˜ token์œผ๋กœ ์ถ”๊ฐ€ํ•œ๋‹ค.
    ['l', 'o', 'w', 'e', 'r', 'n', 's', 't', 'lo']

  3. ์ถ”๊ฐ€๋œ pair๋ฅผ ์ฒ ์ž๋กœ ํฌํ•จํ•˜์—ฌ ์œ„ ๊ณผ์ •์„ ๋ฐ˜๋ณตํ•œ๋‹ค.
    ์ข…๋ฃŒ ์กฐ๊ฑด์€ ํŠน์ • ํšŸ์ˆ˜๋งŒํผ loop๋ฅผ ๋ฐ˜๋ณตํ•˜๊ฑฐ๋‚˜, subword ๋ชฉ๋ก์ด ์ผ์ • ํฌ๊ธฐ์— ๋„๋‹ฌํ–ˆ์„ ๋•Œ๋กœ ์„ค์ •ํ•  ์ˆ˜ ์žˆ๋‹ค.

WordPiece

Byte-Pair Encoding ๋ฐฉ์‹๊ณผ ์œ ์‚ฌํ•œ๋ฐ, ์—ฐ์†ํ•ด์„œ ๋“ฑ์žฅํ•˜๋Š” token์˜ pair๋ฅผ ์„ ํƒํ•  ๋•Œ ๋“ฑ์žฅํ•˜๋Š” ๋นˆ๋„์ˆ˜๊ฐ€ ์•„๋‹ˆ๋ผ, ๊ทธ์— ๊ธฐ๋ฐ˜ํ•œ likelihood๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. Google์—์„œ ๋งŒ๋“  ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ, BERT๋ฅผ ํ•™์Šตํ•  ๋•Œ ์‚ฌ์šฉ๋œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋ฉฐ, ์ฝ”๋“œ๊ฐ€ ๊ณต๊ฐœ๋˜์ง€ ์•Š์•„ ์ด๋ก ์ ์œผ๋กœ ์„ค๋ช…ํ•˜๋Š” ๊ณณ์ด ๋งŽ๋‹ค.

์—ฌ๊ธฐ์„œ๋„ ['low', 'lower', 'newest', 'widest']๋ผ๋Š” ๋‹จ์–ด ๋ชฉ๋ก์ด ์žˆ๋‹ค๊ณ  ํ•˜์ž.

  1. ์ฒ ์ž ๋‹จ์œ„์˜ subword ๋ชฉ๋ก์„ ๋งŒ๋“ ๋‹ค.
    ['l', '##o', '##w', '##e', '##r', 'n', '##s', '##t', 'w', '##i', '##d']
    ์ด๋•Œ BPE์™€ ๋‹ค๋ฅธ ์ ์€, ์‹œ์ž‘ํ•˜๋Š” ์ฒ ์ž๊ฐ€ ์•„๋‹ˆ๋ผ๋ฉด ##๊ณผ ๊ฐ™์€ ํ‘œ์‹œ๋ฅผ ํ•œ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.
  2. BPE์™€ ๋‹ฌ๋ฆฌ ์—ฐ์†ํ•˜๋Š” ์ฒ ์ž๋กœ ์ด๋ฃจ์–ด์ง„ pair๊ฐ€ ๋“ฑ์žฅํ•˜๋Š” ๋นˆ๋„์ˆ˜๊ฐ€ ์•„๋‹Œ, maximum likelihood๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ subword ๋ชฉ๋ก์— ์ถ”๊ฐ€ํ•  pair๋ฅผ ์„ ํƒํ•˜์—ฌ tokenization ํ•œ๋‹ค.
    score=pair_freqfirst_element_freqโˆ—second_element_freq\text{score} = \frac{\text{pair\_freq}}{\text{first\_element\_freq} * \text{second\_element\_freq}}
    ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ์•„๋ฌด๋ฆฌ pair๊ฐ€ ๋งŽ์ด ๋“ฑ์žฅํ–ˆ์–ด๋„, ๊ฐ ์›์†Œ์˜ ๋นˆ๋„์ˆ˜๊ฐ€ ๋” ๋†’๋‹ค๋ฉด score๋Š” ๋‚ฎ๊ฒŒ ํ• ๋‹น๋œ๋‹ค.
  3. ๋‹จ์–ด ๋ชฉ๋ก์˜ ํฌ๊ธฐ๊ฐ€ ์ผ์ • ๋ฒ”์œ„๊ฐ€ ๋  ๋•Œ๊นŒ์ง€, ํ˜น์€ ์ง€์ •ํ•œ ํšŸ์ˆ˜๋งŒํผ ์œ„ ๊ณผ์ •์„ ๋ฐ˜๋ณตํ•œ๋‹ค.

WordPiece ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ score๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ๋ถ€๋ถ„์—์„œ BPE์™€ ์ฐจ์ด๊ฐ€ ์žˆ๋‹ค. ์ˆ˜์‹๋งŒ ๋ดค์„ ๋•Œ๋Š” low๋ผ๋Š” ์›ํ˜•์ด lower, lowest๋ผ๋Š” ๋น„๊ต๊ธ‰, ์ตœ์ƒ๊ธ‰๋ณด๋‹ค score๊ฐ€ ๋†’๊ฒŒ ํ• ๋‹น๋  ๊ฒƒ ๊ฐ™์€ ๋А๋‚Œ์ด ๋“ค์ง€๋งŒ ์ •ํ™•ํžˆ๋Š” ๋ชจ๋ฅด๊ฒ ๋‹ค..

Unigram

์—ฌ๊ธฐ์„œ๋Š” ๊ณ„์‚ฐ์˜ ํŽธ์˜๋ฅผ ์œ„ํ•ด ์•„๋ž˜์™€ ๊ฐ™์€ corpus๋ฅผ ์˜ˆ๋กœ ๋“ค๊ฒ ๋‹ค.
{"hugs bun": 4, "hugs pug": 1, "hug pug pun": 4, "hug pun": 6, "pun": 2}
key๋Š” ์˜ˆ์‹œ ๋ฌธ์žฅ์ด๊ณ , value๋Š” ๋นˆ๋„ ์ˆ˜์ด๋‹ค.

  1. ๋„์–ด์“ฐ๊ธฐ๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๋‹จ์–ด๋ฅผ ๊ตฌ๋ถ„ํ•œ๋‹ค. Word-level Tokenization์„ ์ˆ˜ํ–‰ํ•œ๋‹ค๊ณ  ๋ด๋„ ๋ฌด๋ฐฉํ•˜๋‹ค. ํ•˜์ง€๋งŒ ์ด๋ฅผ token์œผ๋กœ ์‚ฌ์šฉํ•˜์ง€๋Š” ์•Š๋Š”๋‹ค.
    {"hug": 10, "pug": 5, "pun": 12, "bun":4, "hugs":5}

  2. ๊ฐ ๋‹จ์–ด์˜ ๋ชจ๋“  substring์„ vocab์— ์ถ”๊ฐ€ํ•œ๋‹ค.
    ์˜ˆ๋ฅผ ๋“ค์–ด hug์— ๋Œ€ํ•˜์—ฌ h, u, g, hu, ug, hug๊ฐ€ ๋ชจ๋‘ token์ด ๋˜๋Š” ๊ฒƒ์ด๋‹ค.
    ['h', 'u', 'g', 'p', 'n', 's', 'hu', 'ug', 'pu', 'un', 'gs', 'hug', 'pug', 'pun', 'bun', 'ugs', 'hugs']

  3. ๊ฐ token๋“ค์˜ ๋“ฑ์žฅ ๋นˆ๋„์ˆ˜๋ฅผ ๊ตฌํ•œ๋‹ค.
    {'h': 15, 'u': 36, 'g': 20, 'hu': 15, 'ug': 20, 'p': 17, 'pu': 17, 'n': 16, 'un': 16, 'b': 4, 'bu': 4, 's': 5, 'hug': 15, 'gs': 5, 'ugs': 5}

  4. token๋“ค์„ ์กฐํ•ฉํ•ด์„œ ๊ฐ ๋‹จ์–ด๊ฐ€ ์ƒ์„ฑ๋  probability๋ฅผ ๊ตฌํ•œ๋‹ค. ์ด๋•Œ ๊ฐ token๋“ค์˜ ํ™•๋ฅ ์€ ๋…๋ฆฝ์ด๋ผ๊ณ  ๊ฐ€์ •ํ•˜๊ณ  ํ™•๋ฅ ์„ ๊ณฑํ•ด์„œ ํ•ด๋‹น probability๋ฅผ ๊ตฌํ•œ๋‹ค.
    ์˜ˆ๋ฅผ ๋“ค์–ด pug๋ฅผ ๋งŒ๋“ค๊ธฐ ์œ„ํ•ด ๋ชจ๋“  ๊ฐ€๋Šฅํ•œ ์กฐํ•ฉ์˜ ํ™•๋ฅ ์€,
    ['p', 'u', 'g'] = 0.000389, ['pu', 'g'] = 0.0022676, ['p', 'ug'] = 0.0022676์ด๋‹ค.
    ๊ฒฐ๊ณผ์ ์œผ๋กœ pug๋Š” pu g ๋˜๋Š” p ug๋กœ tokenization ํ•ด์•ผ ํ•œ๋‹ค๋Š” ๊ฒƒ์ด๋‹ค.
    ๋ชจ๋“  ๋‹จ์–ด์— ๋Œ€ํ•ด ํ™•๋ฅ ์„ ๊ตฌํ•˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค.
    'hug': ['hug'] (prob 0.071428)
    'pug': ['pu', 'g'] (prob 0.002267)
    'pun': ['pu', 'n'] (prob 0.006168)
    'bun': ['bu', 'n'] (prob 0.001451)
    'hugs': ['hug', 's'] (prob 0.001701)

  5. ๊ฐ ๋‹จ์–ด๋“ค์˜ probability๋ฅผ ์‚ฌ์šฉํ•ด์„œ ํ˜„์žฌ์˜ score๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค. score๋Š” negative log likelihood๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.

    score=ฮฃ๋‹จ์–ดย ๋นˆ๋„ย ์ˆ˜โˆ—(โˆ’log(ํ™•๋ฅ ))\text{score} = \Sigma\text{๋‹จ์–ด ๋นˆ๋„ ์ˆ˜} * (-log(\text{ํ™•๋ฅ }))

    ๋”ฐ๋ผ์„œ ํ˜„์žฌ์˜ score๋Š” 10โˆ—(โˆ’log(0.071428))+5โˆ—(โˆ’log(0.002267))+12โˆ—(โˆ’log(0.006168))+4โˆ—(โˆ’log(0.001451))+5โˆ—(โˆ’log(0.001701))=169.810 * (-log(0.071428)) + 5 * (-log(0.002267)) + 12 * (-log(0.006168)) + 4 * (-log(0.001451)) + 5 * (-log(0.001701)) = 169.8์ด๋‹ค.

  6. ํ˜„์žฌ์˜ vocab์—์„œ ํ•„์š”ํ•˜์ง€ ์•Š์€ p%์˜ token์„ ์ œ๊ฑฐํ•œ๋‹ค. token์„ ์ œ๊ฑฐํ•˜๋Š” ๊ธฐ์ค€์€, ํ•ด๋‹น token์ด ์ œ๊ฑฐ๋˜์—ˆ์„ ๋•Œ ํ˜„์žฌ์˜ score์—์„œ ๋ฐœ์ƒํ•˜๋Š” loss๊ฐ€ ์ตœ์†Œํ™”๋˜๋Š” token์ด๋‹ค.
    ์˜ˆ๋ฅผ ๋“ค์–ด, 4๋ฒˆ์—์„œ p ug๋กœ tokenization์„ ์ˆ˜ํ–‰ํ•ด๋„ ๊ฐ™์€ score๋ฅผ ๊ฐ–๊ฒŒ ๋˜๊ธฐ ๋•Œ๋ฌธ์— ํ˜„์žฌ์˜ vocab์—์„œ pu๋Š” ์‚ญ์ œ๋˜์–ด๋„ ๊ดœ์ฐฎ๋‹ค๊ณ  ํŒ๋‹จํ•œ๋‹ค.
    ์ด๋•Œ p%์˜ p๋Š” ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋กœ, ์ ์ ˆํžˆ ์„ค์ •ํ•œ๋‹ค.

  7. ์ด ๊ณผ์ •์„ ์›ํ•˜๋Š” vocab size๊ฐ€ ๋  ๋•Œ๊นŒ์ง€, ํ˜น์€ ์›ํ•˜๋Š” ๋ฐ˜๋ณต ํšŸ์ˆ˜๋งŒํผ ๋ฐ˜๋ณตํ•œ๋‹ค.

๊ณผ์ •์ด ๊ธธ๊ณ  ๋ณต์žกํ•˜๊ธฐ์— ์žฅ๋‹จ์ ์„ ์‚ดํŽด๋ณด์ž.
pros ๋ชจ๋“  substring์„ ์ „๋ถ€ ๊ณ ๋ คํ•œ vocab์œผ๋กœ๋ถ€ํ„ฐ ์‹œ์ž‘ํ•˜๋Š” ๋ฐฉ์‹์ด๊ธฐ ๋•Œ๋ฌธ์— ๊ฐ€๋Šฅํ•œ token๋“ค์„ ๋ชจ๋‘ ์กฐํ•ฉํ•  ์ˆ˜ ์žˆ๋‹ค.
cons ๋ชจ๋“  ์กฐํ•ฉ์„ ๋‹ค ๊ณ ๋ คํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๊ณ„์‚ฐ ๊ณผ์ •์ด ๋งŽ๊ณ , ์‹œ๊ฐ„์ด ์˜ค๋ž˜ ๊ฑธ๋ฆฐ๋‹ค. ๋˜ํ•œ ๊ฐ token์„ ์กฐํ•ฉํ•˜์—ฌ ๋‹จ์–ด๋ฅผ ์ƒ์„ฑํ•  ๋•Œ token๋“ค์ด ๋…๋ฆฝ์ด๋ผ๊ณ  ๊ฐ€์ •ํ•˜๋Š”๋ฐ, ์ด๋Š” ํ˜„์‹ค ์„ธ๊ณ„์™€ ๋ถ€ํ•ฉํ•˜์ง€ ์•Š๋Š”๋‹ค. (Ex. ์˜์–ด์—์„œ๋Š” ๋Œ€๋ถ€๋ถ„์˜ ๊ฒฝ์šฐ q ๋’ค์— u๊ฐ€ ์™€์•ผ ํ•จ)

SentencePiece

๋น„์ง€๋„ํ•™์Šต ๋ฐฉ์‹์œผ๋กœ ์ˆ˜ํ–‰๋˜๋Š” tokenizer์ด๋‹ค. BPE ๋ฐฉ์‹๊ณผ Unigram ๋ฐฉ์‹์„ ํฌํ•จํ•˜๊ณ  ์žˆ์œผ๋ฉฐ, ์–ธ์–ด์˜ ์ข…๋ฅ˜์— ๊ตฌ์• ๋ฐ›์ง€ ์•Š๊ณ  tokenization์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋‹ค๊ณ  ํ•œ๋‹ค.
SentencePiece ํ•™์Šต๋ฒ•์€ ๋ถ„ํฌ ์ถ”์ •(Variational Inference)์˜ ์ผ์ข…์œผ๋กœ, ๊ด€์ธก ๋ฐ์ดํ„ฐ(evidence)์™€ ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ(theta)๊ฐ€ ์žˆ์„ ๋•Œ ๊ฐ€์„ค์— ๋Œ€ํ•œ ๋ถ„ํฌ P๋ฅผ virational parameter๋ฅผ ๋„์ž…ํ•ด Q๋กœ ๊ทผ์‚ฌํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ๊ทผ์‚ฌ๋ฅผ ํ•˜๋Š” ์ด์œ ๋Š” ๋ถ„ํฌ P ์ž์ฒด๋ฅผ ์ฐพ์•„๋‚ด๋Š” ๊ฒƒ์ด ๋งค์šฐ ์–ด๋ ต๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.
๋ถ„ํฌ ์ถ”์ •์ด๋ž€, ์ฃผ์‚ฌ์œ„๋ฅผ ๊ตด๋ ธ์„ ๋•Œ ์–ด๋–ค ์ˆซ์ž๊ฐ€ ๋‚˜์˜ฌ ํ™•๋ฅ ์„ ๋ชจ๋“  ์‹คํ–‰์„ ํ†ตํ•ด ๊ตฌํ•  ์ˆ˜ ์—†์„ ๋•Œ ์ด๋ฅผ 16\frac{1}{6}์œผ๋กœ ์ถ”์ •ํ•˜๋Š” ๊ณผ์ •์„ ์ƒ๊ฐํ•˜๋ฉด ๋œ๋‹ค. MLE, MAP ๋˜ํ•œ ๋ถ„ํฌ ์ถ”์ •์˜ ๋ฐฉ์‹์ด๋‹ค.
์–ด๋ ค์šฐ๋‹ˆ๊นŒ ์ฝ”๋“œ๋กœ ์ˆ˜ํ–‰ํ•˜๋Š” ๋ฐฉ์‹์„ ์•Œ์•„๋ณด์ž...

์ƒ˜ํ”Œ ๋ฐ์ดํ„ฐ๋Š” ๋„ค์ด๋ฒ„ ์˜ํ™” ๋ฆฌ๋ทฐ ๋ฐ์ดํ„ฐ์…‹์„ ์‚ฌ์šฉํ–ˆ๋‹ค. ๋งํฌ

  1. ๋จผ์ € sentencepiece ๋ชจ๋“ˆ์„ ์„ค์น˜ํ•œ๋‹ค.
    pip install sentencepice
  2. SentencePiece ๋ชจ๋“ˆ์„ ๋ถˆ๋Ÿฌ์™€์„œ ๋ฐ์ดํ„ฐ์…‹์„ ํ•™์Šตํ•œ๋‹ค. ํ•™์Šตํ•œ ๋ชจ๋ธ์„ ๋ถˆ๋Ÿฌ์˜จ๋‹ค.
import sentencepice as spm

spm.SentencePiece.Trainer.train(input={txt_file_path}, model_prefix='mymodel', vocab_size=8000, model_type='bpe')
sp = spm.SentencePieceProcessor(model_file='mymodel.model')

ํ•„์š”ํ•œ ์ธ์ž๋Š” ์•„๋ž˜์™€ ๊ฐ™๋‹ค.
input: txt ํŒŒ์ผ์˜ ๊ฒฝ๋กœ๋ฅผ ์ž…๋ ฅํ•œ๋‹ค. ์‚ฌ์šฉํ•˜๊ณ ์ž ํ•˜๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ๋ณ„๋„์˜ ์ „์ฒ˜๋ฆฌ ์—†์ด ๋ฐ”๋กœ ์ž…๋ ฅํ•ด์ฃผ๊ธฐ๋งŒ ํ•ด๋„ ์•Œ์•„์„œ tokenization์„ ์ˆ˜ํ–‰ํ•œ๋‹ค.
model_prefix: tokenization์„ ์ˆ˜ํ–‰ ํ›„ ์ƒ์„ฑํ•  ๋ชจ๋ธ์˜ ์ด๋ฆ„์ด๋‹ค. ์ง€์ •ํ•œ ์ด๋ฆ„์˜ .model๊ณผ .vocab์ด ์ƒ์„ฑ๋œ๋‹ค.

.vocab์€ ์ด๋ ‡๊ฒŒ ์ƒ๊ฒผ๋‹ค.

vocab_size: ์ƒ์„ฑํ•  vocab์˜ ํฌ๊ธฐ์ด๋‹ค. ์œ„ .vocab์— ํฌํ•จ๋  token์˜ ๊ฐœ์ˆ˜๋ฅผ ์ง€์ •ํ•œ๋‹ค.
model_type: ์ ์šฉํ•  ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ง€์ •ํ•œ๋‹ค. bpe, unigram, char, word๊ฐ€ ์žˆ๋‹ค.

  1. ํ•™์Šตํ•œ SentencePiece ๋ชจ๋ธ๋กœ test ๋ฐ์ดํ„ฐ๋ฅผ encoding ํ•œ๋‹ค.
sentences = ["์˜ˆ์‹œ ๋ฌธ์žฅ 1", "์˜ˆ์‹œ ๋ฌธ์žฅ 2", ...]

# sentences to tokens
tokens = sp.encode(sentences, out_type=str)

# tokens to sentences
backToSentences = sp.decode(tokens)

์ง์ ‘ ์ธ์ฝ”๋”ฉํ•˜๊ณ  ๋””์ฝ”๋”ฉํ•œ ๊ฒฐ๊ณผ๋Š” ์•„๋ž˜์™€ ๊ฐ™๋‹ค.
์™ผ์ชฝ๋ถ€ํ„ฐ ์ˆœ์„œ๋Œ€๋กœ ์›๋ž˜ ๋ฌธ์žฅ, token, ๋ณต์›ํ•œ ๋ฌธ์žฅ์ด๋‹ค.

์ฐธ๊ณ 
๋”ฅ๋Ÿฌ๋‹์„ ์ด์šฉํ•œ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ์ž…๋ฌธ: 13-01 ๋ฐ”์ดํŠธ ํŽ˜์–ด ์ธ์ฝ”๋”ฉ
Hugging Face: WordPiece tokenization
Hugging Face: Unigram tokenization
velog.io/@gibonki77/SentencePiece
Naver Sentiment Movie Corpus v1.0




Word Embedding

๊ฐœ๋…

์•ž์„œ ๊ณต๋ถ€ํ•œ Tokenization์„ ์ˆ˜ํ–‰ ํ›„ ๊ฐ token์— ๋ถ€์—ฌ๋œ index๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ One-Hot Vector๋กœ ํ‘œํ˜„ํ•˜๋Š” ๊ฒƒ์—๋Š” ๋‘ ๊ฐ€์ง€ ํŠน์ง•์ด ์žˆ๋‹ค.

1 ๋‹จ์ˆœํžˆ index๋ฅผ One-Hot์œผ๋กœ ํ‘œํ˜„ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋‹จ์ˆœํ•˜๋‹ค.
2 token ๊ฐœ์ˆ˜๋งŒํผ์˜ dimension์ด ํ•„์š”ํ•˜๊ณ , token ๊ฐ„ ๊ฑฐ๋ฆฌ๊ฐ€ ๋ชจ๋‘ ๋™์ผํ•˜๋‹ค.

๋‹จ์ ์„ ์„ค๋ช…ํ•œ 2 ๋•Œ๋ฌธ์— Word Embedding ๋ฐฉ์‹์ด ์ œ์•ˆ๋˜์—ˆ๋‹ค. ์—ฌ๊ธฐ์„œ One-Hot Vector๋กœ ํ‘œํ˜„ํ•˜๋Š” ๋ฐฉ์‹์„ Sparse Representation์ด๋ผ๊ณ  ํ•˜๋Š”๋ฐ, Word Embedding์œผ๋กœ ํ‘œํ˜„ํ•˜๋Š” ๋ฐฉ์‹์€ ๋‹ค์ฐจ์›์œผ๋กœ ์••์ถ•ํ•˜์—ฌ ํ‘œํ˜„ํ•˜๊ธฐ ๋•Œ๋ฌธ์— Dense Representation์ด๋ผ๊ณ  ํ•œ๋‹ค.

Word Embedding ๋ฐฉ๋ฒ•๋ก ์œผ๋กœ๋Š” LSA, Word2Vec, FastText, Glove ๋“ฑ์ด ์žˆ๋‹ค. ๊ฐ€์žฅ ๋Œ€ํ‘œ์ ์ธ Word2Vec์„ ์‚ดํ‘œ๋ณด๋„๋ก ํ•˜์ž.

Word2Vec

Distributed Representation
์•ž์„œ ๋‹จ์–ด๋ฅผ One-Hot Vector๋กœ ํ‘œํ˜„ํ•˜์ง€ ์•Š๊ณ  ์›์†Œ๋“ค์ด ๋ฐ€์ง‘ํ•œ ํ˜•ํƒœ์˜ ๋ฒกํ„ฐ๋กœ ํ‘œํ˜„ํ•˜๋Š” ๋ฐฉ์‹์„ Dense Representation์ด๋ผ๊ณ  ํ–ˆ๋‹ค. ์ด Dense Representation ๋‚ด์—์„œ ๋‹จ์–ด์˜ ์˜๋ฏธ๋ฅผ ์—ฌ๋Ÿฌ ์ฐจ์›์— ๋ถ„์‚ฐํ•˜๋Š” ๊ฒƒ์„ Distributed Representation์ด๋ผ๊ณ  ํ•œ๋‹ค.
Distributed Representation์ด ๋˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋น„์Šทํ•œ ๋ฌธ๋งฅ์—์„œ ๋“ฑ์žฅํ•˜๋Š” ๋‹จ์–ด๋“ค์€ ๋น„์Šทํ•œ ์˜๋ฏธ๋ฅผ ๊ฐ€์ง„๋‹ค๋Š” ๊ฐ€์ •์—์„œ ์‹œ์ž‘ํ•œ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ์•„์ด์œ (์‚ฌ์‹ฌ ON)๋Š” ์˜ˆ์˜๋‹ค, ๊ท€์—ฝ๋‹ค๋ผ๋Š” ๋‹จ์–ด์™€ ์ž์ฃผ ๋“ฑ์žฅํ•จ์— ๋”ฐ๋ผ ์˜ˆ์˜๋‹ค์™€ ๊ท€์—ฝ๋‹ค๋ฅผ ๋ฒกํ„ฐํ™”ํ–ˆ์„ ๋•Œ ๋‘˜์˜ ๊ฑฐ๋ฆฌ๋Š” ๊ฐ€๊นŒ์šธ ๊ฒƒ์ด๋‹ค.

Word2Vec์˜ ํ•™์Šต ๋ฐฉ์‹์—๋Š” CBOW์™€ Skip-Gram ๋‘ ๊ฐ€์ง€๊ฐ€ ์žˆ๋‹ค. CBOW๋Š” ์ฃผ๋ณ€ ๋‹จ์–ด๋ฅผ ์ž…๋ ฅ์œผ๋กœ ์ค‘๊ฐ„์˜ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ํ•™์Šต ๋ฐฉ์‹์ด๊ณ , Skip-Gram์€ ์ค‘๊ฐ„ ๋‹จ์–ด๋“ค์„ ์ž…๋ ฅ์œผ๋กœ ์ฃผ๋ณ€ ๋‹จ์–ด๋“ค์„ ์˜ˆ์ธกํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค.

CBOW
์˜ˆ์‹œ๋กœ The cat sits on the mat๋ผ๋Š” ๋ฌธ์žฅ์ด ์žˆ๋‹ค๊ณ  ํ•˜์ž. ์ด๋•Œ window_size=2๋ผ๋ฉด, target์ด ๋˜๋Š” ๋‹จ์–ด์˜ ์•ž๋’ค๋กœ 2๊ฐœ์”ฉ์ด ์ฃผ๋ณ€ ๋‹จ์–ด(context)๊ฐ€ ๋œ๋‹ค. ์ฆ‰, ์ฃผ๋ณ€ ๋‹จ์–ด๋Š” 2*window_size๊ฐœ๊ฐ€ ๋˜๋Š” ๊ฒƒ์ด๋‹ค. center word์™€ context word๋ฅผ ์ง์ง€์–ด๋ณด๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค.
The: [[pad], [pad], cat, sits]
cat: [[pad], The, sits, on]
sits: [The, cat, on, the]
on: [cat, sits, the, mat]
the: [sits, on, mat, [pad]]
mat: [on, the, [pad], [pad]]
์ด๋ ‡๊ฒŒ window_size๋งŒํผ window๋ฅผ ์˜†์œผ๋กœ ์ด๋™ํ•ด๊ฐ€๋ฉฐ ๋ฐ์ดํ„ฐ์…‹์„ ๋งŒ๋“œ๋Š” ๊ฒƒ์„ sliding window๋ผ๊ณ  ํ•œ๋‹ค.
์ด๋•Œ window_size๋ฅผ ์ถฉ์กฑํ•˜์ง€ ๋ชปํ•˜๋Š” ์ค‘์‹ฌ ๋‹จ์–ด์— ๋Œ€ํ•ด์„œ๋„ ์ง์„ ์ง€์–ด padding์„ ์ฃผ์—ˆ์ง€๋งŒ, corpus๊ฐ€ ์ถฉ๋ถ„ํžˆ ํฌ๋‹ค๋ฉด ์ด ๋‹จ์–ด๋“ค์„ ์‚ญ์ œํ•ด๋„ ๋ฌด๋ฐฉํ•˜๊ณ , ๊ทธ๋ ‡์ง€ ์•Š๋‹ค๋ฉด ๋ชจ๋“  ์ค‘์‹ฌ ๋‹จ์–ด์˜ ๊ธธ์ด๋ฅผ ๋™์ผํ•˜๊ฒŒ ํ•˜๊ธฐ ์œ„ํ•ด The: [cat sits on the]์™€ ๊ฐ™์ด ๋‹จ์–ด๋ฅผ ์ถ”๊ฐ€ํ•ด์ค˜๋„ ๋œ๋‹ค.
์ด๋ ‡๊ฒŒ ๋ฐ์ดํ„ฐ์…‹ ์ „์ฒ˜๋ฆฌ๋ฅผ ๋งˆ์น˜๊ณ  ๋ชจ๋ธ์— ์ž…๋ ฅํ•œ๋‹ค.

  1. ์ฃผ๋ณ€ ๋‹จ์–ด๋ฅผ WVโˆ—MW_{V*M}๋ฅผ ์‚ฌ์šฉํ•ด์„œ projection ํ•œ๋‹ค. ์ด๋•Œ V๋Š” ์ž…๋ ฅ ๋ฒกํ„ฐ x์˜ dimension์ด๊ณ , M์€ projection layer์˜ dimension์ด๋‹ค.

projectionํ•˜๋Š” ๊ณผ์ •์„ ๋” ์ž์„ธํžˆ ๋ณด๋ฉด ์•„๋ž˜ ๊ทธ๋ฆผ๊ณผ ๊ฐ™๋‹ค.

v๋Š” projection layer์—์„œ์˜ ๋ฒกํ„ฐ์˜ ํ˜•ํƒœ์ด๋‹ค. 1xd์˜ shape(d๋Š” ์œ„์—์„œ ์„ค๋ช…ํ–ˆ๋˜ M)์„ ๊ฐ€์ง€๋ฉฐ, ์€๋‹‰์ธต์ด ํ•œ ๊ฐœ์ด๊ธฐ ๋•Œ๋ฌธ์— shallow neural network๋ผ๊ณ  ํ•œ๋‹ค. ๋˜ํ•œ ์˜ˆ์ธกํ•  ๋‹จ์–ด์™€ ๋งคํ•‘๋˜๋Š” look-up table์ด๋ผ๊ณ  ํ•œ๋‹ค. ์ด๋•Œ ํ™œ์„ฑํ™” ํ•จ์ˆ˜๊ฐ€ ์‚ฌ์šฉ๋˜์ง€ ์•Š๋Š”๋ฐ, ์ด๋Š” ๋‹จ์–ด ๊ฐ„ ์„ ํ˜•์„ฑ์„ ์œ ์ง€ํ•˜๊ธฐ ์œ„ํ•จ์ด๊ธฐ๋„ ํ•˜๊ณ , ๋น„์„ ํ˜•์„ ์ถ”๊ฐ€ํ•จ์œผ๋กœ์จ ๋‹จ์–ด ๊ฐ„ ๊ฑฐ๋ฆฌ ์ •๋ณด๊ฐ€ ์‚ฌ๋ผ์ง€์ง€ ์•Š๋„๋ก ํ•˜๊ธฐ ์œ„ํ•จ์ด๊ธฐ๋„ ํ•œ๋‹ค.

  1. ๋ณต์›์„ ์œ„ํ•ด WMโˆ—Vโ€ฒW'_{M*V}์„ ์‚ฌ์šฉํ•ด์„œ ์˜ˆ์ธก๊ฐ’์„ ๊ตฌํ•œ๋‹ค. ์ด Wโ€ฒW'์™€ WW๋Š” transpose๊ฐ€ ์•„๋‹ˆ๋ผ ๋‹ค๋ฅธ ๊ฐ€์ค‘์น˜์ด๋‹ค.

z๋Š” 1xV์˜ shape์„ ๊ฐ€์ง€๋ฉฐ, softmax๋ฅผ ํ†ตํ•ด ์ƒ๋Œ€์ ์ธ ํ™•๋ฅ ๊ฐ’์œผ๋กœ ๋ณ€ํ™˜๋œ๋‹ค. ์ด๋ฅผ ์‹ค์ œ ์ •๋‹ต๊ฐ’๊ณผ ๋น„๊ตํ•˜์—ฌ cross entropy loss๋ฅผ ์†์‹คํ•จ์ˆ˜๋กœ ํ•˜์—ฌ WW์™€ Wโ€ฒW'๋ฅผ ํ•™์Šตํ•œ๋‹ค.

loss(y^,y)=โˆ’โˆ‘j=1Vyjlog(yj^)loss(\hat{y}, y) = -\sum_{j=1}^{V} y_jlog(\hat{y_j})

ํ•™์Šต์ด ์™„๋ฃŒ๋œ WW์™€ Wโ€ฒW'๋ฅผ ๋ชจ๋‘ embedding vector๋กœ ์‚ฌ์šฉํ•˜๊ธฐ๋„ ํ•˜๊ณ , M์ฐจ์›์˜ ํฌ๊ธฐ๋ฅผ ๊ฐ–๋Š” WW ํ–‰๋ ฌ์˜ ํ–‰๋งŒ์„ ์‚ฌ์šฉํ•˜๊ธฐ๋„ ํ•œ๋‹ค.

Skip-Gram
Skip-Gram์€ ์ค‘์‹ฌ ๋‹จ์–ด๋กœ๋ถ€ํ„ฐ ์ฃผ๋ณ€ ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•œ๋‹ค.
์•ž์˜ ์˜ˆ๋ฌธ The cat sits on the mat๋ฅผ ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉํ•˜์—ฌ ์ง์„ ์ง€์–ด๋ณด๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค.
sits: [The]
sits: [cat]
sits: [sits]
sits: [on]
on: [cat]
on: [sits]
on: [the]
on: [mat]
CBOW์— ๋น„ํ•ด์„œ ์ฃผ๋ณ€ ๋‹จ์–ด๋“ค์˜ ๊ธธ์ด๊ฐ€ ๋Œ€ํญ ์ค„์—ˆ๋‹ค. ๋”ฐ๋ผ์„œ projection layer์—์„œ ํ‰๊ท ์„ ๊ตฌํ•˜๋Š” ๊ณผ์ •์ด ์—†๋‹ค.
์—ฐ๊ตฌ ๊ฒฐ๊ณผ Skip-Gram์ด CBOW๋ณด๋‹ค ์„ฑ๋Šฅ์ด ์ข‹๋‹ค๊ณ  ์•Œ๋ ค์ ธ ์žˆ๋‹ค. ์ด๋Š” Skip-Gram์ด ๋ณด๋‹ค ๋ณต์žกํ•จ์— ๋”ฐ๋ผ ๋‹จ์–ด ํ•˜๋‚˜๊ฐ€ ๋‹ค๋ฅธ ๋‹จ์–ด์™€์˜ ๊ด€๊ณ„๋ฅผ ๋” ๋งŽ์ด ๋ฐฐ์›Œ์„œ๋ผ๊ณ  ์ƒ๊ฐํ•ด๋ณผ ์ˆ˜ ์žˆ๋‹ค.

์ฐธ๊ณ 
๋”ฅ๋Ÿฌ๋‹์„ ์ด์šฉํ•œ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ์ž…๋ฌธ: 09-02 ์›Œ๋“œํˆฌ๋ฒกํ„ฐ

profile
๐ŸŒˆ๐ŸŒผ๐ŸŒธโ˜€๏ธ

0๊ฐœ์˜ ๋Œ“๊ธ€