BoW(Bag of Words)

์• ๋Š™์€์ดยท2023๋…„ 9์›” 10์ผ
0

NLP ์—ฌํ–‰๊ธฐ

๋ชฉ๋ก ๋ณด๊ธฐ
12/13
post-thumbnail

๐Ÿค” BoW๋ž€?

BoW๋ž€ Bag of Words์˜ ์•ฝ์ž๋กœ, ๋‹จ์–ด ๊ฐ€๋ฐฉ์„ ํ™œ์šฉํ•˜์—ฌ ํ…์ŠคํŠธ๋ฅผ ๋ฒกํ„ฐํ™”ํ•˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค. ์ฆ‰, ์—ฌ๋Ÿฌ ๋‹จ์–ด๋“ค์ด ๋‹ด๊ธด ๋‹จ์–ด ๊ฐ€๋ฐฉ(Bag of Words)์—์„œ ์ž„์˜๋กœ ๋‹จ์–ด๋ฅผ ๊บผ๋‚ด์–ด ํ…์ŠคํŠธ์— ํ•ด๋‹น๋˜๋Š” ๋‹จ์–ด๊ฐ€ ๋ช‡ ๊ฐœ ์žˆ๋Š” ์ง€ ๊ทธ ๋นˆ๋„๋ฅผ ์„ธ๋Š” ๋ฐฉ์‹์ด์ฃ .

์˜ˆ๋ฅผ ๋“ค์–ด, ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋‹จ์–ด ๊ฐ€๋ฐฉ์ด ์žˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•ด๋ด…์‹œ๋‹ค.

๋งŒ์•ฝ I really really like dog.๋ผ๋Š” ๋ฌธ์žฅ์ด ์žˆ๋‹ค๋ฉด, BoW๋กœ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ด์ฒ˜๋Ÿผ BoW๋Š” ๋‹จ์–ด ๊ฐ€๋ฐฉ์— ํฌํ•จ๋œ ๋‹จ์–ด๊ฐ€ ๋ช‡ ๊ฐœ ๋“ค์—ˆ๋Š”์ง€ ๊ฐœ์ˆ˜๋กœ ํ…์ŠคํŠธ๋ฅผ ๋ฒกํ„ฐํ™”ํ•ฉ๋‹ˆ๋‹ค. ํ•ด๋‹น ๋ฌธ์žฅ์„ ๋ฒกํ„ฐํ™”ํ•˜๋ฉด [1, 1, 1, 0, 2, 0, 0]์œผ๋กœ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ’ป BoW์˜ ๊ตฌํ˜„

BoW์„ ๊ตฌํ˜„ํ•˜๊ธฐ ์œ„ํ•ด์„  ๋จผ์ € ๋‹จ์–ด ๊ฐ€๋ฐฉ์„ ๋งŒ๋“ค์–ด์•ผ ํ•  ํ•„์š”๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค. ๋‹จ์–ด ๊ฐ€๋ฐฉ์˜ ๊ฒฝ์šฐ, ์ค‘๋ณต๋˜๋Š” ๋‹จ์–ด๊ฐ€ ์—†์–ด์•ผ ํ•˜๊ณ  ๋ฒกํ„ฐํ™”ํ–ˆ์„์‹œ ์–ด๋Š ๋‹จ์–ด์ธ์ง€ ๊ตฌ๋ณ„ํ•˜๊ธฐ ์œ„ํ•ด ์ˆœ๋ฒˆ์ด ๋ถ€์—ฌ๋˜์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋ฌธ์žฅ์— ๋Œ€ํ•ด์„œ BoW๋ฅผ ์‹คํ–‰ํ•œ๋‹ค ๊ฐ€์ •ํ•ด๋ด…์‹œ๋‹ค.

sentences = ["I really want to stay at your house", 
			"I wish that I could turn back time", 
			"I need your time"]

์ด ๊ฒฝ์šฐ, ๋‹จ์–ด ๊ฐ€๋ฐฉ์€ ์„ธ ๋ฌธ์žฅ ๋‚ด์˜ ์ค‘๋ณต๋˜์ง€ ์•Š์€ ๋ชจ๋“  ๋‹จ์–ด๋งŒํผ์˜ ๊ธธ์ด๋กœ ๋งŒ๋“ค์–ด์ ธ์•ผ ํ•ฉ๋‹ˆ๋‹ค.

from nltk.tokenize import word_tokenzie

word_bag =  {}

for sent in sentences:
	tokens = word_tokenize(sent)
    for token in tokens:
    	if token not in word_bag.keys():
        	word_bag[token] = len(word_bag)
        	
print(word_bag)	


# ๊ฒฐ๊ณผ
# {'I': 0, 'really': 1, 'want': 2, 'to': 3, 'stay': 4, 'at': 5, 'your': 6, 'house': 7, 'wish': 8, 'that': 9, 'could': 10, 'turn': 11, 'back': 12, 'time': 13, 'need': 14}

๋‹จ์–ด ๊ฐ€๋ฐฉ์„ ๊ธฐ๋ฐ˜์œผ๋กœ BoW๋ฅผ ์ง„ํ–‰ํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

result = []

for sent in sentences:
	vector = []
    for word in word_bag.keys():
    	if word in sent:
        	vector.append(sent.count(word))
        else:
        	vector.append(0)
    result.append(vector)
    
print(result)


# ๊ฒฐ๊ณผ
# [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0], 
# [2, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0], 
# [1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1]]

๋‹จ์–ด ๊ฐ€๋ฐฉ ๋‚ด ๋‹จ์–ด์˜ ๋นˆ๋„์ˆ˜๊ฐ€ ์ˆœ์„œ๋Œ€๋กœ ๋‚˜์—ด๋˜์–ด ์žˆ๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ํ•จ์ˆ˜๋กœ ์ •๋ฆฌํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •๋ฆฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

import re
from nltk.tokenize import word_tokenize


def vector_word_of_bag(sents: list[str]) -> list[list[int]]:
	sents = [re.sub("[^a-zA-Z0-9\s]", "", sent) for sent in sents] # ํŠน์ˆ˜ ๋ฌธ์ž๋ฅผ ์ œ๊ฑฐํ•ฉ๋‹ˆ๋‹ค.
    tokens: list[list[str]] = [word_tokenize(sent) for sent  in sents] # ํ† ํฐํ™”๋ฅผ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค.
    

    word_bag: dict[str, int] = {}
   	for token in sum(tokens, []):
    	if token not in word_bag.keys():
        	word_bag[token] = len(word_bag)
            
    result: list[list[int]] = []       
    for sent in tokens:
		temp: list[int] = []
    	for word in word_bag.keys():
    		if word in sent:
        		temp.append(sent.count(word))
        	else:
        		temp.append(0)
    	result.append(temp)
        
     return result

BoW๋Š” ์ง์ ‘์ ์ธ ๊ตฌํ˜„ ๋ง๊ณ ๋„ ํŒŒ์ด์ฌ ๋‚ด scikit-learn ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์˜ CountVectorizer() ํด๋ž˜์Šค๋ฅผ ํ†ตํ•ด์„œ ๊ตฌํ˜„ํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ’ก scikit-learn์ด๋ž€?

scikit-learn ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋Š” ํŒŒ์ด์ฌ ๋‚ด ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์ค‘ ํ•˜๋‚˜๋กœ ์ฃผ๋กœ ๋จธ์‹ ๋Ÿฌ๋‹๊ณผ ๊ด€๋ จํ•œ ๊ธฐ๋Šฅ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. CountVectorizer() ํด๋ž˜์Šค๋Š” ๋นˆ๋„์ˆ˜๋ฅผ ๊ธฐ์ค€์œผ๋กœ ํ…์ŠคํŠธ๋ฅผ ๋ฒกํ„ฐํ™”ํ•˜๋Š” ๊ธฐ๋Šฅ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. (pip ๋ช…๋ น์–ด๋ฅผ ํ†ตํ•œ ์„ค์น˜๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.)

from sklearn.feature_extraction.text import CountVectorizer

sents = ["I really want to stay at your house", 
			"I wish that I could turn back time", 
			"I need your time"]
           
vectorizer = CountVectorizer()
vectorizer.fit(sents) # ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ๋ฅผ ํ•™์Šต์‹œํ‚ต๋‹ˆ๋‹ค.
word_bag = vectorizer.vocabulary_
result = vectorizer.transform(sents).toarray()

print(word_bag)
print(result)


# ๊ฒฐ๊ณผ
# {'really': 5, 'want': 11, 'to': 9, 'stay': 6, 'at': 0, 'your': 13, 'house': 3, 'wish': 12, 'that': 7, 'could': 2, 'turn': 10, 'back': 1, 'time': 8, 'need': 4}  
# [[1 0 0 1 0 1 1 0 0 1 0 1 0 1]
# [0 1 1 0 0 0 0 1 1 0 1 0 1 0]
# [0 0 0 0 1 0 0 0 1 0 0 0 0 1]]

์ด ๊ฒฝ์šฐ์—๋Š” ๋ถˆ์šฉ์–ด๊ฐ€ ์ฒ˜๋ฆฌ๋˜์–ด ์˜๋ฏธ์žˆ๋Š” ๋‹จ์–ด๋“ค์˜ ๋นˆ๋„์ˆ˜๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๊ฒฐ๊ณผ๋ฅผ ๋„์ถœํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ” BoW์˜ ์žฅ๋‹จ์ 

๐ŸŸข ์žฅ์ 

BoW๋Š” ๋‹จ์–ด ๊ฐ€๋ฐฉ ๋‚ด ๋‹จ์–ด๋“ค์˜ ๋นˆ๋„๋ฅผ ๊ธฐ์ค€์œผ๋กœ ํ…์ŠคํŠธ ๋ฒกํ„ฐํ™”๋ฅผ ์ง„ํ–‰ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋ฌธ์žฅ ๊ฐ„ ๋น„๊ต๊ฐ€ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰, ๋ฌธ์žฅ ๊ฐ„์˜ ์œ ์‚ฌ๋„๋ฅผ ์‚ดํŽด๋ณผ ์ˆ˜ ์žˆ๋‹ค๋Š” ์ด์•ผ๊ธฐ์ด์ฃ .

๐Ÿ”ด ๋‹จ์ 

์›-ํ•ซ ์ธ์ฝ”๋”ฉ๊ณผ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ, ๋ฌธ์žฅ ๋‚ด์—์„œ ๋‹จ์–ด ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ๋“œ๋Ÿฌ๋‚ด์ง€๋Š” ๋ชปํ•ฉ๋‹ˆ๋‹ค. ๋‹จ์–ด์˜ ์ˆœ์„œ๋‚˜ ์œ„์น˜๋ฅผ ๋ฐ˜์˜ํ•˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์ด์ฃ . ๋˜ํ•œ ๋‹จ์–ด ์ˆ˜๊ฐ€ ๋งŽ์•„์งˆ์ˆ˜๋ก ๋‹จ์–ด ๊ฐ€๋ฐฉ์ด ๋Š˜์–ด๋‚˜๊ณ , ๊ทธ๋งŒํผ ๋ฒกํ„ฐ ์ฐจ์›๋„ ๋Š˜์–ด๋‚˜๊ธฐ ๋•Œ๋ฌธ์— ๊ณต๊ฐ„์ ์œผ๋กœ ๋น„ํšจ์œจ์ ์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ”ฅ DTM์ด๋ž€?

DTM์€ Document-Term Matrix, ๋ฌธ์„œ ๋‹จ์–ด ํ–‰๋ ฌ์˜ ์•ฝ์ž๋กœ, ์•ž์„œ ๋ฐฐ์› ๋˜ BoW๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์—ฌ๋Ÿฌ ๋ฌธ์„œ๋“ค์„ ํ•˜๋‚˜์˜ ๋ฒกํ„ฐ๋กœ ํ‘œํ˜„ํ•œ ๊ฒƒ์„ ๋งํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ๋ฌธ์„œ1, ๋ฌธ์„œ2, ๋ฌธ์„œ3, ๋ฌธ์„œ4๊ฐ€ ์žˆ๋‹ค๋ฉด ์ด๋“ค์˜ ๋‹จ์–ด ๊ฐ€๋ฐฉ์„ ๋งŒ๋“ค๊ณ  BoW๋ฅผ ์‹คํ–‰ํ•œ ๊ฒƒ์ด์ฃ .

๋ฌธ์„œ 1. ๋‚˜๋Š” ๋ฐฅ์„ ๋จน๋Š”๋‹ค.
๋ฌธ์„œ 2. ๋‚˜๋Š” ๋ฐฅ์„ ์ข‹์•„ํ•˜๊ณ  ์š”๋ฆฌ๋„ ์ข‹์•„ํ•œ๋‹ค.
๋ฌธ์„œ 3. ์š”๋ฆฌ๋Š” ์žฌ๋ฐŒ๊ณ  ๋ฐฅ์„ ๋จน๋Š” ๊ฒƒ๋„ ์ฆ๊ฒจํ•œ๋‹ค.
๋ฌธ์„œ 4. ๋ฐฅ๋„ ๋ง›์žˆ์ง€๋งŒ ์นด๋ ˆ๋„ ๋ง›์žˆ๋‹ค.

DTM์˜ ๊ฒฝ์šฐ BoW์˜ ์žฅ๋‹จ์ ์—์„œ ๋ณผ ์ˆ˜ ์žˆ๋“ฏ, ๋‹จ์–ด ์ง‘ํ•ฉ์ด ์ปค์งˆ ์ˆ˜๋ก ๋‹จ์–ด ๊ฐ€๋ฐฉ ๋˜ํ•œ ์ปค์ง€๊ธฐ ๋•Œ๋ฌธ์— ๊ณต๊ฐ„์ ์œผ๋กœ ๋น„ํšจ์œจ์ ์ด๋ผ๋Š” ๋‹จ์ ์„ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ, ๋‹จ์ˆœ ๋นˆ๋„ ์ˆ˜๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๊ธฐ์— ๋ถˆ์šฉ์–ด ์ฒ˜๋ฆฌ๋ฅผ ์ž˜ ํ•˜์ง€ ์•Š์œผ๋ฉด ์œ ์˜๋ฏธํ•œ ๊ฒฐ๊ณผ๋ฅผ ์–ป์„ ์ˆ˜ ์—†๋‹ค๋Š” ํ•œ๊ณ„๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

profile
๊ธ€์“ฐ๋Š” ๊ฐœ๋ฐœ์ž์ž…๋‹ˆ๋‹ค.

0๊ฐœ์˜ ๋Œ“๊ธ€