Bag-of-Words (BoW)

Rainy Night for Sapientia·2023년 7월 8일

NLP

NLP demystified

목록 보기

3/8

Bag-of-Words

머신러닝 알고리즘을 적용하기 위해선 자연어의 형태가 아닌 고정길이의 숫자 집합 즉 fixed-length numeric vectors로 존재해야만 합니다.

이를 우리는 Vectorization이라 부릅니다.
Bag-of-words는 대표적인 초창기 벡터라이제이션 방법 중 하나입니다.

IDEA

bow(bag-of-word)의 아이디어는 유사한 문장일 수록 동질의 단어들을 많이 포함하고 있을 거라는 생각입니다. 충분히 합리적인 생각이라고 할 수 있습니다. 피자레시피와 비행기안전수칙에 대한 글은 나오는 단어들의 집합들이 매우 다를테니까요

bow는 row로 각 도큐먼트(문장, 트윗, 혹은 전체 책)를 정의하고 column으로 각 단어의 발생횟수(occurrences)를 정의합니다.

즉, rows의 수는 corpus에 존재하는 모든 documents의 수를 의미하게 되고
columnsm이 수는 모든 vocabulary size가 됩니다.

Implementation

형태는 크게 두가지가 존재합니다.
binary BoW는 특정단어가 존재하면 1, 존재하지 않으면 0을 가지는 피처 벡터들이 각각의 row가 됩니다. 굉장히 간단하죠?
반면 frequency BoW는 token의 발생 빈도 수가 값으로 채워지게 됩니다.

하지만 어떤방식이든 상상만 해봐도 하나의 corpus에 총 vocab이 사이즈가 굉장히 클테니 엄청 긴 feature vector를 가질 것이라는 걸 알수있고, 대부분 0으로 가득 채워져 있을 것이라는 것을 예상할 수 있습니다.

간단하게 구현해봅시다.

라이브러리 임포트

import spacy

from scipy import spatial
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# A corpus of sentences.
corpus = [
  "Red Bull drops hint on F1 engine.",
  "Honda exits F1, leaving F1 partner Red Bull.",
  "Hamilton eyes record eighth F1 title.",
  "Aston Martin announces sponsor."
]

CountVectorizer()를 통해 해당 코퍼스에 대한 bow를 생성합니다.

vectorizer = CountVectorizer()
bow = vectorizer.fit_transform(corpus)

다음과 같이 각 vocab들이 피처들로 잘 저장된것을 확인할 수 있습니다.

# View features (tokens).
print(vectorizer.get_feature_names_out())

# View vocabulary dictionary.
vectorizer.vocabulary_

###
['announces' 'aston' 'bull' 'drops' 'eighth' 'engine' 'exits' 'eyes' 'f1'
 'hamilton' 'hint' 'honda' 'leaving' 'martin' 'on' 'partner' 'record'
 'red' 'sponsor' 'title']
{'red': 17,
 'bull': 2,
 'drops': 3,
 'hint': 10,
 'on': 14,
 'f1': 8,
 'engine': 5,
 'honda': 11,
 'exits': 6,
 'leaving': 12,
 'partner': 15,
 'hamilton': 9,
 'eyes': 7,
 'record': 16,
 'eighth': 4,
 'title': 19,
 'aston': 1,
 'martin': 13,
 'announces': 0,
 'sponsor': 18}

bow는 0이 굉장히 많이 포함된 희소행렬일 것입니다.
이를 그대로 저장하기 보다는 spacy에서는 보다 효율적인 저장방법인 csr이라는 형태로 저장합니다. 이는 실제 매트릭스를 그 자체로 저장하는 게아닌 로우와 컬럼을 기반으로 정보를 저장해서 매트릭스를 복원할 수 있는 형태로 가지고 있는 것입니다.
다음과 같이 타입과 실제 데이터의 구성 형태를 볼 수 있습니다.

print(type(bow))
#
<class 'scipy.sparse._csr.csr_matrix'>

튜플에서 첫번째는 row(document), 두번째는 column(token_id)이고 뒤 숫자는 각 단어의 발생 수를 의미합니다.

print(bow)
###
  (0, 17)	1
  (0, 2)	1
  (0, 3)	1
  (0, 10)	1
  (0, 14)	1
  (0, 8)	1
  (0, 5)	1
  (1, 17)	1
  (1, 2)	1
  (1, 8)	2
  (1, 11)	1
  (1, 6)	1
  (1, 12)	1
  (1, 15)	1
  (2, 8)	1
  (2, 9)	1
  (2, 7)	1
  (2, 16)	1
  (2, 4)	1
  (2, 19)	1
  (3, 1)	1
  (3, 13)	1
  (3, 0)	1
  (3, 18)	1

약간의 preprocessing 양념을 더해볼까요?
다음과 같이 손쉽게 한 줄로 stopwords를 제거할 수 도 있습니다.

# As usual, we start by importing spaCy and loading a statistical model.
nlp = spacy.load('en_core_web_sm')

# Create a tokenizer callback using spaCy under the hood. Here, we tokenize 
# the passed-in text and return the tokens, filtering out punctuation.
def spacy_tokenizer(doc):
  return [t.text for t in nlp(doc) if not t.is_punct]

그리고 다음과 같이 파라미터를 바꾸며 case folding을 하지 않을 수도, binary BoW를 구현할 수도 있습니다.

vectorizer = CountVectorizer(tokenizer=spacy_tokenizer, lowercase=False, binary=True)
bow = vectorizer.fit_transform(corpus)

Similarity

이러한 BOW의 피처벡터들을 벡터스페이스 상에서 바라보는 관점을 가져봅시다.
즉 vocab의 수가 차원(demension)이 되겠네요.

그럼 코사인유사도를 이용해서 BOW의 각 피처벡터들 간의 유사도를 측정할 수 있습니다. 이는 0~1사이의 값을 가지게 됩니다. 각 컬럼 즉 단어들의 발생빈도(frequency)가 음의 값을 가질 수 없기 때문입니다.

Drawbacks

BOW는 자연어처리의 시작을 알리는 굉장히 중요한 벡터라이제이션 방법론이지만
실제로 이를 이용해서 모델을 만들어 적용하기에는 많은 단점을 가지고 있습니다.
정리해보면 다음과 같습니다.

유사한 단어들 간의 연관성을 포착하거나 표현할 수 없습니다.
코퍼스에 존재하지 않는 Out-of-Vocabulary(OOV)를 처리할 수 없습니다.
Sparse Matrix를 구성하기 때문에 굉장히 컴퓨팅하기에 비효율적입니다.
- 딕셔너리형태로 만든다거나, truncated SVD로 조금 줄일수 있다해도 비효율적이긴 마찬가지입니다.
문장의 순서(order)를 무시합니다. 예를 들면 "Chelsea beats Barcelona"와 "Barcelona beats Chelsea"는 동일한 벡터스페이스에 매핑될겁니다.
- n-grams를 통해 어느정도 보완할 수 있습니다.
마지막으로 occurrences를 기반으로 할때 별로 중요하지 않은 단어들의 비중치가 커질 수 있는 문제가 있습니다.
- TF-IDF에서 보완됩니다.

n-gram을 파라미터로 주는 코드를 간단하게 살펴보겠습니다.

vectorizer = CountVectorizer(tokenizer=spacy_tokenizer, lowercase=False, binary=True, ngram_range=(1,2))
bigrams = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())
print('Number of features: {}'.format(len(vectorizer.get_feature_names_out())))
print(vectorizer.vocabulary_)

###
['Aston' 'Aston Martin' 'Bull' 'Bull drops' 'F1' 'F1 engine' 'F1 leaving'
 'F1 partner' 'F1 title' 'Hamilton' 'Hamilton eyes' 'Honda' 'Honda exits'
 'Martin' 'Martin announces' 'Red' 'Red Bull' 'announces'
 'announces sponsor' 'drops' 'drops hint' 'eighth' 'eighth F1' 'engine'
 'exits' 'exits F1' 'eyes' 'eyes record' 'hint' 'hint on' 'leaving'
 'leaving F1' 'on' 'on F1' 'partner' 'partner Red' 'record'
 'record eighth' 'sponsor' 'title']
Number of features: 40
{'Red': 15, 'Bull': 2, 'drops': 19, 'hint': 28, 'on': 32, 'F1': 4, 'engine': 23, 'Red Bull': 16, 'Bull drops': 3, 'drops hint': 20, 'hint on': 29, 'on F1': 33, 'F1 engine': 5, 'Honda': 11, 'exits': 24, 'leaving': 30, 'partner': 34, 'Honda exits': 12, 'exits F1': 25, 'F1 leaving': 6, 'leaving F1': 31, 'F1 partner': 7, 'partner Red': 35, 'Hamilton': 9, 'eyes': 26, 'record': 36, 'eighth': 21, 'title': 39, 'Hamilton eyes': 10, 'eyes record': 27, 'record eighth': 37, 'eighth F1': 22, 'F1 title': 8, 'Aston': 0, 'Martin': 13, 'announces': 17, 'sponsor': 38, 'Aston Martin': 1, 'Martin announces': 14, 'announces sponsor': 18}