백 오브 워즈(Bag of Words, BoW)

허허맨·2025년 7월 31일

LLM

목록 보기

7/12

📌 백 오브 워즈(Bag of Words, BoW)

1. 개념

Bag of Words는 텍스트의 단어 순서를 완전히 무시하고,
**단어의 등장 빈도(frequency)**에만 집중하는 단어 표현 방법입니다.

직역: “단어들의 가방”
가방 안에 단어들을 넣고 섞으면, 단어 순서는 사라지고 단어의 개수만 남음
대표적인 국소 표현(Local Representation) 기법

2. BoW 만드는 과정

단어 집합(Vocabulary) 생성
- 문서에 등장하는 모든 고유 단어 추출
- 각 단어에 고유 인덱스 부여
단어 빈도 벡터화
- 각 단어 인덱스 위치에 해당 단어의 등장 횟수 기록

예시 (한국어)

문서1: 정부가 발표하는 물가상승률과 소비자가 느끼는 물가상승률은 다르다.

BoW 결과:

vocabulary: {'정부':0, '가':1, '발표':2, '하는':3, '물가상승률':4, ...}
vector:     [1, 2, 1, 1, 2, ...]

물가상승률(index=4)은 문서 내에서 2번 등장 → 해당 위치 값이 2

3. 문서 합치기

여러 문서를 하나의 단어 집합으로 통일한 뒤, 각 문서별 BoW를 생성할 수 있습니다.

문서3 단어 집합: 문서1 + 문서2의 모든 단어 포함
문서1 BoW: [1, 2, 1, 1, 2, 1, 1, ...]
문서2 BoW: [0, 0, 0, 1, 1, 0, 1, ...]

같은 단어라도 문서마다 등장 빈도가 다르기 때문에, BoW 값이 다름

4. 활용 예시

문서 분류: 특정 단어 빈도에 따라 카테고리 분류
문서 유사도 계산: BoW 벡터 간 거리/유사도 측정
검색 엔진: 검색어 BoW ↔ 문서 BoW 비교

예:

'달리기', '체력', '근력' 단어가 많이 등장 → 체육 관련 문서
'미분', '방정식', '부등식' 단어가 많이 등장 → 수학 관련 문서

5. CountVectorizer로 BoW 만들기

scikit-learn의 CountVectorizer로 BoW를 손쉽게 생성할 수 있습니다.

from sklearn.feature_extraction.text import CountVectorizer

corpus = ['you know I want your love. because I love you.']
vector = CountVectorizer()

print(vector.fit_transform(corpus).toarray()) 
print(vector.vocabulary_)

출력:

[[1 1 2 1 2 1]]
{'you':4, 'know':1, 'want':3, 'your':5, 'love':2, 'because':0}

you, love는 2번씩 등장
CountVectorizer는 띄어쓰기 기준 토큰화만 수행

6. 한국어 BoW 주의사항

한국어는 조사·어미 변화 때문에 띄어쓰기만으로 정확한 토큰화 불가능
예: 물가상승률과 / 물가상승률은 → 다른 단어로 인식
한국어 BoW는 형태소 분석기(예: Okt, Mecab)를 이용하는 것이 좋음

7. 불용어(Stopword) 제거

BoW는 단어 중요도와 무관하게 모든 단어를 카운트하기 때문에,
의미 없는 단어(불용어)를 제거하면 성능이 향상될 수 있습니다.

CountVectorizer 불용어 제거 예시

from sklearn.feature_extraction.text import CountVectorizer

text = ["Family is not an important thing. It's everything."]
vect = CountVectorizer(stop_words="english")

print(vect.fit_transform(text).toarray())
print(vect.vocabulary_)

출력:

[[1 1 1]]
{'family':0, 'important':1, 'thing':2}

불용어 "is", "not" 등 제거됨

8. BoW의 특징과 한계

장점: 단순하고 구현이 쉬움
단점:
- 단어 순서를 고려하지 않음
- 의미적 유사성 표현 불가
- 고유명사나 형태 변화에 민감
- 희소 행렬(Sparse Matrix) 형태로 메모리 사용량이 큼

📌 정리

BoW = 단어 순서 무시 + 등장 횟수 기반
문서 분류, 검색, 유사도 분석 등에서 기본 기법
한국어는 반드시 형태소 분석기와 함께 사용 추천
고급 기법(TF-IDF, Word Embedding) 학습 전 필수 이해 개념

전체 코드

# 한국어 형태소 분석기를 이용한 BoW 구현
from konlpy.tag import Okt

okt = Okt()

def build_bag_of_words(document):
    # 마침표 제거
    document = document.replace('.', '')
    # 형태소 분석
    tokenized_document = okt.morphs(document)
    
    # 단어 집합과 BoW 초기화
    word_to_index = {}
    bow = []
    
    for word in tokenized_document:
        if word not in word_to_index.keys():
            # 새로운 단어면 인덱스 부여
            word_to_index[word] = len(word_to_index)
            bow.insert(len(word_to_index) - 1, 1)  # 등장 횟수 1로 시작
        else:
            # 기존 단어면 등장 횟수 +1
            index = word_to_index.get(word)
            bow[index] = bow[index] + 1
            
    return word_to_index, bow

# 테스트 문서
doc1 = "정부가 발표하는 물가상승률과 소비자가 느끼는 물가상승률은 다르다."
vocab, bow = build_bag_of_words(doc1)

# 해당 인덱스는 어떻게 되는가?
print("📌 Vocabulary:", vocab)

# 해당 인덱스의 단어가 몇번 등장 하는가?
print("📌 BoW Vector:", bow)




###################################

doc2 = "소비자는 주로 소비하는 상품을 기준으로 물가상승률을 느낀다."

# 문서 합치기
doc3 = doc1 + " " + doc2
vocab3, bow3 = build_bag_of_words(doc3)

print("\n📌 문서3 Vocabulary:", vocab3)
print("📌 문서3 BoW Vector:", bow3)

# 문서3 단어 집합 기준으로 문서1 BoW 만들기
def build_bow_with_vocab(document, fixed_vocab):
    document = document.replace('.', '')
    tokenized_document = okt.morphs(document)
    bow = [0] * len(fixed_vocab)
    for word in tokenized_document:
        if word in fixed_vocab:
            index = fixed_vocab[word]
            bow[index] += 1
    return bow

bow_doc1_fixed = build_bow_with_vocab(doc1, vocab3)
bow_doc2_fixed = build_bow_with_vocab(doc2, vocab3)

print("\n📌 문서3 단어 집합 기준 문서1 BoW:", bow_doc1_fixed)
print("📌 문서3 단어 집합 기준 문서2 BoW:", bow_doc2_fixed)



##############################
from sklearn.feature_extraction.text import CountVectorizer

corpus = ['you know I want your love. because I love you.']
vectorizer = CountVectorizer()

# BoW 벡터
X = vectorizer.fit_transform(corpus).toarray()
# 단어 인덱스
vocab = vectorizer.vocabulary_

print("\n📌 BoW Vector (영어):", X)
print("📌 Vocabulary (영어):", vocab)

###############################
# 방법 1: 직접 불용어 지정
vect_custom_stop = CountVectorizer(stop_words=["the", "a", "an", "is", "not"])
X_custom = vect_custom_stop.fit_transform(["Family is not an important thing. It's everything."]).toarray()
print("\n📌 Custom Stopwords BoW:", X_custom)
print("📌 Vocabulary:", vect_custom_stop.vocabulary_)

# 방법 2: 영어 내장 불용어
vect_en_stop = CountVectorizer(stop_words="english")
X_en = vect_en_stop.fit_transform(["Family is not an important thing. It's everything."]).toarray()
print("\n📌 English Stopwords BoW:", X_en)
print("📌 Vocabulary:", vect_en_stop.vocabulary_)

허허맨

사람은 망각의 동물입니다. 때로는 기록으로 과거의 나를 데려옵니다.