NLP 감정분석 (3)

JiHyeon Lee·2024년 3월 28일

Part 3 More Fun With Word Vectors

part 2에서 단어의 의미를 이해하도록 모델을 훈련시켰다. 이것을 어떻게 사용할 수 있을까?

From Words To Paragraphs, Attempt 1: Vector Averaging

IMDB 데이터셋의 한 가지 문제는 리뷰들의 길이가 다양하다는 점이다. 각 단어의 feature 벡터를 가져와서 각 리뷰마다 동일한 길이의 특성 집합으로 변환하는 방법을 찾아야 한다.

각 단어는 300차원 공간에서의 feature 벡터이므로, 개별 단어의 feature 벡터를 결합하기 위해 벡터 연산을 사용할 수 있다. 각 리뷰에서 단어 벡터를 단순히 평균하는 것을 시도해볼 수 있다.(이 과정에서 불용어는 노이즈를 추가할 수 있으므로 제거한다).

import numpy as np  # Make sure that numpy is imported

def makeFeatureVec(words, model, num_features):
    # Function to average all of the word vectors in a given paragraph
    # Pre-initialize an empty numpy array (for speed)
    featureVec = np.zeros((num_features,),dtype="float32")
    
    #nwords는 단락에 사용되는 단어의 수
    nwords = 0.
    
    # Index2word is a list that contains the names of the words in 
    # the model's vocabulary. Convert it to a set, for speed 
    index2word_set = set(model.index2word)
   
    # Loop over each word in the review and, if it is in the model's
    # vocaublary, add its feature vector to the total
    for word in words:
        if word in index2word_set: 
            nwords = nwords + 1.
            featureVec = np.add(featureVec,model[word])
    
    # Divide the result by the number of words to get the average
    featureVec = np.divide(featureVec,nwords)
    return featureVec


def getAvgFeatureVecs(reviews, model, num_features):
    # Given a set of reviews (each one a list of words), calculate 
    # the average feature vector for each one and return a 2D numpy array 
    
    # Initialize a counter
    counter = 0.
    
    # Preallocate a 2D numpy array, for speed
    reviewFeatureVecs = np.zeros((len(reviews),num_features),dtype="float32")
    
    # Loop through the reviews
    for review in reviews:
       
       # Print a status message every 1000th review
       if counter%1000. == 0.:
           print "Review %d of %d" % (counter, len(reviews))
       
       # Call the function (defined above) that makes average feature vectors
       reviewFeatureVecs[counter] = makeFeatureVec(review, model, \
           num_features)
      
       # Increment the counter
       counter = counter + 1.
    return reviewFeatureVecs

여기서 makeFeatureVec 함수는 리뷰 하나, 즉 한 문장이 주어졌을 때 리뷰 안의 단어들의 특성 벡터를 합산한 후, (불용어를 제외한)단어의 개수로 나누어 평균을 구하는 함수이다.

getAvgFeatureVecs 함수는 여러 개의 리뷰에 대해 makeFeatureVec 함수를 호출하여 각 리뷰의 평균 특성 벡터를 계산하고, 이를 2차원 NumPy 배열에 저장하는 함수이다. 따라서 이 함수는 여러 문장 또는 리뷰에 대한 평균 벡터 정보를 담고 있는 배열을 생성한다.

getAvgFeatureVecs 함수가 반환하는 평균 벡터 정보의 배열의 행과 열 개수는 다음과 같다.
행(row)의 개수: len(reviews)
열(column)의 개수: num_features

# ****************************************************************
# Calculate average feature vectors for training and testing sets,
# using the functions we defined above. Notice that we now use stop word
# removal.

clean_train_reviews = []
for review in train["review"]:
    clean_train_reviews.append( review_to_wordlist( review, \
        remove_stopwords=True ))

trainDataVecs = getAvgFeatureVecs( clean_train_reviews, model, num_features )

print "Creating average feature vecs for test reviews"
clean_test_reviews = []
for review in test["review"]:
    clean_test_reviews.append( review_to_wordlist( review, \
        remove_stopwords=True ))

testDataVecs = getAvgFeatureVecs( clean_test_reviews, model, num_features )

위에서 정의한 feature 벡터 평균 함수를 이용하여 trainDetaVecs와 testDataVecs를 만들었다.

# Fit a random forest to the training data, using 100 trees
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier( n_estimators = 100 )

print "Fitting a random forest to labeled training data..."
forest = forest.fit( trainDataVecs, train["sentiment"] )

# Test & extract results 
result = forest.predict( testDataVecs )

# Write the test results 
output = pd.DataFrame( data={"id":test["id"], "sentiment":result} )
output.to_csv( "Word2Vec_AverageVectors.csv", index=False, quoting=3 )

평균 벡터를 이용하여 random forest를 학습시켰다. labeled data만 사용할 수 있다.
(label이란 각 데이터 포인트에 대한 목표 변수를 의미하며, 이 과제에서는 감정의 긍정/부정이다.)

라벨이 지정된 데이터만을 사용하는 이유는 지도학습 모델을 훈련시키기 위해서이다. 대부분의 지도학습 알고리즘은 입력 특성과 목표 변수(라벨) 사이의 관계를 학습하고 이를 기반으로 새로운 데이터의 라벨을 예측하기 때문에, 학습 과정에는 반드시 라벨이 지정된 데이터가 필요하다. (반면에 비지도학습 알고리즘은 목표 변수가 없는 데이터에 대해 학습하고, 데이터 내의 패턴이나 구조를 발견하려고 시도한다.)

# Fit a random forest to the training data, using 100 trees
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier( n_estimators = 100 )

print "Fitting a random forest to labeled training data..."
forest = forest.fit( trainDataVecs, train["sentiment"] )

# Test & extract results 
result = forest.predict( testDataVecs )

# Write the test results 
output = pd.DataFrame( data={"id":test["id"], "sentiment":result} )
output.to_csv( "Word2Vec_AverageVectors.csv", index=False, quoting=3 )

테스트 결과 무작위로 긍정/부정을 정하는 것 보다는 좋은 결과를 도출했지만, Bag of Words보다는 몇 퍼센트 포인트로 성능이 낮았다.

단어 벡터의 요소별 평균이 유의미하게 모델을 개선하지 못했기 때문에, 다른 intelligent한 방법이 필요하다. 단어 벡터에 가중치를 부여하는 표준적인 방법 중 하나는 "tf-idf" 가중치를 적용하는 것이다. 이는 주어진 문서 집합 내에서 주어진 단어의 중요성을 측정하는 것이다. Python에서 scikit-learn의 TfidfVectorizer를 사용하여 tf-idf 가중치를 추출할 수 있다. (그러나 이 방법으로 단어 벡터에 가중치를 부여할 때 성능이 크게 향상되지 않았다고 함...)

tf-idf 가중치란?

"tf-idf"는 "Term Frequency-Inverse Document Frequency"의 약자로, 특정 단어가 특정 문서 내에서 얼마나 자주 나타나는지를 측정하는 지표이다. 이것은 해당 단어가 문서 내에서 얼마나 중요한지를 나타낸다. "Term Frequency"는 해당 단어가 문서 내에서 나타나는 빈도를 나타내고, "Inverse Document Frequency"는 해당 단어가 전체 문서 집합에서 얼마나 희귀한지를 나타낸다. 이를 통해 매우 일반적인 단어들은 높은 빈도로 나타나지만 낮은 가중치를 받게 되고, 특정 문서에서만 나타나는 특이한 단어들은 더 높은 가중치를 받게 된다. 이를 통해 단어의 중요성을 더 잘 반영할 수 있다.

From Words to Paragraphs, Attempt 2: Clustering

Word2Vec은 의미적으로 관련된 단어들의 클러스터를 생성한다. 따라서 클러스터 내의 단어들의 유사성을 활용하는 것이 가능하다. 이러한 방식으로 벡터를 그룹화하는 것을 "벡터 양자화"라고 하는데, 이를 위해서는 먼저 단어 클러스터의 중심을 찾아야 한다. 이를 위해 K-Means와 같은 클러스터링 알고리즘을 사용할 수 있다.

K-Means에서 설정해야 하는 유일한 매개변수는 "K", 즉 클러스터의 수다. 시행착오를 통해 클러스터가 평균적으로 단어 당 5개 정도의 작은 크기로 구성되는 것이 많은 단어를 포함한 큰 클러스터보다 더 나은 결과를 보여주었다다.

K-means의 작동 방식

중심 초기화(Initialization of centroids): 먼저 알고리즘은 사용자가 지정한 클러스터의 수(K)에 따라 중심을 임의로 초기화. 이 초기화는 무작위로 선택되거나 데이터 포인트 중 일부를 초기 중심으로 선택함.
할당 단계(Allocation step): 각 데이터 포인트는 가장 가까운 중심에 할당된다. 이는 유클리드 거리나 다른 거리 측정 방법을 사용하여 계산된다.
업데이트 단계(Update step): 할당된 모든 데이터 포인트에 대해 클러스터의 중심을 재계산한다. 클러스터에 속한 모든 데이터 포인트의 평균으로 새로운 중심을 업데이트한다.
할당 및 업데이트 반복(Iteration of allocation and update): 할당 및 업데이트 단계를 반복하여 중심이 수렴할 때까지 알고리즘을 실행한다. 중심이 수렴하면 알고리즘은 종료됨.

(k-means 알고리즘에 대한 자세한 설명: https://velog.io/@jhlee508/%EB%A8%B8%EC%8B%A0%EB%9F%AC%EB%8B%9D-K-%ED%8F%89%EA%B7%A0K-Means-%EC%95%8C%EA%B3%A0%EB%A6%AC%EC%A6%98)

아래는 클러스터링 코드이며 scikit-learn을 사용하여 K-Means를 수행하였다.

from sklearn.cluster import KMeans

# Set "k" (num_clusters) to be 1/5th of the vocabulary size, or an
# average of 5 words per cluster
word_vectors = model.syn0
num_clusters = word_vectors.shape[0] / 5

# Initalize a k-means object and use it to extract centroids
kmeans_clustering = KMeans( n_clusters = num_clusters )
idx = kmeans_clustering.fit_predict( word_vectors )

클러스터 assignment는 idx에 저장되어 있고, Word2Vec의 원본 단어 데이터는 model.index2word에 저장되어 있으므로 하나의 딕셔너리로 묶어준다.

word_centroid_map = dict(zip( model.index2word, idx ))

각 단어별로 어떤 클러스터(중심점)에 할당되는지 알 수 있기 때문에, 리뷰 데이터를 중심점에 대한 정보로 바꾸는 함수를 만들 수 있다.

def create_bag_of_centroids( wordlist, word_centroid_map ):
    
    # The number of clusters is equal to the highest cluster index
    # in the word / centroid map
    num_centroids = max( word_centroid_map.values() ) + 1
    
    # Pre-allocate the bag of centroids vector (for speed)
    bag_of_centroids = np.zeros( num_centroids, dtype="float32" )
    
    # Loop over the words in the review. If the word is in the vocabulary,
    # find which cluster it belongs to, and increment that cluster count 
    # by one
    for word in wordlist:
        if word in word_centroid_map:
            index = word_centroid_map[word]
            bag_of_centroids[index] += 1
            #단어가 특정 클러스터에 속한다면 해당 클러스터의 인덱스 위치의 값을 증가시킨다. 
    
    # Return the "bag of centroids"
    return bag_of_centroids

이 함수는 리뷰를 받아서 단어들을 클러스터의 개수만큼의 차원을 가진 벡터로 표현한다.
이렇게 하면 단어 대신 클러스터를 기반으로 문서를 벡터로 표현할 수 있다.

# 훈련 세트의 bag of centroid를 위한 배열 미리 할당 (속도를 위해)
train_centroids = np.zeros((train["review"].size, num_clusters), dtype="float32")

# 훈련 세트 리뷰를 bag of centroid로 변환
counter = 0
for review in clean_train_reviews:
    train_centroids[counter] = create_bag_of_centroids(review, word_centroid_map)
    counter += 1

# 테스트 세트 리뷰에 대해서도 반복
test_centroids = np.zeros((test["review"].size, num_clusters), dtype="float32")

counter = 0
for review in clean_test_reviews:
    test_centroids[counter] = create_bag_of_centroids(review, word_centroid_map)
    counter += 1

# 랜덤 포레스트 피팅 및 예측 추출
forest = RandomForestClassifier(n_estimators=100)

# 피팅에는 몇 분이 걸릴 수 있음
print "표시된 훈련 데이터에 랜덤 포레스트 맞추는 중..."
forest = forest.fit(train_centroids, train["sentiment"])
result = forest.predict(test_centroids)

# 테스트 결과 작성
output = pd.DataFrame(data={"id": test["id"], "sentiment": result})
output.to_csv("BagOfCentroids.csv", index=False, quoting=3)

하나의 단어에 대해서는 하나의 centroid가 할당된다. 따라서 여러 개의 단어로 이루어진 review를 create_bag_of_centroids에 입력하면 리뷰에 포함된 단어가 해당하는 centroids에 1이 표시된 벡터가 생성된다.
그럼 train_centroids 매트릭스는 여러 리뷰에 대해 실행되므로 한 행 한 행은 각 리뷰의 centroid 정보로 이루어지고, 매트릭스의 행 개수는 리뷰의 개수, 열 개수는 centroid 개수, 즉 클러스터 개수와 같다.

part4 Comparing Deep And Non Deep Learning Method

심층 학습 방법과 얕은 학습 방법 비교

part1에서 수행한 Bag of Words나 part2, part3에서 수행한 Word2Vec은 모두 깊은 신경망을 사용하지 않는 학습 방법이다.
실제로 Bag of Words나 Word2Vec의 성능 비교해보면 큰 차이가 없는 것을 확인할 수 있다.
가장 큰 이유는 벡터의 평균을 내고 centroid를 사용하는 것은 단어의 순서를 고려하지 않기 때문에 Word2Vec이 Bag of Words의 개념과 매우 유사해진다는 점이다.

모델의 성능을 개선하기 위해 시도해볼 방법:

첫째, 더 많은 텍스트에서 Word2Vec을 훈련하기.
구글의 결과는 10억 단어 이상의 말뭉치에서 학습된 단어 벡터를 기반으로 하지만, 예제의 레이블 및 레이블이 없는 훈련 세트는 겨우 1800만 단어 정도이다.

둘째, 심층 학습 방법인 분산 단어 벡터 기술 사용하기.
분산 벡터 방법이 여기서 시도하는 접근 방법보다 나은 이유 중 하나는 벡터 평균화와 클러스터링이 단어 순서를 고려하지 않는 반면, Paragraph Vector는 단어 순서 정보를 보존하기 때문이다. 분산 벡터 방법은 각 문단 또는 문서의 의미를 찾기 이해 신경망 아키텍처를 사용하므로 심층 학습 방법으로 분류된다.

(참고:https://www.kaggle.com/competitions/word2vec-nlp-tutorial/overview)

JiHyeon Lee

Data Analysis

이전 포스트

NLP 감정분석 (2)

다음 포스트

NLP 감정분석 (3)

NLP 감정분석 (2)

시계열분석 1차시 EDA

0개의 댓글