불용어 제거(Stopword removal)

애늙은이·2023년 8월 17일

NLP 여행기

목록 보기

7/13

코퍼스의 방대한 언어 데이터에는 유의미한 단어와 유의미하지 않은 단어가 섞여있습니다. 가령, a나 the같은 단어들은 자주 사용되나 관사이기 때문에 특별한 의미를 가지는 것은 아니죠. 때문에 효율적인 처리를 위해선 코퍼스 내에서 이들을 제거해줄 필요가 있습니다.

🤔 불용어란?

불용어란 앞서 말한 a와 the같이 자주 등장하지만 중요한 의미를 가지고 있지 않은 단어들을 말합니다. 불용어를 제거함으로써 분석할 데이터를 줄일 수 있습니다.

불용어는 NLTK 내 stopwords로 사용할 수 있습니다.

import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')

stopword_lst = stopwords.words('english') # 영어 불용어 리스트입니다.
print(len(stopword_lst)) # 179개입니다.
print(stopword_lst)

위의 코드에서 볼 수 있듯, NLTK에서는 별도의 불용어 리스트를 제공하고 있습니다.

✂ 불용어 제거하기

NLTK에서 제공하는 불용어 리스트를 통해 불용어 제거를 진행할 수 있습니다.

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

stopword_lst = stopwords.words('english')

text = "Everything that irritates us about others can lead us to an understanding about ourselves."

text = text.lower() # 소문자로 통일합니다.
tokenized_text = word_tokenize(text)

result = []
for token in tokenized_text:
	if token not in stopword_lst:
    	result.append(token)
        
print(result)


# 결과
# ['everything', 'irritates', 'us', 'others', 'lead', 'us', 'understanding', '.']

해당 코드는 for문과 if문을 활용해서 NLTK의 불용어 리스트에 있는 단어들은 제외시키는 작업을 수행합니다. 이를 통해서 불용어를 제거한 토큰들의 리스트를 얻을 수 있습니다.

만약 특정 불용어를 추가하거나 제외하는 경우, 리스트 메소드를 사용하여 불용어 리스트를 수정할 수 있습니다.

from nltk.corpus import stopwords

stopword_lst = stopwords('english')

stopword_lst.append('hey') # hey를 불용어 리스트에 추가합니다.
stopword_lst.remove('a') # a를 불용어 리스트에서 제거합니다.

print(stopword_lst)

혹은 직접 불용어 리스트를 만들어 사용할 수도 있습니다.

from nltk.tokenize import word_tokenize

stopword_lst = ['a', 'an', 'the']

text = "You can stand tall without standing on someone. You can be a victor without having victims."
text = text.lower()
tokenized_text = word_tokenize(text)

result = []
for token in tokenized_text:
	if token not in stopword_lst:
    	result.append(token)

print(result)


# 결과
# ['you', 'can', 'stand', 'tall', 'without', 'standing', 'on', 'someone', '.', 'you', 'can', 'be', 'victor', 'without', 'having', 'victims', '.']

🔥 불용어 제거의 활용

불용어 제거는 앞서 배운 표제어 추출이나 어간 추출과 같이 사용될 수 있습니다.

from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

def stem_text(tokens):
	"""
    토큰화된 텍스트를 받아 어간 추출을 진행하는 함수입니다.
    """
    stemmer = PorterStemmer()
    result = []
    
    for token in tokens:
    	result.append(stemmer.stem(token))
     
    return result 
    
def lemmatize_text(tagged_text):
	"""
    품사 태깅된 결과를 받아 품사에 따른 표제어 추출을 진행하는 함수입니다.
    """
    lemmatizer = WordNetLemmatizer()
	result = []
    
   	for token, tag in tagged_text:
    	if tag.startswith('N'):
          lemma = lemmatizer.lemmatize(token, pos='n') 
      	elif tag.startswith('V'):
          lemma = lemmatizer.lemmatize(token, pos='v')
      	elif tag.startswith('J'):
          lemma = lemmatizer.lemmatize(token, pos='a')
      	elif tag.startswith('R'):
          lemma = lemmatizer.lemmatize(token, pos='r')
      	else:
          lemma = lemmatizer.lemmatize(token)
   	
    	result.append(lemma)
        
    return result
    
def remove_stopword(tokenized_text, stopwords):
	"""
    토큰화된 텍스트와 불용어를 입력받아 불용어가 제거된 텍스트를 반환하는 함수입니다.
    """
	result = []

    for token in tokenized_text:
    	if token not in stopwords:
        	result.append(token)

    return result
    
    
text = "The height of your accomplishments will equal the depth of your convictions."
text = text.lower()

tokens = word_tokenize(text) # 토큰화를 진행합니다.
stopword_lst = stopwords.words('english') # 불용어를 설정합니다.

removed_tokens = remove_stopword(tokens, stopword_lst) # 불용어를 제거합니다.
pos_tags = pos_tag(removed_tokens) # 표제어 추출을 위한 품사 태깅을 진행합니다,

stem_result = stem_text(removed_tokens) # 어간 추출을 진행합니다.
lemma_result = lemmatize_text(pos_tags) # 표제어 추출을 진행합니다.

print(f"원문: {text}")
print(f"어간 추출 결과: {stem_result}")
print(f"표제어 추출 결과: {lemma_result}")


# 결과
# 원문: the height of your accomplishments will equal the depth of your convictions.
# 어간 추출 결과: ['height', 'accomplish', 'equal', 'depth', 'convict', '.']
# 표제어 추출 결과: ['height', 'accomplishment', 'equal', 'depth', 'conviction', '.']