[파머완] ch8.2 텍스트 전처리

반소희·2022년 7월 22일

ML 파이썬머신러닝완벽가이드

파이썬 머신러닝 완벽 가이드

목록 보기

18/19

ch8.2 텍스트 정규화

텍스트 정규화란 ?
NLP 애플리케이션에 입력 데이터로 사용하기 위해 아래와 같은 다양한 텍스트 데이터의 사전 작업을 수행하는 것을 의미
텍스트 정규화의 작업

클렌징 (Cleansing)
토큰화 (Tokenization)
필터링/스톱 워드 제거/철자 수정
Stemming
Lemmatization

ch8.2 클렌징

클렌징이란 ?
텍스트에서 분석에 방해되는 불필요한 문자, 기호 등을 사전에 제거하는 작업
예 - HTML, XML 태그, 특정 기호 등

ch8.2 텍스트 토큰화

텍스트 토큰화란 ?
문서에서 문장을 분리하는 문장 토큰화와, 단어를 토큰으로 분리하는 단어 토큰화로 나눌 수 있음

문장 토큰화 (sent_tokenize)
일반적으로 문장 토큰화는 각 문장이 가지는 시맨틱적인 의미가 중요한 요소로 작용할 때 사용
문장의 마침표(.), 개행문자(/n) 등 문장의 마지막을 뜻하는 기호에 따라 분리하는 과정
정규 표현식에 따른 문장 토큰화도 가능함

from nltk import sent_tokenize
import nltk
nltk.download('punkt')

text_sample = 'The Matrix is everywhere its all around us, here even in this room. \
               You can see it out your window or on your television. \
               You feel it when you go to work, or go to church or pay your taxes.'

sentences = sent_tokenize(text=text_sample) ## sent_tokenize 는 각각의 문장을 담은 list 형태로 반환
print(type(sentences),len(sentences))
print(sentences)

<class 'list'> 3
['The Matrix is everywhere its all around us, here even in this room.', 'You can see it out your window or on your television.', 'You feel it when you go to work, or go to church or pay your taxes.']

단어 토근화 (word_tokenize)

단어 토큰화란 ?
문장을 단어로 토큰화하는 것
기본적으로 공백, 콤마(,), 마침표(.), 개행문자(\n) 등으로 단어를 분리
정규 표현식을 이용하여 다양한 유형으로 토큰화 수행 가능

from nltk import word_tokenize

sentence = "The Matrix is everywhere its all around us, here even in this room."

words = word_tokenize(sentence)
print(type(words), len(words))
print(words)

<class 'list'> 15
['The', 'Matrix', 'is', 'everywhere', 'its', 'all', 'around', 'us', ',', 'here', 'even', 'in', 'this', 'room', '.']

문장 토큰화와 단어 토큰화를 조합하여 문서에 있는 모든 단어 토큰화 예제

## 토근화할 문서 샘플
text_sample = 'The Matrix is everywhere its all around us, here even in this room. \
               You can see it out your window or on your television. \
               You feel it when you go to work, or go to church or pay your taxes.'

from nltk import word_tokenize, sent_tokenize

def tokenize_text(text):
    
    ## 문장 토큰화
    sentences = sent_tokenize(text)
    ## 단어 토큰화
    word_tokens = [word_tokenize(sentence) for sentence in sentences]

    return word_tokens

## 문서 입력 및 문장/단어 토큰화 수행
word_tokens = tokenize_text(text_sample)

## 출력
print(type(word_tokens),len(word_tokens))
print(word_tokens)

<class 'list'> 3
[['The', 'Matrix', 'is', 'everywhere', 'its', 'all', 'around', 'us', ',', 'here', 'even', 'in', 'this', 'room', '.'], ['You', 'can', 'see', 'it', 'out', 'your', 'window', 'or', 'on', 'your', 'television', '.'], ['You', 'feel', 'it', 'when', 'you', 'go', 'to', 'work', ',', 'or', 'go', 'to', 'church', 'or', 'pay', 'your', 'taxes', '.']]

하지만, 위와 같이 문장을 단어별로 토큰화 할 경우, 문맥적인 의미는 무시될 수밖에 없다. 이러한 문제를 해결해 보고자 도입된 것이 n-gram 이다!

n-gram
연속된 n개의 단어를 하나의 토큰화 단위로 분리해 내는 것
n개의 단어 크기 윈도우를 만들어 문장의 처음부터 오른쪽으로 움직이면서 토큰화 수행
n-gram 예시
- bigram (2-gram)
  연속적으로 2개의 단어들을 순차적으로 이동하며 단어들을 토큰화하는 과정
  가령 'I like pizza' 라는 문장이 있을 경우, (I, like), (like, pizza) 와 같이 토큰화

ch8.2 스톱 워드 제거

Stop Word란 ?
- 분석에 큰 의미가 없는 단어를 지칭
  (예 - is, a, the, will 등 문장을 구성하는 필수 문법 요소이지만 문맥적으로는 큰 의미가 없는 단어)
- 이러한 단어는 사전에 제거하지 않으면 그 빈번함으로 인해 오히려 중요한 단어로 인지될 수 있음
- NLTK 에서는 언어별 스톱 워드 목록을 제공
언어별 스톱 워드 목록 다운로드

## NLTK에서 제공하는 언어별 스톱 워드 목록 다운로드
import nltk
nltk.download('stopwords')

영어 스톱 워드 수 확인

## 영어 스톱 워드 수 확인
print('영어 stop words 갯수:',len(nltk.corpus.stopwords.words('english')))
print(nltk.corpus.stopwords.words('english')[:20])

영어 stop words 갯수: 179
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his']

스톱 워드 제거

## 문장별로 단어를 토큰화해 생성된 word_tokens 리스트에 대해 불용어를 제거 후 분석을 위한 의미 있는 단어만 추출
stopwords = nltk.corpus.stopwords.words('english')
all_tokens = []

for sentence in word_tokens: # 3개의 각 문장을 확인
    filtered_words = []
    
    for word in sentence:
        
        # 대문자 -> 소문자
        word = word.lower()
        
        # 불용어가 아니라면, 리스트에 추가
        if word not in stopwords:
            filtered_words.append(word)
    
    all_tokens.append(filtered_words)
    
print(all_tokens)

[['matrix', 'everywhere', 'around', 'us', ',', 'even', 'room', '.'], ['see', 'window', 'television', '.'], ['feel', 'go', 'work', ',', 'go', 'church', 'pay', 'taxes', '.']]

ch8.2 Stemming 과 Lemmatization

많은 언어에서 가령 과거/현재, 3인칭 단수 여부, 진행형 등 매우 많은 문법적인 요소에 따라 단어의 형태가 다양하게 변한다. Stemming 과 Lemmatization 은 문법적 또는 의미적으로 변화하는 단어의 원형을 찾는 과정이다.

Stemming
- 원형 단어를 변환 시 일반적인 방법을 적용하거나, 더 단순화된 방법을 적용
- 원래 단어에서 일부 철자가 훼손된 어근 단어를 추출하는 경향이 있음
Lemmatization
- Stemming 에 비해 더 정교하며, 의미론적인 기반에서 단어의 원형을 찾음
- 품사와 같은 문법적인 요소와 더 의미적인 부분을 감안하여 정확한 철자로 된 어근 단어를 찾아줌
- 변환 시간이 오래 걸림
NLTK 에서 Stemming 예시 (LancasterStemmer)

from nltk.stem import LancasterStemmer
stemmer = LancasterStemmer()

print(stemmer.stem('working'),stemmer.stem('works'),stemmer.stem('worked'))
print(stemmer.stem('amusing'),stemmer.stem('amuses'),stemmer.stem('amused'))
print(stemmer.stem('happier'),stemmer.stem('happiest'))
print(stemmer.stem('fancier'),stemmer.stem('fanciest'))

work work work
amus amus amus
happy happiest
fant fanciest

NLTK 에서 Lemmatization 예시 (WordNetLemmatizer)
- Lemmatization 는 보다 정확한 원형 단어 추출을 위해 단어의 품사를 함께 입력해야 함 (예 - 동사는 v, 형용사는 a 등)
- Stemming 에 비해 더 정확하게 원형 단어를 추출해줌을 확인

import nltk
nltk.download('omw-1.4')

from nltk.stem import WordNetLemmatizer

import nltk
nltk.download('wordnet')

lemma = WordNetLemmatizer()

print(lemma.lemmatize('amusing', 'v'),lemma.lemmatize('amuses', 'v'),lemma.lemmatize('amused', 'v'))
print(lemma.lemmatize('happier', 'a'),lemma.lemmatize('happiest', 'a'))
print(lemma.lemmatize('fancier', 'a'),lemma.lemmatize('fanciest', 'a'))

amuse amuse amuse
happy happy
fancy fancy

정리

텍스트 분석은 1) 텍스트 전처리 2) 피처 벡터화/추출 3) 머신러닝 모델 수립 및 학습/예측/평가 순으로 이루어진다. 이번 글에서는 텍스트 전처리에 대해 정리해 보았다.

텍스트 정규화는 NLP 애플리케이션에 입력 데이터로 사용하기 위해 텍스트 데이터에 다양한 사전 작업을 수행하는 것을 말한다. 그 종류로는 1) 클렌징 2) 토큰화 3) 스톱 워드 제거 4) Stemming 과 Lemmatization 이 있겠다.

1) 클렌징은 텍스트에서 분석에 방해되는 태그 등을 제거하는 것을 말하며, 2) 토큰화는 문장과 단어를 분리하는 것을 말한다. 토큰화에는 문장 토큰화와 단어 토큰화가 있으며, 각 문장이 가지는 시맨틱적인 의미가 중요한 요소로 작용하지 않는 한 일반적으로 단어 토큰화만 수행한다. 이때, 단어별로 토큰화를 수행할 경우, 문맥적인 의미는 무시될 수 있다는 문제가 발생할 수 있는데, 이를 해결해 보고자 도입된 것으로 n-gram 이 있다. 3) 스톱 워드 제거는 분석에 의미가 없는 가령 문법적인 요소 등을 제거하는 것을 말하며, 4) Stemming 과 Lemmatization 은 문법적 또는 의미적으로 변화하는 단어의 원형을 찾는 과정을 말한다.

다음 글에서는 이렇게 전처리된 텍스트 데이터에 특정 값을 부여하여 피처 값을 추출하는 과정에 대해 정리할 예정이다.

반소희

행복한 소히의 이것저것

이전 포스트

[파머완] ch8.1 텍스트 분석의 이해

다음 포스트