딥러닝(AI학습 44)

이유진·2024년 7월 8일

--26.텍스트 전처리.ipynb--

Text Preprocessing

자연어(natural language)란 우리가 일상 생활에서 사용하는 언어를 말합니다. 자연어 처리(natural language processing)란 이러한 자연어의 의미를 분석하여 컴퓨터가 처리할 수 있도록 하는 일을 말합니다.

자연어 처리는 음성 인식, 내용 요약, 번역, 사용자의 감성 분석, 텍스트 분류 작업(스팸 메일 분류, 뉴스 기사 카테고리 분류), 질의 응답 시스템, 챗봇과 같은 곳에서 사용되는 분야

토큰화 (Tokenization)

자연어 데이터 -> 토큰화 & 정제 -> 정규화

토큰화

corpus (말 뭉치) -> 토큰으로 분리하는 작업

단어 토큰화

nltk : 영어, 자연어 전처리 모듈

from nltk.tokenize import word_tokenize

import nltk
nltk.download('punkt')

sentence = "Don't be fooled by the dark sounding name, Mr. Jone's Orphanage is as cheery as cheery goes for a pastry shop."

word_tokenize(sentence)

from nltk.tokenize import WordPunctTokenizer

WordPunctTokenizer().tokenize(sentence)

from tensorflow.keras.preprocessing.text import text_to_word_sequence

sentence

text_to_word_sequence(sentence)

토큰화 작업은 단순히 구두점 제거 및 공백기준으로 잘라내는 작업이 아니다!

Ph.D, AT&T, $45.55, 123,456,789, ...

from nltk.tokenize import TreebankWordTokenizer

tokenizer = TreebankWordTokenizer()
sentence = "Starting a home-based restaurant may be an ideal. it doesn't have a food chain or restaurant of their own."

print(tokenizer.tokenize(sentence))

문장 토큰화

Sentence tokenization

문장의 단위는? . ? ! ... <= 이걸로 문장 토큰을 구분하면 불확실하다.

ex) IP 192.168.56.31 서버에 들어가서 로그 파일 저장해서 aaa@gmail.com로 결과 좀 보내줘. 그 후 점심 먹으러 가자.

from nltk.tokenize import sent_tokenize

text = "His barber kept his word. But keeping such a huge secret to himself was driving him crazy. Finally, the barber went up a mountain and almost to the edge of a cliff. He dug a hole in the midst of some reeds. He looked about, to make sure no one was near."
text

sent_tokenize(text)

text = "I am actively looking for Ph.D. students. and you are a Ph.D student."

sent_tokenize(text)

한국어의 문장 토큰화 도구

KSS

! pip install kss

import kss

text = '딥 러닝 자연어 처리가 재미있기는 합니다. 그런데 문제는 영어보다 한국어로 할 때 너무 어렵습니다. 이제 해보면 알걸요?'

kss.split_sentences(text)

한국어는 토큰화가 어렵다

이유1. '교착어' 이기 때문. 조사, 어미변형, ...

'그가', '그에게', '그를', '그와', '그는' ....

그래서 한국어 분석을 위해선 조사와 어미를 분리해주어야 한다.

그래서! 한국어 토큰화에서는 형태소(morpheme) 란 개념이 중요.

형태소 : 뜻을 가진 가장 작은 말의 단위

자립 형태소 : 접사, 어미, 조사와 상관없이 자립하여 사용할 수 있는 형태소. 그 자체로 단어가 된다. 체언(명사, 대명사, 수사), 수식언(관형사, 부사), 감탄사 등이 있다.

의존 형태소 : 다른 형태소와 결합하여 사용되는 형태소. 접사, 어미, 조사, 어간을 말한다.

"에디가 책을 읽었다."

단순히 공백단위로 토큰화 -> "에디가", "책을", "읽었다."

형태소로 분리

자립 형태소 : 에디, 책

의존 형태소 : -가, -을, 읽-, -었-, -다

이유2 : 한국어는 영어에 비해 띄어쓰기 잘 안지켜짐.

EX1) 제가이렇게띄어쓰기를전혀하지않고글을썼다고하더라도글을이해할수있습니다.

EX2) Tobeornottobethatisthequestion

품사태깅

Part-of-speech(POS) tagging

fly : 날다 / 파리

못 : 망치와 '못' / 할수없다.

from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag

nltk.download('averaged_perceptron_tagger')

text = "I am actively looking for Ph.D. students. and you are a Ph.D student."
text

tokenized_sentence = word_tokenize(text)

tokenized_sentence

pos_tag(tokenized_sentence) # 품사 태깅

한국어 형태소 분류

KoNLPY

!pip install konlpy

한글 형태소 분석기

KoNLPy는 다음과 같은 다양한 형태소 분석, 태깅 라이브러리를 파이썬에서 쉽게 사용할 수 있도록 모아놓았다.

Hannanum: 한나눔. KAIST Semantic Web Research Center 개발.
- http://semanticweb.kaist.ac.kr/hannanum/
Kkma: 꼬꼬마. 서울대학교 IDS(Intelligent Data Systems) 연구실 개발.
- http://kkma.snu.ac.kr/
Komoran: 코모란. Shineware에서 개발.
- https://github.com/shin285/KOMORAN
Mecab: 메카브. 일본어용 형태소 분석기를 한국어를 사용할 수 있도록 수정.
- https://bitbucket.org/eunjeon/mecab-ko
Open Korean Text: 오픈 소스 한국어 분석기. 과거 트위터 형태소 분석기.
- https://github.com/open-korean-text/open-korean-text

from konlpy.tag import Okt
from konlpy.tag import Kkma

okt = Okt()
kkma = Kkma()

sentence = "열심히 코딩한 당신, 연휴에는 여행을 가봐요"

okt.morphs(sentence) # 형태소 추출

okt.pos(sentence) # 품사태깅

okt.nouns(sentence) # 명사만

print(kkma.morphs(sentence))
print(kkma.pos(sentence))
print(kkma.nouns(sentence))

sentence = "아버지가방에들어가신다"
print(okt.pos(sentence))
print(kkma.pos(sentence))

sentence = '그래욬ㅋㅋ'
okt.pos(sentence)

kkma.pos(sentence)

okt.pos(sentence, norm=True, stem=True) # stem = True 하면 원형을 찾아주고

okt.pos(sentence, norm=True) # norm = True 하면 원본에서 원형을 추출해줌

정제 (Cleaning), 정규화 (Normalizing)

규칙에 기반한 표기가 다른 단어들 통합

정제(cleaning) : 갖고 있는 코퍼스로부터 노이즈 데이터를 제거한다.

정규화(normalization) : 표현 방법이 다른 단어들을 통합시켜서 같은 단어로 만들어준다.

USA, US <- 동일한 의미를 가지므로 하나의 단어로 정규화 해볼 수 있다.

uh-hu, uhuhh 등등

어간 추출(stemming), 표제어 추출 (lemmatization)

대소문자 통합

Automobile, automobile <- 통합 할 필요가 있으면 통합

무조건 통합도 안됨 -> 'US' , 'us'

불필요한 단어 제거

noise data

단순히 특수문자 뿐만 아니라 분석의 목적에 맞지 않는 단어들을 노이즈 데이터라 함.

방법 : 불용어(stop word) 제거, 등장빈도에 따른 제거, 길이가 짧은 단어 제거...

영어의 경우 2~3 글자 크기 단어만 제거해도 노이즈를 많이 제거하는 효과 있다.

(절대적이진 않다)

a, an, in, by, on ...

import re

text = "I was wondering if anyone out there could enlighten me on this car."
text

shortword = re.compile(r'\w*\b\w{1,2}\b')

shortword.sub('', text)

어간 추출 (Stemming), 표제어 추출(Lemmatization)

정규화 기법. 코퍼스에 있는 단어의 개수를 줄일 수 있는 기법.

눈으로 보아서는 다른 단어이지만 같은 단어들 -> 하나의 단어로 일반화 -> 문서의 단어 수 줄일 수 있다.

표제어 (Lemma) : 기본 사전형 단어

am, are, is ... => be (표제어)

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

words = ['policy', 'doing', 'organization', 'have', 'going', 'love', 'lives', 'fly', 'dies', 'watched', 'has', 'starting']
words

nltk.download('wordnet')

[lemmatizer.lemmatize(word) for word in words]

lemmatizer.lemmatize("dies")

lemmatizer.lemmatize("dies", "v")

lemmatizer.lemmatize("watched")

lemmatizer.lemmatize("watched", "v")

lemmatizer.lemmatize("has")

lemmatizer.lemmatize("has", "v")

어간 추출 (Stemming)

형태학적 분석을 단순화한 버전.

(혹은) 정해진 규칙만 보고 단어의 어미를 자르는 어림짐작의 작업

어간 추출의 결과는 섬세한 작업은 아님. 사전에 존재하지 않는 단어일 수도 있습니다.

from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

sentence = "This was not the map we found in Billy Bones's chest, but an accurate copy, complete in all things--names and heights and soundings--with the single exception of the red crosses and the written notes."
tokenized_sentence = word_tokenize(sentence)

print(tokenized_sentence)

print([stemmer.stem(word) for word in tokenized_sentence])

words = ['formalize', 'allowance', 'electricical']

print('어간 추출 전 :',words)
print('어간 추출 후 :',[stemmer.stem(word) for word in words])

한국어에서 어간 추출

5언 9품사 구조

체언 - 명사, 대명사, 수사

수식언 - 관형사, 부사

관계언 - 조사

독립언 - 감탄사

용어 - 동사, 형용사 <= 어간(stem)과 어미(ending)의 결합

불용어 (Stopword)

분석에 '큰 의미가 없는 단어' 토큰을 제거하는 작업

i, my, me, over, ... 등장은 자주하지만 실제 분석에는 기여하는 바가 미미한 경우.

from nltk.corpus import stopwords

nltk.download('stopwords')

NLTK 에서 불용어 확인

stop_word_list = stopwords.words('english')
print(len(stop_word_list))
print(stop_word_list)

example = "Family is not an important thing. It's everything."
example

stop_words = set(stop_word_list)
word_tokens = word_tokenize(example)

result = []
for word in word_tokens :
if word not in stop_words :
result.append(word)

print('불용어 제거전: ', word_tokens)
print('불용어 제거후: ', result)

example = "고기를 아무렇게나 구우려고 하면 안 돼. 고기라고 다 같은 게 아니거든. 예컨대 삼겹살을 구울 때는 중요한 게 있지."
stop_words = "를 아무렇게나 구 우려 고 안 돼 같은 게 구울 때 는"

stop_words = set(stop_words.split(' '))
word_tokens = okt.morphs(example)

result = [word for word in word_tokens if not word in stop_words]

print('불용어 제거전: ', word_tokens)
print('불용어 제거후: ', result)

https://www.ranks.nl/stopwords/korean

정수 인코딩 (Integer Encoding)

자연어처리 에서는 텍스트를 숫자로 바꾸는 다양한 기법 사용해야 한다.

그 방법중 하나. 단어를 고유한 정수에 맵핑

ex) 텍스트의 단어(종류)가 5000개 -> 1 ~ 5000 번 까지 고유한 정수 인덱스 부여.

인덱스를 부여하는 방법

- 랜덤으로 할 수 있지만

- 단어의 등장 빈도수 기준으로 정렬한 뒤 부여

raw_text = "A barber is a person. a barber is good person. a barber is huge person. he Knew A Secret! The Secret He Kept is huge secret. Huge secret. His barber kept his word. a barber kept his word. His barber kept his secret. But keeping and keeping such a huge secret to himself was driving the barber crazy. the barber went up a huge mountain."
raw_text

sentences = sent_tokenize(raw_text)

print(sentences)

vocab = {}
preprocessed_sentences = []
stop_words = set(stopwords.words('english'))

for sentence in sentences :
tokenized_sentence = word_tokenize(sentence)
result = []

for word in tokenized_sentence :
word = word.lower() # 모든 단어 소문자화 하여 단어개수 줄이기
if word not in stop_words : # 불용어 제거
if len(word) >2 : # 2글자 이하의 단어 제거
result.append(word)
if word not in vocab :
vocab[word] = 0
vocab[word] += 1

preprocessed_sentences.append(result)

preprocessed_sentences

vocab

vocab_sorted = sorted(vocab.items(), key = lambda x: x[1], reverse = True)

vocab_sorted

높은 빈도수를 가진 단어에 낮은 정수 인덱스 부여 (1-base)

word_to_index = {}
i = 0
for (word, frequency) in vocab_sorted :
if frequency > 1 : # 빈도수 1이하 제거
i = i + 1
word_to_index[word] = i

word_to_index

빈도수 상위 n개의 단어만 사용하고 싶다면

vocab_size = 5

인덱스가 5 초과인 단어 제거

words_frequency = [word for word,index in word_to_index.items() if index >= vocab_size + 1]

해당 단어에 대한 인덱스 정보 삭제

for w in words_frequency :
del word_to_index[w]

print(word_to_index)

인덱싱에 존재하지 않는 단어들 --> Out-Of-Vocabulary (OOV, 단어집합(사전)에 없는 단어)

word_to_index 에 'OOV' 를 추가하여 단어집합에 없는 단어들은 OOV의 인덱스로 인코딩 함

word_to_index['OOV'] = len(word_to_index) + 1

print(word_to_index)

word_to_index 를 사용하여 문장의 각 단어를 정수로 바꾸기

encoded_sentences = []
for sentence in preprocessed_sentences :
encoded_sentence = []
for word in sentence :
try :
encoded_sentence.append(word_to_index[word]) # 단어 집합에 있는 단어는 인덱스 추가
except KeyError :
encoded_sentence.append(word_to_index['OOV']) # 단어 집합에 없는 단어는 OOV 인덱스 추가

encoded_sentences.append(encoded_sentence)

print(preprocessed_sentences)
print(encoded_sentences)

위와 같은 작업을 Counter, FreqDist, enumerate, keras 토크나이저를 사용해서도 손쉽게 수행 할 수 있다.

keras 의 전처리

from tensorflow.keras.preprocessing.text import Tokenizer

preprocessed_sentences

tokenizer = Tokenizer()

tokenizer.fit_on_texts(preprocessed_sentences) # 빈도수를 기준으로 단어 집합을 생성

tokenizer.word_index

tokenizer.word_counts

tokenizer.texts_to_sequences(preprocessed_sentences)

keras Tokenizer 에서 빈도수 가장 높은 n개만 사용

vocab_size = 5
tokenizer = Tokenizer(num_words = vocab_size + 1) # 상위 5개 단어만 사용
tokenizer.fit_on_texts(preprocessed_sentences)

tokenizer.word_index

상위 5개만 뽑게 했는데 왜 13개가 나올까?

tokenizer.word_counts

실제 적용은 texts_to_sequences 를 사용할 때 적용된다

encoded = tokenizer.texts_to_sequences(preprocessed_sentences)
encoded

↑ 상위 5개 (1~5)까지만 보존되고 나머지 단어들은 제거된 결과!

패딩 (Padding)

머신러닝모델은 행렬 데이터 형태를 처리,

병렬연산을 위해서는 여러 문장의 길이를 '동일하게' 맞춰주는 전처리 작업 필요

preprocessed_sentences

tokenizer = Tokenizer()
tokenizer.fit_on_texts(preprocessed_sentences)
encoded = tokenizer.texts_to_sequences(preprocessed_sentences)
encoded

모두 동일한 길이로 맞춰주기 위해 가장 길이가 긴 문장의 길이 계산

max_len = max([len(item) for item in encoded])
max_len

import numpy as np

모든 문장의 길이를 7로 맞춰보자

비어있는 단어를 'PAD'를 사용하여 채운다. 'PAD'의 인덱스는 0

for sentence in encoded :
while len(sentence) < max_len :
sentence.append(0)

padded_up = np.array(encoded)
padded_up

pad_sequences()

keras 의 패딩 함수

from tensorflow.keras.preprocessing.sequence import pad_sequences

encoded = tokenizer.texts_to_sequences(preprocessed_sentences)
encoded

padded = pad_sequences(encoded)
padded

pad_sequences() 는 기본적으로 문장의 앞으로 0을 채운다.

padded = pad_sequences(encoded, padding='post')
padded

(padded == padded_up).all()

패딩 길이에 제한을 줄 수 있다

padded = pad_sequences(encoded, padding='post', maxlen=5)
padded

본래 7개 짜리 문장은 '앞쪽'의 두개각 잘려나갔다. (데이터 손실)

만약 뒤쪽의 단어가 삭제되게 하려면

padded = pad_sequences(encoded, padding='post', maxlen=5, truncating='post')
padded

반드시 패딩 문자를 0으로 할 필요는 없다. (다른 숫자 가능)

단어 집합 크기 +1 한 숫자를 패딩문자로 사용해보자

last_value = len(tokenizer.word_index) + 1
last_value

padded = pad_sequences(encoded, padding='post', value=last_value)
padded

One-Hot Encoding

단어 집합 (vocabulary)

'서로 다른' 단어들의 집합.

book, books <- (기본적으로) 다른 단어로 간주한다.

One-hot encoding 하기전에 먼저 해야 할 일은 '단어 집합' 을 잘 만드는 일.

tokens = okt.morphs('나는 자연어 처리를 배운다')

print(tokens)

고유한 정수 (인덱스)를 부여해보자

word_to_index = {word : index for index, word in enumerate(tokens)}

print('단어집합: ',word_to_index)

one-hot 벡터를 만들어내는 함수

def one_hot_encoding(word, word_to_index):
one_hot_vector = [0] * (len(word_to_index))
index = word_to_index[word]
one_hot_vector[index] = 1
return one_hot_vector

one_hot_encoding("자연어", word_to_index)

to_categorical()

keras 의 one-hot encoding

text = "나랑 점심 먹으러 갈래 점심 메뉴는 햄버거 갈래 갈래 햄버거 최고야"

from tensorflow.keras.utils import to_categorical

tokenizer = Tokenizer()
tokenizer.fit_on_texts([text])
print('단어집합:', tokenizer.word_index)

sub_text = "점심 먹으로 갈래 메뉴는 햄버거 최고야"

encoded = tokenizer.texts_to_sequences([sub_text])[0]
encoded

one_hot = to_categorical(encoded)
print(one_hot)

One-Hot Encoding 의 문제점1

단어의 개수가 늘어날수록 벡터를 저장하기 위한 공간이 계속 늘어난다... -> 벡터의 차원이 늘어난다

단어집합의 크기가 곧 벡터의 차원수가 된다.

-> 메모리를 비효율적으로 사용하게 됨.

문제점2

단어의 유사도를 표현하지 못함.

늑대, 호랑이, 강아지, 고양이 ->

[1,0,0,0], [0,1,0,0][0,0,1,0] [0,0,0,1]

Customized KoNLPy

형태소 분석 입력 : '은경이는 사무실로 갔습니다.'

형태소 분석 결과 : ['은', '경이', '는', '사무실', '로', '갔습니다', '.']

!pip install customized_konlpy

from ckonlpy.tag import Twitter
twitter = Twitter()
twitter.morphs('은경이는 사무실로 갔습니다.')

okt.morphs('은경이는 사무실로 갔습니다.')

twitter.add_dictionary('은경이', 'Noun') # 단어 사전에 추가

twitter.morphs('은경이는 사무실로 갔습니다.')

이유진

독해지자

이전 포스트

딥러닝(AI학습 43)

다음 포스트