자연어 처리

이승수·2021년 10월 28일

딥러닝

1. 자연어처리(Natural Language Processing, NLP)

자연어 혹은 자연언어는 사람들이 일상적으로 쓰는 언어를 인공적으로 만들어진 언어인 인공어와 구분하여 부르는 개념
이런 자연어를 컴퓨터로 처리하는 기술이 자연어 처리

2. 벡터화(Vectorize)

컴퓨터가 이해할 수 있도록 자연어 자체를 벡터로 만들어 주는 과정
자연어 처리 모델의 성능을 결정하는 중요한 역할

① 등장횟수 기반의 단어표현(Count-based Representation) : 단어가 문서 혹은 문장에 등장하는 횟수를 기반으로 벡터화하는 방법

Bag-of-Words(CounterVectorizer)
TF-IDF

② 분포 기반의 단어표현(Distributed Representation) : 타겟 단어 주변에 있는 단어를 기반으로 벡터화하는 방법

Word2Vec
GloVe
fastText

3. 텍스트 전처리(Text Preprocessing)

자연어처리의 절반 이상을 차지하는 중요한 과정

내장 메서드를 사용한 전처리 (lower, replace, ...)

정규 표현식(Regular expression, Regex)

불용어(Stop words) 처리

통계적 트리밍(Trimming)

어간 추출(Stemming) 혹은 표제어 추출(Lemmatization)

① 차원의 저주

“특성의 개수가 선형적으로 늘어날 때 동일한 설명력을 가지기 위해 필요한 인스턴스의 수는 지수적으로 증가한다. 즉 동일한 개수의 인스턴스를 가지는 데이터셋의 차원이 늘어날수록 설명력이 떨어지게 된다”

전체 말뭉치에 존재하는 단어의 종류가 데이터셋의 feature 즉 차원이 된다 따라서 단어의 종류의 줄여주어야 차원의 저주를 해결할 수 있다

import spacy
from spacy.tokenizer import Tokenizer

nlp = spacy.load("en_core_web_sm")

tokenizer = Tokenizer(nlp.vocab)

sentence1 = "Hello my name is KOREA!!"
sentence2 = "Hi my name is ENGLAND!!"

sent_list = [sentence1, sentence2]
total_tokens = []

for s in tokenizer.pipe(sent_list):
    sentence_token = [token.text for token in s]
    total_tokens.extend(sentence_token)
    print(sentence_token)
→ ['Hello', 'my', 'name', 'is', 'KOREA!!']
  ['Hi', 'my', 'name', 'is', 'ENGLAND!!'] 출력
  
token_set = set(total_tokens)
print(token_set)
→ {'name', 'Hi', 'is', 'ENGLAND!!', 'Hello', 'KOREA!!', 'my'} 출력

def word2idx(sent, total):
    sent_token = sent.split()
    return [1 if word in sent_token else 0 for word in total]

sent1_idx = word2idx(sentence1, token_set)
sent2_idx = word2idx(sentence2, token_set)

import pandas as pd
df = pd.DataFrame([sent1_idx, sent2_idx], columns=set(total_tokens))
df

7차원으로 줄일 수 있다

	name	Hi	is	ENGLAND!!	Hello	KOREA!!	my
0	1	0	1	0	1	1	1
1	1	1	1	1	0	0	1

② 대소문자 통일

# 데이터를 모두 소문자로 변환
df['brand'] = df['brand'].apply(lambda x: x.lower())

③ 정규표현식(Regex)

구두점이나 특수문자 등의 문자가 말뭉치 내에 있을 경우 토큰화가 제대로 이루어지지 않는다

import re    # 파이썬 정규표현식 패키지 이름 re

# 정규식  [] 사이 문자를 매치, ^: not
regex = r"[^a-zA-Z0-9 ]"    # 소문자 a~z, 대문자 A~Z, 숫자 0~9를 제외(^)
text = "HELLO k!@$o?re$&a"
subset = ‘’  # 치환할 문자

re.sub(regex, subset, text)  
→ "HELLO korea" 출력

def tokenize(text):
    tokens = re.sub(regex, subst, text)   # 정규식 적용
    tokens = tokens.lower().split()       # 소문자로 치환
    
    return tokens   # 토큰이 저장된 리스트

df['tokens'] = df['reviews.text'].apply(tokenize)

df[['reviews.text', 'tokens']].head(3)

	reviews.text	tokens
0	Though I have got it for cheap price during bl...	[though, i, have, got, it, for, cheap, price, ...
1	I purchased the 7" for my son when he was 1.5 ...	[i, purchased, the, 7, for, my, son, when, he,...
2	Great price and great batteries! I will keep o...	[great, price, and, great, batteries, i, will,...

from collections import Counter

# Counter 객체는 리스트요소의 값과 요소의 갯수를 카운트 하여 저장
# 카운터 객체는 .update 메소드로 계속 업데이트 가능
word_counts = Counter()

# 토큰화된 각 리뷰 리스트를 카운터 객체에 업데이트
df['tokens'].apply(lambda x: word_counts.update(x))

# 가장 많이 존재하는 단어 순으로 10개를 나열
word_counts.most_common(10)

→[('the', 10514),
 ('and', 8137),
 ('i', 7465),
 ('to', 7150),
 ('for', 6617),
 ('a', 6421),
 ('it', 6096),
 ('my', 4119),
 ('is', 4111),
 ('this', 3752)]  출력

④ 불용어(stop words) 처리

'I', 'and', 'of' 같은 단어들은 리뷰 관점에서 아무런 의미가 없다
따라서 분석할 때 해당 단어를 제외한다
대부분의 NLP 라이브러리는 접속사, 관사, 부사, 대명사, 일반동사 등을 포함한 일반적인 불용어를 내장하고 있다

tokens = []
# 토큰에서 불용어 제거, 소문자화 하여 업데이트
for doc in tokenizer.pipe(df['reviews.text']):
    doc_tokens = []

    for token in doc:
        # 토큰이 불용어와 구두점이 아니면 저장
        if (token.is_stop == False) & (token.is_punct == False):
            doc_tokens.append(token.text.lower())

    tokens.append(doc_tokens)

df['tokens'] = tokens
df.tokens.head()  → 불용어 제거된 토큰 출력

# 불용어를 지정해서 제거할 수 있다
STOP_WORDS = nlp.Defaults.stop_words.union(
		['I','i','it', "it's", 'it.', 'the', 'this'])

tokens = []

for doc in tokenizer.pipe(df['reviews.text']):
    
    doc_tokens = []
    
    for token in doc: 
        if token.text.lower() not in STOP_WORDS:
            doc_tokens.append(token.text.lower())
   
    tokens.append(doc_tokens)
    
df['tokens'] = tokens

⑤ 어간 추출(Stemming)과 표제어 추출(Lemmatization)

'batteries' 와 'battery'를 보면 이 둘은 어근(root)이 같은 단어다
이런 단어는 어간 추출(stemming)이나 표제어 추출(lemmatization)을 통해 정규화(Normalization) 를 해준다

※ 어간(stem)이란?
단어의 의미가 포함된 부분으로 접사등이 제거된 형태
어근과 단어의 원형이 같지 않을 수도 있다
ex) argue, argued, arguing, argus의 어간은 argu

Stemming
- Stemming에서 해본 Porter 알고리즘은 단지 단어의 끝 부분을 자르는 역할이어서 사전에도 없는 단어가 많이 나오게 된다
- 현실적으로 사용하기에 Stemming 은 성능이 나쁘지 않고 알고리즘이 간단하여 속도가 빠르기 때문에 속도가 중요한 검색 분야에서 많이 사용한다

from nltk.stem import PorterStemmer

ps = PorterStemmer()

words = ["wolf", "wolves"]

for word in words:
    print(ps.stem(word))

→ wolf
   wolv 출력

Lemmatization
- 표제어 추출은 어간추출보다 체계적이고 단어들은 기본 사전형 단어 형태인 Lemma(표제어)로 변환된다
- 명사의 복수형은 단수형으로, 동사는 모두 타동사로 변환된다
- 이렇게 단어들로부터 표제어를 찾아가는 과정은 Stemming 보다 많은 연산이 필요하다

lem = "The social wolf. Wolves are complex."

nlp = spacy.load("en_core_web_sm")

doc = nlp(lem)

for token in doc:
    print(token.text, "  ", token.lemma_)

→ The    the
social    social
wolf    wolf
.    .
Wolves    wolf
are    be
complex    complex
.    . 			출력

4. 등장 횟수 기반의 단어 표현

등장 횟수 기반의 단어 표현(Count-based Representation)은 단어가 특정 문서(혹은 문장)에 들어있는 횟수를 바탕으로 해당 문서를 벡터화
대표적인 방법으로는 Bag-of-Words(TF, TF-IDF) 방식
벡터화 된 문서는 문서-단어 행렬의 형태(Document-Term Matrix, DTM)로 나타내어진다

① Bag-of-Words(BoW) : TF(Term Frequency)

문서(혹은 문장)에서 문법이나 단어의 순서 등을 무시하고 단순히 단어들의 빈도만 고려하여 벡터화

TF(w) = \text{특정 문서 내 단어 w의 수}

② Bag-of-Words(BoW) : TF-IDF (Term Frequency - Inverse Document Frequency)

다른 문서에 등장하지 않는 단어, 즉 특정 문서에만 등장하는 단어에 가중치를 두는 방법이 TF-IDF

\text{분류 대상이 되는 모든 문서의 수} : n \\ \text{단어 w가 들어있는 문서의 수} : df(w)

0이 되는 것을 방지하기 위해 +1을 해준다

\text{IDF(w)} = \log \bigg(\frac{n}{1 + df(w)}\bigg)

# TfidfVectorizer 적용하기
text = """In information retrieval, tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.
It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling.
The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word,
which helps to adjust for the fact that some words appear more frequently in general.
tf–idf is one of the most popular term-weighting schemes today."""

sentences_lst = text.split('\n')
tfidf = TfidfVectorizer(stop_words='english', max_features=15)

# Fit 후 dtm을 만든다(문서, 단어마다 tf-idf 값을 계산)
dtm_tfidf = tfidf.fit_transform(sentences_lst)

dtm_tfidf = pd.DataFrame(dtm_tfidf.todense(), columns=tfidf.get_feature_names())
dtm_tfidf

	corpus	document	frequency	idf	information	number	recommender	...
0	0.239165	0.478329	0.583318	0.173029	0.239165	0.00000	0.291659	...
1	0.000000	0.000000	0.000000	0.426900	0.00000	0.000000	0.000000	...
2	0.277399	0.277399	0.000000	0.200691	0.000000	0.67657	0.000000	...

※ K-NN(NearestNeighbor K-최근접 이웃)

상위 K개의 근접한 데이터를 찾아서 K개 데이터의 유사성을 기반으로 점을 추정하거나 분류하는 예측 분석

from sklearn.neighbors import NearestNeighbors

# dtm을 사용히 NN 모델을 학습 / 최근접 5 이웃(default)
nn = NearestNeighbors(n_neighbors=5, algorithm='kd_tree')
nn.fit(dtm_tfidf)

# 해당 인덱스 문서와 가장 가까운 문서 5개의 거리와 인덱스
nn.kneighbors([dtm_tfidf.iloc[2]])