[NLP] 입문하기

syEON·2023년 10월 23일

소개

NLP(Natural Language Processing) 자연어 처리에 대해 이해한 내용을 바탕으로 정리 해보려한다.

한국어 데이터를 가지고 다분류 문제를 해결하는 과정을 담았다. 고객의 문의내용 VOC를 유형별로 분류한다.
Feautre 컬럼 : 'text'
Traget 컬럼 : 'label'

데이터 확인

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from IPython.display import display
import matplotlib.pyplot as plt
import matplotlib.font_manager as fm
from wordcloud import WordCloud
from collections import Counter

내용 길이 분포

display(train_df.label.value_counts())
texts = list(train_df['text'])
tokenized_texts = [t.split() for t in texts] #
texts_len_by_words = [len(t) for t in tokenized_texts] # 샘플당 단어의 개수

데이터 탐색

방법1. KoNLPy

한국어 정보처리를 위한 파이썬 패키지 'KoNLPy(코엔엘파이)'
https://konlpy.org/ko/latest/index.html
https://konlpy.org/ko/latest/morph/#pos-tagging-with-konlpy
https://datascienceschool.net/03%20machine%20learning/03.01.02%20KoNLPy%20%ED%95%9C%EA%B5%AD%EC%96%B4%20%EC%B2%98%EB%A6%AC%20%ED%8C%A8%ED%82%A4%EC%A7%80.html

형태소 분석 이란 형태소를 비롯하여, 어근, 접두사/접미사, 품사(POS, part-of-speech) 등 다양한 언어적 속성의 구조를 파악하는 것입니다.
품사 태깅 은 형태소의 뜻과 문맥을 고려하여 그것에 마크업을 하는 일입니다

KoNLPy로 품사 태킹에는 여러 패키지들이 있는데 성능과 태킹 방식이 다르다. 공식 페이지를 확인해보며서 가장 적절한 것을 사용했다. Mecab과 Okt 사용방식을 예제로 하였다.

Kkma
Komoran
Hannanum
Okt (이전 Twitter)
Mecab

공통 메소드

nouns : 명사 추출
morphs : 형태소 추출
pos : 품사 부착

mecab 사용
mecab이 KoNLPy은 별도의 설치가 필요함으로 https://pypi.org/project/python-mecab-ko/ 에서 파이썬 패키지 형태로 다운 받아 사용하는 것이 편리하다.

# python package mecab 대소문자 주의! 
from mecab import MeCab 
mecab = MeCab()

text_noun = []
text_pos= []
text_morphs = []

for i in texts:
  text_noun.append(mecab.nouns(i)) #명사 추출
  text_pos.append(mecab.pos(i))  #품사 
  text_morphs.append(mecab.morphs(i))

text_noun[0]

['여기', '커널', '사이즈', '은', '단어', '최대', '길이', '가정', '선언', '것']

text_pos[0]

[('self', 'SL'), ('convs', 'SL'), ('1', 'SN'), ('nn', 'SL'), ('in', 'SL'), ('Ks', 'SL'), ('1', 'SN'), ('여기', 'NP'), ('서', 'JKB'), ('커널', 'NNG'), ('사이즈', 'NNG')]

text_morphs[0]

['길이', '가', '100', '이', '넘지', '않는다는', '가정', '으로', '그냥', '100', '으로']

Okt 사용

from konlpy.tag import Okt
okt = Okt()
for i in texts:
  print(okt.nouns(i))
  print(okt.pos(i))
  print(okt.morphs(i))

NLTK

Natural Language Toolkit은 파이썬에서 자연어 처리와 텍스트 분석을 위한 라이브러리

konlpy로 추출한 명사, 품사, 형태소를 NLTK로 변경하기

nltk_nouns = nltk.Text(nouns.explode())
nltk_nouns = nltk.Text([w for w in nltk_nouns if isinstance(w,str)])
nltk_morphs = nltk.Text(morphs.explode())
nltk_morphs = nltk.Text([w for w in nltk_morphs if isinstance(w,str)])
nltk_pos = nltk.Text(pos.explode())
nltk_pos_tuple = nltk.Text(pos_tuple.explode())

print(nltk_nouns.vocab())
print(nltk_pos.vocab())
print(nltk_morphs.vocab())

NLTK의 다른 기능들도 알아보기
https://www.nltk.org/howto/concordance.html

wordcloud 만들어보기

cloud = WordCloud(
        max_font_size=100, max_words=50,
        background_color='white', relative_scaling=.5,
        width=800, height=600, font_path=FONT_PATH).generate_from_frequencies(nltk_nouns.vocab())
plt.figure(figsize=(12, 6))
plt.imshow(cloud, interpolation='bilinear')
plt.axis('off')
plt.show()

전처리

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

데이터 train, test로 나누기
앞으로 수행하는 전처리는 train에 적용한 것을 test에도 맞춰야 함을 고려하자!

x_train,x_test,y_train,y_test = train_test_split(preprocessed_df.text,preprocessed_df.label, test_size=0.25,random_state=42)

특수문자 제거

X_tr, X_val에는 적용하지 않고 정규식을 사용해 제거하는 예시만 남긴다.

import string ,re
removal_list =  "‘’◇‘”’'·\“·△●■()\">>`/-∼=ㆍ<>.?!【】…◆%"
removal_list += string.punctuation

sentence = re.sub("[^가-힣0-9a-zA-Z\\s]", " ", sentence)
sentence = re.sub(r'\s+', ' ', sentence)
sentence = sentence.translate(str.maketrans(removal_list, ' '*len(removal_list)))

Vectorization

Ref.
https://wikidocs.net/33661

Vectorize 할때 문장을 토큰화까지 한번에 수행할 수 있다.
위에서 소개된 mecab의 pos로 변경한 뒤 토큰화 + 벡터화 로 변경한다.

def mecab_tokenizer(string):
    return list("/".join(res) for res in mecab.pos(str(string)))

방법1
CountVectorizer + Tf-idf transformer() 적용

count_vect = CountVectorizer(tokenizer = mecab_tokenizer, ngram_range=(1,2))
x_train_counts = count_vect.fit_transform(x_train)
x_test_counts = count_vect.transform(x_test)

### 확인
print(count_vect.get_feature_names_out())
print(count_vect.vocabulary_)
print(x_train_counts.toarray())  #toarray를 통해 행렬로 변경한다.
print(x_test_counts.toarray())

transformer = TfidfTransformer()
x_train_tf = transformer.fit_transform(x_train_counts)
x_test_tf = transformer.fit_transform(x_test_counts)

#확인
print(x_train_tf.shape)
print(x_train_tf.toarray())
print(x_test_tf.shape)
print(x_test_tf.toarray())

방법2
TfidfVectorizer 사용

tfidf_vectorizer = TfidfVectorizer(tokenizer=mecab_tokenizer)
x_tr_tfidfv = tfidf_vectorizer.fit_transform(X_tr)
#x_val_tfidfv = tfidf_vectorizer.transform(X_val)
x_te_tfidfv = tfidf_vectorizer.transform(X_te)

방법1과 방법2의 벡터화 결과는 소수점 정도의 차이이다. 방법1을 사용하고자 한다.

TF-IDF 란?
https://wikidocs.net/31698

문서를 d, 단어를 t, 문서의 총 개수를 n 일때
tf(d,t) : 특정 문서 d에서의 특정 단어 t의 등장 횟수
df(t) : 특정 단어 t가 등장한 문서의 수.
idf(t) : df(t)에 반비례하는 수
TF-IDF는 TF와 IDF를 곱한 값

TF-IDF는 모든 문서에서 자주 등장하는 단어는 중요도가 낮다고 판단하며, 특정 문서에서만 자주 등장하는 단어는 중요도가 높다고 판단합니다. TF-IDF 값이 낮으면 중요도가 낮은 것이며, TF-IDF 값이 크면 중요도가 큰 것입니다. 즉, the나 a와 같이 불용어의 경우에는 모든 문서에 자주 등장하기 마련이기 때문에 자연스럽게 불용어의 TF-IDF의 값은 다른 단어의 TF-IDF에 비해서 낮아지게 됩니다.

N-gram 란?
https://wikidocs.net/21692

n-gram은 n개의 연속적인 단어 나열을 의미합니다. 갖고 있는 코퍼스에서 n개의 단어 뭉치 단위로 끊어서 이를 하나의 토큰으로 간주합니다

Sequence

Ref.
https://keras.io/ko/preprocessing/text/
https://wikidocs.net/182469
https://codetorial.net/tensorflow/natural_language_processing_in_tensorflow_01.html
https://www.tensorflow.org/api_docs/python/tf/keras/utils/pad_sequences

소개
train을 sequence를 만드는 것은 딥러닝에 사용하기 위한 전처리 과정이다.

절차
🍑 train, test의 데이터를 mecab으로 한국어 토큰화 후에 sequence로 만든다.
하나의 샘플 데이터를 토큰화 후 다시 한 문자으로 합칠 수도 있고 개별로서 존재하는 방법이 있다.

1번 형태
['기존/NNG 에/JKB 있/VV 던/ETM 파일/NNG 을/JKO 삭제/NNG 해서/XSV+EC 윈도우/NNP 10/SN ./SF ova/SL 가져오/VV']

2번 형태
['기존/NNG',
'에/JKB',
'있/VV',
'던/ETM',
'파일/NNG',
'을/JKO',
'삭제/NNG',
'해서/XSV+EC',
'윈도우/NNP',
'10/SN',
'./SF',
'ova/SL',
'가져오/VV']

둘 중에서 개별로 토큰화 한 것을 추천한다.
왜냐하면 만들어진 단어사전을 확인했을 때 1번 같은 경우 품사표시 /VV 같은 것들이 모두 제거 되고 명사만 남는다. tokenizer.word_index 혹은 tokenizer.word_counts 로 단어사전을 확인할 수 있다.
어느쪽이 딥러닝 성능이 좋은지는 확인해 보지 않았다..

🍑 Tokenizer의 파라미터 num_words
num_words는 Tokenizer가 학습 중에 고려할 최대 단어 수이다.
예를 들어, num_words=5000으로 설정하면 Tokenizer는 학습 데이터에서 가장 빈번한 5000개의 단어만 고려하고 나머지 단어는 무시한다.
그러나 단어사전 print(len(tokenizer.word_index)) 를 하면 5000개가 훨씬 넘었었는데 이 것은 num_words 개수와 상관없이 단어들을 모두 포함하는 단어사전을 만든다.
text_to_sequences로 변환된 리스트(x_train_seq)를 찍어보면 num_words로 설정한 5000 이상의 값이 없었다.

from keras.preprocessing.text import Tokenizer

# 1.토큰화 (사용) 
x_train_token = x_train.apply(lambda x:mecab_tokenizer(x))
x_test_token = x_test.apply(lambda x:mecab_tokenizer(x))


# 2.토큰화한 것을 한 문장으로 변환 (미사용)
x_train_token_str = x_train.apply(lambda x:' '.join(mecab_tokenizer(x)))
x_test_token_str = x_test.apply(lambda x:' '.join(mecab_tokenizer(x)))


# Tokenizer 객체 생성
tokenizer = Tokenizer(
    filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', 
    num_words= 5000   # 실제로는 4999개 단어가 고려된다.
)
tokenizer.fit_on_texts(x_train_token)
print(len(tokenizer.word_index))  #학습된 단어 사전 확인
x_train_seq = tokenizer.texts_to_sequences(x_train_token)  #텍스트를 시퀀스로 변환
x_test_seq = tokenizer.texts_to_sequences(x_test_token)

문장마다 토큰화한 길이가 다름으로 길이를 지정하고 padding 통해 꼭 맞춰준다.
default, 0값으로 앞에서 부터 채워진다.

from keras.preprocessing.sequence import pad_sequences
max_len = 500
x_train_seq = pad_sequences(x_train_seq, maxlen = max_len)
x_test_seq = pad_sequences(x_test_seq, maxlen = max_len)

print(x_test_seq.shape)

데이터 저장 불러오기

sparse data에 대해서는 scipy.sparse.save_npz를 사용한다.

저장

#path 각자 환경에 맞게 설정
scipy.sparse.save_npz(path+'/x_train_tf.npz', x_train_tf)
scipy.sparse.save_npz(path+'/x_test_tf.npz', x_test_tf)

np.save(path +"/" + 'x_train_seq.npy',x_train_seq)
np.save(path +"/" + 'x_test_seq.npy',x_test_seq)

np.save(path +"/" + 'y_train.npy',y_train)
np.save(path +"/" + 'y_test.npy',y_test)

불러오기

#n-gram
x_train_tf = scipy.sparse.load_npz(path+'x_train_tf.npz')
x_test_tf = scipy.sparse.load_npz(path + 'x_test_tf.npz')
print(x_train_tf.shape, x_test_tf.shape)

#sequence
x_train_seq =  np.load(path + 'x_train_seq.npy')
x_test_seq = np.load(path + 'x_test_seq.npy')
print(x_train_seq.shape, x_test_seq.shape)

y_train =  np.load(path + 'y_train.npy')
y_test = np.load(path + 'y_test.npy')
print(y_train.shape, y_test.shape)

머신러닝

Tf-idf로 벡터화 한것을 가지고(x_train_tf) 머신러닝을 학습한다.

from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

예시1. RandomForestClassifier()사용

rfc = RandomForestClassifier()
rfc.fit(x_train_tf,y_train)
y_pred = rfc.predict(x_test_tf)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

예시2. 나이브베이즈모델
가장 성능이 좋았던 모델이고, 성능튜닝을 위해 randomsearch를 사용하였다.

from sklearn.naive_bayes import ComplementNB
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform

# ComplementNB 모델 생성
model = ComplementNB()

# 탐색할 하이퍼파라미터 범위 설정
param_dist = {
    'alpha': uniform(0.1, 10.0),
    'fit_prior': [True, False]
}

# Randomized Search를 위한 객체 생성
random_search = RandomizedSearchCV(model, param_distributions=param_dist, n_iter=10, cv=5, n_jobs=-1, random_state=42)
random_search.fit(x_train_tf, y_train)

# 최적의 하이퍼파라미터와 모델 성능 출력
print("최적 하이퍼파라미터:", random_search.best_params_)
print("최고 정확도:", random_search.best_score_)
sgd_pre = random_search.predict(x_test_tf)

예시3. SGDClassifier

from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import loguniform, randint

# SGDClassifier 모델 생성
model = SGDClassifier()

# 탐색할 하이퍼파라미터 범위 설정
param_dist = {
    'loss': ['hinge', 'log_loss', 'modified_huber', 'squared_hinge'],
    'penalty': ['l2', 'l1', 'elasticnet'],
    'alpha': loguniform(1e-6, 1e-3),
    'max_iter': randint(100, 1000)
}

# Randomized Search를 위한 객체 생성
random_search = RandomizedSearchCV(model, param_distributions=param_dist, n_iter=10, cv=5, n_jobs=-1, random_state=42)

# 모델 학습 데이터 및 레이블을 여기에 제공해야 합니다.
# X_train, y_train을 실제 데이터로 바꿔야 합니다.
random_search.fit(x_train_tf, y_train)

# 최적의 하이퍼파라미터와 모델 성능 출력
print("최적 하이퍼파라미터:", random_search.best_params_)
print("최고 정확도:", random_search.best_score_)

딥러닝

딥러닝은 총 3가지 모델링을 하였고 실제 x_test의 label(정답)은 모르는 상태로 결과를 예측해서 도출하는 방식으로 진행하였습니다.

예시1. DNN

from keras import Input, Model
from keras.models import Sequential
from keras.layers import Dense, Dropout, BatchNormalization, Activation,Embedding, Conv1D, GlobalMaxPooling1D, GlobalAveragePooling1D, LSTM, Bidirectional
from keras.backend import clear_session
from keras.optimizers import Adam
from keras.metrics import categorical_accuracy, SparseCategoricalAccuracy
from keras.callbacks import EarlyStopping

es = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)	
input_dim = x_train_seq.shape[1]

x = Input(shape=[input_dim])
h = Dense(128)(x)
h=  BatchNormalization()(h)
h = Activation('swish')(h)
h = Dropout(rate=0.6)(h)

h = Dense(64)(x)
h=  BatchNormalization()(h)
h = Activation('swish')(h)
h = Dropout(rate=0.6)(h)

h = Dense(32)(x)
h=  BatchNormalization()(h)
h = Activation('swish')(h)
h = Dropout(rate=0.6)(h)

y = Dense(5, activation='softmax')(h)
dnn = Model(x,y)

dnn.compile(optimizer=Adam(learning_rate=0.0001), loss ='sparse_categorical_crossentropy', metrics=[SparseCategoricalAccuracy()])
history = dnn.fit(x_train_seq, y_train, epochs=100, validation_split=0.2, verbose=2, batch_size=128, callbacks=[es])
c_predict = np.argmax(dnn.predict(x_test_seq), axis=1)

예시2. CNN
https://keras.io/examples/nlp/text_classification_from_scratch/

inputs = tf.keras.Input(shape=(None,), dtype="int64")

# Next, we add a layer to map those vocab indices into a space of dimensionality
# 'embedding_dim'.
x = Embedding(vocab_size+1, 128)(inputs)
x = Dropout(0.5)(x)

# Conv1D + global max pooling
x = Conv1D(128, 7, padding="valid", activation="relu", strides=3)(x)
x = Conv1D(128, 7, padding="valid", activation="relu", strides=3)(x)
x = GlobalMaxPooling1D()(x)

# We add a vanilla hidden layer:
x = Dense(128, activation="relu")(x)
x = Dropout(0.5)(x)

# We project onto a single unit output layer, and squash it with a sigmoid:
predictions = Dense(5, activation="softmax", name="predictions")(x)

model = Model(inputs, predictions)

# Compile the model with binary crossentropy loss and an adam optimizer.
model.compile(optimizer=Adam(learning_rate=0.0001), loss ='sparse_categorical_crossentropy', metrics=[SparseCategoricalAccuracy()])
history = model.fit(x_train_seq, y_train, epochs=100, validation_split=0.2, verbose=2, batch_size=128, callbacks=[es])

#loss, acc = model.evaluate(x_test_seq, y_test)
c_predict = model.predict(x_test_seq)
c_predict = np.argmax(c_predict, axis=1)
print(c_predict)
#print(loss, acc)

예시3. LSTM
https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM

model_lstm = Sequential()
model_lstm.add(Embedding(vocab_size+1, 128))
model_lstm.add(Bidirectional(LSTM(64, return_sequences=True)))
model_lstm.add(Bidirectional(LSTM(32)))
model_lstm.add(Dense(64, activation='relu'))
model_lstm.add(Dropout(0.5))
model_lstm.add(Dense(32, activation='relu'))
model_lstm.add(Dropout(0.5))
model_lstm.add(Dense(5, activation='softmax'))

model_lstm.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['acc'])
history = model_lstm.fit(x_train_seq, y_train, epochs=50, callbacks=[es], batch_size=128, validation_split=0.2)
lstm_predict = np.argmax(model_lstm.predict(x_test_seq), axis=1)
print(lstm_predict)