[MBTIgram] AI의 이야기[2]-전처리 및 EDA

MBTIgram·2023년 8월 24일

AI

목록 보기

2/4

안녕하세요. 저는 'MBTIgram'의 AI 개발을 맡은 BTSpa Winter(지유경)입니다.😆
이번 포스팅에서는 전처리 및 EDA 과정을 설명해드리겠습니다.

1. 데이터 전처리

from google.colab import drive
drive.mount('/content/drive')

import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

import seaborn as sns
import matplotlib.pyplot as plt

#데이터셋 로드
data = pd.read_csv('/content/drive/MyDrive/spp_project/mbti_concat.csv')

이전 포스트에서 설명드렸다시피 두 개의 데이터셋을 합쳐서 전처리를 진행했습니다. 위 코드에서 로드한 데이터셋은 사전에 concat()함수를 이용하여 2개의 데이터셋을 합친 csv 파일입니다.

csv 파일을 합치는 방법은 다음과 같습니다.

result = pd.concat([data1, data2], ignore_index=True)

합치는 과정에서 index를 재배열하기 위해 ignore_index=True 옵션을 추가하였습니다.
concat()에는 다양한 옵션이 존재합니다.

join : 어떤 방식으로 병합할지 결정
join의 default는 'outer'이기 때문에 저는 따로 join 옵션을 주지 않았습니다. 만약, 공통으로 있는 열만 선택하여 합치고 싶다면, join='inner'를 사용하시면 됩니다.
axis : 어떤 방향으로 병합할지 결정
axis = 0 은 세로로 합치기, axis = 1은 가로로 합치기이며 axis의 default는 '0'입니다. 제가 가진 두 개의 데이터 셋은 각각 type, posts 두 개의 열로 이루어져있고 type을 예측변수, posts를 설명변수로 각각 사용해야하기 때문에 데이터셋 두개를 세로로 합쳐야 했습니다. 따라서, default 값을 사용해도 되기 때문에 따로 axis 옵션을 주지 않았습니다.

data

데이터셋을 합치면서 생성된 'Unnamed: 0' 컬럼이 보입니다. concat()을 진행하는 과정에서 index가 하나 더 생긴 것 같습니다. 불필요하기 때문에 해당 열 전체를 삭제해줍니다.

# 불필요한 열 제거
data = data.drop(['Unnamed: 0'], axis=1)
data

'Unnamed: 0'이라는 열을 제거하기 때문에 'axis = 1'로 옵션을 주면 됩니다. 만약, 행을 제거 한다면 'axis = 0'으로 옵션을 주면 됩니다.

해당 열이 삭제된 것을 확인되었으니 본격적인 전처리를 시작하겠습니다.
영어에는 '축약형'이라는 것이 존재하기 때문에 이러한 텍스트를 정제하는 과정을 거쳐야 합니다. 아래의 링크를 참고하여 코드를 작성하였습니다.

https://stackoverflow.com/questions/19790188/expanding-english-language-contractions-in-python

# 전처리 함수에서 사용할 contractions 생성
contractions = {"'cause": 'because',
 "I'd": 'I would',
 "I'd've": 'I would have',
 "I'll": 'I will',
 "I'll've": 'I will have',
 "I'm": 'I am',
 "I've": 'I have',
 "ain't": 'is not',
 "aren't": 'are not',
 "can't": 'cannot',
 "could've": 'could have',
 "couldn't": 'could not',
 "didn't": 'did not',
 "doesn't": 'does not',
 "don't": 'do not',
 "hadn't": 'had not',
 "hasn't": 'has not',
 "haven't": 'have not',
 "he'd": 'he would',
 "he'll": 'he will',
 "he's": 'he is',
 "here's": 'here is',
 "how'd": 'how did',
 "how'd'y": 'how do you',
 "how'll": 'how will',
 "how's": 'how is',
 "i'd": 'i would',
 "i'd've": 'i would have',
 "i'll": 'i will',
 "i'll've": 'i will have',
 "i'm": 'i am',
 "i've": 'i have',
 "isn't": 'is not',
 "it'd": 'it would',
 "it'd've": 'it would have',
 "it'll": 'it will',
 "it'll've": 'it will have',
 "it's": 'it is',
 "let's": 'let us',
 "ma'am": 'madam',
 "mayn't": 'may not',
 "might've": 'might have',
 "mightn't": 'might not',
 "mightn't've": 'might not have',
 "must've": 'must have',
 "mustn't": 'must not',
 "mustn't've": 'must not have',
 "needn't": 'need not',
 "needn't've": 'need not have',
 "o'clock": 'of the clock',
 "oughtn't": 'ought not',
 "oughtn't've": 'ought not have',
 "sha'n't": 'shall not',
 "shan't": 'shall not',
 "shan't've": 'shall not have',
 "she'd": 'she would',
 "she'd've": 'she would have',
 "she'll": 'she will',
 "she'll've": 'she will have',
 "she's": 'she is',
 "should've": 'should have',
 "shouldn't": 'should not',
 "shouldn't've": 'should not have',
 "so's": 'so as',
 "so've": 'so have',
 "that'd": 'that would',
 "that'd've": 'that would have',
 "that's": 'that is',
 "there'd": 'there would',
 "there'd've": 'there would have',
 "there's": 'there is',
 "they'd": 'they would',
 "they'd've": 'they would have',
 "they'll": 'they will',
 "they'll've": 'they will have',
 "they're": 'they are',
 "they've": 'they have',
 "this's": 'this is',
 "to've": 'to have',
 "wasn't": 'was not',
 "we'd": 'we would',
 "we'd've": 'we would have',
 "we'll": 'we will',
 "we'll've": 'we will have',
 "we're": 'we are',
 "we've": 'we have',
 "weren't": 'were not',
 "what'll": 'what will',
 "what'll've": 'what will have',
 "what're": 'what are',
 "what's": 'what is',
 "what've": 'what have',
 "when's": 'when is',
 "when've": 'when have',
 "where'd": 'where did',
 "where's": 'where is',
 "where've": 'where have',
 "who'll": 'who will',
 "who'll've": 'who will have',
 "who's": 'who is',
 "who've": 'who have',
 "why's": 'why is',
 "why've": 'why have',
 "will've": 'will have',
 "won't": 'will not',
 "won't've": 'will not have',
 "would've": 'would have',
 "wouldn't": 'would not',
 "wouldn't've": 'would not have',
 "y'all": 'you all',
 "y'all'd": 'you all would',
 "y'all'd've": 'you all would have',
 "y'all're": 'you all are',
 "y'all've": 'you all have',
 "you'd": 'you would',
 "you'd've": 'you would have',
 "you'll": 'you will',
 "you'll've": 'you will have',
 "you're": 'you are',
 "you've": 'you have'}

데이터셋에서 유의미한 단어 토큰만을 선별하기 위해서 큰 의미가 없는 단어 토큰을 제거하는 과정이 필요합니다. 예를 들어, 'I', 'my', 'me', 조사, 접미사 등과 같은 단어들은 문장에 빈번하게 등장하지만 의미 분석을 하는데는 많은 기여를 하지 않는 경우가 많습니다. 이러한 단어들을 불용어(stopword)라고 하며, nltk에는 불용어들을 패키지 내에서 미리 정의하고 있습니다.

nltk의 불용어를 사용하기위한 모듈을 import해야 하는데, 만약 데이터가 없다는 에러가 발생하면 nltk.download라는 커맨드를 통해서 다운로드가 가능합니다.

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

# NLTK의 불용어
stop_words = set(stopwords.words('english'))
print('불용어 개수 :', len(stop_words))
print(stop_words)

stopwords.words("english")는 nltk가 정의한 영어 불용어 리스트를 반환해줍니다. 위의 코드로 불용어의 개수와 불용어를 출력해서 확인할 수 있습니다. 불용어 개수는 179개라는 것을 확인하였습니다.

import re
from bs4 import BeautifulSoup

# 전처리 함수
def preprocess_sentence(sentence, remove_stopwords = True):
    sentence = re.sub(r'https?:\/\/.*?[\s+]', '', sentence) # Links 제거
    sentence = sentence.lower() # 텍스트 소문자화
    sentence = BeautifulSoup(sentence, "lxml").text # <br />, <a href = ...> 등의 html 태그 제거
    sentence = re.sub(r'\([^)]*\)', '', sentence) # 괄호로 닫힌 문자열  제거 Ex) my friend(yugyeong) -> my friend
    sentence = re.sub('"','', sentence) # 쌍따옴표 " 제거
    sentence = ' '.join([contractions[t] if t in contractions else t for t in sentence.split(" ")]) # 약어 정규화
    sentence = re.sub(r"'s\b","",sentence) # 소유격 제거. Ex) yugyeong's -> yugyeong
    sentence = re.sub("[^a-zA-Z]", " ", sentence) # 영어 외 문자(숫자, 특수문자 등) 공백으로 변환
    sentence = re.sub('[m]{2,}', 'mm', sentence) # m이 3개 이상이면 2개로 변경. Ex) ummmmmmm  -> umm

    pers_types = ['infp' ,'infj', 'intp', 'intj', 'istp', 'isfp', 'isfj','istp', 'entp', 'enfp', 'entj', 'enfj', 'estp', 'esfp' ,'esfj' ,'estj']
    for types in pers_types:
      sentence = sentence.replace(types, '')

    # 불용어 제거 (Text)
    if remove_stopwords:
        tokens = ' '.join(word for word in sentence.split() if not word in stop_words if len(word) > 1)
    # 불용어 미제거 (Summary)
    else:
        tokens = ' '.join(word for word in sentence.split() if len(word) > 1)
    return tokens

전처리 함수를 위와 같이 정의해줍니다. 코드에서 볼 수 있는 pers_types는 mbti 종류로 데이터셋 내부에 mbti 종류가 포함되어 있다면 예측 정확도에 영향을 끼칠 수도 있기 때문에 제거를 했습니다.

# posts 열 전처리
clean_posts = []
for s in data['posts']:
    clean_posts.append(preprocess_sentence(s))
clean_posts[:5]

결과를 보면, 처음 데이터셋을 불러올때 0번째 행에 포함되어 있던 'intj'라는 단어와 같이 mbti 종류와 링크, 특수문자 제거 및 소문자화 등의 과정이 제대로 진행된 것을 확인할 수 있습니다.

data['posts'] = clean_posts

# 전처리 진행과정에서 결측치 생성 여부 확인
print(data.isnull().sum())

posts 0
type 0
dtype: int64

결측치가 없는 것을 확인했습니다.
이제, collections 모듈을 사용하여 전체 posts 열에서 중복이 많은 단어들을 확인하고 word cloud를 이용하여 시각화를 진행하겠습니다.

import collections
from collections import Counter

# collections 모듈의 Counter를 사용하여 posts 열에서 중복이 많은 단어 40개 출력
words = list(data["posts"].apply(lambda x: x.split()))
words = [x for y in words for x in y]
Counter(words).most_common(40)

사진에 모두 담지는 못했지만, 중복이 많은 단어 40개를 출력한 것을 확인했습니다. Counter 생성자에 중복된 데이터가 저장된 배열을 인자로 넘기면 각 원소가 몇 번씩 나오는지 저장된 객체를 얻게 됩니다.

collection 모듈의 Counter를 사용하는 방법은 https://www.daleseo.com/python-collections-counter/ 를 참고하시면 됩니다.

import wordcloud
from wordcloud import WordCloud, STOPWORDS

wc = wordcloud.WordCloud(width=1200, height=500,
                         collocations=False, background_color="white",
                         colormap="tab20b").generate(" ".join(words))

plt.figure(figsize=(25,10))
# word cloud 생성
plt.imshow(wc, interpolation='bilinear')
_ = plt.axis("off")

fig, ax = plt.subplots(len(data['type'].unique()), sharex=True, figsize=(15,len(data['type'].unique())))
k = 0
for i in data['type'].unique():
    data_4 = data[data['type'] == i]
    wordcloud = WordCloud(max_words=1628,relative_scaling=1,normalize_plurals=False).generate(data_4['posts'].to_string())
    plt.subplot(4,4,k+1)
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.title(i)
    ax[k].axis("off")
    k+=1

word cloud를 통해서 각 mbti별로 자주 사용한 단어를 확인할 수 있었습니다.

2. EDA

전처리가 끝났으니, 전처리된 데이터셋으로 EDA를 수행합니다.

data.head()

data.info()

data.isnull().sum().to_frame().rename(columns={0: "Count of Missing Values"})

import seaborn as sns
import matplotlib.pyplot as plt

# 스타일과 색상 설정
sns.set(style="whitegrid", palette="pastel")

# count plot 생성
plt.figure(figsize=(14, 6))
ax = sns.countplot(data=data, x='type', order=sorted(data['type'].unique()),
                   palette="pastel")
ax.set_title('Distribution of MBTI Types')
ax.set_xlabel('MBTI Type')
ax.set_ylabel('Count')
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha="right")
for p in ax.patches:
    ax.annotate(f'{p.get_height():.0f}', (p.get_x()+p.get_width()/2, p.get_height()),
                ha='center', va='bottom', fontsize=12)
plt.tight_layout()

plt.show()

각 mbti별 데이터 분포를 살펴본 결과 클래스 불균형이 심각한 것을 확인할 수 있었습니다. 모델링을 진행할 때, 클래스 불균형 문제에 잘 대응할 수 있는 모델을 선정하는 것이 성능 향상에 가장 중요할 것 같다는 생각이 들었습니다.

data['word_count'] = data['posts'].apply(lambda x: len(x.split()))

plt.figure(figsize=(14, 6))
sns.histplot(data=data, x='word_count', bins=30, kde=True)
plt.title('Distribution of Word Count in Tweets')
plt.xlabel('Word Count')
plt.ylabel('Frequency')
plt.show()

단어별 개수 분포에 대한 결과입니다.

import plotly.express as px

# 색상 설정
color_palette = px.colors.qualitative.Pastel

# 박스 플롯 생성
fig = px.box(data, x="type", y="word_count", color="type",
             title="Word Count Distribution by MBTI Personality Type",
             category_orders={"label": sorted(data["type"].unique())},
             color_discrete_sequence=color_palette)

# 라벨 이름 설정
fig.update_xaxes(title="MBTI Personality Type", showgrid=False,
                 tickfont=dict(size=12, color="black"))
fig.update_yaxes(title="Word Count", showgrid=False,
                 tickfont=dict(size=12, color="black"))

# 타이틀 설정
fig.update_layout(title_font=dict(size=24, color="darkblue"))

fig.show()

mtbi 종류별 단어 개수 분포에 대한 결과입니다.

시각화를 하면서 생성된 word_count 컬럼을 삭제하고 'data_result.csv' 로 저장합니다. 저장한 csv 파일은 모델링에 사용할 최종 데이터셋입니다.

# word_count열 제거 후, csv파일로 저장
data = data.drop(['word_count'], axis=1)
data.to_csv('data_result.csv')

마치며..

프로젝트 초반에 했던 전처리는 축약형, 링크, MBTI type은 변환 및 제거하지 않았기 때문에 모델 성능에도 좋지 않은 영향을 끼쳤던 것 같습니다. 이렇게 불필요한 데이터로 학습이 된다면 정확하지 못한 결과 얻을 수 있습니다.
아래의 링크를 참고하시면 프로젝트 초반에 작성한 EDA 코드를 확인하실 수 있습니다. 전처리 코드는 현재의 코드에서 축약형, 링크, MBTI type만 고려하지 않은 코드이기 때문에 따로 첨부하지 않았습니다.
윈터의 velog) https://velog.io/@yugyeong_929/AI-web-service-project-MBTIgram-데이터셋-전처리-및-EDA
처음 작성한 EDA 코드를 보시면 얼마나 빈약한 상태로 모델링을 했는지 느끼실 수 있습니다..😅 EDA는 아무리 해도 어려워서 더 많이 공부해봐야 할 것 같습니다..! 다음 포스트는 'XGBoost와 RNN 모델링 과정'입니다. 많은 관심 부탁드립니다:D

이상으로 MBTIgram의 AI 두 번째 이야기를 마치겠습니다.🙋‍♀️