๐Ÿ’ฏ[Python EDA 5] ๊ณต๊ณต๋ฐ์ดํ„ฐ ์ƒ๊ถŒ ๋ถ„์„ EDA

๊น€๋ฏธ์—ฐยท2023๋…„ 8์›” 22์ผ
0

[๋‚˜๋งŒ์˜ ๋…ธํŠธ] Python EDA

๋ชฉ๋ก ๋ณด๊ธฐ
5/8

๊ตญ๋‚ด ์Œ์‹์  ํ˜„ํ™ฉ ๋ฐ ํŠธ๋ Œ๋“œ ํŒŒ์•…

DataSet : ์†Œ์ƒ๊ณต์ธ ์ƒ๊ถŒ ์ •๋ณด

  • Jupyter notebook ํ™œ์šฉ
  • Python ํ™œ์šฉ
    โ€‹

1. Setting

  • pecab : ํ˜•ํƒœ์†Œ ๋ถ„์„๊ธฐ
    - ๋‹จ์  : ๋Š๋ฆฌ๊ธฐ ๋•Œ๋ฌธ์— ๋Œ€์šฉ๋Ÿ‰ ๋ฐ์ดํ„ฐ์— ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š”๋‹ค
  • wordcloud : ๋‹จ์–ด ๋ชจ์Œ ์ถœ๋ ฅ ๋„๊ตฌ
# ํ…์ŠคํŠธ ๋ถ„์„์— ํ•„์š”ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์„ค์น˜
!pip install pecab wordcloud

# ๋ถ„์„์— ํ•„์š”ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

โ€‹

2. ๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

  • glob : ํŒŒ์ผ ์ด๋ฆ„ ๊ทœ์น™์— ํ•ด๋‹นํ•˜๋Š” ํŒŒ์ผ๊ฒฝ๋กœ ๋ชจ๋‘ ๊ฐ€์ ธ์˜ค๋Š” ํ•จ์ˆ˜
  • tqdm : progress bar(๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ฌ ๋•Œ ์ง„ํ–‰ ์ •๋„ ํ™•์ธ ๊ฐ€๋Šฅ)
  • tab ํ™œ์šฉ : ๊ฒฝ๋กœ ์ž…๋ ฅ ์‹œ tab์„ ๋ˆ„๋ฅด๋ฉด ๋‹ค์Œ ๊ฒฝ๋กœ ๋ชฉ๋ก ํ™•์ธ ๊ฐ€๋Šฅ
from glob import glob
from tqdm.auto import tqdm

# ํŒŒ์ผ ์ด๋ฆ„ ๊ทœ์น™์— ํ•ด๋‹นํ•˜๋Š” ํŒŒ์ผ๊ฒฝ๋กœ ๋ชจ๋‘ ๊ฐ€์ ธ์˜ค๊ธฐ
file_list = sorted(glob('./data/์ƒ๊ฐ€(์ƒ๊ถŒ)์ •๋ณด_20230630/*.csv'))

data = pd.DataFrame()

# ํ•ด๋‹น ํŒŒ์ผ ๋ชจ๋‘ ๋ถˆ๋Ÿฌ์™€ ๋ณ‘ํ•ฉ
for file in tqdm(file_list): # tqdm ์„ค์ •
    temp = pd.read_csv(file)
    data = pd.concat([data, temp], axis=0)
    
# tqdm์ด ์ž‘๋™๋˜์ง€ ์•Š์„ ์‹œ ์•„๋ž˜ ์ฝ”๋“œ๋กœ ๋น„์Šทํ•œ ๊ธฐ๋Šฅ ๊ตฌํ˜„ ๊ฐ€๋Šฅ    
# for idx, file in enumerate(file_list):
#     print("Loading %d%%" % (idx/len(file_list)*100))
#     temp = pd.read_csv(file)
#     data = pd.concat([data, temp], axis=0)   

โ€‹

3. ๋ฐ์ดํ„ฐ ํ™•์ธ ๋ฐ ๊ฐ€๊ณต

  • gc : garbage collector(์‹œ์Šคํ…œ ์†Œํ”„ํŠธ์›จ์–ด)
    - ํ•„์š” ์—†๋Š” ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ๋‹ค์‹œ ๊ฐ€์ ธ์˜ด(๋ฉ”๋ชจ๋ฆฌ ์ฒญ์†Œ)
    - ๋Œ€์šฉ๋Ÿ‰ ๋ฐ์ดํ„ฐ ์‚ฌ์šฉ ์‹œ ํ•„์š”(์‚ญ์ œ ๋’ค ์‚ฌ์šฉ!)
gc.collect() # ๋ฉ”๋ชจ๋ฆฌ ๋ฐ˜ํ™˜
# ์‹ค์ œ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ ์ •๋„ ํ™•์ธ
data.info(memory_usage='deep')

# ์‚ฌ์šฉํ•  column์„ ์ฐพ๊ธฐ ์œ„ํ•ด ๋ฐ์ดํ„ฐ์˜ ์ผ๋ถ€๋ฅผ ์‚ฌ์šฉ
# df = data[:10000]
df = data.sample(n=10000, random_state=42) # 1๋งŒ๊ฐœ ๋žœ๋ค ์ถ”์ถœ

# ๋ถ„์„์„ ์œ„ํ•ด ํ•„์š”ํ•œ ์ปฌ๋Ÿผ ํ™•์ธ
df.iloc[:, 1:11] # ์ƒํ˜ธ๋ช…, ์ƒ๊ถŒ์—…์ข…๋Œ€๋ถ„๋ฅ˜๋ช…, ์ƒ๊ถŒ์—…์ข…์ค‘๋ถ„๋ฅ˜๋ช…
df.iloc[:, 11:15] # ์‹œ๋„๋ช…

# ํ•„์š” ์ปฌ๋Ÿผ๋งŒ์„ ๋‹ด์€ ๋ฐ์ดํ„ฐ ๊ตฌ์ถ•
data = data[['์ƒํ˜ธ๋ช…', '์ƒ๊ถŒ์—…์ข…๋Œ€๋ถ„๋ฅ˜๋ช…', '์ƒ๊ถŒ์—…์ข…์ค‘๋ถ„๋ฅ˜๋ช…', '์‹œ๋„๋ช…']]

# ๋ฉ”๋ชจ๋ฆฌ ๋ฐ˜ํ™˜
gc.collect() 

โ€‹

4. ๋ฐ์ดํ„ฐ ๋ถ„์„

1) ํ•œ์‹ / ์ผ์‹ / ์ค‘์‹ ์Œ์‹์  ๋น„์œจ ์ฐพ๊ธฐ

# ํ•œ์‹ ์Œ์‹์ ์˜ ๋ฐ์ดํ„ฐ
kr_restaurant = df[df['์ƒ๊ถŒ์—…์ข…์ค‘๋ถ„๋ฅ˜๋ช…'] == 'ํ•œ์‹']
# ์ผ์‹ ์Œ์‹์ ์˜ ๋ฐ์ดํ„ฐ
jp_restaurant = df[df['์ƒ๊ถŒ์—…์ข…์ค‘๋ถ„๋ฅ˜๋ช…'] == '์ผ์‹']
# ์ค‘์‹ ์Œ์‹์ ์˜ ๋ฐ์ดํ„ฐ
cn_restaurant = df[df['์ƒ๊ถŒ์—…์ข…์ค‘๋ถ„๋ฅ˜๋ช…'] == '์ค‘์‹']
# ์ „์ฒด ์Œ์‹์ ์˜ ๋ฐ์ดํ„ฐ
total_restaurant = df[df['์ƒ๊ถŒ์—…์ข…๋Œ€๋ถ„๋ฅ˜๋ช…'] == '์Œ์‹']

# ํ•œ์‹/์ผ์‹/์ค‘์‹ ์Œ์‹์  ๋น„์œจ ์ถœ๋ ฅ
print(f'ํ•œ์‹ ์Œ์‹์  ๋น„์œจ : {len(kr_restaurant) / len(total_restaurant) *100:.2f}%')
print(f'์ผ์‹ ์Œ์‹์  ๋น„์œจ : {len(jp_restaurant) / len(total_restaurant) *100:.2f}%')
print(f'์ค‘์‹ ์Œ์‹์  ๋น„์œจ : {len(cn_restaurant) / len(total_restaurant) *100:.2f}%')

2) ๊ฐ ์ง€์—ญ๋ณ„ ์Œ์‹์  ๋น„์œจ ๊ณ„์‚ฐํ•ด๋ณด๊ธฐ

for sido in df['์‹œ๋„๋ช…'].unique():
    cond1 = df['์‹œ๋„๋ช…'] == sido
    cond2 = df['์ƒ๊ถŒ์—…์ข…๋Œ€๋ถ„๋ฅ˜๋ช…'] == '์Œ์‹'
    print(f'{sido} ์Œ์‹์  ๋น„์œจ : {len(df.loc[cond1 & cond2]) / len(df[cond1]) * 100:.2f}')

3) ํ•œ์‹ ์Œ์‹์ ๋“ค์ด ๋งŽ์ด ์‚ฌ์šฉํ•˜๋Š” ๋‹จ์–ด ์ฐพ์•„๋ณด๊ธฐ

  • ํ…์ŠคํŠธ ๋งˆ์ด๋‹(text mining) : ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ๋กœ EDA
    ex) ํ‚ค์›Œ๋“œ, ์—ฐ๊ด€์–ด ์ฐพ๊ธฐ
    ex) LDA(์ฃผ์š” ๋‹จ์–ด ์ฐพ๊ธฐ-์‚ฌํšŒ๊ณผํ•™ ์—ฐ๊ตฌ์— ๋งŽ์ด ์‚ฌ์šฉ)
    - ์‚ฌํšŒ๊ณผํ•™ ์—ฐ๊ตฌ ์‹œ ์„ค๋ฌธ์กฐ์‚ฌ ๋งŽ์ด ํ™œ์šฉ(๊ทธ ์™ธ ๋„ค์ด๋ฒ„ ๋ธ”๋กœ๊ทธ, ์ธ์Šคํƒ€๊ทธ๋žจ ๋“ฑ)

    [text mining process]

    1. corpus(๋ถ„์„ํ•  ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ) ์ •์˜
      ex) ๋„ค์ด๋ฒ„ ์ข…๋ชฉ ํ† ๋ก ์‹ค
      โ€‹
    2. ์ „์ฒ˜๋ฆฌ(text cleaning) - ๊ฐ€์žฅ ๋งŽ์€ ์‹œ๊ฐ„ ์†Œ์š”
    • ๋ถˆ์šฉ์–ด ์ œ๊ฑฐ(์š•, ํŠน์ • ๋‹จ์–ด ๋“ฑ)
    • ํ˜•ํƒœ ํ†ต์ผ (text normalization)
      ex) ์ˆ˜ ์ผ์น˜(apple, apples)
      ex) go, went, going ์ฒ˜๋ฆฌ
      ex) ๊ฒฝ์ œ ๊ต์œก/๊ฒฝ์ œ๊ต์œก >> ์ „์ฒ˜๋ฆฌ
    • regular expression(์ •๊ทœ ํ‘œํ˜„์‹) ํ™œ์šฉ
    • ์ฑ—GPT์—๊ฒŒ ์ •๊ทœํ‘œํ˜„์‹ ๋ถ€๋ถ„ ๋งก๊ธฐ๋Š” ๊ฒƒ๋„ ํ•œ ๋ฐฉ๋ฒ•์ผ ์ˆ˜ ์žˆ๋‹ค
      โ€‹
    1. tokenization(๋ถ„์„ ๋‹จ์œ„ ๊ฒฐ์ •) ๐ŸŒŸ๐ŸŒŸ๐ŸŒŸ
    • ์ด ๊ณผ์ •์— ๋”ฐ๋ผ ๊ฒฐ๊ณผ ๋‹ค๋ฅด๊ฒŒ ๋‚˜์˜จ๋‹ค
    • ๋ฌธ์ž, ํ˜•ํƒœ์†Œ ๊ธฐ์ค€ ๋“ฑ
      โ€‹
    1. modeling
      โ€‹
    2. visualization
  • pecab
    - morphs(๋ฌธ์žฅ) : ์ฃผ์–ด์ง„ ๋ฌธ์žฅ์„ ํ˜•ํƒœ์†Œ ๋‹จ์œ„๋กœ ๋‚˜๋ˆ„๋Š” ํ•จ์ˆ˜
    - nouns(๋ฌธ์žฅ) : ์ฃผ์–ด์ง„ ๋ฌธ์žฅ์—์„œ ๋ช…์‚ฌ๋งŒ ์ถ”์ถœํ•˜๋Š” ํ•จ์ˆ˜
    - pos(๋ฌธ์žฅ) : POS(Part-OfSpeech) tagging(=ํ’ˆ์‚ฌ ๋ถ„์„)ํ•˜๋Š” ํ•จ์ˆ˜
    - [์ฐธ๊ณ ][Korean POS tags comparison chart] J์—ด(https://docs.google.com/spreadsheets/d/1OGAjUvalBuX-oZvZ
    -9tEfYD2gQe7hTGsgUpiiBSXI8/edit#gid=0)
corpus = data.loc[data["์ƒ๊ถŒ์—…์ข…์ค‘๋ถ„๋ฅ˜๋ช…"]=='ํ•œ์‹', '์ƒํ˜ธ๋ช…']

from pecab import PeCab
pecab = PeCab()

# ์ƒํ˜ธ๋ช…์—์„œ ๋ช…์‚ฌ๋งŒ ์ถ”์ถœ
tokenized_corpus = []

for doc in corpus:
    tokenized_corpus.extend(pecab.nouns(doc))

 # ๋นˆ๋„๊ฐ€ ๋†’์€ 30๊ฐœ ๋‹จ์–ด ์ถœ๋ ฅ
from collections import Counter
counter = Counter(tokenized_corpus)
counter.most_common(30)

# ๊ธ€๊ผด ํŒŒ์ผ ๊ฒฝ๋กœ ์ง€์ •
# font_location = '/System/Library/Fonts/AppleSDGothicNeo.ttc' # For Apple
font_location = 'C:/Windows/Fonts/Malgun.ttf' # For Windows


from wordcloud import WordCloud

# font_path : ์‚ฌ์šฉํ•˜๋Š” ๊ธ€๊ผด์˜ ๊ฒฝ๋กœ
# max_words : ์ตœ๋Œ€ ๋ช‡๊ฐœ์˜ ๋‹จ์–ด๋ฅผ ์‚ฌ์šฉํ• ์ง€(๋นˆ๋„์ˆœ)
# width : ๊ฐ€๋กœ ๊ธธ์ด
# height : ์„ธ๋กœ ๊ธธ์ด
# random_state : for reproducing
# background_color : ๋ฐฐ๊ฒฝ ์ƒ‰(๊ธฐ๋ณธ์ƒ‰ : ๊ฒ€์ •)
# colormap : color palette

wc = WordCloud(font_path= font_location, 
               max_words=50, 
               width=1920, 
               height=1080, 
               random_state=42, 
               background_color='white', 
	colormap='viridis',).generate_from_frequencies(counter)

plt.axis('off') # ์ถ•์„ ์ถœ๋ ฅํ•˜์ง€ ์•Š์Œ
plt.savefig('./wordcloud.png') # wordcloud ๊ฒฐ๊ณผ ์ €์žฅ
plt.imshow(wc)

0๊ฐœ์˜ ๋Œ“๊ธ€