💯[Python EDA 5] 공공데이터 상권 분석 EDA

김미연·2023년 8월 22일

[나만의 노트] Python EDA

목록 보기

6/9

국내 음식점 현황 및 트렌드 파악

DataSet : 소상공인 상권 정보

Jupyter notebook 활용
Python 활용

1. Setting

pecab : 형태소 분석기
- 단점 : 느리기 때문에 대용량 데이터에 사용하지 않는다

wordcloud : 단어 모음 출력 도구

# 텍스트 분석에 필요한 라이브러리 설치
!pip install pecab wordcloud

# 분석에 필요한 라이브러리 불러오기
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

2. 데이터 불러오기

glob : 파일 이름 규칙에 해당하는 파일경로 모두 가져오는 함수

tqdm : progress bar(데이터 불러올 때 진행 정도 확인 가능)

tab 활용 : 경로 입력 시 tab을 누르면 다음 경로 목록 확인 가능

from glob import glob
from tqdm.auto import tqdm

# 파일 이름 규칙에 해당하는 파일경로 모두 가져오기
file_list = sorted(glob('./data/상가(상권)정보_20230630/*.csv'))

data = pd.DataFrame()

# 해당 파일 모두 불러와 병합
for file in tqdm(file_list): # tqdm 설정
    temp = pd.read_csv(file)
    data = pd.concat([data, temp], axis=0)
    
# tqdm이 작동되지 않을 시 아래 코드로 비슷한 기능 구현 가능    
# for idx, file in enumerate(file_list):
#     print("Loading %d%%" % (idx/len(file_list)*100))
#     temp = pd.read_csv(file)
#     data = pd.concat([data, temp], axis=0)

3. 데이터 확인 및 가공

gc : garbage collector(시스템 소프트웨어)
- 필요 없는 메모리를 다시 가져옴(메모리 청소)
- 대용량 데이터 사용 시 필요(삭제 뒤 사용!)
gc.collect() # 메모리 반환

# 실제 메모리 사용 정도 확인
data.info(memory_usage='deep')

# 사용할 column을 찾기 위해 데이터의 일부를 사용
# df = data[:10000]
df = data.sample(n=10000, random_state=42) # 1만개 랜덤 추출

# 분석을 위해 필요한 컬럼 확인
df.iloc[:, 1:11] # 상호명, 상권업종대분류명, 상권업종중분류명
df.iloc[:, 11:15] # 시도명

# 필요 컬럼만을 담은 데이터 구축
data = data[['상호명', '상권업종대분류명', '상권업종중분류명', '시도명']]

# 메모리 반환
gc.collect()

4. 데이터 분석

1) 한식 / 일식 / 중식 음식점 비율 찾기

# 한식 음식점의 데이터
kr_restaurant = df[df['상권업종중분류명'] == '한식']
# 일식 음식점의 데이터
jp_restaurant = df[df['상권업종중분류명'] == '일식']
# 중식 음식점의 데이터
cn_restaurant = df[df['상권업종중분류명'] == '중식']
# 전체 음식점의 데이터
total_restaurant = df[df['상권업종대분류명'] == '음식']

# 한식/일식/중식 음식점 비율 출력
print(f'한식 음식점 비율 : {len(kr_restaurant) / len(total_restaurant) *100:.2f}%')
print(f'일식 음식점 비율 : {len(jp_restaurant) / len(total_restaurant) *100:.2f}%')
print(f'중식 음식점 비율 : {len(cn_restaurant) / len(total_restaurant) *100:.2f}%')

2) 각 지역별 음식점 비율 계산해보기

for sido in df['시도명'].unique():
    cond1 = df['시도명'] == sido
    cond2 = df['상권업종대분류명'] == '음식'
    print(f'{sido} 음식점 비율 : {len(df.loc[cond1 & cond2]) / len(df[cond1]) * 100:.2f}')

3) 한식 음식점들이 많이 사용하는 단어 찾아보기

텍스트 마이닝(text mining) : 텍스트 데이터로 EDA
ex) 키워드, 연관어 찾기
ex) LDA(주요 단어 찾기-사회과학 연구에 많이 사용)
- 사회과학 연구 시 설문조사 많이 활용(그 외 네이버 블로그, 인스타그램 등)

[text mining process]

corpus(분석할 텍스트 데이터) 정의
ex) 네이버 종목 토론실

전처리(text cleaning) - 가장 많은 시간 소요

불용어 제거(욕, 특정 단어 등)

형태 통일 (text normalization)
ex) 수 일치(apple, apples)
ex) go, went, going 처리
ex) 경제 교육/경제교육 >> 전처리

regular expression(정규 표현식) 활용

챗GPT에게 정규표현식 부분 맡기는 것도 한 방법일 수 있다

tokenization(분석 단위 결정) 🌟🌟🌟

이 과정에 따라 결과 다르게 나온다

문자, 형태소 기준 등

modeling

visualization

pecab
- morphs(문장) : 주어진 문장을 형태소 단위로 나누는 함수
- nouns(문장) : 주어진 문장에서 명사만 추출하는 함수
- pos(문장) : POS(Part-OfSpeech) tagging(=품사 분석)하는 함수
- [참고][Korean POS tags comparison chart] J열(https://docs.google.com/spreadsheets/d/1OGAjUvalBuX-oZvZ-9tEfYD2gQe7hTGsgUpiiBSXI8/edit#gid=0)

corpus = data.loc[data["상권업종중분류명"]=='한식', '상호명']

from pecab import PeCab
pecab = PeCab()

# 상호명에서 명사만 추출
tokenized_corpus = []

for doc in corpus:
    tokenized_corpus.extend(pecab.nouns(doc))

 # 빈도가 높은 30개 단어 출력
from collections import Counter
counter = Counter(tokenized_corpus)
counter.most_common(30)

# 글꼴 파일 경로 지정
# font_location = '/System/Library/Fonts/AppleSDGothicNeo.ttc' # For Apple
font_location = 'C:/Windows/Fonts/Malgun.ttf' # For Windows


from wordcloud import WordCloud

# font_path : 사용하는 글꼴의 경로
# max_words : 최대 몇개의 단어를 사용할지(빈도순)
# width : 가로 길이
# height : 세로 길이
# random_state : for reproducing
# background_color : 배경 색(기본색 : 검정)
# colormap : color palette

wc = WordCloud(font_path= font_location, 
               max_words=50, 
               width=1920, 
               height=1080, 
               random_state=42, 
               background_color='white', 
	colormap='viridis',).generate_from_frequencies(counter)

plt.axis('off') # 축을 출력하지 않음
plt.savefig('./wordcloud.png') # wordcloud 결과 저장
plt.imshow(wc)

김미연

이전 포스트

💯[Python EDA 5] 공공데이터 상권 분석 EDA

[나만의 노트] Python EDA

국내 음식점 현황 및 트렌드 파악

1. Setting

2. 데이터 불러오기

3. 데이터 확인 및 가공

4. 데이터 분석

1) 한식 / 일식 / 중식 음식점 비율 찾기

2) 각 지역별 음식점 비율 계산해보기

3) 한식 음식점들이 많이 사용하는 단어 찾아보기

🧑‍💻 [Python EDA 4] Seaborn

0개의 댓글

관련 채용 정보