[Zero-Base DS]스터디노트_웹데이터 분석(05)

HAHAHAEUN·2024년 4월 10일

스터디노트 제로베이스

내가 보려고 정리하는 스터디노트

목록 보기

24/40

주요 학습내용

1. 웹데이터 불러오기

2. 데이터 시각화

사용할 자료

오피넷 주유소 정보 : https://www.opinet.co.kr/searRgSelect.do

확인하고자 하는 데이터

셀프주유소는 정말 저렴할까?

I. 웹데이터 불러오기

1. Selenium으로 접근

from selenium import webdriver
from selenium.webdriver.common.by import By

url = "https://www.opinet.co.kr/searRgSelect.do"
driver = webdriver.Chrome()
driver.get(url)

2. 지역: 시/도 가져오기

개발자도구에서 위치 확인 후 id값을 기준으로 시/도 데이터 가져오기

sido_list_raw = driver.find_element(By.ID, "SIDO_NM0")

sido_list = sido_list_raw.find_elements(By.TAG_NAME, 'option')

반복문 사용하여 전체 시/군 가져오기 + 필요없는 행 삭제

방법 1) for문 사용

sido_names = []

for option in sido_list:
    sido_names.append(option.get_attribute("value"))
sido_names

sido_names = sido_names[1:]

방법 2) 한줄로 작성

sido_names = [option.get_attribute("value") for option in sido_list]

sido_names = sido_names[1:]

3. 선택한 시의 구 정보 가져오기

서울시를 기준으로 작성
2번의 시/도 불러오기와 같은 작업 반복

gu_list_raw = driver.find_element(By.ID, "SIGUNGU_NM0") # 부모 태그
gu_list = gu_list_raw.find_elements(By.TAG_NAME, "option") # 자식 태그

gu_names = [option.get_attribute("value") for option in gu_list]
gu_names = gu_names[1:]
gu_names, len(gu_names)

3. 엑셀데이터 개별 다운

# selector 사용
driver.find_element(By.CSS_SELECTOR, "#templ_list0 > div:nth-child(7) > div > a > span").click()

# xpath 사용
driver.find_element(By.XPATH, "//*[@id='templ_list0']/div[7]/div/a/span").click()

for문 사용하여 자동으로 전체 구 엑셀 데이터 다운로드

import time
from tqdm import tqdm_notebook

for gu in tqdm_notebook(gu_names):
    element = driver.find_element(By.ID, "SIGUNGU_NM0")
    element.send_keys(gu)
    time.sleep(3)

    element_get_excel= driver.find_element(By.XPATH, "//*[@id='templ_list0']/div[7]/div/a/span").click()
    time.sleep(3)

4. 다운받은 데이터 정리하기

glob 사용하여 파일 목록 한 번에 가져오기

import pandas as pd
from glob import glob

glob("../data/지역_*.xls")

파일명 저장 및 시작 행 설정

stations_files = glob("../data/지역_*.xls")

tmp = pd.read_excel(stations_files[0], header = 2)

for문 사용하여 전체 데이터 불러오기
- 형식이 동일하고 연달아 붙이기만 하면 될 때는 concat

tmp_raw = []

for file_name in stations_files:
    tmp = pd.read_excel(file_name, header=2)
    tmp_raw.append(tmp)
    
stations_raw = pd.concat(tmp_raw)
stations_raw

DataFrame 생성(필요한 정보만)

stations = pd.DataFrame({
    "상호": stations_raw["상호"],
    "주소": stations_raw["주소"],
    "가격": stations_raw["휘발유"],
    "셀프": stations_raw["셀프여부"],
    "상표": stations_raw["상표"]    
})
stations.tail()

주소 데이터 중 "구"정보만 추출 & DataFrame에 추가

for eachAddress in stations["주소"]:
    print(eachAddress.split())
    
stations["구"]  = [eachAddress.split()[1] for eachAddress in stations["주소"]]
stations

가격정보 있는 주유소 데이터만 사용하기

stations = stations[stations["가격"] != "-"]

# 가격 데이터형 변환 object => float
stations["가격"] = stations["가격"].astype("float")

stations

인덱스 재정렬 및 불필요 인덱스 제거

stations.reset_index(inplace=True)
stations.tail()

del stations["index"]
del stations["level_0"]

stations.tail()

II. 데이터 시각화

1. boxplot 사용

outlier
이상치에는 (1) 일반 이상치(outliers)와 (2) 극단적인 점(extreme points) 두 가지 범주가 있습니다. Q3 + 1.5xIQR보다 큰 값 또는 Q1 - 1.5xIQR보다 작은 값은 이상치로 간주됩니다. Q3 + 3xIQR보다 큰 값 또는 Q1 - 3xIQR보다 작은 값은 극단적인 점(또는 극단적인 이상치)으로 간주됩니다.
[출 처] https://rpkgs.datanovia.com/rstatix/reference/outliers.html#:~:text=Boxplots%20are%20a%20popular%20and,points%20(or%20extreme%20outliers).

1) import 모듈 및 폰트설정

import matplotlib.pyplot as plt
import seaborn as sns
import platform
from matplotlib import font_manager, rc

get_ipython().run_line_magic("matplotlib", "inline")

path = "C:/Windows/Fonts/malgun.ttf"

if platform.system() == "Darwin":
    rc("font", family = "Arial Unicode MS")
elif platform.system() == "Windows":
    font_name = font_manager.FontProperties(fname = path).get_name()
    rc("font", family = font_name)
else:
    print("Unknown system")

2) boxplot(feat.pandas)

stations.boxplot(column="가격", by="셀프", figsize = (12,8));

모든 셀프주유소가 더 싸진않지만, 평균적으로 셀프주유소가 더 싼 것을 확인할 수 있다.

3) boxplot(feat.seaborn)

plt.figure(figsize = (12, 8))
sns.boxplot(x ="셀프", y="가격", data=stations, palette="Set3")
plt.grid(True)
plt.show()

seaborn을 사용하면 좀 더 시각적으로 예쁘게 그릴 수 있다

4) boxplot(feat.seaborn), 상표별

plt.figure(figsize = (12, 8))
sns.boxplot(x="상표", y="가격", hue="셀프", data = stations, palette="Set3")
plt.show()

2. 지도 시각화(서울시 주유소)

1) import 모듈

import json
import folium
import numpy as np
import warnings
warnings.simplefilter(action="ignore", category = FutureWarning)

2) 피벗데이터 생성

gu_data = pd.pivot_table(data = stations, index="구", values="가격", aggfunc=np.mean)
gu_data.head()

3) 구글맵에 시각화 하기

geo_path = "../data/02. skorea_municipalities_geo_simple.json"
geo_str = json.load(open(geo_path, encoding="utf-8"))

my_map = folium.Map(
    location = [37.5502, 126.982],
    zoom_start = 10.5,
    tiles = "OpenStreetMap"
)
folium.Choropleth(
    geo_data = geo_str,
    data = gu_data,
    columns = [gu_data.index, "가격"],
    key_on = "feature.id",
    fill_color = "PuRd"
).add_to(my_map)

my_map