[EDA] 03. 웹 데이터 분석

쩡이·2023년 8월 8일

EDA 데이터분석 데이터취업스쿨 웹크롤링

EDA

목록 보기

6/12

웹크롤링
요즘은 웹에서 데이터를 얻는 것을 거의 웹크롤링이라고 하는 경향이 있음

Beautiful Soup

: 파이썬 유저들이 웹데이터를 얻을 때, 가장 많이 사용하는 모듈

설치

conda install -c anaconda beautifulsoup4
pip install beautifulsoup4

html 기초 및 파일작성

태그를 사용함
HTML 태그는 웹 페이지를 표현
HEAD 태그는 눈에 보이지 않지만 문서에 필요한 헤더 정보를 보관 HEAD 태그 안에 title 태그에는 웹 페이지의 이름을 적어준다.
BODY 태그는 눈에 보이는 정보를 보관
p 태그는 문단을 의미함
a 태그는 링크를 담고 있음

<!DOCTYPE html>
<html>
    <head>
        <title>Very Simple HTML Code by Jeonge</title>
    </head>
    <body>
        <div>
            <p class="inner-text first-item" id="first">
                Happy Jeonge.
                <a href="https://velog.io/@yujeonge" id="je-link">Jeonge</a>
            </p>
            <p class="inner-text second-item">
                Happy Data Science.
                <a href="https://www.python.org" target="_blink" id="py-link">Python</a>
            </p> #target="_blink" : 링크를 새 창으로 연결
        </div>
        <p class="outer-text first-item" id="second">
            <b>Data Science is funny.</b> #<b> : 볼드체
        </p> 
        <p class="outer-text">
            <i>All I need is Love.</i>  #<i> : 필기체
        </p>
    </body>
</html>

Beautiful Soup 기초

html 파일을 읽어야하므로 open 명령이 필요함
open : 파일명과 함께 읽기(r)/쓰기(w) 지정
html.parser : Beautiful Soup의 html을 읽는 엔진(lxml도 많이 사용, lxml은 환경을 새로만들었다면 설치해야함)
prettify() : 들여쓰기를 해서 보기 좋게 출력함

파일 불러오기

그냥 변수명으로 불러오면 가독성이 떨어짐
print(변수명) print함수를 쓰면 좀 낫지만, 더 보기 좋은 매서드인 prettify()를 활용 -> 들여쓰기가 있어서 보기 좋게 출력됨

from bs4 import BeautifulSoup
page = open("../data/03. zerobase.html", "r").read()
soup = BeautifulSoup(page, "html.parser") 
print(soup.prettify())

태그 확인

# head 태그 확인
soup.head

# body 태그 확인
soup.body

# p 태그 확인
# 처음 발견한 p 태그만 출력
soup.p

# find()로 태그 확인 - 가장 상단에 있는 태그 하나만 찾음
soup.find("p")

#find()에 조건주기
soup.find("p", class_="inner-text second-item") #class_ : 파이썬 예약어 class와 겹치지 않기 위해 언더바 사용
soup.find("p", {"class":"outer-text first-item"}).text.strip() #strip() : 공백지우기

# 다중 조건 
soup.find("p", {"class":"inner-text first-item","id":"first"})

#find_all() : 찾고자 하는 태그를 모두 찾음, list 형태로 반환
soup.find_all("p")

# find_all()로 특정 태그 확인
soup.find_all(class_="outer-text") #class 명으로 찾기
soup.find_all(id="je-link") #id로 찾기 

#text만 출력
soup.find_all(id="je-link")[0].text #find_all이 리스트 형태로 반환하기 때문에 인덱스를 넣어줘야함

# text 출력 : text, string, get_text()
print(soup.find_all("p")[0].text)
print(soup.find_all("p")[1].string)
print(soup.find_all("p")[2].get_text())

# p 태그 리스트에서 텍스트 속성만 출력
for each_tag in soup.find_all("p"):
    print("="*50)
    print(each_tag.text)
    
# a 태그에서 href 속성값에 있는 값 추출
links = soup.find_all("a")
links

# 주소 값만 받아오기 : get("href"), ["href"] 마스킹
#리스트의 첫번째 값에서 href의 값을 가져온다
links[0].get("href") 
links[1]["href"]

#text와 주소 받아오기
for each in links:
    href = each.get("href")
    text = each.get_text()
    print(text + "->" + href)

크롬 개발자 도구

크롬에서 오른쪽 상단 점 세개 - 도구 더보기 - 개발자도구
또는 ctrl+shift+i

예제 1-1 환율 정보 가져오기

참고 사이트
https://finance.naver.com/marketindex/

<가져올 정보>
국가명
환율
변동금액
상승/하락 여부
해당 국가 환율 차트 사이트

#데이터 가져오기 방법1
# import
from urllib.request import urlopen
from bs4 import BeautifulSoup

url = "https://finance.naver.com/marketindex/"
page = urlopen(url)
or
response = urlopen(url)
response.status #결과값은 HTML 상태 코드/HTML 상태 코드는 문제 발생시, 어떤 문제인지 파악하기 쉽다
soup = BeautifulSoup(page, "html.parser")
print(soup.prettify())

#데이터 가져오기 방법2 - requests 모듈 이용
import requests #from urllib.request.Request와 같은 기능
from bs4 import BeautifulSoup

url = "https://finance.naver.com/marketindex/"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
print(soup.prettify())

#requests 모듈은 response에 status를 쓰지 않아도 HMTL상태코드가 나옴
#requests.get(), requests.post() 두가지 방식이 있다고 한다.
#requests 모듈 설치 : !pip install requests

#데이터 추출 방법 - find or select 이용
#select_one, find : 단일선택 / select, find_all : 다중선택
#방법 1~4 find 이용
#방법1 - 이름만 이용
soup.find_all("span", "value"), len(soup.find_all("span", "value"))

#방법2 - class 이용
soup.find_all("span", class_="value"), len(soup.find_all("span", "value"))

#방법3 - 딕셔너리 이용
soup.find_all("span",{"class":"value"}), len(soup.find_all("span",{"class":"value"}))

# 금액만 가져오는 3가지 방법
soup.find_all("span",{"class":"value"})[0].text
soup.find_all("span",{"class":"value"})[0].string
soup.find_all("span",{"class":"value"})[0].get_text()

#select 이용
exchangeList = soup.select("#exchangeList > li")
len(exchangeList), exchangeList
#class-> . 사용, id -> # 사용

# 가져올 정보 추출
title = exchangeList[0].select_one(".h_lst").text
exchange = exchangeList[0].select_one(".value").text
change = exchangeList[0].select_one(".change").text
updown = exchangeList[0].select(".blind")[2].text
또는
updown = exchangeList[0].select_one(".head_info.point_up > .blind").text 
# link
baseurl = "https://finance.naver.com"
baseurl + exchangeList[0].select_one("a").get("href") # 사이트 기본 주소 값이 없어서 따로 처리를 해줌

# 환율이 감소폭이면 위 updown의 코드에서 ".head_info.point_up"에서 up에서 dn으로 바뀜
#원래는 클래스 명이 "head_info point_up"인데 이것을 그대로 쓰면 안되고 속성값이 2개 있는 클래스라고 생각해서 ".head_info.point_up"으로 쓴다.
# > : head_info point_up에 직접 연결되는 하위 blind를 선택하는 것, >가 없다면 밑에 있는 모든 하위 태그를 선택
title, exchange, change, updown

(결과)
('미국 USD', '1,316.50', '9.00', '상승')

# 4개 데이터 수집

exchange_datas = []
baseurl = "https://finance.naver.com"

for item in exchangeList:
    data = {
        "title": item.select_one(".h_lst").text,
        "exchange":item.select_one(".value").text,
        "change":item.select_one(".change").text,
        "updown":item.select(".blind")[2].text,
        "link":baseurl + item.select_one("a").get("href")
    }
    exchange_datas.append(data)
exchange_datas
import pandas as pd
df = pd.DataFrame(exchange_datas)
df
df.to_excel("./naverfinance.xlsx", encoding="utf-8") 

# . ->현재경로, .. -> 상위 경로
# updown = exchangeList[0].select_one(".head_info.point_up > .blind").text  이 코드는 환율 변동이 하락인 나라에서는 쓸 수 없기 때문에 에러가 발생해서 사용하지 않았음

#위 내용을 .py 파일형태로
import requests
import pandas as pd
from bs4 import BeautifulSoup

url = "https://finance.naver.com/marketindex/"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")


exchangeList = soup.select("#exchangeList > li")

exchange_datas = []
baseurl = "https://finance.naver.com"

for item in exchangeList:
    data = {
        "title": item.select_one(".h_lst").text,
        "exchange":item.select_one(".value").text,
        "change":item.select_one(".change").text,
        "updown":item.select(".blind")[2].text,
        "link":baseurl + item.select_one("a").get("href")
    }
    exchange_datas.append(data)
df = pd.DataFrame(exchange_datas)
df.to_excel("./naverfinance.xlsx", encoding="utf-8")

저장된 엑셀파일

위키백과 데이터 가져오기

import urllib
from urllib.request import urlopen, Request

# html = "https://ko.wikipedia.org/wiki/%ED%82%B9%EB%8D%94%EB%9E%9C%EB%93%9C" #주소에서 wiki/ 뒷부분 한글이 깨져서 나오므로 따로 처리를 해줌
#https://ko.wikipedia.org/wiki/킹더랜드 #url decode 검색해서 복사해온 것
html = "https://ko.wikipedia.org/wiki/{search_words}"
req = Request(html.format(search_words=urllib.parse.quote("킹더랜드"))) # 글자를 URL로 인코딩
response = urlopen(req)
soup = BeautifulSoup(response,"html.parser")
print(soup.prettify())

#주요 인물 출력
#예시
soup.find_all("li")[76].text.strip().replace("\xa0:", "").split()[0]

print("<주요 인물>")
for i in range(76, 82):
    print(soup.find_all("li")[i].text.strip().replace("\xa0:", "").split()[0])

(결과)
<주요 인물>
이준호
임윤아
김가은
고원희
안세하
김재원

쩡이

이전 포스트

[EDA] 02. 서울시 범죄 현황 분석 18~32

다음 포스트