웹크롤링

Variety_·2021년 11월 2일

Data Analysis

목록 보기

9/10

위키백과 문서정보 가져오기

한글이 포함된 웹페이지(URL)을 복사해 메모장이나 주피터셀에 복붙하면 이상하게 바뀌어서 나타난다, 웹주소는 UTF-8로 인코딩되어야 한다. => 구글에 URL Decode 검색해서 사용 또는 아래처럼 포맷팅!
스트링에서 중괄호( {} ),로 감싸주면 변수취급된다.
quote : 한글로된걸 UTF-8로 변환해줌

import urllib
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen

html = "https://ko.wikipedia.org/wiki/{search_words}"
req = Request(html.format(search_words=urllib.parse.quote("여명의_눈동자")))

response = urlopen(req)

soup = BeautifulSoup(response, "html.parser")
soup

인물정보 찾기

replace() : 특정문자열을 내가 원하는대로 변경해줌

n = 0
for each in soup.find_all("ul"):
    print("=>" + str(n) + "====================")
    print(each.get_text())
    n += 1
    
soup.find_all("ul")[15].text.strip().replace("\xa0","").replace("\n","")

List 데이터형

List 형은 대괄호로 생성한다
.extend() : 제일 뒤에 다수의 자료를 추가
insert() : 원하는 위치에 자료를 삽입
isinstance(data, type) : 자료형 True/False로 확인해줌

colors = ['red', 'blue', 'green']
b = colors    #주소값 참조한거라 b내용을 변경하면 colors 내용도 바뀜
b[1] = 'black'
colors
# deep copy
c = colors.copy()

# in 연산자 사용
if 'black' in colors:
    print("True")

colors.extend(['pink', 'yellow'])

colors.insert(1, "purple")

isinstance(colors, list)
ouput : True

시카고 맛집 데이터 분석

총 51개의 페이지에서 각 가게의 정보를 가져온다
- 가게이름, 대표메뉴, 대표메뉴의 가격, 가게주소

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
from fake_useragent import UserAgent
# http에서 https로 바껴서 오류생김, 의존성 추가하면 됨
import ssl
context = ssl._create_unverified_context()

url_base = "https://www.chicagomag.com"
url_sub = "/Chicago-Magazine/November-2012/Best-Sandwiches-Chicago/"
url = url_base + url_sub

ua = UserAgent()
ua.ie

req = Request(url, headers={"User-Agent" : ua.ie})

response = urlopen(req, context=context)
response.status

soup = BeautifulSoup(response, "html.parser")

# print(soup.prettify())

bs4.elment.Tag 타입이면 find 명령을 사용할 수 있다는 의미

tmp_one = soup.find_all("div", "sammy")[0]
type(tmp_one)

tmp_one.find(class_="sammyRank")

tmp_one.find(class_="sammyRank").get_text()

tmp_one.find(class_="sammyListing").get_text()
tmp_one.find("a")["href"]

output:'/Chicago-Magazine/November-2012/Best-Sandwiches-in-Chicago-Old-Oak-Tap-BLT/'
# 연결되는 홈페이지 주소가 상대경로임

import re
tmp_string = tmp_one.find(class_="sammyListing").get_text()
re.split(("\n|\r\n"),tmp_string)

output : ['BLT', 'Old Oak Tap', 'Read more ']

# 상대주소 절대주소 대응을 위한 모듈
from urllib.parse import urljoin

url_base = "http://www.chicagomag.com"

rank = []
main_menu = []
cafe_name = []
url_add = []    
#필요한 내용을 담을 빈 리스트를 준비 리스트로 하나씩 컬럼만들어 DataFrame으로 합칠예정

list_soup = soup.find_all("div", "sammy")

#urljoin : 두번째 항목이 절대주소면 url_base를 붙이지 않고 상대주소면 붙임 
for item in list_soup:
    rank.append(item.find(class_="sammyRank").get_text())
    tmp_string = item.find(class_="sammyListing").get_text()
    main_menu.append(re.split(("\n|\r\n"), tmp_string)[0])
    cafe_name.append(re.split(("\n|\r\n"), tmp_string)[1])
    url_add.append(urljoin(url_base, item.find("a")["href"]))

데이터프레임으로 합치기

import pandas as pd

data = {"Rank": rank, "Menu":main_menu, "Cafe":cafe_name, "URL": url_add}
df = pd.DataFrame(data)
df.head()

df = pd.DataFrame(data, columns=["Rank", "Cafe", "Menu", "URL"])
df.head()
# 칼럼 순서 변경

# 저장
df.to_csv(
    "./data/03. best_sandwiches_list_chicago.csv",
    sep=",",
    encoding="UTF-8"
)

하위페이지 분석

df["URL"][0]

req = Request(df["URL"][0], headers={"User-Agent" : "Chrome"})
html = urlopen(req, context=context).read()
soup_tmp = BeautifulSoup(html, "html.parser")
print(soup_tmp.find("p", "addy"))

output : <p class="addy">
<em>$10. 2109 W. Chicago Ave., 773-772-0406, <a href="http://www.theoldoaktap.com/">theoldoaktap.com</a></em></p>

가격만 가져오고 싶은데 주소랑 같이 있음 => Regular Expression 사용

.x	임의의 한 문자를 표현(x가 마지막으로 끝)
x+	x가 1번이상 반복
x?	x가 존재하거나 존재하지 않음
x*	x가 0번이상 반복
x\|y	x 또는 y를 찾음(or 연산자)

price_tmp = soup_tmp.find("p", "addy").get_text()
price_tmp

import re

re.split(".,", price_tmp)

price_tmp = re.split(".,", price_tmp)[0]
price_tmp

tmp = re.search("\$\d+\.(\d+)?", price_tmp).group()
price_tmp[len(tmp) + 2:]
#$가 반드시 와야하고 d+ : 숫자가 여러개 있을 수 있고 꼭 .을 만나고 그 뒤에 숫자가 있을수도 있고 아닐수도 있다
#가격이 끝나는 지점의 위치를 이용해서 그 뒤는 주소로 생각한다

for 문을 사용할 때 이게 동작중인지 시간이 얼마 남은건지 모를 떄 => TQDM

from tqdm import tqdm
price = []
address = []

for idx, row in df.iterrows():
    req = Request(row["URL"], headers={"User-Agent" : "Chrome"})
    html = urlopen(req, context=context).read()
    
    soup_tmp = BeautifulSoup(html, "html.parser")
    
    gettings = soup_tmp.find("p", "addy").get_text()
    
    price_tmp = re.split(".,", gettings)[0]
    tmp = re.search("\$\d+\.(\d+)?", price_tmp).group()
    
    price.append(tmp)
    address.append(price_tmp[len(tmp) + 2 :])
    print(idx)

데이터프레임 정리

df["Price"] = price
df["Address"] = address
df = df.loc[:,["Rank", "Cafe", "Menu", "Price", "Address"]]
df.set_index("Rank", inplace=True)
df.head()

시카고 맛집 데이터 지도 시각화

import folium
import pandas as pd
import googlemaps
import numpy as np
from tqdm import tqdm

df = pd.read_csv("./data/03. best_sandwiches_list_chicago2.csv", index_col=0)
df.head()

gmaps_key = "key 값"
gmaps = googlemaps.Client(key=gmaps_key)

lat = []
lng = []
for idx, row in tqdm(df.iterrows()):
    if not row["Address"] == "Multiple location":
        target_name = row["Address"] + ", " + "Chicago"
        gmaps_output = gmaps.geocode(target_name)
        location_output = gmaps_output[0].get("geometry")
        lat.append(location_output["location"]["lat"])
        lng.append(location_output["location"]["lng"])
    else:
        lat.append(np.nan)
        lng.append(np.nan)
df["lat"] = lat
df["lng"] = lng
df.head()
mapping = folium.Map(location=[41.895558, -87.679967], zoom_start=11)
for idx, row in df.iterrows():
    if not row["Address"] == "Multiple location":
        folium.Marker([row["lat"], row["lng"]], popup=row["Cafe"]).add_to(mapping)
mapping

Variety_

이전 포스트

Web Data(BeautifulSoup)

다음 포스트

웹크롤링

Data Analysis

위키백과 문서정보 가져오기

List 데이터형

시카고 맛집 데이터 분석

하위페이지 분석

시카고 맛집 데이터 지도 시각화

Web Data(BeautifulSoup)

웹크롤링(네이버 영화 평점 사이트 분석)

0개의 댓글

관련 채용 정보