웹 스크래핑 (BeautifulSoup 사용)

니나노개발생활·2021년 4월 25일

🏃🏻‍♀️bootcamp

목록 보기

7/18

requests

웹사이트는 HTML이라는 형식으로 쓰여진 문서라는 것을 우선 이해하자!
그래서 문서에 담긴 내용을 가져 오도록 request 해야 한다.
아래와 같이 requests.get()안에 url을 넣어서 사용할 수 있다

import requests # requests 라이브러리 설치 필요

r = requests.get('http://openapi.seoul.go.kr:8088/6d4d776b466c656533356a4b4b5872/json/RealtimeCityAir/1/99')
rjson = r.json()

print(rjson['RealtimeCityAir']['row'][0]['NO2'])

beautifulsoup

requests print 해보면 한 눈에 데이터를 파악하기에 어려움이 있다.
HTML 문서를 탐색해서 원하는 부분만 쉽게 솎아낼 수 있는 파이썬 라이브러리 가 바로 BeautifulSoup!

요 라이브러리를 이용해서 스크래핑(크롤링) 하는 법을 알아보자.

import requests
from bs4 import BeautifulSoup

headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}
data = requests.get('https://movie.naver.com/movie/sdb/rank/rmovie.nhn?sel=pnt&date=20200303',headers=headers)

soup = BeautifulSoup(data.text, 'html.parser')

headers를 주는 이유 : 코드를 요청했을 때 기본 요청을 막아둔 페이지가 많아서 브라우저에서 엔터를 친 것 같은 효과를 주기 위해!

- select.one

title = soup.select_one('#old_content > table > tbody > tr:nth-child(2) > td.title > div > a')

📍 copy할 때는 꼭 copy selector로!

print(title)          ## 결과 : <a href="/movie/bi/mi/basic.nhn?code=171539" title="그린 북">그린 북</a>
print(title.text)     ## 결과 : 그린 북(텍스트만 불러옴)
print(title['href'])  ## 결과 : /movie/bi/mi/basic.nhn?code=171539 (태그만 불러옴)

- select : 여러개의 데이터를 리스트 형식으로 보여줌

#old_content > table > tbody > tr:nth-child(2) > td.title > div > a
#old_content > table > tbody > tr:nth-child(3) > td.title > div > a

📍요청할 데이터의 중복된 부분까지! copy할 때는 꼭 copy selector로!


trs = soup.select('#old_content > table > tbody > tr')

for tr in trs :
      print(tr)     ##결과 : 리스트로 나옴

스크래핑

trs = soup.select('#old_content > table > tbody > tr')

for tr in trs :
    a_tag = tr.select_one('td.title > div > a')
    print(a_tag)

print 결과값 중 text만 뽑고 싶은데 중간 중간 none이 있다..!
~~요 부분은 스크래핑을 원하는 페이지의 형식에 따라 다름~~

for tr in trs :
    a_tag = tr.select_one('td.title > div > a')
    if a_tag is not None :
        print(a_tag.text)

## 결과
그린 북
가버나움
베일리 어게인
주전장 ...

quiz

trs = soup.select('#old_content > table > tbody > tr')

for tr in trs :
        movie = tr.select_one('td.title > div > a')
        if movie is not None:
            title = movie.text
            rank = tr.select_one('td:nth-child(1) > img')['alt']
            star = tr.select_one('td.point').text
            print(rank, title, star)

니나노개발생활

깃헙으로 이사중..

이전 포스트

Ajax > GET 요청

다음 포스트