첫번째 web scraper

Koo·2023년 8월 6일

Crawling python

파이썬으로 웹 크롤러 만들기

목록 보기

1/6

browser 도움 없이 데이터 구조를 파악하고 해석하는 방법
데이터에서 원하는 content를 추출하는 방법

연결

기본적인 network의 연결

desktop B가 server A에 연결하는 경우

B는 전압으로 구분되는 bit stream을 전송. bit는 header와 body 등의 정보를 포함하고 있다. header에는 다음 목적지인 B의 라우터 MAC address와 server A의 IP address를 포함, body에는 server A에 요청하는 내용을 포함.

B의 라우터는 B의 MAC address에서 A의 IP address로 가는 packet을 분석. B의 IP를 발신자로 기록해 Internet에 전송

B의 packet은 여러 서버를 거쳐 server A로 전송

server A는 자신의 IP address에서 packet을 받음

server A는 header의 port number를 이용해 적절한 application에 전달

application은 bit stream(요청)을 받음

요청 받은 파일을 새로운 packet으로 묶어 자신의 라우터를 거쳐 B의 컴퓨터로 전송. 1-6의 과정을 거쳐 B의 컴퓨터에 도달

Web Browser

web browser는 packet을 생성하고, 보내고, 돌아온 데이터를 해석해 여러 데이터를 표현하는 매우 유용한 application
web browser는 processor에 명령을 내려 데이터를 application에 보내 유/무선 interface로 처리할 수 있음
web browser 없이도 라이브러리를 이용해 프로그래밍 언어를 이용해 처리 가능

from urllib.request import urlopen

html = urlopen('http://pythonscraping.com/pages/page1.html')
print(html.read())

urlopen을 이용해 얻은 객체는 page로 렌더링되지 못하고, 하나의 html파일만을 나타낸다

urllib은 request, parse, error, robotparser의 서브모듈로 나뉨
https://docs.python.org/3/library/urllib.html 참고

BeautifulSoup

잘못된 HTML을 수정하여 쉽게 탐색할 수 있는 XML 형식의 파이썬 객체로 변환

pip install beautifulsoup4

위에서 urlopen을 이용해 얻은 객체를 BeautifulSoup객체로 변환

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('http://www.pythonscraping.com/pages/page1.html')
bsObj = BeautifulSoup(html.read(), 'html.parser')
print(bsObj.h1)

bsObj.h1은 urlopen을 통해 얻은 객체에서 h1 tag에 해당하는 내용을 저장
html 내에서 h1 tag는 여러 tag 안에 중첩적으로 들어있지만 이를 무시하고 해당하는 내용만을 parsing함

신뢰할 수 있는 연결

scraping을 하다보면 예외 상황이 많이 발생할 수 있음

page를 찾을 수 없거나, URL 해석에서 에러가 발생하는 경우
server를 찾을 수 없는 경우

1. page를 찾을 수 없거나, URL 해석에서 에러가 발생하는 경우

HTTP error를 반환받게 된다 -> urlopen함수는 HTTPError를 발생시킴

from urllib.request import urlopen, HTTPError
from bs4 import BeautifulSoup

try:
	html = urlopen('http://www.pythonscraping.com/pages/page1.html')
except HTTPError as e:
	print(e) # null을 반환하거나, break문을 실행하는 등의 동작
else:
	# program 을 계속 실행

2. tag에 접근할 때마다, 해당 tag가 실제로 존재하는지 확인하는 과정이 필요

존재하지 않는 tag에 접근을 시도하면, BeautifulSoup는 None 객체를 반환
None 객체에 tag가 있다고 가정하고 접근하면 AttributeError가 발생하게 된다

# nonExistingTag와 anotherTag가 실제로 존재하는지 check
try:
	badContent = bsObj.nonExistingTag.anotherTag
except AttributeError as e:
	print('Tag was not found')
else:
	if badContent == None:
    	print('Tag was not found')
    else:
    	print(badContent)

함수를 만들어 미리 예외처리를 해두면 재사용하기 좋은 web scraper를 만들 수 있다

from urllib.urlopen, HTTPError
from bs4 import BeautifulSoup

def getTitle(url):
	try:
    	html = urlopen(url)
    except HTTPError as e:
    	return None
    
    try:
    	bsObj = BeautifulSoup(html.read(), 'html.parser')
        title = bsObj.body.h1
    except AttributeError as e:
    	return None
    
    return title

title = getTitle('http://www.pythonscraping.com/pages/page1.html')
if title is None:
	print('Title couldn't be found')
else:
	print(title)

Koo

스터디를 해보자

다음 포스트