[크롤링] 1장 첫 번째 웹 스크래퍼 - 2. BeautifulSoup소개

채린·2023년 9월 27일

크롤링

파이썬으로 웹 크롤러 만들기

목록 보기

3/9

BeautifulSoup라이브러리:
잘못된 HTML을 수정하여 쉽게 탐색할 수 있는 XML 형식의 파이썬 객체로 변환 - 골치아픈 웹 탐색 시 유용

BeautifulSoup 설치

기본 파이썬 라이브러리 X -> 설치 필수

(맥)
sudo easy_install pip
pip3 install beautifulsoup4
python3 myScript.py

(확인)
python
from bs4 import BeautifulSoup -> 에러x

(가상환경도 상황에 따라 사용)

BeautifulSoup 실행

from urllib.request import urlopen 
from bs4 import BeautifulSoup

html = urlopen('http://www.pythonscraping.com/pages/page1.html')
bs = BeautifulSoup(html.read(),'html.parser')
# bs = BeautifulSoup(html,'html.parser')  # .read() 없이도 사용 가능
print(bs.h1)

=> <h1>An Interesting Title</h1> 페이지의 첫번째 <h1>태그 반환

bs.h1
bs.html.body.h1
bs.body.h1
bs.html.h1

=> 모두 같은 결과

BeautifulSoup(html.read(),'html.parser')

=> 첫번째 매개변수: HTML 텍스트
=> 두번째 매개변수: Beautiful 객체를 만들 때 쓰는 구문 분석기 (지정가능)
html.parser는 파이썬3과 함께 설치됨
lxml도 많이 쓰임: pip3 install lxml
지저분한 html 코드 분석 시 유리(일일히 멈추지 않고 문제 해결), 조금빠름
따로설치와 서드파티c언어라이브러리 필요하다는 단점
html5lib도 널리 쓰임
잘못 만들어진 html을 수정하고 구문 분석 시도. 더 다양한 에러 수정
외부 프로그램 필요. 조금빠름
(웹 스크래핑 분야에서는 네트워크 속도가 가장 큰 병목이라 구문 분석기의 속도는 상관X)

신뢰할 수 있는 연결과 예외 처리

데이터 형식이 지켜지지 않은 웹도 많고, 웹사이트는 자주 다운되는 등 예기치 못한 문제가 많다

html = urlopen('http://www.pythonscraping.com/pages/page1.html')
일어날 수 있는 문제

HTTPError

페이지를 찾을 수 없거나, URL 해석에서 에러 발생
404,500등의 HTTP에러가 반환됨

    from urllib.request import urlopen 
	from urllib.request import HTTPError
    try:
		html= urlopen('http://www.pythonscraping.com/pages/error.html')
	except HTTPError as e:
		print (e)
    else:
    	# 프로그램 계속 실행. except에서 return이나 break하면 필요X

결과: HTTP Error 404: Not Found

URLError

서버를 찾을 수 없는 경우
서버가 다운됐거나 url에 오타있으면 urlopen이 URLError를 냄 (HTTP에러보다 한단계 더 심각)

    from urllib.request import urlopen 
	from urllib.request import HTTPError
    from urllib.request import URLError
    
    try:
		html= urlopen('http://pythonscrapingthisurldoesnorexist.com')
	except HTTPError as e:
		print (e)
    except URLError as e:
    	print ('The server could not be found!')
    else:
    	print ('It Worked!')

결과: The server could not be found!

AttributeError

존재하지 않는 태그에 접근 시도하면 BeautifulSoup는 None객체 반환하는데, None객체에 태그있다고 가정하고 접근하면 문제 발생

print(bs.nonExistentTag) -> None
None객체를 처리하고 체크하는 것은 문제가 되지 않음

BUT, None이 반환될 수 있음을 무시하고 None객체에 어떤 함수 호출하면 안됨
print(bs.nonExistentTag.someTag)
->AttributeError: 'NoneType' object has no attribute 'someTag'

해결: 두 상황을 명시적으로 체크

try:
	badContent = bs.nonExistentTag.anotherTag
except AttributeError as e:
      print("Tag was not found")
else:
	if badContent == None:
    	print("Tag was not found")
    else:
    	print(badContent)

결과: Tag was not found

위 코드를 읽기 쉽게 수정

from urllib.request import urlopen 
from urllib.request import HTTPError
from bs4 import BeautifulSoup

def getTitle(url):
	try:
    	html=urlopen(url)
    except HTTPError as e:
    	return None
    try:
    	bs=BeautifulSoup(html.read(),'html.parser')
        title=bs.body.h1
    except AttributeError as e:
    	return None
    return title
    
title = getTitle('http://www.pythonscraping.com/pages/page1.html')
if title == None:
    print("Title could not be found")
else:
    print(title)

결과: <h1>An Interesting Title</h1>

스크레이퍼를 만들 때 코드의 전반적인 패턴에 대해 생각해야 예외도 처리하고 읽기도 쉽게 만들수있다
getSiteHTML이나 getTitle같은 범용함수만들어 빠르고 믿을수있는 웹 스크레이퍼를 만들자

채린

이전 포스트

[크롤링] 1장 첫 번째 웹 스크래퍼 - 1. 연결

다음 포스트