BeautifulSoup

GreenBean·2022년 1월 6일

Today I learned

BeautifulSoup

BeautifulSoup 공식 문서

BeautifulSoup의 기초 함수

find()와 find_all()

find()와 findAll()은 BeautifulSoup에서 가장 자주 쓰는 함수
- 이 함수를 쓰면 HTML 페이지에서 원하는 태그를 다양한 속성에 따라 쉽게 필터링할 수 있음
예제 사이트 : http://www.pythonscraping.com/pages/warandpeace.html

find(tag, attributes, recursive, text, keywords)

findAll(tag, attributes, recursive, text, limit, keywords)

실제로 이 함수들을 쓸 때는 처음 두 매개변수인 tag와 attributes가 사용되는 경우가 대부분

tag 매개변수에는 태그 이름인 문자열을 넘기거나, 태그 이름으로 이루어진 파이썬 리스트를 넘길 수도 있음
- .findAll({"h1","h2","h3","h4","h5","h6"})

attributes 매개변수는 속성으로 이루어진 파이썬 딕셔너리를 받고, 그중 하나에 일치하는 태그를 찾음
- 예를 들어 다음 함수는 HTML 문서에서 녹색과 빨간색 span 태그를 모두 반환
- .findAll("span", {"class":{"green", "red"}})

recursive 매개변수는 불리언으로 문서에서 얼마나 깊이 찾아 들어가고 싶은지를 지정
- recursive가 True이면 findAll 함수는 매개변수에 일치하는 태그를 찾아 자식, 자식의 자식을 검색
- false이면 문서의 최상위 태그만 검색
- 기본적으로 findAll은 재귀적으로(recursive가 True) 동작하며 일반적으로 이 옵션은 그대로 두는 것이 좋음

text 매개변수는 태그의 속성이 아니라 텍스트 콘텐츠에 일치한다는 점이 좀 다름
- 예를 들어 예제 페이지에서 태그에 둘러싸인 ‘the prince’가 몇 번 나타났는지 보려면 다음 함수를 사용
- nameList = bsObj.findAll(text="the prince")
- print(len(nameList))

limit 매개변수는 물론 findAll에만 쓰임
- find는 findAll을 호출하면서 limit을 1로 지정한 것과 같음
- 이 매개변수는 페이지의 항목 처음 몇 개에만 관심이 있을 때 사용
- 이 매개변수는 페이지에 나타난 순서대로 찾으며 그 순서가 원하는 바와 일치한다는 보장은 없으므로 주의

keyword 매개변수는 특정 속성이 포함된 태그를 선택할 때 사용
- allText = bsObj.findAll(id="text")
- print(allText[0].get_text())

Tip! keyword 매개변수 쓸 때 주의할 점

keyword 매개변수는 특정 상황에서 매우 유용할 수 있지만 기술적으로는 BeautifulSoup 자체의 기능과 중복되기도 함

예를 들어 다음 두 행은 완전히 같음

bsObj.findAll(id="text")

bsObj.findAll("", {"id":"text"})

또한 keyword는 가끔 문제를 일으키는데, 가장 흔한 경우는 class 속성으로 요소를 검색할 때 일어나며 이는 class가 파이썬에서 보호된 키워드이기 때문

즉, class는 파이썬 예약어(keyword)이므로 변수나 매개변수 이름으로 쓸 수 없음

BeautifulSoup.findAll()의 keyword 매개변수와는 상관없음

예를 들어 다음 행은 class를 비표준적인 방법으로 사용하므로 문법 에러를 일으킴

bsObj.findAll(class="green")

해결책

밑줄 추가

bsObj.findAll(class_="green")

class를 따옴표 안에 쓰는 방법

bsObj.findAll("", {"class":"green"})

태그 목록을 .findAll()에 속성 목록으로 넘기면 or 필터처럼 동작함
- 즉 태그1, 태그2, 태그3 등이 들어간 모든 태그 목록을 선택하게 됨
- 태그 목록이 길다면 필요 없는 것들도 잔뜩 선택될 것
반면 keyword 매개변수는 and 필터처럼 동작하므로 그런 문제가 없음

예시 코드

from urllib.request import urlopen
from bs4 import BeautifulSoup

html= urlopen('***')
bs = BeautifulSoup(html, 'html.parser')

# 녹색(green) 글자를 모두 가져오는 방법
nameList = bs.findAll('span',{'class':'green'}) # or find_all
for i in nameList:
    print(i.get_text())

# 결과
Anna
Pavlovna Scherer
Empress Marya
Fedorovna
Prince Vasili Kuragin
Anna Pavlovna
St. Petersburg
the prince
Anna Pavlovna
Anna Pavlovna
the prince
the prince
the prince
Prince Vasili
Anna Pavlovna
...생략

findAll()함수로 페이지의 모든 해당 태그를 찾고, get_text()를 이용해 태그를 제외한 텍스트를 추출한 것

# 문서의 모든 헤더 태그 리스트 반환
bs.findAll({'h1','h2','h3','h4','h5','h6'})

# 결과
[<h1>War and Peace</h1>, <h2>Chapter 1</h2>]

# 녹색(green)과 빨간색(red) span 태그를 모두 반환
bs.findAll({'span':{'class':{'green','red'}}})

# 결과
<span class="red">Well, Prince, so Genoa and Lucca are now just family estates of the
 Buonapartes. But I warn you, if you don't tell me that this means war,
 if you still try to defend the infamies and horrors perpetrated by
 that Antichrist- I really believe he is Antichrist- I will have
 nothing more to do with you and you are no longer my friend, no longer
 my 'faithful slave,' as you call yourself! But how do you do? I see
 I have frightened you- sit down and tell me all the news.</span>,
 <span class="green">Anna
 Pavlovna Scherer</span>,
... 생략

reculsive=True가 default 이며, 재귀적으로 자식의 자식 태그까지 모두 검색하며, False일 경우 최상위 태그만 검색

# text 매개변수, 태그에 둘러싸인 ‘the prince’가 몇 번 나타났는지 검색
nameList = bs.findAll(text='the prince')
print(len(nameList))

# 결과
7

the prince라는 텍스트만이 태그에 둘러싸여 있어야 검색 가능
- 예를 들어 the라는 텍스트는 혼자 태그에 둘러싸여 있지 않기 때문에 불가능

# keyword 매개변수, 특정 속성이 포함된 태그를 선택
bs.findAll(id='text')

# 결과
[<div id="text">
 "<span class="red">Well, Prince, so Genoa and Lucca are now just family estates of the
 Buonapartes. But I warn you, if you don't tell me that this means war,
 if you still try to defend the infamies and horrors perpetrated by
 that Antichrist- I really believe he is Antichrist- I will have
 nothing more to do with you and you are no longer my friend, no longer
 my 'faithful slave,' as you call yourself! But how do you do? I see
 I have frightened you- sit down and tell me all the news.</span>"
 <p></p>
 It was in July, 1805, and the speaker was the well-known <span class="green">Anna
 Pavlovna Scherer</span>, maid of honor and favorite of the <span class="green">Empress Marya
 Fedorovna</span>. With these words she greeted <span class="green">Prince Vasili Kuragin</span>, a man
 of high rank and importance, who was the first to arrive at her
 reception. <span class="green">Anna Pavlovna</span> had had a cough for some days. She was, as
 she said, suffering from la grippe; grippe being then a new word in
 <span class="green">St. Petersburg</span>, used only by the elite.
 ...생략

# 녹색 글씨를 찾는 다른 방법
bs.findAll(class_='green')

# 결과
<span class="green">Anna
 Pavlovna Scherer</span>, <span class="green">Empress Marya
 Fedorovna</span>, <span class="green">Prince Vasili Kuragin</span>, <span class="green">Anna Pavlovna</span>, <span class="green">St. Petersburg</span>,
... 생략

class는 파이썬 예약어이기 때문에 class_로 사용해야 함

limit 매개변수, 페이지의 처음 부터 검색할 개수
- 이 매개변수는 페이지의 항목 처음 몇개에만 관심이 있을때 사용
- find() 함수는 findAll() 함수의 limit 매개변수를 1로 둔 것을 제외하고는 다른 것이 없음

트리 구조

findAll 함수를 사용하면 이름과 속성에 따라 태그를 찾을 수 있음
하지만 문서 안에서의 위치를 기준으로 태그를 찾을 때는 어떻게 해야 할까?
- 이럴 때 트리 내비게이션이 필요

예제 사이트 : http://www.pythonscraping.com/pages/page3.html

이 페이지의 HTML은 다음과 같은 트리 구조로 나타낼 수 있음
- 간결함을 위해 일부 태그는 생략

html

— body

— div.wrapper

— h1

— div.content

— table #giftList

— tr

— th

— th

— th

— th

— tr.gift #gift1

— td

— td

   — span.excitingNote

— td

— td

   — img

— ...더 많은 테이블 행...

— div.footer

자식(children)과 자손(descendants)

BeautifulSoup 라이브러리는 자식과 자손을 구별
- 사람의 가족과 마찬가지로, 자식은 항상 부모보다 한 태그 아래에 있고, 자손은 조상보다 몇 단계든 아래에 있을 수 있음
- 예제 페이지를 예로 든다면tr 태그는 table 태그의 자식이며 tr과 th, td, img, span은 모두 table 태그의 자손
- 모든 자식은 자손이지만, 모든 자손이 자식인 것은 아님

일반적으로 BeautifulSoup 함수는 항상 현재 선택된 태그의 자손을 다룸
- 예를 들어 bsObj.body.h1은 body의 자손인 첫 번째 h1 태그를 선택
  - body 바깥에 있는 태그에 대해서는 동작하지 않음
- 마찬가지로 bsObj.div.findAll("img")는 문서의 첫 번째 div 태그를 찾고, 그 div 태그의 자손인 모든 img 태그의 목록을 가져옴
자식만 찾을 때는 .children을 사용

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("***")
bsObj = BeautifulSoup(html, "html.parser")

for child in bsObj.find("table",{"id":"giftList"}).children:
    print(child)

이 코드는 giftList 테이블에 들어 있는 제품 행 목록을 출력
- children() 대신 descendants() 함수를 썼다면 테이블에 포함된 태그가 20개 이상 출력됐을 테고, 거기에는 img, span, td 태그 등이 모두 포함됐을 것
- 자식과 자손의 구별이 중요

자식(children) 예시 코드

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen('***')
bs = BeautifulSoup(html, 'html.parser')

# giftList가 있는 표(부모)의 자식 태그들을 가져오는 방법
for child in bs.find('table',{'id':'giftList'}).children:
    print("="*10)
    print(child)

# 결과
==========


==========
<tr><th>
Item Title
</th><th>
Description
</th><th>
Cost
</th><th>
Image
</th></tr>
==========


==========
<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>
==========


==========
<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>
==========

... 생략

바로 밑의 자식 태그는 <tr class="gift" id="gift1">인데, 이 태그는 이 태그만을 의미하는게 아니라 아래의 결과 까지를 의미
- <tr class="gift" id="gift1">…<tr>만 자식 태그
- 즉, 출력되는 결과 객체는 1개

<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>

자손(descendants) 예시 코드

내부에 있는 하위 태그들도 개별적으로 출력하려면 descendants를 사용하여 자손 태그를 사용해야 함

# 내부의 하위 태그도 출력해야할 때
for des in bs.find('table',{'id':'giftList'}).descendants:
    print("="*10)
    print(des)

# 결과
==========
<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>
==========
<td>
Vegetable Basket
</td>
==========

Vegetable Basket

==========
<td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td>
==========

This vegetable basket is the perfect gift for your health conscious (or overweight) friends!

==========
<span class="excitingNote">Now with super-colorful bell peppers!</span>
==========
Now with super-colorful bell peppers!
==========


==========
<td>
$15.00
</td>
==========

$15.00

==========
<td>
<img src="../img/gifts/img1.jpg"/>
</td>
==========


==========
<img src="../img/gifts/img1.jpg"/>
==========

형제(sibling) 다루기

BeautifulSoup의 next_siblings() 함수는 테이블에서 데이터를 쉽게 수집할 수 있으며, 특히 테이블에 타이틀 행이 있을 때 유용

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("***")
bsObj = BeautifulSoup(html, "html.parser")

for sibling in bsObj.find("table",{"id":"giftList"}).tr.next_siblings:
    print(sibling)

이 코드의 출력 결과는 제품 테이블에서 첫 번째 타이틀 행을 제외한 모든 제품 행
- 타이틀 행은 왜 건너뛰었을까요?
- 첫째, 객체는 자기 자신의 형제가 될 수 없음
  - 객체의 형제를 가져올 때, 객체 자체는 항상 그 목록에서 제외됨
- 둘째, 이 함수는 다음 형제만 가져옴
  - 예를 들어 우리가 목록 중간에 있는 임의의 행을 선택하고 next_siblings을 호출했다면 그다음에 있는 형제들만 반환됨
- 즉, 타이틀 행을 선택하고 next_siblings을 호출하면 타이틀 행 자체를 제외한 모든 테이블 행을 선택하게 됨

Tip! 선택은 명확하게!

이전 코드는 bsObj.table.tr, 심지어 bsObj.tr을 써서 테이블의 첫 번째 행을 선택했더라도 마찬가지로 잘 동작했을 것

하지만 번거로움을 무릅쓰고 위 코드를 길고 명확하게 작성

bsObj.find("table",{"id":"giftList"}).tr

설령 페이지에서 테이블(또는 다른 타겟 태그)이 하나뿐인 것처럼 보일 때에도 실수를 하기 쉬움

또한 페이지 레이아웃은 시시때때로 변함

코드를 작성할 때는 페이지 처음에 있던 테이블이, 어느 날 보니 두 번째 또는 세 번째 테이블이 되어 있을 수도 있는 것

스크레이퍼를 더 견고하게 만들려면 항상 태그를 가능한한 명확하게 선택하는 것이 최선이며, 가능하다면 태그 속성을 활용하는게 좋음

next_siblings를 보완하는 previous_siblings 함수도 있음
- 이 함수는 원하는 형제 태그 목록의 마지막에 있는 태그를 쉽게 선택할 수 있을 때 사용

next_siblings, previous_siblings와 거의 같은 next_sibling, previous_sibling 함수도 있음
- 이들 함수는 리스트가 아니라 태그 하나만 반환한다는 점을 빼면 똑같이 동작함

형제(next_siblings) 예시 코드

next_siblings() 함수는 웹페이지의 테이블(표)에서 데이터를 쉽게 수집할 수 있고, 특히 타이틀(헤더)가 있을때 유용

# 테이블의 타이틀을 제외하고 원소들만 가져오는 방법
for siblings in bs.find('table',{'id':'giftList'}).tr.next_siblings:
    print("="*10)
    print(siblings)

# 결과
==========


==========
<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>
==========


==========
<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>
==========


==========
<tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg"/>
</td></tr>
==========


==========
<tr class="gift" id="gift4"><td>
Dead Parrot
</td><td>
This is an ex-parrot! <span class="excitingNote">Or maybe he's only resting?</span>
</td><td>
$0.50
</td><td>
<img src="../img/gifts/img4.jpg"/>
</td></tr>
==========


==========
<tr class="gift" id="gift5"><td>
Mystery Box
</td><td>
If you love suprises, this mystery box is for you! Do not place on light-colored surfaces. May cause oil staining. <span class="excitingNote">Keep your friends guessing!</span>
</td><td>
$1.50
</td><td>
<img src="../img/gifts/img6.jpg"/>
</td></tr>
==========

비슷한 함수로 previous_siblings()함수는 선택한 태그의 이전에 나오는 형제 태그들을 가져옴
또한 s를 뺀 next_sibling(), previous_sibling()함수는 선택 태그 이전과 이후 하나의 형제 태그만을 가져옴

Tip!

헤더처럼 처음 나오는 태그 식별이 쉬울 때는 next_siblings()함수

마지막에 있는 태그 식별이 쉬울 때는 previous_siblings()함수

부모(parent) 다루기

페이지를 스크랩하다 보면, 자식이나 형제가 아니라 아주 가끔은 부모를 찾아야 할 때도 있음
- 일반적으로 HTML 페이지에서 데이터를 수집할 목적으로 살펴볼 때는 보통 맨 위계층에서 시작해 원하는 데이터까지 어떻게 찾아 들어갈지 생각하기 마련
- 하지만 가끔 BeautifulSoup의 부모 검색 함수 .parent와 .parents가 필요할 때도 있음

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("***")
bsObj = BeautifulSoup(html, "html.parser")
    print(bsObj.find("img",{"src":"../img/gifts/img1.jpg"
        }).parent.previous_sibling.get_text())

이 코드는 ../img/gifts/img1.jpg 이미지가 나타내는 객체의 가격(이 경우 $15.00)을 출력
HTML 페이지에서 우리가 살펴볼 부분의 트리 구조를 숫자로 표시한 단계와 함께 나타내면 다음과 같음

<tr>

— <td>

— <td>

— <td> ③

— "$15.00" ④

— s<td> ②

— <img src="../img/gifts/img1.jpg"> ①

①. 먼저 src="../img/gifts/img1.jpg"에 해당하는 이미지를 선택
②. 부모 태그(이 경우 <td> 태그)를 선택
③. ②에서 선택한 <td>의 previous_sibling(이 경우 제품 가격이 들어 있는 태그)을 선택
④. 태그에 들어 있는 텍스트인 $15.00를 선택

부모(parent) 예시 코드

구조를 이해하고 부모부터 자식으로 찾아가는 경우가 흔하지만, 자식을 통해 부모를 찾아야할 때도 있음

# 자식을 통해 부모를 찾는 방법
bs.find('img',{'src':'../img/gifts/img1.jpg'}).parent.previous_sibling.get_text()

# 결과
'\n$15.00\n'

특정 이미지를 이용해서 그 이미지가 해당하는 가격을 가져온 것으로 다음과 같은 순서로 진행됨
- 특정 이미지를 포함하는 img 태그 선택
- 부모 태그 선택
- 바로 이전의 형제 태그 선택
- 태그 내의 텍스트 추출

GreenBean

🌱 Backend-Dev | hwaya2828@gmail.com

이전 포스트

ElasticSearch

다음 포스트

BeautifulSoup

BeautifulSoup

BeautifulSoup의 기초 함수

find()와 find_all()

예시 코드

트리 구조

자식(children)과 자손(descendants)

자식(children) 예시 코드

자손(descendants) 예시 코드

형제(sibling) 다루기

형제(next_siblings) 예시 코드

부모(parent) 다루기

부모(parent) 예시 코드

ElasticSearch

데이터 분석을 위한 고급 SQL

0개의 댓글

관련 채용 정보