python 웹 크롤링 (feat. beautifulsoup4) #1

eunji hwang·2020년 4월 13일

crawling

목록 보기

2/3

웹 크롤링

import

from bs4 import BeautifulSoup
from urllib.request import urlopen

import csv, re, requests

사용할 모듈을 임포트하자. 이전 설치포스팅에서 언급한 requests도 함께 임포트 했다.

csv 파일 만들기

output = "파일명.csv"                            # 최종 아웃풋 파일 지정
csv_open = open(output,'w+', encoding='utf-8') # 오픈 = open(경로, 저장모드, 인코딩)
csv_writer = csv.writer(csv_open)              # 쓰기 = csv.writer(파일경로, 모드, 인코딩)
csv_writer.writerow(('title','image_url'))     # csv파일 타이틀row에 인자로 전달한 값 입력

html 문서 긁어오기

orig_url = '크롤링할 페이지 주소'   # 코드를 단순화 하기위해 변수에 url주소를 담았다
req = requests.get(orig_url)   # 크롤링할 페이지 url로 요청 보내기 , req를 출력하면 statuscode만 출력됨
html = req.text                # html 코드를 가져온다!

soup 만들기

# BeautifulSoup(태그, "해석라이브러리") > 뷰숲 공식문서에서 해석기에 대한 설명을 보자.
# 인자로 넣은 문서는 유니코드로 변환되며, html 개체는 유니코드 문자로 변환된다.
soup = BeautifulSoup(html, 'html.parser')

트리 탐색하기

메서드

자주 사용하는 메서드에 대해 알아보자. 10가지의 메서드(find(), findAll(), find_parents(), find_parent()..등)이 제공된다. 공홈에서 말하길 그 중 5가지는 find()와 비슷하고, 나머지는 findAll()과 비슷하단다. 부담없이 find()와 findAll()에 대해 알아보자.

- find(name, attrs, recursive, text, **kwargs)
- find_all(name, attrs, recursive, text, limit, **kwargs)
- find_parent(name, attrs, text, **kwargs)
- find_parents(name, attrs, text, limit, **kwargs)
- find_next_sibling(name, attrs, text, **kwargs)
- find_next_siblings(name, attrs, text, limit, **kwargs)
- find_previous_sibling(name, attrs, text, **kwargs)
- find_previous_siblings(name, attrs, text, limit, **kwargs)
- find_all_next(name, attrs, text, **kwargs)
- find_all_nexts(name, attrs, text, limit, **kwargs)
- find_all_previou(name, attrs, text, **kwargs)
- find_all_previous(name, attrs, text, limit, **kwargs)

findAll()

findAll(name, attrs, recursive, text, limit, **kwargs)
전체 태그를 훓어보게 된다.

find()

find(name, attrs, recursive, text, **kwargs)
단 1개만 갖는 태그를 검색할때 사용한다. findAll보다 비용이 적개 발생하기 때문에 findAll('body',limit=1) 보다 find('body')로 검색하는 것이 좋다.

태그로 탐색

name : 태그, 정규표현식, 리스트, 함수, bool과 같은 인자를 말한다.
인자를 여러가지 조합해 해당하는 것을 추려낼 수 있다.

soup.find_all("a", class_="sister")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

attrs : css 클래스를 입력하면 해당 클래스를 갖는 데이터를 리턴한다. 하지만 이방법 보단 키워드 인자가 유용,편리!

인자로 넣은 조건에 맞는 데이터를 찾아 리스트에 담아 봔환한다. 인자로는 문자열(태그), 정규표현식, 리스트, bool, 함수가 들어 갈 수 있다. 인자를 어떻게 사용할 수 있는 지 알아보자.

1. `'태그'`

특정 태그를 사용하는 데이터를 추출하려면 태그인자를 사용한다. 태그는 문자열로 입력 되어야 한다. 주의하자.

soup.find('span') # <span> 하하하하 </span>
soup.find_all('span') # <span>하하하하</span>,<span>두번째</span>...

모든 <span>태그를 리스트에 담아 리턴한다.

2. 정규표현식

정규표현식을 인자로 건내면 match()메서드를 사용해 정규표현식에 맞게 여과한다.
re.complie('정규표현식') 를 인자로 넣어 찾아보자.

soup.fild_all(re.complie('^b')) # b로 시작하는 모든 태그를 찾는다.

3. 리스트

리스트에 담긴 문자열이 해당하는 태그를 리스트에 담아 리턴한다.

soup.find_all(['p','span'])
# [
# <p> 이건 p태그 </p>,
# <span> 이건 span1 </span>,
# <span> 이건 span2 </span>,
# ]

4. bool

True, False 값을 인자로 줄수 있으며 True일때 사용한 모든 태그의 이름만 리턴한다. 태그가 포함하는 문자열은 출력하지 않으며, tag.name만 출력!

soup.find_all(True)
# html
# head
# title
# body
# p
# b...

5. 함수

위 모든 인자들로 찾을 수 없다면, 함수정의하여 사용한다. 리턴은 bool로 한다.

# beautifulsoup 공식문서의 예제코드 퍼옴
rom bs4 import NavigableString
def surrounded_by_strings(tag):
    return (isinstance(tag.next_element, NavigableString)
            and isinstance(tag.previous_element, NavigableString))

for tag in soup.find_all(surrounded_by_strings):
    print tag.name
# p
# a
# a
# a
# p

태그가 문자열객체로 둘러쌓여 있다면 T를 리턴하는 함수다.

6. 키워드인자

키워드 인자(kwargs) : id=no-222와 같이 특정값을 지정할 수 있다.

클래스, 아이디값이 무엇인 태그를 찾을때
href에 정규표현식을 이용해 특정문자열이 들어간 주소를 찾을때

soup.find_all(id='link2')
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
soup.find_all(href=re.compile("elsie"))
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

soup.find_all(id=True)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find_all(href=re.compile("elsie"), id='link1')
# [<a class="sister" href="http://example.com/elsie" id="link1">three</a>]

7. limit = num

리턴하는 리스트의 갯수를 지정한다.

8. recursive = bool

리턴하는 리스트는 직계자손만 리턴한다.(기본값은 모든 자손들을 리턴하게 된다.)

CSS 선택자로 탐색

.select(css선택자) 메서드를 사용한다.

만약 CSS 선택자만 필요하다면, 해석기 html.parser보다 lxml을 직접 사용하는 편이 더 빠르다고 한다.

1. tag 선택자

soup.select('title') # [<title>The Dormouse's story</title>]
soup.select('html head title') # [<title>The Dormouse's story</title>]
soup.select("head > title") # [<title>The Dormouse's story</title>]
soup.select("p > a")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie"  id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

class 선택자

p.select(".sister")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.select("[class~=sister]")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

id 선택자

soup.select("#link1")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

soup.select("a#link2")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

속성 선택자

oup.select('a[href]')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

결과보기

.prettify()를 사용해서 개행을 자동 입력되도록 한다.

soup.a.prettify() # 
print(soup.a.prettify()) # '<html>\n <head>\n </head>\n <body>\n  <a href="http://example.com/">\n...'

# <a href="http://example.com/">
#  I linked to
#  <i>
#   example.com
#  </i>
# </a>

eunji hwang

TIL 기록 블로그 :: 문제가 있는 글엔 댓글 부탁드려요!

이전 포스트

python 웹 크롤링 (feat. beautifulsoup4) #2 실습

다음 포스트

python 웹 크롤링 (feat. beautifulsoup4) #1

crawling

웹 크롤링

import

csv 파일 만들기

html 문서 긁어오기

soup 만들기

트리 탐색하기

메서드

findAll()

find()

태그로 탐색

1. `'태그'`

2. 정규표현식

3. 리스트

4. bool

5. 함수

6. 키워드인자

7. limit = num

8. recursive = bool

CSS 선택자로 탐색

1. tag 선택자

class 선택자

id 선택자

속성 선택자

결과보기

python 웹 크롤링 (feat. beautifulsoup4) #2 실습

python 웹 크롤링 (feat. selenium) #3 실습

0개의 댓글

python 웹 크롤링 (feat. beautifulsoup4) #1

crawling

웹 크롤링

import

csv 파일 만들기

html 문서 긁어오기

soup 만들기

트리 탐색하기

메서드

findAll()

find()

태그로 탐색

1. '태그'

2. 정규표현식

3. 리스트

4. bool

5. 함수

6. 키워드인자

7. limit = num

8. recursive = bool

CSS 선택자로 탐색

1. tag 선택자

class 선택자

id 선택자

속성 선택자

결과보기

python 웹 크롤링 (feat. beautifulsoup4) #2 실습

python 웹 크롤링 (feat. selenium) #3 실습

0개의 댓글

1. `'태그'`