CRAWLING

양희연·2020년 6월 7일

Python

목록 보기

8/10

requests

⛑ 설치

(base) $ conda create -n crawling python=3.8
(base) $ conda activate crawling
(crawling) $ pip install requests

💻 code

import requests

#HTTP GET Request
url = requests.get('주소')

#HTML 소스 가져오기
html = url.text

url.text는 python의 문자열 객체를 반환할 뿐 정보를 추출하기가 어렵다.
따라서 BeautifulSoup 이용한다.

beautifulsoup

html 코드를 python이 이해하는 객체 구조로 변환하는 Parsing을 맡고 있고, 이 라이브러리를 이용해 제대로 된 '의미있는' 정보를 추출해 낼 수 있다.

⛑ 설치

(crawling) $ pip install beautifulsoup4

💻 code

import requests
from bs4 import BeautifulSoup

url = requests.get('주소')
html = url.text

#HTML parshing
bs = BeautifulSoup(html, 'html.parser')

#클래스가 entry-title인 h2 태그의 자식 a 태그를 titles 변수에 저장
titles = bs.select('h2.entry-title > a')

for title in titles:
    #a 태그의 text 출력
    print(title.text)

✔️ select

select 함수는 리스트 형태로 전체 반환

select('p')                                 #p 태그 전부
select('p')[0] = select_one('p')            #p 태그 중 첫번째 요소 (리스트 형태 아님)
select('p span')                            #p 태그 하위 중 span 태그
select('p > span').text                     #p 태그 바로 아래 자식인 span 태그
select('p.bs')                              #클래스가 bs인 모든 p 태그

#클래스명이 bs인 모든 h2 태그 직계 자식 요소 a의 text를 하나씩 출력
titles = bs.select('h2.bs > a')
for title in titles:
    print(title.text)

#클래스명이 bs인 모든 h2 태그 직계 자식 요소 a 의 href를 하나씩 출력
links = bs.select('h2.bs > a')
for link in links:
    print(link['href'])

✅ css selector

개발자 도구의 copy seclector를 이용한다.

<!-- html 파일 -->
<p id='color1' title='red'> red
    <span> aaa </span>
    <a href = 'http://test1'> url1 </a>
</p>
 
<p id='color2' title='green'> green
     <span> bbb </span>
     <a href = 'http://test2'> url2 </a>
</p>
 
<p id='color3' title='blue'> blue
     <span class = 'location'> ccc </span>
     <a href = 'http://test3'> url3 </a>
</p>

#python 파일
import requests
from bs4 import BeautifulSoup

url = requests.get('http://127.0.0.1:5500/index.html')
html = url.text
bs = BeautifulSoup(html, 'html.parser')

title = bs.select('p > span')
link = bs.select('p > a')

for i in range(len(title)):
    print(title[i].text, link[i]['href'])

📂 csv 파일 저장

크롤링 한 데이터들을 csv 파일로 저장한다.

💻 code

import csv

in_file = open('test.csv', 'w+', encoding = 'utf-8')
data_writer = csv.writer(in_file)
data_writer.writerow(['title'])

titles = bs.select('h2.entry-title > a')

for title in titles:
    data_writer.writerow(title)
    
in_file.close()

selenium

웹을 테스트하기 위한 프레임워크.
JS를 이용해 비동기적으로 혹은 뒤늦게 불러와지는 컨텐츠들을 가져올 수 있다.
selenium은 브라우저에서 HTML을 곧바로 파싱해주기 때문에 BeautifulSoup를 사용하지 않아도 된다.

⛑ 설치

① selenium 설치

(crawling) $ pip install selenium

② 구글 드라이버 설치
웹 테스트 자동화를 위해 제공되는 툴 (브라우저 버전에 맞게 설치해야 한다.)

💻 code

import requests
from selenium import webdriver
from bs4 import BeautifulSoup
import csv

in_file = open('starbucks.csv', 'w+', encoding='utf-8')
data_writer = csv.writer(in_file)
data_writer.writerow(['menu', 'image'])

#다운받은 웹드라이버 위치
driver = webdriver.Chrome('/Users/user/Downloads/chromedriver')

driver.get('https://www.starbucks.co.kr/menu/drink_list.do')
html = driver.page_source

bs = BeautifulSoup(html, 'html.parser')
infos = bs.select('a.goDrinkView')

for info in infos:
    menu = info.select('img')[0]['alt']
    image = info.select('img')[0]['src']
    
    data_writer.writerow( (menu, image) )

in_file.close()

#close()는 활성화된 창만 종료, quit()는 전체 종료
driver.quit()