๐Ÿ”ฅ ์›น ํฌ๋กค๋ง (feat. bs4 & selenium)

yeeun leeยท2020๋…„ 4์›” 16์ผ
0

2์ฃผ์— ๊ฑธ์ณ beautiful soup๊ณผ selenium ์„ ํ†ตํ•ด ํฌ๋กค๋งํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ฐฐ์šฐ๊ณ  ์žˆ๋‹ค. ํฌ๋กค๋งํ•  ์ˆ˜ ์žˆ๋Š” ์–ธ์–ด๊ฐ€ ๋‹ค์–‘ํ•˜๊ณ  ๋ฐฉ๋ฒ•์ด ๋งŽ์•„์„œ์ธ์ง€ ๋‚ด ๊ฒƒ์œผ๋กœ ์ฐฉ ๋ถ™์ง€๊ฐ€ ์•Š๋Š”๋‹ค ๐Ÿ’€๐Ÿ’€๐Ÿ’€

ํ”„๋ ˆ์ž„์›Œํฌ๊ฐ€ ์‚ฌ๋žŒ ์ƒ๊ฐ๋Œ€๋กœ ์ฝ์–ด์ฃผ์ง€ ์•Š์•„์„œ ๋‹ต๋‹ตํ•œ ๋งˆ์Œ๋„ ๋งŽ์€ ๊ฒƒ ๊ฐ™๋‹ค. ๊ทธ๋ž˜๋„ ํ•˜๋‹ค ๋ณด๋ฉด ๋‚ด ๊ฒƒ์ด ๋˜๋‹ˆ๊นŒ ์˜์‹ฌํ•˜์ง€ ๋ง๊ณ  ๊ณ„์† ํ•ด๋ด์•ผ๊ฒ ๋‹ค. *๋งํฌ๋„ ์ฐธ๊ณ ํ•˜๊ธฐ ์ข‹์€ ์‚ฌ์ดํŠธ์ธ ๊ฒƒ ๊ฐ™์•„ ์šฐ์„  ๋„ฃ์—ˆ๋‹ค!

1. setup

์šฐ์„  ์•„๋ž˜์—์„œ importํ•œ ๋ชจ๋“ˆ(bs4, requests ๋“ฑ)์„ ๋ชจ๋‘ ๊ฐœ๋ฐœ ํ™˜๊ฒฝ์— ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ์ „์ œ๋กœ ํ•œ๋‹ค. conda๋ฅผ ์“ด๋‹ค๋ฉด crawling์šฉ ํ™˜๊ฒฝ์„ ๋”ฐ๋กœ ๋งŒ๋“ค์–ด ๋†“๋Š” ๊ฒƒ์ด ์ข‹๋‹ค. csv ํŒŒ์ผ๋กœ ์ €์žฅํ•  ๊ฒƒ์ด๊ธฐ ๋•Œ๋ฌธ์— ์šฐ์„  ์ €์žฅํ•  ๋ฐฉ๋ฒ•์— ๋Œ€ํ•ด์„œ ๋จผ์ € ์•Œ์•„๋ณด์ž.

1.1 csv

csv๋Š” comma separted values์˜ ์•ฝ์ž๋กœ ๊ฐ ๋ผ์ธ์˜ ์ปฌ๋Ÿผ์ด ์ฝค๋งˆ๋กœ ๋ถ„๋ฆฌ๋˜์–ด ์žˆ๋Š” ํ…์ŠคํŠธ ํŒŒ์ผ ํฌ๋งท์ด๋‹ค. ๊ฐ„๋‹จํ•œ ๋ฐ์ดํ„ฐ๋Š” ์ฝค๋งˆ๋กœ splitํ•ด์„œ ์ฒ˜๋ฆฌํ•˜์ง€๋งŒ, ๋ฐ์ดํ„ฐ ๋‚ด์— ์ฝค๋งˆ๊ฐ€ ์žˆ์„ ๊ฒฝ์šฐ ํŒŒ์ด์ฌ์— ๋‚ด์žฅ๋œ csv ๋ชจ๋“ˆ์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ์ข‹๋‹ค. ๋ฝ‘์•„๋‚ธ ๋ฐ์ดํ„ฐ๋ฅผ ์—‘์…€์— ๊ธฐ์ž…ํ•ด์„œ ์ €์žฅํ•˜๋Š” ์ฝ”๋“œ๋ฅผ ๋จผ์ € ์•Œ์•„๋ณด์ž.

์ฐธ๊ณ : ์˜ˆ์ œ๋กœ ๋ฐฐ์šฐ๋Š” ํŒŒ์ด์ฌ ํ”„๋กœ๊ทธ๋ž˜๋ฐ

import csv 
# "tags.csv" ํŒŒ์ผ์„ ์“ฐ๊ธฐ ๋ชจ๋“œ(w+)๋กœ ๋งŒ๋“ค๊ณ  ์—ด์–ด์ฃผ์„ธ์š”. 
tag_open = open("tag.csv", 'w+', encoding='utf-8', newline='')

# csv.writer๋ฅผ ํ†ตํ•ด ํŒŒ์ผ์„ ๊ฐ์ฒด๋กœ ๋งŒ๋“ค์—ˆ์–ด์š” 
tag_writer = csv.writer(tag_open)

# ํŒŒ์ผ ๊ฐ์ฒด์— writerow ๋ฉ”์†Œ๋“œ๋ฅผ ์‚ฌ์šฉํ•ด์„œ title ์—ด, tags์—ด์„ ๋งŒ๋“ค์–ด์š”.
# ์ด์ œ ๋’ค์—์„œ ๋ฆฌ์ŠคํŠธ ๋‘๊ฐœ ์”ฉ ์ถ”๊ฐ€ํ•˜๋ฉด ์•„๋ž˜์— ๋ฐ์ดํ„ฐ๊ฐ€ ์ €์žฅ๋ ๊ฑฐ์˜ˆ์š”. 
tag_writer.writerow(('title', 'tags')) 

* newline ์˜ต์…˜

csv ํŒŒ์ผ์„ ์“ฐ๊ธฐ ๋ชจ๋“œ๋กœ ๋งŒ๋“ค ๋•Œ, ํ•ด๋‹น ์˜ต์…˜์„ ๋„ฃ์ง€ ์•Š๊ณ  ์—‘์…€ ํŒŒ์ผ์— ๋ฐ์ดํ„ฐ๋ฅผ ๋„ฃ๊ฒŒ ๋˜๋ฉด ๊ฐ ์…€์— ํ•œ ๊ธ€์ž๋งŒ ๋ฐ์ดํ„ฐ๊ฐ€ ๋‹ด๊ธฐ๋Š” ๊ธฐ์ดํ•œ ํ˜„์ƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. newline='' ์ฒ˜๋Ÿผ newline์„ ๋น„์›Œ์ฃผ๋Š” ์ธ์ž๋กœ ๋„ฃ์–ด์•ผ ์ž‘์„ฑ ํ›„ ํ•œ์ค„ ๋ฐ”๊พธ๊ธฐ๊ฐ€ ์—†์–ด์ง„๋‹ค!

2. beautifulsoup

ํ•œ๊ตญ ๋ฌธ์„œ ๋งํฌ ์ฐพ๊ธฐ ์–ด๋ ค์›Œ์„œ ๋งํฌ ๋จผ์ € ๋„ฃ์—ˆ๋‹ค. ์—ฌ๋‹ด์ด์ง€๋งŒ selenium๊ณผ ๋‹ค๋ฅด๊ฒŒ beautifulsoup์€ element๋ฅผ ๋ฝ‘์•„๋‚ด๋Š” ๊ฒƒ์ด ์กฐ๊ธˆ ๋” ์ง๊ด€์ ์ด๊ณ  ๋น ๋ฅด๋‹ค๋Š” ํ‰์ด ์žˆ๋‹ค. ํ•˜์ง€๋งŒ ํ•œ๊ธ€ ๋ฌธ์„œ๊ฐ€ ๋„ˆ๋ฌด ๊ฑฐ์ง€๊ฐ™๊ณ  ๊ฒ€์ƒ‰ํ•˜๋ฉด ์—ฌ๋Ÿฌ ๊ฐœ์˜ ์–ธ์–ด๊ฐ€ ์„ž์—ฌ ์žˆ์–ด์„œ ์‚ฝ์งˆ์„ ํ•˜๊ธฐ ๊ต‰์žฅํžˆ ์‰ฝ๋‹ค ... ๐Ÿ˜ž

๋‚˜์˜ ๊ฒฝ์šฐ์—๋„ ํ•œ์ฐธ๋™์•ˆ ๊ฒ€์ƒ‰ํ–ˆ๋Š”๋ฐ ์•Œ๊ณ ๋ณด๋‹ˆ ์ž๋ฐ”์Šคํฌ๋ฆฝํŠธ ๋ฉ”์†Œ๋“œ์ธ ๊ฒฝ์šฐ๋„ ์žˆ์—ˆ์–ด์„œ ์ง„์งœ ์ž˜ ์•Œ๊ณ  ์จ์•ผ๊ฒ ๋‹ค๋Š” ์ƒ๊ฐ์ด ๋งŽ์ด ๋“ค์—ˆ๋‹ค.

2.1 setup

์†”์งํžˆ ์ฒ˜์Œ์—๋Š” ๋ฐฐ์šด๋Œ€๋กœ ๋‹ค importํ•˜๊ณ , ๋ฌด์Šจ ์˜๋ฏธ์ธ์ง€๋„ ๋ชฐ๋ž๋‹ค. ์•ž์œผ๋กœ ํฌ๋กค๋ง์„ ๋งŽ์ด ํ•˜๊ฒŒ๋ ์ง€๋Š” ๋ชจ๋ฅด์ง€๋งŒ ์˜๋ฏธ๋ฅผ ์•Œ๊ณ  ์“ฐ๊ธฐ ์œ„ํ•ด์„œ ๋ถ€์—ฐ ์„ค๋ช…์„ ๋‹ฌ์•˜๋‹ค.

from bs4 import BeautifulSoup
from urllib.request import urlopen

import csv 
import requests
import re 

# ๋‚ด๊ฐ€ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ ธ์˜ค๊ณ ์ž ํ•˜๋Š” url
crawling_url = "https://www.billboard.com/charts/hot-100"

# http get request๋ฅผ ํ†ตํ•ด url ๋‚ด์— ์žˆ๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ ธ์˜จ๋‹ค.
req = requests.get(crawling_url) 

# html ์†Œ์Šค ๊ฐ€์ ธ์˜ค๊ธฐ(request๋ฅผ ํ†ตํ•ด ๊ฐ€์ ธ์˜จ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฌธ์ž์—ด ๊ฐ์ฒดstr๋กœ ๋ฐ˜ํ™˜) 
# HTTP์š”์ฒญ ๊ฒฐ๊ณผ๋กœ ๋ฐ›์•„์˜จ HTML, ํฌ๋กฌ ๊ฐœ๋ฐœ์ž ๋„๊ตฌ์˜ ํŽ˜์ด์ง€ ์†Œ์Šค ๋‚ด์šฉ๊ณผ ๋™์ผ
html = req.text

#bs4๋กœ ๋ฐ์ดํ„ฐ๋ฅผ python์ด ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋Š” ๊ตฌ์กฐ๋กœ parsingํ•œ๋‹ค.
bs = BeautifulSoup(html, 'html.parser') 

์•„๋ž˜ ๋‘ ๋ชจ๋“ˆ์€ ์‚ฌ์‹ค ๋‚ด๊ฐ€ ์“ด ์ฝ”๋“œ์—๋Š” ํ™œ์šฉ ์˜ˆ์‹œ๊ฐ€ ์—†๋Š”๋ฐ, ๋ถˆ๋Ÿฌ์˜จ ์…‹ํŒ…์œผ๋กœ ๋ฐฐ์›Œ์„œ ์–ด๋–ค ๋‚ด์šฉ์ธ์ง€ ๋”ฐ๋กœ ์ •๋ฆฌํ–ˆ๋‹ค.

  • re : ์ •๊ทœ ํ‘œํ˜„์‹์„ ์ปดํŒŒ์ผํ•˜๊ณ , ์ปดํŒŒ์ผ๋œ ํŒจํ„ด ๊ฐ์ฒด๋ฅผ ์ด์šฉํ•ด ๋ฉ”์†Œ๋“œ๋ฅผ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋‹ค. ๋ฌธ์ž์—ด์„ ๊ฒ€์ƒ‰ ๋ฐ ์น˜ํ™˜ํ•˜์—ฌ ๊ฐ„๋‹จํ•˜๊ฒŒ ํ‘œํ˜„ํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉํ•œ๋‹ค๊ณ  ํ•œ๋‹ค.

  • urlopen ํ•จ์ˆ˜ (์ฐธ๊ณ : ์ฝ”๋”ฉ ๋„์žฅ): urllib.request ํŒจํ‚ค์ง€์˜ urlopen ๋ชจ๋“ˆ์€ URL์„ ์—ฌ๋Š” ํ•จ์ˆ˜์ธ๋ฐ, URL ์—ด๊ธฐ์— ์„ฑ๊ณตํ•˜๋ฉด response.status์˜ ๊ฐ’์ด 200์ด ๋‚˜์˜จ๋‹ค.

* content ์†์„ฑ

๊ฐ€๋” beautifulsoup ๊ฐ์ฒด๋ฅผ ๋งŒ๋“ค ๋•Œ ์•„๋ž˜์™€ ๊ฐ™์ด ์ฒซ ๋ฒˆ์งธ ์ธ์ž์— content๋ฅผ ๋ถ™์—ฌ์ฃผ๋Š” ๊ฒฝ์šฐ๊ฐ€ ์žˆ๋‹ค. content ์†์„ฑ์—๋Š” ํ…์ŠคํŠธ ํ˜•ํƒœ์˜ HTML์ด ๋“ค์–ด์žˆ๊ฒŒ ๋œ๋‹ค. ๊ทธ๋Ÿฌ๋‹ˆ๊นŒ ์œ„์—์„œ html์†Œ์Šค ๊ฐ€์ ธ์˜จ ๋ถ€๋ถ„(.text ๋ถ™์ธ)์„ ์š” ๋‹จ๊ณ„๋กœ ์ถ”๋ฆด ์ˆ˜๋„ ์žˆ๋Š” ๊ฒƒ์ด๋‹ค.

response = requests.get("https://www.billboard.com/charts/hot->100")
soup = BeautifulSoup(response.content, 'html.parser')

2.2 element ์ ‘๊ทผ

- object.select()

object๋ผ๊ณ  ์“ด ์ด์œ ๋Š” ๋‚ด๊ฐ€ ๋งŒ๋“  beautifulsoup ๊ฐ์ฒด ์ด๋ฆ„์— ๋”ฐ๋ผ์„œ ํ•ด๋‹น ๋ถ€๋ถ„์˜ ์ด๋ฆ„์ด ๋ฐ”๋€” ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

์•„๋ž˜๋Š” ๋นŒ๋ณด๋“œ top 100 ํŽ˜์ด์ง€์˜ ์ˆœ์œ„, ๋…ธ๋ž˜, ๊ฐ€์ˆ˜๋ฅผ ๋ฝ‘์•„์„œ csv ํŒŒ์ผ์— ๋„ฃ๋Š” ์ฝ”๋“œ๋‹ค. ๊ฐœ๋ฐœ์ž ๋„๊ตฌ์—์„œ ๋‚ด๊ฐ€ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฝ‘๊ณ ์ž ํ•˜๋Š” element๋ฅผ inspectํ•ด์„œ ํ•ด๋‹น class name ์•ˆ์— ์žˆ๋Š” ๋ชจ๋“  ํ…์ŠคํŠธ๋ฅผ for loop์œผ๋กœ ๋Œ๋ ธ๋‹ค.

rank_list = bs.select('.chart-element__rank__number')
song_list = bs.select('.chart-element__information__song')
artist_list = bs.select('.chart-element__information__artist')

for item in zip(rank_list, song_list, artist_list):
    rank = item[0].text
    song = item[1].text
    artist = item[2].text

    csv_writer.writerow( (rank, song, artist) )

csv_open.close()

- select์˜ return

๊ฐ์ฒด์˜ select ๋ฉ”์†Œ๋“œ๋ฅผ ํ†ตํ•ด ๋‚˜์˜จ ๊ฒฐ๊ณผ๋Š” ๋ฆฌ์ŠคํŠธ์ด๋‹ค. ๋”ฐ๋ผ์„œ

  • ์ธ๋ฑ์Šค๋ฅผ ์ง€์ •ํ•˜์—ฌ text๋กœ ๋ณ€ํ™˜
  • for loop๋ฅผ ๋Œ๋ ค์„œ ์š”์†Œ๋ฅผ ํ•˜๋‚˜์”ฉ ๊บผ๋‚ด๊ธฐ

์œ„ ๋‘ ๊ฐ€์ง€ ๋ฐฉ๋ฒ•์„ ํ†ตํ•ด์•ผ๋งŒ ์‚ฌ๋žŒ์ด ๋ณผ ์ˆ˜ ์žˆ๋Š” ํ˜•ํƒœ๋กœ ํ™•์ธ์ด ๊ฐ€๋Šฅํ•˜๋‹ค.

2.3 tips

- python shell ํ™œ์šฉํ•˜๊ธฐ

์‚ฌ์‹ค ํฌ๋กค๋งํ•  ๋•Œ ํ•˜๋‚˜์”ฉ ๋Œ๋ฆฌ๋ฉด์„œ ๋งž๋Š”์ง€ ํ‹€๋ฆฐ์ง€ ํ™•์ธํ•˜๊ณ  ๊ฒฐ๊ณผ ์•ˆ๋‚˜์˜ค๋ฉด ๋‹ต๋‹ตํ•ด ํ•˜๋Š” ๋ถ€๋ถ„์ด ์ œ์ผ ์งœ์ฆ๋‚œ๋‹ค. ๊ฐ™์ด ๊ณต๋ถ€ํ•˜๋Š” ์นœ๊ตฌ์—๊ฒŒ ๋“ค์€ ํŒ์€ python shell๋กœ ๋Œ๋ ค๋ณด๋Š” ๊ฑฐ๋‹ค!

shell์—์„œ ๋ชจ๋“ˆ์„ importํ•˜๊ณ  url get ๋“ฑ ์กฐ๊ฑด์„ ๋งŒ๋“ค๊ณ  ์‹œ์ž‘ํ•˜๋ฉด, ๊ฒฐ๊ตญ interactive ํ™˜๊ฒฝ์—์„œ ๋‚ด๊ฐ€ ์“ฐ๋Š” ๋ฉ”์†Œ๋“œ๊ฐ€ ๊ฒฐ๊ณผ๊ฐ’์„ returnํ•˜๋Š”์ง€ ์ž˜ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. ์—ฌ๋Ÿฌ ์ค„์„ ์“ฐ๊ณ  ๋Œ๋ฆฌ๋ฉด ์–ด๋””๊ฐ€ ๋ฌธ์ œ์ธ์ง€ ๋ชจ๋ฅด๊ธฐ ์‰ฌ์šด๋ฐ, ์–ด๋Š ์ค„์˜ ๊ฒฐ๊ณผ๊ฐ€ ์–ด๋–ค์ง€ ์ฆ‰๊ฐ์ ์œผ๋กœ ํ™•์ธํ•  ์ˆ˜ ์žˆ์–ด ์•„์ฃผ ์กฐ์œผ๋‹ค ๐Ÿ˜

# Terminal์—์„œ python3 ์ž…๋ ฅ ํ›„ ์—”ํ„ฐ ์น˜๋ฉด shell์— ์ ‘์†ํ•  ์ˆ˜ ์žˆ์–ด์š”.

Python 3.7.7 (default, Mar 26 2020, 10:32:53)
[Clang 4.0.1 (tags/RELEASE_401/final)] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.

from bs4 import BeautifulSoup
from urllib.request import urlopen

import csv
import requests
import re

crawling_url = "https://www.billboard.com/charts/hot-100"

req = requests.get(crawling_url)
html = req.text

bs = BeautifulSoup(html, 'html.parser')

rank_list = bs.select('.chart-element__rank__number')
print(rank_list[0].text) # output : 1 

- way to find elements

stackoverflow์—์„œ ์งˆ๋ฌธ์„ ์ฐพ๋‹ค๊ฐ€ ํฌ๋กค๋ง์—์„œ nested tag๋ฅผ ์ฐพ์„ ๋•Œ ์ฐธ๊ณ ํ• ๋งŒํ•œ ๋‚ด์šฉ์„ ์ •๋ฆฌํ•ด๋†”์„œ ์ถ”๊ฐ€ํ–ˆ๋‹ค. (๋งํฌ๋„ ๋ง๋ถ™์ด๋ ค๊ณ  ํ–ˆ๋Š”๋ฐ ๊นŒ๋จน์–ด์„œ ๊ธธ์„ ์žƒ์—ˆ..)


์˜ˆ์™ธ๋ฅผ ์ œ๋Œ€๋กœ ํ•ธ๋“ค๋งํ•˜์ง€ ๋ชปํ• ๊ฑฐ๋ผ๋ฉด, ์š”์†Œ์˜ path๋ฅผ ์ตœ๋Œ€ํ•œ ๊ตฌ์ฒดํ™”์‹œํ‚ค๋Š”๊ฒŒ ์ฝ”๋“œ ์ž‘๋™์„ ์‰ฝ๊ฒŒ ํ•  ์ง€๋ฆ„๊ธธ!

The more detail you specify the path to find your element. The easier your code will break if you don't handle the exceptions correctly and actually, the logic you find the path might not be general at all..


์ผ๋ฐ˜์ ์ธ tag๋“ค์˜ ์ˆœ์„œ๋ฅผ ์„ธ๋ฉด์„œ ์ ‘๊ทผํ•˜๊ธฐ๋ณด๋‹ค ๊ฐ€๊ธ‰์ ์ด๋ฉด ์œ ๋‹ˆํฌํ•œ id๋‚˜ class๋กœ ์š”์†Œ๋ฅผ ์ฐพ์•„๊ฐ€๋Š” ๊ฒƒ์ด ์ข‹๋‹ค.

Try to locate elements by unique id or classes instead of counting on the order of some general tags.


๋‚ด๊ฐ€ ์ฐพ๊ณ ์ž ํ•˜๋Š” ํ…์ŠคํŠธ๊ฐ€ ํŠน์ • ํŒจํ„ด์„ ๋”ฐ๋ฅด๋ฉด, ๋ฌธ๋ฒ•์—์„œ ํ…์ŠคํŠธ ์ž์ฒด๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค.

If the text you are trying to collect follow a pattern. you can find it easily using text itself , which is more straightforward for programmer... texts are what people see actually.

3. selenium

๋™์ ์ธ ํ™˜๊ฒฝ์—์„œ ํฌ๋กค๋ง์„ ํ•ด์•ผ ํ•œ๋‹ค๋ฉด ์จ์•ผ ํ•˜๋Š” ํ”„๋ ˆ์ž„์›Œํฌ. ์ฝ”๋“œ๋ฅผ ํ†ตํ•ด ์ž๋™์œผ๋กœ ์ปดํ“จํ„ฐ๊ฐ€ ํด๋ฆญํ•˜๊ฒŒ ๋งŒ๋“ค๊ฑฐ๋‚˜, ํ…์ŠคํŠธ ์ž…๋ ฅ ์—”ํ„ฐ ๋“ฑ ๋ธŒ๋ผ์šฐ์ €๋ฅผ ์ œ์–ดํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•œ๋‹ค. ๋‚˜๋Š” ํฌ๋กฌ ๋ธŒ๋ผ์šฐ์ €์—์„œ ํฌ๋กค๋ง์„ ํ•  ๊ฒƒ์ด๋‹ค.

3.1 setup

import time
from selenium import webdriver
import requests

webdriver api๋ฅผ ํ†ตํ•ด ๋ธŒ๋ผ์šฐ์ €๋ฅผ ์ œ์–ดํ•  ์ˆ˜ ์žˆ๋‹ค. ๋‚ด ํฌ๋กฌ ๋ฒ„์ „์„ ํ™•์ธํ•œ ๋’ค์— ํฌ๋กฌ ๋“œ๋ผ์ด๋ฒ„ ๋ฒ„์ „์„ ๋งž์ถ”์–ด ๋‹ค์šด๋ฐ›๊ณ , sciprt์— ๋‹ค์šด๋ฐ›์€ ๊ฒฝ๋กœ๋ฅผ ์ง€์ •ํ•ด์ค€๋‹ค.

์ด ๋•Œ Users/yeni ๊นŒ์ง€๋งŒ ์“ฐ๋ฉด ์•ˆ ๋˜๊ณ , ํŒŒ์ผ ๋ช…๊นŒ์ง€ ์ ์–ด์•ผ์ง€ selenium์„ ์ธ์‹ํ•œ๋‹ค...!
๋‚˜๋Š” ์•„๋ž˜์ฒ˜๋Ÿผ ์•ˆ ์“ฐ๊ณ  ์ฒ˜์Œ์— chromdriver ๋นผ๋จน๊ณ  ์ผ๋‹ค๊ฐ€ ๊ณ„์† ์‹คํ–‰์ด ์•ˆ ๋๋‹ค^^

driver = webdriver.Chrome('/Users/yeni/chromedriver')

์œ„ beautifulsoup์—์„œ bs ๊ฐ์ฒด๋ฅผ ๋งŒ๋“  ๊ฒƒ์ฒ˜๋Ÿผ ์•„๋ž˜์— driver๋ผ๋Š” ๊ฐ์ฒด๋ฅผ ๋งŒ๋“ค์—ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  get method๋ฅผ ํ†ตํ•ด url์— ์ ‘๊ทผํ•˜๊ณ , time.sleep์€ ํŽ˜์ด์ง€์—์„œ ๋‚ด์šฉ์„ ๊บผ๋‚ด์˜ค๋Š” ์‹œ๊ฐ„์„ ๊ณ ๋ คํ•ด์„œ ์ผ์ • ์‹œ๊ฐ„ sleep์„ ์ฃผ๋Š” ๊ฒƒ์ด๋‹ค.

driver = webdriver.Chrome('/Users/yeni/chromedriver')
driver.get('https://www.billboard.com/charts/hot-100')
time.sleep(1)

3.2 element ์ ‘๊ทผ

๊ฐœ๋ฐœ์ž ๋„๊ตฌ์—์„œ ์›ํ•˜๋Š” ์ฝ”๋“œ ์šฐํด๋ฆญํ•˜๋ฉด ๋‚ด๊ฐ€ ๊ธ์–ด์˜ค๊ณ  ์‹ถ์€ ๋ถ€๋ถ„์˜ ์ฝ”๋“œ๊ฐ€ ์žˆ๊ณ , ์ฝ”๋“œ์—์„œ ์šฐํด๋ฆญํ–ˆ์„ ๋•Œ copy์—์„œ element, selector, xpath ์ค‘ ๋ฌด์—‡์„ ๊ฐ€์ ธ์˜ฌ ๊ฒƒ์ธ์ง€ ์„ ํƒํ•  ์ˆ˜ ์žˆ๋‹ค. (์ฐธ๊ณ ๋งํฌ)

find_element_by_name('HTML_name')
find_element_by_id('HTML_id')
find_element_by_xpath('/html/body/some/xpath')
find_element_by_css_selector('#css > div.selector')
find_element_by_class_name('some_class_name')
find_element_by_tag_name('h1')

ํŽ˜์ด์ง€์˜ ์—ฌ๋Ÿฌ ์š”์†Œ๋“ค์— ์ ‘๊ทผํ•˜๋Š” ๋ฉ”์†Œ๋“œ๋Š”, ์œ„์˜ ๋‚ด์šฉ์—์„œ s๋งŒ ์ถ”๊ฐ€ํ•ด์ฃผ๋ฉด(elements) ๋œ๋‹ค. ์†”์งํžˆ ์ง€๊ธˆ ๋‚ด ์ˆ˜์ค€์—์„œ๋Š” ์™œ s๋ฅผ ๋ถ™์ด๊ณ  ๋–ผ๋Š” ๋ฉ”์†Œ๋“œ๋ฅผ ๋งŒ๋“ค์—ˆ๋Š”์ง€ ์ข€ ์ดํ•ด๊ฐ€ ์•ˆ ๊ฐ„๋‹ค. ์ด๋ฆ„์ด ์™„์ „ ๋‹ค๋ฅธ ๊ฒƒ๋„ ์•„๋‹ˆ๊ณ  s๋ฅผ ๋ถ™์˜€๋Š”์ง€ ์•„๋‹Œ์ง€์˜ ์ฐจ์ด์ธ๋ฐ ๊ธฐ๋Šฅ์„ ๋‹ค๋ฅด๊ฒŒ ํ•˜๋‹ˆ๊นŒ ๊ฐ€๋” ์ด๊ฒƒ ๋•Œ๋ฌธ์— ๋ญ๊ฐ€ ์•ˆ ๋˜๋ฉด ์ข€ ํ™”๋”ฑ์ง€๊ฐ€ ๋‚œ๋‹ค ๐Ÿ‘ฟ๐Ÿ‘ฟ๐Ÿ‘ฟ

๊ทธ๋ž˜๋„ beatifulsoup์ฒ˜๋Ÿผ ๋ฉ”์†Œ๋“œ ์ด๋ฆ„์ด ์ค‘๊ตฌ๋‚œ๋ฐฉ์ธ ๊ฒƒ๋ณด๋‹ค๋Š” ํ†ต์ผ์„ฑ ์žˆ์–ด๋ณด์—ฌ์„œ ์ฐจ๋ผ๋ฆฌ ์“ฐ๊ธฐ ์ข‹๊ธด ํ•˜๋‹ค... ํœด ;

์œ„ ๋ฉ”์†Œ๋“œ๋“ค์„ ํ™œ์šฉ์‹œ HTML์„ ๋ธŒ๋ผ์šฐ์ €์—์„œ ํŒŒ์‹ฑํ•ด์ฃผ๊ธฐ ๋•Œ๋ฌธ์— ๊ตณ์ด Python, BeautifulSoup์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š์•„๋„ ๋œ๋‹ค.

- driver.page_source

๋ธŒ๋ผ์šฐ์ €์— ๋ณด์ด๋Š” ๊ทธ๋Œ€๋กœ์˜ HTML, ํฌ๋กฌ ๊ฐœ๋ฐœ์ž ๋„๊ตฌ์˜ Element ํƒญ ๋‚ด์šฉ๊ณผ ๋™์ผํ•˜๋‹ค. ํŽ˜์ด์ง€์˜ ๋ชจ๋“  elements ๊ฐ€์ ธ์˜ค๊ธฐ ๊ธฐ๋Šฅ์œผ๋กœ ๋ณด๋ฉด ๋ ๋“ฏ! ๋‚˜๋Š” ์ž˜ ์‚ฌ์šฉํ•˜์ง€ ์•Š์•˜์ง€๋งŒ ์•Œ์•„๋‘๋ฉด ์ข‹์„ ๊ฒƒ ๊ฐ™์•„์„œ ์šฐ์„  ์ •๋ฆฌํ•ด๋†“์•˜๋‹ค.

html = driver.page_source 

- ์˜ˆ์‹œ ์ฝ”๋“œ

๋น„๋งˆ์ดํŽซ์˜ ํŽ˜์ด์ง€์—์„œ ๊ฒŒ์‹œ๋ฌผ ํ•˜๋‚˜ํ•˜๋‚˜ ๋“ค์–ด๊ฐ€์„œ ๋’ค๋กœ ๋Œ์•„์˜ค๋Š” ์ฝ”๋“œ๋ฅผ ์งœ๋ณด์•˜๋‹ค. ์šฐ์„  ๋ฆฌ์ŠคํŠธ ์ „์ฒด๋ฅผ ๋ณด์—ฌ์ฃผ๋Š” ํŽ˜์ด์ง€์— ๋“ค์–ด์˜จ ๋’ค, class๋ฅผ ํ†ตํ•ด ๊ฐ ๊ฒŒ์‹œ๋ฌผ์— ์ ‘๊ทผํ•  ์ˆ˜ ์žˆ๋Š” ๋ณ€์ˆ˜ places๋ฅผ ๋งŒ๋“ค์—ˆ๋‹ค.

๊ทธ ๋‹ค์Œ ํ•ด๋‹น ๋ณ€์ˆ˜์˜ ๊ธธ์ด๋งŒํผ ๋ฐ˜๋ณต๋ฌธ์„ ๋Œ๋ฆฌ๋ฉด์„œ,

  1. ์žฅ์†Œ ํด๋ฆญ
  2. ์žฅ์†Œ ์•ˆ์— ์žˆ๋Š” ์ฝ˜ํ…์ธ ๋ฅผ ๊ฐ€์ ธ์˜ค๋Š” ํ•จ์ˆ˜ ์‹คํ–‰ (์š”๊ฑฐ๋Š” ๋ณ„๋„ ์ฝ”๋“œ๋กœ ์ €์žฅํ•ด์„œ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ๋งŒ)
  3. driver.back() ์„ ํ†ตํ•ด ๋’ค๋กœ๊ฐ€๊ธฐ

๊นŒ์ง€ ์‹คํ–‰ํ–ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๊ฑฐ์˜ ํ•˜๋ฃจ์ข…์ผ ๋ฌธ์ œ์— ๋ด‰์ฐฉํ–ˆ๋˜ ๋‘ ๊ฐ€์ง€ ๋ฌธ์ œ๋Š” ์•„๋ž˜์—์„œ ์†Œ๊ฐœํ•œ๋‹ค^^

driver = webdriver.Chrome('/Users/yeni/chromedriver')
bemypet_url = 'https://mypetlife.co.kr/map/place-listings/' 
driver.get(bemypet_url)

places = driver.find_elements_by_class_name('job_listing-clickbox')

# ํŽ˜์ด์ง€ ๋‚ด๋ถ€ ๋‚ด์šฉ์„ ๊ธ์–ด์˜ค๋Š” ํ•จ์ˆ˜ 
def save_contents_in_exel():
    title = driver.find_element_by_class_name('job_listing-title').text
    explanation = driver.find_element_by_css_selector('#listify_widget_panel_listing_content-1').text
    tags = driver.find_elements_by_class_name('ion-pricetag')

    about_writer.writerow([title, explanation])

    for t in tags:
        tag_writer.writerow([title, t.text])

# ํŽ˜์ด์ง€๋ฅผ ์™”๋‹ค๊ฐ”๋‹ค ํ•  ๋ฐ˜๋ณต๋ฌธ 
for num in range(len(places)):
        places[num].click()

        save_contents_in_exel()

        driver.back()
        driver.get('https://mypetlife.co.kr/map/place-listings/')
        time.sleep(5)
        places = driver.find_elements_by_class_name('job_listing-clickbox')

3.3 Message: stale element reference: element is not attached to the page document

์•„์นจ๋ถ€ํ„ฐ ์ €๋…๊นŒ์ง€ ์ œ์ผ ๋งŽ์ด๋ดค๋˜ ์—๋Ÿฌ ๋ฉ”์‹œ์ง€ ๐Ÿคฌ๐Ÿคฌ๐Ÿคฌ๐Ÿคฌ
์š”์†Œ๊ฐ€ ํŽ˜์ด์ง€์— ์—†์–ด์„œ ๋ถˆ๋Ÿฌ์˜ฌ ์ˆ˜ ์—†๋‹ค๋Š” ๋œป์ธ๋ฐ, ์ •ํ™•ํ•œ ๋ง๋กœ ์„ค๋ช…ํ•  ์ˆ˜๋Š” ์—†์ง€๋งŒ! ํ•œ ๋Ž์Šค ๋“ค์–ด๊ฐ”๋‹ค๊ฐ€ ๋’ค๋กœ๊ฐ€๊ธฐ๋ฅผ ํ•˜๊ฒŒ ๋˜๋ฉด ๊ธฐ์กด์˜ ํŽ˜์ด์ง€๊ฐ€ ๋ณ€๊ฒฝ๋˜์–ด์„œ ์š”์†Œ๋“ค์„ ๋ถˆ๋Ÿฌ์˜ฌ ์ˆ˜ ์—†๋‹ค๊ณ  ํ•œ๋‹ค.

- solution 1

๊ทธ๋ž˜์„œ get์„ ํ†ตํ•ด url์„ ๋‹ค์‹œ ์ฃผ๊ณ , ๋ณ€์ˆ˜๋„ ๋‹ค์‹œ ์ •์˜ํ•ด์คŒ์œผ๋กœ์จ ๊ฐ™์€ ์‚ฌ์ดํŠธ์—์„œ ๊ฐ™์€ ์š”์†Œ๋ฅผ ๋ถˆ๋Ÿฌ์˜ฌ ์ˆ˜ ์žˆ๋„๋ก ํฌ๋กฌ์—๊ฒŒ ์•Œ๋ ค์ค€๋‹ค.

๊ทธ๋ž˜์„œ ๋‹ค์‹œ ๋Œ์•„์™€์„œ ์›ํ•˜๋Š” ๋‚ด์šฉ์„ ๊ธ์–ด์˜ค๊ธฐ ์œ„ํ•œ ํด๋ฆญ.click()์„ ํ•˜๊ธฐ '์ „'์—!!!

  1. get url์„ ๋‹ค์‹œ ๊ฐ€์ ธ์˜ค๊ธฐ
  2. ๋ณ€์ˆ˜ ์žฌ์ •์˜ ํ•˜๊ธฐ

๋‘ ๊ฐœ์˜ ์ž‘์—…์ด ๊ผญ ํ•„์š”ํ•˜๋‹ค. ๊ทธ๊ฒƒ๋„ ์•„๋‹ˆ๋ผ๋ฉด time.sleep ์‹œ๊ฐ„์„ ๋Š˜๋ ค์„œ ์–˜๊ฐ€ ์ฝ์„ ์ˆ˜ ์žˆ๋„๋ก ์‹œ๊ฐ„์„ ๋” ์ค˜์•ผ ํ•œ๋‹ค. ์ฝ”๋“œ๋Š” ์œ„์˜ ์˜ˆ์‹œ ์ฝ”๋“œ๋ฅผ ์ฐธ๊ณ !


- solution 2

๊ทธ๋ฆฌ๊ณ  ๋งŒ์•ฝ ํฌ๋กค๋งํ•˜๋Š” ํŽ˜์ด์ง€๊ฐ€ ๊ต‰์žฅํžˆ ๊ตฌ๋ฆฐ ํŽ˜์ด์ง€๋ผ๋ฉด, ์ฒซ๋ฒˆ์งธ ํŽ˜์ด์ง€์—์„œ ๋‘ ๋ฒˆ์งธ ํŽ˜์ด์ง€๋กœ ๋„˜์–ด๊ฐˆ ๋•Œ endpoint์— ํŽ˜์ด์ง€ ๊ตฌ๋ถ„์ด ์—†์„ ์ˆ˜๊ฐ€ ์žˆ๋‹ค. (์˜ˆ: map/listing/1 ์ด๋Ÿฐ ์‹์œผ๋กœ ์žˆ๋Š”๊ฒŒ ์•„๋‹ˆ๋ผ ๊ทธ๋ƒฅ listing์œผ๋กœ๋งŒ ํ‘œ์‹œ)

์ด๋Ÿด ๋•Œ๋Š” ๋‘ ๋ฒˆ์งธ ํŽ˜์ด์ง€๋ฅผ ๊ฐ€์„œ ๋‚ด์šฉ์„ ๊ธ๋”๋ผ๋„ get url ๋•Œ๋ฌธ์— ๋‹ค์‹œ ์ฒซ ๋ฒˆ์งธ ํŽ˜์ด์ง€๋กœ ๋Œ์•„๊ฐ€๋Š” ๋ถˆ์ƒ์‚ฌ๊ฐ€ ์ƒ๊ธด๋‹ค. ๊ฒŒ๋‹ค๊ฐ€ ๊ธฐ์กด์˜ ์ฝ”๋“œ๋กœ๋Š” ๋„์ €ํžˆ ์ฒ˜๋ฆฌํ•  ์ˆ˜๊ฐ€ ์—†๊ฒŒ ๋œ๋‹ค.

๋•Œ๋ฌธ์— ๊ทธ๋ƒฅ ํŽ˜์ด์ง€์—์„œ ๊ฐ ๊ฒŒ์‹œ๋ฌผ์˜ url๋งŒ ๋”ฐ๋‹ค๊ฐ€ ์ €์žฅํ•ด์„œ, ์™”๋‹ค๊ฐ”๋‹ค (driver.back)ํ•  ํ•„์š” ์—†์ด ๋ฆฌ์ŠคํŠธ์— url์„ ์ €์žฅํ•ด์„œ ์ˆœ์„œ๋Œ€๋กœ ์ ‘์† ํ›„ ํฌ๋กค๋ง์„ ํ•˜๋„๋ก ๋กœ์ง์„ ์งœ๋ฉด ๋œ๋‹ค.

link = driver.get('https://mypetlife.co.kr/map/place-listings/')
time.sleep(3)

for page_idx in range(1,7):
	# ํŽ˜์ด์ง€์— ๋“ค์–ด๊ฐ„ ๋‹ค์Œ์— 
	page = driver.find_elements_by_xpath('//[@id="main"]/div/nav/ul/li')
	time.sleep(3)

	# ๋ฐ์ดํ„ฐ๋ฅผ ์ถ”์ถœํ•˜๊ณ  ์‹ถ์€ ์š”์†Œ๋ฅผ ๋ณ€์ˆ˜์— ์ €์žฅํ•ด์š”.
	places = driver.find_elements_by_class_name('job_listing')
	linkx  = []

	# ํ•ด๋‹น ๋ณ€์ˆ˜์˜ url์„ ์ฐจ๋ก€๋Œ€๋กœ ๋นˆ ๋ฆฌ์ŠคํŠธ์— ๋„ฃ์–ด์š”. 
	for idx in range(len(places)):
		place = places[idx]
		link  = place.find_element_by_css_selector("a.job_listing-clickbox").get_attribute('href')
		links.append(link)	

# ์ €์žฅํ•œ ๋งํฌ๋ฅผ ํ•˜๋‚˜์”ฉ ๊บผ๋‚ด์„œ ๋ด…์‹œ๋‹ค! 
for link in liks:
	page = driver.get(link)

๊ทธ๋ž˜๋„ ๊ณ„์† ๋ถ™๋“ค๊ณ  ํ•˜๋‚˜์”ฉ ํ•ด๊ฒฐํ•˜๋‹ˆ๊นŒ ๋ญ”๊ฐ€ ์ผ์ฃผ์ผ ์ „๋ณด๋‹ค๋Š” ํ™•์—ฐํžˆ ๋Š” ๊ฒƒ ๊ฐ™์€ ๋Š๋‚Œ์ด ๋“ ๋‹ค. beautifulsoup find๋Š” ์•„์ง ์ข€ ๋ชจ๋ฅด๊ฒ ์–ด์„œ, ๋‹ค์Œ์—๋Š” ์š”๊ฑฐ๋ฅผ ์ •๋ฆฌํ•ด์•ผ๊ฒ ๋‹ค.

profile
์ด์‚ฌ๊ฐ„ ๋ธ”๋กœ๊ทธ: yenilee.github.io

1๊ฐœ์˜ ๋Œ“๊ธ€

comment-user-thumbnail
2020๋…„ 4์›” 17์ผ

์ž˜ ๋ณด๊ณ  ๊ฐ‘๋‹ˆ๋‹ค~~~

๋‹ต๊ธ€ ๋‹ฌ๊ธฐ