๐Ÿ“Œ ์›น ํฌ๋กค๋ง & ์Šคํƒ€๋ฒ…์Šค ํฌ๋กค๋ง

may_soouuยท2020๋…„ 9์›” 6์ผ
2

ํฌ๋กค๋ง

๋ชฉ๋ก ๋ณด๊ธฐ
1/2

์›น ํฌ๋กค๋ง

1. ์„ค์น˜ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ

pip install requests
 # http ์š”์ฒญ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•ด ์‚ฌ์šฉํ•˜๋Š” ๋ชจ๋“ˆ
pip install BeautifulSoup4
 # ์›น ํฌ๋กค๋ง ๋˜๋Š” ์Šคํฌ๋ž˜ํ•‘ ํ•  ๋•Œ ์‚ฌ์šฉํ•˜๋Š” ๋ชจ๋“ˆ
pip install selenium
 # ์›น ๋ธŒ๋ผ์šฐ์ €๋ฅผ ์ปจํŠธ๋กคํ•˜์—ฌ ์›น UI๋ฅผ Automationํ•˜๋Š” ๋„๊ตฌ ์ค‘ ํ•˜๋‚˜
pip install webdriver-manager

์œ„์˜ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์„ค์น˜ ํ›„ ์›น ํฌ๋กค๋ง์„ ์ง„ํ–‰ํ•ด๋ณด๋ ค๊ณ  ํ•œ๋‹ค.
๋ฏธ๋‹ˆ์ฝ˜๋‹ค ๊ฐ€์ƒํ™˜๊ฒฝ์ด ์„ค์น˜๋˜์–ด์žˆ๋‹ค๋Š” ๊ฐ€์ •ํ•˜์— ์ง„ํ–‰ํ•˜์˜€๋‹ค.

2. ์Šคํƒ€๋ฒ…์Šค ํŽ˜์ด์ง€ ์‚ดํŽด๋ณด๊ธฐ

๋‚˜๋Š” ์Šคํƒ€๋ฒ…์Šค ๋ฉ”๋‰ด ์ค‘ '์Œ๋ฃŒ'๋ฆฌ์ŠคํŠธ์˜ ํ’ˆ๋ชฉ๋ช…๊ณผ ์ด๋ฏธ์ง€ ์ฃผ์†Œ๋ฅผ ๋”ฐ์˜ค๋ ค๊ณ  ํ•œ๋‹ค.
๊ฐœ๋ฐœ์ž๋„๊ตฌ๋ฅผ ๋ˆŒ๋Ÿฌ์„œ ์‚ดํŽด๋ณด๋ฉด, img src์— ์ด๋ฏธ์ง€ ์†Œ์Šค์™€ alt ํƒœ๊ทธ ์•ˆ์— ์Œ๋ฃŒ name์ด ์žˆ๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.
๊ทธ ์ƒ์œ„ ํƒœ๊ทธ๋ฅผ ๋ณด๋ฉด, ํด๋ž˜์Šค๋ช…์ด 'menuDataSet'์ธ liํƒœ๊ทธ๋กœ ํ•˜๋‚˜์˜ ์ƒํ’ˆ์„ ๋‚˜ํƒ€๋‚ด๊ณ  ์žˆ๋‹ค.

3. ์ฝ”๋“œ ์งœ๋ณด๊ธฐ

๊ฐ€์ƒํ™˜๊ฒฝ ์‹คํ–‰ ํ›„, ํŒŒ์ผ ๋งŒ๋“ค๊ธฐ
import csv
import re
import time

from bs4 import BeautifulSoup
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager

ํ•„์š”ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ import ํ•˜๊ธฐ
โœ‹๐Ÿป ์—ฌ๊ธฐ์„œ! ๋‚ด๊ฐ€ ์—๋Ÿฌ ๋‚œ ๋ถ€๋ถ„์ด ์žˆ์—ˆ๋‹ค. ์ œ์ผ ๋งˆ์ง€๋ง‰ ์ค„์„ ์ž„ํฌํŠธ ํ•˜์ง€ ์•Š๊ณ  ์ง„ํ–‰ํ•˜๋‹ˆ,
chromedriver executable needs to be in PATH์˜ค๋ฅ˜๊ฐ€ ๋‚ฌ๋‹ค๐Ÿ˜‚
๊ตฌ๊ธ€๋ง ํ•ด๋ณด๋‹ˆ, ์ œ์ผ ๋งˆ์ง€๋ง‰ ์ค„์ฒ˜๋Ÿผ import ํ›„
webdriver.Chrome(ChromeDriverManager().install())
์„ ์ž…๋ ฅํ•ด์ค˜์•ผํ–ˆ๋‹ค! ์œ„ ์ฝ”๋“œ๋Š” ์•„๋ž˜ ์ฝ”๋“œ์— ์žˆ๋‹ค.


์ž„ํฌํŠธ ํ›„, csvํŒŒ์ผ์— ์ €์žฅํ•ด์•ผ ํ•˜๋‹ˆ, csvํŒŒ์ผ์„ ์ƒ์„ฑํ•œ๋‹ค.

filename = "starbucks_product.csv"
csv_open = open(filename, "w+", encoding="utf-8")
csv_writer = csv.writer(csv_open)

open๋ช…๋ น์–ด๋กœ w+(์ฝ๊ธฐ+์“ฐ๊ธฐ ๋ชจ๋“œ๋กœ ์—ด๊ธฐ)๋กœ ํŒŒ์ผ์„ ์ƒ์„ฑํ•œ๋‹ค

๋ชจ๋“œ์„ค๋ช…
rํŒŒ์ผ์„ ์ฝ๊ธฐ ์ „์šฉ ๋ชจ๋“œ๋กœ ์—ด๊ธฐ
r+์ฝ๊ธฐ+์“ฐ๊ธฐ ๋ชจ๋“œ, ๋ฎ์–ด์“ฐ๊ธฐ๋กœ ํŒŒ์ผ์„ ์“ด๋‹ค
wํŒŒ์ผ์„ ์“ฐ๊ธฐ ๋ชจ๋“œ๋กœ ์—ด๊ธฐ
w+์ฝ๊ธฐ+ ์“ฐ๊ธฐ ๋ชจ๋“œ, ๊ธฐ์กด์˜ ํŒŒ์ผ์„ ์ง€์šฐ๊ณ  ํŒŒ์ผ์„ ์“ด๋‹ค
aํŒŒ์ผ์— ๋‚ด์šฉ ์ถ”๊ฐ€ํ•˜๊ธฐ ๋ชจ๋“œ๋กœ ์—ด๊ธฐ
tํ…์ŠคํŠธ ๋ชจ๋“œ๋กœ ํŒŒ์ผ ์—ด๊ธฐ
b๋ฐ”์ด๋„ˆ๋ฆฌ ๋ชจ๋“œ๋กœ ํŒŒ์ผ ์—ด๊ธฐ

write๋ช…๋ น์–ด๋กœ ๋ฆฌ์ŠคํŠธ ๋„ฃ๊ธฐ
๋งŒ์•ฝ ํ•œ ์ค„์— ๋„ฃ๊ณ ์ž ํ•œ๋‹ค๋ฉด, writelines ์“ฐ๊ธฐ!


 driver = webdriver.Chrome(ChromeDriverManager().install())
 url = "https://www.starbucks.co.kr/menu/drink_list.do"
 driver.get(url)
 time.sleep(7)

์›น์ด ์ผœ์ง€๊ณ , ์•„๋ž˜ url๋กœ ์ ‘์† ํ•  ์ˆ˜ ์žˆ๋„๋ก driver~ ๋‚ด์šฉ์„ ์ž…๋ ฅํ•œ๋‹ค.
get๋ฉ”์†Œ๋“œ๋กœ url์ ‘์† !!!
ํ˜น์‹œ๋‚˜, ํฌ๋กฌ ๋กœ๋”ฉ ๋˜๋Š”๋ฐ ์‹œ๊ฐ„์ด ๊ฑธ๋ฆด ์ˆ˜ ์žˆ์œผ๋‹ˆ 7์ดˆ ๊ธฐ๋‹ค๋ฆฌ์ž๐Ÿ˜€


html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
drinks = soup.findAll("li", {"class": re.compile("menuDataSet")})

1) parser : ๋ฌธ์žฅ์˜ ๊ตฌ์กฐ ๋ถ„์„

BeautifulSoup(html, 'html.parser')
๋‚ด๊ฐ€ ์ ‘์†ํ•œ ํŽ˜์ด์ง€์˜ ๋ฌธ์ž์—ด์€ html๊ตฌ์กฐ๋กœ ๋˜์–ด ์žˆ์–ด~ ๊ทธ๋Ÿฌ๋‹ˆ๊นŒ html๋กœ ๋ถ„์„ํ•ด์ค˜!

๐Ÿšฆ์—ฌ๊ธฐ์„œ ์ž ๊น! ๋‚˜๋Š” html.parser๋ฅผ ์ผ๋Š”๋ฐ ์ด parser์˜ ์ข…๋ฅ˜๋Š” ์—ฌ๋Ÿฌ๊ฐ€์ง€๊ฐ€ ์žˆ๋‹ค.
๊ทธ ์ค‘ ๊ฐ€์žฅ ๋Œ€ํ‘œ์ ์ธ ๋‘๊ฐ€์ง€! html.parser์™€ lxml

parser์‚ฌ์šฉ๋ฐฉ๋ฒ•์žฅ์ ๋‹จ์ 
html.parserBeautifulSoup(markup, "html.parser")๊ฐ์ข… ๊ธฐ๋Šฅ ์™„๋น„, ์ ์ ˆํ•œ ์†๋„์ด์ „ ๋ฒ„์ „์˜ ํŒŒ์ด์ฌ์— ๋Œ€ํ•ด ํ˜ธํ™˜์„ฑ์ด ์ข‹์ง€ ์•Š์Œ ( ํŒŒ์ด์ฌ 2.7.3 ์ด๋‚˜ 3.2.2 ์ด์ „ ๋ฒ„์ „์—์„œ)
lxmlBeautifulSoup(markup, "lxml")์•„์ฃผ ๋น ๋ฆ„, ์ด์ „ ๋ฒ„์ „ ํŒŒ์ด์ฌ์— ๊ฝค ์ž˜ ํ˜ธํ™˜๋จ์™ธ๋ถ€ C ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์˜์กด

๐Ÿคทโ€โ™‚๏ธ lxml์ด ์ข€ ๋” ๋น ๋ฅด์ง€๋งŒ, ๊ฒ€์ƒ‰ํ•ด๋ณธ ๊ฒฐ๊ณผ ์ „๋ฌธ์ ์ด๊ณ  ๋งŽ์€ ์–‘์˜ ์ž‘์—…์„ ํ•˜๋Š”๊ฒŒ ์•„๋‹ˆ๋ผ๋ฉด ๋ณดํ†ต์€ html.parser๋ฅผ ์“ด๋‹ค๊ณ  ํ•œ๋‹ค.
๋˜ํ•œ ์ •ํ™•ํ•˜๊ฒŒ html๋กœ ๋งˆํฌ์—…์ด ์•ˆ๋˜์–ด ์žˆ์„ ๊ฒฝ์šฐ, lxml๋กœ ์ฒ˜๋ฆฌ๊ฐ€ ๊ฐ€๋Šฅํ•˜๋‹ค.

์ถœ์ฒ˜: parser์ถœ์ฒ˜

2) ์ •๊ทœํ‘œํ˜„์‹

์ด์ „์— ๋ฒจ๋กœ๊ทธ์— ํ•œ๋ฒˆ ์ •๋ฆฌํ•œ ์ ์ด ์žˆ๋Š”๋ฐ, ๋‹ค์‹œ ํ•œ๋ฒˆ ์ •๋ฆฌํ•˜๋ ค๊ณ  ํ•œ๋‹ค.
์ด์ „์— ์ •๋ฆฌํ•œ ๋ฒจ๋กœ๊ทธ(์ •๊ทœํ‘œํ˜„์‹)

print("m.group():", m.group())  # ์ผ์น˜ํ•˜๋Š” ๋ฌธ์ž์—ด ๋ฐ˜ํ™˜
print("m.string():", m.string)  # ์ž…๋ ฅ๋ฐ›์€ ๋ฌธ์ž์—ด
print("m.start():", m.start())  # ์ผ์น˜ํ•˜๋Š” ๋ฌธ์ž์—ด์˜ ์‹œ์ž‘ index
print("m.end():", m.end())      # ์ผ์น˜ํ•˜๋Š” ๋ฌธ์ž์—ด์˜ ๋ index
print("m.span():", m.span())    # ์ผ์น˜ํ•˜๋Š” ๋ฌธ์ž์—ด์˜ ์‹œ์ž‘/๋ index

Ex.
def print_match(m):
    if m:
      print("m.group():", m.group())  
      print("m.string():", m.string) 
      print("m.start():", m.start())    
      print("m.end():", m.end())      
      print("m.end()", m.span())    
      
p = re.compile("ca.e") 
m = p.search("careless") #search๋Š” ์ฃผ์–ด์ง„ ๋ฌธ์ž์—ด ์ค‘์— ์ผ์น˜ํ•˜๋Š”๊ฒŒ ์žˆ๋Š”์ง€ ํ™•์ธํ•œ๋‹ค
print_math(m)

#๊ฒฐ๊ณผ
m.group()  : care
m.string() : careless
m.start()  : 0
m.end()    : 4
m.end()    : (0,4)

๊ทธ๋ฆฌ๊ณ  ๋‹ค์Œ ์ฝ”๋“œ๋ฅผ ์‚ดํŽด๋ณด์ž
์ •๊ทœํ‘œํ˜„์‹์„ ํ™œ์šฉํ•˜์—ฌ ์ผ์น˜ํ•˜๋Š” ๋ถ€๋ถ„์„ drinks์— ๋„ฃ์—ˆ๊ณ ,

for drink in drinks:
    image_tag = drink.find("img")
    image_url = image_tag['src']
    title = image_tag['alt']
    csv_writer.writerow((title, image_url))

csv_open.close()

img๋กœ findํ•œ ํ›„, src๋Š” image_url ๋ณ€์ˆ˜์— ๋‹ด๊ณ , alt๋Š” ํƒ€์ดํ‹€์— ๋‹ด์•˜๋‹ค.
๊ทธ๋Ÿผ csvํŒŒ์ผ๋กœ ํ’ˆ๋ชฉ๋ช… ๋ฐ ์ด๋ฏธ์ง€url์ด ํ‘œ๋กœ ์ •๋ฆฌ๋œ ๊ฑธ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค!

๋Š๋‚€ ์ 

์ดˆ๋ฐ˜์— ์œ ํˆฌ๋ธŒ์—์„œ ํฌ๋กค๋ง ์˜์ƒ๋ณด๊ณ  ๋”ฐ๋ผํ•˜๋Š”๋ฐ, ๊ณ„์† ์˜ค๋ฅ˜๊ฐ€ ๋‚˜์„œ ๋ช‡์‹œ๊ฐ„์€ ๋‚ ๋ฆฐ ๊ฒƒ ๊ฐ™๋‹ค ใ… ใ…  ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋Š” ์‹น๋‹ค ์„ค์น˜ํ–ˆ๋Š”๋ฐ ๊ณ„์† ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๊ฐ€ ์—†๋‹ค๊ณ  ์˜ค๋ฅ˜๊ฐ€ ๋‚ฌ๋‹ค..
์›์ธ์€,, ํŒŒ์ด์ฌ ๋ฒ„์ „๊ณผ ๊ฐ€์ƒํ™˜๊ฒฝ์œผ๋กœ ์ƒ์„ฑํ•œ ํŒŒ์ด์ฌ ๋ฒ„์ „์ด ๋‹ฌ๋ผ์„œ ์ถฉ๋Œ์ด ์žˆ์—ˆ๋˜ ๊ฒƒ ๊ฐ™๋‹ค ๐Ÿ˜‚
์‹น ๋‹ค ์ง€์šฐ๊ณ , ๊ฐ€์ƒํ™˜๊ฒฝ ์„ค์ • ์‹œ ํŒŒ์ด์ฌ ๋ฒ„์ „ ์ง€์ •ํ•ด์„œ ๋‹ค์‹œ ์‹œ์ž‘ํ–ˆ๋‹ค!

profile
back-end ๊ฐœ๋ฐœ์ž

2๊ฐœ์˜ ๋Œ“๊ธ€

comment-user-thumbnail
2021๋…„ 4์›” 3์ผ

์•ˆ๋…•ํ•˜์„ธ์š” ์ข‹์€์ž๋ฃŒ ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค! ๊ทธ๋Ÿฐ๋ฐ from selenuim import webdriver์—์„œ selenium์ธ๋ฐ ์˜คํƒ€๊ฐ€ ์žˆ์–ด์„œ.. ๋ง์”€๋“œ๋ฆฝ๋‹ˆ๋‹ค

1๊ฐœ์˜ ๋‹ต๊ธ€