Webcrawling์„ ํ•ด๋ณด์ž๐Ÿค”

๊ธฐ๋ฆฐ์ดยท2021๋…„ 2์›” 25์ผ
0

ํˆฌ๋น…์Šค(Tobigs)๐Ÿง 

๋ชฉ๋ก ๋ณด๊ธฐ
3/9

์›น์Šคํฌ๋ž˜ํ•‘,,,, ๊ทธ๊ฒƒ์€ ์‹ ๊ธฐํ•˜๊ณ ๋„ ์žฌ๋ฐŒ๊ณ ๋„ ๋ฒˆ๊ฑฐ๋กญ๊ธฐ๋„ ํ•œ ๊ทธ๊ฒƒ,,,,
ํ•œ๋ฒˆ Request, BeutifulSoup, Selenium ๋ชจ๋‘ ์•Œ์•„๋ณด๊ฒ ๋‹ค.

๊ธฐ๋ณธ์  ์ „์ œ ์ง€์‹

HTTP

HTTP๋Š” HTML ๋ฌธ์„œ์™€ ๊ฐ™์€ ๋ฆฌ์†Œ์Šค๋“ค์„ ๊ฐ€์ ธ์˜ฌ ์ˆ˜ ์žˆ๋„๋ก ํ•ด์ฃผ๋Š” ํ”„๋กœํ† ์ฝœ์ž…๋‹ˆ๋‹ค. HTTP๋Š” ์›น์—์„œ ์ด๋ฃจ์–ด์ง€๋Š” ๋ชจ๋“  ๋ฐ์ดํ„ฐ ๊ตํ™˜์˜ ๊ธฐ์ดˆ์ด๋ฉฐ, ํด๋ผ์ด์–ธํŠธ-์„œ๋ฒ„ ํ”„๋กœํ† ์ฝœ์ด๊ธฐ๋„ ํ•ฉ๋‹ˆ๋‹ค.

๋‚ด PC์™€ ์›น์„œ๋ฒ„๊ฐ„์˜ ํ†ต์‹ . ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ฌ๋ผ๊ณ  ํ•˜๊ฑฐ๋‚˜(request) ์ฃผ๊ฑฐ๋‚˜(response) ํ•  ๋•Œ ์†Œํ†ตํ•˜๋Š” ์–ธ์–ด.
์šฐ๋ฆฌ๊ฐ€ ํฌ๋กค๋ง์„ ํ•  ๋•Œ ์ด ์–ธ์–ด๋กœ ์›น์„œ๋ฒ„์— ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ฌ๋ผ๊ณ  ์š”๊ตฌํ•˜๋Š” ๊ฑฐ๋‹ค.
์ฐธ๊ณ ๋กœ ์„œ๋ฒ„์— ๋ฐ์ดํ„ฐ๋ฅผ ์š”๊ตฌํ• ๋•Œ๋Š” GET, ๋ฐ์ดํ„ฐ๋ฅผ ์ค„๋•Œ๋Š” POST

HTML

HTML (Hypertext Markup Language,ํ•˜์ดํผํ…์ŠคํŠธ ๋งˆํฌ์—… ์–ธ์–ด)๋Š” ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์–ธ์–ด๋Š” ์•„๋‹ˆ๊ณ , ์šฐ๋ฆฌ๊ฐ€ ๋ณด๋Š” ์›นํŽ˜์ด์ง€๊ฐ€ ์–ด๋–ป๊ฒŒ ๊ตฌ์กฐํ™”๋˜์–ด ์žˆ๋Š”์ง€ ๋ธŒ๋ผ์šฐ์ €๋กœ ํ•˜์—ฌ๊ธˆ ์•Œ ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” ๋งˆํฌ์—… ์–ธ์–ด์ž…๋‹ˆ๋‹ค.

์šฐ๋ฆฌ๊ฐ€ ์›น์„œ๋ฒ„์— ๋ฐ์ดํ„ฐ๋ฅผ ์š”๊ตฌํ•˜๋ฉด ์›น์„œ๋ฒ„๋Š” HTML๋กœ ์“ฐ์—ฌ์ง„ ์†Œ์Šค์ฝ”๋“œ๋ฅผ ์ค€๋‹ค. ๋ธŒ๋ผ์šฐ์ €๊ฐ€ ์ด ์ฝ”๋“œ๋ฅผ ํ•ด์„ํ•ด์„œ ์šฐ๋ฆฌ๊ฐ€ ํ•ญ์ƒ ๋ณด๋Š” ์›นํŽ˜์ด์ง€ ํ˜•ํƒœ๋กœ ๋„์–ด์ค€๋‹ค.

๊ทธ๋ ‡๊ธฐ์— ์šฐ๋ฆฌ๊ฐ€ ํฌ๋กค๋ง์„ ํ•˜๋ ค๋ฉด ์›น์„œ๋ฒ„ํ•œํ…Œ ๋ฆฌํ€˜์ŠคํŠธ๋ฅผ ์ค˜์„œ html๋กœ ์ด๋ฃจ์–ด์ง„ ์†Œ์Šค์ฝ”๋“œ๋ฅผ ๋ฐ›๊ณ  ์—ฌ๊ธฐ์„œ ์šฐ๋ฆฌ๊ฐ€ ํ•„์š”ํ•œ ๋ถ€๋ถ„์„ ์ถ”์ถœํ•ด์•ผํ•œ๋‹ค.

Request

๋ชจ๋“ˆ ์ •๋ฆฌ

GET

import requests

url = 'https://naver.com'
response = requests.get(url)
response.text

๊ฐ€์žฅ ๊ธฐ๋ณธ์ ์ธ ๋ฐฉ๋ฒ•.

url = 'https://search.naver.com/search.naver'
params = {
    'where':'image',
    'sm':'tab_jum',
    'query':'youtube'
}
response = requests.get(url, params=params)
htmml = response.content

ํŒŒ๋ผ๋ฏธํ„ฐ๋“ค์„ ๋”•์…”๋„ˆ๋ฆฌ๋กœ ๋„˜๊ฒจ์ค„ ์ˆ˜ ์žˆ๋‹ค. ๋”•์…”๋„ˆ๋ฆฌ๋กœ ๋„˜๊ฒจ์ฃผ๋ฉด url์„ param๋“ค์„ ์•Œ์•„์„œ ์ž˜ ์ •๋ฆฌํ•ด ๋งŒ๋“ค์–ด์ค€๋‹ค.

https://search.naver.com/search.naver?where=news&sm=tab_jum&query=%EC%95%84%EC%9D%B4%ED%8F%B0 ์œ„์˜ url์€ ์ด๋ ‡๊ฒŒ ๋œ๋‹ค.

์ฐธ๊ณ ๋กœ text์™€ content์˜ ์ฐจ์ด์ ์€ text๋Š” ์œ ๋‹ˆ์ฝ”๋“œ๋ฅผ ์ฝ์–ด๋“ค์ด๊ณ  content๋Š” ๋ฐ”์ดํŠธ๋ฅผ ์ฝ๋Š”๋‹ค๊ณ  ํ•œ๋‹ค.

๋”ฑ ํ…์ŠคํŠธ๋งŒ ํ•„์š”ํ•˜๋‹ค -> text
์ด๋ฏธ์ง€๋„ ์žˆ๊ณ  ์—ฌ๋Ÿฌ๊ฐ€์ง€ ๋‹ค ์ฝ๊ณ ์‹ถ๋‹ค -> content

BeautifulSoup

html์ฝ”๋“œ๋ฅผ ์ข€ ๋” ์•Œ์•„๋ณด๊ธฐ ์‰ฝ๊ฒŒ ๋งŒ๋“ค์–ด์ค€๋‹ค. (์ด๊ฑธ soup์ด๋ผ๊ณ  ํ•œ๋‹ค)

soup์—์„œ ์›ํ•˜๋Š” ๋ถ€๋ถ„์„ ์ถ”์ถœํ•  ์ˆ˜ ์žˆ๋‹ค. select, select_one์„ ์ด์šฉํ•œ๋‹ค.

import requests
from bs4 import BeautifulSoup as bs

soup = BeautifulSoup(html)

soup.select('.photo_group._listGrid')

url = 'https://search.naver.com/search.naver'

params = {
    'where' : 'news',
    'sm' : 'tab_jum',
    'query' : '์•„์ดํฐ'
}

response = requests.get(url, params = params)

response.url

soup = BeautifulSoup(response.content)

elements = soup.select('.news_area')

news_list = []
for element in elements:
    news = element.select_one('.news_tit')
    print(news)
    
    news_data ={
        'title' : news['title'],
        'link' : news['href']
    }
    news_list.append(news_data)

import pandas as pd
pd.DataFrame(news_list)

๋„ค์ด๋ฒ„ ๊ธฐ์‚ฌ 10๊ฐœ ์ฝ์–ด์„œ ๊ฐ ๊ธฐ์‚ฌ์˜ title, link ๋”•์…”๋„ˆ๋ฆฌ ์ƒ์„ฑ ํ›„ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์œผ๋กœ ๋งŒ๋“œ๋Š” ๊ณผ์ •์ด๋‹ค.

select๋Š” list๋ฅผ ๋ฐ˜ํ™˜ํ•œ๋‹ค. ์กฐ๊ฑด์„ ๋งŒ์กฑํ•˜๋Š” ๋ชจ๋“  ๊ฐ์ฒด๊ฐ€ ๋‹ด๊ธด ๋ฆฌ์ŠคํŠธ ๋ฐ˜ํ™˜
select_one์€ ์กฐ๊ฑด๋งŒ์กฑํ•˜๋Š” ์ฒซ๋ฒˆ์งธ ๊ฐ์ฒด๋งŒ ๋ฐ˜ํ™˜

Selenium

์ž๋™ํ™” ๋„๊ตฌ์ด๋‹ค. ์›น์—์„œ ์šฐ๋ฆฌ๊ฐ€ ํ•˜๋Š” ํ–‰๋™๋“ค์„ ์ž๋™ํ™”ํ•  ์ˆ˜ ์žˆ๋‹ค.
๋ฒ„ํŠผ์„ ๋ˆ„๋ฅด๊ณ  ์•„์ด๋”” ๋น„๋ฒˆ ์ž…๋ ฅํ•˜๊ณ  ์ด๋Ÿฐ ํ–‰๋™๋“ค์ด๋‹ค.

from selenium import webdriver 
driver = webdriver.Chrome('./chromedriver')

driver.get('https://www.naver.com')

driver.find_element_by_xpath('//*[@id="account"]/a').click()

driver.find_element_by_xpath('//*[@id="id"]').send_keys('123')

driver.find_element_by_xpath('//*[@id="pw"]').send_keys('123')

driver.find_element_by_xpath('//*[@id="log.login"]').click()

driver.close()

๋„ค์ด๋ฒ„ ๋กœ๊ทธ์ธ ํ•˜๋Š” ๊ณผ์ •์ด๋‹ค.
send_keys๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ์ž…๋ ฅ, click ๋ง๊ทธ๋Œ€๋กœ ํด๋ฆญ time.sleep(1) 1์ดˆ ๊ธฐ๋‹ค๋ฆฌ๊ธฐ

์…€๋ ˆ๋‹ˆ์›€๊ณผ ๋ทฐํ‹ฐํ’€์Šพ ๊ฐ™์ด ์“ฐ๊ธฐ

url = 'https://color.adobe.com/ko/search?q=warm'
response = requests.get(url)
soup = BeautifulSoup(response.content)
soup

driver = webdriver.Chrome('./chromedriver')
driver.get('https://color.adobe.com/ko/search?q=warm')

driver.find_element_by_xpath('//*[@id="react-spectrum-8"]/div/div[3]').click()

html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')

a = soup.select('div.Theme__theme___2NcED')

a[3]

์…€๋ ˆ๋‹ˆ์›€์œผ๋กœ ํŒ์—…์ฐฝ x๋ฒ„ํŠผ ๋ˆ„๋ฅด๊ณ , ์ด๋•Œ html์†Œ์Šค์ฝ”๋“œ๋ฅผ bs๋กœ ํŒŒ์‹ฑ

profile
์ค‘์š”ํ•œ ๊ฒƒ์€ ์†๋ ฅ์ด ์•„๋‹ˆ๋ผ ๋ฐฉํ–ฅ์„ฑ, ๊ณต๋ถ€ํ•˜๋ฉฐ ๋ฉ”๋ชจ๋ฅผ ๋‚จ๊ธฐ๋Š” ๊ณต๊ฐ„์ž…๋‹ˆ๋‹ค.

0๊ฐœ์˜ ๋Œ“๊ธ€