웹스크래핑/크롤링 by 파이썬 (2)

이민재·2023년 2월 17일

Python/Library

목록 보기

6/6

0. intro

지난 게시글에서 웹스크래핑/크롤링을 위한 웹페이지의 기초적인 이해, requests와 bs4 라이브러리의 간단한 사용법에 대해 정리했다. 이번 게시글에서는 정적인 상태에서의 스크래핑을 넘어서, 웹페이지에서의 동적인 동작을 수행할 수 있게 하는 것을 돕는 selenium 라이브러리의 기본적인 사용법에 대해 정리하겠다.

1. 설치

터미널 환경에서 다음의 코드로 설치 가능하다.

pip install selenium

또한 크롬브라우저 버전에 맞는 웹드라이버를 설치해주는 라이브러리 webdriver_manager를 설치해주면 따로 자신의 크롬버전에 맞는 웹드라이버를 설치하지 않아도 알아서 해준다.

pip install webdriver_manager

2. 기본

필요한 라이브러리를 설치했다면 다음의 흐름을 통해 웹페이지 상에서 동작을 수행하게 할 수 있다.

from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options

service = Service(ChromeDriverManager().install()) # 내 크롬버전에 맞는 웹드라이버 설치
chrome_options = Options()
driver = webdriver.Chrome(service=service, options=chrome_options)

url = "해당 웹페이지"
driver.get(url) #해당 웹페이지로 이동

#이후 수행할 동작 코드 입력 (클릭, 입력 등)

3.options 이용

-브라우저창을 계속 화면에 띄워두기

chrome_options.add_experimental_option("detach", True)

-user-agent 설정

chrome_options.add_argument("user-agent="+str(본인 컴퓨터의 user-agent))

-시크릿모드로 브라우저창 띄우기

chrome_options.add_argument('incognito')

-브라우저창 띄우지 않고 명령 수행하기

chrome_options.add_argument('headless')

4. element 찾기: find_element()

먼저 다음의 라이브러리를 설치해주어야 한다.

from selenium.webdriver.common.by import By

이후 find_element()로 원하는 element를 찾게 되는데 이때 코드는

driver.find_element(By.찾을 element의 속성 이름 = "실제 속성값")

By. 다음으로 들어갈 수 있는 element 속성이름과 속성값의 목록이다.

ID = "id"
XPATH = "xpath"
LINK_TEXT = "link text"
PARTIAL_LINK_TEXT = "partial link text"
TAG_NAME = "tag name"
CLASS_NAME = "class name"
CSS_SELECTOR = "css selector"

5. 동작 수행시키기

-클릭하기

elem = driver.find_element(By.???)
elem.click()

-텍스트 입력하기/지우기

elem = driver.find_element(By.???)
elem.send_keys("입력할 텍스트")

-드래그앤드롭

elem = driver.find_element(By.???) # 드래그 대상 지정
target = driver.find_element(By.?!?!) # 드롭 대상 지정

from selenium.webdriver import ActionChains

action_chains = ActionChains(driver)
action_chains.drag_and_drop(element, target).perform()

-캡쳐하기

#캡쳐할 엘레먼트 지정
element = driver.find_element(By.???)
#캡쳐
element.save_screenshot('image.png')

-창 아래로 내리기(자바스크립트 코드)

#브라우저 스크롤 최하단으로 이동
driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')

-페이지 로딩 기다리기

import time

time.sleep(10) #10초 기다리기 지정

또는 webdriverwait와 expected_conditions를 이용하는 방법도 있다.

# 페이지 로딩 될때 까지 10초 대기 (로딩이 완료되면 즉시 다음 코드 실행)
driver.implicitly_wait(10)

#해당 Element 로딩 될 때까지 10초 대기후 Element 텍스트 가져오기 (로딩이 완료되면 즉시 다음 코드 실행)
element = WebDriverWait(driver,10).until(EC.presence_of_element_located((By.???))).text

-페이지 앞뒤로 이동하기 / 웹페이지 닫기(현재탭, 전체탭)

driver.back() #뒤로 가기
driver.forward() #앞으로 가기

driver.close() #현재 탭만 닫기
driver.quit() #브라우저 닫기

7.outro

파이썬으로 웹스크래핑/크롤링하는 기초적인 코드 작성법을 알아보았다.

1) 웹페이지 구조 이해하기: HTML, CSS
2) requests, bs4 라이브러리 사용법
3) selenium 라이브러리 사용법

이 3단계로 정리할 수 있을 것 같다.

이민재

넓고 얕은 사람 -> 깊은 사람 -> 깊고 넓은 사람

이전 포스트