Selenium을 이용한 웹스크래핑

Kepler·2020년 3월 9일

현재 마이크로소프트가 운영하는 음악사이트 Tidal을 클론하는 프로젝트를 진행하고 있다.

데이터를 수집하기 위해 리액트로 Tidal 웹사이트에서 앨범, 트랙, 아티스트, 이미지 등의 정보를 크롤링 해와야 하는 미션이 주어져, 처음으로 Selinium을 사용하게 되었다.

1. 셀레니움 설치

pip으로 간단하게 설치한다.

pip install selenium

2. 웹드라이버 설치

자신이 사용하는 브라우저의 웹드라이버를 설치한다. 나의 경우 크롬을 사용하므로 chromedriver를 설치했다. 설치에 앞서 자신의 브라우저 버전을 확인하자. 크롬 오른쪽 상단의 작은 dot 3개 아이콘을 클릭 후, Help > About Google Chrome 을 클릭하면, 다음과 같은 화면에서 버전 확인이 가능하다.

드라이버는 아래 사이트에서 다운로드 할 수있다. 버젼 정보는 완벽히 일치 하지 않아도, 앞의 몇자리 숫자가 대부분 일치하면 OK이다. (2020년3월 기준 80.0.3987.106을 설치했다.)
https://sites.google.com/a/chromium.org/chromedriver/downloads

3. 드라이버 PATH 설정

자신이 사용하는 shell에 PATH를 설정해야한다. zhsrc파일의 하단에 다음의 코드를 입력해 주었다.

export PATH=${PATH}:~/bin

4. import modules 설정

import time
from selenium import webdriver

driver = webdriver.Chrome('chromedriver')

url = "https://listen.tidal.com/album/133516096"

driver.get(url)  # 여기까지 설정하고 py를 실행하면 브라우저가 열리면서 셀레니움이 작동되는 것을 확인 할 수 있다.

driver.implicitly_wait(10)  # 각 요소를 10초를 기다린 후 scrap한다

 a_img    = driver.find_elements_by_css_selector('div.imageContainer--2wNM9.albumImage--4rehY > img')[0].get_attribute('src')
 a_title  = driver.find_elements_by_css_selector('div > header > div.meta--1bV17 > h1')[0].text
 a_artist = driver.find_elements_by_css_selector('div > span > span > a')[0].text

driver.quit()     #종료 명령어

5. 셀레니움 사용 팁

동적 사이트는 정적 사이트와 달리 html 데이터가 서버에서 넘어오는게 아니라, 요청을 보낸 후 javascript로 렌더링 해준다. 따라서, 렌더링 해주는 시간을 감안하고 스크래핑을 시작해야 한다. 이 때, 여러가지의 wait 옵션을 사용할 수 있도록 셀레니움에서 제공해 준다.

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

urls = "https://listen.tidal.com/album/133516096"

wait = WebDriverWait(driver, 10)   # 최대 대기시간 설정

wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, 'h1')))     # CSS_SELECTOR의 h1이 페이지에 표시될 때 까지 wait에서 설정한 시간만큼 기다리기
artist = [driver.find_elements_by_css_selector('h1')[0].text]   # argument로 넘긴 CSS태그의 text 값을 찾음


wait.until(EC.url_to_be(f"{url}"))    # url이 로딩 되기까지 대기

예외처리부분에 try부분과 같은 코드를 넣어주면, try에서 catch되지 않았던 부분이 스크래핑 될 때 도 있다. 코드의 가독성 부분이 의심스럽긴 하지만, 원하는 데이터가 스크래핑 되니 한번 시도해 볼만 하다.
expected_conditions모듈을 사용할시 다음의 에러를 만날 수 있다.

visibility_of_element_located: __init__() takes exactly 2 arguments (3 given)

이는 위의 함수가 initializer가 self 이외의 다른 인자를 받는 클래스이기 때문이다.

class visibility_of_element_located(object):
	# ...
	def __init__(self, locator):
		# ...

따라서, 튜플로 호출되어야 하므로 괄호의 개수가 맞는지를 체크하자.
(https://stackoverflow.com/questions/23661734/selenium-visibility-of-element-located-init-takes-exactly-2-arguments)

셀레니움 wait syntax: http://allselenium.info/wait-for-elements-python-selenium-webdriver/
wait 사용법 정리: https://thesoul214.github.io/python/2019/06/01/Python-Selenium-2.html
expected conditions documentation: https://selenium-python.readthedocs.io/waits.html

Kepler

🔰

이전 포스트

wecode 1차 프로젝트 후기

다음 포스트