selenium 크롤링 - 무한 스크롤, 클릭

jomminii_before·2020년 2월 26일

Crawling crawler django python selenium 셀레니움 장고 크롤링 파이썬

저는 현재 위코드에서 스타일쉐어 웹 클론 프로젝트에서 백엔드를 맡아 진행하고 있습니다. 진행 중 스타일쉐어의 브랜드 리스트 페이지를 크롤링해야했는데요, 이전에 했던 Django http & 크롤링 기초 _ 네이버 블로그 리스트 긁어오기와 다르게selenium을 이용해 진행해야했습니다.

크롤링을 진행한 페이지는 아래의 스타일쉐어 브랜드 리스트 입니다.
브랜드 리스트 캡쳐

이 페이지를 크롤링하기 위해서는 각 인덱스(가~#123) 버튼을 클릭해서 해당 인덱스의 브랜드를 확인해야 하고, 또 각 인덱스 페이지의 브랜드 리스트는 스크롤을 내리면 계속 새로운 리스트가 로딩되는 무한스크롤 방식이었습니다.

하나 더 유의했어야 할 점은 각 인덱스 버튼을 눌렀을 때 url 기반으로 페이지가 전환이 되는게 아니라, 현재 url 내에서 화면만 전환이 되어 일반적인 크롤링 방식을 사용할 수 없었습니다.

본 글에서는 selenium의 기본 사용법을 아신다는 가정 하에 무한 스크롤, 클릭 부분만을 다룰 예정입니다.

인덱스 버튼 클릭

# 인덱스 버튼 클릭
from selenium import webdriver

for page in range(1,17):                     [1] 
    if page == 1:                            [2]
        pass
    
    else :                                   [3]
        elements = driver.find_element_by_xpath(f'//*[@id="app"]/div/div[2]/div[1]/div/button[{page}]')
        driver.execute_script("arguments[0].click();", elements)

[ 1 ] : 1~16번까지의 인덱스 버튼을 하나씩 돌 예정
[ 2 ] : 브랜드 리스트 페이지의 디폴트 페이지가 [가] 인덱스인데, 버튼을 또 누르면 스크롤이 안되는 현상이 발생하여 1번(가) 버튼은 건너뛰도록 조건 설정
[ 3 ] : f-strings를 이용해 각 인덱스 버튼을 xpath로 찾는 작업을 elements 변수에 저장하고, 이 요소를 클릭하도록 명령

무한 스크롤

# 무한 스크롤
import time

    SCROLL_PAUSE_TIME = 2

    # Get scroll height
    last_height = driver.execute_script("return document.body.scrollHeight")         [1]

    while True:
        # Scroll down to bottom                                                      [2]
          driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

        # Wait to load page
        time.sleep(SCROLL_PAUSE_TIME)                                                [3]
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight-50);")  [4]
        time.sleep(SCROLL_PAUSE_TIME)

        # Calculate new scroll height and compare with last scroll height            [5]
        new_height = driver.execute_script("return document.body.scrollHeight")

        if new_height == last_height:                                                [6]
            break

        last_height = new_height

[ 1 ] : 마지막 시점의 창 높이 저장
[ 2 ] : 창 높이까지 스크롤
[ 3 ] : 스크롤 후 창이 로딩될때까지 2초를 기다리겠다는 명령어. 로딩이 다되면 바로 넘어감
[ 4 ] : 한 번에 맨 마지막까지 스크롤되면 아래 리스트가 뜨지 않아서, 마지막을 찍고 조금 창을 올리는 방법으로 리스트가 로딩될 수 있게 함
[ 5 ] : 스크롤이 된 후의 창 높이를 새로운 높이로 저장
[ 6 ] : 새로운 높이가 이전 높이와 변하지 않았으면 스크롤 종료

전체 코드

import time
import csv

from selenium import webdriver

url = 'https://www.styleshare.kr/brands'

driver = webdriver.Chrome(('/Applications/chromedriver'))
driver.get(url)

driver.implicitly_wait(5)

brand_list = []

for page in range(1,17):


    if page == 1:
        pass
    
    else :
        elements = driver.find_element_by_xpath(f'//*[@id="app"]/div/div[2]/div[1]/div/button[{page}]')
        driver.execute_script("arguments[0].click();", elements)

    SCROLL_PAUSE_TIME = 2

    # Get scroll height
    last_height = driver.execute_script("return document.body.scrollHeight")

    while True:
        # Scroll down to bottom
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

        # Wait to load page
        time.sleep(SCROLL_PAUSE_TIME)
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight-50);")
        time.sleep(SCROLL_PAUSE_TIME)

        # Calculate new scroll height and compare with last scroll height
        new_height = driver.execute_script("return document.body.scrollHeight")

        if new_height == last_height:
            break

        last_height = new_height

    brandImg  = driver.find_elements_by_xpath('//*[@id="app"]/div/div[2]/div[2]/div[1]/div/a/img')
    brandName = driver.find_elements_by_xpath('//*[@id="app"]/div/div[2]/div[2]/div[1]/div/div/div/a')
    brandLink = driver.find_elements_by_xpath('//*[@id="app"]/div/div[2]/div[2]/div[1]/div/a')

    for item in zip(brandName,brandImg,brandLink):
        brand_list.append(
            {
                "name" : item[0].text,
                "img"  : item[1].get_attribute("src"),
                "link" : item[2].get_attribute("href"),
            }
        )

with open('./brand_infos.csv', mode='w') as brand_infos:
    brand_writer = csv.writer(brand_infos)

    for list in brand_list:
        brand_writer.writerow([list["name"], list["img"], list["link"]])

driver.quit()

jomminii_before

https://velog.io/@jomminii 로 이동했습니다.

이전 포스트

위코드 29일차(2/25) 로그

다음 포스트

위코드 30일차(2/26) 로그

3개의 댓글

Broside

2020년 4월 13일

안녕하세요 크롤링 공부하는 도중 모르는게 생겨서 그러는데 혹시 여쭤봐도 괜찮을까용 ㅠㅠ?

1개의 답글

박승식

2020년 4월 15일

저도 이번에 무한 스크롤 ( 비동기 방식 데이터 수집) 크롤링을 하게 되었는데요
구글 스토업 앱 리뷰 였습니다.
저는 설명하신 프로세스[4]번의 이유에서 인지 잘 되다가 스크롤을 해도 javascript가 먹통이 됬는지 리스트가 더 생성이 안되고 의도치 않게 작업 중단 되더라고요
혹시 [4]번의 원인이 무엇인지 아시나요??
제가 웹알못이라 생기는 문제인거 같기도 해서요 ㅠㅠ

답글 달기