Scraping 속도 개선 방법(selenium)

문주은·2024년 1월 12일

1. 명시적 대기 시간 사용

적절한 대기 조건 사용 (불필요한 대기 시간 최소화)
WebDriverWait, expected_conditions(EC) 를 사용해 특정 조건이 충족될때 까지만 기다림

# 불필요한 정지 코드 제거 
time.sleep() 

-->

# WebDriverWait을 사용함으로서 대기 시간이 동적으로 조절
WebDriverWait(driver, 30).until(
    EC.presence_of_element_located((By.CSS_SELECTOR, ".product_title")) 
)

2. 로딩(Loading) 최적화

2-1. 이미지 로딩 최적화

웹 드라이버에서 이미지 로딩을 비활성화
대역폭을 절약하고 로딩 속도를 향상 가능

# webdriver options
options.add_argument('--disable-images')
options.add_experimental_option("prefs", {'profile.managed_default_content_settings.images': 2})
options.add_argument('--blink-settings=imagesEnabled=false')

2-2. 페이지 로드 전략 설정

DesiredCapabilites 객체로 페이지 로드 전략 'none'으로 설정

# Set page Laod Strategy to "none"

from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
caps = DesiredCapabilities.CHROME
caps["pageLoadStrategy"] = "none"

3. DataFrame 연결 최적화

반복문에서 pd.concat 반복 처리 --> 데이터를 list에 반복 추가, 반복문이 끝난 후에 DataFrame 한 번만 생성

4.headless option 사용

브라우저 인터페이스가 필요하지 않은 경우, 헤드리스 브라우저인 헤드리스 크롬 사용
그래픽 사용자 인터페이스 없이 실행되므로 빠르고 덜 자원을 사용

## remote chromedriver settings
options = webdriver.ChromeOptions()
options.add_argument('--headless') 
options.add_argument('--ignore-ssl-errors=yes')
options.add_argument('--ignore-certificate-errors')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--no-sandbox')
options.add_argument('--log-level=3')
options.add_argument('--disable-gpu')
options.add_argument('--incognito')

# Add Image Loading inactive Flag to reduce loading time
options.add_argument('--disable-images')
options.add_experimental_option(
    "prefs", {'profile.managed_default_content_settings.images': 2})
options.add_argument('--blink-settings=imagesEnabled=false')
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36 Edg/115.0.1901.200'
options.add_argument(f'user-agent={user_agent}')

driver = webdriver.Chrome(options=options)

5.불필요한 호출 최소화

웹사이트에 요청하는 횟수를 최소화
고유 식별 값으로 이전 스크래핑 여부 확인 후 필요한 경우만 스크래핑

6. 코드 프로파일링

성능 병목 현상을 식별하기 위해 프로파일링 도구(cProfile, profile)를 사용
코드 최적화 가능

7.스크래핑 병렬화

웹사이트가 허용하는 경우, 스크립트를 병렬로 실행하여 웹사이트의 다른 부분에서 동시에 데이터를 스크래핑할 수 있습니다.
스크래핑 작업을 병렬화하기 위해 Multi-threading과 같은 라이브러리를 사용할 수 있습니다.

7-1.Multi-threading Example

def scrape_currency(url):
    driver_currency = scrapping_setting(url, (800, 600))
    
    try:
        # wait until all elements appear
        wait = WebDriverWait(driver_currency, 10)
        wait.until(EC.presence_of_element_located((By.XPATH, '//*')))

        content = driver_currency.page_source
        soup = BeautifulSoup(content, 'lxml')
        target_element = soup.select_one('div.YMlKec.fxKbKc')

        if target_element:
            return target_element.get_text()
        else:
            return 'Element not found'
    finally:
        # Close the WebDriver instance individually
        if driver_currency:
            driver_currency.quit()

try:
    urls = [
            "https://www.google.com/finance/quote/CNY-USD",
            "https://www.google.com/finance/quote/CNY-VND",
            "https://www.google.com/finance/quote/CNY-IDR",
            "https://www.google.com/finance/quote/CNY-KRW",
            "https://www.google.com/finance/quote/CNY-EUR"
        ]

    data = [datetime.now(), 1]
    with ThreadPoolExecutor() as executor:
        results = list(executor.map(scrape_currency, urls))
    data.extend(results)

8. request library 활용

동적 페이지 이동은 selinium으로 수행
scraping 대상이 정적 페이지라면 request library를 사용하면 효율적

# Example
driver_cookies = driver.get_cookies()
cookies = {cookie['name']: cookie['value'] for cookie in driver_cookies}

responses = []
for _ in range(final_page):
    url = base_url + str(num)
    response = requests.get(url, cookies=cookies)
responses.append(response)
num = num + 1

# Parsing with BeautifulSoup
data_list = []
for res in responses:
    soup = BeautifulSoup(res.text, 'html.parser')
    
    # Scraping specific values
    elements = soup.find_all(class_='Resource')
    id = [res.get('id') for res in elements]

번외) 새롭게 스크래핑된 데이터 관리 방법

스크래핑할 때 기존 데이터와 새로 생긴 데이터를 효과적으로 관리하는 방법에 대해 작성했습니다.

전체 재스크래핑:
- 주기적으로 전체 데이터를 새로 스크래핑하는 방법입니다. 이 방법은 가장 단순하며, 모든 데이터를 업데이트하려면 모든 페이지를 다시 스크래핑합니다. 이 방법은 데이터 일관성을 유지하고 오래된 데이터를 제거하는 데 유용합니다.
변경 사항만 스크래핑:
- 기존 데이터와 비교하여 변경된 내용만 스크래핑하는 방법입니다. 이 방법을 사용하려면 데이터의 버전 관리가 필요하며, 변경사항을 탐지하기 위한 알고리즘이나 도구를 개발해야 합니다. 이 방법은 스크래핑 작업의 부담을 줄일 수 있습니다.
데이터베이스 사용:
- 스크래핑된 데이터를 데이터베이스에 저장하고 변경 사항을 관리합니다. 새로운 데이터가 생성되면 데이터베이스에 저장하고, 기존 데이터와 비교하여 업데이트 또는 추가합니다.
웹훅(Webhook) 또는 알림:
- 변경 사항이 감지될 때 알림을 받는 방법을 설정합니다. 변경 사항을 감지하면 스크래핑 작업을 트리거하여 새로운 데이터를 가져올 수 있습니다.

어떤 방법을 선택할지는 프로젝트 요구 사항과 자원, 데이터 양 등에 따라 다를 것입니다. 변경 사항을 실시간으로 모니터링해야 하는 경우에는 4번 방법이 유용할 수 있습니다. 그러나 일정한 주기로 데이터를 업데이트하는 것이 충분한 경우, 1번 방법을 사용할 수 있습니다.