최종 프로젝트 Day12

hyun-jin·2025년 6월 18일

TIL 스파르타 최종

데이터 분석

목록 보기

85/100

데이터 분석 81일

크롤링

API에서 가져온 데이터는 제품 데이터로 EDA를 하면 할수록 판매 데이터의 필요성을 느낌.
URL 컬럼 사용해서 제품 재고와 판매된 제품 수를 가져오는 크롤링을 해보기로 결정함.
크롤링 코드가 잘 돌아간다~! 이걸로 1만개 뽑아봄
겁나 오래걸린다...
그래도 이 데이터가 생기면 EDA 할때 더 좋은 인사이트를 뽑아 낼수 있을 것 같다.

1. 크롤링 코드 - sleep 시간 2.5로 변경하기!

import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
import re
import time
from concurrent.futures import ThreadPoolExecutor, as_completed

def extract_availability_and_sold(url):
    options = Options()
    options.add_argument("--no-sandbox")
    options.add_argument("--disable-dev-shm-usage")
    # options.add_argument("--headless")
    driver = webdriver.Chrome(options=options)
    available, sold = None, None
    try:
        driver.get(url)
        time.sleep(1.5)  # 네트워크/컴퓨터 환경 따라 1~2초로 조정

        try:
            availability_box = driver.find_element(By.ID, "qtyAvailability")
            spans = availability_box.find_elements(By.TAG_NAME, "span")
            for span in spans:
                text = span.text.strip().lower()
                if "available" in text:
                    match = re.search(r"(\d+)", text)
                    if match:
                        available = int(match.group(1))
                if "sold" in text:
                    match = re.search(r"(\d+)", text)
                    if match:
                        sold = int(match.group(1))
        except Exception as e:
            pass
    except Exception as e:
        print(f"❌ Error for {url}: {e}")
    driver.quit()
    return available, sold

# =========================
# 메인 실행 코드
# =========================

# df = pd.read_csv("your_ebay_file.csv")
sample_urls = df['itemWebUrl'].dropna().iloc[1000:2000]
sample_idx = sample_urls.index

results = [None] * len(sample_urls)
max_workers = 10  # 동시에 몇 개 브라우저 띄울지(PC에 따라 4~8 정도가 적당)
print(f"🌐 {len(sample_urls)}개 병렬 크롤링 시작! (동시 {max_workers}개)")

with ThreadPoolExecutor(max_workers=max_workers) as executor:
    future_to_idx = {executor.submit(extract_availability_and_sold, url): i for i, url in enumerate(sample_urls)}
    for future in as_completed(future_to_idx):
        i = future_to_idx[future]
        try:
            results[i] = future.result()
        except Exception as e:
            print(f"Error at index {i}: {e}")
            results[i] = (None, None)
        print(f"({i+1}/{len(sample_urls)}) 완료")

available_list, sold_list = zip(*results)
df.loc[sample_idx, 'available_quantity'] = available_list
df.loc[sample_idx, 'sold_quantity'] = sold_list

df.to_csv("ebay_with_avail_sold2.csv", index=False) # 변경하기
print("✅ 2000개 병렬 크롤링 완료!") # 변경하기

2. 데이터 조인

result = df1.combine_first(df)

명심 하기

지금은 머신러닝 모델보다 ‘데이터 해석/시각화/인사이트’가 핵심
데이터의 성격과 구조, 등록/삭제 로직을 항상 고민
시각화로 드러나는 이상한 점, 의미 있는 패턴에 ‘왜?’를 붙이고 의심
EDA & 대시보드 일정 내에, 최대한 다양한 시각화/해석/가설을 시도해보기
심층적 관찰 + 분석가의 ‘주관’이 진짜 결과물을 만듦!

hyun-jin

이전 포스트

최종 프로젝트 Day11

다음 포스트

최종 프로젝트 Day12

데이터 분석

1. 크롤링 코드 - sleep 시간 2.5로 변경하기!

2. 데이터 조인

명심 하기

최종 프로젝트 Day11

최종 프로젝트 Day13

0개의 댓글