๐Ÿงฉ Cafe24 ์ฝ˜ํ…์ธ  ๋ฐ์ดํ„ฐ ์ž๋™ํ™” ํŒŒ์ดํ”„๋ผ์ธ ๊ตฌ์ถ•๊ธฐ (ft. BigQuery)

yeahcoldยท2025๋…„ 3์›” 24์ผ
0

Data Engineering

๋ชฉ๋ก ๋ณด๊ธฐ
9/20

๐Ÿ’ฌ Intro

Cafe24์˜ ์œ ํŠœ๋ธŒ ์ฝ˜ํ…์ธ  ํ†ต๊ณ„ ๋ฐ์ดํ„ฐ๋ฅผ ๋งค์ผ ์ˆ˜์ง‘ํ•ด BigQuery์— ์ ์žฌํ•˜๋Š” ์ž๋™ํ™” ํŒŒ์ดํ”„๋ผ์ธ์„ ๋งŒ๋“ค๋ฉด์„œ ๋‹ค์–‘ํ•œ ์‹œํ–‰์ฐฉ์˜ค๋ฅผ ๊ฒช์—ˆ๋‹ค. ํŠนํžˆ Cloud Functions, Selenium ํฌ๋กค๋ง, CSV ์ฒ˜๋ฆฌ, BigQuery ์Šคํ‚ค๋งˆ ์„ค๊ณ„์™€ ๊ด€๋ จํ•ด ์ž‘์€ ๋ฌธ์ œ๋“ค์ด ๋งŽ์•˜๊ณ , ์ด๋ฅผ ํ•˜๋‚˜์”ฉ ํ•ด๊ฒฐํ•ด๊ฐ€๋ฉฐ ์ž๋™ํ™”์— ์„ฑ๊ณตํ•œ ๊ธฐ๋ก์„ ๊ณต์œ ํ•ด๋ณด๋ ค ํ•œ๋‹ค.


๐Ÿ“Œ ์ „์ฒด ๊ตฌ์กฐ

  • Cloud Function (Python)
    • export-cafe24-contents: ํฌ๋กค๋ง ๋ฐ ์—…๋กœ๋“œ ํ•จ์ˆ˜
  • Selenium + Requests
    • Cafe24 ๊ด€๋ฆฌ์ž ํŽ˜์ด์ง€์—์„œ ์œ ํŠœ๋ธŒ ์ฝ˜ํ…์ธ  ํ†ต๊ณ„ ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค์šด๋กœ๋“œ
  • BigQuery
    • ์ผ์ž๋ณ„ ๋ฐ์ดํ„ฐ๋ฅผ ์ ์žฌ (์ค‘๋ณต ์ œ๊ฑฐ ํ›„ append ๋ฐฉ์‹)
  • Cloud Scheduler
    • ๋งค์ผ ์˜ค์ „ ์‹คํ–‰

โš ๏ธ ํŠธ๋Ÿฌ๋ธ”์ŠˆํŒ… ๋ชจ์Œ

1. Excel ๋‹ค์šด๋กœ๋“œ ํŒŒ์ผ์ด ์—ด๋ฆฌ์ง€ ์•Š์Œ

ํ˜„์ƒ

  • Cafe24 ๊ด€๋ฆฌ์ž์—์„œ ๋‹ค์šด๋กœ๋“œํ•œ .xlsx ํŒŒ์ผ์ด Excel์—์„œ ์—ด๋ฆฌ์ง€ ์•Š๊ณ  ๊ฒฝ๊ณ  ๋ฐœ์ƒ

์›์ธ

  • ์‹ค์ œ ํŒŒ์ผ์€ .csv ํฌ๋งท์ด์ง€๋งŒ ํ™•์žฅ์ž๋งŒ .xlsx

ํ•ด๊ฒฐ

  • ํŒŒ์ผ์„ .csv๋กœ ์ €์žฅํ•˜์—ฌ ์ฒ˜๋ฆฌ
  • response.content๋ฅผ wb๋กœ ์ €์žฅ ํ›„ csv.reader๋กœ ์žฌํŒŒ์‹ฑํ•˜์—ฌ ํ•ด๊ฒฐ

2. Selenium์œผ๋กœ ๋กœ๊ทธ์ธ ํ›„ session ์ „ํ™˜ ์‹คํŒจ

ํ˜„์ƒ

  • Selenium์œผ๋กœ ๋กœ๊ทธ์ธ์€ ๋˜์ง€๋งŒ, ์ดํ›„ API ์š”์ฒญ ์‹œ ์„ธ์…˜ ์ธ์ฆ ์‹คํŒจ

ํ•ด๊ฒฐ

cookies = driver.get_cookies()
session = requests.Session()
for cookie in cookies:
    session.cookies.set(cookie["name"], cookie["value"])

โ†’ Selenium์—์„œ ๋ฐ›์€ ์ฟ ํ‚ค๋ฅผ Requests ์„ธ์…˜์— ์˜ฎ๊ธฐ๋Š” ๋ฐฉ์‹์œผ๋กœ ํ•ด๊ฒฐ


3. ๋ฐ์ดํ„ฐ ๊ธฐ๊ฐ„ ์„ค์ • ๋ฌธ์ œ

์š”๊ตฌ์‚ฌํ•ญ

  • ๋งค์ผ ์ „์ผ ๊ธฐ์ค€ ๋ฐ์ดํ„ฐ๋งŒ ์ ์žฌํ•˜๊ณ  ์‹ถ์Œ

๊ธฐ์กด ์ฝ”๋“œ

start_date = "2025-02-22"
end_date = "2025-03-23"

๋ณ€๊ฒฝ ์ฝ”๋“œ

target_date = (datetime.today() - timedelta(days=1)).strftime("%Y-%m-%d")

โ†’ ์ „์ผ ๊ธฐ์ค€์œผ๋กœ start/end ๋‚ ์งœ๋ฅผ ๋™์ผํ•˜๊ฒŒ ์„ค์ •ํ•˜์—ฌ ํ•ด๊ฒฐ


4. BigQuery ํ…Œ์ด๋ธ”์ด ์—†์œผ๋ฉด ์—๋Ÿฌ ๋ฐœ์ƒ

ํ˜„์ƒ

  • NotFound: Table ... was not found ์—๋Ÿฌ

ํ•ด๊ฒฐ

  • get_table()๋กœ ํ™•์ธํ•˜๊ณ  ์—†์œผ๋ฉด dataset โ†’ table ์ˆœ์„œ๋กœ ์ž๋™ ์ƒ์„ฑ ๋กœ์ง ์ž‘์„ฑ
try:
    client.get_table(table_ref)
except Exception:
    # dataset ์ƒ์„ฑ โ†’ table ์ƒ์„ฑ

5. ๋ฐฑํ•„(Backfill) ์ž๋™ํ™”

์š”๊ตฌ์‚ฌํ•ญ

  • 2025-01-01๋ถ€ํ„ฐ ํ•˜๋ฃจ์”ฉ ๋ชจ๋“  ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์ง‘ํ•ด์„œ ์ ์žฌํ•˜๊ณ  ์‹ถ์Œ

ํ•ด๊ฒฐ

def backfill(start_date_str, end_date_str):
    ...

โ†’ ๋‚ ์งœ๋ฅผ ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ ๋„˜๊ธธ ์ˆ˜ ์žˆ๊ฒŒ download_csv(target_date=...), load_to_bigquery(target_date=...) ํ•จ์ˆ˜ ๋ชจ๋‘ ๋ฆฌํŒฉํ† ๋ง


6. ํ•œ๊ธ€ ์ปฌ๋Ÿผ๋ช… โ†’ ์˜๋ฌธ ์ปฌ๋Ÿผ๋ช…

๋ฌธ์ œ์ 

  • BigQuery SQL ์ฟผ๋ฆฌ์—์„œ ํ•œ๊ธ€ ์ปฌ๋Ÿผ ์‚ฌ์šฉ ์‹œ ๋ถˆํŽธํ•จ

ํ•ด๊ฒฐ

header = ["log_date", "content_name", "channel_name", "sales", "order_count", "visitor_count", "conversion_rate"]

โ†’ CSV ์ €์žฅ ์‹œ ์ปฌ๋Ÿผ๋ช…์„ ์˜๋ฌธ์œผ๋กœ ๋ณ€๊ฒฝํ•˜์—ฌ ํ†ต์ผ


7. ์ค‘๋ณต ๋ฐ์ดํ„ฐ ์ œ๊ฑฐ

๋ฌธ์ œ์ 

  • ๋‚ ์งœ๋ณ„๋กœ ๊ฐ™์€ ๋ฐ์ดํ„ฐ๊ฐ€ ์Œ“์ด๋ฉด ์ค‘๋ณต๋จ

ํ•ด๊ฒฐ

DELETE FROM `{table_ref}` WHERE log_date = '{target_date}'

โ†’ ์ ์žฌ ์ „์— ํ•ด๋‹น ๋‚ ์งœ์˜ ๊ธฐ์กด ๋ฐ์ดํ„ฐ๋ฅผ ์‚ญ์ œ ํ›„ ์ƒˆ๋กœ insert


โœ… ์ตœ์ข… ๊ฒฐ๊ณผ

  • ๋งค์ผ ์ฝ˜ํ…์ธ ๋ณ„ ํ†ต๊ณ„ ๋ฐ์ดํ„ฐ๋ฅผ ํฌ๋กค๋ง โ†’ ์ „์ฒ˜๋ฆฌ โ†’ BigQuery ์ ์žฌ๊นŒ์ง€ ์ž๋™ํ™” ์„ฑ๊ณต
  • Cloud Scheduler + Cloud Function์„ ํ†ตํ•ด ๋ฌด์ค‘๋‹จ ์šด์˜ ๊ฐ€๋Šฅ
  • ์›ํ•˜๋Š” ๊ธฐ๊ฐ„๊นŒ์ง€ ๋ฐฑํ•„ ์™„๋ฃŒ
profile
Software Engineer

0๊ฐœ์˜ ๋Œ“๊ธ€