2022-09-07

jmΒ·2022λ…„ 9μ›” 14일
0

TIL

λͺ©λ‘ 보기
19/22

πŸ“Œ 유튜브 λŒ“κΈ€ 크둀링 및 μ›Œλ“œ ν΄λΌμš°λ“œ μ‹œκ°ν™”

(이것도 μ…€λ ˆλ‹ˆμ›€μ„ μ‚¬μš©ν•΄μ„œ 크둀링을 ν•  것이기 λ•Œλ¬Έμ— 크둀링 전에 κΌ­ μ„€μΉ˜ ν•΄μ€˜μ•Ό 됨!!)

βœ… 유튜브 λŒ“κΈ€ 크둀링

# 라이브러리 μž„ν¬νŠΈ
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup

import time
import pandas as pd

import warnings
warnings.filterwarnings('ignore')
options = webdriver.ChromeOptions()
options.add_argument('--headless')        
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')

driver = webdriver.Chrome('chromedriver', options=options) # 크둬 λΈŒλΌμš°μ € μ€€λΉ„

driver.get('https://www.youtube.com/watch?v=ycEtLNlX_ss') # μ—΄λ¦Ό
driver.implicitly_wait(3)

time.sleep(1.5)

driver.execute_script("window.scrollTo(0,800)") # 슀크둀 800만큼 내리기
time.sleep(3)

# λŒ“κΈ€ μˆ˜μ§‘μ„ μœ„ν•œ 슀크둀 내리기
last_height = driver.execute_script("return document.documentElement.scrollHeight")  # 졜초 접속 μ‹œ 슀크둀 높이 μ΄ˆκΈ°ν™”
# 슀크둀 내리기λ₯Ό λλ‚ λ•Œ κΉŒμ§€ 

while True:
  driver.execute_script("window.scrollTo(0, document.documentElement.scrollHeight);")
  time.sleep(2)

  new_height = driver.execute_script("return document.documentElement.scrollHeight")

  if new_height == last_height:
    break

  last_height = new_height
  time.sleep(2)

  try:
    driver.find_element_bt_css_selecter('#dismiss-button > a').click() # 유튜브 1달 무료 νŒμ—… λ‹«κΈ°
    time.sleep(1.5)

  except:
    pass

πŸ”Ό 크둀링을 μœ„ν•΄ 코딩을 톡해 μ•‘μ…˜ λΆ€μ—¬ν•˜λŠ” μž‘μ—…λ“€μ΄λ‹€.


# λŒ“κΈ€ 크둀링
html_source = driver.page_source
soup = BeautifulSoup(html_source, 'html.parser')

id_list = soup.select('div#header-author > h3 > #author-text > span')
comment_list = soup.select('yt-formatted-string#content-text')

id_final = []
comment_final = []

for i in range(len(comment_list)):
  temp_id = id_list[i].text
  temp_id = temp_id.replace('\n', '').replace('\t', '').replace(' ', '').strip()
  id_final.append(temp_id) # λŒ“κΈ€ μž‘μ„±μž

  temp_comment = comment_list[i].text
  temp_comment = temp_comment.replace('\n', '').replace('\t', '').replace('\r', '').strip()
  comment_final.append(temp_comment) # λŒ“κΈ€ λ‚΄μš©

πŸ”Ό λŒ“κΈ€ μž‘μ„±μžμ™€ λŒ“κΈ€ λ‚΄μš© 크둀링 μž‘μ—…


# dataframe λ§Œλ“€κΈ° (list -> dic -> dataframe)

youtube_dic = {"아이디":id_final, "λŒ“κΈ€ λ‚΄μš©":comment_final}
youtube_pd = pd.DataFrame(youtube_dic)

πŸ”Ό ν¬λ‘€λ§ν•œ 것을 λ°μ΄ν„°ν”„λ ˆμž„ ν˜•νƒœλ‘œ μ €μž₯


youtube_pd.to_csv('μœ νŠœλΈŒλŒ“κΈ€_크둀링_μ˜€ν›„_20220909.csv', encoding='utf-8-sig', index=False)

πŸ”Ό 파일둜 μ €μž₯ν•˜λŠ” 것도 μžŠμ§€λ§μž ..


βœ… μ›Œλ“œ ν΄λΌμš°λ“œ μ‹œκ°ν™”

df = pd.read_csv('/content/μœ νŠœλΈŒλŒ“κΈ€_크둀링_μ˜€ν›„_20220909.csv')
text = " ".join(li for li in df['λŒ“κΈ€ λ‚΄μš©'].astype(str))

λ°μ΄ν„°ν”„λ ˆμž„ν˜•νƒœλ‘œ 뢈러온 λ‹€μŒ μ›Œλ“œ ν΄λΌμš°λ“œ μ‹œκ°ν™”λ₯Ό μœ„ν•΄ ν…μŠ€νŠΈλ“€μ„ join을 톡해 λͺ¨λ‘ λΆ™μ—¬μ€€λ‹€.

μ›Œλ“œ ν΄λΌμš°λ“œ μ‹œκ°ν™” μ½”λ“œλŠ” 늘 ν•˜λ˜λŒ€λ‘œ ...
ν•˜λ©΄!!

μ΄λ ‡κ²Œ λ‚˜μ˜΅λ‹ˆλ‹€
μΉœκ΅¬ν•œν…Œ λŒ“κΈ€ 1만개 μ΄ν•˜μΈ μ˜μƒ μ•„λ¬΄κ±°λ‚˜ 보내보라고 ν•œκ±°λΌ
이게 λ¨Ό μ˜μƒμ΄κΈΈλž˜ 사이토가 κ°€μž₯ 크게 λ‚˜νƒ€λ‚œκ±΄μ§€λŠ” λͺ¨λ₯΄κ²Ÿλ„€μš” ...
μ˜μƒμ— λ‚˜μ˜¨ μ‚¬λžŒ 이름인가??

0개의 λŒ“κΈ€