[Hadoop] 12. Crawling을 통한 Wordcount

YS Choi·2024년 7월 9일

Hadoop Ecosystem

목록 보기
12/17

1) Crawling에 필요한 라이브러리 다운로드


# 크롤링에 필요한 라이브러리 설치 
pip3 install requests beautifulsoup4 pandas pyarrow hdfs


2) 크롤링 파일 (movie_crawling.py) 생성


cd /home/ubuntu/src
# sample_movie_crawling.py 파일 내용 복사 및 붙여넣기
vim movie_crawling.py
  • movie_crawling.py
#!/usr/bin/env python
# -*-coding:utf-8 -*
import sys  
# site-packages path 추가
sys.path.append( '/home/ubuntu/.local/lib/python3.8/site-packages')

import requests
from bs4 import BeautifulSoup as bs 
 
import pandas as pd 
from hdfs import InsecureClient

# user-agent 각자 작성
def do_crawling(url):
    header = {
        "user-agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36',
        "referer": "https://www.google.com/"
    }
    response = requests.get(url, headers=header)
    response.raise_for_status()

    return response

def get_titles(response):
    soup = bs(response.text, "html.parser")
    html_titles = soup.find_all('a', class_='title')

    return [ title.text for title in html_titles ]


def main(guest_ip, review_url, localpath, hdfspath):

    # 크롤링
    response = do_crawling(review_url)

    # 데이터 추출 
    review_titles = get_titles(response)

    # 데이터 저장 
    with open(localpath, "w",  encoding='utf8') as file:
        file.writelines(review_titles)

    # hadoop에 데이터 저장
    hdfs_ip = "http://{guest_ip}:50070".format(guest_ip=guest_ip)
    client_hdfs = InsecureClient(hdfs_ip, user='ubuntu')
    client_hdfs.upload(hdfspath, localpath)


###############################
# 크롤링 실행함수
###############################
if __name__ == "__main__":
    # master ip 주소
    guest_ip = '10.0.2.xx' # master 인스턴스의 ip
    # 크로링할 리듀 url 주소  
    review_url = "https://www.imdb.com/title/tt0111161/reviews/?ref_=tt_ov_rt"
    # 하둡 저장될 주소
    hdfspath = '/crawling/input/review_titles.txt'
    # 리눅스에 저장될 주소
    localpath = '/home/ubuntu/data/review_titles.txt'

    main(guest_ip, review_url, localpath, hdfspath)
sudo chmod 777 movie_crawling.py



3) movie_crawling.py 실행


  • hdfs에 crawling 디렉토리 생성
hdfs dfs -mkdir -p /crawling/input

  • movie_crawling.py 실행
python3 ~/src/movie_crawling.py

  • 결과 확인
hdfs dfs -ls -R /crawling

cat ~/data/review_titles.txt



4) Hadoop Wordcount 실행


hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-3.3.0.jar \
-files '/home/ubuntu/src/wordcount_mapper.py,/home/ubuntu/src/wordcount_reducer.py' \
-mapper 'python3 wordcount_mapper.py' \
-reducer 'python3 wordcount_reducer.py' \
-input /crawling/input/* \
-output /crawling/output 

hdfs dfs -text /crawling/output/*



5) Web UI 확인


http://127.0.0.1:8088/



0개의 댓글