웹 크롤링 뜯어먹기 #4 | bs4로 HTML 추출, 요소 추출, txt 저장

두더지·2025년 6월 10일

웹 크롤링 뜯어먹기

목록 보기

4/4

💭 들어가며..

저번 포스팅에서 bs4를 간단히 다루었는데, 이번 포스팅에서는 bs4를 활용해 웹사이트에서 HTML을 가져오기, 페이지에서 원하는 내용 추출하기, 그리고 추출한 내용을 txt 파일로 저장하는 방법을 실습해보려 한다.

1. Introduction

BeautifulSoup is a Python library used for web scraping.
It extracts data from HTML and XML files.
Dependencies: Requires requests for fetching web pages and an XML/HTML parser.

1.1 Installing

pip install bs4
pip install requests
pip install lxml

1.2 Import

from bs4 import BeautifulSoup
import requests

2. Steps before scraping a website

2.1 Fetch the pages (obtained a response object)

result = requests.get("www.google.com")

2.2 Page content

content = result.text

2.3 Create Soup (객체 생성)

soup=BeautifulSoup(content, "lxml")

2.4. HTML element 찾기

<article class="main-article">
	<h1> Titanic (1997) </h1>
	<p class="plot"> 84 years later ... </p>
	<div class="full-script"> 13 meters. You ... </div>
</article>

ID로 element 찾기

element = soup.find(id="element_id")

클래스 이름으로 element 찾기

element = soup.find("tag_name", class_="class_name")

tag 이름으로 element 찾기

h1_element = soup.find("h1")

여러 elements 찾기 : find_all() 메서드 사용

h2_elements = soup.find_all("h2")
for h2 in h2_elements:
    print(h2)

3. Scrape Examples

https://subslikescript.com/movie/Titanic-120338 → 예제 웹사이트

ex_6.py | Get the HTML from a website

from bs4 import BeautifulSoup
import requests

website = 'https://subslikescript.com/movie/Titanic-120338'
result = requests.get(website)
content = result.text

soup = BeautifulSoup(content, 'lxml')
print(soup.prettify())

lxml 은 Python을 위한 고성능, 쉬운 사용법을 제공하는 XML 및 HTML 처리 라이브러리
prettify()는 html 구조를 파악하기 쉽게 바꿔줌

ex_6_1.py | Scrape a single page

#https://www.udemy.com/course/web-scraping-course-in-python-bs4-selenium-and-scrapy/learn/lecture/27676578#learning-tools
#udemy 강의 예제 코드

from bs4 import BeautifulSoup
import requests

website = 'https://subslikescript.com/movie/Titanic-120338'
result = requests.get(website)
content = result.text

soup = BeautifulSoup(content, 'lxml')
# print(soup)

box = soup.find('article', class_="main-article") 
#underscore is used to avoid conflict with Python's built-in keyword 'class'

title = box.find('h1').get_text()  # get_text() is used to extract text from the HTML element
transcription = box.find('div', class_="full-script").get_text(strip=True, separator=' ')  # strip=True removes leading and trailing whitespace
print(title)
print(transcription)

strip=True : 앞 뒤 공백 제거
separator=' ': 구분자 Space

ex_6_2.py | Exporting data to a txt file


from bs4 import BeautifulSoup
import requests

website = 'https://subslikescript.com/movie/Titanic-120338'
result = requests.get(website)
content = result.text

soup = BeautifulSoup(content, 'lxml')
# print(soup)

box = soup.find('article', class_="main-article") 
#underscore is used to avoid conflict with Python's built-in keyword 'class'

title = box.find('h1').get_text()  # get_text() is used to extract text from the HTML element
transcription = box.find('div', class_="full-script").get_text(strip=True, separator=' ')  # strip=True removes leading and trailing whitespace

with open(f'{title}.txt', 'w') as file: #title + .txt
    file.write(transcription)

ex_6_2.py | 실행 결과

끝으로..

인강 완강을 목표로..

두더지

일단 하긴 합니다.

이전 포스트

웹 크롤링 뜯어먹기 #4 | bs4로 HTML 추출, 요소 추출, txt 저장

웹 크롤링 뜯어먹기

💭 들어가며..

1. Introduction

1.1 Installing

1.2 Import

2. Steps before scraping a website

2.1 Fetch the pages (obtained a response object)

2.2 Page content

2.3 Create Soup (객체 생성)

2.4. HTML element 찾기

3. Scrape Examples

ex_6.py | Get the HTML from a website

ex_6_1.py | Scrape a single page

ex_6_2.py | Exporting data to a txt file

ex_6_2.py | 실행 결과

끝으로..

웹 크롤링 뜯어먹기 #3 | 자주 쓰는 Python 문법 모음 | with open(), Dataframe, try-except

0개의 댓글