๐Ÿ“•Week2 day2(์›น์Šคํฌ๋ž˜ํ•‘ ๊ธฐ์ดˆ)

๋ฐ•์ค€ํฌยท2023๋…„ 8์›” 29์ผ

ํ”„๋กœ๊ทธ๋ž˜๋จธ์Šค

๋ชฉ๋ก ๋ณด๊ธฐ
9/28
post-thumbnail

HTML๋ถ„์„๊ธฐ - BeautifulSoup


BeautifulSoup ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ

HTML ์ฝ”๋“œ๋ฅผ ๋ถ„์„ํ•ด์ฃผ๋Š”, HTML Parser

from bs4 import BeautifulSoup

# BeautifulSoup๊ฐ์ฒด๋ฅผ ๋งŒ๋“ค์–ด๋ด…์‹œ๋‹ค.
# ์ฒซ๋ฒˆ์งธ ์ธ์ž๋กœ๋Š” response์˜ body๋ฅผ ํ…์ŠคํŠธ๋กœ ์ „๋‹ฌํ•ฉ๋‹ˆ๋‹ค.
# ๋‘๋ฒˆ์งธ ์ธ์ž๋กœ๋Š” "html"๋กœ ๋ถ„์„ํ•œ๋‹ค๋Š” ๊ฒƒ์„ ๋ช…์‹œํ•ด์ค๋‹ˆ๋‹ค.

soup = BeautifulSoup(res.text,"html.parser")

# title ๊ฐ€์ ธ์˜ค๊ธฐ
soup.title

# head ๊ฐ€์ ธ์˜ค๊ธฐ
soup.head

# body ๊ฐ€์ ธ์˜ค๊ธฐ
soup.body

# <h1> ํƒœ๊ทธ๋กœ ๊ฐ์‹ธ์ง„ ์š”์†Œ ํ•˜๋‚˜ ์ฐพ๊ธฐ
h1 = soup.find("h1")

# <p> ํƒœ๊ทธ๋กœ ๊ฐ์‹ธ์ง„ ์š”์†Œ๋“ค ์ฐพ๊ธฐ
soup.find_all("p")

# ํƒœ๊ทธ ์ด๋ฆ„ ๊ฐ€์ ธ์˜ค๊ธฐ
h1.name

# ํƒœ๊ทธ ๋‚ด์šฉ ๊ฐ€์ ธ์˜ค๊ธฐ
h1.text

์›ํ•˜๋Š” ์š”์†Œ ๊ฐ€์ ธ์˜ค๊ธฐ I


# ์Šคํฌ๋ž˜ํ•‘์— ํ•„์š”ํ•œ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ๋ถˆ๋Ÿฌ์™€๋ด…์‹œ๋‹ค.

import requests
from bs4 import BeautifulSoup

# ์˜ˆ์‹œ ์‚ฌ์ดํŠธ์— ์š”์ฒญ์„ ์ง„ํ–‰ํ•˜๊ณ , ์‘๋‹ต์„ ๋ฐ”ํƒ•์œผ๋กœ BeautifulSoup ๊ฐ์ฒด๋ฅผ ๋งŒ๋“ค์–ด๋ด…์‹œ๋‹ค.

res = requests.get("http://books.toscrape.com/catalogue/category/books/travel_2/index.html")
soup = BeautifulSoup(res.text,"html.parser")# res.text ์™€ res.content์˜ ์ฐจ์ด

# <h3> ํƒœ๊ทธ์— ํ•ด๋‹นํ•˜๋Š” ์š”์†Œ๋ฅผ ํ•˜๋‚˜ ์ฐพ์•„๋ด…์‹œ๋‹ค

book = soup.find("h3")

# <h3> ํƒœ๊ทธ์— ํ•ด๋‹นํ•˜๋Š” ์š”์†Œ๋ฅผ ๋ชจ๋‘ ์ฐพ์•„๋ด…์‹œ๋‹ค

h3_results = soup.find_all("h3")
h3_results[0]

์ฐพ์•„์˜จ ๋ฐ์ดํ„ฐ๋“ค์€ ๋ชจ๋‘ ๊ฐ์ฒด์ด๋ฏ€๋กœ, ์ €ํฌ๊ฐ€ ์ต์ˆ™ํ•œ ๋ฐฉ์‹๋Œ€๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ์ถ”์ถœํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

HTML์˜ Locator๋กœ ์›ํ•˜๋Š” ์š”์†Œ ์ฐพ๊ธฐ


ํƒœ๊ทธ๋Š” ์ž์‹ ์˜ ์ด๋ฆ„ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๊ณ ์œ ํ•œ ์†์„ฑ ๋˜ํ•œ ๊ฐ€์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
์ด ์ค‘์—์„œ id์™€ class๋Š” Locator๋กœ์„œ, ํŠน์ • ํƒœ๊ทธ๋ฅผ ์ง€์นญํ•˜๋Š” ๋ฐ์— ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

  • tagname: ํƒœ๊ทธ์˜ ์ด๋ฆ„
  • id: ํ•˜๋‚˜์˜ ๊ณ ์œ  ํƒœ๊ทธ๋ฅผ ๊ฐ€๋ฆฌํ‚ค๋Š” ๋ผ๋ฒจ
  • class: ์—ฌ๋Ÿฌ ํƒœ๊ทธ๋ฅผ ๋ฌถ๋Š” ๋ผ๋ฒจ

id๋ฅผ ์ด์šฉํ•œ ํƒœ๊ทธ ์ฐพ๊ธฐ

## id๊ฐ€ results์ธ div ํƒœ๊ทธ๋ฅผ ์ฐพ์•„๋ด…์‹œ๋‹ค.

soup.find("div", id = "results")

class๋ฅผ ์ด์šฉํ•œ ํƒœ๊ทธ ์ฐพ๊ธฐ

# class๊ฐ€ "page-header"์ธ div ํƒœ๊ทธ๋ฅผ ์ฐพ์•„๋ด…์‹œ๋‹ค.

find_results = soup.find("div", "page-header")

๐Ÿ’ก์›น ์Šคํฌ๋ž˜ํ•‘์„ ํ†ตํ•ด ์ฐพ์€ ๋ฐ์ดํ„ฐ๋“ค์ด ๊ฐ์ฒด๋กœ ์ ์šฉ๋œ๋‹ค๋Š” ์ ์ด ์‹ ๊ธฐํ–ˆ๊ณ , ์ด๋ฅผ ์ด์šฉํ•ด ์—ฌ๋Ÿฌ ๋ฐ์ดํ„ฐ ๊ฐ€๊ณต์„ ํŽธ๋ฆฌํ•˜๊ฒŒ ํ•  ์ˆ˜ ์žˆ์„ ๊ฒƒ ๊ฐ™๋‹ค.

profile
๊ฒŒ์„๋ €๋˜ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๊ณต๋ถ€

0๊ฐœ์˜ ๋Œ“๊ธ€