(Package) BeautifulSoup

์ž„๊ฒฝ๋ฏผยท2023๋…„ 10์›” 12์ผ
1
post-thumbnail

๐Ÿ”‘Summarization


  • Package - BeautifulSoup Basic

๐Ÿ“—Contents


BeautifulSoup Basic

- conda install -c anaconda beautifulsoup4
- pip install beautifulsoup4
  • data ํ™•์ธ
# import 
from bs4 import BeautifulSoup
page = open("../data/03. zerobase.html", "r").read()
soup = BeautifulSoup(page, "html.parser")
print(soup.prettify())
  • HTML : ์›น ํŽ˜์ด์ง€ ํ‘œํ˜„
  • Head : ๋ฌธ์„œ์— ํ•„์š”ํ•œ ํ—ค๋” ์ •๋ณด ๋ณด๊ด€
  • Body : ๋ˆˆ์— ๋ณด์ด๋Š” ์ •๋ณด ๋ณด๊ด€

  • ์‹ค์ œ HTML Code


ํƒœ๊ทธ(Tag) ํ™•์ธ

  • head ํƒœ๊ทธ ํ™•์ธ
# head ํƒœ๊ทธ ํ™•์ธ
soup.head
  • body ํƒœ๊ทธ ํ™•์ธ
# body ํƒœ๊ทธ ํ™•์ธ
soup.body

  • p ํƒœ๊ทธ ํ™•์ธ
    • ์ฒ˜์Œ ๋ฐœ๊ฒฌํ•œ p ํƒœ๊ทธ๋งŒ ์ถœ๋ ฅ
    • find(โ€œpโ€)
# p ํƒœ๊ทธ ํ™•์ธ
# find() 
soup.p


find()

soup.find("p")

  • class_ : class๋ฅผ ์“ธ ๊ฒฝ์šฐ ํŒŒ์ด์ฌ ์˜ˆ์•ฝ์–ด์™€ ๊ฒน์น˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ์ƒ๊ธฐ๊ธฐ ๋•Œ๋ฌธ์— ๊ตฌ๋ถ„ํ•˜๊ธฐ ์œ„ํ•ด ์–ธ๋”๋ฐ”(underbar, โ€˜_โ€™ )๋ฅผ ์‚ฌ์šฉ
# ํŒŒ์ด์ฌ ์˜ˆ์•ฝ์–ด 
# class, id, def, list, str, int, tuple...
soup.find("p", class_="innter-text second-item")

soup.find("p", {"class":"outer-text first-item"}).text.strip()

'Data Science is funny.โ€™

  • ๋‹ค์ค‘ ์กฐ๊ฑด
# ๋‹ค์ค‘ ์กฐ๊ฑด 
soup.find("p", {"class":"inner-text first-item", "id":"first"})

  • find_all() : ์—ฌ๋Ÿฌ ๊ฐœ์˜ ํƒœ๊ทธ(Tag)๋ฅผ ๋ฆฌ์ŠคํŠธ(list) ํ˜•ํƒœ๋กœ ๋ฐ˜ํ™˜
soup.find_all("p")

  • ํŠน์ • ํƒœ๊ทธ(Tag) ํ™•์ธ
# ํŠน์ • ํƒœ๊ทธ ํ™•์ธ 
soup.find_all(id="pw-link")[0].text
soup.find_all("p", class_="innter-text second-item")

  • ๊ธธ์ด
len(soup.find_all("p"))
  • ํ…์ŠคํŠธ ์†์„ฑ๋งŒ ์ถœ๋ ฅํ•˜๊ธฐ
# p ํƒœ๊ทธ ๋ฆฌ์ŠคํŠธ์—์„œ ํ…์ŠคํŠธ ์†์„ฑ๋งŒ ์ถœ๋ ฅ 

for each_tag in soup.find_all("p"):
    print("=" * 50)
    print(each_tag.text)

  • ์†์„ฑ๊ฐ’์— ์žˆ๋Š” ๊ฐ’(Value) ์ถ”์ถœ
# a ํƒœ๊ทธ์—์„œ href ์†์„ฑ๊ฐ’์— ์žˆ๋Š” ๊ฐ’ ์ถ”์ถœ 

links = soup.find_all("a")
links[0].get("href"), links[1]["href"]

('[http://www.pinkwink.kr](http://www.pinkwink.kr/)', '[https://www.python.org](https://www.python.org/)')

for each in links:
    href = each.get("href") # each["href"]
    text = each.get_text()
    print(text + "=>" + href)

PinkWink=>[http://www.pinkwink.kr](http://www.pinkwink.kr/)

Python=> [https://www.python.org](https://www.python.org/)

0๊ฐœ์˜ ๋Œ“๊ธ€