[Python] Beautiful Soup

신은지·2024년 11월 23일

Python

목록 보기

22/23

Beautiful Soup

파일로 저장된 html 파일을 읽을 때 사용한다.
상세한 내용을 알고 싶다면 🔗Beautiful Soup 공식 문서에서 더 많은 정보를 확인할 수 있다.

open : 파일명과 함께 읽기(r) / 쓰기(w) 속성 지정 가능

html.parser : Beautiful Soup의 html을 읽는 엔진 중 하나(lxml도 많이 사용)

prettify() : html 출력을 이쁘게 만들어주는 기능

Beautiful Soup Library Install

conda install -c anaconda beautifulsoup4
pip install beautifulsoup4

기본 Tag 확인

<head> tag
soup.head

<body> tag
soup.body

<p> tag
- 처음 발견한 <p> tag만 출력
soup.p
# 또는
soup.find("p")

python 예약어를 통해 Tag 확인
- python 예약어 : class, id, def, list, str, int ,tuple...
단일 조건
soup.find("태그명", class_="class명")
soup.find("태그명", {"class":"class명"}).text.strip()
다중 조건
soup.find("태그명", {"class":"class명", "id":"id명"})

tag 여러 개 반환
일반적으로 HTML내에서 속성 id는 딱 한 번만 나타나므로 find_all()함수는 의미가 없는 경우가 많다. 그러나 find_all() 함수는 list를 반환하므로, 검색 결과를 list로 받고 싶다면 id라도 find_all()함수를 사용한다.
soup.find_all("태그명")

특정 Tag 확인

idx에는 확인하고자 하는 tag가 몇 번째 요소인지 숫자를 입력하면 된다.

soup.find_all(id="id명)[idx].text

soup.find_all("태그명", class_="class명")

print(soup.find_all("태그명")[idx].text)
print(soup.find_all("태그명")[idx].string)
print(soup.find_all("태그명")[idx].get_text())

특정 속성만 출력

<p> tag 리스트에서 텍스트 속성만 출력
for each_tag in soup.find_all("p"):
    print("=" * 50)
    print(each_tag.text)

<a> tag에서 href 속성 값에 있는 값 추출

links = soup.find_all("a")
links[0].get("href"), links[1]["href"]
# 또는
for each in links:
    href = each.get("href") # each["href"]
    text = each.get_text()
    print(text + "=>" + href)