EDA (3) - 웹데이터

Jasmine·2023년 7월 7일

EDA

목록 보기

8/9

Beautiful Soup

태그로 되어있는 문서를 해석하는 기능을 가진 파이썬 모듈

from bs4 import BeutifulSoup

page = open("../data/03. JasmineK.html", "r").read()
soup = BeautifulSoup(page, "html.parser")
print(soup.prettify())

파일로 저장된 html 파일을 읽을 때
open : 파일명과 함께 읽기(r) / 쓰기(w) 속성 지정
html.parser : 뷰피풀숲의 html을 읽는 엔진 중 하나 (lxml도 많이 사용)
prettify() : html 출력을 이쁘게 만들어주는 기능

a href에서 '새창에서 열기' 옵션

target="_blink"

굵은글씨

<b> </b>

beuatifulsoup에서 태그 가져오는 방법

soup.head	# head 태그 확인
soup.body	# body 태그 확인
soup.p		# p 태그 확인 - 처음 발견한 p태그만 출력
soup.find("p")	# p태그 확인

find

# 검색 조건 설정하기
soup.find("p", class_="inner-text second-item")

# 딕셔너리 사용해서 조건 설정하기
soup.find("p", {"class":"outer-text first-item"})

# p 태그 안의 문자만 가져오기
soup.find("p", {"class":"outer-text first-item"}).text

# strip() : 공백 지우기
soup.find("p", {"class":"outer-text first-item"}).text.strip()

# 다중 조건
soup.find("p", {"class":"inner-text first-item", "id":"first"})

find_all

여러 개의 태그를 반환
list 형태로 반환 (대괄호[ ])

soup.find_all("p")

# 특정 태그 반환
soup.find_all(class_="outer-text")
soup.find_all(id="pw-link")

# 검색 조건 설정하기
soup.find_all("p", class_="inner-text second-item")


# 리스트에서 텍스트만 추출하기 : 인덱스 활용해야 함

# soup.find_all(id="pw-link").text : Error 발생
soup.find_all(id="pw-link")[0].text

# 텍스트 출력하는 방법 
soup.find_all("p")[0].text
soup.find_all("p")[1].string
soup.find_all("p")[1].get_text()

# p 태그 리스트에서 텍스트 속성만 출력

for each_tag in soup.find_all("p"):
    print("="*50)
    print(each_tag.text)

# a 태그에서 href 속성값에 있는 값 추출
links = soup.find_all("a")
links

# 링크만 뽑는 법 2가지
links[0].get("href")	# get을 쓰기
links[1]["href"]		# 바로 마스킹하기

find, select_one : 단일 선택
find_all, select : 다중 선택

class 검색할땐 class_ (언더바가 붙음)

띄어쓰기 있으면 클래스가 두개인거니까 .으로 이어붙여줘야함

하위는 > 로 표기 (꺾쇠 유무 중요함)

id => #, class => .
exchangeList = soup.select("#exchangeList > li")    # > 는 하위단계를 의미함. exchangeList 안의 li 태그를 다 가져와라
updown_element_up = item.select_one('div.head_info.point_up > .blind')

텍스트 출력하는 방법

soup.find_all("p")[0].text
soup.find_all("p")[1].string
soup.find_all("p")[1].get_text()

urllib의 quote함수 : 주소에 한글이 포함된 경우 인코딩을 맞춰준다 (utf-8)
pandas의 unique() : 고유값들을 array형으로 반환
(데이터프레임 = 행렬 = series 형식. pandas)
(array형 = 배열. numpy)
urljoin : 상대주소를 절대주소로 변환.
regular expression (정규표현식)

웹데이터 크롤링

urlopen

from urllib.request import urlopen
from bs4 import BeautifulSoup

url = "https://finance.naver.com/marketindex/"
response = urlopen(url)
# page 대신 response, res 를 변수로 많이 사용함
response.status     # http 상태코드 (200 : 성공), 서버 상태 정상인지 확인
soup = BeautifulSoup(response, "html.parser")
print(soup.prettify())

request 모듈 두가지 (같은 기능)
1 ) requests
2 ) urllib.requests.Request

import requests
# requests 모듈
# from urllib.request.Request  : 같은 기능 다른 모듈
from bs4 import BeautifulSoup


url = "https://finance.naver.com/marketindex/"
response = requests.get(url)
# request 모듈은, 요청하고 응답하는 방식
# requests.get(), requests.post() 두 가지 방식이 있다
# response.text, response.content 사용하면 바로 출력되긴 함
soup = BeautifulSoup(response.text, "html.parser")  # BeautifulSoup(html문서인 문자열, parsing 방법)
print(soup.prettify()) # html 코드를 보기좋게

List 자료형

append : 제일 뒤에 하나의 데이터 추가. list형이 들어갈 수도 있음.
extend : 다수의 자료를 추가
pop : 끝 삭제
remove : 같은 이름의 자료를 지움
insert : 원하는 위치에 삽입. list.insert(인덱스, 값)
isinstance : 자료형이 list인지 확인

기타

쥬피터노트북 꿀팁
esc 누르고 m 누르면 markdown
b 누르면 셀 추가
'http 상태코드'
서버 상태 정상인지 확인 (200 : 성공)

# 3가지 방법
response.getcode()
response.code
response.status

프롬프트에서 모듈 설치여부 검색
pip list | grep 모듈이름
추가검색
find_all vs select
requests vs urllib.request

Jasmine

데이터직무를 위한 공부 기록

이전 포스트

EDA (2) - 범죄현황 (2)

다음 포스트