(Package) BeautifulSoup - Example

์ž„๊ฒฝ๋ฏผยท2023๋…„ 10์›” 19์ผ
1
post-thumbnail

Example 1-1. ๋„ค์ด๋ฒ„ ๊ธˆ์œต


๋ชฉํ‘œ

  • ๊ธˆ์•ก์— ํ•ด๋‹นํ•˜๋Š” ๊ธˆ์œต ๋ฐ์ดํ„ฐ 12๊ฐœ๋ฅผ ๊ฐ€์ ธ์˜ค๋Š” ๊ฒƒ
  • ๋งํฌ : https://finance.naver.com/marketindex/

  • ํฌ๋กฌ( Chrome) ๊ฐœ๋ฐœ์ž ๋„๊ตฌ ํ™œ์šฉ
    • ํฌ๋กฌ ์„ค์ • - ๋„๊ตฌ ๋”๋ณด๊ธฐ - ๊ฐœ๋ฐœ์ž ๋„๊ตฌ

  • ์•„์ด์ฝ˜ ์„ ํƒ

  • ํ•„์š”ํ•œ ๋ฐ์ดํ„ฐ ๋ถ€๋ถ„ ์„ ํƒ
    • ์„ ํƒํ•˜๋ฉด ์„ ํƒํ•œ ํƒœ๊ทธ์— ํ•ด๋‹นํ•˜๋Š” ์ฝ”๋“œ๋กœ ์ด๋™
    • ๊ธฐ์–ตํ•  ๊ฒƒ : **<span class = โ€˜valueโ€™>**


  • Module Load
# import 
from urllib.request import urlopen 
from bs4 import BeautifulSoup
  • ๋ณด๊ณ ์ž ํ•˜๋Š” ํŽ˜์ด์ง€ ํ˜ธ์ถœ
url = "https://finance.naver.com/marketindex/"
# page = urlopen(url)
response = urlopen(url)
response
soup = BeautiefulSoup(page, "html.parser")
print(soup.prettify())

ํ™˜์œจ ๊ฐ€๊ฒฉ(<span class = "value">) ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

# 1 
soup.find_all("span", "value"), len(soup.find_all("span", "value"))

  • ๋ณด๊ณ ์ž ํ•˜๋Š” ์ง€์—ญ์— ํ•ด๋‹นํ•˜๋Š” ๊ฐ’(Value ) ํ˜ธ์ถœ
soup.find_all("span", {"class":"value"})[0].text, soup.find_all("span", {"class":"value"})[0].string, soup.find_all("span", {"class":"value"})[0].get_text()

Output : ('1,171.10', '1,171.10', '1,171.10')



Example 1-2. ๋„ค์ด๋ฒ„ ๊ธˆ์œต


Summary

  • !pip install requests
  • find, find_all
  • select, select_one
  • find, select_one : ๋‹จ์ผ ์„ ํƒ
  • select, find_all : ๋‹ค์ค‘ ์„ ํƒ
import requests # ์š”์ฒญํ•˜๊ณ , ์‘๋‹ต

# from urllib.request.Request
from bs4 import BeautifulSoup
url = "https://finance.naver.com/marketindex/"
response = requests.get(url)
# requests.get(), requests.post()
# response.text
soup = BeautifulSoup(response.text, "html.parser") 
print(soup.prettify())

  • ๋ณ€์ˆ˜๋ช….status : ํฌ๋กค๋ง์„ ์ง„ํ–‰ํ–ˆ์„ ๋•Œ, ์ •์ƒ์ ์œผ๋กœ ์š”์ฒญํ•˜๊ณ  ์‘๋‹ต๋ฐ›์•˜๋Š”์ง€๋ฅผ ํ™•์ธ์‹œ์ผœ์ฃผ๋Š” ์ˆซ์ž
url = "https://finance.naver.com/marketindex/"
response = requests.get(url)
response.status
# soup.find_all("li", "on")
# id => # 
# class => . 
exchangeList = soup.select("#exchangeList > li")
len(exchangeList), exchangeList

title = exchangeList[0].select_one(".h_lst").text
exchange = exchangeList[0].select_one(".value").text
change = exchangeList[0].select_one(".change").text
updown = exchangeList[0].select_one(".head_info.point_up > .blind").text
# link 

title, exchange, change, updown

Output : ('๋ฏธ๊ตญ USD', '1,349.40', '4.40', '์ƒ์Šน')

  • 4๊ฐœ์˜ ๊ฐ’(Value)์— ํ•ด๋‹นํ•˜๋Š” ์ •๋ณด๋ฅผ ๊ฐ€์ ธ์˜ด
    • find_all
    • select๋ฅผ ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ ์ƒ/ํ•˜์œ„ ์ด๋™์ด ์ข€ ๋” ์ž์œ ๋กœ์›€
findmethod = soup.find_all("ul", id="exchangeList")
findmethod[0].find_all("span", "value")
  • ์ฃผ์†Œ๊ฐ’ ํ˜ธ์ถœ
    • exchangeList[0].select_one("a").get("href")์˜ output /marketindex/exchangeDetail.naver?marketindexCd=FX_USDKRW ๋„ค์ด๋ฒ„ ๊ธˆ์œต์—์„œ ๋ณด๊ธฐ ๋•Œ๋ฌธ์— ์ฃผ์†Œ ์ถ”๊ฐ€ ํ•„์š”
baseUrl = "https://finance.naver.com"
baseUrl + exchangeList[0].select_one("a").get("href")

Output : 'https://finance.naver.com/marketindex/exchangeDetail.naver?marketindexCd=FX_USDKRW'

  • ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘
    • exchange_datas ์ €์žฅ ํ˜•ํƒœ : dictionary

import pandas as pd
# 4๊ฐœ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ 

exchange_datas = [] 
baseUrl = "https://finance.naver.com"

for item in exchangeList:
    data = {
        "title": item.select_one(".h_lst").text,
        "exchnage": item.select_one(".value").text,
        "change": item.select_one(".change").text,
        "updown": item.select_one(".head_info.point_up > .blind").text,
        "link": baseUrl + item.select_one("a").get("href")
    }
    print(data)
    exchange_datas.append(data)
df = pd.DataFrame(exchange_datas)
df.to_excel("./naverfinance.xlsx")

  • ์ €์žฅ ์—‘์…€ ํŒŒ์ผ



Example 2-1. ์œ„ํ‚ค๋ฐฑ๊ณผ ๋ฌธ์„œ ์ •๋ณด ๊ฐ€์ ธ์˜ค๊ธฐ


๋ชฉํ‘œ

import urllib
from urllib.request import urlopen, Request

html = "https://ko.wikipedia.org/wiki/{search_words}"
# https://ko.wikipedia.org/wiki/์—ฌ๋ช…์˜_๋ˆˆ๋™์ž
req = Request(html.format(search_words=urllib.parse.quote("์—ฌ๋ช…์˜_๋ˆˆ๋™์ž"))) # ๊ธ€์ž๋ฅผ URL๋กœ ์ธ์ฝ”๋”ฉ 
response = urlopen(req)
soup = BeautifulSoup(response, "html.parser")
print(soup.prettify())

n = 0 

for each in soup.find_all("ul"):
    print("=>" + str(n) + "========================")
    print(each.get_text())
    n += 1

soup.find_all("ul")[15].text.strip().replace("\xa0", "").replace("\n", "")

Output : '์ฑ„์‹œ๋ผ: ์œค์—ฌ์˜ฅ ์—ญ (์•„์—ญ: ๊น€๋ฏผ์ •)๋ฐ•์ƒ์›: ์žฅํ•˜๋ฆผ(ํ•˜๋ฆฌ๋ชจํ†  ๋‚˜์ธ ์˜ค) ์—ญ (์•„์—ญ: ๊น€ํƒœ์ง„)์ตœ์žฌ์„ฑ: ์ตœ๋Œ€์น˜(์‚ฌ์นด์ด) ์—ญ (์•„์—ญ: ์žฅ๋•์ˆ˜)โ€™

0๊ฐœ์˜ ๋Œ“๊ธ€