๐Ÿ•ทData Crawling 1

Gyeomiiยท2022๋…„ 7์›” 14์ผ
0

DDITPython

๋ชฉ๋ก ๋ณด๊ธฐ
13/18
post-thumbnail

์›น ํฌ๋กค๋ง(Web Scraping)

  • ์ปดํ“จํ„ฐ ์†Œํ”„ํŠธ์›จ์–ด ๊ธฐ์ˆ ๋กœ ์›น ์‚ฌ์ดํŠธ๋“ค์—์„œ ์›ํ•˜๋Š” ์ •๋ณด๋ฅผ ์ถ”์ถœํ•˜๋Š” ๊ฒƒ
    • ์›น์€ ๊ธฐ๋ณธ์ ์œผ๋กœ HTMLํ˜•ํƒœ(์–ด๋–ค ์ •ํ˜•ํ™”๋œ ํ˜•ํƒœ)๋กœ ๋˜์–ด ์žˆ๋‹ค.
    • HTML์„ ๋ถ„์„ํ•ด์„œ ์šฐ๋ฆฌ๊ฐ€ ์›ํ•˜๋Š” ์ •๋ณด๋“ค๋งŒ ๋ฝ‘์•„์˜ค๋Š” ๊ฒƒ
  • ์™ธ๊ตญ์—์„  'Web Crawling'๋ณด๋‹ค๋Š” 'Web Scraping'์ด๋ผ๋Š” ์šฉ์–ด๋ฅผ ๋” ์ž์ฃผ ์‚ฌ์šฉํ•จ
  • Python์œผ๋กœ ํฌ๋กค๋ง ํ•˜๋Š” ์†Œ์Šค๋“ค์ด ๊ฐ€์žฅ ํ”ํ•˜๋‹ค

์‹œ๋„ํ•ด๋ณด๊ธฐ

์ง์ ‘๋งŒ๋“  EMPLIST์‚ฌ์ดํŠธ๋ฅผ crawling ํ•ด๋ณด์ž

  • ์„ค์น˜ํ•˜๋Š”๋ฐฉ๋ฒ•

  • ์ฝ”๋“œ
import requests
 
URL = "http://127.0.0.1:5000/"
resp = requests.get(URL)
print(resp.status_code)
print(resp.text)
  • ๊ฒฐ๊ณผ

  • ํŽ˜์ด์ง€ ์†Œ์Šค๋ณด๊ธฐ๋ฅผ ํ†ตํ•ด์„œ ๋‚˜์˜ค๋Š” html์ฝ”๋“œ๊ฐ€ ์ถœ๋ ฅ๋œ๋‹ค.

ํ…Œ์ด๋ธ” ๋ฐ์ดํ„ฐ์—์„œ ํ•„์š”ํ•œ ๊ฒƒ๋งŒ ๊ฐ€์ ธ์˜ค๊ธฐ

  • ์ฝ”๋“œ
import requests
from bs4 import BeautifulSoup

url = "http://127.0.0.1:5000/"

response = requests.get(url)

if response.status_code == 200:
    html = response.text
    soup = BeautifulSoup(html, 'html.parser')
    trArr = soup.select('tr') # trํƒœ๊ทธ ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐฐ์—ดํ˜•ํƒœ๋กœ ๊ฐ€์ ธ์˜จ๋‹ค.
    for idx, tr in enumerate(trArr):
        if(idx > 0): 
            tdArr = tr.select('td') #tr์•ˆ์— ์žˆ๋Š” td๋ฐ์ดํ„ฐ๋ฅผ ๋ฐฐ์—ดํ˜•ํƒœ๋กœ ๊ฐ€์ ธ์˜จ๋‹ค
            print(idx, tdArr[1].text, tdArr[3].text) # td๋ฐฐ์—ด์—์„œ ์ด๋ฆ„๊ณผ ์ฃผ์†Œ์— ํ•ด๋‹นํ•˜๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ ธ์˜จ๋‹ค.

else : 
    print(response.status_code)
  • ๊ฒฐ๊ณผ

  • ์›๋ž˜ ์‚ฌ์ดํŠธ์— ์žˆ๋˜ ํ…Œ์ด๋ธ”์—์„œ ์ด๋ฆ„๊ณผ ์ฃผ์†Œ๋งŒ ๊ธ์–ด์™”๋‹ค.
profile
๊น€์„ฑ๊ฒธ

0๊ฐœ์˜ ๋Œ“๊ธ€