수업 57일차 파이썬 웹 스크래핑

유동우·2022년 12월 11일

■ 웹 스크래핑(Web scraping) // 정적 페이지의 데이터 수집

 웹 페이지 상에서 원하는 콘텐츠 정보를 컴퓨터로 하여금 자동으로 추출하여

수집하도록 하는 기술
 웹 페이지를 구성하고 있는 HTML 태그의 콘텐츠나 속성의 값을 읽어오는 작업

■ 웹 크롤링(web crawling) // 동적 페이지의 데이터 수집
 자동화 봇(bot)인 웹 크롤러(web crawler)가 정해진 규칙에 따라 복수 개의 웹
페이지를 브라우징하는 작업

■ Python의 웹 스크래핑 라이브러리
 BeautifulSoup
 scrapy

■ 사용하는 프로그램

정적 웹크롤링 - BeautifulSoup 프로그램 사용
동적 웹크롤링 - Selenium 프로그램 사용

pdf파일참고

■ CSS 선택자

 id 선택자 	// 유일한 값


 클래스선택자 // 직업

■ JavaScript

 Html : 화면구성뼈대
 CSS : 화면 디자인
 JS : 이벤트 처리 , Html , CSS 동적처리

find() : BeautifulSoup로 지정한 Html 태그를 검색

find_all()
첫번째 매개변수 : 태그명
두번째 매개변수 : 속성명, 속성값

text : find로 찾은 태그의 시작태그와 끝 태그 사이의 값 출력
get() : find로 찾은 태그의 지정한 속성값 출력

============================================

import urllib.request
from bs4 import BeautifulSoup as bs

print('-' 5, '파이썬으로 웹 접속하기', '-' 5 )

url = 'https://www.naver.com'
res = urllib.request.urlopen(url)

print(type(res))
print(res.status)
print(res.version)
print(res.msg)

res_header = res.getheaders()
print('[header 정보] -----')

for s in res_header:
print(s)

=========================

import urllib.request as ur
from bs4 import BeautifulSoup as bs

print('-' 5, 'BeautifulSoup로 웹 크롤링하기', '-' 5 )

url = 'https://quotes.toscrape.com/'
html = ur.urlopen(url)

if html.status != 200:
print('해당 사이트에 접속할 수 없습니다.')
exit()

# print(html.read())
# print(html)

# print('-' * 10)
# soup = bs(html.read(), 'html.parser')
_ # print(soup)

soup = bs(html.read(), 'html.parser')

div_list = soup.find_all('div', {'class': 'quote'})

_ # span_text = div_list[0].find('span', {'class': 'text'})

_ # print(span_text)

# small_text = div_list[0].find('small')
# print(small_text.text)

for div in div_list:
text = div.find('span', {'class': 'text'})
author = div.find('small')
print(f'명언 : {text.text}')
print(f'이름 : {author.text}')
print('-' * 20)

==================================

import urllib.request as ur
from bs4 import BeautifulSoup as bs

url = 'https://news.daum.net/'
html = ur.urlopen(url)

if html.status != 200:
print('지정한 사이트에 접속할 수 없습니다.')
quit()

soup = bs(html.read(), 'html.parser')

# 가져올 데이터를 가지고 있는 가장 순위가 높은 부모 태그를 선택
# ul_list = soup.find_all('ul', {'class': 'list_newsissue'})

# 해당 태그가 몇개가 검색되는지 확인
# print(len(ul_list))

_ # ul = soup.find('ul', {'class': 'list_newsissue'})

_ # div_list = ul.find_all('div', {'class': 'cont_thumb'})

_ # print(len(div_list))

# a_list = div_list[0].find_all('a', {'class': 'link_txt'})
# print(f'a의 수 : {len(a_list)}')

# a = div_list[0].find('a', {'class': 'link_txt'})
# print(a.text.strip())

ul = soup.find('ul', {'class': 'list_newsissue'})

div_list = ul.find_all('div', {'class': 'cont_thumb'})

for div in div_list:
a = div.find('a', {'class': 'link_txt'})
title = a.text.strip()
print(title)
print('-' * 20)

=======================================

import urllib.request as ur
from bs4 import BeautifulSoup as bs

_ # 문제 1) 다음 뉴스 페이지의 '오늘의 연재' 부분에 있는 기사의 제목을 모두 크롤링하여 출력하는 프로그램을 작성하세요

url = 'https://news.daum.net/'
html = ur.urlopen(url)

if html.status != 200:
quit()

# 전체 html 내용을 뷰티풀수프로 파싱
soup = bs(html.read(), 'html.parser')
# '오늘의 연재' 컨텐츠 영역 전체 검색
item_todayseries = soup.find_all('div', {'class': 'item_todayseries'})

_ # 검색된 내용 중 기사의 제목 부분문 출력
for items in item_todayseries:
cont_thumb = items.find('div', {'class': 'cont_thumb'})
a = cont_thumb.find('a', {'class': 'link_txt'})
title = a.text.strip()
link = a.get('href')

html_sub = ur.urlopen(link)
soup_sub = bs(html_sub.read(), 'html.parser')

article_view = soup_sub.find('div', {'class', 'article_view'})

p_list = article_view.find_all('p', {'dmcf-ptype': 'general'})

news = ''
for p in p_list[:3]:
    news += p.text.strip() + '\n'



print(f'기사 제목 : {title}')
print(f'기사 링크 : {link}')
print(f'기사 내용 : {news}')

print('-' * 20)

print('\n' + '-' 5, '서브페이지 접속', '-' 5, '\n')

cont_thumb = item_todayseries[0].find('div', {'class': 'cont_thumb'})
a = cont_thumb.find('a', {'class': 'link_txt'})
title = a.text.strip()
link = a.get('href')

print(f'제목 : {title}')
print(f'기사주소 : {link}')

html2 = ur.urlopen(link)
soup2 = bs(html2.read(), 'html.parser')

print(soup2)

div_list = soup2.find_all('div', {'class', 'article_view'})
print(len(div_list))

=============================================

# 문제 3) 다음 뉴스 페이지의 뉴스 제목 및 링크, 해당 뉴스의 내용 3줄을 가져오는 프로그램을 작성하세요
# 가져온 내용을 파일에 저장
_ # 파일명 news.txt
import urllib.request as ur
from bs4 import BeautifulSoup as bs

url = 'https://news.daum.net/'
html = ur.urlopen(url)

if html.status != 200:
print('지정한 사이트에 접속할 수 없습니다.')
quit()

soup = bs(html.read(), 'html.parser')

ul = soup.find('ul', {'class': 'list_newsissue'})
div_list = ul.find_all('div', {'class': 'cont_thumb'})

news_list = []

for div in div_list:
a = div.find('a', {'class': 'link_txt'})
title = a.text.strip()
link = a.get('href')

html_sub = ur.urlopen(link)
soup_sub = bs(html_sub.read(), 'html.parser')

article_view = soup_sub.find('div', {'class': 'article_view'})
p_list = article_view.find_all('p', {'dmcf-ptype': 'general'})

news = ''

if len(p_list) > 0:
    for p in p_list[:3]:
        news = p.text.strip() + '\n'
else:
    div = article_view.find('div', {'dmcf-ptype': 'general'})

    p = div.find('p')
    news += p.text.strip() + '\n'

print(f'기사 제목 : {title}')
print(f'기사 링크 : {link}')
print(f'기사 내용 : {news}')

news_list.append(title + '\n' + link + '\n' + news + '\n\n')

print('-' * 20)

f = open('news.txt', 'w', encoding='utf-8')
f.writelines(news_list)
f.close()

=========================================

from selenium import webdriver
import time

driver = webdriver.Chrome('chromedriver.exe')

driver.get('http://www.naver.com')
time.sleep(3)

driver.get('http://www.daum.net')
time.sleep(3)

driver.quit()

# 사용할 웹 드라이버 선택
# webdriver.Chrome('웹드라이버 파일 전체 경로')

_ # get(접속 주소) : 지정 웹 사이트로 접속

# close() : 현재 탭 닫기
# quit() : 현재 웹 브라우저 닫기

# back() : 뒤로가기
# forward() : 앞으로 가기

_ # window_handles : 웹 브라우저의 탭 목록 반환

_ # window_handles[index] : 지정한 index의 탭으로 이동, 0번 부터 시작

_ # switch_to.window(driver.window_handles[0]) : 첫번째

# html 태그 가져오기
# find_delement, find_delements 가 존재, find_element는 검색되는 첫번째 html 태그를 가져옴, find_elements 검색되는 모든 html 태그를 리스트로 반환

# find_element_by_id(id명) : 지정한 id값으로 검색
# findelement_by_class_name(class명) : 지정한 class값으로 검색
# findelement_by_tag_name(html태그명) : 지정한 html 태그 이름으로 검색
# findelement_by_css_selector(css선택자) : 지정한 css 선택자로 검색
# find_element_by_xpath(xpath코드) : 지정한 xpath로 검색

_ # click() : find_element를 통해서 검색한 태그에 클릭 이벤트 발생

# send_keys('문자' | 키코드) : find_element를 통해서 검색한 태그에 키보드를 통한 문자 입력 혹은 키보드의 자판 코드를 직접 전송
# clear() : find_element를 통해서 검색한 태그에 입력되어 있는 문자를 삭제

_ # execute_script(JS코드) : 자바스크립트 명령어 실행

유동우

클라우드 엔지니어가 되고싶은 클린이

이전 포스트

수업 56일차 Windows Permission 적용, 그룹, 파일시스템, mklink 링크

다음 포스트

수업 57일차 파이썬 웹 스크래핑

수업 56일차 Windows Permission 적용, 그룹, 파일시스템, mklink 링크

수업 58일차 Windows 파일서버 리소스 관리자(FSRM), 분산파일시스템, DFS의 복제서비스를 확용한 웹서버 부하분산, ISCSI

0개의 댓글