[데이터분석] week1. Colab / BeautifulSoup / Requests

Jake·2022년 12월 20일

(RPA, Robotic Process Automation)

Day 1. Colab

Colab(Google Colaboratory)

구글에서 제공하는 Jupyter Notebook

  https://colab.research.google.com/notebooks/welcome.ipynb

Day 2. Scraping(스크래핑)

bs4 & requests 라이브러리 설치

  ```html
  !pip install bs4 requests
  ```

pip : 파이썬으로 작성된 패키지 소프트웨어를 설치 · 관리하는 패키지 관리 시스템
bs4(Beautiful Soup)
- 대표적인 파이썬 크롤링 패키지(라이브러리)
- HTML, XML, JSON 등 웹페이지를 표현하는 문서파일의 구문을 분석하기 위해 사용
requests
- Python용 HTTP 라이브러리
- 특정 웹사이트에 HTTP 요청을 보내는 모듈

크롤링 기본코드
- requests 로 정보를 가져와서 → BeautifulSoup으로 분석하기 좋게 만든다.

	import requests
	from bs4 import BeautifulSoup

	headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}
	data = requests.get('https://search.naver.com/search.naver?where=news&ie=utf8&sm=nws_hty&query=삼성전자',headers=headers)

	soup = BeautifulSoup(data.text, 'html.parser')

여러 항목 크롤링
- 정보가 담긴 웹 문석의 구조 파악 → 리스트 찾기 → 클래스 찾기

		import requests
		from bs4 import BeautifulSoup
	
		headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}
		data = requests.get('https://search.naver.com/search.naver?where=news&ie=utf8&sm=nws_hty&query=삼성전자',headers=headers)
	
		soup = BeautifulSoup(data.text, 'html.parser')

		lis = soup.select('#main_pack > section > div > div.group_news > ul > li') #뉴스 리스트 가져오기
		# a = lis[0].select_one('a.news_tit') #class가 news_tit인 항목 저장
		# print(a.text)
		for li in lis: #반복문으로 전체 뉴스 항목 출력
	a = li.select_one('a.news_tit')
	print(a.text, a['href'])

다양한 정보 검색

	import requests
	from bs4 import BeautifulSoup
  
	def get_news(keyword):
	headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}
	data = requests.get(f'https://search.naver.com/search.naver?where=news&ie=utf8&sm=nws_hty&query={keyword}',headers=headers)
  
	soup = BeautifulSoup(data.text, 'html.parser')
  
	lis = soup.select('#main_pack > section > div > div.group_news > ul > li') #뉴스 리스트 가져오기
    
	for li in lis:
    	a = li.select_one('a.news_tit')
     	print(a.text, a['href'])
  
	get_news(input("검색어: "))

Jake

Walk on the water!

다음 포스트

[데이터분석] week1. Colab / BeautifulSoup / Requests

[웹 개발] Week 1. Html / Css / Bootstrap

0개의 댓글