220517_python_크롤링, BeautifulSoup, Newspaper3k

juyeon·2022년 5월 17일

끄적이기

목록 보기

4/13

dataframe

# head()를 하면 상위 5개의 행만 출력
df.head()

# tail()를 하면 하위 5개의 행만 출력
df.tail()

# sample(숫자)를 하면 랜덤으로 2개의 행만 출력
df.sample(2)

# 데이터프레임 2개를 연결
total_df = pd.concat([df, df2])
total_df

# 데이터프레임을 csv 파일로 저장
total_df.to_csv('data.csv', index=False)

# csv 파일을 읽어서 데이터프레임에 저장
new_df = pd.read_table('data.csv', sep=',')
new_df

크롤링

웹

: HTML로 이루어짐

HT - HyperText, 문서와 문서가 링크로 연결되어져 있다.
M - Markup, 태그로 이루어져있다.
L - Language

태그
선택자

BeautifulSoup

: 파이썬으로 크롤링을 하는 패키지

#!pip install 패키지 이름
!pip install beautifulSoup4

# bs4라는 패키지로부터 BeautifulSoup라는 모듈을 임포트
from bs4 import BeautifulSoup

# HTML 문서를 문자열 html로 저장
html = '''
<html> 
    <head> 
    </head> 
    <body> 
        <h1> 장바구니
            <p id='clothes' class='name' title='라운드티'> 라운드티
                <span class = 'number'> 25 </span> 
                <span class = 'price'> 29000 </span> 
                <span class = 'menu'> 의류</span> 
                <a href = 'http://www.naver.com'> 바로가기 </a> 
            </p> 
            <p id='watch' class='name' title='시계'> 시계
                <span class = 'number'> 28 </span>
                <span class = 'price'> 32000 </span> 
                <span class = 'menu'> 액세서리 </span> 
                <a href = 'http://www.facebook.com'> 바로가기 </a> 
            </p> 
        </h1> 
    </body> 
</html>
'''

# BeautifulSoup 인스턴스 생성. 두번째 매개변수는 분석할 분석기(parser)의 종류.
soup = BeautifulSoup(html, 'html.parser')

soup.블라블라

soup.select('태그명') : 태그를 입력으로 사용할 경우
soup.select('.클래스명') : 클래스를 입력으로 사용할 경우
soup.select('#아이디') : ID를 입력으로 사용할 경우
soup.select('상위태그명 하위태그명') : 자손 관계 (어떤 태그 내부에 있는 모든 태그를 자손이라고 함)
soup.select('상위태그명 > 하위태그명') : 자식 관계 (어떤 태그 내부에 있는 태그 중 바로 한 단계 아래에 있는 태그를 자식이라고 함)

'soup.태그명'
: 해당 태그를 포함하여 그 태그가 끝날 까지의 문장을 가지고 온다. 단, 해당 태그가 여러개 있다면 첫번째 태그만 가져온다.
a 태그가 본문에 2개 있다보니 처음 것만 가져옵니다.
soup.태그명.get('속성명')
: 해당 속성의 값을 가져온다. 여기서 속성이라는 것은 aaa = 'bbb' 와 같은 형식을 가지는 경우 get('aaa')를 하면 'bbb'의 값을 가져온다는 의미.

Newspaper3k

: 뉴스 데이터 크롤링을 위한 패키지

뉴스 데이터 크롤링 하기

-> 이해하고 다시 글쓰기ㅠㅠ

1주차 숙제

: '지니뮤직의 1~50위 곡을 스크래핑 해보세요.'
https://www.genie.co.kr/chart/top200?ditc=D&ymd=20211103&hh=13&rtm=N&pg=1

#힌트:

수업 시간에 말씀드렸던 크롬 개발자 도구 (윈도우: F12, 맥: CMD + Option + I) 를 이용해서, 순위 / 제목 / 가수 요소의 위치가 어떻게 되는지 파악해보세요!

위치를 파악했다면, [soup.select](http://soup.select) 등의 함수를 적극적으로 활용해서 요소를 출력해보세요!

→ 우선 곡 정보를 감싸고 있는 <tr> 태그 전체를 [soup.select](http://soup.select) 로 가져오신 후에, for 문을 이용해서 곡 정보를 하나씩 하나씩 print 해보세요!

순위와 곡제목이 깔끔하게 나오지 않을 거예요. 옆에 여백이 있다던가, 다른 글씨도 나온다던가.. 파이썬 내장 함수인 `strip()`을 잘 연구해보세요!

**(strip 이 어려우시다면 우선 깔끔하지 않은 채로 제출하셔도 괜찮습니다!)**

숙제 답:

select 사용
: select는 괄호 안의 조건에 해당하는 모든 태그를 다 가져와서, 리스트([] 에 요소가 담기도록) 형태로 담아주는 함수.

import requests
import pandas as pd
from bs4 import BeautifulSoup

headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}
data = requests.get('https://www.genie.co.kr/chart/top200?ditc=D&ymd=20211103&hh=13&rtm=N&pg=1',headers=headers)

soup = BeautifulSoup(data.text, 'html.parser')
trs = soup.select('#body-content > div.newest-list > div > table > tbody > tr')

for tr in trs:
    title = tr.select('td.info > a.title.ellipsis')[0].text.strip()
    rank = tr.select('td.number')[0].text[0:2].strip()
    artist = tr.select('td.info > a.artist.ellipsis')[0].text
    print(rank, title, artist)

select_one 사용
: select_one은 괄호 안의 조건에 해당하는 모든 태그 중에서, 가장 첫번째 요소 하나만 가져오는 함수.

import requests
import pandas as pd
from bs4 import BeautifulSoup

headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}
data = requests.get('https://www.genie.co.kr/chart/top200?ditc=D&ymd=20211103&hh=13&rtm=N&pg=1',headers=headers)

soup = BeautifulSoup(data.text, 'html.parser')
trs = soup.select('#body-content > div.newest-list > div > table > tbody > tr')

for tr in trs:
    title = tr.select_one('td.info > a.title.ellipsis').text.strip()
    rank = tr.select_one('td.number').text[0:2].strip()
    artist = tr.select_one('td.info > a.artist.ellipsis').text
    print(rank, title, artist)