크롤링(AI학습 14)

이유진·2024년 6월 27일

AI Crawling colab python

-- 01.BeautifulSoup 기본.ipynb --
Web Crawling

BeautifulSoup 사용하기

파일 읽기

fp = open("simple.html", "r", encoding="utf-8")
html = fp.read()
fp.close()

from google.colab import drive
drive.mount('/content/drive')

html

BeautifulSoup 파싱 라이브러리 사용

html, xml 등을 parsing 할수 있다

from bs4 import BeautifulSoup

dom = BeautifulSoup(html, "html.parser")

주어진 데이터 html 을 html 문서로 파싱하고

Document Object Model 객체 (DOM) 을 표현하는 BeautifulSoup 객체 생성

#BeautifulSoup 생성자의 두번째 매개변수로는 html.parser 나 lxml 을 많이 사용한다. 여기서는 동일

type(dom)

select(), select_one()

#dom.select_one(CSS selector)

해당 CSS selector 로 select 된 첫번째 element 하나를 리턴

dom.select_one("h1")

type(dom.select_one("h1"))

dom.select_one("li")

해당 selector 의 element 가 여러개 있었다 하더라도 첫번째 element 만 리턴

dom.select(".fruit")

select(CSS selector)

해당 CSS selector 로 select 된 모든 element(들)의 list 리턴

심지어 한개도 select 되지 않아도 비어있는 list 리턴

dom.select('xxx')

dom.select_one('xxx') # 못 찾으면 None 리턴

len(dom.select('.fruit'))

dom.select('.fruit')[0] # select 한 element 들 중 첫번째 리턴. 리스트 타입이므로 인덱스 사용

element 의 속성, 메소드

.text : 특정 element 의 content

태그는 제거된 형태

dom.select('.fruit')[0].text

웹 페이지에서 데이터를 읽어오면 좌우 공백은 제거해주자.

dom.select('.fruit')[0].text.strip()

dom.select('.fruit')[1]

dom.select('.fruit')[1].text.strip()

dom.select_one('ul')

dom.select_one('ul').text # 내부의 text 들이 한덩어리로 리턴

["apple", "banana"]

result = []
for element in dom.select('.fruit') :
result.append(element.text.strip())

result

comprehension 사용

[e.text.strip() for e in dom.select('.fruit')]

.attrs

attribure 정보 가져오기

아래 링크 '주소'와 '링크이름' 을 dict 의 list 형태로 가져오기

네이버

daum

결과예)

[

{

url: "http://www.naver.com",

link: "네이버"

},

{

url: "http://www.daum.net",

link: "daum"

},

]

dom.select_one()

items = dom.select("ol li")
items

element 객체에서 다시 select(), select_one() 사용 가능

items[0].select_one("a")

items[0].select_one("a").attrs # attrs는 dict리턴

items[0].select_one("a").attrs['href']

items[0].select_one("a").attrs.get('href')

[
{
'url' : item.attrs.get('href').strip(),
'link' : item.text.strip()
}
for item in dom.select('ol > li > a')
]

dom.select('ol > li > a')

decompose()

element 를 dom 에서 제거

rows = dom.select_one("#books").select("tr")
rows

len(rows)

rows

[도전]

comprehension 을 사용하여 만들어보자

<결과 예시>
[{'제목': '이것이 파이썬이다', '가격': '[도서] 19,200원'},
 {'제목': '저것도 파이썬이다', '가격': '[할인] 12,800원'},
 {'제목': '그래도 파이썬인가?', '가격': '[중고] 6,500원'}]

None

[
{
"제목" : element.select('td')[0].text.strip(),
"가격" : element.select('td')[1].text.strip(),
}
for element in rows
if element.select_one("td")
]

dom = BeautifulSoup(html, 'html.parser')
rows = dom.select_one('#books').select('tr')

result = []

for row in rows :
if row.select_one("td") :
price = row.select_one("td:nth-child(2)")
print('decimpose() 전', price)
price.select_one('b').decompose()
print('decimpose() 후', price)

item = {
    "제목" : row.select_one("td:first-child").text.strip(),
    "가격" : price.text.strip()
}
result.append(item)

result

가격을 숫자타입으로

19,200월 --> 19200

myStr = "1,232,200원"

replace() 사용

myStr.replace(',','')
myStr.replace(',','')[:-1]
int(myStr.replace(',','')[:-1])

for 사용

int(''.join([
ch
for ch in myStr
if '0' <= ch <= '9'
]))

정규 표현식

import re

int(re.sub(r'\D', '', myStr))

dom = BeautifulSoup(html, 'html.parser')
rows = dom.select_one('#books').select('tr')

result = []

for row in rows :
if row.select_one("td") :
price = row.select_one("td:nth-child(2)")
print('decimpose() 전', price)
price.select_one('b').decompose()
print('decimpose() 후', price)

item = {
    "제목" : row.select_one("td:first-child").text.strip(),
    "가격" : int(price.text.strip().replace(',','')[:-1])
}
result.append(item)

result

import pandas as pd

pd.DataFrame(result)

이유진

독해지자

이전 포스트

AI학습13 (파이썬 기초 종료)

다음 포스트

크롤링(AI학습 14)

BeautifulSoup 사용하기

파일 읽기

BeautifulSoup 파싱 라이브러리 사용

html, xml 등을 parsing 할수 있다

주어진 데이터 html 을 html 문서로 파싱하고

Document Object Model 객체 (DOM) 을 표현하는 BeautifulSoup 객체 생성

select(), select_one()

해당 CSS selector 로 select 된 첫번째 element 하나를 리턴

해당 selector 의 element 가 여러개 있었다 하더라도 첫번째 element 만 리턴

select(CSS selector)

해당 CSS selector 로 select 된 모든 element(들)의 list 리턴

심지어 한개도 select 되지 않아도 비어있는 list 리턴

element 의 속성, 메소드

.text : 특정 element 의 content

태그는 제거된 형태

웹 페이지에서 데이터를 읽어오면 좌우 공백은 제거해주자.

["apple", "banana"]

comprehension 사용

.attrs

attribure 정보 가져오기

아래 링크 '주소'와 '링크이름' 을 dict 의 list 형태로 가져오기

네이버

daum

결과예)

[

{

url: "http://www.naver.com",

link: "네이버"

},

{

url: "http://www.daum.net",

link: "daum"

},

]

element 객체에서 다시 select(), select_one() 사용 가능

decompose()

[도전]

comprehension 을 사용하여 만들어보자

가격을 숫자타입으로

replace() 사용

for 사용

정규 표현식

AI학습13 (파이썬 기초 종료)

크롤링(AI학습 15)

0개의 댓글