AI교육과정 - Python.13

단비·2023년 2월 13일

AI교육과정

목록 보기

65/69

스크레이핑(Scraping): 크롤링 + 데이터를 추출해서 가공하는 최종 목표
셀리니움(Selenium): 브라우저를 컨트롤 할 수 있도록 지원하는 라이브러리
크롤링(Crawling): 인터넷의 데이터를 활용하기 위해 정보들을 분석하고 활용할 수 있게 수집하는 행위
- requests.get: 링크 가져오기
  - headers
    - 크롤링이 막힌 사이트의 경우 header를 추가하여 크롤링 가능
    - header= {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36'}
- BeautifulSoup(request.text): beautifulSoup이 html을 파싱할 데이터를 만들어줌
```
import requests
from bs4 import BeautifulSoup

site = 'https://basicenglishspeaking.com/daily-english-conversation-topics/'
request = requests.get(site) # requests.get(site,headers=header)
request.text # html 소스를 가져옴
soup = BeautifulSoup(request.text;
soup.find('div', {'class':'thrv-columns'})
```

😇 크롬드라이버를 직접 설치하는게 아닌 라이브러리 이용 가능

!pip install chromedriver-autoinstaller

import chromedriver_autoinstaller
import time
from selenium import webdriver
from selenium.webdriver.common.by import By

chromedriver_autoinstaller.install()

명령어

send_keys

search = driver.find_element('name','q')
search.send_keys('lambda')
search.send_keys(Keys.RETURN) # enter

findAll

find

soup = BeautifulSoup(driver.page_source)
comment_area = soup.findAll('span',{'class','u_cbox_contents'})

find_elements

find_element

driver = webdriver.Chrome()
driver.get('https://www.python.org')

search = driver.find_element('name','q')

to_excel
- 엑셀 파일로 만들어줌
```
banapresso.to_excel('banapresso.xlsx')
```

implicitly_wait(초)

웹드라이버 연결 후 대기

driver = webdriver.Chrome()
driver.implicitly_wait(3)

sleep

사이트 연결 후 대기
time 라이브러리 설치 후 사용 가능

url = "https://www.instagram.com/explore/tags/사과/"
driver.get(url)
time.sleep(6)

click

driver.find_element(By.XPATH, '//*[@id="loginForm"]/div/div[3]/button/div').click()

Xpath
- 마크업 언어에서 특정 요소를 찾기 위한 경로(path)를 나타내는 언어
- 기존의 컴퓨터 파일 시스템에서 사용한 경로 표현식과 유사한 XML의 경로 언어
  - 상대 경로 기본 구문
    - 태그 이름에 *를 넣을 경우 전체를 의미함
    - HTML DOM의 중간부터 시작(본인이 선택한 요소부터
```
//태그 이름[@attribute = 'Value']/div/div[3]/button/div
```
  - 절대 경로
    - 문서 앞단부터 경로를 다 지정하는 방법
```
/html/body/div[2]/div/div[2]/div[1]/div[2]/form/div/input
```

단비

tistory로 이전! https://sweet-rain-kim.tistory.com/

이전 포스트

AI교육과정 - Python.12

다음 포스트

AI교육과정 - Python.13

AI교육과정

AI교육과정 - Python.12

AI교육과정 - Python.14

0개의 댓글