TIL - indeed crawling

Heechul Yoon·2020년 2월 12일

LOG

목록 보기

3/62

indeed사이트에 python과 관련된 구인공고의 정보(url, title, location)을 가져와보자.

import requests
from bs4 import BeautifulSoup

url = 'https://kr.indeed.com/%EC%B7%A8%EC%97%85?q=django&l=%EC%84%9C%EC%9A%B8+%EA%B0%95%EB%82%A8%EA%B5%AC'

html = requests.get(url).text

soup = BeautifulSoup(html,'html.parser')

우선 requests와 BeautifulSoup을 import하고 python을 검색한 결과 페이지를 url로 담아주고 requests.get().text로 해당 url페이지의 html 텍스트를 가져온다.
그리고 BeautifulSoup을 통해서 해당 html을 읽어주는 객체를 soup인스턴스로 만든다.
이제 titles에서 url을 가져와 보자.
title을 print해보면 다음과같다.

[<div class="title">
<a class="jobtitle turnstileLink" data-tn-element="jobTitle" href="/rc/clk?jk=55a19c92c845685e&amp;fccid=74d05cee5b52f133&amp;vjs=3" id="jl_55a19c92c845685e" onclick="setRefineByCookie([]); return rclk(this,jobmap[0],true,0);" onmousedown="return rclk(this,jobmap[0],0);" rel="noopener nofollow" target="_blank" title="Developer / Designer recruit">
Developer / Designer recruit</a>
</div>, <div class="title">
<a class="jobtitle turnstileLink" data-tn-element="jobTitle" href="/rc/clk?jk=6c1d50662fa2d7f7&amp;fccid=71bba75c2a7d79a0&amp;vjs=3" id="jl_6c1d50662fa2d7f7" onclick="setRefineByCookie([]); return rclk(this,jobmap[1],true,0);" onmousedown="return rclk(this,jobmap[1],0);" rel="noopener nofollow" target="_blank" title="서울 강남구 인썸니아 시니어 개발자 및 퍼블리셔">
서울 강남구 인썸니아 시니어 개발자 및 퍼블리셔</a>
</div>, <div class="title">
<a class="jobtitle turnstileLink" data-tn-element="jobTitle" href="/rc/clk?jk=a3a65a4d59061430&amp;fccid=533c91503ff32d48&amp;vjs=3" id="jl_a3a65a4d59061430" onclick="setRefineByCookie([]); return rclk(this,jobmap[2],true,0);" onmousedown="return rclk(this,jobmap[2],0);" rel="noopener nofollow" target="_blank" title="위시켓과 함께할 서버개발자님">
위시켓과 함께할 서버개발자님</a>
</div>, <div class="title">
<a class="jobtitle turnstileLink" data-tn-element="jobTitle" href="/rc/clk?jk=b76a28e9af1117ea&amp;fccid=ba4215b904dee414&amp;vjs=3" id="jl_b76a28e9af1117ea" onclick="setRefineByCookie([]); return rclk(this,jobmap[3],true,0);" onmousedown="return rclk(this,jobmap[3],0);" rel="noopener nofollow" target="_blank" title="[라프텔] 서버(백엔드) 개발자">
[라프텔] 서버(백엔드) 개발자</a>
</div>, <div class="title">
<a class="jobtitle turnstileLink" data-tn-element="jobTitle" href="/rc/clk?jk=273d830a0c62aedd&amp;fccid=aba87288e4cc3687&amp;vjs=3" id="jl_273d830a0c62aedd" onclick="setRefineByCookie([]); return rclk(this,jobmap[4],true,0);" onmousedown="return rclk(this,jobmap[4],0);" rel="noopener nofollow" target="_blank" title="솔루션 프론트엔드 개발자">
솔루션 프론트엔드 개발자</a>
</div>, <div class="title">
<a class="jobtitle turnstileLink" data-tn-element="jobTitle" href="/rc/clk?jk=b01342d6d3c2a24a&amp;fccid=1637bdd8f03e93aa&amp;vjs=3" id="jl_b01342d6d3c2a24a" onclick="setRefineByCookie([]); return rclk(this,jobmap[5],true,0);" onmousedown="return rclk(this,jobmap[5],0);" rel="noopener nofollow" target="_blank" title="FinTech 서비스 플랫폼 개발 통해 세상 변화시킬 개발자분 만나고 싶습니다">
FinTech 서비스 플랫폼 개발 통해 세상 변화시킬 개발자분 만나고 싶...</a>

select로 페이지의 부분을 가져올 때 딕셔너리로 만들어준다.

titles[0].a['href']

그래서 첫번째 인덱스에 있는 데이터중 url만 가져오고싶다면 인덱싱을 통해서 객체로 만들어주고 해당 태그의 href 정보를 가져온다.

urls=[titles[i].a['href'] for i in range(0,len(titles))]

그리고 comprehensive loop으 통해서 모든 value를 urls라는 변수에 담아준다.

같은 방법으로 같은 titles 영역에 있는 text형태의 제목을 가져온다.

title=[titles[i].text for i in range(0,len(titles))]

홈페이지에서 location에 해당하는 부분을 가져와보자
개발자도구에서 위치부분을 '검사' 하면

#p_55a19c92c845685e > div.sjcl > span

위와같이 id=#p_55a19c92c845685e 안에 div태그의 class=sjcl 안에 span태그의 경로를 가져온다.

여기서 id는 해당 태그에 유일하게 부여되는 값이므로 전체적이 위치를 불러오는데 제한되기 때문에

div.sjcl > span

위와 같이 id부분을 제외하고 공통된 부분만 가져와서 출력해본다.

[<span class="location accessible-contrast-color-location">서울 논현동</span>, <span class="location accessible-contrast-color-location">서울 서초구</span>, <span class="location accessible-contrast-color-location">서울 강남구</span>, <span class="location accessible-contrast-color-location">서울</span>, <span class="location accessible-contrast-color-location">성남 분당구</span>, <span class="location accessible-contrast-color-location">서울 여의도</span>, <span class="location accessible-contrast-color-location">서울 강남구</span>, <span class="location accessible-contrast-color-location">서울 강남구</span>, <span class="location accessible-contrast-color-location">서울 마포구</span>, <span class="location accessible-contrast-color-location">서울</span>]

이렇게 span태그와 텍스트가 공존하는 데이터를 가져왔으면 텍스트영역만 남겨줘야한다.

location = [locations[i].text for i in range(0, len(locations))]

텍스트 영역만 읽어와서 location 변수에 담아준다.

데이터합치기
3개의 변수를 리스트로 만들어주었으면 zip()함수를 통해서 데이터를 합쳐준다.

job_info=[]
for i in zip(url, title, location):
    job_info.append(
        {
            'url' : i[0],
            'title' : i[1],
            'location' : i[2],
        }
    )

나중에 데이터베이스에 저장할 것을 고려하여 dictionary형태로 만들어준다.

Heechul Yoon

Quit talking, Begin doing

이전 포스트

TIL - Overall sight of Git

다음 포스트

TIL - indeed crawling

LOG

TIL - Overall sight of Git

TIL - Methods to extract html

0개의 댓글

관련 채용 정보