02.Python ํ™œ์šฉ - Crawling & MongoDB

ID์งฑ์žฌยท2021๋…„ 2์›” 25์ผ
0

Crawling

๋ชฉ๋ก ๋ณด๊ธฐ
5/5
post-thumbnail
post-custom-banner

๐ŸŒˆ Crawling ์—ฐ์Šต

๐Ÿ”ฅ Get vs Post ์˜ ์ดํ•ด

๐Ÿ”ฅ post ๋ฐฉ์‹ requests

๐Ÿ”ฅ Cine21 crawling

๐Ÿ”ฅ ํฌ๋กค๋ง๊ณผ ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ

๐Ÿ”ฅ MongoDB์— crawling ๋ฐ์ดํ„ฐ ์ €์žฅ


1. Get vs Post ์˜ ์ดํ•ด

  • ํฌ๋กค๋ง์„ ์œ„ํ•ด ํ•ด๋‹น ์ฃผ์†Œ๋กœ requests ํ•  ๋•Œ, requests method์— ๋Œ€ํ•œ ํ™•์ธ์ด ํ•„์š”
  • get ๋ฐฉ์‹์˜ ์›น ํŽ˜์ด์ง€๋Š” ํŽ˜์ด์ง€๊ฐ€ ๋ณ€ํ•  ๋•Œ ๋งˆ๋‹ค ์ฃผ์†Œ์ฐฝ์— ์ฃผ์†Œ๊ฐ€ ๋ณ€๋™๋˜๊ธฐ ๋•Œ๋ฌธ์— ์ด ํŒจํ„ด์„ ๋ถ„์„
  • post ๋ฐฉ์‹์œผ๋กœ ์ž‘์„ฑ๋œ ์›นํŽ˜์ด์ง€๋Š” ํŽ˜์ด์ง€๊ฐ€ ๋ณ€ํ•ด๋„ ์ฃผ์†Œ์ฐฝ์— ์ฃผ์†Œ๊ฐ€ ๋ฐ”๋€Œ์ง€ ์•Š์Œ
  • ์ด์— post ๋ฐฉ์‹์˜ ์›นํŽ˜์ด์ง€๋ฅผ ํฌ๋กค๋งํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” [๊ฐœ๋ฐœ์ž ๋„๊ตฌ] โ‡ข [Network]์—์„œ ์ •๋ณด๋ฅผ ํ™•์ธํ•ด์•ผ ํ•จ

    1) post ๋ฐฉ์‹ ์›นํŽ˜์ด์ง€ requests ๋ฐฉ๋ฒ•

    • ์„ค์ • : Preserve log์— ์ฒดํฌ์— ํ•˜๊ณ , All๋กœ ์„ค์ •๋˜์–ด์žˆ๋Š”์ง€ ํ™•์ธ
    • page๊ฐ€ ๋ฐ”๋€” ๋•Œ ๋งˆ๋‹ค, From Data ์ •๋ณด๊ฐ€ ๋ณ€๋™๋˜๋Š” ๊ฒƒ์ด ํ™•์ธ ๊ฐ€๋Šฅ
    • Genaral ์ •๋ณด ๋ฐ Form Data ์ •๋ณด ํ™•์ธ

2. post ๋ฐฉ์‹ requests

  • requests ์ฃผ์†Œ๋Š” [๊ฐœ๋ฐœ์ž ๋„๊ตฌ] โ‡ข [Network] ์—์„œ Requests URL์„ ์ด์šฉ
  • ๋˜ํ•œ [๊ฐœ๋ฐœ์ž ๋„๊ตฌ] โ‡ข [Network] ์—์„œ Form Data ์ •๋ณด๋ฅผ ๋”•์…”๋„ˆ๋ฆฌ ํ˜•ํƒœ๋กœ ํ•จ๊ป˜ ์ „๋‹ฌํ•จ
  • requestes ๋ฐฉ์‹์€ requests.post(url, data=[๋”•์…”๋„ˆ๋ฆฌ])
  • ๐Ÿ” res = requests.post(url, data=post_data)
  • ๐Ÿ” get๋ฐฉ์‹๊ณผ ๋น„๊ต : res = requests.get(url)

โœ๐Ÿป python

# 1๋‹จ๊ณ„ : ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ import
import pymongo 
import requests
from bs4 import BeautifulSoup
# 2๋‹จ๊ณ„ : mongodb connection
conn = pymongo.MongoClient() # pymongo๋กœ mongodb ์—ฐ๊ฒฐ(localhost:27017)
actor_db = conn.cine21 # database ์ƒ์„ฑ(cine21) ํ›„ ๊ฐ์ฒด(actor_db)์— ๋‹ด์Œ
actor_collection = actor_db.actor_collection # collection ์ƒ์„ฑ(actor_collection) ํ›„ ๊ฐ์ฒด(actor_collection)์— ๋‹ด์Œ
# 3๋‹จ๊ณ„ : crawling ์ฃผ์†Œ requests(http://www.cine21.com/rank/person)
url = 'http://www.cine21.com/rank/person/content'
post_data = dict() # Form data ๋ถ€๋ถ„์„ ๋”•์…”๋„ˆ๋ฆฌ ํ˜•ํƒœ๋กœ ์ „๋‹ฌ
post_data['section'] = 'actor' 
post_data['period_start'] = '2020-02'
post_data['gender'] = 'all'
post_data['page'] = 1
res = requests.post(url, data=post_data) # requests ์š”์ฒญ
# 4๋‹จ๊ณ„ : parsing๊ณผ ๋ฐ์ดํ„ฐ ์ถ”์ถœ
soup = BeautifulSoup(res.content, 'html.parser') # parsing ๋ฐฉ๋ฒ•

3. Cine21 crawling

1) crawling target1 : ๋ฐฐ์šฐ ์ด๋ฆ„๋งŒ ํด๋กœ๋ง ํ•ด๋ณด๊ธฐ

โœ๐Ÿป python

# 1๋‹จ๊ณ„ : ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ import
import pymongo 
import requests
from bs4 import BeautifulSoup
# 2๋‹จ๊ณ„ : mongodb connection
conn = pymongo.MongoClient() # pymongo๋กœ mongodb ์—ฐ๊ฒฐ(localhost:27017)
actor_db = conn.cine21 # database ์ƒ์„ฑ(cine21) ํ›„ ๊ฐ์ฒด(actor_db)์— ๋‹ด์Œ
actor_collection = actor_db.actor_collection # collection ์ƒ์„ฑ(actor_collection) ํ›„ ๊ฐ์ฒด(actor_collection)์— ๋‹ด์Œ
# 3๋‹จ๊ณ„ : crawling ์ฃผ์†Œ requests(http://www.cine21.com/rank/person)
url = 'http://www.cine21.com/rank/person/content'
post_data = dict() # Form data ๋ถ€๋ถ„์„ ๋”•์…”๋„ˆ๋ฆฌ ํ˜•ํƒœ๋กœ ์ „๋‹ฌ
post_data['section'] = 'actor' 
post_data['period_start'] = '2020-02'
post_data['gender'] = 'all'
post_data['page'] = 1
res = requests.post(url, data=post_data) # requests ์š”์ฒญ
# 4๋‹จ๊ณ„ : parsing๊ณผ ๋ฐ์ดํ„ฐ ์ถ”์ถœ(CSS ์…€๋Ÿญํ„ฐ)
soup = BeautifulSoup(res.content, 'html.parser') # parsing ๋ฐฉ๋ฒ•
actors = soup.select('li.people_li div.name') # ๋ฐฐ์šฐ ์ด๋ฆ„ ๋ชจ๋‘ ๋‹ด๊ธฐ
for actor in actors:
    print(actor.text)

2) ์ •๊ทœํ‘œํ˜„์‹์œผ๋กœ ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ

  • ์ •๊ทœํ‘œํ˜„์‹ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ ์ถ”๊ฐ€ โ‡ข import re
  • re.sub([์ •๊ทœํ‘œํ˜„์‹], [๋ณ€๊ฒฝํ•  ๊ฐ’], [์›๋ณธ ๊ฐ’]) โ‡ข ์›๋ณธ ๊ฐ’์—์„œ ์ •๊ทœํ‘œํ˜„์‹ ํŒจํ„ด์— ํ•ด๋‹นํ•˜๋Š” ๋ถ€๋ถ„ ์ˆ˜์ •
  • ๐Ÿ” re.sub('(\w*)', '', [๋ณ€์ˆ˜]) โ‡ข ๋ณ€์ˆ˜์˜ ๊ฐ’์„ ์ •๊ทœํ‘œํ˜„์‹(์•ž๋’ค ๊ด„ํ˜ธ+์ˆซ์ž๋ฌธ์ž)๋ฅผ ์—†์• ๊ฒ ๋‹ค
    โœ๐Ÿป python
# 1๋‹จ๊ณ„ : ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ import
import pymongo 
import re
import requests
from bs4 import BeautifulSoup
# 2๋‹จ๊ณ„ : mongodb connection
conn = pymongo.MongoClient() # pymongo๋กœ mongodb ์—ฐ๊ฒฐ(localhost:27017)
actor_db = conn.cine21 # database ์ƒ์„ฑ(cine21) ํ›„ ๊ฐ์ฒด(actor_db)์— ๋‹ด์Œ
actor_collection = actor_db.actor_collection # collection ์ƒ์„ฑ(actor_collection) ํ›„ ๊ฐ์ฒด(actor_collection)์— ๋‹ด์Œ
# 3๋‹จ๊ณ„ : crawling ์ฃผ์†Œ requests(http://www.cine21.com/rank/person)
url = 'http://www.cine21.com/rank/person/content'
post_data = dict() # Form data ๋ถ€๋ถ„์„ ๋”•์…”๋„ˆ๋ฆฌ ํ˜•ํƒœ๋กœ ์ „๋‹ฌ
post_data['section'] = 'actor' 
post_data['period_start'] = '2020-02'
post_data['gender'] = 'all'
post_data['page'] = 1
res = requests.post(url, data=post_data) # requests ์š”์ฒญ
# 4๋‹จ๊ณ„ : parsing๊ณผ ๋ฐ์ดํ„ฐ ์ถ”์ถœ(CSS ์…€๋Ÿญํ„ฐ)
soup = BeautifulSoup(res.content, 'html.parser') # parsing ๋ฐฉ๋ฒ•
actors = soup.select('li.people_li div.name') # ๋ฐฐ์šฐ ์ด๋ฆ„ ๋ชจ๋‘ ๋‹ด๊ธฐ
for actor in actors:
    print(re.sub('\(\w*\)', '', actor.text))

3) crawling target2 : ๋ฐฐ์šฐ ๊ฐœ๋ณ„ ๋งํฌ์—์„œ ๋ฐฐ์šฐ ์ •๋ณด ํฌ๋กค๋ง ํ•ด๋ณด๊ธฐ

  • ๋งํฌ์ฃผ์†Œ ๊ฐ€์ ธ์˜ค๊ธฐ : ๋ฐฐ์šฐ ๋งํฌ ์ •๋ณด ํ™•์ธํ•˜๊ธฐ
    โœ๐Ÿป python
actors = soup.select('li.people_li div.name')
for actor in actors:
	print('http://www.cine21.com' + actor.select_one('a').attrs['href'])
  • ๋ฐฐ์šฐ ์ƒ์„ธ์ •๋ณด ์ถ”์ถœ : ๋ฐฐ์šฐ
    โœ๐Ÿป python
# 1๋‹จ๊ณ„ : ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ import
import pymongo 
import re
import requests
from bs4 import BeautifulSoup
# 2๋‹จ๊ณ„ : mongodb connection
conn = pymongo.MongoClient() # pymongo๋กœ mongodb ์—ฐ๊ฒฐ(localhost:27017)
actor_db = conn.cine21 # database ์ƒ์„ฑ(cine21) ํ›„ ๊ฐ์ฒด(actor_db)์— ๋‹ด์Œ
actor_collection = actor_db.actor_collection # collection ์ƒ์„ฑ(actor_collection) ํ›„ ๊ฐ์ฒด(actor_collection)์— ๋‹ด์Œ
# 3๋‹จ๊ณ„ : crawling ์ฃผ์†Œ requests(http://www.cine21.com/rank/person)
url = 'http://www.cine21.com/rank/person/content'
post_data = dict() # Form data ๋ถ€๋ถ„์„ ๋”•์…”๋„ˆ๋ฆฌ ํ˜•ํƒœ๋กœ ์ „๋‹ฌ
post_data['section'] = 'actor' 
post_data['period_start'] = '2020-02'
post_data['gender'] = 'all'
post_data['page'] = 1
res = requests.post(url, data=post_data) # requests ์š”์ฒญ
# 4๋‹จ๊ณ„ : parsing๊ณผ ๋ฐ์ดํ„ฐ ์ถ”์ถœ(CSS ์…€๋Ÿญํ„ฐ)
soup = BeautifulSoup(res.content, 'html.parser') # parsing ๋ฐฉ๋ฒ•
actors = soup.select('li.people_li div.name') # ๋ฐฐ์šฐ ์ด๋ฆ„ ๋ชจ๋‘ ๋‹ด๊ธฐ
for actor in actors:
    actor_link = 'http://www.cine21.com' + actor.select_one('a').attrs['href']
    res_actor = requests.get(actor_link)
    soup_actor = BeautifulSoup(res_actor.content, 'html.parser')
    default_info = soup_actor.select_one('ul.default_info')
    actor_details = default_info.select('li')
    for actor_item in actor_details:
        print(actor_item)


4. ํฌ๋กค๋ง๊ณผ ๋ฐ์ดํ„ฐ ์ „์ฒ˜๋ฆฌ

  • <span class="tit">์˜ ํ…์ŠคํŠธ ๊ฐ’์€ ๊ฐ€์ ธ์˜ค๊ธฐ ์‰ฝ์ง€๋งŒ, ๊ทธ ๋’ค์— ์žˆ๋Š” ํ…์ŠคํŠธ๋ฅผ ๊ฐ€์ ธ์˜ค๊ธฐ ์–ด๋ ค์›€

  • ์ด๋Ÿฐ ๊ตฌ์กฐ์ผ ๋•Œ ๋’ท๋ถ€๋ถ„ ํ…์ŠคํŠธ(ํŒŒ๋ž€๋„ค๋ชจ)๋ฅผ ์ •๊ทœํ‘œํ˜„์‹์„ ํ†ตํ•ด ์ฒ˜๋ฆฌํ•˜์—ฌ ์ถ”์ถœํ•  ์ˆ˜ ์žˆ์Œ

  • ์ •๊ทœํ‘œํ˜„์‹ ์—ฐ์Šต ์‹ธ์ดํŠธ : https://regexr.com/

    1) ํŠน์ˆ˜ํ•œ ์ •๊ทœ ํ‘œํ˜„์‹ : Greedy(.*)

    • ์ •๊ทœํ‘œํ˜„์‹์—์„œ ์ (.)์€ ์ค„๋ฐ”๊ฟˆ ๋ฌธ์ œ์ธ \n ๋ฅผ ์ œ์™ธํ•œ ๋ชจ๋“  ๋ฌธ์ž 1๊ฐœ๋ฅผ ์˜๋ฏธํ•จ
    • ์ •๊ทœํ‘œํ˜„์‹์—์„œ ๋ณ„(*)์€ ์•ž ๋ฌธ์ž๊ฐ€ 0๋ฒˆ ๋˜๋Š” ๊ทธ ์ด์ƒ ๋ฐ˜๋ณต๋˜๋Š” ํŒจํ„ด์„ ์˜๋ฏธํ•จ
    • ์ด์— .* ๋Š” \n ๋ฅผ ์ œ์™ธํ•œ ๋ฌธ์ž๊ฐ€ 0๋ฒˆ ๋˜๋Š” ๊ทธ ์ด์ƒ ๋ฐ˜๋ณต๋˜๋Š” ํŒจํ„ด์„ ๋œปํ•จ(=๊ธฐํ˜ธ๋ฅผ ํฌํ•จํ•œ ๋ชจ๋“  ๋ฌธ์ž)

    2) ํŠน์ˆ˜ํ•œ ์ •๊ทœ ํ‘œํ˜„์‹ : Non-Greedy(.*?)

    • ์ด์— ๋ฐ˜ํ•ด Non-Greedy(.*?)๋Š” ์ฒซ๋ฒˆ์งธ ๋งค์นญ๋˜๋Š”๋ฐ ๊นŒ์ง€๋งŒ ํŒจํ„ด์œผ๋กœ ์ธ์ง€
    • Non-Greedy(.*?)๋Š” ํƒœ๊ทธ๋ฅผ ์ œ์™ธํ•˜๊ณ  ํƒ์ŠคํŠธ๋งŒ ์ถ”์ถœํ•  ๋•Œ ์‚ฌ์šฉํ•˜๊ธฐ ์ข‹์Œ
    • ์ฆ‰, <span.*?>.*?</span> ์ด๋ ‡๊ฒŒ ์ •๊ทœํ‘œํ˜„์‹์„ ์ž‘์„ฑํ•˜๋ฉด span๊ณผ span ์‚ฌ์ด์˜ ํ…์ŠคํŠธ๋ฅผ ์ง€์ •ํ•  ์ˆ˜ ์žˆ์Œ
    • re.sub()์„ ์ด์šฉํ•˜์—ฌ ํ•ด๋‹น ์˜์—ญ๋งŒ ์ง€์šฐ๋ฉด <li>ํ…์ŠคํŠธ<li>๋งŒ ์ถ”์ถœ ๊ฐ€๋Šฅ
    • ์—ฌ๊ธฐ์„œ <li></li>๋ฅผ ํ•œ๋ฒˆ๋” ์ฒ˜๋ฆฌํ•ด์ฃผ๋ฉด ์šฐ๋ฆฌ๊ฐ€ ์›ํ•˜๋Š” ๋’ท๋ถ€๋ถ„ ํ…์ŠคํŠธ๋ฅผ ์ถ”์ถœ ๊ฐ€๋Šฅ
  • ์ •๊ทœํ‘œํ˜„์‹ Non-Greedy๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•ด์ฃผ๋ฉด ์•„๋ž˜์™€ ๊ฐ™์Œ

โœ๐Ÿป python

# 1๋‹จ๊ณ„ : ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ import
import pymongo 
import re
import requests
from bs4 import BeautifulSoup
# 2๋‹จ๊ณ„ : mongodb connection
conn = pymongo.MongoClient() # pymongo๋กœ mongodb ์—ฐ๊ฒฐ(localhost:27017)
actor_db = conn.cine21 # database ์ƒ์„ฑ(cine21) ํ›„ ๊ฐ์ฒด(actor_db)์— ๋‹ด์Œ
actor_collection = actor_db.actor_collection # collection ์ƒ์„ฑ(actor_collection) ํ›„ ๊ฐ์ฒด(actor_collection)์— ๋‹ด์Œ
# 3๋‹จ๊ณ„ : crawling ์ฃผ์†Œ requests(http://www.cine21.com/rank/person)
url = 'http://www.cine21.com/rank/person/content'
post_data = dict() # Form data ๋ถ€๋ถ„์„ ๋”•์…”๋„ˆ๋ฆฌ ํ˜•ํƒœ๋กœ ์ „๋‹ฌ
post_data['section'] = 'actor' 
post_data['period_start'] = '2020-02'
post_data['gender'] = 'all'
post_data['page'] = 1
res = requests.post(url, data=post_data) # requests ์š”์ฒญ
# 4๋‹จ๊ณ„ : parsing๊ณผ ๋ฐ์ดํ„ฐ ์ถ”์ถœ(CSS ์…€๋Ÿญํ„ฐ)
soup = BeautifulSoup(res.content, 'html.parser') # parsing ๋ฐฉ๋ฒ•
actors = soup.select('li.people_li div.name') # ๋ฐฐ์šฐ ์ด๋ฆ„ ๋ชจ๋‘ ๋‹ด๊ธฐ
for actor in actors:
    actor_link = 'http://www.cine21.com' + actor.select_one('a').attrs['href']
    res_actor = requests.get(actor_link)
    soup_actor = BeautifulSoup(res_actor.content, 'html.parser')
    default_info = soup_actor.select_one('ul.default_info')
    actor_details = default_info.select('li')
    for actor_item in actor_details:
        print(actor_item.select_one('span.tit').text) # <span class="tit">ํ…์ŠคํŠธ</span>์˜ ํ…์ŠคํŠธ ์ถ”์ถœ
        actor_item_value = re.sub('<span.*?>.*?</span>', '', str(actor_item)) # <li>ํ…์ŠคํŠธ</li> ์ถ”์ถœ
        actor_item_value = re.sub('<.*?>', '', actor_item_value)
        print(actor_item_value)

  • MongoDB์— ๋ฐ์ดํ„ฐ๋กœ ์‚ฝ์ž…์‹œํ‚ค๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋”•์…”๋„ˆ๋ฆฌ(JSONํ˜•์‹)์œผ๋กœ ๋ณ€ํ™˜์‹œ์ผœ์ค˜์•ผ ํ•จ
  • insert_many()๋ฅผ ์“ฐ๊ธฐ ์œ„ํ•ด ๊ทธ ๊ฒฐ๊ณผ๋ฅผ ๋ฆฌ์ŠคํŠธ์— ๋‹ด์Œ

โœ๐Ÿป python

# 1๋‹จ๊ณ„ : ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ import
import pymongo 
import re
import requests
from bs4 import BeautifulSoup
# 2๋‹จ๊ณ„ : mongodb connection
conn = pymongo.MongoClient() # pymongo๋กœ mongodb ์—ฐ๊ฒฐ(localhost:27017)
actor_db = conn.cine21 # database ์ƒ์„ฑ(cine21) ํ›„ ๊ฐ์ฒด(actor_db)์— ๋‹ด์Œ
actor_collection = actor_db.actor_collection # collection ์ƒ์„ฑ(actor_collection) ํ›„ ๊ฐ์ฒด(actor_collection)์— ๋‹ด์Œ
# 3๋‹จ๊ณ„ : crawling ์ฃผ์†Œ requests(http://www.cine21.com/rank/person)
url = 'http://www.cine21.com/rank/person/content'
post_data = dict() # Form data ๋ถ€๋ถ„์„ ๋”•์…”๋„ˆ๋ฆฌ ํ˜•ํƒœ๋กœ ์ „๋‹ฌ
post_data['section'] = 'actor' 
post_data['period_start'] = '2020-02'
post_data['gender'] = 'all'
post_data['page'] = 1
res = requests.post(url, data=post_data) # requests ์š”์ฒญ
# 4๋‹จ๊ณ„ : parsing๊ณผ ๋ฐ์ดํ„ฐ ์ถ”์ถœ(CSS ์…€๋Ÿญํ„ฐ)
soup = BeautifulSoup(res.content, 'html.parser') # parsing ๋ฐฉ๋ฒ•
actors = soup.select('li.people_li div.name') # ๋ฐฐ์šฐ ์ด๋ฆ„ ๋ชจ๋‘ ๋‹ด๊ธฐ
actors_info_list = list() # ๋ฐฐ์šฐ ์ „์ฒด์˜ ์ƒ์„ธ์ •๋ณด๊ฐ€ ๋‹ด๊ธธ ๋ฆฌ์ŠคํŠธ
for actor in actors:
    actor_link = 'http://www.cine21.com' + actor.select_one('a').attrs['href']
    res_actor = requests.get(actor_link)
    soup_actor = BeautifulSoup(res_actor.content, 'html.parser')
    default_info = soup_actor.select_one('ul.default_info')
    actor_details = default_info.select('li')
    actor_info_dict = dict() # ๋ฐฐ์šฐ๋ณ„ ์ƒ์„ธ ์ •๋ณด๊ฐ€ JSONํ˜•์‹์œผ๋กœ ๋‹ด๊ธธ ๋”•์…”๋„ˆ๋ฆฌ
    for actor_item in actor_details:
        actor_item_key = actor_item.select_one('span.tit').text # <span class="tit">ํ…์ŠคํŠธ</span>์˜ ํ…์ŠคํŠธ ์ถ”์ถœ
        actor_item_value = re.sub('<span.*?>.*?</span>', '', str(actor_item)) # <li>ํ…์ŠคํŠธ</li> ์ถ”์ถœ
        actor_item_value = re.sub('<.*?>', '', actor_item_value)
        actor_info_dict[actor_item_key] = actor_item_value
    actors_info_list.append(actor_info_dict) # ๋”•์…”๋„ˆ๋ฆฌ๋ฅผ ๋ฆฌ์ŠคํŠธ์— ์ถ”๊ฐ€
print(actors_info_list)

  • ๋ฐฐ์šฐ์ด๋ฆ„, ์ถœ์—ฐ์˜ํ™”, ํฅํ–‰์ง€์ˆ˜๋„ ์ถ”๊ฐ€ํ•ด์„œ ์ถ”์ถœํ•ด ๋ณด๊ธฐ

โœ๐Ÿป python

# 1๋‹จ๊ณ„ : ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ import
import pymongo 
import re
import requests
from bs4 import BeautifulSoup
# 2๋‹จ๊ณ„ : mongodb connection
conn = pymongo.MongoClient() # pymongo๋กœ mongodb ์—ฐ๊ฒฐ(localhost:27017)
actor_db = conn.cine21 # database ์ƒ์„ฑ(cine21) ํ›„ ๊ฐ์ฒด(actor_db)์— ๋‹ด์Œ
actor_collection = actor_db.actor_collection # collection ์ƒ์„ฑ(actor_collection) ํ›„ ๊ฐ์ฒด(actor_collection)์— ๋‹ด์Œ
# 3๋‹จ๊ณ„ : crawling ์ฃผ์†Œ requests(http://www.cine21.com/rank/person)
url = 'http://www.cine21.com/rank/person/content'
post_data = dict() # Form data ๋ถ€๋ถ„์„ ๋”•์…”๋„ˆ๋ฆฌ ํ˜•ํƒœ๋กœ ์ „๋‹ฌ
post_data['section'] = 'actor' 
post_data['period_start'] = '2020-02'
post_data['gender'] = 'all'
post_data['page'] = 1
res = requests.post(url, data=post_data) # requests ์š”์ฒญ
# 4๋‹จ๊ณ„ : parsing๊ณผ ๋ฐ์ดํ„ฐ ์ถ”์ถœ(CSS ์…€๋Ÿญํ„ฐ)
soup = BeautifulSoup(res.content, 'html.parser') # parsing ๋ฐฉ๋ฒ•
actors = soup.select('li.people_li div.name') # ๋ฐฐ์šฐ ์ด๋ฆ„ ๋ชจ๋‘ ๋‹ด๊ธฐ
actors_info_list = list() # ๋ฐฐ์šฐ ์ „์ฒด์˜ ์ƒ์„ธ์ •๋ณด๊ฐ€ ๋‹ด๊ธธ ๋ฆฌ์ŠคํŠธ
for actor in actors:
    actor_link = 'http://www.cine21.com' + actor.select_one('a').attrs['href']
    res_actor = requests.get(actor_link)
    soup_actor = BeautifulSoup(res_actor.content, 'html.parser')
    default_info = soup_actor.select_one('ul.default_info')
    actor_details = default_info.select('li')
    actor_info_dict = dict() # ๋ฐฐ์šฐ๋ณ„ ์ƒ์„ธ ์ •๋ณด๊ฐ€ JSONํ˜•์‹์œผ๋กœ ๋‹ด๊ธธ ๋”•์…”๋„ˆ๋ฆฌ
    for actor_item in actor_details:
        actor_item_key = actor_item.select_one('span.tit').text # <span class="tit">ํ…์ŠคํŠธ</span>์˜ ํ…์ŠคํŠธ ์ถ”์ถœ
        actor_item_value = re.sub('<span.*?>.*?</span>', '', str(actor_item)) # <li>ํ…์ŠคํŠธ</li> ์ถ”์ถœ
        actor_item_value = re.sub('<.*?>', '', actor_item_value)
        actor_info_dict[actor_item_key] = actor_item_value
    actors_info_list.append(actor_info_dict) # ๋”•์…”๋„ˆ๋ฆฌ๋ฅผ ๋ฆฌ์ŠคํŠธ์— ์ถ”๊ฐ€
# ๋ฐฐ์šฐ ์ด๋ฆ„, ํฅํ–‰์ง€์ˆ˜, ์ถœ์—ฐ์˜ํ™” ์ถ”์ถœ
# actors = soup.select('li.people_li div.name') # ์œ„์— ์žˆ์–ด์„œ ์ƒ๋žต
hits = soup.select('ul.num_info > li > strong')
movies = soup.select('ul.mov_list')
for index, actor in enumerate(actors):
    print ("๋ฐฐ์šฐ์ด๋ฆ„:", re.sub('\(\w*\)', '', actor.text))
    print ("ํฅํ–‰์ง€์ˆ˜:", int(hits[index].text.replace(',', '')))
    movie_titles = movies[index].select('li a span')
    movie_title_list = list()
    for movie_title in movie_titles:
        movie_title_list.append(movie_title.text)
    print ("์ถœ์—ฐ์˜ํ™”:", movie_title_list)


5. MongoDB์— crawling ๋ฐ์ดํ„ฐ ์ €์žฅ

  • โœ”๏ธ ์ถ”๊ฐ€์‚ฌํ•ญ : crawlingํ•  ํŽ˜์ด์ง€ ๋ฐ˜๋ณต๋ฌธ์œผ๋กœ ํ™•์žฅ์‹œํ‚ค๊ธฐ
  • โœ”๏ธ ์ถ”๊ฐ€์‚ฌํ•ญ : actors_info_list = list()๋ฅผ ์ตœ๋Œ€ํ•œ ์œ„์ชฝ์œผ๋กœ ์˜ฌ๋ ค์ค˜์•ผํ•จ. ๊ทธ๋ ‡์ง€ ์•Š์œผ๋ฉด ํŽ˜์ด์ง€๊ฐ€ ๋ฐ”๋€” ๋•Œ๋งˆ๋‹ค list๊ฐ€ ์ดˆ๊ธฐํ™”๋˜์–ด ์ด์ „ ํŽ˜์ด์ง€์—์„œ ํด๋กค๋งํ•œ ๊ฒฐ๊ณผ๋“ค์ด ์‚ฌ๋ผ์ง
  • โœ”๏ธ ์ถ”๊ฐ€์‚ฌํ•ญ : ๋žญํ‚น ์ •๋ณด ์ถ”๊ฐ€

โœ๐Ÿป python

# 1๋‹จ๊ณ„ : ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ import
import pymongo 
import re
import requests
from bs4 import BeautifulSoup
actors_info_list = list() # ๋ฐฐ์šฐ ์ „์ฒด์˜ ์ƒ์„ธ์ •๋ณด๊ฐ€ ๋‹ด๊ธธ ๋ฆฌ์ŠคํŠธ
# 2๋‹จ๊ณ„ : mongodb connection
conn = pymongo.MongoClient() # pymongo๋กœ mongodb ์—ฐ๊ฒฐ(localhost:27017)
actor_db = conn.cine21 # database ์ƒ์„ฑ(cine21) ํ›„ ๊ฐ์ฒด(actor_db)์— ๋‹ด์Œ
actor_collection = actor_db.actor_collection # collection ์ƒ์„ฑ(actor_collection) ํ›„ ๊ฐ์ฒด(actor_collection)์— ๋‹ด์Œ
# 3๋‹จ๊ณ„ : crawling ์ฃผ์†Œ requests(http://www.cine21.com/rank/person)
url = 'http://www.cine21.com/rank/person/content'
post_data = dict() # Form data ๋ถ€๋ถ„์„ ๋”•์…”๋„ˆ๋ฆฌ ํ˜•ํƒœ๋กœ ์ „๋‹ฌ
post_data['section'] = 'actor' 
post_data['period_start'] = '2020-02'
post_data['gender'] = 'all'
for index in range(1,21):
    post_data['page'] = index
    res = requests.post(url, data=post_data) # requests ์š”์ฒญ
    # 4๋‹จ๊ณ„ : parsing๊ณผ ๋ฐ์ดํ„ฐ ์ถ”์ถœ(CSS ์…€๋Ÿญํ„ฐ)
    soup = BeautifulSoup(res.content, 'html.parser') # parsing ๋ฐฉ๋ฒ•
    # 5๋‹จ๊ณ„ : ๋ฐฐ์šฐ ์ด๋ฆ„, ํฅํ–‰์ง€์ˆ˜, ์ถœ์—ฐ์˜ํ™” soup์— ๋‹ด๊ธฐ
    actors = soup.select('li.people_li div.name') # ๋ฐฐ์šฐ์ด๋ฆ„ ๋ชจ๋‘ ๋‹ด๊ธฐ
    hits = soup.select('ul.num_info > li > strong') # ํฅํ–‰์ง€์ˆ˜ ๋ชจ๋‘ ๋‹ด๊ธฐ
    movies = soup.select('ul.mov_list') # ํฅํ–‰์˜ํ™” ๋ชจ๋‘ ๋‹ด๊ธฐ
    rankings = soup.select('li.people_li > span.grade') # ๋ฐฐ์šฐ ๋žญํ‚น์ •๋ณด ๋‹ด๊ธฐ
    # 6๋‹จ๊ณ„ : ๋ถ€๋ถ„ crawling
    for index, actor in enumerate(actors):
        actor_name = re.sub('\(\w*\)', '', actor.text)
        actor_hits = int(hits[index].text.replace(',', ''))
        movie_titles = movies[index].select('li a span')
        movie_title_list = list()
        for movie_title in movie_titles:
            movie_title_list.append(movie_title.text)
        # ๋ฐฐ์šฐ์ด๋ฆ„, ํฅํ–‰์ง€์ˆ˜, ์ถœ์—ฐ์˜ํ™” dictํ˜•ํƒœ(JSON)๋กœ ์ถ”๊ฐ€
        actor_info_dict = dict() # ๋ฐฐ์šฐ๋ณ„ ์ƒ์„ธ ์ •๋ณด๊ฐ€ JSONํ˜•์‹์œผ๋กœ ๋‹ด๊ธธ ๋”•์…”๋„ˆ๋ฆฌ
        actor_info_dict['๋ฐฐ์šฐ์ด๋ฆ„'] = actor_name
        actor_info_dict['ํฅํ–‰์ง€์ˆ˜'] = actor_hits
        actor_info_dict['์ถœ์—ฐ์˜ํ™”'] = movie_title_list
        actor_info_dict['๋žญํ‚น'] = rankings[index].text
        # ๋ฐฐ์šฐ๋ณ„ ์ƒ์„ธํŽ˜์ด์ง€์—์„œ ๋ฐฐ์šฐ์ •๋ณด ๊ฐ€์ ธ์™€ dict์— ์ถ”๊ฐ€
        actor_link = 'http://www.cine21.com' + actor.select_one('a').attrs['href'] # ๋ฐฐ์šฐ๋ณ„ ๋งํฌ
        res_actor = requests.get(actor_link) # ๋ฐฐ์šฐ๋ณ„ ๋งํฌ์— requests ๋‚ ๋ฆผ
        soup_actor = BeautifulSoup(res_actor.content, 'html.parser') # parsing
        default_info = soup_actor.select_one('ul.default_info')
        actor_details = default_info.select('li')
        for actor_item in actor_details:
            actor_item_key = actor_item.select_one('span.tit').text # <span class="tit">ํ…์ŠคํŠธ</span>์˜ ํ…์ŠคํŠธ ์ถ”์ถœ
            actor_item_value = re.sub('<span.*?>.*?</span>', '', str(actor_item)) # <li>ํ…์ŠคํŠธ</li> ์ถ”์ถœ
            actor_item_value = re.sub('<.*?>', '', actor_item_value) # <li><li> ์ œ๊ฑฐ ํ›„ ํ…์ŠคํŠธ๋งŒ ์ถ”์ถœ
            actor_info_dict[actor_item_key] = actor_item_value
        actors_info_list.append(actor_info_dict) # ๋”•์…”๋„ˆ๋ฆฌ๋ฅผ ๋ฆฌ์ŠคํŠธ์— ์ถ”๊ฐ€
print(actors_info_list) # ์ฝ˜์†”์— ๋ฐ์ดํ„ฐ ์ถœ๋ ฅํ•ด๋ณด๊ธฐ
# MongoDB์— ๋ฐ์ดํ„ฐ ์‚ฝ์ž…
actor_collection.insert_many(actors_info_list) 

profile
Keep Going, Keep Coding!
post-custom-banner

0๊ฐœ์˜ ๋Œ“๊ธ€