๐ฅ Get vs Post ์ ์ดํด
๐ฅ post ๋ฐฉ์ requests
๐ฅ Cine21 crawling
๐ฅ ํฌ๋กค๋ง๊ณผ ๋ฐ์ดํฐ ์ ์ฒ๋ฆฌ
๐ฅ MongoDB์ crawling ๋ฐ์ดํฐ ์ ์ฅ
1) post ๋ฐฉ์ ์นํ์ด์ง requests ๋ฐฉ๋ฒ
- ์ค์ : Preserve log์ ์ฒดํฌ์ ํ๊ณ , All๋ก ์ค์ ๋์ด์๋์ง ํ์ธ
- page๊ฐ ๋ฐ๋ ๋ ๋ง๋ค, From Data ์ ๋ณด๊ฐ ๋ณ๋๋๋ ๊ฒ์ด ํ์ธ ๊ฐ๋ฅ
- Genaral ์ ๋ณด ๋ฐ Form Data ์ ๋ณด ํ์ธ
โ๐ป python
# 1๋จ๊ณ : ๋ผ์ด๋ธ๋ฌ๋ฆฌ import import pymongo import requests from bs4 import BeautifulSoup # 2๋จ๊ณ : mongodb connection conn = pymongo.MongoClient() # pymongo๋ก mongodb ์ฐ๊ฒฐ(localhost:27017) actor_db = conn.cine21 # database ์์ฑ(cine21) ํ ๊ฐ์ฒด(actor_db)์ ๋ด์ actor_collection = actor_db.actor_collection # collection ์์ฑ(actor_collection) ํ ๊ฐ์ฒด(actor_collection)์ ๋ด์ # 3๋จ๊ณ : crawling ์ฃผ์ requests(http://www.cine21.com/rank/person) url = 'http://www.cine21.com/rank/person/content' post_data = dict() # Form data ๋ถ๋ถ์ ๋์ ๋๋ฆฌ ํํ๋ก ์ ๋ฌ post_data['section'] = 'actor' post_data['period_start'] = '2020-02' post_data['gender'] = 'all' post_data['page'] = 1 res = requests.post(url, data=post_data) # requests ์์ฒญ # 4๋จ๊ณ : parsing๊ณผ ๋ฐ์ดํฐ ์ถ์ถ soup = BeautifulSoup(res.content, 'html.parser') # parsing ๋ฐฉ๋ฒ
1) crawling target1 : ๋ฐฐ์ฐ ์ด๋ฆ๋ง ํด๋ก๋ง ํด๋ณด๊ธฐ
โ๐ป python
# 1๋จ๊ณ : ๋ผ์ด๋ธ๋ฌ๋ฆฌ import import pymongo import requests from bs4 import BeautifulSoup # 2๋จ๊ณ : mongodb connection conn = pymongo.MongoClient() # pymongo๋ก mongodb ์ฐ๊ฒฐ(localhost:27017) actor_db = conn.cine21 # database ์์ฑ(cine21) ํ ๊ฐ์ฒด(actor_db)์ ๋ด์ actor_collection = actor_db.actor_collection # collection ์์ฑ(actor_collection) ํ ๊ฐ์ฒด(actor_collection)์ ๋ด์ # 3๋จ๊ณ : crawling ์ฃผ์ requests(http://www.cine21.com/rank/person) url = 'http://www.cine21.com/rank/person/content' post_data = dict() # Form data ๋ถ๋ถ์ ๋์ ๋๋ฆฌ ํํ๋ก ์ ๋ฌ post_data['section'] = 'actor' post_data['period_start'] = '2020-02' post_data['gender'] = 'all' post_data['page'] = 1 res = requests.post(url, data=post_data) # requests ์์ฒญ # 4๋จ๊ณ : parsing๊ณผ ๋ฐ์ดํฐ ์ถ์ถ(CSS ์ ๋ญํฐ) soup = BeautifulSoup(res.content, 'html.parser') # parsing ๋ฐฉ๋ฒ actors = soup.select('li.people_li div.name') # ๋ฐฐ์ฐ ์ด๋ฆ ๋ชจ๋ ๋ด๊ธฐ for actor in actors: print(actor.text)
2) ์ ๊ทํํ์์ผ๋ก ๋ฐ์ดํฐ ์ ์ฒ๋ฆฌ
- ์ ๊ทํํ์ ๋ผ์ด๋ธ๋ฌ๋ฆฌ ์ถ๊ฐ โข import re
- re.sub([์ ๊ทํํ์], [๋ณ๊ฒฝํ ๊ฐ], [์๋ณธ ๊ฐ]) โข ์๋ณธ ๊ฐ์์ ์ ๊ทํํ์ ํจํด์ ํด๋นํ๋ ๋ถ๋ถ ์์
- ๐ re.sub('(\w*)', '', [๋ณ์]) โข ๋ณ์์ ๊ฐ์ ์ ๊ทํํ์(์๋ค ๊ดํธ+์ซ์๋ฌธ์)๋ฅผ ์์ ๊ฒ ๋ค
โ๐ป python# 1๋จ๊ณ : ๋ผ์ด๋ธ๋ฌ๋ฆฌ import import pymongo import re import requests from bs4 import BeautifulSoup # 2๋จ๊ณ : mongodb connection conn = pymongo.MongoClient() # pymongo๋ก mongodb ์ฐ๊ฒฐ(localhost:27017) actor_db = conn.cine21 # database ์์ฑ(cine21) ํ ๊ฐ์ฒด(actor_db)์ ๋ด์ actor_collection = actor_db.actor_collection # collection ์์ฑ(actor_collection) ํ ๊ฐ์ฒด(actor_collection)์ ๋ด์ # 3๋จ๊ณ : crawling ์ฃผ์ requests(http://www.cine21.com/rank/person) url = 'http://www.cine21.com/rank/person/content' post_data = dict() # Form data ๋ถ๋ถ์ ๋์ ๋๋ฆฌ ํํ๋ก ์ ๋ฌ post_data['section'] = 'actor' post_data['period_start'] = '2020-02' post_data['gender'] = 'all' post_data['page'] = 1 res = requests.post(url, data=post_data) # requests ์์ฒญ # 4๋จ๊ณ : parsing๊ณผ ๋ฐ์ดํฐ ์ถ์ถ(CSS ์ ๋ญํฐ) soup = BeautifulSoup(res.content, 'html.parser') # parsing ๋ฐฉ๋ฒ actors = soup.select('li.people_li div.name') # ๋ฐฐ์ฐ ์ด๋ฆ ๋ชจ๋ ๋ด๊ธฐ for actor in actors: print(re.sub('\(\w*\)', '', actor.text))
3) crawling target2 : ๋ฐฐ์ฐ ๊ฐ๋ณ ๋งํฌ์์ ๋ฐฐ์ฐ ์ ๋ณด ํฌ๋กค๋ง ํด๋ณด๊ธฐ
- ๋งํฌ์ฃผ์ ๊ฐ์ ธ์ค๊ธฐ : ๋ฐฐ์ฐ ๋งํฌ ์ ๋ณด ํ์ธํ๊ธฐ
โ๐ป pythonactors = soup.select('li.people_li div.name') for actor in actors: print('http://www.cine21.com' + actor.select_one('a').attrs['href'])
- ๋ฐฐ์ฐ ์์ธ์ ๋ณด ์ถ์ถ : ๋ฐฐ์ฐ
โ๐ป python# 1๋จ๊ณ : ๋ผ์ด๋ธ๋ฌ๋ฆฌ import import pymongo import re import requests from bs4 import BeautifulSoup # 2๋จ๊ณ : mongodb connection conn = pymongo.MongoClient() # pymongo๋ก mongodb ์ฐ๊ฒฐ(localhost:27017) actor_db = conn.cine21 # database ์์ฑ(cine21) ํ ๊ฐ์ฒด(actor_db)์ ๋ด์ actor_collection = actor_db.actor_collection # collection ์์ฑ(actor_collection) ํ ๊ฐ์ฒด(actor_collection)์ ๋ด์ # 3๋จ๊ณ : crawling ์ฃผ์ requests(http://www.cine21.com/rank/person) url = 'http://www.cine21.com/rank/person/content' post_data = dict() # Form data ๋ถ๋ถ์ ๋์ ๋๋ฆฌ ํํ๋ก ์ ๋ฌ post_data['section'] = 'actor' post_data['period_start'] = '2020-02' post_data['gender'] = 'all' post_data['page'] = 1 res = requests.post(url, data=post_data) # requests ์์ฒญ # 4๋จ๊ณ : parsing๊ณผ ๋ฐ์ดํฐ ์ถ์ถ(CSS ์ ๋ญํฐ) soup = BeautifulSoup(res.content, 'html.parser') # parsing ๋ฐฉ๋ฒ actors = soup.select('li.people_li div.name') # ๋ฐฐ์ฐ ์ด๋ฆ ๋ชจ๋ ๋ด๊ธฐ for actor in actors: actor_link = 'http://www.cine21.com' + actor.select_one('a').attrs['href'] res_actor = requests.get(actor_link) soup_actor = BeautifulSoup(res_actor.content, 'html.parser') default_info = soup_actor.select_one('ul.default_info') actor_details = default_info.select('li') for actor_item in actor_details: print(actor_item)
<span class="tit">
์ ํ
์คํธ ๊ฐ์ ๊ฐ์ ธ์ค๊ธฐ ์ฝ์ง๋ง, ๊ทธ ๋ค์ ์๋ ํ
์คํธ๋ฅผ ๊ฐ์ ธ์ค๊ธฐ ์ด๋ ค์
์ด๋ฐ ๊ตฌ์กฐ์ผ ๋ ๋ท๋ถ๋ถ ํ ์คํธ(ํ๋๋ค๋ชจ)๋ฅผ ์ ๊ทํํ์์ ํตํด ์ฒ๋ฆฌํ์ฌ ์ถ์ถํ ์ ์์
์ ๊ทํํ์ ์ฐ์ต ์ธ์ดํธ : https://regexr.com/
1) ํน์ํ ์ ๊ท ํํ์ : Greedy
(.*)
- ์ ๊ทํํ์์์ ์ (.)์ ์ค๋ฐ๊ฟ ๋ฌธ์ ์ธ \n ๋ฅผ ์ ์ธํ ๋ชจ๋ ๋ฌธ์ 1๊ฐ๋ฅผ ์๋ฏธํจ
- ์ ๊ทํํ์์์ ๋ณ(*)์ ์ ๋ฌธ์๊ฐ 0๋ฒ ๋๋ ๊ทธ ์ด์ ๋ฐ๋ณต๋๋ ํจํด์ ์๋ฏธํจ
- ์ด์ .* ๋ \n ๋ฅผ ์ ์ธํ ๋ฌธ์๊ฐ 0๋ฒ ๋๋ ๊ทธ ์ด์ ๋ฐ๋ณต๋๋ ํจํด์ ๋ปํจ(=๊ธฐํธ๋ฅผ ํฌํจํ ๋ชจ๋ ๋ฌธ์)
2) ํน์ํ ์ ๊ท ํํ์ : Non-Greedy
(.*?)
- ์ด์ ๋ฐํด Non-Greedy
(.*?)
๋ ์ฒซ๋ฒ์งธ ๋งค์นญ๋๋๋ฐ ๊น์ง๋ง ํจํด์ผ๋ก ์ธ์ง- Non-Greedy
(.*?)
๋ ํ๊ทธ๋ฅผ ์ ์ธํ๊ณ ํ์คํธ๋ง ์ถ์ถํ ๋ ์ฌ์ฉํ๊ธฐ ์ข์- ์ฆ,
<span.*?>.*?</span>
์ด๋ ๊ฒ ์ ๊ทํํ์์ ์์ฑํ๋ฉด span๊ณผ span ์ฌ์ด์ ํ ์คํธ๋ฅผ ์ง์ ํ ์ ์์- re.sub()์ ์ด์ฉํ์ฌ ํด๋น ์์ญ๋ง ์ง์ฐ๋ฉด
<li>ํ ์คํธ<li>
๋ง ์ถ์ถ ๊ฐ๋ฅ- ์ฌ๊ธฐ์
<li></li>
๋ฅผ ํ๋ฒ๋ ์ฒ๋ฆฌํด์ฃผ๋ฉด ์ฐ๋ฆฌ๊ฐ ์ํ๋ ๋ท๋ถ๋ถ ํ ์คํธ๋ฅผ ์ถ์ถ ๊ฐ๋ฅ
์ ๊ทํํ์ Non-Greedy๋ก ๋ฐ์ดํฐ๋ฅผ ์ฒ๋ฆฌํด์ฃผ๋ฉด ์๋์ ๊ฐ์
โ๐ป python
# 1๋จ๊ณ : ๋ผ์ด๋ธ๋ฌ๋ฆฌ import import pymongo import re import requests from bs4 import BeautifulSoup # 2๋จ๊ณ : mongodb connection conn = pymongo.MongoClient() # pymongo๋ก mongodb ์ฐ๊ฒฐ(localhost:27017) actor_db = conn.cine21 # database ์์ฑ(cine21) ํ ๊ฐ์ฒด(actor_db)์ ๋ด์ actor_collection = actor_db.actor_collection # collection ์์ฑ(actor_collection) ํ ๊ฐ์ฒด(actor_collection)์ ๋ด์ # 3๋จ๊ณ : crawling ์ฃผ์ requests(http://www.cine21.com/rank/person) url = 'http://www.cine21.com/rank/person/content' post_data = dict() # Form data ๋ถ๋ถ์ ๋์ ๋๋ฆฌ ํํ๋ก ์ ๋ฌ post_data['section'] = 'actor' post_data['period_start'] = '2020-02' post_data['gender'] = 'all' post_data['page'] = 1 res = requests.post(url, data=post_data) # requests ์์ฒญ # 4๋จ๊ณ : parsing๊ณผ ๋ฐ์ดํฐ ์ถ์ถ(CSS ์ ๋ญํฐ) soup = BeautifulSoup(res.content, 'html.parser') # parsing ๋ฐฉ๋ฒ actors = soup.select('li.people_li div.name') # ๋ฐฐ์ฐ ์ด๋ฆ ๋ชจ๋ ๋ด๊ธฐ for actor in actors: actor_link = 'http://www.cine21.com' + actor.select_one('a').attrs['href'] res_actor = requests.get(actor_link) soup_actor = BeautifulSoup(res_actor.content, 'html.parser') default_info = soup_actor.select_one('ul.default_info') actor_details = default_info.select('li') for actor_item in actor_details: print(actor_item.select_one('span.tit').text) # <span class="tit">ํ ์คํธ</span>์ ํ ์คํธ ์ถ์ถ actor_item_value = re.sub('<span.*?>.*?</span>', '', str(actor_item)) # <li>ํ ์คํธ</li> ์ถ์ถ actor_item_value = re.sub('<.*?>', '', actor_item_value) print(actor_item_value)
โ๐ป python
# 1๋จ๊ณ : ๋ผ์ด๋ธ๋ฌ๋ฆฌ import import pymongo import re import requests from bs4 import BeautifulSoup # 2๋จ๊ณ : mongodb connection conn = pymongo.MongoClient() # pymongo๋ก mongodb ์ฐ๊ฒฐ(localhost:27017) actor_db = conn.cine21 # database ์์ฑ(cine21) ํ ๊ฐ์ฒด(actor_db)์ ๋ด์ actor_collection = actor_db.actor_collection # collection ์์ฑ(actor_collection) ํ ๊ฐ์ฒด(actor_collection)์ ๋ด์ # 3๋จ๊ณ : crawling ์ฃผ์ requests(http://www.cine21.com/rank/person) url = 'http://www.cine21.com/rank/person/content' post_data = dict() # Form data ๋ถ๋ถ์ ๋์ ๋๋ฆฌ ํํ๋ก ์ ๋ฌ post_data['section'] = 'actor' post_data['period_start'] = '2020-02' post_data['gender'] = 'all' post_data['page'] = 1 res = requests.post(url, data=post_data) # requests ์์ฒญ # 4๋จ๊ณ : parsing๊ณผ ๋ฐ์ดํฐ ์ถ์ถ(CSS ์ ๋ญํฐ) soup = BeautifulSoup(res.content, 'html.parser') # parsing ๋ฐฉ๋ฒ actors = soup.select('li.people_li div.name') # ๋ฐฐ์ฐ ์ด๋ฆ ๋ชจ๋ ๋ด๊ธฐ actors_info_list = list() # ๋ฐฐ์ฐ ์ ์ฒด์ ์์ธ์ ๋ณด๊ฐ ๋ด๊ธธ ๋ฆฌ์คํธ for actor in actors: actor_link = 'http://www.cine21.com' + actor.select_one('a').attrs['href'] res_actor = requests.get(actor_link) soup_actor = BeautifulSoup(res_actor.content, 'html.parser') default_info = soup_actor.select_one('ul.default_info') actor_details = default_info.select('li') actor_info_dict = dict() # ๋ฐฐ์ฐ๋ณ ์์ธ ์ ๋ณด๊ฐ JSONํ์์ผ๋ก ๋ด๊ธธ ๋์ ๋๋ฆฌ for actor_item in actor_details: actor_item_key = actor_item.select_one('span.tit').text # <span class="tit">ํ ์คํธ</span>์ ํ ์คํธ ์ถ์ถ actor_item_value = re.sub('<span.*?>.*?</span>', '', str(actor_item)) # <li>ํ ์คํธ</li> ์ถ์ถ actor_item_value = re.sub('<.*?>', '', actor_item_value) actor_info_dict[actor_item_key] = actor_item_value actors_info_list.append(actor_info_dict) # ๋์ ๋๋ฆฌ๋ฅผ ๋ฆฌ์คํธ์ ์ถ๊ฐ print(actors_info_list)
โ๐ป python
# 1๋จ๊ณ : ๋ผ์ด๋ธ๋ฌ๋ฆฌ import import pymongo import re import requests from bs4 import BeautifulSoup # 2๋จ๊ณ : mongodb connection conn = pymongo.MongoClient() # pymongo๋ก mongodb ์ฐ๊ฒฐ(localhost:27017) actor_db = conn.cine21 # database ์์ฑ(cine21) ํ ๊ฐ์ฒด(actor_db)์ ๋ด์ actor_collection = actor_db.actor_collection # collection ์์ฑ(actor_collection) ํ ๊ฐ์ฒด(actor_collection)์ ๋ด์ # 3๋จ๊ณ : crawling ์ฃผ์ requests(http://www.cine21.com/rank/person) url = 'http://www.cine21.com/rank/person/content' post_data = dict() # Form data ๋ถ๋ถ์ ๋์ ๋๋ฆฌ ํํ๋ก ์ ๋ฌ post_data['section'] = 'actor' post_data['period_start'] = '2020-02' post_data['gender'] = 'all' post_data['page'] = 1 res = requests.post(url, data=post_data) # requests ์์ฒญ # 4๋จ๊ณ : parsing๊ณผ ๋ฐ์ดํฐ ์ถ์ถ(CSS ์ ๋ญํฐ) soup = BeautifulSoup(res.content, 'html.parser') # parsing ๋ฐฉ๋ฒ actors = soup.select('li.people_li div.name') # ๋ฐฐ์ฐ ์ด๋ฆ ๋ชจ๋ ๋ด๊ธฐ actors_info_list = list() # ๋ฐฐ์ฐ ์ ์ฒด์ ์์ธ์ ๋ณด๊ฐ ๋ด๊ธธ ๋ฆฌ์คํธ for actor in actors: actor_link = 'http://www.cine21.com' + actor.select_one('a').attrs['href'] res_actor = requests.get(actor_link) soup_actor = BeautifulSoup(res_actor.content, 'html.parser') default_info = soup_actor.select_one('ul.default_info') actor_details = default_info.select('li') actor_info_dict = dict() # ๋ฐฐ์ฐ๋ณ ์์ธ ์ ๋ณด๊ฐ JSONํ์์ผ๋ก ๋ด๊ธธ ๋์ ๋๋ฆฌ for actor_item in actor_details: actor_item_key = actor_item.select_one('span.tit').text # <span class="tit">ํ ์คํธ</span>์ ํ ์คํธ ์ถ์ถ actor_item_value = re.sub('<span.*?>.*?</span>', '', str(actor_item)) # <li>ํ ์คํธ</li> ์ถ์ถ actor_item_value = re.sub('<.*?>', '', actor_item_value) actor_info_dict[actor_item_key] = actor_item_value actors_info_list.append(actor_info_dict) # ๋์ ๋๋ฆฌ๋ฅผ ๋ฆฌ์คํธ์ ์ถ๊ฐ # ๋ฐฐ์ฐ ์ด๋ฆ, ํฅํ์ง์, ์ถ์ฐ์ํ ์ถ์ถ # actors = soup.select('li.people_li div.name') # ์์ ์์ด์ ์๋ต hits = soup.select('ul.num_info > li > strong') movies = soup.select('ul.mov_list') for index, actor in enumerate(actors): print ("๋ฐฐ์ฐ์ด๋ฆ:", re.sub('\(\w*\)', '', actor.text)) print ("ํฅํ์ง์:", int(hits[index].text.replace(',', ''))) movie_titles = movies[index].select('li a span') movie_title_list = list() for movie_title in movie_titles: movie_title_list.append(movie_title.text) print ("์ถ์ฐ์ํ:", movie_title_list)
actors_info_list = list()
๋ฅผ ์ต๋ํ ์์ชฝ์ผ๋ก ์ฌ๋ ค์ค์ผํจ. ๊ทธ๋ ์ง ์์ผ๋ฉด ํ์ด์ง๊ฐ ๋ฐ๋ ๋๋ง๋ค list๊ฐ ์ด๊ธฐํ๋์ด ์ด์ ํ์ด์ง์์ ํด๋กค๋งํ ๊ฒฐ๊ณผ๋ค์ด ์ฌ๋ผ์งโ๐ป python
# 1๋จ๊ณ : ๋ผ์ด๋ธ๋ฌ๋ฆฌ import import pymongo import re import requests from bs4 import BeautifulSoup actors_info_list = list() # ๋ฐฐ์ฐ ์ ์ฒด์ ์์ธ์ ๋ณด๊ฐ ๋ด๊ธธ ๋ฆฌ์คํธ # 2๋จ๊ณ : mongodb connection conn = pymongo.MongoClient() # pymongo๋ก mongodb ์ฐ๊ฒฐ(localhost:27017) actor_db = conn.cine21 # database ์์ฑ(cine21) ํ ๊ฐ์ฒด(actor_db)์ ๋ด์ actor_collection = actor_db.actor_collection # collection ์์ฑ(actor_collection) ํ ๊ฐ์ฒด(actor_collection)์ ๋ด์ # 3๋จ๊ณ : crawling ์ฃผ์ requests(http://www.cine21.com/rank/person) url = 'http://www.cine21.com/rank/person/content' post_data = dict() # Form data ๋ถ๋ถ์ ๋์ ๋๋ฆฌ ํํ๋ก ์ ๋ฌ post_data['section'] = 'actor' post_data['period_start'] = '2020-02' post_data['gender'] = 'all' for index in range(1,21): post_data['page'] = index res = requests.post(url, data=post_data) # requests ์์ฒญ # 4๋จ๊ณ : parsing๊ณผ ๋ฐ์ดํฐ ์ถ์ถ(CSS ์ ๋ญํฐ) soup = BeautifulSoup(res.content, 'html.parser') # parsing ๋ฐฉ๋ฒ # 5๋จ๊ณ : ๋ฐฐ์ฐ ์ด๋ฆ, ํฅํ์ง์, ์ถ์ฐ์ํ soup์ ๋ด๊ธฐ actors = soup.select('li.people_li div.name') # ๋ฐฐ์ฐ์ด๋ฆ ๋ชจ๋ ๋ด๊ธฐ hits = soup.select('ul.num_info > li > strong') # ํฅํ์ง์ ๋ชจ๋ ๋ด๊ธฐ movies = soup.select('ul.mov_list') # ํฅํ์ํ ๋ชจ๋ ๋ด๊ธฐ rankings = soup.select('li.people_li > span.grade') # ๋ฐฐ์ฐ ๋ญํน์ ๋ณด ๋ด๊ธฐ # 6๋จ๊ณ : ๋ถ๋ถ crawling for index, actor in enumerate(actors): actor_name = re.sub('\(\w*\)', '', actor.text) actor_hits = int(hits[index].text.replace(',', '')) movie_titles = movies[index].select('li a span') movie_title_list = list() for movie_title in movie_titles: movie_title_list.append(movie_title.text) # ๋ฐฐ์ฐ์ด๋ฆ, ํฅํ์ง์, ์ถ์ฐ์ํ dictํํ(JSON)๋ก ์ถ๊ฐ actor_info_dict = dict() # ๋ฐฐ์ฐ๋ณ ์์ธ ์ ๋ณด๊ฐ JSONํ์์ผ๋ก ๋ด๊ธธ ๋์ ๋๋ฆฌ actor_info_dict['๋ฐฐ์ฐ์ด๋ฆ'] = actor_name actor_info_dict['ํฅํ์ง์'] = actor_hits actor_info_dict['์ถ์ฐ์ํ'] = movie_title_list actor_info_dict['๋ญํน'] = rankings[index].text # ๋ฐฐ์ฐ๋ณ ์์ธํ์ด์ง์์ ๋ฐฐ์ฐ์ ๋ณด ๊ฐ์ ธ์ dict์ ์ถ๊ฐ actor_link = 'http://www.cine21.com' + actor.select_one('a').attrs['href'] # ๋ฐฐ์ฐ๋ณ ๋งํฌ res_actor = requests.get(actor_link) # ๋ฐฐ์ฐ๋ณ ๋งํฌ์ requests ๋ ๋ฆผ soup_actor = BeautifulSoup(res_actor.content, 'html.parser') # parsing default_info = soup_actor.select_one('ul.default_info') actor_details = default_info.select('li') for actor_item in actor_details: actor_item_key = actor_item.select_one('span.tit').text # <span class="tit">ํ ์คํธ</span>์ ํ ์คํธ ์ถ์ถ actor_item_value = re.sub('<span.*?>.*?</span>', '', str(actor_item)) # <li>ํ ์คํธ</li> ์ถ์ถ actor_item_value = re.sub('<.*?>', '', actor_item_value) # <li><li> ์ ๊ฑฐ ํ ํ ์คํธ๋ง ์ถ์ถ actor_info_dict[actor_item_key] = actor_item_value actors_info_list.append(actor_info_dict) # ๋์ ๋๋ฆฌ๋ฅผ ๋ฆฌ์คํธ์ ์ถ๊ฐ print(actors_info_list) # ์ฝ์์ ๋ฐ์ดํฐ ์ถ๋ ฅํด๋ณด๊ธฐ # MongoDB์ ๋ฐ์ดํฐ ์ฝ์ actor_collection.insert_many(actors_info_list)