Python Crolling (2)

김용환·2023년 11월 19일

Python

목록 보기

3/3

1편에 이어서 다른 방법의 앱스토어 리뷰도 찾아보고 적용해 보았다.

두 번째로, RSS Feed를 통해 앱스토어 리뷰 크롤링이 가능했다.

import pandas as pd
import xmltodict
import requests
import os

리뷰 내용에서 마지막 page의 Index를 가져 오는 함수

def get_url_index(url):
    response = requests.get(url).content.decode('utf8')
    xml = xmltodict.parse(response)

    last_url = [l['@href'] for l in xml['feed']['link'] if (l['@rel'] == 'last')][0]
    last_index = [int(s.replace('page=', '')) for s in last_url.split('/') if ('page=' in s)][0]

    return last_index

앱스토어에서 리뷰 전체를 가져오는 함수

def appstore_crawler(appid, outfile='./appstore_reviews.csv'):
    url = 'https://itunes.apple.com/kr/rss/customerreviews/page=1/id=%i/sortby=mostrecent/xml' % appid
	
    try:
        last_index = get_url_index(url)
    except Exception as e:
        print (url)
        print ('\tNo Reviews: appid %i' %appid)
        print ('\tException:', e)
        return

    result = list()
    for idx in range(1, 101):
        url = "https://itunes.apple.com/kr/rss/customerreviews/page=%i/id=%i/sortby=mostrecent/xml?urlDesc=/customerreviews/id=%i/sortBy=mostRecent/xml" % (idx, appid, appid)
        print(url)

        response = requests.get(url).content.decode('utf8')
        try:
            xml = xmltodict.parse(response)
        except Exception as e:
            print ('\tXml Parse Error %s\n\tSkip %s :' %(e, url))
            continue

        try:
            num_reivews= len(xml['feed']['entry'])
        except Exception as e:
            print ('\tNo Entry', e)
            continue

        try:
            xml['feed']['entry'][0]['author']['name']
            single_reviews = False
        except:
            #print ('\tOnly 1 review!!!')
            single_reviews = True
            pass

        if single_reviews:
                result.append({
                    'USER': xml['feed']['entry']['author']['name'],
                    'DATE': xml['feed']['entry']['updated'],
                    'STAR': int(xml['feed']['entry']['im:rating']),
                    'LIKE': int(xml['feed']['entry']['im:voteSum']),
                    'TITLE': xml['feed']['entry']['title'],
                    'REVIEW': xml['feed']['entry']['content'][0]['#text'],
                })
        else:
            for i in range(len(xml['feed']['entry'])):
                result.append({
                    'USER': xml['feed']['entry'][i]['author']['name'],
                    'DATE': xml['feed']['entry'][i]['updated'],
                    'STAR': int(xml['feed']['entry'][i]['im:rating']),
                    'LIKE': int(xml['feed']['entry'][i]['im:voteSum']),
                    'TITLE': xml['feed']['entry'][i]['title'],
                    'REVIEW': xml['feed']['entry'][i]['content'][0]['#text'],
                })

    res_df = pd.DataFrame(result)
    res_df['DATE'] = pd.to_datetime(res_df['DATE'], format="%Y-%m-%dT%H:%M:%S")
    res_df.to_csv(outfile, encoding='utf-8-sig', index=False)
    print ('Save reviews to file: %s \n' %(outfile))

메인 함수

if __name__ == '__main__':
    # 배민 app_id 378084485
    # 요기요 app_id 543831532
    app_id = 78084485
    outfile = os.path.join('appstore_' + str(app_id)+'.csv')
    appstore_crawler(app_id, outfile=outfile)

과정 요약

AppID의 RSS Feed 사이트에서 사용자 리뷰 내용을 XML 파일로 받기
XML 파일을 Parsing해서 리뷰 데이터 추출 후, CSV 파일로 변환 후 프로젝트 진행

레퍼런스

https://kibua20.tistory.com/196

김용환

이전 포스트

Python Crolling (2)

Python

리뷰 내용에서 마지막 page의 Index를 가져 오는 함수

앱스토어에서 리뷰 전체를 가져오는 함수

메인 함수

과정 요약

레퍼런스

Python Crolling (1)

0개의 댓글