[EDA] mini project 10 _ 국가별 인터넷 사용률 데이터 (test 04)

jaam._.mini·2024년 1월 4일

📊 EDA. 탐색적 데이터 분석

목록 보기

20/22

문제 소개 및 데이터 준비 단계

Data 원본 출처

Target Data(CSV): Global Internet Usage(국가별 인터넷 사용률)

Source: Kaggle
DownLoad: archive.zip

Reference Data01(HTML Link): 국가별 인구(Population) Data

Source: Wiki

Reference Data02(CSV): 국가별 ISO코드 / 지역 분류(Region) Data

Source: Kaggle
DownLoad: archive.zip

참고사항

위 3개의 Data들은 생성 시기가 다르므로 이 Test에서 도출되는 결과는 실제와 일치하지 않습니다.
문제에 hint가 있을 경우, 해당 hint를 이용하지 않으셔도 무방합니다.
해당 Test는 Wiki의 Online Data를 가져오는 내용을 포함하고 있습니다.
- 해당 내용이 변경될 경우 추후 공유될 정답과 다를 수 있습니다.
- 문제 제작시의 Data는 추후 공유될 정답과 함께 제공합니다. 해당 Data는 아래의 경로에 있습니다.
  - 국가별 인구 Data
    [DS]EDA Level Test_week 4/solution/datas/wiki_population.csv
해당 Test는 pycountry 라는 Library를 사용할 예정입니다.
- https://pypi.org/project/pycountry/#description
- https://github.com/flyingcircusio/pycountry
- 해당 Library가 없는 경우 아래 명령어를 사용하여 설치하시기 바랍니다.
  pip install pycountry 또는 conda install -c conda-forge pycountry
```
!pip install pycountry
!pip install lxml
```

📌pivot_table 색다르게 만드는 방법
https://tourspace.tistory.com/511, [[ ]]

df_result = pd.pivot_table(df_draft, index=['region', 'sub_region']) # index로 grouping
df_result = df_result[['weighted_ave_internet', 'weighted_ave_income']] # 원하는 정보만 뽑기

1단계

Target Data 불러오기 & 전처리

문제 1-1) Target Data 전처리 01 (5점)

위에서 읽은 DataFrame에서 Null값을 처리하고자 합니다. 아래 조건에 맞게 Null값을 처리하세요.

조건1: 'incomeperperson', 'internetuserate', 'urbanrate' Column(열)에 하나라도 Null값이 있다면 그 row(행)를 삭제(drop)하세요.

조건2: Index와 순서(order)는 변경하지 마세요.

df_target = pd.read_csv('./datas/gapminder_internet.csv')
df_target

# 모듈
import pandas as pd

# 데이터 불러오기
df_target = pd.read_csv('./datas/gapminder_internet.csv')

# Null 값 처리 : row(행) 삭제(drop)
df_target = df_target.dropna(subset=['incomeperperson', 'internetuserate', 'urbanrate'])

문제 1-2) Target Data 전처리 02 (5점)

1-1의 DataFrame(df_target)과 아래의 df_target_change_list를 이용하여 아래 조건에 맞게 국가명(컬럼명: 'country')을 변경하세요.

참고: 국가명을 변경하는 이유는 추후(문제 2-4) pycountry Library를 사용하여 국가코드(ex: 대한민국-KR)를 얻기 위함입니다.

아래 df_target_change_list는 변경 대상인 df_target의 index와 그에 맞는 국가명이 쌍(tuple)들을 값으로 가지고 있습니다.

조건1: df_target_change_list를 이용하여 df_target의 국가명을 변경하세요.

조건2: Index 또는 순서(order)는 변경하지 마세요.

df_target_change_list = [
    (33, 'Cabo Verde'),
    (35, 'Central African Republic'),
    (41, 'Congo, The Democratic Republic of the'),
    (42, 'Congo'),
    (45, "Côte d'Ivoire"),
    (49, 'Czech Republic'),
    (53, 'Dominican Republic'),
    (83, 'Hong Kong'),
    (100, 'Korea, Republic of'),
    (103, "Lao People's Democratic Republic"),
    (112, 'Macao'),
    (113, 'Republic of North Macedonia'),
    (125, 'Federated States of Micronesia'),
    (183, 'Eswatini'),
    (210, 'Yemen')
]

# 컬럼명 : 'country' 변경
# DataFrame 중 라벨과 컬럼을 함께 지정하고 싶은 경우, [at] 요소 이용
for idx, column in df_target_change_list:
    df_target.at[idx, 'country'] = column

문제 1-3) Target Data 전처리 03 (10점)

1-2의 DataFrame(df_target)과 pycountry Library를 이용하여 아래 조건에 맞게 국가코드를 구하세요.

참고: pycountry.countries

조건1: df_target에 'code'컬럼을 추가하여 각 row(행)의 국가명에 맞는 2글자 국가코드(alpha_2)를 입력하세요.

조건2: 일반검색(pycountry.countries.get(name=country_name))을 우선 이용 해보고, 결과값이 안 나올 경우 fuzzy 검색(pycountry.countries.search_fuzzy(country_name))을 활용하여 검색하세요.

조건3: fuzzy 검색을 이용할 경우, 결과값 list의 첫번째 값(index=0)의 국가코드를 입력하세요.

조건4: Index 또는 순서(order)는 변경하지 마세요.

# 모듈
import pycountry

# get 예제 확인(1)
country_name = 'korea, republic of'
country = pycountry.countries.get(name=country_name)
country

pycountry.countries.get(name=country_name).alpha_2

# search_fuzzy 예제 확인(2)
country_name = 'korea, republic of'
country = pycountry.countries.search_fuzzy(country_name)
country

pycountry.countries.search_fuzzy(country_name)[0].alpha_2

# pycountry 모듈로 국가코드 가져오기
    # loc는 df.loc[인덱스, '열 이름']을 통해 해당 값에 접근
    # loc[조건,"컬럼이름"] = 바꿀 값 (새로운 컬럼 추가에도 사용됨)
# LookupError: turkey 발생
    # turkey >> 'Türkiye' 변경 : 국가명이 변경되면서 오류가 생긴 것 같다
    # alpha_2 참고 : https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes

for idx, row in df_target.iterrows():
    country_name = row['country']
    
    # turkey >> 'Türkiye'
    if country_name == 'Turkey':
        country_name = 'Türkiye'
    
    try: 
        country = pycountry.countries.get(name=country_name)
        df_target.loc[idx, 'code'] = country.alpha_2
    except:
        country = pycountry.countries.search_fuzzy(country_name)
        df_target.loc[idx, 'code'] = country[0].alpha_2

2단계

Reference Data01 불러오기 & 전처리 & 합치기

문제 2-1) Web Data 가져오기 (10점)

아래 제시된 Link에 있는 국가별 인구에 대한 Table을 아래 조건에 맞게 예시와 같이 Pandas DataFrame으로 불러오세요.

Link: https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population

조건: 해당 DataFrame의 변수명을 'df_population'으로 지정해주세요.

hint: pandas의 기능 중에 웹페이지의 Table들을 읽어와 DataFrame으로 만들어주는 method가 있습니다.

# pandas의 기능 중에 웹페이지의 Table들을 읽어와 DataFrame으로 만들어주는 method
# [참고] https://blog.naver.com/m_biz/223062483264 , https://wwwi.tistory.com/376

# 모듈
import pandas as pd
from selenium import webdriver

# 페이지 접근
driver = webdriver.Chrome(executable_path='./driver/chromedriver.exe')
url = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population'
driver.get(url)

# selenium에 의해 수행된 웹사이트의 html 문서 저장됨
html = driver.page_source

# header = 0 가져온 table의 첫 행을 DataFrame의 컬럼으로 설정
# [0] 지정한 URL의 첫번째 table을 DataFrame으로 반환
web_table = pd.read_html(html, encoding='euc-kr')[0]
web_table

driver.quit()

# column 변경
columns = ['Rank', 'Country / Dependency', 'Population', '% of the world', 'Date', 'Source (official or from the United Nations)', 'Notes']
web_table.columns = columns

# 깊은 복사 (번거로움 줄이기)
df_population = web_table.copy()

문제 2-2) Population Data 전처리 01 (5점)

2-1의 DataFrame(df_population)을 아래의 조건에 맞게 변경하세요.

조건1: Column(열) 중 'Country / Dependency', 'Population' 2개만 남기세요.

조건2: 컬럼명을 아래와 같이 변경하세요.

'Country / Dependency' -> 'country'

'Population' -> 'population'

조건3: 'country' Column(열)의 값이 'World'인 첫번째 인덱스(index=0)을 삭제하세요.

조건4: Index 또는 순서(order)는 변경하지 마세요.

# 'Country / Dependency', 'Population' 2개만 남기기
df_population = df_population.drop(columns=['Rank', '% of the world', 'Date', 'Source (official or from the United Nations)', 'Notes'])
df_population.tail(1)

# 컬럼명 변경
columns = ['country', 'population']
df_population.columns = columns

df_population.tail(1)

# 'World'인 첫번째 인덱스(index=0)을 삭제
df_population = df_population.drop(0)
df_population.head(2)

문제 2-3) Population Data 전처리 02 (5점)

2-2의 DataFrame(df_population)과 아래의 df_population_change_dict를 이용하여 아래 조건에 맞게 국가명(컬럼명: 'country')을 변경하세요.

참고: 국가명을 변경하는 이유는 추후 문제에서 1단계에서 만들었던 df_target과 합치기 위함입니다.

변경 국가는 1단계의 df_target에 있는 국가만 해당됩니다.

일부 국가명의 표현이 다르기 때문에 df_target을 기준으로 표기를 통일합니다.

아래 df_population_change_dict는 변경 대상인 df_population의 기존 국가명과 변경할 국가명이 key-value로 이루어져 있습니다.

조건1: df_population_change_dict를 이용하여 df_population의 국가명을 변경하세요.

조건2: Index 또는 순서(order)는 변경하지 마세요.

df_population_change_dict = {
    'Bermuda (United Kingdom)': 'Bermuda',
    'Cape Verde': 'Cabo Verde',
    'DR Congo': 'Congo, The Democratic Republic of the',
    'Ivory Coast': "Côte d'Ivoire",
    'Greenland (Denmark)': 'Greenland',
    'Hong Kong (China)': 'Hong Kong',
    'South Korea': 'Korea, Republic of',
    'Laos': "Lao People's Democratic Republic",
    'Macau (China)': 'Macao',
    'North Macedonia': 'Republic of North Macedonia',
    'Micronesia': 'Federated States of Micronesia',
    'Puerto Rico (United States)': 'Puerto Rico',
    'Slovakia': 'Slovak Republic',
    'East Timor': 'Timor-Leste',
}

# kwy-value colunmn 변경
# [참고] https://blockdmask.tistory.com/557

df_population['country'] = df_population['country'].replace(df_population_change_dict)
df_population

문제 2-4) Data 합치기(10점)

# 합치기 방법 :  pd.merge(데이터프레임, 데이터프레임, 조건) / pd.concat([[데이터프레임, 데이터프레임]), axis =1)
# merge() 파라미터 : how=': 병합할때 기준이 되는 데이터, 'on=': 데이터 병합시 기준 key값

# 국가명('country')이 같은 Data끼리 합치기 : on='country'
# 결합 방식은 교집합 : how='inner'
df = pd.merge(df_target, df_population, on='country', how='inner')

# 'code' Column(열)을 기준으로 오름차순으로 정렬
df.sort_values(by='code', ascending=True)

# index를 reset
df = df.reset_index(drop=True)

3단계

Reference Data02: 불러오기 & 전처리 & 합치기

문제 3-1) Region Data 전처리 01 (10점)

위의 DataFrame(df_region)을 아래의 조건에 맞게 변경하세요.

조건1

특정 국가('Namibia')의 국가코드가 'NA'이기 때문에 위와 같이 DataFrame을 불러올경우 해당 국가코드가 NaN값으로 처리됩니다.

해당 국가의 국가코드를 'NA'로 변경하세요.(해당 값의 type이 string이어야 합니다.)

참고

아래 조건3에 따른 Column(열)명을 변경하기 전에는 'alpha-2', 변경 후에는 'code' Column(열)에 국가코드가 있습니다.

국가명 Column(열)명은 'name'입니다.

조건2

아래 df_region_drop_col은 Data 분석시 필요 없는 Column(열)입니다.

df_region_drop_col을 참고하여 df_region의 Column(열)들을 삭제(drop)하세요.

조건3

아래 df_region_rename_dict는 기존 컬럼명 - 변경할 컬럼명이 key-value로 이루어져 있습니다.

df_region_rename_dict를 참고하여 df_region의 Column(열)명을 변경하세요.

조건4: Index 또는 순서(order)는 변경하지 마세요.

df_region_drop_col = [
                    'country-code',
                    'alpha-3',
                    'iso_3166-2',
                    'region-code',
                    'sub-region-code',
                    'intermediate-region-code'
                    ]

df_region_rename_dict = {
                        'alpha-2': 'code',
                        'sub-region': 'sub_region',
                        'intermediate-region': 'intermediate_region',
                        }

# Namibia 국가코드 'NA' 변경
df_region.loc[df_region['name'] == 'Namibia', 'alpha-2'] = 'NA'
df_region.tail(1)

# df_region_drop_col 칼럼 삭제
df_region.drop(columns=df_region_drop_col, axis=1, inplace=True)
df_region.tail(1)

# key-value 칼럼명 변경
df_region.rename(columns=df_region_rename_dict, inplace=True)
df_region

문제 3-2) Data 합치기(10점)

df_rename_dict = {
                'incomeperperson': 'income_per_person',
                'internetuserate': 'internet_use_rate',
                }

df_drop_col = ['name']

new_col_order = [
                'code',
                'country',
                'population',
                'income_per_person',
                'internet_use_rate',
                'urbanrate',
                'region',
                'sub_region',
                'intermediate_region'
                ]

#조건1: df, df_region 국가코드('code' Column(열)의 value)가 같은 Data끼리 합쳐주세요.
#조건2: 결합 방식은 교집합(겹치는 국가코드가 있는 경우만 추출)으로 지정해 주세요.
#조건3: Column(열)방향으로 DataFrame을 합쳐주세요.
df_merge = pd.merge(df, df_region, on='code', how='inner')


#조건4: 'code' Column(열)을 기준으로 오름차순으로 정렬해주세요.
df_merge.sort_values(by='code', inplace=True)

#조건5: df_rename_dict를 참고하여 df의 Column(열)명을 변경하세요.
df_merge = df_merge.rename(columns=df_rename_dict)

#조건7: 'name' Column(열)은 삭제(drop)해 주세요
df_merge = df_merge[new_col_order]
# df_merge = df_merge.drop(columns=['name'])

#조건6: new_col_order를 참고하여 df의 Column(열)의 순서를 변경하세요.
df_merge.columns = new_col_order

#조건8: index를 reset해주세요.
df_merge.reset_index(drop=True, inplace=True)

#조건9: 결과 DataFrame의 Column은 6개('country', 'incomeperperson', 'internetuserate', 'urbanrate', 'code', 'population')이어야 합니다.
import copy

df = df_merge.copy()

4단계

Data 분석하기(가중 평균 & 분산)

# 모듈
import numpy as np
import copy

# 깊은 복사
df_copy = df.copy()

# 가중치 & 가중평균 구하기
def weighted_ave(df_copy):
    # value, 가중치
    a = df_copy[['internet_use_rate', 'income_per_person']]
    b = df_copy['population']
    # 가중평균
    return np.average(a, weights=b, axis=0)

# weighted_ave 함수 적용
# list data > DataFrame : .to_frame() [참고] https://seong6496.tistory.com/364
# groupby = DataFrame 파라미터
df_draft = df_copy.groupby(['region', 'sub_region']).apply(weighted_ave).to_frame()
df_draft

# column = ['weighted_ave_internet', 'weighted_ave_income'] 추가
df_draft['weighted_ave_internet'] = [i[0] for i in df_draft[0]]
df_draft['weighted_ave_income'] = [i[1] for i in df_draft[0]]
df_draft

# index 0 삭제
del df_draft[0]
df_result = df_draft

df_result

문제 4-2) 특정 조건의 가준 평균 구하기 (15점)

중국과 인도를 제외한 Eastern Asia, Southern Asia의 인터넷 사용률(internet_use_rate) 및 인당 소득(income_per_person)을 아래 조건에 맞게 구하세요.

조건1: 가중치(weights)는 population Column(열)을 이용하세요.

조건2: 중국(국가코드(code): 'CN)과 인도(국가코드: 'IN')를 제외한 Asia(region) - Eastern Asia, Southern Asia (sub_region)의 인터넷 사용률(internet_use_rate) 및 인당 소득(income_per_person)을 아래 표와 같은 index, column 형태로 나타내주세요.

index = ['region', 'sub_region']

column = ['weighted_ave_internet', 'weighted_ave_income']

# 중국 & 인도 code 제외 확인 (test)
# 교집합 : A[A.isin(B)] / 차집합 : A[~A.isin(B)]
df_draft2 = df[~df['code'].isin(['CN', 'IN'])]

# 총 row 180 > 178로 줄은 것이 확인 됨
df_draft2

# 중국 & 인도 code 제외
# 교집합 : A[A.isin(B)] / 차집합 : A[~A.isin(B)]
df_draft2 = df[~df['code'].isin(['CN', 'IN'])]

# Asia(region) - Eastern Asia, Southern Asia 로만 구성
df_draft2 = df_draft2[df_draft2['sub_region'].isin(['Eastern Asia', 'Southern Asia'])]

df_draft2

# 중국 & 인도 제외 - 가중치 & 가중평균 구하기
def weighted_ave(df_draft2):
    # value, 가중치
    a = df_draft2[['internet_use_rate', 'income_per_person']]
    b = df_draft2['population']
    # 가중평균
    return np.average(a, weights=b, axis=0)

# weighted_ave 함수 적용
df_draft2 = df_draft2.groupby(['region', 'sub_region']).apply(weighted_ave).to_frame()
df_draft2



# # pivot_table 만들기
# df_result = pd.pivot_table(df_draft2, index=['region', 'sub_region']) # index로 grouping
# df_result = df_result[['weighted_ave_internet', 'weighted_ave_income']] # 원하는 정보만 뽑기

# column = ['weighted_ave_internet', 'weighted_ave_income'] 추가
df_draft2['weighted_ave_internet'] = [i[0] for i in df_draft2[0]]
df_draft2['weighted_ave_income'] = [i[1] for i in df_draft2[0]]
df_draft2

# index 0 삭제
del df_draft2[0]
df_result = df_draft2

df_result

jaam._.mini

비전공자의 데이터 공부법

이전 포스트

[EDA] mini project 9 _ 올림픽 금메달 데이터 (drop_duplicates, dropna)

다음 포스트

[EDA] mini project 10 _ 국가별 인터넷 사용률 데이터 (test 04)

📊 EDA. 탐색적 데이터 분석

문제 소개 및 데이터 준비 단계

Data 원본 출처

Target Data(CSV): Global Internet Usage(국가별 인터넷 사용률)

Reference Data01(HTML Link): 국가별 인구(Population) Data

Reference Data02(CSV): 국가별 ISO코드 / 지역 분류(Region) Data

참고사항

1단계

Target Data 불러오기 & 전처리

문제 1-1) Target Data 전처리 01 (5점)

문제 1-2) Target Data 전처리 02 (5점)

문제 1-3) Target Data 전처리 03 (10점)

2단계

Reference Data01 불러오기 & 전처리 & 합치기

문제 2-1) Web Data 가져오기 (10점)

문제 2-2) Population Data 전처리 01 (5점)

문제 2-3) Population Data 전처리 02 (5점)

문제 2-4) Data 합치기(10점)

3단계

Reference Data02: 불러오기 & 전처리 & 합치기

문제 3-1) Region Data 전처리 01 (10점)

문제 3-2) Data 합치기(10점)

4단계

Data 분석하기(가중 평균 & 분산)

문제 4-2) 특정 조건의 가준 평균 구하기 (15점)

[EDA] mini project 9 _ 올림픽 금메달 데이터 (drop_duplicates, dropna)

[EDA] mini project 11 _ 박물관

0개의 댓글