[제로베이스] EDA test 4: 국가별 인터넷 사용률 데이터

Gracie·2024년 6월 8일

제로베이스 데이터 취업 스쿨

목록 보기

21/25

data 원본 출처:
- 국가별 ISO코드, 지역분류 :
  Country Mapping - ISO, Continent, Region
- 국가별 인터넷 사용률
  Global Internet Usage
- 국가별 인구 데이터:
  List of countries and dependencies by population

library 설치: pycountry

!pip install pycountry
!pip install lxml

https://pypi.org/project/pycountry/#description
https://github.com/pycountry/pycountry

국가 검색 방법

pycountry.countries.get(name = country_name) # 하나의 결과값을 return
pycountry.countries.search_fuzzy(country_name) #하나 이상의 결과값을 list형태로 return

1. Target Data(Global Internet Usage) 데이터 불러오기, 전처리

import pandas as pd
df_target = pd.read_csv('./datas/gapminder_internet.csv')

#Null 값 제거
df_target.dropna(inplace = True)

# country 컬럼 변경하기: pycountry Library를 사용해 국가코드를 얻기 위함
df_target_change_list = [
    (33, 'Cabo Verde'),
    (35, 'Central African Republic'),
    (41, 'Congo, The Democratic Republic of the'),
    (42, 'Congo'),
    (45, "Côte d'Ivoire"),
    (49, 'Czech Republic'),
    (53, 'Dominican Republic'),
    (83, 'Hong Kong'),
    (100, 'Korea, Republic of'),
    (103, "Lao People's Democratic Republic"),
    (112, 'Macao'),
    (113, 'Republic of North Macedonia'),
    (125, 'Federated States of Micronesia'),
    (183, 'Eswatini'),
    (210, 'Yemen')
]

for c in df_target_change_list:
    df_target.loc[c[0], 'country'] = c[1]

# pycounry를 활용하여 code 컬럼 추가하기
#code컬럼은 국가명에 맞는 2글자 국가코드(alpha_2)

df_target.loc[196, 'country'] = 'Türkiye'
code_list = []

for idx, row in df_target.iterrows():

    try:
        cd = pycountry.countries.get(name = row['country']).alpha_2
        print(f'try: {cd}') #결과 확인하기 위한 코드
    except:
        cd = pycountry.countries.search_fuzzy(row['country'])[0].alpha_2
        print(f'except: {cd}')

    code_list.append(cd)
    
df_target['code'] = code_list

for문을 돌릴 때 계속 오류가 떠서 확인해보니, pycounry에 Turkey가 Türkiye로 변경 되어있었다.. 튀르키예 온 더 블럭..

df_target.head()

2. Reference Data1 (국가별 인구 수) Web에서 불러오기, 전처리

url = 'https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population'

#HTML 문서에서 모든 테이블 읽어옴
tables = pd.read_html(url)
len(tables)

pd.read_html(url)을 사용하면 웹 페이지의 TAble을 읽어와 DataFrame으로 만들어 준다
어려운 BeautifulSoup을 사용하지 않아도 된다니 아주 유용하군 ^.^

tables를 확인해보니 해당 페이지에는 여러개의 테이블이 있어서,
내가 원하는 데이터인 1번째 테이블만 사용해주기로 한다

df_population = tables[0]

#열 이름 변경
df_population.rename(columns={'Unnamed: 0':'Rank', 'Unnamed: 6':'Notes'},inplace=True)
df_population.rename(columns={'Location':'Country / Dependency'},inplace=True)

#이상값 처리
df_population.loc[1,'Rank'] = 1
df_population.loc[2,'Rank'] = 2

df_population.head(11)

pd.read_html(url)
html의 table 태그를 dataframe으로 읽어오는 함수

웹페이지에는 아래와 같이 표기되어 있다

#사용하지 않는 column 제거
df_population.drop(columns=['Rank','% of world', 'Date',
       'Source (official or from the United Nations)', 'Notes'],inplace=True)
       
#column명 바꾸기
df_population.rename(columns={'Country / Dependency' : 'country', 'Population' : 'population'},inplace=True)

#사용하지 않는  row 제거
df_populstion.drop(index = 0, inplace=True)

Target data & Referance data1 합치기

country를 기준으로 두 데이터프레임을 합치기 위해선 df_population의 국가명을 변경해줘야 한다.

df_population_change_dict = {
    'Bermuda (UK)': 'Bermuda',
    'Cape Verde': 'Cabo Verde',
    'Democratic Republic of the Congo': 'Congo, The Democratic Republic of the',
    'Ivory Coast': "Côte d'Ivoire",
    'Greenland (Denmark)': 'Greenland',
    'Hong Kong (China)': 'Hong Kong',
    'South Korea': 'Korea, Republic of',
    'Laos': "Lao People's Democratic Republic",
    'Macau (China)': 'Macao',
    'North Macedonia': 'Republic of North Macedonia',
    'Micronesia': 'Federated States of Micronesia',
    'Puerto Rico (US)': 'Puerto Rico',
    'Slovakia': 'Slovak Republic',
    'East Timor': 'Timor-Leste',
    'Turkey' : 'Türkiye',
    'Republic of the Congo' : 'Congo'
}

for key, value in df_population_change_dict.items():
    for idx, row in df_population.iterrows():
        if row['country'] == key :
            df_population.loc[idx, 'country'] = value
            print(f'Change Success: {key} > {value}')

country를 기준으로 합쳐주기

조건1: 겹치는 국가명이 있는 경우만 추출(교집합)
조건2: Column을 기준으로 합치기
조건3: code열을 기준으로 오름차순 정렬
조건4: index를 reset

df = pd.merge(df_target, df_population, on = 'country',how='inner')

#열이름 변경
df.columns = ['country', 'incomeperperson', 'internetuserate', 'urbanrate', 'code', 'population']

df.sort_values(by='code',inplace=True)
df.reset_index(inplace=True, drop=True)
df

Reference Data2(국가별 코드 데이터) 불러오기, 전처리

df_region = pd.read_csv('./datas/continents2.csv')

#alpha-2의 NaN 데이터 'NA'(str)로 변경
df_region['alpha-2'].fillna('NA', inplace = Ture)

#필요없는 열 제거
df_region_drop_col = [
                    'country-code',
                    'alpha-3',
                    'iso_3166-2',
                    'region-code',
                    'sub-region-code',
                    'intermediate-region-code'
                    ]

df_region.drop(columns= df_region_drop_col,inplace=True)

#column명 변경
df_region_rename_dict = {
                        'alpha-2': 'code',
                        'sub-region': 'sub_region',
                        'intermediate-region': 'intermediate_region',
                        }
df_region.rename(columns= df_region_rename_dict, inplace=True)

df에 Reference Data2 합치기

조건1: df, df_region 국가코드('code' Column(열)의 value)가 같은 Data끼리 합쳐주세요.
조건2: 결합 방식은 교집합(겹치는 국가코드가 있는 경우만 추출)으로 지정해 주세요.
조건3: Column(열)방향으로 DataFrame을 합쳐주세요.
조건4: 'code' Column(열)을 기준으로 오름차순으로 정렬해주세요.

df_merged = pd.merge(df, df_region, on = 'code', how = 'inner')
df_merged.sort_values('code', inplace = True)

#열 이름 변경
df_rename_dict = {
                'incomeperperson': 'income_per_person',
                'internetuserate': 'internet_use_rate',
                }
df_merged.rename(columns=df_rename_dict, inplace=True)

#columns 순서 변경하기, name column 삭제
new_col_order = [
                'code',
                'country',
                'population',
                'income_per_person',
                'internet_use_rate',
                'urbanrate',
                'region',
                'sub_region',
                'intermediate_region'
                ]
df_merged = df_merged.reindex(columns=new_col_order)     

df_merged.reset_index(drop=True)

#column 순서 변경, 필요없는 column 삭제
df = df_merged[['country', 'income_per_person', 'internet_use_rate', 'urbanrate', 'code', 'population']]

Data 분석하기 (가중평균 & 분산)

데이터의 국가별 인터넷 사용률, 인당 소득을 대륙과 지역대률 별로 평균을 합산하여 보고자 한다.
단순 평균이 아닌 국가별 인구수를 가중한 가중평균을 계산하여 구하기
- 조건1: 가중치는 population 열 사용
- 조건2: column명은 weighted_ave_internet, weighted_ave_income으로 변경

이 과정에서 가중평균 구하는 코드를 도저히 짜기가 어려워서,, chat gpt의 힘을 빌렸다 ^.^
pivot_table로 해볼까 싶었는데, agg값에 가중평균을 줄 수가 없어서 groupby를 이용해야 했다

사용함수: numpy의 np.average() -> 가중평균 구하기, pandas의 groupby()

np.average(data, weight = 가중평균)


df_result = df.groupby(['region','sub_region']).apply( #
	lambda x: pd.Series({ # x는 각 그룹의 데이터프레임
    	'weighted_ave_income' : np.average(x['internet_use_rate'], weights = x['population']),
        'weighted_ave_internet' : np.average(x['income_per_person'], weights = x['population'])
    }
	)

pandas의 groupby(): Dataframe이 아닌 groupby 객체를 반환한다.
groupby 객체: 데이터를 그룹화하는 방법을 정의한 객체일 뿐, 실제 데이터를 시각화하거나 요약한 데이터프레임이 아님
이 객체는 데이터프레임을 그룹화한 상태로, 각 그룹에 대해 연산을 수행할 수 있도록 함
따라서 groupby 객체는 요약 통계를 계산하여 데이터프레임으로 변환해줘야함
ex)
# 데이터프레임 그룹화
grouped = df.groupby(['region', 'sub_region'])
# 그룹화된 데이터에 대해 평균 계산
grouped_mean = grouped.mean().reset_index()

pandas의 apply(): Dataframe이나 Series의 각 요소에 대해 함수를 적용하는 매우 강력한 도구

groupby 객체와 함께 사용하면 각 그룹에 대해 사용자 정의 함수를 적용할 수 있음

s = pd.Series([1,2,3,4,5])
squared = s.apply(lambda x: x** 2)
print(squared)
#출력
0     1
1     4
2     9
3    16
4    25
dtype: int64
#데이터 프레임에 적용
f = pd.DataFrame({
	'A' : [1,2,3],
    'B' : [4,5,6]
})
# 각 열에 대해 합계를 계산
column_sum = df.apply(lambda x: x.sum())
print(columns_sum)
#출력:
A     6
B    15
dtype: int64
# 행 또는 열방향 으로 적용
#axis = 0(기본값): 각 열에대해 함수를 적용
#axis = 1: 각 행에 대해 함수를 적용
row_sum = df.apply(lambda x: x.sum(), axis=1)
print(row_sum)
#출력:
0     5
1     7
2     9
dtype: int64

특정 조건의 가중평균 구하기

중국과 인도를 제외한 Eastern Asia, Southern Asia의 인터넷 사용률 및 인당 소득을 구하기
- 조건1: 가중치는 population Column이용
- 조건2: 중국(code: CN), 인도(code: IN)은 제외

# 중국, 인도를 제외한 Eastern Asia, Southern Asia값만 필터링
df = df[df['region'] == 'Asia']
df = df[df['code'] != 'CN']
df = df[df['code'] != 'IN']
df= df[df['sub_region'].isin(['Eastern Asia', 'Southern Asia'])]

df_result = df.groupby(['region','sub_region']).apply(
	lambda x: pd.Series({
    	'weighted_ave_income' : np.average(x['internet_use_rate'], weights = x['population']),
        'weighted_ave_internet' : np.average(x['income_per_person'],weights=x['population']) 
    	})
    )