코드스테이츠 AI 부트캠프 Section1에서 다음 분기에 어떤 게임을 설계해야 할까?
라는 공통주제로 실시한 Data Science 개인프로젝트 내용 정리 및 회고.
다음 분기에 어떤 게임을 설계해야 할까
라는 고민을 해결하는 것이 프로젝트 목표배경지식이 없는 사람들도 이해할 수 있도록
노력하는 것이 부가적인 목표지역에 따라서 선호하는 게임 장르가 다를까
연도별 게임의 트렌드가 있을까
출고량이 높은 게임에 대한 분석 및 시각화 프로세스
Introduction 서론
배경지식
프로젝트 목표
프로젝트 진행 절차
Data 데이터
데이터 전처리
특성 조합
Trends in Video Game 트렌드 분석
글로벌 트렌드
장르 트렌드
게임플랫폼 타입 트렌드
플랫폼 기업 트렌드
Hypothesis Test 가설 검정
멀티플랫폼과 단일플랫폼
Conclusion 결론
요약
결론 및 인사이트 도출
회고
ATARI 2600
NES(패미컴)
SNES(슈퍼패미컴)
GAMEBOY(게임보이)
GENESIS(메가드라이브)
N64(닌텐도64)
SATURN(새턴)
PlayStation(플레이스테이션)
GAMECUBE(게임큐브)
GBA(게임보이어드밴스)
PS2(플스2)
XBOX(엑스박스)
Wii(닌텐도 위)
NDS(닌텐도DS)
PS3(플스3)
PSP(플스포터블)
XBOX360(엑스박스360)
Wii(닌텐도 위)
NDS(닌텐도DS)
PS4(플스4)
PSP(플스포터블)
XBOX360(엑스박스360)
Nintendo SWITCH(닌텐도 스위치)
PS5(플레이스테이션5)
XBOX SERIES X|S(엑스박스 시리즈 X|S)
어떤 장르의 게임을 선호하는가?
거치형 콘솔, 휴대용 콘솔, PC 중 어떤 type을 선호하는가?
어떤 기업의 플랫폼을 선호하는가?
Multi-Platform으로 개발하는 것이 효과적일까?
# csv 파일 불러오기
df = pd.read_csv('vgames2.csv')
# 첫 column은 의미 없는 index Data이므로 제외
df1 = df.iloc[:,1:]
display(df1.head())
#결측치 확인
df1.isnull().sum()
#output
'''
Name 0
Platform 0
Year 271
Genre 50
Publisher 58
NA_Sales 0
EU_Sales 0
JP_Sales 0
Other_Sales 0
dtype: int64
'''
#원본데이터 shape
df1.shape
#output
'''
(16598, 9)
'''
#결측치 row 쿼리 후 drop
df_year_null = df1[df1.Year.isnull()|df.Genre.isnull()]
df1_drop = df1.drop(index=df_year_null.index).reset_index(drop=True)
#shape 확인
df1_drop.shape
#output
'''
(16277, 9)
'''
#제거된 row 수
df1.shape[0]-df1_drop.shape[0]
#output
'''
321
'''
#fillna함수 이용
df2 = df1_drop.copy()
df2.Publisher = df2.Publisher.fillna('Unknown')
#결측치 확인
df2.isnull().sum()
#output
'''
Name 0
Platform 0
Year 0
Genre 0
Publisher 0
NA_Sales 0
EU_Sales 0
JP_Sales 0
Other_Sales 0
dtype: int64
'''
#결측치 처리후 shape
df2.shape
#output
'''
(16277, 9)
'''
column 순서대로 확인 진행
Platform column 확인
#Platform column
df2.Platform.unique()
#output
'''
array(['DS', 'Wii', 'PSP', 'PS3', 'PC', 'PS', 'GBA', 'PS4', 'PS2', 'XB',
'X360', 'GC', '3DS', '2600', 'SAT', 'GB', 'NES', 'DC', 'N64',
'XOne', 'SNES', 'WiiU', 'PSV', 'GEN', 'SCD', 'WS', 'NG', 'TG16',
'3DO', 'GG', 'PCFX'], dtype=object)
'''
사전 조사 결과에 따라 Platform column의 표기 오류는 없었음.
Year column 확인
#Year column
df2.Year.unique()
#ouput
'''
array([2.008e+03, 2.009e+03, 2.010e+03, 2.005e+03, 2.011e+03, 2.007e+03,
2.001e+03, 2.003e+03, 2.006e+03, 2.014e+03, 2.015e+03, 2.002e+03,
1.997e+03, 2.013e+03, 1.996e+03, 2.004e+03, 2.000e+03, 1.984e+03,
1.998e+03, 2.016e+03, 1.985e+03, 1.999e+03, 9.000e+00, 9.700e+01,
1.995e+03, 1.993e+03, 2.012e+03, 1.987e+03, 1.982e+03, 1.100e+01,
1.994e+03, 1.990e+03, 1.500e+01, 1.992e+03, 1.991e+03, 1.983e+03,
1.988e+03, 1.981e+03, 3.000e+00, 1.989e+03, 9.600e+01, 6.000e+00,
8.000e+00, 1.986e+03, 1.000e+00, 5.000e+00, 4.000e+00, 1.000e+01,
9.800e+01, 7.000e+00, 1.600e+01, 8.600e+01, 1.400e+01, 9.500e+01,
2.017e+03, 1.980e+03, 2.020e+03, 2.000e+00, 1.300e+01, 0.000e+00,
1.200e+01, 9.400e+01])
'''
#Year Series 정의 후 타입 integer로 변경
Year_series = df2.Year.astype(int)
Year_series.unique()
#output
'''
array([2008, 2009, 2010, 2005, 2011, 2007, 2001, 2003, 2006, 2014, 2015,
2002, 1997, 2013, 1996, 2004, 2000, 1984, 1998, 2016, 1985, 1999,
9, 97, 1995, 1993, 2012, 1987, 1982, 11, 1994, 1990, 15,
1992, 1991, 1983, 1988, 1981, 3, 1989, 96, 6, 8, 1986,
1, 5, 4, 10, 98, 7, 16, 86, 14, 95, 2017,
1980, 2020, 2, 13, 0, 12, 94])
'''
Year 표기형태 통일을 위해 수정 필요
# 함수정의
def year(y):
if (y>=0) and (y<22):
y = y+2000
elif (y>=22) and (y<1000):
y = y+1900
else:
y = y
return y
# apply 후 확인
Year_fix = Year_series.apply(year)
Year_fix.unique()
#output
'''
array([2008, 2009, 2010, 2005, 2011, 2007, 2001, 2003, 2006, 2014, 2015,
2002, 1997, 2013, 1996, 2004, 2000, 1984, 1998, 2016, 1985, 1999,
1995, 1993, 2012, 1987, 1982, 1994, 1990, 1992, 1991, 1983, 1988,
1981, 1989, 1986, 2017, 1980, 2020], dtype=int64)
'''
# 수정결과 덮어쓰기
df3 = df2.copy()
df3.Year = Year_fix
df3.Year.unique()
#output
'''
array([2008, 2009, 2010, 2005, 2011, 2007, 2001, 2003, 2006, 2014, 2015,
2002, 1997, 2013, 1996, 2004, 2000, 1984, 1998, 2016, 1985, 1999,
1995, 1993, 2012, 1987, 1982, 1994, 1990, 1992, 1991, 1983, 1988,
1981, 1989, 1986, 2017, 1980, 2020], dtype=int64)
'''
Genre column 확인
#Genre column
df3.Genre.unique()
#output
'''
array(['Action', 'Adventure', 'Misc', 'Platform', 'Sports', 'Simulation',
'Racing', 'Role-Playing', 'Puzzle', 'Strategy', 'Fighting',
'Shooter'], dtype=object)
'''
Genre column 이상없음
Sales column 확인
#Sales column
df3.iloc[:,-4:].dtypes
#output
'''
NA_Sales object
EU_Sales object
JP_Sales object
Other_Sales object
dtype: object
'''
object인 것으로 보아 숫자가 아닌 데이터가 섞여있을 가능성이 커보임
일단 NorthAmerica 알파벳이 있는 열을 확인해봄
df3_NA_err = df3[df3.NA_Sales.str.contains('[a-zA-Z]')]
df3_NA_err.head()
K와 M이 보임.
K는 1,000 M은 1,000,000
K,M이 안붙어있는 데이터의 단위가 M인것으로 확인함
단위를 M으로 통일시키도록 하겠음
#함수정의
def scalefix(s):
if s.endswith('K'):
s = s.replace('K','')
s = float(s)*0.001
elif s.endswith('M'):
s = s.replace('M','')
return s
#NA_Sales에 적용해보기
df3_fix_test = df3_NA_err.NA_Sales.apply(scalefix)
df3_fix_test.head()
#output
'''
10 0.48
44 0.06
142 0.0
440 0.58
451 0.25
Name: NA_Sales, dtype: object
'''
#모든 Sales column에 적용
df3_Sales = df3.iloc[:,-4:]
df3_Sales_fix = df3_Sales.applymap(scalefix)
df3_Sales_fix_f = df3_Sales_fix.astype(float)
#수정사항 덮어쓰기
df4 = df3.copy()
df4.iloc[:,-4:] = df3_Sales_fix_f
display(df4.dtypes)
#output
Name object
Platform object
Year int64
Genre object
Publisher object
NA_Sales float64
EU_Sales float64
JP_Sales float64
Other_Sales float64
dtype: object
#.describe() 로 확인 (max값 확인)
display(df4.describe())
df4[df4.duplicated(subset=['Name','Platform','Year','Publisher'],keep=False)]
#index 8559 제거
df5 = df4.drop(index=8559).reset_index(drop=True)
#중복치 제거 후 shape
df5.shape
#output
'''
(16276, 9)
'''
df5.to_csv('vgames_pre.csv',index=False)
fe = pd.read_csv('vgames_pre.csv')
fe.head()
#데이터 shape
fe.shape
#output
'''
(16276, 9)
'''
fe1 = fe.copy()
fe1['Global_Sales'] = fe1.NA_Sales+fe1.EU_Sales+fe1.JP_Sales+fe1.Other_Sales
fe1.describe()
#데이터 shape
fe1.shape
#output
'''
(16276, 10)
'''
#Platform column
print('platform 수 : {}'.format(len(fe1.Platform.unique())))
fe1.Platform.unique()
#output
'''
platform 수 : 31
array(['DS', 'Wii', 'PSP', 'PS3', 'PC', 'PS', 'GBA', 'PS4', 'PS2', 'XB',
'X360', 'GC', '3DS', '2600', 'SAT', 'GB', 'NES', 'DC', 'N64',
'XOne', 'SNES', 'WiiU', 'PSV', 'GEN', 'SCD', 'WS', 'NG', 'TG16',
'3DO', 'GG', 'PCFX'], dtype=object)
'''
Traditional = ['2600','NES','SNES','N64','GC','Wii','WiiU',
'GEN','SAT','DC','PS','PS2','PS3','PS4',
'XB','X360','XOne']
Portable = ['GB','GBA','DS','3DS','PSP','PSV']
PC = ['PC']
#분류 미포함 쿼리
fe1_type = fe1[~fe1.Platform.isin(Traditional+Portable+PC)]
print('분류에 포함되지 않는 데이터 수: {}'.format(fe1_type.shape[0]))
#output
'''
분류에 포함되지 않는 데이터 수: 31
'''
#drop
fe2 = fe1.copy()
fe2 = fe2.drop(index=fe1_type.index).reset_index(drop=True)
fe_type_check = fe2[~fe2.Platform.isin(Traditional+Portable+PC)]
print('제거 후 분류에 포함되지 않는 데이터 수: {}'.format(fe_type_check.shape[0]))
print('\nplatform 수 : {}'.format(len(fe2.Platform.unique())))
#output
'''
제거 후 분류에 포함되지 않는 데이터 수: 0
platform 수 : 24
'''
#platform목록 일치여부 확인
print(set(fe2.Platform.unique()) == set(Traditional+Portable+PC))
#output
'''
True
'''
#platform type으로 반환시키는 함수 정의
def plat_type(p):
if p in Traditional:
p = 'Traditional'
elif p in Portable:
p = 'Portable'
elif p in PC:
p = 'PC'
return p
#platform type column 생성
fe2_type = fe2.copy()
fe2_type['Platform_type'] = fe2_type.Platform.apply(plat_type)
#함수오류로 입력안된 값이 있는지를 확인
print(fe2_type.Platform_type.isnull().sum())
#output
'''
0
'''
display(fe2_type.head())
#데이터 shape
fe2_type.shape
#output
'''
(16245, 11)
'''
Atari = ['2600']
Nintendo = ['NES','SNES','N64','GC','Wii','WiiU','GB','GBA','DS','3DS']
Sega = ['GEN','SAT','DC']
Sony = ['PS','PS2','PS3','PS4','PSP','PSV']
Microsoft = ['XB','X360','XOne']
PC = ['PC']
print(set(fe2_type.Platform.unique()) == set(Atari+Nintendo+Sega+Sony+Microsoft+PC))
#output
'''
True
'''
#platform company으로 반환시키는 함수 정의
def plat_company(p):
if p in Atari:
p = 'Atari'
elif p in Nintendo:
p = 'Nintendo'
elif p in Sega:
p = 'Sega'
elif p in Sony:
p = 'Sony'
elif p in Microsoft:
p = 'Microsoft'
elif p in PC:
p = 'PC'
return p
#platform type column 생성
fe2_comp = fe2_type.copy()
fe2_comp['Platform_company'] = fe2_comp.Platform.apply(plat_company)
#함수오류로 입력안된 값이 있는지를 확인
print(fe2_comp.Platform_company.isnull().sum())
#output
'''
0
'''
display(fe2_comp.head())
#데이터 shape
fe2_comp.shape
#output
'''
(16245, 12)
'''
gen2 = ['2600']
gen3 = ['NES']
gen4 = ['SNES','GEN','GB']
gen5 = ['N64','SAT','PS']
gen6 = ['GC','DC','PS2','XB','GBA']
gen7 = ['Wii','PS3','X360','DS','PSP']
gen8 = ['WiiU','PS4','XOne','3DS','PSV']
PC = ['PC']
print(set(fe2_comp.Platform.unique()) == set(gen2+gen3+gen4+gen5+gen6+gen7+gen8+PC))
#output
'''
True
'''
#platform generation 반환 함수 정의
def plat_gen(p):
if p in gen2:
p = 2
elif p in gen3:
p = 3
elif p in gen4:
p = 4
elif p in gen5:
p = 5
elif p in gen6:
p = 6
elif p in gen7:
p = 7
elif p in gen8:
p = 8
elif p in PC:
p = np.nan
return p
# Generation column 생성
fe2_gen = fe2_comp.copy()
fe2_gen['Generation']=fe2_gen.Platform.apply(plat_gen)
# 결측치 확인(PC와 일치여부)
print(fe2_gen.Generation.isnull().sum())
print(sum(fe2_gen.Platform=='PC'))
#output
'''
940
940
'''
#세대별 분리
fe2_gen2 = fe2_gen.query('1977 < Year <= 1983')
fe2_gen3 = fe2_gen.query('1983 < Year <= 1988')
fe2_gen4 = fe2_gen.query('1988 < Year <= 1994')
fe2_gen5 = fe2_gen.query('1994 < Year <= 1998')
fe2_gen6 = fe2_gen.query('1998 < Year <= 2005')
fe2_gen7 = fe2_gen.query('2005 < Year <= 2013')
fe2_gen8 = fe2_gen.query('2013 < Year <= 2020')
#generation 채워넣기
fe2_f_gen2 = fe2_gen2.copy()
fe2_f_gen2.Generation = fe2_f_gen2.Generation.fillna('2')
fe2_f_gen3 = fe2_gen3.copy()
fe2_f_gen3.Generation = fe2_f_gen3.Generation.fillna('3')
fe2_f_gen4 = fe2_gen4.copy()
fe2_f_gen4.Generation = fe2_f_gen4.Generation.fillna('4')
fe2_f_gen5 = fe2_gen5.copy()
fe2_f_gen5.Generation = fe2_f_gen5.Generation.fillna('5')
fe2_f_gen6 = fe2_gen6.copy()
fe2_f_gen6.Generation = fe2_f_gen6.Generation.fillna('6')
fe2_f_gen7 = fe2_gen7.copy()
fe2_f_gen7.Generation = fe2_f_gen7.Generation.fillna('7')
fe2_f_gen8 = fe2_gen8.copy()
fe2_f_gen8.Generation = fe2_f_gen8.Generation.fillna('8')
#분리한 Data를 합치고 타입을 int로 변환
fe2_f_gen = pd.concat([fe2_f_gen2,
fe2_f_gen3,
fe2_f_gen4,
fe2_f_gen5,
fe2_f_gen6,
fe2_f_gen7,
fe2_f_gen8]).sort_index()
fe2_f_gen.Generation = fe2_f_gen.Generation.astype(int)
#데이터 확인
fe2_f_gen.info()
#output
'''
Int64Index: 16245 entries, 0 to 16244
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 16245 non-null object
1 Platform 16245 non-null object
2 Year 16245 non-null int64
3 Genre 16245 non-null object
4 Publisher 16245 non-null object
5 NA_Sales 16245 non-null float64
6 EU_Sales 16245 non-null float64
7 JP_Sales 16245 non-null float64
8 Other_Sales 16245 non-null float64
9 Global_Sales 16245 non-null float64
10 Platform_type 16245 non-null object
11 Platform_company 16245 non-null object
12 Generation 16245 non-null int32
dtypes: float64(5), int32(1), int64(1), object(6)
memory usage: 1.7+ MB
'''
#데이터 shape
fe2_f_gen.shape
#output
'''
(16245, 13)
'''
#데이터 분리
fe2_mono = fe2_f_gen[~fe2_f_gen.Name.duplicated(keep=False)]
fe2_multi = fe2_f_gen[fe2_f_gen.Name.duplicated(keep=False)]
#shape 확인
print(fe2_mono.shape)
print(fe2_multi.shape)
print(fe2_f_gen.shape)
fe2_f_gen.shape[0] == fe2_mono.shape[0]+fe2_multi.shape[0]
#output
'''
(8594, 13)
(7651, 13)
(16245, 13)
True
'''
#멀티플랫폼 지원여부 column 생성
fe2_mono2 = fe2_mono.copy()
fe2_multi2 = fe2_multi.copy()
fe2_mono2['Platform_Multi'] = 'Native'
fe2_multi2['Platform_Multi'] = 'Multi'
fe2_multi = pd.concat([fe2_mono2,fe2_multi2]).sort_index()
#shape 확인
fe2_multi.shape
#output
'''
(16245, 14)
'''
fe3 = fe2_multi.copy()
fe3 = fe3[['Name', 'Publisher', 'Year', 'Genre', 'Generation',
'Platform', 'Platform_type', 'Platform_company', 'Platform_Multi',
'NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales', 'Global_Sales']]
fe3.to_csv('vgames_final.csv',index=False)
import matplotlib.pyplot as plt
plt.rcParams['font.family'] = 'Malgun Gothic'
plt.rcParams['axes.unicode_minus'] = False
import seaborn as sns
import plotly.express as plex
from plotly.subplots import make_subplots as plsub
import plotly.graph_objects as go
vgames = pd.read_csv('vgames_final.csv')
vgames.head()
총 판매량은 북미, 유럽, 일본, 그외 지역 순으로 높게 나타남.
총 판매량 확인
Sales_list = ['NA_Sales','EU_Sales','JP_Sales','Other_Sales','Global_Sales']
df_global_sum = vgames[Sales_list].sum()
df_global_sum
#output
'''
NA_Sales 4311.82
EU_Sales 2395.63
JP_Sales 1267.78
Other_Sales 783.42
Global_Sales 8758.65
dtype: float64
'''
global_sum_bar = plex.bar(df_global_sum,color=df_global_sum.index)
global_sum_bar.show()
se_global_sum_pie = df_global_sum[:-1]
sum_index = se_global_sum_pie.index
sum_value = se_global_sum_pie.values
global_sum_pie = go.Figure(data=[go.Pie(labels=sum_index,
values=sum_value,
hole=.3,
textinfo='label+percent')])
global_sum_pie.show()
전반적으로 북미 지역에서의 매출 비율이 높게 나타남.
스마트폰 등장 시기인 2010년을 기점으로 전체적인 판매량이 하락하는 추세를 보임.
연도(세대)별 판매량 Grouping
df_year = vgames.groupby(['Year'],as_index=True)[Sales_list].sum()
df_gen = vgames.groupby(['Generation'],as_index=True)[Sales_list].sum()
plt.figure(figsize=(12,8))
sns.set_style('whitegrid')
sns.lineplot(data=df_year,markers=True,dashes=False)
plt.axvline(x=1980, color='gray',linestyle='--')
plt.axvline(x=1983, color='gray',linestyle='--')
plt.axvline(x=1988, color='gray',linestyle='--')
plt.axvline(x=1994, color='gray',linestyle='--')
plt.axvline(x=1998, color='gray',linestyle='--')
plt.axvline(x=2005, color='gray',linestyle='--')
plt.axvline(x=2013, color='gray',linestyle='--')
plt.axvline(x=2020, color='gray',linestyle='--')
plt.xlabel('')
plt.ylabel('Sales(Million)',fontsize=16)
plt.show()
year_genre = plex.bar(data_frame=df_gen,title='Sales Trend(Global)',barmode='group')
year_genre.update_layout(yaxis_title='Sales(Million)')
year_genre.show()
세계적으로 가장 많이 팔린 타이틀은 닌텐도의 Wii Sports.
100위권에는 닌텐도에서 개발한 게임들이 많은 비중을 차지함.
Top20, Top100 데이터
top20_global = vgames.sort_values(by='Global_Sales',ascending=False).head(20).reset_index(drop=True)
top100_global = vgames.sort_values(by='Global_Sales',ascending=False).head(100).reset_index(drop=True)
global_top20_name = plex.bar(x=top20_global.Global_Sales,y=top20_global.Name)
global_top20_name.update_layout(yaxis=dict(autorange='reversed')
,xaxis_title=dict(text='Global Sales(Million)')
,yaxis_title=dict(text=''))
global_top20_name.show()
global_top100_pub_pie = plex.pie(data_frame=top100_global,hole=0.3,
values='Global_Sales',names='Publisher')
global_top100_pub_pie.update_traces(textposition='inside', textinfo='percent+label')
global_top100_pub_pie.update_layout(annotations=[dict(text='Global',showarrow=False)])
global_top100_pub_pie.show()
대륙별로 분석해도 닌텐도에서 개발한 게임들이 가장 높은 비율을 차지하였음.
그래프 소스 코드는 모든 지역이 같은 코드로 구성되어 있으므로, 본 블로그에서는 Other Country 소스 코드만 기록하겠음.
Other Country bar chart (plotly)
#데이터
top20_Ot = vgames.sort_values(by='Other_Sales',ascending=False).head(20).reset_index(drop=True)
top100_Ot = vgames.sort_values(by='Other_Sales',ascending=False).head(100).reset_index(drop=True)
#bar chart
Ot_top20_name = plex.bar(x=top20_Ot.Other_Sales,y=top20_Ot.Name)
Ot_top20_name.update_layout(yaxis=dict(autorange='reversed')
,xaxis_title=dict(text='Other Sales(Million)')
,yaxis_title=dict(text=''))
Ot_top20_name.show()
plotly graph는 조작가능한 interactive graph이기 때문에, ppt에는 Top20 plot을 드래그하여 잘라낸 Top5 그래프를 캡쳐하여 이용하였음.
#pie chart
Ot_top100_pub_pie = plex.pie(data_frame=top100_Ot,hole=0.3,
values='Global_Sales',names='Publisher')
Ot_top100_pub_pie.update_traces(textposition='inside', textinfo='percent+label')
Ot_top100_pub_pie.update_layout(annotations=[dict(text='Other',showarrow=False)])
Ot_top100_pub_pie.show()
전세계적으로 액션, 스포츠, 슈팅, 롤플레잉, 플랫폼 장르 순으로 선호도가 높게 나타남.
장르 Grouping
df_Genre = vgames.groupby(['Genre'],as_index=False)[Sales_list].sum()
#global barplot
Genre_sort = df_Genre.sort_values(by='Global_Sales',ascending=True).reset_index(drop=True)
plot_index = np.arange(len(Genre_sort.Genre))
plt.figure(figsize=(12,8))
plt.barh(plot_index,Genre_sort.Global_Sales)
plt.title('Sales by Genre (Global)',fontsize=24)
plt.xlabel('Sales(Million)',fontsize=16)
plt.ylabel('')
plt.yticks(plot_index,Genre_sort.Genre,fontsize=16)
plt.grid(True,axis='x',linestyle='--')
plt.xlim([0,1800])
plt.show()
#Global pie chart
explode = np.repeat(0.025,12)
wedgeprops = {'width': 0.5, 'edgecolor': 'w', 'linewidth': 1.5}
textprops={'size':12}
Genre_sort2 = df_Genre.sort_values(by='Global_Sales',ascending=False).reset_index(drop=True)
plt.figure(figsize=(12,12))
plt.pie(Genre_sort2.Global_Sales,
labels=Genre_sort2.Genre,
labeldistance=0.725,
startangle=0,
autopct='%.1f%%',
explode=explode,
wedgeprops=wedgeprops,
textprops=textprops)
plt.legend(loc='center')
plt.show()
일본에서는 롤플레잉 장르가 가장 선호도가 높았고, 슈팅장르가 가장 낮은 선호도를 보임.
그외 모든 지역에서는 액션, 스포츠, 슈팅 장르 순으로 선호도가 높게 나타남.
지역별 장르 선호도 stack bar plot (matplotlib)
#지역별 장르 데이터
Genre_sort2 = df_Genre.sort_values(by='Global_Sales',ascending=False).reset_index(drop=True)
genre_stack = Genre_sort2.iloc[:,:-1].T.iloc[1:,:]
genre_stack.columns = Genre_sort2.Genre
genre_per = genre_stack.div(genre_stack.sum(axis=1),axis=0)*100
#지역별 stack barplot
genre_per_ud = genre_per.loc[::-1]
genre_per_ud.index = ['Other','Japan','Europe','North America']
genre_per_ud.plot(kind = 'barh',stacked='True',figsize=(10,4))
plt.grid(True,axis='x',linestyle='--')
plt.xlabel('Percent(%)')
plt.legend(loc=(1.02,0.1))
plt.show();
#지역별 장르 데이터
Genre_NA = df_Genre.sort_values(by='NA_Sales',ascending=False).reset_index(drop=True)
Genre_EU = df_Genre.sort_values(by='EU_Sales',ascending=False).reset_index(drop=True)
Genre_JP = df_Genre.sort_values(by='JP_Sales',ascending=False).reset_index(drop=True)
Genre_Ot = df_Genre.sort_values(by='Other_Sales',ascending=False).reset_index(drop=True)
#지역별 barplot
region_genre = plsub(rows=2,cols=2,subplot_titles=('North America','Europe','Japan','Other'))
region_genre.add_bar(x= Genre_NA.Genre,y=Genre_NA.NA_Sales,name='North America',row=1,col=1)
region_genre.add_bar(x= Genre_EU.Genre,y=Genre_EU.EU_Sales,name='Europe',row=1,col=2)
region_genre.add_bar(x= Genre_JP.Genre,y=Genre_JP.JP_Sales,name='Japan',row=2,col=1)
region_genre.add_bar(x= Genre_Ot.Genre,y=Genre_Ot.Other_Sales,name='Other',row=2,col=2)
region_genre.update_xaxes(tickangle= 45)
region_genre.show()
top5genre = ['Action','Sports','Shooter','Role-Playing','Platform']
df_top5_genre_year = df_genre_year[df_genre_year.Genre.isin(top5genre)]
df_top5_genre_gen = df_genre_gen[df_genre_gen.Genre.isin(top5genre)]
top5_year_genre = plex.line(data_frame=df_top5_genre_year,
x='Year',
y='Global_Sales',
color='Genre',
title='Top5 Genre Sales Trend(Global)')
top5_year_genre.add_vline(1980)
top5_year_genre.add_vline(1983)
top5_year_genre.add_vline(1988)
top5_year_genre.add_vline(1995)
top5_year_genre.add_vline(1998)
top5_year_genre.add_vline(2005)
top5_year_genre.add_vline(2013)
top5_year_genre.add_vline(2020)
top5_year_genre.show()
top5_gen_genre = plex.bar(data_frame=df_top5_genre_gen,
x='Generation',
y='Global_Sales',
color='Genre',
title='Top5 Genre Sales Trend(Global)',
barmode='group')
top5_gen_genre.show()
df_type = vgames.groupby(['Platform_type'],as_index=False)[Sales_list].sum()
df_type
#global barplot
type_sort = df_type.sort_values(by='Global_Sales',ascending=False).reset_index(drop=True)
type_index = np.arange(len(type_sort.Platform_type))
plt.figure(figsize=(8,8))
plt.bar(type_index,type_sort.Global_Sales)
plt.title('Sales by Platform type (Global)',fontsize=24)
plt.xlabel('')
plt.ylabel('Sales(Million)',fontsize=16)
plt.xticks(type_index,type_sort.Platform_type,fontsize=16)
plt.grid(True,axis='y',linestyle='--')
plt.ylim([0,7000])
plt.show()
#Global pie chart
explode = np.repeat(0.025,3)
wedgeprops = {'width': 0.5, 'edgecolor': 'w', 'linewidth': 1.5}
textprops={'size':12}
type_sort2 = df_type.sort_values(by='Global_Sales',ascending=False).reset_index(drop=True)
plt.figure(figsize=(12,12))
plt.pie(type_sort2.Global_Sales,
labels=type_sort2.Platform_type,
labeldistance=0.725,startangle=0,
autopct='%.1f%%',
explode=explode,
wedgeprops=wedgeprops,
textprops=textprops)
plt.legend(loc='center')
plt.show()
#데이터
type_stack = type_sort2.iloc[:,:-1].T.iloc[1:,:]
type_stack.columns = type_sort2.Platform_type
type_per = type_stack.div(type_stack.sum(axis=1),axis=0)*100
#지역별 stack barplot
type_per_ud = type_per.loc[::-1]
type_per_ud.index = ['Other','Japan','Europe','North America']
type_per_ud.plot(kind = 'barh',stacked='True',figsize=(10,4))
plt.grid(True,axis='x',linestyle='--')
plt.xlabel('Percent(%)')
plt.legend(loc=(1.02,0.1))
plt.show();
#지역별 barplot 데이터
type_NA = df_type.sort_values(by='NA_Sales',ascending=False).reset_index(drop=True)
type_EU = df_type.sort_values(by='EU_Sales',ascending=False).reset_index(drop=True)
type_JP = df_type.sort_values(by='JP_Sales',ascending=False).reset_index(drop=True)
type_Ot = df_type.sort_values(by='Other_Sales',ascending=False).reset_index(drop=True)
#지역별 barplot
region_type = plsub(rows=2,cols=2,subplot_titles=('North America','Europe','Japan','Other'))
region_type.add_bar(x= type_NA.Platform_type,y=type_NA.NA_Sales,name='North America',row=1,col=1)
region_type.add_bar(x= type_EU.Platform_type,y=type_EU.EU_Sales,name='Europe',row=1,col=2)
region_type.add_bar(x= type_JP.Platform_type,y=type_JP.JP_Sales,name='Japan',row=2,col=1)
region_type.add_bar(x= type_Ot.Platform_type,y=type_Ot.Other_Sales,name='Other',row=2,col=2)
region_type.update_xaxes(tickangle= 0)
region_type.show()
df_type_year = vgames.groupby(['Year','Platform_type'],as_index=False).sum()
df_type_gen = vgames.groupby(['Generation','Platform_type'],as_index=False).sum()
year_type = plex.line(data_frame=df_type_year,
x='Year',
y='Global_Sales',
color='Platform_type',
title='Platform type Sales Trend(Global)')
year_type.add_vline(1980)
year_type.add_vline(1983)
year_type.add_vline(1988)
year_type.add_vline(1994)
year_type.add_vline(1998)
year_type.add_vline(2005)
year_type.add_vline(2013)
year_type.add_vline(2020)
year_type.show()
gen_type = plex.bar(data_frame=df_type_gen,
x='Generation',
y='Global_Sales',
color='Platform_type',
title='Platform type Sales Trend(Global)',
barmode='group')
gen_type.show()
df_company = vgames.groupby(['Platform_company'],as_index=False)[Sales_list].sum()
df_company
#global barplot
company_sort = df_company.sort_values(by='Global_Sales',ascending=False).reset_index(drop=True)
company_index = np.arange(len(company_sort.Platform_company))
plt.figure(figsize=(8,8))
plt.bar(company_index,company_sort.Global_Sales)
plt.title('Sales by Platform Company (Global)',fontsize=24)
plt.xlabel('')
plt.ylabel('Sales(Million)',fontsize=16)
plt.xticks(company_index,company_sort.Platform_company,fontsize=16)
plt.grid(True, axis='y',linestyle='--')
plt.ylim([0,4000])
plt.show()
#Global pie chart
explode = np.repeat(0.025,6)
wedgeprops = {'width': 0.5, 'edgecolor': 'w', 'linewidth': 1.5}
textprops={'size':12}
company_sort2 = df_company.sort_values(by='Global_Sales',ascending=False).reset_index(drop=True)
plt.figure(figsize=(12,12))
plt.pie(company_sort2.Global_Sales,
labels=company_sort2.Platform_company,
labeldistance=0.725,
startangle=0,
autopct='%.1f%%',
explode=explode,
wedgeprops=wedgeprops,
textprops=textprops)
plt.legend(loc='center')
plt.show()
#지역별 stack barplot
company_stack = company_sort2.iloc[:,:-1].T.iloc[1:,:]
company_stack.columns = company_sort2.Platform_company
company_per = company_stack.div(company_stack.sum(axis=1),axis=0)*100
company_per_ud = company_per.loc[::-1]
company_per_ud.index = ['Other','Japan','Europe','North America']
company_per_ud.plot(kind = 'barh',stacked='True',figsize=(10,4))
plt.grid(True,axis='x',linestyle='--')
plt.xlabel('Percent(%)')
plt.legend(loc=(1.02,0.1))
plt.show();
#지역별 barplot 데이터
company_NA = df_company.sort_values(by='NA_Sales',ascending=False).reset_index(drop=True)
company_EU = df_company.sort_values(by='EU_Sales',ascending=False).reset_index(drop=True)
company_JP = df_company.sort_values(by='JP_Sales',ascending=False).reset_index(drop=True)
company_Ot = df_company.sort_values(by='Other_Sales',ascending=False).reset_index(drop=True)
#지역별 barplot
region_company = plsub(rows=2,cols=2,subplot_titles=('North America','Europe','Japan','Other'))
region_company.add_bar(x= company_NA.Platform_company,y=company_NA.NA_Sales,name='North America',row=1,col=1)
region_company.add_bar(x= company_EU.Platform_company,y=company_EU.EU_Sales,name='Europe',row=1,col=2)
region_company.add_bar(x= company_JP.Platform_company,y=company_JP.JP_Sales,name='Japan',row=2,col=1)
region_company.add_bar(x= company_Ot.Platform_company,y=company_Ot.Other_Sales,name='Other',row=2,col=2)
region_company.update_xaxes(tickangle= 0)
region_company.show()
df_company_year = vgames.groupby(['Year','Platform_company'],as_index=False).sum()
df_company_gen = vgames.groupby(['Generation','Platform_company'],as_index=False).sum()
year_company = plex.line(data_frame=df_company_year,
x='Year',
y='Global_Sales',
color='Platform_company',
title='Platform company Sales Trend(Global)')
year_company.add_vline(1980)
year_company.add_vline(1983)
year_company.add_vline(1988)
year_company.add_vline(1994)
year_company.add_vline(1998)
year_company.add_vline(2005)
year_company.add_vline(2013)
year_company.add_vline(2020)
year_company.show()
gen_company = plex.bar(data_frame=df_company_gen,
x='Generation',
y='Global_Sales',
color='Platform_company',
title='Platform company Sales Trend(Global)',
barmode='group')
gen_company.show()
vgames_sum = vgames.groupby('Platform_Multi',as_index=False)[Sales_list].sum()
display(vgames_sum)
vgames_count = vgames.groupby('Platform_Multi',as_index=False)[Sales_list].count()
display(vgames_count)
multi_global_sum = plex.bar(data_frame=vgames_sum,x='Platform_Multi',y='Global_Sales')
multi_global_sum.show()
multi_global_sum_pie = plex.pie(data_frame=vgames_sum,hole=0.3,
values='Global_Sales',
names='Platform_Multi',
color='Platform_Multi',
color_discrete_map={'Native':'red','Multi':'blue'})
multi_global_sum_pie.update_traces(textposition='inside', textinfo='percent+label')
multi_global_sum_pie.update_layout(annotations=[dict(text='Global_Sales',showarrow=False)])
multi_global_sum_pie.show()
multi_global_n = plex.bar(data_frame=vgames_count,x='Platform_Multi',y='Global_Sales')
multi_global_n.update_layout(yaxis_title=dict(text='Count'))
multi_global_n.show()
multi_global_n_pie = plex.pie(data_frame=vgames_count,hole=0.3,
values='Global_Sales',names='Platform_Multi', color='Platform_Multi',
color_discrete_map={'Native':'red','Multi':'blue'})
multi_global_n_pie.update_traces(textposition='inside', textinfo='percent+label')
multi_global_n_pie.update_layout(annotations=[dict(text='N',showarrow=False)])
multi_global_n_pie.show()
df_multi_year = vgames.groupby(['Year','Platform_Multi'],as_index=False)[Sales_list].sum()
df_multi_gen = vgames.groupby(['Generation','Platform_Multi'],as_index=False)[Sales_list].sum()
year_multi = plex.line(data_frame=df_multi_year,
x='Year',y='Global_Sales',
color='Platform_Multi',
title='Multi Platform Sales Trend(Global)')
year_multi.add_vline(1980)
year_multi.add_vline(1983)
year_multi.add_vline(1988)
year_multi.add_vline(1994)
year_multi.add_vline(1998)
year_multi.add_vline(2005)
year_multi.add_vline(2013)
year_multi.add_vline(2020)
year_multi.show()
gen_multi = plex.bar(data_frame=df_multi_gen,
x='Generation',y='Global_Sales',
color='Platform_Multi',
title='Multi Platform Sales Trend(Global)',
barmode='group')
gen_multi.show()
멀티플랫폼 평균매출이 단일플랫폼 평균매출보다 크지 않다.
멀티플랫폼 평균매출이 단일플랫폼 평균매출보다 크다.
멀티 플랫폼 평균매출이 단일플랫폼 평균매출보다 유의미하게 크다
.vgames_mean = vgames.groupby('Platform_Multi',as_index=False)[Sales_list].mean()
display(vgames_mean)
# error bar
def mean_conf_inter(data, confidence=0.95):
data = np.array(data)
m = np.mean(data)
n = len(data)
s = stats.sem(data)
i = stats.t.ppf((1+confidence)/2,n-1)*s
return i
vgames_native = vgames.query('Platform_Multi == "Native"')
vgames_multi = vgames.query('Platform_Multi == "Multi"')
error_bar_native = mean_conf_inter(vgames_native.Global_Sales)
error_bar_multi = mean_conf_inter(vgames_multi.Global_Sales)
print(error_bar_native)
print(error_bar_multi)
# error bar output
'''
0.03694679704961938
0.02969598890309412
'''
#bar chart with error bar
multi_global_mean_error = go.Figure(data=[go.Bar(x=vgames_mean.Platform_Multi, y=vgames_mean.Global_Sales,
error_y=dict(type='data', array=[error_bar_multi,error_bar_native])
)])
multi_global_mean_error.show()
vgames_native = vgames.query('Platform_Multi == "Native"')
vgames_multi = vgames.query('Platform_Multi == "Multi"')
se_native_global = vgames_native.Global_Sales
se_multi_global = vgames_multi.Global_Sales
levene_multi = stats.levene(se_native_global,se_multi_global).pvalue
print('levene 등분산검정 p-value : {:.3f}'.format(levene_multi))
ttest_multi = stats.ttest_ind(se_multi_global,se_native_global,alternative='greater',equal_var=False).pvalue
print('T-test 단측검정 p-value : {:.3f}'.format(ttest_multi))
#output
'''
levene 등분산검정 p-value : 0.006
T-test 단측검정 p-value : 0.000
'''
#지역별 barplot 데이터
multi_NA = vgames_mean.sort_values(by='NA_Sales',ascending=False).reset_index(drop=True)
multi_EU = vgames_mean.sort_values(by='EU_Sales',ascending=False).reset_index(drop=True)
multi_JP = vgames_mean.sort_values(by='JP_Sales',ascending=False).reset_index(drop=True)
multi_Ot = vgames_mean.sort_values(by='Other_Sales',ascending=False).reset_index(drop=True)
#지역별 barplot
region_multi = plsub(rows=2,cols=2,subplot_titles=('North America','Europe','Japan','Other'))
region_multi.add_bar(x= multi_NA.Platform_Multi,y=multi_NA.NA_Sales,name='North America',row=1,col=1)
region_multi.add_bar(x= multi_EU.Platform_Multi,y=multi_EU.EU_Sales,name='Europe',row=1,col=2)
region_multi.add_bar(x= multi_JP.Platform_Multi,y=multi_JP.JP_Sales,name='Japan',row=2,col=1)
region_multi.add_bar(x= multi_Ot.Platform_Multi,y=multi_Ot.Other_Sales,name='Other',row=2,col=2)
region_multi.update_xaxes(tickangle= 0)
region_multi.show()
데이터셋에 대한 설명 부족
, 발표자료의 양식에 쓰인 색 구성
이었고, 이 부분은 본인도 공감하는 부분으로써 아쉬움이 남았었던 부분이라서 다음 프로젝트부터는 이러한 피드백을 최대한 반영하려고 노력하였음.