๐Ÿ“•Week3 day3(Pandas)

๋ฐ•์ค€ํฌยท2023๋…„ 9์›” 6์ผ

ํ”„๋กœ๊ทธ๋ž˜๋จธ์Šค

๋ชฉ๋ก ๋ณด๊ธฐ
16/28
post-thumbnail

Pandas

pandas๋Š” ๋ฐ์ดํ„ฐ ์กฐ์ž‘ ๋ฐ ๋ถ„์„์„ ์œ„ํ•œ ํŒŒ์ด์ฌ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์–ธ์–ด ์šฉ์œผ๋กœ ์ž‘์„ฑ๋œ ์†Œํ”„ํŠธ์›จ์–ด ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์ด๋‹ค. ์ถœ์ฒ˜ : ์œ„ํ‚คํ”ผ๋””์•„

pandas ์‹œ์ž‘


pandas๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ importํ•œ๋‹ค.

import pandas as pd

pandas๋กœ 1์ฐจ์› ๋ฐ์ดํ„ฐ ๋‹ค๋ฃจ๊ธฐ - Series


series๋ž€?

  • 1-D(1์ฐจ์›) labeled array
  • ์ธ๋ฑ์Šค๋ฅผ ์ง€์ •ํ•ด์ค„ ์ˆ˜ ์žˆ์Œ
s = pd.Series([1,4,9,16,25])
s
>>>0     1
1     4
2     9
3    16
4    25
dtype: int64

t = pd.Series({'one':1, 'two':2,'three':3,'four':4,'five':5})
t
>>>one      1
two      2
three    3
four     4
five     5
dtype: int64

Series + Numpy

  • series๋Š” numpy์™€ ์œ ์‚ฌํ•˜๋‹ค.
s[1]#์Šฌ๋ผ์ด์‹ฑ ๊ฐ€๋Šฅ
>>>4
t[1]
>>>2
t[1:3]
>>>two      2
three    3
dtype: int64

s[s>s.median()]#์ž๊ธฐ ์ž์‹ ์˜ median๋ณด๋‹ค ํฐ ๊ฐ’์„ ๊ฐ€์ง€๊ณ ์™€๋ผ
>>>3    16
4    25
dtype: int64

s[[3,1,4]]#์—ฌ๋Ÿฌ๊ฐ€์ง€ ์ธ๋ฑ์Šค๋ฅผ ๋ฌถ์–ด์„œ ๋™์‹œ์— ์ถ”์ถœ ๊ฐ€๋Šฅ
>>>3    16
1     4
4    25
dtype: int64

import numpy as np
np.exp(s)#s์— ๋Œ€ํ•œ exp๊ฐ’์ด ๊ณ„์‚ฐ๋˜์–ด ๋‚˜์˜ด
>>>0    2.718282e+00
1    5.459815e+01
2    8.103084e+03
3    8.886111e+06
4    7.200490e+10
dtype: float64

s.dtype
>>>dtype('int64')

Series + dictionary

  • ๋”•์…”๋„ˆ๋ฆฌ์™€๋„ ์œ ์‚ฌํ•˜๋‹ค.
t
>>>one      1
two      2
three    3
four     4
five     5
dtype: int64

t['one']
>>>1

#Series์— ๊ฐ’ ์ถ”๊ฐ€
t['six'] = 6
t
>>>one      1
two      2
three    3
four     4
five     5
six      6
dtype: int64

'six' in t
>>>True
'seven' in t
>>>False

t.get('seven')#์–ด๋–ค ๊ฐ’์„ ๊ฐ€์ ธ์˜ค๋˜ ์—†์–ด๋„ ์—๋Ÿฌ๊ฐ€ ๋ฐœํ–‰ํ•˜์ง€ ์•Š๊ณ  ์˜ˆ์™ธ์ฒ˜๋ฆฌ๋จ
t.get('seven',0)
>>>0

Series์— ์ด๋ฆ„ ๋ถ™์ด๊ธฐ

  • name์†์„ฑ์„ ๊ฐ€์ง€๊ณ ์žˆ๋‹ค.
  • ์ฒ˜์Œ series๋ฅผ ๋งŒ๋“ค ๋•Œ ์ด๋ฆ„์„ ๋ถ™์ผ ์ˆ˜ ์žˆ๋‹ค.
s = pd.Series(np.random.randn(5),name = "random_nums")
s
>>>0    2.193934
1   -0.756458
2   -0.968082
3    0.069727
4   -0.914793
Name: random_nums, dtype: float64
#์ถœ๋ ฅ๋ถ€๋ถ„์„ ๋ณด๋ฉด Name์ด ์ถ”๊ฐ€๋œ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

Pandas๋กœ 2์ฐจ์› ๋ฐ์ดํ„ฐ ๋‹ค๋ฃจ๊ธฐ - dataframe


DataFrame

  • 2-D(2์ฐจ์›) labeled table
  • index๋ฅผ ์ง€์ •ํ•  ์ˆ˜ ์žˆ์Œ
d = {'height':[1,2,3,4],'weight':[30,40,50,60]}

df = pd.DataFrame(d)
df


์œ„์™€ ๊ฐ™์ด 2์ฐจ์› ํ…Œ์ด๋ธ” ํ˜•ํƒœ๋กœ ์ถœ๋ ฅ๋œ๋‹ค.

## dtype ํ™•์ธ
#dataframe๋Š” ๊ฐcolumn๋ณ„๋กœ type์ด ๋‹ค๋ฅผ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— dtypes๋ฅผ ์‚ฌ์šฉ
df.dtypes
>>>height    int64
weight    int64
dtype: object

From CSV to dataframe

  • Comma Separated Value๋ฅผ DataFrame์œผ๋กœ ์ƒ์„ฑํ•ด์ค„ ์ˆ˜ ์žˆ๋‹ค.
  • .read_csv๋ฅผ ์‚ฌ์šฉํ•ด์„œ ๋ถˆ๋Ÿฌ์˜ฌ ์ˆ˜ ์žˆ๋‹ค.
#๋™์ผ ๊ฒฝ๋กœ์—  country_wise_latest.csv๊ฐ€ ์กด์žฌํ•  ๋•Œ
#covid = pd.read_csv("./country_wise_latest.csv")
#๋‚ด๊ฐ€ ์ €์žฅํ•œ ๊ฒฝ๋กœ
covid = pd.read_csv("./archive/country_wise_latest.csv")
covid


์œ„์™€ ๊ฐ™์ด couontry_wise_latest.csvํŒŒ์ผ์„ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ ํ˜•ํƒœ๋กœ ๋ถˆ๋Ÿฌ์˜จ๋‹ค.

pandasํ™œ์šฉ 1.์ผ๋ถ€๋ถ„๋งŒ ๊ด€์ฐฐํ•˜๊ธฐ

  • head():์ฒ˜์Œ n๊ฐœ์˜ ๋ฐ์ดํ„ฐ ์ฐธ์กฐ
# ์œ„์—์„œ๋ถ€ํ„ฐ 5๊ฐœ ๊ด€์ฐฐํ•˜๋Š” ํ•จ์ˆ˜
covid.head(5)

  • tail(): ๋งˆ์ง€๋ง‰ n๊ฐœ์˜ ๋ฐ์ดํ„ฐ ์ฐธ์กฐ
covid.tail()

pandasํ™œ์šฉ 2. ๋ฐ์ดํ„ฐ ์ ‘๊ทผํ•˜๊ธฐ

  • df['column_name'] or df.column_name
covid['Confirmed']
>>>0      36263
1       4880
2      27973
3        907
4        950
       ...  
182    10621
183       10
184     1691
185     4552
186     2704
Name: Confirmed, Length: 187, dtype: int64

1์ฐจ์› ๋ฐ์ดํ„ฐ์™€ ์œ ์‚ฌํ•˜๊ฒŒ ์ ‘๊ทผํ•  ์ˆ˜ ์žˆ๋‹ค.
covid.Active์˜ ํ˜•์‹์œผ๋กœ๋„ ์ ‘๊ทผํ•  ์ˆ˜ ์žˆ๋Š”๋ฐ WHO Region์˜ ๊ฒฝ์šฐ ์ด ๋ฐฉ์‹์€ ์•ˆ๋œ๋‹ค.

์ด๋ ‡๊ฒŒ ํ•ด์„œ ์ถ”์ถœํ•œ ๋ฐ์ดํ„ฐ์˜ ํ˜•ํƒœ๋Š” Series์ด๋‹ค.

type(covid['Confirmed'])#DataFrame์˜ ๊ฐcolumn์€ Series์ด๋‹ค.
>>>pandas.core.series.Series

๋”ฐ๋ผ์„œ

covid['Confirmed'][0]
>>>36263

์ด๋Ÿฐ ํ˜•์‹์œผ๋กœ๋„ ์ถ”์ถœ์ด ๊ฐ€๋Šฅํ•˜๋ฉฐ

covid['Confirmed'][1:5]
>>>1     4880
2    27973
3      907
4      950
Name: Confirmed, dtype: int64

์Šฌ๋ผ์ด์‹ฑ ๋˜ํ•œ ๊ฐ€๋Šฅํ•˜๋‹ค.

pandas ํ™œ์šฉ 3. ์กฐ๊ฑด์„ ์ด์šฉํ•ด์„œ ๋ฐ์ดํ„ฐ ์ ‘๊ทผํ•˜๊ธฐ

์กฐ๊ฑด์„ ์ด์šฉํ•ด์„œ ์กฐ๊ฑด์— ๋”ฐ๋ฅธ ๋ฐ์ดํ„ฐ๋ฅผ ์ถ”์ถœํ•  ์ˆ˜ ์žˆ๋‹ค.

# ์‹ ๊ทœํ™•์ง„์ž๊ฐ€ 100๋ช…์ด ๋„˜๋Š” ๋‚˜๋ผ๋ฅผ ์ฐพ์•„๋ณด์ž
covid[covid["New cases"]>100].head(5)


New cases ๋ถ€๋ถ„์„ ๋ณด๋ฉด ๋‹ค 100๋ช… ์ด์ƒ์ž„์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

๋‹ค์Œ์€ WHO์ง€์—ญ์ด ๋™๋‚จ์•„์‹œ์•„์ธ ๋ฐ์ดํ„ฐ๋งŒ ์ถ”์ถœํ•ด ๋ณด์•˜๋‹ค.

# WHO ์ง€์—ญ(WHO Region)์ด ๋™๋‚จ์•„์‹œ์•„์ธ ๋‚˜๋ผ ์ฐพ๊ธฐ
covid['WHO Region'].unique()
>>>array(['Eastern Mediterranean', 'Europe', 'Africa', 'Americas',
       'Western Pacific', 'South-East Asia'], dtype=object)

WHO์ง€์—ญ์ด ๋™๋‚จ์•„์‹œ์•„์ธ WHO Region์—ด์˜ ์ด๋ฆ„์ด ๋ฌด์—‡์ธ์ง€ ๋ชจ๋ฅด๊ธฐ ๋•Œ๋ฌธ์— unipue()ํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•˜์—ฌ WHO Region์˜ ๊ฐ’์„ ์ถ”์ถœํ•ด๋ณด์•˜๋‹ค.

covid[covid['WHO Region']=='South-East Asia']

๊ทธ ํ›„ ์กฐ๊ฑด๋ฌธ ํ˜•์‹์„ ํ†ตํ•ด WHO์ง€์—ญ์ด ๋™๋‚จ์•„์‹œ์•„์ธ ๋ฐ์ดํ„ฐ๋งŒ ์ถ”์ถœํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค.

pandasํ™œ์šฉ 4. ํ–‰์„ ๊ธฐ์ค€์œผ๋กœ ๋ฐ์ดํ„ฐ ์ ‘๊ทผํ•˜๊ธฐ

์ด์ „์—” ์—ด์„ ๊ธฐ์ค€์œผ๋กœ ๋ฐ์ดํ„ฐ์— ์ ‘๊ทผํ–ˆ๋‹ค๋ฉด ์ด๋ฒˆ์—” ํ–‰์„ ๊ธฐ์ค€์œผ๋กœ ๋ฐ์ดํ„ฐ์— ์ ‘๊ทผํ•ฉ๋‹ˆ๋‹ค.
์ž„์˜์˜ ๋„์„œ๊ด€ ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

books_dict = {'Available':[True,True,True], 'Location' : [102,215,323],'Genre':['Programming','Physics','Math']}
books_df = pd.DataFrame(books_dict, index = ['๋ฒ„๊ทธ๋ž€ ๋ฌด์—‡์ธ๊ฐ€','๋‘๊ทผ๋‘๊ทผ ๋ฌผ๋ฆฌํ•™', '๋ฏธ๋ถ„ํ•ด์ค˜ ํ™ˆ์ฆˆ'])
books_df

  • ์ธ๋ฑ์Šค๋ฅผ ์ด์šฉํ•ด์„œ ๊ฐ€์ ธ์˜ค๊ธฐ .loc[row,col]
books_df.loc['๋ฒ„๊ทธ๋ž€ ๋ฌด์—‡์ธ๊ฐ€']
type(books_df.loc['๋ฒ„๊ทธ๋ž€ ๋ฌด์—‡์ธ๊ฐ€'])
>>>Available           True
Location             102
Genre        Programming
Name: ๋ฒ„๊ทธ๋ž€ ๋ฌด์—‡์ธ๊ฐ€, dtype: object
>>>pandas.core.series.Series

์œ„์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ์ €์žฅํ•  ๋•Œ ์ธ๋ฑ์Šค๋ฅผ ์ฑ… ์ด๋ฆ„์œผ๋กœ ์ €์žฅํ–ˆ๊ธฐ ๋•Œ๋ฌธ์— loc์•ˆ์— ์ฑ…์ด๋ฆ„์„ ๋„ฃ์Šต๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ์ถ”์ถœํ•œ ๋ฐ์ดํ„ฐ ๋˜๋ž€ series์˜ ํ˜•ํƒœ๋ฅผ ๊ฐ€์ง‘๋‹ˆ๋‹ค.

  • ์ˆซ์ž ์ธ๋ฑ์Šค๋ฅผ ์ด์šฉํ•ด์„œ ๊ฐ€์ ธ์˜ค๊ธฐ .iloc[rowindex, colindex]
#์ธ๋ฑ์Šค 0ํ–‰ 1์—ด ๊ฐ€์ ธ์˜ค๊ธฐ
books_df.iloc[0,1]
>>>102

iloc๋Š” ์ธ๋ฑ์Šค๋ฅผ ์ •์ˆ˜๋กœ๋งŒ ์ฝ์–ด์„œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ ธ์˜ต๋‹ˆ๋‹ค.

#์ธ๋ฑ์Šค 1ํ–‰์˜ ์ธ๋ฑ์Šค 0-1์—ด ๊ฐ€์ ธ์˜ค๊ธฐ
books_df.iloc[1,0:2]
>>>Available    True
Location      215
Name: ๋‘๊ทผ๋‘๊ทผ ๋ฌผ๋ฆฌํ•™, dtype: object

pandas ํ™œ์šฉ 5. groupby

groupby()๋Š” ์ด ์„ธ๋‹จ๊ณ„๋กœ ๋‚˜๋‰˜์–ด์ง‘๋‹ˆ๋‹ค.

  • Split : ํŠน์ •ํ•œ ๊ธฐ์ค€์„ ๋ฐ”ํƒ•์œผ๋กœ ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„์„ ๋ถ„ํ• 
  • Apply : ํ†ต๊ณ„ํ•จ์ˆ˜ - sum, mean, median,-์„ ์ ์šฉํ•ด์„œ ๊ฐ ๋ฐ์ดํ„ฐ๋ฅผ ์••์ถ•
  • Combine : Apply ๋œ ๊ฒฐ๊ณผ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์ƒˆ๋กœ์šด Series๋ฅผ ์ƒ์„ฑ (group_key : applied_value)

covid๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ง€๊ณ  groupby๋ฅผ ์‹คํ–‰ํ•ด๋ณด์•˜์Šต๋‹ˆ๋‹ค.

# WHO Region๋ณ„ ํ™•์ง„์ž์ˆ˜

#1. covid์—์„œ ํ™•์ง„์ž ์ˆ˜ column๋งŒ ์ถ”์ถœํ•œ๋‹ค. : split
#2. ์ด๋ฅผ covid์˜ WHO Region์„ ๊ธฐ์ค€์œผ๋กœ groupbyํ•œ๋‹ค. : Apply, Combine
covid_by_region = covid['Confirmed'].groupby(by = covid["WHO Region"])#by๋ถ€๋ถ„์€ ์‹œ๋ฆฌ์ฆˆํ˜•ํƒœ๋กœ ์ ์–ด์ฃผ๋Š”๊ฒŒ ์ข‹๋‹ค.
covid_by_region.sum()
>>>WHO Region
Africa                    723207
Americas                 8839286
Eastern Mediterranean    1490744
Europe                   3299523
South-East Asia          1835297
Western Pacific           292428
Name: Confirmed, dtype: int64

# ๋Œ€๋ฅ™๋ณ„ ํ‰๊ท  ๊ฐ์—ผ์ž ์ˆ˜ sum/๊ตญ๊ฐ€ ์ˆ˜
covid_by_region.mean()
>>>WHO Region
Africa                    15066.812500
Americas                 252551.028571
Eastern Mediterranean     67761.090909
Europe                    58920.053571
South-East Asia          183529.700000
Western Pacific           18276.750000
Name: Confirmed, dtype: float64

์ฝ”ํŠธ๋ฅผ ์‚ดํŽด๋ณด๋ฉด covid์—์„œ ํ•„์š”ํ•œ ์นผ๋Ÿผ์ธ 'Confirmed'๋ฅผ ์ถ”์ถœํ•˜์—ฌ by ์ธ์ž๋ฅผ ๊ธฐ์ค€์œผ๋กœ gropby๋ฅผ ์‹คํ–‰ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ ํ›„ sum()ํ•จ์ˆ˜๋ฅผ ํ†ตํ•ด ์—ฐ์‚ฐ์„ ์ ์šฉํ•˜๋ฉด ๋งˆ๋ฌด๋ฆฌ ๋ฉ๋‹ˆ๋‹ค.

์‹ค์Šต์˜ˆ์ œ

1. covid ๋ฐ์ดํ„ฐ์—์„œ 100 case ๋Œ€๋น„ ์‚ฌ๋ง๋ฅ (Deaths / 100 Cases)์ด ๊ฐ€์žฅ ๋†’์€ ๊ตญ๊ฐ€๋Š”?

covid[covid['Deaths / 100 Cases']==covid['Deaths / 100 Cases'].max()]


max()๋ฅผ ์ด์šฉํ•˜์—ฌ Deaths / 100 Cases๊ฐ€ ๊ฐ€์žฅ ํฐ ๊ฐ’์„ ๊ฐ€์ง€๋Š” ํ–‰์„ ์ถ”์ถœํ•˜์˜€๋‹ค.

2. covid ๋ฐ์ดํ„ฐ์—์„œ ์‹ ๊ทœ ํ™•์ง„์ž๊ฐ€ ์—†๋Š” ๋‚˜๋ผ ์ค‘ WHO Region์ด 'Europe'๋ฅผ ๋ชจ๋‘ ์ถœ๋ ฅํ•˜๋ฉด?

tmp = covid.copy()
conditions =(tmp['New cases']==0)&(tmp["WHO Region"]=='Europe')
tmp = tmp[conditions]
tmp


2๊ฐœ ์ด์ƒ์˜ ์กฐ๊ฑด์„ ํ•œ ์ค„์— applyํ•˜๋Š” ๊ฒƒ์€ ๋น„ํšจ์œจ์ ์ด๋ฏ€๋กœ conditions๋ผ๋Š” ๋ณ€์ˆ˜์— ๋ชจ๋“  ์กฐ๊ฑด์„ ๋”ฐ๋กœ ์ €์žฅํ•œ ํ›„ ๋ฐ์ดํ„ฐ์— ์ ‘๊ทผํ•˜์˜€๋‹ค.

3. ๋‹ค์Œ ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•ด ๊ฐ Region๋ณ„๋กœ ์•„๋ณด์นด๋„๊ฐ€ ๊ฐ€์žฅ ๋น„์‹ผ ํ‰๊ท ๊ฐ€๊ฒฉ(AveragePrice)์„ ์ถœ๋ ฅํ•˜๋ฉด?

avocado = pd.read_csv("./avocado.csv")#๋ฐ์ดํ„ฐ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
avocado.head()#๋ฐ์ดํ„ฐ ์ผ๋ถ€ ํ™•์ธ
avocado.columns#์นผ๋Ÿผ๋ช… ํ™•์ธ
avocado['AveragePrice'].groupby(by = avocado['region']).max()
>>>region
Albany                 2.13
Atlanta                2.75
BaltimoreWashington    2.28
Boise                  2.79
Boston                 2.19
...
Syracuse               2.44
Tampa                  3.17
TotalUS                2.09
West                   2.52
WestTexNewMexico       2.93
Name: AveragePrice, dtype: float64

๋จผ์ € ์ฒ˜์Œ์— ๋ฐ์ดํ„ฐ๋ฅผ ๋ถˆ๋Ÿฌ์˜ค๊ณ  ์ „๋ฐ˜์ ์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ดํŽด๋ณด๊ธฐ ์œ„ํ•ด head()๋ฅผ ํ†ตํ•ด ์ผ๋ถ€๋ฅผ ํ™•์ธํ•˜์˜€์Šต๋‹ˆ๋‹ค.
๊ทธ ํ›„ Region๋ณ„๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ์ถœ๋ ฅํ•˜๊ธฐ ์œ„ํ•ด columns()์„ ์ด์šฉํ•ด ์นผ๋Ÿผ๋ช…์„ ํ™•์ธํ•œ ํ›„ groupby().max() ๋ช…๋ น์–ด๋กœ ๊ฐ region๋ณ„๋กœ ๊ฐ€์žฅ ๋น„์‹ผ AveragePrice๋ฅผ ์ถœ๋ ฅํ–ˆ์Šต๋‹ˆ๋‹ค.

๋~!


profile
๊ฒŒ์„๋ €๋˜ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ๊ณต๋ถ€

0๊ฐœ์˜ ๋Œ“๊ธ€