๐Ÿ“•Pandas Study

๊ฐ•๊ธฐํ™˜ยท2022๋…„ 12์›” 13์ผ

Pandas ๋Š” Numpy ๊ธฐ๋ฐ˜์—์„œ ๊ฐœ๋ฐœ๋˜์–ด ๋‹ค์ฐจ์›์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋น ๋ฅด๊ฒŒ ์ฒ˜๋ฆฌํ•œ๋‹ค.
ํ•œ๋งˆ๋””๋กœ ์ฝ”๋”ฉํ•ด์„œ ์“ฐ๋Š” ์—‘์…€ ํ”„๋กœ๊ทธ๋žจ์ด๋ผ๊ณ  ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋‹ค.

1. ํŒ๋‹ค์Šค ํŠน์ง•

  1. ๋น ๋ฅด๊ณ  ํšจ์œจ์ ์ด๋ฉฐ ๋‹ค์–‘ํ•œ ํ‘œํ˜„๋ ฅ์„ ๊ฐ–์ถ˜ ์ž๋ฃŒ๊ตฌ์กฐ.
    ์‹ค์„ธ๊ณ„ ๋ฐ์ดํ„ฐ ๋ถ„์„์„ ์œ„ํ•ด ๋งŒ๋“ค์–ด์ง„ ํŒŒ์ด์ฌ ํŒจํ‚ค์ง€

  2. ๋‹ค์–‘ํ•œ ํ˜•ํƒœ์˜ ๋ฐ์ดํ„ฐ์— ์ ํ•ฉ
    ์ด์ข… ์ž๋ฃŒํ˜•์˜ ์—ด์„ ๊ฐ€์ง„ ํ…Œ์ด๋ธ” ๋ฐ์ดํ„ฐ
    ์‹œ๊ณ„์—ด ๋ฐ์ดํ„ฐ
    ๋ ˆ์ด๋ธ”์„ ๊ฐ€์ง„ ๋‹ค์–‘ํ•œ ํ–‰๋ ฌ ๋ฐ์ดํ„ฐ
    ๋‹ค์–‘ํ•œ ๊ด€์ธก ํ†ต๊ณ„ ๋ฐ์ดํ„ฐ

  3. ํ•ต์‹ฌ๊ตฌ์กฐ
    ์‹œ๋ฆฌ์ฆˆ: 1์ฐจ์› ๊ตฌ์กฐ๋ฅผ ๊ฐ€์ง„ ํ•˜๋‚˜์˜ ์—ด
    ๋ฐ์ดํ„ฐํ”„๋ ˆ์ž„ : ๋ณต์ˆ˜์˜ ์—ด์„ ๊ฐ€์ง„ 2์ฐจ์› ๋ฐ์ดํ„ฐ

  4. ํŒ๋‹ค์Šค๊ฐ€ ์ž˜ ํ•˜๋Š” ์ผ
    ๊ฒฐ์ธก ๋ฐ์ดํ„ฐ ์ฒ˜๋ฆฌ
    ๋ฐ์ดํ„ฐ ์ถ”๊ฐ€ ์‚ญ์ œ(์ƒˆ๋กœ์šด ์—ด์˜ ์ถ”๊ฐ€, ํŠน์ • ์—ด์˜ ์‚ญ์ œ ๋“ฑ)
    ๋ฐ์ดํ„ฐ ์ •๋ ฌ๊ณผ ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ ์กฐ์ž‘

2. Series ์ž๋ฃŒ๊ตฌ์กฐ (๋ฐ์ดํ„ฐ ๊ฐ์ฒด)

import pandas as pd
import numpy as np

# ํ•˜๋‚˜์˜ ์—ด์˜ ๊ตฌ์กฐ, ์ž๋™์œผ๋กœ ์ƒ‰์ธ ์ƒ์„ฑ
pd.Series(data)

# index ์ •๋ณด
data.index

3. DataFrame ์ƒ์„ฑ

dic = {'์ด๋ฆ„':['๊น€์ˆ˜์•ˆ','๊น€์ˆ˜์ •','๋ฐ•๋™์œค'], '๋‚˜์ด':[19,23,22]}
data = pd.DataFrame(dic)

data1 = [['๊น€์ˆ˜์•ˆ',19],['๊น€์ง€์•ˆ',20]]
data = pd.DataFrame(data1,columns=['์ด๋ฆ„','๋‚˜์ด'])
data = pd.DataFrame(data1, index=['์›','ํˆฌ'],columns=['์ด๋ฆ„','๋‚˜์ด'])

4. Data ์ถœ๋ ฅ

# ์ปฌ๋Ÿผ ์ถœ๋ ฅ
data.columns

# ์ •๋ณด ์ถœ๋ ฅ
data.info()

# ํ–‰,์—ด์˜ ๊ฐ’์„ ์ด์šฉํ•˜์—ฌ ์ถœ๋ ฅ
data.loc[[์ธ๋ฑ์Šค,์ธ๋ฑ์Šค],['์ปฌ๋Ÿผ','์ปฌ๋Ÿผ']]

# ํ–‰,์—ด ์œ„์น˜ ๊ฐ’์„ ์ด์šฉํ•˜์—ฌ ์ถœ๋ ฅ
data.iloc[:,0:1]

# unique ๊ฐ’ ์ถœ๋ ฅ
data['์ปฌ๋Ÿผ'].unique()

np.unique(data['์ปฌ๋Ÿผ'],return_counts=True)

5. Data ์‚ญ์ œ

data.drop([์ธ๋ฑ์Šค,์ธ๋ฑ์Šค])
data.drop(['์ปฌ๋Ÿผ','์ปฌ๋Ÿผ'])
data.drop(['์ปฌ๋Ÿผ','์ปฌ๋Ÿผ'],axis=1)
data.drop(['์ปฌ๋Ÿผ','์ปฌ๋Ÿผ'],axis=1,inplace=True) # ๋ฐ์ดํ„ฐ์— ์‹ค์ œ ์ ์šฉ

6. ๊ฒฐ์ธก ๊ฐ’

# ๊ฒฐ์ธก ๊ฐ’ ์ถœ๋ ฅ
data.isna()
data.isnull()

# ํ•˜๋‚˜๋ผ๋„ NaN ๊ฐ’์ด ์žˆ๋‹ค๋ฉด True
data.isna().any()

# NaN ๊ฐ’ ์ œ๊ฑฐ
data.dropna(axis=0)
data.dropna(axis=1)
data.dropna(axis=1,how = 'any')
data.dropna(axis=1,how = 'all')

# NaN ๊ฐ’ ์ฑ„์šฐ๊ธฐ
data.fillna(value=0,method='backfill') # ffill

7. ์—ฐ์‚ฐ

# data ํ•ฉ
data.sum()
#cov/corr

# ํ†ต๊ณ„
data.describe()

# ์ •๋ ฌ
data.sort_values(by=[],axis=1,ascending=False,ignore_index=True)

# ๋น„๊ต์—ฐ์‚ฐ
data['์ปฌ๋Ÿผ'] >= 100
data[data1&data2]

8. ํŒŒ์ผ ์ž…์ถœ๋ ฅ

pd.read_table(path,sep='\t')
pd.read_csv(path,sep=',')

data.to_csv(path)

9. merge

# ์กฐ์ธ
pd.merge(a,b,on='์ปฌ๋Ÿผ',how='inner')
a.merge(b,on='์ปฌ๋Ÿผ',how='inner')

# key ์ง€์ •
pd.merge(a,b,left_on='lkey',right_on='rkey',how='inner',suffixes=['_left','_right'])

pd.merge(a,b,on='key',how='inner') # left,right,outer

pd.merge(a,b,left_index=True,right_index=True,how='outer')

# Numpy ๋ณ‘ํ•ฉ
np.concatenate((a,b),axis=1)

# Pandas ๋ถ™์ด๊ธฐ
pd.concat([a,b],axis=0)

pd.concat([a,b],axis=1,'inner',ignore_index=False)

# ์ค‘๋ณต ์ฐพ๊ธฐ
data.duplicated(keep='last') # ๋งˆ์ง€๋ง‰ ๋‚จ๊ฒจ๋‘ 
data.duplicated(keep=False)

# ์ปฌ๋Ÿผ ์ง€์ • ์ค‘๋ณต ์ฐพ๊ธฐ
data.duplicated(subset=['์ปฌ๋Ÿผ'])

# ์ค‘๋ณต ์ œ๊ฑฐ
data.drop_duplicates(keep='first')

# ์ค‘๋ณต ์ „๋ถ€ ์ œ๊ฑฐ
data.drop_duplicates(keep=False)

10. Groupby

# ๊ฐ€๋ณ€์ˆ˜
pd.get_dummies(data)

pd.get_dummies(df['์ปฌ๋Ÿผ1'],prefix='์ปฌ๋Ÿผ2)

# ๊ทธ๋ฃน์ง“๊ธฐ
data.groupby('์ปฌ๋Ÿผ')
data.groupby(['์ปฌ๋Ÿผ','์ปฌ๋Ÿผ'])

# ๋ฉ€ํ‹ฐ ์ธ๋ฑ์Šค
data.xs(key=100,level='์ปฌ๋Ÿผ')
profile
๋ฐฑ์—”๋“œ๊ฐœ๋ฐœ์ž ๊ฟˆ๋‚˜๋ฌด

0๊ฐœ์˜ ๋Œ“๊ธ€