판다스의 groupby

코드싸개·2021년 1월 4일

캐글 Advanced 머신러닝 실전 박치기

목록 보기

2/7

SQL에서 한라인으로 처리할 수 있는 작업을 Pandas에서는 agg() 로 한다.

import pandas as pd

cust_dict = {
    'customer_name':['Alice', 'Tom', 'James', 'Yerin', 'Min', 'Katherine', 'John', 'Park', 'Maria', 'Derik', 'Jin'],
    'cust_country':['US', 'GB', 'US', 'KOR', 'KOR', 'GB', 'US', 'KOR', 'US', 'GB', 'KOR'],
    'gender':['F', 'M', 'M', 'F', 'M', 'F', 'M', 'M', 'F', 'M', 'F'],
    'grade':[3, 1, 2, 5, 4, 3, 5, 7, 2, 4, 5],
    'age':[25, 34, 26, 33, 67, 29, 54, 21, 77, 29, 16]
}
customer = pd.DataFrame(cust_dict)

customer.head(11)

	customer_name	cust_country	gender	grade	age
0	Alice	US	F	3	25
1	Tom	GB	M	1	34
2	James	US	M	2	26
3	Yerin	KOR	F	5	33
4	Min	KOR	M	4	67
5	Katherine	GB	F	3	29
6	John	US	M	5	54
7	Park	KOR	M	7	21
8	Maria	US	F	2	77
9	Derik	GB	M	4	29
10	Jin	KOR	F	5	16

Pandas Groupby

SQL Groupby

SELECT cust_country, sum(grade) sum_grade, max(grade) max_grade, avg(age) avg_age
, max(age) max_age FROM customer group by cust_country

grouby key 컬럼으로 DataFrameGroupby 를 생성

cust_group = customer.groupby('cust_country')
type(cust_group)

pandas.core.groupby.generic.DataFrameGroupBy

cust_group.head()

	customer_name	cust_country	gender	grade	age
0	Alice	US	F	3	25
1	Tom	GB	M	1	34
2	James	US	M	2	26
3	Yerin	KOR	F	5	33
4	Min	KOR	M	4	67
5	Katherine	GB	F	3	29
6	John	US	M	5	54
7	Park	KOR	M	7	21
8	Maria	US	F	2	77
9	Derik	GB	M	4	29
10	Jin	KOR	F	5	16

개별 aggregation 컬럼 및 aggregation 연산별로 API 호출

cust_group['grade'].sum()

cust_country
GB      8
KOR    21
US     12
Name: grade, dtype: int64

# 같은 의미
customer.groupby('cust_country')['grade'].sum()

cust_country
GB      8
KOR    21
US     12
Name: grade, dtype: int64

cust_agg = pd.DataFrame()
cust_agg['sum_grade'] = cust_group['grade'].sum()
cust_agg['max_grade'] = cust_group['grade'].max()
cust_agg.head()

	sum_grade	max_grade
cust_country
GB	8	4
KOR	21	7
US	12	5

cust_agg['avg_age'] = cust_group['age'].mean()
cust_agg['max_age'] = cust_group['age'].max()
cust_agg.head()

	sum_grade	max_grade	avg_age	max_age
cust_country
GB	8	4	30.666667	34
KOR	21	7	34.250000	67
US	12	5	45.500000	77

# group by 키 컬럼은 Pandas DataFrame의 인덱스로 변환. 
print(cust_agg.index.name)
cust_agg = cust_agg.reset_index()
cust_agg.head()

cust_country

	cust_country	sum_grade	max_grade	avg_age	max_age
0	GB	8	4	30.666667	34
1	KOR	21	7	34.250000	67
2	US	12	5	45.500000	77

agg()함수를 활용. agg() 인자로 aggregation 함수 리스트를 입력. 여전히 aggregation 컬럼별로 API 적용

cust_agg1 = cust_group['grade'].agg(['sum', 'max'])
cust_agg2 = cust_group['age'].agg(['mean', 'max'])
print(cust_agg1.head())
print(cust_agg2.head())

              sum  max
cust_country          
GB              8    4
KOR            21    7
US             12    5
                   mean  max
cust_country                
GB            30.666667   34
KOR           34.250000   67
US            45.500000   77

cust_agg = cust_agg1.merge(cust_agg2, on='cust_country', how='left')
cust_agg.head()

	sum	max_x	mean	max_y
cust_country
GB	8	4	30.666667	34
KOR	21	7	34.250000	67
US	12	5	45.500000	77

cust_agg = pd.DataFrame()
cust_agg[['sum_grade', 'max_grade']] = cust_group['grade'].agg(['sum', 'max'])
cust_agg[['avg_age', 'max_age']] = cust_group['age'].agg(['mean', 'max'])
cust_agg.head()

	sum_grade	max_grade	avg_age	max_age
cust_country
GB	8	4	30.666667	34
KOR	21	7	34.250000	67
US	12	5	45.500000	77

agg()에 인자로 Aggregation 컬럼값과 Aggregation 함수 리스트를 기재한 dictinary값 입력

보편적으로 이 방법을 사용한다.

agg_dict = {
    'grade':['sum', 'max'],
    'age':['mean', 'max']
}
cust_agg = cust_group.agg(agg_dict)
cust_agg.head()

	grade	age
	sum	max	mean	max
cust_country
GB	8	4	30.666667	34
KOR	21	7	34.250000	67
US	12	5	45.500000	77

print(cust_agg.columns)
[('_').join(column) for column in cust_agg.columns]

MultiIndex([('grade',  'sum'),
            ('grade',  'max'),
            (  'age', 'mean'),
            (  'age',  'max')],
           )





['grade_sum', 'grade_max', 'age_mean', 'age_max']

groupby 수행 결과 DataFrame의 컬럼명 변경.

cust_agg.columns = [('_').join(column) for column in cust_agg.columns]
cust_agg.head()

	grade_sum	grade_max	age_mean	age_max
cust_country
GB	8	4	30.666667	34
KOR	21	7	34.250000	67
US	12	5	45.500000	77

Pandas Groupby Case When

Pandas는 Group by case when 구문을 지원하지 않는다.

SQL

SELECT cust_country, sum(grade) total_sum,  
Sum(case when gender==‘M’ then grade end) male_sum,  
Sum(case when gender==‘F’ then grade end) female_sum  FROM customer GROUP BY cust_country

case when 조건으로 각각 filtering하여 별도의 DataFrame생성

cond_male = customer['gender'] == 'M'
cond_female = customer['gender'] == 'F'
cust_male = customer[cond_male]
cust_female = customer[cond_female]
cust_male

	customer_name	cust_country	gender	grade	age
1	Tom	GB	M	1	34
2	James	US	M	2	26
4	Min	KOR	M	4	67
6	John	US	M	5	54
7	Park	KOR	M	7	21
9	Derik	GB	M	4	29

cust_female

	customer_name	cust_country	gender	grade	age
0	Alice	US	F	3	25
3	Yerin	KOR	F	5	33
5	Katherine	GB	F	3	29
8	Maria	US	F	2	77
10	Jin	KOR	F	5	16

원본 데이터에 cust_country로 groupby 수행.

cust_agg = customer.groupby('cust_country')['grade'].sum()
cust_agg = cust_agg.reset_index()
cust_agg.head()

	cust_country	grade
0	GB	8
1	KOR	21
2	US	12

filtering된 데이터에 cust_country로 groupby 수행.

cust_male_agg = cust_male.groupby('cust_country')['grade'].sum()
cust_female_agg = cust_female.groupby('cust_country')['grade'].sum()
cust_male_agg = cust_male_agg.reset_index()
cust_female_agg = cust_female_agg.reset_index()
cust_male_agg

	cust_country	grade
0	GB	5
1	KOR	11
2	US	7

원본 데이터에 groupby 된 데이터 세트와 filtering된 데이터에 groupby 된 데이터 세트를 조인.

cust_agg = cust_agg.merge(cust_male_agg, on='cust_country', how='left')
cust_agg = cust_agg.merge(cust_female_agg, on='cust_country', how='left')
cust_agg.head()

	cust_country	grade_x	grade_y	grade
0	GB	8	5	3
1	KOR	21	11	10
2	US	12	7	5

cust_agg.columns = ['cust_country', 'sum_grade', 'sum_male_grade', 'sum_female_grade']
cust_agg

	cust_country	sum_grade	sum_male_grade	sum_female_grade
0	GB	8	5	3
1	KOR	21	11	10
2	US	12	7	5

Source: inflearn / 캐글 Advanced 머신러닝 실전 박치기

코드싸개

데이터 분석 공부용 벨로그

이전 포스트

판다스의 join

다음 포스트

판다스의 groupby

캐글 Advanced 머신러닝 실전 박치기

Pandas Groupby

SQL Groupby

grouby key 컬럼으로 DataFrameGroupby 를 생성

개별 aggregation 컬럼 및 aggregation 연산별로 API 호출

agg()함수를 활용. agg() 인자로 aggregation 함수 리스트를 입력. 여전히 aggregation 컬럼별로 API 적용

agg()에 인자로 Aggregation 컬럼값과 Aggregation 함수 리스트를 기재한 dictinary값 입력

groupby 수행 결과 DataFrame의 컬럼명 변경.

Pandas Groupby Case When

SQL

case when 조건으로 각각 filtering하여 별도의 DataFrame생성

원본 데이터에 cust_country로 groupby 수행.

filtering된 데이터에 cust_country로 groupby 수행.

원본 데이터에 groupby 된 데이터 세트와 filtering된 데이터에 groupby 된 데이터 세트를 조인.

판다스의 join

Kaggle: Home Credit Default Risk 첫 번째 모델 학습 및 성능 평가

0개의 댓글