21. 데이터프레임의 응용 - 그룹 연산 메소드(적용 - 결합)

김동웅·2021년 9월 28일

Pandas with python

목록 보기

21/23

🔔 데이터 집계

앞에서 분할한 그룹 객체에 대해 각 그룹별 평균을 계산한 것처럼, 그룹 객체에 다양한 연산을 적용할 수 있다. 이 과정을 데이터 집계라고 부른다.

집계 기능을 내장하고 있는 판다스 기본 함수에는

mean(), max(), min(), sum(), count(), size(), var(), std(), describe(), info(), first(), last() 등이 있다.

표준편차 데이터 집계 : group 객체.std()


import pandas as pd
import seaborn as sns

titanic = sns.load_dataset('titanic')
df = titanic.loc[:,['age','sex','class','fare','survived']]

grouped = df.groupby(['class'])

# 각 그룹에 대한 모든 열의 표준편차를 집계하여 데이터프레임으로 변환
std_all = grouped.std()
print(std_all)

          age       fare  survived
class
First   14.802856  78.380373  0.484026
Second  14.001077  13.417399  0.500623
Third   12.495398  11.778142  0.428949

💡 First class의 금액 표준편차가 높다는 것을 알 수 있다.

집계 연산을 처리하는 사용자 정의 함수를 그룹 객체에 적용하려면 agg() 메소드를 사용해야한다.

agg() 메소드 데이터 집계 : group 객체.agg(매핑함수)


def min_max(x):
    return x.max() - x.min()

agg_minmax = grouped.agg(min_max)
print(agg_minmax.head())

          age      fare  survived
class
First   79.08  512.3292         1
Second  69.33   73.5000         1
Third   73.58   69.5500         1

또한 동시에 여러개의 함수를 사용하여 각 그룹별 데이터에 대한 집계 연산을 처리 할 수 있다.

모든 열에 여러 함수를 매핑 :
group 객체.agg([함수1,함수2,함수3])

각 열마다 다른 함수를 매핑 :
group 객체.agg({'열1' : 함수1,'열2' : 함수2'...})

agg_all = grouped.agg(['min','max'])
print(agg_all.head())

agg_sep = grouped.agg({'fare':['min','max'],'age':'mean'})
print(agg_sep.head())

           age           sex              fare       survived    
        min   max     min   max        min   max      min max
class
First   0.92  80.0  female  male  0.0  512.3292        0   1
Second  0.67  70.0  female  male  0.0   73.5000        0   1
Third   0.42  74.0  female  male  0.0   69.5500        0   1


            fare             age
        min       max       mean
class
First   0.0  512.3292  38.233441
Second  0.0   73.5000  29.877630
Third   0.0   69.5500  25.140620

그룹 연산 데이터 변환

앞에서 살펴본 agg() 메소드 -> 각 그룹별 데이터에 연산을 위한 함수를 구분 적용.

transform() 메소드 -> 그룹별로 구분하여 각 원소에 함수를 적용하지만 그룹별 집계 대신 각 원소의 본래 행인덱스와 열이름을 기준으로 연산결과 반환.

즉, 그룹 연산의 결과를 원본 데이터프레임과 같은 형태로 변형하여 정리한다.

데이터 변환 연산 : group 객체.transform(매핑 함수)

❗ 무슨 뜻인지 정확히 이해하기 위해 예시를 살펴보자.

import pandas as pd
import seaborn as sns

titanic = sns.load_dataset('titanic')
df = titanic.loc[:,['age','sex','class','fare','survived']]

grouped = df.groupby(['class'])

age_mean = grouped.age.mean()
print(age_mean,'\n')

age_std = grouped.age.std()
print(age_std,'\n')

# 그룹 객체의 age열을 iteration 으로 z-score 계산하여 출력
for key, group in grouped.age:
    group_zscore = (group - age_mean.loc[key]) / age_std.loc[key]
    print('* origin : ', key)
    print(group_zscore.head(3))
    print('\n')

class
First     38.233441
Second    29.877630
Third     25.140620
Name: age, dtype: float64 

class
First     14.802856
Second    14.001077
Third     12.495398
Name: age, dtype: float64

* origin :  First
1   -0.015770
3   -0.218434
6    1.065103
Name: age, dtype: float64

* origin :  Second
9    -1.134029
15    1.794317
17         NaN
Name: age, dtype: float64

* origin :  Third
0   -0.251342
2    0.068776
4    0.789041
Name: age, dtype: float64

이번에는 transform() 메소드를 사용하여 'age'열의 데이터를 z-score로 직접 변환해보자.
z_score를 계산하는 사용자 함수를 정의하고, transform() 메소드의 인자로 전달한다.

def z_score(x):
	return (x-x.mean())/x.std()

# transform() 메소드를 이용하여 age열의 데이터를 z-score로 변환
age_zscore = grouped.age.transform(z_score)

print(age_zscore.loc[[1,9,0]] # 1,2,3 그룹의 첫 데이터 확인 
print('\n')

print(len(age_zscore)) # transform 메소드 반환 값의 길이
print('\n')

print(age_zscore.loc[0:9]) # transform 메소드 반환 값 출력 (첫 10개)
print('\n')

print(type(age_zscore)) # transform 메소드 반환 객체의 자료형

1   -0.015770
9   -1.134029
0   -0.251342
Name: age, dtype: float64

891

0   -0.251342
1   -0.015770
2    0.068776
3   -0.218434
4    0.789041
5         NaN
6    1.065103
7   -1.851931
8    0.148805
9   -1.134029
Name: age, dtype: float64

<class 'pandas.core.series.Series'>

그룹 객체 필터링
: 그룹 객체에 filter() 메소드를 적용할 때 조건식을 가진 함수를 전달하면 조건이 참인 그룹만을 남긴다.

그룹 객체 필터링 : group 객체.filter(조건식 함수)

import pandas as pd
import seaborn as sns

titanic = sns.load_dataset('titanic')
df = titanic.loc[:,['age','sex','class','fare','survived']]

grouped = df.groupby(['class'])

# 데이터 개수가 200개 이상인 그룹만을 필터링하여 데이터프레임으로 변환
group_filter = grouped.filter(lambda x : len(x)>=200)
print(group_filter.head())

print('\n')

print(type(group_filter))

이번에는 'age'열의 평균값이 30보다 작은 그룹만을 필터링해보자


age_filter = grouped.filter(lambda x : x.age.mean()<30)
print(age_filter.tail())
print('\n')

print(type(age_filter))

      age     sex   class    fare  survived
884  25.0    male   Third   7.050         0
885  39.0  female   Third  29.125         0
886  27.0    male  Second  13.000         0
888   NaN  female   Third  23.450         0
890  32.0    male   Third   7.750         0

❗ 평균나이가 30세 이하인 그룹은 'class'값이 'Second'와 'Third'인 2등석과 3등석 승객들 뿐인걸 알 수 있다.

그룹 객체에 함수 매핑

범용 메소드 : gropu 객체.apply(매핑함수)

'class'열을 기준으로 구분한 3개의 구룹에 요약 통계정보를 나타내는 describe() 메소드를 적용한다.

import pandas as pd
import seaborn as sns

titanic = sns.load_dataset('titanic')
df = titanic.loc[:,['age','sex','class','fare','survived']]

grouped = df.groupby(['class'])

agg_grouped = grouped.apply(lambda x : x.describe())

print(agg_grouped)

                     age        fare    survived
class
First  count  186.000000  216.000000  216.000000
       mean    38.233441   84.154687    0.629630
       std     14.802856   78.380373    0.484026
       min      0.920000    0.000000    0.000000
       25%     27.000000   30.923950    0.000000
       50%     37.000000   60.287500    1.000000
       75%     49.000000   93.500000    1.000000
       max     80.000000  512.329200    1.000000
Second count  173.000000  184.000000  184.000000
       mean    29.877630   20.662183    0.472826
       std     14.001077   13.417399    0.500623
       min      0.670000    0.000000    0.000000
       25%     23.000000   13.000000    0.000000
       50%     29.000000   14.250000    0.000000
       75%     36.000000   26.000000    1.000000
       max     70.000000   73.500000    1.000000
Third  count  355.000000  491.000000  491.000000
       mean    25.140620   13.675550    0.242363
       std     12.495398   11.778142    0.428949
       min      0.420000    0.000000    0.000000
       25%     18.000000    7.750000    0.000000
       50%     24.000000    8.050000    0.000000
       75%     32.000000   15.500000    0.000000
       max     74.000000   69.550000    1.000000

이번에는 z-score 를 계산하는 사용자 함수를 사용하여 'age'열의 데이터를 z-score로 변환해보자.


def z_score(x):
    return (x-x.mean())/x.std()

age_zscore = grouped.age.apply(z_score)
print(age_zscore.head())

0   -0.251342
1   -0.015770
2    0.068776
3   -0.218434
4    0.789041
Name: age, dtype: float64

이번에는 'age'열의 평균값이 30보다 작은 즉, 평균나이가 30세 미만인 그룹을 판별한다.


age_filter = grouped.apply(lambda x : x.age.mean()<30)
print(age_filter)

for i in age_filter.index :
    if age_filter[i]==True :
        age_filter_df = grouped.get_group(i)
        print(age_filter_df.head())
        print('\n')

    class
    First     False
    Second     True
    Third      True
    dtype: bool

    age     sex   class     fare  survived
    9   14.0  female  Second  30.0708         1
    15  55.0  female  Second  16.0000         1
    17   NaN    male  Second  13.0000         1
    20  35.0    male  Second  26.0000         0
    21  34.0    male  Second  13.0000         1

        age     sex  class     fare  survived
    0  22.0    male  Third   7.2500         0
    2  26.0  female  Third   7.9250         1
    4  35.0    male  Third   8.0500         0
    5   NaN    male  Third   8.4583         0
    7   2.0    male  Third  21.0750         0