정렬, 결손 데이터 처리, lambda 식으로 가공

CharliePark·2020년 9월 7일

TIL machine learning pandas

TIL

목록 보기

28/67

정렬, Aggregation 함수, GroupBy 적용

DataFrame, Series의 정렬 - sort_values()

sort_values() 는 RDBMS SQL의 order by 키워드와 유사하다.

주요 입력 파라미터는 by, ascending, inplace이다.

by : by로 입력한 칼럼으로 정렬을 수행
ascending : ascending=True가 기본값이고, True일때 오름차순, False일때 내림차순이다.
innplace : inplace=False가 기본값이고, False일때 정렬된 결과를 반환하고, True일때 정렬 결과를 원본에 적용한다.

import pandas as pd
titanic_df = pd.read_csv('titanic_train.csv')

titanic_sorted = titanic_df.sort_values(by=['Name'])
titanic_sorted.head(3)

output

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
845	846	0	3	Abbing, Mr. Anthony	male	42.0	0	0	C.A. 5547	7.55	NaN	S
746	747	0	3	Abbott, Mr. Rossmore Edward	male	16.0	1	1	C.A. 2673	20.25	NaN	S
279	280	1	3	Abbott, Mrs. Stanton (Rosa Hunt)	female	35.0	1	1	C.A. 2673	20.25	NaN	S

여러 개의 칼럼으로 정렬하려면, by에 리스트 형식으로 입력하면 된다.

titanic_sorted = titanic_df.sort_values(by=['Pclass', 'Name'], ascending=False)
titanic_sorted.head(3)

output

	PassengerId	Pclass	Name	Sex	Age	Parch	Ticket	Fare	Cabin	Embarked
868	869	3	van Melkebeke, Mr. Philemon	male	NaN	0	345777	9.5	NaN	S
153	154	3	van Billiard, Mr. Austin Blyler	male	40.5	2	A/5. 851	14.5	NaN	S
282	283	3	de Pelsmaeker, Mr. Alfons	male	16.0	0	345778	9.5	NaN	S

Aggregation 함수 사용

DataFrame에서 min(), max(), sum(), count() 와 같은 aggregation 함수의 적용은 RDBMS SQL의 aggregation 함수 적용과 유사하다. 다만 DataFrame에서 aggregation을 호출할 경우 모든 칼럼에 해당 aggregation 을 적용하게 된다.

titanic_df.count()

output

PassengerId    891
Survived       891
Pclass         891
Name           891
Sex            891
Age            714
SibSp          891
Parch          891
Ticket         891
Fare           891
Cabin          204
Embarked       889
dtype: int64

특정 칼럼에 aggregation 함수를 적용하려면 대상 칼럼만 추출하여 적용하면 된다.

titanic_df[['Age', 'Fare']].mean(axis=1)

output

0      14.62500
1      54.64165
2      16.96250
3      44.05000
4      21.52500
         ...   
886    20.00000
887    24.50000
888    23.45000
889    28.00000
890    19.87500
Length: 891, dtype: float64

groupby() 적용

DataFrame에 groupby() 를 호출하면 DataFrameGroupBy라는 또 다른 형태의 DataFrame을 반환한다.

titanic_groupby = titanic_df.groupby(by='Pclass')
print(type(titanic_groupby)) 
print(titanic_groupby)

output

<class 'pandas.core.groupby.generic.DataFrameGroupBy'>
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000020495A77088>

groupby() 를 호출해 반환된 결과에 aggregation 함수를 호출하면 groupby() 대상 칼럼을 제외한 모든 칼럼에 해당 aggregation 함수를 적용한다.

titanic_groupby = titanic_df.groupby('Pclass').count()
titanic_groupby

output

	PassengerId	Survived	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
Pclass
1	216	216	216	216	186	216	216	216	216	176	214
2	184	184	184	184	173	184	184	184	184	16	184
3	491	491	491	491	355	491	491	491	491	12	491

특정 칼럼에만 aggregation 함수를 적용하려면, groupby() 로 반환된 DataFrameGroupBy 객체에 해당 칼럼을 필터링한 뒤 aggregation 함수를 적용한다.

titanic_groupby = titanic_df.groupby(by='Pclass')[['PassengerId', 'Survived']].count()
titanic_groupby

output

	PassengerId	Survived
Pclass
1	216	216
2	184	184
3	491	491

서로 다른 aggregation 함수를 적용할 경우에는 DataFrameGroupBy 객체의 agg() 내에 여러 개의 함수명을 인자로 입력하면 된다.

titanic_df.groupby('Pclass')['Age'].agg([max, min])

output

	max	min
Pclass
1	80.0	0.92
2	70.0	0.67
3	74.0	0.42

이렇게 groupby() 를 이용해 API 기반으로 처리하다보니, SQL의 group by 보다 유연성이 떨어진다.

예컨대 여러 개의 칼럼이 서로 다른 aggregation 함수를 호출하기에는 좀 복잡하다.

이를 처리하기 위해서는 agg() 안에 딕셔너리 형태로 칼럼과 함수를 입력해야 한다.

agg_format={'Age':'max', 'SibSp':'sum', 'Fare':'mean'}
titanic_df.groupby('Pclass').agg(agg_format)

output

	Age	SibSp	Fare
Pclass
1	80.0	90	84.154687
2	70.0	74	20.662183
3	74.0	302	13.675550

결손 데이터 처리하기

판다스는 결손 데이터(Missing Data)를 처리하는 API를 제공한다.

기본적으로 NaN 값, 즉 결손 데이터는 머신러닝 알고리즘에서 처리되지 않는다. 또한, 평균, 총합 등의 함수 연산에서도 제외된다.

isna() 를 이용해 NaN 인지 확인하고, fillna() 를 이용해 다른 값으로 대체 가능하다

isna() 로 결손 데이터 여부 확인

isna() 를 이용하면 모든 칼럼의 값이 NaN 인지 아닌지를 True나 False로 알려준다

titanic_df.isna().head(3)

output

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	False	False	False	False	False	False	False	False	False	False	True	False
1	False	False	False	False	False	False	False	False	False	False	False	False
2	False	False	False	False	False	False	False	False	False	False	True	False

이 결과에 sum() 함수를 추가하면 전체 결손 데이터의 개수를 구할 수 있다(내부적으로 True는 1, False는 0으로 변환된다).

titanic_df.isna( ).sum( )

output

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

fillna() 로 결손 데이터 대체하기

'Cabin'칼럼의 NaN 값을 'C000' 으로 대체해보자

titanic_df['Cabin'] = titanic_df['Cabin'].fillna('C000')
titanic_df.head(3)

output

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	C000	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	C000	S

이때 반환값을 원본에 다시 대입하거나, inplace=True 파라미터를 추가해야 원본 데이터 값이 변경된다.

'Age' 칼럼의 NaN 값을 평균 나이로, 'Embarked' 칼럼의 NaN 값을 'S'로 대체해 모든 결손데이터를 처리해보자.

titanic_df['Age'] = titanic_df['Age'].fillna(titanic_df['Age'].mean())
titanic_df['Embarked'] = titanic_df['Embarked'].fillna('S')
titanic_df.isna().sum()

output

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       0
dtype: int64

apply lambda 식으로 데이터 가공

판다스는 apply 함수에 lambda 식을 결합해 DataFrame이나 Series의 레코드별로 데이터를 가공하는 기능을 제공한다.

먼저 lambda 식에 대해 알아보자.

def get_square(a):
    return a**2

print('3의 제곱은:',get_square(3))

output

3의 제곱은: 9

위와 같은 함수를 lambda 식으로 표현해보자.

lambda_square = lambda x : x ** 2
print('3의 제곱은:',lambda_square(3))

3의 제곱은: 9

lambda x : x ** 2 에서 ':'로 입력 인자와 반환될 값의 계산식을 분리한다.

lambda 식을 사용할 때 여러 개의 값을 입력인자로 사용해야 할 경우에는 보통 map() 함수를 결합해서 사용한다.

a=[1,2,3]
squares = map(lambda x : x**2, a)
list(squares)

output

[1, 4, 9]

이제 이 lambda 식을 DataFrame의 apply에 적용해보자.

titanic_df['Name_len']= titanic_df['Name'].apply(lambda x : len(x))
titanic_df[['Name','Name_len']].head(3)

output

	Name	Name_len
0	Braund, Mr. Owen Harris	23
1	Cumings, Mrs. John Bradley (Florence Briggs Th...	51
2	Heikkinen, Miss. Laina	22

조금 더 복잡하게 if else 절을 사용해보자.

나이가 15세 미만이면 'Child', 그렇지 않으면 'Adult'로 구분하는 예제이다.

titanic_df['Child_Adult'] = titanic_df['Age'].apply(lambda x : 'Child' if x <=15 else 'Adult' )
titanic_df[['Age','Child_Adult']].head(10)

output

	Age	Child_Adult
0	22.000000	Adult
1	38.000000	Adult
2	26.000000	Adult
3	35.000000	Adult
4	35.000000	Adult
5	29.699118	Adult
6	54.000000	Adult
7	2.000000	Child
8	27.000000	Adult
9	14.000000	Child

주의할 점은 ':' 기호의 오른편에 반환값이 있어야하기 때문에, if 식보다 반환값을 먼저 적어야 한다. else 식은 else 식 뒤에 반환값이 오면 된다.

else if는 지원하지 않기에, else에 ()를 만들어 다시 if else를 적용해야 한다.

titanic_df['Age_cat'] = titanic_df['Age'].apply(lambda x : 'Child' if x<=15 else ('Adult' if x <= 60 else 'Elderly'))
titanic_df['Age_cat'].value_counts()

output

Adult      786
Child       83
Elderly     22
Name: Age_cat, dtype: int64

else를 괄호안에 계속 중첩해서 쓰는 것은 부담스럽다. 이때는 함수를 만드는 게 더 낫다.

def get_category(age):
    cat = ''
    if age <= 5: cat = 'Baby'
    elif age <= 12: cat = 'Child'
    elif age <= 18: cat = 'Teenager'
    elif age <= 25: cat = 'Student'
    elif age <= 35: cat = 'Young Adult'
    elif age <= 60: cat = 'Adult'
    else : cat = 'Elderly'
    
    return cat

titanic_df['Age_cat'] = titanic_df['Age'].apply(lambda x : get_category(x))
titanic_df[['Age','Age_cat']].head()

output

	Age	Age_cat
0	22.0	Student
1	38.0	Adult
2	26.0	Young Adult
3	35.0	Young Adult
4	35.0	Young Adult

CharliePark

이전 포스트

데이터 셀렉션 및 필터링

다음 포스트

정렬, 결손 데이터 처리, lambda 식으로 가공

TIL

정렬, Aggregation 함수, GroupBy 적용

결손 데이터 처리하기

apply lambda 식으로 데이터 가공

데이터 셀렉션 및 필터링

인공지능이란 무엇인가?

0개의 댓글

관련 채용 정보