결측치 조치

jeongwoo·2022년 4월 2일

목록 보기

1/4

모델링하기 전에 결측치 조치는 필수이다.
결측치 조치는 단변량분석을 진행하면서 실행한다.
why? 결측치가 존재할 경우, 이상치 확인을 위한 boxplot과 이변량분석에서 가설검정도구들을 사용할 수 없기 때문이다.

결측치는 df.info 또는 df.isna().sum() (df.isnull().sum())으로 확인할 수 있다.

결측치를 조치하는 방법은 채우기 또는 제거가 있다.
채우기에는 여러 가지 방법이 있다.
제거의 경우는 되도록 지양하도록 한다.
why? 전처리 과정에서 결측치를 제거한다는 것은 운영에서도 결측치가 들어올 경우, 그 데이터를 버리겠다는 의미이다.

★Q. 모델링할 때는 결측치가 없던 변수여서 결측치 조치 코드가 없는 상황에 운영 시 얻은 데이터에는 결측치가 있다면 ?
A. 보통 결측치가 있던 변수에서 결측치가 발생한다. 만약 해당 경우가 발생한다면 이슈로써 접근한다.

라이브러리 불러오기

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

타이타닉 데이터 불러오기

titanic = pd.read_csv('titanic_train.csv')

# 확인
titanic.tail(2)

데이터 출처:https://www.kaggle.com/c/titanic

x와 y 분리

target = 'Survived'
x0 = titanic.drop(target, axis=1)
y0 = titanic[target]

train, val, test 셋 분할

# 필요한 라이브러리 불러오기
from sklearn.model_selection import train_test_split

test_size 속성값으로 소수점을 주면 비율로, 정수로 주면 개수로 분할해준다.
random_state 속성값은 난수 설정값이다.

# 필요한 라이브러리 불러오기
x, x_test, y, y_test = train_test_split(x0, y0, test_size=.1, random_state=2022)
x_train, x_val, y_train, y_val = train_test_split(x, y, test_size=80, random_state=2022)

결측치 확인

df.info 또는 df.isna().sum()

titanic.info()

titanic.isna().sum()

결측치 시각화

결측치 시각화를 하기 위해서 import missingno as msno와 import matplotlib.pyplot as plt를 import한다.

매트릭스 형태로 확인

결측치가 존재하는 변수는 결측치만큼 비어있다.
- 즉, 흰 색이 많을수록 결측치가 많다는 뜻

import missingno as msno
import matplotlib.pyplot as plt

msno.matrix(x_train)
plt.show()

막대 그래프 형태로 확인

msno.bar(x_train)
plt.show()

결측치 채우기

SimpleImpuer

cf) https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html

strategy 속성에는

1. most_frequent: 최빈값
1. mean: 평균값
1. median:중간값
1. constant: 지정한 값(숫자 또는 문자)

를 사용할 수 있다.

constant를 사용할 경우, fill_value=숫자 / 문자 속성을 써줘야한다.

most_frequent 사용할 경우
- 보통 범주형(숫자는 이산형)을 채울 때 사용

from sklearn.impute import SimpleImputer
# 대상을 리스트로 선언
imputer1_list = ['Embarked']

# 선언하고 fit_transform
imputer1 = SimpleImputer(strategy = 'most_frequent')
x_train[imputer1_list] = imputer1.fit_transform(x_train[imputer1_list])
x_train.isna().sum()

constant를 사용할 경우
- fill_value=숫자 / 문자 속성을 추가한다.

x_train[x_train['Embarked'].isna()]

채운 후 확인

from sklearn.impute import SimpleImputer
# 대상을 리스트로 선언
imputer1_list = ['Embarked']

# 선언하고 fit_transform
imputer1 = SimpleImputer(strategy = 'constant', fill_value='S')
x_train[imputer1_list] = imputer1.fit_transform(x_train[imputer1_list])
x_train.isna().sum()

df로 확인

x_train[x_train['PassengerId'] == 62]

constant를 사용하지만 fill_value 속성을 추가하지 않은 경우
값이 제대로 들어가지 않는다.

채운 후 확인

from sklearn.impute import SimpleImputer
# 대상을 리스트로 선언
imputer1_list = ['Embarked']

# 선언하고 fit_transform
imputer1 = SimpleImputer(strategy = 'constant')
x_train[imputer1_list] = imputer1.fit_transform(x_train[imputer1_list])
x_train.isna().sum()

df로 확인

x_train[x_train['PassengerId'] == 62]

숫자형 변수

특정값(평균 등)으로 채우기

x_train['Age'].fillna(x_train['Age'].mean(), inplace=True)
x_train['Age'].isna().sum()  # 0

replace()를 활용한 채우기

x_train['Age'].replace(np.nan, x_train['Age'].mean(), inplace=True)
x_train['Age'].isna().sum()  # 0

예측값으로 채우기

예측하는 방법은 정해져있지 않다.
선형회귀 알고리즘을 사용하여 채우는 사람도 본 적이 있다.

KnnImputer 사용

cf) https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html?highlight=knnimputer#sklearn.impute.KNNImputer

KnnImputer는 어떻게 채워야할 지 아이디어가 없을 때 주로 사용된다.
KnnImputer를 사용하기 위해선 데이터에 문자열이 존재해선 안 된다.
즉, 고유값을 나타내는 변수 제거와 가변수화가 선행되어야 한다.
Knn 알고리즘을 사용하여 imputer가 자동으로 결측치를 채워준다.
Knn 알고리즘은 거리를 사용하는 알고리즘이기 때문에 사용 전 스케일링이 선행돼야한다.

따라서 KnnImputer 내용은 뒤에서 다룬다.

시계열일 경우

시계열 데이터 불러오기

air = pd.read_csv('https://bit.ly/3qmthqZ')
air.head(7)

데이터 출처: https://github.com/DA4BAM/dataset

결측치 확인

air.isna().sum()

앞에 값으로 채우기

df.fillna(method='ffill') 또는 df.ffill()

tmp = air.copy()
tmp['Solar.R'].fillna(method='ffill', inplace=True)
tmp.head(7)

또는

tmp = air.copy()
tmp['Solar.R'].ffill(inplace=True)
tmp.head(7)

뒤에 값으로 채우기

df.fillna(method='bfill')과 df.fillna(method='backfill') 또는 df.bfill()은 같은 기능을 한다.

tmp = air.copy()
tmp['Solar.R'].fillna(method='bfill', inplace=True)
tmp.head(7)

또는

tmp = air.copy()
tmp['Solar.R'].fillna(method='backfill', inplace=True)
tmp.head(7)

또는

tmp = air.copy()
tmp['Solar.R'].bfill(inplace=True)
tmp.head(7)

선형보간법으로 채우기

df.interpolate(): 앞뒤의 값의 동일 간격으로 채움

tmp = air.copy()
tmp['Solar.R'].interpolate(inplace=True)
tmp.head(7)

범주형 변수

범주형 변수의 경우, 보통 최빈값으로 결측치를 채운다.

특정값(최빈값)으로 채우기

x_train['Cabin'].fillna(x_train['Cabin'].mode()[0], inplace=True)
x_train['Cabin'].isna().sum()  # 0

cf) Series.mode()는 결과가 Series이기때문에 Series.mode()[0]을 써줘야 에러없이 최빈값을 가져와서 사용할 수 있다.

# Series.mode()
x_train['Cabin'].mode()

결측치 제거

기준 미달일 경우 변수 제거

몇 % 기준 정하기

rate = 0.8  # 80% 아래면 삭제
int(len(x_train) * rate)  # 712

기준 미달일 경우 삭제

# 기준 미달인 열 모두 삭제
filtered_x_train = titanic.dropna(thresh=int(len(x_train) * rate), axis=1)

결측치 시각화로 확인

msno.matrix(filtered_x_train)
plt.show()

결측치 존재하는 행 또는 열 제거

결측치 조치를 채우기로 할 경우, 경우에 따라 왜곡이 크게 발생하는 경우가 있을 수 있다.
이 경우, 채우기보단 제거하여 데이터의 정합성을 보존할 수 있다.

listwise(목록 삭제) 방식

결측치가 존재하는 행을 삭제하는 방식

x_train.dropna(inplace=True)  # axis=0(행) 디폴트
x_train.isna().sum()

pairwise(쌍 삭제) 방식

행의 모든 변수가 결측치일 경우만 삭제

x_train.dropna(how='all', inplace=True)
x_train.isna().sum()

thresh(임계치) 설정하는 방식

인터넷 등 찾아보면 thresh로 설정하는 값에 대한 설명으로, "행이 해당 값 이상의 결측치를 갖고 있으면 삭제한다"고 되어있다.
그러나 해당 값은 행에 채워진 변수 개수(전체 변수 개수 - 행의 결측치 개수)이다.
즉, 행에 결측치를 제외한 값이 채워진 변수들 개수이다.

x_train에 대한 결측치 전체 확인

x_train.isna().sum()

Age와 Cabin 모두 결측치로 채워진 행의 개수

len(x_train[(x_train['Age'].isna()) & (x_train['Cabin'].isna())])

Age에 대한 결측치 중, Age 변수 하나에만 결측치가 있는 행의 개수

len(x_train[x_train['Age'].isna()]) - len(x_train[(x_train['Age'].isna()) & (x_train['Cabin'].isna())])

thresh값을 10으로 설정했을 경우(전체 변수 11개)
- 각 행마다 최소 10개가 채워져 있지 않은 행들을 제거

x_train.dropna(thresh=10, inplace=True)
x_train.isna().sum()

thresh 속성값으로 전체 변수보다 더 크게 준다면, 결측치가 있는 행뿐만 아니라 모든 행이 다 제거됨을 확인할 수 있다.

x_train.dropna(thresh=12, inplace=True)
x_train.isna().sum()

print(len(x_train))
x_train

특정 열 기준 방식

특정 열에 결측치가 있는 행 제거

x_train.dropna(subset=['Age'], inplace=True)
x_train.isna().sum()

jeongwoo

다음 포스트