데이터 전처리(결측값, 가변수화)

예린·2024년 3월 30일

머신러닝

목록 보기

1/7

1) 라이브러리, 데이터 불러오기

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# 데이터 읽어오기
path = "titanic.csv"
titanic = pd.read_csv(path)

# 상위 데이터 확인
titanic.head()

2) 불필요한 변수 제거

Cabin은 77.1%가 NaN이기 때문에 채울 수 없으니 제거
PassengerId, Name, Ticket은 Unique 한 값이므로 제거
axis=0는 행, axis=1은 열을 의미

# 여러 열 동시 제거
drop_cols = ['Cabin', 'PassengerId', 'Name', 'Ticket']
titanic.drop(drop_cols, axis=1, inplace=True)

# 확인
titanic.head()

# 이후 반복 실습을 위해 원본 보관
titanic_bk = titanic.copy()

3) 결측치 처리

NaN 확인

# 변수들의 NaN 포함 상태 확인
# sum()의 axis 옵션 기본 값은 0이기 때문에, 행을 더해서 열 기준 출력 : sum(axis = 0)
titanic.isna().sum()

# 퍼센트(%)로 확인
titanic.isna().sum() / len(titanic) * 100

NaN 처리

행 제거

모든 행 제거

# 처리전 확인
titanic.isna().sum()

# NaN이 포함된 모든 행(axis=0) 제거
titanic.dropna(axis=0, inplace=True)

# 확인
titanic.isna().sum()

일부 행 제거

# 처리전 확인
titanic.isna().sum()

# Age 변수에 NaN이 포함된 행 제거
titanic.dropna(subset=['Age'], axis=0, inplace=True)

# 확인
titanic.isna().sum()

변수(열) 제거

# 처리전 확인
titanic.isna().sum()

# NaN 열이 포함된 모든 변수(axis=1) 제거 (지양)
titanic.dropna(axis=1, inplace=True)

# 확인
titanic.isna().sum()

NaN 채우기

특정 값으로 채우기

NaN 값이 포함된 행이나 열을 제거할 수 없다면 특정 값으로 채움

평균값

# Age 평균 구하기
mean_age = titanic['Age'].mean()

# NaN을 평균값으로 채우기
titanic['Age'].fillna(mean_age, inplace=True)

# 확인
titanic.isna().sum()

중앙값

# Age 중앙값 구하기
median_age = data['Age'].median()

titanic['Age'].fillna(median_age, inplace=True)

# 확인
titanic.isna().sum()

최빈값

# Embarked 최빈값 구하기 1
titanic['Embarked'].value_counts(dropna=True).idxmax()

# Embarked 최빈값 구하기 2
titanic['Embarked'].mode()[0]

# NaN 값을 가장 빈도가 높은 값으로 채우기
titanic['Embarked'].fillna('S', inplace=True)
# titanic['Embarked'].fillna(titanic['Embarked'].mode()[0], inplace=True) > 얘도 가능하지만, 변수로 따로 빼는 걸 추천

# 확인
titanic.isna().sum()

앞/뒤 값으로 채우기

시계열 데이터인 경우 많이 사용하는 방법
method='ffill': 바로 앞의 값으로 채우기
method='bfill': 바로 뒤의 값으로 채우기

# Ozone 변수 NaN 값을 바로 앞의 값으로 채우기
air['Ozone'].fillna(method='ffill', inplace=True)

# Solar.R 변수 NaN 값을 바로 뒤의 값으로 채우기
air['Solar.R'].fillna(method='bfill', inplace=True)

# 확인
air.isna().sum()

선형 보간법으로 채우기

# 선형 보간법으로 채우기
air['Ozone'].interpolate(method='linear', inplace=True)

# Solar.R 변수 NaN 값을 바로 뒤의 값으로 채우기
air['Solar.R'].interpolate(method='linear', inplace=True)

# 확인
air.isna().sum()

4) 가변수화

범주형 데이터 들 중 숫자로 되어 있지 않은 데이터(데이터 타입이 Object인 변수)
0과 1로만 구성된 숫자가 아닌 숫자로 된 범주형 데이터
int 타입이지만 정수형임이 의미 없는 범주형 데이터

설문조사에 대한 변수로써 1(매우 불만족) ~ 5(매우 만족)의 값을 갖고 있는 경우와 같을 때에는, 중요도가 동일한 수준으로 증가하는 의미를 갖는다고 판단되면 가변수화를 할 필요가 없음

다중공선성 문제를 없애기 위해 drop_first=True 옵션을 지정

value 종류가 매우 다양할 때, 원핫 인코딩을 하면 column이 매우 많아질 수 있음

# 열 관련 정보 확인(가변수 대상 변수 식별)
titanic.info()

# 가변수 대상 변수 식별
dumm_cols = ['Pclass', 'Sex', 'Embarked']

# 가변수화
# dtype=int으로 안 하면 bool타입(true, false)으로 반환
titanic = pd.get_dummies(titanic, columns=dumm_cols, drop_first=True, dtype=int)

# 확인
titanic.head()

예린

다음 포스트