Imputation 성능 확인을 위한 Null 값 생성

지리산근육곰·2021년 12월 21일

ML에서 자주쓰는 코드

목록 보기

4/13

Training set에는 null값이 없고 testing set에만 null값이 존재 할 경우 testing set에 어떤 imputation 이 좋은지 모른다.
이 경우 training set에 null 값을 생성 후 성능 비교 후 testing set에 imputation을 진행하게 된다.

Ex: testing set column_3에는 15%의 null 값이 존재한다.

# import library
from sklearn.model_selection import train_test_split

X_nonNull, X_null, y_nonNull, y_null = train_test_split(X, y, test_size=0.15, random_state=42)

X_null['column_3'] = np.nan

# concat
X = pd.concat([X_nonNull, X_null])
# reordered by index
X = X.sort_index(ascending=True)