특성공학

림이·2024년 4월 24일

머신러닝/딥러닝

목록 보기

2/5

특성공학 (Pipe Line)

df1 = pd.read_csv('01_Data.csv')
df1

>>> 	Index	Member_ID	Sales_Type	Contract_Type	Channel	Datetime	Term	Payment_Type	Product_Type	Amount_Month	Customer_Type	Age	Address1	Address2	State	Overdue_count	Overdue_Type	Gender	Credit_Rank	Bank
0	1	66758234	렌탈	일반계약	영업방판	2022-05-05	60	CMS	DES-1	96900	개인	42.0	경기도	경기도	계약확정	0	없음	여자	9.0	새마을금고
1	2	66755948	렌탈	교체계약	영업방판	2023-02-19	60	카드이체	DES-1	102900	개인	39.0	경기도	경기도	계약확정	0	없음	남자	2.0	현대카드

Imputation

결측값을 다른 값으로 대치 (fillna)

단순대치(fillna)를 위한 라이브러리 불러오기

from sklearn.impute import SimpleImputer

'Credit_Rank' 컬럼의 결측치 개수 확인

df1['Credit_Rank'].isnull().sum()

>>> 8781

✔︎ 'CR_clean' 파생변수 생성: 결측치를 mean값으로 대치한 컬럼

df1['CR_clean'] = SimpleImputer(strategy='mean').fit_transform(df1[['Credit_Rank']])

아래와 같이 결측치가 대치가 됨

df1[['Credit_Rank','CR_clean']].tail(7)

✔︎ 문자 항목에 대해 최빈값으로 결측값 처리

s1 = SimpleImputer(strategy = 'most_frequent').fit_transform(df1[['Bank']])
pd.DataFrame(s1).iloc[200:210]

>> 0
200	새마을금고
201	하나은행
202	농협중앙회
203	신한은행
204	신한은행
205	롯데카드
206	롯데카드
207	롯데카드
208	롯데카드
209	농협중앙회

Sacling & Encoding

Scaling : 서로 다른 숫자데이터의 Scale을 조정하여 학습
Encoding : 문자 데이터를 숫자 형태로 변환하여 학습

from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

✔︎ StandardScaler : 평균이 0, 표준편차 1 형태로 데이터로 변환

선형 회귀와 같은 전통적 통계방식의 연산을 수행하거나, 선형대수 기반의 학습 알고리즘
이상치가 없고, 정규성을 잘 따르는 데이터에 대해 적용

기술 통계량을 통해 확인

df1[['Amount_Month', 'Term', 'Age']].describe()

>>> 	Amount_Month	Term	Age
count	51301.000000	51301.000000	44329.000000
mean	93994.974289	55.639149	50.024093
std	15304.263988	12.009915	10.983877
min	54603.000000	12.000000	25.000000
25%	81900.000000	60.000000	42.000000
50%	96900.000000	60.000000	49.000000
75%	98400.000000	60.000000	57.000000
max	215700.000000	60.000000	102.000000

scale_df1 = StandardScaler().fit_transform(df1[['Amount_Month','Term','Age']])
pd.DataFrame(scale_df1, columns = ['Amount_Month','Term','Age']).describe()

>>> 	Amount_Month	Term	Age
count	5.130100e+04	5.130100e+04	4.432900e+04
mean	3.440456e-16	-1.883663e-17	-4.680423e-17
std	1.000010e+00	1.000010e+00	1.000011e+00
min	-2.573947e+00	-3.633629e+00	-2.278282e+00
25%	-7.903086e-01	3.631078e-01	-7.305420e-01
50%	1.898199e-01	3.631078e-01	-9.323703e-02
75%	2.878328e-01	3.631078e-01	6.351114e-01
max	7.952438e+00	3.631078e-01	4.732072e+00

✔︎ MinMaxScaler : 최솟값이 0, 최댓값이 1 형태로 데이터를 변환

비정형 데이터/ 범주형 데이터 같은 데이터들이 같이 학습이 될 때 주로 사용

scale_df1 = MinMaxScaler().fit_transform(df1[['Amount_Month','Term','Age']])
pd.DataFrame(scale_df1, columns = ['Amount_Month','Term','Age']).describe()

>>> 	Amount_Month	Term	Age
count	51301.000000	51301.000000	44329.000000
mean	0.244523	0.909149	0.324988
std	0.095000	0.250207	0.142648
min	0.000000	0.000000	0.000000
25%	0.169444	1.000000	0.220779
50%	0.262556	1.000000	0.311688
75%	0.271867	1.000000	0.415584
max	1.000000	1.000000	1.000000

✔︎ RobustScaler : 중앙값 0 IQR 1 형태로 변환

비모수적 (이상치, 정규성 X)

scale_df1 = RobustScaler().fit_transform(df1[['Amount_Month','Term','Age']])
pd.DataFrame(scale_df1, columns = ['Amount_Month','Term','Age']).describe()

>>> 	Amount_Month	Term	Age
count	51301.000000	51301.000000	44329.000000
mean	-0.176062	-4.360851	0.068273
std	0.927531	12.009915	0.732258
min	-2.563455	-48.000000	-1.600000
25%	-0.909091	0.000000	-0.466667
50%	0.000000	0.000000	0.000000
75%	0.090909	0.000000	0.533333
max	7.200000	0.000000	3.533333

✔︎ Label Encoding : 문자를 특정 정수로 변환하여 사용
✔︎ One Hot Encoding : 문자를 1/0의 정수를 갖는 Table로 변환하여 사용

pd.get_dummies(df1['Channel'])

[특성공학 + 학습]

1. 데이터 핸들링

cond1 = (df1['State']=='계약확정')
df1.loc[cond1, 'Target'] = '정상'
df1.loc[~cond1, 'Target'] = '해약'
df1['Target'].value_counts()
>>> Target
정상    50620
해약      681
Name: count, dtype: int64

2. 목표변수와 설명변수 설정

Y = df1['Target']
X = df1[['Product_Type','Amount_Month', 'Age', 'Gender', 'Credit_Rank', 'Term']]

훈련 셋과 테스트 셋 나누기

X_train, X_test, Y_train, Y_test = train_test_split(X,Y, random_state=1234)

라이브러리 불러오기

from sklearn.pipeline import make_pipeline # 특성 공학 + 학습
# 문자는 문자끼리 / 숫자는 숫자끼리 파이프라인을 병렬로 배치

from sklearn.compose import make_column_transformer
# (특성공학) 1. 결측값 처리

from sklearn.impute import SimpleImputer
# (특성공학) 2. 스케일링/인코딩

from sklearn.preprocessing import StandardScaler, OneHotEncoder
# (학습) 3. 분류 학습 모델

from sklearn.tree import DecisionTreeClassifier

숫자/문자데이터를 구분

numeric_list = X.describe().columns
category_list = X.describe(include = 'object').columns

라이브러리 불러오기

from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

3. 파이프라인 설계

숫자 데이터: 결측값 처리(중앙값) -> 스케일링

numeric_pipe = make_pipeline(SimpleImputer(strategy = 'median'),
                            StandardScaler())

문자 데이터: 결측값 처리(최빈값) -> 인코딩

category_pipe = make_pipeline(SimpleImputer(strategy='most_frequent'),
                              OneHotEncoder())

숫자는 숫자끼리 문자는 문자끼리 처리되는 병렬 파이프를 구성

preprocess_pipe = make_column_transformer((numeric_pipe, numeric_list),
                        (category_pipe, category_list))

4. 학습 파이프 구축

model_pipe = make_pipeline(preprocess_pipe, DecisionTreeClassifier())
model_pipe.fit(X_train, Y_train)

평가 라이브러리 불러오기

from sklearn.metrics import precision_score, recall_score
from sklearn.metrics import classification_report

평가함수 로직 구성

def evaluation_func1(model):
    Y_train_pred = model.predict(X_train)
    Y_test_pred = model.predict(X_test)
    print('학습능력')
    print(classification_report(Y_train, Y_train_pred))
    print('일반화능력')
    print(classification_report(Y_test, Y_test_pred))

5. 평가

evaluation_func1(model_pipe)

>>> 학습능력
              precision    recall  f1-score   support

          정상       0.99      1.00      0.99     37956
          해약       0.93      0.20      0.33       519

    accuracy                           0.99     38475
   macro avg       0.96      0.60      0.66     38475
weighted avg       0.99      0.99      0.99     38475

일반화능력
              precision    recall  f1-score   support

          정상       0.99      1.00      0.99     12664
          해약       0.02      0.01      0.01       162

    accuracy                           0.98     12826
   macro avg       0.50      0.50      0.50     12826
weighted avg       0.98      0.98      0.98     12826

6. pickle 라이브러리를 통해 만들어진 모델 저장하기

pickle.dump(model_pipe, open('model_pipe.sav', 'wb'))

7. 탈퇴 예측 시스템 구축

정보를 입력했을 때 정상 및 비정상을 보여주는 함수 구축

print(X['Product_Type'].unique())
x1 = input('제품 유형을 입력하시오:')
x2 = input('월 랜탈비용을 입력하시오: ')
x3 = input('고객 연령을 입력하시오: ')
x4 = input('고객 성별을 입력하시오(남자/여자):')
x5 = input('고객 신용등급을 입력하시오: ')
x6 = input('계약 기간을 입력하시오: ')

input_data = pd.DataFrame(data=[[x1,x2,x3,x4,x5,x6]], columns=X.columns)

>>> ['DES-1' 'DES-3A' 'DES-2' 'DES-R4' 'MMC' 'ERA']
제품 유형을 입력하시오:DES-1
월 랜탈비용을 입력하시오: 100000
고객 연령을 입력하시오: 29
고객 성별을 입력하시오(남자/여자):남자
고객 신용등급을 입력하시오: 1
계약 기간을 입력하시오: 12

model_pipe.predict(input_data)

>>> array(['정상'], dtype=object)

✔︎ imputation / Scaling / Encoding : 데이터를 더 적절히 처리하기 위한 특성 공학 기법(실제 학습 모델 성능에는 유의미한 영향을 주지 않음)

✔︎ Cross Validation / Hyper Parameter Tuning / imbalanced Data Sampling ... : 학습 모델 성능에 직접적인 영향을 줌

'GridSearchCV' 불러오기

from sklearn.model_selection import GridSearchCV

CV : Cross-Validation

✔︎ param_grid = {} : Hyperparameter Tuning

데이터를 5회에 걸쳐 교차검증 실시

grid_model = GridSearchCV(model_pipe, param_grid = {}, cv=5)

grid_model.fit(X_train, Y_train)
>>>

가장 성능이 우수한 모델 선택

best_model = grid_model.best_estimator_
best_model

함수를 사용하여 모델 평가

evaluation_func1(best_model)

>>> 학습능력
              precision    recall  f1-score   support

          정상       0.99      1.00      0.99     37956
          해약       0.93      0.20      0.33       519

    accuracy                           0.99     38475
   macro avg       0.96      0.60      0.66     38475
weighted avg       0.99      0.99      0.99     38475

일반화능력
              precision    recall  f1-score   support

          정상       0.99      1.00      0.99     12664
          해약       0.02      0.01      0.01       162

    accuracy                           0.98     12826
   macro avg       0.50      0.50      0.50     12826
weighted avg       0.98      0.98      0.98     12826

pipe1 : StandardScaler를 통해 mean값으로 대치
pipe2 : MinMaxScaler를 통해 median값으로 대치
pipe3 : OneHotEncoder를 통해 most frequent값으로 대치

pipe1 = make_pipeline(SimpleImputer(strategy='mean'), StandardScaler())
pipe2 = make_pipeline(SimpleImputer(strategy='median'), MinMaxScaler())
pipe3 = make_pipeline(SimpleImputer(strategy='most frequent'), OneHotEncoder())

multi_pipe = make_column_transformer((pipe1, ['Amount_Month', 'Age']),
                                     (pipe2, ['Credit_Rank', 'Term']),
                                     (pipe3, ['Product_Type', 'Gender']))

DecisionTreeClassifier 학습

model_pipe2 = make_pipeline(multi_pipe, DecisionTreeClassifier())
grid_model = GridSearchCV(model_pipe2, param_grid={}, cv=3)
grid_model.fit(X_train, Y_train)