
순도(purity)/불순도(impurity)
- 서로 다른 종류의 값들이 섞여 있는 비율
- 한 종류(class)의 값이 많을 수록 순도가 높고 불순도는 낮다.
import pandas as pd wine = pd.read_csv("data/wine.csv") wine.shape(6497, 13)
- DecisionTree 계열 모델
- 범주형: Label Encoding, 연속형: Feature Scaling을 하지 않는다.
- 선형계열 모델(예측시 모든 Feature들을 한 연산에 넣어 예측하는 모델)
- 범주형: One Hot Encoding, 연속형: Feature Scaling을 한다.
# X, y X = wine.drop(columns='color').values y = wine['color'].valuestrain/test set 분리
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, stratify=y, random_state=0)## quality LabelEncoding. 범주형 변수를 숫자형으로 변환. from sklearn.preprocessing import LabelEncoder le = LabelEncoder() le.fit(['A', 'B', 'C']) X_train[:, -1] = le.transform(X_train[:, -1]) # 마지막 열(color)를 레이블 인코딩하여 변환 X_test[:, -1] = le.transform(X_test[:, -1])DecisionTreeClassifier 생성 ,학습, 검증
from sklearn.tree import DecisionTreeClassifier tree = DecisionTreeClassifier(random_state=0) tree.fit(X_train, y_train)print("Depth 조회:", tree.get_depth()) print("Leaf nodes의 개수:", tree.get_n_leaves())Depth 조회: 13
Leaf nodes의 개수: 55from metrics import print_binary_classification_metrics print_binary_classification_metrics( y_train, tree.predict(X_train), tree.predict_proba(X_train)[:, 1], "Train set 평가결과" )Train set 평가결과
정확도: 0.9997947454844006
재현율: 1.0
정밀도: 0.9991666666666666
F1 점수: 0.9995831596498541
Average Precision: 0.9999986099527384
ROC-AUC Score: 0.9999997729299327print_binary_classification_metrics( y_test, tree.predict(X_test), tree.predict_proba(X_test)[:, 1], "Test set 평가결과" )Test set 평가결과
정확도: 0.9858461538461538
재현율: 0.965
정밀도: 0.9772151898734177
F1 점수: 0.9710691823899371
Average Precision: 0.9516280428432327
ROC-AUC Score: 0.9788265306122449Graphviz를 이용해 tree구조 시각화
from sklearn.tree import export_graphviz from graphviz import Source graph = Source( export_graphviz( tree, feature_names=wine.columns[:-1], class_names=['White', 'Red'], filled=True, rounded=True ) ) graph
export_graphviz( tree, feature_names=wine.columns[:-1], class_names=['White', 'Red'], filled=True, rounded=True, out_file="wine_tree_model.dot" # 소스코드를 파일로 저장. ) # 저장된 dot 파일을 image로 변환 - CLI명령어. !dot -Tpng wine_tree_model.dot -o wine_tree_model.png
### fit() 뒤에 feature 중요도 조회 fi = tree.feature_importances_ fiarray([0.00215273, 0.0170795 , 0.00306507, 0.00443252, 0.20950343,
0.00077958, 0.68631811, 0.05093937, 0.01158388, 0.01314941,
0.00099639, 0. ])import pandas as pd pd.Series(fi, index=wine.columns[:-1]).sort_values(ascending=False)total sulfur dioxide 0.686318
chlorides 0.209503
density 0.050939
volatile acidity 0.017080
sulphates 0.013149
pH 0.011584
residual sugar 0.004433
citric acid 0.003065
fixed acidity 0.002153
alcohol 0.000996
free sulfur dioxide 0.000780
quality 0.000000
dtype: float64pd.Series(fi, index=wine.columns[:-1]).sort_values().plot(kind='barh');
RandomizedSearchCV 생성, 학습
from sklearn.model_selection import RandomizedSearchCV params = { "max_depth": range(1, 14), "max_leaf_nodes": range(10, 56), "min_samples_leaf": range(10, 1000, 50), "max_features": range(1, 13) } gs = RandomizedSearchCV( DecisionTreeClassifier(random_state=0), params, cv=5, n_jobs=-1, n_iter=60 ) gs.fit(X_train, y_train)결과확인
gs.best_params_{'min_samples_leaf': 10,
'max_leaf_nodes': 10,
'max_features': 12,
'max_depth': 9}gs.best_score_0.9811178855367768
best_model = gs.best_estimator_ fi = pd.Series(best_model.feature_importances_, index=wine.columns[:-1]).sort_values(ascending=False) fitotal sulfur dioxide 0.715779
chlorides 0.214657
density 0.052025
volatile acidity 0.009644
residual sugar 0.004070
sulphates 0.003825
fixed acidity 0.000000
citric acid 0.000000
free sulfur dioxide 0.000000
pH 0.000000
alcohol 0.000000
quality 0.000000
dtype: float64
import pandas as pd from sklearn.model_selection import train_test_split df = pd.read_csv("data/boston_hosing.csv") X = df.drop(columns='MEDV') y = df['MEDV'] X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0) X_train.shape, X_test.shape((379, 13), (127, 13))
from sklearn.tree import DecisionTreeRegressor, export_graphviz from graphviz import Source from metrics import print_regression_metrics model = DecisionTreeRegressor(max_depth=2, random_state=0) model.fit(X_train, y_train)## 분기 구조를 시각화 graph = Source( export_graphviz( model, feature_names=X.columns, filled=True, rounded=True ) ) graph
LSTAT <= 8.13 # 노드를 분기하기 위한 질문 (오차가 가장 적게 나뉘도록 하는 질문) =============아래: 현재 노드의 상태================= squared_error = 85.308 # 현재 노드로 추론했을 때(value) 예상 오차(mean squared error) samples=379 # 현재노드의 데이터(sample) 개수 value = 22.609 # 현재노드로 추론했을때 결과값. (이 노드 y값들의 평균)print_regression_metrics(y_train, model.predict(X_train))MSE: 23.175292750947712
RMSE: 4.814072366608931
R Squared: 0.7283346372537175print_regression_metrics(y_test, model.predict(X_test))MSE: 33.32551599016633
RMSE: 5.772825650421666
R Squared: 0.5920940318375818
- Random: 학습할 때 Train dataset을 random하게 sampling한다.
- Forest: 여러개의 (Decision) Tree 모델들을 앙상블한다.

train/test set 분리
import pandas as pd from sklearn.model_selection import train_test_split df = pd.read_csv('data/wine.csv') X = df.drop(columns=['color', 'quality']) y = df['color'] X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)RandomForestClassifier 생성, 학습, 검증
from sklearn.ensemble import RandomForestClassifier rfc = RandomForestClassifier( n_estimators=200, # DecisionTree 개수. (최소 200개) max_features=10, # 지정한 feature수 내에서 random하게 feature들을 선택. max_depth=5, # DecisionTree hyper parameter (모든 DT모델들은 동일한 형태.) random_state=0, n_jobs=-1, # 개별 DecisionTree 학습, 추론시 병렬 처리 (각 모델은 독립적으로 학습/추정한다.) )rfc.fit(X_train, y_train) # class 추정 pred_train = rfc.predict(X_train) pred_test = rfc.predict(X_test) # class 별 확률 pred_train_proba = rfc.predict_proba(X_train) pred_test_proba = rfc.predict_proba(X_test)from metrics import print_binary_classification_metrics print_binary_classification_metrics( y_train, pred_train, pred_train_proba[:, 0], "Train set 검증결과" ) print_binary_classification_metrics( y_test, pred_test, pred_test_proba[:, 1], "Test set" )Train set 검증결과
정확도: 0.9950738916256158
재현율: 0.9799833194328608
정밀도: 1.0
F1 점수: 0.9898904802021904
Average Precision: 0.15864233040170136
ROC-AUC Score: 0.001007509888333753
Test set
정확도: 0.9870769230769231
재현율: 0.9675
정밀도: 0.979746835443038
F1 점수: 0.9735849056603774
Average Precision: 0.9940479931929254
ROC-AUC Score: 0.9967744897959184Feature importance
fi = pd.Series(rfc.feature_importances_, index=X.columns).sort_values(ascending=False) fitotal sulfur dioxide 0.503064
chlorides 0.382871
volatile acidity 0.036238
density 0.026470
sulphates 0.016476
fixed acidity 0.011452
pH 0.009814
residual sugar 0.008413
citric acid 0.002299
alcohol 0.001950
free sulfur dioxide 0.000951
dtype: float64