library(mlbench)
data("PimaIndiansDiabetes2")
ds <- PimaIndiansDiabetes2
ds <- na.omit(ds)
set.seed(20)
parts <- sample(1:nrow(ds) , size = nrow(ds) * 0.7 )
parts
train <- ds[parts, ]
test <- ds[-parts, ]
train.label <- as.integer(train$diabetes)-1
mat_train.data <- as.matrix(train[, -9])
mat_test.data <- as.matrix(test[, -9])
xgb.train <- xgb.DMatrix(
data = mat_train.data,
label = train.label)
xgb.test <- xgb.DMatrix(
data = mat_test.data)
param_list <- list(
booster = "gbtree",
eta = 0.001,
max_depth = 10,
gamma = 5,
subsampe = 0.8,
colsample_bytree = 0.8,
objective = "binary:logistic",
eval_metric = "auc")
md.xgb <- xgb.train(
params = param_list,
data = xgb.train,
nrounds = 200,
early_stopping_rounds = 10,
watchlist = list(val1 = xgb.train),
verbose = 1
)
xgb.pred <- predict(md.xgb, newdata = xgb.test)
예측된 데이터 확인
> xgb.pred <- predict(md.xgb, newdata = xgb.test)
> xgb.pred
[1] 0.5013106 0.4982389 0.5019429 0.5004498 0.5012721 0.4969598 0.5024905
[8] 0.5019429 0.4969598 0.4996026 0.4969598 0.5004482 0.4969598 0.4978122
[15] 0.5015162 0.4969598 0.4969598 0.4982389 0.4969598 0.4969598 0.5013106
[22] 0.5021177 0.4969598 0.4969598 0.4988046 0.4969598 0.5015721 0.4982957
[29] 0.4978597 0.4982397 0.4995905 0.5007281 0.5004361 0.4969598 0.4969598
[36] 0.4986664 0.4974414 0.4969598 0.4975085 0.4978122 0.5009399 0.4969598
[43] 0.4969598 0.5005187 0.5019429 0.4995905 0.4999838 0.4969598 0.4999717
[50] 0.4971825 0.4969598 0.5004904 0.4971825 0.5007888 0.4971825 0.5007888
[57] 0.5021314 0.5011454 0.4978681 0.4969598 0.4969598 0.4971825 0.4974414
[64] 0.5021314 0.4978191 0.5025512 0.4969598 0.4969598 0.5004503 0.5007281
[71] 0.4978191 0.4978191 0.4974414 0.5003006 0.4996639 0.5013106 0.4969598
[78] 0.4969598 0.4985930 0.5008210 0.4978054 0.4969598 0.4978191 0.4969598
[85] 0.4978122 0.4970304 0.4969598 0.4969598 0.4969598 0.4969598 0.4978681
[92] 0.4969598 0.4969598 0.4999717 0.4969598 0.4995905 0.4986664 0.4969598
[99] 0.5025512 0.5007888 0.4978191 0.4969598 0.4995905 0.5015721 0.5004550
[106] 0.5009399 0.4969598 0.4978191 0.4978690 0.5021245 0.5021245 0.4973411
[113] 0.4982957 0.4969598 0.4969598 0.4978681 0.4970304 0.5015721
>
현재 수치형 데이터로 결과값이 나와있습니다.
범주형 결과값으로 나타내기 위해서 factor형으로 변환하겠습니다.
factor형으로 변환하겠습니다.
0.5이상은 positive, 0.5미만은 negative
xgb.pred2 <- factor(ifelse(xgb.pred >= 0.5, 1, 0), levels = c(0, 1),
labels = c("neg", "pos")
)
방법 1은 0.5이상의 값은 1로 변환, 미만은 0으로 변환 후 해당 데이터를 labels 옵션을 통해서 "neg", "pos"로 변환
xgb.pred2 <- ifelse(
xgb.pred >= 0.5,
xgb.pred <- "pos",
xgb.pred <- "neg"
)
xgb.pred2 <- as.factor(xgb.pred2)
방법 2는 ifelse를 먼저 적용해서 0.5이상은 "pos" 미만은 "neg"로 지정,
단, 마지막에 as.factor를 통해서 factor형으로 변환.
xgb.pred2 <- ifelse(
xgb.pred >= 0.5,
xgb.pred <- "pos",
xgb.pred <- "neg"
)
xgb.pred2 <- as.factor(xgb.pred2)
## 방법 1
Confusion Matrix and Statistics
Reference
Prediction neg pos
neg 66 16
pos 15 21
Accuracy : 0.7373
95% CI : (0.6483, 0.814)
No Information Rate : 0.6864
P-Value [Acc > NIR] : 0.1369
Kappa : 0.3852
Mcnemar's Test P-Value : 1.0000
Sensitivity : 0.5676
Specificity : 0.8148
Pos Pred Value : 0.5833
Neg Pred Value : 0.8049
Prevalence : 0.3136
Detection Rate : 0.1780
Detection Prevalence : 0.3051
Balanced Accuracy : 0.6912
'Positive' Class : pos
## 방법 2
Confusion Matrix and Statistics
Reference
Prediction neg pos
neg 66 16
pos 15 21
Accuracy : 0.7373
95% CI : (0.6483, 0.814)
No Information Rate : 0.6864
P-Value [Acc > NIR] : 0.1369
Kappa : 0.3852
Mcnemar's Test P-Value : 1.0000
Sensitivity : 0.5676
Specificity : 0.8148
Pos Pred Value : 0.5833
Neg Pred Value : 0.8049
Prevalence : 0.3136
Detection Rate : 0.1780
Detection Prevalence : 0.3051
Balanced Accuracy : 0.6912
'Positive' Class : pos
factor형 변경밥법은 조금 달랐지만, 결과는 그대로인것을 알 수 있습니다.