Proportional Odds Model in R

지리산근육곰·2021년 9월 1일
0
post-thumbnail

1. Setting

1.1 데이터 불러오기

train = read.csv("newTrain.csv")

1.2 데이터 Structure 확인 후 변환

str(train)
str(train)
## 'data.frame':    22724 obs. of  22 variables:
##  $ X              : int  1 2 3 4 5 6 7 9 10 11 ...
##  $ child_num      : int  1 0 0 0 2 0 0 0 0 1 ...
##  $ income_total   : num  247500 450000 202500 157500 270000 ...
##  $ income_type    : chr  "Commercial associate" "Working" "Commercial associate" "State servant" ...
##  $ family_type    : chr  "Civil marriage" "Married" "Married" "Married" ...
##  $ house_type     : chr  "House / apartment" "House / apartment" "House / apartment" "House / apartment" ...
##  $ DAYS_BIRTH     : int  -11380 -19087 -15088 -15037 -13413 -17570 -14896 -15785 -19063 -11759 ...
##  $ DAYS_EMPLOYED  : int  -1540 -4434 -2092 -2105 -4996 -1978 -5420 -1308 -2213 -91 ...
##  $ occyp_type     : chr  "Laborers" "Managers" "Sales staff" "Managers" ...
##  $ family_size    : num  3 2 2 2 4 1 2 2 1 3 ...
##  $ begin_month    : num  -5 -22 -37 -26 -18 -41 -53 -5 -40 -51 ...
##  $ credit         : num  1 2 0 2 1 2 0 2 2 2 ...
##  $ income_quartile: num  4 4 3 3 4 4 4 1 3 2 ...
##  $ income_quintile: num  5 5 4 3 5 5 5 1 4 2 ...
##  $ income_decile  : num  9 10 7 6 9 10 9 2 7 4 ...
##  $ age            : int  32 53 42 42 37 49 41 44 53 33 ...
##  $ age_group      : int  1 3 2 2 1 2 2 2 3 1 ...
##  $ used_years     : int  0 1 3 2 1 3 4 0 3 4 ...
##  $ worked_year    : int  4 12 5 5 13 5 14 3 6 0 ...
##  $ women          : num  1 0 1 1 1 1 0 1 1 0 ...
##  $ yesCar         : num  0 1 0 1 0 0 0 0 0 1 ...
##  $ yesReality     : num  1 1 1 1 1 0 1 1 1 1 ...
# training set
## X variables
train$income_type = as.factor(train$income_type)
train$family_type = as.factor(train$family_type)
train$house_type = as.factor(train$house_type)
train$occyp_type = as.factor(train$occyp_type)
## y variable
train$credit = as.factor(train$credit)
summary(train)
##        X           child_num       income_total    
##  Min.   :    1   Min.   :0.0000   Min.   :  27000  
##  1st Qu.: 6656   1st Qu.:0.0000   1st Qu.: 121500  
##  Median :13268   Median :0.0000   Median : 157500  
##  Mean   :13242   Mean   :0.4174   Mean   : 186461  
##  3rd Qu.:19831   3rd Qu.:1.0000   3rd Qu.: 225000  
##  Max.   :26456   Max.   :4.0000   Max.   :1575000  
##                                                    
##                income_type                  family_type   
##  Commercial associate: 5176   Civil marriage      : 1824  
##  Pensioner           : 4447   Married             :15544  
##  State servant       : 1762   Separated           : 1333  
##  Student             :    6   Single / not married: 3010  
##  Working             :11333   Widow               : 1013  
##                                                           
##                                                           
##                house_type      DAYS_BIRTH     DAYS_EMPLOYED   
##  Co-op apartment    :  106   Min.   :-25152   Min.   :-15713  
##  House / apartment  :20370   1st Qu.:-19777   1st Qu.: -3052  
##  Municipal apartment:  722   Median :-15798   Median : -1435  
##  Office apartment   :  161   Mean   :-16142   Mean   : 69229  
##  Rented apartment   :  347   3rd Qu.:-12557   3rd Qu.:  -307  
##  With parents       : 1018   Max.   : -7705   Max.   :365243  
##                                                               
##        occyp_type    family_size      begin_month  credit    income_quartile
##  Laborers   :4512   Min.   : 1.000   Min.   :-60   0: 2791   Min.   :1.000  
##  Unempolyed :4438   1st Qu.: 2.000   1st Qu.:-39   1: 5413   1st Qu.:2.000  
##  Core staff :2646   Median : 2.000   Median :-24   2:14520   Median :3.000  
##  Sales staff:2539   Mean   : 2.185   Mean   :-26             Mean   :2.629  
##  Managers   :2167   3rd Qu.: 3.000   3rd Qu.:-12             3rd Qu.:4.000  
##  Drivers    :1575   Max.   :20.000   Max.   :  0             Max.   :4.000  
##  (Other)    :4847                                                           
##  income_quintile income_decile         age          age_group    
##  Min.   :1.000   Min.   : 1.000   Min.   :22.00   Min.   :0.000  
##  1st Qu.:2.000   1st Qu.: 3.000   1st Qu.:35.00   1st Qu.:1.000  
##  Median :3.000   Median : 6.000   Median :44.00   Median :2.000  
##  Mean   :3.122   Mean   : 5.753   Mean   :44.72   Mean   :2.015  
##  3rd Qu.:4.000   3rd Qu.: 8.000   3rd Qu.:55.00   3rd Qu.:3.000  
##  Max.   :5.000   Max.   :10.000   Max.   :69.00   Max.   :4.000  
##                                                                  
##    used_years     worked_year         women            yesCar      
##  Min.   :0.000   Min.   :-1.000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:1.000   1st Qu.: 0.000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :2.000   Median : 3.000   Median :1.0000   Median :0.0000  
##  Mean   :1.719   Mean   : 5.165   Mean   :0.6658   Mean   :0.3781  
##  3rd Qu.:3.000   3rd Qu.: 8.000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :5.000   Max.   :43.000   Max.   :1.0000   Max.   :1.0000  
##                                                                    
##    yesReality    
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :1.0000  
##  Mean   :0.6805  
##  3rd Qu.:1.0000  
##  Max.   :1.0000  
## 

2. Proportional Odds Model

2.1 Library 불러오기

library(MASS)
# Credit order 지정
train$credit = factor(train$credit, ordered=TRUE, 
                      levels=c("2", "1", "0"))

2.2 모델 생성 및 비교

mod1 = polr(credit~yesCar+yesReality+income_type+family_type
                  +occyp_type+income_quartile+age_group+worked_year
                  +begin_month, 
                   data=train, Hess=TRUE)
summary(mod1)
## Call:
## polr(formula = credit ~ yesCar + yesReality + income_type + family_type + 
##     occyp_type + income_quartile + age_group + worked_year + 
##     begin_month, data = train, Hess = TRUE)
## 
## Coefficients:
##                                      Value Std. Error    t value
## yesCar                          -3.728e-02  0.0303715   -1.22741
## yesReality                       5.190e-02  0.0298230    1.74029
## income_typePensioner            -1.183e+01  0.0438716 -269.63387
## income_typeState servant         5.072e-02  0.0603868    0.83999
## income_typeStudent               4.972e-01  0.7551641    0.65838
## income_typeWorking               8.469e-02  0.0355889    2.37967
## family_typeMarried              -4.049e-02  0.0509414   -0.79479
## family_typeSeparated            -2.546e-02  0.0747911   -0.34044
## family_typeSingle / not married -2.548e-02  0.0604889   -0.42123
## family_typeWidow                 6.955e-02  0.0816933    0.85130
## occyp_typeCleaning staff        -3.116e-01  0.1260286   -2.47225
## occyp_typeCooking staff         -1.461e-01  0.1179518   -1.23861
## occyp_typeCore staff            -1.006e-01  0.0790919   -1.27220
## occyp_typeDrivers               -1.806e-01  0.0863227   -2.09223
## occyp_typeHigh skill tech staff -3.776e-02  0.0923882   -0.40876
## occyp_typeHR staff              -1.055e+00  0.3442720   -3.06339
## occyp_typeIT staff               2.315e-01  0.3143037    0.73663
## occyp_typeLaborers              -1.069e-01  0.0743313   -1.43861
## occyp_typeLow-skill Laborers    -1.499e-01  0.2005121   -0.74764
## occyp_typeManagers              -1.486e-01  0.0816779   -1.81883
## occyp_typeMedicine staff        -2.739e-01  0.1006894   -2.72027
## occyp_typePrivate service staff -9.384e-02  0.1478357   -0.63479
## occyp_typeRealty agents         -3.712e-01  0.2851487   -1.30190
## occyp_typeSales staff           -1.623e-01  0.0790219   -2.05440
## occyp_typeSecretaries           -2.901e-02  0.2110826   -0.13742
## occyp_typeSecurity staff        -4.119e-01  0.1259874   -3.26941
## occyp_typeUnempolyed             1.179e+01  0.0438716  268.78506
## occyp_typeWaiters/barmen staff   6.335e-02  0.1817119    0.34861
## income_quartile                 -6.892e-04  0.0128734   -0.05354
## age_group                       -4.745e-02  0.0157109   -3.02049
## worked_year                     -4.582e-03  0.0026605   -1.72207
## begin_month                      2.144e-02  0.0008544   25.09679
## 
## Intercepts:
##     Value     Std. Error t value  
## 2|1   -0.1572    0.0977    -1.6092
## 1|0    1.2729    0.0984    12.9401
## 
## Residual Deviance: 39477.10 
## AIC: 39545.10
  • 시도해본 모델 중 위의 모델의 AIC가 가장 낮게 나왔다.

2.2.1 Model Variables' p-values

ctable = coef(summary(mod1))
p = pnorm(abs(ctable[, "t value"]), lower.tail = FALSE)*2
round(cbind(ctable, "p value"=p), 4)
##                                    Value Std. Error   t value p value
## yesCar                           -0.0373     0.0304   -1.2274  0.2197
## yesReality                        0.0519     0.0298    1.7403  0.0818
## income_typePensioner            -11.8293     0.0439 -269.6339  0.0000
## income_typeState servant          0.0507     0.0604    0.8400  0.4009
## income_typeStudent                0.4972     0.7552    0.6584  0.5103
## income_typeWorking                0.0847     0.0356    2.3797  0.0173
## family_typeMarried               -0.0405     0.0509   -0.7948  0.4267
## family_typeSeparated             -0.0255     0.0748   -0.3404  0.7335
## family_typeSingle / not married  -0.0255     0.0605   -0.4212  0.6736
## family_typeWidow                  0.0695     0.0817    0.8513  0.3946
## occyp_typeCleaning staff         -0.3116     0.1260   -2.4723  0.0134
## occyp_typeCooking staff          -0.1461     0.1180   -1.2386  0.2155
## occyp_typeCore staff             -0.1006     0.0791   -1.2722  0.2033
## occyp_typeDrivers                -0.1806     0.0863   -2.0922  0.0364
## occyp_typeHigh skill tech staff  -0.0378     0.0924   -0.4088  0.6827
## occyp_typeHR staff               -1.0546     0.3443   -3.0634  0.0022
## occyp_typeIT staff                0.2315     0.3143    0.7366  0.4613
## occyp_typeLaborers               -0.1069     0.0743   -1.4386  0.1503
## occyp_typeLow-skill Laborers     -0.1499     0.2005   -0.7476  0.4547
## occyp_typeManagers               -0.1486     0.0817   -1.8188  0.0689
## occyp_typeMedicine staff         -0.2739     0.1007   -2.7203  0.0065
## occyp_typePrivate service staff  -0.0938     0.1478   -0.6348  0.5256
## occyp_typeRealty agents          -0.3712     0.2851   -1.3019  0.1930
## occyp_typeSales staff            -0.1623     0.0790   -2.0544  0.0399
## occyp_typeSecretaries            -0.0290     0.2111   -0.1374  0.8907
## occyp_typeSecurity staff         -0.4119     0.1260   -3.2694  0.0011
## occyp_typeUnempolyed             11.7920     0.0439  268.7851  0.0000
## occyp_typeWaiters/barmen staff    0.0633     0.1817    0.3486  0.7274
## income_quartile                  -0.0007     0.0129   -0.0535  0.9573
## age_group                        -0.0475     0.0157   -3.0205  0.0025
## worked_year                      -0.0046     0.0027   -1.7221  0.0851
## begin_month                       0.0214     0.0009   25.0968  0.0000
## 2|1                              -0.1572     0.0977   -1.6092  0.1076
## 1|0                               1.2729     0.0984   12.9401  0.0000
  • 현재 p-values를 고려 했을 때 'yesCar', 'yesReality', 'family_type', 'income_quartile', 'worked_year'의 경우 drop하는 것이 좋아 보인다.

2.3 Best Model's p-values

mod2 = polr(credit~income_type+occyp_type+age_group+begin_month, 
                   data=train, Hess=TRUE)
summary(mod2)
## Call:
## polr(formula = credit ~ income_type + occyp_type + age_group + 
##     begin_month, data = train, Hess = TRUE)
## 
## Coefficients:
##                                      Value Std. Error    t value
## income_typePensioner            -11.832886  0.0410348 -288.36236
## income_typeState servant          0.038892  0.0599544    0.64870
## income_typeStudent                0.530268  0.7521999    0.70496
## income_typeWorking                0.079266  0.0352396    2.24933
## occyp_typeCleaning staff         -0.281468  0.1254583   -2.24352
## occyp_typeCooking staff          -0.131518  0.1174376   -1.11990
## occyp_typeCore staff             -0.102439  0.0790278   -1.29624
## occyp_typeDrivers                -0.190993  0.0855666   -2.23210
## occyp_typeHigh skill tech staff  -0.044318  0.0922492   -0.48041
## occyp_typeHR staff               -1.039204  0.3437040   -3.02354
## occyp_typeIT staff                0.228233  0.3140885    0.72665
## occyp_typeLaborers               -0.109843  0.0742864   -1.47864
## occyp_typeLow-skill Laborers     -0.133683  0.1997971   -0.66909
## occyp_typeManagers               -0.157925  0.0811011   -1.94726
## occyp_typeMedicine staff         -0.285005  0.1002296   -2.84352
## occyp_typePrivate service staff  -0.086903  0.1477218   -0.58829
## occyp_typeRealty agents          -0.367264  0.2846345   -1.29030
## occyp_typeSales staff            -0.150683  0.0788143   -1.91187
## occyp_typeSecretaries            -0.005761  0.2107002   -0.02734
## occyp_typeSecurity staff         -0.396938  0.1256130   -3.16001
## occyp_typeUnempolyed             11.857727  0.0410347  288.96814
## occyp_typeWaiters/barmen staff    0.094600  0.1811099    0.52234
## age_group                        -0.051154  0.0145921   -3.50561
## begin_month                       0.021607  0.0008522   25.35481
## 
## Intercepts:
##     Value     Std. Error t value  
## 2|1   -0.1275    0.0771    -1.6534
## 1|0    1.3021    0.0780    16.6932
## 
## Residual Deviance: 39488.04 
## AIC: 39540.04

2.3.1 mod2's p-vlaues

ctable = coef(summary(mod2))
p = pnorm(abs(ctable[, "t value"]), lower.tail = FALSE)*2
round(cbind(ctable, "p value"=p), 4)
##                                    Value Std. Error   t value p value
## income_typePensioner            -11.8329     0.0410 -288.3624  0.0000
## income_typeState servant          0.0389     0.0600    0.6487  0.5165
## income_typeStudent                0.5303     0.7522    0.7050  0.4808
## income_typeWorking                0.0793     0.0352    2.2493  0.0245
## occyp_typeCleaning staff         -0.2815     0.1255   -2.2435  0.0249
## occyp_typeCooking staff          -0.1315     0.1174   -1.1199  0.2628
## occyp_typeCore staff             -0.1024     0.0790   -1.2962  0.1949
## occyp_typeDrivers                -0.1910     0.0856   -2.2321  0.0256
## occyp_typeHigh skill tech staff  -0.0443     0.0922   -0.4804  0.6309
## occyp_typeHR staff               -1.0392     0.3437   -3.0235  0.0025
## occyp_typeIT staff                0.2282     0.3141    0.7267  0.4674
## occyp_typeLaborers               -0.1098     0.0743   -1.4786  0.1392
## occyp_typeLow-skill Laborers     -0.1337     0.1998   -0.6691  0.5034
## occyp_typeManagers               -0.1579     0.0811   -1.9473  0.0515
## occyp_typeMedicine staff         -0.2850     0.1002   -2.8435  0.0045
## occyp_typePrivate service staff  -0.0869     0.1477   -0.5883  0.5563
## occyp_typeRealty agents          -0.3673     0.2846   -1.2903  0.1969
## occyp_typeSales staff            -0.1507     0.0788   -1.9119  0.0559
## occyp_typeSecretaries            -0.0058     0.2107   -0.0273  0.9782
## occyp_typeSecurity staff         -0.3969     0.1256   -3.1600  0.0016
## occyp_typeUnempolyed             11.8577     0.0410  288.9681  0.0000
## occyp_typeWaiters/barmen staff    0.0946     0.1811    0.5223  0.6014
## age_group                        -0.0512     0.0146   -3.5056  0.0005
## begin_month                       0.0216     0.0009   25.3548  0.0000
## 2|1                              -0.1275     0.0771   -1.6534  0.0982
## 1|0                               1.3021     0.0780   16.6932  0.0000
  • income_type에서 'Pensioner', 'Working' 만이 의미있는 정보로 보인다.
  • occyp_type 에서 'Cleaning staff', 'Drivers', 'HR staff', 'Medicine staff', 'Security staff', 'Unempolyed'가 의미있는 변수들로 고려된다.

2.4 Model ANOVA Comparison

anova(mod1, mod2, test = "Chisq")
## Likelihood ratio tests of ordinal regression models
## 
## Response: credit
##                                                                                                                    Model
## 1                                                                     income_type + occyp_type + age_group + begin_month
## 2 yesCar + yesReality + income_type + family_type + occyp_type + income_quartile + age_group + worked_year + begin_month
##   Resid. df Resid. Dev   Test    Df LR stat.   Pr(Chi)
## 1     22698   39488.04                                
## 2     22690   39477.10 1 vs 2     8 10.94169 0.2050253
  • ANOVA Comparison 또한 mod2 가 mod1보다 성능이 좋다고 보여준다.

0개의 댓글