When building a model used for prediction, one measure of the model’s usefulness is how well it performs on unseen (future) data. One method to evaluate a model on unseen data is to partition the data into a training set
and a testing set
. The training set
is the data set one uses to build the model. The test set
is the data one uses to evaluate the model’s predictions. Three approaches are used to create a training set
and a test set
.
The data set HELPrct
after removing rows with missing values from the mdsr
package is used to illustrate the three approaches with a 70/30 split of values. That is, roughly 70% of the values will be used for the training set
and the remainder will be used for the test set
. Note that there are 117 rows of data in HELPrct
with complete information (no missing values), so a 70/30 split should result in roughly \(117\times 0.70 \approx 82\) values in the training set
and the remaining 35 values in the test set
.
sample()
function (method 1):> library(mdsr)
> HELP <- na.omit(HELPrct)
> set.seed(123)
> trainIndex1 <- sample(x = c(TRUE, FALSE), size = nrow(HELP), replace = TRUE, prob = c(0.7, 0.3))
> testIndex1 <- (!trainIndex1)
> trainData1 <- HELP[trainIndex1, ]
> testData1 <- HELP[testIndex1, ]
> dim(HELP)
[1] 117 29
> dim(trainData1)
[1] 82 29
> dim(testData1)
[1] 35 29
> # Train Fraction
> dim(trainData1)[1]/dim(HELP)[1]
[1] 0.7008547
> # Test Fraction
> dim(testData1)[1]/dim(HELP)[1]
[1] 0.2991453
sample()
function (method 2):> set.seed(123)
> trainIndex2 <- sample(x = 1:nrow(HELP), size = round(0.70*nrow(HELP), 0), replace = FALSE)
> trainData2 <- HELP[trainIndex2, ]
> testData2 <- HELP[-trainIndex2, ]
> dim(trainData2)
[1] 82 29
> dim(testData2)
[1] 35 29
> # Train Fraction
> dim(trainData2)[1]/dim(HELP)[1]
[1] 0.7008547
> # Test Fraction
> dim(testData2)[1]/dim(HELP)[1]
[1] 0.2991453
runif()
:> set.seed(123)
> HELP$Partition <- runif(dim(HELP)[1])
> trainData3 <- subset(HELP, HELP$Partition > 0.3)
> testData3 <- subset(HELP, HELP$Partition <= 0.3)
> dim(trainData3)
[1] 84 30
> dim(testData3)
[1] 33 30
> # Train Fraction
> dim(trainData3)[1]/dim(HELP)[1]
[1] 0.7179487
> # Test Fraction
> dim(testData3)[1]/dim(HELP)[1]
[1] 0.2820513
createDataPartition
from the caret
package:> library(caret) # load the caret package
> set.seed(123)
> trainIndex4 <- createDataPartition(y = HELP$homeless, p = .7,
+ list = FALSE,
+ times = 1)
> trainData4 <- HELP[trainIndex4, ]
> testData4 <- HELP[-trainIndex4, ]
> dim(trainData4)
[1] 83 30
> dim(testData4)
[1] 34 30
> # Train Fraction
> dim(trainData4)[1]/dim(HELP)[1]
[1] 0.7094017
> # Test Fraction
> dim(testData4)[1]/dim(HELP)[1]
[1] 0.2905983
Note: When the y
argument to createDataPartition
is a factor, the random sampling occurs within each class and should preserve the overall class distribution of the data.
In this section, a model is built using the training set
to predict whether a person is homeless
.
> model.null <- glm(homeless ~ 1, data = trainData4, family = binomial)
> model.full <- glm(homeless ~ ., data = trainData4, family = binomial)
> AICmod <- step(model.null, scope = list(lower = model.null, upper = model.full), direction = "both", data = trainData4)
Start: AIC=110.6
homeless ~ 1
Df Deviance AIC
+ i1 1 96.888 100.89
+ avg_drinks 1 96.888 100.89
+ i2 1 98.747 102.75
+ max_drinks 1 98.747 102.75
+ female 1 102.302 106.30
+ sex 1 102.302 106.30
+ pss_fr 1 103.184 107.18
+ indtot 1 104.016 108.02
+ e2b 1 105.096 109.10
+ linkstatus 1 105.727 109.73
+ link 1 105.727 109.73
+ substance 2 104.093 110.09
+ g1b 1 106.258 110.26
+ pcs 1 106.570 110.57
+ sexrisk 1 106.580 110.58
<none> 108.605 110.61
+ dayslink 1 106.883 110.88
+ d1 1 106.948 110.95
+ drugrisk 1 107.162 111.16
+ age 1 107.664 111.66
+ treat 1 108.352 112.35
+ anysubstatus 1 108.437 112.44
+ anysub 1 108.437 112.44
+ cesd 1 108.476 112.48
+ satreat 1 108.496 112.50
+ daysanysub 1 108.532 112.53
+ Partition 1 108.533 112.53
+ mcs 1 108.596 112.60
+ id 1 108.605 112.61
+ racegrp 3 108.124 116.12
Step: AIC=100.89
homeless ~ i1
Df Deviance AIC
+ sex 1 89.820 95.820
+ female 1 89.820 95.820
+ pss_fr 1 92.616 98.616
+ linkstatus 1 93.154 99.154
+ link 1 93.154 99.154
+ dayslink 1 93.344 99.344
+ indtot 1 94.268 100.268
+ drugrisk 1 94.355 100.355
<none> 96.888 100.888
+ g1b 1 94.996 100.996
+ sexrisk 1 95.266 101.266
+ e2b 1 95.916 101.916
+ pcs 1 95.980 101.980
+ treat 1 96.134 102.134
+ id 1 96.524 102.524
+ d1 1 96.614 102.614
+ daysanysub 1 96.743 102.743
+ anysubstatus 1 96.781 102.781
+ anysub 1 96.781 102.781
+ satreat 1 96.783 102.783
+ cesd 1 96.835 102.835
+ age 1 96.862 102.862
+ Partition 1 96.875 102.875
+ i2 1 96.880 102.880
+ max_drinks 1 96.880 102.880
+ mcs 1 96.885 102.885
+ substance 2 96.804 104.804
+ racegrp 3 96.167 106.167
- i1 1 108.605 110.605
Step: AIC=95.82
homeless ~ i1 + sex
Df Deviance AIC
+ g1b 1 85.356 93.356
+ dayslink 1 86.476 94.476
+ linkstatus 1 86.567 94.567
+ link 1 86.567 94.567
+ pss_fr 1 87.194 95.194
+ drugrisk 1 87.704 95.704
<none> 89.820 95.820
+ pcs 1 87.920 95.920
+ sexrisk 1 88.549 96.549
+ treat 1 88.675 96.675
+ d1 1 88.969 96.969
+ e2b 1 89.165 97.165
+ indtot 1 89.189 97.189
+ cesd 1 89.244 97.244
+ id 1 89.351 97.351
+ daysanysub 1 89.511 97.511
+ anysubstatus 1 89.523 97.523
+ anysub 1 89.523 97.523
+ Partition 1 89.543 97.543
+ mcs 1 89.683 97.683
+ i2 1 89.780 97.780
+ max_drinks 1 89.780 97.780
+ age 1 89.784 97.784
+ satreat 1 89.811 97.811
+ substance 2 89.717 99.717
- sex 1 96.888 100.888
+ racegrp 3 89.328 101.328
- i1 1 102.302 106.302
Step: AIC=93.36
homeless ~ i1 + sex + g1b
Df Deviance AIC
+ pss_fr 1 82.333 92.333
+ linkstatus 1 82.888 92.888
+ link 1 82.888 92.888
+ dayslink 1 82.925 92.925
<none> 85.356 93.356
+ sexrisk 1 83.721 93.721
+ d1 1 83.949 93.949
+ pcs 1 84.005 94.005
+ drugrisk 1 84.690 94.690
+ treat 1 84.761 94.761
+ id 1 84.924 94.924
+ e2b 1 84.965 94.965
+ age 1 85.095 95.095
+ mcs 1 85.129 95.129
+ indtot 1 85.140 95.140
+ Partition 1 85.141 95.141
+ i2 1 85.314 95.314
+ max_drinks 1 85.314 95.314
+ daysanysub 1 85.332 95.332
+ satreat 1 85.343 95.343
+ cesd 1 85.344 95.344
+ anysubstatus 1 85.356 95.356
+ anysub 1 85.356 95.356
- g1b 1 89.820 95.820
+ substance 2 85.209 97.209
+ racegrp 3 84.586 98.586
- sex 1 94.996 100.996
- i1 1 97.479 103.479
Step: AIC=92.33
homeless ~ i1 + sex + g1b + pss_fr
Df Deviance AIC
+ dayslink 1 79.519 91.519
+ linkstatus 1 79.613 91.613
+ link 1 79.613 91.613
<none> 82.333 92.333
+ sexrisk 1 80.603 92.603
- pss_fr 1 85.356 93.356
+ d1 1 81.421 93.421
+ pcs 1 81.573 93.573
+ mcs 1 81.607 93.607
+ drugrisk 1 81.617 93.617
+ id 1 81.644 93.644
+ treat 1 81.656 93.656
+ e2b 1 81.956 93.956
+ age 1 82.033 94.033
+ Partition 1 82.103 94.103
+ cesd 1 82.168 94.168
+ satreat 1 82.217 94.217
+ daysanysub 1 82.307 94.307
+ indtot 1 82.315 94.315
+ i2 1 82.318 94.318
+ max_drinks 1 82.318 94.318
+ anysubstatus 1 82.332 94.332
+ anysub 1 82.332 94.332
- g1b 1 87.194 95.194
+ substance 2 82.053 96.053
+ racegrp 3 81.595 97.595
- sex 1 90.088 98.088
- i1 1 93.342 101.342
Step: AIC=91.52
homeless ~ i1 + sex + g1b + pss_fr + dayslink
Df Deviance AIC
+ sexrisk 1 77.385 91.385
<none> 79.519 91.519
+ mcs 1 78.329 92.329
- dayslink 1 82.333 92.333
- pss_fr 1 82.925 92.925
+ e2b 1 79.003 93.003
+ satreat 1 79.067 93.067
+ drugrisk 1 79.090 93.090
+ d1 1 79.153 93.153
+ id 1 79.187 93.187
+ pcs 1 79.196 93.196
- g1b 1 83.202 93.202
+ cesd 1 79.210 93.210
+ anysubstatus 1 79.264 93.264
+ anysub 1 79.264 93.264
+ Partition 1 79.366 93.366
+ age 1 79.454 93.454
+ i2 1 79.485 93.485
+ max_drinks 1 79.485 93.485
+ daysanysub 1 79.487 93.487
+ treat 1 79.505 93.505
+ indtot 1 79.511 93.511
+ linkstatus 1 79.515 93.515
+ link 1 79.515 93.515
+ substance 2 79.279 95.279
- sex 1 86.308 96.308
+ racegrp 3 78.622 96.622
- i1 1 92.140 102.140
Step: AIC=91.38
homeless ~ i1 + sex + g1b + pss_fr + dayslink + sexrisk
Df Deviance AIC
<none> 77.385 91.385
- sexrisk 1 79.519 91.519
+ drugrisk 1 76.277 92.277
+ satreat 1 76.555 92.555
- dayslink 1 80.603 92.603
+ mcs 1 76.636 92.636
+ anysubstatus 1 76.658 92.658
+ anysub 1 76.658 92.658
+ pcs 1 76.740 92.740
- pss_fr 1 80.916 92.916
+ age 1 77.067 93.067
+ e2b 1 77.119 93.119
+ d1 1 77.128 93.128
+ Partition 1 77.154 93.154
+ indtot 1 77.235 93.235
+ cesd 1 77.259 93.259
+ daysanysub 1 77.302 93.302
+ treat 1 77.312 93.312
+ id 1 77.355 93.355
+ linkstatus 1 77.375 93.375
+ link 1 77.375 93.375
+ i2 1 77.383 93.383
+ max_drinks 1 77.383 93.383
- g1b 1 81.468 93.468
+ substance 2 77.019 95.019
- sex 1 83.654 95.654
+ racegrp 3 77.212 97.212
- i1 1 88.139 100.139
> summary(AICmod)
Call:
glm(formula = homeless ~ i1 + sex + g1b + pss_fr + dayslink +
sexrisk, family = binomial, data = trainData4)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.1433 -0.6923 -0.3944 0.7927 1.9780
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.903045 1.240429 0.728 0.4666
i1 -0.057817 0.023137 -2.499 0.0125 *
sexmale -1.801138 0.764603 -2.356 0.0185 *
g1byes -1.338143 0.710160 -1.884 0.0595 .
pss_fr 0.159495 0.087981 1.813 0.0699 .
dayslink 0.003429 0.001979 1.733 0.0831 .
sexrisk -0.161055 0.112725 -1.429 0.1531
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 108.605 on 82 degrees of freedom
Residual deviance: 77.385 on 76 degrees of freedom
AIC: 91.385
Number of Fisher Scoring iterations: 6
One can think of the pseudo \(R^2\) for logistic regression as the analog to \(R^2\) used in linear regression. Pseudo \(R^2\) measures how much of the deviance is explained by the model, where \(R^2\) measures the variability explained by a linear model.
> resid.dev <- summary(AICmod)$deviance
> resid.dev
[1] 77.38461
> null.dev <- summary(AICmod)$null.deviance
> null.dev
[1] 108.6048
> PR2 <- 1 - (resid.dev / null.dev)
> PR2 # Pseudo R squared
[1] 0.2874657
The model only explains 28.75% of the deviance in the training data. This will likely be less for the test data suggesting there are likely other factors that have not been measured that predict homelessness.
> testmod <- glm(homeless ~ i1 + sex + g1b + pss_fr + dayslink + sexrisk, family = binomial, data = testData4)
> resid.dev <- summary(testmod)$deviance
> resid.dev
[1] 37.53389
> null.dev <- summary(testmod)$null.deviance
> null.dev
[1] 44.14889
> PR2 <- 1 - (resid.dev / null.dev)
> PR2 # Pseudo R squared
[1] 0.1498339
> # Note 'type = "response"' gives the predicted probabilities
> PM <- predict(AICmod, newdata = testData4, type = "response")
> pred <- ifelse(PM > 0.5, "homeless", "housed")
> T1 <- table(pred, testData4$homeless)
> T1
pred homeless housed
homeless 6 3
housed 16 9
> Accuracy <- sum(diag(T1))/sum(T1)
> Accuracy
[1] 0.4411765
> # Using caret
> confusionMatrix(data = pred, reference = testData4$homeless)
Confusion Matrix and Statistics
Reference
Prediction homeless housed
homeless 6 3
housed 16 9
Accuracy : 0.4412
95% CI : (0.2719, 0.6211)
No Information Rate : 0.6471
P-Value [Acc > NIR] : 0.995694
Kappa : 0.0182
Mcnemar's Test P-Value : 0.005905
Sensitivity : 0.2727
Specificity : 0.7500
Pos Pred Value : 0.6667
Neg Pred Value : 0.3600
Prevalence : 0.6471
Detection Rate : 0.1765
Detection Prevalence : 0.2647
Balanced Accuracy : 0.5114
'Positive' Class : homeless
See Confusion Matrix on Wikipedia.
An ROC curve is a really useful shortcut for summarizing the performance of a classifier over all possible thresholds. This saves one a lot of tedious work computing class predictions for many different thresholds and examining the confusion matrix for each threshold.
> library(caTools)
> colAUC(PM, testData4$homeless, plotROC = TRUE)
[,1]
homeless vs. housed 0.625
> library(pROC)
> ROC <- roc(testData4$homeless, PM) # Note the order is switched here
> plot(ROC, col = "blue")
> auc(ROC)
Area under the curve: 0.625
An AUC of 0.5 is no better than random guessing, an AUC of 1.0 is a perfectly predictive model, and an AUC of 0.0 is perfectly anti-predictive (which rarely happens). One can use the trainControl()
function in caret
to use AUC (instead of acccuracy), to tune the parameters of your models. The twoClassSummary()
convenience function allows one to do this easily.
When using twoClassSummary()
, be sure to always include the argument classProbs = TRUE
or your model will throw an error! (You cannot calculate AUC with just class predictions. You need to have class probabilities as well.)
> # Create trainControl object: myControl
> myControl <- trainControl(
+ # method = "cv",
+ # number = 10,
+ summaryFunction = twoClassSummary,
+ classProbs = TRUE, # IMPORTANT!
+ verbose = FALSE
+ )
> #
> # Train glm with custom trainControl: model
> model <- train(homeless ~ ., data = trainData4, method = "glmStepAIC",
+ trControl = myControl)
Start: AIC=124.46
.outcome ~ age + anysubstatus + anysubyes + cesd + d1 + daysanysub +
dayslink + drugrisk + e2b + female + sexmale + g1byes + i1 +
i2 + id + indtot + linkstatus + linkyes + mcs + pcs + pss_fr +
racegrphispanic + racegrpother + racegrpwhite + satreatyes +
sexrisk + substancecocaine + substanceheroin + treatyes +
avg_drinks + max_drinks + Partition
Step: AIC=124.46
.outcome ~ age + anysubstatus + anysubyes + cesd + d1 + daysanysub +
dayslink + drugrisk + e2b + female + sexmale + g1byes + i1 +
i2 + id + indtot + linkstatus + linkyes + mcs + pcs + pss_fr +
racegrphispanic + racegrpother + racegrpwhite + satreatyes +
sexrisk + substancecocaine + substanceheroin + treatyes +
avg_drinks + Partition
Step: AIC=124.46
.outcome ~ age + anysubstatus + anysubyes + cesd + d1 + daysanysub +
dayslink + drugrisk + e2b + female + sexmale + g1byes + i1 +
i2 + id + indtot + linkstatus + linkyes + mcs + pcs + pss_fr +
racegrphispanic + racegrpother + racegrpwhite + satreatyes +
sexrisk + substancecocaine + substanceheroin + treatyes +
Partition
Step: AIC=124.46
.outcome ~ age + anysubstatus + anysubyes + cesd + d1 + daysanysub +
dayslink + drugrisk + e2b + female + sexmale + g1byes + i1 +
i2 + id + indtot + linkstatus + mcs + pcs + pss_fr + racegrphispanic +
racegrpother + racegrpwhite + satreatyes + sexrisk + substancecocaine +
substanceheroin + treatyes + Partition
Step: AIC=124.46
.outcome ~ age + anysubstatus + anysubyes + cesd + d1 + daysanysub +
dayslink + drugrisk + e2b + female + g1byes + i1 + i2 + id +
indtot + linkstatus + mcs + pcs + pss_fr + racegrphispanic +
racegrpother + racegrpwhite + satreatyes + sexrisk + substancecocaine +
substanceheroin + treatyes + Partition
Step: AIC=124.46
.outcome ~ age + anysubstatus + cesd + d1 + daysanysub + dayslink +
drugrisk + e2b + female + g1byes + i1 + i2 + id + indtot +
linkstatus + mcs + pcs + pss_fr + racegrphispanic + racegrpother +
racegrpwhite + satreatyes + sexrisk + substancecocaine +
substanceheroin + treatyes + Partition
Df Deviance AIC
- treatyes 1 68.460 122.46
- cesd 1 68.461 122.46
- i2 1 68.476 122.48
- d1 1 68.498 122.50
- satreatyes 1 68.527 122.53
- racegrpother 1 68.629 122.63
- id 1 68.669 122.67
- racegrphispanic 1 68.675 122.67
- e2b 1 68.853 122.85
- linkstatus 1 68.857 122.86
- i1 1 68.872 122.87
- age 1 68.877 122.88
- racegrpwhite 1 68.877 122.88
- substancecocaine 1 68.881 122.88
- indtot 1 68.886 122.89
- pcs 1 68.897 122.90
- daysanysub 1 69.125 123.12
- dayslink 1 69.398 123.40
- Partition 1 69.509 123.51
- substanceheroin 1 69.614 123.61
- anysubstatus 1 69.632 123.63
- mcs 1 69.712 123.71
- drugrisk 1 70.232 124.23
<none> 68.458 124.46
- female 1 71.664 125.66
- sexrisk 1 71.689 125.69
- pss_fr 1 71.765 125.77
- g1byes 1 73.622 127.62
Step: AIC=122.46
.outcome ~ age + anysubstatus + cesd + d1 + daysanysub + dayslink +
drugrisk + e2b + female + g1byes + i1 + i2 + id + indtot +
linkstatus + mcs + pcs + pss_fr + racegrphispanic + racegrpother +
racegrpwhite + satreatyes + sexrisk + substancecocaine +
substanceheroin + Partition
Df Deviance AIC
- cesd 1 68.462 120.46
- i2 1 68.478 120.48
- d1 1 68.508 120.51
- satreatyes 1 68.527 120.53
- racegrpother 1 68.631 120.63
- id 1 68.671 120.67
- racegrphispanic 1 68.678 120.68
- e2b 1 68.855 120.86
- linkstatus 1 68.857 120.86
- i1 1 68.872 120.87
- age 1 68.877 120.88
- racegrpwhite 1 68.878 120.88
- substancecocaine 1 68.881 120.88
- pcs 1 68.909 120.91
- indtot 1 68.910 120.91
- daysanysub 1 69.174 121.17
- dayslink 1 69.426 121.43
- Partition 1 69.534 121.53
- substanceheroin 1 69.636 121.64
- anysubstatus 1 69.711 121.71
- mcs 1 69.712 121.71
- drugrisk 1 70.235 122.23
<none> 68.460 122.46
- sexrisk 1 71.699 123.70
- female 1 71.708 123.71
- pss_fr 1 71.813 123.81
- g1byes 1 73.623 125.62
Step: AIC=120.46
.outcome ~ age + anysubstatus + d1 + daysanysub + dayslink +
drugrisk + e2b + female + g1byes + i1 + i2 + id + indtot +
linkstatus + mcs + pcs + pss_fr + racegrphispanic + racegrpother +
racegrpwhite + satreatyes + sexrisk + substancecocaine +
substanceheroin + Partition
Df Deviance AIC
- i2 1 68.483 118.48
- d1 1 68.508 118.51
- satreatyes 1 68.527 118.53
- racegrpother 1 68.631 118.63
- racegrphispanic 1 68.684 118.68
- id 1 68.713 118.71
- linkstatus 1 68.865 118.86
- i1 1 68.873 118.87
- e2b 1 68.879 118.88
- racegrpwhite 1 68.880 118.88
- age 1 68.881 118.88
- substancecocaine 1 68.895 118.89
- indtot 1 68.912 118.91
- pcs 1 69.004 119.00
- daysanysub 1 69.235 119.23
- dayslink 1 69.453 119.45
- Partition 1 69.557 119.56
- substanceheroin 1 69.656 119.66
- anysubstatus 1 69.777 119.78
- drugrisk 1 70.242 120.24
- mcs 1 70.387 120.39
<none> 68.462 120.46
- female 1 71.732 121.73
- sexrisk 1 71.754 121.75
- pss_fr 1 71.864 121.86
- g1byes 1 73.765 123.77
Step: AIC=118.48
.outcome ~ age + anysubstatus + d1 + daysanysub + dayslink +
drugrisk + e2b + female + g1byes + i1 + id + indtot + linkstatus +
mcs + pcs + pss_fr + racegrphispanic + racegrpother + racegrpwhite +
satreatyes + sexrisk + substancecocaine + substanceheroin +
Partition
Df Deviance AIC
- d1 1 68.529 116.53
- satreatyes 1 68.538 116.54
- racegrpother 1 68.646 116.65
- racegrphispanic 1 68.707 116.71
- id 1 68.725 116.72
- racegrpwhite 1 68.881 116.88
- linkstatus 1 68.894 116.89
- age 1 68.896 116.90
- substancecocaine 1 68.896 116.90
- indtot 1 68.921 116.92
- e2b 1 68.981 116.98
- pcs 1 69.009 117.01
- daysanysub 1 69.250 117.25
- dayslink 1 69.476 117.48
- Partition 1 69.602 117.60
- substanceheroin 1 69.683 117.68
- anysubstatus 1 69.795 117.80
- i1 1 69.929 117.93
- drugrisk 1 70.366 118.37
- mcs 1 70.395 118.39
<none> 68.483 118.48
- female 1 71.768 119.77
- sexrisk 1 71.786 119.79
- pss_fr 1 71.981 119.98
- g1byes 1 73.806 121.81
Step: AIC=116.53
.outcome ~ age + anysubstatus + daysanysub + dayslink + drugrisk +
e2b + female + g1byes + i1 + id + indtot + linkstatus + mcs +
pcs + pss_fr + racegrphispanic + racegrpother + racegrpwhite +
satreatyes + sexrisk + substancecocaine + substanceheroin +
Partition
Df Deviance AIC
- satreatyes 1 68.580 114.58
- racegrpother 1 68.677 114.68
- racegrphispanic 1 68.740 114.74
- id 1 68.779 114.78
- linkstatus 1 68.897 114.90
- substancecocaine 1 68.922 114.92
- age 1 68.939 114.94
- racegrpwhite 1 68.964 114.96
- indtot 1 68.970 114.97
- e2b 1 69.000 115.00
- pcs 1 69.133 115.13
- daysanysub 1 69.261 115.26
- dayslink 1 69.476 115.48
- Partition 1 69.631 115.63
- substanceheroin 1 69.803 115.80
- anysubstatus 1 69.830 115.83
- i1 1 70.020 116.02
<none> 68.529 116.53
- mcs 1 70.559 116.56
- drugrisk 1 70.675 116.67
- sexrisk 1 71.821 117.82
- female 1 71.837 117.84
- pss_fr 1 72.393 118.39
- g1byes 1 73.999 120.00
Step: AIC=114.58
.outcome ~ age + anysubstatus + daysanysub + dayslink + drugrisk +
e2b + female + g1byes + i1 + id + indtot + linkstatus + mcs +
pcs + pss_fr + racegrphispanic + racegrpother + racegrpwhite +
sexrisk + substancecocaine + substanceheroin + Partition
Df Deviance AIC
- racegrpother 1 68.700 112.70
- racegrphispanic 1 68.752 112.75
- id 1 68.860 112.86
- linkstatus 1 68.929 112.93
- age 1 68.964 112.96
- indtot 1 69.024 113.02
- substancecocaine 1 69.062 113.06
- racegrpwhite 1 69.158 113.16
- pcs 1 69.241 113.24
- e2b 1 69.283 113.28
- daysanysub 1 69.298 113.30
- dayslink 1 69.494 113.49
- Partition 1 69.799 113.80
- anysubstatus 1 69.866 113.87
- substanceheroin 1 69.981 113.98
- i1 1 70.057 114.06
<none> 68.580 114.58
- drugrisk 1 70.958 114.96
- mcs 1 71.193 115.19
- female 1 71.843 115.84
- sexrisk 1 71.874 115.87
- pss_fr 1 72.488 116.49
- g1byes 1 74.339 118.34
Step: AIC=112.7
.outcome ~ age + anysubstatus + daysanysub + dayslink + drugrisk +
e2b + female + g1byes + i1 + id + indtot + linkstatus + mcs +
pcs + pss_fr + racegrphispanic + racegrpwhite + sexrisk +
substancecocaine + substanceheroin + Partition
Df Deviance AIC
- racegrphispanic 1 68.785 110.78
- linkstatus 1 68.966 110.97
- id 1 68.974 110.97
- age 1 69.016 111.02
- indtot 1 69.082 111.08
- substancecocaine 1 69.346 111.35
- pcs 1 69.441 111.44
- daysanysub 1 69.442 111.44
- racegrpwhite 1 69.488 111.49
- dayslink 1 69.498 111.50
- e2b 1 69.531 111.53
- Partition 1 69.936 111.94
- anysubstatus 1 69.967 111.97
- i1 1 70.060 112.06
- substanceheroin 1 70.088 112.09
<none> 68.700 112.70
- drugrisk 1 71.101 113.10
- mcs 1 71.212 113.21
- sexrisk 1 71.875 113.88
- female 1 72.089 114.09
- pss_fr 1 73.065 115.06
- g1byes 1 74.444 116.44
Step: AIC=110.78
.outcome ~ age + anysubstatus + daysanysub + dayslink + drugrisk +
e2b + female + g1byes + i1 + id + indtot + linkstatus + mcs +
pcs + pss_fr + racegrpwhite + sexrisk + substancecocaine +
substanceheroin + Partition
Df Deviance AIC
- linkstatus 1 68.983 108.98
- age 1 69.100 109.10
- id 1 69.133 109.13
- indtot 1 69.146 109.15
- daysanysub 1 69.491 109.49
- dayslink 1 69.498 109.50
- pcs 1 69.518 109.52
- substancecocaine 1 69.543 109.54
- e2b 1 69.701 109.70
- racegrpwhite 1 69.896 109.90
- Partition 1 69.953 109.95
- anysubstatus 1 69.994 109.99
- substanceheroin 1 70.089 110.09
- i1 1 70.094 110.09
<none> 68.785 110.78
- drugrisk 1 71.112 111.11
- mcs 1 71.273 111.27
- sexrisk 1 71.957 111.96
- female 1 72.169 112.17
- pss_fr 1 73.201 113.20
- g1byes 1 74.746 114.75
Step: AIC=108.98
.outcome ~ age + anysubstatus + daysanysub + dayslink + drugrisk +
e2b + female + g1byes + i1 + id + indtot + mcs + pcs + pss_fr +
racegrpwhite + sexrisk + substancecocaine + substanceheroin +
Partition
Df Deviance AIC
- indtot 1 69.373 107.37
- age 1 69.471 107.47
- id 1 69.580 107.58
- daysanysub 1 69.643 107.64
- substancecocaine 1 69.769 107.77
- pcs 1 69.773 107.77
- e2b 1 69.967 107.97
- Partition 1 69.996 108.00
- racegrpwhite 1 70.004 108.00
- anysubstatus 1 70.122 108.12
- substanceheroin 1 70.137 108.14
- i1 1 70.159 108.16
<none> 68.983 108.98
- drugrisk 1 71.279 109.28
- mcs 1 71.300 109.30
- sexrisk 1 72.422 110.42
- female 1 72.423 110.42
- pss_fr 1 73.223 111.22
- dayslink 1 73.688 111.69
- g1byes 1 74.867 112.87
Step: AIC=107.37
.outcome ~ age + anysubstatus + daysanysub + dayslink + drugrisk +
e2b + female + g1byes + i1 + id + mcs + pcs + pss_fr + racegrpwhite +
sexrisk + substancecocaine + substanceheroin + Partition
Df Deviance AIC
- id 1 69.852 105.85
- age 1 69.966 105.97
- substancecocaine 1 70.045 106.05
- pcs 1 70.063 106.06
- racegrpwhite 1 70.174 106.17
- daysanysub 1 70.212 106.21
- e2b 1 70.295 106.30
- substanceheroin 1 70.403 106.40
- Partition 1 70.511 106.51
- anysubstatus 1 70.796 106.80
- i1 1 70.843 106.84
- mcs 1 71.300 107.30
<none> 69.373 107.37
- drugrisk 1 71.946 107.95
- sexrisk 1 72.828 108.83
- dayslink 1 73.769 109.77
- female 1 73.772 109.77
- pss_fr 1 74.240 110.24
- g1byes 1 74.948 110.95
Step: AIC=105.85
.outcome ~ age + anysubstatus + daysanysub + dayslink + drugrisk +
e2b + female + g1byes + i1 + mcs + pcs + pss_fr + racegrpwhite +
sexrisk + substancecocaine + substanceheroin + Partition
Df Deviance AIC
- pcs 1 70.271 104.27
- daysanysub 1 70.488 104.49
- e2b 1 70.511 104.51
- substancecocaine 1 70.553 104.55
- age 1 70.574 104.57
- racegrpwhite 1 70.647 104.65
- substanceheroin 1 70.713 104.71
- Partition 1 70.797 104.80
- anysubstatus 1 71.113 105.11
- mcs 1 71.619 105.62
<none> 69.852 105.85
- i1 1 71.877 105.88
- drugrisk 1 72.253 106.25
- sexrisk 1 72.875 106.88
- dayslink 1 73.869 107.87
- female 1 74.283 108.28
- pss_fr 1 75.070 109.07
- g1byes 1 75.106 109.11
Step: AIC=104.27
.outcome ~ age + anysubstatus + daysanysub + dayslink + drugrisk +
e2b + female + g1byes + i1 + mcs + pss_fr + racegrpwhite +
sexrisk + substancecocaine + substanceheroin + Partition
Df Deviance AIC
- daysanysub 1 70.750 102.75
- substanceheroin 1 70.864 102.86
- substancecocaine 1 70.923 102.92
- Partition 1 71.226 103.23
- racegrpwhite 1 71.237 103.24
- e2b 1 71.260 103.26
- anysubstatus 1 71.360 103.36
- age 1 71.396 103.40
- mcs 1 72.145 104.14
<none> 70.271 104.27
- i1 1 72.693 104.69
- drugrisk 1 72.776 104.78
- sexrisk 1 73.134 105.13
- female 1 74.448 106.45
- dayslink 1 74.830 106.83
- g1byes 1 75.679 107.68
- pss_fr 1 75.970 107.97
Step: AIC=102.75
.outcome ~ age + anysubstatus + dayslink + drugrisk + e2b + female +
g1byes + i1 + mcs + pss_fr + racegrpwhite + sexrisk + substancecocaine +
substanceheroin + Partition
Df Deviance AIC
- e2b 1 71.510 101.51
- substancecocaine 1 71.606 101.61
- age 1 71.666 101.67
- anysubstatus 1 71.702 101.70
- Partition 1 71.795 101.80
- substanceheroin 1 72.038 102.04
- racegrpwhite 1 72.049 102.05
- mcs 1 72.524 102.52
<none> 70.750 102.75
- i1 1 72.824 102.82
- sexrisk 1 73.223 103.22
- drugrisk 1 73.899 103.90
- female 1 74.799 104.80
- dayslink 1 74.981 104.98
- g1byes 1 75.975 105.97
- pss_fr 1 76.428 106.43
Step: AIC=101.51
.outcome ~ age + anysubstatus + dayslink + drugrisk + female +
g1byes + i1 + mcs + pss_fr + racegrpwhite + sexrisk + substancecocaine +
substanceheroin + Partition
Df Deviance AIC
- substancecocaine 1 72.273 100.27
- Partition 1 72.352 100.35
- age 1 72.386 100.39
- racegrpwhite 1 72.484 100.48
- substanceheroin 1 72.614 100.61
- anysubstatus 1 72.623 100.62
- mcs 1 72.894 100.89
<none> 71.510 101.51
- i1 1 73.998 102.00
- drugrisk 1 74.305 102.31
- sexrisk 1 74.479 102.48
- dayslink 1 75.627 103.63
- female 1 75.836 103.84
- pss_fr 1 76.886 104.89
- g1byes 1 76.903 104.90
Step: AIC=100.27
.outcome ~ age + anysubstatus + dayslink + drugrisk + female +
g1byes + i1 + mcs + pss_fr + racegrpwhite + sexrisk + substanceheroin +
Partition
Df Deviance AIC
- substanceheroin 1 72.758 98.758
- racegrpwhite 1 72.773 98.773
- Partition 1 73.106 99.106
- age 1 73.111 99.111
- anysubstatus 1 73.429 99.429
- mcs 1 73.735 99.735
<none> 72.273 100.273
- sexrisk 1 74.612 100.612
- drugrisk 1 74.621 100.621
- dayslink 1 76.160 102.160
- female 1 76.433 102.433
- i1 1 76.625 102.625
- pss_fr 1 77.269 103.269
- g1byes 1 77.994 103.994
Step: AIC=98.76
.outcome ~ age + anysubstatus + dayslink + drugrisk + female +
g1byes + i1 + mcs + pss_fr + racegrpwhite + sexrisk + Partition
Df Deviance AIC
- racegrpwhite 1 73.188 97.188
- Partition 1 73.692 97.692
- anysubstatus 1 73.841 97.841
- age 1 73.979 97.979
- mcs 1 74.149 98.149
- drugrisk 1 74.621 98.621
<none> 72.758 98.758
- sexrisk 1 75.690 99.690
- dayslink 1 76.645 100.645
- pss_fr 1 77.298 101.298
- g1byes 1 78.648 102.648
- female 1 78.944 102.944
- i1 1 79.133 103.133
Step: AIC=97.19
.outcome ~ age + anysubstatus + dayslink + drugrisk + female +
g1byes + i1 + mcs + pss_fr + sexrisk + Partition
Df Deviance AIC
- Partition 1 73.961 95.961
- age 1 74.141 96.141
- mcs 1 74.403 96.403
- anysubstatus 1 74.411 96.411
- drugrisk 1 74.655 96.655
<none> 73.188 97.188
- sexrisk 1 76.817 98.817
- dayslink 1 76.934 98.934
- pss_fr 1 77.891 99.891
- g1byes 1 78.652 100.652
- i1 1 79.179 101.179
- female 1 79.344 101.344
Step: AIC=95.96
.outcome ~ age + anysubstatus + dayslink + drugrisk + female +
g1byes + i1 + mcs + pss_fr + sexrisk
Df Deviance AIC
- age 1 74.574 94.574
- mcs 1 74.891 94.891
- anysubstatus 1 75.073 95.073
- drugrisk 1 75.245 95.245
<none> 73.961 95.961
- sexrisk 1 77.281 97.281
- dayslink 1 77.772 97.772
- pss_fr 1 78.421 98.421
- g1byes 1 79.061 99.061
- female 1 79.501 99.501
- i1 1 80.645 100.645
Step: AIC=94.57
.outcome ~ anysubstatus + dayslink + drugrisk + female + g1byes +
i1 + mcs + pss_fr + sexrisk
Df Deviance AIC
- anysubstatus 1 75.375 93.375
- mcs 1 75.575 93.575
- drugrisk 1 75.792 93.792
<none> 74.574 94.574
- sexrisk 1 77.377 95.377
- dayslink 1 78.607 96.607
- pss_fr 1 78.907 96.907
- g1byes 1 79.114 97.114
- female 1 79.535 97.535
- i1 1 84.250 102.250
Step: AIC=93.38
.outcome ~ dayslink + drugrisk + female + g1byes + i1 + mcs +
pss_fr + sexrisk
Df Deviance AIC
- mcs 1 76.277 92.277
- drugrisk 1 76.636 92.636
<none> 75.375 93.375
- sexrisk 1 77.693 93.693
- dayslink 1 78.742 94.742
- g1byes 1 79.225 95.225
- pss_fr 1 79.543 95.543
- female 1 80.661 96.661
- i1 1 86.093 102.093
Step: AIC=92.28
.outcome ~ dayslink + drugrisk + female + g1byes + i1 + pss_fr +
sexrisk
Df Deviance AIC
- drugrisk 1 77.385 91.385
<none> 76.277 92.277
- sexrisk 1 79.090 93.090
- dayslink 1 79.335 93.335
- g1byes 1 79.342 93.342
- pss_fr 1 79.783 93.783
- female 1 82.121 96.121
- i1 1 86.744 100.744
Step: AIC=91.38
.outcome ~ dayslink + female + g1byes + i1 + pss_fr + sexrisk
Df Deviance AIC
<none> 77.385 91.385
- sexrisk 1 79.519 91.519
- dayslink 1 80.603 92.603
- pss_fr 1 80.916 92.916
- g1byes 1 81.468 93.468
- female 1 83.654 95.654
- i1 1 88.139 100.139
> # Print model to console
> model
Generalized Linear Model with Stepwise Feature Selection
83 samples
29 predictors
2 classes: 'homeless', 'housed'
No pre-processing
Resampling: Bootstrapped (25 reps)
Summary of sample sizes: 83, 83, 83, 83, 83, 83, ...
Resampling results:
ROC Sens Spec
0.5307497 0.5851495 0.4547207
> summary(model)
Call:
NULL
Deviance Residuals:
Min 1Q Median 3Q Max
-2.1433 -0.6923 -0.3944 0.7927 1.9780
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.898093 0.951086 -0.944 0.3450
dayslink 0.003429 0.001979 1.733 0.0831 .
female 1.801138 0.764603 2.356 0.0185 *
g1byes -1.338143 0.710160 -1.884 0.0595 .
i1 -0.057817 0.023137 -2.499 0.0125 *
pss_fr 0.159495 0.087981 1.813 0.0699 .
sexrisk -0.161055 0.112725 -1.429 0.1531
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 108.605 on 82 degrees of freedom
Residual deviance: 77.385 on 76 degrees of freedom
AIC: 91.385
Number of Fisher Scoring iterations: 6
alpha = 0
\(\rightarrow\) ridgealpha = 1
\(\rightarrow\) lasso> set.seed(234)
> # Create custom trainControl: myControl
> myControl <- trainControl(
+ method = "cv", number = 10,
+ summaryFunction = twoClassSummary,
+ classProbs = TRUE, # IMPORTANT!
+ verboseIter = TRUE
+ )
> # Train glmnet with custom trainControl and tuning: model
> model <- train(
+ homeless ~ ., data = trainData4,
+ tuneGrid = expand.grid(alpha = 0:1, lambda = seq(0.0001, 4, length = 20)),
+ method = "glmnet",
+ trControl = myControl
+ )
Aggregating results
Selecting tuning parameters
Fitting alpha = 0, lambda = 4 on full training set
> # Print model to console
> model
glmnet
83 samples
29 predictors
2 classes: 'homeless', 'housed'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 74, 75, 75, 75, 75, 75, ...
Resampling results across tuning parameters:
alpha lambda ROC Sens Spec
0 0.0001000 0.5766667 0.7166667 0.40000000
0 0.2106211 0.6188889 0.7966667 0.33333333
0 0.4211421 0.6455556 0.8466667 0.30000000
0 0.6316632 0.6522222 0.9233333 0.26666667
0 0.8421842 0.6588889 0.9600000 0.20000000
0 1.0527053 0.6711111 0.9600000 0.13333333
0 1.2632263 0.6833333 0.9600000 0.10000000
0 1.4737474 0.6833333 0.9600000 0.10000000
0 1.6842684 0.6966667 0.9600000 0.06666667
0 1.8947895 0.6966667 0.9600000 0.03333333
0 2.1053105 0.6966667 0.9800000 0.00000000
0 2.3158316 0.6966667 1.0000000 0.00000000
0 2.5263526 0.6966667 1.0000000 0.00000000
0 2.7368737 0.7033333 1.0000000 0.00000000
0 2.9473947 0.7033333 1.0000000 0.00000000
0 3.1579158 0.7033333 1.0000000 0.00000000
0 3.3684368 0.7033333 1.0000000 0.00000000
0 3.5789579 0.7033333 1.0000000 0.00000000
0 3.7894789 0.7033333 1.0000000 0.00000000
0 4.0000000 0.7033333 1.0000000 0.00000000
1 0.0001000 0.5333333 0.6633333 0.43333333
1 0.2106211 0.5000000 1.0000000 0.00000000
1 0.4211421 0.5000000 1.0000000 0.00000000
1 0.6316632 0.5000000 1.0000000 0.00000000
1 0.8421842 0.5000000 1.0000000 0.00000000
1 1.0527053 0.5000000 1.0000000 0.00000000
1 1.2632263 0.5000000 1.0000000 0.00000000
1 1.4737474 0.5000000 1.0000000 0.00000000
1 1.6842684 0.5000000 1.0000000 0.00000000
1 1.8947895 0.5000000 1.0000000 0.00000000
1 2.1053105 0.5000000 1.0000000 0.00000000
1 2.3158316 0.5000000 1.0000000 0.00000000
1 2.5263526 0.5000000 1.0000000 0.00000000
1 2.7368737 0.5000000 1.0000000 0.00000000
1 2.9473947 0.5000000 1.0000000 0.00000000
1 3.1579158 0.5000000 1.0000000 0.00000000
1 3.3684368 0.5000000 1.0000000 0.00000000
1 3.5789579 0.5000000 1.0000000 0.00000000
1 3.7894789 0.5000000 1.0000000 0.00000000
1 4.0000000 0.5000000 1.0000000 0.00000000
ROC was used to select the optimal model using the largest value.
The final values used for the model were alpha = 0 and lambda = 4.
> # Print maximum ROC statistic
> max(model[["results"]][["ROC"]])
[1] 0.7033333
> plot(model)
medianImpute
)knnImpute
)bagImpute
)> # Create custom trainControl: myControl
> myControl <- trainControl(
+ method = "cv", number = 10,
+ summaryFunction = twoClassSummary,
+ classProbs = TRUE, # IMPORTANT!
+ verboseIter = TRUE
+ )
>
> # Train glmnet with custom trainControl and tuning: model
>
> model <- train(
+ homeless ~ ., data = HELPrct,
+ tuneGrid = expand.grid(alpha = 0:1, lambda = seq(0.0001, 100, length = 20)),
+ method = "glmnet",
+ metric = "ROC",
+ trControl = myControl,
+ preProcess = c("knnImpute"),
+ na.action = na.pass
+ )
Aggregating results
Selecting tuning parameters
Fitting alpha = 0, lambda = 84.2 on full training set
> # Print model to console
> model
glmnet
453 samples
28 predictor
2 classes: 'homeless', 'housed'
Pre-processing: nearest neighbor imputation (31), centered (31),
scaled (31)
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 408, 407, 407, 408, 408, 409, ...
Resampling results across tuning parameters:
alpha lambda ROC Sens Spec
0 0.000100 0.6509484 0.482142857 0.6888333
0 5.263253 0.6622321 0.104761905 0.9631667
0 10.526405 0.6614226 0.033333333 0.9916667
0 15.789558 0.6611885 0.004761905 1.0000000
0 21.052711 0.6619583 0.000000000 1.0000000
0 26.315863 0.6621488 0.000000000 1.0000000
0 31.579016 0.6623472 0.000000000 1.0000000
0 36.842168 0.6621488 0.000000000 1.0000000
0 42.105321 0.6621488 0.000000000 1.0000000
0 47.368474 0.6623472 0.000000000 1.0000000
0 52.631626 0.6623472 0.000000000 1.0000000
0 57.894779 0.6623472 0.000000000 1.0000000
0 63.157932 0.6625456 0.000000000 1.0000000
0 68.421084 0.6625456 0.000000000 1.0000000
0 73.684237 0.6625456 0.000000000 1.0000000
0 78.947389 0.6625456 0.000000000 1.0000000
0 84.210542 0.6625456 0.000000000 1.0000000
0 89.473695 0.6623552 0.000000000 1.0000000
0 94.736847 0.6623552 0.000000000 1.0000000
0 100.000000 0.6623552 0.000000000 1.0000000
1 0.000100 0.6481567 0.486904762 0.6888333
1 5.263253 0.5000000 0.000000000 1.0000000
1 10.526405 0.5000000 0.000000000 1.0000000
1 15.789558 0.5000000 0.000000000 1.0000000
1 21.052711 0.5000000 0.000000000 1.0000000
1 26.315863 0.5000000 0.000000000 1.0000000
1 31.579016 0.5000000 0.000000000 1.0000000
1 36.842168 0.5000000 0.000000000 1.0000000
1 42.105321 0.5000000 0.000000000 1.0000000
1 47.368474 0.5000000 0.000000000 1.0000000
1 52.631626 0.5000000 0.000000000 1.0000000
1 57.894779 0.5000000 0.000000000 1.0000000
1 63.157932 0.5000000 0.000000000 1.0000000
1 68.421084 0.5000000 0.000000000 1.0000000
1 73.684237 0.5000000 0.000000000 1.0000000
1 78.947389 0.5000000 0.000000000 1.0000000
1 84.210542 0.5000000 0.000000000 1.0000000
1 89.473695 0.5000000 0.000000000 1.0000000
1 94.736847 0.5000000 0.000000000 1.0000000
1 100.000000 0.5000000 0.000000000 1.0000000
ROC was used to select the optimal model using the largest value.
The final values used for the model were alpha = 0 and lambda = 84.21054.
> # Print maximum ROC statistic
> max(model[["results"]][["ROC"]])
[1] 0.6625456
> plot(model)
> plot(model$finalModel)
> # getting coefficients
> coef(model$finalModel, model$bestTune$lambda)
32 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) 1.547866e-01
age -5.052402e-04
anysubstatus 7.175112e-05
anysubyes 7.173183e-05
cesd -5.094988e-04
d1 -3.875314e-04
daysanysub 4.354365e-05
dayslink 1.266469e-04
drugrisk -2.331586e-04
e2b -9.065190e-04
female 5.748621e-04
sexmale -5.747121e-04
g1byes -7.787904e-04
i1 -1.392394e-03
i2 -1.147066e-03
id -2.852674e-04
indtot -1.268734e-03
linkstatus -2.771621e-04
linkyes -2.770897e-04
mcs 3.990764e-04
pcs 5.605130e-04
pss_fr 1.019032e-03
racegrphispanic 8.851136e-05
racegrpother -9.493057e-07
racegrpwhite -7.793520e-04
satreatyes -4.924555e-04
sexrisk -5.681747e-04
substancecocaine 6.125276e-04
substanceheroin 5.974870e-04
treatyes 1.996529e-04
avg_drinks -1.391529e-03
max_drinks -1.146225e-03
> library(ISLR)
> library(caret)
> Hitters <- na.omit(Hitters)
> summary(Hitters)
AtBat Hits HmRun Runs
Min. : 19.0 Min. : 1.0 Min. : 0.00 Min. : 0.00
1st Qu.:282.5 1st Qu.: 71.5 1st Qu.: 5.00 1st Qu.: 33.50
Median :413.0 Median :103.0 Median : 9.00 Median : 52.00
Mean :403.6 Mean :107.8 Mean :11.62 Mean : 54.75
3rd Qu.:526.0 3rd Qu.:141.5 3rd Qu.:18.00 3rd Qu.: 73.00
Max. :687.0 Max. :238.0 Max. :40.00 Max. :130.00
RBI Walks Years CAtBat
Min. : 0.00 Min. : 0.00 Min. : 1.000 Min. : 19.0
1st Qu.: 30.00 1st Qu.: 23.00 1st Qu.: 4.000 1st Qu.: 842.5
Median : 47.00 Median : 37.00 Median : 6.000 Median : 1931.0
Mean : 51.49 Mean : 41.11 Mean : 7.312 Mean : 2657.5
3rd Qu.: 71.00 3rd Qu.: 57.00 3rd Qu.:10.000 3rd Qu.: 3890.5
Max. :121.00 Max. :105.00 Max. :24.000 Max. :14053.0
CHits CHmRun CRuns CRBI
Min. : 4.0 Min. : 0.00 Min. : 2.0 Min. : 3.0
1st Qu.: 212.0 1st Qu.: 15.00 1st Qu.: 105.5 1st Qu.: 95.0
Median : 516.0 Median : 40.00 Median : 250.0 Median : 230.0
Mean : 722.2 Mean : 69.24 Mean : 361.2 Mean : 330.4
3rd Qu.:1054.0 3rd Qu.: 92.50 3rd Qu.: 497.5 3rd Qu.: 424.5
Max. :4256.0 Max. :548.00 Max. :2165.0 Max. :1659.0
CWalks League Division PutOuts Assists
Min. : 1.0 A:139 E:129 Min. : 0.0 Min. : 0.0
1st Qu.: 71.0 N:124 W:134 1st Qu.: 113.5 1st Qu.: 8.0
Median : 174.0 Median : 224.0 Median : 45.0
Mean : 260.3 Mean : 290.7 Mean :118.8
3rd Qu.: 328.5 3rd Qu.: 322.5 3rd Qu.:192.0
Max. :1566.0 Max. :1377.0 Max. :492.0
Errors Salary NewLeague
Min. : 0.000 Min. : 67.5 A:141
1st Qu.: 3.000 1st Qu.: 190.0 N:122
Median : 7.000 Median : 425.0
Mean : 8.593 Mean : 535.9
3rd Qu.:13.000 3rd Qu.: 750.0
Max. :32.000 Max. :2460.0
> dim(Hitters)
[1] 263 20
> set.seed(1)
> myControl <- trainControl(
+ method = "cv",
+ number = 10
+ )
> hit.model <- train(
+ Salary ~ ., data = Hitters,
+ tuneGrid = expand.grid(alpha = 0, lambda = 10^seq(3, -2, length = 100)),
+ method = "glmnet",
+ trControl = myControl
+ )
> hit.model
glmnet
263 samples
19 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 235, 237, 238, 236, 237, 237, ...
Resampling results across tuning parameters:
lambda RMSE Rsquared MAE
1.000000e-02 326.1370 0.4702243 232.7176
1.123324e-02 326.1370 0.4702243 232.7176
1.261857e-02 326.1370 0.4702243 232.7176
1.417474e-02 326.1370 0.4702243 232.7176
1.592283e-02 326.1370 0.4702243 232.7176
1.788650e-02 326.1370 0.4702243 232.7176
2.009233e-02 326.1370 0.4702243 232.7176
2.257020e-02 326.1370 0.4702243 232.7176
2.535364e-02 326.1370 0.4702243 232.7176
2.848036e-02 326.1370 0.4702243 232.7176
3.199267e-02 326.1370 0.4702243 232.7176
3.593814e-02 326.1370 0.4702243 232.7176
4.037017e-02 326.1370 0.4702243 232.7176
4.534879e-02 326.1370 0.4702243 232.7176
5.094138e-02 326.1370 0.4702243 232.7176
5.722368e-02 326.1370 0.4702243 232.7176
6.428073e-02 326.1370 0.4702243 232.7176
7.220809e-02 326.1370 0.4702243 232.7176
8.111308e-02 326.1370 0.4702243 232.7176
9.111628e-02 326.1370 0.4702243 232.7176
1.023531e-01 326.1370 0.4702243 232.7176
1.149757e-01 326.1370 0.4702243 232.7176
1.291550e-01 326.1370 0.4702243 232.7176
1.450829e-01 326.1370 0.4702243 232.7176
1.629751e-01 326.1370 0.4702243 232.7176
1.830738e-01 326.1370 0.4702243 232.7176
2.056512e-01 326.1370 0.4702243 232.7176
2.310130e-01 326.1370 0.4702243 232.7176
2.595024e-01 326.1370 0.4702243 232.7176
2.915053e-01 326.1370 0.4702243 232.7176
3.274549e-01 326.1370 0.4702243 232.7176
3.678380e-01 326.1370 0.4702243 232.7176
4.132012e-01 326.1370 0.4702243 232.7176
4.641589e-01 326.1370 0.4702243 232.7176
5.214008e-01 326.1370 0.4702243 232.7176
5.857021e-01 326.1370 0.4702243 232.7176
6.579332e-01 326.1370 0.4702243 232.7176
7.390722e-01 326.1370 0.4702243 232.7176
8.302176e-01 326.1370 0.4702243 232.7176
9.326033e-01 326.1370 0.4702243 232.7176
1.047616e+00 326.1370 0.4702243 232.7176
1.176812e+00 326.1370 0.4702243 232.7176
1.321941e+00 326.1370 0.4702243 232.7176
1.484968e+00 326.1370 0.4702243 232.7176
1.668101e+00 326.1370 0.4702243 232.7176
1.873817e+00 326.1370 0.4702243 232.7176
2.104904e+00 326.1370 0.4702243 232.7176
2.364489e+00 326.1370 0.4702243 232.7176
2.656088e+00 326.1370 0.4702243 232.7176
2.983647e+00 326.1370 0.4702243 232.7176
3.351603e+00 326.1370 0.4702243 232.7176
3.764936e+00 326.1370 0.4702243 232.7176
4.229243e+00 326.1370 0.4702243 232.7176
4.750810e+00 326.1370 0.4702243 232.7176
5.336699e+00 326.1370 0.4702243 232.7176
5.994843e+00 326.1370 0.4702243 232.7176
6.734151e+00 326.1370 0.4702243 232.7176
7.564633e+00 326.1370 0.4702243 232.7176
8.497534e+00 326.1370 0.4702243 232.7176
9.545485e+00 326.1370 0.4702243 232.7176
1.072267e+01 326.1370 0.4702243 232.7176
1.204504e+01 326.1370 0.4702243 232.7176
1.353048e+01 326.1370 0.4702243 232.7176
1.519911e+01 326.1370 0.4702243 232.7176
1.707353e+01 326.1370 0.4702243 232.7176
1.917910e+01 326.1370 0.4702243 232.7176
2.154435e+01 326.1370 0.4702243 232.7176
2.420128e+01 326.1514 0.4701971 232.7182
2.718588e+01 326.1457 0.4703697 232.5095
3.053856e+01 326.1902 0.4702580 232.2813
3.430469e+01 326.2403 0.4701406 232.0321
3.853529e+01 326.2892 0.4700387 231.7596
4.328761e+01 326.3342 0.4699625 231.4765
4.862602e+01 326.3743 0.4699165 231.1753
5.462277e+01 326.4152 0.4698879 230.8785
6.135907e+01 326.4545 0.4698845 230.6108
6.892612e+01 326.4863 0.4699266 230.3380
7.742637e+01 326.5160 0.4700038 230.0882
8.697490e+01 326.5473 0.4701094 229.9021
9.770100e+01 326.5785 0.4702506 229.7382
1.097499e+02 326.6129 0.4704264 229.6020
1.232847e+02 326.6534 0.4706327 229.4659
1.384886e+02 326.7068 0.4708540 229.3308
1.555676e+02 326.7761 0.4710941 229.2008
1.747528e+02 326.8627 0.4713565 229.0840
1.963041e+02 326.9766 0.4716209 229.0076
2.205131e+02 327.1242 0.4718747 228.9716
2.477076e+02 327.3120 0.4721262 228.9740
2.782559e+02 327.5442 0.4723607 229.1073
3.125716e+02 327.8324 0.4725723 229.4141
3.511192e+02 328.1865 0.4727470 229.7859
3.944206e+02 328.6129 0.4728922 230.3146
4.430621e+02 329.1222 0.4729971 230.9594
4.977024e+02 329.7271 0.4730507 231.7418
5.590810e+02 330.4383 0.4730559 232.5937
6.280291e+02 331.2665 0.4730212 233.5686
7.054802e+02 332.2254 0.4729347 234.6652
7.924829e+02 333.3302 0.4727990 235.9015
8.902151e+02 334.5908 0.4726197 237.4501
1.000000e+03 336.0245 0.4723993 239.3706
Tuning parameter 'alpha' was held constant at a value of 0
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were alpha = 0 and lambda = 21.54435.
> coef(hit.model$finalModel, s = 11498)
20 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) 407.205936774
AtBat 0.037003083
Hits 0.138357552
HmRun 0.525195508
Runs 0.230978290
RBI 0.240114775
Walks 0.289971555
Years 1.108832399
CAtBat 0.003135215
CHits 0.011666684
CHmRun 0.087642789
CRuns 0.023406258
CRBI 0.024165723
CWalks 0.025042117
LeagueN 0.086629234
DivisionW -6.225431332
PutOuts 0.016506596
Assists 0.002616335
Errors -0.020564158
NewLeagueN 0.302922899
> coef(hit.model$finalModel, s = 705)
20 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) 54.39019010
AtBat 0.11199103
Hits 0.65618308
HmRun 1.17557058
Runs 0.93745780
RBI 0.84682838
Walks 1.32044897
Years 2.58210272
CAtBat 0.01082850
CHits 0.04681147
CHmRun 0.33843942
CRuns 0.09372165
CRBI 0.09788930
CWalks 0.07179422
LeagueN 13.69631793
DivisionW -54.71693537
PutOuts 0.11864655
Assists 0.01610360
Errors -0.70439782
NewLeagueN 8.60936304
> coef(hit.model$finalModel, s = 50)
20 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) 4.821654e+01
AtBat -3.538650e-01
Hits 1.953167e+00
HmRun -1.285127e+00
Runs 1.156329e+00
RBI 8.087771e-01
Walks 2.709765e+00
Years -6.202919e+00
CAtBat 6.085854e-03
CHits 1.070832e-01
CHmRun 6.290984e-01
CRuns 2.172926e-01
CRBI 2.152888e-01
CWalks -1.488961e-01
LeagueN 4.586262e+01
DivisionW -1.182304e+02
PutOuts 2.501647e-01
Assists 1.208491e-01
Errors -3.277073e+00
NewLeagueN -9.423459e+00
> plot(hit.model)
> coef(hit.model$finalModel, hit.model$bestTune$lambda)
20 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) 8.112693e+01
AtBat -6.815959e-01
Hits 2.772312e+00
HmRun -1.365680e+00
Runs 1.014826e+00
RBI 7.130224e-01
Walks 3.378558e+00
Years -9.066800e+00
CAtBat -1.199478e-03
CHits 1.361029e-01
CHmRun 6.979958e-01
CRuns 2.958896e-01
CRBI 2.570711e-01
CWalks -2.789666e-01
LeagueN 5.321272e+01
DivisionW -1.228345e+02
PutOuts 2.638876e-01
Assists 1.698796e-01
Errors -3.685645e+00
NewLeagueN -1.810510e+01
> hit.model$bestTune$lambda
[1] 21.54435
Following ISLR, the data set is split into a training and a test set in order to estimate the error of each method.
> library(caret)
> set.seed(52)
> trainIndex <- createDataPartition(y = Hitters$Salary, p = 0.5,
+ list = FALSE,
+ times = 1)
> HittersTrain <- Hitters[trainIndex, ]
> HittersTest <- Hitters[-trainIndex, ]
> dim(HittersTrain)
[1] 133 20
> dim(HittersTest)
[1] 130 20
> myControl <- trainControl(
+ method = "repeatedcv",
+ repeats = 3
+ )
> hit.modelT <- train(
+ Salary ~ ., data = HittersTrain,
+ tuneGrid = expand.grid(alpha = 0, lambda = 10^seq(3, -2, length = 100)),
+ method = "glmnet",
+ trControl = myControl
+ )
> hit.modelT
glmnet
133 samples
19 predictor
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 121, 119, 118, 119, 120, 119, ...
Resampling results across tuning parameters:
lambda RMSE Rsquared MAE
1.000000e-02 321.2987 0.5202461 238.3084
1.123324e-02 321.2987 0.5202461 238.3084
1.261857e-02 321.2987 0.5202461 238.3084
1.417474e-02 321.2987 0.5202461 238.3084
1.592283e-02 321.2987 0.5202461 238.3084
1.788650e-02 321.2987 0.5202461 238.3084
2.009233e-02 321.2987 0.5202461 238.3084
2.257020e-02 321.2987 0.5202461 238.3084
2.535364e-02 321.2987 0.5202461 238.3084
2.848036e-02 321.2987 0.5202461 238.3084
3.199267e-02 321.2987 0.5202461 238.3084
3.593814e-02 321.2987 0.5202461 238.3084
4.037017e-02 321.2987 0.5202461 238.3084
4.534879e-02 321.2987 0.5202461 238.3084
5.094138e-02 321.2987 0.5202461 238.3084
5.722368e-02 321.2987 0.5202461 238.3084
6.428073e-02 321.2987 0.5202461 238.3084
7.220809e-02 321.2987 0.5202461 238.3084
8.111308e-02 321.2987 0.5202461 238.3084
9.111628e-02 321.2987 0.5202461 238.3084
1.023531e-01 321.2987 0.5202461 238.3084
1.149757e-01 321.2987 0.5202461 238.3084
1.291550e-01 321.2987 0.5202461 238.3084
1.450829e-01 321.2987 0.5202461 238.3084
1.629751e-01 321.2987 0.5202461 238.3084
1.830738e-01 321.2987 0.5202461 238.3084
2.056512e-01 321.2987 0.5202461 238.3084
2.310130e-01 321.2987 0.5202461 238.3084
2.595024e-01 321.2987 0.5202461 238.3084
2.915053e-01 321.2987 0.5202461 238.3084
3.274549e-01 321.2987 0.5202461 238.3084
3.678380e-01 321.2987 0.5202461 238.3084
4.132012e-01 321.2987 0.5202461 238.3084
4.641589e-01 321.2987 0.5202461 238.3084
5.214008e-01 321.2987 0.5202461 238.3084
5.857021e-01 321.2987 0.5202461 238.3084
6.579332e-01 321.2987 0.5202461 238.3084
7.390722e-01 321.2987 0.5202461 238.3084
8.302176e-01 321.2987 0.5202461 238.3084
9.326033e-01 321.2987 0.5202461 238.3084
1.047616e+00 321.2987 0.5202461 238.3084
1.176812e+00 321.2987 0.5202461 238.3084
1.321941e+00 321.2987 0.5202461 238.3084
1.484968e+00 321.2987 0.5202461 238.3084
1.668101e+00 321.2987 0.5202461 238.3084
1.873817e+00 321.2987 0.5202461 238.3084
2.104904e+00 321.2987 0.5202461 238.3084
2.364489e+00 321.2987 0.5202461 238.3084
2.656088e+00 321.2987 0.5202461 238.3084
2.983647e+00 321.2987 0.5202461 238.3084
3.351603e+00 321.2987 0.5202461 238.3084
3.764936e+00 321.2987 0.5202461 238.3084
4.229243e+00 321.2987 0.5202461 238.3084
4.750810e+00 321.2987 0.5202461 238.3084
5.336699e+00 321.2987 0.5202461 238.3084
5.994843e+00 321.2987 0.5202461 238.3084
6.734151e+00 321.2987 0.5202461 238.3084
7.564633e+00 321.2987 0.5202461 238.3084
8.497534e+00 321.2987 0.5202461 238.3084
9.545485e+00 321.2987 0.5202461 238.3084
1.072267e+01 321.2987 0.5202461 238.3084
1.204504e+01 321.2987 0.5202461 238.3084
1.353048e+01 321.2987 0.5202461 238.3084
1.519911e+01 321.2987 0.5202461 238.3084
1.707353e+01 321.2987 0.5202461 238.3084
1.917910e+01 321.2987 0.5202461 238.3084
2.154435e+01 321.2987 0.5202461 238.3084
2.420128e+01 321.2952 0.5202716 238.3122
2.718588e+01 320.9940 0.5206662 238.0906
3.053856e+01 320.4549 0.5212331 237.7615
3.430469e+01 319.9159 0.5218472 237.4002
3.853529e+01 319.3814 0.5224950 237.0234
4.328761e+01 318.8385 0.5232110 236.5997
4.862602e+01 318.3038 0.5239649 236.1519
5.462277e+01 317.7589 0.5247870 235.6663
6.135907e+01 317.2143 0.5256686 235.1726
6.892612e+01 316.6696 0.5266046 234.6630
7.742637e+01 316.1244 0.5276052 234.1716
8.697490e+01 315.5830 0.5286521 233.6538
9.770100e+01 315.0459 0.5297588 233.1027
1.097499e+02 314.5231 0.5308968 232.5554
1.232847e+02 314.0164 0.5320742 231.9891
1.384886e+02 313.5346 0.5332739 231.4450
1.555676e+02 313.0803 0.5344948 230.8870
1.747528e+02 312.6684 0.5357123 230.3830
1.963041e+02 312.3063 0.5369151 229.9022
2.205131e+02 312.0036 0.5380941 229.4195
2.477076e+02 311.7692 0.5392476 228.9595
2.782559e+02 311.6191 0.5403385 228.5224
3.125716e+02 311.5661 0.5413587 228.1176
3.511192e+02 311.6207 0.5423086 227.7972
3.944206e+02 311.7973 0.5431752 227.6065
4.430621e+02 312.1111 0.5439354 227.5245
4.977024e+02 312.5786 0.5445833 227.6521
5.590810e+02 313.2115 0.5451243 227.9577
6.280291e+02 314.0241 0.5455579 228.3820
7.054802e+02 315.0340 0.5458633 229.1220
7.924829e+02 316.2548 0.5460531 230.1591
8.902151e+02 317.7011 0.5461314 231.4492
1.000000e+03 319.3866 0.5461103 232.9972
Tuning parameter 'alpha' was held constant at a value of 0
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were alpha = 0 and lambda = 312.5716.
> plot(hit.modelT)
> coef(hit.modelT$finalModel, hit.modelT$bestTune$lambda)
20 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) -1.247903e+02
AtBat 1.459814e-01
Hits 1.081891e+00
HmRun 1.698469e+00
Runs 1.460525e+00
RBI 1.253918e+00
Walks 1.812634e+00
Years 3.240116e+00
CAtBat 1.105952e-02
CHits 5.240973e-02
CHmRun 1.439892e-01
CRuns 1.036923e-01
CRBI 8.809254e-02
CWalks 3.357095e-02
LeagueN 3.943543e+01
DivisionW -9.517743e+01
PutOuts 2.307162e-01
Assists -6.219518e-03
Errors 4.962977e-01
NewLeagueN 1.523493e+01
> # Test RMSE error
> preds <- predict(hit.modelT, HittersTest)
> TestRMSE1 <- sqrt(mean((HittersTest$Salary - preds)^2))
> TestRMSE1
[1] 365.7261
> set.seed(14)
> myControl <- trainControl(
+ method = "repeatedcv",
+ repeats = 3
+ )
> hit.modelT1 <- train(
+ Salary ~ ., data = HittersTrain,
+ tuneGrid = expand.grid(alpha = 1, lambda = 10^seq(3, -2, length = 100)),
+ method = "glmnet",
+ trControl = myControl
+ )
> hit.modelT1
glmnet
133 samples
19 predictor
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 120, 120, 119, 120, 120, 118, ...
Resampling results across tuning parameters:
lambda RMSE Rsquared MAE
1.000000e-02 344.9567 0.5317168 251.2481
1.123324e-02 344.9567 0.5317168 251.2481
1.261857e-02 344.9567 0.5317168 251.2481
1.417474e-02 344.9567 0.5317168 251.2481
1.592283e-02 344.9567 0.5317168 251.2481
1.788650e-02 344.9567 0.5317168 251.2481
2.009233e-02 344.9567 0.5317168 251.2481
2.257020e-02 344.9567 0.5317168 251.2481
2.535364e-02 344.9484 0.5317445 251.2406
2.848036e-02 344.9330 0.5317989 251.2263
3.199267e-02 344.9155 0.5318604 251.2086
3.593814e-02 344.8958 0.5319292 251.1888
4.037017e-02 344.8740 0.5320059 251.1666
4.534879e-02 344.8490 0.5320934 251.1410
5.094138e-02 344.8214 0.5321900 251.1124
5.722368e-02 344.7907 0.5322978 251.0802
6.428073e-02 344.7565 0.5324185 251.0441
7.220809e-02 344.7144 0.5325631 251.0005
8.111308e-02 344.6602 0.5327433 250.9457
9.111628e-02 344.6019 0.5329387 250.8865
1.023531e-01 344.5408 0.5331493 250.8245
1.149757e-01 344.4645 0.5334218 250.7559
1.291550e-01 344.4250 0.5337040 250.6943
1.450829e-01 344.3778 0.5339582 250.6183
1.629751e-01 344.3072 0.5342806 250.5108
1.830738e-01 344.2157 0.5347082 250.4100
2.056512e-01 344.1290 0.5351863 250.3028
2.310130e-01 344.0172 0.5356807 250.1652
2.595024e-01 343.8899 0.5363539 250.0190
2.915053e-01 343.7420 0.5371636 249.8460
3.274549e-01 343.6231 0.5380336 249.6046
3.678380e-01 343.5535 0.5383117 249.5622
4.132012e-01 343.2964 0.5390409 249.3999
4.641589e-01 343.1303 0.5395689 249.3928
5.214008e-01 342.9319 0.5401276 249.3993
5.857021e-01 342.6649 0.5408087 249.3580
6.579332e-01 342.4264 0.5414772 249.4000
7.390722e-01 342.0848 0.5423246 249.4004
8.302176e-01 341.6879 0.5432463 249.4128
9.326033e-01 341.0577 0.5448035 249.2220
1.047616e+00 340.0577 0.5471681 248.7181
1.176812e+00 338.8994 0.5498354 248.1312
1.321941e+00 337.6266 0.5527886 247.4105
1.484968e+00 336.2798 0.5554993 246.6158
1.668101e+00 335.0729 0.5574949 245.9131
1.873817e+00 333.9522 0.5595600 245.2620
2.104904e+00 332.7679 0.5617792 244.5384
2.364489e+00 331.5553 0.5637075 243.9282
2.656088e+00 330.3735 0.5652986 243.2993
2.983647e+00 329.2407 0.5667164 242.6740
3.351603e+00 328.3826 0.5677183 242.0551
3.764936e+00 327.5944 0.5685767 241.4207
4.229243e+00 326.8148 0.5693254 240.7473
4.750810e+00 326.0566 0.5698774 240.1645
5.336699e+00 325.5196 0.5698206 239.7754
5.994843e+00 325.0129 0.5698516 239.3582
6.734151e+00 324.4087 0.5701554 239.0325
7.564633e+00 323.3235 0.5712337 238.4229
8.497534e+00 321.7210 0.5732175 237.3059
9.545485e+00 320.6527 0.5744247 236.5642
1.072267e+01 320.0481 0.5748184 236.0644
1.204504e+01 319.7647 0.5749693 235.7334
1.353048e+01 319.6746 0.5751985 235.4524
1.519911e+01 319.6609 0.5752208 235.2205
1.707353e+01 319.7615 0.5749260 235.1216
1.917910e+01 319.9474 0.5745316 235.0700
2.154435e+01 320.2151 0.5739632 235.0588
2.420128e+01 320.5658 0.5732442 235.1808
2.718588e+01 321.0514 0.5723001 235.4514
3.053856e+01 321.5387 0.5712260 235.7161
3.430469e+01 322.0805 0.5702089 235.9398
3.853529e+01 322.7885 0.5690059 236.2451
4.328761e+01 323.8444 0.5670850 236.6879
4.862602e+01 325.3568 0.5642881 237.5184
5.462277e+01 327.3306 0.5606403 238.7807
6.135907e+01 329.8954 0.5558401 240.7357
6.892612e+01 333.1740 0.5497009 243.6548
7.742637e+01 336.9878 0.5434489 247.8263
8.697490e+01 341.3459 0.5373413 252.6921
9.770100e+01 346.2057 0.5326136 258.4168
1.097499e+02 352.3579 0.5261646 265.7827
1.232847e+02 360.1852 0.5170375 274.7528
1.384886e+02 369.9365 0.5033856 285.3332
1.555676e+02 381.7669 0.4812320 297.1868
1.747528e+02 395.0145 0.4516864 309.5694
1.963041e+02 408.9939 0.4208011 322.8322
2.205131e+02 423.4003 0.3716421 336.2986
2.477076e+02 435.0806 0.2665585 346.2688
2.782559e+02 437.3556 0.0412020 348.6942
3.125716e+02 437.4395 NaN 348.8411
3.511192e+02 437.4395 NaN 348.8411
3.944206e+02 437.4395 NaN 348.8411
4.430621e+02 437.4395 NaN 348.8411
4.977024e+02 437.4395 NaN 348.8411
5.590810e+02 437.4395 NaN 348.8411
6.280291e+02 437.4395 NaN 348.8411
7.054802e+02 437.4395 NaN 348.8411
7.924829e+02 437.4395 NaN 348.8411
8.902151e+02 437.4395 NaN 348.8411
1.000000e+03 437.4395 NaN 348.8411
Tuning parameter 'alpha' was held constant at a value of 1
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were alpha = 1 and lambda = 15.19911.
> plot(hit.modelT1)
> coef(hit.modelT1$finalModel, hit.model$bestTune$lambda)
20 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) -101.08611784
AtBat .
Hits 2.12682293
HmRun .
Runs .
RBI 1.52192252
Walks 2.18699728
Years .
CAtBat .
CHits .
CHmRun .
CRuns 0.40146625
CRBI 0.06345388
CWalks .
LeagueN 28.44971454
DivisionW -109.81369939
PutOuts 0.31513067
Assists .
Errors .
NewLeagueN .
> # Test RMSE error
> preds1 <- predict(hit.modelT1, HittersTest)
> TestRMSE2 <- sqrt(mean((HittersTest$Salary - preds1)^2))
> TestRMSE2
[1] 365.8057
Note the test RMSE of the ridge model is 365.7260832 while the test RMSE of the lasso model is 365.8056865. In this scenario, the lasso has an advantage over the ridge model in terms of parsimony because 11 of the coefficients in the lasso model are exactly zero.
(Example taken from An Introduction to Statistical Learning)
Goal: Predict whether an individual will default on his or her credit card payment.
> library(ISLR)
> head(Default)
default student balance income
1 No No 729.5265 44361.625
2 No Yes 817.1804 12106.135
3 No No 1073.5492 31767.139
4 No No 529.2506 35704.494
5 No No 785.6559 38463.496
6 No Yes 919.5885 7491.559
> ggplot(data = Default, aes(x = balance, y = income, colour = default)) +
+ geom_point(alpha = 0.8) +
+ theme_bw() +
+ scale_color_brewer(palette=7)
> ggplot(data = Default, aes(x = default, y = balance, fill = default)) +
+ geom_boxplot() +
+ theme_bw() +
+ guides(fill = FALSE) +
+ scale_fill_brewer(palette=7)
> ggplot(data = Default, aes(x = default, y = income, fill = default)) +
+ geom_boxplot() +
+ theme_bw() +
+ guides(fill = FALSE) +
+ scale_fill_brewer(palette=7)
Why not regression?
> Default$defaultN <- ifelse(Default$default == "No", 0, 1)
> Default$studentN <- ifelse(Default$student =="No", 0, 1)
> summary(Default$defaultN)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000 0.0000 0.0000 0.0333 0.0000 1.0000
> head(Default)
default student balance income defaultN studentN
1 No No 729.5265 44361.625 0 0
2 No Yes 817.1804 12106.135 0 1
3 No No 1073.5492 31767.139 0 0
4 No No 529.2506 35704.494 0 0
5 No No 785.6559 38463.496 0 0
6 No Yes 919.5885 7491.559 0 1
> ggplot(data = Default, aes(x = balance, y = defaultN)) +
+ geom_point(alpha = 0.5) +
+ theme_bw() +
+ stat_smooth(method = "lm") +
+ labs(y = "Probability of Default")
Some estimated probabilities are negative! For balances close to zero, we predict a negative probability of default. For very large balances, we get values greater than 1.
What we need is a function of \(X\) that returns values between 0 and 1. Consider the logistic function
\[p(X) = \frac{e^{\beta_0 + \beta_1X}}{1 + e^{\beta_0 + \beta_1X}}\]
which will always produce an S-shaped curve regardless of the value of \(X\).
Let \(y = \beta_0 + \beta_1X\), then
\(p(X)=\frac{e^y}{1 + e^y} \rightarrow p(X) + e^{y}p(X) = e^{y} \rightarrow p(X) = e^{y} - e^{y}p(X) \rightarrow \frac{p(X)}{1 - p(X)}=e^{y} = e^{\beta_0 + \beta_1X}\)
\[\text{log}\left(\frac{p(X)}{1 - p(X)}\right) = \beta_0 + \beta_1X\]
The quantity \(\text{log}\left(\frac{p(X)}{1 - p(X)}\right)\) is called the log-odds or logit, and the quantity \(\frac{p(X)}{1 - p(X)}\) is called the odds.
NOTE: In regression, \(\beta_1\) gives the average change in \(Y\) associated with a one-unit increase in \(X\). In a logistic regression model, increasing \(X\) by one unit changes the log odds by \(\beta_1\), or equivalently it multiplies the odds by \(e^{\beta_1}\).
> ggplot(data = Default, aes(x = balance, y = defaultN)) +
+ geom_point(alpha = 0.5) +
+ theme_bw() +
+ stat_smooth(method = "glm", family = "binomial") +
+ labs(y = "Probability of Default")
The probability of default given balance
can be written as P(default = Yes | balance).
> log.mod <- glm(default ~ balance, data = Default, family = "binomial")
> summary(log.mod)
Call:
glm(formula = default ~ balance, family = "binomial", data = Default)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.2697 -0.1465 -0.0589 -0.0221 3.7589
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.065e+01 3.612e-01 -29.49 <2e-16 ***
balance 5.499e-03 2.204e-04 24.95 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2920.6 on 9999 degrees of freedom
Residual deviance: 1596.5 on 9998 degrees of freedom
AIC: 1600.5
Number of Fisher Scoring iterations: 8
From the output note that \(\hat{\beta} = 0.0054989\). This value indicates that an increase in balance
is associated with an increase in the log odds of default
by 0.0054989 units.
Using the estimated coefficients, the predicted probability of default for an individual with a balance of $1,500 is
\[\hat{p}(X) = \frac{e^{\hat{\beta_0} + \hat{\beta}_1X}}{1 + e^{\hat{\beta_0} + \hat{\beta_1}X}} = \frac{e^{-10.6513306+ 0.0054989 \times 1500}}{1 + e^{-10.6513306+ 0.0054989 \times 1500}} = 0.0829476.\]
> predict(log.mod, newdata = data.frame(balance = 1500), type = "response")
1
0.08294762
For an individual with a balance
of $2,500, the probability of default is
\[p(X) = 0.9567259\].
> predict(log.mod, newdata = data.frame(balance = 2500), type = "response")
1
0.9567259
Are students more likely to default than non-students?
> mod2 <- glm(default ~ student, data = Default, family = "binomial")
> summary(mod2)
Call:
glm(formula = default ~ student, family = "binomial", data = Default)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.2970 -0.2970 -0.2434 -0.2434 2.6585
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.50413 0.07071 -49.55 < 2e-16 ***
studentYes 0.40489 0.11502 3.52 0.000431 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2920.6 on 9999 degrees of freedom
Residual deviance: 2908.7 on 9998 degrees of freedom
AIC: 2912.7
Number of Fisher Scoring iterations: 6
Note that the coefficient associated with studentYes
is positive, and the associated p-value is statistically significant. This indicates that students tend to have higher default probabilities than non-students:
\[\widehat{Pr}(\text{default = Yes }|\text{ student = Yes}) = \frac{e^{-3.5041278+ 0.4048871 \times 1}}{1 + e^{-3.5041278+ 0.4048871 \times 1}} = 0.0431386,\]
> predict(mod2, newdata = data.frame(student = "Yes"), type = "response")
1
0.04313859
\[\widehat{Pr}(\text{default = Yes }|\text{ student = No}) = \frac{e^{-3.5041278+ 0.4048871 \times 0}}{1 + e^{-3.5041278+ 0.4048871 \times 1}} = 0.029195.\]
> predict(mod2, newdata = data.frame(student = "No"), type = "response")
1
0.02919501
Consider a model that uses balance
, income
, and student
to predict default
.
> mlog.mod <- glm(default ~ balance + income + student, data = Default, family = "binomial")
> summary(mlog.mod)
Call:
glm(formula = default ~ balance + income + student, family = "binomial",
data = Default)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.4691 -0.1418 -0.0557 -0.0203 3.7383
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.087e+01 4.923e-01 -22.080 < 2e-16 ***
balance 5.737e-03 2.319e-04 24.738 < 2e-16 ***
income 3.033e-06 8.203e-06 0.370 0.71152
studentYes -6.468e-01 2.363e-01 -2.738 0.00619 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2920.6 on 9999 degrees of freedom
Residual deviance: 1571.5 on 9996 degrees of freedom
AIC: 1579.5
Number of Fisher Scoring iterations: 8
There is a surprising result here. The p-value associated with balance
and the student
are very small, indicating that each of these variables is associated with the probability of default
. However, the coefficient for student
is negative, indicating that students are less likely to default than non-students. Yet, the coefficient in mod2
for student
was positive. How is it possible for the variable student
to be associated with an increase in probability in mod2
and a decrease in probability in mlog.mod
?
> Default$PrDef <- exp(predict(mlog.mod))/(1 + exp(predict(mlog.mod)))
> # Or the following is the same
> Default$PRdef <- predict(mlog.mod, type = "response")
> ggplot(data = Default, aes(x = balance, y = default, color = student)) +
+ geom_line(aes(x = balance, y = PrDef)) +
+ scale_color_brewer(palette = 7) +
+ theme_bw() +
+ labs(y = "Probability of Default")
> library(dplyr)
> Default <- Default %>%
+ arrange(balance) %>%
+ mutate(balanceFAC = cut(balance, breaks = seq(500, 2655, length.out = 10), include.lower = TRUE))
> head(Default)
default student balance income defaultN studentN PrDef
1 No No 0 29275.27 0 0 2.080617e-05
2 No Yes 0 21871.07 0 1 1.065494e-05
3 No No 0 50265.31 0 0 2.217400e-05
4 No No 0 49956.58 0 0 2.215325e-05
5 No No 0 32481.54 0 0 2.100952e-05
6 No No 0 43658.23 0 0 2.173402e-05
PRdef balanceFAC
1 2.080617e-05 <NA>
2 1.065494e-05 <NA>
3 2.217400e-05 <NA>
4 2.215325e-05 <NA>
5 2.100952e-05 <NA>
6 2.173402e-05 <NA>
> T1 <- xtabs(~balanceFAC + default, data = Default, subset = student == "Yes")
> T2 <- xtabs(~balanceFAC + default, data = Default, subset = student == "No")
> SDR <- T1[, 2]/apply(T1[, 1:2], 1, sum)
> NSDR <- T2[, 2]/apply(T2[, 1:2], 1, sum)
> BalanceM <- c(500+120, 739+120, 979+120, 1220+120, 1460+120, 1700+120, 1940+120, 2180+120, 2420+120)
> DF <- data.frame(SDR, NSDR, BalanceM)
> DF
SDR NSDR BalanceM
(500,739] 0.000000000 0.001571092 620
(739,979] 0.000000000 0.003007519 859
(979,1.22e+03] 0.005454545 0.012252592 1099
(1.22e+03,1.46e+03] 0.025000000 0.043923865 1340
(1.46e+03,1.7e+03] 0.085365854 0.140056022 1580
(1.7e+03,1.94e+03] 0.248407643 0.394736842 1820
(1.94e+03,2.18e+03] 0.553571429 0.760000000 2060
(2.18e+03,2.42e+03] 0.818181818 0.888888889 2300
(2.42e+03,2.66e+03] 1.000000000 1.000000000 2540
> ggplot(data = DF, aes(x = BalanceM, y = NSDR)) +
+ geom_line(color = "red") +
+ geom_line(aes(x = BalanceM, y = SDR), color = "green") +
+ theme_bw() +
+ labs(y = "Default Rate", x = "Credit Card Balance")
The odds ratio, denoted OR, is the ratio of the odds for \(X = 1\) to the odds for \(X = 0\), and is given by the equation
\[\text{OR}=\frac{\frac{p(1)}{1 - p(1)}}{\frac{p(0)}{1 - p(0)}}.\]
Substituting
\[\frac{e^{\beta_0 + \beta_1}}{1 + e^{\beta_0 + \beta_1}}\] for \(p(X=1)\), and \[\frac{e^{\beta_0}}{1 + e^{\beta_0}}\] for \(p(X=0)\), respectively, one can show that \[\text{OR} = e^{\beta_1}.\]
The odds ratio is widely used as a measure of association as it approximates how much more likely or unlikely (in terms of odds) it is for the outcome to be present among those subjects with \(X = 1\) as compared to those subjects with \(X = 0\).
Using the mlog.mod
model, the response variable \(Y\) is default
, which denotes whether a credit card holder has defaulted (Yes/1) or not defaulted (No/0) on his/her current payment. The independent variable \(X\) is student
, which denotes whether the individual is a student (Yes/1) or is not a student (No/1). A hypothetical odds ratio in this scenario of 2 is interpreted to mean that the odds of default among students is two times greater than the odds of default among non-students.
An estimate of the odds ratio is
\[\widehat{\text{OR}} = e^{-0.6467758} = 0.5237317.\]
This is interpreted to mean that the odds of default among students is 0.5237317 the odds of default among non-students.