1 Partitions

When building a model used for prediction, one measure of the model’s usefulness is how well it performs on unseen (future) data. One method to evaluate a model on unseen data is to partition the data into a training set and a testing set. The training set is the data set one uses to build the model. The test set is the data one uses to evaluate the model’s predictions. Three approaches are used to create a training set and a test set.

The data set HELPrct after removing rows with missing values from the mdsr package is used to illustrate the three approaches with a 70/30 split of values. That is, roughly 70% of the values will be used for the training set and the remainder will be used for the test set. Note that there are 117 rows of data in HELPrct with complete information (no missing values), so a 70/30 split should result in roughly \(117\times 0.70 \approx 82\) values in the training set and the remaining 35 values in the test set.

> library(mdsr)
> HELP <- na.omit(HELPrct)
> set.seed(123)
> trainIndex1 <- sample(x = c(TRUE, FALSE), size = nrow(HELP), replace = TRUE, prob = c(0.7, 0.3))
> testIndex1 <- (!trainIndex1)
> trainData1 <- HELP[trainIndex1, ]
> testData1 <- HELP[testIndex1, ]
> dim(HELP)
[1] 117  29
> dim(trainData1)
[1] 82 29
> dim(testData1)
[1] 35 29
> # Train Fraction
> dim(trainData1)[1]/dim(HELP)[1]
[1] 0.7008547
> # Test Fraction
> dim(testData1)[1]/dim(HELP)[1]
[1] 0.2991453
> set.seed(123)
> trainIndex2 <- sample(x = 1:nrow(HELP), size = round(0.70*nrow(HELP), 0), replace = FALSE)
> trainData2 <- HELP[trainIndex2, ]
> testData2 <- HELP[-trainIndex2, ]
> dim(trainData2)
[1] 82 29
> dim(testData2)
[1] 35 29
> # Train Fraction
> dim(trainData2)[1]/dim(HELP)[1]
[1] 0.7008547
> # Test Fraction
> dim(testData2)[1]/dim(HELP)[1]
[1] 0.2991453
> set.seed(123)
> HELP$Partition <- runif(dim(HELP)[1])
> trainData3 <- subset(HELP, HELP$Partition > 0.3)
> testData3 <- subset(HELP, HELP$Partition <= 0.3)
> dim(trainData3)
[1] 84 30
> dim(testData3)
[1] 33 30
> # Train Fraction
> dim(trainData3)[1]/dim(HELP)[1]
[1] 0.7179487
> # Test Fraction
> dim(testData3)[1]/dim(HELP)[1]
[1] 0.2820513
> library(caret)  # load the caret package
> set.seed(123)
> trainIndex4 <- createDataPartition(y = HELP$homeless, p = .7, 
+                                   list = FALSE, 
+                                   times = 1)
> trainData4 <- HELP[trainIndex4, ]
> testData4 <- HELP[-trainIndex4, ]
> dim(trainData4)
[1] 83 30
> dim(testData4)
[1] 34 30
> # Train Fraction
> dim(trainData4)[1]/dim(HELP)[1]
[1] 0.7094017
> # Test Fraction
> dim(testData4)[1]/dim(HELP)[1]
[1] 0.2905983

Note: When the y argument to createDataPartition is a factor, the random sampling occurs within each class and should preserve the overall class distribution of the data.

1.1 Building a model

In this section, a model is built using the training set to predict whether a person is homeless.

> model.null <- glm(homeless ~ 1, data = trainData4, family = binomial)
> model.full <- glm(homeless ~ ., data = trainData4, family = binomial)
> AICmod <- step(model.null, scope = list(lower = model.null, upper = model.full), direction = "both", data = trainData4)
Start:  AIC=110.6
homeless ~ 1

               Df Deviance    AIC
+ i1            1   96.888 100.89
+ avg_drinks    1   96.888 100.89
+ i2            1   98.747 102.75
+ max_drinks    1   98.747 102.75
+ female        1  102.302 106.30
+ sex           1  102.302 106.30
+ pss_fr        1  103.184 107.18
+ indtot        1  104.016 108.02
+ e2b           1  105.096 109.10
+ linkstatus    1  105.727 109.73
+ link          1  105.727 109.73
+ substance     2  104.093 110.09
+ g1b           1  106.258 110.26
+ pcs           1  106.570 110.57
+ sexrisk       1  106.580 110.58
<none>             108.605 110.61
+ dayslink      1  106.883 110.88
+ d1            1  106.948 110.95
+ drugrisk      1  107.162 111.16
+ age           1  107.664 111.66
+ treat         1  108.352 112.35
+ anysubstatus  1  108.437 112.44
+ anysub        1  108.437 112.44
+ cesd          1  108.476 112.48
+ satreat       1  108.496 112.50
+ daysanysub    1  108.532 112.53
+ Partition     1  108.533 112.53
+ mcs           1  108.596 112.60
+ id            1  108.605 112.61
+ racegrp       3  108.124 116.12

Step:  AIC=100.89
homeless ~ i1

               Df Deviance     AIC
+ sex           1   89.820  95.820
+ female        1   89.820  95.820
+ pss_fr        1   92.616  98.616
+ linkstatus    1   93.154  99.154
+ link          1   93.154  99.154
+ dayslink      1   93.344  99.344
+ indtot        1   94.268 100.268
+ drugrisk      1   94.355 100.355
<none>              96.888 100.888
+ g1b           1   94.996 100.996
+ sexrisk       1   95.266 101.266
+ e2b           1   95.916 101.916
+ pcs           1   95.980 101.980
+ treat         1   96.134 102.134
+ id            1   96.524 102.524
+ d1            1   96.614 102.614
+ daysanysub    1   96.743 102.743
+ anysubstatus  1   96.781 102.781
+ anysub        1   96.781 102.781
+ satreat       1   96.783 102.783
+ cesd          1   96.835 102.835
+ age           1   96.862 102.862
+ Partition     1   96.875 102.875
+ i2            1   96.880 102.880
+ max_drinks    1   96.880 102.880
+ mcs           1   96.885 102.885
+ substance     2   96.804 104.804
+ racegrp       3   96.167 106.167
- i1            1  108.605 110.605

Step:  AIC=95.82
homeless ~ i1 + sex

               Df Deviance     AIC
+ g1b           1   85.356  93.356
+ dayslink      1   86.476  94.476
+ linkstatus    1   86.567  94.567
+ link          1   86.567  94.567
+ pss_fr        1   87.194  95.194
+ drugrisk      1   87.704  95.704
<none>              89.820  95.820
+ pcs           1   87.920  95.920
+ sexrisk       1   88.549  96.549
+ treat         1   88.675  96.675
+ d1            1   88.969  96.969
+ e2b           1   89.165  97.165
+ indtot        1   89.189  97.189
+ cesd          1   89.244  97.244
+ id            1   89.351  97.351
+ daysanysub    1   89.511  97.511
+ anysubstatus  1   89.523  97.523
+ anysub        1   89.523  97.523
+ Partition     1   89.543  97.543
+ mcs           1   89.683  97.683
+ i2            1   89.780  97.780
+ max_drinks    1   89.780  97.780
+ age           1   89.784  97.784
+ satreat       1   89.811  97.811
+ substance     2   89.717  99.717
- sex           1   96.888 100.888
+ racegrp       3   89.328 101.328
- i1            1  102.302 106.302

Step:  AIC=93.36
homeless ~ i1 + sex + g1b

               Df Deviance     AIC
+ pss_fr        1   82.333  92.333
+ linkstatus    1   82.888  92.888
+ link          1   82.888  92.888
+ dayslink      1   82.925  92.925
<none>              85.356  93.356
+ sexrisk       1   83.721  93.721
+ d1            1   83.949  93.949
+ pcs           1   84.005  94.005
+ drugrisk      1   84.690  94.690
+ treat         1   84.761  94.761
+ id            1   84.924  94.924
+ e2b           1   84.965  94.965
+ age           1   85.095  95.095
+ mcs           1   85.129  95.129
+ indtot        1   85.140  95.140
+ Partition     1   85.141  95.141
+ i2            1   85.314  95.314
+ max_drinks    1   85.314  95.314
+ daysanysub    1   85.332  95.332
+ satreat       1   85.343  95.343
+ cesd          1   85.344  95.344
+ anysubstatus  1   85.356  95.356
+ anysub        1   85.356  95.356
- g1b           1   89.820  95.820
+ substance     2   85.209  97.209
+ racegrp       3   84.586  98.586
- sex           1   94.996 100.996
- i1            1   97.479 103.479

Step:  AIC=92.33
homeless ~ i1 + sex + g1b + pss_fr

               Df Deviance     AIC
+ dayslink      1   79.519  91.519
+ linkstatus    1   79.613  91.613
+ link          1   79.613  91.613
<none>              82.333  92.333
+ sexrisk       1   80.603  92.603
- pss_fr        1   85.356  93.356
+ d1            1   81.421  93.421
+ pcs           1   81.573  93.573
+ mcs           1   81.607  93.607
+ drugrisk      1   81.617  93.617
+ id            1   81.644  93.644
+ treat         1   81.656  93.656
+ e2b           1   81.956  93.956
+ age           1   82.033  94.033
+ Partition     1   82.103  94.103
+ cesd          1   82.168  94.168
+ satreat       1   82.217  94.217
+ daysanysub    1   82.307  94.307
+ indtot        1   82.315  94.315
+ i2            1   82.318  94.318
+ max_drinks    1   82.318  94.318
+ anysubstatus  1   82.332  94.332
+ anysub        1   82.332  94.332
- g1b           1   87.194  95.194
+ substance     2   82.053  96.053
+ racegrp       3   81.595  97.595
- sex           1   90.088  98.088
- i1            1   93.342 101.342

Step:  AIC=91.52
homeless ~ i1 + sex + g1b + pss_fr + dayslink

               Df Deviance     AIC
+ sexrisk       1   77.385  91.385
<none>              79.519  91.519
+ mcs           1   78.329  92.329
- dayslink      1   82.333  92.333
- pss_fr        1   82.925  92.925
+ e2b           1   79.003  93.003
+ satreat       1   79.067  93.067
+ drugrisk      1   79.090  93.090
+ d1            1   79.153  93.153
+ id            1   79.187  93.187
+ pcs           1   79.196  93.196
- g1b           1   83.202  93.202
+ cesd          1   79.210  93.210
+ anysubstatus  1   79.264  93.264
+ anysub        1   79.264  93.264
+ Partition     1   79.366  93.366
+ age           1   79.454  93.454
+ i2            1   79.485  93.485
+ max_drinks    1   79.485  93.485
+ daysanysub    1   79.487  93.487
+ treat         1   79.505  93.505
+ indtot        1   79.511  93.511
+ linkstatus    1   79.515  93.515
+ link          1   79.515  93.515
+ substance     2   79.279  95.279
- sex           1   86.308  96.308
+ racegrp       3   78.622  96.622
- i1            1   92.140 102.140

Step:  AIC=91.38
homeless ~ i1 + sex + g1b + pss_fr + dayslink + sexrisk

               Df Deviance     AIC
<none>              77.385  91.385
- sexrisk       1   79.519  91.519
+ drugrisk      1   76.277  92.277
+ satreat       1   76.555  92.555
- dayslink      1   80.603  92.603
+ mcs           1   76.636  92.636
+ anysubstatus  1   76.658  92.658
+ anysub        1   76.658  92.658
+ pcs           1   76.740  92.740
- pss_fr        1   80.916  92.916
+ age           1   77.067  93.067
+ e2b           1   77.119  93.119
+ d1            1   77.128  93.128
+ Partition     1   77.154  93.154
+ indtot        1   77.235  93.235
+ cesd          1   77.259  93.259
+ daysanysub    1   77.302  93.302
+ treat         1   77.312  93.312
+ id            1   77.355  93.355
+ linkstatus    1   77.375  93.375
+ link          1   77.375  93.375
+ i2            1   77.383  93.383
+ max_drinks    1   77.383  93.383
- g1b           1   81.468  93.468
+ substance     2   77.019  95.019
- sex           1   83.654  95.654
+ racegrp       3   77.212  97.212
- i1            1   88.139 100.139
> summary(AICmod)

Call:
glm(formula = homeless ~ i1 + sex + g1b + pss_fr + dayslink + 
    sexrisk, family = binomial, data = trainData4)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.1433  -0.6923  -0.3944   0.7927   1.9780  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)  
(Intercept)  0.903045   1.240429   0.728   0.4666  
i1          -0.057817   0.023137  -2.499   0.0125 *
sexmale     -1.801138   0.764603  -2.356   0.0185 *
g1byes      -1.338143   0.710160  -1.884   0.0595 .
pss_fr       0.159495   0.087981   1.813   0.0699 .
dayslink     0.003429   0.001979   1.733   0.0831 .
sexrisk     -0.161055   0.112725  -1.429   0.1531  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 108.605  on 82  degrees of freedom
Residual deviance:  77.385  on 76  degrees of freedom
AIC: 91.385

Number of Fisher Scoring iterations: 6

1.1.1 Pseudo \(R^2\)

One can think of the pseudo \(R^2\) for logistic regression as the analog to \(R^2\) used in linear regression. Pseudo \(R^2\) measures how much of the deviance is explained by the model, where \(R^2\) measures the variability explained by a linear model.

> resid.dev <- summary(AICmod)$deviance
> resid.dev
[1] 77.38461
> null.dev <- summary(AICmod)$null.deviance
> null.dev
[1] 108.6048
> PR2 <- 1 - (resid.dev / null.dev)
> PR2 # Pseudo R squared
[1] 0.2874657

The model only explains 28.75% of the deviance in the training data. This will likely be less for the test data suggesting there are likely other factors that have not been measured that predict homelessness.

> testmod <- glm(homeless ~ i1 + sex + g1b + pss_fr + dayslink + sexrisk, family = binomial, data = testData4)
> resid.dev <- summary(testmod)$deviance
> resid.dev
[1] 37.53389
> null.dev <- summary(testmod)$null.deviance
> null.dev
[1] 44.14889
> PR2 <- 1 - (resid.dev / null.dev)
> PR2 # Pseudo R squared
[1] 0.1498339
> # Note 'type = "response"' gives the predicted probabilities
> PM <- predict(AICmod, newdata = testData4, type = "response")
> pred <- ifelse(PM > 0.5, "homeless", "housed")
> T1 <- table(pred, testData4$homeless)
> T1
          
pred       homeless housed
  homeless        6      3
  housed         16      9
> Accuracy <- sum(diag(T1))/sum(T1)
> Accuracy
[1] 0.4411765
> # Using caret
> confusionMatrix(data = pred, reference = testData4$homeless)
Confusion Matrix and Statistics

          Reference
Prediction homeless housed
  homeless        6      3
  housed         16      9
                                          
               Accuracy : 0.4412          
                 95% CI : (0.2719, 0.6211)
    No Information Rate : 0.6471          
    P-Value [Acc > NIR] : 0.995694        
                                          
                  Kappa : 0.0182          
 Mcnemar's Test P-Value : 0.005905        
                                          
            Sensitivity : 0.2727          
            Specificity : 0.7500          
         Pos Pred Value : 0.6667          
         Neg Pred Value : 0.3600          
             Prevalence : 0.6471          
         Detection Rate : 0.1765          
   Detection Prevalence : 0.2647          
      Balanced Accuracy : 0.5114          
                                          
       'Positive' Class : homeless        
                                          

See Confusion Matrix on Wikipedia.

1.2 ROC

An ROC curve is a really useful shortcut for summarizing the performance of a classifier over all possible thresholds. This saves one a lot of tedious work computing class predictions for many different thresholds and examining the confusion matrix for each threshold.

1.2.1 AUC (Area Under the Curve)

  • Single-number summary of model accuracy
  • Summarizes performance accross all thresholds
  • Rank different models within the same dataset
> library(caTools)
> colAUC(PM, testData4$homeless, plotROC = TRUE)

                     [,1]
homeless vs. housed 0.625
> library(pROC)
> ROC <- roc(testData4$homeless, PM) # Note the order is switched here
> plot(ROC, col = "blue")

> auc(ROC)
Area under the curve: 0.625

An AUC of 0.5 is no better than random guessing, an AUC of 1.0 is a perfectly predictive model, and an AUC of 0.0 is perfectly anti-predictive (which rarely happens). One can use the trainControl() function in caret to use AUC (instead of acccuracy), to tune the parameters of your models. The twoClassSummary() convenience function allows one to do this easily.

When using twoClassSummary(), be sure to always include the argument classProbs = TRUE or your model will throw an error! (You cannot calculate AUC with just class predictions. You need to have class probabilities as well.)

> # Create trainControl object: myControl
> myControl <- trainControl(
+   # method = "cv",
+   # number = 10,
+   summaryFunction = twoClassSummary,
+   classProbs = TRUE, # IMPORTANT!
+   verbose = FALSE
+ )
> #
> # Train glm with custom trainControl: model
> model <- train(homeless ~ ., data = trainData4, method = "glmStepAIC",
+                trControl = myControl)
Start:  AIC=124.46
.outcome ~ age + anysubstatus + anysubyes + cesd + d1 + daysanysub + 
    dayslink + drugrisk + e2b + female + sexmale + g1byes + i1 + 
    i2 + id + indtot + linkstatus + linkyes + mcs + pcs + pss_fr + 
    racegrphispanic + racegrpother + racegrpwhite + satreatyes + 
    sexrisk + substancecocaine + substanceheroin + treatyes + 
    avg_drinks + max_drinks + Partition


Step:  AIC=124.46
.outcome ~ age + anysubstatus + anysubyes + cesd + d1 + daysanysub + 
    dayslink + drugrisk + e2b + female + sexmale + g1byes + i1 + 
    i2 + id + indtot + linkstatus + linkyes + mcs + pcs + pss_fr + 
    racegrphispanic + racegrpother + racegrpwhite + satreatyes + 
    sexrisk + substancecocaine + substanceheroin + treatyes + 
    avg_drinks + Partition


Step:  AIC=124.46
.outcome ~ age + anysubstatus + anysubyes + cesd + d1 + daysanysub + 
    dayslink + drugrisk + e2b + female + sexmale + g1byes + i1 + 
    i2 + id + indtot + linkstatus + linkyes + mcs + pcs + pss_fr + 
    racegrphispanic + racegrpother + racegrpwhite + satreatyes + 
    sexrisk + substancecocaine + substanceheroin + treatyes + 
    Partition


Step:  AIC=124.46
.outcome ~ age + anysubstatus + anysubyes + cesd + d1 + daysanysub + 
    dayslink + drugrisk + e2b + female + sexmale + g1byes + i1 + 
    i2 + id + indtot + linkstatus + mcs + pcs + pss_fr + racegrphispanic + 
    racegrpother + racegrpwhite + satreatyes + sexrisk + substancecocaine + 
    substanceheroin + treatyes + Partition


Step:  AIC=124.46
.outcome ~ age + anysubstatus + anysubyes + cesd + d1 + daysanysub + 
    dayslink + drugrisk + e2b + female + g1byes + i1 + i2 + id + 
    indtot + linkstatus + mcs + pcs + pss_fr + racegrphispanic + 
    racegrpother + racegrpwhite + satreatyes + sexrisk + substancecocaine + 
    substanceheroin + treatyes + Partition


Step:  AIC=124.46
.outcome ~ age + anysubstatus + cesd + d1 + daysanysub + dayslink + 
    drugrisk + e2b + female + g1byes + i1 + i2 + id + indtot + 
    linkstatus + mcs + pcs + pss_fr + racegrphispanic + racegrpother + 
    racegrpwhite + satreatyes + sexrisk + substancecocaine + 
    substanceheroin + treatyes + Partition

                   Df Deviance    AIC
- treatyes          1   68.460 122.46
- cesd              1   68.461 122.46
- i2                1   68.476 122.48
- d1                1   68.498 122.50
- satreatyes        1   68.527 122.53
- racegrpother      1   68.629 122.63
- id                1   68.669 122.67
- racegrphispanic   1   68.675 122.67
- e2b               1   68.853 122.85
- linkstatus        1   68.857 122.86
- i1                1   68.872 122.87
- age               1   68.877 122.88
- racegrpwhite      1   68.877 122.88
- substancecocaine  1   68.881 122.88
- indtot            1   68.886 122.89
- pcs               1   68.897 122.90
- daysanysub        1   69.125 123.12
- dayslink          1   69.398 123.40
- Partition         1   69.509 123.51
- substanceheroin   1   69.614 123.61
- anysubstatus      1   69.632 123.63
- mcs               1   69.712 123.71
- drugrisk          1   70.232 124.23
<none>                  68.458 124.46
- female            1   71.664 125.66
- sexrisk           1   71.689 125.69
- pss_fr            1   71.765 125.77
- g1byes            1   73.622 127.62

Step:  AIC=122.46
.outcome ~ age + anysubstatus + cesd + d1 + daysanysub + dayslink + 
    drugrisk + e2b + female + g1byes + i1 + i2 + id + indtot + 
    linkstatus + mcs + pcs + pss_fr + racegrphispanic + racegrpother + 
    racegrpwhite + satreatyes + sexrisk + substancecocaine + 
    substanceheroin + Partition

                   Df Deviance    AIC
- cesd              1   68.462 120.46
- i2                1   68.478 120.48
- d1                1   68.508 120.51
- satreatyes        1   68.527 120.53
- racegrpother      1   68.631 120.63
- id                1   68.671 120.67
- racegrphispanic   1   68.678 120.68
- e2b               1   68.855 120.86
- linkstatus        1   68.857 120.86
- i1                1   68.872 120.87
- age               1   68.877 120.88
- racegrpwhite      1   68.878 120.88
- substancecocaine  1   68.881 120.88
- pcs               1   68.909 120.91
- indtot            1   68.910 120.91
- daysanysub        1   69.174 121.17
- dayslink          1   69.426 121.43
- Partition         1   69.534 121.53
- substanceheroin   1   69.636 121.64
- anysubstatus      1   69.711 121.71
- mcs               1   69.712 121.71
- drugrisk          1   70.235 122.23
<none>                  68.460 122.46
- sexrisk           1   71.699 123.70
- female            1   71.708 123.71
- pss_fr            1   71.813 123.81
- g1byes            1   73.623 125.62

Step:  AIC=120.46
.outcome ~ age + anysubstatus + d1 + daysanysub + dayslink + 
    drugrisk + e2b + female + g1byes + i1 + i2 + id + indtot + 
    linkstatus + mcs + pcs + pss_fr + racegrphispanic + racegrpother + 
    racegrpwhite + satreatyes + sexrisk + substancecocaine + 
    substanceheroin + Partition

                   Df Deviance    AIC
- i2                1   68.483 118.48
- d1                1   68.508 118.51
- satreatyes        1   68.527 118.53
- racegrpother      1   68.631 118.63
- racegrphispanic   1   68.684 118.68
- id                1   68.713 118.71
- linkstatus        1   68.865 118.86
- i1                1   68.873 118.87
- e2b               1   68.879 118.88
- racegrpwhite      1   68.880 118.88
- age               1   68.881 118.88
- substancecocaine  1   68.895 118.89
- indtot            1   68.912 118.91
- pcs               1   69.004 119.00
- daysanysub        1   69.235 119.23
- dayslink          1   69.453 119.45
- Partition         1   69.557 119.56
- substanceheroin   1   69.656 119.66
- anysubstatus      1   69.777 119.78
- drugrisk          1   70.242 120.24
- mcs               1   70.387 120.39
<none>                  68.462 120.46
- female            1   71.732 121.73
- sexrisk           1   71.754 121.75
- pss_fr            1   71.864 121.86
- g1byes            1   73.765 123.77

Step:  AIC=118.48
.outcome ~ age + anysubstatus + d1 + daysanysub + dayslink + 
    drugrisk + e2b + female + g1byes + i1 + id + indtot + linkstatus + 
    mcs + pcs + pss_fr + racegrphispanic + racegrpother + racegrpwhite + 
    satreatyes + sexrisk + substancecocaine + substanceheroin + 
    Partition

                   Df Deviance    AIC
- d1                1   68.529 116.53
- satreatyes        1   68.538 116.54
- racegrpother      1   68.646 116.65
- racegrphispanic   1   68.707 116.71
- id                1   68.725 116.72
- racegrpwhite      1   68.881 116.88
- linkstatus        1   68.894 116.89
- age               1   68.896 116.90
- substancecocaine  1   68.896 116.90
- indtot            1   68.921 116.92
- e2b               1   68.981 116.98
- pcs               1   69.009 117.01
- daysanysub        1   69.250 117.25
- dayslink          1   69.476 117.48
- Partition         1   69.602 117.60
- substanceheroin   1   69.683 117.68
- anysubstatus      1   69.795 117.80
- i1                1   69.929 117.93
- drugrisk          1   70.366 118.37
- mcs               1   70.395 118.39
<none>                  68.483 118.48
- female            1   71.768 119.77
- sexrisk           1   71.786 119.79
- pss_fr            1   71.981 119.98
- g1byes            1   73.806 121.81

Step:  AIC=116.53
.outcome ~ age + anysubstatus + daysanysub + dayslink + drugrisk + 
    e2b + female + g1byes + i1 + id + indtot + linkstatus + mcs + 
    pcs + pss_fr + racegrphispanic + racegrpother + racegrpwhite + 
    satreatyes + sexrisk + substancecocaine + substanceheroin + 
    Partition

                   Df Deviance    AIC
- satreatyes        1   68.580 114.58
- racegrpother      1   68.677 114.68
- racegrphispanic   1   68.740 114.74
- id                1   68.779 114.78
- linkstatus        1   68.897 114.90
- substancecocaine  1   68.922 114.92
- age               1   68.939 114.94
- racegrpwhite      1   68.964 114.96
- indtot            1   68.970 114.97
- e2b               1   69.000 115.00
- pcs               1   69.133 115.13
- daysanysub        1   69.261 115.26
- dayslink          1   69.476 115.48
- Partition         1   69.631 115.63
- substanceheroin   1   69.803 115.80
- anysubstatus      1   69.830 115.83
- i1                1   70.020 116.02
<none>                  68.529 116.53
- mcs               1   70.559 116.56
- drugrisk          1   70.675 116.67
- sexrisk           1   71.821 117.82
- female            1   71.837 117.84
- pss_fr            1   72.393 118.39
- g1byes            1   73.999 120.00

Step:  AIC=114.58
.outcome ~ age + anysubstatus + daysanysub + dayslink + drugrisk + 
    e2b + female + g1byes + i1 + id + indtot + linkstatus + mcs + 
    pcs + pss_fr + racegrphispanic + racegrpother + racegrpwhite + 
    sexrisk + substancecocaine + substanceheroin + Partition

                   Df Deviance    AIC
- racegrpother      1   68.700 112.70
- racegrphispanic   1   68.752 112.75
- id                1   68.860 112.86
- linkstatus        1   68.929 112.93
- age               1   68.964 112.96
- indtot            1   69.024 113.02
- substancecocaine  1   69.062 113.06
- racegrpwhite      1   69.158 113.16
- pcs               1   69.241 113.24
- e2b               1   69.283 113.28
- daysanysub        1   69.298 113.30
- dayslink          1   69.494 113.49
- Partition         1   69.799 113.80
- anysubstatus      1   69.866 113.87
- substanceheroin   1   69.981 113.98
- i1                1   70.057 114.06
<none>                  68.580 114.58
- drugrisk          1   70.958 114.96
- mcs               1   71.193 115.19
- female            1   71.843 115.84
- sexrisk           1   71.874 115.87
- pss_fr            1   72.488 116.49
- g1byes            1   74.339 118.34

Step:  AIC=112.7
.outcome ~ age + anysubstatus + daysanysub + dayslink + drugrisk + 
    e2b + female + g1byes + i1 + id + indtot + linkstatus + mcs + 
    pcs + pss_fr + racegrphispanic + racegrpwhite + sexrisk + 
    substancecocaine + substanceheroin + Partition

                   Df Deviance    AIC
- racegrphispanic   1   68.785 110.78
- linkstatus        1   68.966 110.97
- id                1   68.974 110.97
- age               1   69.016 111.02
- indtot            1   69.082 111.08
- substancecocaine  1   69.346 111.35
- pcs               1   69.441 111.44
- daysanysub        1   69.442 111.44
- racegrpwhite      1   69.488 111.49
- dayslink          1   69.498 111.50
- e2b               1   69.531 111.53
- Partition         1   69.936 111.94
- anysubstatus      1   69.967 111.97
- i1                1   70.060 112.06
- substanceheroin   1   70.088 112.09
<none>                  68.700 112.70
- drugrisk          1   71.101 113.10
- mcs               1   71.212 113.21
- sexrisk           1   71.875 113.88
- female            1   72.089 114.09
- pss_fr            1   73.065 115.06
- g1byes            1   74.444 116.44

Step:  AIC=110.78
.outcome ~ age + anysubstatus + daysanysub + dayslink + drugrisk + 
    e2b + female + g1byes + i1 + id + indtot + linkstatus + mcs + 
    pcs + pss_fr + racegrpwhite + sexrisk + substancecocaine + 
    substanceheroin + Partition

                   Df Deviance    AIC
- linkstatus        1   68.983 108.98
- age               1   69.100 109.10
- id                1   69.133 109.13
- indtot            1   69.146 109.15
- daysanysub        1   69.491 109.49
- dayslink          1   69.498 109.50
- pcs               1   69.518 109.52
- substancecocaine  1   69.543 109.54
- e2b               1   69.701 109.70
- racegrpwhite      1   69.896 109.90
- Partition         1   69.953 109.95
- anysubstatus      1   69.994 109.99
- substanceheroin   1   70.089 110.09
- i1                1   70.094 110.09
<none>                  68.785 110.78
- drugrisk          1   71.112 111.11
- mcs               1   71.273 111.27
- sexrisk           1   71.957 111.96
- female            1   72.169 112.17
- pss_fr            1   73.201 113.20
- g1byes            1   74.746 114.75

Step:  AIC=108.98
.outcome ~ age + anysubstatus + daysanysub + dayslink + drugrisk + 
    e2b + female + g1byes + i1 + id + indtot + mcs + pcs + pss_fr + 
    racegrpwhite + sexrisk + substancecocaine + substanceheroin + 
    Partition

                   Df Deviance    AIC
- indtot            1   69.373 107.37
- age               1   69.471 107.47
- id                1   69.580 107.58
- daysanysub        1   69.643 107.64
- substancecocaine  1   69.769 107.77
- pcs               1   69.773 107.77
- e2b               1   69.967 107.97
- Partition         1   69.996 108.00
- racegrpwhite      1   70.004 108.00
- anysubstatus      1   70.122 108.12
- substanceheroin   1   70.137 108.14
- i1                1   70.159 108.16
<none>                  68.983 108.98
- drugrisk          1   71.279 109.28
- mcs               1   71.300 109.30
- sexrisk           1   72.422 110.42
- female            1   72.423 110.42
- pss_fr            1   73.223 111.22
- dayslink          1   73.688 111.69
- g1byes            1   74.867 112.87

Step:  AIC=107.37
.outcome ~ age + anysubstatus + daysanysub + dayslink + drugrisk + 
    e2b + female + g1byes + i1 + id + mcs + pcs + pss_fr + racegrpwhite + 
    sexrisk + substancecocaine + substanceheroin + Partition

                   Df Deviance    AIC
- id                1   69.852 105.85
- age               1   69.966 105.97
- substancecocaine  1   70.045 106.05
- pcs               1   70.063 106.06
- racegrpwhite      1   70.174 106.17
- daysanysub        1   70.212 106.21
- e2b               1   70.295 106.30
- substanceheroin   1   70.403 106.40
- Partition         1   70.511 106.51
- anysubstatus      1   70.796 106.80
- i1                1   70.843 106.84
- mcs               1   71.300 107.30
<none>                  69.373 107.37
- drugrisk          1   71.946 107.95
- sexrisk           1   72.828 108.83
- dayslink          1   73.769 109.77
- female            1   73.772 109.77
- pss_fr            1   74.240 110.24
- g1byes            1   74.948 110.95

Step:  AIC=105.85
.outcome ~ age + anysubstatus + daysanysub + dayslink + drugrisk + 
    e2b + female + g1byes + i1 + mcs + pcs + pss_fr + racegrpwhite + 
    sexrisk + substancecocaine + substanceheroin + Partition

                   Df Deviance    AIC
- pcs               1   70.271 104.27
- daysanysub        1   70.488 104.49
- e2b               1   70.511 104.51
- substancecocaine  1   70.553 104.55
- age               1   70.574 104.57
- racegrpwhite      1   70.647 104.65
- substanceheroin   1   70.713 104.71
- Partition         1   70.797 104.80
- anysubstatus      1   71.113 105.11
- mcs               1   71.619 105.62
<none>                  69.852 105.85
- i1                1   71.877 105.88
- drugrisk          1   72.253 106.25
- sexrisk           1   72.875 106.88
- dayslink          1   73.869 107.87
- female            1   74.283 108.28
- pss_fr            1   75.070 109.07
- g1byes            1   75.106 109.11

Step:  AIC=104.27
.outcome ~ age + anysubstatus + daysanysub + dayslink + drugrisk + 
    e2b + female + g1byes + i1 + mcs + pss_fr + racegrpwhite + 
    sexrisk + substancecocaine + substanceheroin + Partition

                   Df Deviance    AIC
- daysanysub        1   70.750 102.75
- substanceheroin   1   70.864 102.86
- substancecocaine  1   70.923 102.92
- Partition         1   71.226 103.23
- racegrpwhite      1   71.237 103.24
- e2b               1   71.260 103.26
- anysubstatus      1   71.360 103.36
- age               1   71.396 103.40
- mcs               1   72.145 104.14
<none>                  70.271 104.27
- i1                1   72.693 104.69
- drugrisk          1   72.776 104.78
- sexrisk           1   73.134 105.13
- female            1   74.448 106.45
- dayslink          1   74.830 106.83
- g1byes            1   75.679 107.68
- pss_fr            1   75.970 107.97

Step:  AIC=102.75
.outcome ~ age + anysubstatus + dayslink + drugrisk + e2b + female + 
    g1byes + i1 + mcs + pss_fr + racegrpwhite + sexrisk + substancecocaine + 
    substanceheroin + Partition

                   Df Deviance    AIC
- e2b               1   71.510 101.51
- substancecocaine  1   71.606 101.61
- age               1   71.666 101.67
- anysubstatus      1   71.702 101.70
- Partition         1   71.795 101.80
- substanceheroin   1   72.038 102.04
- racegrpwhite      1   72.049 102.05
- mcs               1   72.524 102.52
<none>                  70.750 102.75
- i1                1   72.824 102.82
- sexrisk           1   73.223 103.22
- drugrisk          1   73.899 103.90
- female            1   74.799 104.80
- dayslink          1   74.981 104.98
- g1byes            1   75.975 105.97
- pss_fr            1   76.428 106.43

Step:  AIC=101.51
.outcome ~ age + anysubstatus + dayslink + drugrisk + female + 
    g1byes + i1 + mcs + pss_fr + racegrpwhite + sexrisk + substancecocaine + 
    substanceheroin + Partition

                   Df Deviance    AIC
- substancecocaine  1   72.273 100.27
- Partition         1   72.352 100.35
- age               1   72.386 100.39
- racegrpwhite      1   72.484 100.48
- substanceheroin   1   72.614 100.61
- anysubstatus      1   72.623 100.62
- mcs               1   72.894 100.89
<none>                  71.510 101.51
- i1                1   73.998 102.00
- drugrisk          1   74.305 102.31
- sexrisk           1   74.479 102.48
- dayslink          1   75.627 103.63
- female            1   75.836 103.84
- pss_fr            1   76.886 104.89
- g1byes            1   76.903 104.90

Step:  AIC=100.27
.outcome ~ age + anysubstatus + dayslink + drugrisk + female + 
    g1byes + i1 + mcs + pss_fr + racegrpwhite + sexrisk + substanceheroin + 
    Partition

                  Df Deviance     AIC
- substanceheroin  1   72.758  98.758
- racegrpwhite     1   72.773  98.773
- Partition        1   73.106  99.106
- age              1   73.111  99.111
- anysubstatus     1   73.429  99.429
- mcs              1   73.735  99.735
<none>                 72.273 100.273
- sexrisk          1   74.612 100.612
- drugrisk         1   74.621 100.621
- dayslink         1   76.160 102.160
- female           1   76.433 102.433
- i1               1   76.625 102.625
- pss_fr           1   77.269 103.269
- g1byes           1   77.994 103.994

Step:  AIC=98.76
.outcome ~ age + anysubstatus + dayslink + drugrisk + female + 
    g1byes + i1 + mcs + pss_fr + racegrpwhite + sexrisk + Partition

               Df Deviance     AIC
- racegrpwhite  1   73.188  97.188
- Partition     1   73.692  97.692
- anysubstatus  1   73.841  97.841
- age           1   73.979  97.979
- mcs           1   74.149  98.149
- drugrisk      1   74.621  98.621
<none>              72.758  98.758
- sexrisk       1   75.690  99.690
- dayslink      1   76.645 100.645
- pss_fr        1   77.298 101.298
- g1byes        1   78.648 102.648
- female        1   78.944 102.944
- i1            1   79.133 103.133

Step:  AIC=97.19
.outcome ~ age + anysubstatus + dayslink + drugrisk + female + 
    g1byes + i1 + mcs + pss_fr + sexrisk + Partition

               Df Deviance     AIC
- Partition     1   73.961  95.961
- age           1   74.141  96.141
- mcs           1   74.403  96.403
- anysubstatus  1   74.411  96.411
- drugrisk      1   74.655  96.655
<none>              73.188  97.188
- sexrisk       1   76.817  98.817
- dayslink      1   76.934  98.934
- pss_fr        1   77.891  99.891
- g1byes        1   78.652 100.652
- i1            1   79.179 101.179
- female        1   79.344 101.344

Step:  AIC=95.96
.outcome ~ age + anysubstatus + dayslink + drugrisk + female + 
    g1byes + i1 + mcs + pss_fr + sexrisk

               Df Deviance     AIC
- age           1   74.574  94.574
- mcs           1   74.891  94.891
- anysubstatus  1   75.073  95.073
- drugrisk      1   75.245  95.245
<none>              73.961  95.961
- sexrisk       1   77.281  97.281
- dayslink      1   77.772  97.772
- pss_fr        1   78.421  98.421
- g1byes        1   79.061  99.061
- female        1   79.501  99.501
- i1            1   80.645 100.645

Step:  AIC=94.57
.outcome ~ anysubstatus + dayslink + drugrisk + female + g1byes + 
    i1 + mcs + pss_fr + sexrisk

               Df Deviance     AIC
- anysubstatus  1   75.375  93.375
- mcs           1   75.575  93.575
- drugrisk      1   75.792  93.792
<none>              74.574  94.574
- sexrisk       1   77.377  95.377
- dayslink      1   78.607  96.607
- pss_fr        1   78.907  96.907
- g1byes        1   79.114  97.114
- female        1   79.535  97.535
- i1            1   84.250 102.250

Step:  AIC=93.38
.outcome ~ dayslink + drugrisk + female + g1byes + i1 + mcs + 
    pss_fr + sexrisk

           Df Deviance     AIC
- mcs       1   76.277  92.277
- drugrisk  1   76.636  92.636
<none>          75.375  93.375
- sexrisk   1   77.693  93.693
- dayslink  1   78.742  94.742
- g1byes    1   79.225  95.225
- pss_fr    1   79.543  95.543
- female    1   80.661  96.661
- i1        1   86.093 102.093

Step:  AIC=92.28
.outcome ~ dayslink + drugrisk + female + g1byes + i1 + pss_fr + 
    sexrisk

           Df Deviance     AIC
- drugrisk  1   77.385  91.385
<none>          76.277  92.277
- sexrisk   1   79.090  93.090
- dayslink  1   79.335  93.335
- g1byes    1   79.342  93.342
- pss_fr    1   79.783  93.783
- female    1   82.121  96.121
- i1        1   86.744 100.744

Step:  AIC=91.38
.outcome ~ dayslink + female + g1byes + i1 + pss_fr + sexrisk

           Df Deviance     AIC
<none>          77.385  91.385
- sexrisk   1   79.519  91.519
- dayslink  1   80.603  92.603
- pss_fr    1   80.916  92.916
- g1byes    1   81.468  93.468
- female    1   83.654  95.654
- i1        1   88.139 100.139
> # Print model to console
> model
Generalized Linear Model with Stepwise Feature Selection 

83 samples
29 predictors
 2 classes: 'homeless', 'housed' 

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 83, 83, 83, 83, 83, 83, ... 
Resampling results:

  ROC        Sens       Spec     
  0.5307497  0.5851495  0.4547207
> summary(model)

Call:
NULL

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.1433  -0.6923  -0.3944   0.7927   1.9780  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)  
(Intercept) -0.898093   0.951086  -0.944   0.3450  
dayslink     0.003429   0.001979   1.733   0.0831 .
female       1.801138   0.764603   2.356   0.0185 *
g1byes      -1.338143   0.710160  -1.884   0.0595 .
i1          -0.057817   0.023137  -2.499   0.0125 *
pss_fr       0.159495   0.087981   1.813   0.0699 .
sexrisk     -0.161055   0.112725  -1.429   0.1531  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 108.605  on 82  degrees of freedom
Residual deviance:  77.385  on 76  degrees of freedom
AIC: 91.385

Number of Fisher Scoring iterations: 6

1.3 GLMNET

  • alpha = 0 \(\rightarrow\) ridge
  • alpha = 1 \(\rightarrow\) lasso
> set.seed(234)
> # Create custom trainControl: myControl
> myControl <- trainControl(
+   method = "cv", number = 10,
+   summaryFunction = twoClassSummary,
+   classProbs = TRUE, # IMPORTANT!
+   verboseIter = TRUE
+ )
> # Train glmnet with custom trainControl and tuning: model
> model <- train(
+   homeless ~ ., data = trainData4,
+   tuneGrid = expand.grid(alpha  = 0:1, lambda = seq(0.0001, 4, length = 20)),
+   method = "glmnet",
+   trControl = myControl
+ )
Aggregating results
Selecting tuning parameters
Fitting alpha = 0, lambda = 4 on full training set
> # Print model to console
> model
glmnet 

83 samples
29 predictors
 2 classes: 'homeless', 'housed' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 74, 75, 75, 75, 75, 75, ... 
Resampling results across tuning parameters:

  alpha  lambda     ROC        Sens       Spec      
  0      0.0001000  0.5766667  0.7166667  0.40000000
  0      0.2106211  0.6188889  0.7966667  0.33333333
  0      0.4211421  0.6455556  0.8466667  0.30000000
  0      0.6316632  0.6522222  0.9233333  0.26666667
  0      0.8421842  0.6588889  0.9600000  0.20000000
  0      1.0527053  0.6711111  0.9600000  0.13333333
  0      1.2632263  0.6833333  0.9600000  0.10000000
  0      1.4737474  0.6833333  0.9600000  0.10000000
  0      1.6842684  0.6966667  0.9600000  0.06666667
  0      1.8947895  0.6966667  0.9600000  0.03333333
  0      2.1053105  0.6966667  0.9800000  0.00000000
  0      2.3158316  0.6966667  1.0000000  0.00000000
  0      2.5263526  0.6966667  1.0000000  0.00000000
  0      2.7368737  0.7033333  1.0000000  0.00000000
  0      2.9473947  0.7033333  1.0000000  0.00000000
  0      3.1579158  0.7033333  1.0000000  0.00000000
  0      3.3684368  0.7033333  1.0000000  0.00000000
  0      3.5789579  0.7033333  1.0000000  0.00000000
  0      3.7894789  0.7033333  1.0000000  0.00000000
  0      4.0000000  0.7033333  1.0000000  0.00000000
  1      0.0001000  0.5333333  0.6633333  0.43333333
  1      0.2106211  0.5000000  1.0000000  0.00000000
  1      0.4211421  0.5000000  1.0000000  0.00000000
  1      0.6316632  0.5000000  1.0000000  0.00000000
  1      0.8421842  0.5000000  1.0000000  0.00000000
  1      1.0527053  0.5000000  1.0000000  0.00000000
  1      1.2632263  0.5000000  1.0000000  0.00000000
  1      1.4737474  0.5000000  1.0000000  0.00000000
  1      1.6842684  0.5000000  1.0000000  0.00000000
  1      1.8947895  0.5000000  1.0000000  0.00000000
  1      2.1053105  0.5000000  1.0000000  0.00000000
  1      2.3158316  0.5000000  1.0000000  0.00000000
  1      2.5263526  0.5000000  1.0000000  0.00000000
  1      2.7368737  0.5000000  1.0000000  0.00000000
  1      2.9473947  0.5000000  1.0000000  0.00000000
  1      3.1579158  0.5000000  1.0000000  0.00000000
  1      3.3684368  0.5000000  1.0000000  0.00000000
  1      3.5789579  0.5000000  1.0000000  0.00000000
  1      3.7894789  0.5000000  1.0000000  0.00000000
  1      4.0000000  0.5000000  1.0000000  0.00000000

ROC was used to select the optimal model using the largest value.
The final values used for the model were alpha = 0 and lambda = 4.
> # Print maximum ROC statistic
> max(model[["results"]][["ROC"]])
[1] 0.7033333
> plot(model)

1.3.1 Imputation

  • Median (medianImpute)
  • knn (knnImpute)
  • bagged tree model (bagImpute)
> # Create custom trainControl: myControl
> myControl <- trainControl(
+   method = "cv", number = 10,
+   summaryFunction = twoClassSummary,
+   classProbs = TRUE, # IMPORTANT!
+   verboseIter = TRUE
+ )
> 
> # Train glmnet with custom trainControl and tuning: model
> 
> model <- train(
+   homeless ~ ., data = HELPrct, 
+   tuneGrid = expand.grid(alpha  = 0:1, lambda = seq(0.0001, 100, length = 20)),
+   method = "glmnet",
+   metric = "ROC",
+   trControl = myControl,
+   preProcess = c("knnImpute"),
+   na.action = na.pass
+ )
Aggregating results
Selecting tuning parameters
Fitting alpha = 0, lambda = 84.2 on full training set
> # Print model to console
> model
glmnet 

453 samples
 28 predictor
  2 classes: 'homeless', 'housed' 

Pre-processing: nearest neighbor imputation (31), centered (31),
 scaled (31) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 408, 407, 407, 408, 408, 409, ... 
Resampling results across tuning parameters:

  alpha  lambda      ROC        Sens         Spec     
  0        0.000100  0.6509484  0.482142857  0.6888333
  0        5.263253  0.6622321  0.104761905  0.9631667
  0       10.526405  0.6614226  0.033333333  0.9916667
  0       15.789558  0.6611885  0.004761905  1.0000000
  0       21.052711  0.6619583  0.000000000  1.0000000
  0       26.315863  0.6621488  0.000000000  1.0000000
  0       31.579016  0.6623472  0.000000000  1.0000000
  0       36.842168  0.6621488  0.000000000  1.0000000
  0       42.105321  0.6621488  0.000000000  1.0000000
  0       47.368474  0.6623472  0.000000000  1.0000000
  0       52.631626  0.6623472  0.000000000  1.0000000
  0       57.894779  0.6623472  0.000000000  1.0000000
  0       63.157932  0.6625456  0.000000000  1.0000000
  0       68.421084  0.6625456  0.000000000  1.0000000
  0       73.684237  0.6625456  0.000000000  1.0000000
  0       78.947389  0.6625456  0.000000000  1.0000000
  0       84.210542  0.6625456  0.000000000  1.0000000
  0       89.473695  0.6623552  0.000000000  1.0000000
  0       94.736847  0.6623552  0.000000000  1.0000000
  0      100.000000  0.6623552  0.000000000  1.0000000
  1        0.000100  0.6481567  0.486904762  0.6888333
  1        5.263253  0.5000000  0.000000000  1.0000000
  1       10.526405  0.5000000  0.000000000  1.0000000
  1       15.789558  0.5000000  0.000000000  1.0000000
  1       21.052711  0.5000000  0.000000000  1.0000000
  1       26.315863  0.5000000  0.000000000  1.0000000
  1       31.579016  0.5000000  0.000000000  1.0000000
  1       36.842168  0.5000000  0.000000000  1.0000000
  1       42.105321  0.5000000  0.000000000  1.0000000
  1       47.368474  0.5000000  0.000000000  1.0000000
  1       52.631626  0.5000000  0.000000000  1.0000000
  1       57.894779  0.5000000  0.000000000  1.0000000
  1       63.157932  0.5000000  0.000000000  1.0000000
  1       68.421084  0.5000000  0.000000000  1.0000000
  1       73.684237  0.5000000  0.000000000  1.0000000
  1       78.947389  0.5000000  0.000000000  1.0000000
  1       84.210542  0.5000000  0.000000000  1.0000000
  1       89.473695  0.5000000  0.000000000  1.0000000
  1       94.736847  0.5000000  0.000000000  1.0000000
  1      100.000000  0.5000000  0.000000000  1.0000000

ROC was used to select the optimal model using the largest value.
The final values used for the model were alpha = 0 and lambda = 84.21054.
> # Print maximum ROC statistic
> max(model[["results"]][["ROC"]])
[1] 0.6625456
> plot(model)

> plot(model$finalModel)

> # getting coefficients
> coef(model$finalModel, model$bestTune$lambda) 
32 x 1 sparse Matrix of class "dgCMatrix"
                             1
(Intercept)       1.547866e-01
age              -5.052402e-04
anysubstatus      7.175112e-05
anysubyes         7.173183e-05
cesd             -5.094988e-04
d1               -3.875314e-04
daysanysub        4.354365e-05
dayslink          1.266469e-04
drugrisk         -2.331586e-04
e2b              -9.065190e-04
female            5.748621e-04
sexmale          -5.747121e-04
g1byes           -7.787904e-04
i1               -1.392394e-03
i2               -1.147066e-03
id               -2.852674e-04
indtot           -1.268734e-03
linkstatus       -2.771621e-04
linkyes          -2.770897e-04
mcs               3.990764e-04
pcs               5.605130e-04
pss_fr            1.019032e-03
racegrphispanic   8.851136e-05
racegrpother     -9.493057e-07
racegrpwhite     -7.793520e-04
satreatyes       -4.924555e-04
sexrisk          -5.681747e-04
substancecocaine  6.125276e-04
substanceheroin   5.974870e-04
treatyes          1.996529e-04
avg_drinks       -1.391529e-03
max_drinks       -1.146225e-03

1.4 Working with ISLR data

1.4.1 First Ridge

> library(ISLR)
> library(caret)
> Hitters <- na.omit(Hitters)
> summary(Hitters)
     AtBat            Hits           HmRun            Runs       
 Min.   : 19.0   Min.   :  1.0   Min.   : 0.00   Min.   :  0.00  
 1st Qu.:282.5   1st Qu.: 71.5   1st Qu.: 5.00   1st Qu.: 33.50  
 Median :413.0   Median :103.0   Median : 9.00   Median : 52.00  
 Mean   :403.6   Mean   :107.8   Mean   :11.62   Mean   : 54.75  
 3rd Qu.:526.0   3rd Qu.:141.5   3rd Qu.:18.00   3rd Qu.: 73.00  
 Max.   :687.0   Max.   :238.0   Max.   :40.00   Max.   :130.00  
      RBI             Walks            Years            CAtBat       
 Min.   :  0.00   Min.   :  0.00   Min.   : 1.000   Min.   :   19.0  
 1st Qu.: 30.00   1st Qu.: 23.00   1st Qu.: 4.000   1st Qu.:  842.5  
 Median : 47.00   Median : 37.00   Median : 6.000   Median : 1931.0  
 Mean   : 51.49   Mean   : 41.11   Mean   : 7.312   Mean   : 2657.5  
 3rd Qu.: 71.00   3rd Qu.: 57.00   3rd Qu.:10.000   3rd Qu.: 3890.5  
 Max.   :121.00   Max.   :105.00   Max.   :24.000   Max.   :14053.0  
     CHits            CHmRun           CRuns             CRBI       
 Min.   :   4.0   Min.   :  0.00   Min.   :   2.0   Min.   :   3.0  
 1st Qu.: 212.0   1st Qu.: 15.00   1st Qu.: 105.5   1st Qu.:  95.0  
 Median : 516.0   Median : 40.00   Median : 250.0   Median : 230.0  
 Mean   : 722.2   Mean   : 69.24   Mean   : 361.2   Mean   : 330.4  
 3rd Qu.:1054.0   3rd Qu.: 92.50   3rd Qu.: 497.5   3rd Qu.: 424.5  
 Max.   :4256.0   Max.   :548.00   Max.   :2165.0   Max.   :1659.0  
     CWalks       League  Division    PutOuts          Assists     
 Min.   :   1.0   A:139   E:129    Min.   :   0.0   Min.   :  0.0  
 1st Qu.:  71.0   N:124   W:134    1st Qu.: 113.5   1st Qu.:  8.0  
 Median : 174.0                    Median : 224.0   Median : 45.0  
 Mean   : 260.3                    Mean   : 290.7   Mean   :118.8  
 3rd Qu.: 328.5                    3rd Qu.: 322.5   3rd Qu.:192.0  
 Max.   :1566.0                    Max.   :1377.0   Max.   :492.0  
     Errors           Salary       NewLeague
 Min.   : 0.000   Min.   :  67.5   A:141    
 1st Qu.: 3.000   1st Qu.: 190.0   N:122    
 Median : 7.000   Median : 425.0            
 Mean   : 8.593   Mean   : 535.9            
 3rd Qu.:13.000   3rd Qu.: 750.0            
 Max.   :32.000   Max.   :2460.0            
> dim(Hitters)
[1] 263  20
> set.seed(1)
> myControl <- trainControl(
+   method = "cv", 
+   number = 10
+ )
> hit.model <- train(
+   Salary ~ ., data = Hitters,
+   tuneGrid = expand.grid(alpha = 0, lambda = 10^seq(3, -2, length = 100)),
+   method = "glmnet",
+   trControl = myControl
+ )
> hit.model
glmnet 

263 samples
 19 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 235, 237, 238, 236, 237, 237, ... 
Resampling results across tuning parameters:

  lambda        RMSE      Rsquared   MAE     
  1.000000e-02  326.1370  0.4702243  232.7176
  1.123324e-02  326.1370  0.4702243  232.7176
  1.261857e-02  326.1370  0.4702243  232.7176
  1.417474e-02  326.1370  0.4702243  232.7176
  1.592283e-02  326.1370  0.4702243  232.7176
  1.788650e-02  326.1370  0.4702243  232.7176
  2.009233e-02  326.1370  0.4702243  232.7176
  2.257020e-02  326.1370  0.4702243  232.7176
  2.535364e-02  326.1370  0.4702243  232.7176
  2.848036e-02  326.1370  0.4702243  232.7176
  3.199267e-02  326.1370  0.4702243  232.7176
  3.593814e-02  326.1370  0.4702243  232.7176
  4.037017e-02  326.1370  0.4702243  232.7176
  4.534879e-02  326.1370  0.4702243  232.7176
  5.094138e-02  326.1370  0.4702243  232.7176
  5.722368e-02  326.1370  0.4702243  232.7176
  6.428073e-02  326.1370  0.4702243  232.7176
  7.220809e-02  326.1370  0.4702243  232.7176
  8.111308e-02  326.1370  0.4702243  232.7176
  9.111628e-02  326.1370  0.4702243  232.7176
  1.023531e-01  326.1370  0.4702243  232.7176
  1.149757e-01  326.1370  0.4702243  232.7176
  1.291550e-01  326.1370  0.4702243  232.7176
  1.450829e-01  326.1370  0.4702243  232.7176
  1.629751e-01  326.1370  0.4702243  232.7176
  1.830738e-01  326.1370  0.4702243  232.7176
  2.056512e-01  326.1370  0.4702243  232.7176
  2.310130e-01  326.1370  0.4702243  232.7176
  2.595024e-01  326.1370  0.4702243  232.7176
  2.915053e-01  326.1370  0.4702243  232.7176
  3.274549e-01  326.1370  0.4702243  232.7176
  3.678380e-01  326.1370  0.4702243  232.7176
  4.132012e-01  326.1370  0.4702243  232.7176
  4.641589e-01  326.1370  0.4702243  232.7176
  5.214008e-01  326.1370  0.4702243  232.7176
  5.857021e-01  326.1370  0.4702243  232.7176
  6.579332e-01  326.1370  0.4702243  232.7176
  7.390722e-01  326.1370  0.4702243  232.7176
  8.302176e-01  326.1370  0.4702243  232.7176
  9.326033e-01  326.1370  0.4702243  232.7176
  1.047616e+00  326.1370  0.4702243  232.7176
  1.176812e+00  326.1370  0.4702243  232.7176
  1.321941e+00  326.1370  0.4702243  232.7176
  1.484968e+00  326.1370  0.4702243  232.7176
  1.668101e+00  326.1370  0.4702243  232.7176
  1.873817e+00  326.1370  0.4702243  232.7176
  2.104904e+00  326.1370  0.4702243  232.7176
  2.364489e+00  326.1370  0.4702243  232.7176
  2.656088e+00  326.1370  0.4702243  232.7176
  2.983647e+00  326.1370  0.4702243  232.7176
  3.351603e+00  326.1370  0.4702243  232.7176
  3.764936e+00  326.1370  0.4702243  232.7176
  4.229243e+00  326.1370  0.4702243  232.7176
  4.750810e+00  326.1370  0.4702243  232.7176
  5.336699e+00  326.1370  0.4702243  232.7176
  5.994843e+00  326.1370  0.4702243  232.7176
  6.734151e+00  326.1370  0.4702243  232.7176
  7.564633e+00  326.1370  0.4702243  232.7176
  8.497534e+00  326.1370  0.4702243  232.7176
  9.545485e+00  326.1370  0.4702243  232.7176
  1.072267e+01  326.1370  0.4702243  232.7176
  1.204504e+01  326.1370  0.4702243  232.7176
  1.353048e+01  326.1370  0.4702243  232.7176
  1.519911e+01  326.1370  0.4702243  232.7176
  1.707353e+01  326.1370  0.4702243  232.7176
  1.917910e+01  326.1370  0.4702243  232.7176
  2.154435e+01  326.1370  0.4702243  232.7176
  2.420128e+01  326.1514  0.4701971  232.7182
  2.718588e+01  326.1457  0.4703697  232.5095
  3.053856e+01  326.1902  0.4702580  232.2813
  3.430469e+01  326.2403  0.4701406  232.0321
  3.853529e+01  326.2892  0.4700387  231.7596
  4.328761e+01  326.3342  0.4699625  231.4765
  4.862602e+01  326.3743  0.4699165  231.1753
  5.462277e+01  326.4152  0.4698879  230.8785
  6.135907e+01  326.4545  0.4698845  230.6108
  6.892612e+01  326.4863  0.4699266  230.3380
  7.742637e+01  326.5160  0.4700038  230.0882
  8.697490e+01  326.5473  0.4701094  229.9021
  9.770100e+01  326.5785  0.4702506  229.7382
  1.097499e+02  326.6129  0.4704264  229.6020
  1.232847e+02  326.6534  0.4706327  229.4659
  1.384886e+02  326.7068  0.4708540  229.3308
  1.555676e+02  326.7761  0.4710941  229.2008
  1.747528e+02  326.8627  0.4713565  229.0840
  1.963041e+02  326.9766  0.4716209  229.0076
  2.205131e+02  327.1242  0.4718747  228.9716
  2.477076e+02  327.3120  0.4721262  228.9740
  2.782559e+02  327.5442  0.4723607  229.1073
  3.125716e+02  327.8324  0.4725723  229.4141
  3.511192e+02  328.1865  0.4727470  229.7859
  3.944206e+02  328.6129  0.4728922  230.3146
  4.430621e+02  329.1222  0.4729971  230.9594
  4.977024e+02  329.7271  0.4730507  231.7418
  5.590810e+02  330.4383  0.4730559  232.5937
  6.280291e+02  331.2665  0.4730212  233.5686
  7.054802e+02  332.2254  0.4729347  234.6652
  7.924829e+02  333.3302  0.4727990  235.9015
  8.902151e+02  334.5908  0.4726197  237.4501
  1.000000e+03  336.0245  0.4723993  239.3706

Tuning parameter 'alpha' was held constant at a value of 0
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were alpha = 0 and lambda = 21.54435.
> coef(hit.model$finalModel, s = 11498)
20 x 1 sparse Matrix of class "dgCMatrix"
                        1
(Intercept) 407.205936774
AtBat         0.037003083
Hits          0.138357552
HmRun         0.525195508
Runs          0.230978290
RBI           0.240114775
Walks         0.289971555
Years         1.108832399
CAtBat        0.003135215
CHits         0.011666684
CHmRun        0.087642789
CRuns         0.023406258
CRBI          0.024165723
CWalks        0.025042117
LeagueN       0.086629234
DivisionW    -6.225431332
PutOuts       0.016506596
Assists       0.002616335
Errors       -0.020564158
NewLeagueN    0.302922899
> coef(hit.model$finalModel, s = 705) 
20 x 1 sparse Matrix of class "dgCMatrix"
                       1
(Intercept)  54.39019010
AtBat         0.11199103
Hits          0.65618308
HmRun         1.17557058
Runs          0.93745780
RBI           0.84682838
Walks         1.32044897
Years         2.58210272
CAtBat        0.01082850
CHits         0.04681147
CHmRun        0.33843942
CRuns         0.09372165
CRBI          0.09788930
CWalks        0.07179422
LeagueN      13.69631793
DivisionW   -54.71693537
PutOuts       0.11864655
Assists       0.01610360
Errors       -0.70439782
NewLeagueN    8.60936304
> coef(hit.model$finalModel, s = 50) 
20 x 1 sparse Matrix of class "dgCMatrix"
                        1
(Intercept)  4.821654e+01
AtBat       -3.538650e-01
Hits         1.953167e+00
HmRun       -1.285127e+00
Runs         1.156329e+00
RBI          8.087771e-01
Walks        2.709765e+00
Years       -6.202919e+00
CAtBat       6.085854e-03
CHits        1.070832e-01
CHmRun       6.290984e-01
CRuns        2.172926e-01
CRBI         2.152888e-01
CWalks      -1.488961e-01
LeagueN      4.586262e+01
DivisionW   -1.182304e+02
PutOuts      2.501647e-01
Assists      1.208491e-01
Errors      -3.277073e+00
NewLeagueN  -9.423459e+00
> plot(hit.model)

> coef(hit.model$finalModel, hit.model$bestTune$lambda) 
20 x 1 sparse Matrix of class "dgCMatrix"
                        1
(Intercept)  8.112693e+01
AtBat       -6.815959e-01
Hits         2.772312e+00
HmRun       -1.365680e+00
Runs         1.014826e+00
RBI          7.130224e-01
Walks        3.378558e+00
Years       -9.066800e+00
CAtBat      -1.199478e-03
CHits        1.361029e-01
CHmRun       6.979958e-01
CRuns        2.958896e-01
CRBI         2.570711e-01
CWalks      -2.789666e-01
LeagueN      5.321272e+01
DivisionW   -1.228345e+02
PutOuts      2.638876e-01
Assists      1.698796e-01
Errors      -3.685645e+00
NewLeagueN  -1.810510e+01
> hit.model$bestTune$lambda
[1] 21.54435

Following ISLR, the data set is split into a training and a test set in order to estimate the error of each method.

> library(caret)
> set.seed(52)
> trainIndex <- createDataPartition(y = Hitters$Salary, p = 0.5,
+                                 list = FALSE,
+                                 times = 1)
> HittersTrain <- Hitters[trainIndex, ]
> HittersTest <- Hitters[-trainIndex, ]
> dim(HittersTrain)
[1] 133  20
> dim(HittersTest)
[1] 130  20
> myControl <- trainControl(
+   method = "repeatedcv", 
+   repeats = 3
+ )
> hit.modelT <- train(
+   Salary ~ ., data = HittersTrain,
+   tuneGrid = expand.grid(alpha = 0, lambda = 10^seq(3, -2, length = 100)),
+   method = "glmnet",
+   trControl = myControl
+ )
> hit.modelT
glmnet 

133 samples
 19 predictor

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 121, 119, 118, 119, 120, 119, ... 
Resampling results across tuning parameters:

  lambda        RMSE      Rsquared   MAE     
  1.000000e-02  321.2987  0.5202461  238.3084
  1.123324e-02  321.2987  0.5202461  238.3084
  1.261857e-02  321.2987  0.5202461  238.3084
  1.417474e-02  321.2987  0.5202461  238.3084
  1.592283e-02  321.2987  0.5202461  238.3084
  1.788650e-02  321.2987  0.5202461  238.3084
  2.009233e-02  321.2987  0.5202461  238.3084
  2.257020e-02  321.2987  0.5202461  238.3084
  2.535364e-02  321.2987  0.5202461  238.3084
  2.848036e-02  321.2987  0.5202461  238.3084
  3.199267e-02  321.2987  0.5202461  238.3084
  3.593814e-02  321.2987  0.5202461  238.3084
  4.037017e-02  321.2987  0.5202461  238.3084
  4.534879e-02  321.2987  0.5202461  238.3084
  5.094138e-02  321.2987  0.5202461  238.3084
  5.722368e-02  321.2987  0.5202461  238.3084
  6.428073e-02  321.2987  0.5202461  238.3084
  7.220809e-02  321.2987  0.5202461  238.3084
  8.111308e-02  321.2987  0.5202461  238.3084
  9.111628e-02  321.2987  0.5202461  238.3084
  1.023531e-01  321.2987  0.5202461  238.3084
  1.149757e-01  321.2987  0.5202461  238.3084
  1.291550e-01  321.2987  0.5202461  238.3084
  1.450829e-01  321.2987  0.5202461  238.3084
  1.629751e-01  321.2987  0.5202461  238.3084
  1.830738e-01  321.2987  0.5202461  238.3084
  2.056512e-01  321.2987  0.5202461  238.3084
  2.310130e-01  321.2987  0.5202461  238.3084
  2.595024e-01  321.2987  0.5202461  238.3084
  2.915053e-01  321.2987  0.5202461  238.3084
  3.274549e-01  321.2987  0.5202461  238.3084
  3.678380e-01  321.2987  0.5202461  238.3084
  4.132012e-01  321.2987  0.5202461  238.3084
  4.641589e-01  321.2987  0.5202461  238.3084
  5.214008e-01  321.2987  0.5202461  238.3084
  5.857021e-01  321.2987  0.5202461  238.3084
  6.579332e-01  321.2987  0.5202461  238.3084
  7.390722e-01  321.2987  0.5202461  238.3084
  8.302176e-01  321.2987  0.5202461  238.3084
  9.326033e-01  321.2987  0.5202461  238.3084
  1.047616e+00  321.2987  0.5202461  238.3084
  1.176812e+00  321.2987  0.5202461  238.3084
  1.321941e+00  321.2987  0.5202461  238.3084
  1.484968e+00  321.2987  0.5202461  238.3084
  1.668101e+00  321.2987  0.5202461  238.3084
  1.873817e+00  321.2987  0.5202461  238.3084
  2.104904e+00  321.2987  0.5202461  238.3084
  2.364489e+00  321.2987  0.5202461  238.3084
  2.656088e+00  321.2987  0.5202461  238.3084
  2.983647e+00  321.2987  0.5202461  238.3084
  3.351603e+00  321.2987  0.5202461  238.3084
  3.764936e+00  321.2987  0.5202461  238.3084
  4.229243e+00  321.2987  0.5202461  238.3084
  4.750810e+00  321.2987  0.5202461  238.3084
  5.336699e+00  321.2987  0.5202461  238.3084
  5.994843e+00  321.2987  0.5202461  238.3084
  6.734151e+00  321.2987  0.5202461  238.3084
  7.564633e+00  321.2987  0.5202461  238.3084
  8.497534e+00  321.2987  0.5202461  238.3084
  9.545485e+00  321.2987  0.5202461  238.3084
  1.072267e+01  321.2987  0.5202461  238.3084
  1.204504e+01  321.2987  0.5202461  238.3084
  1.353048e+01  321.2987  0.5202461  238.3084
  1.519911e+01  321.2987  0.5202461  238.3084
  1.707353e+01  321.2987  0.5202461  238.3084
  1.917910e+01  321.2987  0.5202461  238.3084
  2.154435e+01  321.2987  0.5202461  238.3084
  2.420128e+01  321.2952  0.5202716  238.3122
  2.718588e+01  320.9940  0.5206662  238.0906
  3.053856e+01  320.4549  0.5212331  237.7615
  3.430469e+01  319.9159  0.5218472  237.4002
  3.853529e+01  319.3814  0.5224950  237.0234
  4.328761e+01  318.8385  0.5232110  236.5997
  4.862602e+01  318.3038  0.5239649  236.1519
  5.462277e+01  317.7589  0.5247870  235.6663
  6.135907e+01  317.2143  0.5256686  235.1726
  6.892612e+01  316.6696  0.5266046  234.6630
  7.742637e+01  316.1244  0.5276052  234.1716
  8.697490e+01  315.5830  0.5286521  233.6538
  9.770100e+01  315.0459  0.5297588  233.1027
  1.097499e+02  314.5231  0.5308968  232.5554
  1.232847e+02  314.0164  0.5320742  231.9891
  1.384886e+02  313.5346  0.5332739  231.4450
  1.555676e+02  313.0803  0.5344948  230.8870
  1.747528e+02  312.6684  0.5357123  230.3830
  1.963041e+02  312.3063  0.5369151  229.9022
  2.205131e+02  312.0036  0.5380941  229.4195
  2.477076e+02  311.7692  0.5392476  228.9595
  2.782559e+02  311.6191  0.5403385  228.5224
  3.125716e+02  311.5661  0.5413587  228.1176
  3.511192e+02  311.6207  0.5423086  227.7972
  3.944206e+02  311.7973  0.5431752  227.6065
  4.430621e+02  312.1111  0.5439354  227.5245
  4.977024e+02  312.5786  0.5445833  227.6521
  5.590810e+02  313.2115  0.5451243  227.9577
  6.280291e+02  314.0241  0.5455579  228.3820
  7.054802e+02  315.0340  0.5458633  229.1220
  7.924829e+02  316.2548  0.5460531  230.1591
  8.902151e+02  317.7011  0.5461314  231.4492
  1.000000e+03  319.3866  0.5461103  232.9972

Tuning parameter 'alpha' was held constant at a value of 0
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were alpha = 0 and lambda = 312.5716.
> plot(hit.modelT)

> coef(hit.modelT$finalModel, hit.modelT$bestTune$lambda) 
20 x 1 sparse Matrix of class "dgCMatrix"
                        1
(Intercept) -1.247903e+02
AtBat        1.459814e-01
Hits         1.081891e+00
HmRun        1.698469e+00
Runs         1.460525e+00
RBI          1.253918e+00
Walks        1.812634e+00
Years        3.240116e+00
CAtBat       1.105952e-02
CHits        5.240973e-02
CHmRun       1.439892e-01
CRuns        1.036923e-01
CRBI         8.809254e-02
CWalks       3.357095e-02
LeagueN      3.943543e+01
DivisionW   -9.517743e+01
PutOuts      2.307162e-01
Assists     -6.219518e-03
Errors       4.962977e-01
NewLeagueN   1.523493e+01
> # Test RMSE error
> preds <- predict(hit.modelT, HittersTest)
> TestRMSE1 <- sqrt(mean((HittersTest$Salary - preds)^2))
> TestRMSE1
[1] 365.7261

1.4.2 Next Lasso

> set.seed(14)
> myControl <- trainControl(
+   method = "repeatedcv", 
+   repeats = 3
+ )
> hit.modelT1 <- train(
+   Salary ~ ., data = HittersTrain,
+   tuneGrid = expand.grid(alpha = 1, lambda = 10^seq(3, -2, length = 100)),
+   method = "glmnet",
+   trControl = myControl
+ )
> hit.modelT1
glmnet 

133 samples
 19 predictor

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 120, 120, 119, 120, 120, 118, ... 
Resampling results across tuning parameters:

  lambda        RMSE      Rsquared   MAE     
  1.000000e-02  344.9567  0.5317168  251.2481
  1.123324e-02  344.9567  0.5317168  251.2481
  1.261857e-02  344.9567  0.5317168  251.2481
  1.417474e-02  344.9567  0.5317168  251.2481
  1.592283e-02  344.9567  0.5317168  251.2481
  1.788650e-02  344.9567  0.5317168  251.2481
  2.009233e-02  344.9567  0.5317168  251.2481
  2.257020e-02  344.9567  0.5317168  251.2481
  2.535364e-02  344.9484  0.5317445  251.2406
  2.848036e-02  344.9330  0.5317989  251.2263
  3.199267e-02  344.9155  0.5318604  251.2086
  3.593814e-02  344.8958  0.5319292  251.1888
  4.037017e-02  344.8740  0.5320059  251.1666
  4.534879e-02  344.8490  0.5320934  251.1410
  5.094138e-02  344.8214  0.5321900  251.1124
  5.722368e-02  344.7907  0.5322978  251.0802
  6.428073e-02  344.7565  0.5324185  251.0441
  7.220809e-02  344.7144  0.5325631  251.0005
  8.111308e-02  344.6602  0.5327433  250.9457
  9.111628e-02  344.6019  0.5329387  250.8865
  1.023531e-01  344.5408  0.5331493  250.8245
  1.149757e-01  344.4645  0.5334218  250.7559
  1.291550e-01  344.4250  0.5337040  250.6943
  1.450829e-01  344.3778  0.5339582  250.6183
  1.629751e-01  344.3072  0.5342806  250.5108
  1.830738e-01  344.2157  0.5347082  250.4100
  2.056512e-01  344.1290  0.5351863  250.3028
  2.310130e-01  344.0172  0.5356807  250.1652
  2.595024e-01  343.8899  0.5363539  250.0190
  2.915053e-01  343.7420  0.5371636  249.8460
  3.274549e-01  343.6231  0.5380336  249.6046
  3.678380e-01  343.5535  0.5383117  249.5622
  4.132012e-01  343.2964  0.5390409  249.3999
  4.641589e-01  343.1303  0.5395689  249.3928
  5.214008e-01  342.9319  0.5401276  249.3993
  5.857021e-01  342.6649  0.5408087  249.3580
  6.579332e-01  342.4264  0.5414772  249.4000
  7.390722e-01  342.0848  0.5423246  249.4004
  8.302176e-01  341.6879  0.5432463  249.4128
  9.326033e-01  341.0577  0.5448035  249.2220
  1.047616e+00  340.0577  0.5471681  248.7181
  1.176812e+00  338.8994  0.5498354  248.1312
  1.321941e+00  337.6266  0.5527886  247.4105
  1.484968e+00  336.2798  0.5554993  246.6158
  1.668101e+00  335.0729  0.5574949  245.9131
  1.873817e+00  333.9522  0.5595600  245.2620
  2.104904e+00  332.7679  0.5617792  244.5384
  2.364489e+00  331.5553  0.5637075  243.9282
  2.656088e+00  330.3735  0.5652986  243.2993
  2.983647e+00  329.2407  0.5667164  242.6740
  3.351603e+00  328.3826  0.5677183  242.0551
  3.764936e+00  327.5944  0.5685767  241.4207
  4.229243e+00  326.8148  0.5693254  240.7473
  4.750810e+00  326.0566  0.5698774  240.1645
  5.336699e+00  325.5196  0.5698206  239.7754
  5.994843e+00  325.0129  0.5698516  239.3582
  6.734151e+00  324.4087  0.5701554  239.0325
  7.564633e+00  323.3235  0.5712337  238.4229
  8.497534e+00  321.7210  0.5732175  237.3059
  9.545485e+00  320.6527  0.5744247  236.5642
  1.072267e+01  320.0481  0.5748184  236.0644
  1.204504e+01  319.7647  0.5749693  235.7334
  1.353048e+01  319.6746  0.5751985  235.4524
  1.519911e+01  319.6609  0.5752208  235.2205
  1.707353e+01  319.7615  0.5749260  235.1216
  1.917910e+01  319.9474  0.5745316  235.0700
  2.154435e+01  320.2151  0.5739632  235.0588
  2.420128e+01  320.5658  0.5732442  235.1808
  2.718588e+01  321.0514  0.5723001  235.4514
  3.053856e+01  321.5387  0.5712260  235.7161
  3.430469e+01  322.0805  0.5702089  235.9398
  3.853529e+01  322.7885  0.5690059  236.2451
  4.328761e+01  323.8444  0.5670850  236.6879
  4.862602e+01  325.3568  0.5642881  237.5184
  5.462277e+01  327.3306  0.5606403  238.7807
  6.135907e+01  329.8954  0.5558401  240.7357
  6.892612e+01  333.1740  0.5497009  243.6548
  7.742637e+01  336.9878  0.5434489  247.8263
  8.697490e+01  341.3459  0.5373413  252.6921
  9.770100e+01  346.2057  0.5326136  258.4168
  1.097499e+02  352.3579  0.5261646  265.7827
  1.232847e+02  360.1852  0.5170375  274.7528
  1.384886e+02  369.9365  0.5033856  285.3332
  1.555676e+02  381.7669  0.4812320  297.1868
  1.747528e+02  395.0145  0.4516864  309.5694
  1.963041e+02  408.9939  0.4208011  322.8322
  2.205131e+02  423.4003  0.3716421  336.2986
  2.477076e+02  435.0806  0.2665585  346.2688
  2.782559e+02  437.3556  0.0412020  348.6942
  3.125716e+02  437.4395        NaN  348.8411
  3.511192e+02  437.4395        NaN  348.8411
  3.944206e+02  437.4395        NaN  348.8411
  4.430621e+02  437.4395        NaN  348.8411
  4.977024e+02  437.4395        NaN  348.8411
  5.590810e+02  437.4395        NaN  348.8411
  6.280291e+02  437.4395        NaN  348.8411
  7.054802e+02  437.4395        NaN  348.8411
  7.924829e+02  437.4395        NaN  348.8411
  8.902151e+02  437.4395        NaN  348.8411
  1.000000e+03  437.4395        NaN  348.8411

Tuning parameter 'alpha' was held constant at a value of 1
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were alpha = 1 and lambda = 15.19911.
> plot(hit.modelT1)

> coef(hit.modelT1$finalModel, hit.model$bestTune$lambda)
20 x 1 sparse Matrix of class "dgCMatrix"
                        1
(Intercept) -101.08611784
AtBat          .         
Hits           2.12682293
HmRun          .         
Runs           .         
RBI            1.52192252
Walks          2.18699728
Years          .         
CAtBat         .         
CHits          .         
CHmRun         .         
CRuns          0.40146625
CRBI           0.06345388
CWalks         .         
LeagueN       28.44971454
DivisionW   -109.81369939
PutOuts        0.31513067
Assists        .         
Errors         .         
NewLeagueN     .         
> # Test RMSE error
> preds1 <- predict(hit.modelT1, HittersTest)
> TestRMSE2 <- sqrt(mean((HittersTest$Salary - preds1)^2))
> TestRMSE2
[1] 365.8057

Note the test RMSE of the ridge model is 365.7260832 while the test RMSE of the lasso model is 365.8056865. In this scenario, the lasso has an advantage over the ridge model in terms of parsimony because 11 of the coefficients in the lasso model are exactly zero.

1.5 Dichotomous Response Variable (Logistic Regression)

(Example taken from An Introduction to Statistical Learning)

Goal: Predict whether an individual will default on his or her credit card payment.

> library(ISLR)
> head(Default)
  default student   balance    income
1      No      No  729.5265 44361.625
2      No     Yes  817.1804 12106.135
3      No      No 1073.5492 31767.139
4      No      No  529.2506 35704.494
5      No      No  785.6559 38463.496
6      No     Yes  919.5885  7491.559
> ggplot(data = Default, aes(x = balance, y = income, colour = default)) + 
+   geom_point(alpha = 0.8) + 
+   theme_bw() + 
+   scale_color_brewer(palette=7)

> ggplot(data = Default, aes(x = default, y = balance, fill = default)) + 
+   geom_boxplot() + 
+   theme_bw() + 
+   guides(fill = FALSE) + 
+   scale_fill_brewer(palette=7)

> ggplot(data = Default, aes(x = default, y = income, fill = default)) + 
+   geom_boxplot() + 
+   theme_bw() + 
+   guides(fill = FALSE) + 
+   scale_fill_brewer(palette=7)

Why not regression?

> Default$defaultN <- ifelse(Default$default == "No", 0, 1)
> Default$studentN <- ifelse(Default$student =="No", 0, 1)
> summary(Default$defaultN)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.0000  0.0000  0.0000  0.0333  0.0000  1.0000 
> head(Default)
  default student   balance    income defaultN studentN
1      No      No  729.5265 44361.625        0        0
2      No     Yes  817.1804 12106.135        0        1
3      No      No 1073.5492 31767.139        0        0
4      No      No  529.2506 35704.494        0        0
5      No      No  785.6559 38463.496        0        0
6      No     Yes  919.5885  7491.559        0        1
> ggplot(data = Default, aes(x = balance, y = defaultN)) + 
+   geom_point(alpha = 0.5) + 
+   theme_bw() + 
+   stat_smooth(method = "lm") +
+   labs(y = "Probability of Default")

Some estimated probabilities are negative! For balances close to zero, we predict a negative probability of default. For very large balances, we get values greater than 1.

1.6 Logistic regression

What we need is a function of \(X\) that returns values between 0 and 1. Consider the logistic function

\[p(X) = \frac{e^{\beta_0 + \beta_1X}}{1 + e^{\beta_0 + \beta_1X}}\]

which will always produce an S-shaped curve regardless of the value of \(X\).

Let \(y = \beta_0 + \beta_1X\), then

\(p(X)=\frac{e^y}{1 + e^y} \rightarrow p(X) + e^{y}p(X) = e^{y} \rightarrow p(X) = e^{y} - e^{y}p(X) \rightarrow \frac{p(X)}{1 - p(X)}=e^{y} = e^{\beta_0 + \beta_1X}\)

\[\text{log}\left(\frac{p(X)}{1 - p(X)}\right) = \beta_0 + \beta_1X\]

The quantity \(\text{log}\left(\frac{p(X)}{1 - p(X)}\right)\) is called the log-odds or logit, and the quantity \(\frac{p(X)}{1 - p(X)}\) is called the odds.

NOTE: In regression, \(\beta_1\) gives the average change in \(Y\) associated with a one-unit increase in \(X\). In a logistic regression model, increasing \(X\) by one unit changes the log odds by \(\beta_1\), or equivalently it multiplies the odds by \(e^{\beta_1}\).

> ggplot(data = Default, aes(x = balance, y = defaultN)) + 
+   geom_point(alpha = 0.5) + 
+   theme_bw() + 
+   stat_smooth(method = "glm", family = "binomial") +
+   labs(y = "Probability of Default")

The probability of default given balance can be written as P(default = Yes | balance).

> log.mod <- glm(default ~ balance, data = Default, family = "binomial")
> summary(log.mod)

Call:
glm(formula = default ~ balance, family = "binomial", data = Default)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.2697  -0.1465  -0.0589  -0.0221   3.7589  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.065e+01  3.612e-01  -29.49   <2e-16 ***
balance      5.499e-03  2.204e-04   24.95   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2920.6  on 9999  degrees of freedom
Residual deviance: 1596.5  on 9998  degrees of freedom
AIC: 1600.5

Number of Fisher Scoring iterations: 8

From the output note that \(\hat{\beta} = 0.0054989\). This value indicates that an increase in balance is associated with an increase in the log odds of default by 0.0054989 units.

1.7 Making predictions

Using the estimated coefficients, the predicted probability of default for an individual with a balance of $1,500 is

\[\hat{p}(X) = \frac{e^{\hat{\beta_0} + \hat{\beta}_1X}}{1 + e^{\hat{\beta_0} + \hat{\beta_1}X}} = \frac{e^{-10.6513306+ 0.0054989 \times 1500}}{1 + e^{-10.6513306+ 0.0054989 \times 1500}} = 0.0829476.\]

> predict(log.mod, newdata = data.frame(balance = 1500), type = "response")
         1 
0.08294762 

For an individual with a balance of $2,500, the probability of default is

\[p(X) = 0.9567259\].

> predict(log.mod, newdata = data.frame(balance = 2500), type = "response")
        1 
0.9567259 

Are students more likely to default than non-students?

> mod2 <- glm(default ~ student, data = Default, family = "binomial")
> summary(mod2)

Call:
glm(formula = default ~ student, family = "binomial", data = Default)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.2970  -0.2970  -0.2434  -0.2434   2.6585  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -3.50413    0.07071  -49.55  < 2e-16 ***
studentYes   0.40489    0.11502    3.52 0.000431 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2920.6  on 9999  degrees of freedom
Residual deviance: 2908.7  on 9998  degrees of freedom
AIC: 2912.7

Number of Fisher Scoring iterations: 6

Note that the coefficient associated with studentYes is positive, and the associated p-value is statistically significant. This indicates that students tend to have higher default probabilities than non-students:

\[\widehat{Pr}(\text{default = Yes }|\text{ student = Yes}) = \frac{e^{-3.5041278+ 0.4048871 \times 1}}{1 + e^{-3.5041278+ 0.4048871 \times 1}} = 0.0431386,\]

> predict(mod2, newdata = data.frame(student = "Yes"), type = "response")
         1 
0.04313859 

\[\widehat{Pr}(\text{default = Yes }|\text{ student = No}) = \frac{e^{-3.5041278+ 0.4048871 \times 0}}{1 + e^{-3.5041278+ 0.4048871 \times 1}} = 0.029195.\]

> predict(mod2, newdata = data.frame(student = "No"), type = "response")
         1 
0.02919501 

1.8 Multiple Logistic Regression

Consider a model that uses balance, income, and student to predict default.

> mlog.mod <- glm(default ~ balance + income + student, data = Default, family = "binomial")
> summary(mlog.mod)

Call:
glm(formula = default ~ balance + income + student, family = "binomial", 
    data = Default)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.4691  -0.1418  -0.0557  -0.0203   3.7383  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.087e+01  4.923e-01 -22.080  < 2e-16 ***
balance      5.737e-03  2.319e-04  24.738  < 2e-16 ***
income       3.033e-06  8.203e-06   0.370  0.71152    
studentYes  -6.468e-01  2.363e-01  -2.738  0.00619 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2920.6  on 9999  degrees of freedom
Residual deviance: 1571.5  on 9996  degrees of freedom
AIC: 1579.5

Number of Fisher Scoring iterations: 8

There is a surprising result here. The p-value associated with balance and the student are very small, indicating that each of these variables is associated with the probability of default. However, the coefficient for student is negative, indicating that students are less likely to default than non-students. Yet, the coefficient in mod2 for student was positive. How is it possible for the variable student to be associated with an increase in probability in mod2 and a decrease in probability in mlog.mod?

> Default$PrDef <- exp(predict(mlog.mod))/(1 + exp(predict(mlog.mod)))
> # Or the following is the same
> Default$PRdef <- predict(mlog.mod, type = "response")
> ggplot(data = Default, aes(x = balance, y = default, color = student)) + 
+   geom_line(aes(x = balance, y = PrDef)) + 
+   scale_color_brewer(palette = 7) + 
+   theme_bw() +
+   labs(y = "Probability of Default")

> library(dplyr)
> Default <- Default %>% 
+   arrange(balance) %>% 
+   mutate(balanceFAC = cut(balance, breaks = seq(500, 2655, length.out = 10), include.lower = TRUE))
> head(Default)
  default student balance   income defaultN studentN        PrDef
1      No      No       0 29275.27        0        0 2.080617e-05
2      No     Yes       0 21871.07        0        1 1.065494e-05
3      No      No       0 50265.31        0        0 2.217400e-05
4      No      No       0 49956.58        0        0 2.215325e-05
5      No      No       0 32481.54        0        0 2.100952e-05
6      No      No       0 43658.23        0        0 2.173402e-05
         PRdef balanceFAC
1 2.080617e-05       <NA>
2 1.065494e-05       <NA>
3 2.217400e-05       <NA>
4 2.215325e-05       <NA>
5 2.100952e-05       <NA>
6 2.173402e-05       <NA>
> T1 <- xtabs(~balanceFAC + default, data = Default, subset = student == "Yes")
> T2 <- xtabs(~balanceFAC + default, data = Default, subset = student == "No")
> SDR  <- T1[, 2]/apply(T1[, 1:2], 1, sum)
> NSDR <- T2[, 2]/apply(T2[, 1:2], 1, sum)
> BalanceM <- c(500+120, 739+120, 979+120, 1220+120, 1460+120, 1700+120, 1940+120, 2180+120, 2420+120)
> DF <- data.frame(SDR, NSDR, BalanceM)
> DF
                            SDR        NSDR BalanceM
(500,739]           0.000000000 0.001571092      620
(739,979]           0.000000000 0.003007519      859
(979,1.22e+03]      0.005454545 0.012252592     1099
(1.22e+03,1.46e+03] 0.025000000 0.043923865     1340
(1.46e+03,1.7e+03]  0.085365854 0.140056022     1580
(1.7e+03,1.94e+03]  0.248407643 0.394736842     1820
(1.94e+03,2.18e+03] 0.553571429 0.760000000     2060
(2.18e+03,2.42e+03] 0.818181818 0.888888889     2300
(2.42e+03,2.66e+03] 1.000000000 1.000000000     2540
> ggplot(data = DF, aes(x = BalanceM, y = NSDR)) +
+   geom_line(color = "red") + 
+   geom_line(aes(x = BalanceM, y = SDR), color = "green") + 
+   theme_bw() +
+   labs(y = "Default Rate", x = "Credit Card Balance") 

1.9 Odds Ratio (OR)

The odds ratio, denoted OR, is the ratio of the odds for \(X = 1\) to the odds for \(X = 0\), and is given by the equation

\[\text{OR}=\frac{\frac{p(1)}{1 - p(1)}}{\frac{p(0)}{1 - p(0)}}.\]

Substituting
\[\frac{e^{\beta_0 + \beta_1}}{1 + e^{\beta_0 + \beta_1}}\] for \(p(X=1)\), and \[\frac{e^{\beta_0}}{1 + e^{\beta_0}}\] for \(p(X=0)\), respectively, one can show that \[\text{OR} = e^{\beta_1}.\]

The odds ratio is widely used as a measure of association as it approximates how much more likely or unlikely (in terms of odds) it is for the outcome to be present among those subjects with \(X = 1\) as compared to those subjects with \(X = 0\).

Using the mlog.mod model, the response variable \(Y\) is default, which denotes whether a credit card holder has defaulted (Yes/1) or not defaulted (No/0) on his/her current payment. The independent variable \(X\) is student, which denotes whether the individual is a student (Yes/1) or is not a student (No/1). A hypothetical odds ratio in this scenario of 2 is interpreted to mean that the odds of default among students is two times greater than the odds of default among non-students.

An estimate of the odds ratio is

\[\widehat{\text{OR}} = e^{-0.6467758} = 0.5237317.\]

This is interpreted to mean that the odds of default among students is 0.5237317 the odds of default among non-students.