1 Advantages and Disadvantages of Trees

1.1 Pros

Trees are very easy to explain to people. In fact, they are even easier to explain than linear regression!
Some people believe that decision trees more closely mirror human decision-making than do regression and classification approaches.
Trees can be displayed graphically, and are easily interpreted even by a non-expert (especially if they are small).
Trees can easily handle qualitative predictors without the need to create dummy variables (model.matrix()).

1.2 Cons

Trees generally do not have the same level of predictive accuracy as some of the other regression and classification approaches.
Trees suffer from high variance. This means if we split the training data into two parts at random, and fit a decision tree to both halves, the results that we get could be quite different. In contrast, a procedure with low variance will yield similar results if applied repeatedly to distinct data sets.

However, by aggregating many decision trees, using methods like bagging, random forests, and boosting, the predictive performance of trees can be substantially improved.

Bagging—Bootstrap aggregation—Bagging is a general-purpose procedure for reducing the variance of a statistical learning method.

2 Training data set

> library(MASS)
> library(rpart)
> set.seed(8341)
> train <- sample(1:nrow(Boston), nrow(Boston)/2)

2.1 Create `train1` and `test1` with `caret`

> library(caret)
> set.seed(821)
> trainIndex <- createDataPartition(y = Boston$medv,
+                                   p = 0.50,
+                                   list = FALSE,
+                                   times = 1)
> train1 <- Boston[trainIndex, ]
> test1 <- Boston[-trainIndex, ]
> dim(train1)

[1] 254  14

> dim(test1)

[1] 252  14

2.1.1 Stepwise selection with `caret`

> fitControl <- trainControl(method = "repeatedcv",
+                           number = 10,
+                           repeats = 3,
+                           allowParallel = TRUE)
> set.seed(34)
> stepMod <- train(medv ~., 
+               data = train1,
+               method = "leapSeq",
+               trControl = fitControl,
+               verbose = FALSE)
> stepMod

Linear Regression with Stepwise Selection 

254 samples
 13 predictor

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 228, 228, 227, 229, 228, 230, ... 
Resampling results across tuning parameters:

  nvmax  RMSE      Rsquared   MAE     
  2      5.967332  0.5849737  4.290178
  3      5.539786  0.6402357  3.993155
  4      5.626944  0.6280172  3.992085

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was nvmax = 3.

> summary(stepMod)

Subset selection object
13 Variables  (and intercept)
        Forced in Forced out
crim        FALSE      FALSE
zn          FALSE      FALSE
indus       FALSE      FALSE
chas        FALSE      FALSE
nox         FALSE      FALSE
rm          FALSE      FALSE
age         FALSE      FALSE
dis         FALSE      FALSE
rad         FALSE      FALSE
tax         FALSE      FALSE
ptratio     FALSE      FALSE
black       FALSE      FALSE
lstat       FALSE      FALSE
1 subsets of each size up to 3
Selection Algorithm: 'sequential replacement'
         crim zn  indus chas nox rm  age dis rad tax ptratio black lstat
1  ( 1 ) " "  " " " "   " "  " " " " " " " " " " " " " "     " "   "*"  
2  ( 1 ) " "  " " " "   " "  " " "*" " " " " " " " " " "     " "   "*"  
3  ( 1 ) " "  " " " "   " "  " " "*" " " " " " " " " "*"     " "   "*"

> summary(stepMod$finalModel)

Subset selection object
13 Variables  (and intercept)
        Forced in Forced out
crim        FALSE      FALSE
zn          FALSE      FALSE
indus       FALSE      FALSE
chas        FALSE      FALSE
nox         FALSE      FALSE
rm          FALSE      FALSE
age         FALSE      FALSE
dis         FALSE      FALSE
rad         FALSE      FALSE
tax         FALSE      FALSE
ptratio     FALSE      FALSE
black       FALSE      FALSE
lstat       FALSE      FALSE
1 subsets of each size up to 3
Selection Algorithm: 'sequential replacement'
         crim zn  indus chas nox rm  age dis rad tax ptratio black lstat
1  ( 1 ) " "  " " " "   " "  " " " " " " " " " " " " " "     " "   "*"  
2  ( 1 ) " "  " " " "   " "  " " "*" " " " " " " " " " "     " "   "*"  
3  ( 1 ) " "  " " " "   " "  " " "*" " " " " " " " " "*"     " "   "*"

> coef(stepMod$finalModel, id = 3)

(Intercept)          rm     ptratio       lstat 
 24.5726339   3.5454182  -0.8656869  -0.6481990

2.1.1.1 Test

> yhat <- predict(stepMod, newdata = test1)
> RMSE <- sqrt(mean((test1$medv - yhat)^2))
> RMSE

[1] 4.885194

2.2 Fitting a linear model

> modstep <- stepAIC(lm(medv ~ ., data = Boston, subset = train), k = log(253))

Start:  AIC=849.55
medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad + 
    tax + ptratio + black + lstat

          Df Sum of Sq    RSS    AIC
- age      1      2.35 5353.6 844.13
- indus    1     16.66 5367.9 844.80
- zn       1     38.68 5390.0 845.84
- black    1     61.60 5412.9 846.91
- crim     1     62.41 5413.7 846.95
- tax      1     88.71 5440.0 848.17
<none>                 5351.3 849.55
- rad      1    161.68 5513.0 851.55
- nox      1    249.64 5600.9 855.55
- chas     1    275.33 5626.6 856.71
- dis      1    526.83 5878.1 867.77
- ptratio  1    611.25 5962.5 871.38
- lstat    1    993.46 6344.7 887.10
- rm       1   1910.15 7261.4 921.24

Step:  AIC=844.13
medv ~ crim + zn + indus + chas + nox + rm + dis + rad + tax + 
    ptratio + black + lstat

          Df Sum of Sq    RSS    AIC
- indus    1     16.70 5370.3 839.38
- zn       1     42.14 5395.8 840.58
- black    1     60.34 5414.0 841.43
- crim     1     62.07 5415.7 841.51
- tax      1     89.97 5443.6 842.81
<none>                 5353.6 844.13
- rad      1    168.47 5522.1 846.43
- chas     1    273.40 5627.0 851.19
- nox      1    291.26 5644.9 852.00
- dis      1    537.55 5891.2 862.80
- ptratio  1    619.39 5973.0 866.29
- lstat    1   1279.12 6632.7 892.80
- rm       1   1995.95 7349.6 918.76

Step:  AIC=839.38
medv ~ crim + zn + chas + nox + rm + dis + rad + tax + ptratio + 
    black + lstat

          Df Sum of Sq    RSS    AIC
- zn       1     38.65 5409.0 835.66
- black    1     57.81 5428.1 836.56
- crim     1     65.43 5435.8 836.91
- tax      1     73.32 5443.6 837.28
<none>                 5370.3 839.38
- rad      1    152.09 5522.4 840.91
- nox      1    274.60 5644.9 846.46
- chas     1    283.71 5654.0 846.87
- ptratio  1    602.70 5973.0 860.76
- dis      1    632.29 6002.6 862.01
- lstat    1   1262.86 6633.2 887.28
- rm       1   1979.26 7349.6 913.23

Step:  AIC=835.66
medv ~ crim + chas + nox + rm + dis + rad + tax + ptratio + black + 
    lstat

          Df Sum of Sq    RSS    AIC
- crim     1     54.83 5463.8 832.68
- tax      1     57.07 5466.0 832.78
- black    1     59.53 5468.5 832.90
<none>                 5409.0 835.66
- rad      1    140.66 5549.6 836.62
- chas     1    277.59 5686.6 842.79
- nox      1    293.51 5702.5 843.50
- dis      1    663.67 6072.6 859.41
- ptratio  1    831.71 6240.7 866.31
- lstat    1   1270.25 6679.2 883.50
- rm       1   2074.09 7483.1 912.25

Step:  AIC=832.68
medv ~ chas + nox + rm + dis + rad + tax + ptratio + black + 
    lstat

          Df Sum of Sq    RSS    AIC
- tax      1     56.71 5520.5 829.76
- black    1     70.60 5534.4 830.39
- rad      1     99.42 5563.2 831.71
<none>                 5463.8 832.68
- nox      1    274.24 5738.0 839.54
- chas     1    293.02 5756.8 840.36
- dis      1    635.60 6099.4 854.99
- ptratio  1    801.83 6265.6 861.79
- lstat    1   1455.92 6919.7 886.91
- rm       1   2095.20 7559.0 909.27

Step:  AIC=829.76
medv ~ chas + nox + rm + dis + rad + ptratio + black + lstat

          Df Sum of Sq    RSS    AIC
- rad      1     44.10 5564.6 826.24
- black    1     80.49 5601.0 827.89
<none>                 5520.5 829.76
- chas     1    306.10 5826.6 837.88
- nox      1    350.56 5871.1 839.80
- dis      1    649.04 6169.5 852.35
- ptratio  1    866.50 6387.0 861.11
- lstat    1   1458.57 6979.1 883.54
- rm       1   2207.01 7727.5 909.31

Step:  AIC=826.24
medv ~ chas + nox + rm + dis + ptratio + black + lstat

          Df Sum of Sq    RSS    AIC
- black    1     66.69 5631.3 823.72
<none>                 5564.6 826.24
- chas     1    292.35 5857.0 833.66
- nox      1    311.02 5875.6 834.46
- dis      1    623.89 6188.5 847.59
- ptratio  1    848.26 6412.9 856.60
- lstat    1   1427.14 6991.8 878.47
- rm       1   2327.78 7892.4 909.12

Step:  AIC=823.72
medv ~ chas + nox + rm + dis + ptratio + lstat

          Df Sum of Sq    RSS    AIC
<none>                 5631.3 823.72
- chas     1    311.03 5942.3 831.79
- nox      1    387.84 6019.1 835.04
- dis      1    647.59 6278.9 845.72
- ptratio  1    874.95 6506.2 854.72
- lstat    1   1561.79 7193.1 880.11
- rm       1   2270.74 7902.0 903.90

> summary(modstep)


Call:
lm(formula = medv ~ chas + nox + rm + dis + ptratio + lstat, 
    data = Boston, subset = train)

Residuals:
     Min       1Q   Median       3Q      Max 
-10.2397  -2.8547  -0.7412   1.9924  23.9609 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  27.19591    6.30753   4.312 2.35e-05 ***
chas          4.70186    1.27557   3.686  0.00028 ***
nox         -18.98584    4.61252  -4.116 5.26e-05 ***
rm            5.49957    0.55218   9.960  < 2e-16 ***
dis          -1.17646    0.22119  -5.319 2.35e-07 ***
ptratio      -0.96136    0.15550  -6.182 2.61e-09 ***
lstat        -0.53352    0.06459  -8.260 9.03e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.785 on 246 degrees of freedom
Multiple R-squared:  0.7664,    Adjusted R-squared:  0.7607 
F-statistic: 134.5 on 6 and 246 DF,  p-value: < 2.2e-16

> yhat1 <- predict(modstep, newdata= Boston[-train, ])
> boston_test1 <- Boston[-train, "medv"]
> MSE1 <- mean((yhat1 - boston_test1)^2)
> RMSE1 <- sqrt(MSE1)
> c(MSE1, RMSE1)

[1] 27.456752  5.239919

Here we fit a regression tree to the Boston data set.

> tree_boston <- rpart(medv ~ ., data = Boston, subset = train)
> summary(tree_boston)

Call:
rpart(formula = medv ~ ., data = Boston, subset = train)
  n= 253 

          CP nsplit rel error    xerror       xstd
1 0.54137149      0 1.0000000 1.0069042 0.11526246
2 0.17236659      1 0.4586285 0.4661992 0.06124901
3 0.04628434      2 0.2862619 0.3058307 0.04862013
4 0.04409662      3 0.2399776 0.2939379 0.04973640
5 0.02134681      4 0.1958810 0.2462115 0.04601763
6 0.01807595      5 0.1745341 0.2293116 0.04528531
7 0.01027511      6 0.1564582 0.2099538 0.04169544
8 0.01000000      7 0.1461831 0.2116157 0.04133696

Variable importance
     rm   lstat     age     nox    crim     dis     tax ptratio      zn 
     33      27       7       6       6       5       5       4       3 
  indus   black     rad 
      2       1       1 

Node number 1: 253 observations,    complexity param=0.5413715
  mean=22.79486, MSE=95.27725 
  left son=2 (224 obs) right son=3 (29 obs)
  Primary splits:
      rm      < 7.121    to the left,  improve=0.5413715, (0 missing)
      lstat   < 5.44     to the right, improve=0.4720781, (0 missing)
      ptratio < 19.9     to the right, improve=0.2594405, (0 missing)
      indus   < 6.66     to the right, improve=0.2570786, (0 missing)
      nox     < 0.6695   to the right, improve=0.2215197, (0 missing)
  Surrogate splits:
      lstat   < 4.475    to the right, agree=0.933, adj=0.414, (0 split)
      ptratio < 13.85    to the right, agree=0.901, adj=0.138, (0 split)
      zn      < 87.5     to the left,  agree=0.897, adj=0.103, (0 split)

Node number 2: 224 observations,    complexity param=0.1723666
  mean=20.21071, MSE=43.5639 
  left son=4 (80 obs) right son=5 (144 obs)
  Primary splits:
      lstat   < 14.915   to the right, improve=0.4257827, (0 missing)
      nox     < 0.6695   to the right, improve=0.2936967, (0 missing)
      crim    < 5.84803  to the right, improve=0.2462949, (0 missing)
      ptratio < 19.9     to the right, improve=0.2378162, (0 missing)
      age     < 77.4     to the right, improve=0.2360563, (0 missing)
  Surrogate splits:
      age  < 91.15    to the right, agree=0.875, adj=0.650, (0 split)
      dis  < 2.1288   to the left,  agree=0.804, adj=0.450, (0 split)
      crim < 5.84803  to the right, agree=0.799, adj=0.438, (0 split)
      nox  < 0.6695   to the right, agree=0.786, adj=0.400, (0 split)
      tax  < 434.5    to the right, agree=0.772, adj=0.362, (0 split)

Node number 3: 29 observations,    complexity param=0.04409662
  mean=42.75517, MSE=44.72385 
  left son=6 (10 obs) right son=7 (19 obs)
  Primary splits:
      rm      < 7.433    to the left,  improve=0.8195545, (0 missing)
      lstat   < 5.185    to the right, improve=0.3105049, (0 missing)
      black   < 393.1    to the right, improve=0.2059922, (0 missing)
      dis     < 3.6617   to the right, improve=0.1586536, (0 missing)
      ptratio < 15.05    to the right, improve=0.1189739, (0 missing)
  Surrogate splits:
      lstat < 4.725    to the right, agree=0.828, adj=0.5, (0 split)
      indus < 1.775    to the left,  agree=0.724, adj=0.2, (0 split)
      black < 393.1    to the right, agree=0.724, adj=0.2, (0 split)
      crim  < 0.018935 to the left,  agree=0.690, adj=0.1, (0 split)
      zn    < 28.5     to the right, agree=0.690, adj=0.1, (0 split)

Node number 4: 80 observations,    complexity param=0.02134681
  mean=14.4325, MSE=17.84719 
  left son=8 (50 obs) right son=9 (30 obs)
  Primary splits:
      nox     < 0.603    to the right, improve=0.3603984, (0 missing)
      crim    < 0.853015 to the right, improve=0.2732551, (0 missing)
      tax     < 551.5    to the right, improve=0.2454497, (0 missing)
      ptratio < 19.6     to the right, improve=0.2261544, (0 missing)
      dis     < 2.0037   to the left,  improve=0.2202198, (0 missing)
  Surrogate splits:
      tax   < 393.5    to the right, agree=0.900, adj=0.733, (0 split)
      indus < 14.345   to the right, agree=0.888, adj=0.700, (0 split)
      dis   < 2.399    to the left,  agree=0.875, adj=0.667, (0 split)
      crim  < 1.62073  to the right, agree=0.850, adj=0.600, (0 split)
      rad   < 16       to the right, agree=0.788, adj=0.433, (0 split)

Node number 5: 144 observations,    complexity param=0.04628434
  mean=23.42083, MSE=28.99734 
  left son=10 (128 obs) right son=11 (16 obs)
  Primary splits:
      lstat   < 5.41     to the right, improve=0.26719180, (0 missing)
      rm      < 6.542    to the left,  improve=0.24368150, (0 missing)
      dis     < 1.85635  to the right, improve=0.08338557, (0 missing)
      chas    < 0.5      to the left,  improve=0.06863430, (0 missing)
      ptratio < 20.55    to the right, improve=0.06312148, (0 missing)

Node number 6: 10 observations
  mean=34.41, MSE=5.3309 

Node number 7: 19 observations
  mean=47.14737, MSE=9.511967 

Node number 8: 50 observations
  mean=12.468, MSE=10.23458 

Node number 9: 30 observations
  mean=17.70667, MSE=13.38262 

Node number 10: 128 observations,    complexity param=0.01807595
  mean=22.43672, MSE=18.80889 
  left son=20 (66 obs) right son=21 (62 obs)
  Primary splits:
      lstat   < 9.98     to the right, improve=0.18098300, (0 missing)
      rm      < 6.5165   to the left,  improve=0.11088040, (0 missing)
      indus   < 4.22     to the right, improve=0.07527922, (0 missing)
      ptratio < 20.95    to the right, improve=0.06686426, (0 missing)
      chas    < 0.5      to the left,  improve=0.04522875, (0 missing)
  Surrogate splits:
      rm    < 6.3825   to the left,  agree=0.719, adj=0.419, (0 split)
      crim  < 0.08577  to the right, agree=0.703, adj=0.387, (0 split)
      nox   < 0.5125   to the right, agree=0.695, adj=0.371, (0 split)
      indus < 6.145    to the right, agree=0.680, adj=0.339, (0 split)
      age   < 48.1     to the right, agree=0.680, adj=0.339, (0 split)

Node number 11: 16 observations
  mean=31.29375, MSE=40.77434 

Node number 20: 66 observations
  mean=20.64848, MSE=9.997649 

Node number 21: 62 observations,    complexity param=0.01027511
  mean=24.34032, MSE=21.16079 
  left son=42 (55 obs) right son=43 (7 obs)
  Primary splits:
      black < 371.25   to the right, improve=0.1887872, (0 missing)
      indus < 15.465   to the left,  improve=0.1113402, (0 missing)
      crim  < 1.066625 to the left,  improve=0.1113402, (0 missing)
      rad   < 7.5      to the left,  improve=0.1016110, (0 missing)
      tax   < 400.5    to the left,  improve=0.0976496, (0 missing)
  Surrogate splits:
      nox   < 0.589    to the left,  agree=0.935, adj=0.429, (0 split)
      dis   < 1.9624   to the right, agree=0.935, adj=0.429, (0 split)
      indus < 18.84    to the left,  agree=0.919, adj=0.286, (0 split)
      age   < 89.45    to the left,  agree=0.919, adj=0.286, (0 split)
      tax   < 400.5    to the left,  agree=0.919, adj=0.286, (0 split)

Node number 42: 55 observations
  mean=23.62727, MSE=10.23835 

Node number 43: 7 observations
  mean=29.94286, MSE=71.59673

> tree_boston1 <- rpart(medv ~ ., data = train1)
> summary(tree_boston1)

Call:
rpart(formula = medv ~ ., data = train1)
  n= 254 

          CP nsplit rel error    xerror       xstd
1 0.46109185      0 1.0000000 1.0038882 0.12286406
2 0.20759279      1 0.5389082 0.6392510 0.08041842
3 0.07223984      2 0.3313154 0.4127999 0.06519430
4 0.04628076      3 0.2590755 0.3195668 0.05551688
5 0.02742681      4 0.2127948 0.3283610 0.05363943
6 0.01413471      5 0.1853679 0.2671046 0.03738031
7 0.01000000      6 0.1712332 0.2585818 0.03660635

Variable importance
  lstat      rm   indus    crim     nox      zn     dis     age     rad 
     27      19      12      12      12       8       4       3       1 
    tax ptratio 
      1       1 

Node number 1: 254 observations,    complexity param=0.4610918
  mean=22.53819, MSE=82.27488 
  left son=2 (156 obs) right son=3 (98 obs)
  Primary splits:
      lstat   < 9.545    to the right, improve=0.4610918, (0 missing)
      rm      < 7.414    to the left,  improve=0.4193780, (0 missing)
      indus   < 4.01     to the right, improve=0.2441960, (0 missing)
      ptratio < 18.75    to the right, improve=0.2255930, (0 missing)
      nox     < 0.6695   to the right, improve=0.1867062, (0 missing)
  Surrogate splits:
      crim  < 0.09628  to the right, agree=0.791, adj=0.459, (0 split)
      indus < 7.625    to the right, agree=0.791, adj=0.459, (0 split)
      nox   < 0.5125   to the right, agree=0.783, adj=0.439, (0 split)
      rm    < 6.413    to the left,  agree=0.776, adj=0.418, (0 split)
      zn    < 15       to the left,  agree=0.764, adj=0.388, (0 split)

Node number 2: 156 observations,    complexity param=0.07223984
  mean=17.65641, MSE=21.87579 
  left son=4 (82 obs) right son=5 (74 obs)
  Primary splits:
      lstat < 15       to the right, improve=0.4423738, (0 missing)
      crim  < 6.107135 to the right, improve=0.2756082, (0 missing)
      dis   < 2.06575  to the left,  improve=0.2728879, (0 missing)
      nox   < 0.6635   to the right, improve=0.2521726, (0 missing)
      black < 115.805  to the left,  improve=0.2307359, (0 missing)
  Surrogate splits:
      age   < 90.6     to the right, agree=0.769, adj=0.514, (0 split)
      dis   < 2.06575  to the left,  agree=0.724, adj=0.419, (0 split)
      crim  < 0.866645 to the right, agree=0.692, adj=0.351, (0 split)
      indus < 7.625    to the right, agree=0.679, adj=0.324, (0 split)
      nox   < 0.5835   to the right, agree=0.667, adj=0.297, (0 split)

Node number 3: 98 observations,    complexity param=0.2075928
  mean=30.30918, MSE=80.09573 
  left son=6 (83 obs) right son=7 (15 obs)
  Primary splits:
      rm      < 7.437    to the left,  improve=0.5526851, (0 missing)
      lstat   < 4.65     to the right, improve=0.3556923, (0 missing)
      nox     < 0.574    to the left,  improve=0.2754423, (0 missing)
      dis     < 2.99335  to the right, improve=0.2189892, (0 missing)
      ptratio < 15.05    to the right, improve=0.1987182, (0 missing)
  Surrogate splits:
      lstat   < 3.99     to the right, agree=0.908, adj=0.400, (0 split)
      zn      < 87.5     to the left,  agree=0.857, adj=0.067, (0 split)
      indus   < 18.84    to the left,  agree=0.857, adj=0.067, (0 split)
      tax     < 219      to the right, agree=0.857, adj=0.067, (0 split)
      ptratio < 14.75    to the right, agree=0.857, adj=0.067, (0 split)

Node number 4: 82 observations,    complexity param=0.01413471
  mean=14.70122, MSE=12.03597 
  left son=8 (37 obs) right son=9 (45 obs)
  Primary splits:
      crim  < 6.029745 to the right, improve=0.2992903, (0 missing)
      nox   < 0.603    to the right, improve=0.2541619, (0 missing)
      black < 28.14    to the left,  improve=0.2338012, (0 missing)
      dis   < 2.0037   to the left,  improve=0.2287683, (0 missing)
      tax   < 551.5    to the right, improve=0.2255979, (0 missing)
  Surrogate splits:
      rad     < 15.5     to the right, agree=0.915, adj=0.811, (0 split)
      tax     < 551.5    to the right, agree=0.890, adj=0.757, (0 split)
      ptratio < 20.15    to the right, agree=0.744, adj=0.432, (0 split)
      black   < 85.73    to the left,  agree=0.732, adj=0.405, (0 split)
      dis     < 2.2085   to the left,  agree=0.720, adj=0.378, (0 split)

Node number 5: 74 observations
  mean=20.93108, MSE=12.37863 

Node number 6: 83 observations,    complexity param=0.04628076
  mean=27.48072, MSE=38.36927 
  left son=12 (76 obs) right son=13 (7 obs)
  Primary splits:
      dis   < 2.0891   to the right, improve=0.3036966, (0 missing)
      nox   < 0.589    to the left,  improve=0.3036966, (0 missing)
      age   < 88.75    to the left,  improve=0.2790731, (0 missing)
      tax   < 400      to the left,  improve=0.1883790, (0 missing)
      indus < 14.48    to the left,  improve=0.1813055, (0 missing)
  Surrogate splits:
      nox   < 0.589    to the left,  agree=1.000, adj=1.000, (0 split)
      age   < 87.6     to the left,  agree=0.976, adj=0.714, (0 split)
      crim  < 0.64202  to the left,  agree=0.964, adj=0.571, (0 split)
      indus < 16.57    to the left,  agree=0.964, adj=0.571, (0 split)
      rad   < 16       to the left,  agree=0.952, adj=0.429, (0 split)

Node number 7: 15 observations
  mean=45.96, MSE=21.7664 

Node number 8: 37 observations
  mean=12.60811, MSE=9.918583 

Node number 9: 45 observations
  mean=16.42222, MSE=7.21284 

Node number 12: 76 observations,    complexity param=0.02742681
  mean=26.44474, MSE=19.64852 
  left son=24 (56 obs) right son=25 (20 obs)
  Primary splits:
      rm      < 6.8035   to the left,  improve=0.3838246, (0 missing)
      lstat   < 5.265    to the right, improve=0.2307796, (0 missing)
      ptratio < 18.65    to the right, improve=0.1500866, (0 missing)
      tax     < 222.5    to the right, improve=0.1312044, (0 missing)
      indus   < 1.715    to the right, improve=0.1245970, (0 missing)
  Surrogate splits:
      indus   < 1.715    to the right, agree=0.803, adj=0.25, (0 split)
      lstat   < 4.24     to the right, agree=0.789, adj=0.20, (0 split)
      zn      < 87.5     to the left,  agree=0.763, adj=0.10, (0 split)
      ptratio < 13.65    to the right, agree=0.763, adj=0.10, (0 split)
      nox     < 0.4045   to the right, agree=0.750, adj=0.05, (0 split)

Node number 13: 7 observations
  mean=38.72857, MSE=103.4563 

Node number 24: 56 observations
  mean=24.80357, MSE=11.06213 

Node number 25: 20 observations
  mean=31.04, MSE=15.0324

> plot(tree_boston)
> text(tree_boston, pretty = 0, use.n = TRUE, all = TRUE, cex = 0.7)

> library(tree)
> tree_boston2 <- tree(medv ~ ., data = Boston, subset = train)
> summary(tree_boston2)


Regression tree:
tree(formula = medv ~ ., data = Boston, subset = train)
Variables actually used in tree construction:
[1] "rm"    "lstat" "dis"   "black" "nox"  
Number of terminal nodes:  9 
Residual mean deviance:  13.15 = 3208 / 244 
Distribution of residuals:
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
-11.74000  -2.10700   0.06429   0.00000   1.65300  19.08000

> plot(tree_boston2)
> text(tree_boston2, pretty = 0, use.n = TRUE, all = TRUE, cex = 0.7)

Notice that the output of summary() indicates that five of the variables have been used in constructing the tree. In the context of a regression tree, the deviance is simply the sum of the squared errors for the tree. We now plot the tree with the package partykit which only works on trees created with the package rpart not tree.

> library(partykit)
> plot(as.party(tree_boston))

> plot(as.party(tree_boston1))

The variable lstat measures the percentage of individuals with lower socioeconomic status. The tree indicates that lower values of lstat correspond to more expensive houses. The tree predicts a median house price of $46,000 for larger homes in suburbs in which residents have high socioeconomic status (rm >= 7.437 and lstat < 9.545). Now we use the cv.tree() function to see whether pruning the tree will improve performance.

> set.seed(3)
> cv_boston2 <- cv.tree(tree_boston2)
> plot(cv_boston2$size, cv_boston2$dev, type = "b")

In this case, the most complex tree is selected by cross-validation. However, if we wish to prune the tree, we could do so as follows, using the prune.tree() function:

> prune_boston2 <- prune.tree(tree_boston2, best = 5)
> plot(prune_boston2)
> text(prune_boston2, pretty = 0, all = TRUE, cex = 0.7)

In keeping with the cross-validation results, we use the unpruned tree to make predictions on the test set.

> yhat <- predict(tree_boston2, newdata = Boston[-train, ])
> boston_test <- Boston[-train, "medv"]
> plot(yhat, boston_test)
> abline(0, 1)

> MSE <- mean((yhat - boston_test)^2)
> RMSE <- sqrt(mean((yhat - boston_test)^2))
> c(MSE, RMSE)

[1] 24.383148  4.937929

In other words, the test MSE associated with the regression tree is 24.3831. The square root of the MSE is 4.9379, indicating that this model leads to test predictions that are within around $4937.93 of the true median home value for the suburb.

2.3 Using `test1`

> yhat1 <- predict(tree_boston1, newdata = test1)
> MSE <- mean((test1$medv - yhat1)^2)
> RMSE <- sqrt(MSE)
> c(MSE, RMSE)

[1] 19.347308  4.398557

2.4 Random Forest

> library(randomForest)
> set.seed(32)
> mod_forest <- randomForest(medv ~ ., data = train1, ntree = 1000)
> mod_forest


Call:
 randomForest(formula = medv ~ ., data = train1, ntree = 1000) 
               Type of random forest: regression
                     Number of trees: 1000
No. of variables tried at each split: 4

          Mean of squared residuals: 12.68986
                    % Var explained: 84.58

> yhat_rf <- predict(mod_forest, newdata = test1)
> MSE_test <- mean((test1$medv - yhat_rf)^2)
> RMSE_test <- sqrt(MSE_test)
> c(MSE_test, RMSE_test)

[1] 11.537566  3.396699

2.5 Random Forest with `caret`

> fitControl <- trainControl(method = "repeatedcv",
+                           number = 10,
+                           repeats = 3,
+                           allowParallel = TRUE)
> set.seed(32)
> mtry = c(2, 4, 8)
> tunegrid <- expand.grid(.mtry = mtry)
> mod_rfc <- train(medv ~ .,
+                  data = train1,
+                  method = "rf",
+                  ntree = 1000,
+                  tuneGrid = tunegrid,
+                  trControl = fitControl)
> mod_rfc

Random Forest 

254 samples
 13 predictor

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 228, 229, 229, 228, 229, 229, ... 
Resampling results across tuning parameters:

  mtry  RMSE      Rsquared   MAE     
  2     3.935595  0.8405704  2.663757
  4     3.448269  0.8730797  2.439154
  8     3.273626  0.8782315  2.392458

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 8.

> yhat2 <- predict(mod_rfc, newdata = test1)
> MSE_test <- mean((test1$medv - yhat2)^2)
> RMSE_test <- sqrt(MSE_test)
> c(MSE_test, RMSE_test)

[1] 11.465552  3.386082

3 Using `caret` and `caretEnsemble`

This time, we will create a training set of roughly 80% of the values in Boston.

> library(caret)
> set.seed(757)
> trainIndex <- createDataPartition(y = Boston$medv,
+                                   p = 0.80,
+                                   list = FALSE)
> training <- Boston[trainIndex, ]
> testing <- Boston[-trainIndex, ]
> dim(training)

[1] 407  14

> dim(testing)

[1] 99 14

Consider the type of each variable.

> sapply(training, class)

     crim        zn     indus      chas       nox        rm       age 
"numeric" "numeric" "numeric" "integer" "numeric" "numeric" "numeric" 
      dis       rad       tax   ptratio     black     lstat      medv 
"numeric" "integer" "numeric" "numeric" "numeric" "numeric" "numeric"

> str(training)

'data.frame':   407 obs. of  14 variables:
 $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
 $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
 $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
 $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
 $ rm     : num  6.58 6.42 7.18 7 7.15 ...
 $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 85.9 94.3 ...
 $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
 $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
 $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
 $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
 $ black  : num  397 397 393 395 397 ...
 $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
 $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 18.9 15 ...

> summary(training)

      crim                zn             indus            chas        
 Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
 1st Qu.: 0.08232   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
 Median : 0.28960   Median :  0.00   Median : 9.69   Median :0.00000  
 Mean   : 3.50216   Mean   : 11.41   Mean   :11.25   Mean   :0.07125  
 3rd Qu.: 3.39766   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
 Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
      nox               rm             age              dis        
 Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.169  
 1st Qu.:0.4510   1st Qu.:5.886   1st Qu.: 43.55   1st Qu.: 2.102  
 Median :0.5380   Median :6.193   Median : 78.30   Median : 3.216  
 Mean   :0.5555   Mean   :6.300   Mean   : 68.62   Mean   : 3.791  
 3rd Qu.:0.6240   3rd Qu.:6.622   3rd Qu.: 94.45   3rd Qu.: 5.117  
 Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
      rad              tax           ptratio          black       
 Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
 1st Qu.: 4.000   1st Qu.:280.5   1st Qu.:16.95   1st Qu.:376.07  
 Median : 5.000   Median :330.0   Median :19.00   Median :392.05  
 Mean   : 9.391   Mean   :407.3   Mean   :18.41   Mean   :358.95  
 3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.26  
 Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
     lstat             medv      
 Min.   : 1.730   Min.   : 5.00  
 1st Qu.: 6.725   1st Qu.:17.05  
 Median :11.250   Median :21.20  
 Mean   :12.481   Mean   :22.63  
 3rd Qu.:16.695   3rd Qu.:25.00  
 Max.   :36.980   Max.   :50.00

Let us take a quick look at the correlation between all of the predictors.

> cor(training[, -14])

               crim         zn       indus        chas         nox
crim     1.00000000 -0.1855645  0.37191762 -0.05599437  0.38946869
zn      -0.18556449  1.0000000 -0.54430460 -0.04837940 -0.51456396
indus    0.37191762 -0.5443046  1.00000000  0.04153292  0.74991644
chas    -0.05599437 -0.0483794  0.04153292  1.00000000  0.06889513
nox      0.38946869 -0.5145640  0.74991644  0.06889513  1.00000000
rm      -0.19833754  0.3114150 -0.41414262  0.09654506 -0.31401792
age      0.32862662 -0.5670406  0.64571900  0.09146708  0.73601549
dis     -0.35005456  0.6865718 -0.70074282 -0.09070932 -0.76548127
rad      0.59476255 -0.3055176  0.56751891 -0.01255656  0.58279945
tax      0.54973661 -0.3026334  0.70952060 -0.05186279  0.64240452
ptratio  0.26587913 -0.3993144  0.37179013 -0.11869778  0.16737086
black   -0.38126073  0.1624189 -0.32549057  0.04230642 -0.36273394
lstat    0.43099691 -0.4093366  0.60784942 -0.03265481  0.59949041
                 rm         age         dis         rad         tax
crim    -0.19833754  0.32862662 -0.35005456  0.59476255  0.54973661
zn       0.31141500 -0.56704063  0.68657185 -0.30551757 -0.30263337
indus   -0.41414262  0.64571900 -0.70074282  0.56751891  0.70952060
chas     0.09654506  0.09146708 -0.09070932 -0.01255656 -0.05186279
nox     -0.31401792  0.73601549 -0.76548127  0.58279945  0.64240452
rm       1.00000000 -0.22861271  0.19559393 -0.21033418 -0.30913441
age     -0.22861271  1.00000000 -0.76493082  0.43321989  0.49428877
dis      0.19559393 -0.76493082  1.00000000 -0.47290741 -0.51708554
rad     -0.21033418  0.43321989 -0.47290741  1.00000000  0.90136813
tax     -0.30913441  0.49428877 -0.51708554  0.90136813  1.00000000
ptratio -0.36950800  0.25485463 -0.21521083  0.44766904  0.45989270
black    0.15412667 -0.26451179  0.27563498 -0.40829445 -0.40718417
lstat   -0.62594701  0.60834627 -0.50661891  0.47294964  0.54223926
           ptratio       black       lstat
crim     0.2658791 -0.38126073  0.43099691
zn      -0.3993144  0.16241893 -0.40933661
indus    0.3717901 -0.32549057  0.60784942
chas    -0.1186978  0.04230642 -0.03265481
nox      0.1673709 -0.36273394  0.59949041
rm      -0.3695080  0.15412667 -0.62594701
age      0.2548546 -0.26451179  0.60834627
dis     -0.2152108  0.27563498 -0.50661891
rad      0.4476690 -0.40829445  0.47294964
tax      0.4598927 -0.40718417  0.54223926
ptratio  1.0000000 -0.15071089  0.37972555
black   -0.1507109  1.00000000 -0.38038074
lstat    0.3797256 -0.38038074  1.00000000

There are a lot of correlated variables…

3.1 Unimodal Data Visualizations

Getting a quick look at the distribution of all predictors.

> par(mfrow = c(2, 7))
> for(i in 1:13){
+   hist(training[, i], main = names(training)[i], col = "pink")
+ }
> par(mfrow = c(1, 1))

> par(mfrow = c(4, 4))
> for(i in 1:13){
+   plot(density(training[, i]), main = names(training)[i], col = "blue")
+ }
> par(mfrow = c(1, 1))

It appears that some variables follow an exponential distribution while a few others might be bimodal. Consider boxplots of each predictor next.

> par(mfrow = c(4, 4))
> for(i in 1:13){
+   boxplot(training[, i], main = names(training)[i])
+ }
> par(mfrow = c(1, 1))

3.2 Multimodal Data Visualizations

> pairs(training[, -14]) # Yuck

> library(corrplot)
> rs <- cor(training[, -14])
> corrplot(rs, method = "circle")

3.2.1 Trying Different Algorithms

> myControl <- trainControl(method = "repeatedcv",
+                           number = 10,
+                           repeats = 3,
+                           allowParallel = TRUE)
> library(caretEnsemble)
> set.seed(32)
> models <- caretList(log(medv) ~ .,
+                     data = training,
+                     trControl = myControl,
+                     tuneList = list(
+                       mod_lm = caretModelSpec(method = "lm"),
+                       mod_svm = caretModelSpec(method = "svmRadial"),
+                       mod_knn = caretModelSpec(method = "knn"),
+                       mod_gbm = caretModelSpec(method = "gbm"),
+                       mod_BIC = caretModelSpec(method = "lmStepAIC", k = log(407)),
+                       mod_AIC = caretModelSpec(method = "lmStepAIC"),
+                       mod_BE = caretModelSpec(method = "leapBackward"),
+                       mod_FS = caretModelSpec(method = "leapForward"),
+                       mod_SS = caretModelSpec(method = "leapSeq"),
+                       mod_RF = caretModelSpec(method = "ranger"),
+                       mod_EN = caretModelSpec(method = "glmnet", 
+                                               tuneGrid = expand.grid(alpha = c(0, 0.5, 1), 
+                                                                      lambda = seq(0.0001, 1, length = 20)))
+                       
+                     ))

Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1        0.1418            -nan     0.1000    0.0187
     2        0.1228            -nan     0.1000    0.0192
     3        0.1080            -nan     0.1000    0.0134
     4        0.0951            -nan     0.1000    0.0126
     5        0.0849            -nan     0.1000    0.0097
     6        0.0758            -nan     0.1000    0.0094
     7        0.0689            -nan     0.1000    0.0067
     8        0.0619            -nan     0.1000    0.0054
     9        0.0562            -nan     0.1000    0.0048
    10        0.0510            -nan     0.1000    0.0045
    20        0.0277            -nan     0.1000    0.0006
    40        0.0185            -nan     0.1000   -0.0002
    60        0.0150            -nan     0.1000   -0.0002
    80        0.0130            -nan     0.1000   -0.0001
   100        0.0114            -nan     0.1000   -0.0000
   120        0.0102            -nan     0.1000   -0.0002
   140        0.0091            -nan     0.1000   -0.0001
   150        0.0086            -nan     0.1000   -0.0000

Start:  AIC=-1331.84
.outcome ~ crim + zn + indus + chas + nox + rm + age + dis + 
    rad + tax + ptratio + black + lstat

          Df Sum of Sq    RSS     AIC
- age      1    0.0003 12.551 -1337.8
- indus    1    0.0197 12.571 -1337.2
<none>                 12.551 -1331.8
- zn       1    0.1873 12.738 -1331.8
- chas     1    0.2168 12.768 -1330.9
- black    1    0.4511 13.002 -1323.5
- tax      1    0.4573 13.008 -1323.3
- nox      1    0.5538 13.105 -1320.3
- rad      1    0.6111 13.162 -1318.5
- rm       1    1.0284 13.579 -1305.8
- dis      1    1.1001 13.651 -1303.7
- ptratio  1    1.1860 13.737 -1301.1
- crim     1    1.5136 14.065 -1291.5
- lstat    1    5.6242 18.175 -1187.2

Step:  AIC=-1337.84
.outcome ~ crim + zn + indus + chas + nox + rm + dis + rad + 
    tax + ptratio + black + lstat

          Df Sum of Sq    RSS     AIC
- indus    1    0.0196 12.571 -1343.2
<none>                 12.551 -1337.8
- zn       1    0.1897 12.741 -1337.7
- chas     1    0.2167 12.768 -1336.9
- black    1    0.4516 13.003 -1329.5
- tax      1    0.4589 13.010 -1329.2
- nox      1    0.6076 13.159 -1324.6
- rad      1    0.6200 13.171 -1324.2
- rm       1    1.0822 13.633 -1310.2
- ptratio  1    1.2031 13.754 -1306.6
- dis      1    1.2087 13.760 -1306.4
- crim     1    1.5137 14.065 -1297.5
- lstat    1    6.4543 19.006 -1175.0

Step:  AIC=-1343.21
.outcome ~ crim + zn + chas + nox + rm + dis + rad + tax + ptratio + 
    black + lstat

          Df Sum of Sq    RSS     AIC
- zn       1    0.1760 12.747 -1343.6
<none>                 12.571 -1343.2
- chas     1    0.2304 12.801 -1341.8
- black    1    0.4498 13.021 -1334.9
- tax      1    0.4774 13.048 -1334.0
- nox      1    0.5927 13.164 -1330.5
- rad      1    0.6102 13.181 -1329.9
- rm       1    1.0630 13.634 -1316.2
- ptratio  1    1.1857 13.757 -1312.5
- dis      1    1.3201 13.891 -1308.6
- crim     1    1.5247 14.096 -1302.6
- lstat    1    6.4348 19.006 -1181.0

Step:  AIC=-1343.56
.outcome ~ crim + chas + nox + rm + dis + rad + tax + ptratio + 
    black + lstat

          Df Sum of Sq    RSS     AIC
<none>                 12.747 -1343.6
- chas     1    0.2250 12.972 -1342.5
- tax      1    0.3674 13.114 -1338.0
- black    1    0.4542 13.201 -1335.3
- rad      1    0.5451 13.292 -1332.5
- nox      1    0.6545 13.401 -1329.2
- dis      1    1.2161 13.963 -1312.5
- rm       1    1.2782 14.025 -1310.7
- crim     1    1.4481 14.195 -1305.8
- ptratio  1    1.8165 14.563 -1295.3
- lstat    1    6.3325 19.079 -1185.4
Start:  AIC=-1387.96
.outcome ~ crim + zn + indus + chas + nox + rm + age + dis + 
    rad + tax + ptratio + black + lstat

          Df Sum of Sq    RSS     AIC
- age      1    0.0003 12.551 -1390.0
- indus    1    0.0197 12.571 -1389.3
<none>                 12.551 -1388.0
- zn       1    0.1873 12.738 -1383.9
- chas     1    0.2168 12.768 -1383.0
- black    1    0.4511 13.002 -1375.6
- tax      1    0.4573 13.008 -1375.4
- nox      1    0.5538 13.105 -1372.4
- rad      1    0.6111 13.162 -1370.6
- rm       1    1.0284 13.579 -1357.9
- dis      1    1.1001 13.651 -1355.8
- ptratio  1    1.1860 13.737 -1353.2
- crim     1    1.5136 14.065 -1343.6
- lstat    1    5.6242 18.175 -1239.3

Step:  AIC=-1389.95
.outcome ~ crim + zn + indus + chas + nox + rm + dis + rad + 
    tax + ptratio + black + lstat

          Df Sum of Sq    RSS     AIC
- indus    1    0.0196 12.571 -1391.3
<none>                 12.551 -1390.0
- zn       1    0.1897 12.741 -1385.8
- chas     1    0.2167 12.768 -1385.0
- black    1    0.4516 13.003 -1377.6
- tax      1    0.4589 13.010 -1377.3
- nox      1    0.6076 13.159 -1372.7
- rad      1    0.6200 13.171 -1372.3
- rm       1    1.0822 13.633 -1358.3
- ptratio  1    1.2031 13.754 -1354.7
- dis      1    1.2087 13.760 -1354.5
- crim     1    1.5137 14.065 -1345.6
- lstat    1    6.4543 19.006 -1223.1

Step:  AIC=-1391.32
.outcome ~ crim + zn + chas + nox + rm + dis + rad + tax + ptratio + 
    black + lstat

          Df Sum of Sq    RSS     AIC
<none>                 12.571 -1391.3
- zn       1    0.1760 12.747 -1387.7
- chas     1    0.2304 12.801 -1385.9
- black    1    0.4498 13.021 -1379.0
- tax      1    0.4774 13.048 -1378.2
- nox      1    0.5927 13.164 -1374.6
- rad      1    0.6102 13.181 -1374.0
- rm       1    1.0630 13.634 -1360.3
- ptratio  1    1.1857 13.757 -1356.6
- dis      1    1.3201 13.891 -1352.7
- crim     1    1.5247 14.096 -1346.7
- lstat    1    6.4348 19.006 -1225.1

> models_pred <- lapply(models, predict, newdata = testing)
> models_pred <- as.data.frame((models_pred))
> models_pred <- exp(models_pred)
> head(models_pred)

     mod_lm  mod_svm  mod_knn  mod_gbm  mod_BIC  mod_AIC   mod_BE   mod_FS
9  12.10126 17.09738 17.22209 15.11849 12.54971 12.13816 12.70811 12.70811
14 20.41695 19.06091 21.14211 19.74192 20.59685 20.49928 21.38807 21.38807
20 19.03650 18.05127 23.61302 18.09039 19.03806 19.14874 18.90036 18.90036
36 23.45565 20.73759 27.09339 22.77786 23.54151 23.60537 21.73006 21.73006
40 31.54351 29.94356 38.89945 29.92197 29.23639 31.51472 28.84757 28.84757
44 24.85047 23.87144 23.07799 24.35731 25.46217 24.72752 25.16348 25.16348
     mod_SS   mod_RF   mod_EN
9  11.36432 17.22285 12.13043
14 22.40196 19.92057 20.46086
20 19.69802 18.78564 19.06339
36 22.74828 21.30284 23.44067
40 29.68668 28.94304 31.41850
44 26.43419 23.91167 24.83434

> bwplot(resamples(models))

> modelCor(resamples(models))

           mod_lm   mod_svm   mod_knn   mod_gbm   mod_BIC   mod_AIC
mod_lm  1.0000000 0.7315707 0.5035556 0.6312936 0.9821889 0.9981746
mod_svm 0.7315707 1.0000000 0.6402334 0.7012810 0.7370684 0.7341203
mod_knn 0.5035556 0.6402334 1.0000000 0.5169711 0.5605050 0.5008608
mod_gbm 0.6312936 0.7012810 0.5169711 1.0000000 0.6293528 0.6387542
mod_BIC 0.9821889 0.7370684 0.5605050 0.6293528 1.0000000 0.9840121
mod_AIC 0.9981746 0.7341203 0.5008608 0.6387542 0.9840121 1.0000000
mod_BE  0.8861203 0.7792973 0.5760775 0.6826216 0.8872274 0.8814543
mod_FS  0.8876586 0.7906338 0.5816167 0.7026625 0.8873709 0.8825207
mod_SS  0.7686125 0.6594322 0.4747544 0.5963098 0.7871257 0.7626301
mod_RF  0.6780607 0.8626853 0.6765305 0.8495601 0.6943648 0.6887197
mod_EN  0.9997953 0.7369954 0.5123585 0.6336658 0.9847568 0.9984690
           mod_BE    mod_FS    mod_SS    mod_RF    mod_EN
mod_lm  0.8861203 0.8876586 0.7686125 0.6780607 0.9997953
mod_svm 0.7792973 0.7906338 0.6594322 0.8626853 0.7369954
mod_knn 0.5760775 0.5816167 0.4747544 0.6765305 0.5123585
mod_gbm 0.6826216 0.7026625 0.5963098 0.8495601 0.6336658
mod_BIC 0.8872274 0.8873709 0.7871257 0.6943648 0.9847568
mod_AIC 0.8814543 0.8825207 0.7626301 0.6887197 0.9984690
mod_BE  1.0000000 0.9971566 0.9141154 0.7776125 0.8905587
mod_FS  0.9971566 1.0000000 0.9107504 0.7858050 0.8919799
mod_SS  0.9141154 0.9107504 1.0000000 0.6706237 0.7725255
mod_RF  0.7776125 0.7858050 0.6706237 1.0000000 0.6841685
mod_EN  0.8905587 0.8919799 0.7725255 0.6841685 1.0000000

Let us pick on the three best algorithms (RF, svm, and gbm).

> set.seed(32)
> models <- caretList(log(medv) ~ .,
+                     data = training,
+                     trControl = myControl,
+                     tuneList = list(
+                       mod_svm = caretModelSpec(method = "svmRadial"),
+                       mod_gbm = caretModelSpec(method = "gbm"),
+                       mod_RF = caretModelSpec(method = "ranger")
+                     ))

Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1        0.1425            -nan     0.1000    0.0218
     2        0.1245            -nan     0.1000    0.0156
     3        0.1086            -nan     0.1000    0.0132
     4        0.0949            -nan     0.1000    0.0125
     5        0.0842            -nan     0.1000    0.0091
     6        0.0745            -nan     0.1000    0.0085
     7        0.0671            -nan     0.1000    0.0059
     8        0.0600            -nan     0.1000    0.0063
     9        0.0545            -nan     0.1000    0.0042
    10        0.0494            -nan     0.1000    0.0042
    20        0.0275            -nan     0.1000    0.0007
    40        0.0186            -nan     0.1000   -0.0000
    60        0.0152            -nan     0.1000   -0.0002
    80        0.0129            -nan     0.1000    0.0000
   100        0.0114            -nan     0.1000   -0.0001
   120        0.0101            -nan     0.1000   -0.0001
   140        0.0093            -nan     0.1000   -0.0002
   150        0.0088            -nan     0.1000   -0.0001

> models_pred <- lapply(models, predict, newdata = testing)
> models_pred <- as.data.frame((models_pred))
> models_pred <- exp(models_pred)
> head(models_pred)

   mod_svm  mod_gbm   mod_RF
1 17.08445 15.05113 17.09280
2 19.05936 19.91364 19.86004
3 18.04901 18.20923 19.12218
4 20.73916 22.64887 21.41413
5 29.93608 28.57065 28.69266
6 23.87584 25.03544 23.90853

> bwplot(resamples(models))

> modelCor(resamples(models))

          mod_svm   mod_gbm    mod_RF
mod_svm 1.0000000 0.7643689 0.8246182
mod_gbm 0.7643689 1.0000000 0.8674933
mod_RF  0.8246182 0.8674933 1.0000000

> set.seed(32)
> models <- caretList(medv ~ .,
+                     data = training,
+                     trControl = myControl,
+                     preProc = c("center", "scale", "BoxCox"),
+                     tuneList = list(
+                       # mod_svm = caretModelSpec(method = "svmRadial"),
+                       mod_gbm = caretModelSpec(method = "gbm"),
+                       mod_RF = caretModelSpec(method = "ranger")
+                     ))

Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1       74.7077            -nan     0.1000   13.6639
     2       64.8469            -nan     0.1000   11.2699
     3       56.3836            -nan     0.1000    5.6149
     4       49.5307            -nan     0.1000    6.8685
     5       44.3351            -nan     0.1000    5.6392
     6       39.5944            -nan     0.1000    4.3578
     7       35.1894            -nan     0.1000    4.3334
     8       31.7152            -nan     0.1000    3.3525
     9       28.2104            -nan     0.1000    3.4845
    10       25.9699            -nan     0.1000    1.6611
    20       14.4979            -nan     0.1000    0.2964
    40        9.5577            -nan     0.1000    0.1728
    60        7.8066            -nan     0.1000   -0.0375
    80        6.4782            -nan     0.1000   -0.0871
   100        5.7063            -nan     0.1000    0.0004
   120        5.2300            -nan     0.1000   -0.0365
   140        4.7595            -nan     0.1000   -0.0308
   150        4.5734            -nan     0.1000   -0.0515

> models_pred <- lapply(models, predict, newdata = testing)
> models_pred <- as.data.frame((models_pred))
> # models_pred <- exp(models_pred)
> head(models_pred)

   mod_gbm   mod_RF
1 14.98090 18.13443
2 20.14400 20.48507
3 18.90286 19.22108
4 23.00341 21.55955
5 28.84768 29.90032
6 24.63848 24.04461

> bwplot(resamples(models))

> modelCor(resamples(models))

          mod_gbm    mod_RF
mod_gbm 1.0000000 0.8567665
mod_RF  0.8567665 1.0000000

3.3 Should tune (next course)

3.4 Ensemble

> set.seed(31)
> ensemble <- caretEnsemble(models, trControl = myControl)
> summary(ensemble)

The following models were ensembled: mod_gbm, mod_RF 
They were weighted: 
-1.911 0.2534 0.8331
The resulting RMSE is: 3.0836
The fit for each individual model on the RMSE is: 
  method     RMSE    RMSESD
 mod_gbm 3.225121 0.9622825
  mod_RF 3.094267 1.0071046

> coef(ensemble$ens_model$finalModel)[-1]

  mod_gbm    mod_RF 
0.2534409 0.8330883

> ####
> sum(coef(ensemble$ens_model$finalModel)[-1])

[1] 1.086529

> ####
> BM <- as.matrix(models_pred)
> X <- cbind(rep(1, dim(BM)[1]), BM)
> BETA <- t(coef(ensemble$ens_model$finalModel))
> dim(BETA)

[1] 1 3

> ANS <- X %*% t(BETA)
> head(ANS)

         [,1]
[1,] 16.99340
[2,] 20.26023
[3,] 18.89265
[4,] 21.88006
[5,] 30.30984
[6,] 24.36473

> MSE_test <- mean((testing$medv - ANS)^2)
> RMSE_test <- sqrt(MSE_test)
> c(MSE_test, RMSE_test)

[1] 11.35616  3.36989

3.5 Trees for Classification

> url <- "https://assets.datacamp.com/production/course_3022/datasets/credit.csv"
> GC <- read.csv(url)
> library(caret)
> set.seed(321)
> trainIndex <- createDataPartition(y = GC$default,
+                                   p = 0.80,
+                                   list = FALSE,
+                                   times = 1)
> train80 <- GC[trainIndex, ]
> test20 <- GC[-trainIndex, ]
> dim(train80)

[1] 800  17

> dim(test20)

[1] 200  17

> library(rpart)
> library(partykit)
> credit_model <- rpart(default ~ ., data = train80, method = "class")
> plot(partykit::as.party(credit_model))

> rpart.plot::rpart.plot(x = credit_model, type = 5)

Evaluating the performance of the model with test set classification error.

> library(caret)
> class_prediction <- predict(object = credit_model,  
+                         newdata = test20,   
+                         type = "class")  
>                             
> # Calculate the confusion matrix for the test set
> confusionMatrix(data = class_prediction,       
+                 reference = test20$default)

Confusion Matrix and Statistics

          Reference
Prediction  no yes
       no  122  43
       yes  18  17
                                         
               Accuracy : 0.695          
                 95% CI : (0.6261, 0.758)
    No Information Rate : 0.7            
    P-Value [Acc > NIR] : 0.59526        
                                         
                  Kappa : 0.1757         
 Mcnemar's Test P-Value : 0.00212        
                                         
            Sensitivity : 0.8714         
            Specificity : 0.2833         
         Pos Pred Value : 0.7394         
         Neg Pred Value : 0.4857         
             Prevalence : 0.7000         
         Detection Rate : 0.6100         
   Detection Prevalence : 0.8250         
      Balanced Accuracy : 0.5774         
                                         
       'Positive' Class : no

Cross-validate a bagged tree model with caret

> # Specify the training configuration
> ctrl <- trainControl(method = "cv",     # Cross-validation
+                      number = 5,      # 5 folds
+                      classProbs = TRUE,                  # For AUC
+                      summaryFunction = twoClassSummary)  # For AUC
> 
> # Cross validate the credit model using "treebag" method; 
> # Track AUC (Area under the ROC curve)
> set.seed(1)  # for reproducibility
> credit_caret_model <- train(default ~ .,
+                             data = train80, 
+                             method = "treebag",
+                             metric = "ROC",
+                             trControl = ctrl)
> 
> # Look at the model object
> print(credit_caret_model)

Bagged CART 

800 samples
 16 predictor
  2 classes: 'no', 'yes' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 640, 640, 640, 640, 640 
Resampling results:

  ROC        Sens       Spec     
  0.7731399  0.8589286  0.4708333

> # Inspect the contents of the model list 
> names(credit_caret_model)

 [1] "method"       "modelInfo"    "modelType"    "results"     
 [5] "pred"         "bestTune"     "call"         "dots"        
 [9] "metric"       "control"      "finalModel"   "preProcess"  
[13] "trainingData" "resample"     "resampledCM"  "perfNames"   
[17] "maximize"     "yLimits"      "times"        "levels"      
[21] "terms"        "coefnames"    "contrasts"    "xlevels"

> # Print the CV AUC
> credit_caret_model$results[,"ROC"]

[1] 0.7731399

> # Generate predictions on the test set
> pred <- predict(object = credit_caret_model, 
+                 newdata = test20,
+                 type = "prob")
> 
> # Compute the AUC (`actual` must be a binary (or 1/0 numeric) vector)
> Metrics::auc(actual = ifelse(test20$default == "yes", 1, 0), 
+                     predicted = pred[,"yes"])

[1] 0.705

3.6 Random Forest with `GC` data

> # Train a Random Forest
> set.seed(1)  # for reproducibility
> credit_model <- train(default ~ .,
+                         data = train80, 
+                         method = "ranger",
+                         metric = "ROC",
+                         trControl = ctrl)
>                              
> # Print the model output                             
> print(credit_model)

Random Forest 

800 samples
 16 predictor
  2 classes: 'no', 'yes' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 640, 640, 640, 640, 640 
Resampling results across tuning parameters:

  mtry  splitrule   ROC        Sens       Spec     
   2    gini        0.7803571  0.9750000  0.1416667
   2    extratrees  0.7681176  0.9785714  0.1208333
  18    gini        0.7880952  0.8785714  0.4208333
  18    extratrees  0.7749628  0.8785714  0.4291667
  35    gini        0.7902158  0.8660714  0.4666667
  35    extratrees  0.7742560  0.8678571  0.4416667

Tuning parameter 'min.node.size' was held constant at a value of 1
ROC was used to select the optimal model using the largest value.
The final values used for the model were mtry = 35, splitrule = gini
 and min.node.size = 1.

Regression Trees from ISLR

Alan T. Arnholt

Last knit on March 27, 2019 at 10:08:30 AM

1 Advantages and Disadvantages of Trees

1.1 Pros

1.2 Cons

2 Training data set

2.1 Create `train1` and `test1` with `caret`

2.1.1 Stepwise selection with `caret`

2.1.1.1 Test

2.2 Fitting a linear model

2.3 Using `test1`

2.4 Random Forest

2.5 Random Forest with `caret`

3 Using `caret` and `caretEnsemble`

3.1 Unimodal Data Visualizations

3.2 Multimodal Data Visualizations

3.2.1 Trying Different Algorithms

3.3 Should tune (next course)

3.4 Ensemble

3.5 Trees for Classification

3.6 Random Forest with `GC` data

Regression Trees from ISLR

Alan T. Arnholt

Last knit on March 27, 2019 at 10:08:30 AM

1 Advantages and Disadvantages of Trees

1.1 Pros

1.2 Cons

2 Training data set

2.1 Create train1 and test1 with caret

2.1.1 Stepwise selection with caret

2.1.1.1 Test

2.2 Fitting a linear model

2.3 Using test1

2.4 Random Forest

2.5 Random Forest with caret

3 Using caret and caretEnsemble

3.1 Unimodal Data Visualizations

3.2 Multimodal Data Visualizations

3.2.1 Trying Different Algorithms

3.3 Should tune (next course)

3.4 Ensemble

3.5 Trees for Classification

3.6 Random Forest with GC data

2.1 Create `train1` and `test1` with `caret`

2.1.1 Stepwise selection with `caret`

2.3 Using `test1`

2.5 Random Forest with `caret`

3 Using `caret` and `caretEnsemble`

3.6 Random Forest with `GC` data