Trees are very easy to explain to people. In fact, they are even easier to explain than linear regression!
Some people believe that decision trees more closely mirror human decision-making than do regression and classification approaches.
Trees can be displayed graphically, and are easily interpreted even by a non-expert (especially if they are small).
Trees can easily handle qualitative predictors without the need to create dummy variables (model.matrix()
).
Trees generally do not have the same level of predictive accuracy as some of the other regression and classification approaches.
Trees suffer from high variance. This means if we split the training data into two parts at random, and fit a decision tree to both halves, the results that we get could be quite different. In contrast, a procedure with low variance will yield similar results if applied repeatedly to distinct data sets.
However, by aggregating many decision trees, using methods like bagging, random forests, and boosting, the predictive performance of trees can be substantially improved.
> library(MASS)
> library(rpart)
> set.seed(8341)
> train <- sample(1:nrow(Boston), nrow(Boston)/2)
train1
and test1
with caret
> library(caret)
> set.seed(821)
> trainIndex <- createDataPartition(y = Boston$medv,
+ p = 0.50,
+ list = FALSE,
+ times = 1)
> train1 <- Boston[trainIndex, ]
> test1 <- Boston[-trainIndex, ]
> dim(train1)
[1] 254 14
> dim(test1)
[1] 252 14
caret
> fitControl <- trainControl(method = "repeatedcv",
+ number = 10,
+ repeats = 3,
+ allowParallel = TRUE)
> set.seed(34)
> stepMod <- train(medv ~.,
+ data = train1,
+ method = "leapSeq",
+ trControl = fitControl,
+ verbose = FALSE)
> stepMod
Linear Regression with Stepwise Selection
254 samples
13 predictor
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 228, 228, 227, 229, 228, 230, ...
Resampling results across tuning parameters:
nvmax RMSE Rsquared MAE
2 5.967332 0.5849737 4.290178
3 5.539786 0.6402357 3.993155
4 5.626944 0.6280172 3.992085
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was nvmax = 3.
> summary(stepMod)
Subset selection object
13 Variables (and intercept)
Forced in Forced out
crim FALSE FALSE
zn FALSE FALSE
indus FALSE FALSE
chas FALSE FALSE
nox FALSE FALSE
rm FALSE FALSE
age FALSE FALSE
dis FALSE FALSE
rad FALSE FALSE
tax FALSE FALSE
ptratio FALSE FALSE
black FALSE FALSE
lstat FALSE FALSE
1 subsets of each size up to 3
Selection Algorithm: 'sequential replacement'
crim zn indus chas nox rm age dis rad tax ptratio black lstat
1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " " " "*"
2 ( 1 ) " " " " " " " " " " "*" " " " " " " " " " " " " "*"
3 ( 1 ) " " " " " " " " " " "*" " " " " " " " " "*" " " "*"
> summary(stepMod$finalModel)
Subset selection object
13 Variables (and intercept)
Forced in Forced out
crim FALSE FALSE
zn FALSE FALSE
indus FALSE FALSE
chas FALSE FALSE
nox FALSE FALSE
rm FALSE FALSE
age FALSE FALSE
dis FALSE FALSE
rad FALSE FALSE
tax FALSE FALSE
ptratio FALSE FALSE
black FALSE FALSE
lstat FALSE FALSE
1 subsets of each size up to 3
Selection Algorithm: 'sequential replacement'
crim zn indus chas nox rm age dis rad tax ptratio black lstat
1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " " " "*"
2 ( 1 ) " " " " " " " " " " "*" " " " " " " " " " " " " "*"
3 ( 1 ) " " " " " " " " " " "*" " " " " " " " " "*" " " "*"
> coef(stepMod$finalModel, id = 3)
(Intercept) rm ptratio lstat
24.5726339 3.5454182 -0.8656869 -0.6481990
> yhat <- predict(stepMod, newdata = test1)
> RMSE <- sqrt(mean((test1$medv - yhat)^2))
> RMSE
[1] 4.885194
> modstep <- stepAIC(lm(medv ~ ., data = Boston, subset = train), k = log(253))
Start: AIC=849.55
medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad +
tax + ptratio + black + lstat
Df Sum of Sq RSS AIC
- age 1 2.35 5353.6 844.13
- indus 1 16.66 5367.9 844.80
- zn 1 38.68 5390.0 845.84
- black 1 61.60 5412.9 846.91
- crim 1 62.41 5413.7 846.95
- tax 1 88.71 5440.0 848.17
<none> 5351.3 849.55
- rad 1 161.68 5513.0 851.55
- nox 1 249.64 5600.9 855.55
- chas 1 275.33 5626.6 856.71
- dis 1 526.83 5878.1 867.77
- ptratio 1 611.25 5962.5 871.38
- lstat 1 993.46 6344.7 887.10
- rm 1 1910.15 7261.4 921.24
Step: AIC=844.13
medv ~ crim + zn + indus + chas + nox + rm + dis + rad + tax +
ptratio + black + lstat
Df Sum of Sq RSS AIC
- indus 1 16.70 5370.3 839.38
- zn 1 42.14 5395.8 840.58
- black 1 60.34 5414.0 841.43
- crim 1 62.07 5415.7 841.51
- tax 1 89.97 5443.6 842.81
<none> 5353.6 844.13
- rad 1 168.47 5522.1 846.43
- chas 1 273.40 5627.0 851.19
- nox 1 291.26 5644.9 852.00
- dis 1 537.55 5891.2 862.80
- ptratio 1 619.39 5973.0 866.29
- lstat 1 1279.12 6632.7 892.80
- rm 1 1995.95 7349.6 918.76
Step: AIC=839.38
medv ~ crim + zn + chas + nox + rm + dis + rad + tax + ptratio +
black + lstat
Df Sum of Sq RSS AIC
- zn 1 38.65 5409.0 835.66
- black 1 57.81 5428.1 836.56
- crim 1 65.43 5435.8 836.91
- tax 1 73.32 5443.6 837.28
<none> 5370.3 839.38
- rad 1 152.09 5522.4 840.91
- nox 1 274.60 5644.9 846.46
- chas 1 283.71 5654.0 846.87
- ptratio 1 602.70 5973.0 860.76
- dis 1 632.29 6002.6 862.01
- lstat 1 1262.86 6633.2 887.28
- rm 1 1979.26 7349.6 913.23
Step: AIC=835.66
medv ~ crim + chas + nox + rm + dis + rad + tax + ptratio + black +
lstat
Df Sum of Sq RSS AIC
- crim 1 54.83 5463.8 832.68
- tax 1 57.07 5466.0 832.78
- black 1 59.53 5468.5 832.90
<none> 5409.0 835.66
- rad 1 140.66 5549.6 836.62
- chas 1 277.59 5686.6 842.79
- nox 1 293.51 5702.5 843.50
- dis 1 663.67 6072.6 859.41
- ptratio 1 831.71 6240.7 866.31
- lstat 1 1270.25 6679.2 883.50
- rm 1 2074.09 7483.1 912.25
Step: AIC=832.68
medv ~ chas + nox + rm + dis + rad + tax + ptratio + black +
lstat
Df Sum of Sq RSS AIC
- tax 1 56.71 5520.5 829.76
- black 1 70.60 5534.4 830.39
- rad 1 99.42 5563.2 831.71
<none> 5463.8 832.68
- nox 1 274.24 5738.0 839.54
- chas 1 293.02 5756.8 840.36
- dis 1 635.60 6099.4 854.99
- ptratio 1 801.83 6265.6 861.79
- lstat 1 1455.92 6919.7 886.91
- rm 1 2095.20 7559.0 909.27
Step: AIC=829.76
medv ~ chas + nox + rm + dis + rad + ptratio + black + lstat
Df Sum of Sq RSS AIC
- rad 1 44.10 5564.6 826.24
- black 1 80.49 5601.0 827.89
<none> 5520.5 829.76
- chas 1 306.10 5826.6 837.88
- nox 1 350.56 5871.1 839.80
- dis 1 649.04 6169.5 852.35
- ptratio 1 866.50 6387.0 861.11
- lstat 1 1458.57 6979.1 883.54
- rm 1 2207.01 7727.5 909.31
Step: AIC=826.24
medv ~ chas + nox + rm + dis + ptratio + black + lstat
Df Sum of Sq RSS AIC
- black 1 66.69 5631.3 823.72
<none> 5564.6 826.24
- chas 1 292.35 5857.0 833.66
- nox 1 311.02 5875.6 834.46
- dis 1 623.89 6188.5 847.59
- ptratio 1 848.26 6412.9 856.60
- lstat 1 1427.14 6991.8 878.47
- rm 1 2327.78 7892.4 909.12
Step: AIC=823.72
medv ~ chas + nox + rm + dis + ptratio + lstat
Df Sum of Sq RSS AIC
<none> 5631.3 823.72
- chas 1 311.03 5942.3 831.79
- nox 1 387.84 6019.1 835.04
- dis 1 647.59 6278.9 845.72
- ptratio 1 874.95 6506.2 854.72
- lstat 1 1561.79 7193.1 880.11
- rm 1 2270.74 7902.0 903.90
> summary(modstep)
Call:
lm(formula = medv ~ chas + nox + rm + dis + ptratio + lstat,
data = Boston, subset = train)
Residuals:
Min 1Q Median 3Q Max
-10.2397 -2.8547 -0.7412 1.9924 23.9609
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 27.19591 6.30753 4.312 2.35e-05 ***
chas 4.70186 1.27557 3.686 0.00028 ***
nox -18.98584 4.61252 -4.116 5.26e-05 ***
rm 5.49957 0.55218 9.960 < 2e-16 ***
dis -1.17646 0.22119 -5.319 2.35e-07 ***
ptratio -0.96136 0.15550 -6.182 2.61e-09 ***
lstat -0.53352 0.06459 -8.260 9.03e-15 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.785 on 246 degrees of freedom
Multiple R-squared: 0.7664, Adjusted R-squared: 0.7607
F-statistic: 134.5 on 6 and 246 DF, p-value: < 2.2e-16
> yhat1 <- predict(modstep, newdata= Boston[-train, ])
> boston_test1 <- Boston[-train, "medv"]
> MSE1 <- mean((yhat1 - boston_test1)^2)
> RMSE1 <- sqrt(MSE1)
> c(MSE1, RMSE1)
[1] 27.456752 5.239919
Here we fit a regression tree to the Boston
data set.
> tree_boston <- rpart(medv ~ ., data = Boston, subset = train)
> summary(tree_boston)
Call:
rpart(formula = medv ~ ., data = Boston, subset = train)
n= 253
CP nsplit rel error xerror xstd
1 0.54137149 0 1.0000000 1.0069042 0.11526246
2 0.17236659 1 0.4586285 0.4661992 0.06124901
3 0.04628434 2 0.2862619 0.3058307 0.04862013
4 0.04409662 3 0.2399776 0.2939379 0.04973640
5 0.02134681 4 0.1958810 0.2462115 0.04601763
6 0.01807595 5 0.1745341 0.2293116 0.04528531
7 0.01027511 6 0.1564582 0.2099538 0.04169544
8 0.01000000 7 0.1461831 0.2116157 0.04133696
Variable importance
rm lstat age nox crim dis tax ptratio zn
33 27 7 6 6 5 5 4 3
indus black rad
2 1 1
Node number 1: 253 observations, complexity param=0.5413715
mean=22.79486, MSE=95.27725
left son=2 (224 obs) right son=3 (29 obs)
Primary splits:
rm < 7.121 to the left, improve=0.5413715, (0 missing)
lstat < 5.44 to the right, improve=0.4720781, (0 missing)
ptratio < 19.9 to the right, improve=0.2594405, (0 missing)
indus < 6.66 to the right, improve=0.2570786, (0 missing)
nox < 0.6695 to the right, improve=0.2215197, (0 missing)
Surrogate splits:
lstat < 4.475 to the right, agree=0.933, adj=0.414, (0 split)
ptratio < 13.85 to the right, agree=0.901, adj=0.138, (0 split)
zn < 87.5 to the left, agree=0.897, adj=0.103, (0 split)
Node number 2: 224 observations, complexity param=0.1723666
mean=20.21071, MSE=43.5639
left son=4 (80 obs) right son=5 (144 obs)
Primary splits:
lstat < 14.915 to the right, improve=0.4257827, (0 missing)
nox < 0.6695 to the right, improve=0.2936967, (0 missing)
crim < 5.84803 to the right, improve=0.2462949, (0 missing)
ptratio < 19.9 to the right, improve=0.2378162, (0 missing)
age < 77.4 to the right, improve=0.2360563, (0 missing)
Surrogate splits:
age < 91.15 to the right, agree=0.875, adj=0.650, (0 split)
dis < 2.1288 to the left, agree=0.804, adj=0.450, (0 split)
crim < 5.84803 to the right, agree=0.799, adj=0.438, (0 split)
nox < 0.6695 to the right, agree=0.786, adj=0.400, (0 split)
tax < 434.5 to the right, agree=0.772, adj=0.362, (0 split)
Node number 3: 29 observations, complexity param=0.04409662
mean=42.75517, MSE=44.72385
left son=6 (10 obs) right son=7 (19 obs)
Primary splits:
rm < 7.433 to the left, improve=0.8195545, (0 missing)
lstat < 5.185 to the right, improve=0.3105049, (0 missing)
black < 393.1 to the right, improve=0.2059922, (0 missing)
dis < 3.6617 to the right, improve=0.1586536, (0 missing)
ptratio < 15.05 to the right, improve=0.1189739, (0 missing)
Surrogate splits:
lstat < 4.725 to the right, agree=0.828, adj=0.5, (0 split)
indus < 1.775 to the left, agree=0.724, adj=0.2, (0 split)
black < 393.1 to the right, agree=0.724, adj=0.2, (0 split)
crim < 0.018935 to the left, agree=0.690, adj=0.1, (0 split)
zn < 28.5 to the right, agree=0.690, adj=0.1, (0 split)
Node number 4: 80 observations, complexity param=0.02134681
mean=14.4325, MSE=17.84719
left son=8 (50 obs) right son=9 (30 obs)
Primary splits:
nox < 0.603 to the right, improve=0.3603984, (0 missing)
crim < 0.853015 to the right, improve=0.2732551, (0 missing)
tax < 551.5 to the right, improve=0.2454497, (0 missing)
ptratio < 19.6 to the right, improve=0.2261544, (0 missing)
dis < 2.0037 to the left, improve=0.2202198, (0 missing)
Surrogate splits:
tax < 393.5 to the right, agree=0.900, adj=0.733, (0 split)
indus < 14.345 to the right, agree=0.888, adj=0.700, (0 split)
dis < 2.399 to the left, agree=0.875, adj=0.667, (0 split)
crim < 1.62073 to the right, agree=0.850, adj=0.600, (0 split)
rad < 16 to the right, agree=0.788, adj=0.433, (0 split)
Node number 5: 144 observations, complexity param=0.04628434
mean=23.42083, MSE=28.99734
left son=10 (128 obs) right son=11 (16 obs)
Primary splits:
lstat < 5.41 to the right, improve=0.26719180, (0 missing)
rm < 6.542 to the left, improve=0.24368150, (0 missing)
dis < 1.85635 to the right, improve=0.08338557, (0 missing)
chas < 0.5 to the left, improve=0.06863430, (0 missing)
ptratio < 20.55 to the right, improve=0.06312148, (0 missing)
Node number 6: 10 observations
mean=34.41, MSE=5.3309
Node number 7: 19 observations
mean=47.14737, MSE=9.511967
Node number 8: 50 observations
mean=12.468, MSE=10.23458
Node number 9: 30 observations
mean=17.70667, MSE=13.38262
Node number 10: 128 observations, complexity param=0.01807595
mean=22.43672, MSE=18.80889
left son=20 (66 obs) right son=21 (62 obs)
Primary splits:
lstat < 9.98 to the right, improve=0.18098300, (0 missing)
rm < 6.5165 to the left, improve=0.11088040, (0 missing)
indus < 4.22 to the right, improve=0.07527922, (0 missing)
ptratio < 20.95 to the right, improve=0.06686426, (0 missing)
chas < 0.5 to the left, improve=0.04522875, (0 missing)
Surrogate splits:
rm < 6.3825 to the left, agree=0.719, adj=0.419, (0 split)
crim < 0.08577 to the right, agree=0.703, adj=0.387, (0 split)
nox < 0.5125 to the right, agree=0.695, adj=0.371, (0 split)
indus < 6.145 to the right, agree=0.680, adj=0.339, (0 split)
age < 48.1 to the right, agree=0.680, adj=0.339, (0 split)
Node number 11: 16 observations
mean=31.29375, MSE=40.77434
Node number 20: 66 observations
mean=20.64848, MSE=9.997649
Node number 21: 62 observations, complexity param=0.01027511
mean=24.34032, MSE=21.16079
left son=42 (55 obs) right son=43 (7 obs)
Primary splits:
black < 371.25 to the right, improve=0.1887872, (0 missing)
indus < 15.465 to the left, improve=0.1113402, (0 missing)
crim < 1.066625 to the left, improve=0.1113402, (0 missing)
rad < 7.5 to the left, improve=0.1016110, (0 missing)
tax < 400.5 to the left, improve=0.0976496, (0 missing)
Surrogate splits:
nox < 0.589 to the left, agree=0.935, adj=0.429, (0 split)
dis < 1.9624 to the right, agree=0.935, adj=0.429, (0 split)
indus < 18.84 to the left, agree=0.919, adj=0.286, (0 split)
age < 89.45 to the left, agree=0.919, adj=0.286, (0 split)
tax < 400.5 to the left, agree=0.919, adj=0.286, (0 split)
Node number 42: 55 observations
mean=23.62727, MSE=10.23835
Node number 43: 7 observations
mean=29.94286, MSE=71.59673
> tree_boston1 <- rpart(medv ~ ., data = train1)
> summary(tree_boston1)
Call:
rpart(formula = medv ~ ., data = train1)
n= 254
CP nsplit rel error xerror xstd
1 0.46109185 0 1.0000000 1.0038882 0.12286406
2 0.20759279 1 0.5389082 0.6392510 0.08041842
3 0.07223984 2 0.3313154 0.4127999 0.06519430
4 0.04628076 3 0.2590755 0.3195668 0.05551688
5 0.02742681 4 0.2127948 0.3283610 0.05363943
6 0.01413471 5 0.1853679 0.2671046 0.03738031
7 0.01000000 6 0.1712332 0.2585818 0.03660635
Variable importance
lstat rm indus crim nox zn dis age rad
27 19 12 12 12 8 4 3 1
tax ptratio
1 1
Node number 1: 254 observations, complexity param=0.4610918
mean=22.53819, MSE=82.27488
left son=2 (156 obs) right son=3 (98 obs)
Primary splits:
lstat < 9.545 to the right, improve=0.4610918, (0 missing)
rm < 7.414 to the left, improve=0.4193780, (0 missing)
indus < 4.01 to the right, improve=0.2441960, (0 missing)
ptratio < 18.75 to the right, improve=0.2255930, (0 missing)
nox < 0.6695 to the right, improve=0.1867062, (0 missing)
Surrogate splits:
crim < 0.09628 to the right, agree=0.791, adj=0.459, (0 split)
indus < 7.625 to the right, agree=0.791, adj=0.459, (0 split)
nox < 0.5125 to the right, agree=0.783, adj=0.439, (0 split)
rm < 6.413 to the left, agree=0.776, adj=0.418, (0 split)
zn < 15 to the left, agree=0.764, adj=0.388, (0 split)
Node number 2: 156 observations, complexity param=0.07223984
mean=17.65641, MSE=21.87579
left son=4 (82 obs) right son=5 (74 obs)
Primary splits:
lstat < 15 to the right, improve=0.4423738, (0 missing)
crim < 6.107135 to the right, improve=0.2756082, (0 missing)
dis < 2.06575 to the left, improve=0.2728879, (0 missing)
nox < 0.6635 to the right, improve=0.2521726, (0 missing)
black < 115.805 to the left, improve=0.2307359, (0 missing)
Surrogate splits:
age < 90.6 to the right, agree=0.769, adj=0.514, (0 split)
dis < 2.06575 to the left, agree=0.724, adj=0.419, (0 split)
crim < 0.866645 to the right, agree=0.692, adj=0.351, (0 split)
indus < 7.625 to the right, agree=0.679, adj=0.324, (0 split)
nox < 0.5835 to the right, agree=0.667, adj=0.297, (0 split)
Node number 3: 98 observations, complexity param=0.2075928
mean=30.30918, MSE=80.09573
left son=6 (83 obs) right son=7 (15 obs)
Primary splits:
rm < 7.437 to the left, improve=0.5526851, (0 missing)
lstat < 4.65 to the right, improve=0.3556923, (0 missing)
nox < 0.574 to the left, improve=0.2754423, (0 missing)
dis < 2.99335 to the right, improve=0.2189892, (0 missing)
ptratio < 15.05 to the right, improve=0.1987182, (0 missing)
Surrogate splits:
lstat < 3.99 to the right, agree=0.908, adj=0.400, (0 split)
zn < 87.5 to the left, agree=0.857, adj=0.067, (0 split)
indus < 18.84 to the left, agree=0.857, adj=0.067, (0 split)
tax < 219 to the right, agree=0.857, adj=0.067, (0 split)
ptratio < 14.75 to the right, agree=0.857, adj=0.067, (0 split)
Node number 4: 82 observations, complexity param=0.01413471
mean=14.70122, MSE=12.03597
left son=8 (37 obs) right son=9 (45 obs)
Primary splits:
crim < 6.029745 to the right, improve=0.2992903, (0 missing)
nox < 0.603 to the right, improve=0.2541619, (0 missing)
black < 28.14 to the left, improve=0.2338012, (0 missing)
dis < 2.0037 to the left, improve=0.2287683, (0 missing)
tax < 551.5 to the right, improve=0.2255979, (0 missing)
Surrogate splits:
rad < 15.5 to the right, agree=0.915, adj=0.811, (0 split)
tax < 551.5 to the right, agree=0.890, adj=0.757, (0 split)
ptratio < 20.15 to the right, agree=0.744, adj=0.432, (0 split)
black < 85.73 to the left, agree=0.732, adj=0.405, (0 split)
dis < 2.2085 to the left, agree=0.720, adj=0.378, (0 split)
Node number 5: 74 observations
mean=20.93108, MSE=12.37863
Node number 6: 83 observations, complexity param=0.04628076
mean=27.48072, MSE=38.36927
left son=12 (76 obs) right son=13 (7 obs)
Primary splits:
dis < 2.0891 to the right, improve=0.3036966, (0 missing)
nox < 0.589 to the left, improve=0.3036966, (0 missing)
age < 88.75 to the left, improve=0.2790731, (0 missing)
tax < 400 to the left, improve=0.1883790, (0 missing)
indus < 14.48 to the left, improve=0.1813055, (0 missing)
Surrogate splits:
nox < 0.589 to the left, agree=1.000, adj=1.000, (0 split)
age < 87.6 to the left, agree=0.976, adj=0.714, (0 split)
crim < 0.64202 to the left, agree=0.964, adj=0.571, (0 split)
indus < 16.57 to the left, agree=0.964, adj=0.571, (0 split)
rad < 16 to the left, agree=0.952, adj=0.429, (0 split)
Node number 7: 15 observations
mean=45.96, MSE=21.7664
Node number 8: 37 observations
mean=12.60811, MSE=9.918583
Node number 9: 45 observations
mean=16.42222, MSE=7.21284
Node number 12: 76 observations, complexity param=0.02742681
mean=26.44474, MSE=19.64852
left son=24 (56 obs) right son=25 (20 obs)
Primary splits:
rm < 6.8035 to the left, improve=0.3838246, (0 missing)
lstat < 5.265 to the right, improve=0.2307796, (0 missing)
ptratio < 18.65 to the right, improve=0.1500866, (0 missing)
tax < 222.5 to the right, improve=0.1312044, (0 missing)
indus < 1.715 to the right, improve=0.1245970, (0 missing)
Surrogate splits:
indus < 1.715 to the right, agree=0.803, adj=0.25, (0 split)
lstat < 4.24 to the right, agree=0.789, adj=0.20, (0 split)
zn < 87.5 to the left, agree=0.763, adj=0.10, (0 split)
ptratio < 13.65 to the right, agree=0.763, adj=0.10, (0 split)
nox < 0.4045 to the right, agree=0.750, adj=0.05, (0 split)
Node number 13: 7 observations
mean=38.72857, MSE=103.4563
Node number 24: 56 observations
mean=24.80357, MSE=11.06213
Node number 25: 20 observations
mean=31.04, MSE=15.0324
> plot(tree_boston)
> text(tree_boston, pretty = 0, use.n = TRUE, all = TRUE, cex = 0.7)
> library(tree)
> tree_boston2 <- tree(medv ~ ., data = Boston, subset = train)
> summary(tree_boston2)
Regression tree:
tree(formula = medv ~ ., data = Boston, subset = train)
Variables actually used in tree construction:
[1] "rm" "lstat" "dis" "black" "nox"
Number of terminal nodes: 9
Residual mean deviance: 13.15 = 3208 / 244
Distribution of residuals:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-11.74000 -2.10700 0.06429 0.00000 1.65300 19.08000
> plot(tree_boston2)
> text(tree_boston2, pretty = 0, use.n = TRUE, all = TRUE, cex = 0.7)
Notice that the output of summary()
indicates that five of the variables have been used in constructing the tree. In the context of a regression tree, the deviance is simply the sum of the squared errors for the tree. We now plot the tree with the package partykit
which only works on trees created with the package rpart
not tree
.
> library(partykit)
> plot(as.party(tree_boston))
> plot(as.party(tree_boston1))
The variable lstat
measures the percentage of individuals with lower socioeconomic status. The tree indicates that lower values of lstat
correspond to more expensive houses. The tree predicts a median house price of $46,000 for larger homes in suburbs in which residents have high socioeconomic status (rm >= 7.437
and lstat < 9.545
). Now we use the cv.tree()
function to see whether pruning the tree will improve performance.
> set.seed(3)
> cv_boston2 <- cv.tree(tree_boston2)
> plot(cv_boston2$size, cv_boston2$dev, type = "b")
In this case, the most complex tree is selected by cross-validation. However, if we wish to prune the tree, we could do so as follows, using the prune.tree()
function:
> prune_boston2 <- prune.tree(tree_boston2, best = 5)
> plot(prune_boston2)
> text(prune_boston2, pretty = 0, all = TRUE, cex = 0.7)
In keeping with the cross-validation results, we use the unpruned tree to make predictions on the test set.
> yhat <- predict(tree_boston2, newdata = Boston[-train, ])
> boston_test <- Boston[-train, "medv"]
> plot(yhat, boston_test)
> abline(0, 1)
> MSE <- mean((yhat - boston_test)^2)
> RMSE <- sqrt(mean((yhat - boston_test)^2))
> c(MSE, RMSE)
[1] 24.383148 4.937929
In other words, the test MSE associated with the regression tree is 24.3831. The square root of the MSE is 4.9379, indicating that this model leads to test predictions that are within around $4937.93 of the true median home value for the suburb.
test1
> yhat1 <- predict(tree_boston1, newdata = test1)
> MSE <- mean((test1$medv - yhat1)^2)
> RMSE <- sqrt(MSE)
> c(MSE, RMSE)
[1] 19.347308 4.398557
> library(randomForest)
> set.seed(32)
> mod_forest <- randomForest(medv ~ ., data = train1, ntree = 1000)
> mod_forest
Call:
randomForest(formula = medv ~ ., data = train1, ntree = 1000)
Type of random forest: regression
Number of trees: 1000
No. of variables tried at each split: 4
Mean of squared residuals: 12.68986
% Var explained: 84.58
> yhat_rf <- predict(mod_forest, newdata = test1)
> MSE_test <- mean((test1$medv - yhat_rf)^2)
> RMSE_test <- sqrt(MSE_test)
> c(MSE_test, RMSE_test)
[1] 11.537566 3.396699
caret
> fitControl <- trainControl(method = "repeatedcv",
+ number = 10,
+ repeats = 3,
+ allowParallel = TRUE)
> set.seed(32)
> mtry = c(2, 4, 8)
> tunegrid <- expand.grid(.mtry = mtry)
> mod_rfc <- train(medv ~ .,
+ data = train1,
+ method = "rf",
+ ntree = 1000,
+ tuneGrid = tunegrid,
+ trControl = fitControl)
> mod_rfc
Random Forest
254 samples
13 predictor
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 228, 229, 229, 228, 229, 229, ...
Resampling results across tuning parameters:
mtry RMSE Rsquared MAE
2 3.935595 0.8405704 2.663757
4 3.448269 0.8730797 2.439154
8 3.273626 0.8782315 2.392458
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 8.
> yhat2 <- predict(mod_rfc, newdata = test1)
> MSE_test <- mean((test1$medv - yhat2)^2)
> RMSE_test <- sqrt(MSE_test)
> c(MSE_test, RMSE_test)
[1] 11.465552 3.386082
caret
and caretEnsemble
This time, we will create a training
set of roughly 80% of the values in Boston
.
> library(caret)
> set.seed(757)
> trainIndex <- createDataPartition(y = Boston$medv,
+ p = 0.80,
+ list = FALSE)
> training <- Boston[trainIndex, ]
> testing <- Boston[-trainIndex, ]
> dim(training)
[1] 407 14
> dim(testing)
[1] 99 14
Consider the type of each variable.
> sapply(training, class)
crim zn indus chas nox rm age
"numeric" "numeric" "numeric" "integer" "numeric" "numeric" "numeric"
dis rad tax ptratio black lstat medv
"numeric" "integer" "numeric" "numeric" "numeric" "numeric" "numeric"
> str(training)
'data.frame': 407 obs. of 14 variables:
$ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
$ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
$ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
$ chas : int 0 0 0 0 0 0 0 0 0 0 ...
$ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
$ rm : num 6.58 6.42 7.18 7 7.15 ...
$ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 85.9 94.3 ...
$ dis : num 4.09 4.97 4.97 6.06 6.06 ...
$ rad : int 1 2 2 3 3 3 5 5 5 5 ...
$ tax : num 296 242 242 222 222 222 311 311 311 311 ...
$ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
$ black : num 397 397 393 395 397 ...
$ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
$ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 18.9 15 ...
> summary(training)
crim zn indus chas
Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
1st Qu.: 0.08232 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
Median : 0.28960 Median : 0.00 Median : 9.69 Median :0.00000
Mean : 3.50216 Mean : 11.41 Mean :11.25 Mean :0.07125
3rd Qu.: 3.39766 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
nox rm age dis
Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.169
1st Qu.:0.4510 1st Qu.:5.886 1st Qu.: 43.55 1st Qu.: 2.102
Median :0.5380 Median :6.193 Median : 78.30 Median : 3.216
Mean :0.5555 Mean :6.300 Mean : 68.62 Mean : 3.791
3rd Qu.:0.6240 3rd Qu.:6.622 3rd Qu.: 94.45 3rd Qu.: 5.117
Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
rad tax ptratio black
Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
1st Qu.: 4.000 1st Qu.:280.5 1st Qu.:16.95 1st Qu.:376.07
Median : 5.000 Median :330.0 Median :19.00 Median :392.05
Mean : 9.391 Mean :407.3 Mean :18.41 Mean :358.95
3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.26
Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
lstat medv
Min. : 1.730 Min. : 5.00
1st Qu.: 6.725 1st Qu.:17.05
Median :11.250 Median :21.20
Mean :12.481 Mean :22.63
3rd Qu.:16.695 3rd Qu.:25.00
Max. :36.980 Max. :50.00
Let us take a quick look at the correlation between all of the predictors.
> cor(training[, -14])
crim zn indus chas nox
crim 1.00000000 -0.1855645 0.37191762 -0.05599437 0.38946869
zn -0.18556449 1.0000000 -0.54430460 -0.04837940 -0.51456396
indus 0.37191762 -0.5443046 1.00000000 0.04153292 0.74991644
chas -0.05599437 -0.0483794 0.04153292 1.00000000 0.06889513
nox 0.38946869 -0.5145640 0.74991644 0.06889513 1.00000000
rm -0.19833754 0.3114150 -0.41414262 0.09654506 -0.31401792
age 0.32862662 -0.5670406 0.64571900 0.09146708 0.73601549
dis -0.35005456 0.6865718 -0.70074282 -0.09070932 -0.76548127
rad 0.59476255 -0.3055176 0.56751891 -0.01255656 0.58279945
tax 0.54973661 -0.3026334 0.70952060 -0.05186279 0.64240452
ptratio 0.26587913 -0.3993144 0.37179013 -0.11869778 0.16737086
black -0.38126073 0.1624189 -0.32549057 0.04230642 -0.36273394
lstat 0.43099691 -0.4093366 0.60784942 -0.03265481 0.59949041
rm age dis rad tax
crim -0.19833754 0.32862662 -0.35005456 0.59476255 0.54973661
zn 0.31141500 -0.56704063 0.68657185 -0.30551757 -0.30263337
indus -0.41414262 0.64571900 -0.70074282 0.56751891 0.70952060
chas 0.09654506 0.09146708 -0.09070932 -0.01255656 -0.05186279
nox -0.31401792 0.73601549 -0.76548127 0.58279945 0.64240452
rm 1.00000000 -0.22861271 0.19559393 -0.21033418 -0.30913441
age -0.22861271 1.00000000 -0.76493082 0.43321989 0.49428877
dis 0.19559393 -0.76493082 1.00000000 -0.47290741 -0.51708554
rad -0.21033418 0.43321989 -0.47290741 1.00000000 0.90136813
tax -0.30913441 0.49428877 -0.51708554 0.90136813 1.00000000
ptratio -0.36950800 0.25485463 -0.21521083 0.44766904 0.45989270
black 0.15412667 -0.26451179 0.27563498 -0.40829445 -0.40718417
lstat -0.62594701 0.60834627 -0.50661891 0.47294964 0.54223926
ptratio black lstat
crim 0.2658791 -0.38126073 0.43099691
zn -0.3993144 0.16241893 -0.40933661
indus 0.3717901 -0.32549057 0.60784942
chas -0.1186978 0.04230642 -0.03265481
nox 0.1673709 -0.36273394 0.59949041
rm -0.3695080 0.15412667 -0.62594701
age 0.2548546 -0.26451179 0.60834627
dis -0.2152108 0.27563498 -0.50661891
rad 0.4476690 -0.40829445 0.47294964
tax 0.4598927 -0.40718417 0.54223926
ptratio 1.0000000 -0.15071089 0.37972555
black -0.1507109 1.00000000 -0.38038074
lstat 0.3797256 -0.38038074 1.00000000
There are a lot of correlated variables…
Getting a quick look at the distribution of all predictors.
> par(mfrow = c(2, 7))
> for(i in 1:13){
+ hist(training[, i], main = names(training)[i], col = "pink")
+ }
> par(mfrow = c(1, 1))
> par(mfrow = c(4, 4))
> for(i in 1:13){
+ plot(density(training[, i]), main = names(training)[i], col = "blue")
+ }
> par(mfrow = c(1, 1))
It appears that some variables follow an exponential distribution while a few others might be bimodal. Consider boxplots of each predictor next.
> par(mfrow = c(4, 4))
> for(i in 1:13){
+ boxplot(training[, i], main = names(training)[i])
+ }
> par(mfrow = c(1, 1))
> pairs(training[, -14]) # Yuck
> library(corrplot)
> rs <- cor(training[, -14])
> corrplot(rs, method = "circle")
> myControl <- trainControl(method = "repeatedcv",
+ number = 10,
+ repeats = 3,
+ allowParallel = TRUE)
> library(caretEnsemble)
> set.seed(32)
> models <- caretList(log(medv) ~ .,
+ data = training,
+ trControl = myControl,
+ tuneList = list(
+ mod_lm = caretModelSpec(method = "lm"),
+ mod_svm = caretModelSpec(method = "svmRadial"),
+ mod_knn = caretModelSpec(method = "knn"),
+ mod_gbm = caretModelSpec(method = "gbm"),
+ mod_BIC = caretModelSpec(method = "lmStepAIC", k = log(407)),
+ mod_AIC = caretModelSpec(method = "lmStepAIC"),
+ mod_BE = caretModelSpec(method = "leapBackward"),
+ mod_FS = caretModelSpec(method = "leapForward"),
+ mod_SS = caretModelSpec(method = "leapSeq"),
+ mod_RF = caretModelSpec(method = "ranger"),
+ mod_EN = caretModelSpec(method = "glmnet",
+ tuneGrid = expand.grid(alpha = c(0, 0.5, 1),
+ lambda = seq(0.0001, 1, length = 20)))
+
+ ))
Iter TrainDeviance ValidDeviance StepSize Improve
1 0.1418 -nan 0.1000 0.0187
2 0.1228 -nan 0.1000 0.0192
3 0.1080 -nan 0.1000 0.0134
4 0.0951 -nan 0.1000 0.0126
5 0.0849 -nan 0.1000 0.0097
6 0.0758 -nan 0.1000 0.0094
7 0.0689 -nan 0.1000 0.0067
8 0.0619 -nan 0.1000 0.0054
9 0.0562 -nan 0.1000 0.0048
10 0.0510 -nan 0.1000 0.0045
20 0.0277 -nan 0.1000 0.0006
40 0.0185 -nan 0.1000 -0.0002
60 0.0150 -nan 0.1000 -0.0002
80 0.0130 -nan 0.1000 -0.0001
100 0.0114 -nan 0.1000 -0.0000
120 0.0102 -nan 0.1000 -0.0002
140 0.0091 -nan 0.1000 -0.0001
150 0.0086 -nan 0.1000 -0.0000
Start: AIC=-1331.84
.outcome ~ crim + zn + indus + chas + nox + rm + age + dis +
rad + tax + ptratio + black + lstat
Df Sum of Sq RSS AIC
- age 1 0.0003 12.551 -1337.8
- indus 1 0.0197 12.571 -1337.2
<none> 12.551 -1331.8
- zn 1 0.1873 12.738 -1331.8
- chas 1 0.2168 12.768 -1330.9
- black 1 0.4511 13.002 -1323.5
- tax 1 0.4573 13.008 -1323.3
- nox 1 0.5538 13.105 -1320.3
- rad 1 0.6111 13.162 -1318.5
- rm 1 1.0284 13.579 -1305.8
- dis 1 1.1001 13.651 -1303.7
- ptratio 1 1.1860 13.737 -1301.1
- crim 1 1.5136 14.065 -1291.5
- lstat 1 5.6242 18.175 -1187.2
Step: AIC=-1337.84
.outcome ~ crim + zn + indus + chas + nox + rm + dis + rad +
tax + ptratio + black + lstat
Df Sum of Sq RSS AIC
- indus 1 0.0196 12.571 -1343.2
<none> 12.551 -1337.8
- zn 1 0.1897 12.741 -1337.7
- chas 1 0.2167 12.768 -1336.9
- black 1 0.4516 13.003 -1329.5
- tax 1 0.4589 13.010 -1329.2
- nox 1 0.6076 13.159 -1324.6
- rad 1 0.6200 13.171 -1324.2
- rm 1 1.0822 13.633 -1310.2
- ptratio 1 1.2031 13.754 -1306.6
- dis 1 1.2087 13.760 -1306.4
- crim 1 1.5137 14.065 -1297.5
- lstat 1 6.4543 19.006 -1175.0
Step: AIC=-1343.21
.outcome ~ crim + zn + chas + nox + rm + dis + rad + tax + ptratio +
black + lstat
Df Sum of Sq RSS AIC
- zn 1 0.1760 12.747 -1343.6
<none> 12.571 -1343.2
- chas 1 0.2304 12.801 -1341.8
- black 1 0.4498 13.021 -1334.9
- tax 1 0.4774 13.048 -1334.0
- nox 1 0.5927 13.164 -1330.5
- rad 1 0.6102 13.181 -1329.9
- rm 1 1.0630 13.634 -1316.2
- ptratio 1 1.1857 13.757 -1312.5
- dis 1 1.3201 13.891 -1308.6
- crim 1 1.5247 14.096 -1302.6
- lstat 1 6.4348 19.006 -1181.0
Step: AIC=-1343.56
.outcome ~ crim + chas + nox + rm + dis + rad + tax + ptratio +
black + lstat
Df Sum of Sq RSS AIC
<none> 12.747 -1343.6
- chas 1 0.2250 12.972 -1342.5
- tax 1 0.3674 13.114 -1338.0
- black 1 0.4542 13.201 -1335.3
- rad 1 0.5451 13.292 -1332.5
- nox 1 0.6545 13.401 -1329.2
- dis 1 1.2161 13.963 -1312.5
- rm 1 1.2782 14.025 -1310.7
- crim 1 1.4481 14.195 -1305.8
- ptratio 1 1.8165 14.563 -1295.3
- lstat 1 6.3325 19.079 -1185.4
Start: AIC=-1387.96
.outcome ~ crim + zn + indus + chas + nox + rm + age + dis +
rad + tax + ptratio + black + lstat
Df Sum of Sq RSS AIC
- age 1 0.0003 12.551 -1390.0
- indus 1 0.0197 12.571 -1389.3
<none> 12.551 -1388.0
- zn 1 0.1873 12.738 -1383.9
- chas 1 0.2168 12.768 -1383.0
- black 1 0.4511 13.002 -1375.6
- tax 1 0.4573 13.008 -1375.4
- nox 1 0.5538 13.105 -1372.4
- rad 1 0.6111 13.162 -1370.6
- rm 1 1.0284 13.579 -1357.9
- dis 1 1.1001 13.651 -1355.8
- ptratio 1 1.1860 13.737 -1353.2
- crim 1 1.5136 14.065 -1343.6
- lstat 1 5.6242 18.175 -1239.3
Step: AIC=-1389.95
.outcome ~ crim + zn + indus + chas + nox + rm + dis + rad +
tax + ptratio + black + lstat
Df Sum of Sq RSS AIC
- indus 1 0.0196 12.571 -1391.3
<none> 12.551 -1390.0
- zn 1 0.1897 12.741 -1385.8
- chas 1 0.2167 12.768 -1385.0
- black 1 0.4516 13.003 -1377.6
- tax 1 0.4589 13.010 -1377.3
- nox 1 0.6076 13.159 -1372.7
- rad 1 0.6200 13.171 -1372.3
- rm 1 1.0822 13.633 -1358.3
- ptratio 1 1.2031 13.754 -1354.7
- dis 1 1.2087 13.760 -1354.5
- crim 1 1.5137 14.065 -1345.6
- lstat 1 6.4543 19.006 -1223.1
Step: AIC=-1391.32
.outcome ~ crim + zn + chas + nox + rm + dis + rad + tax + ptratio +
black + lstat
Df Sum of Sq RSS AIC
<none> 12.571 -1391.3
- zn 1 0.1760 12.747 -1387.7
- chas 1 0.2304 12.801 -1385.9
- black 1 0.4498 13.021 -1379.0
- tax 1 0.4774 13.048 -1378.2
- nox 1 0.5927 13.164 -1374.6
- rad 1 0.6102 13.181 -1374.0
- rm 1 1.0630 13.634 -1360.3
- ptratio 1 1.1857 13.757 -1356.6
- dis 1 1.3201 13.891 -1352.7
- crim 1 1.5247 14.096 -1346.7
- lstat 1 6.4348 19.006 -1225.1
> models_pred <- lapply(models, predict, newdata = testing)
> models_pred <- as.data.frame((models_pred))
> models_pred <- exp(models_pred)
> head(models_pred)
mod_lm mod_svm mod_knn mod_gbm mod_BIC mod_AIC mod_BE mod_FS
9 12.10126 17.09738 17.22209 15.11849 12.54971 12.13816 12.70811 12.70811
14 20.41695 19.06091 21.14211 19.74192 20.59685 20.49928 21.38807 21.38807
20 19.03650 18.05127 23.61302 18.09039 19.03806 19.14874 18.90036 18.90036
36 23.45565 20.73759 27.09339 22.77786 23.54151 23.60537 21.73006 21.73006
40 31.54351 29.94356 38.89945 29.92197 29.23639 31.51472 28.84757 28.84757
44 24.85047 23.87144 23.07799 24.35731 25.46217 24.72752 25.16348 25.16348
mod_SS mod_RF mod_EN
9 11.36432 17.22285 12.13043
14 22.40196 19.92057 20.46086
20 19.69802 18.78564 19.06339
36 22.74828 21.30284 23.44067
40 29.68668 28.94304 31.41850
44 26.43419 23.91167 24.83434
> bwplot(resamples(models))
> modelCor(resamples(models))
mod_lm mod_svm mod_knn mod_gbm mod_BIC mod_AIC
mod_lm 1.0000000 0.7315707 0.5035556 0.6312936 0.9821889 0.9981746
mod_svm 0.7315707 1.0000000 0.6402334 0.7012810 0.7370684 0.7341203
mod_knn 0.5035556 0.6402334 1.0000000 0.5169711 0.5605050 0.5008608
mod_gbm 0.6312936 0.7012810 0.5169711 1.0000000 0.6293528 0.6387542
mod_BIC 0.9821889 0.7370684 0.5605050 0.6293528 1.0000000 0.9840121
mod_AIC 0.9981746 0.7341203 0.5008608 0.6387542 0.9840121 1.0000000
mod_BE 0.8861203 0.7792973 0.5760775 0.6826216 0.8872274 0.8814543
mod_FS 0.8876586 0.7906338 0.5816167 0.7026625 0.8873709 0.8825207
mod_SS 0.7686125 0.6594322 0.4747544 0.5963098 0.7871257 0.7626301
mod_RF 0.6780607 0.8626853 0.6765305 0.8495601 0.6943648 0.6887197
mod_EN 0.9997953 0.7369954 0.5123585 0.6336658 0.9847568 0.9984690
mod_BE mod_FS mod_SS mod_RF mod_EN
mod_lm 0.8861203 0.8876586 0.7686125 0.6780607 0.9997953
mod_svm 0.7792973 0.7906338 0.6594322 0.8626853 0.7369954
mod_knn 0.5760775 0.5816167 0.4747544 0.6765305 0.5123585
mod_gbm 0.6826216 0.7026625 0.5963098 0.8495601 0.6336658
mod_BIC 0.8872274 0.8873709 0.7871257 0.6943648 0.9847568
mod_AIC 0.8814543 0.8825207 0.7626301 0.6887197 0.9984690
mod_BE 1.0000000 0.9971566 0.9141154 0.7776125 0.8905587
mod_FS 0.9971566 1.0000000 0.9107504 0.7858050 0.8919799
mod_SS 0.9141154 0.9107504 1.0000000 0.6706237 0.7725255
mod_RF 0.7776125 0.7858050 0.6706237 1.0000000 0.6841685
mod_EN 0.8905587 0.8919799 0.7725255 0.6841685 1.0000000
Let us pick on the three best algorithms (RF, svm, and gbm).
> set.seed(32)
> models <- caretList(log(medv) ~ .,
+ data = training,
+ trControl = myControl,
+ tuneList = list(
+ mod_svm = caretModelSpec(method = "svmRadial"),
+ mod_gbm = caretModelSpec(method = "gbm"),
+ mod_RF = caretModelSpec(method = "ranger")
+ ))
Iter TrainDeviance ValidDeviance StepSize Improve
1 0.1425 -nan 0.1000 0.0218
2 0.1245 -nan 0.1000 0.0156
3 0.1086 -nan 0.1000 0.0132
4 0.0949 -nan 0.1000 0.0125
5 0.0842 -nan 0.1000 0.0091
6 0.0745 -nan 0.1000 0.0085
7 0.0671 -nan 0.1000 0.0059
8 0.0600 -nan 0.1000 0.0063
9 0.0545 -nan 0.1000 0.0042
10 0.0494 -nan 0.1000 0.0042
20 0.0275 -nan 0.1000 0.0007
40 0.0186 -nan 0.1000 -0.0000
60 0.0152 -nan 0.1000 -0.0002
80 0.0129 -nan 0.1000 0.0000
100 0.0114 -nan 0.1000 -0.0001
120 0.0101 -nan 0.1000 -0.0001
140 0.0093 -nan 0.1000 -0.0002
150 0.0088 -nan 0.1000 -0.0001
> models_pred <- lapply(models, predict, newdata = testing)
> models_pred <- as.data.frame((models_pred))
> models_pred <- exp(models_pred)
> head(models_pred)
mod_svm mod_gbm mod_RF
1 17.08445 15.05113 17.09280
2 19.05936 19.91364 19.86004
3 18.04901 18.20923 19.12218
4 20.73916 22.64887 21.41413
5 29.93608 28.57065 28.69266
6 23.87584 25.03544 23.90853
> bwplot(resamples(models))
> modelCor(resamples(models))
mod_svm mod_gbm mod_RF
mod_svm 1.0000000 0.7643689 0.8246182
mod_gbm 0.7643689 1.0000000 0.8674933
mod_RF 0.8246182 0.8674933 1.0000000
> set.seed(32)
> models <- caretList(medv ~ .,
+ data = training,
+ trControl = myControl,
+ preProc = c("center", "scale", "BoxCox"),
+ tuneList = list(
+ # mod_svm = caretModelSpec(method = "svmRadial"),
+ mod_gbm = caretModelSpec(method = "gbm"),
+ mod_RF = caretModelSpec(method = "ranger")
+ ))
Iter TrainDeviance ValidDeviance StepSize Improve
1 74.7077 -nan 0.1000 13.6639
2 64.8469 -nan 0.1000 11.2699
3 56.3836 -nan 0.1000 5.6149
4 49.5307 -nan 0.1000 6.8685
5 44.3351 -nan 0.1000 5.6392
6 39.5944 -nan 0.1000 4.3578
7 35.1894 -nan 0.1000 4.3334
8 31.7152 -nan 0.1000 3.3525
9 28.2104 -nan 0.1000 3.4845
10 25.9699 -nan 0.1000 1.6611
20 14.4979 -nan 0.1000 0.2964
40 9.5577 -nan 0.1000 0.1728
60 7.8066 -nan 0.1000 -0.0375
80 6.4782 -nan 0.1000 -0.0871
100 5.7063 -nan 0.1000 0.0004
120 5.2300 -nan 0.1000 -0.0365
140 4.7595 -nan 0.1000 -0.0308
150 4.5734 -nan 0.1000 -0.0515
> models_pred <- lapply(models, predict, newdata = testing)
> models_pred <- as.data.frame((models_pred))
> # models_pred <- exp(models_pred)
> head(models_pred)
mod_gbm mod_RF
1 14.98090 18.13443
2 20.14400 20.48507
3 18.90286 19.22108
4 23.00341 21.55955
5 28.84768 29.90032
6 24.63848 24.04461
> bwplot(resamples(models))
> modelCor(resamples(models))
mod_gbm mod_RF
mod_gbm 1.0000000 0.8567665
mod_RF 0.8567665 1.0000000
> set.seed(31)
> ensemble <- caretEnsemble(models, trControl = myControl)
> summary(ensemble)
The following models were ensembled: mod_gbm, mod_RF
They were weighted:
-1.911 0.2534 0.8331
The resulting RMSE is: 3.0836
The fit for each individual model on the RMSE is:
method RMSE RMSESD
mod_gbm 3.225121 0.9622825
mod_RF 3.094267 1.0071046
> coef(ensemble$ens_model$finalModel)[-1]
mod_gbm mod_RF
0.2534409 0.8330883
> ####
> sum(coef(ensemble$ens_model$finalModel)[-1])
[1] 1.086529
> ####
> BM <- as.matrix(models_pred)
> X <- cbind(rep(1, dim(BM)[1]), BM)
> BETA <- t(coef(ensemble$ens_model$finalModel))
> dim(BETA)
[1] 1 3
> ANS <- X %*% t(BETA)
> head(ANS)
[,1]
[1,] 16.99340
[2,] 20.26023
[3,] 18.89265
[4,] 21.88006
[5,] 30.30984
[6,] 24.36473
> MSE_test <- mean((testing$medv - ANS)^2)
> RMSE_test <- sqrt(MSE_test)
> c(MSE_test, RMSE_test)
[1] 11.35616 3.36989
> url <- "https://assets.datacamp.com/production/course_3022/datasets/credit.csv"
> GC <- read.csv(url)
> library(caret)
> set.seed(321)
> trainIndex <- createDataPartition(y = GC$default,
+ p = 0.80,
+ list = FALSE,
+ times = 1)
> train80 <- GC[trainIndex, ]
> test20 <- GC[-trainIndex, ]
> dim(train80)
[1] 800 17
> dim(test20)
[1] 200 17
> library(rpart)
> library(partykit)
> credit_model <- rpart(default ~ ., data = train80, method = "class")
> plot(partykit::as.party(credit_model))
> rpart.plot::rpart.plot(x = credit_model, type = 5)
Evaluating the performance of the model with test set classification error.
> library(caret)
> class_prediction <- predict(object = credit_model,
+ newdata = test20,
+ type = "class")
>
> # Calculate the confusion matrix for the test set
> confusionMatrix(data = class_prediction,
+ reference = test20$default)
Confusion Matrix and Statistics
Reference
Prediction no yes
no 122 43
yes 18 17
Accuracy : 0.695
95% CI : (0.6261, 0.758)
No Information Rate : 0.7
P-Value [Acc > NIR] : 0.59526
Kappa : 0.1757
Mcnemar's Test P-Value : 0.00212
Sensitivity : 0.8714
Specificity : 0.2833
Pos Pred Value : 0.7394
Neg Pred Value : 0.4857
Prevalence : 0.7000
Detection Rate : 0.6100
Detection Prevalence : 0.8250
Balanced Accuracy : 0.5774
'Positive' Class : no
Cross-validate a bagged tree model with caret
> # Specify the training configuration
> ctrl <- trainControl(method = "cv", # Cross-validation
+ number = 5, # 5 folds
+ classProbs = TRUE, # For AUC
+ summaryFunction = twoClassSummary) # For AUC
>
> # Cross validate the credit model using "treebag" method;
> # Track AUC (Area under the ROC curve)
> set.seed(1) # for reproducibility
> credit_caret_model <- train(default ~ .,
+ data = train80,
+ method = "treebag",
+ metric = "ROC",
+ trControl = ctrl)
>
> # Look at the model object
> print(credit_caret_model)
Bagged CART
800 samples
16 predictor
2 classes: 'no', 'yes'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 640, 640, 640, 640, 640
Resampling results:
ROC Sens Spec
0.7731399 0.8589286 0.4708333
> # Inspect the contents of the model list
> names(credit_caret_model)
[1] "method" "modelInfo" "modelType" "results"
[5] "pred" "bestTune" "call" "dots"
[9] "metric" "control" "finalModel" "preProcess"
[13] "trainingData" "resample" "resampledCM" "perfNames"
[17] "maximize" "yLimits" "times" "levels"
[21] "terms" "coefnames" "contrasts" "xlevels"
> # Print the CV AUC
> credit_caret_model$results[,"ROC"]
[1] 0.7731399
> # Generate predictions on the test set
> pred <- predict(object = credit_caret_model,
+ newdata = test20,
+ type = "prob")
>
> # Compute the AUC (`actual` must be a binary (or 1/0 numeric) vector)
> Metrics::auc(actual = ifelse(test20$default == "yes", 1, 0),
+ predicted = pred[,"yes"])
[1] 0.705
GC
data> # Train a Random Forest
> set.seed(1) # for reproducibility
> credit_model <- train(default ~ .,
+ data = train80,
+ method = "ranger",
+ metric = "ROC",
+ trControl = ctrl)
>
> # Print the model output
> print(credit_model)
Random Forest
800 samples
16 predictor
2 classes: 'no', 'yes'
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 640, 640, 640, 640, 640
Resampling results across tuning parameters:
mtry splitrule ROC Sens Spec
2 gini 0.7803571 0.9750000 0.1416667
2 extratrees 0.7681176 0.9785714 0.1208333
18 gini 0.7880952 0.8785714 0.4208333
18 extratrees 0.7749628 0.8785714 0.4291667
35 gini 0.7902158 0.8660714 0.4666667
35 extratrees 0.7742560 0.8678571 0.4416667
Tuning parameter 'min.node.size' was held constant at a value of 1
ROC was used to select the optimal model using the largest value.
The final values used for the model were mtry = 35, splitrule = gini
and min.node.size = 1.