Chapter 2 Regression models: fitting them and evaluating their performance
In the first chapter of this course, you’ll fit regression models with train() and evaluate their out-of-sample performance using cross-validation and root-mean-square error (RMSE).
Welcome to the Toolbox Video
In-sample RMSE for linear regression
RMSE is commonly calculated in-sample on your training set. What’s a potential drawback to calculating training set error?
There’s no potential drawback to calculating training set error, but you should calculate \(R^2\) instead of RMSE.
You have no idea how well your model generalizes to new data (i.e. overfitting).
You should manually inspect your model to validate its coefficients and calculate RMSE.
2.1 In-sample RMSE for linear regression on diamonds
diamonds is a classic dataset from the ggplot2 package written by Wickham et al. (2020). The dataset contains physical attributes of diamonds as well as the price they sold for. One interesting modeling challenge is predicting diamond price based on their attributes using something like a linear regression.
Recall that to fit a linear regression, you use the lm() function in the following format:
mod <- lm(y ~ x, data = my_data)To make predictions using mod on the original data, you call the predict() function:
pred <- predict(mod, newdata = my_data)Exercise
- Fit a linear model on the diamonds dataset predicting price using all other variables as predictors (i.e.
price ~ .). Save the result tomodel.
library(ggplot2)
# Fit lm model: model
model <- lm(price ~ ., data = diamonds)- Make predictions using
modelon the full original dataset and save the result top.
# Predict on full data: p
p <- predict(model, newdata = diamonds)- Compute errors using the formula
errors = actual - predicted. Save the result toerror.
# Compute errors: error
error <- diamonds$price - p- Compute RMSE and print it to the console.
# Compute RMSE
RMSE <- sqrt(mean(error^2))
RMSE[1] 1129.843
Out-of-sample error measures video
Out-of-sample RMSE for linear regression
What is the advantage of using a train/test split rather than just validating your model in-sample on the training set?
It takes less time to calculate error on the test set, since it is smaller than the training set.
There is no advantage to using a test set. You can just use adjusted \(R^2\) on your training set.
It gives you an estimate of how well your model performs on new data.
2.2 Randomly order the data frame
One way you can take a train/test split of a dataset is to order the dataset randomly, then divide it into the two sets. This ensures that the training set and test set are both random samples and that any biases in the ordering of the dataset (e.g. if it had originally been ordered by price or size) are not retained in the samples we take for training and testing your models. You can think of this like shuffling a brand new deck of playing cards before dealing hands.
First, you set a random seed so that your work is reproducible and you get the same random split each time you run your script:
set.seed(42)Next, you use the sample() function to shuffle the row indices of the diamonds dataset. You can later use these these indices to reorder the dataset.
rows <- sample(nrow(diamonds))Finally, you can use this random vector to reorder the diamonds dataset:
diamonds <- diamonds[rows, ]Exercise
- Set the random seed to 42.
# Set seed
set.seed(42)- Make a vector of row indices called
rows.
# Shuffle row indices: rows
rows <- sample(nrow(diamonds))- Randomly reorder the
diamondsdata frame.
# Randomly order data
diamonds <- diamonds[rows, ]2.3 Try an 80/20 split
Now that your dataset is randomly ordered, you can split the first 80% of it into a training set, and the last 20% into a test set. You can do this by choosing a split point approximately 80% of the way through your data:
split <- round(nrow(mydata) * 0.80)You can then use this point to break off the first 80% of the dataset as a training set:
mydata[1:split, ]And then you can use that same point to determine the test set:
mydata[(split + 1):nrow(mydata), ]Exercise
- Choose a row index to split on so that the split point is approximately 80% of the way through the
diamondsdataset. Call this indexsplit.
# Determine row to split on: split
split <- round(nrow(diamonds)*0.80)- Create a training set called
trainusing that index.
# Create train
train <- diamonds[1:split, ]- Create a test set called
testusing that index.
# Create test
test <- diamonds[(split + 1):nrow(diamonds), ]2.4 Predict on test set
Now that you have a randomly split training set and test set, you can use the lm() function as you did in the first exercise to fit a model to your training set, rather than the entire dataset. Recall that you can use the formula interface to the linear regression function to fit a model with a specified target variable using all other variables in the dataset as predictors:
mod <- lm(y ~ ., data = training_data)You can use the predict() function to make predictions from that model on new data. The new dataset must have all of the columns from the training data, but they can be in a different order with different values. Here, rather than re-predicting on the training set, you can predict on the test set, which you did not use for training the model. This will allow you to determine the out-of-sample error for the model in the next exercise:
p <- predict(model, newdata = new_data)Exercise
- Fit an
lm()model calledmodelto predict price using all other variables as covariates. Be sure to use the training set,train.
# Fit lm model on train: model
model <- lm(price ~ . , data = train)- Predict on the test set, test, using
predict(). Store these values in a vector calledp.
# Predict on test: p
p <- predict(model, newdata = test, type = "response")2.5 Calculate test set RMSE
Now that you have predictions on the test set, you can use these predictions to calculate an error metric (in this case RMSE) on the test set and see how the model performs out-of-sample, rather than in-sample as you did in the first exercise. You first do this by calculating the errors between the predicted diamond prices and the actual diamond prices by subtracting the predictions from the actual values.
Once you have an error vector, calculating RMSE is as simple as squaring it, taking the mean, then taking the square root:
sqrt(mean(error^2))Exercise
- Calculate the error between the predictions on the test set and the actual diamond prices in the test set. Call this
error.
# Compute errors: error
error <- test$price - p- Calculate RMSE using this error vector, just printing the result to the console.
# Calculate RMSE
RMSE <- sqrt(mean(error^2))
RMSE[1] 1137.466
Comparing out-of-sample RMSE to in-sample RMSE
Why is the test set RMSE higher than the training set RMSE?
Because you overfit the training set and the test set contains data the model hasn’t seen before.
Because you should not use a test set at all and instead just look at error on the training set.
Because the test set has a smaller sample size the training set and thus the mean error is lower.
Cross Valdiation Video
Advantage of cross-validation
What is the advantage of cross-validation over a single train/test split?
There is no advantage to cross-validation, just as there is no advantage to a single train/test split. You should be validating your models in-sample with a metric like adjusted \(R^2\).
You can pick the best test set to minimize the reported RMSE of your model.
It gives you multiple estimates of out-of-sample error, rather than a single estimate.
Note: If all of your estimates give similar outputs, you can be more certain of the model’s accuracy. If your estimates give different outputs, that tells you the model does not perform consistently and suggests a problem with it.
2.6 10-fold cross-validation
A better approach to validating models is to use multiple systematic test sets rather than a single random train/test split. Fortunately, the caret package written by Kuhn (2020) makes this very easy to do:
model <- train(y ~ ., my_data)caret supports many types of cross-validation, and you can specify which type of cross-validation and the number of cross-validation folds with the trainControl() function, which you pass to the trControl argument in train():
model <- train(
y ~ ., my_data,
method = "lm",
trControl = trainControl(
method = "cv", number = 10,
verboseIter = TRUE
)
)It is important to note that you pass the method for modeling to the main train() function and the method for cross-validation to the trainControl() function.
Exercise
- Load the
caretpackage.
# Load the caret package
library(caret)- Fit a linear regression to model price using all other variables in the
diamondsdataset as predictors. Use thetrain()function and 10-fold cross-validation.
# Fit lm model using 10-fold CV: model
model <- train(
price ~ ., data = diamonds,
method = "lm",
trControl = trainControl(
method = "cv", number = 10,
verboseIter = FALSE
)
)- Print the model to the console and examine the results.
# Print model to console
modelLinear Regression
53940 samples
9 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 48547, 48546, 48546, 48547, 48545, 48547, ...
Resampling results:
RMSE Rsquared MAE
1130.633 0.9197711 740.5017
Tuning parameter 'intercept' was held constant at a value of TRUE
model$finalModel # show model coefficients
Call:
lm(formula = .outcome ~ ., data = dat)
Coefficients:
(Intercept) carat cut.L cut.Q cut.C `cut^4`
5753.762 11256.978 584.457 -301.908 148.035 -20.794
color.L color.Q color.C `color^4` `color^5` `color^6`
-1952.160 -672.054 -165.283 38.195 -95.793 -48.466
clarity.L clarity.Q clarity.C `clarity^4` `clarity^5` `clarity^6`
4097.431 -1925.004 982.205 -364.918 233.563 6.883
`clarity^7` depth table x y z
90.640 -63.806 -26.474 -1008.261 9.609 -50.119
# summary(model) # to see all2.7 5-fold cross-validation
In this tutorial, you will use a wide variety of datasets to explore the full flexibility of the caret package. Here, you will use the famous Boston housing dataset, where the goal is to predict median home values in various Boston suburbs.
You can use exactly the same code as in the previous exercise, but change the dataset used by the model:
model <- train(
medv ~ ., Boston,
method = "lm",
trControl = trainControl(
method = "cv", number = 10,
verboseIter = FALSE
)
)Next, you can reduce the number of cross-validation folds from 10 to 5 using the number argument to the trainControl() argument:
trControl = trainControl(
method = "cv", number = 5,
verboseIter = TRUE
)Exercise
- Load the
MASSpackage.
# Load the MASS pacakge
library(MASS)- Fit an
lm()model to theBostonhousing dataset, such thatmedvis the response variable and all other variables are explanatory variables. Use 5-fold cross-validation rather than 10-fold cross-validation.
# Fit lm model using 5-fold CV: model
model <- train(
medv ~. , data = Boston,
method = "lm",
trControl = trainControl(
method = "cv", number = 5,
verboseIter = FALSE
)
)- Print the model to the console and inspect the results.
# Print model to console
modelLinear Regression
506 samples
13 predictor
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 405, 405, 406, 403, 405
Resampling results:
RMSE Rsquared MAE
4.860247 0.7209221 3.398114
Tuning parameter 'intercept' was held constant at a value of TRUE
# show coefficients of model
model$finalModel
Call:
lm(formula = .outcome ~ ., data = dat)
Coefficients:
(Intercept) crim zn indus chas nox
3.646e+01 -1.080e-01 4.642e-02 2.056e-02 2.687e+00 -1.777e+01
rm age dis rad tax ptratio
3.810e+00 6.922e-04 -1.476e+00 3.060e-01 -1.233e-02 -9.527e-01
black lstat
9.312e-03 -5.248e-01
summary(model)
Call:
lm(formula = .outcome ~ ., data = dat)
Residuals:
Min 1Q Median 3Q Max
-15.595 -2.730 -0.518 1.777 26.199
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.646e+01 5.103e+00 7.144 3.28e-12 ***
crim -1.080e-01 3.286e-02 -3.287 0.001087 **
zn 4.642e-02 1.373e-02 3.382 0.000778 ***
indus 2.056e-02 6.150e-02 0.334 0.738288
chas 2.687e+00 8.616e-01 3.118 0.001925 **
nox -1.777e+01 3.820e+00 -4.651 4.25e-06 ***
rm 3.810e+00 4.179e-01 9.116 < 2e-16 ***
age 6.922e-04 1.321e-02 0.052 0.958229
dis -1.476e+00 1.995e-01 -7.398 6.01e-13 ***
rad 3.060e-01 6.635e-02 4.613 5.07e-06 ***
tax -1.233e-02 3.760e-03 -3.280 0.001112 **
ptratio -9.527e-01 1.308e-01 -7.283 1.31e-12 ***
black 9.312e-03 2.686e-03 3.467 0.000573 ***
lstat -5.248e-01 5.072e-02 -10.347 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.745 on 492 degrees of freedom
Multiple R-squared: 0.7406, Adjusted R-squared: 0.7338
F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16
2.8 \(5 \times 5\)-fold cross-validation
You can do more than just one iteration of cross-validation. Repeated cross-validation gives you a better estimate of the test-set error. You can also repeat the entire cross-validation procedure. This takes longer, but gives you many more out-of-sample datasets to look at and much more precise assessments of how well the model performs.
One of the awesome things about the train() function in caret is how easy it is to run very different models or methods of cross-validation just by tweaking a few simple arguments to the function call. For example, you could repeat your entire cross-validation procedure 5 times for greater confidence in your estimates of the model’s out-of-sample accuracy, e.g.:
trControl = trainControl(
method = "repeatedcv", number = 5,
repeats = 5, verboseIter = TRUE
)Exercise
- Re-fit the linear regression model to the
Bostonhousing dataset. Use 5 repeats of 5-fold cross-validation.
# Fit lm model using 5 x 5-fold CV: model
model <- train(
medv ~ ., data = Boston,
method = "lm",
trControl = trainControl(
method = "repeatedcv", number = 5,
repeats = 5, verboseIter = FALSE
)
)- Print the model to the console.
# Print model to console
modelLinear Regression
506 samples
13 predictor
No pre-processing
Resampling: Cross-Validated (5 fold, repeated 5 times)
Summary of sample sizes: 405, 406, 405, 403, 405, 405, ...
Resampling results:
RMSE Rsquared MAE
4.845724 0.7277269 3.402735
Tuning parameter 'intercept' was held constant at a value of TRUE
summary(model)
Call:
lm(formula = .outcome ~ ., data = dat)
Residuals:
Min 1Q Median 3Q Max
-15.595 -2.730 -0.518 1.777 26.199
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.646e+01 5.103e+00 7.144 3.28e-12 ***
crim -1.080e-01 3.286e-02 -3.287 0.001087 **
zn 4.642e-02 1.373e-02 3.382 0.000778 ***
indus 2.056e-02 6.150e-02 0.334 0.738288
chas 2.687e+00 8.616e-01 3.118 0.001925 **
nox -1.777e+01 3.820e+00 -4.651 4.25e-06 ***
rm 3.810e+00 4.179e-01 9.116 < 2e-16 ***
age 6.922e-04 1.321e-02 0.052 0.958229
dis -1.476e+00 1.995e-01 -7.398 6.01e-13 ***
rad 3.060e-01 6.635e-02 4.613 5.07e-06 ***
tax -1.233e-02 3.760e-03 -3.280 0.001112 **
ptratio -9.527e-01 1.308e-01 -7.283 1.31e-12 ***
black 9.312e-03 2.686e-03 3.467 0.000573 ***
lstat -5.248e-01 5.072e-02 -10.347 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.745 on 492 degrees of freedom
Multiple R-squared: 0.7406, Adjusted R-squared: 0.7338
F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16
2.9 Making predictions on new data
Finally, the model you fit with the train() function has the exact same predict() interface as the linear regression models you fit earlier.
After fitting a model with train(), you can call predict() with new data, e.g:
predict(my_model, newdata = new_data)Exercise
- Use the
predict()function to make predictions withmodelon the fullBostonhousing dataset. Print the result to the console.
# Predict on full Boston dataset
head(predict(model, newdata = Boston)) 1 2 3 4 5 6
30.00384 25.02556 30.56760 28.60704 27.94352 25.25628
tail(predict(model, newdata = Boston)) 501 502 503 504 505 506
20.46871 23.53334 22.37572 27.62743 26.12797 22.34421