Chapter 2 Regression models: fitting them and evaluating their performance
In the first chapter of this course, you’ll fit regression models with train()
and evaluate their out-of-sample performance using cross-validation and root-mean-square error (RMSE).
Welcome to the Toolbox Video
In-sample RMSE for linear regression
RMSE is commonly calculated in-sample on your training set. What’s a potential drawback to calculating training set error?
There’s no potential drawback to calculating training set error, but you should calculate \(R^2\) instead of RMSE.
You have no idea how well your model generalizes to new data (i.e. overfitting).
You should manually inspect your model to validate its coefficients and calculate RMSE.
2.1 In-sample RMSE for linear regression on diamonds
diamonds
is a classic dataset from the ggplot2
package written by Wickham et al. (2020). The dataset contains physical attributes of diamonds as well as the price they sold for. One interesting modeling challenge is predicting diamond price based on their attributes using something like a linear regression.
Recall that to fit a linear regression, you use the lm()
function in the following format:
<- lm(y ~ x, data = my_data) mod
To make predictions using mod
on the original data, you call the predict()
function:
<- predict(mod, newdata = my_data) pred
Exercise
- Fit a linear model on the diamonds dataset predicting price using all other variables as predictors (i.e.
price ~ .
). Save the result tomodel
.
library(ggplot2)
# Fit lm model: model
<- lm(price ~ ., data = diamonds) model
- Make predictions using
model
on the full original dataset and save the result top
.
# Predict on full data: p
<- predict(model, newdata = diamonds) p
- Compute errors using the formula
errors = actual - predicted
. Save the result toerror
.
# Compute errors: error
<- diamonds$price - p error
- Compute RMSE and print it to the console.
# Compute RMSE
<- sqrt(mean(error^2))
RMSE RMSE
[1] 1129.843
Out-of-sample error measures video
Out-of-sample RMSE for linear regression
What is the advantage of using a train/test split rather than just validating your model in-sample on the training set?
It takes less time to calculate error on the test set, since it is smaller than the training set.
There is no advantage to using a test set. You can just use adjusted \(R^2\) on your training set.
It gives you an estimate of how well your model performs on new data.
2.2 Randomly order the data frame
One way you can take a train/test split of a dataset is to order the dataset randomly, then divide it into the two sets. This ensures that the training set and test set are both random samples and that any biases in the ordering of the dataset (e.g. if it had originally been ordered by price
or size
) are not retained in the samples we take for training and testing your models. You can think of this like shuffling a brand new deck of playing cards before dealing hands.
First, you set a random seed so that your work is reproducible and you get the same random split each time you run your script:
set.seed(42)
Next, you use the sample()
function to shuffle the row indices of the diamonds
dataset. You can later use these these indices to reorder the dataset.
<- sample(nrow(diamonds)) rows
Finally, you can use this random vector to reorder the diamonds dataset:
<- diamonds[rows, ] diamonds
Exercise
- Set the random seed to 42.
# Set seed
set.seed(42)
- Make a vector of row indices called
rows
.
# Shuffle row indices: rows
<- sample(nrow(diamonds)) rows
- Randomly reorder the
diamonds
data frame.
# Randomly order data
<- diamonds[rows, ] diamonds
2.3 Try an 80/20 split
Now that your dataset is randomly ordered, you can split the first 80% of it into a training set, and the last 20% into a test set. You can do this by choosing a split point approximately 80% of the way through your data:
<- round(nrow(mydata) * 0.80) split
You can then use this point to break off the first 80% of the dataset as a training set:
1:split, ] mydata[
And then you can use that same point to determine the test set:
+ 1):nrow(mydata), ] mydata[(split
Exercise
- Choose a row index to split on so that the split point is approximately 80% of the way through the
diamonds
dataset. Call this indexsplit
.
# Determine row to split on: split
<- round(nrow(diamonds)*0.80) split
- Create a training set called
train
using that index.
# Create train
<- diamonds[1:split, ] train
- Create a test set called
test
using that index.
# Create test
<- diamonds[(split + 1):nrow(diamonds), ] test
2.4 Predict on test set
Now that you have a randomly split training set and test set, you can use the lm()
function as you did in the first exercise to fit a model to your training set, rather than the entire dataset. Recall that you can use the formula interface to the linear regression function to fit a model with a specified target variable using all other variables in the dataset as predictors:
<- lm(y ~ ., data = training_data) mod
You can use the predict()
function to make predictions from that model on new data. The new dataset must have all of the columns from the training data, but they can be in a different order with different values. Here, rather than re-predicting on the training set, you can predict on the test set, which you did not use for training the model. This will allow you to determine the out-of-sample error for the model in the next exercise:
<- predict(model, newdata = new_data) p
Exercise
- Fit an
lm()
model calledmodel
to predict price using all other variables as covariates. Be sure to use the training set,train
.
# Fit lm model on train: model
<- lm(price ~ . , data = train) model
- Predict on the test set, test, using
predict()
. Store these values in a vector calledp
.
# Predict on test: p
<- predict(model, newdata = test, type = "response") p
2.5 Calculate test set RMSE
Now that you have predictions on the test set, you can use these predictions to calculate an error metric (in this case RMSE) on the test set and see how the model performs out-of-sample, rather than in-sample as you did in the first exercise. You first do this by calculating the errors between the predicted diamond prices and the actual diamond prices by subtracting the predictions from the actual values.
Once you have an error vector, calculating RMSE is as simple as squaring it, taking the mean, then taking the square root:
sqrt(mean(error^2))
Exercise
- Calculate the error between the predictions on the test set and the actual diamond prices in the test set. Call this
error
.
# Compute errors: error
<- test$price - p error
- Calculate RMSE using this error vector, just printing the result to the console.
# Calculate RMSE
<- sqrt(mean(error^2))
RMSE RMSE
[1] 1137.466
Comparing out-of-sample RMSE to in-sample RMSE
Why is the test set RMSE higher than the training set RMSE?
Because you overfit the training set and the test set contains data the model hasn’t seen before.
Because you should not use a test set at all and instead just look at error on the training set.
Because the test set has a smaller sample size the training set and thus the mean error is lower.
Cross Valdiation Video
Advantage of cross-validation
What is the advantage of cross-validation over a single train/test split?
There is no advantage to cross-validation, just as there is no advantage to a single train/test split. You should be validating your models in-sample with a metric like adjusted \(R^2\).
You can pick the best test set to minimize the reported RMSE of your model.
It gives you multiple estimates of out-of-sample error, rather than a single estimate.
Note: If all of your estimates give similar outputs, you can be more certain of the model’s accuracy. If your estimates give different outputs, that tells you the model does not perform consistently and suggests a problem with it.
2.6 10-fold cross-validation
A better approach to validating models is to use multiple systematic test sets rather than a single random train/test split. Fortunately, the caret
package written by Kuhn (2020) makes this very easy to do:
<- train(y ~ ., my_data) model
caret
supports many types of cross-validation, and you can specify which type of cross-validation and the number of cross-validation folds with the trainControl()
function, which you pass to the trControl
argument in train()
:
<- train(
model ~ ., my_data,
y method = "lm",
trControl = trainControl(
method = "cv", number = 10,
verboseIter = TRUE
) )
It is important to note that you pass the method for modeling to the main train()
function and the method for cross-validation to the trainControl()
function.
Exercise
- Load the
caret
package.
# Load the caret package
library(caret)
- Fit a linear regression to model price using all other variables in the
diamonds
dataset as predictors. Use thetrain()
function and 10-fold cross-validation.
# Fit lm model using 10-fold CV: model
<- train(
model ~ ., data = diamonds,
price method = "lm",
trControl = trainControl(
method = "cv", number = 10,
verboseIter = FALSE
) )
- Print the model to the console and examine the results.
# Print model to console
model
Linear Regression
53940 samples
9 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 48547, 48546, 48546, 48547, 48545, 48547, ...
Resampling results:
RMSE Rsquared MAE
1130.633 0.9197711 740.5017
Tuning parameter 'intercept' was held constant at a value of TRUE
$finalModel # show model coefficients model
Call:
lm(formula = .outcome ~ ., data = dat)
Coefficients:
(Intercept) carat cut.L cut.Q cut.C `cut^4`
5753.762 11256.978 584.457 -301.908 148.035 -20.794
color.L color.Q color.C `color^4` `color^5` `color^6`
-1952.160 -672.054 -165.283 38.195 -95.793 -48.466
clarity.L clarity.Q clarity.C `clarity^4` `clarity^5` `clarity^6`
4097.431 -1925.004 982.205 -364.918 233.563 6.883
`clarity^7` depth table x y z
90.640 -63.806 -26.474 -1008.261 9.609 -50.119
# summary(model) # to see all
2.7 5-fold cross-validation
In this tutorial, you will use a wide variety of datasets to explore the full flexibility of the caret
package. Here, you will use the famous Boston
housing dataset, where the goal is to predict median home values in various Boston suburbs.
You can use exactly the same code as in the previous exercise, but change the dataset used by the model:
<- train(
model ~ ., Boston,
medv method = "lm",
trControl = trainControl(
method = "cv", number = 10,
verboseIter = FALSE
) )
Next, you can reduce the number of cross-validation folds from 10 to 5 using the number argument to the trainControl()
argument:
= trainControl(
trControl method = "cv", number = 5,
verboseIter = TRUE
)
Exercise
- Load the
MASS
package.
# Load the MASS pacakge
library(MASS)
- Fit an
lm()
model to theBoston
housing dataset, such thatmedv
is the response variable and all other variables are explanatory variables. Use 5-fold cross-validation rather than 10-fold cross-validation.
# Fit lm model using 5-fold CV: model
<- train(
model ~. , data = Boston,
medv method = "lm",
trControl = trainControl(
method = "cv", number = 5,
verboseIter = FALSE
) )
- Print the model to the console and inspect the results.
# Print model to console
model
Linear Regression
506 samples
13 predictor
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 405, 405, 406, 403, 405
Resampling results:
RMSE Rsquared MAE
4.860247 0.7209221 3.398114
Tuning parameter 'intercept' was held constant at a value of TRUE
# show coefficients of model
$finalModel model
Call:
lm(formula = .outcome ~ ., data = dat)
Coefficients:
(Intercept) crim zn indus chas nox
3.646e+01 -1.080e-01 4.642e-02 2.056e-02 2.687e+00 -1.777e+01
rm age dis rad tax ptratio
3.810e+00 6.922e-04 -1.476e+00 3.060e-01 -1.233e-02 -9.527e-01
black lstat
9.312e-03 -5.248e-01
summary(model)
Call:
lm(formula = .outcome ~ ., data = dat)
Residuals:
Min 1Q Median 3Q Max
-15.595 -2.730 -0.518 1.777 26.199
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.646e+01 5.103e+00 7.144 3.28e-12 ***
crim -1.080e-01 3.286e-02 -3.287 0.001087 **
zn 4.642e-02 1.373e-02 3.382 0.000778 ***
indus 2.056e-02 6.150e-02 0.334 0.738288
chas 2.687e+00 8.616e-01 3.118 0.001925 **
nox -1.777e+01 3.820e+00 -4.651 4.25e-06 ***
rm 3.810e+00 4.179e-01 9.116 < 2e-16 ***
age 6.922e-04 1.321e-02 0.052 0.958229
dis -1.476e+00 1.995e-01 -7.398 6.01e-13 ***
rad 3.060e-01 6.635e-02 4.613 5.07e-06 ***
tax -1.233e-02 3.760e-03 -3.280 0.001112 **
ptratio -9.527e-01 1.308e-01 -7.283 1.31e-12 ***
black 9.312e-03 2.686e-03 3.467 0.000573 ***
lstat -5.248e-01 5.072e-02 -10.347 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.745 on 492 degrees of freedom
Multiple R-squared: 0.7406, Adjusted R-squared: 0.7338
F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16
2.8 \(5 \times 5\)-fold cross-validation
You can do more than just one iteration of cross-validation. Repeated cross-validation gives you a better estimate of the test-set error. You can also repeat the entire cross-validation procedure. This takes longer, but gives you many more out-of-sample datasets to look at and much more precise assessments of how well the model performs.
One of the awesome things about the train()
function in caret
is how easy it is to run very different models or methods of cross-validation just by tweaking a few simple arguments to the function call. For example, you could repeat your entire cross-validation procedure 5 times for greater confidence in your estimates of the model’s out-of-sample accuracy, e.g.:
= trainControl(
trControl method = "repeatedcv", number = 5,
repeats = 5, verboseIter = TRUE
)
Exercise
- Re-fit the linear regression model to the
Boston
housing dataset. Use 5 repeats of 5-fold cross-validation.
# Fit lm model using 5 x 5-fold CV: model
<- train(
model ~ ., data = Boston,
medv method = "lm",
trControl = trainControl(
method = "repeatedcv", number = 5,
repeats = 5, verboseIter = FALSE
) )
- Print the model to the console.
# Print model to console
model
Linear Regression
506 samples
13 predictor
No pre-processing
Resampling: Cross-Validated (5 fold, repeated 5 times)
Summary of sample sizes: 405, 406, 405, 403, 405, 405, ...
Resampling results:
RMSE Rsquared MAE
4.845724 0.7277269 3.402735
Tuning parameter 'intercept' was held constant at a value of TRUE
summary(model)
Call:
lm(formula = .outcome ~ ., data = dat)
Residuals:
Min 1Q Median 3Q Max
-15.595 -2.730 -0.518 1.777 26.199
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.646e+01 5.103e+00 7.144 3.28e-12 ***
crim -1.080e-01 3.286e-02 -3.287 0.001087 **
zn 4.642e-02 1.373e-02 3.382 0.000778 ***
indus 2.056e-02 6.150e-02 0.334 0.738288
chas 2.687e+00 8.616e-01 3.118 0.001925 **
nox -1.777e+01 3.820e+00 -4.651 4.25e-06 ***
rm 3.810e+00 4.179e-01 9.116 < 2e-16 ***
age 6.922e-04 1.321e-02 0.052 0.958229
dis -1.476e+00 1.995e-01 -7.398 6.01e-13 ***
rad 3.060e-01 6.635e-02 4.613 5.07e-06 ***
tax -1.233e-02 3.760e-03 -3.280 0.001112 **
ptratio -9.527e-01 1.308e-01 -7.283 1.31e-12 ***
black 9.312e-03 2.686e-03 3.467 0.000573 ***
lstat -5.248e-01 5.072e-02 -10.347 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.745 on 492 degrees of freedom
Multiple R-squared: 0.7406, Adjusted R-squared: 0.7338
F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16
2.9 Making predictions on new data
Finally, the model you fit with the train()
function has the exact same predict()
interface as the linear regression models you fit earlier.
After fitting a model with train()
, you can call predict()
with new data, e.g:
predict(my_model, newdata = new_data)
Exercise
- Use the
predict()
function to make predictions withmodel
on the fullBoston
housing dataset. Print the result to the console.
# Predict on full Boston dataset
head(predict(model, newdata = Boston))
1 2 3 4 5 6
30.00384 25.02556 30.56760 28.60704 27.94352 25.25628
tail(predict(model, newdata = Boston))
501 502 503 504 505 506
20.46871 23.53334 22.37572 27.62743 26.12797 22.34421