Page 389 - Applied Statistics with R

P. 389

16.1. QUALITY CRITERION 389

or 10 fold cross-validation are much more popular. For example, in 5-fold cross-
validation, the model is fit 5 times, each time leaving out a fifth of the data, then
predicting on those values. We’ll leave in-depth examination of cross-validation
to a machine learning course, and simply use LOOCV here.

Let’s calculate LOOCV RMSE for both models, then discuss why we want to
do so. We first write a function which calculates the LOOCV RMSE as defined
using the shortcut formula for linear models.

calc_loocv_rmse = function(model) {
sqrt(mean((resid(model) / (1 - hatvalues(model))) ^ 2))
}

Then calculate the metric for both models.

calc_loocv_rmse(fit_quad)

## [1] 23.57189

calc_loocv_rmse(fit_big)

## [1] 1334.357

Now we see that the quadratic model has a much smaller LOOCV RMSE, so we
would prefer this quadratic model. This is because the large model has severely
over-fit the data. By leaving a single data point out and fitting the large model,
the resulting fit is much different than the fit using all of the data. For example,
let’s leave out the third data point and fit both models, then plot the result.

fit_quad_removed = lm(y ~ poly(x, degree = 2), data = poly_data[-3, ])
fit_big_removed = lm(y ~ poly(x, degree = 8), data = poly_data[-3, ])

plot(y ~ x, data = poly_data, ylim = c(-100, 400), cex = 2, pch = 20)
xplot = seq(0, 10, by = 0.1)
lines(xplot, predict(fit_quad_removed, newdata = data.frame(x = xplot)),
col = "dodgerblue", lwd = 2, lty = 1)
lines(xplot, predict(fit_big_removed, newdata = data.frame(x = xplot)),
col = "darkorange", lwd = 2, lty = 2)

384 385 386 387 388 389 390 391 392 393 394