Page 446 - Applied Statistics with R

P. 446

446 CHAPTER 17. LOGISTIC REGRESSION

the data (in this case training), a model fit via glm(), and K, the number of
folds. See ?cv.glm for details.
Previously, for cross-validating RMSE in ordinary linear regression, we used
LOOCV. We certainly could do that here. However, with logistic regression,
we no longer have the clever trick that would allow use to obtain a LOOCV
metric without needing to fit the model times. So instead, we’ll use 5-fold
cross-validation. (5 and 10 fold are the most common in practice.) Instead of
leaving a single observation out repeatedly, we’ll leave out a fifth of the data.
Essentially we’ll repeat the following process 5 times:

• Randomly set aside a fifth of the data (each observation will only be held-
out once)
• Train model on remaining data
• Evaluate misclassification rate on held-out data

The 5-fold cross-validated misclassification rate will be the average of these
misclassification rates. By only needing to refit the model 5 times, instead of
times, we will save a lot of computation time.

library(boot)
set.seed(1)
cv.glm(spam_trn, fit_caps, K = 5)$delta[1]

## [1] 0.2166961

cv.glm(spam_trn, fit_selected, K = 5)$delta[1]

## [1] 0.1587043

cv.glm(spam_trn, fit_additive, K = 5)$delta[1]

## [1] 0.08684467

cv.glm(spam_trn, fit_over, K = 5)$delta[1]

## [1] 0.14

Note that we’re suppressing warnings again here. (Now there would be a lot
more, since were fitting a total of 20 models.)

Based on these results, fit_caps and fit_selected are underfitting relative
to fit_additive. Similarly, fit_over is overfitting relative to fit_additive.

441 442 443 444 445 446 447 448 449 450 451