Page 446 - Applied Statistics with R
P. 446
446 CHAPTER 17. LOGISTIC REGRESSION
the data (in this case training), a model fit via glm(), and K, the number of
folds. See ?cv.glm for details.
Previously, for cross-validating RMSE in ordinary linear regression, we used
LOOCV. We certainly could do that here. However, with logistic regression,
we no longer have the clever trick that would allow use to obtain a LOOCV
metric without needing to fit the model times. So instead, we’ll use 5-fold
cross-validation. (5 and 10 fold are the most common in practice.) Instead of
leaving a single observation out repeatedly, we’ll leave out a fifth of the data.
Essentially we’ll repeat the following process 5 times:
• Randomly set aside a fifth of the data (each observation will only be held-
out once)
• Train model on remaining data
• Evaluate misclassification rate on held-out data
The 5-fold cross-validated misclassification rate will be the average of these
misclassification rates. By only needing to refit the model 5 times, instead of
times, we will save a lot of computation time.
library(boot)
set.seed(1)
cv.glm(spam_trn, fit_caps, K = 5)$delta[1]
## [1] 0.2166961
cv.glm(spam_trn, fit_selected, K = 5)$delta[1]
## [1] 0.1587043
cv.glm(spam_trn, fit_additive, K = 5)$delta[1]
## [1] 0.08684467
cv.glm(spam_trn, fit_over, K = 5)$delta[1]
## [1] 0.14
Note that we’re suppressing warnings again here. (Now there would be a lot
more, since were fitting a total of 20 models.)
Based on these results, fit_caps and fit_selected are underfitting relative
to fit_additive. Similarly, fit_over is overfitting relative to fit_additive.

