Page 446 - Applied Statistics with R
P. 446

446                           CHAPTER 17. LOGISTIC REGRESSION


                                 the data (in this case training), a model fit via glm(), and K, the number of
                                 folds. See ?cv.glm for details.
                                 Previously, for cross-validating RMSE in ordinary linear regression, we used
                                 LOOCV. We certainly could do that here. However, with logistic regression,
                                 we no longer have the clever trick that would allow use to obtain a LOOCV
                                 metric without needing to fit the model    times. So instead, we’ll use 5-fold
                                 cross-validation. (5 and 10 fold are the most common in practice.) Instead of
                                 leaving a single observation out repeatedly, we’ll leave out a fifth of the data.
                                 Essentially we’ll repeat the following process 5 times:

                                    • Randomly set aside a fifth of the data (each observation will only be held-
                                      out once)
                                    • Train model on remaining data
                                    • Evaluate misclassification rate on held-out data

                                 The 5-fold cross-validated misclassification rate will be the average of these
                                 misclassification rates. By only needing to refit the model 5 times, instead of   
                                 times, we will save a lot of computation time.

                                 library(boot)
                                 set.seed(1)
                                 cv.glm(spam_trn, fit_caps, K = 5)$delta[1]


                                 ## [1] 0.2166961

                                 cv.glm(spam_trn, fit_selected, K = 5)$delta[1]


                                 ## [1] 0.1587043

                                 cv.glm(spam_trn, fit_additive, K = 5)$delta[1]


                                 ## [1] 0.08684467

                                 cv.glm(spam_trn, fit_over, K = 5)$delta[1]


                                 ## [1] 0.14


                                 Note that we’re suppressing warnings again here. (Now there would be a lot
                                 more, since were fitting a total of 20 models.)

                                 Based on these results, fit_caps and fit_selected are underfitting relative
                                 to fit_additive. Similarly, fit_over is overfitting relative to fit_additive.
   441   442   443   444   445   446   447   448   449   450   451