Page 447 - Applied Statistics with R
P. 447

17.4. CLASSIFICATION                                              447


                      Thus, based on these results, we prefer the classifier created based on the logistic
                      regression fit and stored in fit_additive.

                      Going forward, to evaluate and report on the efficacy of this classifier, we’ll use
                      the test dataset. We’re going to take the position that the test data set should
                      never be used in training, which is why we used cross-validation within the
                      training dataset to select a model. Even though cross-validation uses hold-out
                      sets to generate metrics, at some point all of the data is used for training.
                      To quickly summarize how well this classifier works, we’ll create a confusion
                      matrix.













                                           Figure 17.1: Confusion Matrix

                      It further breaks down the classification errors into false positives and false
                      negatives.

                      make_conf_mat = function(predicted, actual) {
                        table(predicted = predicted, actual = actual)
                      }


                      Let’s explicitly store the predicted values of our classifier on the test dataset.

                      spam_tst_pred = ifelse(predict(fit_additive, spam_tst) > 0,
                                              "spam",
                                              "nonspam")
                      spam_tst_pred = ifelse(predict(fit_additive, spam_tst, type = "response") > 0.5,
                                              "spam",
                                              "nonspam")


                      The previous two lines of code produce the same output, that is the same
                      predictions, since


                                                (x) = 0 ⟺   (x) = 0.5
                      Now we’ll use these predictions to create a confusion matrix.
   442   443   444   445   446   447   448   449   450   451   452