Page 447 - Applied Statistics with R

P. 447

17.4. CLASSIFICATION 447

Thus, based on these results, we prefer the classifier created based on the logistic
regression fit and stored in fit_additive.

Going forward, to evaluate and report on the eﬀicacy of this classifier, we’ll use
the test dataset. We’re going to take the position that the test data set should
never be used in training, which is why we used cross-validation within the
training dataset to select a model. Even though cross-validation uses hold-out
sets to generate metrics, at some point all of the data is used for training.
To quickly summarize how well this classifier works, we’ll create a confusion
matrix.

Figure 17.1: Confusion Matrix

It further breaks down the classification errors into false positives and false
negatives.

make_conf_mat = function(predicted, actual) {
table(predicted = predicted, actual = actual)
}

Let’s explicitly store the predicted values of our classifier on the test dataset.

spam_tst_pred = ifelse(predict(fit_additive, spam_tst) > 0,
"spam",
"nonspam")
spam_tst_pred = ifelse(predict(fit_additive, spam_tst, type = "response") > 0.5,
"spam",
"nonspam")

The previous two lines of code produce the same output, that is the same
predictions, since

(x) = 0 ⟺ (x) = 0.5
Now we’ll use these predictions to create a confusion matrix.

442 443 444 445 446 447 448 449 450 451 452