Page 212 - Applied Statistics with R
P. 212
212 CHAPTER 11. CATEGORICAL PREDICTORS AND INTERACTIONS
11.3 Factor Variables
So far in this chapter, we have limited our use of categorical variables to binary
categorical variables. Specifically, we have limited ourselves to dummy variables
which take a value of 0 or 1 and represent a categorical variable numerically.
We will now discuss factor variables, which is a special way that R deals with
categorical variables. With factor variables, a human user can simply think
about the categories of a variable, and R will take care of the necessary dummy
variables without any 0/1 assignment being done by the user.
is.factor(autompg$domestic)
## [1] FALSE
Earlier when we used the domestic variable, it was not a factor variable. It was
simply a numerical variable that only took two possible values, 1 for domestic,
and 0 for foreign. Let’s create a new variable origin that stores the same
information, but in a different way.
autompg$origin[autompg$domestic == 1] = "domestic"
autompg$origin[autompg$domestic == 0] = "foreign"
head(autompg$origin)
## [1] "domestic" "domestic" "domestic" "domestic" "domestic" "domestic"
Now the origin variable stores "domestic" for domestic cars and "foreign"
for foreign cars.
is.factor(autompg$origin)
## [1] FALSE
However, this is simply a vector of character values. A vector of car models is
a character variable in R. A vector of Vehicle Identification Numbers (VINs) is
a character variable as well. But those don’t represent a short list of levels that
might influence a response variable. We will want to coerce this origin variable
to be something more: a factor variable.
autompg$origin = as.factor(autompg$origin)
Now when we check the structure of the autompg dataset, we see that origin
is a factor variable.

