Page 212 - Applied Statistics with R
P. 212

212 CHAPTER 11. CATEGORICAL PREDICTORS AND INTERACTIONS

                                 11.3     Factor Variables


                                 So far in this chapter, we have limited our use of categorical variables to binary
                                 categorical variables. Specifically, we have limited ourselves to dummy variables
                                 which take a value of 0 or 1 and represent a categorical variable numerically.
                                 We will now discuss factor variables, which is a special way that R deals with
                                 categorical variables. With factor variables, a human user can simply think
                                 about the categories of a variable, and R will take care of the necessary dummy
                                 variables without any 0/1 assignment being done by the user.

                                 is.factor(autompg$domestic)


                                 ## [1] FALSE

                                 Earlier when we used the domestic variable, it was not a factor variable. It was
                                 simply a numerical variable that only took two possible values, 1 for domestic,
                                 and 0 for foreign. Let’s create a new variable origin that stores the same
                                 information, but in a different way.

                                 autompg$origin[autompg$domestic == 1] = "domestic"
                                 autompg$origin[autompg$domestic == 0] = "foreign"
                                 head(autompg$origin)


                                 ## [1] "domestic" "domestic" "domestic" "domestic" "domestic" "domestic"

                                 Now the origin variable stores "domestic" for domestic cars and "foreign"
                                 for foreign cars.
                                 is.factor(autompg$origin)


                                 ## [1] FALSE


                                 However, this is simply a vector of character values. A vector of car models is
                                 a character variable in R. A vector of Vehicle Identification Numbers (VINs) is
                                 a character variable as well. But those don’t represent a short list of levels that
                                 might influence a response variable. We will want to coerce this origin variable
                                 to be something more: a factor variable.
                                 autompg$origin = as.factor(autompg$origin)


                                 Now when we check the structure of the autompg dataset, we see that origin
                                 is a factor variable.
   207   208   209   210   211   212   213   214   215   216   217