Page 215 - Applied Statistics with R
P. 215

11.3. FACTOR VARIABLES                                            215


                      When R created    , the dummy variable, it used domestic cars as the reference
                                      2
                      level, that is the default value of the factor variable. So when the dummy
                      variable is 0, the model represents this reference level, which is domestic. (R
                      makes this choice because domestic comes before foreign alphabetically.)
                      So the two models have different estimated coefficients, but due to the different
                      model representations, they are actually the same model.


                      11.3.1    Factors with More Than Two Levels


                      Let’s now consider a factor variable with more than two levels. In this dataset,
                      cyl is an example.

                      is.factor(autompg$cyl)



                      ## [1] TRUE

                      levels(autompg$cyl)


                      ## [1] "4" "6" "8"


                      Here the cyl variable has three possible levels: 4, 6, and 8. You may wonder,
                      why not simply use cyl as a numerical variable? You certainly could.

                      However, that would force the difference in average mpg between 4 and 6 cylin-
                      ders to be the same as the difference in average mpg between 6 and 8 cylinders.
                      That usually make senses for a continuous variable, but not for a discrete vari-
                      able with so few possible values. In the case of this variable, there is no such
                      thing as a 7-cylinder engine or a 6.23-cylinder engine in personal vehicles. For
                      these reasons, we will simply consider cyl to be categorical. This is a decision
                      that will commonly need to be made with ordinal variables. Often, with a large
                      number of categories, the decision to treat them as numerical variables is appro-
                      priate because, otherwise, a large number of dummy variables are then needed
                      to represent these variables.

                      Let’s define three dummy variables related to the cyl factor variable.

                                                    1  4 cylinder
                                                 = {
                                               1
                                                    0  not 4 cylinder

                                                    1  6 cylinder
                                                 = {
                                               2
                                                    0  not 6 cylinder
   210   211   212   213   214   215   216   217   218   219   220