Page 199 - Applied Statistics with R
P. 199

11.1. DUMMY VARIABLES                                             199








                                                                              Automatic
                                                                              Manual
                             30

                             25
                        mpg
                             20

                             15


                             10
                                50      100      150       200      250      300

                                                          hp


                      We should notice a pattern here. The red, manual observations largely fall above
                      the line, while the black, automatic observations are mostly below the line. This
                      means our model underestimates the fuel efficiency of manual transmissions, and
                      overestimates the fuel efficiency of automatic transmissions. To correct for this,
                      we will add a predictor to our model, namely, am as    .
                                                                     2
                      Our new model is

                                                =    +       +       +   ,
                                                              2 2
                                                       1 1
                                                   0
                      where    and    remain the same, but now
                             1
                                                1  manual transmission
                                            = {                         .
                                          2
                                                0  automatic transmission
                      In this case, we call    a dummy variable. A dummy variable is somewhat
                                          2
                      unfortunately named, as it is in no way “dumb”. In fact, it is actually somewhat
                      clever. A dummy variable is a numerical variable that is used in a regression
                      analysis to “code” for a binary categorical variable. Let’s see how this works.
                      First, note that am is already a dummy variable, since it uses the values 0 and
                      1 to represent automatic and manual transmissions. Often, a variable like am
                      would store the character values auto and man and we would either have to
                      convert these to 0 and 1, or, as we will see later, R will take care of creating
                      dummy variables for us.
   194   195   196   197   198   199   200   201   202   203   204