Page 203 - Applied Statistics with R
P. 203

11.2. INTERACTIONS                                                203


                      we see    is the average change in    for an increase in    , no matter the value
                                                                        1
                              1
                      of    . Also,    is always the difference in the average of    for any value of    .
                                  2
                          2
                                                                                          1
                      These are two restrictions we won’t always want, so we need a way to specify a
                      more flexible model.
                      Here we restricted ourselves to a single numerical predictor    and one dummy
                                                                            1
                      variable    . However, the concept of a dummy variable can be used with larger
                               2
                      multiple regression models. We only use a single numerical predictor here for
                      ease of visualization since we can think of the “two lines” interpretation. But
                      in general, we can think of a dummy variable as creating “two models,” one for
                      each category of a binary categorical variable.
                      11.2     Interactions


                      To remove the “same slope” restriction, we will now discuss interaction. To
                      illustrate this concept, we will return to the autompg dataset we created in the
                      last chapter, with a few more modifications.

                      # read data frame from the web
                      autompg = read.table(
                        "http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data",
                        quote = "\"",
                        comment.char = "",
                        stringsAsFactors = FALSE)
                      # give the dataframe headers
                      colnames(autompg) = c("mpg", "cyl", "disp", "hp", "wt", "acc", "year", "origin", "name")
                      # remove missing data, which is stored as "?"
                      autompg = subset(autompg, autompg$hp != "?")
                      # remove the plymouth reliant, as it causes some issues
                      autompg = subset(autompg, autompg$name != "plymouth reliant")
                      # give the dataset row names, based on the engine, year and name
                      rownames(autompg) = paste(autompg$cyl, "cylinder", autompg$year, autompg$name)
                      # remove the variable for name
                      autompg = subset(autompg, select = c("mpg", "cyl", "disp", "hp", "wt", "acc", "year", "origin"))
                      # change horsepower from character to numeric
                      autompg$hp = as.numeric(autompg$hp)
                      # create a dummy variable for foreign vs domestic cars. domestic = 1.
                      autompg$domestic = as.numeric(autompg$origin == 1)
                      # remove 3 and 5 cylinder cars (which are very rare.)
                      autompg = autompg[autompg$cyl != 5,]
                      autompg = autompg[autompg$cyl != 3,]
                      # the following line would verify the remaining cylinder possibilities are 4, 6, 8
                      #unique(autompg$cyl)
                      # change cyl to a factor variable
                      autompg$cyl = as.factor(autompg$cyl)
   198   199   200   201   202   203   204   205   206   207   208