Page 215 - Applied Statistics with R
P. 215
11.3. FACTOR VARIABLES 215
When R created , the dummy variable, it used domestic cars as the reference
2
level, that is the default value of the factor variable. So when the dummy
variable is 0, the model represents this reference level, which is domestic. (R
makes this choice because domestic comes before foreign alphabetically.)
So the two models have different estimated coefficients, but due to the different
model representations, they are actually the same model.
11.3.1 Factors with More Than Two Levels
Let’s now consider a factor variable with more than two levels. In this dataset,
cyl is an example.
is.factor(autompg$cyl)
## [1] TRUE
levels(autompg$cyl)
## [1] "4" "6" "8"
Here the cyl variable has three possible levels: 4, 6, and 8. You may wonder,
why not simply use cyl as a numerical variable? You certainly could.
However, that would force the difference in average mpg between 4 and 6 cylin-
ders to be the same as the difference in average mpg between 6 and 8 cylinders.
That usually make senses for a continuous variable, but not for a discrete vari-
able with so few possible values. In the case of this variable, there is no such
thing as a 7-cylinder engine or a 6.23-cylinder engine in personal vehicles. For
these reasons, we will simply consider cyl to be categorical. This is a decision
that will commonly need to be made with ordinal variables. Often, with a large
number of categories, the decision to treat them as numerical variables is appro-
priate because, otherwise, a large number of dummy variables are then needed
to represent these variables.
Let’s define three dummy variables related to the cyl factor variable.
1 4 cylinder
= {
1
0 not 4 cylinder
1 6 cylinder
= {
2
0 not 6 cylinder

