Page 180 - Jolliffe I. Principal Component Analysis
P. 180
6.4. Examples Illustrating Variable Selection
149
as they account for 87.8%, 93.3%, respectively, of the total variation. The
third and fourth eigenvalues are 0.96, 0.68 so that a cut-off of l =0.70
∗
gives m = 3, but l 4 is so close to 0.70 that caution suggests m = 4. Such
conservatism is particularly appropriate for small sample sizes, where sam-
pling variation may be substantial. As in the previous example, Jolliffe
(1973) found that the inclusion of a fourth variable produced a marked
improvement in reproducing some of the results given by all 18 variables.
McCabe (1982) also indicated that m = 3 or 4 is appropriate.
The subsets chosen in Table 6.5 overlap less than in the previous example,
and McCabe’s subsets change noticeably in going from m =3 to m =4.
However, there is still substantial agreement; for example, variable 1 is
a member of all but one of the selected subsets and variable 13 is also
selected by all four methods, whereas variables {2, 6, 9, 11, 12, 15, 18} are
not selected at all.
Of the variables that are chosen by all four methods, variable 1 is ‘homi-
cide,’ which dominates the third PC and is the only crime whose occurrence
shows no evidence of serial correlation during the period 1950–63. Because
its behaviour is different from that of all the other variables, it is impor-
tant that it should be retained in any subset that seeks to account for most
of the variation in x. Variable 13 (assault) is also atypical of the general
upward trend—it actually decreased between 1950 and 1963.
The values of the criteria (6.3.4) and (6.3.5) for Jolliffe’s and McCabe’s
subsets are closer to optimality and less erratic than in the earlier exam-
ple. No chosen subset does worse with respect to (6.3.5) than 0.925 for 3
variables and 0.964 for 4 variables, compared to optimal values of 0.942,
0.970 respectively. The behaviour with respect to (6.3.4) is less good, but
far less erratic than in the previous example.
In addition to the examples given here, Al-Kandari (1998), Cadima and
Jolliffe (2001), Gonzalez et al. (1990), Jolliffe (1973), King and Jackson
(1999) and McCabe (1982, 1984) all give further illustrations of variable
selection based on PCs. Krzanowski (1987b) looks at variable selection for
the alate adelges data set of Section 6.4.1, but in the context of preserving
group structure. We discuss this further in Chapter 9.

