Page 178 - Jolliffe I. Principal Component Analysis

P. 178

6.4. Examples Illustrating Variable Selection
147
both among the group of dominant variables for the second PC, and vari-
able 13 (tibia length 3) has the largest coefficient of any variable for PC1.
Comparisons can be made regarding how well Jolliffe’s and McCabe’s se-
lections perform with respect to the criteria (6.3.4) and (6.3.5). For (6.3.5),
Jolliffe’s choices are closer to optimality than McCabe’s, achieving values
of 0.933 and 0.945 for four variables, compared to 0.907 and 0.904 for
McCabe, whereas the optimal value is 0.948. Discrepancies are generally
larger but more variable for criterion (6.3.4). For example, the B2 selec-
tion of three variables achieves a value of only 0.746 compared the optimal
value of 0.942, which is attained by B4. Values for McCabe’s selections are
intermediate (0.838, 0.880).
Regarding the choice of m,the l k criterion of Section 6.1.2 was found
by Jolliffe (1972), using simulation studies, to be appropriate for methods
∗
B2 and B4, with a cut-off close to l =0.7. In the present example the
criterion suggests m =3, as l 3 =0.75 and l 4 =0.50. Confirmation that m
should be this small is given by the criterion t m of Section 6.1.1. Two PCs
account for 85.4% of the variation, three PCs give 89.4% and four PCs
contribute 92.0%, from which Jeffers (1967) concludes that two PCs are
sufficient to account for most of the variation. However, Jolliffe (1973) also
looked at how well other aspects of the structure of data are reproduced for
various values of m. For example, the form of the PCs and the division into
four distinct groups of aphids (see Section 9.2 for further discussion of this
aspect) were both examined and found to be noticeably better reproduced
for m = 4 than for m = 2 or 3, so it seems that the criteria of Sections 6.1.1
and 6.1.2 might be relaxed somewhat when very small values of m are
indicated, especially when coupled with small values of n, the sample size.
McCabe (1982) notes that four or five of the original variables are necessary
in order to account for as much variation as the first two PCs, confirming
that m = 4 or 5 is probably appropriate here.
Tanaka and Mori (1997) suggest, on the basis of their two criteria and
using a backward elimination algorithm, that seven or nine variables should
be kept, rather more than Jolliffe (1973) or McCabe (1982). If only four
variables are retained, Tanaka and Mori’s (1997) analysis keeps variables
5, 6, 14, 19 according to the RV-coefficient, and variables 5, 14, 17, 18 using
residuals from regression. At least three of the four variables overlap with
choices made in Table 6.4. On the other hand, the selection rule based
on influential variables suggested by Mori et al. (2000) retains variables
2, 4, 12, 13 in a 4-variable subset, a quite different selection from those of
the other methods.

6.4.2 Crime Rates

These data were given by Ahamad (1967) and consist of measurements of
the crime rate in England and Wales for 18 diﬀerent categories of crime
(the variables) for the 14 years, 1950–63. The sample size n = 14 is very

173 174 175 176 177 178 179 180 181 182 183