Page 178 - Jolliffe I. Principal Component Analysis
P. 178

6.4. Examples Illustrating Variable Selection
                                                                                            147
                              both among the group of dominant variables for the second PC, and vari-
                              able 13 (tibia length 3) has the largest coefficient of any variable for PC1.
                                Comparisons can be made regarding how well Jolliffe’s and McCabe’s se-
                              lections perform with respect to the criteria (6.3.4) and (6.3.5). For (6.3.5),
                              Jolliffe’s choices are closer to optimality than McCabe’s, achieving values
                              of 0.933 and 0.945 for four variables, compared to 0.907 and 0.904 for
                              McCabe, whereas the optimal value is 0.948. Discrepancies are generally
                              larger but more variable for criterion (6.3.4). For example, the B2 selec-
                              tion of three variables achieves a value of only 0.746 compared the optimal
                              value of 0.942, which is attained by B4. Values for McCabe’s selections are
                              intermediate (0.838, 0.880).
                                Regarding the choice of m,the l k criterion of Section 6.1.2 was found
                              by Jolliffe (1972), using simulation studies, to be appropriate for methods
                                                              ∗
                              B2 and B4, with a cut-off close to l =0.7. In the present example the
                              criterion suggests m =3, as l 3 =0.75 and l 4 =0.50. Confirmation that m
                              should be this small is given by the criterion t m of Section 6.1.1. Two PCs
                              account for 85.4% of the variation, three PCs give 89.4% and four PCs
                              contribute 92.0%, from which Jeffers (1967) concludes that two PCs are
                              sufficient to account for most of the variation. However, Jolliffe (1973) also
                              looked at how well other aspects of the structure of data are reproduced for
                              various values of m. For example, the form of the PCs and the division into
                              four distinct groups of aphids (see Section 9.2 for further discussion of this
                              aspect) were both examined and found to be noticeably better reproduced
                              for m = 4 than for m = 2 or 3, so it seems that the criteria of Sections 6.1.1
                              and 6.1.2 might be relaxed somewhat when very small values of m are
                              indicated, especially when coupled with small values of n, the sample size.
                              McCabe (1982) notes that four or five of the original variables are necessary
                              in order to account for as much variation as the first two PCs, confirming
                              that m = 4 or 5 is probably appropriate here.
                                Tanaka and Mori (1997) suggest, on the basis of their two criteria and
                              using a backward elimination algorithm, that seven or nine variables should
                              be kept, rather more than Jolliffe (1973) or McCabe (1982). If only four
                              variables are retained, Tanaka and Mori’s (1997) analysis keeps variables
                              5, 6, 14, 19 according to the RV-coefficient, and variables 5, 14, 17, 18 using
                              residuals from regression. At least three of the four variables overlap with
                              choices made in Table 6.4. On the other hand, the selection rule based
                              on influential variables suggested by Mori et al. (2000) retains variables
                              2, 4, 12, 13 in a 4-variable subset, a quite different selection from those of
                              the other methods.

                              6.4.2 Crime Rates

                              These data were given by Ahamad (1967) and consist of measurements of
                              the crime rate in England and Wales for 18 different categories of crime
                              (the variables) for the 14 years, 1950–63. The sample size n = 14 is very
   173   174   175   176   177   178   179   180   181   182   183