Page 180 - Jolliffe I. Principal Component Analysis
P. 180

6.4. Examples Illustrating Variable Selection
                                                                                            149
                              as they account for 87.8%, 93.3%, respectively, of the total variation. The
                              third and fourth eigenvalues are 0.96, 0.68 so that a cut-off of l =0.70
                                                                                       ∗
                              gives m = 3, but l 4 is so close to 0.70 that caution suggests m = 4. Such
                              conservatism is particularly appropriate for small sample sizes, where sam-
                              pling variation may be substantial. As in the previous example, Jolliffe
                              (1973) found that the inclusion of a fourth variable produced a marked
                              improvement in reproducing some of the results given by all 18 variables.
                              McCabe (1982) also indicated that m = 3 or 4 is appropriate.
                                The subsets chosen in Table 6.5 overlap less than in the previous example,
                              and McCabe’s subsets change noticeably in going from m =3 to m =4.
                              However, there is still substantial agreement; for example, variable 1 is
                              a member of all but one of the selected subsets and variable 13 is also
                              selected by all four methods, whereas variables {2, 6, 9, 11, 12, 15, 18} are
                              not selected at all.
                                Of the variables that are chosen by all four methods, variable 1 is ‘homi-
                              cide,’ which dominates the third PC and is the only crime whose occurrence
                              shows no evidence of serial correlation during the period 1950–63. Because
                              its behaviour is different from that of all the other variables, it is impor-
                              tant that it should be retained in any subset that seeks to account for most
                              of the variation in x. Variable 13 (assault) is also atypical of the general
                              upward trend—it actually decreased between 1950 and 1963.
                                The values of the criteria (6.3.4) and (6.3.5) for Jolliffe’s and McCabe’s
                              subsets are closer to optimality and less erratic than in the earlier exam-
                              ple. No chosen subset does worse with respect to (6.3.5) than 0.925 for 3
                              variables and 0.964 for 4 variables, compared to optimal values of 0.942,
                              0.970 respectively. The behaviour with respect to (6.3.4) is less good, but
                              far less erratic than in the previous example.
                                In addition to the examples given here, Al-Kandari (1998), Cadima and
                              Jolliffe (2001), Gonzalez et al. (1990), Jolliffe (1973), King and Jackson
                              (1999) and McCabe (1982, 1984) all give further illustrations of variable
                              selection based on PCs. Krzanowski (1987b) looks at variable selection for
                              the alate adelges data set of Section 6.4.1, but in the context of preserving
                              group structure. We discuss this further in Chapter 9.
   175   176   177   178   179   180   181   182   183   184   185