Page 149 - Jolliffe I. Principal Component Analysis
P. 149

6. Choosing a Subset of Principal Components or Variables
                              118
                              strap versions of these rules are used by Jackson (1993) and are discussed
                              further in Section 6.1.5. Stauffer et al. (1985) informally compare scree
                              plots from a number of ecological data sets with corresponding plots from
                              random data sets of the same size. They incorporate bootstrap confidence
                              intervals (see Section 6.1.5) but their main interest is in the stability of
                              the eigenvalues (see Section 10.3) rather than the choice of m. Preisendor-
                              fer and Mobley’s (1988) Rule N, described in Section 6.1.7 also uses ideas
                              similar to parallel analysis.
                                Turning to the LEV diagram, an example of which is given in Sec-
                              tion 6.2.2 below, one of the earliest published descriptions was in Craddock
                              and Flood (1969), although, like the scree graph, it had been used routinely
                              for some time before this. Craddock and Flood argue that, in meteorology,
                              eigenvalues corresponding to ‘noise’ should decay in a geometric progres-
                              sion, and such eigenvalues will therefore appear as a straight line on the
                              LEV diagram. Thus, to decide on how many PCs to retain, we should
                              look for a point beyond which the LEV diagram becomes, approximately,
                              a straight line. This is the same procedure as in Cattell’s interpretation of
                              the scree graph, but the results are different, as we are now plotting log(l k )
                              rather than l k . To justify Craddock and Flood’s procedure, Farmer (1971)
                              generated simulated data with various known structures (or no structure).
                              For purely random data, with all variables uncorrelated, Farmer found that
                              the whole of the LEV diagram is approximately a straight line. Further-
                              more, he showed that if structures of various dimensions are introduced,
                              then the LEV diagram is useful in indicating the correct dimensionality, al-
                              though real examples, of course, give much less clear-cut results than those
                              of simulated data.


                              6.1.4 The Number of Components with Unequal Eigenvalues
                                     and Other Hypothesis Testing Procedures

                              In Section 3.7.3 a test, sometimes known as Bartlett’s test, was described
                              for the null hypothesis

                                                 H 0,q : λ q+1 = λ q+2 = ··· = λ p
                              against the general alternative that at least two of the last (p−q) eigenvalues
                              are unequal. It was argued that using this test for various values of q,it
                              can be discovered how many of the PCs contribute substantial amounts of
                              variation, and how many are simply ‘noise.’ If m, the required number of
                              PCs to be retained, is defined as the number of PCs that are not noise,
                              then the test is used sequentially to find m.
                                H 0,p−2 is tested first, that is λ p−1 = λ p ,andif H 0,p−2 is not rejected then
                              H 0,p−3 is tested. If H 0,p−3 is not rejected, H 0,p−4 is tested next, and this
                                                                             ∗
                              sequence continues until H 0,q is first rejected at q = q ,say.Thevalueof
                                                          ∗
                                                                  ∗
                              m is then taken to be q +1 (or q +2 if q = p−2). There are a number of
                                                  ∗
                              disadvantages to this procedure, the first of which is that equation (3.7.6)
   144   145   146   147   148   149   150   151   152   153   154