Page 149 - Jolliffe I. Principal Component Analysis
P. 149
6. Choosing a Subset of Principal Components or Variables
118
strap versions of these rules are used by Jackson (1993) and are discussed
further in Section 6.1.5. Stauffer et al. (1985) informally compare scree
plots from a number of ecological data sets with corresponding plots from
random data sets of the same size. They incorporate bootstrap confidence
intervals (see Section 6.1.5) but their main interest is in the stability of
the eigenvalues (see Section 10.3) rather than the choice of m. Preisendor-
fer and Mobley’s (1988) Rule N, described in Section 6.1.7 also uses ideas
similar to parallel analysis.
Turning to the LEV diagram, an example of which is given in Sec-
tion 6.2.2 below, one of the earliest published descriptions was in Craddock
and Flood (1969), although, like the scree graph, it had been used routinely
for some time before this. Craddock and Flood argue that, in meteorology,
eigenvalues corresponding to ‘noise’ should decay in a geometric progres-
sion, and such eigenvalues will therefore appear as a straight line on the
LEV diagram. Thus, to decide on how many PCs to retain, we should
look for a point beyond which the LEV diagram becomes, approximately,
a straight line. This is the same procedure as in Cattell’s interpretation of
the scree graph, but the results are different, as we are now plotting log(l k )
rather than l k . To justify Craddock and Flood’s procedure, Farmer (1971)
generated simulated data with various known structures (or no structure).
For purely random data, with all variables uncorrelated, Farmer found that
the whole of the LEV diagram is approximately a straight line. Further-
more, he showed that if structures of various dimensions are introduced,
then the LEV diagram is useful in indicating the correct dimensionality, al-
though real examples, of course, give much less clear-cut results than those
of simulated data.
6.1.4 The Number of Components with Unequal Eigenvalues
and Other Hypothesis Testing Procedures
In Section 3.7.3 a test, sometimes known as Bartlett’s test, was described
for the null hypothesis
H 0,q : λ q+1 = λ q+2 = ··· = λ p
against the general alternative that at least two of the last (p−q) eigenvalues
are unequal. It was argued that using this test for various values of q,it
can be discovered how many of the PCs contribute substantial amounts of
variation, and how many are simply ‘noise.’ If m, the required number of
PCs to be retained, is defined as the number of PCs that are not noise,
then the test is used sequentially to find m.
H 0,p−2 is tested first, that is λ p−1 = λ p ,andif H 0,p−2 is not rejected then
H 0,p−3 is tested. If H 0,p−3 is not rejected, H 0,p−4 is tested next, and this
∗
sequence continues until H 0,q is first rejected at q = q ,say.Thevalueof
∗
∗
m is then taken to be q +1 (or q +2 if q = p−2). There are a number of
∗
disadvantages to this procedure, the first of which is that equation (3.7.6)

