Page 165 - Jolliffe I. Principal Component Analysis
P. 165

6. Choosing a Subset of Principal Components or Variables
                              134
                              Table 6.2. First six eigenvalues for the covariance matrix, blood chemistry data.
                                                     1
                                                             2
                                Component number
                                Eigenvalue,  l k  1704.68   15.07  3 6.98  4 2.64  5 0.13  6 0.07
                                l k/l ¯              7.88    0.07  0.03  0.01  0.0006  0.0003
                                         m
                                            l k
                                         k=1
                                         p
                                t m = 100           98.6    99.4  99.8  99.99  99.995  99.9994
                                            l k
                                         k=1
                                l k−1 − l k               1689.61  8.09  4.34  2.51    0.06
                              nearly 80% of the total variation, but it takes six PCs to account for 90%.
                              A cut-off at l =0.7 for the second criterion retains four PCs, but the next
                                         ∗
                              eigenvalue is not very much smaller, so perhaps five should be retained. In
                              the scree graph the slope actually increases between k = 3 and 4, but then
                              falls sharply and levels off, suggesting that perhaps only four PCs should
                              be retained. The LEV diagram (not shown) is of little help here; it has no
                              clear indication of constant slope after any value of k, and in fact has its
                              steepest slope between k =7 and 8.
                                Using Cattell’s (1966) formulation, there is no strong straight-line be-
                              haviour after any particular point, although perhaps a cut-off at k =4 is
                              most appropriate. Cattell suggests that the first point on the straight line
                              (that is, the ‘elbow’ point) should be retained. However, if we consider the
                              scree graph in the same light as the test of Section 6.1.4, then all eigen-
                              values after, and including, the elbow are deemed roughly equal and so all
                              corresponding PCs should be deleted. This would lead to the retention of
                              only three PCs in the present case.
                                Turning to Table 6.2, which gives information for the covariance matrix,
                              corresponding to that presented for the correlation matrix in Table 6.1, the
                              three ad hoc measures all conclusively suggest that one PC is sufficient. It
                              is undoubtedly true that choosing m = 1 accounts for the vast majority
                              of the variation in x, but this conclusion is not particularly informative
                              as it merely reflects that one of the original variables accounts for nearly
                              all the variation in x. The PCs for the covariance matrix in this example
                              were discussed in Section 3.3, and it can be argued that it is the use of
                              the covariance matrix, rather than the rules of Sections 6.1.1–6.1.3, that is
                              inappropriate for these data.


                              6.2.2 Gas Chromatography Data

                              These data, which were originally presented by McReynolds (1970), and
                              which have been analysed by Wold (1978) and by Eastment and Krzanow-
                              ski (1982), are concerned with gas chromatography retention indices. After
                              removal of a number of apparent outliers and an observation with a missing
                              value, there remain 212 (Eastment and Krzanowski) or 213 (Wold) mea-
                              surements on ten variables. Wold (1978) claims that his method indicates
   160   161   162   163   164   165   166   167   168   169   170