Page 152 - Jolliffe I. Principal Component Analysis
P. 152

121
                                                        6.1. How Many Principal Components?
                              (1982), the number of terms in the estimate for X, corresponding to the
                              number of PCs, is successively taken as 1, 2,... , and so on, until overall
                              prediction of the x ij is no longer significantly improved by the addition of
                              extra terms (PCs). The number of PCs to be retained, m, is then taken to
                              be the minimum number necessary for adequate prediction.
                                Using the SVD, x ij can be written, as in equations (3.5.2),(5.3.3),
                                                           r
                                                                 1/2
                                                     x ij =   u ik l  a jk ,
                                                                 k
                                                          k=1
                              where r is the rank of X. (Recall that, in this context, l k ,k =1, 2,... ,p
                              are eigenvalues of X X, rather than of S.)

                                An estimate of x ij , based on the first m PCs and using all the data, is
                                                            m
                                                                  1/2
                                                    m ˜x ij =  u ik l  a jk ,            (6.1.1)
                                                                  k
                                                           k=1
                              but what is required is an estimate based on a subset of the data that does
                              not include x ij . This estimate is written
                                                            m
                                                                  1/2
                                                    m ˆx ij =  ˆ u ik l ˆ  ˆ a jk ,      (6.1.2)
                                                                  k
                                                           k=1
                                       ˆ
                              where ˆu ik , l k , ˆa jk are calculated from suitable subsets of the data. The sum
                              of squared differences between predicted and observed x ij is then
                                                            n  p
                                                                            2
                                               PRESS(m)=         ( m ˆx ij − x ij ) .    (6.1.3)
                                                           i=1 j=1
                              The notation PRESS stands for PREdiction Sum of Squares, and is taken
                              from the similar concept in regression, due to Allen (1974). All of the above
                              is essentially common to both Wold (1978) and Eastment and Krzanowski
                              (1982); they differ in how a subset is chosen for predicting x ij , and in how
                              (6.1.3) is used for deciding on m.
                                Eastment and Krzanowski (1982) use an estimate ˆa jk in (6.1.2) based on
                              the data set with just the ith observation x i deleted. ˆu ik is calculated with
                                                             ˆ
                              only the jth variable deleted, and l k combines information from the two
                              cases with the ith observation and the jth variable deleted, respectively.
                              Wold (1978), on the other hand, divides the data into g blocks, where he
                              recommends that g should be between four and seven and must not be a
                              divisor of p, and that no block should contain the majority of the elements
                                                                                  ˆ
                              in any row or column of X. Quantities equivalent to ˆu ik , l k and ˆa jk are
                              calculated g times, once with each block of data deleted, and the estimates
                              formed with the hth block deleted are then used to predict the data in the
                              hth block, h =1, 2,... ,g.
                                With respect to the choice of m, Wold (1978) and Eastment and Krza-
                              nowski (1982) each use a (different) function of PRESS(m) as a criterion
   147   148   149   150   151   152   153   154   155   156   157