Page 144 - Jolliffe I. Principal Component Analysis
P. 144

6.1. How Many Principal Components?
                              that is the sum of the variances of the PCs is equal to the sum of the
                              variances of the elements of x. The obvious definition of ‘percentage of
                              variation accounted for by the first m PCs’ is therefore       113
                                                           p                 p
                                                    m                 m

                                           t m = 100  l k    s jj = 100  l k   l k ,
                                                   k=1    j=1        k=1    k=1
                              which reduces to
                                                            100
                                                                 m
                                                       t m =
                                                             p     l k
                                                                k=1
                              in the case of a correlation matrix.
                                                 ∗
                                Choosing a cut-off t somewhere between 70% and 90% and retaining m
                                                                               ∗
                              PCs, where m is the smallest integer for which t m >t , provides a rule
                              which in practice preserves in the first m PCs most of the information in
                              x. The best value for t will generally become smaller as p increases, or
                                                  ∗
                              as n, the number of observations, increases. Although a sensible cutoff is
                              very often in the range 70% to 90%, it can sometimes be higher or lower
                              depending on the practical details of a particular data set. For example,
                              a value greater than 90% will be appropriate when one or two PCs repre-
                              sent very dominant and rather obvious sources of variation. Here the less
                              obvious structures beyond these could be of interest, and to find them a
                              cut-off higher than 90% may be necessary. Conversely, when p is very large
                              choosing m corresponding to 70% may give an impractically large value of
                              m for further analyses. In such cases the threshold should be set somewhat
                              lower.
                                Using the rule is, in a sense, equivalent to looking at the spectral de-
                              composition of the covariance (or correlation) matrix S (see Property A3
                              of Sections 2.1, 3.1), or the SVD of the data matrix X (see Section 3.5). In
                              either case, deciding how many terms to include in the decomposition in
                              order to get a good fit to S or X respectively is closely related to looking
                              at t m , because an appropriate measure of lack-of-fit of the first m terms in

                                                           l
                              either decomposition is  k=m+1 k . This follows because
                                                     p
                                              n  p                        p
                                                             2
                                                   ( m ˜x ij − x ij ) =(n − 1)  l k ,
                                             i=1 j=1                   k=m+1

                              (Gabriel, 1978) and   m S−S  =  k=m+1 k (see the discussion of Property
                                                                  l
                                                            p
                              G4 in Section 3.2), where m ˜x ij is the rank m approximation to x ij based
                              on the SVD as given in equation (3.5.3), and m S is the sum of the first m
                              terms of the spectral decomposition of S.
                                A number of attempts have been made to find the distribution of t m ,
                              and hence to produce a formal procedure for choosing m, based on t m .
                              Mandel (1972) presents some expected values for t m for the case where all
                              variables are independent, normally distributed, and have the same vari-
   139   140   141   142   143   144   145   146   147   148   149