Page 47 - Jolliffe I. Principal Component Analysis
P. 47

2. Properties of Population Principal Components
                              16
                                There is therefore a non-zero vector α in S k of the form α = Bγ for a
                              γ in T k , and it follows that





                                               γ B ΣBγ     γ B ΣBγ     α Σα
                                          µ k ≤          =           =       ≤ λ k .
                                                  γ γ       γB Bγ       α α



                              Thus the kth eigenvalue of B ΣB ≤ kth eigenvalue of Σ for k =1, ··· ,q.

                              This means that
                                                   q                           q

                                        det(Σ y )=   (kth eigenvalue of B ΣB) ≤  λ k .

                                                  k=1                         k=1
                              But if B = A q , then the eigenvalues of B ΣB are

                                                                             q

                                           λ 1 ,λ 2 , ··· ,λ q ,  so that  det(Σ y )=  λ k
                                                                            k=1
                              in this case, and therefore det(Σ y ) is maximized when B = A q .
                              The result can be extended to the case where the columns of B are not
                              necessarily orthonormal, but the diagonal elements of B B are unity (see

                              Okamoto (1969)). A stronger, stepwise version of Property A4 is discussed
                              by O’Hagan (1984), who argues that it provides an alternative derivation of
                              PCs, and that this derivation can be helpful in motivating the use of PCA.
                              O’Hagan’s derivation is, in fact, equivalent to (though a stepwise version
                              of) Property A5, which is discussed next.
                                Note that Property A1 could also have been proved using similar reason-
                              ing to that just employed for Property A4, but some of the intermediate
                              results derived during the earlier proof of Al are useful elsewhere in the
                              chapter.
                                The statistical importance of the present result follows because the de-
                              terminant of a covariance matrix, which is called the generalized variance,
                              can be used as a single measure of spread for a multivariate random vari-
                              able (Press, 1972, p. 108). The square root of the generalized variance,
                              for a multivariate normal distribution is proportional to the ‘volume’ in
                              p-dimensional space that encloses a fixed proportion of the probability dis-
                              tribution of x. For multivariate normal x, the first q PCs are, therefore, as
                              a consequence of Property A4, q linear functions of x whose joint probabil-
                              ity distribution has contours of fixed probability enclosing the maximum
                              volume.
                              Property A5.    Suppose that we wish to predict each random variable, x j
                                                                                 2
                              in x by a linear function of y,where y = B x, as before. If σ is the residual

                                                                                 j
                                                                     2
                              variance in predicting x j from y, then Σ p  σ is minimized if B = A q .
                                                                 j=1 j
                                The statistical implication of this result is that if we wish to get the best
                              linear predictor of x in a q-dimensional subspace, in the sense of minimizing
                              the sum over elements of x of the residual variances, then this optimal
                              subspace is defined by the first q PCs.
   42   43   44   45   46   47   48   49   50   51   52