Page 67 - Jolliffe I. Principal Component Analysis
P. 67

3. Properties of Sample Principal Components
                              36
                                                            n


                                                       =tr
                                                                i
                                                           i=1
                                                          n   (x BB x i )



                                                       =     tr(x BB x i )
                                                                i
                                                          i=1
                                                          n

                                                       =     tr(B x i x B)


                                                                    i
                                                          i=1

                                                                  n

                                                       =tr B        x i x    B
                                                                       i
                                                                 i=1


                                                       =tr[B X XB]

                                                       =(n − 1) tr(B SB).

                              Finally, from Property A1, tr(B SB) is maximized when B = A q .
                                Instead of treating this property (G3) as just another property of sample
                              PCs, it can also be viewed as an alternative derivation of the PCs. Rather
                              than adapting for samples the algebraic definition of population PCs given
                              in Chapter 1, there is an alternative geometric definition of sample PCs.
                              They are defined as the linear functions (projections) of x 1 , x 2 ,..., x n that
                              successively define subspaces of dimension 1, 2,...,q,..., (p − 1) for which
                              the sum of squared perpendicular distances of x 1 , x 2 ,..., x n from the sub-
                              space is minimized. This definition provides another way in which PCs can
                              be interpreted as accounting for as much as possible of the total variation
                              in the data, within a lower-dimensional space. In fact, this is essentially
                              the approach adopted by Pearson (1901), although he concentrated on the
                              two special cases, where q =1 and q =(p − 1). Given a set of points in p-
                              dimensional space, Pearson found the ‘best-fitting line,’ and the ‘best-fitting
                              hyperplane,’ in the sense of minimizing the sum of squared deviations of
                              the points from the line or hyperplane. The best-fitting line determines the
                              first principal component, although Pearson did not use this terminology,
                              and the direction of the last PC is orthogonal to the best-fitting hyper-
                              plane. The scores for the last PC are simply the perpendicular distances of
                              the observations from this best-fitting hyperplane.
                              Property G4.    Let X be the (n × p) matrix whose (i, j)th element is
                              ˜ x ij − ¯x j , and consider the matrix XX . The ith diagonal element of XX

                                             2
                              is  p  (˜x ij − ¯x j ) , which is the squared Euclidean distance of x i from the
                                  j=1
                              centre of gravity ¯ x of the points x 1 , x 2 ,..., x n ,where ¯ x =  1    n  x i .Also,
                                                                                    i=1
                                                                                n


                              the (h, i)th element of XX is  p j=1 (˜x hj − ¯x j )(˜x ij − ¯x j ), which measures
                              the cosine of the angle between the lines joining x h and x i to ¯ x, multiplied

                              by the distances of x h and x i from ¯ x. Thus XX contains information
                              about the configuration of x 1 , x 2 ,..., x n relative to ¯ x. Now suppose that
                              x 1 , x 2 ,..., x n are projected onto a q-dimensional subspace with the usual
                              orthogonal transformation y i = B x i ,i =1, 2,... ,n. Then the transfor-
   62   63   64   65   66   67   68   69   70   71   72