Page 407 - Jolliffe I. Principal Component Analysis
P. 407

13. Principal Component Analysis for Special Types of Data
                              372
                              there may be a large number of zeros in the data. If two variables x j
                              and x k simultaneously record zero for a non-trivial number of sites, the
                              calculation of covariance or correlation between this pair of variables is
                              likely to be distorted. Legendre and Legendre (1983, p. 285) argue that data
                              are better analysed by nonmetric multidimensional scaling (Cox and Cox,
                              2001) or with correspondence analysis (as in Section 5.4.1), rather than by
                              PCA, when there are many such ‘double zeros’ present. Even when such
                              zeros are not a problem, species abundance data often have highly skewed
                              distributions and a transformation; for example, taking logarithms, may be
                              advisable before PCA is contemplated.
                                Another unique aspect of species abundance data is an interest in the
                              diversity of species at the various sites. It has been argued that to exam-
                              ine diversity, it is more appropriate to use uncentred than column-centred
                              PCA. This is discussed further in Section 14.2.3, together with doubly
                              centred PCA which has also found applications to species abundance data.
                              Large Data Sets
                              The problems of large data sets are different depending on whether the
                              number of observations n or the number of variables p is large, with the
                              latter typically causing greater difficulties than the former. With large n
                              there may be problems in viewing graphs because of superimposed observa-
                              tions, but it is the size of the covariance or correlation matrix that usually
                              determines computational limitations. However, if p>n it should be
                              remembered (Property G4 of Section 3.2) that the eigenvectors of X X cor-

                              responding to non-zero eigenvalues can be found from those of the smaller
                              matrix XX .

                                For very large values of p, Preisendorfer and Mobley (1988, Chapter 11)
                              suggest splitting the variables into subsets of manageable size, performing
                              PCA on each subset, and then using the separate eigenanalyses to approx-
                              imate the eigenstructure of the original large data matrix. Developments
                              in computer architecture may soon allow very large problems to be tackled
                              much faster using neural network algorithms for PCA (see Appendix A1
                              and Diamantaras and Kung (1996, Chapter 8)).
   402   403   404   405   406   407   408   409   410   411   412