Page 407 - Jolliffe I. Principal Component Analysis

P. 407

13. Principal Component Analysis for Special Types of Data
372
there may be a large number of zeros in the data. If two variables x j
and x k simultaneously record zero for a non-trivial number of sites, the
calculation of covariance or correlation between this pair of variables is
likely to be distorted. Legendre and Legendre (1983, p. 285) argue that data
are better analysed by nonmetric multidimensional scaling (Cox and Cox,
2001) or with correspondence analysis (as in Section 5.4.1), rather than by
PCA, when there are many such ‘double zeros’ present. Even when such
zeros are not a problem, species abundance data often have highly skewed
distributions and a transformation; for example, taking logarithms, may be
advisable before PCA is contemplated.
Another unique aspect of species abundance data is an interest in the
diversity of species at the various sites. It has been argued that to exam-
ine diversity, it is more appropriate to use uncentred than column-centred
PCA. This is discussed further in Section 14.2.3, together with doubly
centred PCA which has also found applications to species abundance data.
Large Data Sets
The problems of large data sets are diﬀerent depending on whether the
number of observations n or the number of variables p is large, with the
latter typically causing greater diﬃculties than the former. With large n
there may be problems in viewing graphs because of superimposed observa-
tions, but it is the size of the covariance or correlation matrix that usually
determines computational limitations. However, if p>n it should be
remembered (Property G4 of Section 3.2) that the eigenvectors of X X cor-

responding to non-zero eigenvalues can be found from those of the smaller
matrix XX .

For very large values of p, Preisendorfer and Mobley (1988, Chapter 11)
suggest splitting the variables into subsets of manageable size, performing
PCA on each subset, and then using the separate eigenanalyses to approx-
imate the eigenstructure of the original large data matrix. Developments
in computer architecture may soon allow very large problems to be tackled
much faster using neural network algorithms for PCA (see Appendix A1
and Diamantaras and Kung (1996, Chapter 8)).

402 403 404 405 406 407 408 409 410 411 412