Page 424 - Jolliffe I. Principal Component Analysis
P. 424

14.2. Weights, Metrics, Transformations and Centerings
                                                                                            389
                                Standardization, in the sense of dividing each column of the data matrix
                              by its standard deviation, leads to PCA based on the correlation matrix,
                              and its pros and cons are discussed in Sections 2.3 and 3.3. This can be
                              thought of a version of weighted PCA (Section 14.2.1). So, also, can dividing
                              each column by its range or its mean (Gower, 1966), in the latter case
                              giving a matrix of coefficients of variation. Underhill (1990) suggests a
                              biplot based on this matrix (see Section 5.3.2). Such plots are only relevant
                              when variables are non-negative, as with species abundance data.
                                Principal components are linear functions of x whose coefficients are
                              given by the eigenvectors of a covariance or correlation matrix or, equiva-
                              lently, the eigenvectors of a matrix X X. Here X is a (n × p) matrix whose

                              (i, j)th element is the value for the ith observation of the jth variable,
                              measured about the mean for that variable. Thus, the columns of X have
                              been centred, so that the sum of each column is zero, though Holmes-Junca
                              (1985) notes that centering by either medians or modes has been suggested
                              as an alternative to centering by means.
                                Two alternatives to ‘column-centering’ are:
                              (i) the columns of X are left uncentred, that is x ij is now the value for
                                 the ith observation of the jth variable, as originally measured;
                              (ii) both rows and columns of X are centred, so that sums of rows, as well
                                 as sums of columns, are zero.

                              In either (i) or (ii) the analysis now proceeds by looking at linear func-
                              tions of x whose coefficients are the eigenvectors of X X, with X now

                              non-centred or doubly centred. Of course, these linear functions no longer
                              maximize variance, and so are not PCs according to the usual definition,
                              but it is convenient to refer to them as non-centred and doubly centred
                              PCs, respectively.
                                Non-centred PCA is a fairly well-established technique in ecology (Ter
                              Braak,1983). It has also been used in chemistry (Jackson, 1991, Section
                              3.4; Cochran and Horne, 1977) and geology (Reyment and J¨oreskog, 1993).
                              As noted by Ter Braak (1983), the technique projects observations onto
                              the best fitting plane (or flat) through the origin, rather than through the
                              centroid of the data set. If the data are such that the origin is an important
                              point of reference, then this type of analysis can be relevant. However, if the
                              centre of the observations is a long way from the origin, then the first ‘PC’
                              will dominate the analysis, and will simply reflect the position of the cen-
                              troid. For data that consist of counts of a number of biological species (the
                              variables) at various sites (the observations), Ter Braak (1983) claims that
                              non-centred PCA is better than standard (centred) PCA at simultaneously
                              representing within-site diversity and between-site diversity of species (see
                              also Digby and Kempton (1987, Section 3.5.5)). Centred PCA is better at
                              representing between-site species diversity than non-centred PCA, but it is
                              more difficult to deduce within-site diversity from a centred PCA.
   419   420   421   422   423   424   425   426   427   428   429