Page 424 - Jolliffe I. Principal Component Analysis

P. 424

14.2. Weights, Metrics, Transformations and Centerings
389
Standardization, in the sense of dividing each column of the data matrix
by its standard deviation, leads to PCA based on the correlation matrix,
and its pros and cons are discussed in Sections 2.3 and 3.3. This can be
thought of a version of weighted PCA (Section 14.2.1). So, also, can dividing
each column by its range or its mean (Gower, 1966), in the latter case
giving a matrix of coeﬃcients of variation. Underhill (1990) suggests a
biplot based on this matrix (see Section 5.3.2). Such plots are only relevant
when variables are non-negative, as with species abundance data.
Principal components are linear functions of x whose coeﬃcients are
given by the eigenvectors of a covariance or correlation matrix or, equiva-
lently, the eigenvectors of a matrix X X. Here X is a (n × p) matrix whose

(i, j)th element is the value for the ith observation of the jth variable,
measured about the mean for that variable. Thus, the columns of X have
been centred, so that the sum of each column is zero, though Holmes-Junca
(1985) notes that centering by either medians or modes has been suggested
as an alternative to centering by means.
Two alternatives to ‘column-centering’ are:
(i) the columns of X are left uncentred, that is x ij is now the value for
the ith observation of the jth variable, as originally measured;
(ii) both rows and columns of X are centred, so that sums of rows, as well
as sums of columns, are zero.

In either (i) or (ii) the analysis now proceeds by looking at linear func-
tions of x whose coeﬃcients are the eigenvectors of X X, with X now

non-centred or doubly centred. Of course, these linear functions no longer
maximize variance, and so are not PCs according to the usual definition,
but it is convenient to refer to them as non-centred and doubly centred
PCs, respectively.
Non-centred PCA is a fairly well-established technique in ecology (Ter
Braak,1983). It has also been used in chemistry (Jackson, 1991, Section
3.4; Cochran and Horne, 1977) and geology (Reyment and Jöreskog, 1993).
As noted by Ter Braak (1983), the technique projects observations onto
the best fitting plane (or flat) through the origin, rather than through the
centroid of the data set. If the data are such that the origin is an important
point of reference, then this type of analysis can be relevant. However, if the
centre of the observations is a long way from the origin, then the first ‘PC’
will dominate the analysis, and will simply reflect the position of the cen-
troid. For data that consist of counts of a number of biological species (the
variables) at various sites (the observations), Ter Braak (1983) claims that
non-centred PCA is better than standard (centred) PCA at simultaneously
representing within-site diversity and between-site diversity of species (see
also Digby and Kempton (1987, Section 3.5.5)). Centred PCA is better at
representing between-site species diversity than non-centred PCA, but it is
more difficult to deduce within-site diversity from a centred PCA.

419 420 421 422 423 424 425 426 427 428 429