Page 406 - Jolliffe I. Principal Component Analysis

P. 406

13.8. Some Other Types of Data
371
discuss two adaptations of PCA for such data. In the first, called the VER-
TICES method, the ith row of the (n × p) data matrix is replaced by the
2 distinct rows whose elements have either x ij or x ij in their jth column.
p
A PCA is then done on the resulting (n2 × p) matrix. The value or score
p
of a PC from this analysis can be calculated for each of the n2 rows of the
p
new data matrix. For the ith observation there are 2 such scores and an
p
interval can be constructed for the observation, bounded by the smallest
and largest of these scores. In plotting the observations, either with respect
to the original variables or with respect to PCs, each observation is repre-
sented by a rectangle or hyperrectangle in two or higher-dimensional space.
The boundaries of the (hyper)rectangle are determined by the intervals for
the variables or PC scores. Chouakria et al. (2000) examine a number of
indices measuring the quality of representation of an interval data set by
a small number of ‘interval PCs’ and the contributions of each observation
to individual PCs.
For large values of p, the VERTICES method produces very large matri-
ces. As an alternative, Chouakria et al. suggest the CENTERS procedure,
in which a PCA is done on the (n × p) matrix whose (i, j)th element is
(x ij +x ij )/2. The immediate results give a single score for each observation
on each PC, but Chouakria and coworkers use the intervals of possible val-
ues for the variables to construct intervals for the PC scores. This is done
by finding the combinations of allowable values for the variables, which,
when inserted in the expression for a PC in terms of the variables, give the
maximum and minimum scores for the PC. An example is given to compare
the VERTICES and CENTERS approaches.
Ichino and Yaguchi (1994) describe a generalization of PCA that can be
used on a wide variety of data types, including discrete variables in which a
measurement is a subset of more than one of the possible values for a vari-
able; continuous variables recorded as intervals are also included. To carry
out PCA, the measurement on each variable is converted to a single value.
This is done by first calculating a ‘distance’ between any two observations
on each variable, constructed from a formula that involves the union and
intersection of the values of the variable taken by the two observations.
From these distances a ‘reference event’ is found, defined as the observa-
tion whose sum of distances from all other observations is minimized, where
distance here refers to the sum of ‘distances’ for each of the p variables.
The coordinate of each observation for a particular variable is then taken
as the distance on that variable from the reference event, with a suitably
assigned sign. The coordinates of the n observations on the p variables thus
defined form a data set, which is then subjected to PCA.

Species Abundance Data
These data are common in ecology—an example was given in Section 5.4.1.
When the study area has diverse habitats and many species are included,

401 402 403 404 405 406 407 408 409 410 411