Page 406 - Jolliffe I. Principal Component Analysis
P. 406

13.8. Some Other Types of Data
                                                                                            371
                              discuss two adaptations of PCA for such data. In the first, called the VER-
                              TICES method, the ith row of the (n × p) data matrix is replaced by the
                              2 distinct rows whose elements have either x ij or x ij in their jth column.
                               p
                              A PCA is then done on the resulting (n2 × p) matrix. The value or score
                                                                 p
                              of a PC from this analysis can be calculated for each of the n2 rows of the
                                                                                   p
                              new data matrix. For the ith observation there are 2 such scores and an
                                                                            p
                              interval can be constructed for the observation, bounded by the smallest
                              and largest of these scores. In plotting the observations, either with respect
                              to the original variables or with respect to PCs, each observation is repre-
                              sented by a rectangle or hyperrectangle in two or higher-dimensional space.
                              The boundaries of the (hyper)rectangle are determined by the intervals for
                              the variables or PC scores. Chouakria et al. (2000) examine a number of
                              indices measuring the quality of representation of an interval data set by
                              a small number of ‘interval PCs’ and the contributions of each observation
                              to individual PCs.
                                For large values of p, the VERTICES method produces very large matri-
                              ces. As an alternative, Chouakria et al. suggest the CENTERS procedure,
                              in which a PCA is done on the (n × p) matrix whose (i, j)th element is
                              (x ij +x ij )/2. The immediate results give a single score for each observation
                              on each PC, but Chouakria and coworkers use the intervals of possible val-
                              ues for the variables to construct intervals for the PC scores. This is done
                              by finding the combinations of allowable values for the variables, which,
                              when inserted in the expression for a PC in terms of the variables, give the
                              maximum and minimum scores for the PC. An example is given to compare
                              the VERTICES and CENTERS approaches.
                                Ichino and Yaguchi (1994) describe a generalization of PCA that can be
                              used on a wide variety of data types, including discrete variables in which a
                              measurement is a subset of more than one of the possible values for a vari-
                              able; continuous variables recorded as intervals are also included. To carry
                              out PCA, the measurement on each variable is converted to a single value.
                              This is done by first calculating a ‘distance’ between any two observations
                              on each variable, constructed from a formula that involves the union and
                              intersection of the values of the variable taken by the two observations.
                              From these distances a ‘reference event’ is found, defined as the observa-
                              tion whose sum of distances from all other observations is minimized, where
                              distance here refers to the sum of ‘distances’ for each of the p variables.
                              The coordinate of each observation for a particular variable is then taken
                              as the distance on that variable from the reference event, with a suitably
                              assigned sign. The coordinates of the n observations on the p variables thus
                              defined form a data set, which is then subjected to PCA.


                              Species Abundance Data
                              These data are common in ecology—an example was given in Section 5.4.1.
                              When the study area has diverse habitats and many species are included,
   401   402   403   404   405   406   407   408   409   410   411