Page 278 - Jolliffe I. Principal Component Analysis
P. 278

245
                                          10.1. Detection of Outliers Using Principal Components
                              that this (male) student has the equal largest chest measurement, but that
                              only 3 of the other 16 male students are shorter than him, and only two
                              have a smaller waist measurement—perhaps he was a body builder? Similar
                              analyses can be done for other observations in Table 10.1. For example,
                              observation 20 is extreme on the fifth PC. This PC, which accounts for
                              2.7% of the total variation, is mainly a contrast between height and forearm
                              length with coefficients 0.67, −0.52, respectively. Observation 20 is (jointly
                              with one other) the shortest student of the 28, but only one of the other
                              ten women has a larger forearm measurement. Thus, observations 15 and
                              20, and other observations indicated as extreme by the last few PCs, are
                              students for whom some aspects of their physical measurements contradict
                              the general positive correlation among all seven measurements.




                              Household Formation Data
                              These data were described in Section 8.7.2 and are discussed in detail by
                              Garnham (1979) and Bassett et al. (1980). Section 8.7.2 gives the results of
                              a PC regression of average annual total income per adult on 28 other de-
                              mographic variables for 168 local government areas in England and Wales.
                              Garnham (1979) also examined plots of the last few and first few PCs of
                              the 28 predictor variables in an attempt to detect outliers. Two such plots,
                              for the first two and last two PCs, are reproduced in Figures 10.4 and 10.5.
                              An interesting aspect of these figures is that the most extreme observations
                              with respect to the last two PCs, namely observations 54, 67, 41 (and 47,
                              53) are also among the most extreme with respect to the first two PCs.
                              Some of these observations are, in addition, in outlying positions on plots
                              of other low-variance PCs. The most blatant case is observation 54, which
                              is among the few most extreme observations on PCs 24 to 28 inclusive, and
                              also on PC1. This observation is ‘Kensington and Chelsea,’ which must be
                              an outlier with respect to several variables individually, as well as being
                              different in correlation structure from most of the remaining observations.
                                In addition to plotting the data with respect to the last few and first few
                              PCs, Garnham (1979) examined the statistics d 2  for q =1, 2,... , 8using
                                                                        1i
                              gamma plots, and also looked at normal probability plots of the values of
                              various PCs. As a combined result of these analyses, he identified six likely
                              outliers, the five mentioned above together with observation 126, which is
                              moderately extreme according to several analyses.
                                The PC regression was then repeated without these six observations. The
                              results of the regression were noticeably changed, and were better in two
                              respects than those derived from all the observations. The number of PCs
                              which it was necessary to retain in the regression was decreased, and the
                              prediction accuracy was improved, with the standard error of prediction
                              reduced to 77.3% of that for the full data set.
   273   274   275   276   277   278   279   280   281   282   283