Page 278 - Jolliffe I. Principal Component Analysis
P. 278
245
10.1. Detection of Outliers Using Principal Components
that this (male) student has the equal largest chest measurement, but that
only 3 of the other 16 male students are shorter than him, and only two
have a smaller waist measurement—perhaps he was a body builder? Similar
analyses can be done for other observations in Table 10.1. For example,
observation 20 is extreme on the fifth PC. This PC, which accounts for
2.7% of the total variation, is mainly a contrast between height and forearm
length with coefficients 0.67, −0.52, respectively. Observation 20 is (jointly
with one other) the shortest student of the 28, but only one of the other
ten women has a larger forearm measurement. Thus, observations 15 and
20, and other observations indicated as extreme by the last few PCs, are
students for whom some aspects of their physical measurements contradict
the general positive correlation among all seven measurements.
Household Formation Data
These data were described in Section 8.7.2 and are discussed in detail by
Garnham (1979) and Bassett et al. (1980). Section 8.7.2 gives the results of
a PC regression of average annual total income per adult on 28 other de-
mographic variables for 168 local government areas in England and Wales.
Garnham (1979) also examined plots of the last few and first few PCs of
the 28 predictor variables in an attempt to detect outliers. Two such plots,
for the first two and last two PCs, are reproduced in Figures 10.4 and 10.5.
An interesting aspect of these figures is that the most extreme observations
with respect to the last two PCs, namely observations 54, 67, 41 (and 47,
53) are also among the most extreme with respect to the first two PCs.
Some of these observations are, in addition, in outlying positions on plots
of other low-variance PCs. The most blatant case is observation 54, which
is among the few most extreme observations on PCs 24 to 28 inclusive, and
also on PC1. This observation is ‘Kensington and Chelsea,’ which must be
an outlier with respect to several variables individually, as well as being
different in correlation structure from most of the remaining observations.
In addition to plotting the data with respect to the last few and first few
PCs, Garnham (1979) examined the statistics d 2 for q =1, 2,... , 8using
1i
gamma plots, and also looked at normal probability plots of the values of
various PCs. As a combined result of these analyses, he identified six likely
outliers, the five mentioned above together with observation 126, which is
moderately extreme according to several analyses.
The PC regression was then repeated without these six observations. The
results of the regression were noticeably changed, and were better in two
respects than those derived from all the observations. The number of PCs
which it was necessary to retain in the regression was decreased, and the
prediction accuracy was improved, with the standard error of prediction
reduced to 77.3% of that for the full data set.

