Page 268 - Jolliffe I. Principal Component Analysis
P. 268
10.1. Detection of Outliers Using Principal Components
235
variables, and will often be extreme with respect to one or both of these
variables looked at individually.
By contrast, the last few PCs may detect outliers that are not apparent
with respect to the original variables. A strong correlation structure be-
tween variables implies that there are linear functions of the variables with
small variances compared to the variances of the original variables. In the
simple height-and-weight example described above, height and weight have
a strong positive correlation, so it is possible to write
x 2 = βx 1 + ε,
where x 1 ,x 2 are height and weight measured about their sample means,
β is a positive constant, and ε is a random variable with a much smaller
variance than x 1 or x 2 . Therefore the linear function
x 2 − βx 1
has a small variance, and the last (in this case the second) PC in an analysis
of x 1 ,x 2 has a similar form, namely a 22 x 2 − a 12 x 1 , where a 12 ,a 22 > 0.
Calculation of the value of this second PC for each observation will detect
observations such as (175 cm, 25 kg) that are outliers with respect to the
correlation structure of the data, though not necessarily with respect to
individual variables. Figure 10.2 shows a plot of the data from Figure 10.1,
with respect to the PCs derived from the correlation matrix. The outlying
observation is ‘average’ for the first PC, but very extreme for the second.
This argument generalizes readily when the number of variables p is
greater than two; by examining the values of the last few PCs, we may be
able to detect observations that violate the correlation structure imposed
by the bulk of the data, but that are not necessarily aberrant with respect
to individual variables. Of course, it is possible that, if the sample size is
relatively small or if a few observations are sufficiently different from the
rest, then the outlier(s) may so strongly influence the last few PCs that
these PCs now reflect mainly the position of the outlier(s) rather than the
structure of the majority of the data. One way of avoiding this masking
or camouflage of outliers is to compute PCs leaving out one (or more)
observations and then calculate for the deleted observations the values of
the last PCs based on the reduced data set. To do this for each observation
is a heavy computational burden, but it might be worthwhile in small
samples where such camouflaging is, in any case, more likely to occur.
Alternatively, if PCs are estimated robustly (see Section 10.4), then the
influence of outliers on the last few PCs should be reduced and it may be
unnecessary to repeat the analysis with each observation deleted.
A series of scatterplots of pairs of the first few and last few PCs may
be useful in identifying possible outliers. One way of presentating each PC
separately is as a set of parallel boxplots. These have been suggested as a
means of deciding how many PCs to retain (see Section 6.1.5), but they
may also be useful for flagging potential outliers (Besse, 1994).

