Page 272 - Jolliffe I. Principal Component Analysis
P. 272
10.1. Detection of Outliers Using Principal Components
239
2
Note that d , computed separately for several populations, is also used
1i
in a form of discriminant analysis (SIMCA) by Wold (1976) (see Sec-
tion 9.1). Mertens et al. (1994) use this relationship to suggest modifications
2
to SIMCA. They investigate variants in which d 2 is replaced by d , d 2
1i 2i 3i
or d 4i as a measure of the discrepancy between a new observation and a
2
group. In an example they find that d , but not d 2 or d 4i , improves the
2i 3i
2
cross-validated misclassification rate compared to that for d .
2
2
1i
The exact distributions for d , d , d 2 and d 4i can be deduced if we as-
1i 2i 3i
sume that the observations are from a multivariate normal distribution with
mean µ and covariance matrix Σ, where µ, Σ are both known (see Hawkins
2
(1980, p. 113) for results for d , d 4i ). Both d 2 3i and d 2 2i when q = p,aswell
2
2i
as d , have (approximate) gamma distributions if no outliers are present
1i
and if normality can be (approximately) assumed (Gnanadesikan and Ket-
tenring, 1972), so that gamma probability plots of d 2 (with q = p)and d 2
2i 3i
can again be used to look for outliers. However, in practice µ, Σ are un-
known, and the data will often not have a multivariate normal distribution,
so that any distributional results derived under the restrictive assumptions
can only be approximations. Jackson (1991, Section 2.7.2) gives a fairly
complicated function of d 2 that has, approximately, a standard normal
1i
distribution when no outliers are present.
In order to be satisfactory, such approximations to the distributions of
2
2
2
d , d , d , d 4i often need not be particularly accurate. Although there are
1i 2i 3i
exceptions, such as detecting possible unusual patient behaviour in safety
data from clinical trials (see Penny and Jolliffe, 2001), outlier detection is
frequently concerned with finding observations that are blatantly different
from the rest, corresponding to very small significance levels for the test
statistics. An observation that is ‘barely significant at 5%’ is typically not
of interest, so that there is no great incentive to compute significance levels
very accurately. The outliers that we wish to detect should ‘stick out like
a sore thumb’ provided we find the right direction in which to view the
data; the problem in multivariate outlier detection is to find appropriate
directions. If, on the other hand, identification of less clear-cut outliers
is important and multivariate normality cannot be assumed, Dunn and
Duncan (2000) propose a procedure, in the context of evaluating habitat
suitability, for assessing ‘significance’ based on the empirical distribution of
2
their test statistics. The statistics they use are individual terms from d .
2i
PCs can be used to detect outliers in any multivariate data set, regardless
of the subsequent analysis which is envisaged for that data set. For par-
ticular types of data or analysis, other considerations come into play. For
multiple regression, Hocking (1984) suggests that plots of PCs derived from
(p + 1) variables consisting of the p predictor variables and the dependent
variable, as used in latent root regression (see Section 8.4), tend to reveal
outliers together with observations that are highly influential (Section 10.2)
for the regression equation. Plots of PCs derived from the predictor vari-
ables only also tend to reveal influential observations. Hocking’s (1984)

