Page 266 - Jolliffe I. Principal Component Analysis

P. 266

233
10.1. Detection of Outliers Using Principal Components
Hoaglin et al. (1983) for a more readable approach), and robustness with
respect to distributional assumptions, as well as with respect to outlying or
inﬂuential observations, may be of interest. A number of techniques have
been suggested for robustly estimating PCs, and these are discussed in the
fourth section of this chapter; the ﬁnal section presents a few concluding
remarks.
10.1 Detection of Outliers Using Principal
Components

There is no formal, widely accepted, definition of what is meant by an ‘out-
lier.’ The books on the subject by Barnett and Lewis (1994) and Hawkins
(1980) both rely on informal, intuitive definitions, namely that outliers are
observations that are in some way different from, or inconsistent with, the
remainder of a data set. For p-variate data, this definition implies that out-
liers are a long way from the rest of the observations in the p-dimensional
space defined by the variables. Numerous procedures have been suggested
for detecting outliers with respect to a single variable, and many of these
are reviewed by Barnett and Lewis (1994) and Hawkins (1980). The lit-
erature on multivariate outliers is less extensive, with each of these two
books containing only one chapter (comprising less than 15% of their total
content) on the subject. Several approaches to the detection of multivariate
outliers use PCs, and these will now be discussed in some detail. As well as
the methods described in this section, which use PCs in fairly direct ways
to identify potential outliers, techniques for robustly estimating PCs (see
Section 10.4) may also be used to detect outlying observations.
A major problem in detecting multivariate outliers is that an observation
that is not extreme on any of the original variables can still be an outlier,
because it does not conform with the correlation structure of the remainder
of the data. It is impossible to detect such outliers by looking solely at the
original variables one at a time. As a simple example, suppose that heights
and weights are measured for a sample of healthy children of various ages
between 5 and 15 years old. Then an ‘observation’ with height and weight
of 175 cm (70 in) and 25 kg (55 lb), respectively, is not particularly extreme
on either the height or weight variables individually, as 175 cm is a plausible
height for the older children and 25 kg is a plausible weight for the youngest
children. However, the combination (175 cm, 25 kg) is virtually impossible,
and will be a clear outlier because it combines a large height with a small
weight, thus violating the general pattern of a positive correlation between
the two variables. Such an outlier is apparent on a plot of the two variables
(see Figure 10.1) but, if the number of variables p is large, it is quite possible
1
that some outliers will not be apparent on any of the p(p−1) plots of two
2
variables at a time. Thus, for large p we need to consider the possibility

261 262 263 264 265 266 267 268 269 270 271