Page 269 - Jolliffe I. Principal Component Analysis

P. 269

10. Outlier Detection, Inﬂuential Observations and Robust Estimation
236

Figure 10.2. The data set of Figure 10.1, plotted with respect to its PCs.

As well as simple plots of observations with respect to PCs, it is possible
to set up more formal tests for outliers based on PCs, assuming that the PCs
are normally distributed. Strictly, this assumes that x has a multivariate
normal distribution but, because the PCs are linear functions of p random
variables, an appeal to the Central Limit Theorem may justify approximate
normality for the PCs even when the original variables are not normal. A
battery of tests is then available for each individual PC, namely those for
testing for the presence of outliers in a sample of univariate normal data
(see Hawkins (1980, Chapter 3) and Barnett and Lewis (1994, Chapter 6)).
The latter reference describes 47 tests for univariate normal data, plus 23
for univariate gamma distributions and 17 for other distributions. Other
tests, which combine information from several PCs rather than examining
one at a time, are described by Gnanadesikan and Kettenring (1972) and
Hawkins (1974), and some of these will now be discussed. In particular, we
2
2
deﬁne four statistics, which are denoted d , d , d 2 and d 4i .
1i 2i 3i
The last few PCs are likely to be more useful than the ﬁrst few in de-
tecting outliers that are not apparent from the original variables, so one

264 265 266 267 268 269 270 271 272 273 274