Page 273 - Jolliffe I. Principal Component Analysis

P. 273

10. Outlier Detection, Influential Observations and Robust Estimation
240
suggestions are illustrated with an example, but no indication is given of
whether the first few or last few PCs are more likely to be useful—his ex-
ample has only three predictor variables, so it is easy to look at all possible
plots. Mason and Gunst (1985) refer to outliers among the predictor vari-
ables as leverage points. They recommend constructing scatter plots of the
first few PCs normalized to have unit variance, and claim that such plots
are often effective in detecting leverage points that cluster and leverage
points that are extreme in two or more dimensions. In the case of multi-
variate regression, another possibility for detecting outliers (Gnanadesikan
and Kettenring, 1972) is to look at the PCs of the (multivariate) residuals
from the regression analysis.
Peña and Yohai (1999) propose a PCA on a matrix of regression diagnos-
tics that is also useful in detecting outliers in multiple regression. Suppose
that a sample of n observations is available for the analysis. Then an (n×n)
matrix can be calculated whose (h, i)th element is the difference ˆy h − ˆy h(i)
between the predicted value of the dependent variable y for the hth obser-
vation when all n observations are used in the regression, and when (n−1)
observations are used with the ith observation omitted. Peña and Yohai
(1999) refer to this as a sensitivity matrix and seek a unit-length vector
such that the sum of squared lengths of the projections of the rows of the
matrix onto that vector is maximized. This leads to the first principal com-
ponent of the sensitivity matrix, and subsequent components can be found
in the usual way. Peña and Yohai (1999) call these components principal
sensitivity components and show that they also represent directions that
maximize standardized changes to the vector of the regression coefficient.
The definition and properties of principal sensitivity components mean that
high-leverage outliers are likely to appear as extremes on at least one of
the first few components.
Lu et al. (1997) also advocate the use of the PCs of a matrix of regres-
sion diagnostics. In their case the matrix is what they call the standardized
influence matrix (SIM). If a regression equation has p unknown parame-
ters and n observations with which to estimate them, a (p × n) influence
matrix can be formed whose (j, i)th element is a standardized version of
the theoretical influence function (see Section 10.2) for the jth parame-
ter evaluated for the ith observation. Leaving aside the technical details,
the so-called complement of the standardized influence matrix (SIM )can
c
be viewed as a covariance matrix for the ‘data’ in the influence matrix.
Lu et al. (1997) show that finding the PCs of these standardized data,
and hence the eigenvalues and eigenvectors of SIM , can identify outliers
c
and influential points and give insights into the structure of that influence.
Sample versions of SIM and SIM are given, as are illustrations of their
c
use.
Another specialized field in which the use of PCs has been proposed in
order to detect outlying observations is that of statistical process control,
which is the subject of Section 13.7. A different way of using PCs to detect

268 269 270 271 272 273 274 275 276 277 278