Page 273 - Jolliffe I. Principal Component Analysis
P. 273

10. Outlier Detection, Influential Observations and Robust Estimation
                              240
                              suggestions are illustrated with an example, but no indication is given of
                              whether the first few or last few PCs are more likely to be useful—his ex-
                              ample has only three predictor variables, so it is easy to look at all possible
                              plots. Mason and Gunst (1985) refer to outliers among the predictor vari-
                              ables as leverage points. They recommend constructing scatter plots of the
                              first few PCs normalized to have unit variance, and claim that such plots
                              are often effective in detecting leverage points that cluster and leverage
                              points that are extreme in two or more dimensions. In the case of multi-
                              variate regression, another possibility for detecting outliers (Gnanadesikan
                              and Kettenring, 1972) is to look at the PCs of the (multivariate) residuals
                              from the regression analysis.
                                Pe˜na and Yohai (1999) propose a PCA on a matrix of regression diagnos-
                              tics that is also useful in detecting outliers in multiple regression. Suppose
                              that a sample of n observations is available for the analysis. Then an (n×n)
                              matrix can be calculated whose (h, i)th element is the difference ˆy h − ˆy h(i)
                              between the predicted value of the dependent variable y for the hth obser-
                              vation when all n observations are used in the regression, and when (n−1)
                              observations are used with the ith observation omitted. Pe˜na and Yohai
                              (1999) refer to this as a sensitivity matrix and seek a unit-length vector
                              such that the sum of squared lengths of the projections of the rows of the
                              matrix onto that vector is maximized. This leads to the first principal com-
                              ponent of the sensitivity matrix, and subsequent components can be found
                              in the usual way. Pe˜na and Yohai (1999) call these components principal
                              sensitivity components and show that they also represent directions that
                              maximize standardized changes to the vector of the regression coefficient.
                              The definition and properties of principal sensitivity components mean that
                              high-leverage outliers are likely to appear as extremes on at least one of
                              the first few components.
                                Lu et al. (1997) also advocate the use of the PCs of a matrix of regres-
                              sion diagnostics. In their case the matrix is what they call the standardized
                              influence matrix (SIM). If a regression equation has p unknown parame-
                              ters and n observations with which to estimate them, a (p × n) influence
                              matrix can be formed whose (j, i)th element is a standardized version of
                              the theoretical influence function (see Section 10.2) for the jth parame-
                              ter evaluated for the ith observation. Leaving aside the technical details,
                              the so-called complement of the standardized influence matrix (SIM )can
                                                                                         c
                              be viewed as a covariance matrix for the ‘data’ in the influence matrix.
                              Lu et al. (1997) show that finding the PCs of these standardized data,
                              and hence the eigenvalues and eigenvectors of SIM , can identify outliers
                                                                           c
                              and influential points and give insights into the structure of that influence.
                              Sample versions of SIM and SIM are given, as are illustrations of their
                                                            c
                              use.
                                Another specialized field in which the use of PCs has been proposed in
                              order to detect outlying observations is that of statistical process control,
                              which is the subject of Section 13.7. A different way of using PCs to detect
   268   269   270   271   272   273   274   275   276   277   278