Page 272 - Jolliffe I. Principal Component Analysis
P. 272

10.1. Detection of Outliers Using Principal Components
                                                                                            239
                                          2
                                Note that d , computed separately for several populations, is also used
                                          1i
                              in a form of discriminant analysis (SIMCA) by Wold (1976) (see Sec-
                              tion 9.1). Mertens et al. (1994) use this relationship to suggest modifications
                                                                                         2
                              to SIMCA. They investigate variants in which d 2  is replaced by d , d 2
                                                                         1i              2i  3i
                              or d 4i as a measure of the discrepancy between a new observation and a
                                                               2
                              group. In an example they find that d , but not d 2  or d 4i , improves the
                                                               2i          3i
                                                                                  2
                              cross-validated misclassification rate compared to that for d .
                                                            2
                                                         2
                                                                                  1i
                                The exact distributions for d , d , d 2  and d 4i can be deduced if we as-
                                                         1i  2i  3i
                              sume that the observations are from a multivariate normal distribution with
                              mean µ and covariance matrix Σ, where µ, Σ are both known (see Hawkins
                                                        2
                              (1980, p. 113) for results for d , d 4i ). Both d 2 3i  and d 2 2i  when q = p,aswell
                                 2
                                                        2i
                              as d , have (approximate) gamma distributions if no outliers are present
                                 1i
                              and if normality can be (approximately) assumed (Gnanadesikan and Ket-
                              tenring, 1972), so that gamma probability plots of d 2  (with q = p)and d 2
                                                                           2i                3i
                              can again be used to look for outliers. However, in practice µ, Σ are un-
                              known, and the data will often not have a multivariate normal distribution,
                              so that any distributional results derived under the restrictive assumptions
                              can only be approximations. Jackson (1991, Section 2.7.2) gives a fairly
                              complicated function of d 2  that has, approximately, a standard normal
                                                     1i
                              distribution when no outliers are present.
                                In order to be satisfactory, such approximations to the distributions of
                               2
                                      2
                                  2
                              d , d , d , d 4i often need not be particularly accurate. Although there are
                               1i  2i  3i
                              exceptions, such as detecting possible unusual patient behaviour in safety
                              data from clinical trials (see Penny and Jolliffe, 2001), outlier detection is
                              frequently concerned with finding observations that are blatantly different
                              from the rest, corresponding to very small significance levels for the test
                              statistics. An observation that is ‘barely significant at 5%’ is typically not
                              of interest, so that there is no great incentive to compute significance levels
                              very accurately. The outliers that we wish to detect should ‘stick out like
                              a sore thumb’ provided we find the right direction in which to view the
                              data; the problem in multivariate outlier detection is to find appropriate
                              directions. If, on the other hand, identification of less clear-cut outliers
                              is important and multivariate normality cannot be assumed, Dunn and
                              Duncan (2000) propose a procedure, in the context of evaluating habitat
                              suitability, for assessing ‘significance’ based on the empirical distribution of
                                                                                            2
                              their test statistics. The statistics they use are individual terms from d .
                                                                                            2i
                                PCs can be used to detect outliers in any multivariate data set, regardless
                              of the subsequent analysis which is envisaged for that data set. For par-
                              ticular types of data or analysis, other considerations come into play. For
                              multiple regression, Hocking (1984) suggests that plots of PCs derived from
                              (p + 1) variables consisting of the p predictor variables and the dependent
                              variable, as used in latent root regression (see Section 8.4), tend to reveal
                              outliers together with observations that are highly influential (Section 10.2)
                              for the regression equation. Plots of PCs derived from the predictor vari-
                              ables only also tend to reveal influential observations. Hocking’s (1984)
   267   268   269   270   271   272   273   274   275   276   277