Page 266 - Jolliffe I. Principal Component Analysis
P. 266

233
                                          10.1. Detection of Outliers Using Principal Components
                              Hoaglin et al. (1983) for a more readable approach), and robustness with
                              respect to distributional assumptions, as well as with respect to outlying or
                              influential observations, may be of interest. A number of techniques have
                              been suggested for robustly estimating PCs, and these are discussed in the
                              fourth section of this chapter; the final section presents a few concluding
                              remarks.
                              10.1 Detection of Outliers Using Principal
                                      Components


                              There is no formal, widely accepted, definition of what is meant by an ‘out-
                              lier.’ The books on the subject by Barnett and Lewis (1994) and Hawkins
                              (1980) both rely on informal, intuitive definitions, namely that outliers are
                              observations that are in some way different from, or inconsistent with, the
                              remainder of a data set. For p-variate data, this definition implies that out-
                              liers are a long way from the rest of the observations in the p-dimensional
                              space defined by the variables. Numerous procedures have been suggested
                              for detecting outliers with respect to a single variable, and many of these
                              are reviewed by Barnett and Lewis (1994) and Hawkins (1980). The lit-
                              erature on multivariate outliers is less extensive, with each of these two
                              books containing only one chapter (comprising less than 15% of their total
                              content) on the subject. Several approaches to the detection of multivariate
                              outliers use PCs, and these will now be discussed in some detail. As well as
                              the methods described in this section, which use PCs in fairly direct ways
                              to identify potential outliers, techniques for robustly estimating PCs (see
                              Section 10.4) may also be used to detect outlying observations.
                                A major problem in detecting multivariate outliers is that an observation
                              that is not extreme on any of the original variables can still be an outlier,
                              because it does not conform with the correlation structure of the remainder
                              of the data. It is impossible to detect such outliers by looking solely at the
                              original variables one at a time. As a simple example, suppose that heights
                              and weights are measured for a sample of healthy children of various ages
                              between 5 and 15 years old. Then an ‘observation’ with height and weight
                              of 175 cm (70 in) and 25 kg (55 lb), respectively, is not particularly extreme
                              on either the height or weight variables individually, as 175 cm is a plausible
                              height for the older children and 25 kg is a plausible weight for the youngest
                              children. However, the combination (175 cm, 25 kg) is virtually impossible,
                              and will be a clear outlier because it combines a large height with a small
                              weight, thus violating the general pattern of a positive correlation between
                              the two variables. Such an outlier is apparent on a plot of the two variables
                              (see Figure 10.1) but, if the number of variables p is large, it is quite possible
                                                                            1
                              that some outliers will not be apparent on any of the p(p−1) plots of two
                                                                            2
                              variables at a time. Thus, for large p we need to consider the possibility
   261   262   263   264   265   266   267   268   269   270   271