Page 275 - Jolliffe I. Principal Component Analysis
P. 275

10. Outlier Detection, Influential Observations and Robust Estimation
                              242
                              study of methods for detecting multivariate outliers. It did well compared to
                              other methods in some circumstances, particularly when there are multiple
                              outliers and p is not too large.
                                Before turning to examples, recall that an example in which outliers
                              are detected using PCs in a rather different way was given in Section 5.6.
                              In that example, Andrews’ curves (Andrews, 1972) were computed using
                              PCs and some of the observations stood out as different from the others
                              when plotted as curves. Further examination of these different observations
                              showed that they were indeed ‘outlying’ in some respects, compared to the
                              remaining observations.
                              10.1.1 Examples
                              In this section one example will be discussed in some detail, while three
                              others will be described more briefly.
                              Anatomical Measurements
                              A set of seven anatomical measurements on 28 students was discussed in
                              Section 5.1.1 and it was found that on a plot of the first two PCs (Fig-
                              ures 1.3, 5.1) there was an extreme observation on the second PC. When
                              the measurements of this individual were examined in detail, it was found
                              that he had an anomalously small head circumference. Whereas the other
                              27 students all had head girths in the narrow range 21–24 cm, this student
                              (no. 16) had a measurement of 19 cm. It is impossible to check whether
                              this was an incorrect measurement or whether student 16 indeed had an
                              unusually small head (his other measurements were close to average), but
                              it is clear that this observation would be regarded as an ‘outlier’ according
                              to most definitions of the term.
                                This particular outlier is detected on the second PC, and it was sug-
                              gested above that any outliers detected by high-variance PCs are usually
                              detectable on examination of individual variables; this is indeed the case
                              here. Another point concerning this observation is that it is so extreme on
                              the second PC that it may be suspected that it alone is largely responsible
                              for the direction of this PC. This question will be investigated at the end
                              of Section 10.2, which deals with influential observations.
                                Figure 1.3 indicates one other possible outlier at the extreme left of
                              the diagram. This turns out to be the largest student in the class—190
                              cm (6 ft 3 in) tall, with all measurements except head girth at least as
                              large as all other 27 students. There is no suspicion here of any incorrect
                              measurements.
                                Turning now to the last few PCs, we hope to detect any observations
                              which are ‘outliers’ with respect to the correlation structure of the data.
                              Figure 10.3 gives a plot of the scores of the observations for the last two
                                                                  2
                              PCs, and Table 10.1 gives the values of d , d 2  and d 4i , defined in equa-
                                                                  1i  2i
                              tions (10.1.1), (10.1.2) and (10.1.4), respectively, for the six ‘most extreme’
   270   271   272   273   274   275   276   277   278   279   280