Page 400 - Jolliffe I. Principal Component Analysis
P. 400

13.6. Principal Component Analysis in the Presence of Missing Data
                                                                                            365
                              of prior knowledge about q, there is, at present, no procedure for choosing
                              its value without repeating the analysis for a range of values.
                                Most published work, including Little and Rubin (1987), does not ex-
                              plicitly deal with PCA, but with the estimation of covariance matrices in
                              general. Tipping and Bishop (1999a) is one of relatively few papers that
                              focus specifically on PCA when discussing missing data. Another is Wiberg
                              (1976). His approach is via the singular value decomposition (SVD), which
                              gives a least squares approximation of rank m to the data matrix X.In
                              other words, the approximation m ˜x ij minimizes
                                                         p
                                                      n
                                                                      2
                                                           ( m x ij − x ij ) ,
                                                     i=1 j=1
                              where m x ij is any rank m approximation to x ij (see Section 3.5). Principal
                              components can be computed from the SVD (see Section 3.5 and Appendix
                              Al). With missing data, Wiberg (1976) suggests minimizing the same quan-
                              tity, but with the summation only over values of (i, j) for which x ij is not
                              missing; PCs can then be estimated from the modified SVD. The same idea
                              is implicitly suggested by Gabriel and Zamir (1979). Wiberg (1976) reports
                              that for simulated multivariate normal data his method is slightly worse
                              than the method based on maximum likelihood estimation. However, his
                              method has the virtue that it can be used regardless of whether or not the
                              data come from a multivariate normal distribution.
                                For the specialized use of PCA in analysing residuals from an addi-
                              tive model for data from designed experiments (see Section 13.4), Freeman
                              (1975) shows that incomplete data can be easily handled, although mod-
                              ifications to procedures for deciding the rank of the model are needed.
                              Michailidis and de Leeuw (1998) note three ways of dealing with miss-
                              ing data in non-linear multivariate analysis, including non-linear PCA
                              (Section 14.1).
                                A special type of ‘missing’ data occurs when observations or variables
                              correspond to different times or different spatial locations, but with irreg-
                              ular spacing between them. In the common atmospheric science set-up,
                              where variables correspond to spatial locations, Karl et al. (1982) examine
                              differences between PCAs when locations are on a regularly spaced grid,
                              and when they are irregularly spaced. Unsurprisingly, for the irregular data
                              the locations in areas with the highest density of measurements tend to in-
                              crease their loadings on the leading PCs, compared to the regularly spaced
                              data. This is because of the larger correlations observed in the high-density
                              regions. Kaplan et al. (2001) discuss methodology based on PCA for inter-
                              polating spatial fields (see Section 12.4.4). Such interpolation is, in effect,
                              imputing missing data.
                                Another special type of data in which some values are missing occurs

                              when candidates choose to take a subset of p out of p examinations, with
                              different candidates choosing different subsets. Scores on examinations not
   395   396   397   398   399   400   401   402   403   404   405