Page 400 - Jolliffe I. Principal Component Analysis
P. 400
13.6. Principal Component Analysis in the Presence of Missing Data
365
of prior knowledge about q, there is, at present, no procedure for choosing
its value without repeating the analysis for a range of values.
Most published work, including Little and Rubin (1987), does not ex-
plicitly deal with PCA, but with the estimation of covariance matrices in
general. Tipping and Bishop (1999a) is one of relatively few papers that
focus specifically on PCA when discussing missing data. Another is Wiberg
(1976). His approach is via the singular value decomposition (SVD), which
gives a least squares approximation of rank m to the data matrix X.In
other words, the approximation m ˜x ij minimizes
p
n
2
( m x ij − x ij ) ,
i=1 j=1
where m x ij is any rank m approximation to x ij (see Section 3.5). Principal
components can be computed from the SVD (see Section 3.5 and Appendix
Al). With missing data, Wiberg (1976) suggests minimizing the same quan-
tity, but with the summation only over values of (i, j) for which x ij is not
missing; PCs can then be estimated from the modified SVD. The same idea
is implicitly suggested by Gabriel and Zamir (1979). Wiberg (1976) reports
that for simulated multivariate normal data his method is slightly worse
than the method based on maximum likelihood estimation. However, his
method has the virtue that it can be used regardless of whether or not the
data come from a multivariate normal distribution.
For the specialized use of PCA in analysing residuals from an addi-
tive model for data from designed experiments (see Section 13.4), Freeman
(1975) shows that incomplete data can be easily handled, although mod-
ifications to procedures for deciding the rank of the model are needed.
Michailidis and de Leeuw (1998) note three ways of dealing with miss-
ing data in non-linear multivariate analysis, including non-linear PCA
(Section 14.1).
A special type of ‘missing’ data occurs when observations or variables
correspond to different times or different spatial locations, but with irreg-
ular spacing between them. In the common atmospheric science set-up,
where variables correspond to spatial locations, Karl et al. (1982) examine
differences between PCAs when locations are on a regularly spaced grid,
and when they are irregularly spaced. Unsurprisingly, for the irregular data
the locations in areas with the highest density of measurements tend to in-
crease their loadings on the leading PCs, compared to the regularly spaced
data. This is because of the larger correlations observed in the high-density
regions. Kaplan et al. (2001) discuss methodology based on PCA for inter-
polating spatial fields (see Section 12.4.4). Such interpolation is, in effect,
imputing missing data.
Another special type of data in which some values are missing occurs
when candidates choose to take a subset of p out of p examinations, with
different candidates choosing different subsets. Scores on examinations not

