Page 398 - Jolliffe I. Principal Component Analysis
P. 398
13.6. Principal Component Analysis in the Presence of Missing Data
363
13.6 Principal Component Analysis in the
Presence of Missing Data
In all the examples given in this text, the data sets are complete. However,
it is not uncommon, especially for large data sets, for some of the values
of some of the variables to be missing. The most usual way of dealing with
such a situation is to delete, entirely, any observation for which at least one
of the variables has a missing value. This is satisfactory if missing values are
few, but clearly wasteful of information if a high proportion of observations
have missing values for just one or two variables. To meet this problem, a
number of alternatives have been suggested.
The first step in a PCA is usually to compute the covariance or cor-
relation matrix, so interest often centres on estimating these matrices in
the presence of missing data. There are a number of what Little and Ru-
bin (1987, Chapter 3) call ‘quick’ methods. One option is to compute the
(j, k)th correlation or covariance element-wise, using all observations for
which the values of both x j and x k are available. Unfortunately, this leads to
covariance or correlation matrices that are not necessarily positive semidef-
inite. Beale and Little (1975) note a modification of this option. When
computing the summation (x ij − ¯x j )(x ik − ¯x k ) in the covariance or cor-
i
relation matrix, ¯x j ,¯x k are calculated from all available values of x j , x k ,
respectively, instead of only for observations for which both x j and x k have
values present, They state that, at least in the regression context, the re-
sults can be unsatisfactory. However, Mehrota (1995), in discussing robust
estimation of covariance matrices (see Section 10.4), argues that the prob-
lem of a possible lack of positive semi-definiteness is less important than
making efficient use of as many data as possible. He therefore advocates
element-wise estimation of the variances and covariances in a covariance
matrix, with possible adjustment if positive semi-definiteness is lost.
Another quick method is to replace missing values for variable x j by the
mean value ¯x j , calculated from the observations for which the value of x j
is available. This is a simple way of ‘imputing’ rather than ignoring missing
values. A more sophisticated method of imputation is to use regression of
the missing variables on the available variables case-by-case. An extension
to the idea of imputing missing values is multiple imputation. Each missing
value is replaced by a value drawn from a probability distribution, and
this procedure is repeated M times (Little and Rubin, 1987, Section 12.4;
Schafer, 1997, Section 4.3). The analysis, in our case PCA, is then done
M times, corresponding to each of the M different sets of imputed values.
The variability in the results of the analyses gives an indication of the
uncertainty associated with the presence of missing values.
A different class of procedures is based on maximum likelihood estima-
tion (Little and Rubin, 1987, Section 8.2). The well-known EM algorithm
(Dempster et al., 1977) can easily cope with maximum likelihood estimation

