Page 144 - Jolliffe I. Principal Component Analysis
P. 144
6.1. How Many Principal Components?
that is the sum of the variances of the PCs is equal to the sum of the
variances of the elements of x. The obvious definition of ‘percentage of
variation accounted for by the first m PCs’ is therefore 113
p p
m m
t m = 100 l k s jj = 100 l k l k ,
k=1 j=1 k=1 k=1
which reduces to
100
m
t m =
p l k
k=1
in the case of a correlation matrix.
∗
Choosing a cut-off t somewhere between 70% and 90% and retaining m
∗
PCs, where m is the smallest integer for which t m >t , provides a rule
which in practice preserves in the first m PCs most of the information in
x. The best value for t will generally become smaller as p increases, or
∗
as n, the number of observations, increases. Although a sensible cutoff is
very often in the range 70% to 90%, it can sometimes be higher or lower
depending on the practical details of a particular data set. For example,
a value greater than 90% will be appropriate when one or two PCs repre-
sent very dominant and rather obvious sources of variation. Here the less
obvious structures beyond these could be of interest, and to find them a
cut-off higher than 90% may be necessary. Conversely, when p is very large
choosing m corresponding to 70% may give an impractically large value of
m for further analyses. In such cases the threshold should be set somewhat
lower.
Using the rule is, in a sense, equivalent to looking at the spectral de-
composition of the covariance (or correlation) matrix S (see Property A3
of Sections 2.1, 3.1), or the SVD of the data matrix X (see Section 3.5). In
either case, deciding how many terms to include in the decomposition in
order to get a good fit to S or X respectively is closely related to looking
at t m , because an appropriate measure of lack-of-fit of the first m terms in
l
either decomposition is k=m+1 k . This follows because
p
n p p
2
( m ˜x ij − x ij ) =(n − 1) l k ,
i=1 j=1 k=m+1
(Gabriel, 1978) and m S−S = k=m+1 k (see the discussion of Property
l
p
G4 in Section 3.2), where m ˜x ij is the rank m approximation to x ij based
on the SVD as given in equation (3.5.3), and m S is the sum of the first m
terms of the spectral decomposition of S.
A number of attempts have been made to find the distribution of t m ,
and hence to produce a formal procedure for choosing m, based on t m .
Mandel (1972) presents some expected values for t m for the case where all
variables are independent, normally distributed, and have the same vari-

