Page 172 - Jolliffe I. Principal Component Analysis
P. 172
6.3. Selecting a Subset of Variables
variable selection in factor analysis. Cadima and Jolliffe (2001) show that
Yanai’s coefficient can be written as
1 2 141
q
corr(P q , P m )= √ r (6.3.4)
qm km
k=1
where r km is the multiple correlation between the kth PC and the set of
m selected variables.
The second indicator examined by Cadima and Jolliffe (2001) is again a
matrix correlation, this time between the data matrix X and the matrix
formed by orthogonally projecting X onto the space spanned by the m
selected variables. It can be written
!
2
p λ k r
k=1
corr(X, P m X)= km . (6.3.5)
p
k=1 λ k
It turns out that this measure is equivalent to the second of McCabe’s
(1984) criteria defined above (see also McCabe (1986)). Cadima and Jol-
liffe (2001) discuss a number of other interpretations, and relationships
between their measures and previous suggestions in the literature. Both
indicators (6.3.4) and (6.3.5) are weighted averages of the squared multi-
ple correlations between each PC and the set of selected variables. In the
second measure, the weights are simply the eigenvalues of S, and hence the
variances of the PCs. For the first indicator the weights are positive and
equal for the first q PCs, but zero otherwise. Thus the first indicator ignores
PCs outside the chosen q-dimensional subspace when assessing closeness,
but it also gives less weight than the second indicator to the PCs with the
very largest variances relative to those with intermediate variances.
Cadima and Jolliffe (2001) discuss algorithms for finding good subsets
of variables and demonstrate the use of the two measures on three exam-
ples, one of which is large (p = 62) compared to those typically used for
illustration. The examples show that the two measures can lead to quite
different optimal subsets, implying that it is necessary to know what aspect
of a subspace it is most desirable to preserve before choosing a subset of
variables to achieve this. They also show that
• the algorithms usually work efficiently in cases where numbers of
variables are small enough to allow comparisions with an exhaustive
search;
• as discussed elsewhere (Section 11.3), choosing variables on the basis
of the size of coefficients or loadings in the PCs’ eigenvectors can be
inadvisable;
• to match the information provided by the first q PCsitisoftenonly
necessary to keep (q +1) or (q + 2) variables.
For data sets in which p is too large to conduct an exhaustive search
for the optimal subset, algorithms that can find a good subset are needed.

