Page 172 - Jolliffe I. Principal Component Analysis
P. 172

6.3. Selecting a Subset of Variables
                              variable selection in factor analysis. Cadima and Jolliffe (2001) show that
                              Yanai’s coefficient can be written as
                                                                 1       2                  141
                                                                     q
                                                 corr(P q , P m )= √    r                (6.3.4)
                                                                 qm      km
                                                                    k=1
                              where r km is the multiple correlation between the kth PC and the set of
                              m selected variables.
                                The second indicator examined by Cadima and Jolliffe (2001) is again a
                              matrix correlation, this time between the data matrix X and the matrix
                              formed by orthogonally projecting X onto the space spanned by the m
                              selected variables. It can be written
                                                               !
                                                                         2
                                                                   p  λ k r
                                                                   k=1
                                                corr(X, P m X)=          km  .           (6.3.5)
                                                                     p
                                                                     k=1  λ k
                              It turns out that this measure is equivalent to the second of McCabe’s
                              (1984) criteria defined above (see also McCabe (1986)). Cadima and Jol-
                              liffe (2001) discuss a number of other interpretations, and relationships
                              between their measures and previous suggestions in the literature. Both
                              indicators (6.3.4) and (6.3.5) are weighted averages of the squared multi-
                              ple correlations between each PC and the set of selected variables. In the
                              second measure, the weights are simply the eigenvalues of S, and hence the
                              variances of the PCs. For the first indicator the weights are positive and
                              equal for the first q PCs, but zero otherwise. Thus the first indicator ignores
                              PCs outside the chosen q-dimensional subspace when assessing closeness,
                              but it also gives less weight than the second indicator to the PCs with the
                              very largest variances relative to those with intermediate variances.
                                Cadima and Jolliffe (2001) discuss algorithms for finding good subsets
                              of variables and demonstrate the use of the two measures on three exam-
                              ples, one of which is large (p = 62) compared to those typically used for
                              illustration. The examples show that the two measures can lead to quite
                              different optimal subsets, implying that it is necessary to know what aspect
                              of a subspace it is most desirable to preserve before choosing a subset of
                              variables to achieve this. They also show that
                                 • the algorithms usually work efficiently in cases where numbers of
                                   variables are small enough to allow comparisions with an exhaustive
                                   search;
                                 • as discussed elsewhere (Section 11.3), choosing variables on the basis
                                   of the size of coefficients or loadings in the PCs’ eigenvectors can be
                                   inadvisable;
                                 • to match the information provided by the first q PCsitisoftenonly
                                   necessary to keep (q +1) or (q + 2) variables.
                                For data sets in which p is too large to conduct an exhaustive search
                              for the optimal subset, algorithms that can find a good subset are needed.
   167   168   169   170   171   172   173   174   175   176   177