Page 175 - Jolliffe I. Principal Component Analysis
P. 175

6. Choosing a Subset of Principal Components or Variables
                              144
                              to compare what is being optimized here with the approaches described
                              earlier.
                                 • The RV-coefficient compares linear combinations of subsets of
                                   variables with the full set of variables.
                                 • Some methods, such as those of Jolliffe (1970, 1972, 1973), com-
                                   pare principal components of subsets of variables with principal
                                   components from the full set.
                                 • Some approaches, such as McCabe’s (1984) principal variables, simply
                                   compare subsets of the variables with the full set of variables.

                                 • Some criteria, such as Yanai’s generalized coefficient of determination,
                                   compare subspaces spanned by a subset of variables with subspaces
                                   spanned by a subset of PCs, as in Cadima and Jolliffe (2001).

                                No examples are presented by Robert and Escoufier (1976) of how their
                              method works in practice. However, Gonzalez et al. (1990) give a stepwise
                              algorithm for implementing the procedure and illustrate it with a small
                              example (n = 49; p = 6). The example is small enough for all subsets of
                              each size to be evaluated. Only for m =1, 2, 3 does the stepwise algorithm
                              give the best subset with respect to RV, as identified by the full search.
                              Escoufier (1986) provides further discussion of the properties of the RV-
                              coefficient when used in this context.
                                Tanaka and Mori (1997) also use the RV-coefficient, as one of two criteria

                              for variable selection. They consider the same linear combinations M X 1 of
                              a given set of variables as Robert and Escoufier (1976), and call these lin-
                              ear combinations modified principal components. Tanaka and Mori (1997)
                              assess how well a subset reproduces the full set of variables by means of
                              the RV-coefficient. They also have a second form of ‘modified’ principal
                              components, constructed by minimizing the trace of the residual covari-
                              ance matrix obtained by regressing X on M X 1 . This latter formulation is

                              similar to Rao’s (1964) PCA of instrumental variables (see Section 14.3).
                              The difference between Tanaka and Mori’s (1997) instrumental variable
                              approach and that of Rao (1964) is that Rao attempts to predict X 2 ,the
                              (n × (p − m)) complementary matrix to X 1 using linear functions of X 1 ,
                              whereas Tanaka and Mori try to predict the full matrix X.
                                Both of Tanaka and Mori’s modified PCAs solve the same eigenequation

                                                   (S 2 11  + S 12 S 21 )a = lS 11 a,    (6.3.6)
                              with obvious notation, but differ in the way that the quality of a sub-
                              set is measured. For the instrumental variable approach, the criterion

                                                    l
                              is proportional to  k=1 k , whereas for the components derived via the
                                                m
                                                                    2
                              RV-coefficient, quality is based on  m  l , where l k is the kth largest
                                                                k=1 k
                              eigenvalue in the solution of (6.3.6). A backward elimination method is
                              used to delete variables until some threshold is reached, although in the
   170   171   172   173   174   175   176   177   178   179   180