Page 168 - Jolliffe I. Principal Component Analysis
P. 168

6.3. Selecting a Subset of Variables
                                                                                            137
                                Krzanowski (1983) examines the gas chromatography example further
                              by generating six different artificial data sets with the same sample covari-
                              ance matrix as the real data. The values of W are fairly stable across the
                              replicates and confirm the choice of four PCs obtained above by slightly de-
                              creasing the cut-off for W. For the full data set, with outliers not removed,
                              the replicates give some different, and useful, information from that in the
                              original data.
                              6.3 Selecting a Subset of Variables


                              When p, the number of variables observed, is large it is often the case that
                              a subset of m variables, with m   p, contains virtually all the information
                              available in all p variables. It is then useful to determine an appropriate
                              value of m, and to decide which subset or subsets of m variables are best.
                                Solution of these two problems, the choice of m and the selection of a
                              good subset, depends on the purpose to which the subset of variables is
                              to be put. If the purpose is simply to preserve most of the variation in
                              x, then the PCs of x can be used fairly straightforwardly to solve both
                              problems, as will be explained shortly. A more familiar variable selection
                              problem is in multiple regression, and although PCA can contribute in this
                              context (see Section 8.5), it is used in a more complicated manner. This is
                              because external considerations, namely the relationships of the predictor
                              (regressor) variables with the dependent variable, as well as the internal
                              relationships between the regressor variables, must be considered. External
                              considerations are also relevant in other variable selection situations, for
                              example in discriminant analysis (Section 9.1); these situations will not be
                              considered in the present chapter. Furthermore, practical considerations,
                              such as ease of measurement of the selected variables, may be important in
                              some circumstances, and it must be stressed that such considerations, as
                              well as the purpose of the subsequent analysis, can play a prominent role in
                              variable selection, Here, however, we concentrate on the problem of finding
                              a subset of x in which the sole aim is to represent the internal variation of
                              x as well as possible.
                                Regarding the choice of m, the methods of Section 6.1 are all relevant.
                              The techniques described there find the number of PCs that account for
                              most of the variation in x, but they can also be interpreted as finding the
                              effective dimensionality of x.If x can be successfully described by only m
                              PCs, then it will often be true that x can be replaced by a subset of m (or
                              perhaps slightly more) variables, with a relatively small loss of information.
                                Moving on to the choice of m variables, Jolliffe (1970, 1972, 1973) dis-
                              cussed a number of methods for selecting a subset of m variables that
                              preserve most of the variation in x. Some of the methods compared, and
                              indeed some of those which performed quite well, are based on PCs. Other
   163   164   165   166   167   168   169   170   171   172   173