Page 168 - Jolliffe I. Principal Component Analysis
P. 168
6.3. Selecting a Subset of Variables
137
Krzanowski (1983) examines the gas chromatography example further
by generating six different artificial data sets with the same sample covari-
ance matrix as the real data. The values of W are fairly stable across the
replicates and confirm the choice of four PCs obtained above by slightly de-
creasing the cut-off for W. For the full data set, with outliers not removed,
the replicates give some different, and useful, information from that in the
original data.
6.3 Selecting a Subset of Variables
When p, the number of variables observed, is large it is often the case that
a subset of m variables, with m p, contains virtually all the information
available in all p variables. It is then useful to determine an appropriate
value of m, and to decide which subset or subsets of m variables are best.
Solution of these two problems, the choice of m and the selection of a
good subset, depends on the purpose to which the subset of variables is
to be put. If the purpose is simply to preserve most of the variation in
x, then the PCs of x can be used fairly straightforwardly to solve both
problems, as will be explained shortly. A more familiar variable selection
problem is in multiple regression, and although PCA can contribute in this
context (see Section 8.5), it is used in a more complicated manner. This is
because external considerations, namely the relationships of the predictor
(regressor) variables with the dependent variable, as well as the internal
relationships between the regressor variables, must be considered. External
considerations are also relevant in other variable selection situations, for
example in discriminant analysis (Section 9.1); these situations will not be
considered in the present chapter. Furthermore, practical considerations,
such as ease of measurement of the selected variables, may be important in
some circumstances, and it must be stressed that such considerations, as
well as the purpose of the subsequent analysis, can play a prominent role in
variable selection, Here, however, we concentrate on the problem of finding
a subset of x in which the sole aim is to represent the internal variation of
x as well as possible.
Regarding the choice of m, the methods of Section 6.1 are all relevant.
The techniques described there find the number of PCs that account for
most of the variation in x, but they can also be interpreted as finding the
effective dimensionality of x.If x can be successfully described by only m
PCs, then it will often be true that x can be replaced by a subset of m (or
perhaps slightly more) variables, with a relatively small loss of information.
Moving on to the choice of m variables, Jolliffe (1970, 1972, 1973) dis-
cussed a number of methods for selecting a subset of m variables that
preserve most of the variation in x. Some of the methods compared, and
indeed some of those which performed quite well, are based on PCs. Other

