Page 176 - Jolliffe I. Principal Component Analysis

P. 176

6.4. Examples Illustrating Variable Selection
145
examples given by Tanaka and Mori (1997) the decision on when to stop
deleting variables appears to be rather subjective.
Mori et al. (1999) propose that the subsets selected in modified PCA
are also assessed by means of a PRESS criterion, similar to that defined in
equation (6.1.3), except that m ˜x ij is replaced by the prediction of x ij found
from modified PCA with the ith observation omitted. Mori et al. (2000)
demonstrate a procedure in which the PRESS citerion is used directly to
select variables, rather than as a supplement to another criterion. Tanaka
and Mori (1997) show how to evaluate the influence of variables on param-
eters in a PCA (see Section 10.2 for more on influence), and Mori et al.
(2000) implement and illustrate a backward-elimination variable selection
algorithm in which variables with the smallest influence are successively
removed.
Hawkins and Eplett (1982) describe a method which can be used for
selecting a subset of variables in regression; their technique and an ear-
lier one introduced by Hawkins (1973) are discussed in Sections 8.4 and
8.5. Hawkins and Eplett (1982) note that their method is also potentially
useful for selecting a subset of variables in situations other than multiple
regression, but, as with the RV-coefficient, no numerical example is given
in the original paper. Krzanowski (1987a,b) describes a methodology, us-
ing principal components together with Procrustes rotation for selecting
subsets of variables. As his main objective is preserving ‘structure’ such as
groups in the data, we postpone detailed discussion of his technique until
Section 9.2.2.

6.4 Examples Illustrating Variable Selection

Two examples are presented here; two other relevant examples are given in
Section 8.7.

6.4.1 Alate adelges (Winged Aphids)

These data were first presented by Jeffers (1967) and comprise 19 different
variables measured on 40 winged aphids. A description of the variables,
together with the correlation matrix and the coefficients of the first four
PCs based on the correlation matrix, is given by Jeffers (1967) and will
not be reproduced here. For 17 of the 19 variables all of the correlation
coefficients are positive, reflecting the fact that 12 variables are lengths
or breadths of parts of each individual, and some of the other (discrete)
variables also measure aspects of the size of each aphid. Not surprisingly,
the first PC based on the correlation matrix accounts for a large proportion
(73.0%) of the total variation, and this PC is a measure of overall size of
each aphid. The second PC, accounting for 12.5% of total variation, has its

171 172 173 174 175 176 177 178 179 180 181