Page 174 - Jolliffe I. Principal Component Analysis

P. 174

143
6.3. Selecting a Subset of Variables
of m variables, but rather than treating m as fixed they also consider
how to choose m. They use methods of variable selection due to Jolliffe
(1972, 1973), adding a new variant that was computationally infeasible in
1972. To choose m, King and Jackson (1999) consider the rules described
in Sections 6.1.1 and 6.1.2, including the broken stick method, together
with a rule that selects the largest value of m for which n/m > 3. To
assess the quality of a chosen subset of size m, King and Jackson (1999)
compare plots of scores on the first two PCs for the full data set and for
the data set containing only the m selected variables. They also compute a
Procrustes measure of fit (Krzanowski, 1987a) between the m-dimensional
configurations given by PC scores in the full and reduced data sets, and a
weighted average of correlations between PCs in the full and reduced data
sets.
The data set analyzed by King and Jackson (1999) has n =37 and
p = 36. The results of applying the various selection procedures to these
data confirm, as Jolliffe (1972, 1973) found, that methods B2 and B4 do
reasonably well. The results also confirm that the broken stick method
generally chooses smaller values of m than the other methods, though its
subsets do better with respect to the Procrustes measure of fit than some
much larger subsets. The small number of variables retained by the broken
stick implies a corresponding small proportion of total variance accounted
for by the subsets it selects. King and Jackson’s (1999) recommendation of
method B4 with the broken stick could therefore be challenged.
We conclude this section by briefly describing a number of other possible
methods for variable selection. None uses PCs directly to select variables,
but all are related to topics discussed more fully in other sections or chap-
ters. Bartkowiak (1991) uses a method described earlier in Bartkowiak
(1982) to select a set of ‘representative’ variables in an example that also
illustrates the choice of the number of PCs (see Section 6.1.8). Variables
are added sequentially to a ‘representative set’ by considering each vari-
able currently outside the set as a candidate for inclusion. The maximum
residual sum of squares is calculated from multiple linear regressions of
each of the other excluded variables on all the variables in the set plus the
candidate variable. The candidate for which this maximum sum of squares
is minimized is then added to the set. One of Jolliffe’s (1970, 1972, 1973)
rules uses a similar idea, but in a non-sequential way. A set of m variables
is chosen if it maximizes the minimum multiple correlation between each
of the (p − m) non-selected variables and the set of m selected variables.
The RV-coefficient, due to Robert and Escoufier (1976), was defined in
Section 3.2. To use the coefficient to select a subset of variables, Robert

and Escoufier suggest finding X 1 which maximizes RV(X, M X 1 ), where
RV(X, Y) is defined by equation (3.2.2) of Section 3.2. The matrix X 1
is the (n × m) submatrix of X consisting of n observations on a subset
of m variables, and M is a specific (m × m) orthogonal matrix, whose
construction is described in Robert and Escoufier’s paper. It is interesting

169 170 171 172 173 174 175 176 177 178 179