Page 415 - Jolliffe I. Principal Component Analysis
P. 415
14. Generalizations and Adaptations of Principal Component Analysis
380
networks interface are covered briefly in Section 14.6.1.
Diamantaras and Kung (1996, Section 6.6) give a general definition of
non-linear PCA as minimizing
2
E[ x − g(h(x)) ], (14.1.4)
where y = h(x)isa q(<p)-dimensional function of x and g(y)isa p-
dimensional function of y. The functions g(.), h(.) are chosen from some
given sets of non-linear functions so as to minimize (14.1.4). When g(.)
and h(.) are restricted to be linear functions, it follows from Property A5
of Section 2.1 that minimizing (14.1.4) gives the usual (linear) PCs.
Diamantaras and Kung (1996, Section 6.6.1) note that for some types
of network allowing non-linear functions leads to no improvement in mi-
nimizing (14.1.4) compared to the linear case. Kramer (1991) describes a
network for which improvement does occur. There are two parts to the
network, one that creates the components z k from the p variables x j ,and
a second that approximates the p variables given a reduced set of m (<p)
components. The components are constructed from the variables by means
of the formula
N & p
'
z k = w lk2 σ w jl1 x j + θ l ,
l=1 j=1
where
( $ −1
*
& p ' p
#
σ w jl1 x j + θ l = 1 + exp − w jl1 x j − θ l , (14.1.5)
j=1 j=1
in which w lk2 ,w jl1 ,θ l ,j =1, 2,... ,p; k =1, 2,... ,m; l =1, 2,... ,N are
constants to be chosen, and N is the number of nodes in the hidden layer.
A similar equation relates the estimated variables ˆx j to the components z k ,
and Kramer (1991) combines both relationships into a single network. The
objective is find the values of all the unknown constants so as to minimize
the Euclidean norm of the matrix of residuals formed by estimating n
values of each x j by the corresponding values of ˆx j . This is therefore a
special case of Diamantaras and Kung’s general formulation with g(.), h(.)
both restricted to the class of non-linear functions defined by (14.1.5).
For Kramer’s network, m and N need to chosen, and he discusses various
strategies for doing this, including the use of information criteria such as
AIC (Akaike, 1974) and the comparison of errors in training and test sets
to avoid overfitting. In the approach just described, m components are
calculated simultaneously, but Kramer (1991) also discusses a sequential
version in which one component at a time is extracted. Two examples are
given of very different sizes. One is a two-variable artificial example in which
non-linear PCA finds a built-in non-linearity. The second is from chemical
engineering with 100 variables, and again non-linear PCA appears to be
superior to its linear counterpart.

