Page 415 - Jolliffe I. Principal Component Analysis
P. 415

14. Generalizations and Adaptations of Principal Component Analysis
                              380
                              networks interface are covered briefly in Section 14.6.1.
                                Diamantaras and Kung (1996, Section 6.6) give a general definition of
                              non-linear PCA as minimizing
                                                                    2
                                                      E[ x − g(h(x))  ],                (14.1.4)
                              where y = h(x)isa q(<p)-dimensional function of x and g(y)isa p-
                              dimensional function of y. The functions g(.), h(.) are chosen from some
                              given sets of non-linear functions so as to minimize (14.1.4). When g(.)
                              and h(.) are restricted to be linear functions, it follows from Property A5
                              of Section 2.1 that minimizing (14.1.4) gives the usual (linear) PCs.
                                Diamantaras and Kung (1996, Section 6.6.1) note that for some types
                              of network allowing non-linear functions leads to no improvement in mi-
                              nimizing (14.1.4) compared to the linear case. Kramer (1991) describes a
                              network for which improvement does occur. There are two parts to the
                              network, one that creates the components z k from the p variables x j ,and
                              a second that approximates the p variables given a reduced set of m (<p)
                              components. The components are constructed from the variables by means
                              of the formula

                                                     N       & p
                                                                          '
                                                z k =   w lk2 σ  w jl1 x j + θ l ,
                                                     l=1      j=1
                              where
                                                      (                      $ −1
                                                                              *
                                     & p          '              p
                                                             #
                                   σ     w jl1 x j + θ l = 1 + exp −  w jl1 x j − θ l  ,  (14.1.5)
                                      j=1                       j=1
                              in which w lk2 ,w jl1 ,θ l ,j =1, 2,... ,p; k =1, 2,... ,m; l =1, 2,... ,N are
                              constants to be chosen, and N is the number of nodes in the hidden layer.
                              A similar equation relates the estimated variables ˆx j to the components z k ,
                              and Kramer (1991) combines both relationships into a single network. The
                              objective is find the values of all the unknown constants so as to minimize
                              the Euclidean norm of the matrix of residuals formed by estimating n
                              values of each x j by the corresponding values of ˆx j . This is therefore a
                              special case of Diamantaras and Kung’s general formulation with g(.), h(.)
                              both restricted to the class of non-linear functions defined by (14.1.5).
                                For Kramer’s network, m and N need to chosen, and he discusses various
                              strategies for doing this, including the use of information criteria such as
                              AIC (Akaike, 1974) and the comparison of errors in training and test sets
                              to avoid overfitting. In the approach just described, m components are
                              calculated simultaneously, but Kramer (1991) also discusses a sequential
                              version in which one component at a time is extracted. Two examples are
                              given of very different sizes. One is a two-variable artificial example in which
                              non-linear PCA finds a built-in non-linearity. The second is from chemical
                              engineering with 100 variables, and again non-linear PCA appears to be
                              superior to its linear counterpart.
   410   411   412   413   414   415   416   417   418   419   420