Face image analysis by unsupervised learning and redundancy reduction

Marian Stewart Bartlett

Doctoral Dissertation
University of California, San Diego
1998

Abstract
In a task such as face recognition, much of the important information may be contained in the high-order relationships among the image pixels. Representations such as "Eigenfaces" (Turk & Pentland, 1991) and "Holons" (Cottrell & Metcalfe, 1991) are based on Principal component analysis (PCA), which encodes the correlational structure of the input, but does not address high-order statistical dependencies such as relationships among three or more pixels. Independent component analysis (ICA) is a generalization of PCA which encodes the high-order dependencies in the input in addition to the correlations. Representations for face recognition were developed from the independent components of face images. The ICA representations were superior to PCA for recognizing faces across sessions and changes in expression.

ICA was compared to more than eight other image analysis methods on a task of recognizing facial expressions in a project to automate the Facial Action Coding System (Ekman & Friesen, 1978). These methods included estimation of optical flow; representations based on the second-order statistics of the full face images such Eigenfaces (Cottrell & Fleming, 1990; Turk & Pentland, 1991) local feature analysis (Penev & Atick, 1996), and linear discriminant analysis (Belhumeur, Hespanha, & Kriegeman, 1997); and representations based on the outputs of local filters, such as a Gabor wavelet representations (Daugman, 1988; Lades et al. 1993) and local PCA (Padgett and Cottrell, 1997). The ICA and Gabor wavelet representations achieved the best performance of 96% for classifying 12 facial actions. Relationships between the independent component representation and the Gabor representation are discussed.

Temporal redundancy contains information for learning invariances. Different views of a face tend to appear in close temporal proximity as the person changes expression, pose, or moves through the environment. The final chapter modeled the development of viewpoint invariant responses to faces from visual experience in a biological system by encoding spatio-temporal dependencies. The simulations combined temporal smoothing of activity signals with Hebbian learning (Foldiak, 1991) in a network with both feed-forward connections and a recurrent layer that was a generalization of a Hopfield attractor network. Following training on sequences of graylevel images of faces as they changed pose, multiple views of a given face fell into the same basin of attraction, and the system acquired representations of faces that were approximately viewpoint invariant.

Number of hits since July 10, 1998: