Foreword
Computers are good at many things that we are not good at, like sorting a
long list of numbers and calculating the trajectory of a rocket, but they
are not at all good at things that we do easily and without much thought,
like seeing and hearing. In the early days of computers, it was not
obvious that vision was a difficult problem. Today, despite great advances
in speed, computers are still limited in what they can pick out from a
complex scene and recognize. Some progress has been made, particularly in
the area of face processing, which is the subject of this monograph.
Faces are dynamic objects that change shape rapidly, on the time scale of
seconds during changes of expression, and more slowly over time as we age.
We use faces to identify individuals, and we rely of facial expressions to
assess feelings and get feedback on the how well we are communicating. It
is disconcerting to talk with someone whose face is a mask. If we want
computers to communicate with us, they will have to learn how to make and
assess facial expressions. A method for automating the analysis of facial
expressions would be useful in many psychological and psychiatric studies
as well as have great practical benefit in business and forensics.
The research in this monograph arose through a collaboration with Paul
Ekman, which began 10 years ago. Dr. Beatrice Golomb, then a postdoctoral
fellow in my laboratory, had developed a neural network called Sexnet,
which could distinguish the sex of person from a photograph of their face
(Golomb et al. 1991). This is a difficult problem since no single feature can
be used to reliably make this judgment, but humans are quite good at it.
This project was the starting point for a major research effort, funded by
the National Science Foundation, to automate the Facial Action Coding
System (FACS), developed by Ekman and Friesen (1978). Joseph Hager made a
major contribution in the early stages of this research by obtaining a high
quality set of videos of experts who could produce each facial action.
Without such a large dataset of labeled images of each action it would not
have been possible to use neural network learning algorithms.
In this monograph, Dr. Marian Stewart Bartlett presents the results of her
doctoral research into automating the analysis of facial expressions. When
she began her research, one of the methods that she used to study the FACS
dataset, a new algorithm for Independent Component Analysis (ICA), had
recently been developed, so she was pioneering not only facial analysis of
expressions, but also the initial exploration of ICA. Her comparison of
ICA with other algorithms on the recognition of facial expressions is
perhaps the most thorough analysis we have of the strengths and limits ICA.
Much of human learning is unsupervised; that is, without the benefit of an
explicit teacher. The goal of unsupervised learning is to discover the
underlying probability distributions of sensory inputs (Hinton & Sejnowski,
1999). Or
as Yogi Berra once said, "You can observe a lot just by watchin'." The
identification of an object in an image nearly always depends on the
physical causes of the image rather than the pixel intensities.
Unsupervised learning can be used to solve the difficult problem of
extracting the underlying causes, and decisions about responses can be left
to a supervised learning algorithm that takes the underlying causes rather
than the raw sensory data as its inputs.
Several types of input representation are compared here on the problem of
discriminating between facial actions. Perhaps the most intriguing result
is that two different input representations, Gabor filters and a version of
ICA, both gave excellent results that were roughly comparable with trained
humans. The responses of simple cells in the first stage of processing in
the visual cortex of primates are similar to those of Gabor filters, which
form a roughly statistically independent set of basis vectors over a wide
range of natural images (Bell & Sejnowski, 1997). The disadvantage of
Gabor filters from an image processing perspective is that they are
computationally intensive. The ICA filters, in contrast, are much more
computationally efficient, since they were optimized for faces. The
disadvantage is that they are too specialized a basis set and could not be
used for other problems in visual pattern discrimination.
One of the reasons why facial analysis is such a difficult problem in
visual pattern recognition is the great variability in the images of faces.
Lighting conditions may vary greatly and the size and orientation of the
face make the problem even more challenging. The differences between the
same face under these different conditions are much greater than the
differences between the faces of different individuals. Dr. Bartlett takes
up this challenge in Chapter 7 and shows that learning algorithms may also
be used to help overcome some of these difficulties.
The results reported here form the foundation for future studies on face
analysis, and the same methodology can be applied toward other problems in
visual recognition. Although there may be something special about faces,
we may have learned a more general lesson about the problem of
discriminating between similar complex shapes: A few good filters are all
you need, but each class of object may need a quite different set for
optimal discrimination.
Terrence J. Sejnowski