Foreword

Computers are good at many things that we are not good at, like sorting a long list of numbers and calculating the trajectory of a rocket, but they are not at all good at things that we do easily and without much thought, like seeing and hearing. In the early days of computers, it was not obvious that vision was a difficult problem. Today, despite great advances in speed, computers are still limited in what they can pick out from a complex scene and recognize. Some progress has been made, particularly in the area of face processing, which is the subject of this monograph.

Faces are dynamic objects that change shape rapidly, on the time scale of seconds during changes of expression, and more slowly over time as we age. We use faces to identify individuals, and we rely of facial expressions to assess feelings and get feedback on the how well we are communicating. It is disconcerting to talk with someone whose face is a mask. If we want computers to communicate with us, they will have to learn how to make and assess facial expressions. A method for automating the analysis of facial expressions would be useful in many psychological and psychiatric studies as well as have great practical benefit in business and forensics.

The research in this monograph arose through a collaboration with Paul Ekman, which began 10 years ago. Dr. Beatrice Golomb, then a postdoctoral fellow in my laboratory, had developed a neural network called Sexnet, which could distinguish the sex of person from a photograph of their face (Golomb et al. 1991). This is a difficult problem since no single feature can be used to reliably make this judgment, but humans are quite good at it. This project was the starting point for a major research effort, funded by the National Science Foundation, to automate the Facial Action Coding System (FACS), developed by Ekman and Friesen (1978). Joseph Hager made a major contribution in the early stages of this research by obtaining a high quality set of videos of experts who could produce each facial action. Without such a large dataset of labeled images of each action it would not have been possible to use neural network learning algorithms.

In this monograph, Dr. Marian Stewart Bartlett presents the results of her doctoral research into automating the analysis of facial expressions. When she began her research, one of the methods that she used to study the FACS dataset, a new algorithm for Independent Component Analysis (ICA), had recently been developed, so she was pioneering not only facial analysis of expressions, but also the initial exploration of ICA. Her comparison of ICA with other algorithms on the recognition of facial expressions is perhaps the most thorough analysis we have of the strengths and limits ICA.

Much of human learning is unsupervised; that is, without the benefit of an explicit teacher. The goal of unsupervised learning is to discover the underlying probability distributions of sensory inputs (Hinton & Sejnowski, 1999). Or as Yogi Berra once said, "You can observe a lot just by watchin'." The identification of an object in an image nearly always depends on the physical causes of the image rather than the pixel intensities. Unsupervised learning can be used to solve the difficult problem of extracting the underlying causes, and decisions about responses can be left to a supervised learning algorithm that takes the underlying causes rather than the raw sensory data as its inputs.

Several types of input representation are compared here on the problem of discriminating between facial actions. Perhaps the most intriguing result is that two different input representations, Gabor filters and a version of ICA, both gave excellent results that were roughly comparable with trained humans. The responses of simple cells in the first stage of processing in the visual cortex of primates are similar to those of Gabor filters, which form a roughly statistically independent set of basis vectors over a wide range of natural images (Bell & Sejnowski, 1997). The disadvantage of Gabor filters from an image processing perspective is that they are computationally intensive. The ICA filters, in contrast, are much more computationally efficient, since they were optimized for faces. The disadvantage is that they are too specialized a basis set and could not be used for other problems in visual pattern discrimination.

One of the reasons why facial analysis is such a difficult problem in visual pattern recognition is the great variability in the images of faces. Lighting conditions may vary greatly and the size and orientation of the face make the problem even more challenging. The differences between the same face under these different conditions are much greater than the differences between the faces of different individuals. Dr. Bartlett takes up this challenge in Chapter 7 and shows that learning algorithms may also be used to help overcome some of these difficulties.

The results reported here form the foundation for future studies on face analysis, and the same methodology can be applied toward other problems in visual recognition. Although there may be something special about faces, we may have learned a more general lesson about the problem of discriminating between similar complex shapes: A few good filters are all you need, but each class of object may need a quite different set for optimal discrimination.

Terrence J. Sejnowski