Automating The Facial Action Coding System:
Issues And Image Representations
Bartlett, M.S., Donato, G.L., Movellan, J.R., Ekman, P., & Sejnowski, T.J.
NIPS Post-Conference Workshop on Affective Computing, Breckenridge, CO,
December 2.
Abstract
Faces contain much information beyond what is conveyed by basic emotion
categories, including signs of cognitive state such as interest, boredom,
and confusion, conversational signals that provide emphasis to speech and
information about syntax, and blends of two or more emotions (eg. happiness
+ disgust = smug). In addition, variations within an emotional category
(eg. vengeance vs. resentment), and variations in magnitude (annoyance
vs. fury) may be signaled by which muscles are contracted in addition to
the intensity of the contraction. Instead of classifying expressions into a
few basic emotion categories, this system attempts to measure the full
range of facial behavior by recognizing facial animation units that
comprise facial expressions. The system is based on the Facial Action
Coding System (FACS) (Ekman & Friesen, 1978), which was developed by
experimental psychologists to objectively measure facial movement. In FACS,
human scorers decompose each facial expression into component muscle
movements. Advantages of FACS over other sets of animation parameters
defined by the engineering community include 1) Comprehensiveness. Each
independent motion of the face is described by one of the forty-six action
units, and 2) Robust link with ground truth. There is over 20 years of
behavioral data on the relationships between FACS movement parameters and
underlying emotional or cognitive states.
In the first part of the talk described the Facial Action Coding
System, and motivate its application to affective computing. The second
part of the talk explored and compared techniques for automatically
recognizing facial actions in sequences of images. These methods include
unsupervised learning techniques for finding image filters such as
principal component analysis, independent component analysis and local
feature analysis, and supervised learning techniques such as Fisher's
linear discriminants. These data-driven filters are compared to Gabor
wavelets, in which the filter kernels are predefined. Best performances
were obtained using the Gabor wavelet representation and the independent
component representation, both of which achieved 96% accuracy for
classifying twelve facial actions. Both the ICA and the Gabor wavelet
kernels share the property of spatial locality. In addition, they both
share relationships to receptive fields in the primary visual cortex, and
are sensitive to high-order dependencies in the image ensemble. The ICA
representation employed two orders of magnitude fewer kernels than the
Gabor representation, and used 90% less CPU time to compute for new
images. The results provide evidence for the importance of using local
filter kernels, high spatial frequencies, and statistical independence for
classifying facial actions.