Automatic Analysis of Facial Expressions: The State of the Art By Maja Pantic, Leon Rothkrantz

Automatic Analysis of FaAutomatic Analysis of Facial Expressions:cial Expressions:

The State of the ArtThe State of the ArtBy By Maja Pantic, Leon Maja Pantic, Leon

RothkrantzRothkrantz

Presentation OutlinePresentation Outline

MotivationMotivation Desired functionality and evaluation Desired functionality and evaluation

criteriacriteria Face DetectionFace Detection Expression data extractionExpression data extraction ClassificationClassification Conclusions and future researchConclusions and future research

MotivationMotivation

HCIHCI Hope to achieve robust communication by Hope to achieve robust communication by

recovering from failure of one communication recovering from failure of one communication channel using information from another channelchannel using information from another channel

According to some estimates, the facial According to some estimates, the facial expression of the speaker counts for 55% of the expression of the speaker counts for 55% of the effect of the spoken message (with the voice effect of the spoken message (with the voice intonation contributing 38%, and the verbal part intonation contributing 38%, and the verbal part just 7%)just 7%)

Behavioral science researchBehavioral science research Automation of objective measurement of facial Automation of objective measurement of facial

activityactivity

Desired FunctionalityDesired Functionality Human visual system = good reference pointHuman visual system = good reference point Desired properties:Desired properties:

Works on images of people of any sex, age, and Works on images of people of any sex, age, and ethnicityethnicity

Robust to variation in lightingRobust to variation in lighting Insensitive to hair style changes, presence of glasses, Insensitive to hair style changes, presence of glasses,

facial hair, partial occlusionsfacial hair, partial occlusions Can deal with rigid head motionsCan deal with rigid head motions Is real-timeIs real-time Capable of classifying expressions into multiple Capable of classifying expressions into multiple

emotion categoriesemotion categories Able to learn the range of emotional expression by a Able to learn the range of emotional expression by a

particular person particular person Able to distinguish all possible facial expressions Able to distinguish all possible facial expressions

(probably impossible)(probably impossible)

OverviewOverview

Three basic problems need to be Three basic problems need to be solved:solved: Face detectionFace detection Facial expression data extractionFacial expression data extraction Facial expression classificationFacial expression classification

Both static images and image Both static images and image sequences have been used in studies sequences have been used in studies surveyed in the papersurveyed in the paper

Face DetectionFace Detection In arbitrary imagesIn arbitrary images

A. Pentland et al.A. Pentland et al. Detection in a single imageDetection in a single image

Principal Component Analysis is used to generate a face Principal Component Analysis is used to generate a face space from a set of sample imagesspace from a set of sample images

A face map is created by calculating the distance A face map is created by calculating the distance between the local subimage and the face space at every between the local subimage and the face space at every location in the imagelocation in the image

If the distance is smaller than a certain threshold, the If the distance is smaller than a certain threshold, the presence of a face is declaredpresence of a face is declared

Detection in an image sequenceDetection in an image sequence Frame differencing is usedFrame differencing is used The difference image is thresholded to obtain motion The difference image is thresholded to obtain motion

blobsblobs Blobs are tracked and analyzed over time to determine if Blobs are tracked and analyzed over time to determine if

motion is caused by a person and to determine the head motion is caused by a person and to determine the head positionposition

Face Detection Face Detection (Continued)(Continued) In face imagesIn face images

Holistic approaches (the face is detected as a whole Holistic approaches (the face is detected as a whole unit)unit)

M. Pantic, L. RothkrantzM. Pantic, L. Rothkrantz Use a frontal and a profile face imagesUse a frontal and a profile face images Outer head boundaries are determined by analyzing the Outer head boundaries are determined by analyzing the

horizontal and vertical histograms of the frontal face imagehorizontal and vertical histograms of the frontal face image The face contour is obtained by using an HSV color model The face contour is obtained by using an HSV color model

based algorithm (the face is extracted as the biggest object based algorithm (the face is extracted as the biggest object in the scene having the Hue parameter in the defined range)in the scene having the Hue parameter in the defined range)

The profile contour is determined by following the The profile contour is determined by following the procedure below:procedure below:

The value component of the HSV color model is used to The value component of the HSV color model is used to threshold the input imagethreshold the input image

The number of background pixels between the right edge of the The number of background pixels between the right edge of the image and the first “On” pixel is counted (this gives a vector image and the first “On” pixel is counted (this gives a vector that represents a discrete approximation of the contour curve)that represents a discrete approximation of the contour curve)

Noise is removed by averagingNoise is removed by averaging Local extrema correspond to points of interest (found by Local extrema correspond to points of interest (found by

determining zero crossings of the 1st derivative)determining zero crossings of the 1st derivative)

Face Detection Face Detection (Continued)(Continued)

Analytic approaches (the face is detected Analytic approaches (the face is detected by detecting some important facial by detecting some important facial features first)features first) H. Kobayashi, F. HaraH. Kobayashi, F. Hara

Brightness distribution data of the human face is Brightness distribution data of the human face is obtained with a camera in monochrome modeobtained with a camera in monochrome mode

An average of brightness distribution data An average of brightness distribution data obtained from 10 subjects is calculatedobtained from 10 subjects is calculated

Irises are identified by computing crosscorrelation Irises are identified by computing crosscorrelation between the average image and the novel imagebetween the average image and the novel image

The locations of other features are determined The locations of other features are determined using relative locations of the facial features in the using relative locations of the facial features in the faceface

Template-based facial Template-based facial expression data extraction expression data extraction

using static images using static images Edwards et al.Edwards et al.

Use Active Appearance Models (AAMs)Use Active Appearance Models (AAMs) Combined model of shape and gray-level Combined model of shape and gray-level

appearanceappearance A training set of hand-labeled images with A training set of hand-labeled images with

landmark points marked at key positions to landmark points marked at key positions to outline the main featuresoutline the main features

PCA is applied to shape and gray level data PCA is applied to shape and gray level data separately, then applied again to a vector of separately, then applied again to a vector of concatenated shape and gray level parametersconcatenated shape and gray level parameters

The result is a description in terms of The result is a description in terms of “appearance” parameters“appearance” parameters

80 appearance parameters sufficient to 80 appearance parameters sufficient to explain 98% of the variation in the 400 explain 98% of the variation in the 400 training images labeled with 122 pointstraining images labeled with 122 points

Given a new face image, they find appearance Given a new face image, they find appearance parameter values that minimize the error parameter values that minimize the error between the new image and the synthesized between the new image and the synthesized AAM imageAAM image

Feature-based facial Feature-based facial expression data extraction expression data extraction

using static imagesusing static images M. Pantic, L. RothkrantzM. Pantic, L. Rothkrantz A point-based face model is usedA point-based face model is used

19 points selected in the frontal-view image, and 19 points selected in the frontal-view image, and 10 in the side-view image10 in the side-view image

Face model features are defined as some Face model features are defined as some geometric relationship between facial points or geometric relationship between facial points or the image intensity in a small region defined the image intensity in a small region defined relative to facial points (e.g. Feature 17 = relative to facial points (e.g. Feature 17 = Distance KL)Distance KL)

Neutral facial expression analyzed firstNeutral facial expression analyzed first The positions of facial points are determined by The positions of facial points are determined by

using information from feature detectorsusing information from feature detectors Multiple feature detectors are used for each Multiple feature detectors are used for each

facial feature localization and model feature facial feature localization and model feature extractionextraction

The result obtained from each detector is stored The result obtained from each detector is stored in a separate filein a separate file

The detector output is checked for accuracyThe detector output is checked for accuracy After “inaccurate” results are discarded, those After “inaccurate” results are discarded, those

that were obtained by the highest priority that were obtained by the highest priority detector are selected for use in the classification detector are selected for use in the classification stage stage

Template-based facial Template-based facial expression data extraction expression data extraction

using image sequencesusing image sequences M. Black, Y. YacoobM. Black, Y. Yacoob

Do not address the problem of initially Do not address the problem of initially locating the various facial featureslocating the various facial features

The motion of various face regions is The motion of various face regions is estimated using parameterized optical estimated using parameterized optical flowflow

Estimates of deformation and motion Estimates of deformation and motion parameters (e.g. horizontal and vertical parameters (e.g. horizontal and vertical translation, divergence, curl) are derivedtranslation, divergence, curl) are derived

Feature-based facial Feature-based facial expression data extraction expression data extraction

using image sequencesusing image sequences Cohn et al. (the only surveyed method)Cohn et al. (the only surveyed method)

Feature points in the first frame manually marked Feature points in the first frame manually marked with a mouse around facial landmarkswith a mouse around facial landmarks

A 13x13 flow window is centered around each point A 13x13 flow window is centered around each point Hierarchical optical flow method of Lucas and Hierarchical optical flow method of Lucas and

Kanade used to track feature points in the image Kanade used to track feature points in the image sequencesequence

Displacement of each point calculated relative to the Displacement of each point calculated relative to the first framefirst frame

The displacement of feature points between the The displacement of feature points between the initial and peak frames used for classification initial and peak frames used for classification

ClassificationClassification Two basic problems:Two basic problems:

Defining a set of categories/classesDefining a set of categories/classes Choosing a classification mechanismChoosing a classification mechanism

People are not very good at it eitherPeople are not very good at it either In one study, a trained observer could classify only 87% of the In one study, a trained observer could classify only 87% of the

faces correctlyfaces correctly Expressions can be classified in terms of facial actions Expressions can be classified in terms of facial actions

that cause an expression or “typical” emotionsthat cause an expression or “typical” emotions Facial muscle activity can be described by a set of Facial muscle activity can be described by a set of

codescodes The codes are called Action Units (AUs). All possible, visually The codes are called Action Units (AUs). All possible, visually

detectable facial changes can be described by a set of 44 AUs. detectable facial changes can be described by a set of 44 AUs. These codes form the basis of Facial Action Coding System These codes form the basis of Facial Action Coding System (FACS), which provides a linguistic description for each code. (FACS), which provides a linguistic description for each code.

Classification Classification (continued)(continued)

Most of the studies perform an Most of the studies perform an emotion classification and use emotion classification and use the following 6 basic the following 6 basic categories: happiness, sadness, categories: happiness, sadness, surprise, fear, anger, and surprise, fear, anger, and disgustdisgust

No agreement among No agreement among psychologists whether these psychologists whether these are the right categoriesare the right categories

People rarely produce “pure” People rarely produce “pure” expressions (e.g. 100% expressions (e.g. 100% happiness), blends are much happiness), blends are much more commonmore common

Template-based Template-based classification using static classification using static

imagesimages Edwards et al.Edwards et al.

The Mahalanobis distance measure can be used The Mahalanobis distance measure can be used for classificationfor classification

Classification into 6 basic + neutral Classification into 6 basic + neutral categoriescategories Correct recognition of 74% reportedCorrect recognition of 74% reported

c is the vector of appearance parameters for the new image, is the centroid of the multivariate distribution for class i, and C-1 is the within-class covariance matrix for all the training images

Neural network-based Neural network-based classification using static classification using static

imagesimages H. Kobayashi, F. HaraH. Kobayashi, F. Hara

Used 234x50x6 neural network trained off-line Used 234x50x6 neural network trained off-line using backpropagationusing backpropagation

The input layer units correspond to intensity The input layer units correspond to intensity values extracted from the input image along the values extracted from the input image along the 13 vertical lines13 vertical lines

The output units correspond to the 6 basic The output units correspond to the 6 basic emotion categoriesemotion categories

Average correct recognition rate 85%Average correct recognition rate 85%

Neural network-based Neural network-based classification using static classification using static

images (Continued)images (Continued) Zhang et al.Zhang et al.

Used 680x7x7 neural networkUsed 680x7x7 neural network Output units represent six basic emotion Output units represent six basic emotion

categories plus the neutral categorycategories plus the neutral category Output units give a probability of the analyzed Output units give a probability of the analyzed

expression belonging to the corresponding expression belonging to the corresponding emotion categoryemotion category

Cross-validation used for testingCross-validation used for testing J. Zhao, G. KearneyJ. Zhao, G. Kearney

Used 10x10X3 neural networkUsed 10x10X3 neural network Neural network trained and tested on the Neural network trained and tested on the

whole set of data with 100% percent whole set of data with 100% percent recognition rate recognition rate

Rule-based classification Rule-based classification using static imagesusing static images

M. Pantic, L. Rothkrantz (the only surveyed method)M. Pantic, L. Rothkrantz (the only surveyed method) Two-stage classification: Two-stage classification:

1. Facial actions (corresponding to one of the Action Units) are 1. Facial actions (corresponding to one of the Action Units) are deduced from changes in face geometrydeduced from changes in face geometry

Action Units are described in terms of face model feature values (E.g. AU Action Units are described in terms of face model feature values (E.g. AU 28 = (Both) lips sucked in = feature 17 is 0, where feature 17 = Distance 28 = (Both) lips sucked in = feature 17 is 0, where feature 17 = Distance KL) KL)

2. The stage 1 classification results are used to classify the 2. The stage 1 classification results are used to classify the expression into one of the emotion categoriesexpression into one of the emotion categories

E.g. AU6 + AU12 + AU16 + AU25 => HappinessE.g. AU6 + AU12 + AU16 + AU25 => Happiness The two-stage classification process allows “weighted emotion The two-stage classification process allows “weighted emotion

labels”labels” Assumption: each AU that is part of the AU-coded description of a Assumption: each AU that is part of the AU-coded description of a

“pure” emotional expression has the same influence on the intensity “pure” emotional expression has the same influence on the intensity of that emotional expressionof that emotional expression

E.g. If the analysis of some image results in the activation of AU6, E.g. If the analysis of some image results in the activation of AU6, AU12, and AU16, then the expression is classified as 75% happinessAU12, and AU16, then the expression is classified as 75% happiness

The system can distinguish 29 AUsThe system can distinguish 29 AUs Recognition rate 92% for upper face Aus, and 86% for lower face Recognition rate 92% for upper face Aus, and 86% for lower face

AUsAUs

Template-based Template-based classification using image classification using image

sequencessequences Cohn et al.Cohn et al.

Classification in terms of Action UnitsClassification in terms of Action Units Uses Discriminant Function AnalysisUses Discriminant Function Analysis

Deals with each face region separatelyDeals with each face region separately Used for classification only (i.e. all facial point Used for classification only (i.e. all facial point

displacements are used as input)displacements are used as input) Does not deal with image sequences Does not deal with image sequences

containing several consecutive facial actionscontaining several consecutive facial actions Recognition rate: 92% in the brow region, Recognition rate: 92% in the brow region,

88% in the eye region, 83% in the nose and 88% in the eye region, 83% in the nose and mouth regionmouth region

Rule-based classification Rule-based classification using image sequencesusing image sequences

M. Black, Y. Yacoob (the only surveyed method)M. Black, Y. Yacoob (the only surveyed method) Mid- and high-level descriptions of facial actions are Mid- and high-level descriptions of facial actions are

usedused The parameter values (e.g. translation, divergence) The parameter values (e.g. translation, divergence)

derived from optical flow are thresholdedderived from optical flow are thresholded E.g. Div >0.02 => expansion, Div <-0.02 => contraction. This E.g. Div >0.02 => expansion, Div <-0.02 => contraction. This

is what the authors would call a mid-level predicate for the is what the authors would call a mid-level predicate for the mouth.mouth.

High-level predicates are rules for classifying facial High-level predicates are rules for classifying facial expressionsexpressions

Rules for detecting the beginning and the end of an expressionRules for detecting the beginning and the end of an expression Use the results of applying mid-level rules as inputUse the results of applying mid-level rules as input E.g. Beginning of surprise = Raising brows and vertical E.g. Beginning of surprise = Raising brows and vertical

expansion of mouth, End of Surprise = Lowering brows and expansion of mouth, End of Surprise = Lowering brows and vertical contraction of mouthvertical contraction of mouth

The rules used for classification are not designed to The rules used for classification are not designed to deal with blends of emotional expressions (Anger + deal with blends of emotional expressions (Anger + Fear recognized as disgust)Fear recognized as disgust)

Recognition rate: 88%Recognition rate: 88%

Conclusions and Possible Conclusions and Possible Directions for Future Directions for Future

ResearchResearch Active research areaActive research area Most surveyed systems rely on the frontal view Most surveyed systems rely on the frontal view

of the face and assume no facial hair or glassesof the face and assume no facial hair or glasses None of the surveyed systems can distinguish all None of the surveyed systems can distinguish all

44 AUs defined in FACS44 AUs defined in FACS Classification into basic emotion categories in Classification into basic emotion categories in

most surveyed studiesmost surveyed studies Some reported results are of little practical valueSome reported results are of little practical value The ability of the human visual system to “fill in” The ability of the human visual system to “fill in”

missing parts of the observed face (i.e. deal with missing parts of the observed face (i.e. deal with partial occlusions) has not been investigatedpartial occlusions) has not been investigated

Conclusions and Possible Conclusions and Possible Directions for Future Directions for Future Research (Continued)Research (Continued)

Not clear at all whether the 6 “basic” emotion Not clear at all whether the 6 “basic” emotion categories are universalcategories are universal

Each person has his/her own range of expression Each person has his/her own range of expression intensity – so systems that start with a generic intensity – so systems that start with a generic classification and then adapt may be of interestclassification and then adapt may be of interest

Assignment of a higher priority to upper face Assignment of a higher priority to upper face features by the human visual system (when features by the human visual system (when interpreting facial expressions) has not been interpreting facial expressions) has not been subject of a lot of researchsubject of a lot of research

Hard or impossible to compare reported results Hard or impossible to compare reported results objectively without a well-defined, commonly objectively without a well-defined, commonly used database of face imagesused database of face images

ReferencesReferences M. Pantic, L. Rothkrantz, “Automatic Analysis of Facial Expressions: The M. Pantic, L. Rothkrantz, “Automatic Analysis of Facial Expressions: The

State of the Art”, IEEE Transactions on Pattern Analysis and Machine State of the Art”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 22, No. 12, December 2000Intelligence, Vol. 22, No. 12, December 2000

M. Pantic, L. Rothkrantz, Expert System for Automatic Analysis of Facial M. Pantic, L. Rothkrantz, Expert System for Automatic Analysis of Facial Expressions, Image and Vision Computing, Vol. 18, No. 11, pp. 881-905, Expressions, Image and Vision Computing, Vol. 18, No. 11, pp. 881-905, 20002000

M. J. Black, Y. Yacoob, “Recognizing Facial Expressions in Image M. J. Black, Y. Yacoob, “Recognizing Facial Expressions in Image Sequences Using Local Parameterized Models of Image Motion”, Int’l J. Sequences Using Local Parameterized Models of Image Motion”, Int’l J. Computer Vision, Vol. 25, no.1, pp. 23-48, 1997Computer Vision, Vol. 25, no.1, pp. 23-48, 1997

J. F. Cohn, A.J. Zlochower, J.J. Lien, T. Kanade, “Feature-Point Tracking J. F. Cohn, A.J. Zlochower, J.J. Lien, T. Kanade, “Feature-Point Tracking by Optical Flow Discriminates Subtle Differences in Facial Expression”, by Optical Flow Discriminates Subtle Differences in Facial Expression”, Proc. Int’l Conf. Automatic Face and Gesture Recognition, pp. 396-401, Proc. Int’l Conf. Automatic Face and Gesture Recognition, pp. 396-401, 19981998

G.J. Edwards, T.F. Cootes, C.J. Taylor, “Face Recognition Using Active G.J. Edwards, T.F. Cootes, C.J. Taylor, “Face Recognition Using Active Appearance Models”, Proc. European Conference on Computer Vision, Appearance Models”, Proc. European Conference on Computer Vision, Vol. 2, pp. 581-695, 1998Vol. 2, pp. 581-695, 1998

G.J. Edwards, T.F. Cootes, C.J. Taylor, “Active Appearance Models”, G.J. Edwards, T.F. Cootes, C.J. Taylor, “Active Appearance Models”, Proc. European Conf. Computer Vision, Vol. 2, pp. 484-498, 1998Proc. European Conf. Computer Vision, Vol. 2, pp. 484-498, 1998

H. Kobayashi, F. Hara, “Facial Interaction between Animated 3D Face H. Kobayashi, F. Hara, “Facial Interaction between Animated 3D Face Robot and Human Beings”, Proc. Int’l Conf. Systems, Man, Cybernetics, Robot and Human Beings”, Proc. Int’l Conf. Systems, Man, Cybernetics, pp.3,732-3,737, 1997pp.3,732-3,737, 1997

Automatic Analysis of Facial Expressions: The State of the Art By Maja Pantic, Leon Rothkrantz

Documents

Transcript of Automatic Analysis of Facial Expressions: The State of the Art By Maja Pantic, Leon Rothkrantz