Post on 03-Apr-2018
7/28/2019 A Deformable 3D Facial Expression Model for Dynamic Human Emotional State Recognition
1/14Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
1
A Deformable 3D Facial Expression Model for
Dynamic Human Emotional State RecognitionYun Tie, Member, IEEE, Ling Guan, Fellow, IEEE,
AbstractAutomatic emotion recognition from facial expres-sion is one of the most intensively researched topics in affectivecomputing and human-computer interaction (HCI). However, itis well known that due to the lack of 3D feature and dynamicanalysis the functional aspect of affective computing is insufficientfor natural interaction. In this paper we present an automaticemotion recognition approach from video sequences based ona fiducial point controlled 3D facial model. The facial regionis first detected with local normalization in the input frames.The 26 fiducial points are then located on the facial region andtracked through the video sequences by multiple particle filters.Depending on the displacement of the fiducial points, they may
be used as landmarked control points to synthesize the inputemotional expressions on a generic mesh model. As a physics-based transformation, Elastic Body Spline (EBS) technology isintroduced to the facial mesh to generate a smooth warp thatreflects the control point correspondences. This also extracts thedeformation feature from the realistic emotional expressions aswell. Discriminative Isomap (D-Isomap) based classification isused to embed the deformation feature into a low dimensionalmanifold that spans in an expression space with one neutral andsix emotion class centers. The final decision is made by computingthe Nearest Class Center (NCC) of the feature space.
Index TermsVideo analysis, Elastic Body Spline, DifferentialEvolution Markov Chain, Discriminative Isomap, Nearest ClassCenter,
I. INTRODUCTION
W ITH the rapid development of Human-Machine Inter-action (HMI), affective computing is currently gainingpopularity in research and flourishing in the industry domain.
It aims to equip computing devices with effortless and natural
communication. The ability to recognize human affective state
will empower the intelligent computer to interpret, under-
stand, and respond to human emotions, moods, and possibly,
intentions. This is similar to the way that humans rely on
their senses to assess each others affective state [1]. Many
potential applications such as intelligent automobile systems,
game and entertainment industries, interactive video, indexing
and retrieval of image or video databases can benefit from thisability.
Emotion recognition is the first and one of the most im-
portant issues in the affective computing field. It incorporates
computers with the ability to interact with humans more natu-
rally and in a friendly manner. Affective interaction can have
maximal impact when emotion recognition and expression is
Y. Tie and L. Guan are with the Department of Electrical and ComputerEngineering, Ryerson University, Toronto, ON, Canada.
e-mail: ytie@ryerson.ca, lguan@ee.ryerson.ca.Copyright (c) 2012 IEEE. Personal use of this material is permitted.
However, permission to use this material for any other purposes must beobtained from the IEEE by sending an email to pubs-permissions@ieee.org.
available to all parties, human and computers [2]. Most of the
existing systems attempt to recognize the human prototypic
emotions. It is widely accepted from psychological theory
that human emotions can be classified into six archetypal
emotions: surprise, fear, disgust, anger, happiness, and sadness,
which are so-called six-basic emotions that was pioneered
by Ekman and Friesen [3]. According to Ekman, the six-
basic emotions are not culturally determined, but universal
to human culture and thus biological in origin. There are also
several other emotions and many combinations of emotions
that have been studied, but they are unconfirmed as universallydistinguishable.
Facial expression regulates face-to-face interactions, indi-
cates reciprocity, interpersonal attraction or repulsion, and en-
ables intersubjectivity between members of different cultures
[4]. Recent research in the fields of psychology and neurology
has shown that facial expression is a most natural and primary
cue for communicating the quality and nature of emotions, and
that it correlates well with the body and voice [5]. Each of the
six basic emotions corresponds to a unique facial expression.
To the objectives of an emotion recognition system, facial
expression analysis is considered to be the major indicator
of a human affective state.
In the past 20 years there has been much research inrecognizing emotion through facial expressions. However,
challenges still remain. Traditionally, the majority approaches
for solving human facial expression recognition problems
attempt to perform the task on two dimensional data, either 2D
images or 2D video sequences. Unfortunately, such approaches
have difficulty handling pose variations, lighting illumination
and subtle facial behavior. The performance of 2D based
algorithms remains unsatisfactory, and often proves unreliable
under adverse conditions.
Using 3D visual feature to recognize and understand fa-
cial expressions has been demonstrated to be a more robust
approach for human emotion recognition [6]. However, the
general 3D emotion recognition approaches are mainly basedon static analysis. A growing body of psychological research
supports that the timing of expressions is a critical parameter
in recognizing emotions and the detailed spatial dynamic
deformation of the expression is important in expression recog-
nition. Therefore the dynamic analysis for the state transitions
of 3D faces could be a crucial clue to the investigation of
human emotional states.
Another weakness of the existing 3D based approaches is
the complexity and intensive computation cost to meet the
challenge of accuracy. The temporal and detailed spatial infor-
mation in the 3D visual cues, both at local and global scales,
7/28/2019 A Deformable 3D Facial Expression Model for Dynamic Human Emotional State Recognition
2/14
Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
2
may cause more difficulties and complexities in describing
human facial movement. Moreover, automatic detection and
the segments based on the facial components with respect to
emotion recognition has not been reported so far. Most of the
existing works require manual initialization.
Fig. 1. Overall System Diagram.
In light of these problems, this paper presents an automatic
emotion recognition method from video sequences based on
a deformable 3D facial expression model. We use the elas-
tic body spline (EBS) based approach for human emotion
classification with the active deformation feature extractiondepending on the 3D generic model. This model is driven by
the key fiducial points and thus makes it possible to generate
the intrinsic geometries of the emotional space. The block
diagram of this method is shown in Fig. 1. The rest of the paper
is organized as follows. Section II gives an overview on state-
of-the-art for human emotion recognition. We then present the
proposed 3D facial modeling and feature extraction from video
sequences using EBS techniques in Section III. Discriminative
Isomap (D-Isomap) based classification is discussed in Section
IV. The experimental results are presented in Section V.
Section VI gives our conclusions.
II. RELATED WORKS
The most commonly used vision-based coding system is
the facial action coding system (FACS) proposed by Ekman
and Friesen [7] for the manual labeling of facial behavior.
To recognize emotions from facial clues, FACS enables facial
expression analysis through standardized coding of changes
in facial motion in terms of atomic facial actions called
Action Units (AUs). The changes in the facial expression are
described with FACS in terms of AUs. FACS decomposes the
facial muscular actions into 44 basic actions and describes
the facial expressions as combinations of the AUs. Many
researchers are inspirited by this work and try to analyze facial
expressions in image and video processing. Most methods usethe distribution of facial features as inputs of a classification
system, and the outcome is one of the facial expression classes.
Lyons et al. [8] used a set of multi-scale, multi-orientation
Gabor filters to transform the images first. The Gabor coef-
ficients sampled on the grid were combined into one single
vector. They tested their system and achieved 75% expression
classification accuracy by using Linear Discriminant Analysis
(LDA). Silva and Hui [9] determined the eye and lip position
using low-pass filtering and edge detection methods. They
achieved an average emotion recognition rate of 60% using
a neural network (NN). Cohen et al. [10] introduced the
temporal information from video sequences for recognizing
human facial expression. They proposed a multi-level hidden
Markov model (HMM) classifier for dynamic classification, in
which the temporal information was also taken into account.
Guo and Dyer [11] introduced a linear programming based
method for facial expression recognition with a small number
of training images for each expression. A pairwise framework
for feature selection was presented and three methods were
compared in the experimental part. Pantic and Patras [12]
presented a method to handle a large range of human facial
behavior by recognizing facial muscle actions that produce
expressions. The algorithm performed both automatic seg-
mentation into facial expressions and recognition of temporal
segments of 27 AUs. Anderson and McOwan [13] presented an
automated multistage system for real-time recognition of facial
expression. The system used facial motion to characterize
monochrome frontal views of facial expressions. It was able to
operate effectively in cluttered and dynamic scenes, recogniz-
ing the six emotions universally associated with unique facial
expressions. Gunes and Piccardi [14] proposed an automatic
method for temporal segment detection and affect recognitionfrom facial and body displays. Wang and Guan [15] con-
structed a bimodal system for emotion recognition. They used
a facial detection scheme based on a Hue Saturation Value
(HSV) color model to detect the face from the background
and Gabor wavelet features to represent the facial expressions.
Presently, state-of-the-art 3D facial modeling by physically
based paradigm has been recognized as a key research area
of emotion recognition for next-generation human-computer
interaction (HCI) [16]. Song et al. [17] presented a generic fa-
cial expression analogy technique to transfer facial expressions
between arbitrary 3D facial models, and between 2D facial
images. Geometry encoding for triangle meshes, vertex-tent-
coordinates were proposed to formulate expression transfer in2D and 3D cases as a solution to a simple system of linear
equations. In [18], a 3D features based method for human
emotion recognition was proposed. 3D geometric information
plus colour/density information of the facial expressions were
extracted by 3D Gabor library to construct visual feature
vectors. The improved kernel canonical correlation analysis
(IKCCA) algorithm was applied for final decision, and the
overall recognition rate was about 85%. A static 3D facial
expression recognition method was proposed in [19]. The
primitive 3D facial expression features were extracted from
3D models based on the principal curvature calculation on
3D mesh models. Classification into one of the six-basic
emotions was done based on the statistical analysis of thesefeatures, and the best performance was obtained using LDA.
Although several methods can achieve a very high recogni-
tion rate, most of the existing 3D face expression recognition
works are based on static data. Soyel and Demirel [20], [21]
used six distance measures from 3D distributions of facial
feature points to form the feature vectors. The probabilistic
NN architecture was applied to classify the facial expressions.
They obtained an average recognition rate of 87.8%. Unfortu-
nately, the authors did not specify how to identify this set
of feature points. Tang and Huang [22], [23] used similar
distance features based on the change of face shape between
7/28/2019 A Deformable 3D Facial Expression Model for Dynamic Human Emotional State Recognition
3/14
Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
3
the emotional expressions. Normalized Euclidean distances
between the facial feature points were used for emotion
classification. An automatic feature selection method was also
proposed based on maximizing the average relative entropy
of marginalized class-conditional feature distributions. Using
a regularized multi-class AdaBoost classification algorithm,
they achieved a 95.1% average recognition rate. However the
facial feature points were predefined on the cropped 3D face
mesh model, and were not generated automatically. Such an
approach is, therefore, difficult to be used in real world ap-
plications. Thus far, few efforts have been reported exploiting
3D facial expression recognition in dynamic or deformable
feature analysis. Sun and Yin [24] extracted sophisticated
features of geometric labeling and used 2D HMMs to model
the spatial and temporal relationships between the features
for recognizing expressions from 3D facial model sequences.
However this method requires manual detection and annotation
of certain facial landmarks.
III. METHODOLOGY
In this section we present a fully automatic method for
emotion recognition that exploits the EBS features between
neutral and expressional faces based on a 3D deformable mesh
(DM) model. The system developed consists of several steps.
The facial region is first detected automatically in the input
frames using the local normalization based method [25]. We
then locate 26 fiducial points over the facial region using scale-
space extrema and scale invariant feature examination. The
fiducial points are tracked continuously by multiple particle
filters throughout the video sequences. EBS is used to extract
the deformation features and the D-Isomap algorithm is then
applied for the final decision.
A. Preprocessing
Automatic face detection is considered to be the first es-
sential requirement for our emotion recognition system. Since
the faces are non-rigid and have a high degree of variability
in location, color and pose, several facial features that are
uncommon to other pattern detection issues make facial de-
tection more complex. Occlusion and lighting distortions and
illumination conditions can also change the overall appearance
of a face. We detect facial regions in the input video sequence
consisting of feature selection and classification based on a
local normalization technique [25]. Compare to Viola and
Johns algorithm [26], the proposed method is adaptive tothe normalized input image and designed to complete the
segmentation in a single iteration. With the local normalization
based method, the proposed emotion recognition system can
be more robust under different illumination conditions.
Fiducial points are a set of facial salient points, usually
located on the corners, tips or mid points of facial components.
Automatically detecting fiducial points can extract the promi-
nent characteristics of facial expressions with the distances
between points and the relative sizes of the facial components
and form a feature vector. Additionally, choosing the feature
points should represent the most important characteristics on
the face and be extracted easily. The Active Appearance Mod-
els (AAM) and Active Shape Models (ASM) are two popular
feature localization methods with statistical face models to
prevent locating inappropriate feature points. The AAM [27],
[28] fits a generative model to the region of interest. The best
match of the model simultaneously calculates feature point
locations. The ASM algorithm learns a statistical model of
shape from manually labeled images and the PCA models
of patches around individual feature points. The best local
match of each feature is found with constraints on the relative
configuration of feature points. They are commonly used to
track faces in video. In general, the point to point accuracy is
around 85% if the bias of the automatic labeling result to the
manual labeling result is less than 20% of the true inter-ocular
distance [29]. However, it is not sufficient in the case of facial
expression analysis.
We choose 26 fiducial points [30] on the facial region ac-
cording to the anthropometricmeasurement with the maximum
movement of the facial components during expressions. To
follow the subtle changes in the facial feature appearance, we
define a SUCCESS case if the bias of a detected point to thetrue facial point is less than 10% of inter-ocular distance in the
test image. The proposed method constructs a set of fiducial
point detectors with scale invariant feature. Candidate points
are selected over the facial region by the local scale-space ex-
trema detection. The scale invariant feature for each candidate
point is extracted to form the feature vectors for the detection.
We use multiple Differential Evolution Markov Chain (DE-
MC) particle filters [31] to track the fiducial points depending
on the locations of the current appearance of the spatially
sampled features. The kernel correlation based on HSV color
histograms is used to estimate the observation likelihood and
measure the correctness of particles. We define the observation
likelihood of the color measurement distribution using the cor-relation coefficient. Starting with mode-seeking procedure, the
posterior modes are subsequently detected through the kernel
correlation analysis. It provides a consistent way to resolve
the ambiguities that arise in associating multiple objects with
measurements of the similarity criterion between the target
points and the candidate points. The proposed method achieves
an overall accuracy of 91% for the 26 fiducial points [31].
B. 3D EBS Facial Modeling
The EBS is an image morphing technique derived from
the Navier equation that describes the deformation of ho-
mogeneous elastic tissues. It was developed for matching3D MRIs of the breast used in the evaluation of breast
cancer [32], [33]. Davis et al. [32] designed the EBS to
matching 3D magnetic resonance images (MRIs) used in the
evaluation of breast cancer. The coordinate transformations are
evaluated with different types of deformations and different
numbers of corresponding coordinate locations. The EBS is
based on a mechanical model of an elastic body, which
can approximate the properties of body tissues. The spline
maps can be expressed as the linear combination of an affine
transformation and a Navier interpolation spline. It allows each
landmark to be mapped to the corresponding landmark in
7/28/2019 A Deformable 3D Facial Expression Model for Dynamic Human Emotional State Recognition
4/14
Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
4
the other image and provides interpolation of this mapping
at intermediate locations. Hassanien and Nakajima [34] used
Navier EBS to generate warp functions for facial animation
with the interpolating scattered data points. Kuo et al. [35]
proposed an iterative EBS algorithm to obtain the elastic
property of a facial model for facial expression generation.
However, the most feature points in these works were manually
localized, and only 2D examples were considered for facial
image analysis.
Fig. 2. Proposed 3D mesh model with 26 fiducial points (black) and 28characteristic points (red).
The proposed EBS method automatically generate facial
expressions using a 3D physically based DM model according
to a deformable feature perspective executed with the control
points within an acceptable time for emotion recognition. The
mesh wireframe generic facial model consists of characteristic
feature points and deformable polygons with EBS structure.
We can deform the wireframe model to best fit a human face
with any expressions. The 3D affine transformation realizes the
facial expressions by imitating the facial muscular actions. Itformulates the deforming rules according to the FACS coding
system using the 26 fiducial points as control points. Fig. 2
shows the proposed model based on this standardized coding
system. In practical applications, not all feature points in the
model can be easily detected from the input sequences, so
we use 54 characteristic feature points for facial expression
parameterization. Characteristic feature points include: a) the
26 control points based on the fiducial points, and b) 28
dependent points which are determined by the control points.
We also assume that the physical property of the EBS structure
is the same within the facial region. The EBS deformation
analysis is presented in following section.
Merits of this approach are: a) a physically based DMmodel of the human face with fiducial points for driving facial
deformation according to muscle movement parameterization.
The face can be modeled as an elastic body that is deformed
under a tension force field. Muscles are interpreted as forces
deforming the polygonal mesh of the face. The factors affect-
ing the deformation are tension of the muscle, elasticity of
the skin and zone of influence. Higher-level parameterizations
are easier to use for emotional expressions and can be de-
fined in terms of low-level parameters. b) We extend a DM
facial model by a set of well-designed polygons with EBS
structure which can be efficiently modified to establish the
facial expression model. A 3D face is decomposed into area
or volume elements, each endowed with physical parameters
embedded in an EBS model according to the surface curva-
ture. The deformable element relationships are computed by
integrating the piecewise components over the entire face. c)
The control points are predefined by the landmarked fiducial
points. The number of control points is small and they can be
identified robustly and automatically. Once the control points
are adjusted, the emotional facial model can be established
using the transform function of EBS and extended to obtain
expression parameters for final recognition.
Using EBS transforms we can interpolate the positions of
characteristic feature points such that the 3D facial model of
non-neutral expressive expression can be generated from the
input video frame. Based on the arrangement of facial muscle
fibers, our EBS model calculates elastic characteristics for
each emotional face by modeling the facial muscle fiber as an
elastic body. The affine elastic body coordinate transformation
is fitted to the displacements of the facial expression with
the continuity condition. The spline obtained by this method
is mathematically identical to the computed coefficients ofthe original displacements from the control points directly.
Moreover, the resulting spline is added to the initial mesh of
the elastic body transformation to give the overall coordinate
transformation. Simulation results show that the facial model
generated by our method demonstrates good performance
under the availability of control point positions.
C. EBS parameterizations
EBS is applied for generating different facial expressions
with a generic facial model from a neutral face. By varying
the position of control points, EBS mathematically describes
the equilibrium displacement of the facial expressions sub-
jected to muscular forces using a Navier partial differential
equation (PDE). The deformable facial model equations can
be expressed in 3D vector form with the interpolation spline
relating the set of corresponding control points. The PDE of
an elastic body is based on notions of stress and strain. When
a body is subject to an external force this induces internal
forces within the body which cause it to deform. The integral
of the surface forces and body forces must be zero [36]. Let xdenote a set of feature points in the 3D facial model of neutral
face, yi be the corresponding control points with expressions,we then have the Navier equilibrium PDE as:
2l (x ) + ( + )[
l (x )] +
f(x ) = 0 (1)
wherel (x ) is the displacement of all characteristic feature
points within the facial model from the original position
(neutral face), and are the Lam coefficients which describethe physical properties of the face. is also referred to as theshear modulus. 2 and denote the Laplacian and gradient
operation, respectively, and l (x ) is the divergence of
l (x ).
f (x ) is the muscular force field being applied on the face.
To find an appropriate physical property for an expressional
model, muscular forces are assumed to distribute on the ho-
mogeneous isotropic elastic body of the facial model to obtain
7/28/2019 A Deformable 3D Facial Expression Model for Dynamic Human Emotional State Recognition
5/14
Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
5
smooth deformation. So a polynomial radially symmetric force
is considered that:
f (x ) = w d(x ) (2)
where w = [ w1 w2 w3 ]T is the strength of the force
field, and d(x ) = |x21
+ x22
+ x23
|1/2
. The PDEs solutions of
(1) can be computed as:
l (x ) = E(x )w (3)
and
E(x ) = [d(x )2I 3x x T]d(x ) (4)
where = (11+5)/(+) is the Poissons ratio, I is a 33 identity matrix, and x x T is an outer product. It is obtainedusing the Galerkin vector method [36] to transform the three
coupled PDEs into three independencies. The solution can be
verified by substituting (3) into (1). The EBS displacement
LEBS(x ) is a linear combination of the PDEs solution in
(3) that:
LEBS(x ) =
Ni=1
E(x yi )wi + Ax + B (5)
where Ax + B is the affine portion of the EBS, A =[ a1 a2 a3 ]T is a 3 3 matrix. The coefficients of thespline are determined from the control points yi and thedisplacement of feature points. The spline relaxes to an affine
transformation as the distance from the point approaches
infinity.
The summation in (5) can be expressed in the matrix-vector
form as:
EEBS(x ) = HLEBS (6)
where H is a (3N + 12) (3N + 12) transfer function as
described by Kuo [35], and EEBS is a (3N + 12) 1 vectorwith all the EBS coefficients as:
EEBS = [w1T
w2T ...wNT a1T a2T a3T bT ]
T
(7)
In our system, the 26 control points and the displacements of
the control point sets are obtained from the fiducial detection
and tracking steps. We solve (6) from the requirements that
the spline displacements equal the control point displacements
with a constant all over the facial region. The flatnessconstraints which are expressed in terms of second or higher
order (e.g. xi2, xj
2 or xixj ) are set to zero enforcing the
conservation of linear and angular momenta for an equilibrium
solution. These constraints cause the force field to be balancedso that the EBS facial model is stationary. The value of the
spline for the 28 dependent points are computed from (5) with
the spline coefficient EEBS, the spline basis function H andthe control point locations.
The muscular force fieldf (x ), given by (2), can be calcu-
lated from the solutions of EBS according to the displacement
of control points such that:
f (x ) =
f1
f2
f3
T=
Ni=1
f(x yi )
wi (8)
With different we obtained variantf(x ). By the princi-
ple of superposition for an elastic body, the external forces
must be minimized according to the roughness measurement
constraints [35]. This ensures that the forces are optimally
smooth and sufficient to deform the elastic material so that
the EBS equals the given displacements at the control point
locations. By varying the values of in (4), we can calculateeach corresponding muscular force field respectively. To find
the minimum muscular force field |f (x )|min , we obtainthe appropriate physical property and the associate EBScoefficients EEBS. We then construct the deformable visualfeature v for classification with and EEBS. The algorithmfor deformation feature extraction is summarized as follows.
1) Initialize the feature point positions x in the 3D facialmodel for neutral face according to the detection results
from the 26 fiducial points.
2) Set for facial region 0.01 .3) Update the corresponding control point positions yi in
the expressional facial model subject to the tracking
results .
4) Calculate the displacements of the control point sets
l
in the facial region.
5) Solve EBS in (6) to obtain the associate spline coeffi-
cient EEBS.6) Compute the position of nonsignificant points in the
facial region based on the EBSs solution in the previous
step.
7) Calculate the muscular force fieldf(x ) in (2) from
the solution of EBS.
8) Sweep from 0.02, 0.03, ... , to 0.5 and repeat steps 5),6) and 7) to obtain the new muscular force fields.
9) Find the minimum muscular force field |f (x )|min, fix
and the EBS coefficients EEBS.
10) Construct the deformable visual feature v for classifica-tion with EEBS and
.
IV. D-ISOMAP BASED CLASSIFIER
Once the deformable facial features have been obtained
with the EBS, we use an isomap based method for emotion
classification. Isomap was first proposed by Tenenbaum [37],
and is one of the most popular manifold learning techniques
for promising nonlinear dimensionality reduction. It attempts
to learn complex embedding manifolds using local geometric
metrics within a single global coordinate system. The Isomap
algorithm uses geodesic distances between points instead of
simply taking Euclidean distances, thus encoding the manifoldstructure of the input space into distances. The geodesic
distances are computed by constructing a sparse graph in
which each node is connected only to its closest neighbors.
The geodesic distance between each pair of nodes is taken to
be the length of the shortest path in the graph that connects
them. These approximated geodesic distances are then used
as inputs to classical multidimensional scaling (MDS). Yang
proposed a face recognition method based on Extended Isomap
(EI) [38]. In his work, the EI method was utilized by a Fisher
Linear Discriminant (FLD) algorithm. The main difference
between EI and the original Isomap is that after a geodesic
7/28/2019 A Deformable 3D Facial Expression Model for Dynamic Human Emotional State Recognition
6/14
Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
6
distance is obtained, the EI algorithm uses FLD to achieve
the low dimensional embedding while the original Isomap
algorithm using MDS. X. Geng [39] proposed an improved
version of Isomap to guide the procedure of nonlinear di-
mensionality reduction. The neighborhood graph of the input
data is constructed according to a certain kind of dissimilarity
between data points, which is specially designed to integrate
the class information.
The Isomap algorithm generally has three steps: construct
a neighborhood graph, compute shortest paths, and construct
d-dimensional embedding. Classical MDS is applied to the ma-
trix of graph distances to obtain a low-dimensional embedding
of the data. However, since the original prototype Isomap does
not discriminate data acquired from different classes, when
dealing with multi-class data, several isolated sub-graphs will
result in undesirable embedding. On the other hand, the EI
[38] used the Euclidean distance to approximate the distance
between two nearest points in two classes. When the number
of classes becomes large, the classes may construct their
own spatially intrinsic structure. Then the EI and improved
version cant recover the classes intrinsic structures of thehigh-dimensional data. In order to cope with such problems,
in this paper, we propose a D-Isomap based method for
emotion classification. The discriminative information of facial
features [40] are considered so that they can properly represent
the discriminative structures of the emotional space on the
manifold. The proposed D-Isomap provides a simple way
to obtain the low dimensional embedding and discovers the
discriminative structure on the manifold. It has the capability
of discovering nonlinear degrees of freedom and finding the
globally optimal solutions guaranteed to converge for each
manifold [41].
There are two general approaches to build the final classifier
using dynamic information from video sequences. One is todetermine the dependencies based on the joint probability
distribution among the score level decisions. The other is based
on the distribution of dynamic features, in which case the
features can be discrete or continuous. Le et al. [42] proposed
a 3D dynamic expression recognition method using spatio-
temporal shape features. The HMMs algorithm was adopted
for the final classification. Sandbach et al. [43] also proposed
to recognize 3D facial expression using HMM dependent on
the motion-based features.
In this work, the final classifier is constructed based on the
dynamic feature level fusion. We change the facial expression
model following the trajectory of the 54 characteristic feature
points frame by frame. It explicitly describes the relationshipbetween the motions of the facial feature points and the ex-
pression changes. The EBS model sequence v(t) is effectivelyrepresented by a sequence of observation from the input video,
where t is the time variable. Before the raw data samples in thedatasets could be used for training/testing of classification, it
is necessary to normalize the sequences such that they were in
the format required by the system. The frame rate is reduced
to 10 fps and the sequences last 3 seconds in total from a
neutral face to the apex of one expression. Since the original
displacement ofv(t) in each frame depends on each individual,we use the length (distance between the Top point of the head
and the Tip point of the chin) and width (distance between
the Left point of the head and the Right point of the head) of
the neutral face for scale normalization. We then normalize
the feature matrix to regulate the variances from the EBS
coefficients and constant lambda using L2 method. The EBSmodel sequence takes into account the temporal dynamics of
the feature vectors, and the labeled graph matching is then
applied to determine the category of the sample video.
The EBS feature v for each emotional facial model can beseen as one point in a high dimensional space. As we have 54
characteristic feature points in the 3D facial model, each EBS
feature, v, has 175 dimensions. Given the variations of facialconfigurations during emotional expressions, these points can
be embedded into a lower dimensional space. We define the
facial EBS feature set V as the input data:
V = {vt} RTM (9)
where t = 1,...,T is the input sample number, M = 175is the dimensionality of the original data. Let U denote the
embedding space ofV into a low dimensional manifold with
m dimensions such that:
U = {ut} RTm (10)
which preserves the manifolds estimated intrinsic geometry.
The D-Isomap provides a simple way to obtain the low dimen-
sional embedding and discovers the discriminative structure
on the manifold as well [40], [41]. We compute Euclidean
distance D between any pairwise points (vt, vt) from theinput space V for the training data with a discriminative
weight factor such that:
D(vt, vt) =
vt vt 2 if Z(vt) = Z(vt)
vt vt 2 if Z(vt) = Z(vt)
(11)where Z(vt) denotes the class label which the input data vtbelongs to. For pairwise points with the same class labels,
the Euclidean distance is shortened by weight factor . Thecompacting and expanding parameters are empirically calcu-
lated for the discriminative matrix. It can solve the impeding
problems in [44] when the dimensions of scatter become very
high in the real data sets.
A neighborhood graph G is constructed according to thediscriminative matrix. If one point is one of the closest points
or lies within a fixed radius of any other point, it is defined as
a neighbor of that point. The pairs are connected with paths
between points, which are acquired by adding up a sequence
of edges equal to the distance between neighbor points . Thedistances between all point pairs are computed based on a
chosen distance metric. We then calculate a distance matrix
between all pairwise points by computing the shortest paths in
the neighborhood graph. The geodesic distance matrix between
all points is set to be:
DG = min (DG,DTG) (12)
The embedding matrix, Dm, in low dimensional space can
be calculated by converting the distance matrix to inner
products with a translation mapping [45]. Compute the largest
eigenvalue and the top m eigenvectors ofDG, we obtain the
7/28/2019 A Deformable 3D Facial Expression Model for Dynamic Human Emotional State Recognition
7/14
Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
7
eigenvector matrix E Rnm and the eigenvalue matrixM Rmm. The embedding matrix in low dimensional spacecan be calculated that:
Dm = M1/2ET (13)
We then use a Nearest Class Center (NCC) algorithm [46] to
determine the emotion classes. The NCC algorithm is a centre
based method for data classification in a high dimensional
space. Many classification methods may be concerned for
the final decision such as nearest neighbors, k-mean or EM
algorithms. In the nearest neighbors based classification, the
representational capacity and the error rate depends on how the
dataset to be chosen to account for possible variations and how
many are available. The k-mean method adjusts the center of
a class based on the distance of its neighboring patterns. The
EM algorithm is a special kind of quasi-Newton algorithm
with a searching direction having a positive projection on
the gradient of the log likelihood. In each EM iteration, the
estimation maximizes a likelihood function which is further
refined in each iteration by a maximization step. When the
EM iteration converges, it should ideally obtain maximumlikelihood estimation of the data distribution
A commonality among these methods is that they define a
distance between the dataset and an individual class, then the
classification is determined by consisting of isolated points in
the feature space. However, since the emotional features in our
work are complex and not interpretable, a formal centre for
each emotion class may be difficult to determine or misplaced.
In many cases, multiple clusters are available within one video
sequence. Such property can be utilized to improve the final
decision but has been ignored by other methods. For this
reason, we need to find a more efficient way to generalize
the representational capacity with sufficient large number of
feature points stored to account for as many variations aspossible. Unlike other alternations, NCC considers the centers
for the clusters k with known label from the training dataand generalizes the class center for each emotion group. The
derived cluster centers have more variations than the original
input features and thus expands the capacity of the available
data sets. The classification for the test data is based on the
nearest distance to each class center.
The NCC algorithm is applied for the classification of
the input video based on the number of clusters k and theembedding matrix Dm. We assume that the clusters can be
classified in classes a priori through any viable means and are
available within each video sequence. So the distance matrix
makes use of such information about classes contained in theclusters of each class. A subspace is constructed out from the
entire feature space based on the prior knowledge and the
within-class clusters are generalized to represent the variants
of that emotion class. Thus the generalization ability of the
classifier is increased.
Let ck be a set of k cluster centers for the feature pointsbelong to a class. The k clusters determine the output classlabel of the input data. Each cluster approximates the sample
population in the feature space for the samples that belong
to it. The statistics of every cluster are used to estimate the
probability for the dataset. The probability distribution can be
calculated from the training data at this level. The centers of
these clusters provide essential information for discriminative
subspace, since these clusters are formed according to class
labels of emotions. We can simply enforce the mapping to be
orthogonal, i.e., we can impose the condition
U UT = I (14)
for the feature points on the projected set. In our case, a
total of k centers of the clusters give (k 1) discriminativefeatures which span (k 1) dimensional discriminative space.The cluster centers for a test data can be calculated using the
objective function:
E(ck) = c
k ck (15)
A dense matrix h = eeT that e = [1,..., 1]T is imposed tothe distance matrix DG to calculate cluster centers from the
training data. Since DG is symmetric, we put the uniform
weight 1/N to every single pair for the full graph. Let pdenote the sample number in one cluster, l = 7 the emotionalspace for labeling, and Ut represents the tth-element of the
embedded manifold matrix for a test data from (10), theobjective function becomes:
{Ck}l =1
p
pt=1
(DmUt 1
2HDGH) (16)
where H is the centering matrix that H = I 1NeeT. The
labeled class center {Ck}l for the emotional space of a testvideo can be calculated from (16). Each data sample along
with its k cluster lies on a locally manifold. Since D-Isomapseeks to preserve the intrinsic geometric properties of the local
neighborhoods, the input data is reconstructed by a linear
combination of its nearest centers with the labeled graph
matching.For each category of facial expression, we calculate an
average class center coordinates Cl from the training samples.Compute the class centers cl for the test data using (16), wecan obtain its class label C using the Euclidean distance tothe nearest class center coordinates Cl.
C = arg mincl
(cl, Cl,Dm,Ut, ) (17)
V. EXPERIMENT AND RESULTS
To evaluate the performance of our proposed method, two
facial expression video datasets are used for the experiment:
RML Emotion database and Mind Reading DVD database.
RML Emotion database [15] was originally recorded forlanguage and context independent emotional recognition with
the six fundamental emotional states: happiness, sadness,
anger, disgust, fear and surprise. It includes eight subjects
in a nearly frontal view (2 Italian, 2 Chinese, 2 Pakistani,
1 Persian, and 1 Canadian) and 520 video sequences in
total. The RML Emotion database was originally developed
for language and context independent emotional recognition.
Each video pictures a single emotional expression and ends
at the apex of that expression while the first frame of every
video sequence shows a neutral face. Video sequences from
neutral to target display are digitized into 320 340 pixel
7/28/2019 A Deformable 3D Facial Expression Model for Dynamic Human Emotional State Recognition
8/14
Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
8
arrays with 24-bit color values. The Mind Reading DVD [47]
is an interactive computer-based resource for face emotional
expressions, developed by Cohen and his psychologist team.
It consists of 2472 faces, 2472 voices and 2472 stories. Each
video pictures the frontal face with a single facial expression
of one actor (30 actors in total) of varying age ranges and
ethnic origins. All the videos are recorded at 30 frames per
second, last between 5 to 8 seconds, and are as a resolution
of320 240.
A. Facial region detection
The Facial region is detected in the input video sequence
using the face detection method with local normalization [25].
The normalized results of the original sequences show that the
histograms of all input images are widely spread to cover the
entire gray scale by local normalization; and the distribution
of pixels is not too far from uniform. As a result, dark images,
bright images, and low contrast images are much enhanced to
have an appearance of high contrast. The overall performance
of the system is considerably improved by incorporating local
normalization.
B. Fiducial points detection and tracking
The fiducial points are then detected [30] and tracked [31]
automatically in the facial region. As the location of each
fiducial point is at the center of a 16 16 pixel neighborhoodwindow, and the feature vector for point detectors are extracted
from this region, we consider detected points displaced within
five pixels from the corresponding ground truth facial points
as successful detections. 180 videos of 6 subjects from RML
Emotion database and 240 videos of 20 subjects from Mind
Reading DVD database are selected for experiment, which
constitute a total of 420 sequences of 26 subjects. We ran-domly divide all the 420 sequences into training and testing
subsets containing 210 sequences each.
The overall system performance of recall 92.45% and pre-cision 90.93% are achieved simultaneously in terms of falsealarm rates. We also implement the AAM method mentioned
in [27] for the 26 fiducial point detection and tracking, as
shown in Fig. 3. The proposed method has a better perfor-
mance on both efficiency and accuracy.
C. EBS based emotional facial modeling
In this section, we verify the performance of the EBS based
method for emotional facial modeling on the aforementioneddatabases. The positions of 26 fiducial points are obtained
from the detecting and tracking step and then used for calcu-
lating the positions of the 28 dependent points. These positions
are 2D data in the video sequences and cant be applied with
the 3D EBS analysis directly. All the fiducial points need to be
aligned to our 3D model first. We use a flexible generic facial
modeling (FGFM) algorithm [48] for fitting each face image
to the 3D mesh model. The geometric values used in FGPFM
are obtained from the BU-3DFE database [49]. There are 2500
3D facial expression models for 100 subjects in this database.
We use the 3D facial expression model with the associated
Fig. 3. Detection and tracking Result
(a) anger faces.
(b) disgust faces.
(c) fear faces.
Fig. 4. Emotional EBS model construction.
frontal-view texture image as ground truth data to train the
3D model. Initially, we define a face-centered coordinate
system used for FGPFM. All the 3-D coordinates, curvature
parameters for every vertex generation function, the weights in
the interrelationship functions and the statistical model ratios
are recorded in an FGPFM. The clustering process is used to
construct the accurate generic facial models from the training
3D data. All the selected typical training examples are used
to acquire the geometric values for each FGPFM. The optimal
geometric values of FGPFM result in full coincidence between
7/28/2019 A Deformable 3D Facial Expression Model for Dynamic Human Emotional State Recognition
9/14
Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
9
superimpositions of the transformed FGPFM and those facial
contours of training images. Geometric values of FGPFM are
established using the profile matching technique for silhouettes
of the training images and the FGPFM with the known view
directions. The reconstruction procedure can be regarded as
a block function of FGPFM, and the input parameters are
3-D face-centered coordinates of control points. When the
control points are accurately modified, the desired 3-D facial
model is determined based on the topological and geometric
descriptions of FGPFM.
To remove the individual differences in the facial expres-
sions, each face shape from the video sequences is normalized
to the same scale. The 26 control points on the 3D facial model
are initially estimated by the fiducial points using the back
projection technique with the set of predefined unified depth
values. The original dependent points are also predefined in
the model. Classified FGFM ratio features are selected with
a minimal Euclidean distance between the estimated and the
codebook-like ratios database. The depth values of control
points and curvature parameters are obtained for reconstructing
the EBS facial model from the selected ratio features classifier.Fig. 4 shows some representative sample results for emo-
tional model construction. Our objective here is to find the
positions of dependent points after emotional facial defor-
mation under the availability of the fiducial point position.
The basic six emotions are analyzed in this experiment. The
best-fit mesh model of a given face is estimated from the
first input frame with neutral emotion. Based on the known
tracking information, the positions of all characteristic feature
points are calculated and the EBS model is reconstructed for
any particular expression. From experimental results we can
see that our method provides good construction following the
variations of the control points.
(a) (b)
(c) (d)
Fig. 5. EBS facial model constructions with different Poissons ratio (a) amale anger face (b) a female sadness face (c) a female anger face (d) a malehappiness face.
We provide more experimental results in Fig. 5 to verify
the consistency of the proposed method. Fig. 5 presents the
results of the emotional facial model for different people.
The is assumed to be constant for the whole facial regionand determined under the condition of minimum muscular
force field generation. Fig. 5(a-d) shows the results when
is obtained experimentally. Subjectively, the proposed method
provides a good facial model under different people and
expressions.
D. D-Isomap for final decision
In this section, 280 video sequences of eight subjects from
the RML Emotion database and 420 video sequences of 12
subjects from the Mind Reading DVD database are selectedfor D-Isomap based classifier evaluation, which constitute a
total of 700 sequences of 20 subjects with six emotions and
neutral faces.
The facial EBS features are extracted to construct a 175
dimensional vector sequence that is too large to manipulate
directly. We use D-Isomap algorithm for dimensionality re-
duction, as discussed in section 4. Since each feature vector
can be seen as one point in a 175 dimensional space, the
D-Isomap is utilized to find the embedding manifold in a
low-dimensional space to represent the original data. These
representations should cover most of the variances of the
observation based on the continuous variations of facial config-
urations. The low-dimensional space structures are extracted tofacilitate the manifolds estimated intrinsic geometry by the D-
Isomaps capability of nonlinear analysis and the convergence
of globally optimal solutions.
The geodesic distance graph from (12) is used for D-Isomap
based embedding. Fig. 6 shows examples of distance matrices
with discriminative weight factors for seven emotionalexpressions of randomly selected subjects. The distance graph
reflects the intrinsic similarity of the original expression data
and consequently is considered for determining true embed-
ding. From Fig. 6 we can see, by applying the weight factor,
the points from the same cluster can be projected closer in
low dimensional space, thus the distance is compacted. On
the other hand, the distance between different clusters could
be expanded by increasing .
(a) = 0.1. (b) = 0.25.
(c) = 0.5. (d) = 0.75.
Fig. 6. Distance matrix graph with different weight factor, higher values areshown in red, lower values in blue.
Increasing the dimension of the embedding space, we can
calculate the residual variance for the original data. The true
7/28/2019 A Deformable 3D Facial Expression Model for Dynamic Human Emotional State Recognition
10/14
Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
10
dimension of data can be found by considering the decreasing
trend in the residual value. The embedding results using
Isomap and proposed D-Isomap with different k are presentedin Fig. 7, which shows the results when k is set to be 7,12, and 20, respectively. From the results we see that our
proposed method achieves an average of 10% improvementwhen compared with the original Isomap. The best perfor-
mance can be obtained when k is 12 and the dimension ofembedded space is reduced to 20, which covers more than 95
% variances of the observation from the input data. Therefore,these 20 dimensional components are used here to represent
facial expressions in the input videos.
(a) (b)
(c) (d)
(e) (f)
Fig. 7. Dimensionality reduction using Isomap and D-Isomap, (a-e) showthe results using Isomap with cluster k are 7, 12, and 20 respectively, (b-f)show the results using D-Isomap.
We also provide expressional configurations to show appar-
ent emotional variation in Fig. 8. For each video sequencefrom the database, we constructed 10 sub-clips of the samples
with different frames from neutral to the apex, which can
improve the representational capacity with sufficient large
number of feature points stored to account for as many
variations in the original data as possible. To show apparent
emotional variations, we provided the expressional configu-
rations based on different numbers of samples. In Fig. 8,
(a) shows the result using 700 samples with one sample for
each video, (b) using 10 samples for each video and 7000
samples in total. From the results we can see that the EBS
model sequences are embedded to a discriminative structure
on the low dimensional feature space. By applying the NCC
algorithm to the embedding results from the D-Isomap using
(17), we can determine the emotion class for a test video. We
label the emotion class centers on the embedded feature space,
shown in Fig. 8.
(a)
(b)
Fig. 8. Labeled class centers in a 2D space based on the embedding results(a) shows the results using 700 samples (b) shows the results using 7000samples.
To evaluate the performance of our proposed method, we
divide these 700 sequences into five subsets and 140 sequences
for each. Every time, one of the five subsets is used as a
testing set, the other four subsets are used as a training set.
The evaluation procedure is repeated until all the subsets have
been used as a testing set. A test video sequence is treated
as a unit and labeled with a single expression category. Therecognition accuracy is calculated as the ratio of the number of
correctly classified videos to the total number of videos in the
data set. By using the proposed classifier, we achieve an overall
accuracy of 88.2%. We list the confusion matrix for emotion
recognition with numbers representing percentage correct in
Table I. From the results we can see that features representing
different expressions exhibit great diversity since the distances
between different emotions are relatively high. On the other
hand, the same expressions collected from different subjects
are very similar due to the short distances within the same
class.
7/28/2019 A Deformable 3D Facial Expression Model for Dynamic Human Emotional State Recognition
11/14
7/28/2019 A Deformable 3D Facial Expression Model for Dynamic Human Emotional State Recognition
12/14
Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
12
standard deviation and average rate of emotion recognition
between three Isomap methods. Table III indicates that the
proposed algorithm achieves better performance than OI and
EI. D-Isomap outperforms the other methods because it can
compact the data points from the same cluster on a high-
dimension manifold to make them closer in the low-dimension
space, and make the data points from the different clusters
farther as well. This ability could be beneficial in preserving
the homogeneous characteristics for emotion classification.
To demonstrate the discriminative embedding performance
of the proposed D-Isomap, we conducted some experiments
with state-of-art manifold learning methods i.e. localized
LDA (LLDA), the discriminative version of LLE (DLLE)
and Laplacian Eigenmap (LE). LLDA [50] is based on the
local estimates of the model parameters that approximate
the non-linearity of the manifold structure with piece-wise
linear discriminating subspaces. The local neighborhood size
k = 30 and dimensionality of subspace d = 32 are selectedto compute local metrics. DLLE [51] preserves the local
geometric properties within each class according to LLE
criterion, and the separability between different classes is en-forced by maximizing margins between point pairs on different
classes. The balance term h = 1, nearest Neighbors k1 = 1and smallest distances k2 = 100 are used for classificationwith the closest centroit. LE [52] makes use of incremental
Laplacian Eigenmap reducing the dimension and extracting the
feature to data points Drawing on the correspondence between
the graph Laplacian, the Laplacian Beltrami operator on the
manifold, and the connections to the heat equation, a geo-
metrically motivated algorithm is utilized for representing the
high dimensional data that has locality preserving properties
and a natural connection to clustering. The experiments are
conducted with the compression dimensions of 50.
TABLE IVRECOGNITION RATE OF DIFFERENT MANIFOLD LEARNING
METHODS.
Method Dimensions Recognition Rate
LLDA 32 80.5%DLLE 40 85.3%
LE 25 84.7%D-Isomap 20 88.2%
In all the experiments, the final classification after dimen-
sionality reduction is determined by the nearest neighbor
criterion. Table IV shows the experimental results of different
algorithms. The results demonstrate the greater effectiveness
of D-Isomap for both feature reduction and final recognition.
It considers the label information and local manifold structure.
When dealing with the multiple classes and the data set
distribution is complex, the proposed D-Isomap takes the
advantage of weight factor to separate data with different
labels farther and cluster data with the same label closer. Thus
the proposed algorithm can gain better recognition rate.
VI. CONCLUSIONS
In this paper we present an automatic emotion recognition
method from video sequences using the 3D active deformable
information. From experimental results we find that the signif-
icant features to distinguish one individual emotion from the
other emotions are different. Some of the features selected in a
global scenario are redundant, while some of the other features
might contribute to the classification of specific emotion.
Another observation is that there is not single feature which is
significant for all the classes. This actually reveals the nature
of human emotion; there are no sharp boundaries between
the emotional states. Any single emotion may share similar
patterns to other emotions, but not all. The human perception
of emotion is based on the integration of different patterns.
In the emotion recognition field, current techniques for the
detection and tracking of facial expressions are sensitive to
head pose, clutter, and variations in lighting conditions. Few
approaches to automatic facial expression analysis are based
on deformable or dynamic 3D facial models. The proposed
system attempts to solve such problems by using a generic 3D
mesh model with D-Isomap classification. The facial region
and fiducial points are detected and tracked in the input video
frames automatically. The generic facial mesh model is then
used for EBS feature extraction. D-Isomap based classificationis applied for final decision. The merits of this work are
summarized as follows.
Facial expressions are detected and tracked automatically
in the video sequences, which can alleviate a common
problem in conventional detection and tracking methods.
Namely, inconsistent performance due to sensitivity to
variation in illuminations such as local shadowing, noise
and occlusion.
We model the face as an elastic body and exhibit
different elastic characteristics dependent on different
facial expressions. Based on the continuity condition, the
elastic property of each facial expression is found, and a
complete wireframe facial model can be generated underthe availability of some limited feature point positions.
An adaptive partition of polygons is embedded in EBS
according to the surface curvature through the character-
istic feature points. The subtle structural information can
be expressed without giving complicated facial features.
The generic 3D facial model is established so that the
good parameters of EBS can be used for emotion recogni-
tion, e.g. the appropriate physical characteristics for face
deformations, control points, etc.
We propose the use of D-Isomap for emotion recognition.
It can compact the data points from the same emotion
class on a high-dimension manifold to make them closer
in the low-dimension space, and makes the data pointsfrom the different clusters farther as well. It results in a
high recognition rate when compared with other Isomap
methods
Experimental results and comparison with several other
algorithms demonstrated the effectiveness of the proposed
method.
REFERENCES
[1] A.C. Rafael and D. Sidney, Affect Detection: An InterdisciplinaryReview of Models, Methods, and Their Applications, IEEE Transactionson Affective Computing, Vol.1 (1), pp. 18-34, June 2010.
7/28/2019 A Deformable 3D Facial Expression Model for Dynamic Human Emotional State Recognition
13/14
Copyright (c) 2011 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
13
[2] N. Sebe, H. Aghajan, T. Huang, N.M. Thalmann, C. Shan, Special Issueon Multimodal Affective Interaction, IEEE Transactions on Multimedia,Vol.12 (6), pp. 477-480, 2010.
[3] P. Ekman, T. Dalgleish, and M.E. Power, Basic emotions, Handbook ofCognition and Emotion, Wiley, Chichester, U.K., 1999.
[4] C. Darwin, The Expression of Emotions in Man and Animals, JohnMurray, 1872, reprinted by University of Chicago Press, 1965.
[5] J.F. Cohn, Advances in Behavioral Science Using Automated FacialImage Analysis and Synthesis, Signal Processing Magazine, IEEE,Vol.27 (6), pp. 128-133, 2010.
[6] K.I. Chang, K.W. Bowyer, and P.J. Flynn, Multiple Nose RegionMatching for 3D Face Recognition under Varying Facial Expression,
IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.28(10), pp. 1695-1700, October 2006.
[7] P. Ekman, W.V. Friesen, and J.C. Hager, The Facial Action CodingSystem: A Technique for the Measurement of Facial Movement, SanFrancisco, Consulting Psychologist, 2002.
[8] M.J. Lyons, J. Budynek, A. Plante, and S. Akamatsu, Classifying facialattributes using a 2-D Gabor wavelet representation and discriminantanalysis, Proceedings of the 4th International Conference on AutomaticFace and Gesture Recognition, pp. 202-207, March 2000.
[9] L.D. Silva and S.C. Hui, Real-time facial feature extraction and emotionrecognition, Proceedings of 4th International Conference on Informa-tion, Communications and Signal Processing, Vol.3, pp. 1310-1314,Singapore, December 2003.
[10] I. Cohen, N. Sebe, Y. Sun, M. S. Lew, and T.S. Huang, Evaluation ofexpression recognition techniques, Proceedings of International Confer-
ence on Image and Video Retrieval, pp. 184-195, IL, USA July 2003.[11] G. Guo and C.R. Dyer, Learning from examples in the small sample
case: face expression recognition, IEEE Transactions on Systems, Man,and Cybernetics, Part B, Vol.35 (3), pp. 477-488, June 2005.
[12] M. Pantic and I. Patras, Dynamics of facial expression: recognitionof facial actions and their temporal segments from face profile imagesequences, IEEE Transactions on Systems, Man, and Cybernetics, Part
B, Vol.36 (2), pp. 433-449, April 2006.[13] K. Anderson and P.W. McOwan, A real-time automated system for the
recognition of human facial expressions, IEEE Transactions on Systems,Man, and Cybernetics, Part B, Vol.36 (1), pp. 96-105, February 2006.
[14] H. Gunes and M. Piccardi, Automatic Temporal Segment Detection andAffect Recognition from Face and Body Display, IEEE Transactions onSystems, Man, and Cybernetics, Part B, Vol.39 (1), pp. 64-84, February2009.
[15] Y. Wang and L. Guan, Recognizing Human Emotional State fromAudiovisual Signals, IEEE Transactions on Multimedia, Vol.10 (5), pp.
659-668, August 2008.[16] Z. Zeng, M. Pantic, G.I. Roisman, and T.S. Huang, A Survey of Affect
Recognition Methods: Audio, Visual, and Spontaneous Expressions,IEEE Transactions on Pattern Analysis and Machine Intelligent, Vol.31(1), pp. 39-58, January 2009.
[17] M. Song, Z. Dong, C. Theobalt, H.Q. Wang, Z.C. Liu, and H.P. Seidel,A General Framework for Efficient 2D and 3D Facial ExpressionAnalogy, IEEE Transactions on Multimedia, Vol.9 (7), pp. 1384-1395,November 2007.
[18] T. Yun and L. Guan, Human Emotion Recognition Using Real 3DVisual Features from Gabor Library, IEEE International Workshop on
Multimedia Signal Processing, pp. 481-486, Saint Malo, October 2010.[19] J. Wang, L. Yin, X. Wei, and Y. Sun, 3D facial expression recognition
based on primitive surface feature distribution, IEEE InternationalConference on Computer Vision and Pattern Recognition, pp. 1399-1406,New York, June 2006.
[20] H. Soyel and H. Demirel, Facial expression recognition using 3D
facial feature distances, International Conference on Image Analysis andRecognition, Vol.4633, pp. 831-838, Montreal, August 2007.
[21] H. Soyel and H. Demirel, Optimal feature selection for 3D facial ex-pression recognition using coarse-to-fine classification, Turkish Journalof Electrical Engineering and Computer Sciences, Vol.18 (6), pp. 1031-1040, 2010.
[22] H. Tang and T. S. Huang, 3D facial expression recognition based onautomatically selected features, IEEE Computer Society Conference onComputer Vision and Pattern Recognition Workshops, pp. 1-8, Anchorage,June 2008.
[23] H. Tang and T. Huang, 3D facial expression recognition based onproperties of line segments connecting facial feature points, IEEE
International Conference on Automatic Face and Gesture Recognition,pp. 1-6, Amsterdam, The Netherlands, 2008.
[24] Y. Sun and L. Yin, Facial expression recognition based on 3D dynamicrange model sequences, Computer Vision - ECCV 08, pp. 58-71, 2008.
[25] T. Yun and L. Guan, Automatic face detection in video sequences usinglocal normalization and optimal adaptive correlation techniques, Pattern
Recognition, Vol.42 (9), pp. 1859-1868, September 2009.
[26] P. Viola and M. Jones, Robust Real Time Object Detection, Pro-ceedings 2nd International Workshop on Statistical and ComputationalTheories of Vision, Vancouver, Canada, July 2001.
[27] J. Xiao, S. Baker, I. Matthews, T. Kanade,Real-time combined 2d+3dactive appearance models, Computer Vision and Pattern RecognitionConference, Vol.2, pp. 535-542, July 2004.
[28] R.Gross, I.Matthews, S.Baker, Constructing and Fitting Active Appear-
ance Models With Occlusion, IEEE Workshop on Face Processing inVideo, pp. 72, 2004.
[29] D. Vukadinovic, M. Pantic, Fully Automatic Facial Feature PointDetection Using Gabor Feature Based Boosted Classifiers, IEEE Inter-national Conference on Systems, Man and Cybernetics Waikoloa, Vol. 2,pp.1692 - 1698, October 2005.
[30] T. Yun and L. Guan, Automatic fiducial points detection for facialexpressions using scale invariant feature, IEEE International Workshopon Multimedia Signal Processing, pp. 1-6, Rio de Janero, Brazil, October2009.
[31] T. Yun and L. Guan, Fiducial Point Tracking for Facial ExpressionUsing Multiple Particle Filters with Kernel Correlation Analysis, IEEE
International Conference on Image Processing, pp. 373-376, Hongkong,September 2010.
[32] M.H. Davis, A. Khotanzad, D.P Flamig, and S.E Harms, A physics-based coordinate transformation for 3-D image matching, IEEE Trans-actions on Medical Imaging, Vol.16 (3), pp. 317 -328, June 1997.
[33] M.H. Davis, A. Khotanzad, D.P. Flamig, S.E. Harms, Elastic bodysplines: a physics based approach to coordinate transformation in medicalimage matching, IEEE Symposium on Computer-Based Medical Systems,pp. 81 - 88, 1995.
[34] A.Hassanien, M.Nakajima, Image morphing of facial images transfor-mation based on Navier elastic body splines, Computer Animation, pp.119 - 125, 1998.
[35] C.J. Kuo, J. Hung, M. Tsai, and P. Shih, Elastic Body Spline Techniquefor Feature Point Generation and Face Modeling, IEEE Transactions on
Image Processing, Vol.14 (12), pp. 2159-2166, December 2005.
[36] P. C. Chou and N. J. Pagano,Elasticity: Tensor, Dyadic, and EngineeringApproaches, Dover, New York, 1992.
[37] J.B. Tenenbaum, V. Silva, and J.C. Langford, A global geometricframework for nonlinear dimensional reduction, Science, Vol.290, pp.2319-2323, December 2000.
[38] M.H. Yang, Face recognition using extended isomap, InternationalConference on Image Processing, Vol.2, pp.117-120, New York, Septem-ber 2002.
[39] X. Geng, D.C. Zhan, and Z.H. Zhou, Supervised Nonlinear Dimension-ality Reduction for Visualization and Classification, IEEE Transactionson Systems, Man and Cybernetics, Part B, Vol.35 (6), pp. 10981107,December 2005.
[40] Y. Wu, K.L. Chan, and L. Wang, Face recognition based on discrimina-tive manifold learning, International Conference on Pattern Recognition,Vol.4 pp. 171-174, Cambridge, UK, September 2004.
[41] D. Zhao and L. Yang, Incremental Isometric Embedding of High-Dimensional Data Using Connected Neighborhood Graphs, IEEE Trans-actions on Pattern Analysis and Machine Intelligence, Vol.31, pp. 86-98,January 2009.
[42] V. Le, H. Tang, and T.S. Huang, Expression recognition from 3Ddynamic faces using robust spatio-temporal shape features, AutomaticFace Gesture Recognition and Workshops, pp. 414-421, 2011.
[43] G. Sandbach, S. Zafeiriou, M. Pantic, and D.Rueckert, A dynamic
approach to the recognition of 3D facial expressions and their temporalmodels, Automatic Face Gesture Recognition and Workshops, pp. 406-413, 2011.
[44] T. Friedrich, Nonlinear Dimensionality Reduction with Locally LinearEmbedding and Isomap, University of Sheffield, 2002.
[45] E. Kokiopoulou and Y. Saad, Orthogonal Neighborhood PreservingProjections: A Projection- Based Dimensionality Reduction Technique,
IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.29,pp. 2143-2156, December 2007.
[46] J. Handl and J. Knowles, An Evolutionary Approach to MultiobjectiveClustering, IEEE Transactions on Evolutionary Computation, Vol.11 (1),pp. 56-76, February 2007.
[47] S.B. Cohen, Mind Reading: The Interactive Guide to Emotions, Lon-don, Jessica Kingsley, 2004.
[48] S.Y. Ho and H.L. Huang, Facial Modeling from an UncalibratedFace Image Using Flexible Generic Parameterized Facial Models, IEEE
7/28/2019 A Deformable 3D Facial Expression Model for Dynamic Human Emotional State Recognition
14/14
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
14
Transactions on Systems, Man and Cybernetics, Part B, Vol.31 (8),October 2005.
[49] L. Yin, X. Wei, Y. Sun, J. Wang, M.J. Rosato,3D Facial ExpressionDatabase for Facial Behavior Research, Automatic Face and Gesture
Recognition, pp. 211 - 216, 2006.[50] L. Zhu, F. Yun, Y. Junsong, T.S. Huang, and W. Ying, Query Driven
Localized Linear Discriminant Models for Head Pose Estimation, IEEEInternational Conference on Multimedia and Expo,pp. 1810 - 1813, July2007.
[51] X. Li, S. Lin, S. Yan, and D. Xu, Discriminant Locally Linear Em-
bedding With High-Order Tensor Data, IEEE Transactions on Systems,Man, and Cybernetics, Part B: Cybernetics,Vol. 38 (2), pp. 342 - 352,April 2008.
[52] W. Luo, Face recognition based on Laplacian Eigenmaps, Interna-tional Conference on Computer Science and Service System, pp 416 -419, June 2011.
Yun Tie (S07) received his B.Sc. Degree from Nanjing University of Scienceand Technology, China, M.A.Sc Degree of Computer Science from KwangjuInstitute of Science and Technology (KJIST), Korea, and Ph.D Degree fromRyerson University, Canada. He is currently a Post Doctoral Fellow in RyersonMultimedia Lab at Ryerson University, Toronto, Canada. His research interestsinclude image/video processing, pattern recognition, 3D data modeling andintelligent classification and their applications.
Ling Guan (S88-M90-SM96-F08) received his B.Sc. degree in ElectronicEngineering from Tianjin University, China, M.A.Sc degree in SystemsDesign Engineering from University of Waterloo, Canada and Ph.D. degree inElectrical Engineering from the University of British Columbia, Canada. Heis currently a professor and a Tier I Canada Research Chair in the Departmentof Electrical and Computer Engineering at Ryerson University, Toronto,Canada. He also held visiting positions at British Telecom (1994), TokyoInstitute of Technology (1999), Princeton University (2000), Hong KongPolytechnic University (2008) and Microsoft Research Asia (2002, 2009).
He has published extensively in multimedia processing and communications,human-centered computing, pattern analysis and machine intelligence, andadaptive image and signal processing. He is a recipient of the 2005 IEEETrans. on Circuits and Systems for Video Technology Best Paper Award andan IEEE Circuits and Systems Society Distinguished Lecturer.