Facial Expression Analysis using Nonlinear Decomposable ...elgammal/pub/LeeAMFG05FacialExpres… ·...

Facial Expression Analysis using NonlinearDecomposable Generative Models

Chan-Su Lee and Ahmed Elgammal

Computer Science, Rutgers UniversityPiscataway NJ 08854, USA

{chansu, elgammal}@cs.rutgers.edu

Abstract. We present a new framework to represent and analyze dynamic facialmotions using a decomposable generative model. In this paper, we consider facialexpressions which lie on a one dimensional closed manifold, i.e., start from someconfiguration and coming back to the same configuration, while there are othersources of variability such as different classes of expression, and different peo-ple, etc., all of which are needed to be parameterized. The learned model supportstasks such as facial expression recognition, person identification, and synthesis.We aim to learn a generative model that can generate different dynamic facialappearances for different people and for different expressions. Given a singleimage or a sequence of images, we can use the model to solve for the tempo-ral embedding, expression type and person identification parameters. As a resultwe can directly infer intensity of facial expression, expression type, and personidentity from the visual input. The model can successfully be used to recognizeexpressions performed by different people never seen during training. We showexperiment results for applying the framework for simultaneous face and facialexpression recognition.Sub-categories: 1.1 Novel algorithms, 1.6 Others: modeling facial expression

1 Introduction

The appearance of a face performing a facial expression is an example of a dynamicappearance that has global and local deformations. There are two interesting compo-nents in dynamic facial expressions: face identity (face geometry and appearance char-acterizing the person) and facial motion (deformation of face geometry through theexpression and its temporal characteristics). There has been extensive research relatedto face recognition [24] emanating from interest in applications in security and visualsurveillance. Most of face recognition systems focused on still face images, i.e., cap-turing identity through facial geometry and appearance. There have been also interestson expression invariant face recognition [15, 14, 2]. Individual differences of facial ex-pression like expressiveness can be useful as a biometric to enhance accuracy in facerecognition [8]. On the other hand, facial expression analysis gain interest in computervision with applications in human emotion analysis for HCI and affective computing.Most studies of facial expression recognition have focused on static display of intenseexpressions even though facial dynamics are important in interpreting facial expressionprecisely [1].

Our objective in this paper is to learn dynamic models for facial expressions thatenable simultaneous recognition of faces and facial expressions. We learn a dynamicgenerative model that factors out different face appearance corresponding to differentpeople and in the same time parameterizes different expressions.

Despite the high dimensionality of the image space in facial expressions, facialmotions lie intrinsically on much lower dimensional subspaces. Therefore, researchershave tried to exploit subspace analysis in face recognition and facial expression analy-sis. PCA has been widely used in appearance modeling to discover subspaces for faceappearance variations as in [21, 10]. When dealing with dynamic facial expressions, im-age data lie on low dimensional nonlinear manifolds embedded in the high dimensionalinput space. Embedding expression manifolds to low dimensional spaces provides away to explicitly model such manifolds. Linear subspace analysis can achieve a linearembedding of the motion manifold in a subspace. However, the dimensionality of thesubspace depends on the variations in the data and not on the intrinsic dimensional-ity of the manifold. Nonlinear dimensionality reduction approaches can achieve muchlower dimensionality embedding of nonlinear manifolds through changing the metricfrom the original space to the embedding space based on local structure of the manifold,e.g. [17, 19]. Nonlinear dimensionality reduction has been recently exploited to modelthe manifold structure in face recognition, facial expression analysis [3]. However, allthese approaches (linear and nonlinear) are data-driven, i.e., the visual input is used tomodel motion manifolds. The resulting embeddings are data-driven and, therefore, theresulting embedded manifolds vary due to person facial geometry, appearance, facialdeformation, and dynamics in facial expressions, which affect collectively the appear-ance of facial expressions. The embedding of the same facial expression performed bydifferent people will be quite different and it is hard to find a unified representation ofthe manifold. But, conceptually all these manifolds (for the same expression) are thesame. We can think of it as the same expression manifold which is twisted differently inthe input space based on person’s facial appearance. They are all topologically equiv-alent, i.e., homeomorphic to each other and we can establish a bijection between anypair of them. Therefore, we utilize a conceptual manifold representation to model facialexpression configuration and learn mappings between the conceptual unified represen-tation and each individual data manifold.

Different factors affect the face appearance. There had been efforts to decomposemultiple factors affecting appearance from face and facial expression data. Bilinearmodels were applied to decompose person-dependent factor and the pose-dependentfactor as the style and content from pose-aligned face images of different people [20]and facial expression synthesis [4]. Multilinear analysis, or higher-order singular valuedecomposition [11], were applied to aligned face images with variation of people, illu-mination and expression factors and applied for face recognition [22]. In this model,face images are decomposed into tensor multiplication of different people basis, illumi-nation basis and expression basis. Facial expressions were also analyzed using multilin-ear analysis for feature space similar to active appearance model to recognize face andfacial expression simultaneously [23]. All these approaches have limitations in captur-ing nonlinearity of facial expression as the subspaces are expansion of linear subspace

of facial images. In addition, all these approaches deal with static facial expressions anddo not model dynamics in facial expression.

In this paper, we learn nonlinear mappings between a conceptual embedding spaceand facial expression image space and decompose the mapping space using multilinearanalysis. The mapping between sequences of facial expression and embedding pointscontains characteristics of the data invariant to temporal variations and change with dif-ferent people facial expression and different types of facial expressions. We decomposethe mapping space into person face appearance factor, which is person dependent andconsistent for each person, and expression factor, which depends on expression typeand common to all people with the same expression. In addition, we explicitly decom-pose the intrinsic face configuration during the expression, as a function of time in theembedding space, from other conceptually orthogonal factors such as facial expres-sions and person face appearances. As a result, we learn a nonlinear generative modelof facial expression with modeling dynamics in low dimensional embedding space anddecomposing of multiple factors in facial expressions.Contribution: In this paper we consider facial expressions which lie on a one dimen-sional closed manifold, i.e., start from some configuration and coming back to the sameconfiguration. We introduce a framework to learn decomposable generative models fordynamic appearance of facial expressions where the motion is constrained to one di-mensional closed manifolds while there are other sources of variability such as differentclasses of expression, and different people, etc., all of which are needed to be parame-terized. The learned model supports tasks such as facial expression recognition, personidentification, and synthesis. Given a single image or a sequence of images, we canuse the model to solve for the temporal embedding, expression type and person iden-tification parameters. As a result we can directly infer intensity of facial expression,expression type, and person face from the visual input. The model can successfully beused to recognize expressions performed by different people never seen in the training.

2 Facial Expression Manifolds and Nonlinear DecomposableGenerative Models

We investigate low dimensional manifolds and propose conceptual manifold embed-ding as a representation of facial expression dynamics in Sec. 2.1. In order to preservenonlinearity of facial expression in our generative model, we learn nonlinear mappingbetween embedding space and image space of facial expression in Sec. 2.2. The decom-posable compact parameterization of the generative model is achieved using multilinearanalysis of the mapping coefficients in Sec. 2.3.

2.1 Facial Expression Manifolds and Conceptual Manifold Embedding

We use conceptual manifold embedding for facial expressions as a uniform represen-tation of facial expression manifolds. Conceptually, each expression sequence formsa one-dimensional closed trajectory in the input space as the expression starts from aneutral face and comes back to the neutral face. Data-driven low dimensional manifoldsusing nonlinear dimensionality reduction algorithms such as LLE [17] and Isomap [19]

(a) Smile sequences from subject Db

(b) Smile sequences from subject S

(c) LLE embedding for Db’s smile

−2 −1.5 −1 −0.5 0 0.5 1 1.5

−3

−2

−1

0

1

2

3

−5

0

5

200440

480

40

160

400240

120

280

80

320

360

(d) LLE embedding for S’s smile

−4−3−2−1012

−4−2

02

4

−1.5

−1

−0.5

0

0.5

1

1.5

2

360

480240280

400

200

80

320

40

160

120

440

Fig. 1. Facial expression manifolds in different subjects: (a) and (b): Facial expression imagesequences. (2 cycles 480 frames):40th, 80th, 120th, 160th, 200th, 240th, 280th, 320th, 360th,400th, 440th, 480th frames. (c) and (d): Nonlinear manifold embeddings of facial expressionsequences by LLE.

vary in different people and in different expression types. Fig. 1 shows low dimensionalmanifold representation of facial expression sequences when we applied LLE to highdimensional vector representations of image sequences of facial expressions. The fa-cial expression data with twice repetitions of the same type expression are captured andnormalized for each person as shown in Fig. 1 (a) and (b). Fig. 1 (c) and (d) show themanifolds found by applying the LLE algorithm. The manifolds are elliptical curveswith distortions according to the person face and expressions. Isomap and other nonlin-ear dimensionality reduction algorithms show similar results. Sometimes the manifolddoes not show smooth curves due to noise in the tracking data and images. In addition,the embedding manifolds can be very different in some case. It is hard to find represen-tations comparable each manifold for multiple expression styles and expression types.Conceptually, however, all data driven manifolds are equal. They are all topologicallyequivalent, i.e., homeomorphic to each other, and to a circular curve. Therefore, we canuse the unit circle in 2D space as a conceptual embedding space for facial expressions.

A set of image sequences which represent a full cycle of the facial expressions areused in conceptual embedding of facial expressions. Each image sequence is of a certainperson with a certain expression. Each person has multiple expression image sequences.The image sequences are not necessarily to be of the same length. We denote eachsequence by Y se = {yse

1 · · ·yseNse

} where e denotes the expression label and s is personface label. Let Ne and Ns denote the number of expressions and the number of people

respectively, i.e., there areNs ×Ne sequences. Each sequence is temporally embeddedat equidistance on a unit circle such that xse

i = [cos(2πi/Nse) sin(2πi/Nse)], i =1 · · ·Nse. Notice that by temporal embedding on a unit circle we do not preserve themetric in input space. Rather, we preserve the topology of the manifold.

2.2 Nonlinear Mapping between Embedding Space and Image Space

Nonlinear mapping between embedding space and image space can be achieved throughraidal basis function interpolation [6]. Given a set of distinctive representative and ar-bitrary points {zi ∈ R

2, i = 1 · · ·N} we can define an empirical kernel map[18] asψN (x) : R

2 → RN where

ψN (x) = [φ(x, z1), · · · , φ(x, zN )]T, (1)

given a kernel function φ(·). For each input sequence Y se and its embedding X se

we can learn a nonlinear mapping function f se(x) that satisfies fse(xi) = yi, i =1 · · ·Nse and minimizes a regularized risk criteria. Such function admits a representa-tion of the form

f(x) =N∑

i=1

wiφ(x, zi),

i.e., the whole mapping can be written as

fse(x) = Bse · ψ(x) (2)

where B is a d×N coefficient matrix. If radial symmetric kernel function is used, wecan think of equation 2 as a typical Generalized Radial basis function (GRBF) inter-polation [16] where each row in the matrix B represents the interpolation coefficientsfor corresponding element in the input. i.e., we have d simultaneous interpolation func-tions each from 2D to 1D. The mapping coefficients can be obtained by solving thelinear system

[yse1 · · ·yse

Nse] = Bse[ψ(xse

1 ) · · ·ψ(xseNse

)]

Where the left hand side is a d×Nse matrix formed by stacking the images of sequencese column wise and the right hand side matrix is anN×Nse matrix formed by stackingkernel mapped vectors. Using these nonlinear mapping, we can capture nonlinearityof facial expression in different people and expressions. More details about fitting themodel can be found in [6].

2.3 Decomposition of Nonlinear Mapping Space

Each nonlinear mapping is affected by multiple factors such as expressions and personfaces. Mapping coefficients can be arranged into high order tensor according to expres-sion type and person face. We applied multilinear tensor analysis to decompose themapping into multiple orthogonal factors. This is a generalization of the nonlinear styleand content decomposition as introduced in [7]. Multilinear analysis can be achieved by

higher-order singular value decomposition (HOSVD) with unfolding, which is a gener-alization of singular value decomposition (SVD) [11]. Each of the coefficient matricesBse = [b1b2 · · · bN ] can be represented as a coefficient vector bse by column stacking(stacking its columns above each other to form a vector). Therefore, b se is anNc = d·Ndimensional vector. All the coefficient vectors can then be arranged in an order-three fa-cial expression coefficient tensor B with dimensionalityNs ×Ne ×Nc. The coefficienttensor is then decomposed as

B = Z ×1 S ×2 E ×3 F (3)

where S is the mode-1 basis of B, which represents the orthogonal basis for the personface. Similarly, E is the mode-2 basis representing the orthogonal basis of the expres-sion and F represents the basis for the mapping coefficient space. The dimensionality ofthese matrices areNs×Ns,Ne ×Ne,Nc×Nc for S,E and F respectively.Z is a coretensor, with dimensionalityNs×Ne×Nc which governs the interactions among differ-ent mode basis matrices. Similar to PCA, it is desired to reduce the dimensionality foreach of the orthogonal spaces to retain a subspace representation. This can be achievedby applying higher-order orthogonal iteration for dimensionality reduction [12].

Given this decomposition and given any Ns dimensional person face vector s andany Ne dimensional expression vector e we can generate coefficient matrix B se byunstacking the vector bse obtained by tensor product bse = Z ×1 s ×2 e. Thereforewe can generate any specific instant of the expression by specifying the configurationparameter xt through the kernel map defined in equation 1. Therefore, the whole modelfor generating image yse

t can be expressed as

yset = unstacking(Z ×1 s ×2 e) · ψ(xt) .

This can be expressed abstractly also in the generative form by arranging the tensor Zinto a order-four tensor C

yt = C ×1 s ×2 e ×3 ψ(x) ×4 L , (4)

where dimensionality of core tensor C is Ns ×Ne ×N × d, ψ(x) is a basis vector forkernel mapping with dimensionN for given x and L is collection of basis vectors of allpixel elements with dimension d×d . We can analyze facial expression image sequenceby estimation of the parameters in this generative model.

3 Facial Expression Analysis and Synthesis using GenerativeModels

There are two main approaches in representing facial motions for facial expressionanalysis: model-based or appearance-based. Geometric features are extracted with theaid of 2D or 3D face models in model-based approaches. 3D deformable generic facemodel [5] or multistate facial component models [13] are used to extract facial features.Active appearance model are employed to use both shape and textures in [10][23]. Ourgenerative model use pixel intensity itself as an appearance representation as we want,

(a) (b) (c)

(d)

Fig. 2. Cropping and normalizing face images to a standard front face: (a) Selected templates(eyes and a nose tip). (b) Template detection, affine transformation and selected cropping region.(c) A normalized sequence where templates are selected from the first frame. (d) A normalizedsequence from another expression of the same person.

not only to analyze, but also to synthesize facial expressions in the image space. Thefinal representation of facial expressions in our generative model, however, is a compactperson face vector and an expression vector that are invariant to temporal characteristicsand low dimensional embedding that represents temporal characteristics.

The generative model supports both sequence-based and frame-based recognition offacial expressions. Facial expression recognition system can be categorized into frame-based and sequence-based methods [8] according to the use of temporal information. Inframe-based methods, the input image is treated independently either a static image ora frame of a sequence. Frame-based method does not use temporal information in therecognition process. In sequence-based methods, the HMMs are frequently used to uti-lize temporal information in facial expression recognition [5]. In our generative model,the temporal characteristics are modeled in low dimensional conceptual manifolds andwe can utilize the temporal characteristics of the whole sequence by analyzing facial ex-pression based on the mapping between the low dimensional embedding and the wholeimage sequence. We also provide methods to estimate expression parameters and faceparameters from single static image.

3.1 Preprocessing: Cropping and normalizing face images

The alignment and normalization of captured faces using a standard face is an importantpreprocessing in facial expression recognition to achieve robust recognition of facialexpressions in head motion and lighting condition change. We interactively select twoeyes and a nose tip locations, which are relatively consistent during facial expressions,from one face image for each subject. Based on the selected templates, we perform de-tection of each template location from subsequent facial image by finding maximumcorrelation of the template images in the given frame image. We cropped images basedon eye locations and nose similar to [14] after affine transformation to align the loca-tion of eyes and nose tip to a standard front face. Fig. 2 (a) shows interactively selectedtwo eyes and a nose tip templates. A cropping region is decided after detection of tem-plate locations and affine transformation for every new image as shown in (b). Fig. 2(c) shows normalization results in the sequence where the first frame is used to selecttemplates and (d) in another sequence with different expression of the same subject

(a) Eight style vectors

1 2 3 4 5 6 7 8−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

(b) Six expression vectors

1 2 3 4 5 6−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

(c) Plotting style vectors in 3D

0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55−1

−0.5

0

0.5−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8person style 1person style 2person style 3person style 4person style 5person style 6person style 7person style 8

(d) Plotting expression vectors in 3D

−0.48

−0.46

−0.44

−0.42

−0.4

−0.38−0.4 −0.2 0 0.2 0.4 0.6 0.8 1

−1

−0.5

0

0.5

1

HappySurpriseSadnessAngerDisgustFear

Fig. 3. Facial expression analysis for eight subjects with six expressions from Cohn-Kanadedataset

without new template selections. We further processed the normalization of brightnesswhen necessary. As a result, we can recognize facial expression robustly with changesof head location and small changes of head orientation from a frontal view.

3.2 Facial Expression Representation

Our generative model represents facial expressions using three state variables of thegenerative model: person face vector s, expression vector e, and embedding manifoldpoint x, whose dimensions are Ns, Ne and 2 without further dimensionality reductionusing orthogonal iteration. The embedding can be parameterized by one dimensionalvector as the conceptual embedding manifold, unit circle, is one dimensional manifoldin two dimensional space. The total number of dimensions of the parameters to repre-sent a facial image is Ns + Ne + 1 after we learn the generative model. Fig. 3 showsexamples of person face vectors (a) and expression vectors (b) when we learn the gen-erative model from eight people with six different expressions related to basic emo-tions from Cohn-Kanade AU coded facial expression database [9], where N s = 8 andNe = 6. Plottings in three dimensional space using the first three parameters of faceclass vectors (c) and facial expression class vectors (d) give insight to the similarityamong different person faces and different facial expression classes. Interestingly, plot-ting of the first three parameters of six basic expressions in Fig. 3 (d) shows embeddingsimilar to the conceptual distance of six expressions in the image space. The surpriseexpression class vector is located far from other expressions, which is connected to dis-tinguishable different visual motions in surprise. Anger, fear, disgust, and sadness arerelatively close to each other than other expressions since they are distinguished visu-ally using more subtle motions. The expression vector captures characteristics of imagespace facial expression in low dimensional space.

3.3 Sequence-based Facial Expression Recognition

Given a sequence of images representing a facial expression, we can solve for the ex-pression class paramter, e, and person face parameter, s. First, the sequence is embed-

(a) Iteration: sadness

0 5 10 150

0.2

0.4

0.6

0.8

1

1.2Sadness Expression


(b) Iteration: surprise

0 5 10 150

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6Surprise Expression


(c) Iteration: happy

0 5 10 150

0.2

0.4

0.6

0.8

1

1.2

Happy Expression


(d) Iteration: anger

0 5 10 15 20 250

0.2

0.4

0.6

0.8

1

Anger Expression


Fig. 4. The convergence of estimated expression parameters in iterations

ded to a unit circle and aligned to the model as described in Sec. 2. Then, mappingcoefficients B are learned from the aligned embedding to the input. Given such coeffi-cients, we need to find the optimal s and e, which minimize the error

E(s, e) = ‖b −Z ×1 s ×2 e‖ , (5)

where b is the vector representation of matrix B by column stacking. If the person faceparameter s is known, we can obtain a closed form solution for e. This can be achievedby evaluating the tensor product G = Z ×1 s to obtain tensor G. Solution for b canbe obtained by solving the system b = G ×2 e for e which can be written as a typicallinear system by unfolding G as a matrix. Therefore the expression estimation e can beobtained by

e = (G2)+b (6)

where G2 is the matrix obtained by mode-2 unfolding of G and + denotes the pseudoinverse using singular value decomposition (SVD). Similarly we can analytically solvefor s if the expression parameter, e, is known by forming a tensor H = Z × 2 e:

s = (H1)+b (7)

where H1 is the matrix obtained by mode-1 unfolding of HIterative estimations of e and s using equations 6 and 7 would lead to a local minima

for the error in 5. Fig. 4 shows examples of expression estimation in iteration using newsequences. Y axis shows Euclidian distance between the estimated expression vectorand six expression class vectors in the generative model in Sec. 4.1. Usually the esti-mation parameters of expressions converge into one of expression class vectors withinseveral iterations. Fig. 4 (d) shows a case when more than ten iterations are required toreach stable solution in the estimation of expression vector.

3.4 Frame-based Facial Expression Recognition

When the input is a single face image, it is desired to estimate temporal embedding orthe face configuration in addition to expression and person face parameters in the gener-ative model. Given an input image y, we need to estimate configuration, x , expressionparameter e, and person face parameter s which minimize the reconstruction error

E(x, s, e) =|| y − C ×1 s ×2 e ×3 ψ(x) || (8)

We can use a robust error metric instead of Euclidian distance in error measurements.In both cases we end up with a nonlinear optimization problem.

We assume optimal estimated expression parameter for a given image can be writtenas a linear combination of expression class vectors in the training data. i.e., we need tosolve for linear regression weights α such that e =

∑Ke

k=1 αkek where each ek is oneof Ke expression class vectors in the training data. Similarly for the person face, weneed to solve for weights β such that s =

∑Ks

k=1 βksk where each sk is one ofKs faceclass vectors.

If the expression vector and the person face vector are known, then equation 8 isreduced to a nonlinear 1-dimensional search problem for configuration x on the unitcircle that minimizes the error. On the other hand, if the configuration vector and theperson face vector are known, we can obtain expression conditional class probabilitiesp(ek|y,x, s) which is proportional to observation likelihood p(y | x, s, ek). Such like-lihood can be estimated assuming a Gaussian density centered around C × 1 sk ×2 e×3

ψ(x), i.e.,

p(y | x, s, ek) ≈ N(C ×1 sk ×2 e ×3 ψ(x), Σek

).

Given expression class probabilities we can set the weights to αk = p(ek | y,x, s).Similarly, if the configuration vector and the expression vector are known, we can obtainface class weights by evaluating image likelihood given each face class sk assuming aGaussian density centered at C ×1 sk ×2 e ×3 ψ(x).

This setting favors an iterative procedures for solving for x, e, s. However, wrongestimation of any of the vectors would lead to wrong estimation of the others and leadsto a local minima. For example wrong estimation of the expression vector would leadto a totally wrong estimate of configuration parameter and therefore wrong estimate forperson face parameter. To avoid this we use a deterministic annealing like procedurewhere in the beginning the expression weights and person face weights are forced to beclose to uniform weights to avoid hard decisions about expression and face classes. Theweights gradually become discriminative thereafter. To achieve this, we use a variableexpression and person face class variances which are uniform to all classes and aredefined as Σe = Teσ

2eI and Σs = Tsσ

2sI respectively. The parameters Te and Ts start

with large values and are gradually reduced and in each step and a new configurationestimate is computed. Several iterations with decreasing Te and Ts allow estimationsof the expression vector, the person face vector and face configuration iteratively andallow estimations of expression and face from a single image.

3.5 Facial Expression Synthesis

Our model can generate new facial expressions by combinations of new facial expres-sion parameter and person face parameter. As we have decomposed the mapping spacethat captures nonlinear deformation in facial expressions, the linear interpolation of theface style and facial expression still somewhat captures nonlinearity in the facial expres-sion. In addition, we can control the parameters for person face and facial expressionseparately as a result of the multilinear decomposition. A new person face vector anda new facial expression vector can be synthesized by linear interpolation of existing

(a) Neutral → smile → surprise (b) Surprise → angry → neutral

(c) Subject A face → subject B face (d) Subject B face → subject C face

(e) Simultaneous transfer of face and expression: neutral → smile → surprise → fear → neutral

Fig. 5. Facial expression synthesis: First row: Expression transfer. Second row: Person face trans-fer during smile expression. Third row: simultaneous transfer of facial expression and person face

person face class vectors and expression class vectors using parameter α i, and βj asfollows:

enew = α1e1 + α2e2 + · · ·+ αNeeNe , snew = β1s1 + β2s2 + · · ·+ βNssNs , (9)

where∑

i αi = 1, and∑

j βj = 1, and αi ≥ 0 and βi ≥ 0 in order to be linear inter-polation in the convex set of the original expression classes and face classes. Here α i

and βj are control parameters whereas they are estimated in recognition as in Sec. 3.4.We can also control these interpolation parameters according to temporal informationor configuration. A new facial expression image can be generated using new style andexpression parameters.

ynewt = C ×1 snew

t ×2 enewt ×3 ψ(xt) (10)

Fig. 5 shows examples of the synthesis of new facial expressions and person faces.During synthesis of the new images, we combine control parameter t to embeddingcoordinate x and interpolation parameter α and β. In case of Fig. 5 (a), the t changed0 → 1 and new expression parameter enew

t = (1−t)esmile+tesurprise. As a result, thefacial expression starts from neutral expression of smile and animates new expressionas t changes and when t = 1, the expression become a peak expression of surprise.In case of (b), the t changed 1 → 0. In the same way, we can synthesize new facesduring smile expressions as in (c) and (d). Fig. 5 (e) is the simultaneous control of theperson face and expression parameters. This shows the potential of synthesis of newfacial expression in the image space using our generative model.

4 Experimental Results

4.1 Person independent recognition of facial expression: Cohn-Kanade facialexpression data set

We test the performance of facial expression analysis by our generative model usingCohn-Kanade AU coded facial expression database [9]. We first collected eight sub-jects with all six basic expression sequences, which are 48 expression sequences whose

frame number varies between 11 and 33 to target display. We performed normaliza-tion by cropping image sequence based on template eyes and nose images as explainedin Sec. 3.1. We embed the sequence into a half circle in the conceptual manifold aswe counted the sequence of the data as half of one cycle among neutral → target ex-pression → neutral expression. Eight equal-distance centers are used in learning GRBFwith thin-plate spline basis. We used a full dimension to represent each style and expres-sion. Fig. 3 shows the representation of expression vectors and person face vectors afterlearning the generative models from these eight subjects with six expressions. Fig. 5shows examples of facial expression synthesis using this generative model.Sequence-based expression recognition: The performance of person independent fa-cial expression recognition is tested by leave-one-out cross-validation method usingwhole sequences in the database [9]. We learned a generative model using 42 sequencesof seven subjectsand and tested six sequences of one subject whose data are not usedfor learning the generative model. We tested the recognition performance by select-ing the nearest expression class vector after iterations by sequence-based expressionrecognition in Sec. 3.3. Table 1 shows the confusion matrix for 48 sequences. The re-sult shows potentials of the estimated expression vectors as feature vectors for otheradvanced classifiers like SVM.

Table 1. Person-independent average confusion matrix by sequence-based expression recognition

Emotion Happy Surprise Sadness Anger Disgust FearHappy 25%(2) 0 0 37.5%(3) 25%(2) 25%(2)

Surprise 12.5%(1) 62.5%(5) 12.5%(1) 0 0 12.5%(1)Sadness 0 0 37.5%(3) 25%(2) 12.5%(1) 25%(2)Anger 12.5%(1) 0 37.5%(3) 50%(4) 0 0

Disgust 12.5%(1) 12.5%(1) 12.5%(1) 25%(2) 12.5%(1) 25%(2)Fear 0 0 0 50%(4) 0 50%(4)

Frame-based expression recognition: Using the generative model, we can estimateperson face parameters and expression parameters for a given expression image or se-quence of images based on frame-by-frameestimation. We collected additional data thathave five different expressions from 16 subjects. We used the generative model learnedby eight subjects with six expressions to estimate expression parameters and person faceparameters using deterministic annealing in Sec. 3.4. Fig. 6 (a) (b) (c) shows expressionweight values α of every frame in three different expression sequences. The weightsbecome more discriminative as expressions get closer to target expressions. We canrecognize the expression using the maximum weight expression class in every frame.Table 2 shows recognition results when we classified facial expression using maximumexpression weight of the last frame from 80 sequences.

4.2 Dynamic facial expression and face recognition

We used CMU-AMP facial expression database which are used for robust face recogni-tion in variant facial expressions [14]. We collected sequences of ten people with three

(a) Happy: (4,8,12,16,20th frames)

0 5 10 15 20 250

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Smile sequence

Frame number

Weigh

t valu

es


(b) Surprise: (2,5,8,11,14th frames)

0 2 4 6 8 10 12 140

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Surprise sequence

Frame number

Weigh

t valu

es


(c) Sadness: (1,5,9,13,17th frames)

0 2 4 6 8 10 12 14 160

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Sadness sequence

Frame number

Weigh

t valu

es


Fig. 6. Estimated expression weights in frame-based estimations

Table 2. Person-independent average confusion matrix by frame-based recognition: classificationonly last frame maximum weight expression

Emotion Happy Surprise Sadness Anger Disgust FearHappy 93.3%(14) 0 0 0 0 6.7%(1)

Surprise 0 100%(16) 0 0 0 0Sadness 0 7.1%(1) 28.6%(4) 7.1%(1) 35.7%(5) 21.4%(3)Anger 9.1%(1) 0 18.2%(2) 27.3%(3) 45.4% 0Disgust 9.1%(1) 0 9.1%(1) 18.2%(2) 63.6%(7) 0

Fear 25%(3) 0 8.3%(1) 0 8.3%(1) 58.3%(7)

expressions (smile, anger, surprise) by manual segmentation from the whole sequences.We learned a generative model from nine people. The last one person data are used totest recognition of expression as a new person. The unit circle is used to embed eachexpression sequence.

We used the learned generative model to recognize facial expression, and personidentity at each frame from the whole sequence using the frame-based algorithm insection 3.4. Fig. 7 (a) shows example frames of a whole sequence and the three differ-ent expression probabilities obtained in each frame (d)(e)(f). The person face weights,which are used to person identification, consistantly show dominant weights for thesubject face as in Fig. 7 (b). Fig. 7 (c) shows that the estimated embedding parametersare close to the true embedding from manually selected sequences. We used the learnedmodel to recognize facial expressions from sequences of a new person whose data arenot used during training. Fig. 8 shows recognition of expressions for the new person.The model can generalize for the new person and can distinguish three expressions fromthe whole sequence.

5 Conclusion

In this paper we presented a framework for learning a decomposable generative modelfor facial expression analysis. Conceptual manifold embedding on a unit circle is usedto model the intrinsic facial expression configuration on a closed 1D manifold. Theembedding allows modeling any variations (twists) of the manifold given any numberfactors such as different people, different expression, etc; since all resulting manifoldsare still topologically equivalent to the unit circle. This is not achievable if data-driven

(a) Source sequence images

6 11 16 21 ... 71(b) Estimated style weights

6 11 16 21 26 31 36 41 46 51 56 61 66 710

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Sequence: Frame Number

Style Proba

bilityperson 1 styleperson 2 styleperson 3 stylepreson 4 styleperson 5 styleperson 6 styleperson 7 styleperson 8 styleperson 9 style

(c) Estimated embedding parameters

0 10 20 30 40 50 60 700

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Frame number

Embedding

parameter v

alues

True valueEstimated value

(d) Expression weight: smile

6 11 16 21 26 31 36 41 46 51 56 61 66 710

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Expressio

n Probabil

ity: Smile

(e) Expression weight: angry

6 11 16 21 26 31 36 41 46 51 56 61 66 710

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Expressio

n Probabil

ity: Angry

(f) Expression weight: surprise

6 11 16 21 26 31 36 41 46 51 56 61 66 710

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Expressio

n Probabil

ity: Surpri

se

Fig. 7. Facial expression analysis with partially trained segements

(a) Source sequence images

6 11 16 21 ... 71(b) Expression weight: smile

6 11 16 21 26 31 36 41 46 51 56 61 66 710

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Expression

Probability

: Smile

(c) Expression weight: angry

6 11 16 21 26 31 36 41 46 51 56 61 66 710

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Expressio

n Probabil

ity: Angry

(d) Expression weight: surprise

6 11 16 21 26 31 36 41 46 51 56 61 66 710

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


Expressio

n Probabil

ity: Surpri

se

Fig. 8. Expression recognition for a new person.

embedding is used. The use of a generative model is tied to the use of conceptual em-bedding since the mapping from the manifold representation to the input space willbe well defined in contrast to a discriminative model where the mapping from the vi-sual input to manifold representation is not necessarily a function. We introduced aframework to solve facial expression factors, person face factors and configurations initerative methods for the whole sequence and in deterministic annealing methods fora given frame. The estimated expression parameters can be used as feature vectors forexpression recognition using advanced classification algorithms like SVM. The frame-by-frame estimation of facial expression shows similar weights when expression imageis close to the neutral face and more discriminative weights when it is near to targetfacial expressions. The weights of facial expression may be useful not only for facialexpression recognition but also for other characteristics like expressiveness in the ex-pression.

References

1. Z. Ambadar, J. W. Schooler, and J. F. Cohn. Deciphering the enigmatic face: The impor-tance of facial dynamics in interpreting subtle facial expressions. Psychological Science,16(5):403–410, 2005.

2. A. M. Bronstein, M. M. Bronstein, and R. Kimmel. Expression-invariant 3d face recognition.In AVBPA, LNCS 2688, pages 62–70, 2003.

3. Y. Chang, C. Hu, and M. Turk. Probabilistic expression analysis on manifolds. In Proc. ofCVPR, pages 520–527, 2004.

4. E. S. Chuang, H. Deshpande, and C. Bregler. Facial expression space learning. In PacificConference on Computer Graphics and Applications, pages 68–76, 2002.

5. I. Cohen, N. Sebe, A. Garg, L. S. Chen, and T. S. Huang. Facial expression recognition fromvideo sequences: Temporal and static modeling. CVIU, pages 160–187, 2003.

6. A. Elgammal. Nonlinear manifold learning for dynamic shape and dynamic appearance. InWorkshop Proc. of GMBV, 2004.

7. A. Elgammal and C.-S. Lee. Separating style and content on a nonlinear manifold. In Proc.of CVPR, volume 1, pages 478–485, 2004.

8. A. K. Jain and S. Z. Li, editors. Handbook of Face Recognition, chapter 11. Face ExpressionAnalysis. Springer, 2005.

9. T. Kanade, Y. Tian, and J. F. Cohn. Comprehensive database for facial expression analysis.In Proc. of FGR, pages 46–53, 2000.

10. A. Lanitis, C. J. Taylor, and T. F. Cootes. Automatic interpretation and coding of face imagesusing flexible models. IEEE Trans. PAMI, 19(7):743–756, 1997.

11. L. D. Lathauwer, B. de Moor, and J. Vandewalle. A multilinear singular value decomposiiton.SIAM Journal On Matrix Analysis and Applications, 21(4):1253–1278, 2000.

12. L. D. Lathauwer, B. de Moor, and J. Vandewalle. On the best rank-1 and rank-(r1, r2, ..., rn)approximation of higher-order tensors. SIAM Journal On Matrix Analysis and Applications,21(4):1324–1342, 2000.

13. Y. li Tian, T. Kanade, and J. F. Cohn. Recognizing action units for facial expression analysis.IEEE Trans. PAMI, 23(2), 2001.

14. X. Liu, T. Chen, and B. V. Kumar. Face authentication for multiple subjects using eigenflow.Pattern Recognitioin, 36:313–328, 2003.

15. A. M. Martinez. Recognizing expression variant faces from a single sample image per class.In Proc. of CVPR, pages 353–358, 2003.

16. T. Poggio and F. Girosi. Networks for approximation and learning. Proceedings of the IEEE,78(9):1481–1497, 1990.

17. S. Roweis and L. Saul. Nonlinear dimensionality reduction by locally linear embedding.Science, 290(5500):2323–2326, 2000.

18. B. Schlkopf and A. Smola. Learning with Kernels: Support Vector Machines, Regularization,Optimization and Beyond. MIT Press, 2002.

19. J. B. Tenenbaum, V. de Silva, and J. C. Langford. A global geometric framework for nonlin-ear dimensionality reduction. Science, 290(5500):2319–2323, 2000.

20. J. B. Tenenbaum and W. T. Freeman. Separating style and content with biliear models.Neural Computation, 12:1247–1283, 2000.

21. M. Turk and A. Pentland. Eigenfaces for recognition. Journal of Cognitive Neuroscience,3(1):71–86, 1991.

22. M. A. O. Vasilescu and D. Terzopoulos. Multilinear analysis of image ensembles: Tensor-faces. In 7th European Conference on Computer Vision, pages 447–460, 2002.

23. H. Wang and N. Ahuja. Facial expression decomposition. In Proc. of ICCV, volume 2, pages958–965, 2003.

24. W. Zhao, R. Chellappa, P. J. Phillips, and A. Rosenfeld. Face recognition: A literature survey.ACM Comput. Surv., 35(4):399–458, 2003.

Facial Expression Analysis using Nonlinear Decomposable ...elgammal/pub/LeeAMFG05FacialExpres… ·...

Documents

Transcript of Facial Expression Analysis using Nonlinear Decomposable ...elgammal/pub/LeeAMFG05FacialExpres… ·...