Vision-based Control of 3D Facial Animation

15
Eurographics/SIGGRAPH Symposium on Computer Animation (2003) D. Breen, M. Lin (Editors) Vision-based Control of 3D Facial Animation Jin-xiang Chai, 1Jing Xiao 1 and Jessica Hodgins 1 1 The Robotics Institute, Carnegie Mellon University, Pittsburgh, PA 15213, USA Abstract Controlling and animating the facial expression of a computer-generated 3D character is a difficult problem because the face has many degrees of freedom while most available input devices have few. In this paper, we show that a rich set of lifelike facial actions can be created from a preprocessed motion capture database and that a user can control these actions by acting out the desired motions in front of a video camera. We develop a real-time facial tracking system to extract a small set of animation control parameters from video. Because of the nature of video data, these parameters may be noisy, low-resolution, and contain errors. The system uses the knowledge embedded in motion capture data to translate these low-quality 2D animation control signals into high-quality 3D facial expressions. To adapt the synthesized motion to a new character model, we introduce an efficient expression retargeting technique whose run-time computation is constant independent of the complexity of the character model. Wedemonstrate the power of this approach through two users who control and animate a wide range of 3D facial expressions of different avatars. Categories and Subject Descriptors (according to ACM CCS): I.3.7 [Computer Graphics]: Animation; I.3.6 [Com- puter Graphics]: Interaction techniques; I.4.8 [Image Processing and Computer Vision]: Tracking 1. Introduction Providing intuitive control over three-dimensional facial ex- pressions is an important problem in the context of computer graphics, virtual reality, and human computer interaction. Such a system would be useful in a variety of applications such as video games, electronically mediated communica- tion, user interface agents and teleconferencing. Two diffi- culties, however, arise in animating a computer-generated face model: designing a rich set of believable actions for the virtual character and giving the user intuitive control over these actions. One approach to animating motion is to simulate the dy- namics of muscle and skin. Building and simulating such a model has proven to be an extremely difficult task because of the subtlety of facial skin motions 40, 34 . An alternative so- lution is to use motion capture data 20, 30 . Recent technologi- cal advances in motion capture equipment make it possible to record three-dimensional facial expressions with high fi- delity, resolution, and consistency. Although motion capture data are a reliable way to capture the detail and nuance of http://graphics.cs.cmu.edu/projects/face-animation Figure 1: Interactive Expression Control : a user can control 3D facial expressions of an avatar interactively. (Left) The users act out the motion in front of a single-view camera. (Middle) The controlled facial movement of the avatars with gray masks. (Right) The controlled facial movement of the avatars with texture mapped models. c The Eurographics Association 2003.

Transcript of Vision-based Control of 3D Facial Animation

Page 1: Vision-based Control of 3D Facial Animation

Eurographics/SIGGRAPH Symposium on Computer Animation (2003)D. Breen, M. Lin (Editors)

Vision-based Control of 3D Facial Animation

Jin-xiang Chai,1† Jing Xiao1 and Jessica Hodgins1

1 The Robotics Institute, Carnegie Mellon University, Pittsburgh, PA 15213, USA

AbstractControlling and animating the facial expression of a computer-generated 3D character is a difficult problembecause the face has many degrees of freedom while most available input devices have few. In this paper, we showthat a rich set of lifelike facial actions can be created from a preprocessed motion capture database and that auser can control these actions by acting out the desired motions in front of a video camera. We develop a real-timefacial tracking system to extract a small set of animation control parameters from video. Because of the natureof video data, these parameters may be noisy, low-resolution, and contain errors. The system uses the knowledgeembedded in motion capture data to translate these low-quality 2D animation control signals into high-quality 3Dfacial expressions. To adapt the synthesized motion to a new character model, we introduce an efficient expressionretargeting technique whose run-time computation is constant independent of the complexity of the charactermodel. We demonstrate the power of this approach through two users who control and animate a wide range of3D facial expressions of different avatars.

Categories and Subject Descriptors (according to ACM CCS): I.3.7 [Computer Graphics]: Animation; I.3.6 [Com-puter Graphics]: Interaction techniques; I.4.8 [Image Processing and Computer Vision]: Tracking

1. Introduction

Providing intuitive control over three-dimensional facial ex-pressions is an important problem in the context of computergraphics, virtual reality, and human computer interaction.Such a system would be useful in a variety of applicationssuch as video games, electronically mediated communica-tion, user interface agents and teleconferencing. Two diffi-culties, however, arise in animating a computer-generatedface model: designing a rich set of believable actions for thevirtual character and giving the user intuitive control overthese actions.

One approach to animating motion is to simulate the dy-namics of muscle and skin. Building and simulating such amodel has proven to be an extremely difficult task becauseof the subtlety of facial skin motions40, 34. An alternative so-lution is to use motion capture data20, 30. Recent technologi-cal advances in motion capture equipment make it possibleto record three-dimensional facial expressions with high fi-delity, resolution, and consistency. Although motion capturedata are a reliable way to capture the detail and nuance of

† http://graphics.cs.cmu.edu/projects/face-animation

Figure 1: Interactive Expression Control : a user can control3D facial expressions of an avatar interactively. (Left) Theusers act out the motion in front of a single-view camera.(Middle) The controlled facial movement of the avatars withgray masks. (Right) The controlled facial movement of theavatars with texture mapped models.

c© The Eurographics Association 2003.

Page 2: Vision-based Control of 3D Facial Animation

Chai et al. / Vision-based Control of 3D Facial Animation

live motion, reuse and modification for a different purposeremains a challenging task.

Providing the user with an intuitive interface to control abroad range of facial expressions is difficult because char-acter expressions are high dimensional but most availableinput devices are not. Intuitive control of individual degreesof freedom is currently not possible for interactive environ-ments unless the user can use his or her own face to act outthe motion using online motion capture. However, accuratemotion capture requires costly hardware and extensive in-strumenting of the subject and is therefore not widely avail-able or practical. A vision-based interface would offer aninexpensive and nonintrusive alternative to controlling theavatar interactively, though accurate, high-resolution, andreal-time facial tracking systems have not yet been devel-oped.

In this paper, we propose to combine the strengths of avision-based interface with those of motion capture data forinteractive control of 3D facial animations. We show that arich set of lifelike facial actions can be created from a motioncapture database and that the user can control these actionsinteractively by acting out the desired motions in front ofa video camera (figure 1). The control parameters derivedfrom a vision-based interface are often noisy and frequentlycontain errors24. The key insight in our approach is to usethe knowledge embedded in motion capture data to translatethese low-quality control signals into high-quality facial ex-pressions.

In the process, we face several challenges. First, we mustmap low-quality visual signals to high-quality motion data.This problem is particularly difficult because the mappingbetween the two spaces is not one-to-one, so a direct frame-by-frame mapping will not work. Second, in order to controlfacial actions interactively via a vision-based interface, weneed to extract meaningful animation control signals fromthe video sequence of a live performer in real time. Third,we want to animate any 3D character model by reusing andinterpolating motion capture data, but motion capture datarecord only the motion of a finite number of facial mark-ers on a source model. Thus we need to adapt the motioncapture data of a source model to all the vertices of the char-acter model to be animated. Finally, if the system is to allowany user to control any 3D face model, we must correct dif-ferences between users because each person has somewhatdifferent facial proportions and geometry.

1.1. Related Work

Facial animation has been driven by keyframeinterpolation31, 36, 27, direct parameterization32, 33, 11,the user’s performance13, 43, 41, 14, pseudomuscle-basedmodels42, 9, 23, muscle-based simulation40, 26, 2D facialdata for speech6, 5, 15 and full 3D motion capture data20, 30.Among these approaches, our work is most closely related

to the performance-driven approach and motion capture,and we therefore review the research related to these twoapproaches in greater detail. Parke and Waters offer anexcellent survey of the entire field34.

A number of researchers have described techniques forrecovering facial motions directly from video. Williamstracked expressions of a live performer with special makeupand then mapped 2D tracking data onto the surface of ascanned 3D face model43. Facial trackers without specialmarkers can be based on deformable patches4, edge or fea-ture detectors41, 25, 3D face models14, 37, 12 or data-drivenmodels19. For example, Terzpoloulos and Waters tracked thecontour features on eyebrows and lips for automatic estima-tion of the face muscle contraction parameters from a videosequence, and these muscle parameters were then used toanimate the physically based muscle structure of a syntheticcharacter41. Essa and his colleagues tracked facial expres-sions using optical flow in an estimation and control frame-work coupled with a physical model describing the skin andmuscle structure of the face14. More recently, Gokturk andhis colleagues applied PCA on stereo tracking data to learna deformable model and then incorporated the learned de-formation model into an optical flow estimation frameworkto simultaneously track the head and a small set of facialfeatures19.

The direct use of tracking motion for animation requiresthat the face model of a live performer have similar pro-portions and geometry as those of the animated model. Re-cently, however, the vision-based tracking approach has beencombined with blendshape interpolation techniques32, 27 tocreate facial animations for a new target model7, 10, 16. Thetarget model could be a 2D drawing or any 3D model. Gen-erally, this approach requires that an artist generate a set ofkey expressions for the target model. These key expressionsare correlated to the different values of facial features in thelabelled images. Vision-based tracking can then extract thevalues of the facial expressions in the video image, and theexamples can be interpolated appropriately.

Buck and his colleagues introduced a 2D hand-drawn an-imation system in which a small set of 2D facial parame-ters are tracked in the video images of a live performer andthen used to blend a set of hand-drawn faces for variousexpressions7. The FaceStation system automatically locatedand tracked 2D facial expressions in real time and then ani-mated a 3D virtual character by morphing among 16 prede-fined 3D morphs16. Chuang and Bregler explored a similarapproach for animating a 3D face model by interpolating aset of 3D key facial shapes created by an artist10. The pri-mary strength of such animation systems is the flexibility toanimate a target face model that is different from the sourcemodel in the video. These systems, however, rely on the la-bor of skilled animators to produce the key poses for eachanimated model. The resolution and accuracy of the final an-imation remains highly sensitive to that of the visual control

c© The Eurographics Association 2003.

Page 3: Vision-based Control of 3D Facial Animation

Chai et al. / Vision-based Control of 3D Facial Animation

signal due to the direct frame-by-frame mappings adopted inthese systems. Any jitter in the tracking system results in anunsatisfactory animation. A simple filtering technique likea Kalman filter might be used to reduce the noise and thejumping artifacts, but it would also remove the details andhigh frequencies in the visual signals.

Like vision-based animation, motion capture also usesmeasured human motion data to animate facial expressions.Motion capture data, though, have more subtle details thanvision tracking data because an accurate hardware setup isused in a controlled capture environment. Guenter and hiscolleagues created an impressive system for capturing hu-man facial expressions using special facial markers and mul-tiple calibrated cameras and replaying them as a highly real-istic 3D “talking head”20. Recently, Noh and Neumann pre-sented an expression cloning technique to adapt existing mo-tion capture data of a 3D source facial model onto a new 3Dtarget model30. A recent notable example of motion capturedata is the movie The Lord of the Rings: the Two Towerswhere prerecorded movement and facial expressions wereused to animate the synthetic character “Gollum.”

Both the vision-based approach and motion capture haveadvantages and disadvantages. The vision-based approachgives us an intuitive and inexpensive way to control a widerange of actions, but current results are disappointing foranimation applications18. In contrast, motion capture datagenerate high quality animation but are expensive to col-lect, and once they have been collected, may not be exactlywhat the animator needs, particularly for interactive appli-cations in which the required motions cannot be preciselyor completely predicted in advance. Our goal is to obtainthe advantage of each method while avoiding the disadvan-tages. In particular, we use a vision-based interface to extracta small set of animation control parameters from a single-camera video and then use the embedded knowledge in themotion capture data to translate it into high quality facial ex-pressions. The result is an animation that does what an ani-mator wants it to do, but has the same high quality as motioncapture data.

An alternative approach to performance-based animationis to use high degree-of-freedom (DOF) input devices tocontrol and animate facial expressions directly. For exam-ple, DeGraf demonstrated a real-time facial animation sys-tem that used a special purpose interactive device called a“waldo” to achieve real-time control13. Faceworks is anotherhigh DOF device that gives an animator direct control ofa character’s facial expressions using motorized sliders17.These systems are appropriate for creating animation inter-actively by trained puppeteers but not for occasional usersof video games or teleconferencing systems because an un-trained user cannot learn to simultaneously manipulate alarge number of DOFs independently in a reasonable periodof time.

Computer vision researchers have explored another

PreprocessedMotion Capture

Data

3D head pose

Single-cameravideo

Avatar animation

VideoAnalysis

ExpressionControl andAnimation

ExpressionRetargeting

Avatarexpression

Mocap subjectmodel

Avatar model

Synthesizedexpression

Expressioncontrol

parameters

Figure 2: System overview diagram. At run time, the videoimages from a single-view camera are fed into the VideoAnalysis component, which simultaneously extracts twotypes of animation control parameters: expression controlparameters and 3D pose control parameters. The Expres-sion Control and Animation component uses the expressioncontrol parameters as well as a preprocessed motion cap-ture database to synthesize the facial expression, which de-scribes only the movement of the motion capture markers onthe surface of the motion capture subject. The ExpressionRetargeting component uses the synthesized expression, to-gether with the scanned surface model of the motion capturesubject and input avatar surface model, to produce the facialexpression for an avatar. The avatar expression is then com-bined with the avatar pose, which is directly derived frompose control parameters, to generate the final animation.

way to combine motion capture data and vision-basedtracking21, 35, 39. They use motion capture data to build a priormodel describing the movement of the human body and thenincorporate this model into a probabilistic Bayesian frame-work to constrain the search. The main difference betweenour work and this approach is that our goal is to animate andcontrol the movement of any character model rather thanreconstruct and recognize the motion of a specific subject.What matters in our work is how the motion looks, how wellthe character responds to the user’s input, and the latency. Tothis end, we not only want to remove the jitter and wobblesfrom the vision-based interface but also to retain as muchas possible the small details and high frequencies in mo-tion capture data. In contrast, tracking applications requirethat the prior model from motion capture data is able to pre-dict the real world behaviors well so that it can provide a

c© The Eurographics Association 2003.

Page 4: Vision-based Control of 3D Facial Animation

Chai et al. / Vision-based Control of 3D Facial Animation

powerful cue for the reconstructed motion in the presenceof occlusion and measurement noise. The details and highfrequencies of the reconstructed motion are not the primaryconcern of these vision systems.

1.2. Overview

Our system transforms low-quality tracking motion intohigh-quality animation by using the knowledge of human fa-cial motion that is embedded in motion capture data. The in-put to our system consists of a single video stream recordingthe user’s facial movement, a preprocessed motion capturedatabase, a 3D source surface scanned from the motion cap-ture subject, and a 3D avatar surface model to be animated.

By acting out a desired expression in front of a camera,any user has interactive control over the facial expressions ofany 3D character model. Our system is organized into fourmajor components (figure 2):

• Video analysis. Simultaneously track the 3D position andorientation of the head and a small group of importantfacial features in video and then automatically translatethem into two sets of high-level animation control param-eters: expression control parameters and head pose pa-rameters.

• Motion capture data preprocessing. Automatically sep-arate head motion from facial deformations in the motioncapture data and extract the expression control parametersfrom the decoupled motion capture data.

• Expression control and animation. Efficiently trans-form the noisy and low-resolution expression control sig-nals to high-quality motion in the context of the motioncapture database. Degrees of freedom that were noisy andcorrupted are filtered and then mapped to the motion cap-ture data; missing degrees of freedom and details are syn-thesized using the information contained in the motioncapture data.

• Expression retargeting. Adapt the synthesized motion toanimate all the vertices of a different character model atrun time.

The motion capture data preprocessing is done off-line;the other stages are completed online based on input fromthe user. We describe each of these components in more de-tail in the next four sections of this paper.

2. Video Analysis

An ideal vision-based animation system would accuratelytrack both 3D head motion and deformations of the face invideo images of a performer. If the system is to allow anyuser to control the actions of a character model, the systemmust be user-independent. In the computer vision literature,extensive research has been done in the area of vision-basedfacial tracking. Many algorithms exist but the performance

of facial tracking systems, particularly user-independent ex-pression tracking, is not very good for animation applica-tions because of the direct use of vision-based tracking datafor animation. Consequently, we do not attempt to track allthe details of the facial expression. Instead, we choose to ro-bustly track a small set of distinctive facial features in realtime and then translate these low-quality tracking data intohigh-quality animation using the information contained inthe motion capture database.

Section 2.1 describes how the system simultaneouslytracks the pose and expression deformation from video. Thepose tracker recovers the position and orientation of thehead, whereas the expression tracker tracks the position of19 distinctive facial features. Section 2.2 explains how toextract a set of high-level animation control parameters fromthe tracking data.

2.1. Facial tracking

Our system tracks the 6 DOFs of head motion (yaw, pitch,roll and 3D position). We use a generic cylinder model toapproximate the head geometry of the user and then apply amodel-based tracking technique to recover the head pose ofthe user in the monocular video stream45, 8. The expressiontracking step involves tracking 19 2D features on the face:one for each mid-point of the upper and lower lip, one foreach mouth corner, two for each eyebrow, four for each eye,and three for the nose, as shown in figure 3. We choose thesefacial features because they have high contrast textures in thelocal feature window and because they record the movementof important facial areas.

Initialization: The facial tracking algorithm is based on ahand-initialized first frame. Pose tracking needs the initial-ization of the pose and the parameters of the cylinder model,including the radius, height and center position whereas ex-pression tracking requires the initial position of 2D features.By default, the system starts with the neutral expression inthe frontal-parallel pose. A user clicks on the 2D positionsof 19 points in the first frame. After the features are identi-fied in the first frame, the system automatically computes thecylindrical model parameters. After the initialization step,the system builds a texture-mapped reference head model foruse in tracking by projecting the first image onto the surfaceof the initialized cylinder model.

Pose tracking: During tracking, the system dynamicallyupdates the position and orientation of the reference headmodel. The texture of the model is updated by projecting thecurrent image onto the surface of the cylinder. When the nextframe is captured, the new head pose is automatically com-puted by minimizing the sum of the squared intensity differ-ences between the projected reference head model and thenew frame. The dynamically updated reference head modelcan deal with gradual lighting changes and self-occlusion,

c© The Eurographics Association 2003.

Page 5: Vision-based Control of 3D Facial Animation

Chai et al. / Vision-based Control of 3D Facial Animation

Figure 3: User-independent facial tracking: the red arrow denotes the position and orientation of the head and the green dotsshow the positions of the tracked points.

which allows us to recover the head motion even when mostof the face is not visible.

Because the reference head model is dynamically up-dated, the tracking errors will accumulate over time. We usea re-registration technique to prevent this problem. At runtime, the system automatically stores several texture-mappedhead models in key poses, and chooses to register the newvideo image with the closest example rather than with thereference head model when the difference between the cur-rent pose and its closest pose among the examples falls be-low a user-defined threshold.

Some pixels in the processed images may disappear or be-come distorted or corrupted because of occlusion, non-rigidmotion, and sensor noise. Those pixels should contribute lessto motion estimation than other pixels. The system uses arobust technique, compensated iteratively re-weighted leastsquares (IRLS), to reduce the contribution of these noisy andcorrupted pixels3.

Expression tracking: After the head poses have been re-covered, the system uses the poses and head model to warpthe images into the fronto-parallel view. The positions of fa-cial features are then estimated in the warped current image,given their locations in the warped reference image. Warp-ing removes the effects of global poses in the 2D trackingfeatures by associating the movements of features only withexpression deformation. It also improves the accuracy androbustness of our feature tracking because the movement offeatures becomes smaller after warping.

To track the 2D position of a facial feature, we define asmall square window centered at the feature’s position. Wemake the simplifying assumption that the movement of pix-els in the feature window can be approximated as the affinemotion of a plane centered at the 3D coordinate of the fea-ture. Affine deformations have 6 degrees of freedom and canbe inferred using optical flow in the feature window. We em-ploy a gradient-based motion estimation method to find theaffine motion parameters thereby minimizing the sum of thesquared intensity difference in the feature window betweenthe current frame and the reference frame2, 38, 1.

Even in warped images, tracking single features such asthe eyelids based on intensity only is unreliable. To improvetracking robustness, we incorporate several prior geometric

constraints into the feature tracking system. For example, weassign a small threshold to limit the horizontal displacementof features located on the eyelids because the eyelids deformalmost vertically. The system uses a re-registration techniquesimilar to the one used in pose tracking to handle the erroraccumulation of the expression tracking. In particular, thesystem stores the warped facial images and feature locationsin example expressions at run time and re-registers the facialfeatures of the new images with those in the closest examplerather than with those in the reference image when the dif-ference between current expression and its closest templatefalls below a small threshold.

Our facial tracking system runs in real time at 20 fps. Thesystem is user-independent and can track the facial move-ment of different human subjects. Figure 3 shows the resultsof our pose tracking and expression tracking for two per-formers.

2.2. Control Parameters

To build a common interface between the motion capturedata and the vision-based interface, we derive a small set ofparameters from the tracked facial features as a robust anddiscriminative control signal for facial animation. We wantthe control parameters extracted from feature tracking to cor-respond in a meaningful and intuitive way with the expres-sion movement qualities they control. In total, the system au-tomatically extracts 15 control parameters that describe thefacial expression of the observed actor:

• Mouth (6): The system extracts six scalar quantities de-scribing the movement of the mouth based on the po-sitions of four tracking features around the mouth: leftcorner, right corner, upper lip and lower lip. More pre-cisely, these six control parameters include the parame-ters measuring the distance between the lower and upperlips (1), the distance between the left and right corners ofthe mouth (1), the center of the mouth (2), the angle ofthe line segment connecting the lower lip and upper lipwith respect to the vertical line (1), and the angle of theline segment connecting the left and right corners of themouth with respect to the horizontal line (1).

• Nose (2): Based on three tracked features on the nose, wecompute two control parameters describing the movementof the nose: the distance between the left and right corners

c© The Eurographics Association 2003.

Page 6: Vision-based Control of 3D Facial Animation

Chai et al. / Vision-based Control of 3D Facial Animation

Figure 4: The scanned head surface model of the motioncapture subject aligned with 76 motion capture markers.

of the nose and the distance between the top point of thenose and the line segment connecting the left and rightcorners (2).

• Eye (2): The control parameters describing the eye move-ments are the distance between the upper and lower eye-lids of each eye (2).

• Eyebrow (5): The control parameters for the eyebrow ac-tions consist of the angle of each eyebrow relative to ahorizontal line (2), the distance between each eyebrowand eye (2), and the distance between the left and righteyebrow (1).

The expression control signal describes the evolution ofM = 15 control parameters derived from tracking data. Weuse the notation Z̃i ≡ {z̃a,i|a = 1, ...,15} to denote the con-trol signal at time i. Here a is the index for the individualcontrol parameter. These 15 parameters are used to controlthe expression deformation of the character model. In addi-tion, the head pose derived from video gives us 6 additionalanimation control parameters, which are used to drive theavatar’s position and orientation.

3. Motion Capture Data Preprocessing

We used a Minolta Vivid 700 laser scanner to build the sur-face model of a motion capture subject22. We set up a Viconmotion capture system to record facial movement by attach-ing 76 reflective markers onto the face of the motion capturesubject. The scanned surface model with the attached motioncapture markers is shown in figure 4. The markers were ar-ranged so that their movement captures the subtle nuances ofthe facial expressions. During capture sessions, the subjectmust be allowed to move his head freely because head mo-tion is involved in almost all natural expressions. As a result,the head motion and the facial expressions are coupled in themotion capture data, and we need to separate the head mo-tion from the facial expression accurately in order to reuseand modify the motion capture data. Section 3.1 describes afactorization algorithm to separate the head motion from thefacial expression in the motion capture data by utilizing theinherent rank constraints in the motion capture data.

3.1. Decoupling Pose and Expression

Assuming the facial expression deforms with L independentmodes of variation, then its shape can be represented as alinear combination of a deformation basis set S1,S2, ...,SL.Each deformation basis Si is a 3×P matrix describing thedeformation mode of P points. The recorded facial motioncapture data Xf combines the effects of 3D head pose andlocal expression deformation:

Xf = Rf · (L

∑i=1

c f i ·Si)+Tf (1)

where Rf is a 3× 3 head rotation matrix and Tf is a 3× 1head translation in frame f . c f i is the weight correspondingto the ith deformation basis Si. Our goal is to separate thehead pose Rf and Tf from the motion capture data Xf sothat the motion capture data record only expression defor-mations.

We eliminate Tf from Xf by subtracting the mean of all3D points. We then represent the resulting motion capturedata in matrix notation:

X1...XF

︸ ︷︷ ︸M

=

c11R1 ... c1LR1...

......

cF1RF ... cFLRF

︸ ︷︷ ︸Q

·

S1...SL

︸ ︷︷ ︸B

(2)

where F is the number of frames of motion capture data,M is a 3F ×P matrix storing the 3D coordinates of all themotion capture marker locations, Q is a 3F × 3L scaled ro-tation matrix recording the head orientations and the weightfor each deformation basis in every frame, B is a 3L× Pmatrix containing all deformation bases. Equation (2) showsthat without noise, the rank of the data matrix M is at most3L. Therefore, we can automatically determine the numberof deformation bases by computing the rank of matrix M.

When noise corrupts the motion capture data, the data ma-trix M will not be exactly of rank 3L. However, we can per-form singular value decomposition (SVD) on the data matrixM such that M =USVT , and then get the best possible rank3L approximation of the data matrix, factoring it into twomatrices:

Q̃ =U3F,3LS3L,3L12 , B̃ = S3L,3L

12VP,3LT (3)

The rank of data matrix M, that is 3L, is automatically deter-mined by keeping a specific amount of original data energy.In our experiment, we found that 25 deformation bases were

c© The Eurographics Association 2003.

Page 7: Vision-based Control of 3D Facial Animation

Chai et al. / Vision-based Control of 3D Facial Animation

sufficient to capture 99.6% of the deformation variations inthe motion capture database.

The decomposition of Equation (3) is determined up to alinear transformation. Any non-singular 3L× 3L matrix Gand its inverse could be inserted between Q̃ and B̃ and theirproduct would still be the same. Thus the actual scaled rota-tion matrix Q and basis matrix B are given by

Q = Q̃ ·G B = G−1 · B̃ (4)

with the appropriate 3L×3L invertible matrix G selected.

To recover the appropriate linear transformation matrixG,we introduce two different sets of linear constraints: rotationconstraints and basis constraints on the matrixGGT (For de-tails, please see our technical report44). Rotation constraintsutilize the orthogonal property of the rotation matrices to im-pose the linear constraints on the matrix GGT . Given 3Dmotion capture data, the deformation bases representing thefacial expression are not unique. Any non-singular transfor-mation of deformation bases are still valid to describe thefacial expressions. To remove the ambiguity, our algorithmautomatically finds the L appropriate frames in the motioncapture data that cover all the deformation variations in thedata. We choose these L frames as the deformation basis. Thespecific form of the deformation basis provides another setof linear constraints on the matrix GGT . These two sets ofconstraints allow us to uniquely recover the matrix GGT viastandard least-square techniques. We use the SVD techniqueagain to factor the matrix GGT into the matrix G and itstranspose GT , a step that leads to the actual rotation Rf , con-figuration coefficients c f i, and deformation bases S1, ...,SL.

After we separate the poses from the motion capture data,we project each frame of the 3D motion capture data inthe fronto-parallel view and extract the expression controlparameters for each motion capture frame much as we ex-tracted expression control parameters from the visual track-ing data. Let Xi ≡ {xb,i|b = 1, ...,76} be 3D positions ofthe motion capture markers in frame i and Zi ≡ {za,i|a =1, ...,15} be the control parameters derived from frame i.Here xb,i is the 3D coordinate of the bth motion capturemarker corresponding to frame i. In this way, each motioncapture frame Xi is automatically associated with animationcontrol parameters Zi.

4. Expression Control and Animation

Given the control parameters derived from a vision-basedinterface, controlling the head motion of a virtual characteris straightforward. The system directly maps the orientationof the performer to the virtual character. The position pa-rameters derived from video need to be appropriately scaledbefore they are used to control the position of an avatar. This

scale is computed as the ratio of the mouth width betweenthe user and the avatar.

Controlling the expression deformation requires integrat-ing the information in the expression control parameters andthe motion capture data. In this section, we present a noveldata-driven approach for motion synthesis that translates thenoisy and lower-resolution expression control signals to thehigh-resolution motion data using the information containedin motion capture database. Previous work in this area canbe classified into two categories: synthesis by examples andsynthesis by a parametric or probabilistic model. Both ap-proaches have advantages: the former allows details of theoriginal motion to be retained for synthesis; the latter cre-ates a simpler structure for representing the data. In order tokeep the details and high frequency of the motion capturedata, we choose to synthesize the facial animation by exam-ples. The system finds the K closest examples in the motioncapture database using the low-resolution and noisy querycontrol parameters from the vision-based interface and thenlinearly interpolates the corresponding high-quality motionexamples in the database with a local regression technique.Because the mapping from the control parameters space tothe motion data space is not one to one, the query controlsignal is based on multiple frames rather than a single framethereby eliminating the mapping ambiguity by integratingthe query evidence forward and backward over a window ofa short fixed length.

The motion synthesis process begins with a normalizationstep that corrects for the difference between the animationcontrol parameters in the tracking data and the motion cap-ture data. We then describe a data-driven approach to filterthe noisy expression control signals derived from the vision-based interface. Next, we introduce a data-driven expressionsynthesis approach that transforms the filtered control sig-nals into high quality motion data. Finally, we describe anew data structure and an efficient K nearest points searchtechnique that we use to speed up the synthesis process.

4.1. Control Parameter Normalization

The expression control parameters from the tracking data areinconsistent with those in the motion capture data becausethe user and the motion capture subject have different fa-cial geometry and proportions. We use a normalization stepthat automatically scales the measured control parametersaccording to the control parameters of the neutral expres-sion to approximately remove these differences. By scalingthe control parameters, we ensure that the control parametersextracted from the user have approximately the same magni-tude as those extracted from motion capture data when bothare in the same expression.

c© The Eurographics Association 2003.

Page 8: Vision-based Control of 3D Facial Animation

Chai et al. / Vision-based Control of 3D Facial Animation

4.2. Data-driven Filtering

Control signals in a vision-based interface are often noisy.We divide them into segments of a fixed, short temporallengthW at run time and then use the prior knowledge em-bedded in the motion capture database to sequentially filterthe control signals segment by segment. The prior model ofthe expression control signals is a local linear model learnedat run time. When a new segment arrives, the motion capturedatabase is searched for the examples that are most relevantto the segment. These examples are then used to build a locallinear dynamical model that captures the dynamical behav-ior of the control signals over a fixed length sequence.

We collect the set of neighboring frames with the sametemporal window W for each frame in the motion capturedatabase and treat them as a data point. Conceptually, all themotion segments of the facial control signals form a nonlin-ear manifold embedded in a high-dimensional configurationspace. Each motion segment in the motion capture databaseis a point sample on the manifold. The motion segments ofcontrol signals from a vision-based interface are noisy sam-ples of this manifold. The key idea of our filtering techniqueis that we can use a low-dimensional linear subspace to ap-proximate the local region of the high-dimensional nonlinearmanifold. For each noisy sample, we apply Principal Com-ponent Analysis (PCA) to learn a linear subspace using thedata points that lie within the local region and then recon-struct them using the low-dimensional linear subspaces.

If we choose too many frames for the motion segments,the samples from the motion capture data might not be denseenough to learn an accurate local linear model in the high-dimensional space. If we choose too few frames, the modelwill not capture enough motion regularities. The length ofthe motion segments will determine the response time of thesystem, or the action delay between the user and the avatar.From our experiments, we found 20 to be a reasonable num-ber of frames for each segment. The delay of our systemis then 0.33s because the frame rate of our video camera is60fps.

Let φ̃t ≡ [Z̃1, ..., ˜ZW ] be a fragment of input control pa-rameters. The filter step is

• Find the K closest slices in the motion capture database.• Compute the principal components of the K closest slices.We keep the M largest eigenvectors U1, ...,UM as the fil-ter basis, and M is automatically determined by retaining99% of the variation of the original data.

• Project φ̃t into a local linear space spanned by U1, ...,UMand reconstruct the control signal φ̄t = [Z̄1, ..., Z̄b] usingthe projection coefficients.

The principal components, which can be interpreted as themajor sources of dynamical variation in the local region ofthe input control signals, naturally captures the dynamicalbehavior of the animation control parameters in the local re-gion of the input control signal φ̃t . We can use this low di-

mensional linear space to reduce the noise and error in thesensed control signal. The performance of our data-drivenfiltering algorithm depends on the number of closest exam-ples. As the number of closest examples K increases, thedimension of subspace M becomes higher so that it is capa-ble of representing a large range of local dynamical behav-ior. The high dimensional subspace will provide fewer con-straints on the sensed motion and the subspace constraintsmight be insufficient to remove the noise in data. If fewerexamples are used, they might not particularly similar to theonline noisy motion and the subspace might not fit the mo-tion very well. In this way, the filtered motion might be oversmoothed and distorted. The specific choice of K depends onthe properties of a given motion capture database and con-trol data extracted from video. In our experiments, we foundthat a K between 50 to 150 gave us a good filtering result.

4.3. Data-driven Expression Synthesis

Suppose we have N frames of motion capture data X1, ...,XNand its associated control parameters Z1, ...,ZN , then our ex-pression control problem can be stated as follows: givena new segment of control signals φ̄t = [Z̄t1 , ..., Z̄tW ], syn-thesize the corresponding motion capture example M̄t =[X̄t1 , ..., X̄tW ].

The simplest solution to this problem might be K-nearest-neighbor interpolation. For each frame Z̄i such that i =t1, ..., tW , the system can choose the K closest example pointsof the control parameters, denoted as Zi1 , ...,ZiK , in the mo-tion capture data and assign each of them a weight ωi basedon their distance to the query Z̄i. These weights could thenbe applied to synthesize the 3D motion data by interpolat-ing the corresponding motion capture data Xi1 , ...,XiK . Thedownside of this approach is that the generated motion mightnot be smooth because the mapping from control parameterspace to motion configuration space is not one to one.

Instead, we develop a segment-based motion synthesistechnique to remove the mapping ambiguity from the controlparameter space to the motion configuration space. Giventhe query segment φ̄t , we compute the interpolation weightsbased on its distance to the K closest segments in the motioncapture database. The distance is measured as the Euclideandistance in the local linear subspace. We can then synthesizethe segment of motion data via a linear combination of themotion data in the K closest segments. Synthesizing the mo-tion in this way allows us to handle the mapping ambiguityby integrating expression control parameters forwards andbackwards over the whole segment.

Both the dynamical filtering and the motion mapping pro-cedure work on a short temporal window of 20 frames. Thesystem automatically breaks the video stream into the seg-ments at run time, and then sequentially transforms the vi-sual control signals into high-quality motion data segmentby segment. Because a small discontinuity might occur at

c© The Eurographics Association 2003.

Page 9: Vision-based Control of 3D Facial Animation

Chai et al. / Vision-based Control of 3D Facial Animation

the transition between segments, we introduce some over-lap between neighboring segments so that we can movesmoothly from one segment to another segment by blend-ing the overlap. In our experiment, the blend interval is setto 5 frames. Two synthesized fragments are blended by fad-ing out one while fading in the other using a sigmoid-likefunction, α = 0.5cos(βπ) + 0.5. Over the transition dura-tion, βmoves linearly from 0 to 1. The transitional motion isdetermined by linearly interpolating similar segments withweights α and 1−α.

4.4. Data Structure

Because a large number of the K-nearest-neighbor queriesmay need to be conducted over the same data set S, the com-putational cost can be reduced if we preprocess S to createa data structure that allows fast nearest-point search. Manysuch data structures have been proposed, and we refer thereader to28 for a more complete reference. However, most ofthese algorithms assume generic inputs and do not attemptto take advantage of special structures. We present a datastructure and efficient K nearest neighbor search techniquethat takes advantage of the temporal coherence of our querydata, that is, neighboring query examples are within a smalldistance in the high-dimensional space. The K closest pointsearch can be described as follows:

• Construct neighborhood graph G. We collect the set ofneighboring frames with the temporal window of W foreach frame in the motion capture database and treat themas a data point. Then we compute the standard Euclideandistance dx(i, j) for every pair i and j of data points inthe database. A graph G is defined over all data points byconnecting points i and j if they are closer than a specificthreshold ε, or if i is one of the K nearest neighbors of j.The edge lengths of G are set to dx(i, j). The generatedgraph is used as a data structure for efficient nearest-pointqueries.

• K-nearest-neighbor search. Rather than search the fulldatabase, we consider only the data points that are withina particular distance of the last query because of the tem-poral coherence in the expression control signals. Whena new motion query arrives, we first find the closest ex-ample E among the K closest points of the last queryin the buffer. To find the K nearest points of the currentquery motion, the graph is then traversed from E in a bestfirst order by comparing the query with the children, andthen following the ones that have a smaller distance to thequery. This process terminates when the number of exam-ples exceeds a pre-selected size or the largest distance ishigher than a given threshold.

A single search with the database size |S| can be approx-imately achieved in time O(K), which is independent on thesize of database |S| and is much faster than exhaustive searchwith linear time complexity O(|S|).

Figure 5: Dense surface correspondence. (Left) Thescanned source surface model. (Middle) The animated sur-face model. (Right) The morphed model from the source sur-face to the target surface using the surface correspondence.

5. Expression Retargeting

The synthesized motion described in the previous sectionspecifies the movement of a finite set of points on the sourcesurface; however, animating a 3D character model requiresmoving all the vertices on the animated surface. Noh andNeumann introduced an expression cloning technique tomap the expression of a source model to the surface of atarget model30. Their method modifies the magnitudes anddirections of the source using the local shape of two mod-els. It produces good results but the run time computationcost depends on the complexity of the animated model be-cause the motion vector of each vertex on the target surfaceis adapted individually at run time.

In this section, we present an efficient expression retar-geting method whose run-time computation is constant inde-pendent of the complexity of the character model. The basicidea of our expression retargeting method is to precomputeall deformation bases of the target model so that the run-timeoperation involves only blending together these deformationbases appropriately.

As input to this process, we take the scanned source sur-face model, the input target surface model, and the defor-mation bases of the motion capture database. We requireboth models be in the neutral expression. The system firstbuilds a surface correspondence between the two models andthen adapts the deformation bases of the source model tothe target model based on the deformation relationship de-rived from the local surface correspondence. At run time,the system operates on the synthesized motion to generatethe weight for each deformation basis. The output animationis created by blending together the target deformation basesusing the weights. In particular, the expression retargetingprocess consists of four stages: motion vector interpolation,dense surface correspondences, motion vector transfer, andtarget motion synthesis.

Motion Vector Interpolation: Given the deformation vec-tor of the key points on the source surface for each deforma-tion mode, the goal of this step is to deform the remainingvertices on the source surface by linearly interpolating themovement of the key points using barycentric coordinates.First, the system generates a mesh model based on the 3D

c© The Eurographics Association 2003.

Page 10: Vision-based Control of 3D Facial Animation

Chai et al. / Vision-based Control of 3D Facial Animation

(a) (b) (c) (d)

(e) (f) (g) (h)

Figure 6: The top seven deformation bases for a target surface model. (a) The gray mask is the target surface model in theneutral expression. (b-h) The needles show the scale and direction of the 3D deformation vector on each vertex.

positions of the motion capture markers on the source model.For each vertex, the system determines the face on which thevertex is located and the corresponding barycentric coordi-nate for interpolation. Then the deformation vector of theremaining vertices is interpolated accordingly.

Dense Surface Correspondences: Beginning with a smallset of manually established correspondences between twosurfaces, a dense surface correspondence is computed byvolume morphing with a Radial Basis Function followed bya cylindrical projection29 (figure 5). This gives us a home-omorphic mapping (one to one and onto) between the twosurfaces. Then we use Radial Basis Functions once again tolearn continuous homeomorphic mapping functions f(xs) =( f1(xs), f2(xs), f3(xs)) from the neutral expression of thesource surface xs = (xs,ys,zs) to the neutral expression ofthe target surface xt = (xt ,yt ,zt), such that xt = f(xs).

Motion Vector Transfer: The deformation on the sourcesurface can not simply be transferred to a target model with-out adjusting the direction and scale of each motion vector,because facial proportions and geometry vary between mod-els. For any point xs on the source surface, we can com-pute the corresponding point xt on the target surface by us-ing the mapping function f(xs). Given a small deformationδxs = (δxs,δys,δzs) for a source point xs, the deformationδxt of its corresponding target point xt is computed by theJacobian matrices Jf(xs) = (Jf1(xs),Jf2(xs),Jf3(xs))T :

δxt = Jf(xs) ·δxs (5)

We use the learnt RBF function f(xs) to compute the Ja-cobian matrix at xs numerically:

Jfi(xs) =

fi(xs+δxs,ys,zs)− fi(xs,ys,zs)δxs

fi(xs,ys+δys,zs)− fi(xs,ys,zs)δys

fi(xs,ys,zs+δzs)− fixs,ys,zs)δzs

i= 1,2,3 (6)

Geometrically, the Jacobian matrix adjusts the direction andmagnitude of the source motion vector according to the localsurface correspondence between two models. Because thedeformation of the source motion is represented as a linearcombination of a set of small deformation bases, the defor-mation δxt can be computed as:

δxt = Jf(xs) ·ΣLi=1λiδxs,i= ΣLi=1λi(Jf(xs) ·δxs,i) (7)

where δxs,i represents the small deformation correspondingto the ith deformation base, and λi is the combination weight.Equation (7) shows that we can precompute the deformationbases of the target surface δxt,i such that δxt,i = Jf(xs) ·δxs,i.Figure 6 shows the top seven deformation bases on a targetsurface model.

c© The Eurographics Association 2003.

Page 11: Vision-based Control of 3D Facial Animation

Chai et al. / Vision-based Control of 3D Facial Animation

Target Motion Synthesis: After the motion data are synthe-sized from the motion capture database via a vision-based in-terface, the system projects them into the deformation basisspace of the source model S1, ...SL to compute the combina-tion weights λ1, ...,λL. The deformation of the target surfaceδxt are generated by blending together the deformation basesof the target surface δxt,i using the combination weights λiaccording to Equation (7).

The target motion synthesis is done online and the otherthree steps are completed off-line. One strength of our ex-pression retargeting method is speed. The run-time computa-tional cost of our expression retargeting depends only on thenumber of deformation bases for motion capture databaseL rather than the number of the vertices on the animatedmodel. In our implementation, the expression retargetingcost is far less than the rendering cost.

6. Results

All of the motion data in our experiments was from one sub-ject and collected as the rate of 120 frames/second. In total,the motion capture database contains about 70000 frames(approximately 10 minutes). We captured the subject per-forming various sets of facial actions, including the six basicfacial expressions: anger, fear, surprise, sadness, joy, and dis-gust and such other common facial actions as eating, yawn-ing, and snoring. During the capture, the subject was in-structed to repeat the same facial action at least 6 times inorder to capture the different styles of the same facial action.We also recorded a small amount of motion data related tospeaking (about 6000 frames), but the amount of speakingdata was not sufficient to cover all the variations of the facialmovements related to speaking.

We tested our facial animation system on different users.We use a single video camera to record the facial expressionof a live performer. The frame rate of the video camera isabout 60 frames/second. The users were not told which fa-cial actions were contained in the motion capture database.We also did not give specific instructions to the users aboutthe kind of expressions they should perform and how theyshould perform them. The user was instructed to start fromthe neutral expression under the frontal-parallel view.

At run time, the system is completely automatic exceptthat the positions of the tracking features in the first framemust be initialized. The facial tracking system then runs inreal time (about 20 frames/second). Dynamical filtering, ex-pression synthesis, and expression retargeting do not requireuser intervention and the resulting animation is synthesizedin real time. Currently, the system has a 0.33s delay to re-move the mapping ambiguity between the visual trackingdata and the motion capture data. Figure 7 and Figure 8show several sample frames from the single-camera videoof two subjects and the animated 3D facial expressions oftwo different virtual characters. The virtual characters ex-

hibit a wide range of facial expressions and capture the de-tailed muscle movement in untracked areas like the lowercheek. Although the tracked features are noisy and sufferfrom the typical jitter in vision-based tracking systems, thecontrolled expression animations are pleasingly smooth andhigh quality.

7. Discussion

We have demonstrated how a preprocessed motion capturedatabase, in conjunction with a real-time facial tracking sys-tem, can be used to create a performance-based facial an-imation in which a live performer effectively controls theexpressions and facial actions of a 3D virtual character. Inparticular, we have developed an end to end facial anima-tion system for tracking real-time facial movements in video,preprocessing the motion capture database, controlling andanimating the facial actions using the preprocessed motioncapture database and the extracted animation control param-eters, and retargeting the synthesized expressions to a newtarget model.

We would like to do a user study on the quality of the syn-thesized animation. We can generate animations both fromour system and from “ground truth” motion capture data andthen present all segments in random order to users. The sub-jects will be asked to select the “more natural” animationso that we can evaluate the quality of synthesized animationbased on feedback from users. We also would like to testthe performance of our system on a larger set of users andcharacter models.

The output animation might not be exactly similar to thefacial movement in the video because the synthesized mo-tion is interpolated and adapted from the motion capturedata. If the database does not include the exact facial move-ments in the video, the animation is synthesized by appro-priately interpolating the closest motion examples in thedatabase. In addition, the same facial action from a user andan avatar will differ because they generally have differentfacial geometry and proportions.

One limitation of the current system is that sometimes itloses details of the lip movement. This problem arises be-cause the motion capture database does not include suffi-cient samples related to speaking. Alternatively, the amountof tracking data in the vision-based interface might not beenough to capture the subtle movement of the lips. Witha larger database and more tracking features around themouth, this limitation might be eliminated.

Currently, our system does not consider speech as an in-put; however, previous work on speech animation showsthere is a good deal of mutual information between vocal andfacial gestures6, 5, 15. The combination of a speech-interfaceand a vision-based interface would improve the quality ofthe final animation and increase the controllability of the fa-cial movements. We would like to record the facial move-

c© The Eurographics Association 2003.

Page 12: Vision-based Control of 3D Facial Animation

Chai et al. / Vision-based Control of 3D Facial Animation

Figure 7: Results of two users controlling and animating 3D facial expressions of two different target surface models.

c© The Eurographics Association 2003.

Page 13: Vision-based Control of 3D Facial Animation

Chai et al. / Vision-based Control of 3D Facial Animation

Figure 8: Results of two users controlling and animating 3D facial expressions of two different texture-mapped avatar models.

c© The Eurographics Association 2003.

Page 14: Vision-based Control of 3D Facial Animation

Chai et al. / Vision-based Control of 3D Facial Animation

ment and the audio from the motion capture subject simul-taneously, and increase the size of the database by capturingmore speaking data. We would like to explore how to controlthe facial movement of an avatar using a multimodal inter-face, a combination of the vision-based interface and speak-ing interface.

We are also interested in exploring other intuitive inter-faces for data-driven facial animation. For example, a key-framing interpolation system might use the information con-tained in a motion capture database to add the details andnuances of live motion to the degrees of freedom that are notspecified by animator.

Acknowledgements

The authors would like to thank Daniel Huber for his help inscanning and processing the source surface model of the mo-tion capture subject. We also would like to acknowledge thesupport of the NSF through EIA 0196217 and IIS 0205224.

References

1. S. Baker and I. Matthews. Lucas-Kanade 20 years on:A unifying framework part 1: The quantity approxi-mated, the warp update rule, and the gradient descentapproximation. In International Journal of ComputerVision. 2003.

2. J. R. Bergen, P. Anandan, and K. J. Hanna. Hierarchi-cal model-based motion estimation. In Proceedings ofEuropean Conference on Computer Vision. 1992. 237-252.

3. M. Black. Robust incremental optical flow. PhD thesis,Yale University, 1992.

4. M.J. Black and Y. Yacoob. Tracking and recognizingrigid and non-rigid facial motions using local paramet-ric models of image motion. In International Confer-ence on Computer Vision. 1995. 374-381.

5. M.E. Brand. Voice puppetry. In Proceedings of SIG-GRAPH 99. Computer Graphics Proceedings, AnnualConference Series, 1999. 21-28.

6. C. Bregler, M. Covell, and M. Slaney. Video rewrite:Driving visual speech with audio. In Proceedings ofSIGGRAPH 97. Computer Graphics Proceedings, An-nual Conference Series, 1997. 353-360.

7. I. Buck, A. Finkelstein, C. Jacobs, A. Klein, D. H.Salesin, J. Seims, R. Szeliski, and K. Toyama.Performance-driven hand-drawn animation. In Sympo-sium on Non Photorealistic Animation and Rendering2000. Annecy, France, 2000. 101-108.

8. M. La Cascia and S. Sclaroff. Fast, reliable head track-ing under varying illumination. In IEEE Conf. on Com-puter Vision and Pattern Recognition. 1999. 604-610.

9. J. E. Chadwick, D. R. Haumann, and R. E. Parent. Lay-ered construction for deformable animated characters.In Proceedings of SIGGRAPH 89. Computer GraphicsProceedings, Annual Conference Series, 1989. 243-252.

10. E. Chuang and C. Bregler. Performance driven facialanimation using blendshape interpolation. ComputerScience Technical Report, Stanford University, 2002.CS-TR-2002-02.

11. M. N. Cohen and D. W. Massaro. Modeling coar-ticulation in synthetic visual speech. In Models andTechniques in Computer Animation. Springer-Verlag,Tokyo, Japan, 1992. edited by Thalmann N. M. andThalmann, D.

12. D. DeCarlo and D. Metaxas. Optical flow constraintson deformable models with applications to face track-ing. In International Journal of Computer Vision. 2000.38(2):99-127.

13. B. DeGraf. Notes on facial animation. In SIGGRAPH89 Tutorial Notes: State of the Art in Facial Animation.1989. 10-11.

14. I. Essa, S. Basu, T. Darrell, and A. Pentland. Modeling,tracking and interactive animation of faces and headsusing input from video. In Proceedings of ComputerAnimation Conference. 1996. 68-79.

15. T. Ezzat, G. Geiger, and T. Poggio. Trainable videoreal-istic speech animation. In ACM Transaction on Graph-ics (SIGGRAPH 2002). San Antonio, Texas, 2002.21(3):388-398.

16. FaceStation. URL: http://www.eyematic.com.

17. Faceworks. URL: http://www.puppetworks.com.

18. M. Gleicher and N. Ferrier. Evaluating video-basedmotion capture. In Proceedings of Computer Anima-tion 2002. 2002.

19. S.B. Gokturk, J.Y. Bouguet, and R. Grzeszczuk. A datadriven model for monocular face tracking. In IEEE In-ternational conference on computer vision. 2001. 701-708.

20. B. Guenter, C. Grimm, D. Wood, H. Malvar, andF. Pighin. Making faces. In Proceedings of SIGGRAPH98. Computer Graphics Proceedings, Annual Confer-ence Series, 1998. 55-66.

21. N. Howe, M. Leventon, and W. Freeman. Bayesian re-construction of 3d human motion from single-cameravideo. In Neural Information Processing Systems 1999(NIPS 12). 1999.

22. D.F. Huber and M. Hebert. Fully automatic registra-tion of multiple 3d data sets. In IEEE Computer SocietyWorkshop on Computer Vision Beyond the Visible Spec-trum (CVBVS 2001). December, 2001. 19:989-1003.

c© The Eurographics Association 2003.

Page 15: Vision-based Control of 3D Facial Animation

Chai et al. / Vision-based Control of 3D Facial Animation

23. P. Kalra, A. Mangili, N. Magnenat Thalmann, andD. Thalmann. Simulation of facial muscle actionsbased on rational free form deformations. In ComputerGraphics Forum (EUROGRAPHICS ’92 Proceedings).Cambridge, 1992. 59-69.

24. R. Kjeldsen and J. Hartman. Design issues for vision-based computer interaction systems. Proceedings of the2001 Workshop on Perceptive User Interfaces, 2001.

25. A. Lanitis, C.J.Taylor, and T.F.Cootes. Automatic in-terpretation and coding of face images using flexiblemodels. In IEEE Pattern Analysis and Machine Intelli-gence. 1997. 19(7):743-756.

26. Y. Lee, D. Terzopoulos, and K. Waters. Realistic mod-eling for facial animation. In Proceedings of SIG-GRAPH 95. Computer Graphics Proceedings, AnnualConference Series, 1995. 55-62.

27. J. P. Lewis, M. Cordner, and N. Fong. Pose spacedeformation: A unified approach to shape interpola-tion and skeleton-driven deformation. In Proceedingsof ACM SIGGRAPH 2000. Computer Graphics Pro-ceedings, Annual Conference Series, New Orleans, LO,2000. 165-172.

28. S. Nene and S. Nayar. A simple algorithm for near-est neighbor search in high dimensions. In IEEE tran-sations on Pattern Analysis and machine Intelligence.1997. 19:989-1003.

29. G. M. Nielson. Scattered data modeling. In IEEE Com-puter Graphics and Applications. 1993. 13(1):60-70.

30. J. Noh and U. Neumann. Expression cloning. In Pro-ceedings of ACM SIGGRAPH 2001. Computer Graph-ics Proceedings, Annual Conference Series, Los Ange-les CA, 2001. 277-288.

31. F. I. Parke. Computer generated animation of faces. InProc. ACM National Conference, 1972. 1:451-457.

32. F. I. Parke. A parametric model for human faces. SaltLake City, Utah, 1974. Ph.D thesis, University of Utah.UTEC-CSc-75-047.

33. F. I. Parke. Parameterized models for facial animationrevisited. In ACM Siggraph Facial Animation TutorialNotes. 1989. 53-56.

34. F. I. Parke and K. Waters. Computer facial animation.A.K. Peter, Wellesley, MA, 1996.

35. V. Pavlovic, J.M.Rehg, T.-J.Cham, and K.Murphy. Adynamic baysian network approach to figure trackingusing learned dynamic models. In IEEE InternationalConference on Computer Vision. 1999. 94-101.

36. F. Pighin, J. Hecker, D. Lischinski, R. Szeliski, andD. Salesin. Synthesizing realistic facial expressionsfrom photographs. In Proceedings of SIGGRAPH 98.

Computer Graphics Proceedings, Annual ConferenceSeries, San Antonio, TX, 1998. 75-84.

37. F. Pighin, R. Szeliski, and D. Salesin. Resynthesiz-ing facial animation through 3d model-based tracking.In International Conference on Computer Vision. 1999.143-150.

38. J. Shi and C. Tomasi. Good features to track. In Pro-ceedins of IEEE Conference on Computer Vision andPattern Recognition. 1994. 593-600.

39. H. Sidenbladh, M.J. Black, and L. Sigal. Implicit prob-abilistic models of human motion for synthesis andtracking. In European Conference on Computer Vision.Springer-Verlag, 2002. 784-800.

40. D. Terzopoulos and K. Waters. Physically-based facialmodelling, analysis, and animation. In The Journal ofVisualization and Computer Animation. 1990. 1(2):73-80.

41. D. Terzopoulos and K. Waters. Analysis and synthesisof facial image sequences using physical and anatom-ical models. In IEEE Trans. on Pattern Analysis andMachine Intelligence. 1993. 15:569-579.

42. N. M. Thalmann, N.E. Primeau, and D. Thalmann. Ab-stract muscle actions procedures for human face anima-tion. In Visual Computer. 1988. 3(5):290-297.

43. L. Williams. Performance driven facial animation.In Proceedings of SIGGRAPH 90. Computer Graph-ics Proceedings, Annual Conference Series, 1990. 235-242.

44. J. Xiao and J. Chai. A linear solution to nonrigid struc-ture from motion. In Robotics Institute Technical Re-port, Carnegie Mellon University. 2003. CMU-RI-TR-03-16.

45. J. Xiao, T. Kanade, and J. Cohn. Robust full mo-tion recovery of head by dynamic templates and re-registration techniques. In Proceedings of the FifthIEEE International Conference on Automatic Face andGesture Recognition. 2002. 156-162.

c© The Eurographics Association 2003.