Face Animation Overview with Shameless Bias Toward MPEG-4 Face Animation Tools Dr. Eric Petajan...

Click here to load reader

  • date post

  • Category


  • view

  • download


Embed Size (px)

Transcript of Face Animation Overview with Shameless Bias Toward MPEG-4 Face Animation Tools Dr. Eric Petajan...

  • Face Animation Overview with Shameless Bias Toward MPEG-4 Face Animation Tools

    Dr. Eric PetajanChief Scientist and Founderface2face animation, [email protected]

  • Computer-generated Face Animation MethodsMorph targets/key frames (traditional)Speech articulation model (TTS)Facial Action Coding System (FACS)Physics-based (skin and muscle models)Marker-based (dots glued to face)Video-based (surface features)

  • Morph targets/key framesAdvantagesComplete manual control of each frameGood for exaggerated expressionsDisadvantagesHard to achieve good lipsync without manual tweekingMorph targets must be downloaded to terminal for streaming animation (delay)

  • Speech articulation modelAdvantagesHigh level control of faceEnables TTSDisadvantagesRobotic characterHard to sync with real voice

  • Facial Action Coding SystemAdvantagesVery high level control of faceMaps to morph targetsExplicit specification of emotional statesDisadvantagesNot good for speechNot quantified

  • Physics-basedAdvantagesGood for realistic skin, muscle and fatCollision detectionDisadvantagesHigh complexityMust be driven by high level articulation parameters (TTS)Hard to drive with motion capture data

  • Marker-basedAdvantagesCan provide accurate motion data from most of the faceFace models can be animated directly from surface feature point motionDisadvantagesDots glued to faceDots must be manually registeredNot good for accurate inner lip contour or eyelid tracking

  • Video-basedAdvantagesSimple to capture video of faceFace models can be animated directly from surface feature motionDisadvantagesMust have good view of face

  • What is MPEG-4 Multimedia?Natural audio and video objects2D and 3D graphics (based on VRML)Animation (virtual humans)Synthetic speech and audio

  • Samples versus ObjectsTraditional video coding is sample based (blocks of pixels are compressed)MPEG-4 provides visual object representation for better compression and new functionalitiesObjects are rendered in the terminal after decoding object descriptors

  • Object-based FunctionalitiesUser can choose display of content layersIndividual objects (text, models) can be searched or stored for later usedContent is independent of display resolutionContent can be easily repurposed by provider for different networks and users

  • MPEG-4 Object CompositionObjects are organized in a scene graphScene graphs are specified using a binary format called BIFS (based on VRML)Both 2D and 3D objects, properties and transforms are specified in BIFSBIFS allows objects to be transmitted once and instanced repeatedly in the scene after transformations

  • MPEG-4 Operation Sequence

  • Faces are SpecialHumans are hard-wired to respond to facesThe face is the primary communication interfaceHuman faces can be automatically analyzed and parameterized for a wide variety of applications

  • MPEG-4 Face and Body Animation CodingFace animation is in MPEG-4 version 1Body animation is in MPEG-4 version 2Face animation parameters displace feature points from neutral positionBody animation parameters are joint anglesFace and body animation parameter sequences are compressed to low bitrates

  • Neutral Face Definition

    Head axes parallel to the world axes Gaze is in direction of Z axisEyelids tangent to the irisPupil diameter is one third of iris diameterMouth is closed and the upper and lower teeth are touchingTongue is flat, horizontal with the tip of tongue touching the boundary between upper and lower teeth

  • Face Feature PointsTeethFeature points affected by FAPsOther feature points

  • Face Animation Parameter NormalizationFace Animation Parameters (FAPs) are normalized to facial dimensionsEach FAP is measured as a fraction of neutral face mouth width, mouth-nose distance, eye separation, or iris diameter 3 Head and 2 eyeball rotation FAPs are Euler angles

  • Neutral Face Dimensions for FAP Normalization

  • FAP Groups


    Number of FAPs

    1: visemes and expressions


    2: jaw, chin, inner lowerlip, cornerlips, midlip


    3: eyeballs, pupils, eyelids


    4: eyebrow


    5: cheeks


    6: tongue


    7: head rotation


    8: outer lip positions


    9: nose


    10: ears


  • Lip FAPsMouth closed if sum of upper and lower lip FAPs = 0

  • Face Model IndependenceFAPs are always normalized for model independenceFAPs (and BAPs) can be used without MPEG-4 systems/BIFSPrivate face models can be accurately animated with FAPsFace models can be simple or complex depending on terminal resources

  • MPEG-4 BIFS Face NodeFace node contains FAP node, Face scene graph, Face Definition Parameters (FDP), FIT,and FATFIT (Face Interpolation Table) specifies interpolation of FAPs in terminalFAT (Face Animation Table) maps FAPs to Face model deformationFDP information included face feature points positions and texture map

  • Face Model Download3D graphical models (e.g. faces) can be downloaded to the terminal with MPEG-43D model specification is based on VRMLFace Animation Table( FAT) maps FAPs to face model vertex displacementsAppearance and animation of downloaded face models is exactly predictable

  • FAP CompressionFAPs are adaptively quantized to desired quality levelQuantized FAPs are differentially codedAdaptive arithmetic coding further reduces bitrateTypical compressed FAP bitrate is less than 2 kilobits/second

  • FAP Predictive CodingFAP(t)+QQ-1FrameDelay-ArithmeticCoderBitstream

  • Face Analysis SystemMPEG-4 does not specify analysis systemsface2face face analysis system tracks nostrils for robust operationInner lip contour estimated using adaptive color thresholding and lip modelingEyelids, eyebrows and gaze direction

  • Nostril Tracking

  • Inner Lip Contour Estimation

  • FAP Estimation AlgorithmHead scale is normalized based on neutral mouth (closed mouth) widthHead pitch is approximated based on vertical nostril deviation from neutral head positionHead roll is computed from smoothed eye or nostril orientation depending on availability Inner lip FAPs are measured directly from the inner lips contour as deviations from the neutral lip position (closed mouth)

  • FAP Sequence Smoothing

  • MPEG-4 Visemes and ExpressionsA weighted combination of 2 visemes and 2 facial expressions for each frameDecoder is free to interpret effect of visemes and expressions after FAPs are appliedDefinitions of visemes and expressions using FAPs can also be downloaded

  • Visemes








    p, b, m

    put, bed, mill


    f, v

    far, voice



    think, that


    t, d

    tip, doll


    k, g

    call, gas


    tS, dZ, S

    chair, join, she


    s, z

    sir, zeal


    n, l

    lot, not



















  • Facial Expressions


    expression name

    textual description






    The eyebrows are relaxed. The mouth is open and the mouth corners pulled back toward the ears.



    The inner eyebrows are bent upward. The eyes are slightly closed. The mouth is relaxed.



    The inner eyebrows are pulled downward and together. The eyes are wide open. The lips are pressed against each other or opened to expose the teeth.



    The eyebrows are raised and pulled together. The inner eyebrows are bent upward. The eyes are tense and alert.



    The eyebrows and eyelids are relaxed. The upper lip is raised and curled, often asymmetrically.



    The eyebrows are raised. The upper eyelids are wide open, the lower relaxed. The jaw is opened.

  • Free Face Model SoftwareWireface is an openGL-based, MPEG-4 compliant face modelGood starting point for building high quality face models for web applicationsReads FAP file and raw audio fileRenders face and audio in real timeWireface source is freely available

  • Body AnimationHarmonized with VRML Hanim specBody Animation Parameters (BAPs) are humanoid skeleton joint Euler anglesBody Animation Table (BAT) can be downloaded to map BAPs to skin deformationBAPs can be highly compressed for streaming

  • Body Animation Parameters (BAPs)186 humanoid skeleton euler angles110 free parameters for use with downloaded body surface meshCoded using same codecs as FAPsTypical bitrates for coded BAPs is 5-10kbps

  • Body Definition Parameters (BDPs)Humanoid joint center positionsNames and hierarchy harmonized with VRML/Web3D H-Anim working groupDefault positions in standard for broadcast applicationsDownload just BDPs to accurately animate unknown body model

  • Faces Enhance the User ExperienceVirtual call center agentsNews readers (e.g. Ananova)Story tellers for the child in all of useLearningProgram guideMultilingual (same face different voice)Entertainment animationMultiplayer games

  • Visual Content for the Practical InternetBroadband deployment is happening slowlyDSL availability is limited and cable is sharedTalking heads need high frame-rateConsumer graphics hardware is cheap and powerfulMPEG-4 SNHC/FBA tools are matched to available bandwidth and terminals

  • Visual Speech ProcessingFAPs can be used to improve speech recognition accuracyText-to-speech systems can use FAPs to animate face modelsFAPs can be used in computer-human dialogue systems to communicate emotions, intentions and speech especially in noisy environments

  • Video-driven Face AnimationFacial expressions, lip movements and head motion transferred to face modelFAPs extracted from talking head video with special computer vision systemNo face markers or lipstick is requiredNormal lighting is usedCommunicates lip movements and facial expressions with visual anonymity

  • Automatic Face Animation DemonstrationFAPs extracted from camcorder videoFAPs compressed to less than 2 kbits/sec30 frames/sec animation generated automaticallyFace models animated with bones rig or fixed deformable mesh (real-time)

  • What is easy, solved, or almost solved

    Can we do photorealistic non-animated face models? YESCan we do near-real-time lip sync'ing that is indistinguishable from a human? NO

  • What is really hard Synthesizing human speech and facial expressions Hair

  • What we have assumed someone else is solving Graphics acceleration Video camera cost and resolutionMultimedia communication infrastructure

  • Where we need help We have a face with 68 parameters but we need the psychologists to tell us how to drive it autonomouslyWe need to embody our agents into graphical models that have a couple of thousand parameters to control gaze, gesture, body language, and do collision detection-> NEED MORE SPEED

  • Core functionality of the faceSpeechLips, teeth, tongueEmotional expressionsGaze, eyebrow, eyelids, head poseNon-verbal communicationSensory responsivityTechnical requirementsFramerateSynchronizationLatencyBitrateSpatial resolutionComplexityCommon framework withbodyInteractionDifferent faces should respond similarly to common commandsAccessible to everyone

  • Interaction with other componentsLanguage and discoursePhoneme to viseme mappingGiven/newAction in the environmentGlobal informationEmotional statePersonalityCultureWorld knowledgeCentral time-base and timestamps

  • Open questionsCentral vs peripheral functionalityDegree of interface commonalityDegree of agent autonomyWhat should the VH be capable of