Creating Interactive Virtual Humans: Some Assembly Requiredbadler/papers/x4GEW.pdf · ing human...

2 1094-7167/02/$17.00 © 2002 IEEE IEEE INTELLIGENT SYSTEMS

Creating InteractiveVirtual Humans: SomeAssembly RequiredJonathan Gratch, USC Institute for Creative Technologies

Jeff Rickel, USC Information Sciences Institute

Elisabeth André, University of Augsburg

Norman Badler, University of Pennsylvania

Justine Cassell, MIT Media Lab

Eric Petajan, Face2Face Animation, Inc.

Science fiction has long imagined a

future populated with artificial

humans—human-looking devices with

human-like intelligence. Although Asimov’s

benevolent robots and the Terminator movies’ terriblewar machines are still a distant fantasy, researchersacross a wide range of disciplines are beginning towork together toward a more modest goal—build-ing virtual humans. These software entities look andact like people and can engage in conversation andcollaborative tasks, but they live in simulated envi-ronments. With the untidy problems of sensing andacting in the physical world thus dispensed, the focusof virtual human research is on capturing the rich-ness and dynamics of human behavior.

The potential applications of this technology areconsiderable. History students could visit ancientGreece and debate Aristotle. Patients with social pho-bias could rehearse threatening social situations inthe safety of a virtual environment. Social psychol-ogists could study theories of communication by sys-tematically modifying a virtual human’s verbal andnonverbal behavior. A variety of applications arealready in progress, including education, and train-ing,1 therapy,2 marketing,3-4 and entertainment.5-6

Building a virtual human is a multidisciplinaryeffort, joining traditional artificial intelligence prob-lems with a range of issues from computer graphicsto social science. Virtual humans must act and react

in their simulated environment, drawing on the dis-ciplines of automated reasoning and planning. Tohold a conversation, they must exploit the full gamutof natural language research, from speech recogni-tion and natural language understanding to naturallanguage generation and speech synthesis. Provid-ing human bodies that can be controlled in real timedelves into computer graphics and animation. Andbecause an agent looks like a human, people expectit to behave like one as well and will be disturbed by,or misinterpret, discrepancies from human norms.Thus, virtual human research must draw heavily onpsychology and communication theory to appropri-ately convey nonverbal behavior, emotion, and per-sonality.

This broad range of requirements poses a seriousproblem. Researchers working on particular aspectsof virtual humans cannot explore their componentin the context of a complete virtual human unlessthey can understand results across this array of dis-ciplines and assemble the vast range of software tools(for example, speech recognizers, planners, and ani-mation systems) required to construct one. More-over, these tools were rarely designed to interoper-ate and, worse, were often designed with differentpurposes in mind. For example, most computergraphics research has focused on high fidelity offlineimage rendering that does not support the fine-grained interactive control that a virtual human musthave over its body. In the spring of 2002, about 30 internationalresearchers from across disciplines convened at theUniversity of Southern California to begin to bridge

this gap in knowledge and tools (seewww.ict.usc.edu/~vhumans). Ourultimategoal is a modular architecture and interfacestandards that will allow researchers in thisarea to reuse each other’s work. This goal canonly be achieved through a close multi-dis-ciplinary collaboration. Towards this end,the workshop gathered a collection of expertsrepresenting the range of required researchareas, including

The purpose of the workshop, and thisarticle, is to begin to bridge this gap in knowl-edge and tools. Our ultimate goal is a mod-ular architecture and interface standards thatwill allow researchers in this area to reuseeach other’s work. This goal can only beachieved through a close multi-disciplinarycollaboration. Towards this end, the work-shop gathered a collection of experts repre-senting the range of required research areas,including

• human figure animation• facial animation• perception• cognitive modeling• emotions and personality• natural language processing• speech recognition and synthesis• nonverbal communication• distributed simulation• computer games.

Here we discuss some of the key issuesthat must be addressed in creating virtualhumans. As a first step, we overview theissues and available tools in three key areasof virtual human research: face-to-face con-versation, emotions and personality, andhuman figure animation.

Face-to-face conversationHuman face-to-face conversation involves

both language and nonverbal behavior. Thebehaviors during conversation don’t justfunction in parallel, but interdependently.The meaning of a word informs the inter-pretation of a gesture, and vice-versa. Thetime scales of these behaviors, however, aredifferent—a quick look at the other personto check that they are listening lasts for lesstime than it takes to pronounce a single word,while a hand gesture that indicates what theword “caulk” means might last longer thanit takes to say, “I caulked all weekend.”

Coordinating verbal and nonverbal con-versational behaviors for virtual humansrequires meeting several interrelated chal-lenges. How speech, intonation, gaze, and

head movements make meaning together, thepatterns of their co-occurrence in conversa-tion, and what kinds of goals are achieved bythe different channels, are all equally impor-tant for understanding the construction of vir-tual humans. Speech and nonverbal behav-iors do not always manifest the sameinformation, but what they convey is virtu-ally always compatible.7 In many cases, dif-ferent modalities serve to reinforce oneanother through redundancy of meaning. Inother cases, semantic and pragmatic attrib-utes of the message are distributed across themodalities.8 The compatibility of meaningbetween gestures and speech recalls the inter-action of words and graphics in multimodalpresentations.9 For patterns of co-occurrence,there is a tight synchrony among the differ-

ent conversational modalities in humans. Forexample, people accentuate important wordsby speaking more forcefully, illustrating theirpoint with a gesture, and turning their eyestoward the listener when coming to the endof a thought. Meanwhile listeners nod withina few hundred milliseconds of when thespeaker’s gaze shifts. This synchrony isessential to the meaning of conversation.When it is destroyed, as in low bandwidthvideoconferencing, satisfaction and trust inthe outcome of a conversation diminishes.10

Regarding the goals achieved by the dif-ferent modalities, in natural conversationspeakers tend to produce a gesture withrespect to their propositional goals (to advancethe conversation content), such as making thefirst two fingers look like legs walking whensaying “it took 15 minutes to get here,” andspeakers tend to use eye movement withrespect to interactional goals (to ease the con-versation process), such as looking toward theother person when giving up the turn.7To real-istically generate all the different verbal and

nonverbal behaviors, then, computationalarchitectures for virtual humans must controlboth the propositional and interactional struc-tures. In addition, because some of these goalscan be equally well met by one modality orthe other, the architecture must deal at the levelof goals or functions, and not at the level ofmodalities or behaviors. That is, giving up theturn is often achieved by looking at the lis-tener. But, if the speaker’s eyes are on the road,he or she can get a response by saying, “Don’tyou think?”

Constructing a virtual human that caneffectively participate in face-to-face con-versation requires a control architecture withthe following features:4

• Multimodal input and output. Becausehumans in face-to-face conversation sendand receive information through gesture,intonation, and gaze as well as speech, thearchitecture should also support receivingand transmitting this information.

• Real-time feedback. The system must letthe speaker watch for feedback and turnrequests, while the listener can send theseat any time through various modalities.The architecture should be flexible enoughto track these different threads of commu-nication in ways appropriate to eachthread. Different threads have differentresponse-time requirements; some, suchas feedback and interruption, occur on asub-second time scale. The architectureshould reflect this by allowing differentprocesses to concentrate on activities atdifferent time scales.

• Understanding and synthesis of proposi-tional and interactional information.Dealing with propositional information—the communication content—requiresbuilding a model of the user’s needs andknowledge. The architecture must includea static domain knowledge base and adynamic discourse knowledge base. Pre-senting propositional information requiresa planning module for presenting multi-sentence output and managing the orderof presentation of interdependent facts.Understanding interactional informa-tion—about the processes of conversa-tion—on the other hand, entails buildinga model of the current state of the conver-sation with respect to the conversationalprocess (to determine who is the currentspeaker and listener, has the listener under-stood the speaker’s contribution, and soon).

JULY/AUGUST 2002 computer.org/intelligent 3

With the untidy problems of

sensing and acting in the physical

world thus dispensed, the focus of

virtual human research is on

capturing the richness and

dynamics of human behavior.

• Conversational function model. Functions,such as initiating a conversation or givingup the floor, can be achieved by a range ofdifferent behaviors, such as looking repeat-edly at another person or bringing yourhands down to your lap. Explicitly repre-senting conversational functions, ratherthan behaviors, provides both modularityand a principled way to combine differentmodalities. Functional models influencethe architecture because the core systemmodules operate exclusively on functions,while other system modules at the edgestranslate input behaviors into functions,and functions into output behaviors. Thisalso produces a symmetric architecturebecause the same functions and modalitiesare present in both input and output.

To capture different time scales and theimportance of co-occurrence, input to a vir-tual human must be incremental and timestamped. For example, incremental speechrecognition lets the virtual human give feed-back (such as a quick nod) right as the realhuman finishes a sentence, therefore influ-encing the direction the human speaker takes.At the very least, the sytem should report asignificant change in state right away, evenif full information about the event has not yetbeen processed. This means that if speechrecognition cannot be incremental, at leastsomeone speaking or finished speakingshould be relayed immediately, even in theabsence of a fully recognized utterance. Thislets the virtual human give up the turn whenthe real human claims it and signal receptionafter being addressed. When dealing withmultiple modalities, fusing interpretations ofthe different input events is important tounderstand what behaviors are acting

together to convey meaning.12 For this, a syn-chronized clock across modalities is crucialso events such as exactly when an emphasisbeat gesture occurs can be compared tospeech, word by word. This requires, ofcourse, that the speech recognizer supplyword onset times.

Similarly, for the virtual human to producea multimodal performance, the output chan-nels also must be incremental and tightlysynchronized. Incremental refers to twoproperties in particular: seamless transitionsand interruptible behavior. When producingcertain behaviors, such as gestures, the vir-tual human must reconfigure its limbs in anatural manner, usually requiring that sometime be spent on interpolating from a previ-ous posture to a new one. For the transitionto be seamless, the virtual human must givethe animation system advance notice ofevents such as gestures, so that it has time tobring the arms into place. Sometimes, how-ever, behaviors must be abruptly interrupted,such as when the real human takes the turnbefore the virtual human has finished speak-ing. In that case, the current behavior sched-ule must be scrapped, the voice halted, andnew attentive behaviors initiated—all withreasonable seamlessness.

Synchronicity between modalities is asimportant in the output as the input. The vir-tual human must align a graphical behaviorwith the uttering of particular words or agroup of words. The temporal associationbetween the words and behaviors might havebeen resolved as part of the behavior gener-ation process, as is done in SPUD (SentencePlanning Using Description),8 but it is essen-tial that the speech synthesizer provide amechanism for maintaining synchronythrough the final production stage. There aretwo types of mechanisms, event based or

time based. A text-to-speech engine can usu-ally be programmed to send events onphoneme and word boundaries. Althoughthis is geared towards supporting lip synch,other behaviors can be executed as well.However, this does not allow any time forbehavior preparation. Preferably, the TTSengine can provide exact start-times for eachword prior to playing back the voice, as Fes-tival does.13 This way, we can schedule thebehaviors, and thus the transitions betweenbehaviors, beforehand, and then play themback along with the voice for a perfectlyseamless performance.

On the output side, one tool that providessuch tight synchronicity is the BehaviorExpression Animation Toolkit system.11 Fig-ure 1 shows BEAT’s architecture. BEAT hasthe advantage of automatically annotatingtext with hand gestures, eye gaze, eyebrowmovement, and intonation. The annotation iscarried out in XML, through interaction withan embedded word ontology module, whichcreates a set of hypernyms that broadens aknowledge base search of the domain beingdiscussed. The annotation is then passed toa set of behavior generation rules. Output isscheduled so that tight synchronization ismaintained among modalities.

Emotions and personalityPeople infuse their verbal and nonverbal

behavior with emotion and personality, andmodeling such behavior is essential for build-ing believable virtual humans. Consequently,researchers have developed computationalmodels for a wide range of applications.Computational approaches might be roughlydivided into communication-driven and sim-ulation-based approaches.

In communication-driven approaches, avirtual human chooses its emotional expres-sion on the basis of its desired impact on theuser. Catherine Pelachaud and her colleaguesuse facial expressions to convey affect incombination with other communicative func-tions.14 For example, making a request witha sorrowful face can evoke pity and motivatean affirmative response from the listener. Aninteresting feature of their approach is thatthe agent deliberately plans whether or notto convey a certain emotion. Tutoring appli-cations usually also follow a communication-driven approach, intentionally expressingemotions with the goal of motivating the stu-dents and thus increasing the learning effect.The Cosmo system, where the agent’s peda-gogical goals drive the selection and

4 computer.org/intelligent IEEE INTELLIGENT SYSTEMS

S e c t i o n T i t l e ( C o n t . )

Figure 1: Behavior Expression Animation Toolkit Text-to-Nonverbal Behavior Module.

sequencing of emotive behaviors, is oneexample.15 For instance, a congratulatory acttriggers a motivational goal to express admi-ration that is conveyed with applause. To con-vey appropriate emotive behaviors, agentssuch as Cosmo need to appraise events notonly from their own perspective but alsofrom the perspective of others.

The second category of approaches aims ata simulation of “true” emotion (as opposed todeliberately conveyed emotion). Theseapproaches build on appraisal theories ofemotion, the most prominent being AndrewOrtony, Gerald Clore, and Allan Collins’cog-nitive appraisal theory—commonly referredto as the OCC model.16 This theory viewsemotions as arising from a valenced reactionto events and objects in the light of agentgoals, standards, and attitudes. For example,an agent watching a game-winning moveshould respond differently depending onwhich team is preferred.3 Recent work byStacy Marsella and Jonathan Gratch inte-grates the OCC model with coping theoriesthat explain how people cope with strongemotions.17 For example, their agents canengage in either problem-focused copingstrategies, selecting and executing actions inthe world that could improve the agent’semotional state, or emotion-focused copingstrategies, improving emotional state byaltering the agent’s mental state (for exam-ple, dealing with guilt by blaming someoneelse). Further simulation approaches arebased on the observation that an agent shouldbe able to dynamically adapt its emotionsthrough its own experience, using learningmechanisms. 6,18

Appraisal theories focus on the relation-ship between an agent’s world assessmentand the resulting emotions. Nevertheless, theyare rather vague about the assessmentprocess. For instance, they do not explain howto determine whether a certain event is desir-able. A promising line of research is inte-grating appraisal theories with AI-based plan-ning approaches,19 which might lead to aconcretization of such theories. First, emo-tions can arise in response to a deliberativeplanning process (when relevant risks arenoticed, progress assessed, and successdetected). For example, several approachesderive an emotion’s intensity from the impor-tance of a goal and its probability of achieve-ment. 20,21 Second, emotions can influencedecision-making by allocating cognitiveresources to specific goals or threats. Plan-based approaches support the implementa-

tion of decision and action selection mecha-nisms that are guided by an agent’s emotionalstate. For example, the Inhabited MarketPlace application treats emotions as filters toconstrain the decision process when select-ing and instantiating dialogue operators.3

In addition to generating affective states,we must also express them in a manner eas-ily interpretable to the user. Effective meansof conveying emotions include body ges-tures, acoustic realization, and facial expres-sions (see Gary Collier’s work for anoverview of studies on emotive expres-sions22). Several researchers use Bayesiannetworks to model the relationship betweenemotion and its behavioral expression.Bayesian networks let us deal explicitly withuncertainty, which is a great advantage whenmodeling the connections between emotionsand the resulting behaviors. Gene Ball andJack Breese presented an example of such anapproach. They constructed a Bayesian net-work that estimates the likelihood of specificbody postures and gestures for individualswith different personality types and emo-tions.23 For instance, a negative emotionincreases the probability that an agent willsay “Oh, you again,” as opposed to “Nice tosee you!” Recent work by CatherinePelachaud and colleagues employs Bayesiannetworks to resolve conflicts that occur whendifferent communicative functions need tobe shown on different channels of the face,such as eyebrow, mouth shape, gaze direc-tion, head direction, and head movements(see Figure 2).14 In this case, the Bayesiannetwork estimates the likelihood that a facemovement overrides another. Bayesian net-works also offer a possibility to model howemotions vary over time. Even though nei-ther Gene Ball and Jack Breese nor Cather-ine Pelachaud and colleagues took advantage

of this feature, the extension of the twoapproaches to dynamic Bayesian networksseems obvious.

While significant progress has been madeon the visualization of emotive behaviors,automated speech synthesis still has a longway to go. The most natural-soundingapproaches rely on a large inventory ofhuman speech units (for example, combina-tions of phonemes) that are subsequentlyselected and combined based on the sentenceto be synthesized. These approaches do not,yet, provide much ability to convey emotionthrough speech (for example, by varyingprosody or intensity). Marc Schröder pro-vides an overview of speech manipulationsthat have been successfully employed toexpress several basic emotions.25 While theinterest in affective speech synthesis isincreasing, hardly any work has been doneon conveying emotion through sentencestructure or word choice. An exceptionincludes Eduard Hovy’s pioneering work onnatural language generation that addressesnot only the goal of information delivery, butalso pragmatic aspects, such as the speaker’semotions.26 Marilyn Walker and colleaguespresent a first approach to integratingacoustic parameters with other linguistic phe-nomena, such as sentence structure andwording.27

Obviously, there is a close relationshipbetween emotion and personality. Dave Mof-fat differentiates between personality andemotion using the two dimensions durationand focus.28 Whereas personality remainsstable over a long period of time, emotionsare short-lived. Moreover, while emotionsfocus on particular events or objects, factorsdetermining personality are more diffuse andindirect. Because of this obvious relation-ship, several projects aim to develop an inte-


Figure 2. Pelachaud uses a MPEG-4 compatible facial animation system toinvestigate how to resolve cnflicts that arise when different communication funcitonsneed to be shown on different channels of the face.

(a) (b) (c)Certain Sorry-for Sorry-for + Certain

grated model of emotion and personality. Asan example, Gene Ball and Jack Breese’smodel dependencies between emotions andpersonality in a Bayesian network.23 Toenhance the believability of animated agentsbeyond reasoning about emotion and per-sonality, Helmut Prendinger and colleaguesmodel the relationship between an agent’ssocial role and the associated constraints onemotion expression, for example, by sup-pressing negative emotion when interactingwith higher-status individuals.29

Another line of research aims at providingan enabling technology to support affectiveinteractions. This includes both the defini-tion of standardized languages for specify-ing emotive behaviors, such as the AffectivePresentation Markup Language14 or theEmotion Markup Language (www.vhml.org), as well as the implementation of toolk-its for affective computing combining a setof components addressing affective knowl-edge acquisition, representation, reasoning,planning, communication, and expression.30

Human figure animationBy engaging in face-to-face conversation,

conveying emotion and personality, and oth-erwise interacting with the synthetic envi-ronment, virtual humans impose fairly severebehavioral requirements on the underlyinganimation system that must render their

physical bodies. Most production workinvolves animator effort to design or scriptmovements or direct performer motion cap-ture. Replaying movements in real time is notthe issue; rather, it is creating novel, contex-tually sensitive movements in real time thatmatters. Interactive and conversationalagents, for example, will not enjoy the luxuryof relying on animators to create humantime-frame responses. Animation techniquesmust span a variety of body systems: loco-motion, manual gestures, hand movements,body pose, faces, eyes, speech, and otherphysiological necessities such as breathing,blinking, and perspiring. Research in humanfigure animation has addressed all of thesemodalities, but historically the work focuseseither on the animation of complete bodymovements or on animation of the face.

Body animation methodsIn body animation, there are two basic

ways to gain the required interactivity: usemotion capture and additional techniques torapidly modify or re-target movements toimmediate needs,31 or write procedural codethat allows program control over importantmovement parameters.32 The difficulty withthe motion capture approach is maintainingenvironmental constraints such as solid footcontacts and proper reach, grasp, and obser-vation interactions with the agent’s own body

parts and other objects. To alleviate theseproblems, procedural approaches parame-terize target locations, motion qualities, andother movement constraints to form a plau-sible movement directly. Proceduralapproaches consist of kinematic and dynam-ics techniques. Each has its preferred domainof applicability; kinematics is generally bet-ter for goal-directed activities, and slower(controlled) actions and dynamics is morenatural for movements directed by applica-tion of forces, impacts, or high-speed behav-iors.33 The wide range of human movementdemands that both approaches have real-timeimplementations that can be procedurallyselected as required.

Animating a human body form requiresmore than just controlling skeletal rotationangles. People are neither skeletons norrobots, and considerable human qualitiesarise from intelligent movement strategies,soft deformable surfaces, and clothing.Movement strategies include reach or con-strained contacts, often achieved with goal-directed inverse kinematics.34 Complexworkplaces, however, entail more complexplanning to avoid collisions, find free paths,and optimize strength availability. The sup-pleness of human skin and the underlying tis-sue biomechanics lead to shape changescaused by internal muscle actions as well asexternal contact with the environment. Mod-eling and animating the local, muscle-based,deformation of body surfaces in real time ispossible through shape morphing tech-niques,35-36 but providing appropriate shapechanges in response to external forces is achallenging problem. “Skin-tight” texturemapped clothing is prevalent in computergame characters, but animating draped orflowing garments requires dynamic simula-tion, fast collision detection, and appropriatecollision response.37-38

Accordingly, animation systems build pro-cedural models of these various behaviorsand execute them on human models. Thediversity of body movements involved hasled to building more consistent agents: pro-cedural animations that affect and controlmultiple body communication channels incoordinated ways.11,24,39-40 The particularchallenge here is constructing computergraphics human models that balance suffi-cient articulation, detail, and motion gener-ators to effect both gross and subtle move-ments with realism, real-time responsiveness,and visual acceptability. And if that isn’tenough, consider the additional difficulty of



Figure 3. PeopleShop and DI-Guy are used to create scenarios for ground combat train-ing. this scenario was used at Ft. Benning to enhance situation awareness inexperiments to train US Army officers for urban combat. Image courtesy of BostonDynamics.

modeling a specific real individual. Com-puter graphics still lacks effective techniquesto transfer even captured motion into featuresthat characterize a specific person’s manner-isms and behaviors, though machine-learn-ing approaches could prove promising.41

Implementing an animated human body iscomplicated by a relative paucity of gener-ally available tools. Body models tend to beproprietary (for example., Extempo.com,Ananova.com), optimized for real time andthus limited in body structure and features(for example, DI-Guy, BDI.com, illustratedin Figure 3, or constructions for particularanimations built with standard animator toolssuch as Poser, Maya, or 3DSMax. The bestattempt to design a transportable, standardavatar is the Web3D Consortium’s H-Animeffort (www.h-anim.org). With well-definedbody structure and feature sites, the H-Animspecification has engendered model sharingand testing not possible with proprietaryapproaches. The present liability is the lackof an application programming interface inthe VRML language binding of H-Anim. Ageneral API for human models is a highlydesirable next step, the benefits of whichhave been demonstrated by Norman Badler’sresearch group’s use of the software API inJack (www.ugs.com/products/efactory/jack)that allows feature access and provides plug-in extensions for new real-time behaviors.

Face animation methodsA computer-animated human face can

evoke a wide range of emotions in real peo-ple because faces are central to human real-ity. Unfortunately, modeling and renderingartifacts can easily produce a negativeresponse in the viewer. The great complexityand psychological depth of the humanresponse to faces causes difficulty in pre-dicting the response to a given animated facemodel. The partial or minimalist renderingof a face can be pleasing as long as it main-tains quality and accuracy in certain keydimensions. The ultimate goal is to analyzeand synthesize humans with enough fidelityand control to pass the Turing test, create anykind of virtual being, and enable total con-trol over its virtual appearance. Eventually,surviving technologies will be combined toincrease accuracy and efficiency of the cap-ture, linguistic, and rendering systems. Cur-rently the approaches to animating the faceare disjoint and driven by production costsand imperfect technology. Each method pre-sents a distinct “look and feel,” as well as

advantages and disadvantages. Facial animation methods fall into three

major categories. The first and earliestmethod is to manually generate keyframesand then automatically interpolate framesbetween the keyframes (or use less skilledanimators). This approach is used in tradi-tional cell animation and in 3D animated fea-ture films. Keyframe and morph target ani-mation provides complete artistic control butcan be time consuming to perfect.

The second method is to synthesize facialmovements from text or acoustic speech. ATTS algorithm, or an acoustic speech recog-nizer, provides a translation to phonemes,which are then mapped to visemes (visualphonemes). The visemes drive a speech artic-ulation model that animates the face. Theconvincing synthesis of a face from text hasyet to be accomplished. The state of the artprovides understandable acoustic and visualspeech and facial expressions.42-43

The third and most recent method for ani-

mating a face model is to measure humanfacial movements directly and then apply themotion data to the face model. The model cancapture facial motions using one or morecameras and can incorporate face markers,structured light, laser range finders, and otherface measurement modes. Each facial motioncapture approach has limitations that mightrequire post-processing to overcome. Theideal motion-capture data representation sup-ports sufficient detail without sacrificingeditability (for example, MPEG-4 FacialAnimation Parameters). The choice of mod-eling and rendering technologies ranges from2D line drawings to physics-based 3D mod-els with muscles, skin, and bone.44-45 Ofcourse, textured polygons (non-uniformrational b-splines and subdivision surfaces)are by far the most common. A variety of sur-face deformation schemes exist that attemptto simulate the natural deformations of thehuman face while driven by external para-meters.46-47


Figure4, The set of MPEG-4 Face definition Parameter (FDP) feature points.

Right eye Left eyeFeature points affected by FAPsOther feature points

Teeth

Nose

MouthTongue

MPEG-4, which was designed for high-quality visual communication at low bit-ratescoupled with low-cost graphics renderingsystems, offers one existing standard forhuman figure animation (see Figure 4). Itcontains a comprehensive set of tools for rep-resenting and compressing content objectsand the animation of those objects, and ittreats virtual humans (faces and bodies) as aspecial type of object. The MPEG-4 Face andBody Animation standard provides anatom-ically specific locations and animation para-meters. It defines Face Definition Parameterfeature points and locates them on the face.Some of these points only serve to helpdefine the face’s shape. The rest of them aredisplaced by Facial Animation Parameters,which specify feature point displacementsfrom the neutral face position. Some FAPsare descriptors for visemes and emotionalexpressions. Most remaining FAPs are nor-malized to be proportional to neutral facemouth width, mouth-nose distance, eye sep-aration, iris diameter, or eye-nose distance.Although MPEG-4 has defined a limited setof visemes and facial expressions, designerscan specify two visemes or two expressionswith a blend factor between the visemes andan intensity value for each expression. Thenormalization of the FAPs gives the facemodel designer freedom to create characterswith any facial proportions, regardless of thesource of the FAPs. They can embed MPEG-4 compliant face models into decoders, storethem on CDROM, download them as an exe-cutable from a website, or build them into aWeb browser.

Integration challengesIntegrating all the various elements

described here into a virtual human is adaunting task. It is difficult for any singleresearch group to do it alone. Reusable toolsand modular architectures would be an enor-mous benefit to virtual human researchers,letting them leverage each other’s work.Indeed, some research groups have begun toshare tools, and several standards haverecently emerged that will further encouragesharing. However, we must confront severaldifficult issues before we can readily plug-and-play different modules to control a vir-tual human’s behavior. Two key issues dis-cussed at the workshop were consistency andtiming of behavior.

ConsistencyWhen combining a variety of behavioral

components, one problem is maintainingconsistency between the agent’s internal state(for example, goals, plans, and emotions) andthe various channels of outward behavior (forexample, speech and body movements).When real people present multiple behaviorchannels, we interpret them for consistency,honesty, and sincerity, and for social roles,relationships, power, and intention. Whenthese channels conflict, the agent might sim-ply look clumsy or awkward, but it couldappear insincere, confused, conflicted, emo-tionally detached, repetitious, or simply fake.To an actor or an expert animator, this is obvi-ous. Bad actors might fail to control gesturesor facial expressions to portray the demeanorof their persona in a given situation. The actor

might not have internalized the character’sgoals and motivations enough to use thebody’s own machinery to manifest theseinner drives as appropriate behaviors. Askilled animator (and actor) knows that allaspects of a character must be consistent withits desired mental state because we can con-trol only voice, body shape, and movementfor the final product. We cannot open a dia-log with a pre-animated character to furtherprobe its mind or its psychological state.With a real time embodied agent, however,we might indeed have such an opportunity.

One approach to remedying this problemis to explicitly coordinate the agent’s inter-nal state with the expression of body move-ments in all possible channels. For example,Norman Badler’s research group has beenbuilding a system, Emote, to parameterizeand modulate action performance.48 It isbased on Laban Movement Analysis, ahuman movement observation system.Emote is not an action selector per se; it isused to modify the execution of a given

behavior and thus change its movement qual-ities or character. Emote’s power arises fromthe relatively small number of parametersthat control or affect a much larger set, andfrom new extensions to the original defini-tions that include non-articulated face move-ments. The same set of parameters controlmany aspects of manifest behavior across theagent’s body and therefore permit experi-mentation with similar or dissimilar settings.The hypothesis is that behaviors manifest inseparate channels with similar Emote para-meters will appear consistent to some inter-nal state of the agent; conversely, dissimilarEmote parameters will convey various neg-ative impressions of the character’s internalconsistency. Most computer-animated agentsprovide direct evidence for the latter view:

• Arm gestures without facial expressionslook odd

• Facial expressions with neutral gestureslook artificial

• Arm gestures without torso involvementlook insincere

• Attempts at emotions in gait variationslook funny without concomitant body andfacial affect

• Otherwise carefully timed gestures andspeech fail to register with gesture perfor-mance and facial expressions

• Repetitious actions become irritatingbecause they appear unconcerned aboutour changing (more negative) feelingsabout them

TimingIn working together toward a unifying

architecture, timing emerged as a central con-cern at the workshop. A virtual human’sbehavior must unfold over time, subject to avariety of temporal constraints. For example,speech-related gestures must closely followthe voice cadence. It became obvious duringthe workshop that previous work focused ona specific aspect of behavior (for example,speech, reactivity, or emotion), leading toarchitectures that are tuned to a subset of tim-ing constraints and cannot straightforwardlyincorporate others. During the final day of theworkshop, we struggled with possible archi-tectures that might address this limitation.

For example, BEAT schedules speech-related body movements using a pipelinedarchitecture: a text-to-speech system gener-ates a fixed timeline to which a subsequentgesture scheduler must conform. Essentially,



When these channels conflict, the

agent might simply look clumsy or

awkward, but it could appear

insincere, confused, conflicted,

emotionally detached, repetitious,

or simply fake.

behavior is a slave to the timing constraintsof the speech synthesis tool. In contrast, sys-tems that try to physically convey a sense ofemotion or personality often work by alter-ing the time course of gestures. For example,Emote works later in the pipeline, taking apreviously generated sequence of gesturesand shortening or drawing them out for emo-tional effect. Essentially, behavior is a slaveto the constraints of emotional dynamics.Finally, some systems have focused on mak-ing the character highly reactive and embed-ded in the synthetic environment. For exam-ple, Mr. Bubb of Zoesis Studios (see Figure5) is tightly responsive to unpredictable andcontinuous changes in the environment (suchas mouse movements or bouncing balls). Insuch systems, behavior is a slave to environ-mental dynamics. Clearly, if these variouscapabilities are to be combined, we must rec-oncile these different approaches.

One outcome of the workshop was a num-ber of promising proposals for reconcilingthese competing constraints. At the veryleast, much more information must be sharedbetween components in the pipeline. Forexample, if BEAT had more access to timingconstraints generated by Emote, it could doa better job of up-front scheduling. Anotherpossibility would be to specify all of the con-straints explicitly and devise an animationsystem flexible enough to handle them all,an approach the motion graph technique sug-gests.48 Norman Badler suggests an interest-ing pipeline architecture that consists of “fat”pipes with weak uplinks. Modules wouldsend down considerably more information(and possibly multiple options) and couldpoll downstream modules for relevant infor-mation (for example, how long would it taketo look at the ball, given its current location).Exploring these and other alternatives is animportant open problem in virtual humanresearch.

The future of androids remains to beseen, but realistic interactive virtual

humans will almost certainly populate ournear future, guiding us toward opportunitiesto learn, enjoy, and consume. The movetoward sharable tools and modular architec-tures will certainly hasten this progress, and,although significant challenges remain, workis progressing on multiple fronts. The emer-gence of animation standards such as MPEG-4 and H-Anim has already facilitated the

modular separation of animation from behav-ioral controllers and sparked the develop-ment of higher-level extensions such as theAffective Presentation Markup Language.Researchers are already sharing behavioralmodels such as BEAT and Emote. We haveoutlined only a subset of the many issues thatarise, ignoring many of the more classical AIissues such as perception, planning, andlearning. Nonetheless, we have highlightedthe considerable recent progress towardsinteractive virtual humans and some of thekey challenges that remain. Assembling anew virtual human is still a daunting task, butthe building blocks are getting bigger andbetter every day.

AcknowledgmentsWe are indebted to Bill Swartout for develop-

ing the initial proposal to run a Virtual Humanworkshop, to Jim Blake for helping us arrangefunding, and to Julia Kim and Lynda Strand forhandling all the logistics and support, withoutwhich the workshop would have been impossible.We gratefully acknowledge the Department of the

Army for their financial support under contractnumber DAAD 19-99-D-0046. We also acknowl-edge the support of the US Air Force F33615-99-D-6001 (D0 #8), Office of Naval Research K-5-55043/3916-1552793, NSF IIS99-00297, andNASA NRA NAG 9-1279. Any opinions, findings,and conclusions or recommendations expressed inthis material are those of the authors and do notnecessarily reflect the views of these agencies.

References

1. Johnson, W.L., J.W. Rickel, and J.C.Lester 2000. Animated PedagogicalAgents: Face-to-Face Interaction inInteractive Learning Environments.International Journal of Artificial Intel-ligence in Education, 11, 47-78.

2. Marsella, S.C., W.L. Johnson, and C.LaBore 2000. Interactive PedagogicalDrama. Proceedings of the FourthInternational Conference onAutonomous Agents, 301-308, ACMPress.

3. André, E., Rist, T., van Mulken, S.,Klesen, M. and Baldes, S. (2000). TheAutomated Design of Believable Dia-logues for Animated PresentationTeams. In: Cassell, J., Sullivan, J., Pre-vost, S. and Churchill, E. (eds.):Embodied Conversational Agents,220-255, Cambridge, MA: MIT Press.

4. Cassell, J., Bickmore, T., Campbell, L.,Vilhjalmsson, H. and Yan, H. 2000.Human conversation as a systemframework: Designing embodied con-versational agents. In J. Cassell, J. Sul-


Figure 5. Mr. bubb is an interactive characteer developed by Zoesis Studios that reactscontinously to the user’s social interactions during cooperative play experience. Imagecourtesy of Zoesis Studios.

livan, S. Prevost, and E. Churchill (eds.),Embodied Conversational Agents.Boston: MIT Press.

5. Laird, J.E. 2001. It Knows What You’reGoing To Do: Adding Anticipation to aQuakebot. Proceedings of the FifthInternational Conference onAutonomous Agents, 385—392, ACMPress.

6. Yoon, S., B. Blumberg, and G.E. Schnei-der 2000. Motivation Driven Learningfor Interactive Synthetic Characters, Pro-ceedings of the Fourth InternationalConference on Autonomous Agents,365-372, ACM Press.

7. Cassell, J. 2000. Nudge, Nudge, Wink,Wink: Elements of Face-to-Face Con-versation for Embodied ConversationalAgents. In J. Cassell, J. Sullivan, S. Pre-vost, and E. Churchill (eds.), EmbodiedConversational Agents. Boston: MITPress: 1-27.

8. Cassell, J., Stone, M. and Yan, H. 2000.“Coordination and Context-Dependencein the Generation of Embodied Conver-sation,” Proceedings of INLG 2000,Mitzpe Ramon, Israel.

9. Maybury, M.T. (ed.) 1993. IntelligentMultimedia Interfaces. AAAI Press,Menlo Park, CA.

10. O’Conaill, B. and S. Whittaker 1997.Characterizing, predicting, and measur-ing video-mediated communication: Aconversational approach. In Video-medi-ated communication: Computers, cog-nition, and work. K. E. Finn and A. J.Sellen. Mahwah, N.J., Lawrence Erl-baum Associates, Inc.: 107-131.

11. Cassell, J. Vilhjálmsson, H. and Bick-more, T. 2001. “BEAT: The BehaviorExpression Animation Toolkit,” Pro-ceedings of SIGGRAPH 01, Los Ange-les, CA.

12. Johnston, M., P. R. Cohen, et al. 1997.Unification-Based Multimodal Integra-tion. Proceedings of 35th Annual Meet-ing of the Association for CompuationalLinguistics (ACL-97/EACL-97),Madrid, Spain.

13. Taylor, P., A. Black, et al. 1998. TheArchitecture of the Festival Speech Syn-thesis System. 3rd ESCA Workshop onSpeech Synthesis, Jenolan Caves, Aus-tralia.

14. Pelachaud, C., Carofiglio, V., De Caro-lis, B., de Rosis, F., Poggi, I. 2002.Embodied Agent in Information Deliv-ering Application. In: Proc. ofAutonomous Agents and MultiagentSystems. To appear.



T h e A u t h o r sJonathan Gratch is a project leader for the stress and emotion project at theUniversity of Southern California’s Institute for Creative Technologies andis a research assistant professor in the department of computer science. Hisresearch interests include cognitive science, emotion, planning, and the useof simulation in training. He received his PhD in computer science from theUniversity of Illinois. Contact him at the USC Inst. for Creative Technolo-gies, 13274 Fiji Way, Marina del Rey, CA 90292; [email protected];www.ict.usc.edu/ ~gratch.

Jeff Rickel is a project leader at the University of Southern California’sInformation Sciences Institute and a research assistant professor in thedepartment of computer science. His research interests include intelligentagents for education and training, especially animated agents that collabo-rate with people in virtual reality. He received his PhD in computer sciencefrom the University of Texas at Austin. He is a member of the ACM and theAAAI. Contact him at the USC Information Sciences Inst., 4676 AdmiraltyWay, Suite 1001, Marina del Rey, CA 90292; [email protected]; www.isi.edu/~rickel.

Elisabeth André is a full professor at the University of Augsburg, Germany.Her research interests include embodied conversational agents, multimediainterface, and the integration vision and natural language. She received herPhD from the University of Saarbrücken, Germany. Contact her at Lehrstuhlfür Multimedia Konzepte und Anwendungen, Institut für Informatik, Uni-versität Augsburg, Eichleitnerstr. 30, 86135 Augsburg, Germany;[email protected]; www.informatik.uni-augsburg.de/~lisa.

Justine Cassell is an associate professor at the MIT Media Lab where shedirects the Gesture and Narrative Language Research Group. Her researchinterests are bringing knowledge about verbal and nonverbal aspects ofhuman conversation and storytelling to computational systems design, andthe role that technologies—such as a virtual storytelling peer that has thepotential to encourage children in creative, empowered, and independentlearning—play in children’s lives. She holds a master’s degree in Literaturefrom the Université de Besançon (France), a master’s degree in Linguisticsfrom the University of Edinburgh (Scotland), and a dual PhD from the Uni-

versity of Chicago, in Psychology and in Linguistics. Contact her at [email protected].

Eric Petajan is chief scientist and founder of face2face animation, inc andchaired the MPEG-4 Face and Body Animation (FBA) group. Prior to form-ing face2face, Eric was a Bell Labs researcher where he developed facialmotion capture, HD video coding, and interactive graphics systems. Start-ing in 1989, Eric was a leader in the development of HDTV technology andstandards leading up to the US HDTV Grand Alliance. He received a PhDin EE in 1984 and an MS in Physics from the U. of Illinois where he builtthe first automatic lipreading system. Eric is also associate editor of theIEEE Transactions on Circuits and Systems for Video Technology. Contact

him at xxx.

Norman Badler is a Professor of Computer and Information Science at theUniversity of Pennsylvania and has been on that faculty since 1974. Activein computer graphics since 1968, his research focuses on human figure mod-eling, real-time animation, embodied agent software, and computationalconnections between language and action. He received the BA in CreativeStudies Mathematics from the University of California at Santa Barbara in1970, an MSc in Mathematics in 1971, and a PhD in Computer Science in1975, both from the University of Toronto. He directs the Center for HumanModeling and Simulation and is the Associate Dean for Academic Affairs

in the School of Engineering and Applied Science at the University of Pennsylvania. Contact him [email protected].

15. Lester, J. C., S. G. Towns, C. Callaway,J. L. Voerman and P. J. FitzGerald.2000. Deictic and emotive communi-cation in animated pedagogical agents.In: Cassell, J., Sullivan, J., Prevost, S.and Churchill, E. (eds.): EmbodiedConversational Agents, 123-154, Cam-bridge, MA: MIT Press.

16. Ortony, A., Clore, G. L., and Collins,A. 1988. The Cognitive Structure ofEmotions. Cambridge: CambridgeUniversity Press.

17. Marsella, S. and Gratch, J. 2002. AStep Towards Irrationality: UsingEmotion to Change Belief, in Pro-ceedings of the 1st International JointConference on Autonomous Agentsand Multi-Agent Systems, Bologna,Italy, to appear.

18. El-Nasr, M.S., Ioerger, T.R, and Yen,J. 1999. Peteei: A Pet with EvolvingEmotional Intelligence. In: in Pro-ceedings of the 3rd International Con-ference on Autonomous Agents, Seat-tle, Washington, pp. 9-15, New York:ACM Press.

19. Marsella, S. and Gratch, J. 2001. Mod-eling the Interplay of Emotions andPlans in Multi-Agent Simulations, inProceedings of the 23rd Annual Con-ference of the Cognitive Science Soci-ety, Edinburgh, Scotland, 2001.

20. Sloman, A. 1990. Motives, Mecha-nisms and Emotions. Cognition andEmotion, in: M.A. Boden (ed) The Phi-losophy of Artificial Intelligence,“Oxford Readings in Philosophy”Series Oxford University Press, pp.231-247.

21. Gratch, J. 2000. Emile: MarshallingPassions in Training and Education, inProceedings of the 4th InternationalConference on Autonomous Agents,Barcelona, Spain.

22. Collier, G. 1985. Emotional expres-sion. Hillsdale, N.J.: Lawrence Erl-baum.

23. Ball, G. and J. Breese. 2000. Emotionand personality in a conversationalcharacter. In: Cassell, J., Sullivan, J.,Prevost, S. and Churchill, E. (eds.):Embodied Conversational Agents,189-219, Cambridge, MA: MIT Press.

24. Chi, D., Costa, M., Zhao, L. andBadler, N. 2000. The EMOTE Modelfor Effort and Shape, ACM SIG-GRAPH 00, New Orleans, LA, pp.173-182, 2000.

25. Schröder, M. 2001. Emotional SpeechSynthesis - A Review. Proc.Eurospeech 2001, Aalborg, Vol. 1, pp.

561-564.

26. Hovy, E. 1987. Some pragmatic deci-sion criteria in generation. In G. Kem-pen, ed., Natural language generation,3–17. Dordrecht: Martinus NijhoffPublishers.

27. Walker, M., J. Cahn, and S. J. Whit-taker. 1997. Improving linguistic style:Social and affective bases for agentpersonality. In Proceedings ofAutonomous Agents’97, 96–105.Marina del Ray, Calif.: ACM Press.

28. Moffat, D. 1997. Personality parame-ters and programs. In R. Trappl and P.Petta, eds., Creating personalities forsynthetic actors, 120–165. New York:Springer.

29. Prendinger, H. and Ishizuka, M. 2001.Social Role Awareness in AnimatedAgents. In: Proc. of the Fifth Confer-ence on Autonomous Agents,270–377. New York: ACM Press.

30. Paiva, A., André, E., Arafa, Y., Costa,M., Figueiredo, P. Gebhard, P., Höök,K., Mamdani,A., Martino, C., Mourao,D., Petta, P., Sengers, P., Val, M. 2001.Satira: Supporting Affective Interac-tions in Real-Time Applications, in:CAST 2001: Living in mixed realities,Bonn, 2001.

31. Gleicher, M. 2001. Comparing con-straint-based motion editing methods.Graphical Models 63(2), pp. 107-134,2001.

32. Badler, N., Phillips, C. and Webber, B.(1993). Simulating Humans: Com-puter Graphics, Animation, and Con-trol. Oxford Univ. Press: New York.

33. Raibert, M. and Hodgins, J. 1991. Ani-mation of dynamic legged locomotion.ACM SIGGRAPH 25(4), July 1991,pp. 349-358.

34. Tolani, D., Goswami, A. and Badler,N. 2000. Real-time inverse kinematicstechniques for anthropomorphic limbs.Graphical Models 62 (5), pp. 353-388.

35. Lewis, J., Cordner, M. and Fong, N.2000. Pose space deformations: A uni-fied approach to shape interpolationand skeleton-driven deformation.ACM SIGGRAPH, July 2000, pp.165-172.

36. Sloan, P., Rose, C. and Cohen M. 2001Shape by Example. Symposium onInteractive 3D Graphics, March, 2001.

37. Baraff, D. and Witkin, A. 1998. Largesteps in cloth simulation. ACM SIG-GRAPH, July 1998, pp. 43-54.

38. Magnenat-Thalmann, N., and Volino,P. 1999 Dressing Virtual Humans.Morgan Kaufmann Publishers.

39. Perlin, K. 1995. Real time responsiveanimation with personality. IEEETrans. on Visualization and ComputerGraphics, 1(1):5-15.

40. Byun, M. and Badler, N. 2002. FacE-MOTE: Qualitative parametric modi-fiers for facial animations. Symposiumon Computer Animation, July 2002.

41. Brand, M. and Hertzmann, A. 2000.Style machines. ACM SIGGRAPH,July 2000, pp. 183-192.

42. Cosatto, E., and H.P. Graf: Photo-Real-istic Talking-Heads from Image Sam-ples. IEEE Transactions on Multime-dia 2(3): 152-163 (2000).

43. Ezzat, T., Geiger, G. and Poggio, T.Trainable Videorealistic Speech Ani-mation . To appear in Proceedings ofACM SIGGRAPH 2002, San Antonio,Texas, July 2002.

44. Terzopoulos, D. & Waters, K. (1993).Analysis and synthesis of facial imagesusing physical and anatomical models.IEEE Transactions on Pattern Analy-sis and Machine Intelligence, 15 , 6,569-579.

45. Parke, F. I. & Waters, K. (1994) Com-puter Facial Animation, AK Peters.ISBN 1-56881-014-8.

46. Capin, T.K., E. Petajan, J. Ostermann2000a. Efficient Modeling of VirtualHumans in MPEG-4. IEEE Interna-tional Conference on Multimedia andExpo (II): 1103-1106.

47. Capin, T.K., E. Petajan, J. Ostermann2000b. Very Low Bitrate Coding ofVirtual Human Animation in MPEG-4. IEEE International Conference onMultimedia and Expo (II):1107-1110.

48. Kovar, L., M. Gleicher, and F. Pighin2002. Motion Graphs, Proceedings ofACM SIGGRAPH 2002.


Creating Interactive Virtual Humans: Some Assembly Requiredbadler/papers/x4GEW.pdf · ing human...

Documents

Transcript of Creating Interactive Virtual Humans: Some Assembly Requiredbadler/papers/x4GEW.pdf · ing human...