Human perceiving and - People | MIT...

12

Transcript of Human perceiving and - People | MIT...

Page 1: Human perceiving and - People | MIT CSAILpeople.csail.mit.edu/edsinger/doc/brezeal_active_vision.pdfMA. E-mail: cyn-thia,edsinger,paul tz,scaz@ai.mit.edu rob ot can get quite far with

IEEE TRANSACTIONS ON MAN, CYBERNETICS AND SYSTEMS, VOL. XX, NO. Y, MONTH 2000 1

Active Vision for Sociable Robots

Cynthia Breazeal, Aaron Edsinger, Paul Fitzpatrick, and Brian Scassellati

Abstract|

In 1991, Ballard [1] described the implications of hav-ing a visual system that could actively position the cameracoordinates in response to physical stimuli. In humanoidrobotic systems, or in any animate vision system that inter-acts with people, social dynamics provide additional levelsof constraint and provide additional opportunities for pro-cessing economy. In this paper, we describe an integratedvisual-motor system that has been implemented on a hu-manoid robot to negotiate the robot's physical constraints,the perceptual needs of the robot's behavioral and motiva-tional systems, and the social implications of motor acts.

Keywords| Active vision, robots, social interaction withhumans.

I. Introduction

Animate vision introduces requirements for real-timeprocessing, removes simplifying assumptions of static cam-era systems, and presents opportunities for simplifyingcomputation. This simpli�cation arises through situatingperception in a behavioral context, by providing for op-portunities to learn exible behaviors, and by allowing theexploitation of dynamic regularities of the environment [1].These bene�ts have been of critical interest to a varietyof humanoid robotics projects [2], [3], [4], [5], and to therobotics and AI communities as a whole. On a practicallevel, the vast majority of these systems are still limited bythe complexities of perception and thus focus on a singleaspect of animate vision or concentrate on the integrationof two well-known systems. On a theoretical level, exist-ing systems often do not bene�t from the advantages thatBallard proposed because of their limited scope.

In humanoid robotics, these problems are particularlyevident. Animate vision systems that provide only a lim-ited set of behaviors (such as supporting only smooth pur-suit tracking) or that provide behaviors on extremely lim-ited perceptual inputs (such as systems that track onlyvery bright light sources) fail to provide a natural interac-tion between human and robot. We propose that in orderto allow realistic human-machine interactions, an animatevision system must address a set of social constraints inaddition to the other issues that classical active vision hasaddressed.

It is useful to view social constraints not as limitations,but opportunities. They induce a natural \vocabulary"to make the robot's behavior and state readable by a hu-man. This in turn can provide a framework for the robotto negotiate a change in a human's behavior. For example,section XI-A discusses a procedure the robot can use tocontrol the \inter-personal" distance a human assigns toit. Having this control means that in social situations the

Arti�cial Intelligence Laboratory, Massachusetts Insti-tute of Technology (MIT), Cambridge, MA. E-mail: cyn-thia,edsinger,paul�tz,[email protected]

robot can get quite far with a simple vision system that istuned to a particular distance.

II. Social constraints

For robots and humans to interact meaningfully, it isimportant that they understand each other enough to beable to shape each other's behavior. This has several im-plications. One of the most basic is that robot and humanshould have at least some overlapping perceptual abilities.Otherwise, they can have little idea of what the other issensing and responding to. Vision is one important sen-sory modality for human interaction, and the one we focuson in this article. We endow our robots with visual per-ception that is human-like in its physical implementation.

Fig. 1. Kismet, a robot capable of conveying intentionality throughfacial expressions and behavior. Here, the robot's physical state ex-presses attention to and interest in the human beside it. Anotherperson { for example, the photographer { would expect to have to at-tract the robot's attention before being able to in uence its behavior.

Similarity of perception requires more than similarity ofsensors. Not all sensed stimuli are equally behaviorally rel-evant. It is important that both human and robot �ndthe same types of stimuli salient in similar conditions. Ourrobots have a set of perceptual biases based on the humanpre-attentive visual system. These biases can be modulatedby the motivational state of the robot, making later percep-tual stages more behaviorally relevant. This approximatesthe top-down in uence of motivation on the bottom-up pre-attentive process found in human vision.

Visual perception requires high bandwidth and is compu-tationally demanding. In the early stages of human vision,the entire visual �eld is processed in parallel. Later com-putational steps are applied much more selectively, so thatbehaviorally relevant parts of the visual �eld can be pro-cessed in greater detail. This mechanism of visual attentionis just as important for robots as it is for humans, from thesame considerations of resource allocation. The existence

Page 2: Human perceiving and - People | MIT CSAILpeople.csail.mit.edu/edsinger/doc/brezeal_active_vision.pdfMA. E-mail: cyn-thia,edsinger,paul tz,scaz@ai.mit.edu rob ot can get quite far with

IEEE TRANSACTIONS ON MAN, CYBERNETICS AND SYSTEMS, VOL. XX, NO. Y, MONTH 2000 2

of visual attention is also key to satisfying the expectationsof humans concerning what can and cannot be perceivedvisually. We have implemented a context-dependent atten-tion system that goes some way towards this.

Human eye movements have a high communicative value.For example, gaze direction is a good indicator of the locusof visual attention. Knowing a person's locus of attentionreveals what that person currently considers behaviorallyrelevant, which is in turn a powerful clue to their intent.The dynamic aspects of eye movement, such as staring ver-sus glancing, also convey information. Eye movements areparticularly potent during social interactions, such as con-versational turn-taking, where making and breaking eyecontact plays an important role in regulating the exchange.We model the eye movements of our robots after humans,so that they may have similar communicative value.

Our hope is that by following the example of the humanvisual system, the robot's behavior will be easily under-stood because it is analogous to the behavior of a humanin similar circumstances (see Figure 1). For example, whenan anthropomorphic robot moves its eyes and neck to ori-ent toward an object, an observer can e�ortlessly concludethat the robot has become interested in that object. Thesetraits lead not only to behavior that is easy to understandbut also allows the robot's behavior to �t into the socialnorms that the person expects.

There are other advantages to modeling our implemen-tation after the human visual system. There is a wealthof data and proposed models for how the human visualsystem is organized. This data provides not only a modu-lar decomposition but also mechanisms for evaluating theperformance of the complete system. Another advantageis robustness. A system that integrates action, perception,attention, and other cognitive capabilities can be more ex-ible and reliable than a system that focuses on only one ofthese aspects. Adding additional perceptual capabilitiesand additional constraints between behavioral and percep-tual modules can increase the relevance of behaviors whilelimiting the computational requirements [6]. For example,in isolation, two diÆcult problems for a visual tracking sys-tem are knowing what to track and knowing when to switchto a new target. These problems can be simpli�ed by com-bining the tracker with a visual attention system that canidentify objects that are behaviorally relevant and worthtracking. In addition, the tracking system bene�ts the at-tention system by maintaining the object of interest in thecenter of the visual �eld. This simpli�es the computationnecessary to implement behavioral habituation. These twomodules work in concert to compensate for the de�cienciesof the other and to limit the required computation in each.

III. Physical form

Currently, the most sophisticated of our robots in termsof visual-motor behavior is Kismet. This robot is an ac-tive vision head augmented with expressive facial features(see Figure 2). Kismet is designed to receive and sendhuman-like social cues to a caregiver, who can regulate itsenvironment and shape its experiences as a parent would

for a child. Kismet has three degrees of freedom to controlgaze direction, three degrees of freedom to control its neck,and �fteen degrees of freedom in other expressive compo-nents of the face (such as ears and eyelids). To perceiveits caregiver Kismet uses a microphone, worn by the care-giver, and four color CCD cameras. The positions of theneck and eyes are important both for expressive posturesand for directing the cameras towards behaviorally relevantstimuli.

Eye tilt

Left eye panRight eye pan

Camera with widefield ofview

Camera with narrowfield ofview

Neck tilt

Neck pan

Neck lean

Fig. 2. Kismet has a large set of expressive features { eyelids, eye-brows, ears, jaw, lips, neck and eye orientation. The schematic onthe right shows the degrees of freedom relevant to visual perception(omitting the eyelids!). The eyes can turn independently along thehorizontal (pan), but turn together along the vertical (tilt). Theneck can turn the whole head horizontally and vertically, and canalso crane forward. Two cameras with narrow \foveal" �elds of viewrotate with the eyes. Two central cameras with wide �elds of viewrotate with the neck. These cameras are una�ected by the orientationof the eyes.

The cameras in Kismet's eyes have high acuity but anarrow �eld of view. Between the eyes, there are two un-obtrusive central cameras �xed with respect to the head,each with a wider �eld of view but correspondingly loweracuity. The reason for this mixture of cameras is that typ-ical visual tasks require both high acuity and a wide �eldof view. High acuity is needed for recognition tasks andfor controlling precise visually guided motor movements.A wide �eld of view is needed for search tasks, for trackingmultiple objects, compensating for involuntary ego-motion,etc. A common trade-o� found in biological systems is tosample part of the visual �eld at a high enough resolutionto support the �rst set of tasks, and to sample the restof the �eld at an adequate level to support the second set.This is seen in animals with foveate vision, such as humans,where the density of photoreceptors is highest at the cen-ter and falls o� dramatically towards the periphery. Thiscan be implemented by using specially designed imaginghardware [7], space-variant image sampling [8], or by usingmultiple cameras with di�erent �elds of view, as we havedone.The designs of our robots are constantly evolving. New

degrees of freedom are added, old degrees of freedom arereorganized, sensors are replaced or rearranged, new sen-sory modalities are introduced. The descriptions given hereshould be treated as a eeting snapshot of the current stateof the robots. Our hardware and software control architec-tures have been designed to meet the challenge of real-timeprocessing of visual signals (approaching 30 Hz) with min-imal latencies. Kismet's vision system is implemented ona network of nine 400 MHz commercial PCs running theQNX real-time operating system. Kismet's motivational

Page 3: Human perceiving and - People | MIT CSAILpeople.csail.mit.edu/edsinger/doc/brezeal_active_vision.pdfMA. E-mail: cyn-thia,edsinger,paul tz,scaz@ai.mit.edu rob ot can get quite far with

IEEE TRANSACTIONS ON MAN, CYBERNETICS AND SYSTEMS, VOL. XX, NO. Y, MONTH 2000 3

system runs on a collection of four Motorola 68332 pro-cessors. Machines running Windows NT and Linux arealso networked for speech generation and recognition re-spectively. Even more so than Kismet's physical form, thecontrol network is rapidly evolving as new behaviors andsensory modalities come on line.

IV. Levels of visual behavior

Visual behavior can be conceptualized on four di�erentlevels (as shown in Figure 3). These levels correspond tothe social level, the behavior level, the skills level, and theprimitives level. This decomposition is motivated by dis-tinct temporal, perceptual, and interaction constraints thatexist at each level. The temporal constraints pertain tohow fast the motor acts must be updated and executed.These can range from real-time vision rates (30 Hz) to therelatively slow time scale of social interaction (potentiallytransitioning over minutes). The perceptual constraintspertain to what level of sensory feedback is required tocoordinate behavior at that layer. This perceptual feed-back can originate from the low level visual processes suchas the current target from the attention system, to rel-atively high-level multi-modal percepts generated by thebehavioral releasers. The interaction constraints pertainto the arbitration of units that compose each layer. Thiscan range from low-level oculomotor primitives (such assaccades and smooth pursuit), to using visual behavior toregulate human-robot turn-taking.

inter-motorsystemcoordination

perceptualfeedback from motoracts

FacialExpression

oculomotorcontrol

affective vocal

synthesis

ExpressivePosture

task

vocalskill

visuo-motorskill

bodyskill

faceskill

Behavior System

Motor Skills System

Human perceiving andresponding to robotrobot

respondsto human

humanrespondsto robot

Fig. 3. Levels of behavioral organization. The primitive level ispopulated with tightly coupled sensorimotor loops. The skill levelcontains modules that coordinate primitives to achieve tasks. Behav-ior level modules deal with questions of relevance, persistence andopportunism in the arbitration of tasks. The social level comprisesdesign-time considerations of how the robot's behaviors will be inter-preted and responded to in a social environment.

Each level serves a particular purpose for generating theoverall observed behavior. As such, each level must ad-dress a speci�c set of issues. The levels of abstraction helpsimplify the overall control of visual behavior by restrictingeach level to address those core issues that are best man-aged at that level. By doing so, the coordination of visualbehavior at each level (i.e., arbitration), between the levels

(i.e., top-down and bottom-up), and through the world ismaintained in a principled way.

� The Social Level: The social level explicitly deals with is-sues pertaining to having a human in the interaction loop.This requires careful consideration of how the human in-terprets and responds to the robot's behavior in a socialcontext. Using visual behavior (making eye contact andbreaking eye contact) to help regulate the transition ofspeaker turns during vocal turn-taking is an example.� The behavior level: The behavior level deals with issuesrelated to producing relevant, appropriately persistent, andopportunistic behavior. This involves arbitrating betweenthe many possible goal-achieving behaviors that the robotcould perform to establish the current task. Actively seek-ing out a desired stimulus and then visually engaging it isan example.� The motor skill level: The motor skill level is responsi-ble for �guring out how to move the motors to accomplishthat task. Fundamentally, this level deals with the issuesof blending of and sequencing between coordinated ensem-bles of motor primitives (each ensemble is a distinct mo-tor skill). The skills level must also deal with coordinatingmulti-modal motor skills (e.g., those motor skills that com-bine speech, facial expression, and body posture). Fixedaction patterns such as a searching behavior is an examplewhere the robot alternately performs ballistic eye-neck ori-entation movements with gaze �xation to the most salienttarget. The ballistic movements are important for scan-ning the scene, and the �xation periods are important forlocking on the desired type of stimulus.� The motor primitives level: The motor primitives levelimplements the building blocks of motor action. This levelmust deal with motor resource allocation and tightly cou-pled sensorimotor loops. For example, gaze stabilizationmust take sensory stimuli and produce motor commandsin a very tight feedback loop. Kismet actually has fourdistinct motor systems at the primitives level: the a�ectivevocal system, the facial expression system, the oculomotor

system, and the body posturing system. Because this paperfocuses on visual behavior, we only discuss the oculomotorsystem here.

We describe these levels in detail as they pertain toKismet's visual behavior. We begin at the lowest level,motor primitives pertaining to vision, and progress to thehighest level where we discuss the social constraints of an-imate vision.

V. Visual motor primitives

Kismet's visual-motor control is modeled after the hu-man ocular-motor system. The human system is so goodat providing a stable percept of the world that we haveno intuitive appreciation of the physical constraints un-der which it operates. Humans have foveate vision. Thefovea (the center of the retina) has a much higher densityof photoreceptors than the periphery. This means that tosee an object clearly, humans must move their eyes suchthat the image of the object falls on the fovea. Humaneye movement is not smooth. It is composed of many

Page 4: Human perceiving and - People | MIT CSAILpeople.csail.mit.edu/edsinger/doc/brezeal_active_vision.pdfMA. E-mail: cyn-thia,edsinger,paul tz,scaz@ai.mit.edu rob ot can get quite far with

IEEE TRANSACTIONS ON MAN, CYBERNETICS AND SYSTEMS, VOL. XX, NO. Y, MONTH 2000 4

quick jumps, called saccades, which rapidly re-orient theeye to project a di�erent part of the visual scene onto thefovea. After a saccade, there is typically a period of �xa-tion, during which the eyes are relatively stable. They areby no means stationary, and continue to engage in correc-tive micro-saccades and other small movements. If the eyes�xate on a moving object, they can follow it with a con-tinuous tracking movement called smooth pursuit. Thistype of eye movement cannot be evoked voluntarily, butonly occurs in the presence of a moving object. Periods of�xation typically end after some hundreds of milliseconds,after which a new saccade will occur [9].

Vergenceangle

Right eye

Left eye

Ballistic saccadeto new target

Smooth pursuitand vergenceco-operate to trackobject

Fig. 4. Humans exhibit four characteristic types of eye motion.Saccadic movements are high-speed ballistic motions that center atarget in the �eld of view. Smooth pursuit movements are used totrack a moving object at low velocities. The vestibulo-ocular andopto-kinetic re exes act to maintain the angle of gaze as the headand body move through the world. Vergence movements serve tomaintain an object in the center of the �eld of view of both eyes asthe object moves in depth.

The eyes normally move in lock-step, making equal, con-junctive movements. For a close object, the eyes need toturn towards each other somewhat to correctly image theobject on the foveae of the two eyes. These disjunctivemovements are called vergence, and rely on depth percep-tion (see Figure 4). Since the eyes are located on the head,they need to compensate for any head movements that oc-cur during �xation. The vestibulo-ocular re ex uses iner-tial feedback from the vestibular system to keep the orien-tation of the eyes stable as the eyes move. This is a veryfast response, but is prone to the accumulation of errorover time. The opto-kinetic response is a slower compen-sation mechanism that uses a measure of the visual slip ofthe image across the retina to correct for drift. These twomechanisms work together to give humans stable gaze asthe head moves.

Our implementation of an ocular-motor system is an ap-proximation of the human system. The motor primitivesare organized around the needs of higher levels, such asmaintaining and breaking mutual regard, performing vi-sual search, etc. Since our motor primitives are tightlybound to visual attention, we will �rst discuss their sen-sory component.

VI. Pre-attentive visual perception

Human infants and adults naturally �nd certain percep-tual features interesting. Features such as color, motion,and face-like shapes are very likely to attract our attention[10]. We have implemented a variety of perceptual featuredetectors that are particularly relevant to interacting withpeople and objects. These include low-level feature detec-tors attuned to quickly moving objects, highly saturatedcolor, and colors representative of skin tones. Examplesof features we have used are shown in Figure 5. Loomingobjects are also detected pre-attentively, to facilitate a fastre exive withdrawal.

Fig. 5. Overview of the attention system. The robot's attentionis determined by a combination of low-level perceptual stimuli. Therelative weightings of the stimuli are modulated by high-level behav-ior and motivational in uences. A suÆciently salient stimulus in anymodality can pre-empt attention, similar to the human response tosudden motion. All else being equal, larger objects are consideredmore salient than smaller ones. The design is intended to keep therobot responsive to unexpected events, while avoiding making it aslave to every whim of its environment. With this model, people in-tuitively provide the right cues to direct the robot's attention (shakeobject, move closer, wave hand, etc.). Displayed images were cap-tured during a behavioral trial session.

A. Color saliency feature map

One of the most basic and widely recognized visual fea-ture is color. Our models of color saliency are drawn fromthe complementary work on visual search and attentionfrom Itti, Koch, and Niebur [11]. The incoming videostream contains three 8-bit color channels (r, g, and b)which are transformed into four color-opponency channels(r0, g0, b0, and y

0). Each input color channel is �rst normal-ized by the luminance l (a weighted average of the threeinput color channels):

rn =255

3�r

lgn =

255

3�g

lbn =

255

3�b

l(1)

These normalized color channels are then used to producefour opponent-color channels:

r0 = rn � (gn + bn)=2 (2)

Page 5: Human perceiving and - People | MIT CSAILpeople.csail.mit.edu/edsinger/doc/brezeal_active_vision.pdfMA. E-mail: cyn-thia,edsinger,paul tz,scaz@ai.mit.edu rob ot can get quite far with

IEEE TRANSACTIONS ON MAN, CYBERNETICS AND SYSTEMS, VOL. XX, NO. Y, MONTH 2000 5

g0 = gn � (rn + bn)=2 (3)

b0 = bn � (rn + gn)=2 (4)

y0 =

rn + gn

2� bn � krn � gnk (5)

The four opponent-color channels are clamped to 8-bit val-ues by thresholding. While some research seems to indicatethat each color channel should be considered individually[10], we choose to maintain all of the color information in asingle feature map to simplify the processing requirements(as does Wolfe [12] for more theoretical reasons).

B. Motion feature map

In parallel with the color saliency computations, a secondprocessor receives input images from the frame grabber andcomputes temporal di�erences to detect motion. Motiondetection is performed on the wide FoV camera, which isoften at rest since it does not move with the eyes. Theincoming image is converted to grayscale and placed into aring of frame bu�ers. A raw motion map is computed bypassing the absolute di�erence between consecutive imagesthrough a threshold function T :

Mraw = T (kIt � It�1k) (6)

This raw motion map is then smoothed with a uniform7 � 8 �eld. The result is a binary 2-D map where regionscorresponding to motion have a high intensity value. Themotion saliency feature map is computed at 25-30 Hz by asingle 400MHz processor node. Figure 5 gives an exampleof the motion feature map when the robot looks at a toyblock that is being shaken.

C. Skin tone feature map

Colors consistent with skin are also �ltered for. This is acomputationally inexpensive means to rule out regions thatare unlikely to contain faces or hands. Most pixels on faceswill pass these tests over a wide range of lighting conditionsand skin color. Pixels that pass these tests are weightedaccording to a function learned from instances of skin tonefrom images taken by Kismet's cameras (see �gure 6). Inthis implementation, a pixel is not skin-toned if:

� r < 1:1 � g (the red component fails to dominate greensuÆciently)� r < 0:9 � b (the red component is excessively dominatedby blue)� r > 2:0 �max(g; b) (the red component completely dom-inates both blue and green)� r < 20 (the red component is too low to give good esti-mates of ratios)� r > 250 (the red component is too saturated to give agood estimate of ratios)

VII. Visual attention

We have implemented Wolfe's model of human visualsearch and attention [12]. Our implementation is similar

Fig. 6. The skin tone �lter responds to 4.7% of possible (R;G;B)values. Each grid in the �gure to the left shows the response of the�lter to all values of red and green for a �xed value of blue. The imageto the right shows the �lter in operation. Typical indoor objects thatmay also be consistent with skin tone include wooden doors, creamwalls, etc.

to other models based in part on Wolfe's work [11], but ad-ditionally operates in conjunction with motivational andbehavioral models, with moving cameras, and addressesthe issue of habituation. The attention process acts in twoparts. The low-level feature detectors discussed in the pre-vious section are combined through a weighted average toproduce a single attention map. This combination allowsthe robot to select regions that are visually salient and todirect its computational and behavioral resources towardsthose regions. The attention system also integrates in u-ences from the robot's internal motivational and behavioralsystems to bias the selection process. For example, if therobot's current goal is to interact with people, the attentionsystem is biased toward objects that have colors consistentwith skin tone. The attention system also has mechanismsfor habituating to stimuli, thus providing the robot with aprimitive attention span. The state of the attention sys-tem is usually re ected in the robot's gaze direction, unlessthere are behavioral reasons for this not to be the case. Theattention system runs all the time, even when it is not con-trolling gaze, since it determines the perceptual input towhich the motivational and behavioral systems respond.

A. Task-based in uences on attention

For a goal achieving creature, the behavioral state shouldalso bias what the creature attends to next. For instance,when performing visual search, humans seem to be ableto preferentially select the output of one broadly tunedchannel per feature (e.g. \red" for color and \shallow" fororientation if searching for red horizontal lines).In our system these top-down, behavior-driven factors

modulate the output of the individual feature maps beforethey are summed to produce the bottom-up contribution.This process selectively enhances or suppresses the contri-bution of certain features, but does not alter the underlyingraw saliency of a stimulus. To implement this, the bottom-up results of each feature map are passed through a �lter(e�ectively a gain). The value of each gain is determinedby the active behavior. For instance, as shown in Figure 7,the skin tone gain is enhanced when the seek people be-havior is active and is suppressed when the avoid people

behavior is active. Similarly, the color gain is enhanced

Page 6: Human perceiving and - People | MIT CSAILpeople.csail.mit.edu/edsinger/doc/brezeal_active_vision.pdfMA. E-mail: cyn-thia,edsinger,paul tz,scaz@ai.mit.edu rob ot can get quite far with

IEEE TRANSACTIONS ON MAN, CYBERNETICS AND SYSTEMS, VOL. XX, NO. Y, MONTH 2000 6

SocialDrive

SatiateSocial

Level 0CEG

Level 1CEGs

EngageToy

AvoidToy

SeekToy

Satiation Strategies

StimulationDrive

SatiateStimulation

EngagePerson

AvoidPerson

SeekPerson

Satiation Strategies

AttentionSystem

Biascolor gain

Suppresscolor gain

Intensifycolor gain

Suppressskin gain

Intensifyskin gain

Biasskin gain

“Person”Percept

“Toy”Percept

Skin &Motion

Color &Motion

PerceptualCategorization

MotivationSystem

BehaviorSystem

Fig. 7. Schematic of behaviors relevant to attention. The activa-tion of a particular behavior depends on both perceptual factors andmotivation factors. The perceptual factors come from post attentiveprocessing of the target stimulus into behaviorally relevant percepts.The drives within the motivation system have an indirect in uenceon attention by in uencing the behavioral context. The behaviors atLevel 1 of the behavior system directly manipulate the gains of theattention system to bene�t their goals. Through behavior arbitration,only one of these behaviors is active at any time. These behaviorsare further elaborated in deeper levels of the behavior system.

when the seek toys behavior is active, and suppressedwhen the avoid toys behavior is active. Whenever theengage people or engage toys behaviors are active, theface and color gains are restored to their default values,respectively.

Fig. 8. E�ect of gain adjustment on looking preference. Circlescorrespond to �xation points, sampled at one second intervals. Onthe left, the gain of the skin tone �lter is higher. The robot spendsmore time looking at the face in the scene (86% face, 14% block).This bias occurs despite the fact that the face is dwarfed by the blockin the visual scene. On the right, the gain of the color saliency �lteris higher. The robot now spends more time looking at the brightlycolored block (28% face, 72% block).

These modulated feature maps are then summed to com-pute the overall attention activation map, thus biasing at-tention in a way that facilitates achieving the goal of theactive behavior. For example, if the robot is searching forsocial stimuli, it becomes sensitive to skin tone and less sen-sitive to color. Behaviorally, the robot may encounter toysin its search, but will continue until a skin toned stimulus isfound (often a person's face). Figure 8 shows the results oftwo such experiments. The left �gure shows a looking pref-erence to a person despite a lesser \raw" saliency when therobot is seeking out people. Conversely, when the robot is

actively searching for a toy, it demonstrates a looking pref-erence to the colorful block depite the dominant presenceof a person's face in the visual �eld.

B. Habituation e�ects

To build a believable creature, the attention systemmust also implement habituation e�ects. Infants respondstrongly to novel stimuli, but soon habituate and respondless as familiarity increases. This acts both to keep theinfant from being continually fascinated with any singleobject and to force the caretaker to continually engage theinfant with slightly new and interesting interactions. Fora robot, a habituation mechanism removes the e�ects ofhighly salient background objects that are not currently in-volved in direct interactions as well as placing requirementson the caretaker to maintain interaction with slightly novelstimulation.To implement habituation e�ects, a habituation �lter is

applied to the activation map over the location currentlybeing attended to. The habituation �lter e�ectively decaysthe activation level of the location currently being attendedto, making other locations of lesser activation bias atten-tion more strongly.

C. Consistency of attention

In the presence of objects of similar salience, it is usefulbe able to commit attention to one of the objects for a pe-riod of time. This gives time for post-attentive processingto be carried out on the object, and for downstream pro-cesses to organize themselves around the object. As soonas a decision is made that the object is not behaviorallyrelevant (for example, it may lack eyes, which are searchedfor post-attentively), attention can be withdrawn from itand visual search may continue. Committing to an objectis also useful for behaviors that need to be atomically ap-plied to a target (for example, a calling behavior where therobot needs to stay looking at the person it is calling).To allow such commitment, the attention system is aug-

mented with a tracker. The tracker follows a target inthe visual �eld, using simple correlation between succes-sive frames. Usually changes in the tracker target will bere ected in movements of the robot's eyes, unless this isbehaviorally inappropriate. If the tracker loses the target,it has a very good chance of being able to reacquire it fromthe attention system.

D. Experiments with directing attention

The overall attention system runs at 20 Hz on several 400MHz processors. In section VII-A, we presented Kismet'slooking preference results with respect to directing its at-tention to task-relevant stimuli. In this section, we examinehow easy it is for people to direct the Kismet's attentionto a speci�c target stimulus, and to determine when theyhave been successful in doing so.Three naive subjects were invited to interact with

Kismet. The subjects ranged in age from 25 to 28 yearsold. All used computers frequently but were not computerscientists by training. All interactions were video-recorded.

Page 7: Human perceiving and - People | MIT CSAILpeople.csail.mit.edu/edsinger/doc/brezeal_active_vision.pdfMA. E-mail: cyn-thia,edsinger,paul tz,scaz@ai.mit.edu rob ot can get quite far with

IEEE TRANSACTIONS ON MAN, CYBERNETICS AND SYSTEMS, VOL. XX, NO. Y, MONTH 2000 7

The robot's attention gains were set to their default valuesso that there would be no strong preference for one saliencyfeature over another. The subjects were asked to direct therobot's attention to each of the target stimuli. There wereseven target stimuli used in the study. Three were satu-rated color stimuli, three were skin-toned stimuli, and thelast was a pure motion stimulus. Each target stimulus wasused more than once per subject. These are listed below:� A highly saturated colorful block� A bright yellow stu�ed dinosaur with multi-color spines� A bright green cylinder� A bright pink cup (which is actually detected by the skintone feature map)� The person's face� The person's hand� A black and white plush cow (which is only salient whenmoving)The video was later analyzed to determine which cues

the subjects used to attract the robot's attention, whichcues they used to determine when they had been success-ful, and the length of time required to do so. They werealso interviewed at the end of the session about which cuesthey used, which cues they read, and about how long theythought it took to direct the robot's attention. The resultsare summarized in Table I.

motiononly

5.856Total

commonly read cues

skin-toned &

movement

color & movement

stimuluscategory

commonly used cues

3.58face

5.08hand

6.58pink cup

5.08b/w cow

6.08green cylinder

6.58multi-colored block

8.58yellow dinosaur

averagetime (s)

presentationsstimulus

motion acrosscenter line

shaking motion

bringing target close to robot

eye behavior, especially tracking

facial expression, especially raisedeyebrows

body posture, especially forward leanor withdraw

TABLE I

Summary from attention manipulation interactions.

To attract the robot's attention, the most frequentlyused cues include bringing the target close and in front ofthe robot's face, shaking the object of interest, or movingit slowly across the centerline of the robot's face. Each cueincreases the saliency of a stimulus by making it appearlarger in the visual �eld, or by supplementing the coloror skin-tone cue with motion. Note that there was an in-herent competition between the saliency of the target andthe subject's own face as both could be visible from thewide FoV camera. If the subject did not try to direct therobot's attention to the target, the robot tended to look atthe subject's face.

The subjects also e�ortlessly determined when they hadsuccessfully re-directed the robot's gaze. Interestingly, it isnot suÆcient for the robot to orient to the target. Peoplelook for a change in visual behavior, from ballistic orienta-tion movements to smooth pursuit movements, before con-cluding that they had successfully re-directed the robot'sattention. All subjects reported that eye movement wasthe most relevant cue to determine if they had successfullydirected the robot's attention. They all reported that itwas easy to direct the robot's attention to the desired tar-get. They estimated the mean time to direct the robot'sattention at 5 to 10 seconds. This turns out to be the case;the mean time over all trials and all targets is 5:8 seconds.

VIII. Post-attentive processing

Once the attention system has selected regions of the vi-sual �eld that are potentially behaviorally relevant, moreintensive computation can be applied to these regions thancould be applied across the whole �eld. Searching for eyesis one such task. Locating eyes is important to us for engag-ing in eye contact, and as a reference point for interpretingfacial movements and expressions. We currently search foreyes after the robot directs its gaze to a locus of attention,so that a relatively high resolution image of the area beingsearched is available from the foveal cameras (see Figure9). Once the target of interest has been selected, we alsoestimate its proximity to the robot using a stereo matchbetween the two central wide cameras. Proximity is an im-portant for interaction as things closer to the robot shouldbe of greater interest. It's also useful for interaction ata distance, such as a person standing too far for face toface interaction but is close enough to be beckoned closer.Clearly the relevant behavior (beckoning or playing) is de-pendent on the proximity of the human to the robot.

Fig. 9. Eyes are searched for within a restricted part of the robot's�eld of view. The eye detector actually looks for the region betweenthe eyes. It has adequate performance over a limited range of dis-tances and face orientations.

Eye-detection in a real-time, robotic domain is compu-tationally expensive and prone to error due to the large

Page 8: Human perceiving and - People | MIT CSAILpeople.csail.mit.edu/edsinger/doc/brezeal_active_vision.pdfMA. E-mail: cyn-thia,edsinger,paul tz,scaz@ai.mit.edu rob ot can get quite far with

IEEE TRANSACTIONS ON MAN, CYBERNETICS AND SYSTEMS, VOL. XX, NO. Y, MONTH 2000 8

variance in head posture, lighting conditions and featurescales. We developed an approach based on successive fea-ture extraction, combined with some inherent domain con-straints, to achieve a robust and fast eye-detection systemfor Kismet. First, a set of feature �lters are applied succes-sively to the image in increasing feature granularity. Thisserves to reduce the computational overhead while main-taining a robust system. The successive �lter stages are:

� Detect skin colored patches in the image (abort if thisdoes not pass above threshold).� Scan the image for ovals and characterize its skin tonefor a potential face.� Extract a sub-image of the oval and run a ratio template[13], [14] over it for candidate eye locations.� For each candidate eye location, run a pixel based multi-layer perceptron on the region. The perceptron is previ-ously trained to recognize shading patterns characteristicof the eyes and bridge of the nose.

By doing so, the set of possible eye-locations in the im-age is reduced from the previous level based on a feature�lter. This allows the eye detector to run in real time ona 400Mhz PC. The methodology assumes that the light-ing conditions allow the eyes to be distinguished as darkregions surrounded by highlights of the temples and thebridge of the nose, that human eyes are largely surroundedby regions of skin color, that the head is only moderatelyrotated, that the eyes are reasonably horizontal, and thatpeople are within interaction distance from the robot (3 to10 feet).

IX. Eye movements

Kismet's eyes periodically saccade to new targets chosenby an attention system, tracking them smoothly if theymove and the robot wishes to engage them. Vergence eyemovements are more challenging to implement in a socialsetting, since errors in disjunctive eye movements can givethe eyes a disturbing appearance of moving independently.Errors in conjunctive movements have a much smaller im-pact on an observer, since the eyes clearly move in lock-step. A crude approximation of the opto-kinetic re ex isrolled into our implementation of smooth pursuit. An ana-logue of the vestibular-ocular re ex has been developed us-ing a 3-axis inertial sensor, but has yet to be implementedon Kismet (it currently runs on other humanoid robots inour lab). Kismet uses an e�erent copy mechanism to com-pensate the eyes for movements of the head. An over ofthe oculomotor control system is shown in �gure 10.The attention system operates on the view from the cen-

tral camera (see Figure 2). A transformation is needed toconvert pixel coordinates in images from this camera intoposition setpoints for the eye motors. This transforma-tion in general requires the distance to the target to beknown, since objects in many locations will project to thesame point in a single image (see Figure 11). Distance esti-mates are often noisy, which is problematic if the goal is tocenter the target exactly in the eyes. In practice, it is usu-ally enough to get the target within the �eld of view of thefoveal cameras in the eyes. Clearly the narrower the �eld of

WideCamera 1

WideCamera 2

Left FovealCamera

Right FovealCamera

Wide Frame

Grabber

Wide 2Frame

Grabber

Left Frame

Grabber

Right Frame

Grabber

SkinDetector

ColorDetector

MotionDetector

FaceDetector

Distance to Target

EyeFinder

FovealDisparity

W W W WAttention

WideTracker

Behaviors&

Motivations

Fixed ActionPattern

Affective Postural

Shifts w/Gaze Comp

Saccadew/Neck Comp

VOR

Arbitor

Smooth Pursuit& Vergence

w/Neck Comp

Eye-Head-Neck Control

tracked target

Most salienttarget

locus of attention d isparityballistic movement

θ , θ , θp p p

. ..

θ , θ , θf f f

. ..θ , θ , θv v v

. ..θ , θ , θs s s

. ..

MotorDaemon Motors

Θ , Θ , Θ. ..

Fig. 10. Organization of Kismet's eye/neck motor control. Manycross level in uences have been omitted. The modules in gray arenot active in the results presented in this paper.

view of these cameras is, the more accurately the distanceto the object needs to be known. Other crucial factors arethe distance between the wide and foveal cameras, and theclosest distance at which the robot will need to interactwith objects. These constraints determined the physicaldistribution of Kismet's cameras and choice of lenses. Thecentral location of the wide camera places it as close aspossible to the foveal cameras. It also has the advantagethat moving the head to center a target as seen in the cen-tral camera will in fact truly orient the head towards thattarget { for cameras in other locations, accuracy of orienta-tion would be limited by the accuracy of the measurementof distance.

Fig. 11. Without distance information, knowing the position of atarget in the wide camera only identi�es a ray along which the objectmust lie, and does not uniquely identify its location. If the camerasare close to each other relative to the closest distance the object isexpected to be at, the foveal cameras can be rotated to bring theobject within their narrow �eld of view without needing an accurateestimate of its distance. If the cameras are far apart, or the �eldof view is very narrow, the minimum distance the object can be atbecomes large.

Higher-level in uences modulate eye and neck move-ments in a number of ways. As already discussed, mod-i�cations to weights in the attention system translate tochanges of the locus of attention about which eye move-

Page 9: Human perceiving and - People | MIT CSAILpeople.csail.mit.edu/edsinger/doc/brezeal_active_vision.pdfMA. E-mail: cyn-thia,edsinger,paul tz,scaz@ai.mit.edu rob ot can get quite far with

IEEE TRANSACTIONS ON MAN, CYBERNETICS AND SYSTEMS, VOL. XX, NO. Y, MONTH 2000 9

ments are organized. The regime used to control the eyesand neck is available as a set of primitives to higher-levelmodules. Regimes include low-commitment search, high-commitment engagement, avoidance, sustained gaze, anddeliberate gaze breaking. The primitive percepts gener-ated by this level include a characterization of the mostsalient regions of the image in terms of the feature maps,an extended characterization of the tracked region in termsof the results of post-attentive processing (eye detection,distance estimation), and signals related to undesired con-ditions, such as a looming object, or an object moving atspeeds the tracker �nds diÆcult to keep up with.

X. Visual motor skills

Given the current task (as dictated by the behavior sys-tem), the motor skills level is responsible for �guring outhow to move the actuators to carry out the stated goal.Often this requires coordination between multiple motormodalities (speech, body posture, facial display, and gazecontrol). Requests for these modalities can originate fromthe top-down (e.g. from the emotion system or behaviorsystem), as well as from the bottom-up (the vocal systemrequesting lip and jaw movements for lip synching). Hence,the motor skills level must address the issue of servicing themotor requests of di�erent systems across the di�erent mo-tor resources.

Furthermore, it often requires a sequence of coordinatedmotor movements to satisfy a goal. Each motor movementis a primitive (or a combination of primitives) from one ofthe base motor systems (the vocal system, the oculomotorsystem, etc.). Each of these coordinated series of motorprimitives is called a skill, and each skill is implementedas a �nite state machine (FSM). Each motor skill encodesknowledge of how to move from one motor state to thenext, where each sequence is designed to bring the robotcloser to the current goal. The motor skills level mustarbitrate among the many di�erent FSMs, selecting the oneto become active based on the active goal. This decisionprocess is straight forward since there is an FSM tailoredfor each task of the behavior system.

Many skills can be thought of as a �xed action pattern

(FAP) as conceptualized by early ethologists. Each FAPconsists of two components, the action component and thetaxis (or orienting) component. For Kismet, FAPs oftencorrespond to communicative gestures where the actioncomponent corresponds to the facial gesture, and the taxiscomponent (to whom the gesture is directed) is controlledby gaze. People seem to intuitively understand that whenKismet makes eye contact with them, they are the locusof Kismet's attention and the robot's behavior is organizedabout them. This places the person in a state of actionreadiness where they are poised to respond to Kismet'sgestures.

A simple example of a motor skill is Kismet's \calling"FAP. When the current task is to bring a person into agood interaction distance, the motor skill system activatesthe calling FSM. The taxis component of the FAP issuesa hold gaze request to the oculomotor system. This serves

to maintain the robot's gaze on the person to be hailed. Inthe �rst state of the gestural component, Kismet leans itsbody toward the person (a request to the body posturemotor system). This strengthens the person's perceptionthat the robot has taken a particular interest in them. Theears also begin to waggle exuberantly (creating a signi�cantamount of motion and noise) which further attracts theperson's attention to the robot (a request to the face mo-tor system). In addition, Kismet vocalizes excitedly whichis perceived as an initiation of engagement. At the com-pletion of this gesture, the FSM transitions to the secondstate. In this state, the robot \sits back" and waits for abit with an expecting expression (ears slightly perked, eyesslightly widened, and brows raised). If the person has notalready approached the robot, it is likely to occur duringthis \anticipation" phase. If the person does not approachwithin the allotted time period, the FSM transitions to thethird state in which the face relaxes, the robot maintainsa neutral posture, and gaze �xation is released. At thispoint, the robot is likely to shift gaze. As long as this FSMis active (determined by the behavior system), the hailingcycle repeats. It can be interrupted at any state transitionby the activation of another FSM (such as the \greeting"FSM when the person has approached).

We now move up another layer of abstraction, to thebehavior level in the hierarchy that was shown in Figure 3.

XI. Visual behavior

The behavior level is responsible for establishing the cur-rent task for the robot through arbitrating among Kismet'sgoal-achieving behaviors. By doing so, the observed behav-ior should be relevant, appropriately persistent, and oppor-tunistic. Both the current environmental conditions (ascharacterized by high-level perceptual releasers, as well asmotivational factors (emotion processes and homeostaticregulation) contribute to this decision process.

Interaction of the behavior level with the social level oc-curs through the world as determined by the nature of theinteraction between Kismet and the human. As the hu-man responds to Kismet, the robot's perceptual conditionschange. This can activate a di�erent behavior, whose goalis physically carried out the underlying motor systems. Thehuman observes the robot's ensuing response and shapestheir reply accordingly.

Interaction of the behavior level with the motor skillslevel also occurs through the world. For instance, if Kismetis looking for a bright toy, then the seek toy behavior isactive. This task is passed to the underlying motor skillswhich carry out the search. The act of scanning the envi-ronment brings new perceptions to Kismet's �eld of view.If a toy is found, then the seek toy behavior is successfuland released. At this point, the perceptual conditions forengaging the toy are relevant and the engage toy behav-iors become active. A new set motor skills become activeto track and smoothly pursue the toy.

Page 10: Human perceiving and - People | MIT CSAILpeople.csail.mit.edu/edsinger/doc/brezeal_active_vision.pdfMA. E-mail: cyn-thia,edsinger,paul tz,scaz@ai.mit.edu rob ot can get quite far with

IEEE TRANSACTIONS ON MAN, CYBERNETICS AND SYSTEMS, VOL. XX, NO. Y, MONTH 2000 10

A. Social level

Eye movements have communicative value. As discussedpreviously, they indicate the robot's locus of attention. Therobot's degree of engagement can also be conveyed, to com-municate how strongly the robot's behavior is organizedaround what it is currently looking at. If the robot's eyes ick about from place to place without resting, that in-dicates a low level of engagement, appropriate to a visualsearch behavior. Prolonged �xation with smooth pursuitand orientation of the head towards the target conveysa much greater level of engagement, suggesting that therobot's behavior is very strongly organized about the locusof attention.

Comfortable interactionspeed

Too fast –irritation response

Too fast,Too close –

threat response

Comfortableinteraction distance

Too close –withdrawalresponse

Too far –calling

behavior

Person drawscloser

Personbacks off

Beyondsensorrange

Fig. 12. Regulating interaction. People too distant to be seen clearlyare called closer; if they come too close, the robot signals discomfortand withdraws. The withdrawal moves the robot back somewhatphysically, but is more e�ective in signaling to the human to back o�.Toys or people that move too rapidly cause irritation.

Eye movements are the most obvious and direct motoractions that support visual perception. But they are byno means the only ones. Postural shifts and �xed actionpatterns involving the entire robot also have an importantrole. Kismet has a number of coordinated motor actionsdesigned to deal with various limitations of Kismet's vi-sual perception (see Figure 12). For example, if a personis visible, but is too distant for their face to be imaged atadequate resolution, Kismet engages in a calling behaviorto summon the person closer. People who come too closeto the robot also cause diÆculties for the cameras withnarrow �elds of view, since only a small part of a face maybe visible. In this circumstance, a withdrawal response isinvoked, where Kismet draws back physically from the per-son. This behavior, by itself, aids the cameras somewhatby increasing the distance between Kismet and the human.But the behavior can have a secondary and greater e�ectthrough social ampli�cation { for a human close to Kismet,a withdrawal response is a strong social cue to back away,since it is analogous to the human response to invasions of\personal space."

Similar kinds of behavior can be used to support the vi-sual perception of objects. If an object is too close, Kismet

can lean away from it; if it is too far away, Kismet can craneits neck towards it. Again, in a social context, such actionshave power beyond their immediate physical consequences.A human, reading intent into the robot's actions, may am-plify those actions. For example, neck-craning towards atoy may be interpreted as interest in that toy, resultingin the human bringing the toy closer to the robot. An-other limitation of the visual system is how quickly it cantrack moving objects. If objects or people move at excessivespeeds, Kismet has diÆculty tracking them continuously.To bias people away from excessively boisterous behaviorin their own movements or in the movement of objects theymanipulate, Kismet shows irritation when its tracker is atthe limits of its ability. These limits are either physical(the maximum rate at which the eyes and neck move), orcomputational (the maximum displacement per frame fromthe cameras over which a target is searched for).

Such regulatory mechanisms play roles in more com-plex social interactions, such as conversational turn- tak-ing. Here control of gaze direction is important for regu-lating conversation rate [15]. In general, people are likelyto glance aside when they begin their turn, and make eyecontact when they are prepared to relinquish their turnand await a response. Blinks occur most frequently at theend of an utterance. These and other cues allow Kismetto in uence the ow of conversation to the advantage ofits auditory processing. The visual-motor system can alsobe driven by the requirements of a nominally unrelatedsensory modality, just as behaviors that seem completelyorthogonal to vision (such as ear-wiggling during the callbehavior to attract a person's attention) are neverthelessrecruited for the purposes of regulation. These mecha-nisms also help protect the robot. Objects that suddenlyappear close to the robot trigger a looming re ex, caus-ing the robot to quickly withdraw and appear startled. Ifthe event is repeated, the response quickly habituates andthe robot simply appears annoyed, since its best strategyfor ending these repetitions is to clearly signal that theyare undesirable. Similarly, rapidly moving objects close tothe robot are threatening and trigger an escape response.These mechanisms are all designed to elicit natural and in-tuitive responses from humans, without any special train-ing. But even without these carefully crafted mechanisms,it is often clear to a human when Kismet's perception isfailing, and what corrective action would help, because therobot's perception is re ected in behavior in a familiar way.Inferences made based on our human preconceptions areactually likely to work.

B. Evidence of Social Ampli�cation

To evaluate the social implications of Kismet's behav-ior, we invited a few people to interact with the robot in afree-form exchange. There were four subjects in the study,two males (one adult and one child) and two females (bothadults). They ranged in age from 12 to 28 years. None ofthe subjects were aÆliated with MIT. All had substantialexperience with computers. None of the subjects had anyprior experience with Kismet. The child had prior experi-

Page 11: Human perceiving and - People | MIT CSAILpeople.csail.mit.edu/edsinger/doc/brezeal_active_vision.pdfMA. E-mail: cyn-thia,edsinger,paul tz,scaz@ai.mit.edu rob ot can get quite far with

IEEE TRANSACTIONS ON MAN, CYBERNETICS AND SYSTEMS, VOL. XX, NO. Y, MONTH 2000 11

ence with a variety of interactive toys. Each subject inter-acted with the robot for 20 to 30 minutes. All exchangeswere video recorded for further analysis.We analyzed the video for evidence of social ampli�ca-

tion. Namely, did people read Kismet's cues and did theyrespond to them in a manner that bene�ted the robot'sperceptual processing or its behavior? we found severalclasses of interactions where the robot displayed social cuesand successfully regulated the exchange.

B.1 Establishing a Personal Space

The strongest evidence of social ampli�cation was appar-ent in cases where people came within very close proximityof Kismet. In numerous instances the subjects would bringtheir face very close to the robot's face. The robot wouldwithdraw, shrinking backwards, perhaps with an annoyedexpression on its face. In some cases the robot would alsoissue a vocalization with an expression of disgust. In oneinstance, the subject accidentally came too close and therobot withdrew without exhibiting any signs of annoyance.The subject immediately queried, \Am I too close to you?I can back up," and moved back to put a bit more spacebetween himself and the robot. In another instance, a dif-ferent subject intentionally put his face very close to therobot's face to explore the response. The robot withdrewwhile displaying full annoyance in both face and voice. Thesubject immediately pushed backwards, rolling the chairacross the oor to put about an additional three feet be-tween himself and the robot, and promptly apologized tothe robot.Overall, across di�erent subjects, the robot successfully

established a personal space. This bene�ts the robot's vi-sual processing by keeping people at a distance where thevisual system can detect eyes more robustly. This behav-ioral response was added to the robot's repertoire becauseprevious interactions with naive subjects illustrated therobot was not granted any personal space. This can beattributed to \baby movements" where people tend to getextremely close to infants, for instance.

B.2 Luring People to a Good Interaction Distance

People seem responsive to Kismet's calling behavior.When a person is close enough for the robot to perceivehis/her presense, but too far away for face-to-face ex-change, the robot issues this social display to bring theperson closer. The most distinguishing features of the dis-play are craning the neck forward in the direction of theperson, wiggling the ears with large amplitude, and vocal-izing with an excited a�ect. The function of the display isto lure people into an interaction distance that bene�ts thevision system. This behavior is not often witnessed as mostsubjects simply pull up a chair in front of the robot andremain seated at a typical face-to-face interaction distance.The youngest subject took the liberty of exploring di�er-

ent interaction ranges, however. Over the course of about�fteen minutes he would alternately approach the robotto a normal face-to-face distance, move very close to therobot (invading its personal space), and backing away from

the robot. Upon the �rst appearance of the calling re-sponse, the experimenter queried the subject about therobot's behavior. The subject interpreted the display asthe robot wanting to play, and he approached the robot.At the end of the subject's investigation, the experimenterqueried him about the further interaction distances. Thesubject responded that when he was further from Kismet,the robot would lean forward. He also noted that the robothad a harder time looking at his face when he was fartherback. In general, he interpreted the leaning behavior asthe robot's attempt to initiate an exchange with him. Wehave noticed from earlier interactions (with other peopleunfamiliar with the robot) that a few people have not im-mediately understood this display as a \calling" behavior.The display is amboyant enough, however, to arouse theirinterest to approach the robot.

XII. Limitations and extensions

There are a number of ways the current implementationcould be improved and expanded upon. Some of these rec-ommendations involve supplementing the existing frame-work, others involve integrating this system into a largerframework.

Kismet's visual perceptual world only consists of whatis in view of the cameras. Ultimately, the robot should beable to construct an ego-centered saliency map of interac-tion space. In this representation, the robot could keeptrack of where interesting things are located, even if theyare not currently in view. Human infants engage in so-cial referencing with their caregiver at a very young age.If some event occurs that the infant is unsure about, theinfant will look to the caregiver's face for an a�ective as-sessment. The infant will use this assessment to organizeits behavior. For instance, if the caregiver looks frightened,the infant may become distressed and not probe further. Ifthe caregiver looks pleased and encouraging, the infant islikely to continue exploring. With respect to Kismet, itwill encounter many situations that it was not explicitlyprogrammed to evaluate. However, if the robot can en-gage in social referencing, it can look to the human forthe a�ective assessment and use it to bias learning and toorganize subsequent behavior. Chances are, the event inquestion and the human's face will not be in view at thesame time. Hence, a representation of where interestingthis are in ego-centered interaction space is an importantresource.

The attention system could be extended by adding newfeature maps. A depth map from stereo would be very use-ful { currently distance is only computed post-attentively.Another interesting feature map to incorporate into thesystem would be edge orientation. Wolfe and Triesmanamong others argue in favor of edge orientation as abottom-up feature map in humans. Currently, Kismet hasno shape metrics to help it distinguish objects from eachother (such as its toy block from its toy dinosaur). Addingfeatures to support this is an important extension to theexisting implementation.

There are no auditory bottom-up contributions. A sound

Page 12: Human perceiving and - People | MIT CSAILpeople.csail.mit.edu/edsinger/doc/brezeal_active_vision.pdfMA. E-mail: cyn-thia,edsinger,paul tz,scaz@ai.mit.edu rob ot can get quite far with

IEEE TRANSACTIONS ON MAN, CYBERNETICS AND SYSTEMS, VOL. XX, NO. Y, MONTH 2000 12

localization feature map would be a nice multi-modal ex-tension. Currently, Kismet assumes that the most salientperson is the one who is talking to it. Often there are multi-ple people talking around and to the robot. It is importantthat the robot knows who is addressing it and when. Soundlocalization would be of great bene�t here.

Acknowledgments

This work was funded by a DARPA under contractDABT 63-99-1-0012 and by NTT.

XIII. Conclusions

Motor control for a social robot poses challenges beyondissues of stability and accuracy. Motor actions will be per-ceived by human observers as semantically rich, regardlessof whether the imputed meaning is intended or not. Thiscan be a powerful resource for facilitating natural interac-tions between robot and human, and places constraints onthe robot's physical appearance and movement. It allowsthe robot to be readable { to make its behavioral intentand motivational state transparent at an intuitive level tothose it interacts with. It allows the robot to regulate itsinteractions to suit its perceptual and motor capabilities,again in an intuitive way with which humans naturally co-operate. These social constraints give the robot leverageover the world that extends far beyond its physical compe-tence, through social ampli�cation of its perceived intent.If properly designed, the robot's visual behaviors can bematched to human expectations and allow both robot andhuman to participate in natural and intuitive social inter-actions.

References

[1] D.H. Ballard, \Animate vision," AI, vol. 48, pp. 57{86, 1991.[2] S. Vijayakumar, J. Conradt, T. Shibata, and S. Schaal, \Overt

visual attention for a humanoid robot," in Proceedings of theInternational Conference on Intelligence in Robotics and Au-tonomous Systems (IROS01), Hawaii, 2001.

[3] K. Kawamura, A. Alford, K. Hambuchen, and M. Wilkes, \To-wards a uni�ed framework for human-humanoid interaction," inProceedings of the First IEEE-RAS International Conference onHumanoid Robots, Cambridge, MA, 2000.

[4] G. Metta, BabyRoBot: A study in sensori-motor development,Ph.D. thesis, University of Genoa, LIRA-Lab, DIST, 1999.

[5] H. Kozima, \Attention-sharing and behavior-sharing in human-robot communication," in IEEE International Workshop onRobot and Human Communication (ROMAN-98), Takamatsu,Japan, 1998, pp. 9{14.

[6] D.H. Ballard, \Behavioral constraints on animate vision," Imageand Vision Computing, vol. 7:1, pp. 3{9, 1989.

[7] Jan van der Spiegel, G. Kreider, C. Claeys, I. Debusschere,G. Sandini, P. Dario, F. Fantini, P. Belluti, and G. Soncini,\A foveated retina-like sensor using ccd technology," in AnalogVLSI Implementation of Neural Systems, C. Mead and M. Is-mail, Eds., pp. pp. 189{212. Kluwer Academic Publishers, 1989.

[8] A. Bernardino and J. Santos-Victor, \Binocular visual rracking:Integration of perception and control," IEEE Transactions onRobotics and Automation, vol. 15(6), pp. 1937{1958, 1999.

[9] Michael E. Goldberg, \The control of gaze," in Principlesof Neural Science, Eric R. Kandel, James H. Schwartz, andThomas M. Jessell, Eds. McGraw-Hill, 4rd edition, 2000.

[10] H. C. Nothdurft, \The role of features in preattentive vision:Comparison of orientation, motion and color cues," Vision Re-search, vol. 33, pp. 1937{1958, 1993.

[11] L. Itti, C. Koch, and E. Niebur, \A model of saliency-basedvisual attention for rapid scene analysis," IEEE Transactions

on Pattern Analysis and Machine Intelligence (PAMI), vol. 20,no. 11, pp. 1254{1259, 1998.

[12] Jeremy M. Wolfe, \Guided search 2.0: A revised model of visualsearch," Psychonomic Bulletin & Review, vol. 1, no. 2, pp. 202{238, 1994.

[13] B. Scassellati, \Finding eyes and faces with a foveated vision sys-tem," in Proceedings of the Fifthteenth National Conference onArti�cial Intelligence (AAAI98), Madison, WI, 1998, pp. 969{976.

[14] Pawan Sinha, \Object recognition via image invariants: A casestudy," Investigative Ophthalmology and Visual Science, vol.35, pp. 1735{1740, May 1994.

[15] J. Cassell, \Embodied conversation: Integrating face and gestureinto automatic spoken dialogue systems," .

Cynthia Breazeal is a postdoctoral associateat MIT. She received her Sc.D. from MIT inthe Department of Electrical Engineering andComputer Science in May 2000. Her interestsfocus on human-like robots that can interact innatural, social ways with humans. She receivedher BS in Electrical and Computer Engineer-ing at University of California, Santa Barbara,and her MS in Electrical Engineering and Com-puter Science at MIT.

Aaron Edsinger is currently a graduate stu-dent with Professor Rodney Brooks at the MITArti�cial Intelligence Lab. He is interested incombining sophisticated sensorimotor controlwith aesthetic considerations to create natu-ralistic movement in robots. He received a BSin Computer Systems at Stanford.

Paul Fitzpatrick recieved a B.Eng. andM.Eng. in Computer Engineering at the Uni-versity of Limerick, Ireland. He is currently agraduate student with Prof. Rodney Brooksa the MIT Arti�cial Intelligence Laboratory.His current work focuses on robot-human com-munication through social protocols realized inthe visual and auditory domains.

Brian Scassellati is completing his PhD withRodney Brooks at the MIT Arti�cial Intelli-gence Lab. His work is strongly grounded intheories of how the human mind develops, andhe is interested in robotics as a tool for eval-uating models from biological sciences. He re-ceived a BS in Computer Science, a BS in Brainand Cognitive Science, and his ME in Elec-trical Engineering and Computer Science, allfrom MIT.