Visual Transformations in Gesture Imitation: what you see ... · Instituto de Sistemas e Rob´otica...

8
Visual Transformations in Gesture Imitation: what you see is what you do Manuel Cabido Lopes Jos´ e Santos-Victor Instituto de Sistemas e Rob´ otica Instituto Superior T´ ecnico Lisbon, Portugal {macl,jasv}@isr.ist.utl.pt Abstract We propose an approach for a robot to imitate the ges- tures of a human demonstrator. Our framework con- sists solely of two components: a Sensory-Motor Map (SMM) and a View-Point Transformation (VPT). The SMM establishes an association between an arm image and the corresponding joint angles and it is learned by the system during a period of observation of its own ges- tures. The VPT is widely discussed in the psychology of visual perception and is used to transform the image of the demonstrator’s arm to the so-called ego-centric image, as if the robot were observing its own arm. Dif- ferent structures of the SMM and VPT are proposed in accordance with observations in human imitation. The whole system relies on monocular visual information and leads to a parsimonious architecture for learning by imitation. Real-time results are presented and dis- cussed. 1 Introduction The impressive advance of research and develop- ment in robotics and autonomous systems in the past years has led to the development of robotic systems of increasing motor, perceptual and cogni- tive capabilities. These achievements are opening the way for new application opportunities that will require these systems to interact with other robots or non tech- nical users during extended periods of time. Tra- ditional programming methodologies and robot in- terfaces will no longer suffice, as the system needs to learn to execute complex tasks and improve its performance through its lifetime. Our work has the long-term goal of building so- phisticated robotic systems able to interact with humans or other robots in a natural and intuitive way. One promising approach relies on imitation whereby a robot could learn how to handle a per- son’s private objects by observing the owner’s be- havior, over time. Learning by imitation is not a new topic and has been addressed before in the literature. This learning paradigm has already been pursued in hu- manoid robotic applications [1] where the number of degrees of freedom is very large, tele-operation [2] or assembly tasks [3]. Most published works, how- ever, describe complete imitation systems but focus their attention on isolated system components only, while we describe a complete architecture. We will concentrate on the simplest form of im- itation that consists in replicating the gestures or movements of a demonstrator, without seeking to understand the gestures or the action’s goal. In the work described in [4], the imitator can not only replicate the gestures but also the dynamics of a demonstrator, but it requires the usage of an ex- oskeleton to sense the demonstrator’s behavior. In- stead, our approach is exclusively based on vision. The motivation to use visual information for im- itation arises from the fact that many living be- ings - like humans - resort to vision to solve an extremely large set of tasks. Also, from the en- gineering point view, video cameras are low-cost, non invasive devices that can be installed in ordi- nary houses and that provide an enormous quantity of information, specially if combined with domain knowledge or stereo data. Interestingly, the process of imitation seems to be the primary learning process used by infants and monkeys during the first years of life. Recently, the discovery of the mirror neurons in the monkey’s brain [5, 6] has raised new hypotheses and provided a better understanding of the process of imitation in nature. These neurons are activated both when a monkey performs a certain action and when it sees the same action being performed by a demonstrator or another monkey. Even if the role of these neurons is not yet fully understood, a few important conclusions can nev- ertheless be drawn. Firstly, mirror neurons clearly illustrate the intimate relationship between percep- tion and action. Secondly, these neurons exhibit the remarkable ability of ”recognizing” certain ges- tures or actions when seen from very different per- spectives (associating gestures performed by the demonstrator to the subject’s own gestures).

Transcript of Visual Transformations in Gesture Imitation: what you see ... · Instituto de Sistemas e Rob´otica...

Page 1: Visual Transformations in Gesture Imitation: what you see ... · Instituto de Sistemas e Rob´otica Instituto Superior T´ecnico Lisbon, Portugal {macl,jasv}@isr.ist.utl.pt ... systems

Visual Transformations in Gesture Imitation:

what you see is what you do

Manuel Cabido Lopes Jose Santos-Victor

Instituto de Sistemas e RoboticaInstituto Superior Tecnico

Lisbon, Portugal{macl,jasv}@isr.ist.utl.pt

Abstract

We propose an approach for a robot to imitate the ges-tures of a human demonstrator. Our framework con-sists solely of two components: a Sensory-Motor Map(SMM) and a View-Point Transformation (VPT). TheSMM establishes an association between an arm imageand the corresponding joint angles and it is learned bythe system during a period of observation of its own ges-tures. The VPT is widely discussed in the psychologyof visual perception and is used to transform the imageof the demonstrator’s arm to the so-called ego-centricimage, as if the robot were observing its own arm. Dif-ferent structures of the SMM and VPT are proposed inaccordance with observations in human imitation. Thewhole system relies on monocular visual informationand leads to a parsimonious architecture for learningby imitation. Real-time results are presented and dis-cussed.

1 Introduction

The impressive advance of research and develop-ment in robotics and autonomous systems in thepast years has led to the development of roboticsystems of increasing motor, perceptual and cogni-tive capabilities.

These achievements are opening the way for newapplication opportunities that will require thesesystems to interact with other robots or non tech-nical users during extended periods of time. Tra-ditional programming methodologies and robot in-terfaces will no longer suffice, as the system needsto learn to execute complex tasks and improve itsperformance through its lifetime.

Our work has the long-term goal of building so-phisticated robotic systems able to interact withhumans or other robots in a natural and intuitiveway. One promising approach relies on imitationwhereby a robot could learn how to handle a per-son’s private objects by observing the owner’s be-havior, over time.

Learning by imitation is not a new topic andhas been addressed before in the literature. Thislearning paradigm has already been pursued in hu-

manoid robotic applications [1] where the numberof degrees of freedom is very large, tele-operation [2]or assembly tasks [3]. Most published works, how-ever, describe complete imitation systems but focustheir attention on isolated system components only,while we describe a complete architecture.

We will concentrate on the simplest form of im-itation that consists in replicating the gestures ormovements of a demonstrator, without seeking tounderstand the gestures or the action’s goal. Inthe work described in [4], the imitator can not onlyreplicate the gestures but also the dynamics of ademonstrator, but it requires the usage of an ex-oskeleton to sense the demonstrator’s behavior. In-stead, our approach is exclusively based on vision.

The motivation to use visual information for im-itation arises from the fact that many living be-ings - like humans - resort to vision to solve anextremely large set of tasks. Also, from the en-gineering point view, video cameras are low-cost,non invasive devices that can be installed in ordi-nary houses and that provide an enormous quantityof information, specially if combined with domainknowledge or stereo data.

Interestingly, the process of imitation seems tobe the primary learning process used by infants andmonkeys during the first years of life. Recently, thediscovery of the mirror neurons in the monkey’sbrain [5, 6] has raised new hypotheses and provideda better understanding of the process of imitationin nature. These neurons are activated both when amonkey performs a certain action and when it seesthe same action being performed by a demonstratoror another monkey.

Even if the role of these neurons is not yet fullyunderstood, a few important conclusions can nev-ertheless be drawn. Firstly, mirror neurons clearlyillustrate the intimate relationship between percep-tion and action. Secondly, these neurons exhibitthe remarkable ability of ”recognizing” certain ges-tures or actions when seen from very different per-spectives (associating gestures performed by thedemonstrator to the subject’s own gestures).

jasv
Deliverable D2.3 - Annex I
Page 2: Visual Transformations in Gesture Imitation: what you see ... · Instituto de Sistemas e Rob´otica Instituto Superior T´ecnico Lisbon, Portugal {macl,jasv}@isr.ist.utl.pt ... systems

One of the main contributions of this paper isrelated to this last observation, that is illustratedin Figure 1. We propose a method that allows thesystem to “rotate” the image of gestures done bya demonstrator (allo-image) to the correspondingimage (ego-image) that would be obtained if thosesame gestures were actually performed by the sys-tem itself. We call this process the View-PointTransformation (VPT). Surprisingly, in spite of theimportance given to the VPT in psychological stud-ies [7], it has received very little attention fromother researchers in the field of visual imitation.

Figure 1: Gestures can be seen from very dis-tinct perspectives. The image shows one’s own armperforming a gesture (ego-image) and that of thedemonstrator performing a similar gesture (allo-image).

One of the few works that dealt explicitly withthe VPT is [8]. However, instead of consideringthe complete arm posture, only the mapping of theend-effector position is done. The map betweenthe allo and ego image is performed using epipolargeometry, based on a stereo camera pair.

Other studies addressed this problem in an im-plicit and superficially way. A mobile robot capableof learning the policy followed by another mobilevehicle is described in [9]. Since the system kine-matics is very simple, the VPT corresponds to atransformation between the views of the two mobilerobots. This is achieved in practice by delaying theimitator’s perception until it reaches the same placeas the demonstrator, without focusing the processof VPT. The work described in [10] has similar ob-jectives to our own research and allows a robot tomimic the “dance” of an Avatar. However, it doesnot address the VPT at all, and a special invasivehardware was used to perform this transformation.Instead, we present a simple architecture for im-itation which carefully addresses the fundamentalprocess of View-Point Transformation.

The VPT allows the robot to map observedgestures to a canonical point-of-view. The final

step consists in transforming these mapped featuresto motor commands, which is referred to as theSensory-Motor Map (SMM). Our complete archi-tecture for imitation is shown in Fig. 2.

Figure 2: The combination of the Sensory-MotorMap and the View-Point Transformation allow therobot to imitate the arm movements executed byanother robot or human.

The Sensory-Motor Map can be computed explic-itly if the parameters of the arm-hand-eye configu-ration are known a priori but - more interestingly- it can be learned from observations of arm/handmotions. Again, biology can provide relevant in-sight. The Asymmetric Tonic Neck reflex [11] forcesnewborns to look to their hands, which allows themto learn the relationship between motor actions andthe corresponding visual stimuli.

Similarly, in our work the robot learns the SMMduring an initial period of self-observation, whileperforming hand/arm movements. Once the SMMhas been estimated, the robot can observe a demon-strator, use the VPT to transform the image fea-tures to a canonical reference frame and map thesefeatures to motor commands through the SMM.The final result will be a posture similar to thatobserved.

Structure of the paperIn Section 2, we present the models used

throughout this work, namely the arm kinematicsand the camera/eye geometry. Section 3 is devotedto the definition and estimation of the Sensory-Motor Map. In Section 4 we describe how thesystem performs the View-Point Transformation.In Section 5 we show how to use these elementaryblocks to perform imitation and present experimen-tal results. In Section 6, we draw some conclusionsand establish directions for future work.

2 Modeling

Throughout the paper we consider a robotic systemconsisting of a computer simulating an antropomor-phic arm and equipped with a real web camera.This section presents the models used for the cam-era and robot body.

2

Page 3: Visual Transformations in Gesture Imitation: what you see ... · Instituto de Sistemas e Rob´otica Instituto Superior T´ecnico Lisbon, Portugal {macl,jasv}@isr.ist.utl.pt ... systems

2.1 Body/arm kinematics

The anthropomorphic arm is modeled as an articu-lated link system. Fig. 3 shows the four arm links:L1 - forearm, L2 - upper arm, L3 - shoulder widthand L4 - body height.

Figure 3: Kinematic model of the human arm.

It is further assumed that the relative sizes ofthese links are known, e.g. from biometric mea-surements: L1 = L2 = 1, L3 = 1.25 and L4 = 2.5.

2.2 Camera/eye geometry

An image is a projection of the 3D world wherebydepth information is lost. In our case, we will re-trieve depth information from a single image byusing knowledge about the body links and a sim-plified, orthographic camera model.

We use the scaled orthographic projection modelthat assumes that the image is obtained by pro-jecting all points along parallel lines plus a scalefactor. Interestingly, such approximation may havesome biological grounding taking into account thescale-compensation effect in the human vision [12]whereby we normalize the sizes of known objectsirrespective to their distances to the eye.

Let M = [X Y Z]T denote a 3D point expressedin the camera coordinate frame. Then, with anorthographic camera model, M is projected ontom = [u v]T , according to:

m = PM[uv

]= s

[1 0 00 1 0

] XYZ

(1)

where s is a scale factor that can be estimated plac-ing a segment with size L fronto-parallel to thecamera and measuring the image size l(s = l/L).

For simplification, we assume that the cameraaxis is positioned in the imitator’s right shoulderwith the optical axis pointing forward horizontally.With this specification of the camera pose, there is

no need for an additional arm-eye coordinate trans-formation in Equation (1).

3 Sensory-Motor Map

The Sensory-Motor Map (SMM) defines a corre-spondence between perception and action. It canbe interpreted in terms of forward/inverse kinemat-ics for the case of robotic manipulators. The SMMcan be used to predict the image resulting frommoving one’s arm to a certain posture. In our case,the SMM will allow the system to determine thearm’s joint angles that correspond to a given im-age configuration of the arm.

In the context of imitation, the SMM can be usedwith different levels of ambiguity/completeness. Insome cases, one wants to replicate exactly someoneelse’s gestures, considering all the joint angles. Insome other cases, however, we may want to imitatethe hand pose only, while the position of the elbowor the rest of the arm configuration is irrelevant.To encompass these possibilities, we have consid-ered two cases: the full arm SMM and the free-elbow SMM that will be described in the followingsections. Finally we describe how the system canlearn the SMM during a period of self-observation.

3.1 Full-Arm SMM

We denote the elbow and wrist image coordinatesby me and mw, the forearm and upper arm imagelength by l1 and l2 and θi=1..4, the various jointangles. We then have:

[θ1, · · · , θ4] = F1(me,mw, l1, l2, L1, L2, s)

where F1(.) denotes the SMM, L2/L1 representsthe (known) length of the upper/forearm and s isthe camera scale factor.

The computation of this function can be done insuccessive steps, where the angles of the shoulderjoint are determined first and used in a later stageto simplify the calculation of the elbow joint’s an-gles.

The inputs to the SMM consist of features ex-tracted from the image points of the shoulder, el-bow and wrist; the outputs are the angular posi-tions of every joint. The shoulder pan and eleva-tion angles, θ1 and θ2 can be readily obtained fromimage data as:

θ1 = f1(me) = arctan(ve/ue)θ2 = f2(l2, L2, s) = arccos(l2/sL2)

Once the system has extracted the shoulder an-gles, the process is repeated for the elbow. Beforecomputing this second set of joint angles, the imagefeatures undergo a set of transformations so as to

3

Page 4: Visual Transformations in Gesture Imitation: what you see ... · Instituto de Sistemas e Rob´otica Instituto Superior T´ecnico Lisbon, Portugal {macl,jasv}@isr.ist.utl.pt ... systems

compensate the rotation of the shoulder:[u′

w

v′w

ξ

]= Rzy(θ1, θ2)

uw

vw√s2L2

1 − l21

[ue

ve

0

](2)

where ξ is not used in the remaining computationsand Rzy(θ1, θ2) denotes a rotation of θ1 around thez axis followed by a rotation of θ2 around the yaxis.

With the transformed coordinates of the wristwe can finally extract the remaining joint angles,θ3 and θ4:

θ3 = f3(m′w) = arctan(v′w/u′

w)θ4 = f4(m′

w, L1, s) = arccos(l′1/sL1)

The approach just described allows the systemto determine the joint angles corresponding to acertain image configuration of the arm. In the nextsection, we will address the case where the elbowjoint is allowed to vary freely.

3.2 Free-Elbow SMM

The free-elbow SMM is used to generate a givenhand position, while the elbow is left free to reachdifferent configurations. The input features con-sist of the hand image coordinates and the depthbetween the shoulder and the hand.

[θ1, θ2, θ4] = F2(mw,rdZw, L1, L2, s)

The elbow joint, θ3, is set to a comfortable po-sition. This is done in an iterative process aimingat maintaining the joint positions as far as possiblefrom their limit values. The optimal elbow angleposition, θ3 is chosen to maximize:

θ3 = arg maxθ3

∑i

(θi − θlimitsi )2

while the other angles can be calculated from thearm features. Again, the estimation process can bedone sequentially, each joint being used to estimatethe next one:

θ4 = arcsin(

rx2h +ry2

h +rz2h

2− 1)

θ1 = 2 arctan

(b1 −

√b21 + a2

1 − c21

a1 + c1

)

θ2 = 2 arctan

(b2 −

√b22 + a2

2 − c22

a2 + c2

)+ π

where the following constants have been used:

a1 = sin θ4 + 1b1 = cos θ3 cos θ4

c1 = −ryh

a2 = cos θ4 cos θ2 cos θ3 − sin θ2(1 + sin θ4)b2 = − cos θ4 sin θ3

c2 = rxh

3.3 Learning the SMM

In the previous sections we have derived the expres-sions of the full-arm and free-elbow SMMs. How-ever, rather than coding these expressions directlywe adopted a learning approach whereby the sys-tem learns the SMM by performing arm movementsand observing the effect on the image plane.

The computation of the SMM can be done se-quentially: estimating the first angle, which is thenused in the computation of the following angle andso forth. This fact allows the system to learn theSMM as a sequence of smaller learning problems.

This approach has strong resemblance to the de-velopment of sensory-motor coordination in new-borns and young infants, which starts by simplemotions that get more and more elaborate as in-fants acquire a better control over motor coordina-tion.

In all cases, we use a Multi-Layer Perceptron(MLP) to learn the SMM, i.e. to approximatefunctions fi,i=1..4. Table 1 presents the learningerror and illustrates the good performance of ourapproach for estimating the SMM.

θ1 θ2 θ3 θ4

3.6e−2 3.6e−2 3.6 3.6

Table 1: Mean squared error (in deg.2) for the eachjoint in the full-arm SMM

Ideas about development can be further ex-ploited in this construction. Starting from simplercases, de-coupling several degrees of freedom, inter-leaving perception with action learning cycles aredevelopmental “techniques” found in biological sys-tems.

4 View-Point Transformation

A certain arm gesture can be seen from very differ-ent perspectives depending on whether the gestureis performed by the robot (self-observation) or bythe demonstrator.

One can thus consider two distinct images: theego-centric image, Ia, during self-observation andthe allo-centric image, Ie, when looking at otherrobots/people. The View-Point Transformation(VPT) has the role of aligning the allo-centric im-age of the demonstrator’s arm, with the ego-centricimage, as if the system were observing its own arm.

The precise structure of the VPT is related to theultimate meaning of imitation. Experiments in psy-chology show that imitation tasks can be ambigu-ous. In some cases, humans imitate only partiallythe gestures of a demonstrator (e.g. replicating thehand pose but having a different arm configuration,as in sign language), use a different arm or executegestures with distinct absolute orientations [13]. In

4

Page 5: Visual Transformations in Gesture Imitation: what you see ... · Instituto de Sistemas e Rob´otica Instituto Superior T´ecnico Lisbon, Portugal {macl,jasv}@isr.ist.utl.pt ... systems

some other cases, the goal consists in mimickingsomeone else’s gestures as completely as possible,as when performing dancing or dismounting a com-plex mechanical part.

According to the structure of the chosen VPT, aclass of imitation behaviors can be generated. Weconsider two different cases. In the first case - 3DVPT - a complete three-dimensional imitation isintended. In the second case - 2D VPT - the goalconsists in achieving coherence only in the image,even if the arm pose might be different. Depend-ing on the desired level of coherence (2D/3D) thecorresponding (2D/3D) VPT allows the robot totransform the image of an observed gesture to anequivalent image as if the gesture were executed bythe robot itself.

4.1 3D View-Point Transformation

In this approach we explicitly reconstruct the pos-ture of the observed arm in 3D and use fixed points(shoulders and hip) to determine the rigid transfor-mation that aligns the allo-centric and ego-centricimage features: We then have:

Ie = P T Rec(Ia) = V PT (Ia)

where T is a 3D rigid transformation and Rec(Ia)stands for the 3D reconstruction of the arm posturefrom allo-centric image features. Posture recon-struction and the computation of T are presentedin the following sections.

4.1.1 Posture reconstruction

To reconstruct the 3D posture of the observed arm,we will follow the approach suggested in [14], basedon the orthographic camera and articulated armmodels presented in Section 2.

Let M1 and M2 be the 3D endpoints of an arm-link whose image projections are denoted by m1

and m2. Under orthography, the X,Y coordinatesare readily computed from image coordinates (sim-ple scale). The depth variation, dZ = Z1 −Z2, canbe determined as:

dZ = ±√

L2 − l2

s2

where L = ‖M1 − M2‖ and l = ‖m1 − m2‖.If the camera scale factor s is not known before-

hand, one can use a different value provided thatthe following constraint, involving the relative sizesof the arm links, is met:

s ≥ maxi

liLi

i = 1..4 (3)

Fig. 4 illustrates results of the reconstructionprocedure. It shows an image of an arm gestureand the corresponding 3D reconstruction, achieved

Arm x

y

z

Figure 4: Left: Reconstructed arm posture. Right:Original view.

with a single view and considering that s and thearm links proportions were known.

With this method there is an ambiguity in thesign of dZ. We overcome this problem by restrict-ing the working volume of the arm. In the future,we will further address this problem and several ap-proaches may be used: (i) optimization techniquesto fit the arm kinematic model to the image; (ii)explore occlusions to determine which link is in theforeground; or (iii) use kinematics constraints toprune possible arm configurations.

4.1.2 Rigid Transformation (T )

A 3D rigid transformation is defined by three an-gles for the rotation and a translation vector. Sincethe arm joints are moving, they cannot be usedas reference points. Instead, we consider the threepoints in Fig. 3: left and right shoulders, (Mls,Mrs)and hip, Mhip, with image projections denoted by(mls,mrs,mhip). The transformation T is deter-mined to translate and rotate these points untilthey coincide with those of the system’s own body.

The translational component must place thedemonstrators right shoulder at the image origin(which coincide’s with the system’s right shoulder)and can be defined directly in image coordinates:

t = −amrs

After translating the image features directly, theremaining steps consist in determining the rotationangles to align the shoulder line and the shoulder-hip contour. The angles of rotation along the z, yand x axes, denoted by φ, θ and ψ are given by:

φ = arctan (vls/uls)θ = arccos (uhip/L4)ψ = arccos (vhip/L3)

Hence, by performing the image translation firstand the 3D rotation described in this section, wecomplete the process of aligning the image projec-tions of the shoulders and hip to the ego-centricimage coordinates.

4.2 2D View-Point Transformation

The 2D VPT is used when one is not interested inimitating the depth variations of a certain move-

5

Page 6: Visual Transformations in Gesture Imitation: what you see ... · Instituto de Sistemas e Rob´otica Instituto Superior T´ecnico Lisbon, Portugal {macl,jasv}@isr.ist.utl.pt ... systems

ment, alleviating the need for a full 3D transfor-mation. It can also be seen as a simplification ofthe 3D VPT if one assumes that the observed armdescribes a fronto-parallel movement with respectto the camera.

The 2D VPT performs an image translation toalign the shoulder of the demonstrator (ams) andthat of the system (at the image origin, by defini-tion). The VPT can be written as:

V PT (am) =[ −1 0

0 1

][am −ams] (4)

and is applied to the image projection of thedemonstrator’s hand or elbow, amh or ame.

Notice that when the arm used to imitate is thesame as the demonstrator, the imitated movementis a mirror image of the original. If we use a simpleidentity matrix in Equation (4) then the movementwill be correct. At the image level both the 2D and3D VPTs have the same result but the 3D postureof the arm is different in the two cases.

From the biological standpoint, the 2D VPT ismore plausible than the 3D version. In [13] severalimitation behaviors are presented which are not al-ways faithful to the demonstrated gesture: some-times, people do not care about usage of the cor-rect hand, depth is irrelevant in some other cases,movements can be reflections of the original ones,etc. The 3D VPT might be more useful in indus-trial facilities where gestures should be reproducedas exactly as possible.

5 Experiments

We have implemented the modules discussed in theprevious sections to build a system able to learn byimitation. In all the experiments, we use a webcamera to observe the demonstrator gestures and asimulated robot arm to replicate those gestures.

We start by describing the approach used forhand-tracking before presenting the overall resultsof imitation. The position of the shoulder is as-sumed to be fixed. In the following sections weshall discuss about the procedures for doing imita-tion.

5.1 Hand Color Segmentation

To find the hand in the image we use a color seg-mentation scheme, implemented by a feed-forwardneural network with three neurons in the hiddenlayer. As inputs we use the hue and saturationchannels of HSV color representation. The train-ing data are obtained by selecting the hand and thebackground in a sample image. After color classi-fication a majority morphological operator is used.The hand is identified as the largest blob found andits position is estimated over time with a Kalman

filter. Figure 5 shows a typical result of this ap-proach.

Figure 5: Skin color segmentation results.

5.2 Gesture Imitation

The first step to achieve imitation consists in train-ing the system to learn the Sensory-Motor Map asdescribed in Section 3.3. This is accomplished by aneural network that estimates the SMM while thesystem performs a large number of arm movements.

The imitation process consists of the followingsteps: (i) the system observes the demonstrator’sarm movements; (ii) the VPT is used to transformthese image coordinates to the ego-image, as pro-posed in Section 4 and (iii) the SMM generates theadequate joint angle references to execute the samearm movements.

Figure 6 shows experimental results obtainedwith the 3D-VPT with the learned SMM (full-arm). To assess the quality of the results, we over-laid the images of the executed arm gestures (wireframe) on those of the demonstrator. The figureshows that the quality of imitation is very good.

Figure 7 shows results obtained in real-time(about 5 Hz) when using the 2D VPT and the free-elbow SMM. The goal of imitating the hand gestureis well achieved but, as expected, there are differ-ences in the configuration of the elbow, particularlyat more extreme positions.

These tests show that encouraging results canbe obtained with the proposed framework underrealistic conditions.

6 Conclusions and futurework

We have proposed an approach for learning by im-itation that relies exclusively on visual informationprovided by a single camera.

One of the main contributions is the View-PointTransformation that performs a “mental rotation”of the image of the demonstrator’s arm to the ego-image, as if the system were observing its own arm.In spite of the fundamental importance of the VPTin visual perception and in the psychology of im-

6

Page 7: Visual Transformations in Gesture Imitation: what you see ... · Instituto de Sistemas e Rob´otica Instituto Superior T´ecnico Lisbon, Portugal {macl,jasv}@isr.ist.utl.pt ... systems

Figure 6: The quality of the results can be assessedby the coincidence of the demonstrator gesturesand the result of imitation.

Figure 7: Family of solutions with different elbowangles, while the hand is faithfully imitated.

itation [7], it has received little attention by re-searchers in robotics.

We described two different VPTs needed for 3Dor 2D imitation. The View-Point Transformationcan have an additional interest to Mirror Neuronsstudies, by providing a canonical frame of referencethat greatly simplifies the recognition of arm ges-tures.

The observed actions are mapped into musclestorques by the Sensory-Motor Map, that associatesimage features to motor acts. Again two differenttypes of SMM are proposed, depending on whetherthe task consists of imitating the entire arm or thehand position only. The SMM is learned automat-ically during a period of self-observation.

Experiments conducted to test the various sub-systems have led to encouraging results, thus vali-dating our approach to the problem.

Besides improvements on the feature detectioncomponent using shape and kinematic data, futurework will focus on the the understanding of the taskgoals to enhance the quality of imitation.

References

[1] S. Schaal. Is imitation learning the route to hu-manoid robots. Trends in Cognitive Sciences,3(6):233–242, 1999.

[2] J. Yang, Y. Xu, and C.S. Chen. Hiddenmarkov model approach to skill learning andits application to telerobotics. IEEE Transa-tions on Robotics and Automation, 10(5):621–631, October 1994.

[3] T. G. Williams, J. J. Rowland, and M. H.Lee. Teaching from examples in assemblyand manipulation of snack food ingredientsby robot. In 2001 IEEE/RSJ, InternationalConference on Intelligent Robots and Systems,pages 2300–2305, Oct.29-Nov.03 2001.

[4] Aaron D’Souza, Sethu Vijayakumar, andStephan Schaal. Learning inverse kinemat-ics. In International Conference on IntelligentRobots and Systems, pages 298–303, Maui,Hawaii, USA, 2001.

[5] V.S. Ramachandran. Mirror neurons and imi-tation learning as the driving force behind thegreat leap forward in human evolution. Edge,69, June 2000.

[6] Luciano Fadiga, Leonardo Fogassi, VittorioGallese, and Giacomo Rizzolatti. Visuomotorneurons: ambiguity of the discharge or ’mo-tor’ perception? International Journal of Psy-chophysiology, 35:165–177, 2000.

[7] J.S. Bruner. Nature and use of immaturity.American Psychologist, 27:687–708, 1972.

7

Page 8: Visual Transformations in Gesture Imitation: what you see ... · Instituto de Sistemas e Rob´otica Instituto Superior T´ecnico Lisbon, Portugal {macl,jasv}@isr.ist.utl.pt ... systems

[8] Minoru Asada, Yuichiro Yoshikawa, and KohHosoda. Learning by observation withoutthree-dimensional reconstruction. In Intelli-gent Autonomous Systems (IAS-6), pages 555–560, 2000.

[9] A. Billard and G. Hayes. Drama, a connec-tionist architecture for control and learningin autonomous robots. Adaptive Behaviour,7(1):35–63, 1999.

[10] Maja J. Mataric. Sensory-motor primitives asa basis for imitation: Linking perception toaction and biology to robotics. In C. NehanivK. Dautenhahn, editor, Imitation in Animalsand Artifacts. MIT Press, 2000.

[11] G. Metta, G. Sandini, L. Natale, andF.Panerai. Sensorimotor interaction in a de-veloping robot. In First International Work-shop on Epigenetic Robotics: Modeling Cog-nitive Development in Robotic Systems, pages18–19, Lund, Sweden, September 2001.

[12] Richard L. Gregory. Eye and Brain, The Psy-chology of Seeing. Princeton University Press,Princeton, New Jersey, 1990.

[13] Philippe Rochat. Ego function of early imi-tation. In Andrew N. Meltzoff and WolfgangPrinz, editors, The Imitative Mind. CambridgeUniversity Press, 2002.

[14] C. Taylor. Reconstruction of articulated ob-jects from point correspondences in a singleuncalibrated image. Computer Vision and Im-age Understanding, 80:349–363, 2000.

8