Representing Affective Facial Expressions for Robots and Embodied Conversational Agents by Facial...

8
Int J Soc Robot (2013) 5:619–626 DOI 10.1007/s12369-013-0208-9 Representing Affective Facial Expressions for Robots and Embodied Conversational Agents by Facial Landmarks Caixia Liu · Jaap Ham · Eric Postma · Cees Midden · Bart Joosten · Martijn Goudbeek Accepted: 19 July 2013 / Published online: 7 September 2013 © Springer Science+Business Media Dordrecht 2013 Abstract Affective robots and embodied conversational agents require convincing facial expressions to make them socially acceptable. To be able to virtually generate facial expressions, we need to investigate the relationship between technology and human perception of affective and social signals. Facial landmarks, the locations of the crucial parts of a face, are important for perception of the affective and social signals conveyed by facial expressions. Earlier re- search did not use that kind of technology, but rather used analogue technology to generate point-light faces. The goal of our study is to investigate whether digitally extracted fa- cial landmarks contain sufficient information to enable the facial expressions to be recognized by humans. This study presented participants with facial expressions encoded in moving landmarks, while these facial landmarks correspond to the facial-landmark videos that were extracted by face C. Liu (B ) · J. Ham · C. Midden Human-Technology Interaction Group, Department of Industrial Engineering and Innovation Sciences, Eindhoven University of Technology, Eindhoven, The Netherlands e-mail: [email protected] J. Ham e-mail: [email protected] C. Midden e-mail: [email protected] C. Liu · E. Postma · B. Joosten · M. Goudbeek Tilburg Center for Cognition and Communication, Tilburg University, Tilburg, The Netherlands E. Postma e-mail: [email protected] B. Joosten e-mail: [email protected] M. Goudbeek e-mail: [email protected] analysis software from full-face videos of acted emotions. The facial-landmark videos were presented to 16 partici- pants who were instructed to classify the sequences accord- ing to the emotion represented. Results revealed that for three out of five facial-landmark videos (happiness, sadness and anger), participants were able to recognize emotions ac- curately, but for the other two facial-landmark videos (fear and disgust), their recognition accuracy was below chance, suggesting that landmarks contain information about the expressed emotions. Results also show that emotions with high levels of arousal and valence are better recognized than those with low levels of arousal and valence. We argue that the question of whether these digitally extracted facial land- marks are a basis for representing facial expressions of emo- tions is crucial for the development of successful human- robot interaction in the future. We conclude by stating that landmarks provide a basis for the virtual generation of emo- tions in humanoid agents, and discuss how additional facial information might be included to provide a sufficient basis for faithful emotion identification. Keywords Robots · Embodied conversational agents · Emotion · Facial expression · Facial landmarks · FaceTracker 1 Introduction Socially aware robots and embodied conversational agents (ECAs) require accurate recognition and generation of facial expressions. The automatic recognition of facial expressions is an active field of research in affective computing and so- cial signal processing [1]. The virtual generation of facial expressions is still in its infancy. Human faces only have a limited range of movements. Facial expressions may be fairly subtle changes in the pro- portions and relative positions of the facial muscles. Emo-

Transcript of Representing Affective Facial Expressions for Robots and Embodied Conversational Agents by Facial...

Int J Soc Robot (2013) 5:619–626DOI 10.1007/s12369-013-0208-9

Representing Affective Facial Expressions for Robotsand Embodied Conversational Agents by Facial Landmarks

Caixia Liu · Jaap Ham · Eric Postma · Cees Midden ·Bart Joosten · Martijn Goudbeek

Accepted: 19 July 2013 / Published online: 7 September 2013© Springer Science+Business Media Dordrecht 2013

Abstract Affective robots and embodied conversationalagents require convincing facial expressions to make themsocially acceptable. To be able to virtually generate facialexpressions, we need to investigate the relationship betweentechnology and human perception of affective and socialsignals. Facial landmarks, the locations of the crucial partsof a face, are important for perception of the affective andsocial signals conveyed by facial expressions. Earlier re-search did not use that kind of technology, but rather usedanalogue technology to generate point-light faces. The goalof our study is to investigate whether digitally extracted fa-cial landmarks contain sufficient information to enable thefacial expressions to be recognized by humans. This studypresented participants with facial expressions encoded inmoving landmarks, while these facial landmarks correspondto the facial-landmark videos that were extracted by face

C. Liu (B) · J. Ham · C. MiddenHuman-Technology Interaction Group, Department of IndustrialEngineering and Innovation Sciences, Eindhoven Universityof Technology, Eindhoven, The Netherlandse-mail: [email protected]

J. Hame-mail: [email protected]

C. Middene-mail: [email protected]

C. Liu · E. Postma · B. Joosten · M. GoudbeekTilburg Center for Cognition and Communication, TilburgUniversity, Tilburg, The Netherlands

E. Postmae-mail: [email protected]

B. Joostene-mail: [email protected]

M. Goudbeeke-mail: [email protected]

analysis software from full-face videos of acted emotions.The facial-landmark videos were presented to 16 partici-pants who were instructed to classify the sequences accord-ing to the emotion represented. Results revealed that forthree out of five facial-landmark videos (happiness, sadnessand anger), participants were able to recognize emotions ac-curately, but for the other two facial-landmark videos (fearand disgust), their recognition accuracy was below chance,suggesting that landmarks contain information about theexpressed emotions. Results also show that emotions withhigh levels of arousal and valence are better recognized thanthose with low levels of arousal and valence. We argue thatthe question of whether these digitally extracted facial land-marks are a basis for representing facial expressions of emo-tions is crucial for the development of successful human-robot interaction in the future. We conclude by stating thatlandmarks provide a basis for the virtual generation of emo-tions in humanoid agents, and discuss how additional facialinformation might be included to provide a sufficient basisfor faithful emotion identification.

Keywords Robots · Embodied conversational agents ·Emotion · Facial expression · Facial landmarks ·FaceTracker

1 Introduction

Socially aware robots and embodied conversational agents(ECAs) require accurate recognition and generation of facialexpressions. The automatic recognition of facial expressionsis an active field of research in affective computing and so-cial signal processing [1]. The virtual generation of facialexpressions is still in its infancy.

Human faces only have a limited range of movements.Facial expressions may be fairly subtle changes in the pro-portions and relative positions of the facial muscles. Emo-

620 Int J Soc Robot (2013) 5:619–626

tional facial expressions convey social signals that are dom-inant units of social communication among humans. For ex-ample, a smile can indicate approval or happiness, whilea frown can signal disapproval or unhappiness. The cir-cumplex model of emotion, as developed by Russell [2]and Mondloch [3], provides a useful characterization ofemotions and the associated facial expressions in terms ofarousal and valence. Figure 1 displays Mondloch’s visual-ization of the circumplex model with the locations of fiveemotions (happiness, sadness, fear, anger, and disgust). Themodel provides a representation of the five emotions interms of the dimensions of arousal (vertical) and valence(horizontal). For instance, sadness is characterized by a neg-ative valence and low arousal, whereas happiness is associ-ated with a positive valence and high arousal. The locationsof the emotions have been established in empirical studiesof emotion (see e.g., [4]).

1.1 Facial Expressions in Robots and ECAs

In future computational devices, Embodied ConversationalAgents (ECAs) will interact with humans by understanding

Fig. 1 Schematic representation of the circumplex model (based on[3]) with the locations of the five emotions we used in the current study

and producing speech, gestures and facial expressions. ECAsmay be physical robots or virtual cartoon-like or human-likerenderings of bodies or faces displayed on a screen. The vir-tual generation of facial expressions in ECAs is a rapidlydeveloping field of research. In the domain of robotics, anotable example is Leonardo [5–7], a socially aware robotcreated at the MIT Media Lab. Leonardo is able to estab-lish and maintain eye contact, and to express a wide rangeof emotions by moving its mouth, arms, and ears. Figure 2gives an impression of Leonardo’s appearance. Leonardowas shown to be successful in establishing social contactwith human observers. Due to the limited degrees of free-dom of Leonardo’s facial features, its facial expressions arenot as complex as those of humans.

In ECAs, richer facial expressions may be generated bycloning the expressions of actors by means of motion cap-ture (especially in the movie industry) or by means ofreal-time digital cloning of human facial expressions (fromvideo) onto the face of an avatar [8]. Recent work on thegeneration of realistic facial expressions relies on three-dimensional models of human heads and on models of themuscles controlling the face. An alternative and more per-ceptually motivated approach is to define a representationspace of facial expressions that can be extracted from per-ceivable features of faces.

1.2 Representing Facial Expressions

In a famous study by Johansson [9], participants watchedvideos of lights attached to the joints of walking peopleagainst an otherwise black background. Participants wereable to identify familiar persons from their gait as reflectedin the dynamics of the lights, e.g. their gender and the natureof the task that they were engaged in. Apparently, the dy-namics of light sources at informative locations (the joints)contains sufficient information to represent gait. Similarly,tiny light sources attached to the informative locations of theface may be sufficient to represent facial expressions. Thisidea was tested by Bassili [10].

Fig. 2 Impression ofLeonardo [5]

Int J Soc Robot (2013) 5:619–626 621

In his study, the faces of some participants (the actors)were painted black with about 100 small white dots ofpaint superimposed, forming an approximately uniform gridof white dots (facial landmarks) covering the entire face.Some actors were instructed to sequentially express six emo-tions (happiness, sadness, surprise, fear, anger, and disgust).Other participants (the raters) were instructed to identify theemotions. The experimental setting ensured that the raterscould only see the dots and not the texture of the surround-ing black skin. Results showed that the raters could identifythe six emotional expressions reasonably well. Table 1 con-tains a reproduction of the results obtained by Bassili. Rowslist the emotions displayed by the actors and the columnsthe emotions reported by the raters. The entries in the tablerepresent the percentages of correct recognition of the emo-tions for the fully displayed faces (left of the slash) and theirwhite-dot representations (right of the slash).

In a more recent study, Tomlinson [11] employed a point-light (PL) imaging technique to visualize facial expressions.The main difference with Bassili’s experiment was the useof 72 light-emitting markers that were attached to an actor’sface. Despite this difference, both studies were similar intheir use of landmarks that are attached physically to theface, either by paint or by light-emitting markers. Recentdevelopments in digital image recognition and processing[12] allow for the digital extraction of facial landmarks fromvideo sequences of facial expressions.

Table 1 Bassili’s emotion recognition results for full faces/dot-patterned faces expressed as percentages of correct recognition (repro-duced from [10])

Displayedemotion

Reported emotion

Happiness Sadness Fear Surprise Anger Disgust

Happiness 31/31 6/13 0/6 38/31 19/6 6/13

Sadness 13/0 56/25 0/25 0/13 0/0 31/37

Fear 0/0 0/19 69/6 13/25 6/13 12/37

Surprise 0/0 0/6 6/0 94/75 0/6 0/13

Anger 6/0 0/13 0/0 0/6 50/6 44/75

Disgust 0/6 0/6 0/19 0/19 6/6 88/57

1.3 Goals of Our Study

The goal of our study is to investigate whether digitally ex-tracted facial landmarks contain sufficient information to en-able the facial expressions to be recognized by humans. Inaddition, we aim at relating our results to the circumplexmodel by investigating and comparing valence judgmentsand arousal judgments that participants make about full-facevideos and facial-landmark videos. The insights obtainedfrom our investigations will support our future work on thegeneration of facial emotions in ECAs.

Our study presented participants with facial expressionsencoded in moving landmarks, while these facial landmarkscorrespond to the facial-landmark videos that were extractedby FaceTracker software [13] from full-face videos of actedemotions. We used FaceTracker software to extract facial-landmark sequences from full-face videos. The extractedlandmarks were visualized as image sequences of white dotsagainst a dark background. The image sequences were con-verted to videos with a frame rate corresponding to the land-mark sampling frequency. With the videos so obtained, weinvestigated the accuracy of perception of expressed emo-tions from the dynamics of the landmarks. If these faciallandmarks contain sufficient information, it should be pos-sible to recognize and generate emotional expressions basedon the landmarks only. Our main research question is howwell participants recognize the emotions of facial expres-sions in the facial-landmark videos as compared to theirrecognition of emotions in the full-face equivalents.

2 Digital Extraction of Facial Landmarks

To be able to present participants with facial-landmarkvideos representing facial expressions, we need informationabout the locations and movements of the crucial parts of aface while that face is displaying facial expressions. Theselocations might be calculated based on models of humanfaces and facial expressions [14, 15], but they can also beextracted directly from real human facial expressions.

As stated in the previous section, we used FaceTrackersoftware (see Fig. 3) to extract this information from full-

Fig. 3 Digital extractionlandmarks by FaceTracker. Theleft image shows the face of oneof the authors and a grid oflandmarks superimposed. Theright image is an illustration ofthe extracted landmarks as theyappear in the facial-landmarkvideo

622 Int J Soc Robot (2013) 5:619–626

face videos of actors displaying the five emotions (happi-ness, sadness, fear, disgust and anger) as shown in the visu-alization of the circumplex model in Fig. 1.

FaceTracker operates in real time and returns a mesh ofinterconnected landmarks that are located at the contours ofthe eyes, at the nose, at the mouth and at other facial parts.In our experiment, 66 landmarks were used covering theprominent locations of the face (see Fig. 3).

3 Methods

3.1 Participants and Design

Sixteen healthy adult participants participated in the study.None of them were familiar with the purpose of the study.All participants were students at Eindhoven University ofTechnology with Dutch as their native language. Their av-erage age was 25.6 years (SD = 10.46). Five full-facevideos and five facial-landmark videos with different emo-tions which were recorded by four actors (two male and twofemale) are included in our study. Each participant was pre-sented with ten full-face videos (two actors, one male (M1)and one female (F1), each expressing five emotions in fiveseparate videos), and also with ten facial-landmark videos(based on other two different actors, the other male (M2)and the other female (F2), each expressing five emotions infive separate videos). The two actors (M1,F1) in the facial-landmark videos were different from the other two actors(M2,F2) displayed in the full-face videos to prevent recog-nition of a previously-seen expression or actor. So each par-ticipant could see the videos shown by four different actorsthat half are two actors’ facial-landmark videos and the otherhalf are the other two actors’ full-face videos, and vice versa.

The block of full-face videos and the block of landmarkvideos were presented one after the other. Both the order ofpresentation of the blocks and the gender of the actors fea-turing in the videos were counterbalanced. The design was awithin-subject design (each participant was confronted withboth conditions namely full-face videos and facial-landmarkvideos) and the dependent variable was recognition accuracyor classification performance.

So, even though one participant saw different actors forthe full-face videos and the facial-landmark videos, overall,all participants saw the same set of actors for the full-facevideos and the facial-landmark videos. As a consequence,the final results could not be influenced by differences inacting skills of the different actors.

3.2 Stimulus Materials

The stimulus materials were based on the GEMEP corpus[16, 17], featuring full-face videos of actors exhibiting emo-tional expressions. For the full-face videos, we used four ac-tors (two male and two female). Of each actor, we used five

short videos (average length is 2 seconds), each showing theface and upper torso of an actor, while the actor acted as ifhe or she experienced an assigned emotion. That is, for eachactor, we used five videos representing five different emo-tions, including happiness, sadness, fear, disgust, or anger(there were no “surprise” videos in our data set, so we didnot include this emotion in our experiment).

Based on these full-face videos, we constructed thefacial-landmark videos by applying the FaceTracker soft-ware to generate 66 landmarks consisting of locations in-dicating eyebrows, eyes, nose, mouth, and the face pro-file, based on which we could construct a facial-landmarkvideo with white points on a black background. Each facial-landmark video was based on one full-face video. So, weemployed four (different actors) times five (different emo-tions) full-face videos of actors expressing emotions, andfour times five facial-landmark videos of the same emotions.

Each participant was shown full-face videos of two ac-tors, for each of which five videos were shown (expressingfive emotions), and also facial-landmark videos based on theother two actors in our set of four actors (to prevent inter-ference as a result of identification of the actor) expressingfive emotions. Within each video (full-face video or facial-landmark video), we arranged video segments such that eachvideo was displayed three times. So each participant wasshown three times twenty different videos which were fivetimes (different emotions) four different actors (two withfull-face videos, the other two with facial-landmark videos).

3.3 Procedure

Participants participated individually, in a cubicle that con-tained a desktop computer and a display. All instructions andstimulus materials were shown on the computer display, andthe experiment was completely computer controlled. Eachparticipant was instructed that he or she would be shown 20short videos of facial expressing emotions, and that some-times it would be the full-face video, and sometimes it wouldbe the facial-landmark video. Each video was shown to theparticipants three times. Also, participants were instructedthat after each video they would be asked three questionsabout the emotion expressed by the face in the video. Eachof these three questions was explained. Each participant waspresented ten full-face videos, and, on different screens, alsoten facial-landmark videos (see Fig. 4). Half of the observerswere presented with the full-face videos first and the facial-landmark videos second. The other observers were pre-sented with the facial-landmark videos first and the full-facevideos second. Each set of five emotions displayed those fivevideos in a different random order.

3.4 Measures

For each of the videos, participants were first shown thevideo and then, on the next page, asked the three questions.

Int J Soc Robot (2013) 5:619–626 623

Fig. 4 An example of a frameof a full-face video and a frameof a facial-landmark video

In the first question, the participant was asked to identifythe emotions expressed in the video by selecting one of sixoptions (“happiness”, “sadness”, “fear”, “disgust”, “anger”and “don’t know” (participants were not forced into a choicenor could they mistake random expression, so we added the“don’t know” option). In the second and third questions, theparticipant was asked to rate the valence level of the ex-pressed emotion (1 = negative, to 7 = positive, or “don’tknow”), and the arousal level of the expressed emotion(1 = low arousal, to 7 = high arousal, or “don’t know”). Allthe questions could only be answered one by one and a par-ticipant could not return to an earlier question. Based on thefull-face video, we constructed the facial-landmark video byemploying FaceTracker software to generate 66 landmarks.These landmarks are the locations of eyebrows, eyes, nose,mouth, and the face profile. Each facial-landmark video wasbased on one full-face video. We construct a facial-landmarkvideo with white points on a black background.

4 Results

The results of our experiment are listed in Table 2 and vi-sualized in Fig. 5. The numbers in the table represent thepercentages correct recognition for full-face videos (left ofthe slash) and facial-landmark videos (right of the slash).Each row represents the responses of 16 participants in eachof the two conditions. Although it was not our intention toreplicate Bassili’s study (there are many methodological dif-ferences between Bassili’s and our study), the reader maycompare our results to those listed in Table 1. Figure 5 showsthe results in a bar which the percentage accuracies recog-nition for each emotion of the full-face videos (black bars)and facial-landmark videos (white bars) of our experiment.The results obtained in the experiment give rise to three ob-servations.

The first observation is that the participants can relativelyaccurately identify emotional facial expressions from thefull-face videos. The average percentage correct recognitionacross the five emotions was 80 %, and ranged from 65 %(disgust) to 100 % (anger), where 16.7 % accuracy would beexpected by chance. The average accuracy was significantlyabove from chance level, χ2(1) = 128.36, p < 0.001. More

Table 2 Participants’ identification of emotion displayed on full-facevideos or facial-landmark videos

Displayedemotion

Reported emotion

Happiness Sadness Fear Disgust Anger Don’t know

Happiness 91/53 0/3 3/6 0/9 0/0 6/28

Sadness 0/3 66/38 6/16 13/ 6 0/9 16/28

Fear 0/22 13/9 78/ 9 0/3 9/16 0/41

Disgust 0/9 29/16 3/9 65/13 0/16 3/38

Anger 0/16 0/0 0/9 0/6 100/56 0/13

Fig. 5 Recognition accuracies (percentage) for full-face videos (blackbars) and facial-landmark videos (white bars)

specifically, specific post-hoc chi-square test comparisonsindicated that recognition accuracy was significantly abovefrom chance level for happiness (91 %;χ2(1) = 35.54,p <

0.001), sadness (66 %;χ2(1) = 16.04,p < 0.001), fear(78 %;χ2(1) = 24.12,p < 0.001), disgust (65 %;χ2(1) =15.45,p < 0.001), and anger (100 %;χ2(1) = 45.68,p <

0.001).The second observation is that recognition of emotions

from facial-landmark videos was less accurate than forthe full-face videos. The participants could relatively ac-curately identify three emotional facial expression (happi-ness, sadness, and anger) based on facial-landmark videos,and the accuracy for the recognition of these three emo-tions displayed in the facial-landmark videos was all abovechance, but the other two emotions (fear and disgust)were not. The mean average accuracy in the recognitionof emotions on the basis of facial-landmark videos was33.8 % and ranged from 9 % (fear) to 56 % (anger),

624 Int J Soc Robot (2013) 5:619–626

Table 3 Participants’ evaluation of the valence and arousal levels ofthe facial emotional expressions

Displayedemotion

Reported emotion

Valence (Negative–Positive) Arousal (Low–High)

Happiness 6.4/4.9 5.8/4.6

Sadness 2.4/3.2 3.2/2.8

Fear 2.0/3.8 5.7/4.1

Disgust 1.9/3.6 5.7/4.3

Anger 1.3/2.8 6.6/5.8

where 16.7 % accuracy would be expected by chance.This average accuracy level was significantly above chancelevel (16.7 % ), χ2(1) = 12.39, p < 0.001. More specif-ically, specific post-hoc chi-square test comparisons in-dicated that recognition accuracy was significantly abovefrom chance level for happiness (53 %;χ2(1) = 9.29,p <

0.01), sadness (38 %;χ2(1) = 3.65,p < 0.05), and anger(56 %;χ2(1) = 10.68,p < 0.01), but significantly belowchance level for fear (9 %;χ2(1) = 0.85,p < 0.05), anddisgust (13 %;χ2(1) = 0.17,p < 0.05).

As a final observation, a comparison of the overall accu-racy rate on the basis of full-face videos and facial-landmarkvideos reveals that the former was significantly higher thanthe latter, χ2(1) = 69.63, p < 0.001 (McNemar’s chi-squaretest).

Since the original videos produced by four actors yieldedsome errors or confused facial expressions, we determinedthe similarities of the results of the full-face and facial-landmark videos. The correlation coefficients of the re-sponses given for full-face videos and facial-landmarkvideos (as shown in each cell of Table 2) was r = 0.77;p < 0.001. The significant correlation indicated that par-ticipants’ errors were far from random, and the errors theymade in the facial-landmark videos were similar to thosethey made in the full-face videos.

Furthermore, the participants’ judgments about the va-lence and arousal levels of emotional facial expressions dis-played by full-face videos show a strong correlation withthose displayed by facial-landmark videos. The correlationbetween valence evaluations given for full-face videos andfacial-landmark videos (as shown in Table 3) was r = 0.91;p < 0.05, and for participants arousal evaluations of the full-face videos and the facial-landmark videos was r = 0.93;p < 0.05.

Results revealed that for three out of five facial-landmarkvideos (happiness, sadness, and anger), participants wereable to recognize emotions accurately, but for the other twofacial-landmark videos (fear and disgust), their recognitionaccuracy was below chance level, suggesting that landmarkscontain some information about the expressed emotions.

Table 3 lists the average levels of participants’ valenceand arousal evaluations of the emotion corresponding to the

Fig. 6 The location of each emotional expression evaluated in the ful-l-face videos (“◦”) and facial-landmark videos (“•”), represented on a2-dimensional space of valence (X-axis) and arousal (Y-axis)

row label. Numbers on the left of the slash are for responsesto the full-face videos, and those to the right are for thefacial-landmark videos. Each row represents the responsesof sixteen participants in each of the two conditions.

Figure 6 illustrates the location of each emotional facialexpression in a 2-dimensional space of valence (X-axis) andarousal (Y-axis). The white markers represents the result ofthe full-face videos and the black markers represents the re-sult of facial-landmark videos. Markers associated with thesame emotion in both types of videos are connected by aline. Figure 6 suggests that emotional facial expressions offull-face videos and facial-landmark videos are judged to bein the same coordinate area of this 2-dimensional space ofvalence and arousal. The results of valence and arousal lev-els for full-face videos and facial-landmark videos are sim-ilar. The results shown in Fig. 6 may be compared to the“ideal” of the circumplex model shown in Fig. 1. Clearly,there is a better match for the full-face videos (“◦” of Fig. 6)than for the facial-landmark videos (“•” of Fig. 6). Appar-ently, moving from full-face videos to landmark videos re-sults in an overall decrease of valence and arousal.

5 Discussion and Conclusion

For the development of successful future human-robot inter-actions, the current research has investigated whether dig-itally extracted facial landmarks contain sufficient infor-mation to enable the facial expressions to be recognized

Int J Soc Robot (2013) 5:619–626 625

by humans. To this purpose, our study presented partici-pants with facial expressions encoded in moving landmarks,while these facial landmarks were extracted by FaceTrackersoftware from full-face videos of acted emotions. This al-lowed us to answer the question of how well participantsrecognize the emotions of facial expressions in the facial-landmark videos as compared to their recognition of emo-tions in the full-face equivalents. Results suggested that par-ticipants were able to recognize three out of five emotionalfacial expressions from reduced landmark representations.Emotion recognition for full-face videos was better than forthe reduced landmark videos.

Our results contribute to the existing literature on human-robot interaction by establishing that facial landmarks canbe useful in the differentiation of emotions. Apparently, thelandmarks form an informative but incomplete basis forrepresentation of emotional facial expressions in humanoidagents. This may be due to the limited number of landmarks,but a more likely explanation is the lack of shape, texture,and color information in the landmark videos. It is knownthat eyebrows and facial color are important for facial recog-nition [18]. The same may apply to emotional expressionrecognition. An additional explanation may be that the fa-cial landmarks extracted by FaceTracker software do not in-clude all extremities of the head (e.g., the top of the head).Finally, the current research investigated flat, 2-dimensionalfaces. Future research can also focus on the emotional fa-cial expressions of faces presented in three dimensions. In-deed, for designing future social robots [19, 20], we needknowledge of the accuracy of recognition of facial expres-sions in three dimensions. Future research will be directedtowards (i) the variation in the number and distribution ofdigitally extracted landmarks and (ii) the inclusion of addi-tional visual cues (texture and color), to create recognizableemotional facial expressions.

To conclude, our results suggested that participants couldaccurately identify three out of five emotional facial expres-sions of facial-landmark videos (though less accurately thanthose expressed by full-face videos). Ultimately, our ambi-tion would be to design ECAs with expressive faces. Cur-rently, the results obtained contribute to the literature on so-cial robotics (and on human-ECA interaction) lead to theconclusion that landmarks can be a useful but limited basisfor generating virtual emotions in ECAs. This suggests thatlandmarks are important components to generate or displayfacial expressions of avatars.

Acknowledgements The authors acknowledge anonymous review-ers for their constructive and detailed comments to an earlier version ofthis paper. We wish to express our gratitude to Ruud Mattheij, and Pe-ter Ruijten, and the Persuasive Technology Lab Group at TU/e for thefruitful discussions about this work. The first author also appreciatesthe scholarship for her Ph.D. project from China Scholarship Council.

References

1. Vinciarelli A, Pantic M, Bourlard H (2009) Social signal pro-cessing: survey of an emerging domain. Image Vis Comput27(12):1743–1759

2. Russell JA (1997) Reading emotions from and into faces: res-urrecting a dimensional-contextual perspective. In: Russell JA,Fernandez-Dols JM (eds) The psychology of facial expressions.Cambridge University, New York, pp 295–320

3. Mondloch CJ (2012) Sad or fearful? The influence of body postureon adults and childrens perception of facial displays of emotion.J Exp Child Psychol 111:180–196

4. Aviezer H, Hassin R, Bentin S, Trope Y (2008) Putting facial ex-pressions back in context. In: Ambady N, Skowronski JJ (eds)First impressions. Guilford, New York, pp 255–286

5. Breazeal CL Designing social robots. Personal Robots Group inMIT Media Lab, Cambridge

6. Breazeal CL (2000) Sociable machines: expressive social ex-change between humans and robots. Diss Massachusetts Instituteof Technology, pp 178–184

7. Breazeal CL (2003) Emotion and sociable humanoid robots. Int JHum-Comput Stud 59(1):119–155

8. Saragih JM, Lucey S, Cohn JF, Court T (2011) Real-time avataranimation from a single image. In: Automatic face & gesture

9. Johansson G (1975) Visual motion perception. Sci Am 232:76–8810. Bassili JN (1978) Facial motion in the perception of faces and of

emotional expression. J Exp Psychol Hum Percept Perform 4:373–379

11. Tomlinson EK, Jones CA, Johnston RA, Meaden A, Wink B(2006) Facial emotion recognition from moving and static point-light images in schizophrenia. Schizophr Res 85(1–3):96–105

12. Saragih J, Lucey S, Cohn J (2011) Deformable model fitting byregularized landmark mean-shift. Int J Comput Vis 91:200–215

13. Lucey P, Lucey S, Cohn JF (2010) Registration invariant represen-tations for expression detection. In: International conference ondigital image computing: techniques and applications. I, pp 255–261

14. Alexander O, Rogers M, Lambeth W, Chiang M, Debevec P(2009) Creating a photoreal digital actor: the digital Emily project.In: Conference for visual media production, pp 176–187

15. Yang C, Chiang W (2007) An interactive facial expression gener-ation system. Springer, Berlin

16. Bänziger T, Scherer KR (2010) Introducing the Geneva Mul-timodal Emotion Portrayal (GEMEP) corpus. In: Scherer KR,Bänziger T, Roesch EB (eds) Blueprint for affective computing:a sourcebook. Oxford University Press, Oxford, pp 271–294

17. Bänziger T, Mortillaro M, Scherer KR (2011) Introducing theGeneva multimodal expression corpus for experimental researchon emotion perception. Emotion. doi:10.137/a0025827

18. Sinha P, Balas B, Ostrovsky Y, Russell R (2006) Face recogni-tion by humans: nineteen results all computer vision researchersshould know about. Proc IEEE 94(11):1948–1962

19. Cheng L, Lin C, Huang C (2012) Visualization of facial expressiondeformation applied to the mechanism improvement of face robot.Int J Soc Robot. doi:10.1007/s12369-012-0168-5

20. Kedzierski J, Muszynski R, Zoll C, Oleksy A, Frontkiewicz M(2013) EMYS—Emotive head of a social robot. Int J Soc Robot5(2):237–249

Caixia Liu received her B.Sc. degree in Mathematics from XinyangNormal University and M.Sc. degree in Computer Science from ThePLA Information Engineering University in China. Currently she isdoing research as a Ph.D. student on the topic of automatic generationof emotional facial expressions in Embodied Conversational Agents(ECAs) and their evaluation by humans in the Human-Technology In-teraction research group at Eindhoven University of Technology and in

626 Int J Soc Robot (2013) 5:619–626

Artificial Intelligence in the Tilburg Center for Cognition and Commu-nication of Tilburg University.

Jaap Ham is an Associate Professor in the Human-Technology In-teraction research group at Eindhoven University of Technology. Hestudies the psychology of human-technology interaction, investigatingsocial robotics, (ambient) persuasive technology, and trust and accep-tance of technology.

Eric Postma is a Full Professor in Artificial Intelligence in the TilburgCenter for Cognition and Communication of Tilburg University. Hismain interests include image recognition and analysis, social signalprocessing and cognitive modeling.

Cees Midden is a Full Professor of Human Technology Interaction andchair of the Human-Technology Interaction group at Eindhoven Uni-versity of Technology, The Netherlands. His research focus is on thesocial psychological factors of human-technology interactions as thesebecome apparent in the consumption and use of products and systems.He published various books and articles on environmental consumer

behavior, on persuasive communication and the perception and com-munication of technological risks. In 2006, he chaired the first Interna-tional Conference on Persuasive Technology.

Bart Joosten received his B.Sc. degree in knowledge managementfrom Maastricht University. He received his M.A. in communicationand information sciences in the Master track human aspects of infor-mation technology, Currently he is doing research as a Ph.D. studenton the topic of digital analysis of facial expressions.

Martijn Goudbeek is an Assistant Professor of Communication andInformation Sciences at the department of Humanities at the Univer-sity of Tilburg, The Netherlands. He obtained his Ph.D. at the MPIfor Psycholinguistics in Nijmegen and was a postdoctoral fellow at theSwiss Center for Affective Science at the University of Geneva. Hisresearch interests include emotional expression and encoding as wellas the production of spoken language, in both a psycholinguistic and acomputational linguistic approach.