Emotive Qualities in Robot Speech · Emotive Qualities in Robot Speech Cynthia Breazeal Abstract-...

7
SUBMISSION TO IROS 2001 1 Emotive Qualities in Robot Speech Cynthia Breazeal Abstract- facial expression. A disembodied voice would be a detri- This paper explores the expression of emotion in syn- ment to a life-like quality of interaction that we would like thesized speech for an anthropomorphic robot. We have Kismet to have with people. Synchronized movements of adapted several key emotional correlates of human speech to the robot's speech synthesizer to allow the robot to speak the face with voice both complement as well as supplement in either an angry, calm, disgusted, fearful, happy, sad, or the information transmitted through the verbal channel. surprised manner. We have evaluated our approach thor- In earlier work we have presented Kismet's emotion sys- ough acoustic analysis of the speech patters for each vocal affect and have studied how well human subjects perceive tem and its expressive facial animation system (that in- the intended affect. cludes emotive facial expressions and lip synchronization) Keywords- Human-robot interaction, emotive expression, [12]. This paper presents our work in giving Kismet's voice synthesized speech. emotive qualities. I. INTRODUCTION II. EMOTION IN SPEECH There is a growing research and commercial interest in There has been an increasing amount of work in iden- building robots that can interact with people in a life- tifying those acoustic features that vary with a speaker's like and social manner. For robotic applications where affective state [14]. Figure 1 summarizes the effects of emo- the robot and human establish and maintain a long term tion in human speech that tend to alter the pitch, timing, relationship, such as robotic pets for children or robotic voice quality, and articulation of the speech signal [15]. nursemaids for the elderly, communication of affect is im- Several of these features, however, are also modulated by portant. There have been a number of projects exploring the prosodic effects that the speaker uses to communicate models of emotion for robots or animated life-like charac- grammatical structure and lexical correlates. These tend to ters [1], [2], [3], [4], [5], the recognition of emotive states have a more localized influence on the speech signal, such in people [6], [7], [8], [9], and the expression of affect in as emphasizing a particular word. For recognition tasks, facial expression [10], [11], [12] and body movement [13]. this increases the challenge of isolating those feature char- This paper explores the expression of emotion in synthe- acteristics modulated by emotion. Even humans are not sized speech for an anthropomorphic robot (called Kismet) perfect at perceiving the intended emotion for those emo- with a highly expressive face. We have adapted several tional states that have similar acoustic characteristics. For key emotional correlates of human speech to the robot's instance, surprise can be perceived or understood as either synthesizer (based on DECTALK v4.0) to allow Kismet to joyous surprise (happiness) or apprehensive surprise (fear). speak in either an angry, calm, disgusted, fearful, happy, Disgust is a form of disapproval and can be confused with sad, or surprised manner. We have evaluated our approach anger. Picard (1997) [6] presents a nice overview of work thorough acoustic analysis of the speech patters for each in this area. vocal affect. We have also studied how well human sub- jects perceive the intended affect. The effect of emotions on the human voice It is well-accepted that facial expressions (related to - W - affect) and facial displays (which serve a communication function) are important for verbal communication. Hence, Kismet's vocalizations should convey the affective state of. .. N" h'W- Nw -. the robot. This provides a person with important affective - -. -- = information as to how to appropriately engage a sociable "-N NO h hV robot like Kismet. If done properly, Kismet could then use its emotive vocalizations to convey disapproval, frustra- tion, disappointment, attentiveness, or playfulness. This - fosters richer and sustained social interaction, and helps to maintain the person's interest. For a compelling verbal -i - - exchange, it is also important for Kismet to accompany its expressive speech with appropriate motor movements of the lips, jaw, and face. The ability to lip synchronize with Fig. 1. Typical effect of emotions on adult human speech, adapted expressive speech strengthens the perception of Kismet as from Murray and Arnott (1993) and Picard (1997). a social creature that expresses itself vocally and through There have been a few systems developed to syn- C. Breazeal is with the Artificial Intelligence Laboratory, Mas- sachusetts Institute of Technology (MIT), Cambridge. MA USA. E- thesize emotional speech. For instance, Jun Sato (see mail: cynthia"ai.mit.edu. ww.ee. seikei .ac. jp/user/junsato/research/) trained DISTRIBUTION STATEMENT A Approved for Public Release Distribution Unlimited

Transcript of Emotive Qualities in Robot Speech · Emotive Qualities in Robot Speech Cynthia Breazeal Abstract-...

Page 1: Emotive Qualities in Robot Speech · Emotive Qualities in Robot Speech Cynthia Breazeal Abstract- facial expression. A disembodied voice would be a detri-This paper explores the expression

SUBMISSION TO IROS 2001 1

Emotive Qualities in Robot SpeechCynthia Breazeal

Abstract- facial expression. A disembodied voice would be a detri-This paper explores the expression of emotion in syn- ment to a life-like quality of interaction that we would like

thesized speech for an anthropomorphic robot. We have Kismet to have with people. Synchronized movements ofadapted several key emotional correlates of human speechto the robot's speech synthesizer to allow the robot to speak the face with voice both complement as well as supplementin either an angry, calm, disgusted, fearful, happy, sad, or the information transmitted through the verbal channel.surprised manner. We have evaluated our approach thor- In earlier work we have presented Kismet's emotion sys-ough acoustic analysis of the speech patters for each vocalaffect and have studied how well human subjects perceive tem and its expressive facial animation system (that in-the intended affect. cludes emotive facial expressions and lip synchronization)

Keywords- Human-robot interaction, emotive expression, [12]. This paper presents our work in giving Kismet's voicesynthesized speech. emotive qualities.

I. INTRODUCTION II. EMOTION IN SPEECH

There is a growing research and commercial interest in There has been an increasing amount of work in iden-

building robots that can interact with people in a life- tifying those acoustic features that vary with a speaker'slike and social manner. For robotic applications where affective state [14]. Figure 1 summarizes the effects of emo-the robot and human establish and maintain a long term tion in human speech that tend to alter the pitch, timing,relationship, such as robotic pets for children or robotic voice quality, and articulation of the speech signal [15].

nursemaids for the elderly, communication of affect is im- Several of these features, however, are also modulated by

portant. There have been a number of projects exploring the prosodic effects that the speaker uses to communicatemodels of emotion for robots or animated life-like charac- grammatical structure and lexical correlates. These tend to

ters [1], [2], [3], [4], [5], the recognition of emotive states have a more localized influence on the speech signal, suchin people [6], [7], [8], [9], and the expression of affect in as emphasizing a particular word. For recognition tasks,facial expression [10], [11], [12] and body movement [13]. this increases the challenge of isolating those feature char-This paper explores the expression of emotion in synthe- acteristics modulated by emotion. Even humans are notsized speech for an anthropomorphic robot (called Kismet) perfect at perceiving the intended emotion for those emo-with a highly expressive face. We have adapted several tional states that have similar acoustic characteristics. Forkey emotional correlates of human speech to the robot's instance, surprise can be perceived or understood as eithersynthesizer (based on DECTALK v4.0) to allow Kismet to joyous surprise (happiness) or apprehensive surprise (fear).

speak in either an angry, calm, disgusted, fearful, happy, Disgust is a form of disapproval and can be confused withsad, or surprised manner. We have evaluated our approach anger. Picard (1997) [6] presents a nice overview of workthorough acoustic analysis of the speech patters for each in this area.

vocal affect. We have also studied how well human sub-jects perceive the intended affect. The effect of emotions on the human voice

It is well-accepted that facial expressions (related to - W -

affect) and facial displays (which serve a communicationfunction) are important for verbal communication. Hence,Kismet's vocalizations should convey the affective state of. .. N" h'W- Nw -.

the robot. This provides a person with important affective - -. -- =information as to how to appropriately engage a sociable "-N NO h hV

robot like Kismet. If done properly, Kismet could then useits emotive vocalizations to convey disapproval, frustra-tion, disappointment, attentiveness, or playfulness. This -

fosters richer and sustained social interaction, and helpsto maintain the person's interest. For a compelling verbal -i - -exchange, it is also important for Kismet to accompany itsexpressive speech with appropriate motor movements ofthe lips, jaw, and face. The ability to lip synchronize with Fig. 1. Typical effect of emotions on adult human speech, adaptedexpressive speech strengthens the perception of Kismet as from Murray and Arnott (1993) and Picard (1997).

a social creature that expresses itself vocally and throughThere have been a few systems developed to syn-

C. Breazeal is with the Artificial Intelligence Laboratory, Mas-

sachusetts Institute of Technology (MIT), Cambridge. MA USA. E- thesize emotional speech. For instance, Jun Sato (seemail: cynthia"ai.mit.edu. ww.ee. seikei .ac. jp/user/junsato/research/) trained

DISTRIBUTION STATEMENT AApproved for Public Release

Distribution Unlimited

Page 2: Emotive Qualities in Robot Speech · Emotive Qualities in Robot Speech Cynthia Breazeal Abstract- facial expression. A disembodied voice would be a detri-This paper explores the expression

SUBMISSION TO IROS 2001 2

a neural network to modulate a neutrally spoken speech timing-related parameters modify the prosody of the vo-signal (in Japanese) to convey one of four emotional states calization, often being reflected in speech rate and stress(happiness, anger, sorrow, disgust). The neural network placement. The timing-related parameters include speechwas trained on speech spoken by Japanese actors. This rate, pauses, exaggeration, and stress frequency. The voice-approach has the advantage that the output speech sig- quality parameters include loudness, brilliance, breathiness,nal sounds more natural than purely synthesized speech. laryngealization, pitch discontinuity, and pause discontinu-For our interactive robot application, this approach has ity. The articulation parameter modifies the precision ofthe disadvantage that the speech input to the system must what is uttered, either being more enunciated or slurred.be prerecorded. Kismet must be able to generate its own These vocal affect parameters are described in more detailutterances to suit the circumstance. below.

The Affect Editor by Janet Cahn is among the earliest Our task is to derive a mapping of these physiological vo-work in expressive synthesized speech [15]. Her system cal affect parameters to the underlying synthesizer settingswas based on DECtalk3, a commercially available text-to- (we use DECTALK v4. 0) to convey the emotional qualitiesspeech speech synthesizer. Given an English sentence and of anger, fear, disgust, happiness, sadness, and surprise inan emotional quality (one of anger, disgust, fear, joy, sor- Kismet's voice. There is currently a single fixed mappingrow, or surprise), she developed a methodology for map- per emotional quality. Figure 3 along with the equationsping the emotional correlates of speech (changes in pitch, presented in this paper summarize how the vocal affect pa-timing, voice quality, and articulation) onto the underly- rameters are mapped to the DECtalk synthesizer settings.ing DECtalk synthesizer settings. She took great care to The default values and max/min bounds for these settingsintroduce the global prosodic effects of emotion while still are given in Figure 4. Figure 5 summarizes how each emo-preserving the more local influences of gTammatical and tional quality of voice is mapped onto the VAPs.lexical correlates of speech intonation. With respect togiving Kismet the ability to generate emotive vocalizations,Cahn's work is a valuable resource that we have adapted DECIak Synttr..mar DECtik morm Cao Pwmt o

and extended to suit our purposes. sPaaO Sirt C

avgreij pitch ap .51 avrage Pth I

assert, venom as .06 final lowe~ngcontour deono 2

bu.1ine- fall bt 0 contour direci on, -5

linal lowering .6bre.rhu n-oo br 46 bremhonear I

cor pause Cp .238 Speech rat. -1

9ain of g6 .6 pmreon 0o Ifr-a- ron 7tartmLr

rain of gh 533 pretaision of I0sp1 1tion articulation

gain of oicirng g9 .76 loudness .6Fpr~euronc 44

anlculatinn,tfiai rise hr 2 reference lne I

laryngeal izat im 1& 0 larynreratior I

.o.•dnss Io 5 loudness Ilax breathirnera li 75 breihrnews 1

prIo4 fp.a.e .Pp '606 We nrat, -

pitch range pr .8 fifth rang. 1qg ocrkne-s qu .5 pitch dreornonuty I

speech raten ra .2 pach rate 1

rich�n�r r .4 brlla•ce IsOC-ot gieS ca m ,05 bnilanie -,

Fig. 2. Kismet's expressive speech GUI. Listed is a selection of sir-es rise Cr .22 ecoerrttta

emotive qualities, the vocal affect parameters, and the synthesizer p-h- _ ohnuy .2settings.

Fig. 3. Percent contributions of vocal affect parameters to DECtalkIII. THE EXPRESSIVE VOICE SYNTHESIS SYSTEM synthesizer settings. The absolute values of the contributions in the

far right column add up to 1 (100%) for each synthesizer setting. SeeEmotions have a global impact on speech since they mod- the equations in section ?? for the mapping.

ulate the respiratory system, larynx, vocal tract, muscularsystem, heart rate, and blood pressure. There are an as-sortment of vocal affect parameters (VAP) that alter the A. The Vocal Affect Parameters (VAPs)pitch, timing, voice quality, and articulation aspects of the The following six pitch parameters influence the pitchspeech signal. The pitch-related parameters affect the pitch contour of the spoken utterance. The pitch contour is thecontour of the speech signal, which is the primary contribu- trajectory of the fundamental frequency, J0, over time.tor for affective information. The pitch-related parameters a Accent Shape: Modifies the shape of the pitch contour forinclude accent shape, average pitch, pitch contour slope, fi- any pitch accented word by varying the rate of Jo changenal lowering, pitch range, and pitch reference line. The about that word. A high accent shape corresponds to

Page 3: Emotive Qualities in Robot Speech · Emotive Qualities in Robot Speech Cynthia Breazeal Abstract- facial expression. A disembodied voice would be a detri-This paper explores the expression

SUBMISSION TO IROS 2001 3

speaker agitation where there is a high peak fo and a steeprising and falling pitch contour slope. This parameter has OECt.k S ul, 0

a substantial contribution to DECtalk's stress rise set- Sesing sow.g .seing eetbn

ting, which regulates the fo magnitude of pitch-accented a.r... pitch Hz 306 260 360

words. assertiveness % 66 0 100

* Average Pitch: Quantifies how high or low the speaker baeline fall HZ 0 0 40

appears to be speaking relative to their normal speech. It is b-hi,.s.... dB 47 40 60

the average fo value of the pitch contour. It varies directly c- p ... , -20 800

with DECtalk's average pitch. gn of frication dB 72 80 80

* Contour Slope: Describes the general direction of the gain of spiro., d 70 0 75

pitch contour, which can be characterized as rising, falling, gai of -o,.s, dB 86 88

or level. It contributes to two DECtalk settings. It has hHt z ,Z 20 0 8o

a small contribution to the assertiveness setting, and larn •o1 % 0 0 10

varies inversely with the baseline fall setting. -oIud- d s 60 70

* Final Lowering. Refers to the amount that the pitch loa breahi . 76 100 0

contour falls at the end of an utterance. In general, an period p- o W -S 800

utterance will sound emphatic with a strong final lowering, pith, -.g. % 210 so 200

and tentative if weak. It can also be used as an auditory q00.. % so 0 100

cue to regulate turn taking. A strong final lowering can speech rate wpm 180 76 30

signify the end of a speaking turn, whereas a speaker's in- richness % 40 0 100

tention to continue talking can be conveyed with a slight .ohnno.. % 5 0 15

rise at the end. This parameter strongly contributes to stress rise HZ 22 0 8o

DECtalk's assertiveness setting and somewhat to thebaseline fall setting.* Pitch Range: Measures the bandwidth between the max- Fig. 4. Default DECtalk synthesizer settings for Kismet's voice thatimum and minimum Jo of the utterance. The pitch range are used in the equations for altering these values to produce Kismet'sexpands and contracts about the average fo of the pitch expressive speech.contour. It varies directly with DECtalk's pitch rangesetting.* Reference Line: Controls the reference pitch fo contour. . Breathiness: Controls the aspiration noise in the speechPitch accents cause the pitch trajectory to rise above or signal. It adds a tentative and weak quality to the voice,dip below this reference value. DECtalk's hat rise setting when speaker is minimally excited. DECtalk breathinessvery roughly approximates this. and lax breathiness vary directly with this.

The vocal affect timing parameters contribute to speech * Brillance: Controls the perceptual effect of relative en-rhythm. Such correlates arise in emotional speech from ergies of the high and low frequencies. When agitated,physiological changes in respiration rate (changes in higher frequencies predominate and the voice is harsh orbreathing patterns) and level of arousal. "brilliant". When speaker is relaxed or depressed, lower

* Speech Rate: Controls the rate of words or syllables ut- frequencies dominate and the voice sounds soothing andwarm. DECtalk's richness setting varies directly as it

tered per minute. It influences how quickly an individ-enhances the lower frequencies. In contrast, DECtalk'sual word or syllable is uttered, the duration of sound smoothness setting varies inversely since it attenuates

silence within an utterance, and the relative duration of higher frequencies.

phoneme classes. Speech is faster with higher arousal and Conrol tepriv ceayoc

slower with lower arousal. This parameter varies directly *Laryngealization: Controls the perceived creaky voiceslowerwith lowerrousal.eThis para metert varies dinersely phenomena. It arises from minimal sub-glottal pressurewith DECtalk's speech rate setting. It varies inversely and a small open quotient such that fo is low, the glottalwith DECtalk's period pause and comma pause settings pulse is narrow, and the fundamental period is irregular. Itas faster speech is accompanied with shorter pauses. varies directly with DECtalk's laryngealization setting.

* Stress Frequency. Controls the frequency of occurrence of vaLoudnectrl th amplitude of the see iwvepith acens ad dterine th smothessor brutnes • Loudness: Controls the amplitude of the speech wave-pitch accents and determines the smoothness or abruptness form. As a speaker becomes aroused, the sub-glottal pres-

of fo transitions. As more words are stressed, the speech suredsounds more emphatic and the speaker more agitated. It sure builds which increases the signal amplitude. As asouds oreempati an th spake moe aitaed.Itresult, the voice sounds louder. It varies directly withfilters other vocal affect parameters such as precision of DECtalk's loudness setting. It also influences DECtalk's

articulation and accent shape, and thereby contributes to gain o u soioing.

the associated DECtalk settings. gain of voicing.o Pause Discontinuity- Controls the smoothness of fo tran-

Emotion can induce not only changes in pitch and tempo, sitions from sound to silence for unfilled pauses. Longer orbut in voice quality as well. These phenomena primarily more abrupt silences correlate with being more emotion-arise from changes in the larynx and articulatory tract. ally upset. It varies directly with DECtalk's quicknessThe voice quality parameters are as follows: setting.

Page 4: Emotive Qualities in Robot Speech · Emotive Qualities in Robot Speech Cynthia Breazeal Abstract- facial expression. A disembodied voice would be a detri-This paper explores the expression

SUBMISSION TO IROS 2001 4

I Pitch Discontinuity: Controls smoothness or abruptness by the equation:of Jo transitions, and the degree to which the intended tar-gets are reached. With more speaker control, the transi- P VAPvae, VAPmit

tions are smoother. With less control, they transitions are VAPma. - VAPmi,more abrupt. It contributes to DECtalk's stress riseand quickness settings. VAPi is the current VAP under consideration, VAPvaiut. is

The autonomic nervous system modulates articulation its value specified by the current emotion, VAPoyy et = 10by inducing an assortment of physiological changes such as adjusts these values to be positive, VAP,,,O.O = 10, andcausing dryness of mouth or increased salivation. There is VAPmin = -10.only one articulation parameter as follows: In the second stage, a weighted contribution (WCji) of* Precision: Controls a range of articulation from enun- those VAPi that control each of DECtalk's synthesizer set-ciation to slurring. Slurring has minimal frication noise, tings (SSj) is computed. The far right column of figure 3whereas greater enunciation for consonants results in in- specifies each of the corresponding scale factors (SFja,).creased frication. Stronger enunciation also results in Each scale factor represents a percentage of control thatan increase in aspiration noise and voicing. The preci- each VAPi applies to its synthesizer setting SSj.sion of articulation varies directly with DECtalk's gain of For each synthesizer setting, SSj:frication, gain of voicing, and gain of aspiration. For each corresponding scale factor, SFj,i of VAPi:

If SFj.., _ 0WC_,i = PPi X SF.,i

- -- "*'q M If SFj _ 0;o 0 . . . . WCQ i = (1 - PPi) x (-SFj )10 10 0o 1 0- 0. 0 0.. . 0 0 ssj = wc ,

*W bS *v to 5 to -t4 i0 0

Ptýmmg 0o 1 0 0t .IQ t 0

to -4. -- . . At this point, each synthesizer value has a value 0 <.. I 10 . . 0 SSj < 1. In the final stage, each synthesizer setting SS5

o1 .. . . . o is scaled about 0.5. This produces the final synthesizerbV to to 2 9

0... 0. 0 0 value, SSj,,..,. The final value is sent to the speech syn-0 -- . 0 A 0. thesizer. The maximum, Minimum, and default values of- 6- w. 11 o oo t . o

I,•o., t- . . o " .o. . the synthesizer settings are shown in figure 4.o o J o 5 , o o For each final synthesizer setting, SSjf,.,:

Compute SSjo4 .. , = SSj - norm

Fig. 5. The mapping from each expressive quality of speech to the If SSi.-f. 0vocal affect parameters (VAPs). There is a single fixed mapping for SSjf1 ,, - SS= j .,,, +each emotional quality. (2 X SSjof.., x (SSjx . - SSj )

If SSJof-• <_ 0B. Mapping VAPs to Synthesizer Settings (2'n x SSjf 0 ,, X

(2 x So,°,X (SS3 .o,,~ - SS35.... ))

This section presents the equations that map the vo-cal affect parameters to synthesizer setting values. Linearchanges in these vocal affect parameter values result in anon-linear change in the underlying synthesizer settings. Given a string to be spoken and the updated synthesizerFurthermore, the mapping between parameters and syn- settings, Kismet can vocally express itself with differentthesizer settings is not necessarily one-to-one. Each pa- emotional qualities (anger, disgust, fear, joy, sorrow, orrameter affects a percent of the final synthesizer setting's surprise). To evaluate Kismet's speech, we analyzed thevalue (figure 3). When a synthesizer setting is modulated produced utterances with respect to the acoustical corre-by more than one parameter, its final value is the sum of lates of emotion. This reveals whether the implementationthe effects of the controlling parameters. The total of the produces similar acoustical changes to the speech wave-absolute values of these percentages must be 100%. See fig- form given a specified emotional state. We also evaluatedure 4 for the allowable bounds of synthesizer settings. The how the affective modulations of the synthesized speech arecomputational mapping occurs in three stages. The vo- perceived by human listeners.cal affect parameters can assume integer values within therange of (-10, 10). Negative numbers correspond to lesser .1 Analysis of Speecheffects, positive numbers correspond to greater effects, and To analyze the performance of the expressive vocaliza-zero is the neutral setting. These values are set according tion system, we extracted the dominant acoustic featuresto the current specified emotion as shown in figure 5. that are highly correlated with emotive state. The acous-

In the first stage, the percentage of each of the VAPs tic features and their modulation with emotion are sum-(VAPi) to its total range is computed, (PPi). This is given marized in figure 1. Specifically, these are average pitch,

Page 5: Emotive Qualities in Robot Speech · Emotive Qualities in Robot Speech Cynthia Breazeal Abstract- facial expression. A disembodied voice would be a detri-This paper explores the expression

SUBMISSION TO IROS 2001 5

pitch range, pitch variance, and mean energy. To measure ..speech rate, we extracted the overall time to speak and the 62total time of voiced segments. *o m

290 o 20001

8090.city 2920 6W467 444.4 =67 T777 61222 r 20 h1 C0 " *209.1 4703 444 60 2S44 100.8 121 o, 30s 0 0 0 0 0 5 10

ari.M-tt0e 773 tam AiLS 1034 0 1107 11" .1 41a0.60 274 •0 444 16017 2950t 1107 1044 0460 40

C4800Y 316.8 802a0 363A 250 113$ 8 026 85 56 27 7 - 120

001r.0 =.4 I 304 = 0073 3 263 7 089 1030 124 04 30 600IhOc~ti un 2900 12900 29± 2293. 1243t 102£ 110 73 4510 o I e O*

1,0 s v SO

d 4 .0 y 3644 220.0 400 173.9 221 1095 124 83 41 100 70 V4 13 2640a 16691 400 1905 2095 1016 173 123 53 160 o*- -o 60 0 08J*ocare 167 , 0 o g-ioo 70i 0

so o

tear-cty 417.0 087 100 235.3 284.7 1020 54 27 32 - - 201640.1o00d 3573 71400 000 160 340 1028 86 03 30 0 0 0 0 0 0 0 0 0 071003016~~ WO o9 no 10 54 103 60 o1

h.w-ciy 842 51110A 50 2951 214.3 1o8 71 17 Fig. 7. Plots of acoustic features of Kismet's speech. Plots illustrateepy6 342 84.68 000 73 32.1 100 109 7 31 how each emotion relates to the others for each acoustic feature. The

•-!0"tUr8 207 033 6•00 060? 0,3 10 10 - 4 horizontal axis simply maps an integer value to each emotion for ease

of viewing (anger=l, calm=2, etc.)6801-c~y 270.0 774 280•7 2887 18 964 8 5 60 2 28

276 -9 7 287 -187 19 461 144 03 O 1

084-00201 0,700 1070 700 070 307 00 1297 1- 3 7 0

018ly 38?43 82194 50 148A 351,9 1075 GO Q9 20 Happy speech is relatively fast, with a high mean pitch,8 re-146.01 4 33 71280 200 160 340 101t 101 64 17. wide pitch range, and wide pitch variance. It is loud withS, .smooth undulating inflections as shown in figure 8.

* Disgusted speech is slow with long pauses interspersed. ItFig. 6. Table of acoustic features for the three utterances, has a low mean pitch with a slightly wide pitch range. It is

fairly quiet with a slight creaky quality to the voice. TheFeatures were extracted from three phrases: contour has a global downward slope as shown in figure 8.

"* Look at that picture o Surprised speech is fast with a high mean pitch and wide"* Go to the city pitch range. It is fairly loud with a steep rising contour on"* It's been moved already the stressed syllable of the final word.

The results are summarized in figure 6. The values foreach feature are displayed for each phrase with each emo- .2 Human Listener Experimentstive quality (including the neutral state). The averages are To evaluate Kismet's expressive speech, nine subjectsalso presented in the table and plotted in figure 7. These were asked to listen to prerecorded utterances and to fillplots easily illustrate the relationship of how each emotive out a forced-choice questionnaire. Subjects ranged from 23quality modulates these acoustic features with respect to to 54 years of age, all affiliated with MIT. The subjects hadone another. The pitch contours for each emotive quality very limited to no familiarity with Kismet's voice.are shown in figure 8. They correspond to the utterance In this study, each subject first listened to an introduc-"It's been moved already." Relating these plots with fig- tion spoken with Kismet's neutral expression. This was toure 1, it is clear that many of the acoustic correlates of acquaint the subject with Kismet's synthesized quality ofemotive speech are preserved in Kismet's speech. voice and neutral affect. A series of eighteen utterances

Kismet's vocal quality varies with its emotive state as followed, covering six expressive qualities (anger, fear, dis-follows: gust, happiness, surprise, and sorrow). Within the ex-* Fearful speech is very fast with wide pitch contour, large periment, the emotive qualities were distributed randomly.pitch variance, very high mean pitch, and normal inten- Given the small number of subjects per study, we only usedsity. I have added a slightly breathy quality to the voice as a single presentation order per experiment. Each subjectpeople seem to associate it with a sense of trepidation. could work at his/her own pace and control the number of* Angry speech is loud and slightly fast with a wide pitch presentations of each stimulus.range and high variance. We've purposefully implemented The three stimulus phrases were: "I'm going to the city,"a low mean pitch to give the voice a prohibiting qual- "I saw your name in the paper," and "It's happening to-ity. This differs from figure 1, but a preliminary study morrow." The first two test phrases were selected becausedemonstrated a dramatic improvement in recognition per- Cahn (1990) had found the word choice to have reasonablyformance of naive subjects. This makes sense as it gives neutral affect. In a previous version of the study, subjectsthe voice a threatening quality, reported that it was just as easy to map emotional cor-e Sad speech has a slower speech rate, with longer pauses relates onto English phrases as to Kismet's non-linguisticthan normal. It has a low mean pitch, a narrow pitch range vocalizations (akin to infant-like babbles). Their perfor-and low variance. It is softly spoken with a slight breathy mance for English phrases and Kismet's babbles supportsquality. This differs from figure 1, but it gives the voice a this.tired quality. It has a pitch contour that falls at the end. Using a forced choice paradigm, the subjects were sim-

Page 6: Emotive Qualities in Robot Speech · Emotive Qualities in Robot Speech Cynthia Breazeal Abstract- facial expression. A disembodied voice would be a detri-This paper explores the expression

SUBMISSION TO IROS 2001

s b- -d forced choice percentage (random=1 7%)It u2Dve eey= anWl degsgtt fear happy sad surprise %

0 W0 15ISO__

4 -4 disgust 21 4 0 25 0 50/100

0 so V so so fear 4 0 25 a 0 as 255100

800 _____________ am0

"_W _ happy 0 4 4 7 17 500

"sad a 00o 0 0 0 O IOS 0 JO IO0 50 ISO

shar hiharousal, so th cofuio is no unexpected0 o 0 150o so 100 150

Figp 9. Naive subjects assessed the emotion conveyed in Kismet'sFig. 8. Pitch analysis of Kismet's speech for the English phrase nIt's voice in a forced-choice evaluation. All emotional qualities were recog-been moved already." nized with reasonable performance except for "fear" which Was most

often confused for "surprise/excitement." Both expressive qualitiesshare high arousal, so the confusion is not unexpected.

ply asked to circle the word which best described the voicequality. The choices were "anger," "disgust," "fear/panic," vi e eb le et a i m tsf ca x r si n ad b d

"happy," "sad," "surprise/excited." From a previous iter- voscure beld hat Kismet's faci guities anco d

ation of the study, we found that word choice mattered posture should help resolve the ambiguities encountered

A given emotion category can have a wide range of vocal through voice alone.

affects. For instance, the subject could interpret "fear" ACKNOWLEDGMENTSto imply "apprehensive," which might be associated withKismet's whispery vocal expression for sadness. Alterna- This work was supported in part by DARPA under con-tively, it could be associated with "panic" which is a more tract DABT 63-99-1-0012 and in part by NTT.aroused interpretation. The results from these evaluationsare summarized in figure 9. REFERENCES

Overall, the subjects exhibited reasonable performance [1] Juan Velasquez, "Modeling emotions and other motivations insynthetic agents," in Proceedings of the 1997 National Confer-in correctly mapping Kismet's expressive quality with ence on Artificial Intelligence, AAAI97, july 1997, pp. 10-15.

the targeted emotion. However, the expression of "fear" [2] C. Breazeal, "Robot in society: friend or appliance?," in Pro-proved somewhat problematic. For all other expressive ceedings of Agents99 workshop on Emotion-based architectures,

Seattle, WA, 1999, pp. 18-26.qualities, the performance was significantly above ran- [3] D. Canamero, "Modeling motivations and emotions as a basis fordom. Furthermore, misclassifications were highly corre- intelligent behavior," in Proceedings of the First Internationallated to similar emotions. For instance, "anger" was some- Conference on Autonomous Agents, L. Johnson, Ed. 1997, pp.

148-155, ACM Press.times confused with "disgust" (sharing negative valence) or [4] S.Y. Yoon, B. Blumberg, and G. Schneider, "Motivation driven"surprise/excitement" (both sharing high arousal). "Dis- learning for interactive synthetic characters," in Proceedings ofgust" was confused with other negative emotions. "Fear" Agents2000 (to appear), 2000.

[5] S. Reilly, Believable Social and Emotional Agents, Ph.D. thesis,was confused with other high arousal emotions (with CMU School of Computer Science, Pittsburgh, PA, 1996."surprise/excitement" in particular). The distribution [6] R. Picard. Affective Computation, MIT Press, Cambridge, MA,for "happy" was more spread out, but it was most of- 1997.

it [7] C. Breazeal and L. Aryananda, "Recognition of affective com-ten confused with "surprise/excitement," with which i municative intent in robot-directed speech," in Proceedings ofshares high arousal. Kismet's "sad" speech was confused Humanoids2000 (submitted), 2000.with other negative emotions. The distribution for "sur- [8] D. Roy and A. Pentland, "Automatic spoken affect analysis and

classification," Proceedings of the 1996 international conferenceprise/excitement" was broad, but it was most often con- on automatic face and gesture recognition, 1996.fused for "fear." [9] L. Chen and T. Huang, "Multimodal human emotion/expression

recognition," Proceedings of the second international conferenceon automatic face and gesture recognition, pp. 366-371, April

V. SUMMARY 1998.For the purposes of evaluation, the current set of data [10] F. Hara, "Personality characterization of animate face robotthrough interactive communication with human," in Proceedings

is promising. Misclassifications are particularly informa- of IARP98, Tsukuba, Japan, 1998, pp. IV-1.tive. The mistakes are highly correlated with similar emo- [11] H. Takanobu, A. Takanishi, S. Hirano, 1. Kato, K. Sato, and

T. Umetsu, "Development of humanoid robot heads for natu-tions, which suggests that arousal and valence are conveyed ral human-robot communication," in Proceedings of HUR098,to people (arousal being more consistently conveyed than 1998.

valence). We are using the results of this study to im- [12] C. Breazeal, Socialbe Machines: Expressive Social ExchangeBetween Humans and Robots, Ph.D. thesis, Massachusetts In-prove Kismet's expressive qualities. In addition, Kismet ex- stitute of Technology, department of Electrical Engineering and

presses itself through multiple modalities, not just through Computer Science, Cambridge, MA, May 2000.

Page 7: Emotive Qualities in Robot Speech · Emotive Qualities in Robot Speech Cynthia Breazeal Abstract- facial expression. A disembodied voice would be a detri-This paper explores the expression

SUBMISSION TO IROS 2001 7

[13] C. Rose, B. Bodenheimer, and M. Cohen, "Verbs and adverbs,"In press: Computer Graphics and Animation, 1998.

[14] I. Murray and L. Arnott, "Toward the simulation of emotionin synthetic speech: a review of the literature on human vocalemotion," Journal Acoustical Society of America, vol. 93, no. 2,pp. 1097-1108, 1993.

[15] J. Cahn, "Generating expression in synthesized speech," M.S.thesis, MIT Media Lab, Cambridge, MA, 1990.