24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf

download 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf

of 120

Transcript of 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf

  • 7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf

    1/120

    Simulating Emotional Speech for a Talking Head November 2000

    Contents

    1 Introduction..............................................................................................1

    2 Problem Description................................................................................2

    2.1 Objectives..............................................................................................................2

    2.2 Subproblems..........................................................................................................2

    2.3 Significance...........................................................................................................3

    3 Literature Review....................................................................................5

    3.1 Emotion and Speech..............................................................................................5

    3.2 The Speech Correlates of Emotion........................................................................

    3.3 Emotion in Speech S!nthesis................................................................................"

    3.# Speech $ar%up &anguages...................................................................................'

    3.5 E(tensible $ar%up &anguage )*$&+.................................................................1,

    3.5.1 *$& -eatures..........................................................................................1,

    3.5.2 The *$& ocument................................................................................113.5.3 T/s and 0alidation..............................................................................12

    3.5.# ocument Object $odel )O$+.............................................................1#

    3.5.5 S* arsing.............................................................................................15

    3.5. enefits of *$&......................................................................................1

    3.5.4 -uture irections in *$&........................................................................14

    3. -T6.................................................................................................................1"

    3.4 7esource 7evie8.................................................................................................2,

    3.4.1 Te(t9to9Speech S!nthesi:er.....................................................................2,

    3.4.2 *$& arser..............................................................................................22

    3." Summar!.............................................................................................................23

    4 Research ethodolo!"..........................................................................25

    #.1 6!potheses..........................................................................................................25

    age i

  • 7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf

    2/120

    Simulating Emotional Speech for a Talking Head November 2000

    #.2 &imitations and elimitations.............................................................................2

    #.2.1 &imitations...............................................................................................2

    #.2.2 elimitations............................................................................................2

    #.3 7esearch $ethodologies.....................................................................................24

    5 Implementation......................................................................................2#

    5.1 TTS nterface.......................................................................................................2"

    5.1.2 $odule nputs..........................................................................................2'

    5.1.3 $odule Outputs........................................................................................3,

    5.1.# C;Ctterance Structures.................................................................................34

    5. ?atural &anguage arser.....................................................................................3'

    5..2 Obtaining a honeme Transcription........................................................#,

    5..2 S!nthesi:ing in Sections..........................................................................#2

    5..3 ortabilit! ssues......................................................................................#3

    5.4 mplementation of Emotion Tags........................................................................##

    5.4.1 Sadness.....................................................................................................#5

    5.4.2 6appiness.................................................................................................#

    5.4.3 nger........................................................................................................#4

    5.4.# Stressed 0o8els.......................................................................................#"

    5.4.5 Conclusion...............................................................................................#"

    5." mplementation of &o89level S$& Tags............................................................#'

    5.".1 Speech Tags.............................................................................................#'

    5.".2 Spea%er Tag..............................................................................................53

    5.' igital Signal rocessor......................................................................................5#

    5.1, Cooperating 8ith the -$& module................................................................55

    5.11 Summar!...........................................................................................................54

    age ii

  • 7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf

    3/120

    Simulating Emotional Speech for a Talking Head November 2000

    $ Results and %nal"sis..............................................................................5#

    .1 ata c@uisition..................................................................................................5"

    .1.1 Auestionnaire Structure and esign........................................................5"

    .1.2 E(perimental rocedure...........................................................................1

    .1.3 rofile of articipants...............................................................................3

    .2 7ecogni:ing Emotion in S!nthetic Speech.........................................................#

    .2.1 Confusion $atri(.....................................................................................#

    .2.2 Emotion 7ecognition for Section 2.......................................................

    .2.3 Emotion 7ecognition for Section 2.......................................................'

    .2.# Effect of 0ocal Emotion on Emotionless Te(t........................................43

    .2.5 Effect of 0ocal Emotion on Emotive Te(t..............................................45

    .2. -urther nal!sis.......................................................................................45.3 Tal%ing 6ead and 0ocal E(pression...................................................................44

    .# Summar!............................................................................................................."1

    & 'uture (or)...........................................................................................#2

    4.1 ost Baveform rocessing.................................................................................."2

    4.2 Spea%ing St!les..................................................................................................."3

    4.3 Speech Emotion evelopment............................................................................"#

    4.# *$& ssues........................................................................................................."5

    4.5 Tal%ing 6ead......................................................................................................."

    4. ncreasing Communication and8idth..............................................................."4

    # Conclusion...............................................................................................##

    * +iblio!raph"...........................................................................................*1

    1, %ppendi- % /L 0a! /peciication..................................................*$

    ................................................................................................................................1,1

    11 %ppendi- + /L D0D.....................................................................1,2

    12 %ppendi- C 'estival and isual C..............................................1,413 %ppendi- D valuation uestionnaire...........................................1,&

    14 %ppendi- 0est Phrases or uestionnaire6 /ection 2+..............113

    age iii

  • 7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf

    4/120

    Simulating Emotional Speech for a Talking Head November 2000

    List o 'i!ures

    15 'i!ure 1 7 %n 8L document holdin! simple weather inormation.11

    1$ 'i!ure 2 7 /ample section o a D0D ile...............................................12

    1& 'i!ure 3 7 8L s"nta- error 7 list and item ta!s incorrectl" matched.

    13

    1# 'i!ure 4 7 (ell7ormed 8L document6 but does not ollow

    !rammar speciication in D0D ile 9an item ta! occurs outside o list

    ta!:...........................................................................................................13

    1* 'i!ure 5 (ell7ormed 8L document that also ollows D0D

    !rammar speciication. (ill not produce an" parse errors..............13

    2, 'i!ure $ 7 D; representation o 8L e-ample..............................15

    21 'i!ure & 7 '%I0< pro=ect architecture................................................1*

    22 'i!ure # 7 0al)in!

  • 7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf

    5/120

    Simulating Emotional Speech for a Talking Head November 2000

    31 'i!ure 1& 7 >tterance structures to hold the phrase ?the moon@. > A

    C00/B>tteranceIno ob=ect ( A C00/B>tteranceIno ob=ect P A

    C00/BPhonemeIno ob=ect pp A C00/BPitchPatternPoint ob=ect. 3*

    32 'i!ure 1# 7 0o)eniation o a part o an /L

    Document................................................................................................4,33 'i!ure 1* 7 /L Document sub7tree representin! e-ample /L

    mar)up....................................................................................................41

    34 'i!ure 2, 7 Raw timeline showin! server and client e-ecution when

    s"nthesiin! e-ample /L mar)up above..........................................43

    35 'i!ure 21 ultipl" actors o pitch and duration values or

    emphasied phonemes............................................................................5,

    3$ 'i!ure 22 7 Processin! a pause ta!........................................................51

    3& 'i!ure 23 7 0he eect o widenin! the pitch ran!e o an utterance...52

    3# 'i!ure 24 7 Processin! the pron ta!......................................................52

    3* 'i!ure 25 7 -ample +R;L% input..................................................55

    4, 'i!ure 2$ 7 -ample utterance inormation supplied to the '%L

    module b" the 00/ module. -ample phraseE ?%nd now the latest

    news@.......................................................................................................5$

    41 'i!ure 2& 7 % node carr"in! waveorm processin! instructions or an

    operation.................................................................................................#3

    42 'i!ure 2# 7 Insertion o new submodule or post waveorm

    processin!................................................................................................#3

    43 'i!ure 2* 7 /L ar)up containin! a lin) to a st"lesheet................#4

    44 'i!ure 3, 7 Inclusion o an ?8L

  • 7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf

    6/120

    Simulating Emotional Speech for a Talking Head November 2000

    List o 0ables

    4$ 0able 1 7 /ummar" o human vocal emotion eects.............................#

    4& 0able 2 7 /ummar" o human vocal emotion eects or an!er6

    happiness6 and sadness..........................................................................44

    4# 0able 3 /peech correlate values implemented or sadness..............45

    4* 0able 4 7 /peech correlate values implemented or happiness..........4$

    5, 0able 5 7 /peech correlate values implemented or an!er.................4&51 0able $ 7 owel7soundin! phonemes are discriminated based on their

    duration and pitch..................................................................................4#

    52 0able &7 +R;L% command line option values or en1 and us1

    diphone databases to output male and emale voices.........................54

    53 0able # 7 /tatistics o participants........................................................$3

    54 0able * 7 Conusion matri- template....................................................$4

    55 0able 1, 7 Conusion matri- with sample data...................................$5

    5$ 0able 11 7 Conusion matri- showin! ideal e-periment dataE 1,,F

    reco!nition rate or all simulated emotions.........................................$5

    5& 0able 12 Listener response data or neutral phrases spo)en with

    happ" emotion........................................................................................$$

    5# 0able 13 /ection 2% listener response data or neutral phrases.....$&

    5* 0able 14 7 Listener response data or /ection 2%6 uestion 1...........$#

    $, 0able 15 7 Listener response data or /ection 2%6 uestion 2...........$#

    $1 0able 1$ 7 Listener responses or utterances containin! emotionlesste-t with no vocal emotion.....................................................................&,

    $2 0able 1& 7 Listener responses or utterances containin! emotive te-t

    with no vocal emotion............................................................................&1

    $3 0able 1# 7 Listener responses or utterances containin! emotionless

    te-t with vocal emotion..........................................................................&2

    age vi

  • 7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf

    7/120

    Simulating Emotional Speech for a Talking Head November 2000

    $4 0able 1* 7 Listener responses or utterances containin! emotive te-t

    with vocal emotion.................................................................................&3

    $5 0able 2, 7 Percenta!e o listeners who improved in emotion

    reco!nition with the addition o vocal emotion eects or neutral te-t.

    &4$$ 0able 21 Percenta!e o listeners whose emotion reco!nition

    deteriorated with the addition o vocal emotion eects or neutral

    te-t...........................................................................................................&4

    $& 0able 22 7 Percenta!e o listeners whose emotion reco!nition

    improved with the addition o vocal emotion eects or emotive te-t.

    &5

    $# 0able 23 Percenta!e o listeners whose emotion reco!nition

    deteriorated with the addition o vocal emotion eects or emotive

    te-t...........................................................................................................&5

    $* 0able 24 Listener responses or participants who spea) n!lish as

    their irst lan!ua!e. >tterance t"pe is ?neutral te-t6 emotive voice@.

    &$

    &, 0able 25 Listener responses or participants who do G;0 spea)

    n!lish as their irst lan!ua!e. >tterance t"pe is ?neutral te-t6

    emotive voice@.........................................................................................&$

    &1 0able 2$ Listener responses or participants who spea) n!lish as

    their irst lan!ua!e. >tterance t"pe is ?emotive te-t6 emotive voice@.&&

    &2 0able 2& Listener responses or participants who do G;0 spea)

    n!lish as their irst lan!ua!e. >tterance t"pe is ?emotive te-t6

    emotive voice@.........................................................................................&&

    &3 0able 2# 7 Participant responses when as)ed to choose the 0al)in!

  • 7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf

    8/120

    Chapter 1

    Introduction

    Bhen 8e tal% 8e produce a comple( acoustic signal that carries information in addition

    to the verbal content of the message. 0ocal e(pression tells others about the emotional

    state of the spea%er as 8ell as @ualif!ing )or even dis@ualif!ing+ the literal meaning ofthe 8ords. ecause of this listeners expectto hear vocal effects pa!ing attention not

    onl! to 8hat is being said but ho8 it is said. The problem 8ith current speech

    s!nthesi:ers is that the effect of emotion on speech is not ta%en into account producing

    output that sounds monotonic or at 8orst distinctl! machine9li%e. s a result of this the

    abilit! of a Tal%ing 6ead to e(press its emotional state 8ill be adversel! affected if it

    uses a plain speech s!nthesi:er to Dtal%D. The objective of this research 8as to develop a

    s!stem that is able to incorporate emotional effects in s!nthetic speech and thus improve

    the perceived naturalness of a Tal%ing 6ead.

    This thesis revie8s the literature in the fields of speech emotion s!nthetic speech

    s!nthesis and *$&. discussion on *$& is featured prominentl! in this thesis because

    it 8as the vehicle chosen for directing ho8 the s!nthetic voice should sound. t also had

    considerable impact on ho8 speech information 8as processed. The design and

    implementation details of the project are discussed to describe the developed s!stem. n

    in9depth anal!sis of the project/s evaluation data is then given concluding 8ith a

    discussion of future 8or% that has been identified.

  • 7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf

    9/120

    Chapter 2

    Problem Description

    2.1 ;b=ectives

    evelopment of the project 8as aimed at meeting t8o main objectives to support theh!potheses of Section #.1

    1. To develop a s!stem that can add simulated emotion effects to s!nthetic

    speech. This involved researching the speech correlates of emotion that have

    been identified in the literature. The findings 8ere to be applied to the control

    parameters available in a speech s!nthesi:er allo8ing a specified emotion to be

    simulated using rules controlling the parameters.

    2. To integrate the s!stem 8ithin the TTS )te(t9to9speech+ module of a

    Tal%ing 6ead. The speech s!stem 8as to be added to the Tal%ing 6ead that ispart of the -T61project. t is being developed jointl! at Curtin >niversit! of

    Technolog! Bestern ustralia and the >niversit! of Fenoa in tal! )eard et al

    1'''+. The te(t9to9speech module must be treated as a Gblac% bo(G 8hich is

    consistent 8ith the modular design of -Abot.

    2.2 /ubproblems

    number of subproblems 8ere identified to successfull! develop a s!stem 8ith thestated objectives.

    1. esign and implementation of a speech mar%up language. t 8as desirable

    that the mar%up language be *$&9basedH the reasons for this 8ill become

    apparent later in the thesis. The role of the speech mar%up language )S$&+ is to

    1-acial nimated nteractive Tal%ing 6ead

  • 7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf

    10/120

    provide a 8a! to specif! in 8hich emotion a te(t segment is to be rendered. n

    addition to this it 8as decided to e(tend the application of the mar%up to provide

    a mechanism for the manipulation of generall! useful speech properties such as

    rate pitch and volume. S$& 8as designed to closel! follo8 the S&E

    specification described b! Sproat et al)1''"+.

    2. Evaluation of each of the e(isting te(t9to9speech )TTS+ submodules of the

    Tal%ing 6ead 8as re@uired. ts aim 8as to determine 8hat could and could not

    be reused. This included assessing the e(isting TTS module/s and the

    modules that interface 8ith other subs!stems of the Tal%ing 6ead )namel! the

    $EF9# subs!stem+.

    3. Cooperative integration 8ith modules that 8ere being concurrentl! 8ritten

    for the Tal%ing 6ead namel! the gesture mar%up language being developed b!

    6u!nh )2,,,+. The collaboration bet8een the t8o subprojects 8as aimed at

    providing the Tal%ing 6ead 8ith s!nchroni:ation of vocal e(pressions and facial

    gestures. n architecture specification for allo8ing facial and speech

    s!nchroni:ation is given b! Ostermann et al. )1''"+.

    #.Since the Tal%ing 6ead is being developed to run over a number of

    platforms )Bin32 &inu( and 7* .3+ it 8as crucial that the ne8 TTS module

    8ould not hamper efforts to ma%e the Tal%ing 6ead a platform independent

    application.

    2.3 /i!niicance

    The project is significant because despite the important role of the displa! of emotion in

    human communication current te(t9to9speech s!nthesi:ers do not cater for its effect on

    speech. 7esearch to add emotion effects to s!nthetic speech is ongoing notabl! b!

    $urra! and rnott )1''+ but has been mainl! restricted to a standalone s!stem and not

    part of a Tal%ing 6ead as this project set out to do.

    ncreased naturalness in s!nthetic speech is seen as being important for its

    acceptance )Scherer 1''+ and this is li%el! to be the case for applications of Tal%ing

    6ead technolog! as 8ell. This thesis is attempting to address this need. dvances in this

    area 8ill also benefit 8or% in the fields of speech anal!sis speech recognition and speech

    s!nthesis 8hen dealing 8ith natural variabilit!. This is because 8or% 8ith the speech

    correlates of emotion 8ill help support or disprove speech correlates identified in speech

  • 7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf

    11/120

    anal!sis help in proper feature e(traction for the automatic recognition of emotion in the

    voice and generall! improve s!nthetic speech production.

  • 7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf

    12/120

    Chapter 3

    Literature Review

    This section presents a brief revie8 of the literature relevant to the areas the project is

    concerned 8ith the effects of emotion on speech speech emotion s!nthesis *$& and

    speech mar%up languages.

    3.1 motion and /peech

    Emotion is an integral part of speech. Semantic meaning in a conversation is conve!ed

    not onl! in the actual 8ords 8e sa! but also in howthe! are e(pressed )Inapp 1'",H

    $alandro ar%er and ar%er 1'"'+. Even before the! can understand 8ords children

    displa! the abilit! to recogni:e vocal emotion illustrating the importance that nature

    places on being able to conve! and recogni:e emotion in the speech channel band8idth.The intrinsic relationship that emotion shares 8ith speech is seen in the direct effect

    that our emotional state has on the speech production mechanism. h!siological changes

    such as increased heart rate and blood pressure muscle tremors and dr!ness of mouth

    have been noted to be brought about b! the arousal of the s!mpathetic nervous s!stem

    such as 8hen e(periencing fear anger or jo! )Cahn 1'',+. These effects of emotion on

    a person/s speech apparatus ultimatel! affect ho8 speech is produced thus promoting the

    vie8 that an emotion Jcarrier 8aveK is produced for the 8ords spo%en )$urra! and

    rnott 1''3+.

    Bith emotion being described as Jthe organism/s interface to the 8orld outsideK

    )Scherer 1'"1+ considerable interest has been devoted to investigate the role of emotion

    in speech particularl! regarding its social aspects )Inapp 1'",+. One function is to

    notif! others of our behavioural intentions in response to certain events )Scherer 1'"1+.

    -or e(ample the contraction of ones throat 8hen e(periencing fear 8ill produce a harsh

    voice that is increased in loudness )$urra! and rnott 1''3+ serving to 8arn and

  • 7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf

    13/120

    frighten a 8ould9be assailant 8ith the bod! tensing for a possible confrontation. The

    e(pression of emotion through speech also serves to communicate to others our

    judgement of a particular situation. mportantl! vocal changes due to emotion ma! in

    fact be cross9cultural in nature though this ma! onl! be true for some emotions and

    further 8or% is re@uired to ascertain this for certain )$urra! rnott and 7oh8er 1''+.

    Be also deliberately use vocal e(pression in speech to communicate various

    meanings. Sudden pitch changes 8ill ma%e a s!llable stand out highlighting the

    associated 8ord as an important component of that utterance )utoit 1''4+. spea%er

    8ill also pause at the end of %e! sentences in a discussion to allo8 listeners the chance to

    process 8hat 8as said and a phrase/s pitch 8ill increase to8ards the end to denote a

    @uestion )$alandro ar%er and ar%er 1'"'+. Bhen something is said in a 8a! that

    seems to contradict the actual spo%en 8ords 8e 8ill usuall! accept the vocal meaning

    over the verbal meaning. -or e(ample the e(pression Jthan%s a lotK spo%en in an angr!

    tone 8ill generall! be ta%en in a negative 8a! and not as a compliment as the literal

    meaning of the 8ords alone 8ould suggest. This underscores the importance 8e place on

    the vocal information that accompanies the verbal content.

    3.2 0he /peech Correlates o motion

    coustics researchers and ps!chologists have endeavoured to identif! the speech

    correlates of emotion. The motivation behind this 8or% is based on the demonstrated

    abilit! of listeners to recogni:e different vocal e(pressions. f vocal emotions are

    distinguishable then there are acoustic features responsible for ho8 various emotions are

    e(pressed )Scherer 1''+. 6o8ever this tas% has met 8ith considerable difficult!. This

    is because coordination of the speech apparatus to produce vocal e(pression is done

    unconsciousl! even 8hen a spea%ing st!le is consciousl! adopted )$urra! and rnott

    1''+.

    Traditionall! there have been three major e(perimental techni@ues that researchers

    have used to investigate the speech correlates of emotion )Inapp 1'",H $urra! and

    rnott 1''3+

    1. $eaningless Lneutral/ content )e.g. letters of the alphabet

    numbers etc+ is read b! actors 8ho e(press various emotions.

    2. The same utterance is e(pressed in different emotions. This

    approach aids in comparing the emotions being studied.

  • 7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf

    14/120

    3. The content is ignored altogether either b! using e@uipment

    designed to e(tract various speech attributes or b! filtering out the content.

    The latter techni@ue involves appl!ing a lo89pass filter to the speech signal

    thus eliminating the high fre@uencies that 8ord recognition is dependent upon.

    )This meets 8ith limited success ho8ever since some of the vocalinformation also resides in the high fre@uenc! range.+

    The problem of speech parameter identification is further compounded b! the

    subjective nature of these tests. This is evident in the literature as results ta%en from

    numerous studies rarel! agree 8ith each other. ?evertheless a general picture of the

    speech parameters responsible for the e(pression of emotion can be constructed. There

    are three main categories of speech correlates of emotion )Cahn 1'',H $urra! rnott

    and 7oh8er 1''+

    itch contour.The intonation of an utterance 8hich describes the natureof accents and the overall pitch range of the utterance. itch is e(pressed as

    fundamental fre@uenc! )-,+. arameters include average pitch pitch range

    contour slope and final lo8ering.

    Timing. escribes the speed that an utterance is spo%en as 8ell as rh!thm

    and the duration of emphasi:ed s!llables. arameters include speech rate

    hesitation pauses and e(aggeration.

    !oice "uality. The overall Lcharacter/ of the voice 8hich includes effects

    such as 8hispering hoarseness breathiness and intensit!.

    t is believed that value combinations of these speech parameters are used to e(press

    vocal emotion. Table 1is a summar! of human vocal emotion effects of four of the so9

    called basic emotions anger happiness sadness and fear )$urra! and rnott 1''3H

    Falanis arsinos and Io%%ina%is 1''H Cahn 1'',H avti: 1'4H Scherer 1''+. The

    parameter descriptions are relative to neutral speech.

    Anger Happiness Sadness Fear

  • 7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf

    15/120

    Speech rate Faster Slightly faster Slightly slower Much faster

    Pitch average Very much

    higher

    Much higher Slightly lower Very much

    higher

    Pitch range Much wider Much wider Slightly

    narrower

    Much wider

    Intensity Higher Higher Lower Higher

    Pitch changes Abrupt,downward,directedcontours

    Smooth, upwardinflections

    Downwardinflections

    Downwardterminalinflections

    Voice quality Breathy, chestytone

    Breathy,

    blaring!esonant "rregular

    #oicing

    Articulation $lipped Slightly slurred Slurred %recise

    terms used b! $urra! and rnott )1''3+.

    0able 1 7 /ummar" o human vocal emotion eects.

    The summar! should not be ta%en as a complete and final description but rather is

    meant as a guideline onl!. -or instance the table above emphasi:es the role of

    fundamental fre@uenc! as a carrier of vocal emotion. 6o8ever Ino8er )1'#1 as

    referred in $urra! and rnott 1''3+ notes that 8hispered speech is able to conve!

    emotion even though 8hispering ma%es no use of the voice/s fundamental fre@uenc!.

    ?evertheless being able to succinctl! describe vocal e(pression li%e this has significant

    benefits for simulating emotion in s!nthetic speech.

    3.3 motion in /peech /"nthesis

    n the past focus has been placed on developing speech s!nthesi:er techni@ues to

    produce clearer intelligibilit! 8ith intonation being confined to model neutral speech.

    6o8ever the speech produced is distinctl! machine sounding and unnatural. Speech

    s!nthesis is seen as being fla8ed for not possessing appropriate prosodic variation li%e

    that found in human speech. -or this reason some s!nthesis models are including the

    effects of emotion on speech to produce greater variabilit! )$urra! rnott and 7oh8er

    1''+. nterestingl! Scherer )1''+ sees this as being crucial for the acceptance of

    s!nthetic speech.

    The advantage of the vocal emotion descriptions in Table 1 is that the speech

    parameters can be manipulated in current speech s!nthesi:ers to simulate emotional

    speech 8ithout dramaticall! affecting intelligibilit!. This approach thus allo8s emotive

    effects to be added on top of the output of te(t9to9speech s!nthesi:ers through the use of

  • 7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf

    16/120

    carefull! constructed rules. T8o of the better %no8n s!stems capable of adding emotion9

    b!9rule effects to speech are the Jffect EditorK developed b! Cahn )1'',b+ and

    6$&ET developed b! $urra! and rnott )1''5+ )$urra! rnott and ?e8ell 1'""+.

    The s!stems both ma%e use of the ECtal% te(t9to9speech s!nthesi:er mainl! because of

    its e(tensive control parameter features.

    -uture 8or% is concerned 8ith building a solid model of emotional speech as this

    area is seen as being limited b! our understanding of vocal e(pression and the @ualit! of

    the speech correlates used to describe emotional speech )Cahn 1'""H $urra! and rnott

    1''5H Scherer 1''+. lthough not 8ithin the scope of the project it is 8orth

    mentioning that research is being underta%en in concept9to9speech s!nthesis. This 8or%

    is aimed at improving the intonation of s!nthetic speech b! using e(tra linguistic

    information )i.e. tagged te(t+ provided b! another s!stem such as a natural language

    generation )?&F+ s!stem )6it:eman et al#1'''+.

    0ariabilit! in speech is also being investigated in the area of speech recognition 8ith

    the aim of possibl! developing computer interfaces that respond differentl! according to

    the emotional state of the user )ellaert ol:in and Baibel 1''+. nother avenue for

    future research could be to incorporate the effects of facial gestures on speech. -or

    instance 6ess Scherer and Iappas )1'""+ noted that voice @ualit! is judged to be

    friendl! over the phone 8hen a person is smiling. model that could cater for this

    8ould have e(tremel! beneficial applications for recent 8or% concerned 8ith the

    s!nchroni:ation of facial gestures and emotive speech in Tal%ing 6eads.

    -inall! simulating emotion in s!nthetic speech not onl! has the potential to build

    more realistic speech s!nthesi:ers )and hence provide the benefits that such a s!stem

    8ould offer+ but 8ill also add to our understanding of speech emotion itself.

    3.4 /peech ar)up Lan!ua!es

    deall! a te(t9to9speech s!nthesi:er 8ould be able to accept plain te(t as input and spea%

    it in a manner comparable to a human emphasi:ing important 8ords pausing for effect

    and pronouncing foreign 8ords correctl!. >nfortunatel! automaticall! processing and

    anal!:ing plain te(t is e(tremel! difficult for a machine. Bithout e(tra information to

    accompan! the 8ords it is to spea% the speech s!nthesi:er 8ill not onl! sound unnatural

    but intelligibilit! 8ill also decrease. Therefore it is desirable to have an annotation

    scheme that 8ill allo8 direct control over the speech s!nthesi:er/s output.

  • 7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf

    17/120

    $ost research and commercial s!stems allo8 for such an annotation scheme but

    almost all are s!nthesi:er dependent thus ma%ing it e(tremel! difficult for soft8are

    developers to build programs that can interface 8ith an! speech s!nthesi:er. 7ecent

    moves b! industr! leaders to standardi:e a speech mar%up language has led to the draft

    specification of S&E a s!stem independent SF$&9based mar%up language )Sproatet al 1''"+. The S&E specification has evolved from three e(isting speech s!nthesis

    mar%up languages SS$& )Ta!lor and sard 1''4+ ST$& )Sproat et al 1''4+ and

    Mava/s MS$&.

    3.5 -tensible ar)up Lan!ua!e 98L:

    *$& is the E(tensible $ar%up &anguage created b! B3C the Borld Bide Beb

    Consortium )E(tensible $ar%up &anguage 1''"+. t 8as speciall! designed to enablethe use of large document management concepts for the Borld Bide Beb that 8ere

    embodied in SF$& the Standard Fenerali:ed $ar%up &anguage. n adopting SF$&

    concepts ho8ever the aim 8as also to remove features of SF$& that 8ere either not

    needed for Beb applications or 8ere ver! difficult to implement )The *$& -A 2,,,+.

    The result 8as a simplified dialect of SF$& that is relativel! eas! to learn use and

    implement and at the same time retains much of the po8er of SF$& )osa% 1''4+.

    t is important to note that *$& is not a mar%up language in itself but rather it is a

    meta$language= a language for describing other languages. Therefore *$& allo8s a

    user to specif! the tag set and grammar of their o8n custom mar%up language that

    follo8s the *$& specification.

    3.5.1 8L 'eatures

    There are three significant features of *$& that ma%e it a ver! po8erful meta9language

    )osa% 1''4+

    1. -tensibilit"9 ne8 tags and their attribute names can be defined at

    8ill. ecause the author of an *$& document can mar%up data using an!number of custom tags the document is able to effectivel! describe the data

    embodied 8ithin the tags. This is not the case 8ith 6T$& 8hich uses a

    fi(ed tag set.

  • 7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf

    18/120

    2. /tructure= the structure of an *$& document can be nested to

    an! level of comple(it! since it is the author that defines the tag set and

    grammar of the document.

    3. alidation = if a tag set and grammar definition is provided

    )usuall! via a ocument T!pe efinition )T++ then applications

    processing the *$& document can perform structural validation to ma%e sure

    it conforms to the grammar specification. So though the nested structure of an

    *$& document can be @uite comple( the fact that it follo8s a ver! rigid

    guideline ma%es document processing relativel! eas!.

    3.5.2 0he 8L Document

    n *$& document is a se@uence of characters that contains markup )the tags that

    describe the te(t the! encapsulate+ and the character data)the actual te(t being Jmar%ed

    upK+. -igure 1sho8s an e(ample of a simple *$& document.

    'i!ure 1 7 %n 8L document holdin! simple weather inormation.

    One of the main observations that should be made for the e(ample given in -igure 1

    is that an %&' document de(cribe( only the data and not ho8 it should be vie8ed. This

    is unli%e 6T$& 8hich forces a specific vie8 and does not provide a good mechanism

    for data description )Fraham and Auinn 1'''+. -or e(ample 6T$& tags such as P

    DIV and TABLEdescribe ho8 a bro8ser is to displa! the encapsulated te(t but are

    Markup tag

    Character data

    (marked up text)

    '1 * 1

  • 7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf

    19/120

    inade@uate for specif!ing 8hether the data is describing an automotive part is a section

    of a patient/s health record or the price of a grocer! item.

    The fact that an *$& document is encoded in plain te(t 8as a conscious decision

    made b! the *$& designers = the designing of a s!stem9independent and vendor9

    independent solution )osa% 1''4+. lthough te(t files are usuall! larger than

    comparable binar! formats this can be easil! compensated for using freel! available

    utilities that can efficientl! compress files both in terms of si:e and time. t 8orst the

    disadvantages associated 8ith an uncompressed plain te(t file is deemed to be

    out8eighed b! the advantages of a universall! understood and portable file format that

    does not re@uire special soft8are for encoding and decoding.

    3.5.3 D0DHs and alidation

    The *$& specification has ver! strict rules 8hich describe the s!nta( of an *$&

    document = for instance the characters allo8able 8ithin the markupsection ho8 tags

    must encapsulate te(t the handling of 8hite space etc. These rigid rules ma%e the tas%s

    of parsing and dividing the document into sub9components much easier. well$formed

    *$& document is one that follo8s the s!nta( rules set in the *$& specification.

    6o8ever since its author determines the structure of the document a mechanism must be

    provided that allo8s grammar chec%ing to ta%e place. *$& does this through the

    ocument T!pe efinition or T.

    T file is 8ritten in *$&/s eclaration S!nta( and contains the formaldescription of a document/s grammar )The *$& -A 2,,,+. t defines amongst other

    things 8hich tags can be used and 8here the! can occur the attributes 8ithin each tag

    and ho8 all the tags fit together.

    -igure 2gives a sample T section that describes t8o elements listand item.

    The e(ample declares that one or more item tags can occur 8ithin a list tag.

    -urthermore an itemtag ma! optionall! have a t-"eattribute.

    'i!ure 2 7 /ample section o a D0D ile

  • 7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf

    20/120

    E(tending this e(ample the different levels of validation performed b! an *$&

    parser can be seen. -igure 3sho8s an *$& document that does not meet the s!nta(

    specified in the *$& specification.

    'i!ure 3 7 8L s"nta- error 7 list and item ta!s incorrectl" matched.

    -igure #sho8s a 8ell9formed *$& document )i.e. it follo8s the *$& s!nta(+ but

    does not follo8 the grammar specified in the lin%ed T file. )The T file is the one

    given in -igure 2+.

    'i!ure 4 7 (ell7ormed 8L document6 but does not ollow !rammar speciication in

    D0D ile 9an item ta! occurs outside o list ta!:.

    -igure 5 sho8s a 8ell9formed *$& document that also meets the grammar

    specification given in the T file.

    'i!ure 5 (ell7ormed 8L document that also ollows D0D

    !rammar speciication. (ill not produce an" parse errors.

    The *$& 7ecommendation states that an! parse error detected 8hile processing an

    *$& document 8ill immediatel! cause a fatal error )E(tensible $ar%up &anguage

    1''"+ = the *$& document 8ill not be processed an! further and the application 8ill

    Item 1

  • 7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf

    21/120

    not attempt to second guess the author/s intent. ?ote that the T does ?OT define ho8

    the data should be vie8ed either. lso the T is able to define which (ub$element(can

    occur 8ithin an element but not the order in 8hich the! occurH the same applies for

    attributes specified for an element. -or this reason an application processing an *$&

    document should avoid being dependent on the order of given tags or attributes.

    3.5.4 Document ;b=ect odel 9D;:

    The ocument Object $odel )O$+ &evel 1 Specification states the)ocument *b+ect

    &odelas Ja platform9 and language9neutral interface that allo8s programs and scripts to

    d!namicall! access and update the content structure and st!le of documentsK )ocument

    Object $odel 2,,,+. t provides a tree9based representation of an *$& document

    allo8ing the creation manipulation and navigation of an! part 8ithin the document.

    6o8ever it is important to note that the O$ specification itself does not specif! that

    documents must be implementedas a tree = onl! it is convenient that the logical structureof the document be described as a tree due to the hierarchical structure of mar%ed up

    documents. The O$ is therefore a programming for documents that is trul!

    Jstructurall! neutralK as 8ell.

    Bor%ing 8ith parts of the O$ is @uite intuitive since the object structure of the

    O$ ver! closel! resembles the hierarchical structure of the document. -or instance

    the O$ sho8n in -igure b 8ould represent the tagged te(t e(ample in -igure a.

    gain the hierarchical relationships are logical ones defined in the programming

    and are not representations of an! particular internal structures )ocument Object $odel

    2,,,+.

    Once a O$ tree is constructed it can be modified easil! b! adding;deleting nodes

    and moving sub9trees. The ne8 O$ tree can then be used to output a ne8 *$&

    document since all the information re@uired to do so is held 8ithin the O$

    representation. O$ tree 8ill not be constructed until the *$& document has been

    full! parsed and validated.

  • 7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf

    22/120

    'i!ure $ 7 D; representation o 8L e-ample

    3.5.5 /%8 Parsin!

    do8nside to the O$ is that most *$& parsers implementing the O$ ma%e the

    entire tree reside in memor! = apart from putting a strain on s!stem resources it also

    limits the si:e of the *$& document that can be processed )!thon;*$& 6o8to 2,,,+

    )lib(ml 2,,,+. lso sa! the application onl! needs to search the *$& document for

    occurrences of a particular 8ord it 8ould be inefficient to construct a complete in9

    memor! tree to do this.

    1*

    October 30,

    2000

    14:40

    Partly cloudy 18

    a.

    b.

  • 7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf

    23/120

    S* handler on the other hand can process ver! large documents since it does

    not %eep the entire document in memor! during processing. S* the Simple for

    *$& is a standard interface for event9based *$& parsing )S* 2., 2,,,+. nstead of

    building a structure representing the entire *$& document S* reports parsing events

    )such as the start and end of tags+ to the application through callbac%s.

    3.5.$ +eneits o 8L

    The follo8ing benefits of using *$& in applications have been identified )$icrosoft

    2,,,+ )Soft8areF 2,,,b+

    /implicit"= *$& is eas! to read 8rite and process b! both humans and

    computers.

    ;penness= *$& is an open and e(tensible format that leverages on other

    )open+ standards such as SF$&. *$& is no8 a B3C 7ecommendation 8hich

    means it is a ver! stable technolog!. n addition *$& is highl! supported b!

    industr! mar%et leaders such as $icrosoft $ Sun and ?etscape both in

    developer tools and user applications.

    -tensibilit"= data encoded in *$& is not limited to a fi(ed tag set.

    This enables precise data description greatl! aiding data manipulators such as

    search engines to produce more meaningful searches.

    Local computation and manipulation= once data in *$& format is sent

    to the client all processing can be done on the local machine. The *$& O$

    allo8s data manipulation through scripting and other programming languages.

    /eparation o data rom presentation= this allo8s data to be 8ritten

    read and sent in the best logical mode possible. $ultiple vie8s of the data are

    easil! rendered and the loo% and feel of *$& documents can be changed

    through *S& st!le sheetsH this means that the actual content of the document

    need not be changed.

    ranular updates= the structure of *$& documents allo8s for granular

    updates to ta%e place since onl! modified elements need to be sent from the

    server to the client. This is currentl! a problem 8ith 6T$& since even 8ith the

    slightest modification a page needs to be rebuilt. Franular updates 8ill help

    reduce server 8or%load.

    /calabilit"= separation of data from presentation also allo8s authors to

    embed 8ithin the structured data procedural descriptions of ho8 to produce

    different vie8s. This offloads much of the user interaction from the server to the

  • 7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf

    24/120

  • 7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf

    25/120

  • 7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf

    26/120

  • 7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf

    27/120

    'i!ure # 7 0al)in!

  • 7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf

    28/120

    -estival is a 8idel! recogni:ed research project developed at the Centre for Speech

    Technolog! 7esearch )CST7+ >niversit! of Edinburgh 8ith the aim of offering a free

    high @ualit! te(t9to9speech s!stem for the advancement of research )lac% Ta!lor and

    Cale! 1'''+. The $7O& project initiated b! the TCTS &ab of the -acult

    ol!techni@ue de $ons )elgium+ is a free multi9lingual speech s!nthesi:er developed8ith aims similar to -estival/s )$7O&roject 6omepage 2,,,+.

    'i!ure * 7 0op level outline showin! how 'estival and +R;L% s"stems were used to!ether.

    t 8as decided for this project to use the -estival s!stem as the natural language

    parser )?&+ component of the module 8hich accepts te(t as input and transcribes this

    to its phoneme e@uivalent plus duration and pitch information. This information can be

    then given to the $7O& s!nthesi:er acting as the digital signal processing unit

    )S+ 8hich produces a 8aveform from this information. lthough -estival has its o8nS unit it 8as found that the -estival < $7O& combination produces the best

    @ualit!. t is important to note that the -estival s!stem supports $7O& in its .

    ecause of the phoneme9duration9pitch input format re@uired for $7O& it

    provides ver! fine pitch and timing control for each phoneme in the utterance. s stated

    before this level of control is simpl! unattainable 8ith commercial s!stems e(cept

    NLP

    (Festival)

    Pitch and TimingModifier

    DSP(MBROLA)

    Phonemes, pitch

    and duration

    Modified phonemes,

    pitch and duration

    Waveform

    Text

    http://tcts.fpms.ac.be/synthesis/mbrola.htmlhttp://tcts.fpms.ac.be/synthesis/mbrola.html
  • 7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf

    29/120

    ECtal%. The advantage of using $7O& over ECtal% ho8ever is in the fact that

    once a phoneme/s pitch is altered in the latter s!stem the generated pitch contour is

    over8ritten. Cahn )1'',+ first mentioned this problem and as a result did not manipulate

    the utterance at the phoneme level limiting the amount of control 8hich ultimatel!

    hindered the @ualit! of the simulated emotion. To overcome this $urra! and rnott)1''5+ had to 8rite their o8n intonation model to replace the ECtal% generated pitch

    contour 8hen the! changed pitch values at the phoneme level. -ortunatel! this is not an

    issue 8ith $7O& as changes to the pitch and duration levels can be done prior to

    passing it to $7O& )as -igure 'sho8s+. Therefore it can be seen that the -estival ?*

    platform its source code can be ported to the Bin32 platform via relativel! minor

    modifications. The $7O& 6omepage offers binaries for man! platforms including

    Bin32 &inu( most >ni( OS versions eOS $acintosh and more.

    efore the final decision 8as made to ma%e use of the -estival s!stem ho8ever an

    important issue re@uired investigation. The previous TTS module of the Tal%ing 6ead

    did not use the -estival s!stem because although it 8as ac%no8ledged that -estival/s

    output is of a ver! high @ualit! the computation time 8as deemed to be far too e(pensive

    to use in an interactive application )Crossman 1'''+. -or e(ample the phrase J6ello

    ever!bod!. This is the voice of a Tal%ing 6ead. The Tal%ing 6ead project consists ofresearchers from Curtin >niversit! and 8ill create a 3 model of a human head that 8ill

    ans8er @uestions inside a 8eb bro8ser.K too% about #5 seconds to s!nthesi:e on a Silicon

    Fraphics nd! 8or%station )Crossman 2,,,+. t is contested ho8ever that the negative

    impression that could be made of the -estival s!stem from such data ma! be a little

    misled. Though e(ecution time ma! ta%e longer on an SF nd! 8or%station informal

    testing on several (tandard Cs )Bin32 and &inu( platforms+ sho8ed that the same

    phrase too% less than 5 seconds to s!nthesi:e )including the generation of a 8aveform+.

    Since TTS processing is done on the server side the s!stem can be easil! configured to

    ensure -estival 8ill carr! its processing on a faster machine. Therefore -estival/ss!nthesis time 8as not considered a problem.

    3.&.2 8L Parser

  • 7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf

    30/120

    Since it is e(pected that the program/s input 8ill contain mar%ed up te(t an *$& parser

    8as re@uired to parse and validate the input and create a O$ tree structure for eas!

    processing. There are a number of freel! available *$& parsers though man! are still in

    development stage and implement the *$& specification to var!ing degrees. One of the

    more complete parsers is libxml a freel! available *$& C librar! for Fnome )lib(ml2,,,+.

    >sing libxmlas the *$& parser fulfilled the needs of the project in a number of

    8a!s

    a+ ortability= 8ritten in C the librar! is highl! portable. long 8ith the main

    program it has been successfull! ported to the Bin32 &inu( and 7*

    platforms.

    b+ Small and (imple= onl! a limited range of the *$& features are being used

    therefore a comple( parser 8as not re@uired. This is not to sa! that libxmlis atrivial librar! as it offers some po8erful features.

    c+ Efficiency = nformal testing sho8ed libxml parses large documents in

    surprisingl! little time. lthough not used for this project libxml offers a

    S* interface to allo8 for more memor!9efficient parsing )see section 3.5.5+.

    d+ ,ree= libxmlcan be obtained cost9free and license9free.

    t is important to note that the libxmllibrar!/s O$ tree building feature 8as used to

    help create the re@uired objects that hold the program/s utterance information. 6o8ever

    care 8as ta%en to ma%e sure the program/s objects 8ere not dependent on the *$&

    parser being used. nstead a 8rapper class 8TT77$LParser used libxmlas the *$&

    parser and output a cu(tom tree9li%e structure ver! similar to that of the O$. This

    ensured that all other objects 8ithin the program used the custom structure and not the

    O$ tree that libxmloutputs. )See Chapter 5 for more details.+

    3.# /ummar"

    This chapter has e(plored research that 8as applicable to this project focusing on ho8

    the literature can help 8ith achieving the stated objectives and subproblems of Chapter 2

    and supporting the h!potheses of Chapter #. $ore specificall! the literature 8as

    investigated to find the speech correlates of emotion see%ing clear definitions so that

    there 8as a solid base to 8or% from during the implementation phase. The 8or% of

    prominent researchers in the field of s!nthetic speech emotion such as $urra! and

  • 7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf

    31/120

    rnott )1''5+ and Cahn )1'',+ 8ho have alread! attempted to simulate emotional

    speech 8as sought in order to gain an understanding of the problems involved and the

    approach ta%en in solving them.

    The in9depth revie8 on *$& served t8o purposes a+ to describe 8hat *$& is and

    8hat the technolog! is tr!ing the address and b+ to e(pound the benefits of *$& so as to

    justif! 8h! S$& 8as designed to be *$&9based. resource revie8 8as given to

    discuss the issues involved 8hen deciding 8hich tools to use for the TTS module and to

    address one of the subproblems stated in Section 2.3H that is that the TTS module should

    be able to run across the Bin32 &inu( and >?* platforms.

  • 7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf

    32/120

    Chapter 4

    Research ethodolo!"

    The literature revie8 of Chapter 3 enabled the formation of the h!potheses stated in this

    chapter. t also identified areas 8here limitations 8ould appl! and defined the scope of

    the project.

    4.1

  • 7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf

    33/120

    4.2 Limitations and Delimitations

    4.2.1 Limitations

    T8o main limitations have been identified

    . !ocal arameter( $ The @ualit! of the s!nthesi:ed emotional speech 8ill be

    limited b! the abilit! of the vocal parameters to describe the various emotions.

    This is a reflection of the current level of understanding of speech emotion itself.

    2. Speech Synthe(i1er uality $ The @ualit! of the speech s!nthesi:er and the

    parameters it is able to handle 8ill also have a direct effect on the speech

    produced. -or instance most speech s!nthesi:ers are unable to change voice

    @ualit! features )breathiness intensit! etc+ 8ithout significantl! affecting the

    intelligibilit! of the utterance.

    4.2.2 Delimitations

    The purpose of this research is to determine ho8 8ell the vocal effects of emotion can be

    added to s!nthetic speech = it is not concerned 8ith generating an emotional state for the

    Tal%ing 6ead based on the 8ords it is to spea%. Therefore the s!stem 8ill not %no8 the

    re@uired emotion to simulate from the input te(t alone. This top9level information 8ill be

    provided through the use of e(plicit tags hence the need for the implementation of aspeech mar%up language.

    ue to the strict time constraints placed on this project the emotions that are to be

    simulated b! the s!stem 8ere bounded to happiness sadness and anger. These three

    emotions 8ere chosen because of the 8ealth of stud! carried out on these emotions )and

    hence an increased understanding+ compared to other emotions. This is because

    happiness sadness and anger )along 8ith fear and grief+ are often referred to as the

    Jbasic emotionsK on 8hich it is believed other emotions are built on.

  • 7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf

    34/120

    4.3 Research ethodolo!ies

    The follo8ing 7esearch $ethodologies of $auch and irch )1''3+ are applicable to this

    research

    Desi!n and Demonstration. This is the standard methodolog! used for

    the design and implementation of soft8are s!stems. The speech s!nthesis

    s!stem is being demonstrated as the TTS module of a Tal%ing 6ead.

    valuation. The effectiveness of the s!stem 8as needed to be determined

    via listener @uestionnaires testing ho8 8ell the TTS module supports the stated

    h!potheses. Therefore an evaluation research methodolog! 8as adopted.

    eta7%nal"sis. The project involves a number of diverse fields otherthan speech s!nthesisH namel! ps!cholog! paralinguistics and etholog!. The

    meta9anal!sis research methodolog! 8as used to determine ho8 8ell the speech

    emotion parameters described in these fields mapped to speech s!nthesis.

  • 7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf

    35/120

    Chapter 5

    Implementation

    This chapter discusses the implementation of the TTS module to simulate emotional

    speech for a Tal%ing 6ead plus the stated subproblems of Section 2. The discussion

    covers ho8 the module/s input is processed and ho8 the various emotional effects 8ereimplemented. This 8ill involve a description of the various structures and objects that

    are used b! the TTS module. Since the module relies heavil! on S$& the speech

    mar%up language that 8as designed and implemented to enable direct control over the

    module/s output the chapter discusses S$& issues such as parsing and tag processing.

    5.1 00/ Interace

    efore an in9depth description of each of the TTS module/s components is given it 8ill

    be beneficial to describe the input and outputs of the s!stem. t 8as important to be able

    to describe the s!stem as a ver! high9level blac% bo(H not onl! for clarit! of design but

    also to ensure that the replacement of the e(isting TTS module of the -T6 project

    8ould be a smooth one. t also minimi:es module and tool interdependenc!. -igure 1,

    sho8s the blac% bo( design of the s!stem as the TTS module of a Tal%ing 6ead.

    'i!ure 1, 7 +lac) bo- desi!n o the s"stem6 shown as the 00/ module o a 0al)in!

  • 7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf

    36/120

    of this detail nor should it. Bhat is important to describe at this level is the module/s

    interfaceH ho8 the module produces its output is irrelevant to the user of the module.

    5.1.2 odule Inputs

    -igure 1,sho8s te(t as the single input to the TTS module. 6o8ever Lte(t/ can be a

    fairl! ambiguous description for input and indeed the module caters for t8o distinct

    t!pes of te(t plain te(t and te(t mar%ed up in the TTS module/s o8n custom Speech

    $ar%up &anguage )S$&+.

    Plain Text

    The simplest form of input plain te(t means that the TTS module 8ill endeavour to

    render the speech9e@uivalent of allthe input te(t. n other 8ords it 8ill be assumed that

    no characters 8ithin the input represent directives for ho8 to generate the speech. s a

    result of this speech generated using plain te(t 8ill have default speech parameters

    spo%en 8ith neutral intonation.

    SML Markup

    f direct control over the TTS module/s output is desired then the te(t to be spo%en can

    be mar%ed up in S$& the custom mar%up language implemented for the module.

    lthough an in9depth description of S$& 8ill not be given here )see Section 2 and

    ppendi( + it 8as designed to provide the user of the TTS module 8ith the follo8ing

    abilities

    irect control of speech production. -or e(ample the s!stem

    could be specified to spea% at a certain speech rate pitch or pronounce a

    particular 8ord in a certain 8a! )this is especiall! useful for foreign names+.

    Control over spea%er properties. This gives the abilit! to not onl!

    have control of ho8 the mar%ed up te(t is spo%en but also whois spea%ing.

    Spea%er properties such as gender age and voice can be d!namicall! changed

    8ithin S$& mar%up.

    The effect of the spea%er/s emotion on speech. -or e(ample the

    mar%up ma! specif! that the spea%er is sad for a portion of the te(t. s a

    result the speech 8ill sound sad. One of the primar! objectives of this thesis

    is to determine ho8 effective the simulated effect of emotion on the voice is.

  • 7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf

    37/120

  • 7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf

    38/120

    specification )$EF 1'''+. The phoneme9to9viseme translation submodule is one of

    the fe8 that 8ere retained from the e(isting TTS module.

    5.1.4 CJC %PI

    $odules that call the TTS module do so through its C;C

  • 7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf

    39/120

    TT77"eaText 4%onst %ar $essaFe5= same as

    8TT78entral,,7"eaText 4%onst %ar Cilename5.

    TT77"eaTextEx 4%onst %ar $essaFe( int

    Emotion5= same as 8TT78entral,,7"eaTextEx 4%onst %ar

    $essaFe( int Emotion5.

    TT7Destro- 45= used to nicel! cleanup the TTS module once

    it is not need an!more. The function i( called once only.

    5.2 /L /peech ar)up Lan!ua!e

    n Section 2.2 it 8as identified that the design and implementation of a suitable mar%up

    language 8as re@uired so that the emotion of a te(t segment could be specified as 8ell as

    providing a means of manipulating other useful speech parameters. S$& is the TTS

    module/s *$&9based Speech $ar%up &anguage designed to meet these re@uirements.

    This section 8ill provide an overvie8 of ho8 an utterance should be mar%ed up in S$&.

    -or a description of each S$& tag 8ith its associated attributes see ppendi( . -or

    issues regarding S$&/s implementation see Section 5.4.

    5.2.1 /L ar)up /tructure

    n input file containing correct S$& mar%up must contain an *$& header declaration at

    the beginning of the file. -ollo8ing the *$& header the smltag encapsulates the entire

    mar%ed up te(t and can contain multiple ")paragraph+ tags. -igure 11sho8s the basic

    la!out of an input file mar%ed up in S$&. ?ote that all the *$& constraints discussed in

    Section 3.5 appl! to S$&.

    'i!ure 11 7 0op7level structure o an /L document.

    XML header

    Reference to

    SML v01 DTD

    Root tag

    Paragraphs

  • 7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf

    40/120

    n turn each "node can contain one or more emotion tags )sa# anFr- a""- and

    ne/tral+ and instances of the eme#tagH te(t not contained 8ithin an emotion tag is

    not allo8ed. -or e(ample -igure 12sho8s valid S$& mar%up 8hile -igure 13sho8s

    S$& mar%up that is invalid because it does not follo8 this rule. ?ote that unli%e Jla:!K

    6T$& the paragraph )"+ tags must be closed properl!.

    'i!ure 12 7 alid /L mar)up.

    'i!ure 13 7 Invalid /L mar)up.

    ll tags described in ppendi( can occur inside an emotion tag )e(cept sml "

    and eme#+. limitation of S$& is that emotion tags cannot occur 8ithin other emotion

    tags. 6o8ever unless e(plicitl! specified most other tags can contain even instances of

    tags 8ith the same name. -or e(ample a "it%tag can contain another "it%tag as

    the follo8ing e(ample sho8s.

  • 7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf

    41/120

    5.3 00/ odule /ubs"stems ;verview

    'i!ure 14 7 00/ module subs"stems.

    s -igure 1#sho8s the design of the TTS module subs!stems is centered on the S$&

    ocument object. The main steps for s!nthesi:ing the module/s input te(t involve the

    creation processing and output of the S$& ocument. This is bro%en do8n into the

    follo8ing tas%s

    a+ ar(ing. The input te(t is parsed b! the S$& arser and creates

    an S$& ocument object. The S$& arser ma%es use of lib(ml.

    b+ Text to honeme Tran(cription. The ?atural &anguage arser

    )?&+ is responsible for transcribing the te(t into its phoneme e@uivalent plus

    providing intonation information in the form of each phoneme/s duration and

    pitch values. This information is given to the S$& ocument object and

    SML ParserSML Document

    Visual

    Module

    SML 'agsProcessor

    Text Visemes

    Waveform

    Plain text Phoneme data

    NLP

    Festival

    DSP

    MBROLA

    Phoneme dataTags +

    Text/Phonemes

    Modified

    Text/Phonemes

    Phonemeinfo

    libxml

  • 7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf

    42/120

    stored 8ithin its internal structures. The ?& unit ma%es use of the -estival

    Speech S!nthesis S!stem.

    c+ S&' Tag roce((ing. n! S$& tags present in the input te(t are

    processed. This usuall! involves modif!ing the te(t or phonemes held 8ithin

    the S$& ocument.

    d+ 3aveform 4eneration. The phoneme data held 8ithin the S$&

    ocument is given to the igital Signal rocessing )S+ unit to generate a

    8aveform. The S ma%es use of the $7O& S!nthesi:er.

    e+ !i(eme 4eneration. The 0isual $odule is responsible for

    transcribing the phonemes to their viseme e@uivalent. gain the phoneme

    data is obtained from the S$& ocument. n this thesis the 0isual $odule

    8ill not be discussed in an! further detail since it is has reused much of the

    old TTS module/s subroutines. Crossman )1'''+ provides a description of thephoneme9to9viseme translation process.

    5.4 /L Parser

    The S$& arser encapsulated in the 8TT77$LParserclass is responsible for parsing

    the module/s te(t input to ensure that it is both a 8ell9formed *$& document and that its

    structure conforms to the grammar specification of the T. f the input is full!

    validated then an S$& ocument object is created based on the input.

    To perform full *$& parsing on the input the *$& C &ibrar! libxmlis used. part

    from validating the input libxml also constructs a O$ tree )described in Section 3.5.#+

    that represents the input/s tag structure should no parse errors occur. The S$&

    ocument object that is returned b! the S$& arser follo8s the hierarchical structure of

    the O$ ver! closel!. Therefore it traverses the O$ and creates an S$& ocument

    containing nodes mirroring the O$/s structure. Once the S$& ocument has been

    constructed the O$ is destro!ed and the S$& ocument is returned.

    t 8as mentioned in Section 5.1.2 that the TTS module is able to handle un%no8ntags present 8ithin the input mar%up. This is because the input is filtered to remove all

    un%no8n tags before an! validation parsing is done b! libxml. n doing so the O$ tree

    that libxmlcreates does not hold an! un%no8n tag nodes and as a conse@uence neither

    does the S$& ocument.

  • 7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf

    43/120

    The TTS module %eeps trac% of all S$& tag names b! %eeping a special *$&

    document that holds S$& tag information2. -iltering of the input is done b! creating a

    cop! of the input file and cop!ing onl! those tags that are %no8n. t is important that

    this filtering process is carried out because the input is envisaged to contain other non9

    S$& tags such as those belonging to the -$& module. -igure 15sho8s the filteringprocess.

    'i!ure 15 7 'ilterin! process o un)nown ta!s.

    5.5 /L Document

    s the TTS module/s subs!stems diagram sho8s )-igure 1#+ the S$& ocument is at

    the core of the TTS module. ts role is to store all information re@uired for speech

    s!nthesis to ta%e place such as 8ord phoneme and intonation data. t also contains thefull tag information that appears in the inputH in fact such is the depth of information held

    that the S$& mar%up could be easil! recreated b! the information held in the S$&

    ocument. The tag data is used to control the manipulation of the te(t and phoneme

    data. n this section 8e 8ill describe the structure of the S$& ocument as 8ell the

    various structures re@uired to perform the above mentioned role. -inall! the data held

    8ithin the S$& ocument is then used to produce a 8aveform. The S$& ocument

    object is encapsulated b! the TT77$LDo%/mentand TT77$L3o#eclasses.

    5.5.1 0ree /tructure

    n the last section it 8as mentioned that the structure of the S$& ocument matches

    ver! closel! that of the *$& O$. The S$& ocument consists of a hierarch! of

    nodes that represent the information held in the input S$& mar%up. Therefore the nodes

    2The *$& document is called Jtag9names.(mlK and is held in the special TTS resource director!

    JTTSQrcK.

    Input fileCopy of input file

    (Filtered)

    SML tag information

    Tag lookup Known tag

  • 7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf

    44/120

    hold mar%up information attribute values and character data. -igure 1sho8s the high9

    level structure of an S$& ocument that 8ould be constructed for the accompan!ing

    S$& mar%up. ?ote ho8 each node has a t!pe that specifies 8hat t!pe of node it is.

    The hierarchical nature of the S$& ocument implies 8hich te(t sections 8ill be

    rendered in 8hat 8a! = a parent 8ill affect all its children. So for the e(ample in -igure

    1 the em"node 8ill affect the phoneme data of its )one+ child node the te(t node

    containing the te(t JtooK. The a""-node 8ill affect the phoneme data of all its )three+

    children nodes containing the te(t JThat/s notK JtooK and Jfar a8a!K respectivel!. Tags

    that 8ere specified 8ith attribute values are represented b! element nodes that point to

    attribute information )this is not sho8n on -igure 1for clarit! purposes+.

    'i!ure 1$ 7 /L Document structure or /L mar)up !iven above.

    5.5.2 >tterance /tructures

    Each te(t node contains its o8n utterance information 8hich comprises of 8ord and

    phoneme related data. The information is held in different la!ers.

    10 $ain 7treet TatMs not ar awa-.

  • 7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf

    45/120

    1. >tterance level = the 8hole phrase held in that node. The

    8TT7Htteran%eInoclass is responsible for holding information at this

    level.

    2. Bord level = the individual 8ords of the utterance. The

    8TT7or#Ino class is responsible for holding information at this level.

    3. honeme level = the phonemes that ma%e up the 8ords. The

    8TT7PonemeIno class is responsible for holding information at this

    level.

    #. honeme pitch level = the pitch values of the phonemes )phonemes

    can have multiple pitch values+. The 8TT7Pit%PatternPoint class is

    responsible for holding information at this level.

    The 8a! that the above mentioned objects are organi:ed 8ithin a te(t node is as

    follo8s

    te(t node contains one 8TT7Htteran%eInoobject.

    The 8TT7Htteran%eIno object contains a list of 8TT7or#Ino

    objects that contain 8ord information.

    n turn each 8TT7or#Ino object contains a list of

    8TT7PonemeIno objects that contain phoneme information.

    8TT7PonemeIno object contains the actual phoneme and its duration )ms+.

    Each 8TT7PonemeIno object then contains a list of

    8TT7Pit%PatternPoint objects that contain pitch information for each

    phoneme. pitch point is characteri:ed b! a pitch value and a percentage value

    of 8here the point occurs 8ithin the phoneme/s duration.

  • 7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf

    46/120

    'i!ure 1& 7 >tterance structures to hold the phrase ?the moon@. > A C00/B>tteranceIno ob=ect

    ( A C00/B>tteranceIno ob=ect P A C00/BPhonemeIno ob=ect pp A C00/BPitchPatternPoint

    ob=ect.

    5.$ Gatural Lan!ua!e Parser

    s introduced in Section 5.3 the ?& )?atural &anguage arser+ module is responsiblefor transcribing the te(t to be rendered as speech to its phoneme e@uivalent. t is also

    responsible for generating intonation information b! providing pitch and duration values

    for each phoneme in the utterance. The goals this module sets out to achieve are non9

    trivial and it is not surprising that this stage ta%es b! far the longest time of an! of the

    stages in the speech s!nthesis process. utoit )1''4+ gives an e(cellent discussion of the

    problems the ?& unit of a speech s!nthesi:er must overcome.

    Since the phoneme transcription and the intonation information greatl! affect the

    @ualit! of the s!nthesi:ed speech it 8as ver! important to have an ?& that 8ould

    produce high @ualit! output. s mentioned in Section 3.4.1 the ,e(tival Speech

    Synthe(i( Sy(tem8as chosen to provide these services 8hich is able to generate output

    comparable to commercial speech s!nthesi:ers.

    U

    theW

    P dhpp (0,95)

    P @pp (50,101)

    moonW

    P m

    pp (0,102)

    P uu

    pp (50,110)

    P n

    pp (100,103)

    % inside phoneme length

    pitch value

  • 7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf

    47/120

    5.$.2 ;btainin! a Phoneme 0ranscription

    s described in Section 5.5 each te(t node 8ithin the S$& ocument contains utterance

    objects that 8ill ultimatel! hold the node/s 8ord and phoneme information. One of the

    intermediate steps for obtaining a phoneme transcription is to tokeni1ethe input character

    string into 8ords. -or e(ample the character string JOn $a! 5 1'"5 1'"5 people

    moved to &ivingstonK 8ould be to%eni:ed in the follo8ing 8ords b! -estival JOn $a!

    fifth nineteen eight! five one thousand nine hundred and eight! five people moved to

    &ivingstonK. This illustrates the comple(it! of the input that -estival is able to handle

    8hich has a direct effect on user perception of the intelligence of the Tal%ing 6ead.

    To to%eni:e the contents of the S$& ocument the tree is traversed and each te(t

    node/s content is individuall! given to -estival. -estival returns the to%ens in the

    character string and these are stored as 8ords in the corresponding node/s utterance

    object. -igure 1"sho8s ho8 each node holds its o8n to%en information.

    'i!ure 1# 7 0o)eniation o a part o an /L Document.

    Once 8ord information is stored 8ithin each node/s utterance object phoneme data

    can be generated for each 8ord. Obtaining the actual phoneme data )including

    intonation+ is a more comple( process ho8ever. This is because an entirephrase should

    be given to -estival in order for correct intonation to be generated. s an e(ample

    consider the follo8ing S$& mar%up )the corresponding nodes held in the S$&

    ocument are sho8n in -igure 1'+.

    10 oranges costten oranges cost

    $8.30eight dollars thirty

    ne/tral

    em"

    10 oranFes %ost

    SML Markup

  • 7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf

    48/120

    #i# -o/ not?

    'i!ure 1* 7 /L Document sub7tree representin! e-ample /L mar)up.

    f each te(t node/s contents is given to -estival one at a time )i.e. first J 8onderK

    then J!ou pronounced itK and so forth+ then though -estival 8ill be able to produce the

    correct phonemes it 8ill not generate proper pitch and timing information for the

    phonemes. This 8ill result in an utterance 8hose 8ords are pronounced properl! but

    contains inappropriate intonation brea%s that ma%e the utterance sound unnatural.

    n appropriate analog! to this 8ould be if a person 8ere sho8n a pac% of cards 8ith

    8ords 8ritten on them one at a time and as%ed to read it out loud. The person not

    %no8ing 8hat 8ords 8ill follo8 8ill not %no8 ho8 to give the phrase an appropriate

    intonation.

    ?o8 if the same person is given a card that contains the entiresentence on it then

    no8 %no8ing 8hat the phrase is sa!ing the person 8ill read it out loud correctl!. The

    same approach 8as ta%en in the solution to this problem. Continuing the above e(ample

    8ill help understand ho8 this is done.

    The S$& ocument is traversed until an emotion node is encountered. n

    the e(ample traversal 8ould stop at the a""-node.

    The contents of its child te(t nodes are then concatenated to ma%e one

    phrase. So the contents of the four te(t nodes in -igure 1' 8ould be

    concatenated to form the phrase J 8onder !ou pronounced it tomato did !ou

    notPK The phrase is stored in a temporar! utterance object held in the a""-

    node.

    a""-

    rate em"you pronounced it

    I wonder,

    did you not?

    tomato

  • 7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf

    49/120

    The phrase is given to -estival and -estival generates the phoneme

    transcription as 8ell as intonation information.

    The entire phoneme data is stored in the a""- node/s temporar!

    utterance object.

    ecause each te(t node alread! contains 8ord information in their

    utterance objects it is a simple process to disperse the phoneme data held in the

    a""-node amongst its children. The temporar! utterance object in a""- is

    then destro!ed.

    f this procedure is follo8ed then correct intonation is given to the utterance. Of

    course a limitation is that this 8ill not solve the problem of having an emotion change in

    mid sentence. 6o8ever the algorithm ma%es an assumption that this 8ill not occur

    fre@uentl! and that if it does the intonation 8ill not be needed to continue over emotion

    boundaries and a brea% is acceptable.

    5.$.2 /"nthesiin! in /ections

    Speech s!nthesis can be a processor intensive tas% and can ta%e a significant amount of

    time and memor! 8hen s!nthesi:ing larger utterances. -inding an! 8a! to minimi:e the

    8aiting time is highl! desirable especiall! 8hen the speech production is being 8aited

    upon b! an interactiveTal%ing 6ead.

    There 8as a concern that if a ver! large amount of S$& mar%up 8as given to theTTS module then the e(ecution time 8ould be unacceptable for someone communicating

    8ith the Tal%ing 6ead. To avoid this from occurring a solution 8as implemented that

    too% advantage of the client;server architecture of the Tal%ing 6ead.

    nstead of the entire S$& ocument being s!nthesi:ed in one go smaller portions

    )at the emotion node level+ are s!nthesi:ed one at a time on the server and sent to the

    client. s the Tal%ing 6ead on the client begins to Jspea%K the server s!nthesi:es the

    ne(t emotion tag of the S$& ocument. ! the time the Tal%ing 6ead has finished

    tal%ing the ne(t utterance is read! to be Jspo%enK. This 8a! the actual 8aiting time is

    reall! onl! for the first utterance and is no8 dependent on the communication speed

    bet8een the server and client and not the s!nthesis time of the 8hole document. -igure

    2,represents a timeline of the e(ample mar%up.

    t should be noted that this section9oriented method of producing speech involves not

    onl! the ?& submodule but also all the steps in the s!nthesis process after the creation

    of the S$& ocument.

  • 7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf

    50/120

    'i!ure 2, 7 Raw timeline showin! server and client

    e-ecution when s"nthesiin! e-ample /L mar)up above.

    5.$.3 Portabilit" Issues

    To address the portabilit! issue stated in Section 2.2 it 8as important -estival be useable

    over multiple platforms. ecause the -estival s!stem has been developed primaril! for

    the >?* platform compiling it for 7* .3 8as relativel! straightfor8ard. Similarl!

    obtaining a &inu( version of -estival 8as also effortless since &inu( 7$/s )7ed6at

    ac%age $anager+ containing precompiled -estival libraries are available. lthough ithas not been tested e(tensivel! on the Bin32 platform the -estival developers are

    confident that the source code is platform independent enough for -estival to compile on

    Bin32 machines 8ithout too man! changes.

    espite this optimism a considerable amount of the project/s effort 8ent into

    reali:ing this objective. n fact changes made to the code 8ere %ept trac% of and as it

    gre8 a help document for compiling -estival 8ith $icrosoft 0isual C7& http;;888.computing.edu.au;Rstalloj;projects;honours;festival9help.html.

    CLIENTSERVER

    Synthesizing neutral tag

    (Utterance 1)Idle

    Synthesizing happy tag(Utterance 2)

    Playing Utterance 1

    Synthesizing sad tag(Utterance 3)

    Playing Utterance 2

    Idle Playing Utterance 3

    http://www.computing.edu.au/~stalloj/projects/honours/festival-help.htmlhttp://www.computing.edu.au/~stalloj/projects/honours/festival-help.html
  • 7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf

    51/120

    5.& Implementation o motion 0a!s

    revious sections have dealt 8ith describing the frame8or% constructed to support the

    main h!pothesisH that is to simulate the effect of emotion on speech. This section 8ill

    no8 discuss the implementation of S$&/s emotion tags 8hich 8hen used to mar%upte(t cause the te(t to be rendered 8ith the specified emotion.

    s has alread! been stated in this thesis the speech correlates of emotion needed to

    be investigated in the literature for the main objectives to be met. Section 3.2 described

    the findings of this research and a table 8as constructed that describes the speech

    correlates for four of the five so9called JbasicK emotions anger happiness sadness and

    fear )see Table 1+. The table formed the basis for implementing the anFr- a""- and

    sa#S$& tags. -or ease of reference the contents of Table 1for the anger happiness

    and sadness emotions is sho8n again in the follo8ing table

    Anger Happiness Sadness

    Speech rate Faster Slightly faster Slightly slower

    Pitch average Very much

    higher

    Much higher Slightly lower

    Pitch range Much wider Much wider Slightly

    narrower

    Intensity Higher Higher Lower

    Pitch changes Abrupt,downward,directedcontours

    Smooth, upwardinflections

    Downwardinflections

    Voice quality Breathy, chestytone

    Breathy,

    blaring!esonant

    Articulation $lipped Slightly slurred Slurred

    terms used b! $urra! and rnott )1''3+.

    0able 2 7 /ummar" o human vocal emotion eects or an!er6 happiness6 and sadness.

    To implement the guidelines found in the literature on human speech emotion

    $urra! and rnott )1''5+ developed a number of prosodic rules for their 6$&ET

    s!stem. The TTS module has adopted some of these rules though slight modifications

    8ere re@uired. lso other similar prosodic rules have been developed through personal

    e(perimentation.

  • 7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf

    52/120

    5.&.1 /adness

    Basic Speech Correlates

    -ollo8ing the literature9derived guideline for the speech correlates of emotion sho8n inTable 2Table 3sho8s the parameter values set for the S$& sa#tag. The values 8ere

    optimi:ed for the TTS module and are given as percentage values relative to neutral

    speech.

    Parameter Value (relative to neutral speech)

    Speech rate *+

    %itch a#erage *+

    %itch range *-+

    Volume ./0

    0able 3 /peech correlate values implemented or sadness.

    s a result of the above speech parameter changes the speech is slo8er lo8er in

    tone and is more monotonic )pitch range reduction gives a flatter intonation curve+. The

    volume is reduced for sadness so that the spea%er tal%s more softl!. )mplementation

    details on ho8 speech rate volume and pitch values are modified can be found in Section

    5."+.

    Prosodic rules

    The follo8ing rules adopted from $urra! and rnott )1''5+ 8ere deemed to be

    necessar! for the simulation of sadness. Some parameter values 8ere slightl! modified

    to 8or% best 8ith the TTS module.

    1. Eliminate abrupt change( in pitch between phoneme(. The

    phoneme data is scanned and if an! phoneme pairs have a pitch difference of

    greater than 1, then the lo8er of the t8o pitch values is increased b! 5 of

    the pitch range.

    2. 5dd pau(e( after long word(. f an! 8ord in the utterance contains

    si( or more phonemes then a slight pause )", milliseconds+ is inserted after

    the 8ord.

    The follo8ing rules 8ere developed specificall! for the TTS module.

    1. 'ower the pitch of every word that occur( before a pau(e. Such

    8ords are lo8ered b! scanning the phoneme data in the particular 8ord and

  • 7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf

    53/120

    lo8ering the last vo8el9sounding phoneme )and an! consonant9sounding

    phonemes that follo8+ b! 15. This has the effect of lo8ering the last

    s!llable.

    2. ,inal lowering of utterance. The last s!llable of the last 8ord in

    the utterance is lo8ered in pitch b! 15.

    5.&.2 tterances usuall! have a pitch drop in the final vo8el and an! follo8ing

  • 7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf

    54/120

    consonants. This rule increases the pitch values of these phonemes b! 15

    hence reducing the si:e of the terminal pitch fall.

    5.&.3 %n!er

    Basic Speech Correlates

    Table 5 sho8s the param