24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf

7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf

1/120

Simulating Emotional Speech for a Talking Head November 2000

Contents

1 Introduction..............................................................................................1

2 Problem Description................................................................................2

2.1 Objectives..............................................................................................................2

2.2 Subproblems..........................................................................................................2

2.3 Significance...........................................................................................................3

3 Literature Review....................................................................................5

3.1 Emotion and Speech..............................................................................................5

3.2 The Speech Correlates of Emotion........................................................................

3.3 Emotion in Speech S!nthesis................................................................................"

3.# Speech $ar%up &anguages...................................................................................'

3.5 E(tensible $ar%up &anguage )*$&+.................................................................1,

3.5.1 *$& -eatures..........................................................................................1,

3.5.2 The *$& ocument................................................................................113.5.3 T/s and 0alidation..............................................................................12

3.5.# ocument Object $odel )O$+.............................................................1#

3.5.5 S* arsing.............................................................................................15

3.5. enefits of *$&......................................................................................1

3.5.4 -uture irections in *$&........................................................................14

3. -T6.................................................................................................................1"

3.4 7esource 7evie8.................................................................................................2,

3.4.1 Te(t9to9Speech S!nthesi:er.....................................................................2,

3.4.2 *$& arser..............................................................................................22

3." Summar!.............................................................................................................23

4 Research ethodolo!"..........................................................................25

#.1 6!potheses..........................................................................................................25

age i


2/120


#.2 &imitations and elimitations.............................................................................2

#.2.1 &imitations...............................................................................................2

#.2.2 elimitations............................................................................................2

#.3 7esearch $ethodologies.....................................................................................24

5 Implementation......................................................................................2#

5.1 TTS nterface.......................................................................................................2"

5.1.2 $odule nputs..........................................................................................2'

5.1.3 $odule Outputs........................................................................................3,

5.1.# C;Ctterance Structures.................................................................................34

5. ?atural &anguage arser.....................................................................................3'

5..2 Obtaining a honeme Transcription........................................................#,

5..2 S!nthesi:ing in Sections..........................................................................#2

5..3 ortabilit! ssues......................................................................................#3

5.4 mplementation of Emotion Tags........................................................................##

5.4.1 Sadness.....................................................................................................#5

5.4.2 6appiness.................................................................................................#

5.4.3 nger........................................................................................................#4

5.4.# Stressed 0o8els.......................................................................................#"

5.4.5 Conclusion...............................................................................................#"

5." mplementation of &o89level S$& Tags............................................................#'

5.".1 Speech Tags.............................................................................................#'

5.".2 Spea%er Tag..............................................................................................53

5.' igital Signal rocessor......................................................................................5#

5.1, Cooperating 8ith the -$& module................................................................55

5.11 Summar!...........................................................................................................54

age ii


3/120


$ Results and %nal"sis..............................................................................5#

.1 ata c@uisition..................................................................................................5"

.1.1 Auestionnaire Structure and esign........................................................5"

.1.2 E(perimental rocedure...........................................................................1

.1.3 rofile of articipants...............................................................................3

.2 7ecogni:ing Emotion in S!nthetic Speech.........................................................#

.2.1 Confusion $atri(.....................................................................................#

.2.2 Emotion 7ecognition for Section 2.......................................................

.2.3 Emotion 7ecognition for Section 2.......................................................'

.2.# Effect of 0ocal Emotion on Emotionless Te(t........................................43

.2.5 Effect of 0ocal Emotion on Emotive Te(t..............................................45

.2. -urther nal!sis.......................................................................................45.3 Tal%ing 6ead and 0ocal E(pression...................................................................44

.# Summar!............................................................................................................."1

& 'uture (or)...........................................................................................#2

4.1 ost Baveform rocessing.................................................................................."2

4.2 Spea%ing St!les..................................................................................................."3

4.3 Speech Emotion evelopment............................................................................"#

4.# *$& ssues........................................................................................................."5

4.5 Tal%ing 6ead......................................................................................................."

4. ncreasing Communication and8idth..............................................................."4

# Conclusion...............................................................................................##

* +iblio!raph"...........................................................................................*1

1, %ppendi- % /L 0a! /peciication..................................................*$

................................................................................................................................1,1

11 %ppendi- + /L D0D.....................................................................1,2

12 %ppendi- C 'estival and isual C..............................................1,413 %ppendi- D valuation uestionnaire...........................................1,&

14 %ppendi- 0est Phrases or uestionnaire6 /ection 2+..............113

age iii


4/120


List o 'i!ures

15 'i!ure 1 7 %n 8L document holdin! simple weather inormation.11

1$ 'i!ure 2 7 /ample section o a D0D ile...............................................12

1& 'i!ure 3 7 8L s"nta- error 7 list and item ta!s incorrectl" matched.

13

1# 'i!ure 4 7 (ell7ormed 8L document6 but does not ollow

!rammar speciication in D0D ile 9an item ta! occurs outside o list

ta!:...........................................................................................................13

1* 'i!ure 5 (ell7ormed 8L document that also ollows D0D

!rammar speciication. (ill not produce an" parse errors..............13

2, 'i!ure $ 7 D; representation o 8L e-ample..............................15

21 'i!ure & 7 '%I0< pro=ect architecture................................................1*

22 'i!ure # 7 0al)in!


5/120


31 'i!ure 1& 7 >tterance structures to hold the phrase ?the moon@. > A

C00/B>tteranceIno ob=ect ( A C00/B>tteranceIno ob=ect P A

C00/BPhonemeIno ob=ect pp A C00/BPitchPatternPoint ob=ect. 3*

32 'i!ure 1# 7 0o)eniation o a part o an /L

Document................................................................................................4,33 'i!ure 1* 7 /L Document sub7tree representin! e-ample /L

mar)up....................................................................................................41

34 'i!ure 2, 7 Raw timeline showin! server and client e-ecution when

s"nthesiin! e-ample /L mar)up above..........................................43

35 'i!ure 21 ultipl" actors o pitch and duration values or

emphasied phonemes............................................................................5,

3$ 'i!ure 22 7 Processin! a pause ta!........................................................51

3& 'i!ure 23 7 0he eect o widenin! the pitch ran!e o an utterance...52

3# 'i!ure 24 7 Processin! the pron ta!......................................................52

3* 'i!ure 25 7 -ample +R;L% input..................................................55

4, 'i!ure 2$ 7 -ample utterance inormation supplied to the '%L

module b" the 00/ module. -ample phraseE ?%nd now the latest

news@.......................................................................................................5$

41 'i!ure 2& 7 % node carr"in! waveorm processin! instructions or an

operation.................................................................................................#3

42 'i!ure 2# 7 Insertion o new submodule or post waveorm

processin!................................................................................................#3

43 'i!ure 2* 7 /L ar)up containin! a lin) to a st"lesheet................#4

44 'i!ure 3, 7 Inclusion o an ?8L


6/120


List o 0ables

4$ 0able 1 7 /ummar" o human vocal emotion eects.............................#

4& 0able 2 7 /ummar" o human vocal emotion eects or an!er6

happiness6 and sadness..........................................................................44

4# 0able 3 /peech correlate values implemented or sadness..............45

4* 0able 4 7 /peech correlate values implemented or happiness..........4$

5, 0able 5 7 /peech correlate values implemented or an!er.................4&51 0able $ 7 owel7soundin! phonemes are discriminated based on their

duration and pitch..................................................................................4#

52 0able &7 +R;L% command line option values or en1 and us1

diphone databases to output male and emale voices.........................54

53 0able # 7 /tatistics o participants........................................................$3

54 0able * 7 Conusion matri- template....................................................$4

55 0able 1, 7 Conusion matri- with sample data...................................$5

5$ 0able 11 7 Conusion matri- showin! ideal e-periment dataE 1,,F

reco!nition rate or all simulated emotions.........................................$5

5& 0able 12 Listener response data or neutral phrases spo)en with

happ" emotion........................................................................................$$

5# 0able 13 /ection 2% listener response data or neutral phrases.....$&

5* 0able 14 7 Listener response data or /ection 2%6 uestion 1...........$#

$, 0able 15 7 Listener response data or /ection 2%6 uestion 2...........$#

$1 0able 1$ 7 Listener responses or utterances containin! emotionlesste-t with no vocal emotion.....................................................................&,

$2 0able 1& 7 Listener responses or utterances containin! emotive te-t

with no vocal emotion............................................................................&1

$3 0able 1# 7 Listener responses or utterances containin! emotionless

te-t with vocal emotion..........................................................................&2

age vi


7/120


$4 0able 1* 7 Listener responses or utterances containin! emotive te-t

with vocal emotion.................................................................................&3

$5 0able 2, 7 Percenta!e o listeners who improved in emotion

reco!nition with the addition o vocal emotion eects or neutral te-t.

&4$$ 0able 21 Percenta!e o listeners whose emotion reco!nition

deteriorated with the addition o vocal emotion eects or neutral

te-t...........................................................................................................&4

$& 0able 22 7 Percenta!e o listeners whose emotion reco!nition

improved with the addition o vocal emotion eects or emotive te-t.

&5

$# 0able 23 Percenta!e o listeners whose emotion reco!nition

deteriorated with the addition o vocal emotion eects or emotive

te-t...........................................................................................................&5

$* 0able 24 Listener responses or participants who spea) n!lish as

their irst lan!ua!e. >tterance t"pe is ?neutral te-t6 emotive voice@.

&$

&, 0able 25 Listener responses or participants who do G;0 spea)

n!lish as their irst lan!ua!e. >tterance t"pe is ?neutral te-t6

emotive voice@.........................................................................................&$

&1 0able 2$ Listener responses or participants who spea) n!lish as

their irst lan!ua!e. >tterance t"pe is ?emotive te-t6 emotive voice@.&&

&2 0able 2& Listener responses or participants who do G;0 spea)

n!lish as their irst lan!ua!e. >tterance t"pe is ?emotive te-t6

emotive voice@.........................................................................................&&

&3 0able 2# 7 Participant responses when as)ed to choose the 0al)in!


8/120

Chapter 1

Introduction

Bhen 8e tal% 8e produce a comple( acoustic signal that carries information in addition

to the verbal content of the message. 0ocal e(pression tells others about the emotional

state of the spea%er as 8ell as @ualif!ing )or even dis@ualif!ing+ the literal meaning ofthe 8ords. ecause of this listeners expectto hear vocal effects pa!ing attention not

onl! to 8hat is being said but ho8 it is said. The problem 8ith current speech

s!nthesi:ers is that the effect of emotion on speech is not ta%en into account producing

output that sounds monotonic or at 8orst distinctl! machine9li%e. s a result of this the

abilit! of a Tal%ing 6ead to e(press its emotional state 8ill be adversel! affected if it

uses a plain speech s!nthesi:er to Dtal%D. The objective of this research 8as to develop a

s!stem that is able to incorporate emotional effects in s!nthetic speech and thus improve

the perceived naturalness of a Tal%ing 6ead.

This thesis revie8s the literature in the fields of speech emotion s!nthetic speech

s!nthesis and *$&. discussion on *$& is featured prominentl! in this thesis because

it 8as the vehicle chosen for directing ho8 the s!nthetic voice should sound. t also had

considerable impact on ho8 speech information 8as processed. The design and

implementation details of the project are discussed to describe the developed s!stem. n

in9depth anal!sis of the project/s evaluation data is then given concluding 8ith a

discussion of future 8or% that has been identified.


9/120

Chapter 2

Problem Description

2.1 ;b=ectives

evelopment of the project 8as aimed at meeting t8o main objectives to support theh!potheses of Section #.1

1. To develop a s!stem that can add simulated emotion effects to s!nthetic

speech. This involved researching the speech correlates of emotion that have

been identified in the literature. The findings 8ere to be applied to the control

parameters available in a speech s!nthesi:er allo8ing a specified emotion to be

simulated using rules controlling the parameters.

2. To integrate the s!stem 8ithin the TTS )te(t9to9speech+ module of a

Tal%ing 6ead. The speech s!stem 8as to be added to the Tal%ing 6ead that ispart of the -T61project. t is being developed jointl! at Curtin >niversit! of

Technolog! Bestern ustralia and the >niversit! of Fenoa in tal! )eard et al

1'''+. The te(t9to9speech module must be treated as a Gblac% bo(G 8hich is

consistent 8ith the modular design of -Abot.

2.2 /ubproblems

number of subproblems 8ere identified to successfull! develop a s!stem 8ith thestated objectives.

1. esign and implementation of a speech mar%up language. t 8as desirable

that the mar%up language be *$&9basedH the reasons for this 8ill become

apparent later in the thesis. The role of the speech mar%up language )S$&+ is to

1-acial nimated nteractive Tal%ing 6ead


10/120

provide a 8a! to specif! in 8hich emotion a te(t segment is to be rendered. n

addition to this it 8as decided to e(tend the application of the mar%up to provide

a mechanism for the manipulation of generall! useful speech properties such as

rate pitch and volume. S$& 8as designed to closel! follo8 the S&E

specification described b! Sproat et al)1''"+.

2. Evaluation of each of the e(isting te(t9to9speech )TTS+ submodules of the

Tal%ing 6ead 8as re@uired. ts aim 8as to determine 8hat could and could not

be reused. This included assessing the e(isting TTS module/s and the

modules that interface 8ith other subs!stems of the Tal%ing 6ead )namel! the

$EF9# subs!stem+.

3. Cooperative integration 8ith modules that 8ere being concurrentl! 8ritten

for the Tal%ing 6ead namel! the gesture mar%up language being developed b!

6u!nh )2,,,+. The collaboration bet8een the t8o subprojects 8as aimed at

providing the Tal%ing 6ead 8ith s!nchroni:ation of vocal e(pressions and facial

gestures. n architecture specification for allo8ing facial and speech

s!nchroni:ation is given b! Ostermann et al. )1''"+.

#.Since the Tal%ing 6ead is being developed to run over a number of

platforms )Bin32 &inu( and 7* .3+ it 8as crucial that the ne8 TTS module

8ould not hamper efforts to ma%e the Tal%ing 6ead a platform independent

application.

2.3 /i!niicance

The project is significant because despite the important role of the displa! of emotion in

human communication current te(t9to9speech s!nthesi:ers do not cater for its effect on

speech. 7esearch to add emotion effects to s!nthetic speech is ongoing notabl! b!

$urra! and rnott )1''+ but has been mainl! restricted to a standalone s!stem and not

part of a Tal%ing 6ead as this project set out to do.

ncreased naturalness in s!nthetic speech is seen as being important for its

acceptance )Scherer 1''+ and this is li%el! to be the case for applications of Tal%ing

6ead technolog! as 8ell. This thesis is attempting to address this need. dvances in this

area 8ill also benefit 8or% in the fields of speech anal!sis speech recognition and speech

s!nthesis 8hen dealing 8ith natural variabilit!. This is because 8or% 8ith the speech

correlates of emotion 8ill help support or disprove speech correlates identified in speech


11/120

anal!sis help in proper feature e(traction for the automatic recognition of emotion in the

voice and generall! improve s!nthetic speech production.


12/120

Chapter 3

Literature Review

This section presents a brief revie8 of the literature relevant to the areas the project is

concerned 8ith the effects of emotion on speech speech emotion s!nthesis *$& and

speech mar%up languages.

3.1 motion and /peech

Emotion is an integral part of speech. Semantic meaning in a conversation is conve!ed

not onl! in the actual 8ords 8e sa! but also in howthe! are e(pressed )Inapp 1'",H

$alandro ar%er and ar%er 1'"'+. Even before the! can understand 8ords children

displa! the abilit! to recogni:e vocal emotion illustrating the importance that nature

places on being able to conve! and recogni:e emotion in the speech channel band8idth.The intrinsic relationship that emotion shares 8ith speech is seen in the direct effect

that our emotional state has on the speech production mechanism. h!siological changes

such as increased heart rate and blood pressure muscle tremors and dr!ness of mouth

have been noted to be brought about b! the arousal of the s!mpathetic nervous s!stem

such as 8hen e(periencing fear anger or jo! )Cahn 1'',+. These effects of emotion on

a person/s speech apparatus ultimatel! affect ho8 speech is produced thus promoting the

vie8 that an emotion Jcarrier 8aveK is produced for the 8ords spo%en )$urra! and

rnott 1''3+.

Bith emotion being described as Jthe organism/s interface to the 8orld outsideK

)Scherer 1'"1+ considerable interest has been devoted to investigate the role of emotion

in speech particularl! regarding its social aspects )Inapp 1'",+. One function is to

notif! others of our behavioural intentions in response to certain events )Scherer 1'"1+.

-or e(ample the contraction of ones throat 8hen e(periencing fear 8ill produce a harsh

voice that is increased in loudness )$urra! and rnott 1''3+ serving to 8arn and


13/120

frighten a 8ould9be assailant 8ith the bod! tensing for a possible confrontation. The

e(pression of emotion through speech also serves to communicate to others our

judgement of a particular situation. mportantl! vocal changes due to emotion ma! in

fact be cross9cultural in nature though this ma! onl! be true for some emotions and

further 8or% is re@uired to ascertain this for certain )$urra! rnott and 7oh8er 1''+.

Be also deliberately use vocal e(pression in speech to communicate various

meanings. Sudden pitch changes 8ill ma%e a s!llable stand out highlighting the

associated 8ord as an important component of that utterance )utoit 1''4+. spea%er

8ill also pause at the end of %e! sentences in a discussion to allo8 listeners the chance to

process 8hat 8as said and a phrase/s pitch 8ill increase to8ards the end to denote a

@uestion )$alandro ar%er and ar%er 1'"'+. Bhen something is said in a 8a! that

seems to contradict the actual spo%en 8ords 8e 8ill usuall! accept the vocal meaning

over the verbal meaning. -or e(ample the e(pression Jthan%s a lotK spo%en in an angr!

tone 8ill generall! be ta%en in a negative 8a! and not as a compliment as the literal

meaning of the 8ords alone 8ould suggest. This underscores the importance 8e place on

the vocal information that accompanies the verbal content.

3.2 0he /peech Correlates o motion

coustics researchers and ps!chologists have endeavoured to identif! the speech

correlates of emotion. The motivation behind this 8or% is based on the demonstrated

abilit! of listeners to recogni:e different vocal e(pressions. f vocal emotions are

distinguishable then there are acoustic features responsible for ho8 various emotions are

e(pressed )Scherer 1''+. 6o8ever this tas% has met 8ith considerable difficult!. This

is because coordination of the speech apparatus to produce vocal e(pression is done

unconsciousl! even 8hen a spea%ing st!le is consciousl! adopted )$urra! and rnott

1''+.

Traditionall! there have been three major e(perimental techni@ues that researchers

have used to investigate the speech correlates of emotion )Inapp 1'",H $urra! and

rnott 1''3+

1. $eaningless Lneutral/ content )e.g. letters of the alphabet

numbers etc+ is read b! actors 8ho e(press various emotions.

2. The same utterance is e(pressed in different emotions. This

approach aids in comparing the emotions being studied.


14/120

3. The content is ignored altogether either b! using e@uipment

designed to e(tract various speech attributes or b! filtering out the content.

The latter techni@ue involves appl!ing a lo89pass filter to the speech signal

thus eliminating the high fre@uencies that 8ord recognition is dependent upon.

)This meets 8ith limited success ho8ever since some of the vocalinformation also resides in the high fre@uenc! range.+

The problem of speech parameter identification is further compounded b! the

subjective nature of these tests. This is evident in the literature as results ta%en from

numerous studies rarel! agree 8ith each other. ?evertheless a general picture of the

speech parameters responsible for the e(pression of emotion can be constructed. There

are three main categories of speech correlates of emotion )Cahn 1'',H $urra! rnott

and 7oh8er 1''+

itch contour.The intonation of an utterance 8hich describes the natureof accents and the overall pitch range of the utterance. itch is e(pressed as

fundamental fre@uenc! )-,+. arameters include average pitch pitch range

contour slope and final lo8ering.

Timing. escribes the speed that an utterance is spo%en as 8ell as rh!thm

and the duration of emphasi:ed s!llables. arameters include speech rate

hesitation pauses and e(aggeration.

!oice "uality. The overall Lcharacter/ of the voice 8hich includes effects

such as 8hispering hoarseness breathiness and intensit!.

t is believed that value combinations of these speech parameters are used to e(press

vocal emotion. Table 1is a summar! of human vocal emotion effects of four of the so9

called basic emotions anger happiness sadness and fear )$urra! and rnott 1''3H

Falanis arsinos and Io%%ina%is 1''H Cahn 1'',H avti: 1'4H Scherer 1''+. The

parameter descriptions are relative to neutral speech.

Anger Happiness Sadness Fear


15/120

Speech rate Faster Slightly faster Slightly slower Much faster

Pitch average Very much

higher

Much higher Slightly lower Very much

higher

Pitch range Much wider Much wider Slightly

narrower

Much wider

Intensity Higher Higher Lower Higher

Pitch changes Abrupt,downward,directedcontours

Smooth, upwardinflections

Downwardinflections

Downwardterminalinflections

Voice quality Breathy, chestytone

Breathy,

blaring!esonant "rregular

#oicing

Articulation $lipped Slightly slurred Slurred %recise

terms used b! $urra! and rnott )1''3+.

0able 1 7 /ummar" o human vocal emotion eects.

The summar! should not be ta%en as a complete and final description but rather is

meant as a guideline onl!. -or instance the table above emphasi:es the role of

fundamental fre@uenc! as a carrier of vocal emotion. 6o8ever Ino8er )1'#1 as

referred in $urra! and rnott 1''3+ notes that 8hispered speech is able to conve!

emotion even though 8hispering ma%es no use of the voice/s fundamental fre@uenc!.

?evertheless being able to succinctl! describe vocal e(pression li%e this has significant

benefits for simulating emotion in s!nthetic speech.

3.3 motion in /peech /"nthesis

n the past focus has been placed on developing speech s!nthesi:er techni@ues to

produce clearer intelligibilit! 8ith intonation being confined to model neutral speech.

6o8ever the speech produced is distinctl! machine sounding and unnatural. Speech

s!nthesis is seen as being fla8ed for not possessing appropriate prosodic variation li%e

that found in human speech. -or this reason some s!nthesis models are including the

effects of emotion on speech to produce greater variabilit! )$urra! rnott and 7oh8er

1''+. nterestingl! Scherer )1''+ sees this as being crucial for the acceptance of

s!nthetic speech.

The advantage of the vocal emotion descriptions in Table 1 is that the speech

parameters can be manipulated in current speech s!nthesi:ers to simulate emotional

speech 8ithout dramaticall! affecting intelligibilit!. This approach thus allo8s emotive

effects to be added on top of the output of te(t9to9speech s!nthesi:ers through the use of


16/120

carefull! constructed rules. T8o of the better %no8n s!stems capable of adding emotion9

b!9rule effects to speech are the Jffect EditorK developed b! Cahn )1'',b+ and

6$&ET developed b! $urra! and rnott )1''5+ )$urra! rnott and ?e8ell 1'""+.

The s!stems both ma%e use of the ECtal% te(t9to9speech s!nthesi:er mainl! because of

its e(tensive control parameter features.

-uture 8or% is concerned 8ith building a solid model of emotional speech as this

area is seen as being limited b! our understanding of vocal e(pression and the @ualit! of

the speech correlates used to describe emotional speech )Cahn 1'""H $urra! and rnott

1''5H Scherer 1''+. lthough not 8ithin the scope of the project it is 8orth

mentioning that research is being underta%en in concept9to9speech s!nthesis. This 8or%

is aimed at improving the intonation of s!nthetic speech b! using e(tra linguistic

information )i.e. tagged te(t+ provided b! another s!stem such as a natural language

generation )?&F+ s!stem )6it:eman et al#1'''+.

0ariabilit! in speech is also being investigated in the area of speech recognition 8ith

the aim of possibl! developing computer interfaces that respond differentl! according to

the emotional state of the user )ellaert ol:in and Baibel 1''+. nother avenue for

future research could be to incorporate the effects of facial gestures on speech. -or

instance 6ess Scherer and Iappas )1'""+ noted that voice @ualit! is judged to be

friendl! over the phone 8hen a person is smiling. model that could cater for this

8ould have e(tremel! beneficial applications for recent 8or% concerned 8ith the

s!nchroni:ation of facial gestures and emotive speech in Tal%ing 6eads.

-inall! simulating emotion in s!nthetic speech not onl! has the potential to build

more realistic speech s!nthesi:ers )and hence provide the benefits that such a s!stem

8ould offer+ but 8ill also add to our understanding of speech emotion itself.

3.4 /peech ar)up Lan!ua!es

deall! a te(t9to9speech s!nthesi:er 8ould be able to accept plain te(t as input and spea%

it in a manner comparable to a human emphasi:ing important 8ords pausing for effect

and pronouncing foreign 8ords correctl!. >nfortunatel! automaticall! processing and

anal!:ing plain te(t is e(tremel! difficult for a machine. Bithout e(tra information to

accompan! the 8ords it is to spea% the speech s!nthesi:er 8ill not onl! sound unnatural

but intelligibilit! 8ill also decrease. Therefore it is desirable to have an annotation

scheme that 8ill allo8 direct control over the speech s!nthesi:er/s output.


17/120

$ost research and commercial s!stems allo8 for such an annotation scheme but

almost all are s!nthesi:er dependent thus ma%ing it e(tremel! difficult for soft8are

developers to build programs that can interface 8ith an! speech s!nthesi:er. 7ecent

moves b! industr! leaders to standardi:e a speech mar%up language has led to the draft

specification of S&E a s!stem independent SF$&9based mar%up language )Sproatet al 1''"+. The S&E specification has evolved from three e(isting speech s!nthesis

mar%up languages SS$& )Ta!lor and sard 1''4+ ST$& )Sproat et al 1''4+ and

Mava/s MS$&.

3.5 -tensible ar)up Lan!ua!e 98L:

*$& is the E(tensible $ar%up &anguage created b! B3C the Borld Bide Beb

Consortium )E(tensible $ar%up &anguage 1''"+. t 8as speciall! designed to enablethe use of large document management concepts for the Borld Bide Beb that 8ere

embodied in SF$& the Standard Fenerali:ed $ar%up &anguage. n adopting SF$&

concepts ho8ever the aim 8as also to remove features of SF$& that 8ere either not

needed for Beb applications or 8ere ver! difficult to implement )The *$& -A 2,,,+.

The result 8as a simplified dialect of SF$& that is relativel! eas! to learn use and

implement and at the same time retains much of the po8er of SF$& )osa% 1''4+.

t is important to note that *$& is not a mar%up language in itself but rather it is a

meta$language= a language for describing other languages. Therefore *$& allo8s a

user to specif! the tag set and grammar of their o8n custom mar%up language that

follo8s the *$& specification.

3.5.1 8L 'eatures

There are three significant features of *$& that ma%e it a ver! po8erful meta9language

)osa% 1''4+

1. -tensibilit"9 ne8 tags and their attribute names can be defined at

8ill. ecause the author of an *$& document can mar%up data using an!number of custom tags the document is able to effectivel! describe the data

embodied 8ithin the tags. This is not the case 8ith 6T$& 8hich uses a

fi(ed tag set.


18/120

2. /tructure= the structure of an *$& document can be nested to

an! level of comple(it! since it is the author that defines the tag set and

grammar of the document.

3. alidation = if a tag set and grammar definition is provided

)usuall! via a ocument T!pe efinition )T++ then applications

processing the *$& document can perform structural validation to ma%e sure

it conforms to the grammar specification. So though the nested structure of an

*$& document can be @uite comple( the fact that it follo8s a ver! rigid

guideline ma%es document processing relativel! eas!.

3.5.2 0he 8L Document

n *$& document is a se@uence of characters that contains markup )the tags that

describe the te(t the! encapsulate+ and the character data)the actual te(t being Jmar%ed

upK+. -igure 1sho8s an e(ample of a simple *$& document.

'i!ure 1 7 %n 8L document holdin! simple weather inormation.

One of the main observations that should be made for the e(ample given in -igure 1

is that an %&' document de(cribe( only the data and not ho8 it should be vie8ed. This

is unli%e 6T$& 8hich forces a specific vie8 and does not provide a good mechanism

for data description )Fraham and Auinn 1'''+. -or e(ample 6T$& tags such as P

DIV and TABLEdescribe ho8 a bro8ser is to displa! the encapsulated te(t but are

Markup tag

Character data

(marked up text)

'1 * 1


19/120

inade@uate for specif!ing 8hether the data is describing an automotive part is a section

of a patient/s health record or the price of a grocer! item.

The fact that an *$& document is encoded in plain te(t 8as a conscious decision

made b! the *$& designers = the designing of a s!stem9independent and vendor9

independent solution )osa% 1''4+. lthough te(t files are usuall! larger than

comparable binar! formats this can be easil! compensated for using freel! available

utilities that can efficientl! compress files both in terms of si:e and time. t 8orst the

disadvantages associated 8ith an uncompressed plain te(t file is deemed to be

out8eighed b! the advantages of a universall! understood and portable file format that

does not re@uire special soft8are for encoding and decoding.

3.5.3 D0DHs and alidation

The *$& specification has ver! strict rules 8hich describe the s!nta( of an *$&

document = for instance the characters allo8able 8ithin the markupsection ho8 tags

must encapsulate te(t the handling of 8hite space etc. These rigid rules ma%e the tas%s

of parsing and dividing the document into sub9components much easier. well$formed

*$& document is one that follo8s the s!nta( rules set in the *$& specification.

6o8ever since its author determines the structure of the document a mechanism must be

provided that allo8s grammar chec%ing to ta%e place. *$& does this through the

ocument T!pe efinition or T.

T file is 8ritten in *$&/s eclaration S!nta( and contains the formaldescription of a document/s grammar )The *$& -A 2,,,+. t defines amongst other

things 8hich tags can be used and 8here the! can occur the attributes 8ithin each tag

and ho8 all the tags fit together.

-igure 2gives a sample T section that describes t8o elements listand item.

The e(ample declares that one or more item tags can occur 8ithin a list tag.

-urthermore an itemtag ma! optionall! have a t-"eattribute.

'i!ure 2 7 /ample section o a D0D ile


20/120

E(tending this e(ample the different levels of validation performed b! an *$&

parser can be seen. -igure 3sho8s an *$& document that does not meet the s!nta(

specified in the *$& specification.

'i!ure 3 7 8L s"nta- error 7 list and item ta!s incorrectl" matched.

-igure #sho8s a 8ell9formed *$& document )i.e. it follo8s the *$& s!nta(+ but

does not follo8 the grammar specified in the lin%ed T file. )The T file is the one

given in -igure 2+.

'i!ure 4 7 (ell7ormed 8L document6 but does not ollow !rammar speciication in

D0D ile 9an item ta! occurs outside o list ta!:.

-igure 5 sho8s a 8ell9formed *$& document that also meets the grammar

specification given in the T file.

'i!ure 5 (ell7ormed 8L document that also ollows D0D

!rammar speciication. (ill not produce an" parse errors.

The *$& 7ecommendation states that an! parse error detected 8hile processing an

*$& document 8ill immediatel! cause a fatal error )E(tensible $ar%up &anguage

1''"+ = the *$& document 8ill not be processed an! further and the application 8ill

Item 1


21/120

not attempt to second guess the author/s intent. ?ote that the T does ?OT define ho8

the data should be vie8ed either. lso the T is able to define which (ub$element(can

occur 8ithin an element but not the order in 8hich the! occurH the same applies for

attributes specified for an element. -or this reason an application processing an *$&

document should avoid being dependent on the order of given tags or attributes.

3.5.4 Document ;b=ect odel 9D;:

The ocument Object $odel )O$+ &evel 1 Specification states the)ocument *b+ect

&odelas Ja platform9 and language9neutral interface that allo8s programs and scripts to

d!namicall! access and update the content structure and st!le of documentsK )ocument

Object $odel 2,,,+. t provides a tree9based representation of an *$& document

allo8ing the creation manipulation and navigation of an! part 8ithin the document.

6o8ever it is important to note that the O$ specification itself does not specif! that

documents must be implementedas a tree = onl! it is convenient that the logical structureof the document be described as a tree due to the hierarchical structure of mar%ed up

documents. The O$ is therefore a programming for documents that is trul!

Jstructurall! neutralK as 8ell.

Bor%ing 8ith parts of the O$ is @uite intuitive since the object structure of the

O$ ver! closel! resembles the hierarchical structure of the document. -or instance

the O$ sho8n in -igure b 8ould represent the tagged te(t e(ample in -igure a.

gain the hierarchical relationships are logical ones defined in the programming

and are not representations of an! particular internal structures )ocument Object $odel

2,,,+.

Once a O$ tree is constructed it can be modified easil! b! adding;deleting nodes

and moving sub9trees. The ne8 O$ tree can then be used to output a ne8 *$&

document since all the information re@uired to do so is held 8ithin the O$

representation. O$ tree 8ill not be constructed until the *$& document has been

full! parsed and validated.


22/120

'i!ure $ 7 D; representation o 8L e-ample

3.5.5 /%8 Parsin!

do8nside to the O$ is that most *$& parsers implementing the O$ ma%e the

entire tree reside in memor! = apart from putting a strain on s!stem resources it also

limits the si:e of the *$& document that can be processed )!thon;*$& 6o8to 2,,,+

)lib(ml 2,,,+. lso sa! the application onl! needs to search the *$& document for

occurrences of a particular 8ord it 8ould be inefficient to construct a complete in9

memor! tree to do this.

1*

October 30,

2000

14:40

Partly cloudy 18

a.

b.


23/120

S* handler on the other hand can process ver! large documents since it does

not %eep the entire document in memor! during processing. S* the Simple for

*$& is a standard interface for event9based *$& parsing )S* 2., 2,,,+. nstead of

building a structure representing the entire *$& document S* reports parsing events

)such as the start and end of tags+ to the application through callbac%s.

3.5.$ +eneits o 8L

The follo8ing benefits of using *$& in applications have been identified )$icrosoft

2,,,+ )Soft8areF 2,,,b+

/implicit"= *$& is eas! to read 8rite and process b! both humans and

computers.

;penness= *$& is an open and e(tensible format that leverages on other

)open+ standards such as SF$&. *$& is no8 a B3C 7ecommendation 8hich

means it is a ver! stable technolog!. n addition *$& is highl! supported b!

industr! mar%et leaders such as $icrosoft $ Sun and ?etscape both in

developer tools and user applications.

-tensibilit"= data encoded in *$& is not limited to a fi(ed tag set.

This enables precise data description greatl! aiding data manipulators such as

search engines to produce more meaningful searches.

Local computation and manipulation= once data in *$& format is sent

to the client all processing can be done on the local machine. The *$& O$

allo8s data manipulation through scripting and other programming languages.

/eparation o data rom presentation= this allo8s data to be 8ritten

read and sent in the best logical mode possible. $ultiple vie8s of the data are

easil! rendered and the loo% and feel of *$& documents can be changed

through *S& st!le sheetsH this means that the actual content of the document

need not be changed.

ranular updates= the structure of *$& documents allo8s for granular

updates to ta%e place since onl! modified elements need to be sent from the

server to the client. This is currentl! a problem 8ith 6T$& since even 8ith the

slightest modification a page needs to be rebuilt. Franular updates 8ill help

reduce server 8or%load.

/calabilit"= separation of data from presentation also allo8s authors to

embed 8ithin the structured data procedural descriptions of ho8 to produce

different vie8s. This offloads much of the user interaction from the server to the


24/120


25/120


26/120


27/120

'i!ure # 7 0al)in!


28/120

-estival is a 8idel! recogni:ed research project developed at the Centre for Speech

Technolog! 7esearch )CST7+ >niversit! of Edinburgh 8ith the aim of offering a free

high @ualit! te(t9to9speech s!stem for the advancement of research )lac% Ta!lor and

Cale! 1'''+. The $7O& project initiated b! the TCTS &ab of the -acult

ol!techni@ue de $ons )elgium+ is a free multi9lingual speech s!nthesi:er developed8ith aims similar to -estival/s )$7O&roject 6omepage 2,,,+.

'i!ure * 7 0op level outline showin! how 'estival and +R;L% s"stems were used to!ether.

t 8as decided for this project to use the -estival s!stem as the natural language

parser )?&+ component of the module 8hich accepts te(t as input and transcribes this

to its phoneme e@uivalent plus duration and pitch information. This information can be

then given to the $7O& s!nthesi:er acting as the digital signal processing unit

)S+ 8hich produces a 8aveform from this information. lthough -estival has its o8nS unit it 8as found that the -estival < $7O& combination produces the best

@ualit!. t is important to note that the -estival s!stem supports $7O& in its .

ecause of the phoneme9duration9pitch input format re@uired for $7O& it

provides ver! fine pitch and timing control for each phoneme in the utterance. s stated

before this level of control is simpl! unattainable 8ith commercial s!stems e(cept

NLP

(Festival)

Pitch and TimingModifier

DSP(MBROLA)

Phonemes, pitch

and duration

Modified phonemes,

pitch and duration

Waveform

Text
http://tcts.fpms.ac.be/synthesis/mbrola.htmlhttp://tcts.fpms.ac.be/synthesis/mbrola.html


29/120

ECtal%. The advantage of using $7O& over ECtal% ho8ever is in the fact that

once a phoneme/s pitch is altered in the latter s!stem the generated pitch contour is

over8ritten. Cahn )1'',+ first mentioned this problem and as a result did not manipulate

the utterance at the phoneme level limiting the amount of control 8hich ultimatel!

hindered the @ualit! of the simulated emotion. To overcome this $urra! and rnott)1''5+ had to 8rite their o8n intonation model to replace the ECtal% generated pitch

contour 8hen the! changed pitch values at the phoneme level. -ortunatel! this is not an

issue 8ith $7O& as changes to the pitch and duration levels can be done prior to

passing it to $7O& )as -igure 'sho8s+. Therefore it can be seen that the -estival ?*

platform its source code can be ported to the Bin32 platform via relativel! minor

modifications. The $7O& 6omepage offers binaries for man! platforms including

Bin32 &inu( most >ni( OS versions eOS $acintosh and more.

efore the final decision 8as made to ma%e use of the -estival s!stem ho8ever an

important issue re@uired investigation. The previous TTS module of the Tal%ing 6ead

did not use the -estival s!stem because although it 8as ac%no8ledged that -estival/s

output is of a ver! high @ualit! the computation time 8as deemed to be far too e(pensive

to use in an interactive application )Crossman 1'''+. -or e(ample the phrase J6ello

ever!bod!. This is the voice of a Tal%ing 6ead. The Tal%ing 6ead project consists ofresearchers from Curtin >niversit! and 8ill create a 3 model of a human head that 8ill

ans8er @uestions inside a 8eb bro8ser.K too% about #5 seconds to s!nthesi:e on a Silicon

Fraphics nd! 8or%station )Crossman 2,,,+. t is contested ho8ever that the negative

impression that could be made of the -estival s!stem from such data ma! be a little

misled. Though e(ecution time ma! ta%e longer on an SF nd! 8or%station informal

testing on several (tandard Cs )Bin32 and &inu( platforms+ sho8ed that the same

phrase too% less than 5 seconds to s!nthesi:e )including the generation of a 8aveform+.

Since TTS processing is done on the server side the s!stem can be easil! configured to

ensure -estival 8ill carr! its processing on a faster machine. Therefore -estival/ss!nthesis time 8as not considered a problem.

3.&.2 8L Parser


30/120

Since it is e(pected that the program/s input 8ill contain mar%ed up te(t an *$& parser

8as re@uired to parse and validate the input and create a O$ tree structure for eas!

processing. There are a number of freel! available *$& parsers though man! are still in

development stage and implement the *$& specification to var!ing degrees. One of the

more complete parsers is libxml a freel! available *$& C librar! for Fnome )lib(ml2,,,+.

>sing libxmlas the *$& parser fulfilled the needs of the project in a number of

8a!s

a+ ortability= 8ritten in C the librar! is highl! portable. long 8ith the main

program it has been successfull! ported to the Bin32 &inu( and 7*

platforms.

b+ Small and (imple= onl! a limited range of the *$& features are being used

therefore a comple( parser 8as not re@uired. This is not to sa! that libxmlis atrivial librar! as it offers some po8erful features.

c+ Efficiency = nformal testing sho8ed libxml parses large documents in

surprisingl! little time. lthough not used for this project libxml offers a

S* interface to allo8 for more memor!9efficient parsing )see section 3.5.5+.

d+ ,ree= libxmlcan be obtained cost9free and license9free.

t is important to note that the libxmllibrar!/s O$ tree building feature 8as used to

help create the re@uired objects that hold the program/s utterance information. 6o8ever

care 8as ta%en to ma%e sure the program/s objects 8ere not dependent on the *$&

parser being used. nstead a 8rapper class 8TT77$LParser used libxmlas the *$&

parser and output a cu(tom tree9li%e structure ver! similar to that of the O$. This

ensured that all other objects 8ithin the program used the custom structure and not the

O$ tree that libxmloutputs. )See Chapter 5 for more details.+

3.# /ummar"

This chapter has e(plored research that 8as applicable to this project focusing on ho8

the literature can help 8ith achieving the stated objectives and subproblems of Chapter 2

and supporting the h!potheses of Chapter #. $ore specificall! the literature 8as

investigated to find the speech correlates of emotion see%ing clear definitions so that

there 8as a solid base to 8or% from during the implementation phase. The 8or% of

prominent researchers in the field of s!nthetic speech emotion such as $urra! and


31/120

rnott )1''5+ and Cahn )1'',+ 8ho have alread! attempted to simulate emotional

speech 8as sought in order to gain an understanding of the problems involved and the

approach ta%en in solving them.

The in9depth revie8 on *$& served t8o purposes a+ to describe 8hat *$& is and

8hat the technolog! is tr!ing the address and b+ to e(pound the benefits of *$& so as to

justif! 8h! S$& 8as designed to be *$&9based. resource revie8 8as given to

discuss the issues involved 8hen deciding 8hich tools to use for the TTS module and to

address one of the subproblems stated in Section 2.3H that is that the TTS module should

be able to run across the Bin32 &inu( and >?* platforms.


32/120

Chapter 4

Research ethodolo!"

The literature revie8 of Chapter 3 enabled the formation of the h!potheses stated in this

chapter. t also identified areas 8here limitations 8ould appl! and defined the scope of

the project.

4.1


33/120

4.2 Limitations and Delimitations

4.2.1 Limitations

T8o main limitations have been identified

. !ocal arameter( $ The @ualit! of the s!nthesi:ed emotional speech 8ill be

limited b! the abilit! of the vocal parameters to describe the various emotions.

This is a reflection of the current level of understanding of speech emotion itself.

2. Speech Synthe(i1er uality $ The @ualit! of the speech s!nthesi:er and the

parameters it is able to handle 8ill also have a direct effect on the speech

produced. -or instance most speech s!nthesi:ers are unable to change voice

@ualit! features )breathiness intensit! etc+ 8ithout significantl! affecting the

intelligibilit! of the utterance.

4.2.2 Delimitations

The purpose of this research is to determine ho8 8ell the vocal effects of emotion can be

added to s!nthetic speech = it is not concerned 8ith generating an emotional state for the

Tal%ing 6ead based on the 8ords it is to spea%. Therefore the s!stem 8ill not %no8 the

re@uired emotion to simulate from the input te(t alone. This top9level information 8ill be

provided through the use of e(plicit tags hence the need for the implementation of aspeech mar%up language.

ue to the strict time constraints placed on this project the emotions that are to be

simulated b! the s!stem 8ere bounded to happiness sadness and anger. These three

emotions 8ere chosen because of the 8ealth of stud! carried out on these emotions )and

hence an increased understanding+ compared to other emotions. This is because

happiness sadness and anger )along 8ith fear and grief+ are often referred to as the

Jbasic emotionsK on 8hich it is believed other emotions are built on.


34/120

4.3 Research ethodolo!ies

The follo8ing 7esearch $ethodologies of $auch and irch )1''3+ are applicable to this

research

Desi!n and Demonstration. This is the standard methodolog! used for

the design and implementation of soft8are s!stems. The speech s!nthesis

s!stem is being demonstrated as the TTS module of a Tal%ing 6ead.

valuation. The effectiveness of the s!stem 8as needed to be determined

via listener @uestionnaires testing ho8 8ell the TTS module supports the stated

h!potheses. Therefore an evaluation research methodolog! 8as adopted.

eta7%nal"sis. The project involves a number of diverse fields otherthan speech s!nthesisH namel! ps!cholog! paralinguistics and etholog!. The

meta9anal!sis research methodolog! 8as used to determine ho8 8ell the speech

emotion parameters described in these fields mapped to speech s!nthesis.


35/120

Chapter 5

Implementation

This chapter discusses the implementation of the TTS module to simulate emotional

speech for a Tal%ing 6ead plus the stated subproblems of Section 2. The discussion

covers ho8 the module/s input is processed and ho8 the various emotional effects 8ereimplemented. This 8ill involve a description of the various structures and objects that

are used b! the TTS module. Since the module relies heavil! on S$& the speech

mar%up language that 8as designed and implemented to enable direct control over the

module/s output the chapter discusses S$& issues such as parsing and tag processing.

5.1 00/ Interace

efore an in9depth description of each of the TTS module/s components is given it 8ill

be beneficial to describe the input and outputs of the s!stem. t 8as important to be able

to describe the s!stem as a ver! high9level blac% bo(H not onl! for clarit! of design but

also to ensure that the replacement of the e(isting TTS module of the -T6 project

8ould be a smooth one. t also minimi:es module and tool interdependenc!. -igure 1,

sho8s the blac% bo( design of the s!stem as the TTS module of a Tal%ing 6ead.

'i!ure 1, 7 +lac) bo- desi!n o the s"stem6 shown as the 00/ module o a 0al)in!


36/120

of this detail nor should it. Bhat is important to describe at this level is the module/s

interfaceH ho8 the module produces its output is irrelevant to the user of the module.

5.1.2 odule Inputs

-igure 1,sho8s te(t as the single input to the TTS module. 6o8ever Lte(t/ can be a

fairl! ambiguous description for input and indeed the module caters for t8o distinct

t!pes of te(t plain te(t and te(t mar%ed up in the TTS module/s o8n custom Speech

$ar%up &anguage )S$&+.

Plain Text

The simplest form of input plain te(t means that the TTS module 8ill endeavour to

render the speech9e@uivalent of allthe input te(t. n other 8ords it 8ill be assumed that

no characters 8ithin the input represent directives for ho8 to generate the speech. s a

result of this speech generated using plain te(t 8ill have default speech parameters

spo%en 8ith neutral intonation.

SML Markup

f direct control over the TTS module/s output is desired then the te(t to be spo%en can

be mar%ed up in S$& the custom mar%up language implemented for the module.

lthough an in9depth description of S$& 8ill not be given here )see Section 2 and

ppendi( + it 8as designed to provide the user of the TTS module 8ith the follo8ing

abilities

irect control of speech production. -or e(ample the s!stem

could be specified to spea% at a certain speech rate pitch or pronounce a

particular 8ord in a certain 8a! )this is especiall! useful for foreign names+.

Control over spea%er properties. This gives the abilit! to not onl!

have control of ho8 the mar%ed up te(t is spo%en but also whois spea%ing.

Spea%er properties such as gender age and voice can be d!namicall! changed

8ithin S$& mar%up.

The effect of the spea%er/s emotion on speech. -or e(ample the

mar%up ma! specif! that the spea%er is sad for a portion of the te(t. s a

result the speech 8ill sound sad. One of the primar! objectives of this thesis

is to determine ho8 effective the simulated effect of emotion on the voice is.


37/120


38/120

specification )$EF 1'''+. The phoneme9to9viseme translation submodule is one of

the fe8 that 8ere retained from the e(isting TTS module.

5.1.4 CJC %PI

$odules that call the TTS module do so through its C;C


39/120

TT77"eaText 4%onst %ar $essaFe5= same as

8TT78entral,,7"eaText 4%onst %ar Cilename5.

TT77"eaTextEx 4%onst %ar $essaFe( int

Emotion5= same as 8TT78entral,,7"eaTextEx 4%onst %ar

$essaFe( int Emotion5.

TT7Destro- 45= used to nicel! cleanup the TTS module once

it is not need an!more. The function i( called once only.

5.2 /L /peech ar)up Lan!ua!e

n Section 2.2 it 8as identified that the design and implementation of a suitable mar%up

language 8as re@uired so that the emotion of a te(t segment could be specified as 8ell as

providing a means of manipulating other useful speech parameters. S$& is the TTS

module/s *$&9based Speech $ar%up &anguage designed to meet these re@uirements.

This section 8ill provide an overvie8 of ho8 an utterance should be mar%ed up in S$&.

-or a description of each S$& tag 8ith its associated attributes see ppendi( . -or

issues regarding S$&/s implementation see Section 5.4.

5.2.1 /L ar)up /tructure

n input file containing correct S$& mar%up must contain an *$& header declaration at

the beginning of the file. -ollo8ing the *$& header the smltag encapsulates the entire

mar%ed up te(t and can contain multiple ")paragraph+ tags. -igure 11sho8s the basic

la!out of an input file mar%ed up in S$&. ?ote that all the *$& constraints discussed in

Section 3.5 appl! to S$&.

'i!ure 11 7 0op7level structure o an /L document.

XML header

Reference to

SML v01 DTD

Root tag

Paragraphs


40/120

n turn each "node can contain one or more emotion tags )sa# anFr- a""- and

ne/tral+ and instances of the eme#tagH te(t not contained 8ithin an emotion tag is

not allo8ed. -or e(ample -igure 12sho8s valid S$& mar%up 8hile -igure 13sho8s

S$& mar%up that is invalid because it does not follo8 this rule. ?ote that unli%e Jla:!K

6T$& the paragraph )"+ tags must be closed properl!.

'i!ure 12 7 alid /L mar)up.

'i!ure 13 7 Invalid /L mar)up.

ll tags described in ppendi( can occur inside an emotion tag )e(cept sml "

and eme#+. limitation of S$& is that emotion tags cannot occur 8ithin other emotion

tags. 6o8ever unless e(plicitl! specified most other tags can contain even instances of

tags 8ith the same name. -or e(ample a "it%tag can contain another "it%tag as

the follo8ing e(ample sho8s.


41/120

5.3 00/ odule /ubs"stems ;verview

'i!ure 14 7 00/ module subs"stems.

s -igure 1#sho8s the design of the TTS module subs!stems is centered on the S$&

ocument object. The main steps for s!nthesi:ing the module/s input te(t involve the

creation processing and output of the S$& ocument. This is bro%en do8n into the

follo8ing tas%s

a+ ar(ing. The input te(t is parsed b! the S$& arser and creates

an S$& ocument object. The S$& arser ma%es use of lib(ml.

b+ Text to honeme Tran(cription. The ?atural &anguage arser

)?&+ is responsible for transcribing the te(t into its phoneme e@uivalent plus

providing intonation information in the form of each phoneme/s duration and

pitch values. This information is given to the S$& ocument object and

SML ParserSML Document

Visual

Module

SML 'agsProcessor

Text Visemes

Waveform

Plain text Phoneme data

NLP

Festival

DSP

MBROLA

Phoneme dataTags +

Text/Phonemes

Modified

Text/Phonemes

Phonemeinfo

libxml


42/120

stored 8ithin its internal structures. The ?& unit ma%es use of the -estival

Speech S!nthesis S!stem.

c+ S&' Tag roce((ing. n! S$& tags present in the input te(t are

processed. This usuall! involves modif!ing the te(t or phonemes held 8ithin

the S$& ocument.

d+ 3aveform 4eneration. The phoneme data held 8ithin the S$&

ocument is given to the igital Signal rocessing )S+ unit to generate a

8aveform. The S ma%es use of the $7O& S!nthesi:er.

e+ !i(eme 4eneration. The 0isual $odule is responsible for

transcribing the phonemes to their viseme e@uivalent. gain the phoneme

data is obtained from the S$& ocument. n this thesis the 0isual $odule

8ill not be discussed in an! further detail since it is has reused much of the

old TTS module/s subroutines. Crossman )1'''+ provides a description of thephoneme9to9viseme translation process.

5.4 /L Parser

The S$& arser encapsulated in the 8TT77$LParserclass is responsible for parsing

the module/s te(t input to ensure that it is both a 8ell9formed *$& document and that its

structure conforms to the grammar specification of the T. f the input is full!

validated then an S$& ocument object is created based on the input.

To perform full *$& parsing on the input the *$& C &ibrar! libxmlis used. part

from validating the input libxml also constructs a O$ tree )described in Section 3.5.#+

that represents the input/s tag structure should no parse errors occur. The S$&

ocument object that is returned b! the S$& arser follo8s the hierarchical structure of

the O$ ver! closel!. Therefore it traverses the O$ and creates an S$& ocument

containing nodes mirroring the O$/s structure. Once the S$& ocument has been

constructed the O$ is destro!ed and the S$& ocument is returned.

t 8as mentioned in Section 5.1.2 that the TTS module is able to handle un%no8ntags present 8ithin the input mar%up. This is because the input is filtered to remove all

un%no8n tags before an! validation parsing is done b! libxml. n doing so the O$ tree

that libxmlcreates does not hold an! un%no8n tag nodes and as a conse@uence neither

does the S$& ocument.


43/120

The TTS module %eeps trac% of all S$& tag names b! %eeping a special *$&

document that holds S$& tag information2. -iltering of the input is done b! creating a

cop! of the input file and cop!ing onl! those tags that are %no8n. t is important that

this filtering process is carried out because the input is envisaged to contain other non9

S$& tags such as those belonging to the -$& module. -igure 15sho8s the filteringprocess.

'i!ure 15 7 'ilterin! process o un)nown ta!s.

5.5 /L Document

s the TTS module/s subs!stems diagram sho8s )-igure 1#+ the S$& ocument is at

the core of the TTS module. ts role is to store all information re@uired for speech

s!nthesis to ta%e place such as 8ord phoneme and intonation data. t also contains thefull tag information that appears in the inputH in fact such is the depth of information held

that the S$& mar%up could be easil! recreated b! the information held in the S$&

ocument. The tag data is used to control the manipulation of the te(t and phoneme

data. n this section 8e 8ill describe the structure of the S$& ocument as 8ell the

various structures re@uired to perform the above mentioned role. -inall! the data held

8ithin the S$& ocument is then used to produce a 8aveform. The S$& ocument

object is encapsulated b! the TT77$LDo%/mentand TT77$L3o#eclasses.

5.5.1 0ree /tructure

n the last section it 8as mentioned that the structure of the S$& ocument matches

ver! closel! that of the *$& O$. The S$& ocument consists of a hierarch! of

nodes that represent the information held in the input S$& mar%up. Therefore the nodes

2The *$& document is called Jtag9names.(mlK and is held in the special TTS resource director!

JTTSQrcK.

Input fileCopy of input file

(Filtered)

SML tag information

Tag lookup Known tag


44/120

hold mar%up information attribute values and character data. -igure 1sho8s the high9

level structure of an S$& ocument that 8ould be constructed for the accompan!ing

S$& mar%up. ?ote ho8 each node has a t!pe that specifies 8hat t!pe of node it is.

The hierarchical nature of the S$& ocument implies 8hich te(t sections 8ill be

rendered in 8hat 8a! = a parent 8ill affect all its children. So for the e(ample in -igure

1 the em"node 8ill affect the phoneme data of its )one+ child node the te(t node

containing the te(t JtooK. The a""-node 8ill affect the phoneme data of all its )three+

children nodes containing the te(t JThat/s notK JtooK and Jfar a8a!K respectivel!. Tags

that 8ere specified 8ith attribute values are represented b! element nodes that point to

attribute information )this is not sho8n on -igure 1for clarit! purposes+.

'i!ure 1$ 7 /L Document structure or /L mar)up !iven above.

5.5.2 >tterance /tructures

Each te(t node contains its o8n utterance information 8hich comprises of 8ord and

phoneme related data. The information is held in different la!ers.

10 $ain 7treet TatMs not ar awa-.


45/120

1. >tterance level = the 8hole phrase held in that node. The

8TT7Htteran%eInoclass is responsible for holding information at this

level.

2. Bord level = the individual 8ords of the utterance. The

8TT7or#Ino class is responsible for holding information at this level.

3. honeme level = the phonemes that ma%e up the 8ords. The

8TT7PonemeIno class is responsible for holding information at this

level.

#. honeme pitch level = the pitch values of the phonemes )phonemes

can have multiple pitch values+. The 8TT7Pit%PatternPoint class is

responsible for holding information at this level.

The 8a! that the above mentioned objects are organi:ed 8ithin a te(t node is as

follo8s

te(t node contains one 8TT7Htteran%eInoobject.

The 8TT7Htteran%eIno object contains a list of 8TT7or#Ino

objects that contain 8ord information.

n turn each 8TT7or#Ino object contains a list of

8TT7PonemeIno objects that contain phoneme information.

8TT7PonemeIno object contains the actual phoneme and its duration )ms+.

Each 8TT7PonemeIno object then contains a list of

8TT7Pit%PatternPoint objects that contain pitch information for each

phoneme. pitch point is characteri:ed b! a pitch value and a percentage value

of 8here the point occurs 8ithin the phoneme/s duration.


46/120

'i!ure 1& 7 >tterance structures to hold the phrase ?the moon@. > A C00/B>tteranceIno ob=ect

( A C00/B>tteranceIno ob=ect P A C00/BPhonemeIno ob=ect pp A C00/BPitchPatternPoint

ob=ect.

5.$ Gatural Lan!ua!e Parser

s introduced in Section 5.3 the ?& )?atural &anguage arser+ module is responsiblefor transcribing the te(t to be rendered as speech to its phoneme e@uivalent. t is also

responsible for generating intonation information b! providing pitch and duration values

for each phoneme in the utterance. The goals this module sets out to achieve are non9

trivial and it is not surprising that this stage ta%es b! far the longest time of an! of the

stages in the speech s!nthesis process. utoit )1''4+ gives an e(cellent discussion of the

problems the ?& unit of a speech s!nthesi:er must overcome.

Since the phoneme transcription and the intonation information greatl! affect the

@ualit! of the s!nthesi:ed speech it 8as ver! important to have an ?& that 8ould

produce high @ualit! output. s mentioned in Section 3.4.1 the ,e(tival Speech

Synthe(i( Sy(tem8as chosen to provide these services 8hich is able to generate output

comparable to commercial speech s!nthesi:ers.

U

theW

P dhpp (0,95)

P @pp (50,101)

moonW

P m

pp (0,102)

P uu

pp (50,110)

P n

pp (100,103)

% inside phoneme length

pitch value


47/120

5.$.2 ;btainin! a Phoneme 0ranscription

s described in Section 5.5 each te(t node 8ithin the S$& ocument contains utterance

objects that 8ill ultimatel! hold the node/s 8ord and phoneme information. One of the

intermediate steps for obtaining a phoneme transcription is to tokeni1ethe input character

string into 8ords. -or e(ample the character string JOn $a! 5 1'"5 1'"5 people

moved to &ivingstonK 8ould be to%eni:ed in the follo8ing 8ords b! -estival JOn $a!

fifth nineteen eight! five one thousand nine hundred and eight! five people moved to

&ivingstonK. This illustrates the comple(it! of the input that -estival is able to handle

8hich has a direct effect on user perception of the intelligence of the Tal%ing 6ead.

To to%eni:e the contents of the S$& ocument the tree is traversed and each te(t

node/s content is individuall! given to -estival. -estival returns the to%ens in the

character string and these are stored as 8ords in the corresponding node/s utterance

object. -igure 1"sho8s ho8 each node holds its o8n to%en information.

'i!ure 1# 7 0o)eniation o a part o an /L Document.

Once 8ord information is stored 8ithin each node/s utterance object phoneme data

can be generated for each 8ord. Obtaining the actual phoneme data )including

intonation+ is a more comple( process ho8ever. This is because an entirephrase should

be given to -estival in order for correct intonation to be generated. s an e(ample

consider the follo8ing S$& mar%up )the corresponding nodes held in the S$&

ocument are sho8n in -igure 1'+.

10 oranges costten oranges cost

$8.30eight dollars thirty

ne/tral

em"

10 oranFes %ost

SML Markup


48/120

#i# -o/ not?

'i!ure 1* 7 /L Document sub7tree representin! e-ample /L mar)up.

f each te(t node/s contents is given to -estival one at a time )i.e. first J 8onderK

then J!ou pronounced itK and so forth+ then though -estival 8ill be able to produce the

correct phonemes it 8ill not generate proper pitch and timing information for the

phonemes. This 8ill result in an utterance 8hose 8ords are pronounced properl! but

contains inappropriate intonation brea%s that ma%e the utterance sound unnatural.

n appropriate analog! to this 8ould be if a person 8ere sho8n a pac% of cards 8ith

8ords 8ritten on them one at a time and as%ed to read it out loud. The person not

%no8ing 8hat 8ords 8ill follo8 8ill not %no8 ho8 to give the phrase an appropriate

intonation.

?o8 if the same person is given a card that contains the entiresentence on it then

no8 %no8ing 8hat the phrase is sa!ing the person 8ill read it out loud correctl!. The

same approach 8as ta%en in the solution to this problem. Continuing the above e(ample

8ill help understand ho8 this is done.

The S$& ocument is traversed until an emotion node is encountered. n

the e(ample traversal 8ould stop at the a""-node.

The contents of its child te(t nodes are then concatenated to ma%e one

phrase. So the contents of the four te(t nodes in -igure 1' 8ould be

concatenated to form the phrase J 8onder !ou pronounced it tomato did !ou

notPK The phrase is stored in a temporar! utterance object held in the a""-

node.

a""-

rate em"you pronounced it

I wonder,

did you not?

tomato


49/120

The phrase is given to -estival and -estival generates the phoneme

transcription as 8ell as intonation information.

The entire phoneme data is stored in the a""- node/s temporar!

utterance object.

ecause each te(t node alread! contains 8ord information in their

utterance objects it is a simple process to disperse the phoneme data held in the

a""-node amongst its children. The temporar! utterance object in a""- is

then destro!ed.

f this procedure is follo8ed then correct intonation is given to the utterance. Of

course a limitation is that this 8ill not solve the problem of having an emotion change in

mid sentence. 6o8ever the algorithm ma%es an assumption that this 8ill not occur

fre@uentl! and that if it does the intonation 8ill not be needed to continue over emotion

boundaries and a brea% is acceptable.

5.$.2 /"nthesiin! in /ections

Speech s!nthesis can be a processor intensive tas% and can ta%e a significant amount of

time and memor! 8hen s!nthesi:ing larger utterances. -inding an! 8a! to minimi:e the

8aiting time is highl! desirable especiall! 8hen the speech production is being 8aited

upon b! an interactiveTal%ing 6ead.

There 8as a concern that if a ver! large amount of S$& mar%up 8as given to theTTS module then the e(ecution time 8ould be unacceptable for someone communicating

8ith the Tal%ing 6ead. To avoid this from occurring a solution 8as implemented that

too% advantage of the client;server architecture of the Tal%ing 6ead.

nstead of the entire S$& ocument being s!nthesi:ed in one go smaller portions

)at the emotion node level+ are s!nthesi:ed one at a time on the server and sent to the

client. s the Tal%ing 6ead on the client begins to Jspea%K the server s!nthesi:es the

ne(t emotion tag of the S$& ocument. ! the time the Tal%ing 6ead has finished

tal%ing the ne(t utterance is read! to be Jspo%enK. This 8a! the actual 8aiting time is

reall! onl! for the first utterance and is no8 dependent on the communication speed

bet8een the server and client and not the s!nthesis time of the 8hole document. -igure

2,represents a timeline of the e(ample mar%up.

t should be noted that this section9oriented method of producing speech involves not

onl! the ?& submodule but also all the steps in the s!nthesis process after the creation

of the S$& ocument.


50/120

'i!ure 2, 7 Raw timeline showin! server and client

e-ecution when s"nthesiin! e-ample /L mar)up above.

5.$.3 Portabilit" Issues

To address the portabilit! issue stated in Section 2.2 it 8as important -estival be useable

over multiple platforms. ecause the -estival s!stem has been developed primaril! for

the >?* platform compiling it for 7* .3 8as relativel! straightfor8ard. Similarl!

obtaining a &inu( version of -estival 8as also effortless since &inu( 7$/s )7ed6at

ac%age $anager+ containing precompiled -estival libraries are available. lthough ithas not been tested e(tensivel! on the Bin32 platform the -estival developers are

confident that the source code is platform independent enough for -estival to compile on

Bin32 machines 8ithout too man! changes.

espite this optimism a considerable amount of the project/s effort 8ent into

reali:ing this objective. n fact changes made to the code 8ere %ept trac% of and as it

gre8 a help document for compiling -estival 8ith $icrosoft 0isual C7& http;;888.computing.edu.au;Rstalloj;projects;honours;festival9help.html.

CLIENTSERVER

Synthesizing neutral tag

(Utterance 1)Idle

Synthesizing happy tag(Utterance 2)

Playing Utterance 1

Synthesizing sad tag(Utterance 3)

Playing Utterance 2

Idle Playing Utterance 3
http://www.computing.edu.au/~stalloj/projects/honours/festival-help.htmlhttp://www.computing.edu.au/~stalloj/projects/honours/festival-help.html


51/120

5.& Implementation o motion 0a!s

revious sections have dealt 8ith describing the frame8or% constructed to support the

main h!pothesisH that is to simulate the effect of emotion on speech. This section 8ill

no8 discuss the implementation of S$&/s emotion tags 8hich 8hen used to mar%upte(t cause the te(t to be rendered 8ith the specified emotion.

s has alread! been stated in this thesis the speech correlates of emotion needed to

be investigated in the literature for the main objectives to be met. Section 3.2 described

the findings of this research and a table 8as constructed that describes the speech

correlates for four of the five so9called JbasicK emotions anger happiness sadness and

fear )see Table 1+. The table formed the basis for implementing the anFr- a""- and

sa#S$& tags. -or ease of reference the contents of Table 1for the anger happiness

and sadness emotions is sho8n again in the follo8ing table

Anger Happiness Sadness

Speech rate Faster Slightly faster Slightly slower

Pitch average Very much

higher

Much higher Slightly lower

Pitch range Much wider Much wider Slightly

narrower

Intensity Higher Higher Lower

Pitch changes Abrupt,downward,directedcontours

Smooth, upwardinflections

Downwardinflections

Voice quality Breathy, chestytone

Breathy,

blaring!esonant

Articulation $lipped Slightly slurred Slurred

terms used b! $urra! and rnott )1''3+.

0able 2 7 /ummar" o human vocal emotion eects or an!er6 happiness6 and sadness.

To implement the guidelines found in the literature on human speech emotion

$urra! and rnott )1''5+ developed a number of prosodic rules for their 6$&ET

s!stem. The TTS module has adopted some of these rules though slight modifications

8ere re@uired. lso other similar prosodic rules have been developed through personal

e(perimentation.


52/120

5.&.1 /adness

Basic Speech Correlates

-ollo8ing the literature9derived guideline for the speech correlates of emotion sho8n inTable 2Table 3sho8s the parameter values set for the S$& sa#tag. The values 8ere

optimi:ed for the TTS module and are given as percentage values relative to neutral

speech.

Parameter Value (relative to neutral speech)

Speech rate *+

%itch a#erage *+

%itch range *-+

Volume ./0

0able 3 /peech correlate values implemented or sadness.

s a result of the above speech parameter changes the speech is slo8er lo8er in

tone and is more monotonic )pitch range reduction gives a flatter intonation curve+. The

volume is reduced for sadness so that the spea%er tal%s more softl!. )mplementation

details on ho8 speech rate volume and pitch values are modified can be found in Section

5."+.

Prosodic rules

The follo8ing rules adopted from $urra! and rnott )1''5+ 8ere deemed to be

necessar! for the simulation of sadness. Some parameter values 8ere slightl! modified

to 8or% best 8ith the TTS module.

1. Eliminate abrupt change( in pitch between phoneme(. The

phoneme data is scanned and if an! phoneme pairs have a pitch difference of

greater than 1, then the lo8er of the t8o pitch values is increased b! 5 of

the pitch range.

2. 5dd pau(e( after long word(. f an! 8ord in the utterance contains

si( or more phonemes then a slight pause )", milliseconds+ is inserted after

the 8ord.

The follo8ing rules 8ere developed specificall! for the TTS module.

1. 'ower the pitch of every word that occur( before a pau(e. Such

8ords are lo8ered b! scanning the phoneme data in the particular 8ord and


53/120

lo8ering the last vo8el9sounding phoneme )and an! consonant9sounding

phonemes that follo8+ b! 15. This has the effect of lo8ering the last

s!llable.

2. ,inal lowering of utterance. The last s!llable of the last 8ord in

the utterance is lo8ered in pitch b! 15.

5.&.2 tterances usuall! have a pitch drop in the final vo8el and an! follo8ing


54/120

consonants. This rule increases the pitch values of these phonemes b! 15

hence reducing the si:e of the terminal pitch fall.

5.&.3 %n!er

Basic Speech Correlates

Table 5 sho8s the param

24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf

Documents

Transcript of 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf