24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf
-
Upload
andreea-giorgiana-marcu -
Category
Documents
-
view
221 -
download
0
Transcript of 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf
-
7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf
1/120
Simulating Emotional Speech for a Talking Head November 2000
Contents
1 Introduction..............................................................................................1
2 Problem Description................................................................................2
2.1 Objectives..............................................................................................................2
2.2 Subproblems..........................................................................................................2
2.3 Significance...........................................................................................................3
3 Literature Review....................................................................................5
3.1 Emotion and Speech..............................................................................................5
3.2 The Speech Correlates of Emotion........................................................................
3.3 Emotion in Speech S!nthesis................................................................................"
3.# Speech $ar%up &anguages...................................................................................'
3.5 E(tensible $ar%up &anguage )*$&+.................................................................1,
3.5.1 *$& -eatures..........................................................................................1,
3.5.2 The *$& ocument................................................................................113.5.3 T/s and 0alidation..............................................................................12
3.5.# ocument Object $odel )O$+.............................................................1#
3.5.5 S* arsing.............................................................................................15
3.5. enefits of *$&......................................................................................1
3.5.4 -uture irections in *$&........................................................................14
3. -T6.................................................................................................................1"
3.4 7esource 7evie8.................................................................................................2,
3.4.1 Te(t9to9Speech S!nthesi:er.....................................................................2,
3.4.2 *$& arser..............................................................................................22
3." Summar!.............................................................................................................23
4 Research ethodolo!"..........................................................................25
#.1 6!potheses..........................................................................................................25
age i
-
7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf
2/120
Simulating Emotional Speech for a Talking Head November 2000
#.2 &imitations and elimitations.............................................................................2
#.2.1 &imitations...............................................................................................2
#.2.2 elimitations............................................................................................2
#.3 7esearch $ethodologies.....................................................................................24
5 Implementation......................................................................................2#
5.1 TTS nterface.......................................................................................................2"
5.1.2 $odule nputs..........................................................................................2'
5.1.3 $odule Outputs........................................................................................3,
5.1.# C;Ctterance Structures.................................................................................34
5. ?atural &anguage arser.....................................................................................3'
5..2 Obtaining a honeme Transcription........................................................#,
5..2 S!nthesi:ing in Sections..........................................................................#2
5..3 ortabilit! ssues......................................................................................#3
5.4 mplementation of Emotion Tags........................................................................##
5.4.1 Sadness.....................................................................................................#5
5.4.2 6appiness.................................................................................................#
5.4.3 nger........................................................................................................#4
5.4.# Stressed 0o8els.......................................................................................#"
5.4.5 Conclusion...............................................................................................#"
5." mplementation of &o89level S$& Tags............................................................#'
5.".1 Speech Tags.............................................................................................#'
5.".2 Spea%er Tag..............................................................................................53
5.' igital Signal rocessor......................................................................................5#
5.1, Cooperating 8ith the -$& module................................................................55
5.11 Summar!...........................................................................................................54
age ii
-
7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf
3/120
Simulating Emotional Speech for a Talking Head November 2000
$ Results and %nal"sis..............................................................................5#
.1 ata c@uisition..................................................................................................5"
.1.1 Auestionnaire Structure and esign........................................................5"
.1.2 E(perimental rocedure...........................................................................1
.1.3 rofile of articipants...............................................................................3
.2 7ecogni:ing Emotion in S!nthetic Speech.........................................................#
.2.1 Confusion $atri(.....................................................................................#
.2.2 Emotion 7ecognition for Section 2.......................................................
.2.3 Emotion 7ecognition for Section 2.......................................................'
.2.# Effect of 0ocal Emotion on Emotionless Te(t........................................43
.2.5 Effect of 0ocal Emotion on Emotive Te(t..............................................45
.2. -urther nal!sis.......................................................................................45.3 Tal%ing 6ead and 0ocal E(pression...................................................................44
.# Summar!............................................................................................................."1
& 'uture (or)...........................................................................................#2
4.1 ost Baveform rocessing.................................................................................."2
4.2 Spea%ing St!les..................................................................................................."3
4.3 Speech Emotion evelopment............................................................................"#
4.# *$& ssues........................................................................................................."5
4.5 Tal%ing 6ead......................................................................................................."
4. ncreasing Communication and8idth..............................................................."4
# Conclusion...............................................................................................##
* +iblio!raph"...........................................................................................*1
1, %ppendi- % /L 0a! /peciication..................................................*$
................................................................................................................................1,1
11 %ppendi- + /L D0D.....................................................................1,2
12 %ppendi- C 'estival and isual C..............................................1,413 %ppendi- D valuation uestionnaire...........................................1,&
14 %ppendi- 0est Phrases or uestionnaire6 /ection 2+..............113
age iii
-
7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf
4/120
Simulating Emotional Speech for a Talking Head November 2000
List o 'i!ures
15 'i!ure 1 7 %n 8L document holdin! simple weather inormation.11
1$ 'i!ure 2 7 /ample section o a D0D ile...............................................12
1& 'i!ure 3 7 8L s"nta- error 7 list and item ta!s incorrectl" matched.
13
1# 'i!ure 4 7 (ell7ormed 8L document6 but does not ollow
!rammar speciication in D0D ile 9an item ta! occurs outside o list
ta!:...........................................................................................................13
1* 'i!ure 5 (ell7ormed 8L document that also ollows D0D
!rammar speciication. (ill not produce an" parse errors..............13
2, 'i!ure $ 7 D; representation o 8L e-ample..............................15
21 'i!ure & 7 '%I0< pro=ect architecture................................................1*
22 'i!ure # 7 0al)in!
-
7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf
5/120
Simulating Emotional Speech for a Talking Head November 2000
31 'i!ure 1& 7 >tterance structures to hold the phrase ?the moon@. > A
C00/B>tteranceIno ob=ect ( A C00/B>tteranceIno ob=ect P A
C00/BPhonemeIno ob=ect pp A C00/BPitchPatternPoint ob=ect. 3*
32 'i!ure 1# 7 0o)eniation o a part o an /L
Document................................................................................................4,33 'i!ure 1* 7 /L Document sub7tree representin! e-ample /L
mar)up....................................................................................................41
34 'i!ure 2, 7 Raw timeline showin! server and client e-ecution when
s"nthesiin! e-ample /L mar)up above..........................................43
35 'i!ure 21 ultipl" actors o pitch and duration values or
emphasied phonemes............................................................................5,
3$ 'i!ure 22 7 Processin! a pause ta!........................................................51
3& 'i!ure 23 7 0he eect o widenin! the pitch ran!e o an utterance...52
3# 'i!ure 24 7 Processin! the pron ta!......................................................52
3* 'i!ure 25 7 -ample +R;L% input..................................................55
4, 'i!ure 2$ 7 -ample utterance inormation supplied to the '%L
module b" the 00/ module. -ample phraseE ?%nd now the latest
news@.......................................................................................................5$
41 'i!ure 2& 7 % node carr"in! waveorm processin! instructions or an
operation.................................................................................................#3
42 'i!ure 2# 7 Insertion o new submodule or post waveorm
processin!................................................................................................#3
43 'i!ure 2* 7 /L ar)up containin! a lin) to a st"lesheet................#4
44 'i!ure 3, 7 Inclusion o an ?8L
-
7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf
6/120
Simulating Emotional Speech for a Talking Head November 2000
List o 0ables
4$ 0able 1 7 /ummar" o human vocal emotion eects.............................#
4& 0able 2 7 /ummar" o human vocal emotion eects or an!er6
happiness6 and sadness..........................................................................44
4# 0able 3 /peech correlate values implemented or sadness..............45
4* 0able 4 7 /peech correlate values implemented or happiness..........4$
5, 0able 5 7 /peech correlate values implemented or an!er.................4&51 0able $ 7 owel7soundin! phonemes are discriminated based on their
duration and pitch..................................................................................4#
52 0able &7 +R;L% command line option values or en1 and us1
diphone databases to output male and emale voices.........................54
53 0able # 7 /tatistics o participants........................................................$3
54 0able * 7 Conusion matri- template....................................................$4
55 0able 1, 7 Conusion matri- with sample data...................................$5
5$ 0able 11 7 Conusion matri- showin! ideal e-periment dataE 1,,F
reco!nition rate or all simulated emotions.........................................$5
5& 0able 12 Listener response data or neutral phrases spo)en with
happ" emotion........................................................................................$$
5# 0able 13 /ection 2% listener response data or neutral phrases.....$&
5* 0able 14 7 Listener response data or /ection 2%6 uestion 1...........$#
$, 0able 15 7 Listener response data or /ection 2%6 uestion 2...........$#
$1 0able 1$ 7 Listener responses or utterances containin! emotionlesste-t with no vocal emotion.....................................................................&,
$2 0able 1& 7 Listener responses or utterances containin! emotive te-t
with no vocal emotion............................................................................&1
$3 0able 1# 7 Listener responses or utterances containin! emotionless
te-t with vocal emotion..........................................................................&2
age vi
-
7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf
7/120
Simulating Emotional Speech for a Talking Head November 2000
$4 0able 1* 7 Listener responses or utterances containin! emotive te-t
with vocal emotion.................................................................................&3
$5 0able 2, 7 Percenta!e o listeners who improved in emotion
reco!nition with the addition o vocal emotion eects or neutral te-t.
&4$$ 0able 21 Percenta!e o listeners whose emotion reco!nition
deteriorated with the addition o vocal emotion eects or neutral
te-t...........................................................................................................&4
$& 0able 22 7 Percenta!e o listeners whose emotion reco!nition
improved with the addition o vocal emotion eects or emotive te-t.
&5
$# 0able 23 Percenta!e o listeners whose emotion reco!nition
deteriorated with the addition o vocal emotion eects or emotive
te-t...........................................................................................................&5
$* 0able 24 Listener responses or participants who spea) n!lish as
their irst lan!ua!e. >tterance t"pe is ?neutral te-t6 emotive voice@.
&$
&, 0able 25 Listener responses or participants who do G;0 spea)
n!lish as their irst lan!ua!e. >tterance t"pe is ?neutral te-t6
emotive voice@.........................................................................................&$
&1 0able 2$ Listener responses or participants who spea) n!lish as
their irst lan!ua!e. >tterance t"pe is ?emotive te-t6 emotive voice@.&&
&2 0able 2& Listener responses or participants who do G;0 spea)
n!lish as their irst lan!ua!e. >tterance t"pe is ?emotive te-t6
emotive voice@.........................................................................................&&
&3 0able 2# 7 Participant responses when as)ed to choose the 0al)in!
-
7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf
8/120
Chapter 1
Introduction
Bhen 8e tal% 8e produce a comple( acoustic signal that carries information in addition
to the verbal content of the message. 0ocal e(pression tells others about the emotional
state of the spea%er as 8ell as @ualif!ing )or even dis@ualif!ing+ the literal meaning ofthe 8ords. ecause of this listeners expectto hear vocal effects pa!ing attention not
onl! to 8hat is being said but ho8 it is said. The problem 8ith current speech
s!nthesi:ers is that the effect of emotion on speech is not ta%en into account producing
output that sounds monotonic or at 8orst distinctl! machine9li%e. s a result of this the
abilit! of a Tal%ing 6ead to e(press its emotional state 8ill be adversel! affected if it
uses a plain speech s!nthesi:er to Dtal%D. The objective of this research 8as to develop a
s!stem that is able to incorporate emotional effects in s!nthetic speech and thus improve
the perceived naturalness of a Tal%ing 6ead.
This thesis revie8s the literature in the fields of speech emotion s!nthetic speech
s!nthesis and *$&. discussion on *$& is featured prominentl! in this thesis because
it 8as the vehicle chosen for directing ho8 the s!nthetic voice should sound. t also had
considerable impact on ho8 speech information 8as processed. The design and
implementation details of the project are discussed to describe the developed s!stem. n
in9depth anal!sis of the project/s evaluation data is then given concluding 8ith a
discussion of future 8or% that has been identified.
-
7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf
9/120
Chapter 2
Problem Description
2.1 ;b=ectives
evelopment of the project 8as aimed at meeting t8o main objectives to support theh!potheses of Section #.1
1. To develop a s!stem that can add simulated emotion effects to s!nthetic
speech. This involved researching the speech correlates of emotion that have
been identified in the literature. The findings 8ere to be applied to the control
parameters available in a speech s!nthesi:er allo8ing a specified emotion to be
simulated using rules controlling the parameters.
2. To integrate the s!stem 8ithin the TTS )te(t9to9speech+ module of a
Tal%ing 6ead. The speech s!stem 8as to be added to the Tal%ing 6ead that ispart of the -T61project. t is being developed jointl! at Curtin >niversit! of
Technolog! Bestern ustralia and the >niversit! of Fenoa in tal! )eard et al
1'''+. The te(t9to9speech module must be treated as a Gblac% bo(G 8hich is
consistent 8ith the modular design of -Abot.
2.2 /ubproblems
number of subproblems 8ere identified to successfull! develop a s!stem 8ith thestated objectives.
1. esign and implementation of a speech mar%up language. t 8as desirable
that the mar%up language be *$&9basedH the reasons for this 8ill become
apparent later in the thesis. The role of the speech mar%up language )S$&+ is to
1-acial nimated nteractive Tal%ing 6ead
-
7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf
10/120
provide a 8a! to specif! in 8hich emotion a te(t segment is to be rendered. n
addition to this it 8as decided to e(tend the application of the mar%up to provide
a mechanism for the manipulation of generall! useful speech properties such as
rate pitch and volume. S$& 8as designed to closel! follo8 the S&E
specification described b! Sproat et al)1''"+.
2. Evaluation of each of the e(isting te(t9to9speech )TTS+ submodules of the
Tal%ing 6ead 8as re@uired. ts aim 8as to determine 8hat could and could not
be reused. This included assessing the e(isting TTS module/s and the
modules that interface 8ith other subs!stems of the Tal%ing 6ead )namel! the
$EF9# subs!stem+.
3. Cooperative integration 8ith modules that 8ere being concurrentl! 8ritten
for the Tal%ing 6ead namel! the gesture mar%up language being developed b!
6u!nh )2,,,+. The collaboration bet8een the t8o subprojects 8as aimed at
providing the Tal%ing 6ead 8ith s!nchroni:ation of vocal e(pressions and facial
gestures. n architecture specification for allo8ing facial and speech
s!nchroni:ation is given b! Ostermann et al. )1''"+.
#.Since the Tal%ing 6ead is being developed to run over a number of
platforms )Bin32 &inu( and 7* .3+ it 8as crucial that the ne8 TTS module
8ould not hamper efforts to ma%e the Tal%ing 6ead a platform independent
application.
2.3 /i!niicance
The project is significant because despite the important role of the displa! of emotion in
human communication current te(t9to9speech s!nthesi:ers do not cater for its effect on
speech. 7esearch to add emotion effects to s!nthetic speech is ongoing notabl! b!
$urra! and rnott )1''+ but has been mainl! restricted to a standalone s!stem and not
part of a Tal%ing 6ead as this project set out to do.
ncreased naturalness in s!nthetic speech is seen as being important for its
acceptance )Scherer 1''+ and this is li%el! to be the case for applications of Tal%ing
6ead technolog! as 8ell. This thesis is attempting to address this need. dvances in this
area 8ill also benefit 8or% in the fields of speech anal!sis speech recognition and speech
s!nthesis 8hen dealing 8ith natural variabilit!. This is because 8or% 8ith the speech
correlates of emotion 8ill help support or disprove speech correlates identified in speech
-
7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf
11/120
anal!sis help in proper feature e(traction for the automatic recognition of emotion in the
voice and generall! improve s!nthetic speech production.
-
7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf
12/120
Chapter 3
Literature Review
This section presents a brief revie8 of the literature relevant to the areas the project is
concerned 8ith the effects of emotion on speech speech emotion s!nthesis *$& and
speech mar%up languages.
3.1 motion and /peech
Emotion is an integral part of speech. Semantic meaning in a conversation is conve!ed
not onl! in the actual 8ords 8e sa! but also in howthe! are e(pressed )Inapp 1'",H
$alandro ar%er and ar%er 1'"'+. Even before the! can understand 8ords children
displa! the abilit! to recogni:e vocal emotion illustrating the importance that nature
places on being able to conve! and recogni:e emotion in the speech channel band8idth.The intrinsic relationship that emotion shares 8ith speech is seen in the direct effect
that our emotional state has on the speech production mechanism. h!siological changes
such as increased heart rate and blood pressure muscle tremors and dr!ness of mouth
have been noted to be brought about b! the arousal of the s!mpathetic nervous s!stem
such as 8hen e(periencing fear anger or jo! )Cahn 1'',+. These effects of emotion on
a person/s speech apparatus ultimatel! affect ho8 speech is produced thus promoting the
vie8 that an emotion Jcarrier 8aveK is produced for the 8ords spo%en )$urra! and
rnott 1''3+.
Bith emotion being described as Jthe organism/s interface to the 8orld outsideK
)Scherer 1'"1+ considerable interest has been devoted to investigate the role of emotion
in speech particularl! regarding its social aspects )Inapp 1'",+. One function is to
notif! others of our behavioural intentions in response to certain events )Scherer 1'"1+.
-or e(ample the contraction of ones throat 8hen e(periencing fear 8ill produce a harsh
voice that is increased in loudness )$urra! and rnott 1''3+ serving to 8arn and
-
7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf
13/120
frighten a 8ould9be assailant 8ith the bod! tensing for a possible confrontation. The
e(pression of emotion through speech also serves to communicate to others our
judgement of a particular situation. mportantl! vocal changes due to emotion ma! in
fact be cross9cultural in nature though this ma! onl! be true for some emotions and
further 8or% is re@uired to ascertain this for certain )$urra! rnott and 7oh8er 1''+.
Be also deliberately use vocal e(pression in speech to communicate various
meanings. Sudden pitch changes 8ill ma%e a s!llable stand out highlighting the
associated 8ord as an important component of that utterance )utoit 1''4+. spea%er
8ill also pause at the end of %e! sentences in a discussion to allo8 listeners the chance to
process 8hat 8as said and a phrase/s pitch 8ill increase to8ards the end to denote a
@uestion )$alandro ar%er and ar%er 1'"'+. Bhen something is said in a 8a! that
seems to contradict the actual spo%en 8ords 8e 8ill usuall! accept the vocal meaning
over the verbal meaning. -or e(ample the e(pression Jthan%s a lotK spo%en in an angr!
tone 8ill generall! be ta%en in a negative 8a! and not as a compliment as the literal
meaning of the 8ords alone 8ould suggest. This underscores the importance 8e place on
the vocal information that accompanies the verbal content.
3.2 0he /peech Correlates o motion
coustics researchers and ps!chologists have endeavoured to identif! the speech
correlates of emotion. The motivation behind this 8or% is based on the demonstrated
abilit! of listeners to recogni:e different vocal e(pressions. f vocal emotions are
distinguishable then there are acoustic features responsible for ho8 various emotions are
e(pressed )Scherer 1''+. 6o8ever this tas% has met 8ith considerable difficult!. This
is because coordination of the speech apparatus to produce vocal e(pression is done
unconsciousl! even 8hen a spea%ing st!le is consciousl! adopted )$urra! and rnott
1''+.
Traditionall! there have been three major e(perimental techni@ues that researchers
have used to investigate the speech correlates of emotion )Inapp 1'",H $urra! and
rnott 1''3+
1. $eaningless Lneutral/ content )e.g. letters of the alphabet
numbers etc+ is read b! actors 8ho e(press various emotions.
2. The same utterance is e(pressed in different emotions. This
approach aids in comparing the emotions being studied.
-
7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf
14/120
3. The content is ignored altogether either b! using e@uipment
designed to e(tract various speech attributes or b! filtering out the content.
The latter techni@ue involves appl!ing a lo89pass filter to the speech signal
thus eliminating the high fre@uencies that 8ord recognition is dependent upon.
)This meets 8ith limited success ho8ever since some of the vocalinformation also resides in the high fre@uenc! range.+
The problem of speech parameter identification is further compounded b! the
subjective nature of these tests. This is evident in the literature as results ta%en from
numerous studies rarel! agree 8ith each other. ?evertheless a general picture of the
speech parameters responsible for the e(pression of emotion can be constructed. There
are three main categories of speech correlates of emotion )Cahn 1'',H $urra! rnott
and 7oh8er 1''+
itch contour.The intonation of an utterance 8hich describes the natureof accents and the overall pitch range of the utterance. itch is e(pressed as
fundamental fre@uenc! )-,+. arameters include average pitch pitch range
contour slope and final lo8ering.
Timing. escribes the speed that an utterance is spo%en as 8ell as rh!thm
and the duration of emphasi:ed s!llables. arameters include speech rate
hesitation pauses and e(aggeration.
!oice "uality. The overall Lcharacter/ of the voice 8hich includes effects
such as 8hispering hoarseness breathiness and intensit!.
t is believed that value combinations of these speech parameters are used to e(press
vocal emotion. Table 1is a summar! of human vocal emotion effects of four of the so9
called basic emotions anger happiness sadness and fear )$urra! and rnott 1''3H
Falanis arsinos and Io%%ina%is 1''H Cahn 1'',H avti: 1'4H Scherer 1''+. The
parameter descriptions are relative to neutral speech.
Anger Happiness Sadness Fear
-
7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf
15/120
Speech rate Faster Slightly faster Slightly slower Much faster
Pitch average Very much
higher
Much higher Slightly lower Very much
higher
Pitch range Much wider Much wider Slightly
narrower
Much wider
Intensity Higher Higher Lower Higher
Pitch changes Abrupt,downward,directedcontours
Smooth, upwardinflections
Downwardinflections
Downwardterminalinflections
Voice quality Breathy, chestytone
Breathy,
blaring!esonant "rregular
#oicing
Articulation $lipped Slightly slurred Slurred %recise
terms used b! $urra! and rnott )1''3+.
0able 1 7 /ummar" o human vocal emotion eects.
The summar! should not be ta%en as a complete and final description but rather is
meant as a guideline onl!. -or instance the table above emphasi:es the role of
fundamental fre@uenc! as a carrier of vocal emotion. 6o8ever Ino8er )1'#1 as
referred in $urra! and rnott 1''3+ notes that 8hispered speech is able to conve!
emotion even though 8hispering ma%es no use of the voice/s fundamental fre@uenc!.
?evertheless being able to succinctl! describe vocal e(pression li%e this has significant
benefits for simulating emotion in s!nthetic speech.
3.3 motion in /peech /"nthesis
n the past focus has been placed on developing speech s!nthesi:er techni@ues to
produce clearer intelligibilit! 8ith intonation being confined to model neutral speech.
6o8ever the speech produced is distinctl! machine sounding and unnatural. Speech
s!nthesis is seen as being fla8ed for not possessing appropriate prosodic variation li%e
that found in human speech. -or this reason some s!nthesis models are including the
effects of emotion on speech to produce greater variabilit! )$urra! rnott and 7oh8er
1''+. nterestingl! Scherer )1''+ sees this as being crucial for the acceptance of
s!nthetic speech.
The advantage of the vocal emotion descriptions in Table 1 is that the speech
parameters can be manipulated in current speech s!nthesi:ers to simulate emotional
speech 8ithout dramaticall! affecting intelligibilit!. This approach thus allo8s emotive
effects to be added on top of the output of te(t9to9speech s!nthesi:ers through the use of
-
7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf
16/120
carefull! constructed rules. T8o of the better %no8n s!stems capable of adding emotion9
b!9rule effects to speech are the Jffect EditorK developed b! Cahn )1'',b+ and
6$&ET developed b! $urra! and rnott )1''5+ )$urra! rnott and ?e8ell 1'""+.
The s!stems both ma%e use of the ECtal% te(t9to9speech s!nthesi:er mainl! because of
its e(tensive control parameter features.
-uture 8or% is concerned 8ith building a solid model of emotional speech as this
area is seen as being limited b! our understanding of vocal e(pression and the @ualit! of
the speech correlates used to describe emotional speech )Cahn 1'""H $urra! and rnott
1''5H Scherer 1''+. lthough not 8ithin the scope of the project it is 8orth
mentioning that research is being underta%en in concept9to9speech s!nthesis. This 8or%
is aimed at improving the intonation of s!nthetic speech b! using e(tra linguistic
information )i.e. tagged te(t+ provided b! another s!stem such as a natural language
generation )?&F+ s!stem )6it:eman et al#1'''+.
0ariabilit! in speech is also being investigated in the area of speech recognition 8ith
the aim of possibl! developing computer interfaces that respond differentl! according to
the emotional state of the user )ellaert ol:in and Baibel 1''+. nother avenue for
future research could be to incorporate the effects of facial gestures on speech. -or
instance 6ess Scherer and Iappas )1'""+ noted that voice @ualit! is judged to be
friendl! over the phone 8hen a person is smiling. model that could cater for this
8ould have e(tremel! beneficial applications for recent 8or% concerned 8ith the
s!nchroni:ation of facial gestures and emotive speech in Tal%ing 6eads.
-inall! simulating emotion in s!nthetic speech not onl! has the potential to build
more realistic speech s!nthesi:ers )and hence provide the benefits that such a s!stem
8ould offer+ but 8ill also add to our understanding of speech emotion itself.
3.4 /peech ar)up Lan!ua!es
deall! a te(t9to9speech s!nthesi:er 8ould be able to accept plain te(t as input and spea%
it in a manner comparable to a human emphasi:ing important 8ords pausing for effect
and pronouncing foreign 8ords correctl!. >nfortunatel! automaticall! processing and
anal!:ing plain te(t is e(tremel! difficult for a machine. Bithout e(tra information to
accompan! the 8ords it is to spea% the speech s!nthesi:er 8ill not onl! sound unnatural
but intelligibilit! 8ill also decrease. Therefore it is desirable to have an annotation
scheme that 8ill allo8 direct control over the speech s!nthesi:er/s output.
-
7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf
17/120
$ost research and commercial s!stems allo8 for such an annotation scheme but
almost all are s!nthesi:er dependent thus ma%ing it e(tremel! difficult for soft8are
developers to build programs that can interface 8ith an! speech s!nthesi:er. 7ecent
moves b! industr! leaders to standardi:e a speech mar%up language has led to the draft
specification of S&E a s!stem independent SF$&9based mar%up language )Sproatet al 1''"+. The S&E specification has evolved from three e(isting speech s!nthesis
mar%up languages SS$& )Ta!lor and sard 1''4+ ST$& )Sproat et al 1''4+ and
Mava/s MS$&.
3.5 -tensible ar)up Lan!ua!e 98L:
*$& is the E(tensible $ar%up &anguage created b! B3C the Borld Bide Beb
Consortium )E(tensible $ar%up &anguage 1''"+. t 8as speciall! designed to enablethe use of large document management concepts for the Borld Bide Beb that 8ere
embodied in SF$& the Standard Fenerali:ed $ar%up &anguage. n adopting SF$&
concepts ho8ever the aim 8as also to remove features of SF$& that 8ere either not
needed for Beb applications or 8ere ver! difficult to implement )The *$& -A 2,,,+.
The result 8as a simplified dialect of SF$& that is relativel! eas! to learn use and
implement and at the same time retains much of the po8er of SF$& )osa% 1''4+.
t is important to note that *$& is not a mar%up language in itself but rather it is a
meta$language= a language for describing other languages. Therefore *$& allo8s a
user to specif! the tag set and grammar of their o8n custom mar%up language that
follo8s the *$& specification.
3.5.1 8L 'eatures
There are three significant features of *$& that ma%e it a ver! po8erful meta9language
)osa% 1''4+
1. -tensibilit"9 ne8 tags and their attribute names can be defined at
8ill. ecause the author of an *$& document can mar%up data using an!number of custom tags the document is able to effectivel! describe the data
embodied 8ithin the tags. This is not the case 8ith 6T$& 8hich uses a
fi(ed tag set.
-
7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf
18/120
2. /tructure= the structure of an *$& document can be nested to
an! level of comple(it! since it is the author that defines the tag set and
grammar of the document.
3. alidation = if a tag set and grammar definition is provided
)usuall! via a ocument T!pe efinition )T++ then applications
processing the *$& document can perform structural validation to ma%e sure
it conforms to the grammar specification. So though the nested structure of an
*$& document can be @uite comple( the fact that it follo8s a ver! rigid
guideline ma%es document processing relativel! eas!.
3.5.2 0he 8L Document
n *$& document is a se@uence of characters that contains markup )the tags that
describe the te(t the! encapsulate+ and the character data)the actual te(t being Jmar%ed
upK+. -igure 1sho8s an e(ample of a simple *$& document.
'i!ure 1 7 %n 8L document holdin! simple weather inormation.
One of the main observations that should be made for the e(ample given in -igure 1
is that an %&' document de(cribe( only the data and not ho8 it should be vie8ed. This
is unli%e 6T$& 8hich forces a specific vie8 and does not provide a good mechanism
for data description )Fraham and Auinn 1'''+. -or e(ample 6T$& tags such as P
DIV and TABLEdescribe ho8 a bro8ser is to displa! the encapsulated te(t but are
Markup tag
Character data
(marked up text)
'1 * 1
-
7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf
19/120
inade@uate for specif!ing 8hether the data is describing an automotive part is a section
of a patient/s health record or the price of a grocer! item.
The fact that an *$& document is encoded in plain te(t 8as a conscious decision
made b! the *$& designers = the designing of a s!stem9independent and vendor9
independent solution )osa% 1''4+. lthough te(t files are usuall! larger than
comparable binar! formats this can be easil! compensated for using freel! available
utilities that can efficientl! compress files both in terms of si:e and time. t 8orst the
disadvantages associated 8ith an uncompressed plain te(t file is deemed to be
out8eighed b! the advantages of a universall! understood and portable file format that
does not re@uire special soft8are for encoding and decoding.
3.5.3 D0DHs and alidation
The *$& specification has ver! strict rules 8hich describe the s!nta( of an *$&
document = for instance the characters allo8able 8ithin the markupsection ho8 tags
must encapsulate te(t the handling of 8hite space etc. These rigid rules ma%e the tas%s
of parsing and dividing the document into sub9components much easier. well$formed
*$& document is one that follo8s the s!nta( rules set in the *$& specification.
6o8ever since its author determines the structure of the document a mechanism must be
provided that allo8s grammar chec%ing to ta%e place. *$& does this through the
ocument T!pe efinition or T.
T file is 8ritten in *$&/s eclaration S!nta( and contains the formaldescription of a document/s grammar )The *$& -A 2,,,+. t defines amongst other
things 8hich tags can be used and 8here the! can occur the attributes 8ithin each tag
and ho8 all the tags fit together.
-igure 2gives a sample T section that describes t8o elements listand item.
The e(ample declares that one or more item tags can occur 8ithin a list tag.
-urthermore an itemtag ma! optionall! have a t-"eattribute.
'i!ure 2 7 /ample section o a D0D ile
-
7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf
20/120
E(tending this e(ample the different levels of validation performed b! an *$&
parser can be seen. -igure 3sho8s an *$& document that does not meet the s!nta(
specified in the *$& specification.
'i!ure 3 7 8L s"nta- error 7 list and item ta!s incorrectl" matched.
-igure #sho8s a 8ell9formed *$& document )i.e. it follo8s the *$& s!nta(+ but
does not follo8 the grammar specified in the lin%ed T file. )The T file is the one
given in -igure 2+.
'i!ure 4 7 (ell7ormed 8L document6 but does not ollow !rammar speciication in
D0D ile 9an item ta! occurs outside o list ta!:.
-igure 5 sho8s a 8ell9formed *$& document that also meets the grammar
specification given in the T file.
'i!ure 5 (ell7ormed 8L document that also ollows D0D
!rammar speciication. (ill not produce an" parse errors.
The *$& 7ecommendation states that an! parse error detected 8hile processing an
*$& document 8ill immediatel! cause a fatal error )E(tensible $ar%up &anguage
1''"+ = the *$& document 8ill not be processed an! further and the application 8ill
Item 1
-
7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf
21/120
not attempt to second guess the author/s intent. ?ote that the T does ?OT define ho8
the data should be vie8ed either. lso the T is able to define which (ub$element(can
occur 8ithin an element but not the order in 8hich the! occurH the same applies for
attributes specified for an element. -or this reason an application processing an *$&
document should avoid being dependent on the order of given tags or attributes.
3.5.4 Document ;b=ect odel 9D;:
The ocument Object $odel )O$+ &evel 1 Specification states the)ocument *b+ect
&odelas Ja platform9 and language9neutral interface that allo8s programs and scripts to
d!namicall! access and update the content structure and st!le of documentsK )ocument
Object $odel 2,,,+. t provides a tree9based representation of an *$& document
allo8ing the creation manipulation and navigation of an! part 8ithin the document.
6o8ever it is important to note that the O$ specification itself does not specif! that
documents must be implementedas a tree = onl! it is convenient that the logical structureof the document be described as a tree due to the hierarchical structure of mar%ed up
documents. The O$ is therefore a programming for documents that is trul!
Jstructurall! neutralK as 8ell.
Bor%ing 8ith parts of the O$ is @uite intuitive since the object structure of the
O$ ver! closel! resembles the hierarchical structure of the document. -or instance
the O$ sho8n in -igure b 8ould represent the tagged te(t e(ample in -igure a.
gain the hierarchical relationships are logical ones defined in the programming
and are not representations of an! particular internal structures )ocument Object $odel
2,,,+.
Once a O$ tree is constructed it can be modified easil! b! adding;deleting nodes
and moving sub9trees. The ne8 O$ tree can then be used to output a ne8 *$&
document since all the information re@uired to do so is held 8ithin the O$
representation. O$ tree 8ill not be constructed until the *$& document has been
full! parsed and validated.
-
7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf
22/120
'i!ure $ 7 D; representation o 8L e-ample
3.5.5 /%8 Parsin!
do8nside to the O$ is that most *$& parsers implementing the O$ ma%e the
entire tree reside in memor! = apart from putting a strain on s!stem resources it also
limits the si:e of the *$& document that can be processed )!thon;*$& 6o8to 2,,,+
)lib(ml 2,,,+. lso sa! the application onl! needs to search the *$& document for
occurrences of a particular 8ord it 8ould be inefficient to construct a complete in9
memor! tree to do this.
1*
October 30,
2000
14:40
Partly cloudy 18
a.
b.
-
7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf
23/120
S* handler on the other hand can process ver! large documents since it does
not %eep the entire document in memor! during processing. S* the Simple for
*$& is a standard interface for event9based *$& parsing )S* 2., 2,,,+. nstead of
building a structure representing the entire *$& document S* reports parsing events
)such as the start and end of tags+ to the application through callbac%s.
3.5.$ +eneits o 8L
The follo8ing benefits of using *$& in applications have been identified )$icrosoft
2,,,+ )Soft8areF 2,,,b+
/implicit"= *$& is eas! to read 8rite and process b! both humans and
computers.
;penness= *$& is an open and e(tensible format that leverages on other
)open+ standards such as SF$&. *$& is no8 a B3C 7ecommendation 8hich
means it is a ver! stable technolog!. n addition *$& is highl! supported b!
industr! mar%et leaders such as $icrosoft $ Sun and ?etscape both in
developer tools and user applications.
-tensibilit"= data encoded in *$& is not limited to a fi(ed tag set.
This enables precise data description greatl! aiding data manipulators such as
search engines to produce more meaningful searches.
Local computation and manipulation= once data in *$& format is sent
to the client all processing can be done on the local machine. The *$& O$
allo8s data manipulation through scripting and other programming languages.
/eparation o data rom presentation= this allo8s data to be 8ritten
read and sent in the best logical mode possible. $ultiple vie8s of the data are
easil! rendered and the loo% and feel of *$& documents can be changed
through *S& st!le sheetsH this means that the actual content of the document
need not be changed.
ranular updates= the structure of *$& documents allo8s for granular
updates to ta%e place since onl! modified elements need to be sent from the
server to the client. This is currentl! a problem 8ith 6T$& since even 8ith the
slightest modification a page needs to be rebuilt. Franular updates 8ill help
reduce server 8or%load.
/calabilit"= separation of data from presentation also allo8s authors to
embed 8ithin the structured data procedural descriptions of ho8 to produce
different vie8s. This offloads much of the user interaction from the server to the
-
7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf
24/120
-
7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf
25/120
-
7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf
26/120
-
7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf
27/120
'i!ure # 7 0al)in!
-
7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf
28/120
-estival is a 8idel! recogni:ed research project developed at the Centre for Speech
Technolog! 7esearch )CST7+ >niversit! of Edinburgh 8ith the aim of offering a free
high @ualit! te(t9to9speech s!stem for the advancement of research )lac% Ta!lor and
Cale! 1'''+. The $7O& project initiated b! the TCTS &ab of the -acult
ol!techni@ue de $ons )elgium+ is a free multi9lingual speech s!nthesi:er developed8ith aims similar to -estival/s )$7O&roject 6omepage 2,,,+.
'i!ure * 7 0op level outline showin! how 'estival and +R;L% s"stems were used to!ether.
t 8as decided for this project to use the -estival s!stem as the natural language
parser )?&+ component of the module 8hich accepts te(t as input and transcribes this
to its phoneme e@uivalent plus duration and pitch information. This information can be
then given to the $7O& s!nthesi:er acting as the digital signal processing unit
)S+ 8hich produces a 8aveform from this information. lthough -estival has its o8nS unit it 8as found that the -estival < $7O& combination produces the best
@ualit!. t is important to note that the -estival s!stem supports $7O& in its .
ecause of the phoneme9duration9pitch input format re@uired for $7O& it
provides ver! fine pitch and timing control for each phoneme in the utterance. s stated
before this level of control is simpl! unattainable 8ith commercial s!stems e(cept
NLP
(Festival)
Pitch and TimingModifier
DSP(MBROLA)
Phonemes, pitch
and duration
Modified phonemes,
pitch and duration
Waveform
Text
http://tcts.fpms.ac.be/synthesis/mbrola.htmlhttp://tcts.fpms.ac.be/synthesis/mbrola.html -
7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf
29/120
ECtal%. The advantage of using $7O& over ECtal% ho8ever is in the fact that
once a phoneme/s pitch is altered in the latter s!stem the generated pitch contour is
over8ritten. Cahn )1'',+ first mentioned this problem and as a result did not manipulate
the utterance at the phoneme level limiting the amount of control 8hich ultimatel!
hindered the @ualit! of the simulated emotion. To overcome this $urra! and rnott)1''5+ had to 8rite their o8n intonation model to replace the ECtal% generated pitch
contour 8hen the! changed pitch values at the phoneme level. -ortunatel! this is not an
issue 8ith $7O& as changes to the pitch and duration levels can be done prior to
passing it to $7O& )as -igure 'sho8s+. Therefore it can be seen that the -estival ?*
platform its source code can be ported to the Bin32 platform via relativel! minor
modifications. The $7O& 6omepage offers binaries for man! platforms including
Bin32 &inu( most >ni( OS versions eOS $acintosh and more.
efore the final decision 8as made to ma%e use of the -estival s!stem ho8ever an
important issue re@uired investigation. The previous TTS module of the Tal%ing 6ead
did not use the -estival s!stem because although it 8as ac%no8ledged that -estival/s
output is of a ver! high @ualit! the computation time 8as deemed to be far too e(pensive
to use in an interactive application )Crossman 1'''+. -or e(ample the phrase J6ello
ever!bod!. This is the voice of a Tal%ing 6ead. The Tal%ing 6ead project consists ofresearchers from Curtin >niversit! and 8ill create a 3 model of a human head that 8ill
ans8er @uestions inside a 8eb bro8ser.K too% about #5 seconds to s!nthesi:e on a Silicon
Fraphics nd! 8or%station )Crossman 2,,,+. t is contested ho8ever that the negative
impression that could be made of the -estival s!stem from such data ma! be a little
misled. Though e(ecution time ma! ta%e longer on an SF nd! 8or%station informal
testing on several (tandard Cs )Bin32 and &inu( platforms+ sho8ed that the same
phrase too% less than 5 seconds to s!nthesi:e )including the generation of a 8aveform+.
Since TTS processing is done on the server side the s!stem can be easil! configured to
ensure -estival 8ill carr! its processing on a faster machine. Therefore -estival/ss!nthesis time 8as not considered a problem.
3.&.2 8L Parser
-
7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf
30/120
Since it is e(pected that the program/s input 8ill contain mar%ed up te(t an *$& parser
8as re@uired to parse and validate the input and create a O$ tree structure for eas!
processing. There are a number of freel! available *$& parsers though man! are still in
development stage and implement the *$& specification to var!ing degrees. One of the
more complete parsers is libxml a freel! available *$& C librar! for Fnome )lib(ml2,,,+.
>sing libxmlas the *$& parser fulfilled the needs of the project in a number of
8a!s
a+ ortability= 8ritten in C the librar! is highl! portable. long 8ith the main
program it has been successfull! ported to the Bin32 &inu( and 7*
platforms.
b+ Small and (imple= onl! a limited range of the *$& features are being used
therefore a comple( parser 8as not re@uired. This is not to sa! that libxmlis atrivial librar! as it offers some po8erful features.
c+ Efficiency = nformal testing sho8ed libxml parses large documents in
surprisingl! little time. lthough not used for this project libxml offers a
S* interface to allo8 for more memor!9efficient parsing )see section 3.5.5+.
d+ ,ree= libxmlcan be obtained cost9free and license9free.
t is important to note that the libxmllibrar!/s O$ tree building feature 8as used to
help create the re@uired objects that hold the program/s utterance information. 6o8ever
care 8as ta%en to ma%e sure the program/s objects 8ere not dependent on the *$&
parser being used. nstead a 8rapper class 8TT77$LParser used libxmlas the *$&
parser and output a cu(tom tree9li%e structure ver! similar to that of the O$. This
ensured that all other objects 8ithin the program used the custom structure and not the
O$ tree that libxmloutputs. )See Chapter 5 for more details.+
3.# /ummar"
This chapter has e(plored research that 8as applicable to this project focusing on ho8
the literature can help 8ith achieving the stated objectives and subproblems of Chapter 2
and supporting the h!potheses of Chapter #. $ore specificall! the literature 8as
investigated to find the speech correlates of emotion see%ing clear definitions so that
there 8as a solid base to 8or% from during the implementation phase. The 8or% of
prominent researchers in the field of s!nthetic speech emotion such as $urra! and
-
7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf
31/120
rnott )1''5+ and Cahn )1'',+ 8ho have alread! attempted to simulate emotional
speech 8as sought in order to gain an understanding of the problems involved and the
approach ta%en in solving them.
The in9depth revie8 on *$& served t8o purposes a+ to describe 8hat *$& is and
8hat the technolog! is tr!ing the address and b+ to e(pound the benefits of *$& so as to
justif! 8h! S$& 8as designed to be *$&9based. resource revie8 8as given to
discuss the issues involved 8hen deciding 8hich tools to use for the TTS module and to
address one of the subproblems stated in Section 2.3H that is that the TTS module should
be able to run across the Bin32 &inu( and >?* platforms.
-
7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf
32/120
Chapter 4
Research ethodolo!"
The literature revie8 of Chapter 3 enabled the formation of the h!potheses stated in this
chapter. t also identified areas 8here limitations 8ould appl! and defined the scope of
the project.
4.1
-
7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf
33/120
4.2 Limitations and Delimitations
4.2.1 Limitations
T8o main limitations have been identified
. !ocal arameter( $ The @ualit! of the s!nthesi:ed emotional speech 8ill be
limited b! the abilit! of the vocal parameters to describe the various emotions.
This is a reflection of the current level of understanding of speech emotion itself.
2. Speech Synthe(i1er uality $ The @ualit! of the speech s!nthesi:er and the
parameters it is able to handle 8ill also have a direct effect on the speech
produced. -or instance most speech s!nthesi:ers are unable to change voice
@ualit! features )breathiness intensit! etc+ 8ithout significantl! affecting the
intelligibilit! of the utterance.
4.2.2 Delimitations
The purpose of this research is to determine ho8 8ell the vocal effects of emotion can be
added to s!nthetic speech = it is not concerned 8ith generating an emotional state for the
Tal%ing 6ead based on the 8ords it is to spea%. Therefore the s!stem 8ill not %no8 the
re@uired emotion to simulate from the input te(t alone. This top9level information 8ill be
provided through the use of e(plicit tags hence the need for the implementation of aspeech mar%up language.
ue to the strict time constraints placed on this project the emotions that are to be
simulated b! the s!stem 8ere bounded to happiness sadness and anger. These three
emotions 8ere chosen because of the 8ealth of stud! carried out on these emotions )and
hence an increased understanding+ compared to other emotions. This is because
happiness sadness and anger )along 8ith fear and grief+ are often referred to as the
Jbasic emotionsK on 8hich it is believed other emotions are built on.
-
7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf
34/120
4.3 Research ethodolo!ies
The follo8ing 7esearch $ethodologies of $auch and irch )1''3+ are applicable to this
research
Desi!n and Demonstration. This is the standard methodolog! used for
the design and implementation of soft8are s!stems. The speech s!nthesis
s!stem is being demonstrated as the TTS module of a Tal%ing 6ead.
valuation. The effectiveness of the s!stem 8as needed to be determined
via listener @uestionnaires testing ho8 8ell the TTS module supports the stated
h!potheses. Therefore an evaluation research methodolog! 8as adopted.
eta7%nal"sis. The project involves a number of diverse fields otherthan speech s!nthesisH namel! ps!cholog! paralinguistics and etholog!. The
meta9anal!sis research methodolog! 8as used to determine ho8 8ell the speech
emotion parameters described in these fields mapped to speech s!nthesis.
-
7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf
35/120
Chapter 5
Implementation
This chapter discusses the implementation of the TTS module to simulate emotional
speech for a Tal%ing 6ead plus the stated subproblems of Section 2. The discussion
covers ho8 the module/s input is processed and ho8 the various emotional effects 8ereimplemented. This 8ill involve a description of the various structures and objects that
are used b! the TTS module. Since the module relies heavil! on S$& the speech
mar%up language that 8as designed and implemented to enable direct control over the
module/s output the chapter discusses S$& issues such as parsing and tag processing.
5.1 00/ Interace
efore an in9depth description of each of the TTS module/s components is given it 8ill
be beneficial to describe the input and outputs of the s!stem. t 8as important to be able
to describe the s!stem as a ver! high9level blac% bo(H not onl! for clarit! of design but
also to ensure that the replacement of the e(isting TTS module of the -T6 project
8ould be a smooth one. t also minimi:es module and tool interdependenc!. -igure 1,
sho8s the blac% bo( design of the s!stem as the TTS module of a Tal%ing 6ead.
'i!ure 1, 7 +lac) bo- desi!n o the s"stem6 shown as the 00/ module o a 0al)in!
-
7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf
36/120
of this detail nor should it. Bhat is important to describe at this level is the module/s
interfaceH ho8 the module produces its output is irrelevant to the user of the module.
5.1.2 odule Inputs
-igure 1,sho8s te(t as the single input to the TTS module. 6o8ever Lte(t/ can be a
fairl! ambiguous description for input and indeed the module caters for t8o distinct
t!pes of te(t plain te(t and te(t mar%ed up in the TTS module/s o8n custom Speech
$ar%up &anguage )S$&+.
Plain Text
The simplest form of input plain te(t means that the TTS module 8ill endeavour to
render the speech9e@uivalent of allthe input te(t. n other 8ords it 8ill be assumed that
no characters 8ithin the input represent directives for ho8 to generate the speech. s a
result of this speech generated using plain te(t 8ill have default speech parameters
spo%en 8ith neutral intonation.
SML Markup
f direct control over the TTS module/s output is desired then the te(t to be spo%en can
be mar%ed up in S$& the custom mar%up language implemented for the module.
lthough an in9depth description of S$& 8ill not be given here )see Section 2 and
ppendi( + it 8as designed to provide the user of the TTS module 8ith the follo8ing
abilities
irect control of speech production. -or e(ample the s!stem
could be specified to spea% at a certain speech rate pitch or pronounce a
particular 8ord in a certain 8a! )this is especiall! useful for foreign names+.
Control over spea%er properties. This gives the abilit! to not onl!
have control of ho8 the mar%ed up te(t is spo%en but also whois spea%ing.
Spea%er properties such as gender age and voice can be d!namicall! changed
8ithin S$& mar%up.
The effect of the spea%er/s emotion on speech. -or e(ample the
mar%up ma! specif! that the spea%er is sad for a portion of the te(t. s a
result the speech 8ill sound sad. One of the primar! objectives of this thesis
is to determine ho8 effective the simulated effect of emotion on the voice is.
-
7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf
37/120
-
7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf
38/120
specification )$EF 1'''+. The phoneme9to9viseme translation submodule is one of
the fe8 that 8ere retained from the e(isting TTS module.
5.1.4 CJC %PI
$odules that call the TTS module do so through its C;C
-
7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf
39/120
TT77"eaText 4%onst %ar $essaFe5= same as
8TT78entral,,7"eaText 4%onst %ar Cilename5.
TT77"eaTextEx 4%onst %ar $essaFe( int
Emotion5= same as 8TT78entral,,7"eaTextEx 4%onst %ar
$essaFe( int Emotion5.
TT7Destro- 45= used to nicel! cleanup the TTS module once
it is not need an!more. The function i( called once only.
5.2 /L /peech ar)up Lan!ua!e
n Section 2.2 it 8as identified that the design and implementation of a suitable mar%up
language 8as re@uired so that the emotion of a te(t segment could be specified as 8ell as
providing a means of manipulating other useful speech parameters. S$& is the TTS
module/s *$&9based Speech $ar%up &anguage designed to meet these re@uirements.
This section 8ill provide an overvie8 of ho8 an utterance should be mar%ed up in S$&.
-or a description of each S$& tag 8ith its associated attributes see ppendi( . -or
issues regarding S$&/s implementation see Section 5.4.
5.2.1 /L ar)up /tructure
n input file containing correct S$& mar%up must contain an *$& header declaration at
the beginning of the file. -ollo8ing the *$& header the smltag encapsulates the entire
mar%ed up te(t and can contain multiple ")paragraph+ tags. -igure 11sho8s the basic
la!out of an input file mar%ed up in S$&. ?ote that all the *$& constraints discussed in
Section 3.5 appl! to S$&.
'i!ure 11 7 0op7level structure o an /L document.
XML header
Reference to
SML v01 DTD
Root tag
Paragraphs
-
7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf
40/120
n turn each "node can contain one or more emotion tags )sa# anFr- a""- and
ne/tral+ and instances of the eme#tagH te(t not contained 8ithin an emotion tag is
not allo8ed. -or e(ample -igure 12sho8s valid S$& mar%up 8hile -igure 13sho8s
S$& mar%up that is invalid because it does not follo8 this rule. ?ote that unli%e Jla:!K
6T$& the paragraph )"+ tags must be closed properl!.
'i!ure 12 7 alid /L mar)up.
'i!ure 13 7 Invalid /L mar)up.
ll tags described in ppendi( can occur inside an emotion tag )e(cept sml "
and eme#+. limitation of S$& is that emotion tags cannot occur 8ithin other emotion
tags. 6o8ever unless e(plicitl! specified most other tags can contain even instances of
tags 8ith the same name. -or e(ample a "it%tag can contain another "it%tag as
the follo8ing e(ample sho8s.
-
7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf
41/120
5.3 00/ odule /ubs"stems ;verview
'i!ure 14 7 00/ module subs"stems.
s -igure 1#sho8s the design of the TTS module subs!stems is centered on the S$&
ocument object. The main steps for s!nthesi:ing the module/s input te(t involve the
creation processing and output of the S$& ocument. This is bro%en do8n into the
follo8ing tas%s
a+ ar(ing. The input te(t is parsed b! the S$& arser and creates
an S$& ocument object. The S$& arser ma%es use of lib(ml.
b+ Text to honeme Tran(cription. The ?atural &anguage arser
)?&+ is responsible for transcribing the te(t into its phoneme e@uivalent plus
providing intonation information in the form of each phoneme/s duration and
pitch values. This information is given to the S$& ocument object and
SML ParserSML Document
Visual
Module
SML 'agsProcessor
Text Visemes
Waveform
Plain text Phoneme data
NLP
Festival
DSP
MBROLA
Phoneme dataTags +
Text/Phonemes
Modified
Text/Phonemes
Phonemeinfo
libxml
-
7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf
42/120
stored 8ithin its internal structures. The ?& unit ma%es use of the -estival
Speech S!nthesis S!stem.
c+ S&' Tag roce((ing. n! S$& tags present in the input te(t are
processed. This usuall! involves modif!ing the te(t or phonemes held 8ithin
the S$& ocument.
d+ 3aveform 4eneration. The phoneme data held 8ithin the S$&
ocument is given to the igital Signal rocessing )S+ unit to generate a
8aveform. The S ma%es use of the $7O& S!nthesi:er.
e+ !i(eme 4eneration. The 0isual $odule is responsible for
transcribing the phonemes to their viseme e@uivalent. gain the phoneme
data is obtained from the S$& ocument. n this thesis the 0isual $odule
8ill not be discussed in an! further detail since it is has reused much of the
old TTS module/s subroutines. Crossman )1'''+ provides a description of thephoneme9to9viseme translation process.
5.4 /L Parser
The S$& arser encapsulated in the 8TT77$LParserclass is responsible for parsing
the module/s te(t input to ensure that it is both a 8ell9formed *$& document and that its
structure conforms to the grammar specification of the T. f the input is full!
validated then an S$& ocument object is created based on the input.
To perform full *$& parsing on the input the *$& C &ibrar! libxmlis used. part
from validating the input libxml also constructs a O$ tree )described in Section 3.5.#+
that represents the input/s tag structure should no parse errors occur. The S$&
ocument object that is returned b! the S$& arser follo8s the hierarchical structure of
the O$ ver! closel!. Therefore it traverses the O$ and creates an S$& ocument
containing nodes mirroring the O$/s structure. Once the S$& ocument has been
constructed the O$ is destro!ed and the S$& ocument is returned.
t 8as mentioned in Section 5.1.2 that the TTS module is able to handle un%no8ntags present 8ithin the input mar%up. This is because the input is filtered to remove all
un%no8n tags before an! validation parsing is done b! libxml. n doing so the O$ tree
that libxmlcreates does not hold an! un%no8n tag nodes and as a conse@uence neither
does the S$& ocument.
-
7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf
43/120
The TTS module %eeps trac% of all S$& tag names b! %eeping a special *$&
document that holds S$& tag information2. -iltering of the input is done b! creating a
cop! of the input file and cop!ing onl! those tags that are %no8n. t is important that
this filtering process is carried out because the input is envisaged to contain other non9
S$& tags such as those belonging to the -$& module. -igure 15sho8s the filteringprocess.
'i!ure 15 7 'ilterin! process o un)nown ta!s.
5.5 /L Document
s the TTS module/s subs!stems diagram sho8s )-igure 1#+ the S$& ocument is at
the core of the TTS module. ts role is to store all information re@uired for speech
s!nthesis to ta%e place such as 8ord phoneme and intonation data. t also contains thefull tag information that appears in the inputH in fact such is the depth of information held
that the S$& mar%up could be easil! recreated b! the information held in the S$&
ocument. The tag data is used to control the manipulation of the te(t and phoneme
data. n this section 8e 8ill describe the structure of the S$& ocument as 8ell the
various structures re@uired to perform the above mentioned role. -inall! the data held
8ithin the S$& ocument is then used to produce a 8aveform. The S$& ocument
object is encapsulated b! the TT77$LDo%/mentand TT77$L3o#eclasses.
5.5.1 0ree /tructure
n the last section it 8as mentioned that the structure of the S$& ocument matches
ver! closel! that of the *$& O$. The S$& ocument consists of a hierarch! of
nodes that represent the information held in the input S$& mar%up. Therefore the nodes
2The *$& document is called Jtag9names.(mlK and is held in the special TTS resource director!
JTTSQrcK.
Input fileCopy of input file
(Filtered)
SML tag information
Tag lookup Known tag
-
7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf
44/120
hold mar%up information attribute values and character data. -igure 1sho8s the high9
level structure of an S$& ocument that 8ould be constructed for the accompan!ing
S$& mar%up. ?ote ho8 each node has a t!pe that specifies 8hat t!pe of node it is.
The hierarchical nature of the S$& ocument implies 8hich te(t sections 8ill be
rendered in 8hat 8a! = a parent 8ill affect all its children. So for the e(ample in -igure
1 the em"node 8ill affect the phoneme data of its )one+ child node the te(t node
containing the te(t JtooK. The a""-node 8ill affect the phoneme data of all its )three+
children nodes containing the te(t JThat/s notK JtooK and Jfar a8a!K respectivel!. Tags
that 8ere specified 8ith attribute values are represented b! element nodes that point to
attribute information )this is not sho8n on -igure 1for clarit! purposes+.
'i!ure 1$ 7 /L Document structure or /L mar)up !iven above.
5.5.2 >tterance /tructures
Each te(t node contains its o8n utterance information 8hich comprises of 8ord and
phoneme related data. The information is held in different la!ers.
10 $ain 7treet TatMs not ar awa-.
-
7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf
45/120
1. >tterance level = the 8hole phrase held in that node. The
8TT7Htteran%eInoclass is responsible for holding information at this
level.
2. Bord level = the individual 8ords of the utterance. The
8TT7or#Ino class is responsible for holding information at this level.
3. honeme level = the phonemes that ma%e up the 8ords. The
8TT7PonemeIno class is responsible for holding information at this
level.
#. honeme pitch level = the pitch values of the phonemes )phonemes
can have multiple pitch values+. The 8TT7Pit%PatternPoint class is
responsible for holding information at this level.
The 8a! that the above mentioned objects are organi:ed 8ithin a te(t node is as
follo8s
te(t node contains one 8TT7Htteran%eInoobject.
The 8TT7Htteran%eIno object contains a list of 8TT7or#Ino
objects that contain 8ord information.
n turn each 8TT7or#Ino object contains a list of
8TT7PonemeIno objects that contain phoneme information.
8TT7PonemeIno object contains the actual phoneme and its duration )ms+.
Each 8TT7PonemeIno object then contains a list of
8TT7Pit%PatternPoint objects that contain pitch information for each
phoneme. pitch point is characteri:ed b! a pitch value and a percentage value
of 8here the point occurs 8ithin the phoneme/s duration.
-
7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf
46/120
'i!ure 1& 7 >tterance structures to hold the phrase ?the moon@. > A C00/B>tteranceIno ob=ect
( A C00/B>tteranceIno ob=ect P A C00/BPhonemeIno ob=ect pp A C00/BPitchPatternPoint
ob=ect.
5.$ Gatural Lan!ua!e Parser
s introduced in Section 5.3 the ?& )?atural &anguage arser+ module is responsiblefor transcribing the te(t to be rendered as speech to its phoneme e@uivalent. t is also
responsible for generating intonation information b! providing pitch and duration values
for each phoneme in the utterance. The goals this module sets out to achieve are non9
trivial and it is not surprising that this stage ta%es b! far the longest time of an! of the
stages in the speech s!nthesis process. utoit )1''4+ gives an e(cellent discussion of the
problems the ?& unit of a speech s!nthesi:er must overcome.
Since the phoneme transcription and the intonation information greatl! affect the
@ualit! of the s!nthesi:ed speech it 8as ver! important to have an ?& that 8ould
produce high @ualit! output. s mentioned in Section 3.4.1 the ,e(tival Speech
Synthe(i( Sy(tem8as chosen to provide these services 8hich is able to generate output
comparable to commercial speech s!nthesi:ers.
U
theW
P dhpp (0,95)
P @pp (50,101)
moonW
P m
pp (0,102)
P uu
pp (50,110)
P n
pp (100,103)
% inside phoneme length
pitch value
-
7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf
47/120
5.$.2 ;btainin! a Phoneme 0ranscription
s described in Section 5.5 each te(t node 8ithin the S$& ocument contains utterance
objects that 8ill ultimatel! hold the node/s 8ord and phoneme information. One of the
intermediate steps for obtaining a phoneme transcription is to tokeni1ethe input character
string into 8ords. -or e(ample the character string JOn $a! 5 1'"5 1'"5 people
moved to &ivingstonK 8ould be to%eni:ed in the follo8ing 8ords b! -estival JOn $a!
fifth nineteen eight! five one thousand nine hundred and eight! five people moved to
&ivingstonK. This illustrates the comple(it! of the input that -estival is able to handle
8hich has a direct effect on user perception of the intelligence of the Tal%ing 6ead.
To to%eni:e the contents of the S$& ocument the tree is traversed and each te(t
node/s content is individuall! given to -estival. -estival returns the to%ens in the
character string and these are stored as 8ords in the corresponding node/s utterance
object. -igure 1"sho8s ho8 each node holds its o8n to%en information.
'i!ure 1# 7 0o)eniation o a part o an /L Document.
Once 8ord information is stored 8ithin each node/s utterance object phoneme data
can be generated for each 8ord. Obtaining the actual phoneme data )including
intonation+ is a more comple( process ho8ever. This is because an entirephrase should
be given to -estival in order for correct intonation to be generated. s an e(ample
consider the follo8ing S$& mar%up )the corresponding nodes held in the S$&
ocument are sho8n in -igure 1'+.
10 oranges costten oranges cost
$8.30eight dollars thirty
ne/tral
em"
10 oranFes %ost
SML Markup
-
7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf
48/120
#i# -o/ not?
'i!ure 1* 7 /L Document sub7tree representin! e-ample /L mar)up.
f each te(t node/s contents is given to -estival one at a time )i.e. first J 8onderK
then J!ou pronounced itK and so forth+ then though -estival 8ill be able to produce the
correct phonemes it 8ill not generate proper pitch and timing information for the
phonemes. This 8ill result in an utterance 8hose 8ords are pronounced properl! but
contains inappropriate intonation brea%s that ma%e the utterance sound unnatural.
n appropriate analog! to this 8ould be if a person 8ere sho8n a pac% of cards 8ith
8ords 8ritten on them one at a time and as%ed to read it out loud. The person not
%no8ing 8hat 8ords 8ill follo8 8ill not %no8 ho8 to give the phrase an appropriate
intonation.
?o8 if the same person is given a card that contains the entiresentence on it then
no8 %no8ing 8hat the phrase is sa!ing the person 8ill read it out loud correctl!. The
same approach 8as ta%en in the solution to this problem. Continuing the above e(ample
8ill help understand ho8 this is done.
The S$& ocument is traversed until an emotion node is encountered. n
the e(ample traversal 8ould stop at the a""-node.
The contents of its child te(t nodes are then concatenated to ma%e one
phrase. So the contents of the four te(t nodes in -igure 1' 8ould be
concatenated to form the phrase J 8onder !ou pronounced it tomato did !ou
notPK The phrase is stored in a temporar! utterance object held in the a""-
node.
a""-
rate em"you pronounced it
I wonder,
did you not?
tomato
-
7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf
49/120
The phrase is given to -estival and -estival generates the phoneme
transcription as 8ell as intonation information.
The entire phoneme data is stored in the a""- node/s temporar!
utterance object.
ecause each te(t node alread! contains 8ord information in their
utterance objects it is a simple process to disperse the phoneme data held in the
a""-node amongst its children. The temporar! utterance object in a""- is
then destro!ed.
f this procedure is follo8ed then correct intonation is given to the utterance. Of
course a limitation is that this 8ill not solve the problem of having an emotion change in
mid sentence. 6o8ever the algorithm ma%es an assumption that this 8ill not occur
fre@uentl! and that if it does the intonation 8ill not be needed to continue over emotion
boundaries and a brea% is acceptable.
5.$.2 /"nthesiin! in /ections
Speech s!nthesis can be a processor intensive tas% and can ta%e a significant amount of
time and memor! 8hen s!nthesi:ing larger utterances. -inding an! 8a! to minimi:e the
8aiting time is highl! desirable especiall! 8hen the speech production is being 8aited
upon b! an interactiveTal%ing 6ead.
There 8as a concern that if a ver! large amount of S$& mar%up 8as given to theTTS module then the e(ecution time 8ould be unacceptable for someone communicating
8ith the Tal%ing 6ead. To avoid this from occurring a solution 8as implemented that
too% advantage of the client;server architecture of the Tal%ing 6ead.
nstead of the entire S$& ocument being s!nthesi:ed in one go smaller portions
)at the emotion node level+ are s!nthesi:ed one at a time on the server and sent to the
client. s the Tal%ing 6ead on the client begins to Jspea%K the server s!nthesi:es the
ne(t emotion tag of the S$& ocument. ! the time the Tal%ing 6ead has finished
tal%ing the ne(t utterance is read! to be Jspo%enK. This 8a! the actual 8aiting time is
reall! onl! for the first utterance and is no8 dependent on the communication speed
bet8een the server and client and not the s!nthesis time of the 8hole document. -igure
2,represents a timeline of the e(ample mar%up.
t should be noted that this section9oriented method of producing speech involves not
onl! the ?& submodule but also all the steps in the s!nthesis process after the creation
of the S$& ocument.
-
7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf
50/120
'i!ure 2, 7 Raw timeline showin! server and client
e-ecution when s"nthesiin! e-ample /L mar)up above.
5.$.3 Portabilit" Issues
To address the portabilit! issue stated in Section 2.2 it 8as important -estival be useable
over multiple platforms. ecause the -estival s!stem has been developed primaril! for
the >?* platform compiling it for 7* .3 8as relativel! straightfor8ard. Similarl!
obtaining a &inu( version of -estival 8as also effortless since &inu( 7$/s )7ed6at
ac%age $anager+ containing precompiled -estival libraries are available. lthough ithas not been tested e(tensivel! on the Bin32 platform the -estival developers are
confident that the source code is platform independent enough for -estival to compile on
Bin32 machines 8ithout too man! changes.
espite this optimism a considerable amount of the project/s effort 8ent into
reali:ing this objective. n fact changes made to the code 8ere %ept trac% of and as it
gre8 a help document for compiling -estival 8ith $icrosoft 0isual C7& http;;888.computing.edu.au;Rstalloj;projects;honours;festival9help.html.
CLIENTSERVER
Synthesizing neutral tag
(Utterance 1)Idle
Synthesizing happy tag(Utterance 2)
Playing Utterance 1
Synthesizing sad tag(Utterance 3)
Playing Utterance 2
Idle Playing Utterance 3
http://www.computing.edu.au/~stalloj/projects/honours/festival-help.htmlhttp://www.computing.edu.au/~stalloj/projects/honours/festival-help.html -
7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf
51/120
5.& Implementation o motion 0a!s
revious sections have dealt 8ith describing the frame8or% constructed to support the
main h!pothesisH that is to simulate the effect of emotion on speech. This section 8ill
no8 discuss the implementation of S$&/s emotion tags 8hich 8hen used to mar%upte(t cause the te(t to be rendered 8ith the specified emotion.
s has alread! been stated in this thesis the speech correlates of emotion needed to
be investigated in the literature for the main objectives to be met. Section 3.2 described
the findings of this research and a table 8as constructed that describes the speech
correlates for four of the five so9called JbasicK emotions anger happiness sadness and
fear )see Table 1+. The table formed the basis for implementing the anFr- a""- and
sa#S$& tags. -or ease of reference the contents of Table 1for the anger happiness
and sadness emotions is sho8n again in the follo8ing table
Anger Happiness Sadness
Speech rate Faster Slightly faster Slightly slower
Pitch average Very much
higher
Much higher Slightly lower
Pitch range Much wider Much wider Slightly
narrower
Intensity Higher Higher Lower
Pitch changes Abrupt,downward,directedcontours
Smooth, upwardinflections
Downwardinflections
Voice quality Breathy, chestytone
Breathy,
blaring!esonant
Articulation $lipped Slightly slurred Slurred
terms used b! $urra! and rnott )1''3+.
0able 2 7 /ummar" o human vocal emotion eects or an!er6 happiness6 and sadness.
To implement the guidelines found in the literature on human speech emotion
$urra! and rnott )1''5+ developed a number of prosodic rules for their 6$&ET
s!stem. The TTS module has adopted some of these rules though slight modifications
8ere re@uired. lso other similar prosodic rules have been developed through personal
e(perimentation.
-
7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf
52/120
5.&.1 /adness
Basic Speech Correlates
-ollo8ing the literature9derived guideline for the speech correlates of emotion sho8n inTable 2Table 3sho8s the parameter values set for the S$& sa#tag. The values 8ere
optimi:ed for the TTS module and are given as percentage values relative to neutral
speech.
Parameter Value (relative to neutral speech)
Speech rate *+
%itch a#erage *+
%itch range *-+
Volume ./0
0able 3 /peech correlate values implemented or sadness.
s a result of the above speech parameter changes the speech is slo8er lo8er in
tone and is more monotonic )pitch range reduction gives a flatter intonation curve+. The
volume is reduced for sadness so that the spea%er tal%s more softl!. )mplementation
details on ho8 speech rate volume and pitch values are modified can be found in Section
5."+.
Prosodic rules
The follo8ing rules adopted from $urra! and rnott )1''5+ 8ere deemed to be
necessar! for the simulation of sadness. Some parameter values 8ere slightl! modified
to 8or% best 8ith the TTS module.
1. Eliminate abrupt change( in pitch between phoneme(. The
phoneme data is scanned and if an! phoneme pairs have a pitch difference of
greater than 1, then the lo8er of the t8o pitch values is increased b! 5 of
the pitch range.
2. 5dd pau(e( after long word(. f an! 8ord in the utterance contains
si( or more phonemes then a slight pause )", milliseconds+ is inserted after
the 8ord.
The follo8ing rules 8ere developed specificall! for the TTS module.
1. 'ower the pitch of every word that occur( before a pau(e. Such
8ords are lo8ered b! scanning the phoneme data in the particular 8ord and
-
7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf
53/120
lo8ering the last vo8el9sounding phoneme )and an! consonant9sounding
phonemes that follo8+ b! 15. This has the effect of lo8ering the last
s!llable.
2. ,inal lowering of utterance. The last s!llable of the last 8ord in
the utterance is lo8ered in pitch b! 15.
5.&.2 tterances usuall! have a pitch drop in the final vo8el and an! follo8ing
-
7/25/2019 24656081-Simulating-Emotional-Speech-for-a-Talking-Head.pdf
54/120
consonants. This rule increases the pitch values of these phonemes b! 15
hence reducing the si:e of the terminal pitch fall.
5.&.3 %n!er
Basic Speech Correlates
Table 5 sho8s the param