[email protected]

A modular holistic approach to prosody modelling

implementing the duration dimension of our prosody model.

* Corresponding author. Tel.: +44 (0) 121 204 3681; fax: +44 (0) 121 204 3473.E-mail addresses: [email protected] (O: .A. O: de: jo:b), [email protected] (S.H.S. Wong), [email protected] (A.J.

Beaumont).1 Supported by the Commonwealth Scholarship Commission, UK.

Computer Speech and Language 22 (2008) 3968

www.elsevier.com/locate/csl

COMPUTERSPEECH ANDLANGUAGE0885-2308/$ - see front matter 2007 Elsevier Ltd. All rights reserved.To establish the eectiveness of our prosody model, we have also developed a Stem-ML prosody model for SY. Wehave performed both quantitative and qualitative evaluations on our implemented prosody models. The results suggestthat, although the R-Tree model does not predict the numerical speech prosody data as accurately as the Stem-ML model,it produces synthetic speech prosody with better intelligibility and naturalness. The R-Tree model is particularly suitablefor speech prosody modelling for languages with limited language resources and expertise, e.g. African languages. Further-more, the R-Tree model is easy to implement, interpret and analyse. 2007 Elsevier Ltd. All rights reserved.

Keywords: Speech synthesis; Prosody modelling; Standard Yoru`ba; Tone languages; Modular holistic model; Relational treesfor Standard Yoru`ba speech synthesis

O: de: tunj A. O: de: jo:bb,1, Shun Ha Sylvia Wong a,*, Anthony J. Beaumont a

a Computer Science, Aston University, Aston Triangle, Birmingham B4 7ET, UKb Room 109, Computer Buildings, Computer Science and Engineering Department, O: bafe:mi Awolo:wo`: University, Ile-Ife`: , Nigeria

Received 28 December 2005; received in revised form 3 May 2007; accepted 6 May 2007Available online 23 May 2007

Abstract

This paper presents a novel prosody model in the context of computer text-to-speech synthesis applications for tonelanguages. We have demonstrated its applicability using the Standard Yoru`ba (SY) language. Our approach is motivatedby the theory that abstract and realised forms of various prosody dimensions should be modelled within a modular andunied framework [Coleman, J.S., 1994. Polysyllabic words in the YorkTalk synthesis system. In: Keating, P.A. (Ed.),Phonological Structure and Forms: Papers in Laboratory Phonology III, Cambridge University Press, Cambridge, pp.293324]. We have implemented this framework using the Relational Tree (R-Tree) technique. R-Tree is a sophisticateddata structure for representing a multi-dimensional waveform in the form of a tree.

The underlying assumption of this research is that it is possible to develop a practical prosody model by using appro-priate computational tools and techniques which combine acoustic data with an encoding of the phonological and phoneticknowledge provided by experts. To implement the intonation dimension, fuzzy logic based rules were developed usingspeech data from native speakers of Yoru`ba. The Fuzzy Decision Tree (FDT) and the Classication and Regression Tree(CART) techniques were tested in modelling the duration dimension. For practical reasons, we have selected the FDT fordoi:10.1016/j.csl.2007.05.002

40 O: .A. O: de: jo:b et al. / Computer Speech and Language 22 (2008) 39681. Introduction

This work examines prosody modelling for the Standard Yoru`ba (SY) language in the context of computertext-to-speech synthesis applications. Yoru`ba is a typical tone language in which the lexical tones associatedwith syllables carries semantic connotations. With tone languages, a syllable can possess several tone types andeach represents dierent morphemes. In the context of speech technology research, Yoru`ba has not been aswidely studied as other tone languages such as Mandarin and Cantonese Chinese. This paper reports the rstresults of two complete prosody models for Yoru`ba.

In Yoru`ba, there are three lexical tones, denoted as High(H), Mid(M) and Low(L). A description of Yor-u`ba adequate for understanding this paper is presented in O: de: jo:b et al. (2006). In citation syllables, each Yor-u`ba tone has a clear fundamental frequency (f0) manifestation. In the context of continuous speech, however,the f0 signature of each tone is modied due to co-articulation and the well-known phenomenon of tone san-dhi. The degree of change in the signature of a tone type is further aected by paralinguistic features such asemotion and rate of speaking. Although the tones are important in the perception of an SY utterance, the tim-ing of the fundamental frequency contour as well as the loudness (or intensity) of the tonal signatures are alsocrucial to the prosodic quality of SY speech. Detailed studies on the phonological characteristics and prosodicproperties of SY can be found in Owolab (1998) and Connell and Ladd (1990), respectively.

In previous work, we have developed and implemented an intonation model (O: de: jo:b et al., 2006) and aduration model (O: de: jo:b et al., 2007) for modelling prosody in Yoru`ba TTS. In the present study, we presentthe underlying methodology in which these individual models are integrated. Our prosody model adopts amodular holistic approach. As our work is the rst work in Yoru`ba TTS, there does not exist other prosodymodelling results for Yoru`ba TTS. In order to establish the performance and suitability of our approach toYoru`ba prosody modelling in the context of contemporary work, we have also developed a Stem-ML modelfor Yoru`ba. This paper briey describes this Stem-ML model and presents the quantitative and qualitativeevaluation results of our modular holistic model and the Stem-ML model. In light of our results, we discussthe strengths and weaknesses of our modular holistic approach to prosody modelling in the context of TTS forAfrican tone languages.

2. Related work

The ultimate goal of this work is to produce a prosody model for standard Yoru`ba text-to-speech synthesisapplication. Due to the limited language resources and expertise for Yoru`ba, and the fact that there is no pre-vious TTS work on the language, the prosody model should be simple in design, transparent in representationand requiring a limited language resource for implementation. In our eort to nd a suitable prosody modelwhich meets these criteria, we have reviewed various current working prosody models for other languages, inparticular tone languages. This section reveals our ndings and discusses the suitability of these approachesfor our intended task.

There are two general approaches to prosody modelling in the context of speech synthesis technology: (i)data-driven and (ii) rule-driven. The aim in the data-driven approach is to construct a model from a largespeech database. The fundamental idea underlying the data-driven approach is to automatically derive therelationship between input linguistic labels and numerical representation of prosodic features by usingmachine learning techniques. These techniques include statistical methods such as Articial Neural Networks(ANN) (Burrows, 1996; Riedi, 1995; Fackrell et al., 1999; Vainio, 2001; Lin et al., 2004), Classication andRegression Trees (CART) (Fackrell et al., 1999), and probabilistic methods such as the Hidden MarkovModel (HMM) (Tokuda etal., 2000, 2002) and Bayesian networks (Goubanova and Taylor, 2000).

An advantage of the data-driven approach is that it requires little knowledge about the theory of speechsound. Also, it can produce good quality synthetic speech prosody with relatively little development eorts.The primary weakness of the data-driven approach is that it requires a large speech database which mustbe specially prepared, e.g. through annotation. In the context of the present research, this approach is notpractical because, due to the relatively limited language resources for Yoru`ba, the necessary speech databaseis not available and the necessary expertise in Yoru`ba linguistics and phonetics required for developing such a

database is scarce. Furthermore, data-driven approaches have weak generalisation ability, especially when

O: .A. O: de: jo:b et al. / Computer Speech and Language 22 (2008) 3968 41they are trained using corpora with a limited scope or size. It is also dicult to adapt data-driven based pros-ody models to new contexts, e.g. the synthesis of speech for dierent mode or speaker. In addition, the result-ing model from a data-driven approach is not transparent and, therefore, cannot be used for explaining thephenomena captured in the model. Such an explanation is particularly important when there is the need toanalyse the prosody model so as to better understand how the elements contribute to the realisation of naturalspeech. Such an understanding helps to further improve an existing model. These weaknesses of the data-dri-ven approach make it unsuitable for the purpose of this work.

The rule-driven approach models speech prosody based on a theory of speech, e.g. tone phonology or artic-ulatory phonology. The aim in such methods is to encode expert knowledge about speech generation by usingmathematical or computational models. These models are usually developed around the intonation or funda-mental frequency dimension and the other dimensions of speech prosody are incorporated, usually implicitly.

There are two well-known classications of intonation models namely: (i) the superposition model and (ii)the linear alignment model. Superposition models include those proposed by Garding (1983), Fujisaki andHirose (1982) and Thorsen (1986); while the intonation model proposed by Pierrehumbert (1981) is classiedas a linear alignment model. Both Garding and Fujisaki adopted the numerical model by Ohman (1967) asbasic component but Gardings model is qualitative while Fujisakis quantitative.

The Fujisaki model has been applied to a number of languages, e.g. Japanese (Fujisaki and Hirose, 1982),German (Mobius et al., 1993) and other languages (Fujisaki et al., 1998). It has also been applied successfullyto the modelling of Mandarin Chinese speech prosody (Fujisaki et al., 2000, 2005). In its application to tonelanguages, the phrase commands were used to model intonation at the phrase level while the accent commandswere used to model tones at the syllable level (Fujisaki et al., 2004). The phrase intonations are superimposedon sequences of the accent tones and the interactions between the two levels culminate on the base frequencyresulting in the production of the nal f0 contour.

An important feature of the Fujisaki model is that it allows some analytical separation of the model com-ponents. This helps to decide under which conditions and to what extent the concrete shape of the f0 contour isdetermined by linguistic features, such as the lexical tone, and non-linguistic factors, such as intrinsic and co-articulatory f0 variations and speaker characteristics. This model is more exible than the linear model in thatit can describe the amount of downstepping in an utterance and the reset at syntactic boundaries can becontrolled.

A limitation of the Fujisaki model is that the relationship between the f0 parameter and the linguisticdescription of intonation is not transparent. Moreover, although the timing of accent commands can, in prin-ciple, be systematically varied, the model does not facilitate varying the shape of the tonal elements. This isbecause the shape of the tonal element depends on the shape of the underlying accent command which is rect-angular in shape. Since the f0 curve can have a non-rectangular shape and its shape is important to the per-ceptual quality of the generated speech, this limitation reduces the accuracy of the Fujisaki model in capturingtone language prosody in the context of TTS application.

The Fujisaki model uses a single sentence component of response lter to model sentence intonation. Thisapproach is inadequate for modelling Yoru`ba sentence intonation because sentence command generates slowmelodic movement which makes intonation events such asH-rising and L-lowering dicult to model. A way toincorporate this into the Fujisaki model is to add additional accent or phrase commands. However, addingsuch commands can result in the creation of intonation segments which may not correspond to any intonationevent.

Another important modelling technique that has been used to model aspects of tone language prosody isthe Soft template Mark-Up Language (Stem-ML) developed by Kochanski and Shih (2003). The Stem-MLmodel incorporates a tagging system which generates fundamental frequency contours from a set of mathe-matically dened mark-up tags. The tags include stress tags for local tone shapes and step and slope tagsfor phrase curves. The Stem-ML model has been applied to Mandarin (Shih and Koshanski, 2000) and Can-tonese (Lee et al., 2002). Like the Fujisaki model, the Stem-ML model derives its theoretical basis from artic-ulatory phonology.

Basically, the Stem-ML models the muscle dynamics and the planning process that controls the tension ofthe vocal folds which results in the f0 contour of the generated speech. The resulting soft template for each

syllable is a compromise between articulation eorts and communication accuracy. The Stem-ML model is

42 O: .A. O: de: jo:b et al. / Computer Speech and Language 22 (2008) 3968able to describe the interactions between nearby tones, making it possible to capture surface tonal variationsusing highly constrained models with only one template for each lexical tone category, and a single prosodicstrength per word (Kochanski et al., 2003b). The strength can also correlate with other prosody features suchas syllable duration, mutual information and part-of-speech.

Both the Fujisaki model and the Stem-ML model share two important components: (i) a component thatdescribes the local attributes of prosody, e.g. local f0 movement in the case of fundamental frequency dimen-sion, and (ii) a component which describes a global approximation to prosody attribute, e.g. the shape of slopeof the f0 contour spanning a phrase or an utterance. These models place little emphasis on the perceptualinformation and thus limits their capability to produce speech output with high naturalness quality (Klatt,1987).

3. An overview of our prosody modelling

Based on the weaknesses of the models discussed above, we decided not to use a particular intonationmodel, but to stylise and standardise the f0 curves of each SY tone with the aim of concatenating them accord-ing to SY phonological rules. We assume that prosody can be modelled as the superposition of independentmulti-parameter waveforms which belong to hierarchical linguistic levels (Morlec et al., 2001). The prototyp-ical structure and the movements of each syllable prosody parameter within this structure is used dynam-ically to generate the structure of longer linguistic units under the control of phonological rules. In essence,each syllable participates in the encoding of the abstract linguistic representation as well as the phonetic real-isation of the various dimensions of speech prosody.

Our approach is motivated by the fact that the information about speech prosody, e.g. f0 variation andduration, that is likely to be relevant to the perception of tones are generally preserved in the recorded speechsignal. This information can be captured symbolically and used to represent the pattern of each tone. Linguis-tically based heuristics can then be applied to generate the abstract phonological structure of an utterancefrom the tonal elements. The acoustic signal is generated from the phonological structure via computationalmechanisms that are modelled to mimic human expert knowledge in the context of Articial Intelligence (AI).

There are two goals that must be achieved for our prosody modelling technique to be successful. First,based on the specic properties of the component syllables, our modelling approach must account for theholistic properties of the prosodic units that are larger than the syllable. Second, we need to coordinate thetemporal events at various linguistic (i.e. syllable and word) levels as well as at the syntactic boundaries withinan utterance. Conceptually, the rst goal subsumes the second. To achieve these goals, we need to be able torst adequately model the characteristics of each prosodic dimension, and then combine the results of suchmodelling systematically to form the complete prosody model. Hence, we have developed a holistic frameworkwithin which linguistic entities can be overlapped or concatenated. Each prosodic dimension of a speech canbe modelled using dierent modelling techniques. The results of the modelling can be combined eectively. Wetherefore call our approach a modular holistic approach to speech prosody.

We implement such a holistic view in a tree-based framework using the Relational Tree (R-Tree) technique(Ehrich and Foith, 1976). R-Tree is a technique for representing a waveform using a binary tree data structure.In this representation, a succession of peaks and valleys on the waveform are represented as elements in thenodes of the tree. The self-embedding structure of the tree represents the relative heights and depths of peaksand valleys of the spatial structure of the waveform.

The construction of an R-Tree involves generating a Skeletal Tree (S-Tree) which represents the abstractstructure of the waveform. In our model, the S-Tree for the intonation contour of an utterance is generatedusing tone phonological rules. The dimensions of the perceptually signicant points, corresponding to thepeaks and valleys, on the S-Tree are then computed to synthesise the prosody of the target utterance. A com-plete R-Tree, therefore, contains nodes representing all the phonologically signicant peaks and valleys on awaveform as well as their numerical values in modelling each prosodic dimension.

4. The S-Tree generation processTo generate the S-Tree for an utterance, we applied the following steps:

Stylisation: Approximate the f0 curve of each tone by a simpler function using stylisation technique(dAlessandor and Mertens, 1995).Standardisation: Represent each stylised tone type in terms of peak and valley (Collier, 1990).S-Tree generation: Derive an algorithm for computing successive peaks and valleys from a sequence oftones based on the tone phonology of the language.

We have chosen to implement our stylisation from rst principles. By this we mean we directly approximatethe f0 data of the voiced portion of a syllable using various interpolation polynomials. These approximationsare then subjected to perceptual evaluation in order to determine the one that produces the best speech qualitywhile at the same time facilitating a transparent representation of the f0 curve. The assessment of perceptualrelevance is motivated by the IPO-like automatic stylisation method (dAlessandor and Mertens, 1995) as wellas other works on stylisation. We found that a third degree polynomial is the most appropriate interpolationfunction of the f0 curves for the three SY tones (O: de: jo:b et al., 2004).

For the standardisation, the peaks and valleys of the stylised f0 curve are taken to be the location of themost important phonological events (tHart, 1991). The standardised f0 curve for each tone is a symbolic rep-resentation of the relative magnitude of the peaks and valleys on the underlying f0 curve. However, simplyconcatenating these standardised f0 curves will not result in the appropriate intonation pattern because therealisation of intonation is governed by phonological rules. As a result, the computed shape of each tonein the intonation contour may dier from the standardised f0 curve. We used tone phonological rules to pre-dict the structure and the course of the intonation pattern. In essence, the tone phonological rules are used togenerate the skeletal tree (S-Tree).

In a phonetic study of SY tones on VCV nouns, Hombert (1976) documented the approximate shape of thef0 curves for six tone patterns as shown in Fig. 1. These ndings show that the tonal signature of an f0 curvecan be linked using phonological rules and it can be used to predict the intonation pattern of the utterance

O: .A. O: de: jo:b et al. / Computer Speech and Language 22 (2008) 3968 43corresponding to a sequence of tones. Consequently, an algorithm can be derived to generate the intonationpattern using the standardised f0 curve of tones discussed above. The peaks and valleys of the sequence of suchtones will be the basic input to the algorithm. We therefore proposed an algorithm, formulated around the

115

102.5

90

H z

L L

115

102.5

90

H z

L M

115

102.5

90

H z

L H

115

102.5

90

H z

M L

115

102.5

90

H z

M M

115

102.5

90

H z

M H Fig. 1. Graphical representations of co-articulated tones according to Hombert (1976).

notion of a Skeletal Tree (S-Tree), which continuously determines the highest peaks and lowest valleys in theintonation pattern and assigns them to nodes in the S-Tree.

Based on the various phonological rules and heuristics described above, we used the following template forspecifying the phonological rules: A) B/C. This rule template is interpreted as: tone A is realised as tone B inthe context C. For example, the rule T1) T3/T2_ means that the tone T1 is realised as tone T3 if it is pre-ceded by tone T2. T2_ species the context and the underscore indicates the position occupied by the tone onthe left-hand side of the rule.

Using the rule template, we specify SY phonological tone rules as shown in Table 1. Rule set (i) in Table 1can be interpreted as: the f0 curve of any tone that is not preceded by a tone remains unmodied. This meansthat the rst tone in an utterance or an isolated syllable retains its tonal property. Rule 3 in rule set (iii), i.e.L) Low(L)/L_, is interpreted as: the pitch of a syllable carrying a low tone will be further lowered if it ispreceded by another low tone carrying syllable.

For each phrase in an utterance, we generate an appropriate S-Tree. An S-Tree describes the underlyingabstract waveform in terms of its peaks and valleys. An S-Tree is generated in such a way that the rootand the internal nodes of the tree represent the valleys on the underlying waveform and the leaf nodes repre-sent the peaks. The following assumptions are made in deriving the S-Tree generation algorithm:

Each tone has exactly one peak and one valley on its stylised f0 curve.

44 O: .A. O: de: jo:b et al. / Computer Speech and Language 22 (2008) 3968 Any two peaks in an intonation waveform must have a valley between them. The peaks and the valleys correspond to the turning points on the intonation waveform.

4.1. Skeletal tree generation for SY sentences

In the S-Tree generation algorithm, the lowest valley as determined by the phonological rules of the lan-guage corresponds to the root node of the S-Tree. The underlying principle is that lower valleys appeartowards the right of the root node while higher valleys appear to the left of the root node. Each leaf node cor-responds to a peak, with the highest peak positioned at the left-most leaf node of the tree and the lowest peakpositioned at the right-most leaf.

As an illustration, let us take the example phonological rules in Fig. 2a. These rules specify that if two tonesof the same type are adjacent to each other in an utterance, the second tone is realised at a lower f0 peak thanthe rst. Generally, H tones will have higher peaks than M tones which, in turn, have higher peaks than Ltones.

Table 1Phonological rules for SY tone interaction

Rule set Rule number Rule specication

i 1 H)H/UH2 M)M/UM3 L) L/UL

ii 1 H)High(H)/L_2 H) SlightlyHigh(H)/M_3 H) Low(H)/H_

iii 1 L) VeryLow(L)/H_2 L) SlightlyHigh(L)/M_3 L) Low(L)/L_

iv 1 M) SlightlyHigh(M)/L_2 M) SlightlyLow(M)/H_3 M)M/M_Where U denotes a blank.

O: .A. O: de: jo:b et al. / Computer Speech and Language 22 (2008) 3968 45H => Low(H)/H_L => Low(L)/L_M => Low(M)/M_

H L H HTone Sequence

Peaks

Valleys V1

P1 P3 P4

V3 V4V2

P2

a bGiven an utterance made up of four syllables with the tone sequence, HLHH, as shown in Fig. 2b. Based onthe phonological rules, we can generate the rst level of the S-Tree in Fig. 2c by noting that the deepest valleyin the utterance will be associated with the L tone, i.e. V2. Using the phonological rules, the highest peaks tothe left and right of V2 are P1 and P3, respectively. The double circle enclosing P1 and P3 in Fig. 2c indicatesthat they still dominate other peaks in the waveform representation. The abstract waveform for this partial S-Tree representation is shown in Fig. 2d.

Note:Peak(H) > Peak(M)> Peak(L)Valley(L) P3>P4>P2 and V2

On the portion of the waveform dominated by P3, the next lowest valley is V4. This valley is bounded to theleft by P3 and the right by P4 (cf. Fig. 2e). The abstract waveform for this partial S-Tree representation isshown in Fig. 2f. On the portion of the waveform dominated by P1, the next lowest valley is V3. This valleyis bounded to the left by P1 and the right by P2 (cf. Fig. 2g). The abstract waveform for the nal S-Tree rep-resentation is shown in Fig. 2h.

The above discussed process of determining the peak and valley is applied recursively until all the peaks andvalleys on the waveform have been represented as nodes on the S-Tree. The main pseudocode for the S-Treegeneration algorithm is shown in Fig. 13 of Appendix B. The subroutine for building the S-Tree, expressed asa recursive algorithm, is shown in Fig. 14 of Appendix B. The subroutines for computing the positions of thelowest (or deepest) valleys and highest peaks within a sequence of tones are shown in Figs. 15 and 16 ofAppendix B respectively.

To generate the R-Tree for a multi-phrase sentence, we use a composite approach. Given a sentence madeup of two phrases Pr1 and Pr2. If we apply the idea that the f0 contour generally follows a downtrend patternas suggested by Viana et al. (2003), we can assume that the highest peak in a two-phrase utterance will belocated in the rst phrase, i.e. Pr1, while the lowest valley will be located in the second phrase, i.e. Pr2. IfVj and Vk are the deepest valleys in Pr1 and Pr2, respectively, then Vk will be a lower valley than Vj.

46 O: .A. O: de: jo:b et al. / Computer Speech and Language 22 (2008) 3968In the S-Tree generation algorithm, the underlying principle is that lower valleys appear towards the rightand higher valleys appear to the left of the root of the tree. Going by this principle, we can represent the twovalleys Vj and Vk using the conguration in Fig. 3a. This conguration shows that the valley Vj is dominatingVk thus making Vk the root of the new S-Tree. The S-Trees for Pr1 and Pr2 can then take their root from thevertex Vj and Vk, respectively.

4.2. An illustration

We illustrate the S-Tree generation algorithm using the Yoru`ba sentence, O`:do`: mi lode, Ko to lo: . (meaningHe came to my place before leaving.). This sentence contains two phrases and six words. The rst and thirdwords are made up of two syllables each, the other words are mono-syllabic. This makes a total of eight syl-lables with the tone sequence (LLMHH, HHM). The transcript of the S-Tree generation process for the sen-tence is shown in Fig. 4. In the rst step of the algorithm, the tone of each syllable is associated with a peak(see Fig. 4a). Adjacent peaks are then paired from left to right and a valley is associated with each pair. Thelowest valley and the highest peak are then determined and assigned to nodes on the tree by applying the pho-nological rules in a recursive manner.

Applying the algorithm discussed in Section 4.1 to the rst phrase of the sentence, i.e. O: do: mi lode, with

tone pattern LLMHH, the lowest valley will fall on the second syllable, i.e. do: . This valley is V12 and the high-est peak to its left will be associated with the rst syllable of the phrase, i.e. O:

, which is P11. The highest peak

Vk

Vj Vk

Vk

Vi Vk

Vj Vk

RootRoot

a

bFig. 3. Conguration of S-Tree for a multi-phrase sentence. (a) Two phase valley conguration and (b) three phase valley conguration.

O: .A. O: de: jo:b et al. / Computer Speech and Language 22 (2008) 3968 47ato the right of V12 is associated with the fourth syllable in the phrase, i.e. lo, being the rst syllable with Htone. V12 is therefore, the root node of the S-Tree representing this phrase. Following our algorithm, P14 isassigned to the left node of the root because it is the phonologically dominant peak. P11 is assigned to the rightnode of the tree.

On the part of the tree dominated by P11, the lowest valley is V11. This valley is bounded by P11 and P12associated with the rst and second syllables, respectively. Applying this approach to the other syllables in thephrase, the subtree with root at V12 is generated for the rst phrase. Following the same approach, the treewith root at V22 is generated for the second phrase of the sentence, i.e Ko to lo: , which has the tone sequence

b

c

Fig. 4. Transcript of S-Tree generation for the sentence O: do: mi lode, Ko to lo: . (a) Association of peaks and valleys with syllables, (b)

skeletal tree and (c) abstract waveform.

48 O: .A. O: de: jo:b et al. / Computer Speech and Language 22 (2008) 3968HHM. These two trees are combined using the assumption that the lowest valleys are associated with the lastphrase in a sentence. This makes V22 the overall root of the S-Tree for the sentence.

The S-Tree generated at the end of this process represents the peaks and valleys on the intonation waveformas well as the self-embedding structure of the waveform. This tree depicts the abstract structure of the into-nation contour as shown in Fig. 4c. Note that, peaks on the S-Tree are arranged in a relative order of mag-nitude and their subscripts indicate their position on the waveform. For example, peak P11 occurs before P12,P12 occurs before P13, etc.

It would be observed that the rst phrase, i.e. O:do:mi lode, shows a general upward pitch pattern based on

the LLMHH tone sequence in the phrase. The second phrase, i.e. Ko to lo: , shows a general downward trendas a result of the HHM tone sequence in the phrase. At the end of the S-Tree generation process, the resultingS-Tree depicts the shape and structure of the intonation contour (cf. Fig. 4c).

5. The prosody dimension realisation

The next step in our prosody modelling technique is the computation of the prosody dimensions corre-sponding to the abstract waveform represented by the S-Tree. In this stage of the R-Tree based prosody mod-elling, therefore, we compute the numerical values of perceptually signicant transitions on the selecteddimension. While the S-Tree generation algorithm is linguistically-driven, the computation of the numericalvalues for each prosodic dimension is generic, in that it can be data-driven or rule-driven. The present workfocuses on the realisation of the intonation and duration dimensions only. The realisation of the intensitydimension has not yet been incorporated into our model.

As the focus of this paper is on our modular holistic approach to prosody modelling and we have previ-ously reported on our modelling of the intonation dimension (O: de: jo:b et al., 2006) and duration dimension(O: de: jo:b et al., 2007) of SY in detail, we will simply summarise the key points of those models in this paper.Interested readers are referred to those papers for details.

Before we summarise the realisation of the intonation and duration dimensions, we will rst describe thedata used for the modelling and experimentation processes.

5.1. Data

The domain for our speech synthesis is in language education and the mass media. We selected four popularSY newspapers and three SY textbooks for creating our text database. The newspapers are: (i) Alaro`ye, (ii)Alala`ye, (iii) I`ro`y`n Yoru`ba, and (iv) Akede Agbaye. As these newspapers did not use the standard orthographyof the language accurately, the texts selected from the newspaper for our text corpus have been tone markedand under-dotted appropriately. Textual anomalies such as numeral, foreign words and proper names are alsoexpanded and written in SY orthography using appropriate SY accents.

The three textbooks selected are two SY language education textbooks (Bamgbos: e, 1990; Owolab, 1998)and a book on SY culture (O`gunbo:wale, 1966). In addition, we also composed a short SY story and the text isadded to the SY text corpus. The purpose of composing the story is to add typical dialogue domain text intothe collected texts. It also allows us to compare the tonal and linguistic distributions in the dierent domains ofSY text. The text are not syntactically tagged but the linguistic boundaries, such as sentence and phraseboundaries, are indicated with appropriate punctuation marks.

Despite this relatively small text corpus, the coverage of our text database in terms of syllables and words isadequate for its intended purpose. The database may, however, need to be updated if a dierent domain wereto be selected for analysis. For example, the number of words in a phrase and the number of phrases in a sen-tence depend on the style of writing and the domain of the text.

Out of the 690 possible SY syllables (O: de: jo:b, 2005), our text database contains 456 unique isolated sylla-bles. These syllables are carefully selected to reect the coverage of all syllable types in terms of phonetic andphonological distributions. For example, in the CV syllable type, the manner of articulation of the onset isconsidered. The onset consonants are selected from each manner of articulation classes, i.e. Stop, Labio-velar,Fricates, Aricate, Sonorant or Semivowel. In order to select the syllable for each class of utterance, the

selected onset is combined with each vowel type, e.g. Closed rounded, Half-closed front, etc. The same process

is repeated for all syllable types. Our data set adequately represents all of the ve SY syllable types (i.e. CV,CVn, Vn, V and N).

For our speech database, however, 350 syllables were selected based on the analysis of the Yoru`ba news-papers and textbooks discussed above. We also generated 360 sentences, 95 of which were selected for ourprosody modelling. Fifty-ve of the sentences are one-phrase sentences while the remaining forty are two-phrase sentences (see Table 2). All sentences in our database are semantically well-formed statement sentences

O: .A. O: de: jo:b et al. / Computer Speech and Language 22 (2008) 3968 49and they are selected to reect common, everyday use of SY. The minimum number of syllables per sentence inour database is 4 and the maximum is 24, with a mean of 6.7. The H and L tone syllables account for 40% eachwhile the M tone syllables account for the remaining 20%.

5.2. Recording and annotation

The 350 syllables and the 95 sentences were read by three females and three males. All are nave native SYspeakers. Their ages range from 21 to 36 years old. Each speaker read the text at their own pace, resulting inan average number of syllable per seconds ranging from 3.5 to about 6.3. Each recording session lasted forabout 1 hour. The rst 15 minutes were used to familiarise the speakers with the recording process and theenvironment. All of the speakers participated voluntarily, and were therefore not paid for this activity. Thespeech les were annotated using the Praat (Boersma and Weenink, 2004) speech processing software.

The data was divided into two subsets. The larger subset was used to develop the intonation and durationmodels and the smaller subset was used to test and evaluate these models. The details of how the data is usedin developing and testing the models can be found in O: de: jo:b et al. (2006, 2007).

5.3. Intonation modelling

The implementation of our intonation model within the relational tree (R-Tree) technique is achieved intwo steps:

(1) The abstract structure of the intonation waveform, called the skeletal tree, is generated by using the tonephonological rules of SY.

(2) The numerical values corresponding to the perceptually signicant peaks and valleys on the skeletal treeare computed from the f0 data of each syllable in the utterance using a Fuzzy Logic based model (Takagiand Sugeno, 1985).

The variables for computing these numerical values are: (i) Tone contrast and (ii) the relative position ofsyllable in an utterance. The tone contrast is the dierence between the canonical peak/valley of the f0 valuefor each syllable in an utterance. The tone contrast for each tone is computed by a dierent equation.

The Fuzzy Logic model is made up of a set of fuzzy rules which take the form: IF premise THEN conse-quence. The premise of each of these rules is formulated using two kinds of linguistic information: the relativeposition of each syllable in the utterance and the relative magnitude of the peak and valley of each syllable.The consequence computes the f0 peaks and valleys that correspond to the premise (O: de: jo:b et al., 2006).

The computed value for each peak and valley is then incorporated into the corresponding peak (Pi) or val-ley (Vi) on the S-Tree. The actual intonation contour is obtained by joining the computed points using inter-polation. The method for computing the points on the S-Tree is able to take advantage of the abstractcomputation that has already been carried out in the previous step. In the case of intonation modelling,the S-Tree can be viewed as a model which organises the intonation units (i.e. tones) into a coherent structure.

Table 2Speech database description

Number of syllables Number of one-phrase sentences Number of two-phrase sentences

Each speaker 350 55 40Total 2100 330 240

50 O: .A. O: de: jo:b et al. / Computer Speech and Language 22 (2008) 3968This structure forms the basis for realising the actual intonation waveform. A detailed description of thismodel is presented in O: de: jo:b et al. (2006).

5.4. Duration modelling

We have chosen to model the duration dimension using a Fuzzy Decision Tree (FDT) based durationmodel (O: de: jo:b et al., 2007). Since the duration modelling is syllable-based, we rst carried out a set of explor-atory analytical experiments to determine the factors aecting duration of SY syllables. Nine variables wereconsidered for computing the duration of each syllable in an utterance. They include:

(1) the position of syllable in word,(2) the position of word in sentence,(3) length of word in which the target syllable occurs,(4) the peak of the f0 curve for the target syllable,(5) the phonetic structure of the target syllable,(6) the phonetic structure of the preceding syllable,(7) phonetic structure of the following syllable,(8) f0 peak of the preceding syllable, and(9) f0 peak of the following syllable.

The results of this experiment led to the selection of the rst seven factors (O: de: jo:b et al., 2007) which arethen used to produce the duration model. To model the duration of each SY syllable in context, we also pre-dict a scaling factor using the dierence in duration between a citation syllable and its contextual counterpart.This scaling factor is used to compute the realised syllable duration.

Within our modular holistic approach to prosody modelling, we are able to experiment with dierent mod-elling approaches and compare their results. As FDT has not previously been used for duration modelling inthe context of TTS, we have also implemented a Classication and Regression Tree (CART) model because itis a popular and well-documented approach.

The results of qualitative and quantitative evaluations show that CART models the training data moreaccurately than FDT. The FDT model, however, is better in extrapolating from the training data since it pro-duced a better accuracy for the test data set. Synthesised speech produced by the FDT duration model wasalso ranked better on its perceptual quality than the CART model.

A detailed description of our duration models and the associated experiments is presented in O: de: jo:b et al.(2007).

6. Illustration of the prosody model

We illustrate our prosody model using two SY sentences: (i) On lati lo: wobe`: . (meaning He must go andsee the place.) and (ii) O:do`: mi lode, ko to lo: . (meaning He came to my place before leaving.). The rst sen-tence contains one phrase with the tone sequence (HHHMMML). The second sentence contains two phraseswith the tone sequence (LLMHH, HHM). To model the prosody of each of these sentences, we rst generatean S-Tree for each sentence using the algorithm illustrated in Section 4.2. The transcripts for the S-Tree gen-eration for these sentences are shown in Figs. 4 and 5, respectively.

The generated abstract waveform for each sentence shown in Figs. 4c and 5c conforms with the expectedintonation patterns. If SY phonological rules are applied on the underlying tone sequence of each of thesesentences, similar intonation patterns would be produced (Connell and Ladd, 1990).

After generating an S-Tree for each sentence, the numeric values for each peak and valley in each prosodydimension are computed using the techniques stated in Sections 5.3 and 5.4 (cf. O: de: jo:b et al., 2006, 2007 fordetails). In our experiments, we have included the modelling of the intonation dimension and the durationdimension of synthesised speech only. To model the intonation dimension, we rst compute the peaks andvalleys on the underlying f0 curve and the results are then incorporated into the abstract waveform. To model

the duration dimension, the realised time of each peak and valley is then computed and the results are

O: .A. O: de: jo:b et al. / Computer Speech and Language 22 (2008) 3968 51aincorporated into the abstract waveform. The complete R-Tree for modelling two prosody dimensions of thesynthesised speech is then formed.

The computed peaks, valleys and the time of peaks and valleys for our two sample sentences are shown inTables 3 and 4. These data are used to synthesise the corresponding speech as explained in Section 5. The cor-responding synthesised utterance waveform, f0 and duration data are plotted in Figs. 6 and 7. In Figs. 6 and 7,the dashed lines indicate the synthesised f0 contour. The solid lines indicate the natural f0 contour. The * indi-cates sentence boundaries and the comma (,) indicates a phrase boundary.

As shown in Figs. 6 and 7, the synthesised f0 contours dier from the natural one but generally follows thepattern of the natural f0 contour. The synthesised f0 also shows some blunt angles at syllable junctions. These

b

c

Fig. 5. Transcript of S-Tree generation for the sentence On lati lo: wobe: . (a) Association of peaks and valleys with syllables, (b) skeletaltree and (c) abstract waveform.

52 O: .A. O: de: jo:b et al. / Computer Speech and Language 22 (2008) 3968Table 3Computed prosody data for the sentence On lati lo: wobe`:

Syllable f0 peak (Hz) Time of f0 peak (sec) f0 valley (Hz) Time of f0 valley (s)

O`: 148.000 0.126 135.900 0.051n 145.500 0.278 145.200 0.221la` 149.200 0.346 140.700 0.311ti 115.600 0.561 108.300 0.638lo: 121.700 0.690 114.400 0.781wo 112.800 0.800 110.300 0.881be`: 104.000 1.000 77.200 1.110

Table 4Computed prosody data for the sentence O`: do`: mi lode, Ko to lo: angles result in some perceptible clicks in the synthesised speech sound. We observed that the f0 curve for syl-lables in the synthesised utterance does not always show every peak and valley as is the case in the canonicalsyllables. This is particularly prevalent in the M tone syllable, where the f0 points computed as peaks and val-leys usually have the same f0 values. This results in a line rather than a curve. Furthermore, some H and Ltones have at or slightly tilting f0 curves. Our observation, however, agrees with the ndings of recent studieson Mandarin Chinese (Xu, 1998, 1999b), which suggest that a tone may have a peak only when it is in anappropriate tonal or prosodic context.

We observed some unexpected results in terms of the f0 peak of the fth H tone syllable being higher thanthat of the preceding one, e.g. the syllables lo and de in Fig. 7. However, the natural f0 contour also exhibits asimilar pattern. A speculative explanation for that may be that the f0 curves of the tones in question suersome co-articulatory eects as well as boundary perturbations since such phenomena mostly occur at theword, phrase and sentence boundaries. However, there are no available theoretical ndings in the literatureto support this claim.

Syllable f0 peak (Hz) Time of f0 peak (s) f0 valley (Hz) Time of f0 valley (s)

O: 109.051 0.093 101.760 0.153do: 112.320 0.173 105.042 0.273mi 126.029 0.405 116.264 0.303lo 138.915 0.523 127.731 0.463de 137.854 0.773 122.586 0.653Ko 133.950 0.933 124.067 0.993to 138.553 1.093 123.230 1.188lo: 113.912 1.322 111.390 1.373

* O(H) ni(H) la(H) ti(M) lo(M) wo(M) be(L) *

* l o *

* *

Ti

Pitch (Hz)

me (s)0 1.17106

Time (s)0 1.17106

0

170

0 0.234212 0.468425 0.702637 0.93685 1.171060

85

170

Fig. 6. Synthesised utterance waveform for the sentence On lati lo: wobe`: .

O: .A. O: de: jo:b et al. / Computer Speech and Language 22 (2008) 3968 536.1. The alignment model

In order to produce a natural prosody, the f0 turning points, the segment duration and the timing of the f0 con-tourwith respect to the syllable segments are important. In this respect, the f0 contour needs to be alignedwith thesyllable segment using the duration data. There are a number of ways for implementing this alignment (Vendittiand vanSanten, 1998).One optionwouldbe to use the f0 anddurationdata to develop amodel of alignment of thekey f0 events, i.e. the f0 turning points, with respect to the segments. A linear interpolation can then be used to jointhe points. The problem with this approach, as suggested by Venditti and van Santen (1998), is that the f0 curvesare not simply straight lines, but curves of various shapes. The computed alignment will not t with the under-

Pitch (Hz)

Fig. 7. Synthesised utterance waveform for the sentence O: do: mi lode, Ko to lo: .lying prosody data and hence the resulting synthesised speech will have poor perceptual quality.Another approach would be to use the f0 and duration data to determine the start and end time of each f0

turning point with respect to the syllable segment. The problem with this approach is that the turning pointsmay not appear as distinguishable peaks or valleys on the intonation waveform. Some situations may arise inwhich consecutive f0 turning points are on the same level, making it impossible to designate either of them as apeak or a valley.

The approach we adopted in this work is to assume that the alignment of tonal targets can be specied rel-ative to the duration of the segmental components of the syllable (Dilley et al., 2005). The variation is deter-mined based on the duration tier of the speech waveform.

The tone alignment is template-driven and the template depends on the tone and syllable types. The tem-plate used for the H tone and L tone of a CVn syllable type is shown in Fig. 8. The alignment process uses thef0 peak and valley as well as the duration of the syllable to adjust the template as appropriate. The details ofthis alignment process is documented in Xu (1999a) and Dilley et al. (2005).

The alignment process involves using the computed numeric values for the f0 contour and the timing togenerate the synthesised prosody. The alignment process begins with aligning the timing of the f0 curve onthe voiced portion of the rst syllable with the duration of the intonation pattern of the utterance. The sub-sequent syllables are then concatenated. As the concatenation proceeds, the timing of the earlier syllables isadjusted in order to t with the prosody of the utterance. This procedure continues until all syllables in theutterance have been concatenated.

We illustrate how the timing of a syllable is computed. Let the total duration of a syllable be Dtotal, theduration of the onset be Donset and the duration of the nucleus be Dnucleus. The total duration of the syllablecan be expressed as:

If theginalduratequat

The slablehalf alogicaco-or

Fig. 8.syllabl

54 O: .A. O: de: jo:b et al. / Computer Speech and Language 22 (2008) 3968Start f0Valley f0

f0 a

lign

mPeak f0End f0

entStart Valley Peak EndTime alignmentaDtotal hDonset;Dnucleusi 1syllable is compressed by n, where 0.0 6 n 6 1.0, then Dnucleus will be reduced by a factor of n of its ori-canonical value as well as the factor of n of the onsets canonical value. This will ensure that the totalion of the syllable is reduced by n at the compressible portion, i.e. the rhyme portion, as specied in theion:

Dcompressed : hDtotal Dtotal ni: hDonset;Dnucleus Dnucleus n Donset ni 2

ign of n determines whether the syllable is compressed or expanded. For example, if n = + 0.5, the syl-will be about one and a half of the canonical syllable in terms of duration. If n = 0.5, the syllable will bes long as the canonical syllable. The application of this formula is, however, dependent on the phono-l identity of the syllable. Overlapping syllables to synthesise a word in this manner requires the temporaldination of the events at the end of one syllable and the beginning of the next (Coleman, 1994).

Onset Nucleus Coda

Syllable Structure

Onset Nucleus Coda

Start ValleyPeak End

Peak f0

End f0

Start f0

Valley f0

Syllable Structure

Time alignment

f0 a

lignm

ent

b

Schematics of time and f0 alignment for CVn type syllable. (a) H tone alignment for CVn syllable and (b) L tone alignment for CVne.

syllabrange

no intrinsic dierences between the rst and second syllables in a two-syllable word.

O: .A. O: de: jo:b et al. / Computer Speech and Language 22 (2008) 3968 55 The strength of each syllable is given by:

Si Asi CP i 3where the function s(i) returns the tone of the ith syllable, A[s(i)] is the intrinsic strength of s(i), the functionP(i) returns the position type of the ith syllable in the sentence (e.g. sentence-nal, medial,. . .), and C[P(i)] isthe strength factor for position P(i). Segmental eects that depend on the phoneme sequence are relatively small.

The overall model uses 22 parameters: 11 to specify the templates, 7 to specify the strengths, and 4 global,speaker-specic parameters, e.g. the base f0 of the speaker. Note that the Stem-ML model does not use pho-nological rules for modelling speech prosody.

The model was tted to the corpus under the assumption that the f0 data had independent Gaussian errors,using a Bayesian Markov Chain Monte Carlo algorithm that produced samples from the posterior distribu-the onset and the valley is closer to the coda of the syllable (see Fig. 8b).

7. Stem-ML model for SY

In order to put the results of our work in the context of contemporary work, we have also developed aStem-ML model. The Stem-ML model was proposed by Kochanski and Shih (2003) and it has been appliedto model intonation in Mandarin (Shih and Koshanski, 2000) and Cantonese (Lee et al., 2002). The apparentsuccess of Stem-ML in modelling prosody for these tone languages motivated us to perform this experiment.The implementation of the Stem-ML model for SY is discussed in the following sections.

7.1. Design of the stem-ML model

The assumptions in our model, beyond those that are generic to Stem-ML (see Kochanski and Shih (2003)),are:

Each syllable carries a soft intonation target with one of the three shapes, chosen by the lexical tone. Eachtarget is a line segment.

The prosodic strength of a syllable both aects the precision with which a tone is realised and scales the f0range of that tones template. One can expect that a linguistically stronger syllable will have both a largerpitch range and also be articulated more carefully. We include an adjustable parameter in the model (atype)to account for such a correlation.

Minimal syntactic information is required to model the intonation of SY. Our model includes ve syntacticclasses of syllables: phrase-initial, phrase-nal, sentence-initial, sentence-nal, and medial (i.e. everythingelse).

Syllables in one- and two-syllable words have the same phonetic realisations, and we assume that there aretion ole (see Fig. 8a). The assumption for the M tone curve is similar to that of the H tone, except that the f0, i.e. the dierence between the peak and the valley, is much smaller. For the L tone, the peak is closer toEq. 2 constrains the amount of lengthening to at most 100% because we assumed that a sentence is notincreased more than twice of its canonical size. This constraint was motivated by our experimental data(O: de: jo:b et al., 2007).

The same equations apply to all syllable types. In the case of a CVn syllable, it is important to note that, inSY, the Vn is a nasalised version of the vowel V. We therefore assume that the timing of the coda is absorbedby the preceding vowel. With syllables that have no onset, i.e. V, Vn and N, the onset part in Eq. 2 is set tozero.

As an example, the schematics of the timing and f0 alignment for CVn type syllables are shown in Fig. 8.This model assumes that the f0 for a syllable always starts and ends on the vowel that forms the nucleus. In thecase of the H tone curve, the valley appears nearer to the onset and the peak appears nearer to the coda of thef the parameters. Once the algorithm had converged to a stationary distribution, we collected the last

56 O: .A. O: de: jo:b et al. / Computer Speech and Language 22 (2008) 39686000 samples. From that, we computed the average values of all parameters and their uncertainties for theresulting Stem-ML model.

The Stem-ML implementation is quantised in 10 ms increments, which raises the possibility of spuriouslocal minima. To ensure that an optimal model is produced, we started 10 more Monte Carlo runs withdierent xed values of the centershift parameter. We chose 10 random samples of parameters from thesecond thousand iterations from each run and computed the RMSE. The resulting set of samples tracesout a minimum of RMSE against a centershift that is consistent with the error reported in Section 8.

7.2. Analysis of best-t parameters of the intonation model

The following describes the parameters used in our Stem-ML model for SY prosody modelling in the con-text of a TTS application. The parameter names and their values are given in bold italics.

smooth=0.062 0.006: This controls the rate at which the speaker changes pitch for a weak accent. It cor-responds to a time of 106 8 ms, roughly double the value obtained for a Mandarin speaker (Kochanskiand Shih, 2003).base=105 0.5 Hz: The speakers base frequency.atype=0.41 0.05: This describes how much the pitch range of a template expands as the strength changes.It indicates a fairly weak (but signicant at p < 0.01) eect: a 40% change in strength (e.g. changing from asentence-initial to a medial syllable) would mean that the template of the stronger syllable has an f0 rangejust 14% larger than a comparable medial syllable. The value is similar to the 0.87 0.7 value fromKochanski et al. (2003b).ctrshift and wscale: These two parameters describe the scope of the template relative to the syllable bound-aries. In our model, the length of the target is 85 2% of the length of a syllable, similar to the 88 1% forMandarin. The target is nearly centered in the syllable, 1 1% of its width (i.e. about 2 ms) before the syl-lable centre. Combined with Stem-MLs time-symmetric mathematics, this implies a nearly equal balancebetween anticipatory and carry-over co-articulation.H tone: The template slopes up from 8% to 112% above base. The target shape is both high and rising.The parameter for the H tone type is 0.72 0.06. This implies that both the height and the shape areimportant, but errors in the tones average value are more important than errors in the shape of thetone.M tone: The template slopes down from 28% to 6% above base. This is a mid-level and weakly falling tone.The parameter for the M tone type is 0.59 0.06. This parameter indicates that both the pitch height andslope are important.L tone: The template slopes down from 15% to 68% below base. The parameter for the L tone type is0.148 0.04, indicating that the shape is more important than the average f0 value. It might be best todescribe this tone as falling with a tendency towards low.Intrinsic strengths of tones: For the intrinsic strength of tones, we obtained A[H] = 1.8 0.12,A[M] = 2.5 0.15, A[L] = 2.2 0.08 (see Eq. (3)). The high tone is the weakest and the M tone is the stron-gest. Thus, when compared with other tones, there is a higher tendency for the shape of a high tone to beinuenced by its environment. The dierence in this tendency is fairly signicant (p < 0.01).Sentence boundaries: The strength factors for syllables in sentence-initial (SI) and sentence-nal (SF) posi-tions (Eq. (3)) are C[SI] = 1.42 0.06 and C[SF] = 0.70 0.03, respectively. Thus, sentence-initial syllablesare stronger (i.e. they are articulated more precisely and have wider f0 swings) than medial syllables, andsentence-nal syllables are weaker than medial syllables.Phrase boundaries: The strength factors for syllables in phrase-initial (PI) and phrase-nal (PF) positions(but not at the beginning or the end of a sentence) are C[PI] = 1.09 0.03 and C[PF] = 0.73 0.06. Thephrase-initial syllables are stronger than medial syllables (though just slightly), and phrase-nal syllablesare also weaker than medial syllables.Our results for sentence- and phrase-boundaries parallel the results from Kochanski and Shih (2003) for

Mandarin and Lee et al. (2002) for Cantonese. The dierences in strength may be cues that help the listener

to nd phrase and sentence boundaries.

O: .A. O: de: jo:b et al. / Computer Speech and Language 22 (2008) 3968 57Fig. 9. Model t and raw data for SY sentence On lati lo: wobe`: .The result of applying the Stem-ML model to SY prosody modelling on our earlier sentence samples: Onlati lo: wobe`: and O`: do`: mi lode, Ko to lo: . are shown in Figs. 9 and 10, respectively. In these gures, the blackdots mark measured f0, the grey curve is the predicted f0, the grey lines show the Stem-ML templates, and thevertical dashed lines show the syllable centres.

8. Evaluation

The evaluation is divided into two major types: (i) quantitative and (ii) qualitative. In the quantitativeevaluation, the aim is to investigate how accurately the models predict the data. In the case of the qualitativeevaluation, the aim is to see how native speakers judge the overall quality of the synthesised speech producedby the two models. The intelligibility and perceived naturalness of synthetic speech strongly depend on theprosodic quality.

8.1. Evaluation data preparation

Our experimental data contains 25 isolated sentences of varying complexities (cf. AppendixA). Their selectionwas based on the following criteria: (i) tone combinations, (ii) syllable types and length, (iii) phrasal structure

Fig. 10. Stem-ML prediction of f0 and raw data for the two phrases of SY sentence O: do: mi lode, Ko to lo: . This data is in the test set; the

model prediction is based on parameters derived from the training set, using syllable boundaries for this specic utterance. (a) Model tand raw data for the rst phrase O:

do: mi lode and (b) model t and raw data for the second phrase Ko to lo: .

(i.e. one and two phrase sentences). With respect to tone combinations, for example, we selected some sentencesin which all syllables carry the same tones, e.g. O tun sare wa. (meaning He ran here again.) which has thetone sequence (HHHHH), while others contain syllables with various combination of tone types, including somebeginning and ending with a particular tone sequence, e.g. HHH, LLL or MMM.

An attempt was made to achieve a balance between the phonetic types of syllables in a sentence, but thiswas not possible in most cases due to the sparse nature of SY syllable type distribution. In order to reduce thecognitive load of the stimuli, we have also limited the length of the sentences used in this experiment to 15syllables.

Ten out of the twenty-ve sentences come from the training set and the remaining 15 are from the test set.The test set is further divided into two groups. The rst test data group consists of 10 sentences. The syllablesthat make up these sentences are in our syllable database which was used for developing the R-Tree basedprosody model. The remaining ve sentences, which form the second test group, contains syllables that are

58 O: .A. O: de: jo:b et al. / Computer Speech and Language 22 (2008) 3968not in our syllable database. The reason for using the second test group is to see how well the implementedmodels are able to extrapolate from known to unknown data.

All test sentences are statement sentences. Three types of stimuli were prepared for each of the sentences: (i)the speech synthesised using the parameters computed by the R-Tree model, (ii) the speech synthesised usingthe parameters computed by the Stem-ML model and (iii) the natural speech spoken by a nave adult malenative speaker of SY. The synthesised stimuli were created by replacing the duration tiers and pitch tiers ofthe natural speech with the ones computed by each model and applying the PSOLA re-synthesis functionin the Praat speech processing software (Boersma and Weenink, 2004).

8.2. Quantitative evaluation

Two parameters are used for our quantitative evaluation: the Root Mean Square Error (RMSE) and Pear-sons Correlation (Corr) (Petruccelli et al., 1999). The RMSE measures how far apart two intonation contoursare, while Pearsons Correlation measures how closely the synthetic f0 contour relates to the natural one.RMSE measures the distance between two contours on the time axis regardless of the f0 contour shape. Pear-sons correlation, on the other hand, measures the degree to which variables are linearly related. The results ofour quantitative evaluation are shown in Table 5.

On the f0 dimension, the Stem-ML model ts the rst and second test sets with an RMSE of 16.56 Hz and18.80 Hz, respectively. The t to the training set has an RMSE of 12.00 Hz. The R-Tree model ts the rst andsecond test sets with an RMSE of 17.40 Hz and 18.35 Hz; the t of the training set has an RMSE of 16.50 Hz.These results indicate that, when we consider the training set, the f0 contour predicted by the Stem-ML modelis much closer to the natural f0 contour than the R-Tree model. When we compare the test data sets, the Stem-ML model performs slightly better than the R-Tree model.

This observation is consistent with the well-known fact that quantitative approaches, i.e. the Stem-MLbased model in this case, are strong at modelling the training data but are weak at extrapolating to test data(Wang et al., 2002; Sakurai et al., 2003). Although the R-Tree model did not perform as well as the Stem-MLmodel, the results of the training and test data sets of the R-Tree model are more consistent.

Table 5f0 contour evaluation

Model Data set RMSE (Hz) Corr.

R-Tree based Training data set 16.50 (2.30) 0.77 (0.11)R-Tree based First test set 17.40 (2.11) 0.68 (0.06)R-Tree based Second test set 18.35 (1.72) 0.57 (0.05)Stem-ML based Training data set 12.00 (3.11) 0.85 (0.13)Stem-ML based First test set 16.56 (2.78) 0.76 (0.09)Stem-ML based Second test set 18.80 (1.89) 0.61 (0.07)Standard deviations are shown in parentheses.

O: .A. O: de: jo:b et al. / Computer Speech and Language 22 (2008) 3968 59The correlation for the Stem-ML model was 0.85 (0.13) for the training set. For the rst and second testsets, it was 0.76 (0.09) and 0.61 (0.07), respectively. The correlation for the R-Tree model was 0.77 (0.11)for the training set. For the rst and second test sets, it was 0.68 (0.06) and 0.57 (0.05), respectively. Theseresults indicate that the Stem-ML model produces a synthetic f0 contour that is closer to the natural one.The correlation results also indicate that Stem-ML models the training and test data more accurately thanthe R-Tree model.

Note that the RMSE for the f0 dimension for the two models is less than the JND (Just Noticeable Dier-ence) of 22 Hz reported for SY by Harrison (2000). This implies that the errors in both models are unlikely toresult in a wrong perception of tone for each syllable since the errors are less than the minimum f0 value rangethat is noticeable by a native speaker.

However, these quantitative evaluations cannot determine whether variation in the synthetic contour makesthe synthetic speech sound any more or less natural. They only measure how much the two contours vary.Therefore, there was a need to perform a qualitative evaluation.

8.3. Qualitative evaluation

Two important aspects of synthetic speech are its intelligibility and naturalness (Hawkins et al., 2000;Sun, 2002). These are determined by the prosody of the speech. Intelligibility denes how easy it is to com-prehend the message in the synthetic speech. Naturalness denes how close the synthetic speech is to the nat-ural speech. The qualitative evaluation aims to provide an insight into the opinion of the potential users of aTTS system concerning the quality of the synthesised speech. Substantial eorts have been made to developgood methodologies for evaluating the quality of synthetic speech (e.g. Huggins and Nickerson, 1985; Mona-ghan and Ladd, 1990, Benot et al., 1996; Bradlow et al., 1996; Zera, 2004; Viswanathan and Viswanathan,2005).

The primary approach has been to use a subjective method whereby native speakers of the language listento synthetic speech and rate it on a scale of 15. This approach is known as the Mean Opinion Score (Viswa-nathan and Viswanathan, 2005). In the following section, we summarise the results of the intelligibility andnaturalness evaluations of the two prosody models.

8.4. Data for qualitative evaluation

To prepare the data for our qualitative evaluation, 25 sentences were selected from the 90 sentences dis-cussed above (cf. Section 8.1). Ten of these sentences were from the training set, 10 from the rst test setand 5 from the second test set. The participants were informed that the experiment dealt with the qualityof synthetic speech, but they were not informed which of the speech samples were natural or synthesised.The sentences were presented to each participant in a random manner. Each sample is played to the partici-pant as many times as they wanted.

Seventeen nave native SY speakers volunteered to take part in the evaluation, ranging in age between 23and 45 years old. To ascertain their hearing ability, they were all subjected to an initial screening process. Thatprocess involves playing some natural speech sound to them and asking them to transcribe or repeat what theyheard. Those who failed to produce 100% accuracy in this test were excluded from the evaluation experiment.As a result, 14 of them were selected for the evaluation. All 14 participants took part in the evaluation of thetraining and rst test set evaluation, but only 10 of them were also able to participate in the evaluation of thesecond test set. Each participant took about 45 min to evaluate the speech. In each case, the intelligibility eval-uation was done rst. After a 5-min break the naturalness evaluation followed.

8.5. Intelligibility evaluation

Our intelligibility test is designed to establish how well the listeners identify the syllables in an utterance.This intelligibility evaluation is adopted because we feel that the syllable is the most important perceptual unitin an SY utterance. If the listeners are able to identify all syllables in an utterance, then they can discern the

underlying words, phrases and sentences.

The intelligibility test we adopted here is a transcription error test. During the test, the listeners are expectedto be able to transcribe or repeat what they have heard. The result of the intelligibility test for each sentence asrated by each participant is computed using the formula:

Intelligibility TAll TWrongTAll

5:0 4

where TAll is the total number of syllables in a sentence and TWrong is the number of syllables that had beenwrongly transcribed. The formula ensures that if all syllables are wrongly transcribed, the intelligibility scorewill be zero. The intelligibility score will be ve if all syllables in the utterance are correctly transcribed.

On the training set, the R-Tree model had an average intelligibility rating of 4.0; whereas the Stem-MLmodel was rated 3.6 (see Table 6). A sign test (Petruccelli et al., 1999) shows that the listeners preferred thesynthetic speech generated using the R-Tree model over that generated using the Stem-ML model(p 6 0.05). The R-Tree model also outperforms the Stem-ML model on the rst and second test data as itsrating was 3.7 and 3.3, respectively, as opposed to 3.3 and 2.9 for the Stem-ML model. However, these dif-ferences are not signicant (p > 0.05). The results show that the synthetic speech produced using the R-Treebased prosody model is slightly more intelligible than that produced using the Stem-ML based prosody model.

8.6. Naturalness evaluation

The average Mean Opinion Score (MOS) and the standard deviation for each model over the thirty stimuliwere also computed. In this case, the participants were asked to rate their overall impression of the speechquality in terms of how close it is to human speech. They were asked to rate the quality of the stimuli onthe 5-point MOS scale as described in Table 7. The participants were not informed which speech was naturalor synthesised.

Table 6

TableResult

60 O: .A. O: de: jo:b et al. / Computer Speech and Language 22 (2008) 3968R-Tree based model 3.1 (0.08) 2.5 (0.02) 2.2 (0.08)Stem-ML based model 3.1 (0.13) 2.3 (0.03) 1.9 (0.17)Natural speech 5.0 (0.00) 4.8 (0.01) 4.7 (0.02)The nu8s for naturalness evaluation

Training data First test data Second test dataResults of the intelligibility evaluation

Training data First test data Second test data

R-Tree based model 4.0 (0.15) 3.7 (0.05) 3.3 (0.12)Stem-ML based model 3.6 (0.31) 3.2 (0.16) 2.9 (0.08)Natural speech 5.0 (0.01) 4.9 (0.03) 4.9 (0.03)

The number in parenthesis is the standard deviation.

Table 7Qualitative evaluation scores

Value Description

5 Perfect, indistinguishable from natural speech quality4 Very good3 Average2 Poor1 Weak or not acceptablember in parenthesis is the standard deviation.

O: .A. O: de: jo:b et al. / Computer Speech and Language 22 (2008) 3968 61Training Set

0.00

1.00

2.00

3.00

4.00

5.00

6.00

1 2 3 4 5 6 7 8 9 10Sentence number

Ave

rage

inte

lligi

bilit

y sc

ore

sco

reR-Tree ModelStem-ML ModelNATURAL

f

First Test Set

5.00

6.00

a

bAs shown in Table 8, in the training set, both the R-Tree model and the Stem-ML model score the sameMOS of 3.1, but the result of the R-Tree model has a lower standard deviation. On the test set, however, theresults show that the R-Tree model received a higher average rating. While the rst and second test sets wererated as 2.5 and 2.2, respectively, for the R-Tree prosody model, they are rated 2.3 and 1.9 for the Stem-MLmodel.

The result of a sign test on the rst test set shows that there is no evidence (p > 0.05) that the R-Tree modelis preferred over the Stem-ML model. The same result is obtained for the second test set. In a further analysis,we found that there is no statistically signicant evidence (p > 0.05) that listeners preferred the synthetic speechgenerated by the Stem-ML model over that of the R-Tree model. This shows that the R-Tree model is slightlybetter than the Stem-ML model in terms of naturalness.

9. Discussion

The results for the quantitative analysis show that the R-Tree model does not model the prosody data asaccurately as the Stem-ML model. However, the synthesised speech from the R-Tree model was rated slightly

Ave

rage

inte

lligi

bilit

yA

vera

ge in

telli

gibi

lity

scor

e

0.00

1.00

2.00

3.00

4.00

1 2 3 4 5 6 7 8 9 10

Sentence number

Sentence number

R-Tree ModelStem-ML ModelNATURAL

Second Test Set

0.00

1.00

2.00

3.00

4.00

5.00

6.00

1 2 3 4 5


c

Fig. 11. Results of the intelligibility evaluation. (a) Intelligibility results for the training data set, (b) intelligibility results for the rst testset and (c) intelligibility results for the second test set.

more intelligible and natural than the Stem-ML model. A reason for this is that, in the R-Tree model, the f0dimension is stylised. This ensures that only the perceptually signicant points are captured by the model.Other points that are not perceptually signicant are ignored in the modelling. The quantitative evaluationmeasures how accurate the model reproduces the entire prosody data. As the perceptually less signicantpoints also account for the greater proportion of the prosody data, having ignored these points, the R-Treemodel simply interpolates between the perceptually signicant points. This results in less accurate reproduc-tion of the entire prosody data. However, this does not translate to poor perceptual quality because the mostimportant points of the prosody data have been captured in the model.

Much of the error in the Stem-ML based prosody model can probably be accounted for by segmental eects(Silverman, 1987; Dusterho, 2000; Kochanski et al., 2003a), due to changes in the vowel and the syllableonset consonant. Generally speaking, the worst-tting syllables are those with the largest and fastest f0 excur-sions. These are conditions where Stem-MLs approximations between templates and the realised pitch curveis furthest from the actual perceptual metric. However, segmental eects or un-modelled dierences in thestrength of syllables may also play a role. The results of the ts for Stem-ML models are generally similarto other Stem-ML based intonation models (Kochanski et al., 2003a; Kochanski and Shih, 2003). We haveobserved from our qualitative evaluation that the Stem-ML model performs better for long sentences contain-ing one or more phrases with consistent tone patterns, e.g. the intelligibility of our Stem-ML model performs

0.00

1.00

2.00

3.00

4.00

5.00

6.00

1 2 3 4 5 6 7 8 9 10

Ave

rage

MO

Sag

e M

OS


First test set

3.00

4.00

5.00

6.00

Training Set

Sentence number

a

b

62 O: .A. O: de: jo:b et al. / Computer Speech and Language 22 (2008) 3968Ave

r

0.00

1.00

2.00

1 2 3 4 5 6 7 8 9 10


Second test set

0.00

1.00

2.00

3.00

4.00

5.00

6.00

1 2 3 4 5Sentence Number

Ave

rage

MO

S

R-tree ModelStem-ML ModelNATURAL

Sentence number

c

Fig. 12. Results of the naturalness evaluation. (a) Naturalness results for the training data set, (b) naturalness results for the rst test set

and (c) naturalness results for the second test set.

O: .A. O: de: jo:b et al. / Computer Speech and Language 22 (2008) 3968 63better with the seventh sentence in our training set (cf. Appendix A) which has the tone pattern(HHHHHHHMH, HMHLLHLHMLH).

On the whole, our qualitative evaluation shows that our R-Tree model performs better than our Stem-MLmodel (cf. Figs. 11 and 12). However, we observed that a strength of our Stem-ML model is that it models theprosody for sentences with uniform tone patterns better than those with mixed tone patterns. For example,our Stem-ML model produces better naturalness results for the fth and sixth sentences in the training setand the seventh and ninth sentences in the rst test set (cf. Appendix A) . In terms of intelligibility, it seemsthat the longer the sentence, the better the performance of our Stem-ML model.

When we compare the two prosody models, the R-Tree model produces more consistent performance thanthe Stem-ML model. Unlike the Stem-ML model, the R-Tree model produces good intelligibility and natural-ness results regardless of the length or the tone pattern of the utterance. This consistency may have resultedbecause the parameters of the R-Tree model are constrained by phonological rules. The incorporation of per-ceptual information into the R-Tree model, by way of stylisation of the f0 curves, may have improved the qual-ity of the synthesised speech prosody. In addition, despite the apparent success of the Stem-ML model atcapturing some aspects of SY intonation, it is dicult to relate these results to linguistic phenomena becausethe model does not take the phonological rules of SY into account.

Furthermore, an important feature of the R-Tree model that each of the prosody dimensions can easily beanalysed and their contribution to the quality of the synthesised speech prosody can be determined. Thisinformation can be used to improve corpus design and development. Since the R-Tree model is based onphonological rules, the model results can also be used to verify various observed phonological phenomenain SY.

10. Conclusion

We have presented a new prosody model suitable for the Standard Yoru`ba (SY) language. This prosodymodel is conceptualised around a modular holistic approach and implemented using the Relational Tree(R-Tree) technique. The R-Tree is a sophisticated data structure for representing a multi-dimensional wave-form in the form of a tree. An important feature of this framework is its exibility in facilitating both the inde-pendent implementation of the dierent dimensions of prosody, i.e. intonation, duration and intensity, usingdierent modelling techniques and their subsequent integration.

The R-Tree for an utterance is generated by creating a Skeletal Tree (S-Tree) using an algorithm derivedfrom the tone phonological rules for the target language and then computing the numerical values of eachprosody dimension corresponding to the perceptually signicant points represented by the S-Tree. We havemodelled the intonation dimension using Fuzzy Logic rules. For the duration dimension, we experimentedwith both Fuzzy Decision Tree (FDT) and Classication And Regression Tree (CART) techniques. We foundthat FDT is a practical choice for modelling SY duration. These prosody dimensions are integrated using abasic alignment model which coordinates the f0 and duration data with the syllable segments in an utterance.

To evaluate the eectiveness of our proposed prosody model as well as to put the results of our work in thecontext of contemporary work on prosody modelling, we have also developed a Stem-ML prosody model forSY. We have performed both quantitative and qualitative evaluations on our implemented prosody models.The results of these evaluations suggest that Stem-ML models prosody data more accurately than R-Tree.However, in the qualitative evaluations, i.e. intelligibility and naturalness, the synthesised speech prosody gen-erated by the R-Tree model is rated higher overall.

The R-Tree model is particularly suitable for modelling speech prosody in TTS applications for languagesthat have not been widely studied, e.g. African languages. In such languages, there are limited expertise andlanguage resources. The R-Tree model is able to take advantage of the results of phonological studies and arelatively small speech corpus for prosody modelling. Such a modelling approach enables us to start o with arelatively small model and improve on it in an iterative manner. The R-Tree model also facilitates the integra-tion of perceptual information into the model development. This results in an improved perceptual quality anda model that is easy to interpret and analyse.

Our modular holistic approach to prosody modelling provides a suitable framework for experimenting and

comparing dierent models for realising each prosody dimension. One drawback of such an approach is that

64 O: .A. O: de: jo:b et al. / Computer Speech and Language 22 (2008) 3968there is the need for a robust alignment model. We expect that, with a more accurate and realistic alignmentmodel, the R-Tree based prosody model will perform better.

Acknowledgements

We acknowledge Dr. Greg Kochanski of the Phonetic Laboratory at Oxford University for his help in thedevelopment of the Stem-ML model. We are grateful for the advice and contributions of Professor RobertLadd of Theoretical and Applied Linguistics, Dr. Simon King and Dr. Robert Clark of the Centre for SpeechTechnology Research at the University of Edinburgh regarding the development of the CART durationmodel.

Appendix A. Evaluation sentences

The training set

(1) Ba`ba a`gbe`: ti ta ko`ko.The farmer has sold cocao.

(2) O`: gbe:ni Gba`da`, wa oruko: sle`: fun kaa`d` `danimo`: .Mr. Gbada, come and register for identication card.

(3) O mo`: pe e`mi ko: .He knows that it is not me.

(4) Ba`ba a`gbe`: ti ta ko`ko, koto mo: pe ko`ko ti gbowo lor.The farmer has sold his cocoa, before known that the price of cocoa has increased.

(5) Is: e lo wawa.He came looking for job.

(6) Dde tode lo wa sb.He came here as soon as he arrived.

(7) Ope:k o to de s ile, ntor o:

`na` to j`n lo ti r`n wa.

He came home late, because he walked from a far place.(8) Wo:n ti mowo wa.

They have brought the money.(9) Olu`ko: ti de.

The teachers has come.(10) Ele: ru` ti de, e: je: ka gbe.

The owner of the baggage has come, let us carry it.

The rst test set

(1) La` fo`: ro`: gu`n, o mun afara gu`n.Without prolonging the issue, he climbed the bridge.

(2) Ati par.We have nished.

(3) O: mo: we:we: ni wo:n.They are kids.

(4) Ele: ru` ti de.The owner of the baggage has come.

(5) A`la`de wa, os` tun lo: .Alade came, and left as well.

(6) Botiw lor.It is as he has said.

(7) O tun sare wa.

He came back running again.

(8) O`:do:mi lo ko:ko: sawa, k o to r`n pada` s `lu I`ba`da`n.He ran to my place, before running back to Ibadan town.

(9) Isu ape: ja.The shers yam.

Fig. 13. Main pseudocode for S-Tree generation.

O: .A. O: de: jo:b et al. / Computer Speech and Language 22 (2008) 3968 65Fig. 14. Pseudocode for building a partial S-Tree.

(10) Olu`ko: wa, wo:n s` ko: wa.The teacher came, and he taught us.

The second test set

(1) Ko` te`te` j.He woke up late.

(2) I`wa` o: de`le`.A behaviour of betrayer.

(3) Ibi t a lo: , lat mbo`.We are coming from where we went.

Fig. 15. Pseudocode for nding the deepest valley within a sequence of tones.

66 O: .A. O: de: jo:b et al. / Computer Speech and Language 22 (2008) 3968Fig. 16. Pseudocode for nding the highest peak within a sequence of tones.

O: .A. O: de: jo:b et al. / Computer Speech and Language 22 (2008) 3968 67(4) Oko A`d`gun lati s: is: e.We worked at Adiguns farm.

(5) Sbe` sbe`, o lo: la`` gba`s: e.Even then, he left without taking approval.

Appendix B

See Figs. 1316.

References

Bamgbos: e, A., 1990. Fonoloj` a`ti Grama` Yoru`ba. University Press PLC, Iba`da`n.Benot, C., Grice, M., Hazan, V., 1996. The SUS test: a method for the assessment of text-to-speech synthesis intelligibility using

semantically unpredictable sentences. Speech Communication 18, 381392.Boersma, P., Weenink, D., 2004. Praat, doing phonetics by computer. http://www.fon.hum.uva.nl/praat/. Visited: March 2004.Bradlow, A.R., Torretta, G.M., Pisoni, D.B., 1996. Intelligibility of normal speech i: global and ne-graned acoustic-phonetic talkers

characteristics. Speech Communication 20, 255272.Burrows, T.-L., 1996. Trainable speech synthesis. PhD thesis, Speech Processing with Linear and Neural Network Models, Cambridge.Coleman, J.S., 1994. Polysyllabic words in the YorkTalk synthesis system. In: Keating, P.A. (Ed.), Phonological Structure and Forms:

Papers in Laboratory Phonology III, Cambridge University Press, Cambridge, pp. 293324.Collier, R., 1990. On the perceptual analysis of intonation. Speech Communication 9, 443451.Connell, B., Ladd, D.R., 1990. Aspect of pitch realisation in Yoru`ba. Phonology 7, 129.dAlessandor, C., Mertens, P., 1995. Automatic pitch contour stylization using a model of tonal perception. Computer Speech and

Language 9, 257288.Dilley, L.C., Ladd, D.R., Schepman, A., 2005. Alignment of L and H tone in bitonal pitch accents: testing two hypotheses. Journal of

Phonetics 33, 115119.Dusterho, K., 2000. Synthesizing fundamental frequency using models automatically trained from data. PhD thesis, University of

Edinburgh, Edinburgh.Ehrich, R.W., Foith, J.P., 1976. Representation of random waveforms by relational trees. IEEE Transactions on Computers C-25 (7),

725736.Fackrell, J.W.A., Vereecken, H., Martens, J.P., Coile, B.V., 1999. Multilingual prosody modelling using cascades of regression trees and

neural networks. http://chardonnay.elis.rug.ac.be/papers/1999_0001.pdf. Visited: September 2004.Fujisaki, H., Hirose, K., 1982. Modelling the dynamic characteristics of voiced fundamental frequency with application to analysis and

synthesis of intonation. In: Proceedings of the 13th International Congress of Linguistics, pp. 5770.Fujisaki, H., Ohno, S., Gu, W., 2004. Physiological and physical mechanisms for fundamental frequency control in some tone languages

and a command-response model for generation of their f0 contours. In: Proceedings of International Symposium on Tonal Aspects ofLanguages, Beijing. Visited: June 2004.

Fujisaki, H., Ohno, S., Wang, C., 1998. A command-response model for f0 contour generation in multi-lingual speech synthesis.In:Proceedings of the Third ESCA/COCOSDA International Workshop on Speech Synthesis, pp. 299309.

Fujisaki, H., Tomana, R., Narusawa, S., Ohno, S., Wang, C., 2000. Physiological mechanism for fundamental frequency control inStandard Chinese. In: Proceedings of ICSLP 2000.

Fujisaki, H., Wang, C., Ohno, S., Gu, W., 2005. Analysis and synthesis of fundamental frequency contours of Standard Chinese using thecommand-response model. Speech Communication 47, 5970.

Goubanova, O., Taylor, P., 2000. Using Bayesian belief networks for model duration in text-to-speech systems. In: Proceedings ofICSLP2000.

Garding, E., 1983. A generative model of intonation. In: Cutler, A., Ladd, D.R. (Eds.), Prosody: Models and Measurements. Springer,Berlin, pp. 1125.

Harrison, P., 2000. Acquiring the phonology of lexical tone in infants. Lingua 110, 581616.Hawkins, S., Heid, S., House, J., Huckvale, M., 2000. Assessment of naturalness in the Prosynth speech synthesis project. In: Proceedings

of the IEE Colloquium on Speech Synthesis, London.Hombert, J.-M., 1976. Perception of tones of bisyllabic nouns in Yoru`ba. Studies in African Linguistics (Suppl. 6), 109121.Huggins, A., Nickerson, R.S., 1985. Speech qualtiy evaluation using phoneme-specic sentences. Journal of the Acoustical Society of

America 77 (5), 18961906.Klatt, D.H., 1987. Review of text-to-speech conversion for English. Journal of the Acoustical Society of America 82 (3), 737793.Kochanski, G., Shih, C., 2003. Prosody modelling with soft templates. Speech Communication 39, 311352.Kochanski, G., Shih, C., Jing, H., 2003a. Hierarchical structure and word strength prediction of Mandarin prosody. International Journal

of Speech Technology 6, 3343.Kochanski, G., Shih, C., Jing, H., 2003b. Quantitative measurement of prosody strength in Mandarin. Speech Communication 41, 625645.

Lee, T., Kochanski, G., Shih, C., Li, Y., 2002. Modeling tones in continuous Cantonese speech. In: Proceedings of the InternationalConference on Spoken Language Processing, Den

[email protected]

Documents

Transcript of [email protected]