IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE...

15
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 1 Towards High-Quality Next-Generation Text-to-Speech Synthesis: A Multidomain Approach by Automatic Domain Classification Francesc Alías, Member, IEEE, Xavier Sevillano, Joan Claudi Socoró, and Xavier Gonzalvo Abstract—This paper is a contribution to the recent advance- ments in the development of high-quality next generation text-to- speech (TTS) synthesis systems. Two of the hottest research topics in this area are oriented towards the improvement of speech ex- pressiveness and flexibility of synthesis. In this context, this paper presents a new TTS strategy called multidomain TTS (MD-TTS) for synthesizing among different domains. Although the multido- main philosophy has been widely applied in spoken language sys- tems, few research efforts have been conducted to extend it to the TTS field. To do so, several proposals are described in this paper. First, a text classifier (TC) is included in the classic TTS architec- ture in order to automatically conduct the selection of the most appropriate domain for synthesizing the input text. In contrast to classic topic text classification tasks, the MD-TTS TC should not only consider the contents of text but also its structure. To this end, this paper introduces a new text modeling scheme based on an as- sociative relational network, which represents texts as a directional weighted word-based graph. The conducted experiments validate the proposal in terms of both objective (TC efficiency) and subjec- tive (perceived synthetic speech quality) evaluation criteria. Index Terms—Speech synthesis, text processing. I. INTRODUCTION T HERE has been a very noticeable development over the last 20 years in the text-to-speech (TTS) synthesis research field. In particular, TTS systems have moved from diphone-based approaches, with only one instance per unit, to unit selection or corpus-based strategies, using large speech corpora containing multiple instances per unit [1], [2]. In this change of paradigm, the TTS research community has borrowed several aspects of the philosophy and some specific techniques (e.g., search algorithms, cost functions, etc.) from the automatic speech recognition (ASR) field [3]. During the last decade, this convergence has been stressed by the application of hidden Markov models (HMMs) to conduct speech synthesis (e.g., see [4], [5] and related works) as opposed to classic concatenative strategies. HMM-based TTS synthesis allows higher flexibility thanks to speech signal parameterization, but it is still not Manuscript received June 29, 2007; revised November 13, 2007. This work was supported in part by the IntegraTV-4all project under Grant FIT-350301- 2004-2 of the Spanish Science and Technology Council. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Steve Renals. The authors are with the Grup de Recerca en Processament Multimodal, Enginyeria i Arquitectura La Salle, Universitat Ramon Llull, Quatre Camins, 2, 08022 Barcelona, Spain (e-mail: [email protected]; [email protected]; [email protected]; [email protected]). Digital Object Identifier 10.1109/TASL.2008.925145 Fig. 1. Different approaches to TTS research towards perfect unconstrained speech synthesis, as a function of the task difficulty and the obtained synthetic speech quality—adapted from [7] and [8]—with the multidomain proposal superimposed. capable of achieving the typical high speech quality obtained by unit-selection concatenative approaches [6]. However, whatever the synthesis approach taken, the final purpose of any text-to-speech system is the generation of per- fectly natural synthetic speech from any input text. In this quest, two complementary strategies have been followed historically (see Fig. 1), which constitute a tradeoff between speech natu- ralness and system flexibility [7], [8]: 1) general purpose TTS synthesis (GP-TTS), which prioritizes the flexibility of the ap- plication at the expense of the achieved synthetic speech quality, and 2) limited domain TTS (LD-TTS), which restricts the scope of the input text (as done in initial ASR systems [8]) so as to obtain high quality synthetic speech [9]. Driven by the success of limited-domain speech understanding systems [3] and multi- media applications (e.g., [10]) there has been a growing interest in developing commercial systems based on LD-TTS. According to [6] and [11], next-generation TTS systems are asked to deal with: expressivity (emotions, speaking styles, etc.) [12], flexibility (multilinguality, voice transformation, voice impersonation, etc.) [5], spontaneity (whispering, pausing, laughing, etc.) [13], and even singing, 1 among others. Thus, these TTS systems should be able to produce the message using the most appropriate prosody, speaking style, etc., an issue that 1 For instance, a Synthesis of Singing Challenge was held at the Inter- Speech2007 conference. 1558-7916/$25.00 © 2008 IEEE

Transcript of IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE...

Page 1: IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE ...lorien.die.upm.es/~lapiz/rtth/papers/premios08/06/06.pdf · 2, 08022 Barcelona, Spain (e-mail: falias@salle.url.edu; xavis@salle.url.edu;

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 1

Towards High-Quality Next-GenerationText-to-Speech Synthesis: A Multidomain Approach

by Automatic Domain ClassificationFrancesc Alías, Member, IEEE, Xavier Sevillano, Joan Claudi Socoró, and Xavier Gonzalvo

Abstract—This paper is a contribution to the recent advance-ments in the development of high-quality next generation text-to-speech (TTS) synthesis systems. Two of the hottest research topicsin this area are oriented towards the improvement of speech ex-pressiveness and flexibility of synthesis. In this context, this paperpresents a new TTS strategy called multidomain TTS (MD-TTS)for synthesizing among different domains. Although the multido-main philosophy has been widely applied in spoken language sys-tems, few research efforts have been conducted to extend it to theTTS field. To do so, several proposals are described in this paper.First, a text classifier (TC) is included in the classic TTS architec-ture in order to automatically conduct the selection of the mostappropriate domain for synthesizing the input text. In contrast toclassic topic text classification tasks, the MD-TTS TC should notonly consider the contents of text but also its structure. To this end,this paper introduces a new text modeling scheme based on an as-sociative relational network, which represents texts as a directionalweighted word-based graph. The conducted experiments validatethe proposal in terms of both objective (TC efficiency) and subjec-tive (perceived synthetic speech quality) evaluation criteria.

Index Terms—Speech synthesis, text processing.

I. INTRODUCTION

T HERE has been a very noticeable development overthe last 20 years in the text-to-speech (TTS) synthesis

research field. In particular, TTS systems have moved fromdiphone-based approaches, with only one instance per unit, tounit selection or corpus-based strategies, using large speechcorpora containing multiple instances per unit [1], [2]. In thischange of paradigm, the TTS research community has borrowedseveral aspects of the philosophy and some specific techniques(e.g., search algorithms, cost functions, etc.) from the automaticspeech recognition (ASR) field [3]. During the last decade, thisconvergence has been stressed by the application of hiddenMarkov models (HMMs) to conduct speech synthesis (e.g., see[4], [5] and related works) as opposed to classic concatenativestrategies. HMM-based TTS synthesis allows higher flexibilitythanks to speech signal parameterization, but it is still not

Manuscript received June 29, 2007; revised November 13, 2007. This workwas supported in part by the IntegraTV-4all project under Grant FIT-350301-2004-2 of the Spanish Science and Technology Council. The associate editorcoordinating the review of this manuscript and approving it for publication wasDr. Steve Renals.

The authors are with the Grup de Recerca en Processament Multimodal,Enginyeria i Arquitectura La Salle, Universitat Ramon Llull, Quatre Camins,2, 08022 Barcelona, Spain (e-mail: [email protected]; [email protected];[email protected]; [email protected]).

Digital Object Identifier 10.1109/TASL.2008.925145

Fig. 1. Different approaches to TTS research towards perfect unconstrainedspeech synthesis, as a function of the task difficulty and the obtained syntheticspeech quality—adapted from [7] and [8]—with the multidomain proposalsuperimposed.

capable of achieving the typical high speech quality obtainedby unit-selection concatenative approaches [6].

However, whatever the synthesis approach taken, the finalpurpose of any text-to-speech system is the generation of per-fectly natural synthetic speech from any input text. In this quest,two complementary strategies have been followed historically(see Fig. 1), which constitute a tradeoff between speech natu-ralness and system flexibility [7], [8]: 1) general purpose TTSsynthesis (GP-TTS), which prioritizes the flexibility of the ap-plication at the expense of the achieved synthetic speech quality,and 2) limited domain TTS (LD-TTS), which restricts the scopeof the input text (as done in initial ASR systems [8]) so as toobtain high quality synthetic speech [9]. Driven by the successof limited-domain speech understanding systems [3] and multi-media applications (e.g., [10]) there has been a growing interestin developing commercial systems based on LD-TTS.

According to [6] and [11], next-generation TTS systems areasked to deal with: expressivity (emotions, speaking styles, etc.)[12], flexibility (multilinguality, voice transformation, voiceimpersonation, etc.) [5], spontaneity (whispering, pausing,laughing, etc.) [13], and even singing,1 among others. Thus,these TTS systems should be able to produce the message usingthe most appropriate prosody, speaking style, etc., an issue that

1For instance, a Synthesis of Singing Challenge was held at the Inter-Speech2007 conference.

1558-7916/$25.00 © 2008 IEEE

Page 2: IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE ...lorien.die.upm.es/~lapiz/rtth/papers/premios08/06/06.pdf · 2, 08022 Barcelona, Spain (e-mail: falias@salle.url.edu; xavis@salle.url.edu;

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING

Fig. 2. Modification of the classic block diagram of a TTS synthesis system bythe inclusion of an automatic domain classification module.

can be faced by extracting more information from the input text.For this reason, a new research direction has emerged in theTTS field towards extending text analysis beyond the typicalcapabilities of TTS systems. Several recent papers can be foundin the literature focused on this issue by, for instance, extractingthe user attitude from text [14] or guessing the underlyingemotion of the message [15]–[17]—see also references therein.

In this context, as an approach to increase the TTS systemsflexibility while trying to maintain a speech quality equivalentto that of LD-TTS, we introduced the idea of developing mul-tidomain TTS (MD-TTS) systems in order to synthesize amongdifferent domains with high speech naturalness [18]. This ap-proach can be seen as a step further in the convergence of thedifferent key modules of multidomain spoken language systems(i.e., the ASR and TTS modules), using information regardingthe domain of the conversation for improving the quality of thesynthetic output. For instance, depending on the particular im-plementation of the TTS system, knowing the domain of theinput text allows to: 1) help in the text normalization process(e.g., if the input text belongs to a mathematical domain, thetext “1/2” should be translated into “half” instead of “Januarythe second”); 2) choose the most appropriate prosodic modelor consider different prosodic patterns during unit search [19];3) select the corresponding subcorpus for corpus-based tieringapproaches [10], [20] or guide the unit selection process byweighting the domain units accordingly for blending methods[15], [21]; 4) control the signal processing module depending onthe speech characteristics of that domain (e.g., voice quality in-terpolation [22]); or 5) activate the voice transformation moduleto resemble the target domain, if necessary; etc.

Therefore, MD-TTS systems need to know, at run time, whichdomain is the most suitable for synthesizing the input text so asto obtain the highest speech quality. Thus, if domain detectionis to be conducted in a fully automatic manner from the rawinput text, it is necessary to redefine the classic architecture ofTTS systems by including a domain classification module (seeFig. 2).

There is a large amount of research on text classification (see[23] and [24] for an extensive review). Traditional text clas-sification (TC) techniques are mainly focused on thematic (ortopic) documents categorization. In this context, documents arerepresented by only considering the occurrence of the key termsthat constitute the texts (often, after filtering function words andstemming), thus, ignoring their relationships and text structure[23]. Although topic information is useful for organizing thespeech corpus, relying solely on text contents is insufficient for

considering the inherent sequential nature of speech (which isrelated to prosody and coarticulation issues). Thus, proper TCfor MD-TTS should consider both thematic and stylistic aspectsof text, like in other TC-related applications such as authorshipattribution or genre detection [25]. Equally important, TC forMD-TTS should consider all the terms and punctuation marksappearing in the text, not only because function words filteringwould induce the loss of valuable information concerning textstructure, but also because texts input to TTS systems can bevery short, e.g., only one sentence.

In this paper, we first describe the related work regardingmultidomain spoken language systems and corpora (Section II).Second, the specific implementation of a multidomain TTSsystem following a tiering corpus-based synthesis technique isdescribed (Section III). Next, a global and a reduced variantof a novel graph-based text representation model designedfor conducting text classification within the TTS frameworkare introduced (Section IV). Then, several experiments re-garding text classification in the MD-TTS context are described(Section V). Finally, we discuss several issues related to theproposal, outlining some interesting future research directions(Sections VI and VII).

II. RELATED WORK

This section describes the main issues related to multidomainspoken language systems, emphasizing the role that the auto-matic speech recognition module plays in these kind of systems,and the motivations for exporting the multidomain strategy tothe TTS field.

Although other speech-related research fields have adoptedmultidomain philosophies so as to improve the naturalness andusability of spoken language systems (see Section II-A), the re-search in the TTS field has been largely an exception to this rule.This situation is motivated by two issues: first, the fact that earlyTTS systems were quite capable of facing general-purpose syn-thesis (with acceptable intelligibility), in contrast to ASR sys-tems, which had to focus on restricted tasks (e.g., single-speakerdigit dictation) to achieve reasonable performances [8]. Second,it is worth noting that TTS systems have often played a sec-ondary role in multidomain spoken language systems, whichtypically make use of general-purpose TTS systems, since theyare often only asked to give an intelligible message to the user(e.g., see [26]).

A. Multidomain Spoken Language Systems

The development of multidomain applications is one ofthe recent research directions in spoken language systems(SLS) [27]. Most multidomain SLS (MD-SLS), excludinggeneral-purpose dictation systems, operate over a finite setof domains of interaction, e.g., different destinations in callrouting, several topics in translation systems, or differentsubdomains in complex dialog systems [28]. Knowing thedomain of communication (in this context, a domain generallycorresponds to a topic), allows to improve the performance andefficiency of the constituting modules of SLS [29], for instance,by 1) selecting the most appropriate language model of aspeech recognizer (thus, reducing its perplexity), 2) adapting

Page 3: IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE ...lorien.die.upm.es/~lapiz/rtth/papers/premios08/06/06.pdf · 2, 08022 Barcelona, Spain (e-mail: falias@salle.url.edu; xavis@salle.url.edu;

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

ALÍAS et al.: TOWARDS HIGH-QUALITY NEXT-GENERATION TTS SYNTHESIS 3

the dialog manager strategy for reducing the number of di-alogue turns, 3) dynamically loading the required resourcesaccording to the current domain of interaction, 4) helpingthe comprehension module to disambiguate word senses, or5) controlling out-of-domain queries—see, e.g., [27], [28], and[30] and references therein.

One of the main research areas where multidomain strategieshave been more extensively applied is ASR, where automaticdomain classification plays a salient role. The following sectionsdescribe both issues.

1) Multidomain ASR: Automatic speech recognition systemshave evolved from single-speaker, isolated word, small vocabu-lary tasks to speaker-independent large-vocabulary continuousspeech applications [8]. This evolution has affected the two keyelements constituting ASR systems: the acoustic and the lin-guistic models. These models have suffered a transformationfrom highly controlled approaches (e.g., grammatical rules ortask-and-speaker dependent acoustic models) to more flexiblestrategies (e.g., multispeaker acoustic models or stochastic lan-guage models), thus, being able to satisfy more complex andgeneric needs. One of the most critical problems, which derivesfrom the increase in the number of users and the vocabulary size,is the mismatch between the training and test data characteris-tics. These misadjustments can be caused by the evolution of thedomain of interaction (e.g., change of the most common wordsused to interact), speaking styles variations (e.g., due to moodchanges), etc. (see [31] for more information). This issue canbe tackled by adapting both the linguistic [32] and acoustic [33]models to the task and/or user speech particularities—see [31]for a general review of the most commonly applied adaptationtechniques.

2) Domain Classification Strategies: Assigning domains touser utterances at run time is a key issue in MD-SLS [27]. Inthis context, the domain selection process can be user-guided,by explicitly using a predefined set of keywords, or dialog-guided, i.e., implicitly detected from speech recognition hy-potheses [27], [28], [30]. The former simplifies the task of as-signing domains but reduces the usability of the SLS. The latterallows natural navigation thanks to automatic domain classifi-cation, but it must be able to extract (reliable) information fromshort utterances (e.g., typical queries are between 10 and 20words long [30]) despite speech recognition errors [28]—a farmore complicated task than topic classification of articles orbroadcast news, where very large data collections are used [23],[30].

B. Deeper Analysis of Text in TTS Systems

The typical analysis of the input text conducted in TTS sys-tems has been usually restricted to tasks related to natural lan-guage processing such as text normalization, pausing predic-tion, part-of-speech tagging, etc. However, several recent workshave been focused on extracting more information from theinput text, e.g., trying to determine the attitude [14] or the emo-tion [15]–[17], [34], [35] from text. In [14], the correlation be-tween the prosodic variations and the inclusion of adjectives(weighted by the accompanying adverbs) expressing positiveor negative attitudes (with different levels of intensity) is ana-lyzed and demonstrated. In [15], a Dictionary of Affect is used

to detect and score the emotional keywords present in text—fol-lowing a similar approach to the one described in [34]—ad-justing the unit selection cost function to guide the unit searchaccording to the emotional contents of the input text. The exper-iments are conducted on a unique expressive corpus composedof three different emotions (neutral, happy, and angry). The au-thors conclude that the subjective emotional perception is pro-portional to the number of emotional words in the input text. In[16] and [35], a similar approach is presented by defining a dic-tionary of emotional words (adjectives, names or verbs), but alsousing part-of-speech tagging and linear classification to deter-mine the emotion from text.2 From another point of view, workssuch as [36] and [37] make use of more complex knowledge likesemantic and common sense networks, respectively—more in-formation can be found elsewhere.3

In [18], we introduced the seminal idea of synthesizing dif-ferent domains within the same TTS system by introducing anautomatic domain classification module in the classic TTS ar-chitecture, thus covering the niche between GP and LD-TTS. Onone hand, the TTS task difficulty is increased due to the manage-ment of multiple domains, but, on the other hand, the achievedspeech quality aims to be equivalent to that of LD-TTS whenthe input text is assigned to the correct domain, thanks to thedirect correspondence between style and domain. Hence, thisapproach can represent an advancement towards perfect uncon-strained speech synthesis (see Fig. 1). The proposal was theo-retically analyzed (no speech corpus was recorded), obtainingremarkable computational savings (again observed when devel-oping the weather forecast application described in [10]) besideskeeping good theoretical speech quality.4 Subsequently, in [39]we presented a hierarchical text classifier based on independentcomponent analysis (ICA), which was capable of 1) organizingthe contents of the corpus in a hierarchical manner (obtaining astructure similar to the one depicted in Fig. 3), and 2) classifyingthe texts to be synthesized according to the learned structure.Both works were developed on collected Catalan and Spanishnews articles.

III. MULTIDOMAIN CORPUS-BASED TTS

Although corpus-based TTS systems are able to generatehigh-quality synthetic speech, it is a commonplace that adramatic decrease in speech quality occurs when the inputtext mismatches the corpus domain coverage, for both gen-eral-purpose [38], [40] and, more obviously, limited domainTTS syntheses [10]. This is due to the fact that the quality ofthe speech generated by corpus-based TTS systems is highlydependent on the style and coverage of the recorded speechcorpus [41].

Several proposals have appeared in the literature trying to al-leviate the consequences of domain mismatch. The most repre-sentative examples of these proposals are: 1) the adaptation of

2In this context, research has been mainly focused on storytelling texts (e.g.,fairy tales) due to their emotional context [16], [17].

3See, for instance, www.clairvoyancecorp.com/Research/Workshops/AAAI-EAAT-2004/home.html

4Computed as the average segment length [38] of the sequence of units re-trieved from the speech corpus (although it is obvious that not always the largestset of units attains the best synthetic results, it was the only measure we coulduse without having a speech corpus available).

Page 4: IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE ...lorien.die.upm.es/~lapiz/rtth/papers/premios08/06/06.pdf · 2, 08022 Barcelona, Spain (e-mail: falias@salle.url.edu; xavis@salle.url.edu;

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING

Fig. 3. Block diagram of a multidomain corpus-based text-to-speech synthesissystem, including automatic domain assignment by text classification and atiering speech corpus.

a GP-TTS system to the target domain including small speechcorpora of that domain [38], [40], 2) the design of speech cor-pora based on the emotional contents of the text composingthe corpus [42], and 3) the definition of a multidomain TTS(MD-TTS) approach to deal with the desired domains altogether[18]. All these proposals are based on the fact that knowingwhich is the most appropriate domain for synthesizing the inputtext allows much more proper delivery [43]—provided that thatdomain is properly synthesized from the speech corpus.

Although particularized on a tiering corpus-based approachin this paper (see Section III-B), the MD-TTS philosophy can beadapted to any other corpus typology or even synthesis strategyother than corpus-based (e.g., HMM-based, or hybrid solutions[6]). Therefore, the MD-TTS architecture allows a flexible andadaptable TTS system design and implementation that can betuned according to the application needs or domain characteris-tics. In any case, the architecture of the TTS system must bemodified by the inclusion of a domain classification module,which will interact with the remaining elements of the TTSsystem (see Fig. 2). In the following paragraphs, a discussionon domain classification strategies and a description of a specificimplementation of a corpus-based MD-TTS system following atiering approach are presented.

A. Domain Classification: Concepts and Strategies

The selection of the most suitable synthesis domain is oftenclosely related to the determination of the most appropriate syn-thesis speaking style for a given input text, which usually de-pends on paralinguistic (speakers relationship, mood, messageintention, etc.) and extralinguistic (speaker’s age, sex, person-ality, etc.) information [44]. For instance, the sentence “Thereis a lot of food in the fridge,” can be spoken joyfully or withheavy sarcasm, depending on the context of communication. Al-though the study of these issues lies beyond the scope of this

work, there are cases in which the style of delivery or even thespeaker’s gender can be inferred from the meaning and/or thestructure of the input text (e.g., natural synthesis of positive ornegative messages requires using appropriate prosodic patterns[14], [21], or some sentences are expected to be spoken by aboy or a girl [41]). In other cases, certain speaking styles can bereadily discarded for synthesizing some types of sentences [45](e.g., command utterances do not convey sadness or fear [20],or complex sentences are not usually spoken by children [41]).

As regards the implementation of the domain classificationstrategy, it can be defined as an external task to the TTS systemitself or it can be included in the TTS architecture as an auto-matic module. In the former case, domain selection can be ei-ther a manual process (which is the simplest method, allowingthe user to change among domains by hand [8], as in some dia-logue systems or multimodal applications), or a supervised one(usually related to systems where the message domain is knownbeforehand, by tagging the input text accordingly [10], [20] orby means of concept-to-speech synthesis [8]).

In this paper, domain assignment is regarded as an automaticprocess, which relies on the input text solely. The most appro-priate synthesis domain is inferred from text thanks to the directrelationship between domain and speaking style, obtaining thehighest possible synthetic speech quality [43], and/or reducingthe computational cost of the unit selection process [18]. Forthis reason, it is necessary to go beyond the typical text anal-ysis of TTS systems. In the current version of our approach,this module is implemented by an automatic text classificationtechnique based on a vector space model representation of texts,which includes information about the frequency and collocationof words plus the structure of text [18] (see Section IV).

B. Tiering Corpus-Based MD-TTS Architecture

As aforementioned, the MD-TTS approach requires in-cluding a domain classification module, which will interactwith the modules found in the classic corpus-based TTS sys-tems architecture, such as the NLP, unit-selection and digitalsignal processing modules. Moreover, corpus-based MD-TTSsynthesis requires using a multidomain speech corpus, whichcan be implemented following distinct corpus typologies [41],[43]: 1) tiering, that is, defining an independent subcorpusfor each domain [20], [42], or 2) blending, which consists inmixing different corpus subsets into a unique corpus, generallyincluding a large general-purpose core [21], [38], [40]. Fig. 3depicts a corpus-based MD-TTS system based on a tieringapproach, which is employed throughout the experimentspresented in Section V.

It is to note that the creation of multidomain speech corporahas also been handled by the HMM TTS research community,where the tiering and blending strategies are called style depen-dent modeling and style mixed modeling, respectively (see [5]and related works)—style corresponds to a particular kind of do-main in the MD-TTS context, where the term domain can standfor emotion, speaking style, topic, etc. Moreover, notice the sim-ilarity between these concepts and the MD-ASR approaches de-scribed in Section II-A1.

Page 5: IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE ...lorien.die.upm.es/~lapiz/rtth/papers/premios08/06/06.pdf · 2, 08022 Barcelona, Spain (e-mail: falias@salle.url.edu; xavis@salle.url.edu;

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

ALÍAS et al.: TOWARDS HIGH-QUALITY NEXT-GENERATION TTS SYNTHESIS 5

IV. TEXT CLASSIFICATION FOR MD-TTS SYNTHESIS

This section describes the proposal for implementing the do-main classification module in the MD-TTS framework, whoseobjectives are to 1) learn a multidomain model upon the trainingtexts that constitute each domain of the multidomain speechcorpus, and 2) select the most appropriate domain(s) for syn-thesizing the input text. The starting requirements for designingthis module—besides looking for the highest classification per-formance—are the following:

• adapt the text classification system to the particularities oftext-to-speech synthesis, taking specially into account theimportance of classifying texts as short as one sentence;

• minimize the classification algorithm complexity so as toavoid overloading the text-to-speech conversion process.

Most text classification strategies are mainly thematically ori-ented, thus treating texts as a collection of isolated words, ig-noring their order, relationships and text structure (the so-calledbag-of-words model) [23]. In this context, function words (e.g.,prepositions and articles) and punctuation marks are commonlyfiltered out (stop listing), besides usually reducing words to theirlemmas (stemming). However, stylistic (nonthematic) text clas-sification tasks, such as authorship attribution (i.e., the identi-fication of the author of the text) or genre detection (e.g., lit-erary, scientific, etc.) do consider function words distribution,part-of-speech tags, words and sentence length, vocabulary rich-ness, among other parameters [24], [25].

Therefore, the text classification task involved in MD-TTSlies between thematic and stylistic classification. Whereas textcontents is useful to organize the text in the (tiering) multido-main corpus, it seems insufficient to determine the best way topronounce a given text, making it necessary to include someinformation about the structure and sequentiality of text (seeSections IV-B and IV-C1).

A. Associative Relational Networks

To deal with the aforementioned stylistic aspects of text, itis essential to make use of a text representation technique ca-pable of codifying them. To this end, the developed TC systemrepresents the texts by means of an associative relational net-work (ARN), a graph-based model for representing informa-tion, which was initially introduced in the context of visual rep-resentation of documents [46]. However, in our approach, thenodes of the graph represent the terms of the text (i.e., words andpunctuation marks) and their connections describe the co-occur-rences between them (see Fig. 4). Each node contains a weight

defining the relevance of its corresponding term, and eachconnection is weighted by the relationship strength between thelinked terms , by also considering their order (i.e., not nec-essarily , as opposed to visual documents representa-tion [46]).

As a result, the ARN encodes the structure and the sequen-tiality of text (modeled as a run of pair wise co-occurrences ofterms), which are essential for classifying texts in the MD-TTSframework.

Fig. 4. Word-based associative relational network, inspired by the visual rep-resentation of documents of the Galaxy of News described in [46].

B. Weighting the Network

Once the ARN architecture is defined, it is necessary toassign specific values to the network weights. In particular,the nodes will basically contain information of text contentswhereas the internodal connections will be used to representand extract structural text patterns. For the time being, the the-matic features weighting the relevance of each term whichare employed in this work are: 1) term frequency (TF) (i.e.,the number of times a term occurs in a document) inversedocument frequency (i.e., the singularity of that term across thecollection), denoted as TFIDF [47], and 2) a newly proposedweight called inverse term frequency (ITF), which is defined as

ITFTF

TF (1)

where denotes the cardinality of a set hereafter—in this case,represents the number of terms of document —and TF

is the term frequency of the th term in that document. ITF canbe interpreted as a local approximation of IDF [23], since itweighs each term according to its prominence within each text(or document), instead of considering its distribution across thewhole training text collection.

By its own definition, the ARN allows considering structuralresemblance between texts when conducting classification. Inthis paper, the co-occurrence frequency (COF) of each consec-utive pair of terms (i.e., the number of times that two termsare contiguously placed in the text) and the number of consec-utive term pairs (see Section IV-C3) are considered as stylistic(structural) features—see Section VI for possible extensions.

C. Using the ARN for Conducting Text Classification inCorpus-Based TTS Synthesis

The following paragraphs describe, first, how the informationincluded in text is represented and parameterized, and second,the training and testing processes for using the associative rela-tional network with classification purposes in the corpus-based

Page 6: IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE ...lorien.die.upm.es/~lapiz/rtth/papers/premios08/06/06.pdf · 2, 08022 Barcelona, Spain (e-mail: falias@salle.url.edu; xavis@salle.url.edu;

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING

Fig. 5. Building the (a) global ARN and (b) domain ARN-Fs, from � to � , as a function (f) of the global ARN representation of , built fromtheir corresponding training documents � � �� � � � � � � �. In the graphs, “�” denotes filled nodes, “�” represents empty nodes and dashed lines symbolizeinexistent co-occurrences. � �� � � within the VSM defined by the global ARN (see example of Table I).

TTS synthesis framework. Notice that, as it can be observedfrom Section V, the ARN model can be used for classifying ei-ther texts as short as 1 sentence or large paragraphs (e.g., of30-sentence length).

1) Turning the ARN Model into a Vector-Based Classifier:In order to make use of the information embedded in the ARN,it is necessary to define a suitable model for conducting theclassification task on a set of categories . There are severalpossibilities to exploit this information; however, up to now,the TC system included in the corpus-based MD-TTS archi-tecture represents the ARN contents on a vector space model(VSM) [47], following the “Choose the best to modify the least”corpus-based philosophy [48]. According to the general defini-tion of the VSM, each document of the collection is repre-sented as a vector of weights within the vector space built fromthe term set [47]—see (2).

(2)

where is the total number of terms contained in the textcollection, and represents the weighting of term indocument .

Thus, each term defines a dimension of the multidimensionalvector space model of , where the documents of thetraining collection are represented as vectors. Thanks to thisvector representation, the algebraic operators (e.g., vectordistances) will be applicable to conduct text classification, asdescribed afterwards. Moreover, notice that in the current ap-proach, the dimensions of the VSM correspond to the thematicand stylistic features extracted from text. Thus, the documents

will be represented according to (3), defined as a generaliza-tion of (2), since it includes the co-occurrence weights of theterm collection that composes the ARN (see Fig. 4).

(3)

where represents the weighting of orderedco-occurrences between terms and in document . The

TABLE ISYMBOLIC EXAMPLE OF DOMAIN PATTERN VECTORS �� (ARN-F

� � � � � � ��) AND TEXT TO BE CLASSIFIED � ACCORDING TO THE

GLOBAL ARN, GIVEN THREE DIFFERENT DOMAINS � �� AND � .SYMBOLS �� � � � � � � � � � REPRESENT THE TERM AND

CO-OCCURRENCE WEIGHTS OF THE MODELED TEXTS

multidimensional vector space defined in (2) becomes, since it integrates all terms with their co-occur-

rences (i.e., each term can appear contiguously in text with therest of the terms and itself).

2) Training the ARN-Based Text Classifier: The trainingprocess consists of building an ARN for each of the do-mains contained in the corpus, which are composed of thecorresponding subset of training documents . In orderto obtain a consistent representation of data across all thedomains, a global ARN is firstly built from all the trainingtexts [see Fig. 5(a)]. Next, this global ARN is used as a ref-erence for building each domain’s ARN, obtaining what wehave called Full ARN of the th domain (denoted as ARN-F

), as its components follow the order indicatedby the global ARN [see Fig. 5(b)]. The training stage finishesafter deriving a vectorial representation of each ARN-Fyielding pattern vectors.5

3) Classifying the MD-TTS Input Texts: Given a textinput to the TTS system, it is first represented according tothe global ARN model derived in the training stage [see (3)],obtaining its corresponding vector . Next,this vector can be compared to each of the pattern vectorsthrough a similarity measure. Finally, the input text is assignedto the domain(s) attaining the highest similarity. The com-parison process can be done by simply computing a cosine

5Representing the information contained in the domain � according to theARN-F.

Page 7: IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE ...lorien.die.upm.es/~lapiz/rtth/papers/premios08/06/06.pdf · 2, 08022 Barcelona, Spain (e-mail: falias@salle.url.edu; xavis@salle.url.edu;

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

ALÍAS et al.: TOWARDS HIGH-QUALITY NEXT-GENERATION TTS SYNTHESIS 7

similarity distance between vectors [47]. Nevertheless, thecosine classification similarity can be enriched by including amultiplicative factor that takes into account the global structuralresemblance between the compared texts [see (4) and (5)]; thus,exploiting higher order similarity features extracted from theARN model beyond the first-order adjacency in the text (i.e.,co-occurrence frequency).

(4)

(5)

The pattern length (PL) is defined as the length of the longestsequence of identical consecutive terms appearing in the sameorder in the input text and each domain , after representingthem on the global common space ( and pattern vector ,respectively) [18]

(6)

(7)

where is the number of terms of text is the index com-puting the number of coincident consecutive co-occurrences be-tween the text to be classified and the considered domain, thatis, is the co-occurrence frequency be-

tween terms and of input text and, finally, is its cor-responding value within the pattern vector, which must exist.6

As a result, the PL only computes the co-occurrences appearingboth in the input text and the considered domain

, as indicated in (7).Furthermore, if the pattern length is computed as the sum

of consecutive co-occurrences matching between the comparedvectors [see (7)], we obtain the cumulative PL (cPL) following(8):

(8)

As it can be observed from (6) and (8), both structural sim-ilarity parameters are normalized by the total number of terms

; thus, so as to avoid fictitiously bi-asing their value due to input text length. Please note that in thespecials case when , both PL and cPL will be assigneda zero value—see Appendix B for a toy example showing PLand cPL computation.

4) Reduced ARN Model: Due to the extremely high dimen-sionality of the global representation space,7 the classificationof each input text is a computationally demanding task, since

6The index sequence � � �� � � � � � � � � � contains the position ofterms of text �� referenced to their position on the global ARN.

7In this approach, the training texts are fully represented, without stop listingor stemming, and the co-occurrences are also included, obtaining very largevectors.

it requires going through the whole global ARN before con-ducting classification. Moreover, the vector representation ofwill be typically very sparse, which results in a reduction of theseparability properties of the pattern vectors, yielding poorertext classification efficiency [49]. In order to improve domainseparability and minimize the computational cost of the classifi-cation task, a second ARN-based strategy called Reduced ARN(ARN-R) is introduced.

The main idea of the ARN-R model is based on the substitu-tion of the full comparison space (built from the global ARN) bythe VSM derived from the ARN generated from the input text

. Hence, during the classification stage, each domain is rep-resented according to the ARN-R before conducting the com-parison in order to obtain a common representation space. Thatis, the domain ARN building process depicted in Fig. 5 is nowconducted by substituting the global ARN by the ARN gen-erated from the input text . In this sense, the computationalcomplexity of representing on the global ARN space is sub-stituted by the cost of representing each domain in the ARN-Rspace, which in general will be much lower.

Obviously, the ARN-R is just an approximation of thecomplete training data representation provided by the ARN-F,as the ARN-R misses most of the information stored in thefull space generated from the training documents

. Anyhow, it can be algebraically proved that the ARN-R isclose to the best possible approximation of the ARN-F on theinput text space in the least mean square sense, as described inAppendix A.

V. EXPERIMENTS

The experiments have been conducted on a 2.5 h Spanishmultidomain speech corpus recorded by a female professionalspeaker. The speech corpus is composed of 2590 sentencesextracted from an advertising database, which are grouped intothree different domains following a tiering approach: educa-tion-training (916 sentences), technology (833 sentences), andcosmetics (841 sentences). Each domain was recorded usinga predefined speaking style according to [50]: happy-elation(HAP), neutral-mature (NEU), and sensual-sweet8 (SEN),respectively. Thanks to the correspondence defined in [50]between speaking styles and domain contents, the automatictext classification module is able to select the most appropriatespeaking style from text.

The analyzed text classification algorithms are trained on the80% of corpus sentences and tested on the remaining ones fol-lowing a tenfold random subsampling strategy to obtain statisti-cally reliable results. In order to evaluate the performance of theTC algorithms in terms of classification efficiency (computed bythe classic measure, the harmonic mean of precision and re-call [23]), the labeled sentences have been randomly groupedinto pseudo-documents (hereafter, documents). This is done toevaluate the performance of the proposed text classifiers whenthe number of sentences per document decreases, moving froma standard TC task (with many sentences per document) to atypical TTS scenario, with only one sentence per document. To

8It is a warm, soft, and pleasant speaking style with some whispering nature.

Page 8: IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE ...lorien.die.upm.es/~lapiz/rtth/papers/premios08/06/06.pdf · 2, 08022 Barcelona, Spain (e-mail: falias@salle.url.edu; xavis@salle.url.edu;

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING

that effect, a sweep ranging from 1 to 30 sentences per docu-ment is conducted.

Moreover, in its current implementation, the unit selectionmodule is adjusted to extract the longest sequence of consecu-tive units from the domain indicated by the text classifier, usinga simple cost function [10]. The target prosody (pitch, dura-tion, and energy) is predicted by the natural language processingmodule following the approach described in [12].

A. Baseline Method Selection

The goal of this first experiment is to select a baseline TC al-gorithm as a reference to validate the performance of the ARN-based TC proposals. In the context of thematic TC, supportvector machines (SVMs) are regarded as the best performingclassifiers [23]. However, as the TC in the implemented tieringcorpus-based MD-TTS system is only trained with the texts cor-responding to the recorded speech, SVM becomes an unsuitableoption to implement the domain classification module. This isdue to the unbalance between the high dimensionality of thefeature space where MD-TTS text classification is conducted

and the comparatively much smaller size ofthe training document collections in this context. When there arenot enough examples to represent the training space accurately,SVMs decrease their performance dramatically, being even un-able to operate [51], which it was what happened in the informalexperiment conducted using the SVM software [52]. Forinstance, if linear kernels are to be used, examples areneeded to properly model the -dimensional feature space [53],while in our experiments and thenumber of examples is .

As a consequence of the aforementioned argument, welooked for other TC strategies than SVM, covering differentapproaches as a baseline for solving this classification problem.First, a basic nearest-neighbor (NN) classifier using TFIDFweighted terms as features is analyzed [23]. This technique isbased on representing each document as a vector in a VSM builtfrom the training set. At classification time, each test documentis assigned to the category of the most similar training docu-ment, according to a cosine distance. Second, a probabilisticTC algorithm based on bigrams is also analyzed. The first ideawas to represent each domain by its own probabilistic languagemodel obtained from the word pairs distribution across the doc-uments of that domain. However, it was necessary to substitutewords by characters, as in [54], due to the low statistical robust-ness of word-based probabilistic language models caused bythe small size of the training collection (a problem equivalentto the one affecting SVM9). The input text is assigned to thedomain attaining the highest membership probability. Finally,an ICA-based TC is applied to the problem. This technique,which considers topics as latent random variables, makes useof term extraction for better thematic identification (a latentsemantic space is built from the independent components thatconstitute the basis of the information represented in the text).In previous works, the ICA-based TC has been successfully

9Both issues may be tackled by considering larger texts collections than therecorded ones, but this issue is left for future work.

Fig. 6. Classification efficiency of the analyzed baseline methods across thesentences per document sweep.

applied for semi-supervised text classification and hierarchiza-tion of document corpora, by identifying the correspondencebetween text independent components and domains [39].

Fig. 6 depicts the performance of the analyzed baselinemethods, in terms of average , across the conducted sweep ofsentences per document. It can be observed that the NN methodshows the best global behavior, followed by the probabilisticTC, whereas ICA-based TC suffers a rapid worsening due tothe fact that this is a predominantly thematic approach (thesmaller the size of the documents, the more difficult the extrac-tion of latent topics becomes). Hence, the NN classifier wasselected as the baseline state-of-the art method for validatingthe ARN-based proposals.

B. Objective Performance of the ARN-Based Proposals

The following paragraphs analyze the performance of theproposed ARN-based text classifiers denoted as ARN-F andARN-R, respectively. To that effect, four different text parame-terizations are considered. We compare the influence of usingTFIDF versus ITF as thematic features, besides consideringstructural information by including co-occurrence frequencies(COF) or not (NCOF) in the vectors built from the graph-basedARN. Moreover, we also analyze the impact of using similaritymeasures which incorporate structural information by meansof pattern length (PL) and its cumulative version (cPL) [see(6) to (8)].

1) Text Parameterization: Fig. 7 presents the classificationefficiency results of the compared TC techniques across the pre-defined sweep, using the cosine distance as the similarity mea-sure. It can be observed that the ARN-based methods achievebetter results than the baseline NN classifier for any text pa-rameterization when classifying one sentence long documents,which is the most demanding classification scenario. Notice thisis also the case in the most part of the sweep. However, bothglobal representation methods (ARN-F and NN) are negativelyaffected by the inclusion of COF—due the dramatic increaseof the feature space dimensionality— achieving their optimalperformance for the TFIDF NCOF parameterization. In con-trast, ARN-R even experiences a slight performance improve-ment when COF is considered. Moreover, ARN-R achieves its

Page 9: IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE ...lorien.die.upm.es/~lapiz/rtth/papers/premios08/06/06.pdf · 2, 08022 Barcelona, Spain (e-mail: falias@salle.url.edu; xavis@salle.url.edu;

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

ALÍAS et al.: TOWARDS HIGH-QUALITY NEXT-GENERATION TTS SYNTHESIS 9

Fig. 7. Classification efficiency of the ARN-based and NN TC methods across the sentences per document sweep for different text parameterizations. (a) ARN-Fversus NN. (b) ARN-F NCOF versus ARN-R.

optimal performance when ITF is selected as the thematic fea-ture—with very similar results between COF and NCOF param-eterizations. As a conclusion, it can be stated that ARN-R, de-spite being an approximation of ARN-F, behaves more robustlyin terms of the parameterization employed besides achievingequal or slightly better classification results in every step of thesweep (in particular, ARN-R is the best classifier in the hardestcategorization scenario, i.e., with one sentence per document,attaining an average ). Finally, notice that all TC ap-proaches suffer from the decrease of sentences per document,which highlights the importance of finding a TC tuned to solvethe domain classification task within the TTS synthesis frame-work satisfactorily.

2) Similarity Measures: Fig. 8 presents a global compar-ison regarding the use of stylistically weighted similarity mea-sures for both ARN-based text classifiers. The ARN-F basedtext classifier experiences a notable improvement when the co-sine distance is enriched with PL and cPL weightings, attainingan average relative improvement of 14.2% and 19% on , re-spectively. On the contrary, ARN-R is nearly unaffected by theinclusion of these factors in the similarity measure. As a con-clusion, the structural weighting of the cosine distance affectsthe ARN-F-based TC positively, since it makes up for the dra-matic increase of vectors length (each domain is represented bymeans of a large single vector), whereas this effect is less clearfor the ARN-R classifier, since the managed vectors are of lowerdimensionality.

C. Subjective Results of the MD-TTS System

As the final goal of introducing a text classifier into theMD-TTS system architecture is achieving high-quality syn-thetic speech besides improving system flexibility, severallistening preference tests were conducted in order to validateits naturalness subjectively. These experiments are intendedto analyze the influence of correct and wrong domain clas-sification decisions on the synthetic speech quality obtainedby the MD-TTS approach when classifying texts as short asone sentence. The correctness of the text classifier decisions is

Fig. 8. Averaged classification efficiency of TC methods across the sentencesper document sweep of Fig. 7 for different similarity measures.

evaluated by taking into account the manual labels assignedto each document (the so called ground truth). Since the im-plemented MD-TTS follows a tiering speech corpus typology,the TTS system selects both the target prosody pattern and thesubcorpus according to the automatically assigned domain.

The evaluators (24 members of our University) were asked toselect, by means of a web interface, the most appropriate/naturalversion (according to the sentence meaning) between two ran-domly ordered synthetic results obtained from the same inputsentence. The evaluators were able to 1) listen to the generatedfiles as many times as needed before taking a decision, and 2) se-lect an indistinct option when they could not decide between thesynthetic versions compared (equally good or bad). For this ex-periment, the ARN-R based text classifier was used, since it at-tained the best performance among the analyzed TC approachesat one sentence level.

1) Subjective Evaluation of Correct Domain Classifications:The first preference test analyzes the achieved results when thecorrect domain (according to the ground truth) is chosen. As aMD-TTS system which makes correct decisions is essentially aLD-TTS system in terms of speech quality, comparing its results

Page 10: IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE ...lorien.die.upm.es/~lapiz/rtth/papers/premios08/06/06.pdf · 2, 08022 Barcelona, Spain (e-mail: falias@salle.url.edu; xavis@salle.url.edu;

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

10 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING

Fig. 9. Distribution of user preferences between synthetic results from correctdomain and wrong domain classifications according to the manual labeling, in-cluding the indistinct votes. (a) Comparing the correct classifications of happyand sensual domains versus neutral domain syntheses, and misclassifications(yielding to wrong syntheses) versus manually labeled domain syntheses (right-most bar plot), indicating the 95% confidence levels. (b) Detail of the 12 happysentences votes. (c) Detail of the 15 sensual sentences votes.

to the ones obtained from the neutral domain is somehow equiv-alent to comparing LD-TTS to GP-TTS.10 Both correct and neu-tral domain syntheses make use of the prosodic pattern corre-sponding to its speaking style. The test was conducted on 27

10It is worth noting that the technology domain is not large enough to beproperly called as a reliable general purpose speech corpus, however, it is usedas reference regarding to what could be achieved by general purpose synthesis.

TABLE IILIST OF WRONGLY CLASSIFIED SENTENCES EXTRACTED FROM

THE SPEECH CORPUS USED TO VALIDATE THE SYNTHETIC SPEECH

QUALITY OF THE TIERING MD-TTS PROPOSAL

correctly classified sentences (12 happy and 15 sensual), whichwere selected by applying a simple greedy algorithm tuned toobtain phonetically balanced sentences.

The results indicate a significant preference for the cor-rectly classified domain outcomes over the reference neutralsyntheses, for both happy and sensual domains [see the twoleft-most bar plots in Fig. 9(a)]. Moreover, the test on thesensual domain reveals higher preference for the correct do-main syntheses compared to the happy results (75.8% and71.5%, respectively), besides reducing the preferences to theneutral synthesis (17.2% and 24.3%, respectively). As it canbe observed from Fig. 9(b) and (c), this is due to the fact thatsome sentences from the happy domain are preferred whensynthesized in a neutral style, e.g., sentences 4 (“Libro de lacompetición 93”) and 7 (“En teoría, una escuela de nego-cios”)—“93 competition’s book” and “In theory, a businessschool,” in English.

2) Subjective Evaluation of Wrong Domain Classifications:The second test evaluates the perceptual impact of wrongautomatic text classifications with respect to the ground truth.Hence, this experiment is equivalent to comparing worst-caseMD-TTS to LD-TTS synthesis, besides validating if the depen-dence between the style of delivery and the assigned domainis relevant in these wrong classification cases. To that effect,nine sentences misclassified by the automatic text classificationmodule (listed in Table II) were presented to the evaluators.

As it can be observed from the right-most bar plot in Fig. 9(a),there is also a significant preference for the manually labeleddomain results (65.8%) versus the syntheses coming fromthe incorrectly classified ones (29.1%), labeled as correct andwrong in the figure, respectively. However, the preference gap(36.8%) is significantly lower than in the previous tests (47.2%and 58.6% for the happy and sensual domains), showing ahigher trend to select the syntheses obtained from the wrongly

Page 11: IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE ...lorien.die.upm.es/~lapiz/rtth/papers/premios08/06/06.pdf · 2, 08022 Barcelona, Spain (e-mail: falias@salle.url.edu; xavis@salle.url.edu;

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

ALÍAS et al.: TOWARDS HIGH-QUALITY NEXT-GENERATION TTS SYNTHESIS 11

classified domain. Moreover, according to the users’ feedback,this experiment involved the most difficult elections (e.g., alarger number of turns were needed before deciding).

Hence, the smaller preference gap, besides the much lessclear preference pattern across evaluators (already observed in[55]11), somehow correlate with the automatic domain misclas-sifications, which mostly occur in sentences with no clear do-main membership (e.g., “Soluciones a medida”—“Tailor-madesolutions,” in English). However, there is still room for improve-ment to avoid misclassifications of sentences like “Pero no sepueden sustraer al perfume” (last sentence in Table II), whichcontains the word perfume, indicating its membership to the cos-metics domain, or sentences 4 and 7 of the happy domain [seeFig. 9(b)]. Therefore, it seems necessary to include some deepersemantic analysis in future approaches.

VI. DISCUSSION

In this paper, the MD-TTS system has been implemented fol-lowing a tiering speech corpus typology. As it is well known, thetiering approach is very costly, since each new speaking style thesystem is asked to synthesize requires the design and recordingof its corresponding speech corpus. Nevertheless, the conceptbehind MD-TTS synthesis can be exported to other speech syn-thesis strategies and corpus typologies, since our final goal isimproving the synthesis flexibility besides obtaining high syn-thetic speech quality. To that effect, the MD-TTS philosophyallows selecting the most appropriate synthesis configuration(technique, corpus typology, signal processing, etc.) for a partic-ular speaking style. The following paragraphs discuss the porta-bility of the proposal and the quality of the obtained syntheticresults.

One of the key elements of the MD-TTS approach is theflexibility of the introduced system architecture. First, noticethat this architecture allows conducting: 1) general-purpose TTSsynthesis (with a single generic speech corpus), 2) limited do-main TTS synthesis (with a single restricted domain corpus),and 3) multidomain TTS synthesis with different domains (withflat or hierarchical corpus structure, depending on the domains’contents and their acoustic characteristics). Moreover, these do-mains can be explicitly incorporated as independent subcorpora(e.g., [41], [42]), or as small appendices completing a genericpurpose corpus (e.g., [15], [21], [38], [40]), or even as a resultof dividing the corpus (with the same acoustic characteristics)in different subdomains (e.g., journalistic texts: politics, society,culture, [39]). Second, the introduced architecture may beimplemented by means of different synthesis strategies, such as:1) corpus-based techniques (e.g., the tiering corpus depicted inFig. 3 can be extended as desired, provided that each subdo-main is large enough to conduct unit selection), 2) HMM-basedsynthesis (e.g., with domain-dependent statistical models) (see[45], [5] and related works), or 3) hybrid solutions (e.g., see [6]).Finally, we would like to note that the architecture makes it pos-sible to use other speech corpus typologies than tiering (e.g.,

11We detected a slight (but not statistically significant) trend to select neutralsyntheses when coming from wrong classifications (four sentences of Table II).Hence, in future experiments we want to compare general purpose TTS versusthe MD-TTS approach to analyze the ill effects caused by the misclassificationsmore exhaustively.

blending or mixed approaches), thus, allowing the use of all thespeech units during the unit selection process if necessary [15],[19], [21].

Following the same idea of flexibility, the directedword-based ARN model allows conducting: 1) thematicclassification, by only considering key words (after stop listingand stemming)—hence, turning the graph-based architectureinto a bag-of-words approach—2) stylistic classification, likeauthorship attribution or genre detection, by including theappropriate features in the model, or 3) domain classificationof texts as short as one sentence, like in the current approach.Thus, the ARN model can somehow be regarded as a generictext representation that includes all terms and their order in thetext, allowing the most appropriate term weighting accordingto the target task. In this context, higher order structural rela-tionships other than the ones introduced in this work (based onco-occurrence word frequencies) may be considered in futureworks. Furthermore, we want to point out that updating theARN is very easy. If new training texts are to be incorporated tothe document collection, 1) the full ARN only needs to updatethe global ARN model with the new texts besides rebuildingthe domain ARNs, and 2) the reduced ARN only needs to buildthe ARNs of the new domains, as the space of classification isdefined by the input text.

As pointed out throughout this paper, the synthesis processof the implemented MD-TTS system is based on the direct cor-respondence between domain contents and speaking styles es-tablished by [50]. Thanks to this relationship, the TTS system iscapable of delivering the message with the appropriate speakingstyle in most cases (i.e., when the input text is assigned to thecorrect domain), yielding a performance equivalent to that ofLD-TTS synthesis systems.

As regards the synthetic speech quality, we would like to pointout that the mismatch between prosodic features and acousticsegments causes notable quality degradation when comparingcurrent results to the ones presented in [55]. In that work, theprosodic pattern was set to fit the characteristics of the happyand sensual domains even when the neutral speech corpus wasselected when evaluating correct classification synthetic results.As a result the listening tests presented an overwhelming pref-erence for the correct classification results compared to the ref-erence ones.

VII. CONCLUSION

Next-generation high-quality text-to-speech synthesis sys-tems are not only asked to generate high-quality syntheticspeech (like limited-domain approaches) but also to be flexibleenough to adapt to any application needs or speech signal char-acteristics, besides being capable of delivering natural (e.g.,spontaneous or expressive) speech. This paper has introducedour proposal towards improving the flexibility of high qualityTTS systems by considering multiple domains, named multido-main TTS synthesis. This proposal belongs to a recent researchdirection focused on incorporating deeper text analysis fortaking TTS systems a little closer to human behavior—whichoften includes humor changes, different speaking styles, etc.,within the same conversation. In that sense, the MD-TTS ap-proach follows a counterpart evolution to multidomain spoken

Page 12: IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE ...lorien.die.upm.es/~lapiz/rtth/papers/premios08/06/06.pdf · 2, 08022 Barcelona, Spain (e-mail: falias@salle.url.edu; xavis@salle.url.edu;

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

12 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING

language systems, borrowing the idea of automatic domainassignment from deeper analysis of the input data—leavingparalinguistic and extralinguistic issues, which are defined bythe context of the conversation, for further research.

Therefore, the MD-TTS approach can constitute a genericframework for developing any kind of application involvingspeech synthesis, by including or not (depending on the ap-plication requirements) the automatic domain assignment ofthe input text. In that sense, the introduced architecture allowschanging the point of view when developing a TTS system: in-stead of predefining a synthesis strategy plus a corpus typologyso as to meet the application requirements, the flexibility ofthe MD-TTS architecture allows adapting the TTS systemmodules to the acoustic needs of each desired output speakingstyle—e.g., from speaking styles with a particular voice quality(e.g., the whispering nature of the sensual domain) that needto be explicitly recorded to be properly delivered [22], [44], tostyles realistically generated through signal processing modifi-cations, such as good or bad news synthetic messages that canbe obtained from a general purpose corpus [21].

Moreover, the described ARN-based automatic text classifi-cation proposal tackles satisfactorily the problem of classifyingtexts as short as one sentence, by taking into account both the-matic and structural features of text after representing it on agraph-based model including all words and punctuation marks.It is important to note that, to date, the proposal for conductingtext classification only takes into account the raw input textwithout including external semantic knowledge (e.g., WordNet[16]), an issue that is left for future investigations. Moreover, thetraining text database corresponds to the texts of the recordedspeech corpus, since the MD-TTS was implemented followinga tiering corpus-based approach. However, we are currently con-sidering the possibility of enlarging the training collection withtexts not included in the speech corpus so as to generalize thedomain classification process (an issue that will become moreinteresting as the employed synthesis strategy becomes moreflexible, besides reducing the out-of-vocabulary problem at thesame time—see Appendix A). Furthermore, the current imple-mentation of the text classification module is being optimizedtowards reducing its computational cost and improving its clas-sification efficiency when classifying input texts shorter thanone sentence, besides studying its applicability to other textclassification tasks. Finally, in future works we want to exploretechniques capable of inferring the speaking style (e.g., user atti-tude or emotion) directly from text, without resorting to the cor-respondence between domains and speaking styles, which hasbeen the basis of the current implementation of the MD-TTSproposal.

In terms of the synthetic speech quality, the conducted sub-jective experiments show a nice correlation between evaluators’preferences and TC assignments, validating the performance ofthe ARN-based TC perceptually. Specifically, the collected sub-jective results reveal that, when MD-TTS works properly (i.e.,it is equivalent to LD-TTS), users significantly prefer MD-TTSsynthesis results to general-purpose equivalent syntheses, likein [21], [38], and [40]. Moreover, when MD-TTS assigns theinput sentence to a domain other than the one it was originallyrecorded (i.e., wrong domain classification), evaluators showed

lower (though still significant) preference for the results synthe-sized from the manually labeled domain, as misclassificationsmainly occur on texts the meaning of which does not convey aclear domain membership. In addition to the presented exper-iments, we also compared the synthesis from the correctly as-signed domain to the general-purpose TTS with all the domainsgathered in a single database. However, no significant differ-ences were observed between the set of units retrieved from thewhole database and those selected from the assigned domain,since only the longest path of units has been considered as thecost function in the current experiments—thus making this ex-periment meaningless in current conditions. Nevertheless, weare planning to conduct new experiments after grouping the do-mains into a common database (i.e., simultaneously using mul-tidomain speech databases as a single speech database) after fin-ishing our current research on reliable subjective cost functionweight tuning [56].

APPENDIX IALGEBRAIC JUSTIFICATION OF THE ARN-R MODEL

In this section, the ARN-R approach is presented as thebest ARN-F approximation (in the least mean square sense)by means of algebraic arguments. The ARN-F model can berepresented in a real vector space , which is builtfrom the training documents collection . In this vector space,the pattern vectors representing each domainand the vectors modeling the text to be classified—with

active and null components,12 where—are represented (see the example of

Table I). Following the same idea, the ARN-R can also beconsidered to be defined in a real vector space , where

and . The last inequality standsfor terms of which are not represented in the global ARN(i.e., out-of-vocabulary words (OOV).13)

Moreover, within the vector space defined by the global ARN,it is possible to define a vector subspace gen-erated by a vector basis composed of

orthonormal vectors defined by the no null components ofvector (see Table IV). By making use of this basis , the pat-tern vector can be optimally approximated in the subspaceas , in terms of the minimum square error, by means of simpleorthogonal projection (see (9))

(9)

Any other projection (nonorthogonal) of vectors on thesubspace , will achieve a higher approximation error—com-puted as the Euclidean norm [57].

12� is the number of parameters (related to words, punctuation marks,co-occurrences, etc.) representing the input text, and � is the resultingnumber of parameters after representing it in the global ARN.

13As in any other deterministic machine learning based process, whatever notseen during the training process is not considered for classification purposes.However, this is not a critical issue of our approach according to objective ex-perimental results.

Page 13: IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE ...lorien.die.upm.es/~lapiz/rtth/papers/premios08/06/06.pdf · 2, 08022 Barcelona, Spain (e-mail: falias@salle.url.edu; xavis@salle.url.edu;

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

ALÍAS et al.: TOWARDS HIGH-QUALITY NEXT-GENERATION TTS SYNTHESIS 13

Fig. 10. ARN-based text representations for a reference domain � and a text � to be classified in that domain, considering ���� � � �����. “�”represents empty nodes and dashed lines symbolize inexistent co-occurrences. (a) ARN-� built from text “The weather in Barcelona is fantastic.” (b) Input text� �“The weather is fantastic,” represented according to ARN-� of figure (a).

TABLE IIIDATA REPRESENTATION OF TABLE I CONTENTS WITHIN THE �� ON SPACE

DEFINED BY THE ARN-R BUILT FROM THE TEXT TO BE CLASSIFIED �

Notice the relationship between the data representation onvector space defined by the ARN-R—i.e., —(seeTable III) and the one obtained when projecting the informa-tion on the vector subspace defined by theglobal ARN (see Table IV). It can be observed that using theARN-R strategy is equivalent to approximate the domain pat-tern vectors on the vector subspace with the minimum squareerror. Moreover, besides the change on the order of the vectorcomponents—which does not affect the distance computationresults—there is a subtle difference of null com-ponents within the pattern vectors due to the OOV words con-tained in the text to be classified. However, these null cells, onone hand, do not affect the result of the dot product of vectors

and , and on the other hand, will affect uniformly all thecomputations and comparisons when using the cosine distancethrough the vector norm of .

Nevertheless, it is important to note that the ARN-R approachimplies losing a certain amount of information contained in theglobal representation of pattern vectors of ARN-F , whichwill affect the similarity computation through the vector normvalues. However, as it can be observed in the experiments de-scribed in Section V, this problem does not have a clear im-pact on the achieved results. Nevertheless, we shall continuestudying the particularities of the ARN-R approach in futureexperiments.

APPENDIX IISTRUCTURAL FEATURES: PL AND CPL COMPUTATION

Fig. 10 presents a toy example of the ARN-based text repre-sentation of a hypothetical domain built from a single sen-tence and a similar input text to be classified. As it can benoticed, the terms not observed during the training process arenot represented in the ARN built from the text , since it is theARN-D which defines the comparison space (i.e., in this ex-ample, ). For instance, the connection

TABLE IVDATA REPRESENTATION OF TABLE I EXAMPLE ACCORDING TO VECTOR

SUBSPACE � CREATED FROM THE ORTHONORMAL BASIS � � ��� ��� �DEFINED BY THE � � ACTIVE COMPONENTS OF �� ,

REPRESENTED ACCORDING TO THE GLOBAL VSM

TABLE VCOMPUTATION OF PL AND CPL OF TEXT OF FIG. 10(b), GIVEN AN INDEX

SEQUENCE � ���� �� ���, YIELDING � � � � � �,OBTAINED WHEN INDEXING THE FIVE-TERMS TEXT REFERRED TO THE

DOMAIN ARN OF FIG. 10(a) AFTER BEING VECTORIZED AS ��

“Barcelona-is” of Fig. 10(b) is null, i.e., , whereas“weather-is” does not exist in , i.e., .

Table V shows the computation of PL and cPL for the ex-ample depicted in Fig. 10 (with ), according to (6) to(8). The index vector contains the positions of text terms ref-erenced to the ARN model (e.g., terms 3 and 4 are not includedin text, thus, their indexes are not considered). As a result, sincethe largest set of consecutive terms is two, ;however, thanks to considering there aretwo sets of consecutive terms. Finally, notice that has beenomitted since .

Page 14: IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE ...lorien.die.upm.es/~lapiz/rtth/papers/premios08/06/06.pdf · 2, 08022 Barcelona, Spain (e-mail: falias@salle.url.edu; xavis@salle.url.edu;

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

14 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING

ACKNOWLEDGMENT

The authors would like to thank P. Barnola, D. García,I. Iriondo, J. A. Montero, and O. Guasch for their help and alsoall participants involved in the subjective experiments.

REFERENCES

[1] Y. Sagisaka, “Speech synthesis by rule using an optimal selection ofnon-uniform synthesis units,” in Proc. ICASSP, New York, 1988, pp.679–682.

[2] A. Black and P. Taylor, “Automatically clustering similar units for unitselection in speech synthesis,” in Proc. EuroSpeech, Rhodes, Greece,1997, pp. 601–604.

[3] M. Ostendorf and I. Bulyko, “The impact of speech recognition onspeech synthesis,” in Proc. IEEE Workshop Speech Synthesis, SantaMonica, 2002, pp. 99–106.

[4] H. Zen and T. Toda, “An overview of Nitech HMM based speechsynthesis system for Blizzard Challenge 2005,” in Proc. InterSpeech,Lisbon, Portugal, 2005, pp. 93–96.

[5] J. Yamagishi and T. Kobayashi, “Average-voice-based speech synthesisusing HSMM-based speaker adaptation and adaptive training,” IEICETrans. Inf. Syst., vol. E90D, no. 2, pp. 533–543, Feb. 2007.

[6] A. Black, H. Zen, and K. Tokuda, “Statistical parametric speech syn-thesis,” in Proc. ICASSP, Honolulu, HI, 2007, vol. IV, pp. 1229–1232.

[7] J. Yi and J. Glass, “Natural-sounding speech synthesis using variable-length units,” in Proc. ICSLP, Sydney, Australia, 1998, pp. 1167–1170.

[8] P. Taylor, “Concept-to-speech synthesis by phonological structurematching,” Philosophical Trans. R. Soc., Series A, vol. 356, no. 1769,pp. 1403–1416, 2000.

[9] A. Black and K. Lenzo, “Limited domain synthesis,” in Proc. ICSLP,Beijing, China, 2000, vol. 2, pp. 411–414.

[10] F. Alías, I. Iriondo, L. Formiga, X. Gonzalvo, C. Monzo, and X.Sevillano, “High quality Spanish restricted-domain TTS oriented to aweather forecast application,” in Proc. InterSpeech, Lisbon, Portugal,2005, pp. 2573–2576.

[11] G. Bailly, N. Campbell, and B. Möbius, “ISCA special session: Hottopics in speech synthesis,” in Proc. EuroSpeech, Geneva, Switzerland,2003, pp. 37–40.

[12] I. Iriondo, F. Alías, and J. Socoró, “Prosody modelling of Spanish forexpressive speech synthesis,” in Proc. ICASSP, Honolulu, HI, 2007,vol. IV, pp. 821–824.

[13] S. Sundaram and S. Narayanan, “An empirical text transformationmethod for spontaneous speech synthesizers,” in Proc. EuroSpeech,Geneve, Switzerland, 2003, vol. 2, pp. 1221–1224.

[14] Y. Sagisaka, T. Yamashita, and Y. Kokenawa, “Generation and percep-tion of � markedness for communicative speech synthesis,” SpeechCommun., vol. 46, no. I, pp. 376–384, 2005.

[15] G. Holer, K. Richmond, and R. Clark, “Informed blending of databasesfor emotional speech synthesis,” in Proc. InterSpeech, Lisbon, Por-tugal, 2005, pp. 501–504.

[16] C. Ovesdotter, D. Rolh, and R. Sproat, “Emotions from text: Machinelearning for text-based emotion prediction,” in Proc. HLT/EMNLP,Vancouver, BC, Canada, 2005, pp. 579–586.

[17] V. Francisco and P. Gervás, “Automated mark up of affective infor-mation in English texts,” Lecture Notes in Comput. Sci., no. 4188, pp.375–382, Sep. 2006.

[18] F. Alías, I. Iriondo, and P. Barnola, “Multi-domain text classificationfor unit selection text-to-speech synthesis,” in Proc. 15th Int. Congr.Phonetic Sci. (ICPhS), Barcelona, Spain, 2003, pp. 2341–2344.

[19] F. Campillo and E. R. Banga, “A method for combining intonationmodelling and speech unit selection in corpus-based speech synthesissystems,” Speech Commun., vol. 48, no. 8, pp. 941–956, Aug. 2006.

[20] W. Johnson, S. Narayanan, R. Whitney, R. Das, M. Bulut, and C. La-Bore, “Limited domain synthesis of expressive military speech for an-imated characters,” in Proc. IEEE Workshop Speech Synthesis, SantaMonica, CA, 2002, pp. 163–166.

[21] W. Hamza, R. Bakis, E. M. Hide, M. A. Picheny, and J. F. Pitrelli, “TheIBM expressive speech synthesis system,” in Proc. ICSLP, Jeju Island,Korea, 2004, pp. 2577–2580.

[22] O. Turk, M. Schröder, B. Bozkurt, and L. Arslan, “Voice quality inter-polation for emotional text-to-speech synthesis,” in Proc. InterSpeech,Lisbon, Portugal, 2005, pp. 797–800.

[23] F. Sebastiani, “Machine learning in automated text categorisation,”ACM Comput. Surveys, vol. 34, no. 1, pp. 1–47, 2002.

[24] F. Sebastiani, “Text categorization,” in Text Mining and its Applica-tions, A. Zanasi, Ed. Southampton, U.K.: WIT Press, 2005, ch. 4,pp. 109–129.

[25] E. Stamatatos, G. Kokkinakis, and N. Fakotakis, “Automatic text cat-egorization in terms of genre and author,” Comput. Linguist., vol. 26,no. 4, pp. 471–495, 2000.

[26] D. Pérez-Piñar and C. García, “Application of confidence measures fordialogue systems through the use of parallel speech recognizers,” inProc. InterSpeech, Lisbon, Portugal, 2005, pp. 2785–2788.

[27] K. Rüggenmann and I. Gurevych, “Assigning domains to speech recog-nition hypotheses,” in Proc. HLT-NAACL Workshop Spoken Lang. Un-derstanding for Conversational Syst. and Higher Level Linguist. Inf.for Speech Process., Boston, MA, 2004, pp. 70–77.

[28] I. Lane, T. Kawahara, T. Matsui, and S. Nakamura, “Dialogue speechrecognition by combining hierarchical topic classification and lan-guage model switching,” IEICE Trans. Inf. Syst., vol. E88D, no. 3, pp.446–454, 2005.

[29] J. Allan, “Perspectives on information retrieval and speech,” in LectureNotes in Computer Sci. (Workshop on Inf. Retrieval Tech. Speech Ap-plicat.), 2001, vol. 2273, pp. 1–10.

[30] K. Asami, T. Takezawa, and G. Kikui, “Topic detection of an utterancefor speech dialogue processing,” in Proc. ICSLP, Denver, USA, 2002,pp. 1977–1980.

[31] J. Bellegarda, “Statistical language model adaptation: Review and per-spectives,” Speech Commun., vol. 42, no. 1, pp. 93–108, 2004.

[32] J. Diéguez, C. García, and A. Cardenal, “Effective topic-tree based lan-guage model adaptation,” in Proc. InterSpeech, Lisbon, Portugal, 2005,pp. 1289–1292.

[33] Y. Aiikita and T. Kawahara, “Language model adaptation based onPLSA of topics and speakers,” in Proc. ICSLP, Jeju Island, Korea,2004, pp. 1045–1048.

[34] F. Sugimoto, K. Yazu, M. Murakami, and M. Yoneyama, “A method toclassify emotional expressions of text and synthesize speech,” in Proc.1st Int. Symp. Control, Commun. Signal Process., Hamrnamet, Tunisia,2004, pp. 611–614.

[35] J. Tao and T. Tan, “Emotional Chinese talking head system,” in Proc.the 6th Int. Conf. Multimodal Interfaces (ICMI), State College, PA,2004, pp. 273–280.

[36] Z.-J. Chuang and C.-H. Wu, “Emotion recognition from textual inputusing an emotional semantic network,” in Proc. ICSLP, Denver, CO,2002, pp. 2033–2036.

[37] H. Liu, H. Lieberman, and T. Selker, “A model of textual affect sensingusing real-world knowledge,” in Proc. 8th Int. Conf. Intell. User Inter-faces, Miami, FL, 2003, pp. 125–132.

[38] M. Chu, C. Li, P. Hu, and E. Cahng, “Domain adaptation for TTS sys-tems,” in Proc. ICASSP, Orlando, FL, 2002, pp. 453–456.

[39] X. Sevillano, F. Alías, and J. Socoró, “ICA-based hierarchical text clas-sification for multi-domain text-to-speech synthesis,” in Proc. ICASSP,Montreal, QC, Canada, 2004, vol. 5, pp. 697–700.

[40] V. Fischer, J. Botella, and S. Kunzmann, “Domain adaptation methodsin the IBM trainable text-to-speech system,” in Proc.ICSLP, Jeju Is-land, Korea, 2004, pp. 1165–1168.

[41] A. Black, “Unit Selection and Emotional Speech,” in Proc. Eu-roSpeech, Geneve, Switzerland, 2003, pp. 1649–1652.

[42] A. Lida, N. Campbell, F. Higuchi, and M. Yasumura, “A corpus-basedspeech synthesis system with emotion,” Speech Commun., vol. 40, no.1,2, pp. 161–187, 2003.

[43] A. Black, “Perfect synthesis for all of the people all of the time,” inProc. IEEE Workshop Speech Synthesis, Santa Monica, CA, 2002, pp.167–170.

[44] N. Campbell, “Developments in corpus-based speech synthesis: Ap-proaching natural conversational speech,” IEICE Trans. Inf. Syst., vol.E88D, no. 3, pp. 376–383, 2005.

[45] J. Yamagishi, K. Onishi, T. Masuko, and T. Kobayashi, “Acoustic mod-elling of speaking styles and emotional expressions in HMM-basedspeech synthesis,” IEICE Trans. Inf. Syst., vol. E88D, no. 3, pp.502–509, 2005.

[46] E. Rennison, “Galaxy of News: An approach to visualizing and under-standing expansive news landscapes,” in Proc. ACM Symp. User Inter-face Software Technol., 1994, pp. 3–12.

[47] G. Salton, “Automatic Text Processing: The Transformation,” in Anal-ysis, and Retrieval of Information by Computer. New York: Addison-Wesley, 1989.

[48] M. Balestri, A. Paechiotti, S. Quazza, P. L. Salza, and S. Saridri,“Choose the best to modify the least: A new generation concatenativesynthesis system,” in Proc. EuroSpeech, Budapest, Hungary, 1999,vol. 5, pp. 2291–2294.

[49] C.-L. Isbell and P. Viola, “Restructuring sparse high dimensional datafor effective retrieval,” Adv. Neural Inf. Process. Syst., vol. 11, pp.480–486, 1999.

Page 15: IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE ...lorien.die.upm.es/~lapiz/rtth/papers/premios08/06/06.pdf · 2, 08022 Barcelona, Spain (e-mail: falias@salle.url.edu; xavis@salle.url.edu;

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

ALÍAS et al.: TOWARDS HIGH-QUALITY NEXT-GENERATION TTS SYNTHESIS 15

[50] N. Montoya, “El uso de la voz en la Publicidad Audiovisual Dirigida alos niños y su Eficacia Persuasiva,” Ph.D. dissertation, Univ. Autònomade Barcelona, Barcelona, Spain, 1999.

[51] M. Sassano, “Virtual examples for text classification with supportvector machines,” in Proc. Conf. Empirical Methods in Natural Lang.Process., 2003, pp. 208–215.

[52] T. Joachims, SVMlight 2000 [Online]. Available: http://ais.gmd.de/~thorsten/svm_light/

[53] J. Shawe-Taylor and N. Cristianini, Kernel Methods for Pattern Anal-ysis. Cambridge, U.K.: Cambridge Univ. Press, 2004.

[54] W. Cavnar and J. Trenkle, “N-gram-based text categorization,” in Proc.3rd Annu. Symp. Document Anal. Inf. Retrieval, 1994, pp. 161–175.

[55] F. Alías, J. Socoró, X. Sevillano, I. Iriondo, and X. Gonzalvo, “Multi-domain text-to-speech synthesis by automatic text classification,” inProc. InterSpeech, Pittsburgh, PA, 2006, pp. 267–274.

[56] F. Alías, X. Llorà, L. Formiga, K. Sastry, and D. E. Goldberg, “Efficientinteractive weight tuning for TTS synthesis: Reducing user fatigue byimproving user consistency,” in Proc. ICASSP, Toulouse, France, May2006, vol. I, pp. 865–868.

[57] B. Noble and J. Daniel, Applied Linear Algebra. Englewood Cliffs:Prentice-Hall, 1988.

Francesc Alías (S’05–M’07) received the B.Sc.degree in telecommunications engineering and theM.Sc. and Ph.D. degrees in electronics engineeringfrom Enginyeria i Arquitectura La Salle, UniversitatRamon Llull, Barcelona, Spain, in 1997, 1999, and2006, respectively.

From 1999 to 2004, he was a Research Assistantand a Practices/Demonstrating Teacher at the De-partment of Communications and Signal Theory,Enginyeria i Arquitectura La Salle, also becomingan Assistant Teacher in 2004. In September 2007, he

joined the Acoustic Area of the Department of Audiovisual Technologies ofthe same faculty as a Researcher and Assistant Teacher. His current researchinterests include speech and audio processing, analysis, synthesis and recogni-tion, multimodal systems, artificial intelligence, text analysis, and new teachingmethodologies. From 2000 to 2004, he was a Ph.D. student granted by theDepartament d’Universitats i Societat de la Informació (DURSI), Generalitatde Catalunya. He has authored or coauthored over 50 papers in scientificjournals and conferences.

Dr. Alías has been a member of the IEEE Signal Processing Society since2005, and he is currently a member of the Speech Synthesis Special InterestGroup and the Special Interest Group on Iberian Languages of the Interna-tional Speech Communication Association (ISCA) and the Speech Technolo-gies Spanish Network.

Xavier Sevillano received the B.Sc. degree intelecommunications engineering and the M.Sc.degree in electronics engineering from Enginy-eria i Arquitectura La Salle, Universitat RamonLlull (URL), Barcelona, Spain, in 1997 and 2000,respectively, and the M.S. degree in project manage-ment from URL in 2002. He is currently pursuingthe Ph.D. degree at URL focused on multimodalclustering.

Since 2000, he has been an Assistant Teacher andResearcher at the Department of Communications

and Signal Theory, Enginyeria i Arquitectura La Salle. His current researchinterests are text analysis, multimodal fusion, speech technologies, and clusterensembles for robust multimedia data clustering. He has authored or coauthoredover 25 papers in scientific journals and conferences.

Mr. Sevillano is currently a member of the Association for Computing Ma-chinery (ACM).

Joan Claudi Socoró received the B.Sc. degree intelecommunications engineering and the M.Sc.,and Ph.D. degrees in electronics engineering fromEnginyeiria i Arquitectura La Salle, UniversitatRamon Llull, Barcelona, Spain, in 1993, 1995, and2002, respectively.

He has been with the Department of Communica-tions and Signal Theory, Enginyeria i ArquitecturaLa Salle, since 1992, first as a Practices/Demon-strating Teacher and, since 1995, as Researcher andAssistant Teacher. His current research interests

include speech and audio processing, analysis, synthesis and recognition, andmultimodal systems. From 1996 to 1998, he was a Ph.D. student granted by theDepartament d’Universitats i Societat de la Informació (DURSI), Generalitatde Catalunya. He has authored or coauthored over 90 papers in scientificjournals and conferences.

Dr. Socoró received the 1999/2000 Rosina Ribalta Research ConsolationPrize for the best Ph.D. thesis in Information Technologies and Communi-cations by the Epson Foundation. He has been a member of COST-251 andCOST-262 and he is currently member of the Speech Technologies SpanishNetwork.

Xavier Gonzalvo received the B.Sc. degree intelecommunications engineering and M.Sc. degreesin electronics engineering from Enginyeria i Arqui-tectura La Salle , Universitat Ramon Llull (URL),Barcelona, Spain, in 2002 and 2004, respectively.He is currently pursuing the Ph.D. degree at URLfocused on HMM-based text-to-speech synthesis.

He was with the Department of Communicationsand Signal Theory, Enginyeria i Arquitectura LaSalle, as an Assistant Researcher from 2003 toMarch 2008. His current research interests include

speech processing analysis, speech synthesis and recognition, multimodalsystems, dialog systems, and array processing. He has authored or coauthoredover 15 papers in scientific journals and conferences.

Mr. Gonzalvo is currently a member of International Speech CommunicationAssociation (ISCA) and the Speech Technologies Spanish Network.