Download - Control Concepts for Articulatory Speech Synthesis Peter Birkholz Institute for Computer Science, University of Rostock, Germany Ingmar Steiner Department.

Control Concepts for Articulatory Speech Synthesis

Peter Birkholz

Institute for Computer Science, University of Rostock, Germany

Ingmar Steiner

Department of Computational Linguistics and Phonetics, Saarland University, Germany

Stefan Breuer

Institute of Communication Sciences (IfK), University of Bonn, Germany

6th ISCA Workshop on Speech Synthesis, Bonn, 2007

Articulatory Control Concepts 2

Outline

• Introduction

• The articulatory speech synthesizer

• Rule-based generation of gestural scores using the Bonn Open Synthesis System

• Speech resynthesis based on EMA data

• Conclusions


Motivation

• Articulatory speech synthesis has the highest potential to synthesize speech with any voice and in any language with the most natural quality.

• To achieve such a high quality, appropriate models are needed for the vocal tract, aero-acoustics, and articulatory control.

• A comprehensive, configurable articulatory synthesizer (“VocalTractLab”, form. “Speak”) has been developed by Birkholz et al. in the last years (2004-2007)

• In this talk, we present – a novel gesture-based method to control the articulatory

movements of the vocal tract model– two high-level concepts for the specification of articulatory

gestures in terms of gestural scores.


Concepts for the specification of gestures

We investigated two concepts for the generation of gestural scores:

1. Generation of gestures from text using the open source software platform BOSS (Bonn Open Synthesis System) for articulatory text-to-speech synthesis• Phonetic transcription, duration prediction, and intonation prediction is

done analogous to unit-selection text-to-speech synthesis.

2. Use timing information extracted from Electromagnetic Articulography (EMA) signals to create gestural scores for speech resynthesis


Application prospects

The combination of parametric flexibility of a vocal tract model and

high-level articulatory control concepts could facilitate...

• Expressive speech synthesis which would benefit from the flexible control of prosodic parameters, mainly F0 and voice quality.

• Multilingual speech synthesis – truly the same voice for speaking different languages

• Voice morphing

• Research in prosody


The articulatory speech synthesizer (1)


The articulatory speech synthesizer (2)


From gestural scores to speech movements

• A gestural score is transformed into trajectories for the vocal tract parameters and the glottal parameters

• Vocalic and consonantal gestures are associated with articulatory target configurations („macros“)

• Overlapping gestures are coarticulated

• The transition between speech sounds/gestures is modelled as a process of target approximation using 3rd order dynamical systems

TTS

CTS

SSML

?

BOSS Server

TranscriptionDE

BOSS_Synthesis

Mod

ule

Ord

er

Voices / Languages

...

...

...

...

......... ... ...

TranscriptionPL

DurationDE

DurationPL

Unit SelectionDE

Unit SelectionPL

BOSS_ConMan

dynamic loading of modules during initialisation

Concat.Concat. &Manipul.

XML file access or network communication w. client

signal output(net or file)

raw audio

XML

BO

SS

Clie

nts

The Bonn Open Synthesis System (BOSS) is a developer framework for the design of unit-selection speech synthesis applications. It is designed as a client-server architecture. Clients are responsible for receiving either text or text with mark-up and converting it into the XML format understood by the server.The BOSS server contains the module scheduler that integrates the various synthesis components and calls them in the appropriate order.

Rule-based generation of gestural scores

The Synthesis Process:Text Normalisation &

Client/Server Communication

TTS

CTS

SSML

?

BOSS Server

TranscriptionDE

BOSS_Synthesis

Mod

ule

Ord

er

Voices / Languages

...

...

...

...

......... ... ...

TranscriptionPL

DurationDE

DurationPL

Unit SelectionDE

Unit SelectionPL

BOSS_ConMan


Concat.Concat. &Manipul.

XML file access or network communication w. client

signal output(net or file)

raw audio

XML

BO

SS

Clie

nts

BOSS clients are application-specific and need to be supplied by the user. They can be either TTS or CTS. The task of the client is to provide tokenisation and conversion into the server XML format. It also sends the data to the server and receives the speech signal.

<SENTENCE Type="."><WORD Orth="Guten" ></WORD><WORD Orth="Tag" ></WORD><WORD Orth="Herr" ></WORD><WORD Orth="Müller" ></WORD></SENTENCE>

The BOSSWin frontend for VocalTractLab

BOSSWin

TranscriptionDE

BOSS_Synthesis

Mod

ule

Ord

er

DurationDE


XML outputXML

XML

Voc

alT

ract

Lab

For the integration with our articulatory synthesizer “VocalTractLab”, we ported BOSS to Windows, using only the German TTS components.

The client software was merged with the server.

TTS preproc

IntonationDE

BOSSWin

TranscriptionDE

BOSS_Synthesis

Mod

ule

Ord

er

DurationDE


XML outputXML

XML

Voc

alT

ract

Lab

TTS preproc

IntonationDE

The Synthesis Process:Automatic Phonetic Transcription

The transcription module adds the SYLLABLE, PHONE and HALFPHONE elements and provides the attributes TKey (which contains the transcription of the element) as well as the phrasing attributes PInt and PMode to the XML DOM.

The German module uses a three-step process to yield a transcription for each WORD element:

1. lexicon lookup2. morpheme decomposition3. decision-tree based

grapheme-to-phoneme conversion, stress assignment and syllabification

BOSSWin

TranscriptionDE

BOSS_Synthesis

Mod

ule

Ord

er

DurationDE


XML outputXML

XML

Voc

alT

ract

Lab

TTS preproc

IntonationDE

There is a CART-based module for the prediction of sound durations in ms which adds the attribute Dur to each WORD, SYLLABLE, PHONE and HALFPHONE element in the DOM.

The Synthesis Process:Prediction of Prosodic Parameters

For the automatic generation of gestural scores from these data,

• gesture durations had to be predicted from phone durations (they are NOT the same!), and • the gestures had to be temporally coordinated (e. g., the glottal opening and the oral closure for /t/) Phasing rules!


Coordination rules examples (1)

• Voiced and voiceless plosives mainly differ in the existence/absence of a glottal abduction gesture.

• For /t/, glottal opening starts approx. when the oral closure is established. Glottal closing starts around the oral closure release, but depends on the required degree of aspiration.


Coordination rules examples (2)

• For voiceless fricatives, glottal opening starts roughly when the tongue motion toward the constriction starts.

• Nasals require a velar gesture for the opening of the nasal port. Opening and closing times are not very critical, as long as the nasal port is at least slightly open during the oral closure.


Examples for rule-based synthesis

• The rules exemplified on the last slides were implemented quantitatively to generate gestural scores from BOSS data.

• Examples:– „Der Zug hat eine Stunde Verspätung.“

(“The train is one hour delayed.”)– „Guten Tag, liebe Zuhörer!“

(“Hello, dear listeners!”)

• Mapping from predicted phone durations to gestures still in experimental stage

• No intonation prediction (yet)


Speech resynthesis based on EMA data (1)

• Electromagnetic Articulography (EMA) allows motion capture of speech movements.

• EMA data can be used e.g. to analyze the timing of articulatory gestures.

• Timing information from EMA data could be used to improve the timing control of gestures for articulatory synthesis.

• Preliminary results indicate that resynthesis of speech, with gestural timing derived from EMA data, can produce results which strongly resemble the original with respect to gestural timing.


Speech resynthesis based on EMA data (2)

DFG project “German vowels” (Munich)• EMMA (EM Midsagittal A) with AG100• recorded at LMU Munich (1993-1995)• 7 German speakers (1f/6m)• sensors on lower lip (LLIP), tongue tip (TTIP), 3 along tongue body

(TMID, TBACK, TDORS), and others• manually annotated

Drawbacks• No VELic sensor• No Electroglottographic (EGG) data (F0)


Speech resynthesis based on EMA data (3)„Methanol“


Conclusions

• Both proposed concepts for high-level control lead to intelligible, though not natural, synthesis results.

• It is conceivable to train e.g. a CART to directly predict gesture durations instead of phone durations using the methods implemented in BOSS.

• Integration of Fujisaki parameter prediction into the synthesizer should yield improvements to intonation.

• The resynthesis method could be further automatized and provide very natural timing information for the gestural scores.

• The quality of articulatory synthesis output depends crucially on gestural timing control.