Control Concepts for Articulatory Speech Synthesis
Peter Birkholz
Institute for Computer Science, University of Rostock, Germany
Ingmar Steiner
Department of Computational Linguistics and Phonetics, Saarland University, Germany
Stefan Breuer
Institute of Communication Sciences (IfK), University of Bonn, Germany
6th ISCA Workshop on Speech Synthesis, Bonn, 2007
Articulatory Control Concepts 2
Outline
• Introduction
• The articulatory speech synthesizer
• Rule-based generation of gestural scores using the Bonn Open Synthesis System
• Speech resynthesis based on EMA data
• Conclusions
Articulatory Control Concepts 3
Motivation
• Articulatory speech synthesis has the highest potential to synthesize speech with any voice and in any language with the most natural quality.
• To achieve such a high quality, appropriate models are needed for the vocal tract, aero-acoustics, and articulatory control.
• A comprehensive, configurable articulatory synthesizer (“VocalTractLab”, form. “Speak”) has been developed by Birkholz et al. in the last years (2004-2007)
• In this talk, we present – a novel gesture-based method to control the articulatory
movements of the vocal tract model– two high-level concepts for the specification of articulatory
gestures in terms of gestural scores.
Articulatory Control Concepts 4
Concepts for the specification of gestures
We investigated two concepts for the generation of gestural scores:
1. Generation of gestures from text using the open source software platform BOSS (Bonn Open Synthesis System) for articulatory text-to-speech synthesis• Phonetic transcription, duration prediction, and intonation prediction is
done analogous to unit-selection text-to-speech synthesis.
2. Use timing information extracted from Electromagnetic Articulography (EMA) signals to create gestural scores for speech resynthesis
Articulatory Control Concepts 5
Application prospects
The combination of parametric flexibility of a vocal tract model and
high-level articulatory control concepts could facilitate...
• Expressive speech synthesis which would benefit from the flexible control of prosodic parameters, mainly F0 and voice quality.
• Multilingual speech synthesis – truly the same voice for speaking different languages
• Voice morphing
• Research in prosody
Articulatory Control Concepts 8
From gestural scores to speech movements
• A gestural score is transformed into trajectories for the vocal tract parameters and the glottal parameters
• Vocalic and consonantal gestures are associated with articulatory target configurations („macros“)
• Overlapping gestures are coarticulated
• The transition between speech sounds/gestures is modelled as a process of target approximation using 3rd order dynamical systems
TTS
CTS
SSML
?
BOSS Server
TranscriptionDE
BOSS_Synthesis
Mod
ule
Ord
er
Voices / Languages
...
...
...
...
......... ... ...
TranscriptionPL
DurationDE
DurationPL
Unit SelectionDE
Unit SelectionPL
BOSS_ConMan
dynamic loading of modules during initialisation
Concat.Concat. &Manipul.
XML file access or network communication w. client
signal output(net or file)
raw audio
XML
BO
SS
Clie
nts
The Bonn Open Synthesis System (BOSS) is a developer framework for the design of unit-selection speech synthesis applications. It is designed as a client-server architecture. Clients are responsible for receiving either text or text with mark-up and converting it into the XML format understood by the server.The BOSS server contains the module scheduler that integrates the various synthesis components and calls them in the appropriate order.
Rule-based generation of gestural scores
The Synthesis Process:Text Normalisation &
Client/Server Communication
TTS
CTS
SSML
?
BOSS Server
TranscriptionDE
BOSS_Synthesis
Mod
ule
Ord
er
Voices / Languages
...
...
...
...
......... ... ...
TranscriptionPL
DurationDE
DurationPL
Unit SelectionDE
Unit SelectionPL
BOSS_ConMan
dynamic loading of modules during initialisation
Concat.Concat. &Manipul.
XML file access or network communication w. client
signal output(net or file)
raw audio
XML
BO
SS
Clie
nts
BOSS clients are application-specific and need to be supplied by the user. They can be either TTS or CTS. The task of the client is to provide tokenisation and conversion into the server XML format. It also sends the data to the server and receives the speech signal.
<SENTENCE Type="."><WORD Orth="Guten" ></WORD><WORD Orth="Tag" ></WORD><WORD Orth="Herr" ></WORD><WORD Orth="Müller" ></WORD></SENTENCE>
The BOSSWin frontend for VocalTractLab
BOSSWin
TranscriptionDE
BOSS_Synthesis
Mod
ule
Ord
er
DurationDE
dynamic loading of modules during initialisation
XML outputXML
XML
Voc
alT
ract
Lab
For the integration with our articulatory synthesizer “VocalTractLab”, we ported BOSS to Windows, using only the German TTS components.
The client software was merged with the server.
TTS preproc
IntonationDE
BOSSWin
TranscriptionDE
BOSS_Synthesis
Mod
ule
Ord
er
DurationDE
dynamic loading of modules during initialisation
XML outputXML
XML
Voc
alT
ract
Lab
TTS preproc
IntonationDE
The Synthesis Process:Automatic Phonetic Transcription
The transcription module adds the SYLLABLE, PHONE and HALFPHONE elements and provides the attributes TKey (which contains the transcription of the element) as well as the phrasing attributes PInt and PMode to the XML DOM.
The German module uses a three-step process to yield a transcription for each WORD element:
1. lexicon lookup2. morpheme decomposition3. decision-tree based
grapheme-to-phoneme conversion, stress assignment and syllabification
BOSSWin
TranscriptionDE
BOSS_Synthesis
Mod
ule
Ord
er
DurationDE
dynamic loading of modules during initialisation
XML outputXML
XML
Voc
alT
ract
Lab
TTS preproc
IntonationDE
There is a CART-based module for the prediction of sound durations in ms which adds the attribute Dur to each WORD, SYLLABLE, PHONE and HALFPHONE element in the DOM.
The Synthesis Process:Prediction of Prosodic Parameters
For the automatic generation of gestural scores from these data,
• gesture durations had to be predicted from phone durations (they are NOT the same!), and • the gestures had to be temporally coordinated (e. g., the glottal opening and the oral closure for /t/) Phasing rules!
Articulatory Control Concepts 14
Coordination rules examples (1)
• Voiced and voiceless plosives mainly differ in the existence/absence of a glottal abduction gesture.
• For /t/, glottal opening starts approx. when the oral closure is established. Glottal closing starts around the oral closure release, but depends on the required degree of aspiration.
Articulatory Control Concepts 15
Coordination rules examples (2)
• For voiceless fricatives, glottal opening starts roughly when the tongue motion toward the constriction starts.
• Nasals require a velar gesture for the opening of the nasal port. Opening and closing times are not very critical, as long as the nasal port is at least slightly open during the oral closure.
Articulatory Control Concepts 16
Examples for rule-based synthesis
• The rules exemplified on the last slides were implemented quantitatively to generate gestural scores from BOSS data.
• Examples:– „Der Zug hat eine Stunde Verspätung.“
(“The train is one hour delayed.”)– „Guten Tag, liebe Zuhörer!“
(“Hello, dear listeners!”)
• Mapping from predicted phone durations to gestures still in experimental stage
• No intonation prediction (yet)
Articulatory Control Concepts 17
Speech resynthesis based on EMA data (1)
• Electromagnetic Articulography (EMA) allows motion capture of speech movements.
• EMA data can be used e.g. to analyze the timing of articulatory gestures.
• Timing information from EMA data could be used to improve the timing control of gestures for articulatory synthesis.
• Preliminary results indicate that resynthesis of speech, with gestural timing derived from EMA data, can produce results which strongly resemble the original with respect to gestural timing.
Articulatory Control Concepts 18
Speech resynthesis based on EMA data (2)
DFG project “German vowels” (Munich)• EMMA (EM Midsagittal A) with AG100• recorded at LMU Munich (1993-1995)• 7 German speakers (1f/6m)• sensors on lower lip (LLIP), tongue tip (TTIP), 3 along tongue body
(TMID, TBACK, TDORS), and others• manually annotated
Drawbacks• No VELic sensor• No Electroglottographic (EGG) data (F0)
Articulatory Control Concepts 20
Conclusions
• Both proposed concepts for high-level control lead to intelligible, though not natural, synthesis results.
• It is conceivable to train e.g. a CART to directly predict gesture durations instead of phone durations using the methods implemented in BOSS.
• Integration of Fujisaki parameter prediction into the synthesizer should yield improvements to intonation.
• The resynthesis method could be further automatized and provide very natural timing information for the gestural scores.
• The quality of articulatory synthesis output depends crucially on gestural timing control.
Top Related