Text-to-Speech synthesis using OpenMARY
-
Upload
danilo-sousa -
Category
Education
-
view
190 -
download
2
Transcript of Text-to-Speech synthesis using OpenMARY
Text-to-Speech synthesis using OpenMARY
An introduction and practical tutorial
Marc Schröder, [email protected]
eNTERFACE Amsterdam, 14 July 2010
Marc Schröder, DFKI 2
Overview
Some Text-to-Speech (TTS) basicsNatural Language ProcessingGenerating the sound
diphone synthesisunit selection synthesisHMM-based synthesis
OpenMARYexisting system MARY 4.0toolkit for adding new languages and voices
Tutorial overviewwhat you will learn to do in the tutorial
Marc Schröder, DFKI 4
Applications of TTS
Texts readersfor the blindin eyes-free environments (e.g., while driving)
Telephone-based voice portalsMulti-modal interactive systems
talking heads“embodied conversational agents” (ECAs)
Marc Schröder, DFKI 5
A Talking Head
“Hello, nice to meet you.”
TTS
Informationon timing
and mouth shapes
Marc Schröder, DFKI 6
Structure of a TTS system
text analysis
audio generation
Text orSpeech synthesis markup
phonetic transcription +prosodic parameters
Either plain text orSSML document
Intonation specificationPausing & speech timing
natural languageprocessing techniques
signal processingtechniques
Wave file
TEXTSSML
ACOUSTPARAMS
AUDIO
Marc Schröder, DFKI 7
Structure of a TTS system: MARY TTS
Text analysisInput markup parser TEXT or SSML → RAWMARYXML
Shallow NLP RAWMARYXML → PARTSOFSPEECH
Phonemiser PARTSOFSPEECH → ALLOPHONES
Symbolic prosody ALLOPHONES → INTONATION
Acoust. parameters INTONATION → ACOUSTPARAMS
Audio generationwaveform synthesis ACOUSTPARAMS → AUDIO
Marc Schröder, DFKI 8
System structure: Input markup parser
System-internal XML representation MaryXML=> speech synthesis markup parsing is simple XML transformationUse XSLT => easily adaptable to new markup language
TEXT or SSML → RAWMARYXML
Marc Schröder, DFKI 9
System structure: Shallow NLP
Shallow NLPTokeniser RAWMARYXML → TOKENS
sentence boundaries, “tokens” = word-like units
Text normalisation TOKENS → WORDS
expanded, pronounceable forms (see next slide)
Part-of-speech tagger WORDS → PARTSOFSPEECH
Marc Schröder, DFKI 10
Preprocessing / Text normalisation
Net patterns (email, web addresses) [email protected] patterns 23/07/2001Time patterns 12:24 h, 12:24Duration patterns 12:24 h, 12 h 24 minCurrency patterns 12.95 €Measure patterns 123.09 kmTelephone number patterns +49-681-85775-5303Number patterns (cardinal, ordinal, roman) 3 3rd III.Abbreviations engl.Special characters &
Marc Schröder, DFKI 11
System structure: Phonemisation
Phonemiser PARTSOFSPEECH → PHONEMES
lexicon lookupletter-to-sound conversion
morphological decompositionletter-to-sound rulessyllabificationword stress assignment
Custom pronounciation PHONEMES → ALLOPHONES
slurring, non-standard pronounciation
potentially trainable from annotated data of a given person
Marc Schröder, DFKI 12
System structure: Prosody
“Prosody”?intonation (accented syllables; high or low phrase boundaries)rhythmic effects (pauses, syllable durations)loudness, voice quality
Symbolic prosody prediction ALLOPHONES → INTONATION
assign prosody by rule, based onpunctuationpart-of-speech
modelled using “Tones and Break Indices” (ToBI)tonal targets: accents, boundary tonesphrase breaks
Marc Schröder, DFKI 13
System structure:Calculation of acoustic parameters
Duration prediction INTONATION → DURATIONS
segment duration predictedby rulesor by decision trees
Contour generation DURATIONS → ACOUSTPARAMS
fundamental frequency curve predictedby rulesor by decision trees
Marc Schröder, DFKI 14
System structure: Waveform synthesis
Waveform synthesis ACOUSTPARAMS → AUDIO
several waveform generation technologies
Marc Schröder, DFKI 15
Creating sound:Waveform synthesis technologies (1)
Formant synthesisacoustic model of speechgenerate acoustic structure by rulerobotic sound
Marc Schröder, DFKI 16
Creating sound:Waveform synthesis technologies (2)
Concatenative synthesisdiphone synthesis
glue pre-recorded “diphones” togetheradapt prosody through signal processing
unit selection synthesisglue units from a large corpus of speech togetherprosody comes from the corpus, (nearly) no signal processing
Marc Schröder, DFKI 17
Creating sound:Waveform synthesis technologies (3)
Statistical-parametric speech synthesiswith Hidden Markov Modelsmodels trained on speech corporano data needed at runtime => small footprint
Marc Schröder, DFKI 18
Examples of speech synthesis technologies
MARY TTSunit selection
HMM-based
MBROLA diphones
expressive unit selection
Commercialunit selection
IVONALoquendo
formant synthesisDecTalk
Marc Schröder, DFKI 19
Concatenative synthesis:Isolated phones don't work
target: w I n t r= d eI
acoustic unit database(units = phone segments recorded in isolation)
weI
r=aI
tnd
T
Marc Schröder, DFKI 20
Concatenative synthesis:Diphones
target: w I n t r= d eI _-w w-I I-n n-t t-r= r=-d d-eI eI-_
acoustic unit databaseunits = diphone segments recorded in carrier words(flat intonation)
_-w (wonder)
w-I (will)
I-n (spin)
n-t (fountain)
t-r= (water)
r=-d (nerdy)
d-eI (date)
eI-_ (away)Diphones =sound segmentsfrom the middle of one phoneto the middle of the next phone
Marc Schröder, DFKI 21
Concatenative synthesis:Diphones (2)
target: w I n t r= d eI _-w w-I I-n n-t t-r= r=-d d-eI eI-_
PSOLApitchmanipulation
Marc Schröder, DFKI 22
Concatenative synthesisUnit selection
“Which of these?”
“Let's discuss the question of interchangesanother day.”
target: w I n t r= d eI
acoustic unit database units = (di-)phone segments recorded in natural sentences (natural intonation)
Marc Schröder, DFKI 23
AI Poker: The voices of Sam and Max
Sam:Unit Selection SynthesisVoice specificallyrecorded for AI PokerNatural sound withinpoker domain
Max:HMM-based synthesisSound quality is limitedbut constant with any text
Marc Schröder, DFKI 24
Sam's voice: Unit selection syntheis
...
severalhours ofspeech
recordings
Unit selection corpus
“Ich habe zwei Paare.”+ + +
=> very good quality within the poker domain!
Marc Schröder, DFKI 25
Sam's voice: Unit selection syntheis
...
severalhours ofspeech
recordings
Unit selection corpus
“Ich kann auch ganz andere Sachen...”+ + +
reduced quality with arbitrary text
Marc Schröder, DFKI 26
Max's voice: HMM-based synthesis
statisticalmodels
“Ich habe zwei Paare.”
Hidden Markov Modelsacousticfeature vectors
vocoder
Marc Schröder, DFKI 27
Max's voice: HMM-based synthesis
statisticalmodels
“Ich kann auch ganz andere Sachen...”
Hidden Markov Modelsacousticfeature vectors
vocoder
constant quality with arbitrary text
Marc Schröder, DFKI 28
MARY TTS 4.0
Pure JavaRuns on any platform with Java 5
Client-server architecturehttp interface – your browser is a MARY client
Multilingual, with UTF-8 supportEnglish (US and GB)GermanTurkishTelugu
Willkommen
Konuşma
స్చ స్నసస
Marc Schröder, DFKI 29
Audio effects in MARY 4.0
Some can be applied to any voicevocal tract length (longer – shorter )Robot effectWhisper effectJet pilot
More effects for HMM-based voicespitch level (higher – lower )pitch range (wider – narrower )speaking rate (faster – slower )
Can be parameterised & combined to create characteristic voices
WikipediaXML dump
Wikipedia text import
Dump splitter
Markup cleaner
most frequent words in
the language
clean text
sentences w/diphone+prosody
features
Script selectionoptimising coverage
selectedsentences /
script
Manual check, excludeunsuitable sentences
Redstartrecord speech db
audiofiles
Synthesis componentsenable conversion ALLOPHONES->Audio
in new voice
Voice Import Tools
acousticmodelsfor F0+duration
unitselection
voicefiles
HMM-basedvoicefiles
speaker-specific
pronoun-ciation
allo-phones
.xml
Transcription GUI
pronoun-ciationlexicon
list offunctionwords
letter-to-sound forunknown
words
Basic NLP componentsenable conversion TEXT->ALLOPHONES
in new locale
Phonemiser rudimentaryPOS tagger
Tokeniser Symbolicprosody
generic implementations withbasic functionality:
Feature maker
MARY TTS: New language support workflow
Marc Schröder, DFKI 31
What you will learn to do in the MARY Tutorial
Installing the MARY systemlanguages and voices
Interacting with MARY using the web clientbasic experimentationinteractive test of audio effectsinteractive documentation of http interface
Triggering TTS from your own softwarehttp interfaceJava client codeselecting language, voice and effects in requests