Text-to-Speech synthesis using OpenMARY

32
Text-to-Speech synthesis using OpenMARY An introduction and practical tutorial Marc Schröder, DFKI [email protected] eNTERFACE Amsterdam, 14 July 2010

Transcript of Text-to-Speech synthesis using OpenMARY

Text-to-Speech synthesis using OpenMARY

An introduction and practical tutorial

Marc Schröder, [email protected]

eNTERFACE Amsterdam, 14 July 2010

Marc Schröder, DFKI 2

Overview

Some Text-to-Speech (TTS) basicsNatural Language ProcessingGenerating the sound

diphone synthesisunit selection synthesisHMM-based synthesis

OpenMARYexisting system MARY 4.0toolkit for adding new languages and voices

Tutorial overviewwhat you will learn to do in the tutorial

Marc Schröder, DFKI 3

What is text-to-speech synthesis?

“You have one message from Dr Johnson.”

TTS

Marc Schröder, DFKI 4

Applications of TTS

Texts readersfor the blindin eyes-free environments (e.g., while driving)

Telephone-based voice portalsMulti-modal interactive systems

talking heads“embodied conversational agents” (ECAs)

Marc Schröder, DFKI 5

A Talking Head

“Hello, nice to meet you.”

TTS

Informationon timing

and mouth shapes

Marc Schröder, DFKI 6

Structure of a TTS system

text analysis

audio generation

Text orSpeech synthesis markup

phonetic transcription +prosodic parameters

Either plain text orSSML document

Intonation specificationPausing & speech timing

natural languageprocessing techniques

signal processingtechniques

Wave file

TEXTSSML

ACOUSTPARAMS

AUDIO

Marc Schröder, DFKI 7

Structure of a TTS system: MARY TTS

Text analysisInput markup parser TEXT or SSML → RAWMARYXML

Shallow NLP RAWMARYXML → PARTSOFSPEECH

Phonemiser PARTSOFSPEECH → ALLOPHONES

Symbolic prosody ALLOPHONES → INTONATION

Acoust. parameters INTONATION → ACOUSTPARAMS

Audio generationwaveform synthesis ACOUSTPARAMS → AUDIO

Danilo
Realce
Danilo
Realce
Danilo
Realce
Danilo
Realce
Danilo
Realce

Marc Schröder, DFKI 8

System structure: Input markup parser

System-internal XML representation MaryXML=> speech synthesis markup parsing is simple XML transformationUse XSLT => easily adaptable to new markup language

TEXT or SSML → RAWMARYXML

Marc Schröder, DFKI 9

System structure: Shallow NLP

Shallow NLPTokeniser RAWMARYXML → TOKENS

sentence boundaries, “tokens” = word-like units

Text normalisation TOKENS → WORDS

expanded, pronounceable forms (see next slide)

Part-of-speech tagger WORDS → PARTSOFSPEECH

Marc Schröder, DFKI 10

Preprocessing / Text normalisation

Net patterns (email, web addresses) [email protected] patterns 23/07/2001Time patterns 12:24 h, 12:24Duration patterns 12:24 h, 12 h 24 minCurrency patterns 12.95 €Measure patterns 123.09 kmTelephone number patterns +49-681-85775-5303Number patterns (cardinal, ordinal, roman) 3 3rd III.Abbreviations engl.Special characters &

Marc Schröder, DFKI 11

System structure: Phonemisation

Phonemiser PARTSOFSPEECH → PHONEMES

lexicon lookupletter-to-sound conversion

morphological decompositionletter-to-sound rulessyllabificationword stress assignment

Custom pronounciation PHONEMES → ALLOPHONES

slurring, non-standard pronounciation

potentially trainable from annotated data of a given person

Marc Schröder, DFKI 12

System structure: Prosody

“Prosody”?intonation (accented syllables; high or low phrase boundaries)rhythmic effects (pauses, syllable durations)loudness, voice quality

Symbolic prosody prediction ALLOPHONES → INTONATION

assign prosody by rule, based onpunctuationpart-of-speech

modelled using “Tones and Break Indices” (ToBI)tonal targets: accents, boundary tonesphrase breaks

Marc Schröder, DFKI 13

System structure:Calculation of acoustic parameters

Duration prediction INTONATION → DURATIONS

segment duration predictedby rulesor by decision trees

Contour generation DURATIONS → ACOUSTPARAMS

fundamental frequency curve predictedby rulesor by decision trees

Marc Schröder, DFKI 14

System structure: Waveform synthesis

Waveform synthesis ACOUSTPARAMS → AUDIO

several waveform generation technologies

Marc Schröder, DFKI 15

Creating sound:Waveform synthesis technologies (1)

Formant synthesisacoustic model of speechgenerate acoustic structure by rulerobotic sound

Marc Schröder, DFKI 16

Creating sound:Waveform synthesis technologies (2)

Concatenative synthesisdiphone synthesis

glue pre-recorded “diphones” togetheradapt prosody through signal processing

unit selection synthesisglue units from a large corpus of speech togetherprosody comes from the corpus, (nearly) no signal processing

Marc Schröder, DFKI 17

Creating sound:Waveform synthesis technologies (3)

Statistical-parametric speech synthesiswith Hidden Markov Modelsmodels trained on speech corporano data needed at runtime => small footprint

Marc Schröder, DFKI 18

Examples of speech synthesis technologies

MARY TTSunit selection

HMM-based

MBROLA diphones

expressive unit selection

Commercialunit selection

IVONALoquendo

formant synthesisDecTalk

Marc Schröder, DFKI 19

Concatenative synthesis:Isolated phones don't work

target: w I n t r= d eI

acoustic unit database(units = phone segments recorded in isolation)

weI

r=aI

tnd

T

Marc Schröder, DFKI 20

Concatenative synthesis:Diphones

target: w I n t r= d eI _-w w-I I-n n-t t-r= r=-d d-eI eI-_

acoustic unit databaseunits = diphone segments recorded in carrier words(flat intonation)

_-w (wonder)

w-I (will)

I-n (spin)

n-t (fountain)

t-r= (water)

r=-d (nerdy)

d-eI (date)

eI-_ (away)Diphones =sound segmentsfrom the middle of one phoneto the middle of the next phone

Marc Schröder, DFKI 21

Concatenative synthesis:Diphones (2)

target: w I n t r= d eI _-w w-I I-n n-t t-r= r=-d d-eI eI-_

PSOLApitchmanipulation

Marc Schröder, DFKI 22

Concatenative synthesisUnit selection

“Which of these?”

“Let's discuss the question of interchangesanother day.”

target: w I n t r= d eI

acoustic unit database units = (di-)phone segments recorded in natural sentences (natural intonation)

Marc Schröder, DFKI 23

AI Poker: The voices of Sam and Max

Sam:Unit Selection SynthesisVoice specificallyrecorded for AI PokerNatural sound withinpoker domain

Max:HMM-based synthesisSound quality is limitedbut constant with any text

Marc Schröder, DFKI 24

Sam's voice: Unit selection syntheis

...

severalhours ofspeech

recordings

Unit selection corpus

“Ich habe zwei Paare.”+ + +

=> very good quality within the poker domain!

Marc Schröder, DFKI 25

Sam's voice: Unit selection syntheis

...

severalhours ofspeech

recordings

Unit selection corpus

“Ich kann auch ganz andere Sachen...”+ + +

reduced quality with arbitrary text

Marc Schröder, DFKI 26

Max's voice: HMM-based synthesis

statisticalmodels

“Ich habe zwei Paare.”

Hidden Markov Modelsacousticfeature vectors

vocoder

Marc Schröder, DFKI 27

Max's voice: HMM-based synthesis

statisticalmodels

“Ich kann auch ganz andere Sachen...”

Hidden Markov Modelsacousticfeature vectors

vocoder

constant quality with arbitrary text

Marc Schröder, DFKI 28

MARY TTS 4.0

Pure JavaRuns on any platform with Java 5

Client-server architecturehttp interface – your browser is a MARY client

Multilingual, with UTF-8 supportEnglish (US and GB)GermanTurkishTelugu

Willkommen

Konuşma

స్చ స్నసస

Marc Schröder, DFKI 29

Audio effects in MARY 4.0

Some can be applied to any voicevocal tract length (longer – shorter )Robot effectWhisper effectJet pilot

More effects for HMM-based voicespitch level (higher – lower )pitch range (wider – narrower )speaking rate (faster – slower )

Can be parameterised & combined to create characteristic voices

WikipediaXML dump

Wikipedia text import

Dump splitter

Markup cleaner

most frequent words in

the language

clean text

sentences w/diphone+prosody

features

Script selectionoptimising coverage

selectedsentences /

script

Manual check, excludeunsuitable sentences

Redstartrecord speech db

audiofiles

Synthesis componentsenable conversion ALLOPHONES->Audio

in new voice

Voice Import Tools

acousticmodelsfor F0+duration

unitselection

voicefiles

HMM-basedvoicefiles

speaker-specific

pronoun-ciation

allo-phones

.xml

Transcription GUI

pronoun-ciationlexicon

list offunctionwords

letter-to-sound forunknown

words

Basic NLP componentsenable conversion TEXT->ALLOPHONES

in new locale

Phonemiser rudimentaryPOS tagger

Tokeniser Symbolicprosody

generic implementations withbasic functionality:

Feature maker

MARY TTS: New language support workflow

Marc Schröder, DFKI 31

What you will learn to do in the MARY Tutorial

Installing the MARY systemlanguages and voices

Interacting with MARY using the web clientbasic experimentationinteractive test of audio effectsinteractive documentation of http interface

Triggering TTS from your own softwarehttp interfaceJava client codeselecting language, voice and effects in requests

Marc Schröder, DFKI 32

What you will learn to do in the MARY Tutorial (2)

Using timing information:REALISED_ACOUSTPARAMS and REALISED_DURATIONSPerformance: caching