Speech Synthesis: Overviewtts.speech.cs.cmu.edu/11-752/slides/tts_intro.pdf · Overview Speech...

Speech Synthesis: Speech Synthesis: Overview Overview

11-752

OverviewOverview◆ Speech Synthesis History: From knowledge-based to data drivenSpeech Synthesis History: From knowledge-based to data driven

● Formant to DiphoneFormant to Diphone● Diphone to Unit SelectionDiphone to Unit Selection● Unit Selection to Statistical ParametricUnit Selection to Statistical Parametric

◆ Optimizing the ProblemOptimizing the Problem● The right measures, the right algorithmThe right measures, the right algorithm● The right databases, the right things to synthesizeThe right databases, the right things to synthesize

◆ Some Hard ProblemsSome Hard Problems◆ EvaluationEvaluation

Physical ModelsPhysical Models

• Blowing air through tubes…

– von Kemplen’s

synthesizer 1791

• Synthesis by physical models– Homer Dudley’s Voder. 1939

More Computation – More DataMore Computation – More Data

◆ Formant synthesis (60s-80s)Formant synthesis (60s-80s)● Waveform construction from componentsWaveform construction from components

◆ Diphone synthesis (80s-90s)Diphone synthesis (80s-90s)● Waveform by concatenation of small number of Waveform by concatenation of small number of

instances of speechinstances of speech◆ Unit selection (90s-00s)Unit selection (90s-00s)

● Waveform by concatenation of very large number of Waveform by concatenation of very large number of instances of speechinstances of speech

◆ Statistical Parametric Synthesis (00s-..)Statistical Parametric Synthesis (00s-..)● Waveform construction from parametric modelsWaveform construction from parametric models

Waveform GenerationWaveform Generation

- Formant synthesisFormant synthesis- Random word/phrase concatenationRandom word/phrase concatenation- Phone concatenationPhone concatenation- Diphone concatenationDiphone concatenation- Sub-word unit selectionSub-word unit selection- Cluster based unit selectionCluster based unit selection- Statistical Parametric SynthesisStatistical Parametric Synthesis

Building a Research FieldBuilding a Research Field

◆ ToolsTools● Allow others to easily join the fieldAllow others to easily join the field

◆ Common Data SetsCommon Data Sets● Be able to concentrate on techniquesBe able to concentrate on techniques● Have common comparisonsHave common comparisons

◆ EvaluationEvaluation● Realistically compare techniquesRealistically compare techniques

◆ Have UsersHave Users● Some one has to care about your resultsSome one has to care about your results

◆ Don’t become stifled Don’t become stifled ● Ensure there are new tasks and directionsEnsure there are new tasks and directions

Festival Speech Synthesis SystemFestival Speech Synthesis Systemhttp://festvox.org/festivalGeneral system for multi-lingual TTSC/C++ code with Scheme scripting languageGeneral replaceable modules lexicons, LTS, duration, intonation, phrasing, POS tagging tokenizing, diphone/unit selectionGeneral Tools intonation analysis (F0, Tilt), signal processing CART building, n-grams, SCFG, WFST, OLSNo fixed theoriesNew languages without new C++ codeMultiplatform (Unix, Windows, OSX)Full sources in distributionFree Software

CMU FestVox ProjectCMU FestVox Project

http://festvox.org“I want it to speak like me!”-Festival is an engine, how do you make voices- Building Synthetic Voices - Tools, scripts, documentation - Discussion and examples for building voices - Example voice databases - Step by Step walkthroughs of processes-Support for English and other languages-Support for different waveform techniques: - diphone, unit selection, limit domain, HMM- Other support: lexicon, prosody, text analysers

The CMU Flite projectThe CMU Flite projecthttp://cmuflite.org“But I want it to run on my phone!”- FLITE a fast, small, portable run-time synthesizer- C based (no loaded files)- Basic FestVox voices compiled into C/data- Thread safe- Suitable for embedded devices - Ipaq, Linux, WinCE, PalmOS, Symbian- Scalable: - quality/size/speed trade offs - frequency based lexicon pruning- Sizes: - 2.4Meg footprint (code+data+runtime RAM) - < 0.025 secs “time-to-speak”

Common Data SetsCommon Data Sets◆ Data drive techniques need dataData drive techniques need data◆ Diphone DatabasesDiphone Databases

● CSTR and CMU US English Diphone sets (kal and ked)CSTR and CMU US English Diphone sets (kal and ked)◆ CMU ARCTIC DatabasesCMU ARCTIC Databases

● 1200 phonetically balanced utterances (about 1 hour)1200 phonetically balanced utterances (about 1 hour)● 7 different speakers (2 male 2 female 3 accented)7 different speakers (2 male 2 female 3 accented)● EGG, phonetically labeledEGG, phonetically labeled● Utterances chosen from out-of-copyright textUtterances chosen from out-of-copyright text● Easy to sayEasy to say● Freely distributableFreely distributable● Tools to build your own in your own languageTools to build your own in your own language

Blizzard ChallengeBlizzard Challenge

◆ Realistic evaluationRealistic evaluation● Under the same conditionsUnder the same conditions

◆ Blizzard Challenge [Black and Tokuda]Blizzard Challenge [Black and Tokuda]● Participants build voice from common datasetParticipants build voice from common dataset● Synthesis test sentencesSynthesis test sentences● Large set of listening experimentsLarge set of listening experiments● Since 2005, now in 9Since 2005, now in 9thth year year● 15-20 groups (Academia, Research Labs and 15-20 groups (Academia, Research Labs and

Commercial Companies)Commercial Companies)

How to test synthesisHow to test synthesis

◆ Blizzard tests:Blizzard tests:● Do you like it? (MOS scores)Do you like it? (MOS scores)● Can you understand it?Can you understand it?

→ SUS sentenceSUS sentence→ The unsure steaks overcame the zippy rudderThe unsure steaks overcame the zippy rudder

◆ Can’t this be done automatically?Can’t this be done automatically?● Not yet (at least not reliably enough)Not yet (at least not reliably enough)● But we now have lots of data for training techniquesBut we now have lots of data for training techniques

◆ Why does it still sound like robot?Why does it still sound like robot?● Need better (appropriate testing)Need better (appropriate testing)

Speech Synthesis TechniquesSpeech Synthesis Techniques

◆ Unit selectionUnit selection◆ Statistical parameter synthesisStatistical parameter synthesis◆ Automated voice buildingAutomated voice building

● Database designDatabase design● Language portabilityLanguage portability

◆ Voice conversionVoice conversion

Unit SelectionUnit Selection

• Target cost and Join cost [Hunt and Black 96]– Target cost is distance from desired unit to actual

unit in the databases• Based on phonetic, prosodic metrical context

– Join cost is how well the selected units join

Clustering UnitsClustering Units

• Cluster units [Donovan et al 96, Black et al 97]

Unit Selection IssuesUnit Selection Issues

• Cost metrics– Finding best weights, best techniques etc

• Database design– Best database coverage

• Automatic labeling accuracy– Finding errors/confidence

• Limited domain:– Target the databases to a particular application– Talking clocks – Targeted domain synthesis

Unit Selection vs ParametricUnit Selection vs Parametric

Unit SelectionThe “standard” method“Select appropriate sub-word units from

large databases of natural speech”Parametric Synthesis: [NITECH: Tokuda et al]

HMM-generation based synthesisCluster units to form modelsGenerate from the models“Take ‘average’ of units”

Old vs NewOld vs New

Unit Selection:large carefully labelled databasequality good when good examples availablequality will sometimes be badno control of prosody

Parametric Synthesis:smaller less carefully labelled databasequality consistentresynthesis requires vocoder, (buzzy)can (must) control prosody

model size much smaller than Unit DB

Parametric SynthesisParametric Synthesis

• Probabilistic Models

• Simplification

• Generative model– Predict acoustic frames from text

SPSSSPSS

◆ ASR vs SPSSASR vs SPSS● Similar techniques but not the sameSimilar techniques but not the same

◆ Model training techniquesModel training techniques● Alignment, and cluster featuresAlignment, and cluster features● MLLR (adaptation from multi-speaker models)MLLR (adaptation from multi-speaker models)

◆ Model improvement techniquesModel improvement techniques● Minimum generation errorMinimum generation error● Label optimizationLabel optimization

◆ Parameterization techniquesParameterization techniques● MFCC, LSP, STAIGHT, HSMMFCC, LSP, STAIGHT, HSM● Excitation modeling techniquesExcitation modeling techniques

SPSS GoalsSPSS Goals

◆ Require optimal paramerization thatRequire optimal paramerization that● Is derivable from speechIs derivable from speech● Can generate high quality speechCan generate high quality speech● Is predictable from textIs predictable from text

◆ CandidatesCandidates● Spectral, F0, excitationSpectral, F0, excitation● Formants, nasality, aspirationFormants, nasality, aspiration● Articulatory features Articulatory features

SPSS SystemsSPSS Systems

◆ HTS (NITECH)HTS (NITECH)● Based on HTKBased on HTK● Predicts HMM-statesPredicts HMM-states● (Default) uses MCEP and MLSA filter(Default) uses MCEP and MLSA filter● Supported in FestivalSupported in Festival

◆ Clustergen (CMU)Clustergen (CMU)● No use of HTKNo use of HTK● Predicts FramesPredicts Frames● (Default) uses MCEP and MLSA filter(Default) uses MCEP and MLSA filter● More tightly coupled with FestivalMore tightly coupled with Festival

Building Synthetic VoicesBuilding Synthetic Voices The “standard” voice requires …The “standard” voice requires …- A phone setA phone set- Pronunciations:Pronunciations:

- Lexicon/letter-to-sound rulesLexicon/letter-to-sound rules- Phonetically and prosodically balanced corpusPhonetically and prosodically balanced corpus

- Spoken by a good speakerSpoken by a good speaker- Text analysis:Text analysis:

- Number, symbol expansion, etcNumber, symbol expansion, etc- Prosodic modelingProsodic modeling

- Phrasing, intonation, duration etcPhrasing, intonation, duration etc- Waveform generationWaveform generation

- Diphones, unit selection, parametric synthesisDiphones, unit selection, parametric synthesis- Something else that is hard:Something else that is hard:

- No vowels (Arabic), no word segmentation, number declensionsNo vowels (Arabic), no word segmentation, number declensions

Designing a good corpusDesigning a good corpus- From a large set of textFrom a large set of text

- Select “nice” utterancesSelect “nice” utterances- 5 to 15 words, easy to say5 to 15 words, easy to say- All words in lexicon, no homographsAll words in lexicon, no homographs

- Convert text to phoneme stringsConvert text to phoneme strings- Possibly with lexical stress, onset/coda, tone etcPossibly with lexical stress, onset/coda, tone etc

- Select utterances that maximize di/triphone coverageSelect utterances that maximize di/triphone coverage- Looking for around 1000 utterancesLooking for around 1000 utterances- Can seed initial data with “domain” dataCan seed initial data with “domain” data- CMU ARCTIC databasesCMU ARCTIC databases

- 7 x single speaker English DBS7 x single speaker English DBS- 1200 phonetically balanced utterances1200 phonetically balanced utterances

Hard Synthesis ProblemsHard Synthesis Problems

◆ Text NormalizationText Normalization◆ Intonation modelingIntonation modeling

● Intonation evaluationIntonation evaluation

◆ Style modelingStyle modeling● Choosing the right styleChoosing the right style● Evaluating the resultEvaluating the result

Text NormalizationText Normalization

◆ Finding the wordsFinding the words● Tokenizing, homograph disambiguation etcTokenizing, homograph disambiguation etc● ““$1.25” vs “$1.25 million” vs “$1.25 song”$1.25” vs “$1.25 million” vs “$1.25 song”

◆ Very large number of rare eventsVery large number of rare events◆ Formalized systems existFormalized systems exist

● Trained from data, optimized and out-of-dateTrained from data, optimized and out-of-date◆ Long term updated hacks rule systemsLong term updated hacks rule systems◆ ML ChallengeML Challenge

● Such a problem Such a problem cannot cannot be done by machine learningbe done by machine learning

Intonation ModelingIntonation Modeling

◆ Accents, Phrases and F0Accents, Phrases and F0● Lots of statistical models availableLots of statistical models available● Lots of “objective” measures: Lots of “objective” measures:

→ RMSE, CorrelationRMSE, Correlation● No good subjective measuresNo good subjective measures

◆ Listening testsListening tests● Natural Intonation: goodNatural Intonation: good● Naïve intonation: badNaïve intonation: bad● Various cute models for intonation: mehVarious cute models for intonation: meh

Improving UnderstandingImproving Understanding

◆ Take reading comprehension storiesTake reading comprehension stories● For children’s reading tests, or TOEFLFor children’s reading tests, or TOEFL

◆ Synthesis with:Synthesis with:● Natural IntonationNatural Intonation● Naïve modelsNaïve models● Various cute modelsVarious cute models

◆ Human listening testsHuman listening tests● Answer questions about storiesAnswer questions about stories● Best system: Naïve models Best system: Naïve models

Style ModelingStyle Modeling

◆ Classic Emotion ModelingClassic Emotion Modeling● Happy, sad, angry and neutralHappy, sad, angry and neutral● But no one needs thatBut no one needs that

◆ Style ModelingStyle Modeling● Polite, command, empathicPolite, command, empathic

◆ Style usageStyle usage● When can it be used?When can it be used?● How much should be used?How much should be used?

Dialog with StyleDialog with Style

◆ Record human-human dialogRecord human-human dialog● Label dialog states:Label dialog states:

→ Implicit confirmation, corrections, discourse markersImplicit confirmation, corrections, discourse markers

◆ Build dialog state sensitive voiceBuild dialog state sensitive voice● Using dialog state in featuresUsing dialog state in features

◆ Must be closely integrated into SDSMust be closely integrated into SDS● Timing, dialog state appropriateTiming, dialog state appropriate

◆ But how do you test it?But how do you test it?

Voice TransformationVoice Transformation

- Collect small amount of dataCollect small amount of data- 50 utterances50 utterances

- Adapt existing voice to target voiceAdapt existing voice to target voice- Adaptation: What makes a voice:Adaptation: What makes a voice:

- Lexical choiceLexical choice- Phonetic variationPhonetic variation- ProsodyProsody- Spectral/vocal tract/articulatory movementSpectral/vocal tract/articulatory movement- Excitation modeExcitation mode

- Use articulatory modeling for transformation (Toth)Use articulatory modeling for transformation (Toth)

Voice Transformation Voice Transformation

- Festvox GMM transformation suite (Toda) Festvox GMM transformation suite (Toda)

awb bdl jmk sltawb bdl jmk slt

awbawb

bdlbdl

jmkjmk

sltslt

ApplicationsApplications

◆ Speech output is only one componentSpeech output is only one component◆ Need to integrate with larger applicationsNeed to integrate with larger applications

● Spoken Dialog SystemsSpoken Dialog Systems● Speech-to-Speech Translation SystemsSpeech-to-Speech Translation Systems● Talking HeadsTalking Heads● Conversational participantsConversational participants● Information deliveryInformation delivery

ConclusionsConclusions

• Synthesis has improved– But there is still much to do– Isolated sentences are clear …– … But conversational speech still in the future

• Speech Systems must adapt– To their usage– And their funding conditions

• But we can always fall back on our talents

Speech Synthesis: Overviewtts.speech.cs.cmu.edu/11-752/slides/tts_intro.pdf · Overview Speech...

Documents

Transcript of Speech Synthesis: Overviewtts.speech.cs.cmu.edu/11-752/slides/tts_intro.pdf · Overview Speech...