TextText--toto--Speech (TTS) Speech (TTS) Synthesis … · 2010. 5. 19. ·...

49
Digital Speech Processing Digital Speech Processing— Lecture 18 Lecture 18 Lecture 18 Lecture 18 Text Text-to to-Speech (TTS) Speech (TTS) Synthesis Systems Synthesis Systems Synthesis Systems Synthesis Systems

Transcript of TextText--toto--Speech (TTS) Speech (TTS) Synthesis … · 2010. 5. 19. ·...

Page 1: TextText--toto--Speech (TTS) Speech (TTS) Synthesis … · 2010. 5. 19. · GraphemeGrapheme--toto--Phoneme ConversionPhoneme Conversion Tagged Phones Prosodic Analysis Pitch and

Digital Speech ProcessingDigital Speech Processing——Lecture 18Lecture 18Lecture 18Lecture 18

TextText--toto--Speech (TTS) Speech (TTS) Synthesis SystemsSynthesis SystemsSynthesis SystemsSynthesis Systems

Page 2: TextText--toto--Speech (TTS) Speech (TTS) Synthesis … · 2010. 5. 19. · GraphemeGrapheme--toto--Phoneme ConversionPhoneme Conversion Tagged Phones Prosodic Analysis Pitch and

TextText--toto--Speech (TTS) SynthesisSpeech (TTS) SynthesisTextText toto Speech (TTS) SynthesisSpeech (TTS) Synthesis• GOAL: convert arbitrary textual messages to

intelligible and natural sounding syntheticintelligible and natural sounding synthetic speech so as to transmit information from a machine to a personmachine to a person

input text

Text Analysis

Speech Synthesizer

abstract underlying linguistic

synthetic speech

t ty y

description output

• pronunciation of text => phonemes, stress, intonation, duration

• syntactic structure of sentence (pauses, rate of speaking, emphasis)

• semantic focus, ambiguity resolution (duration, intonation)

– rules for word etymology (especially names, foreign terms)

Page 3: TextText--toto--Speech (TTS) Speech (TTS) Synthesis … · 2010. 5. 19. · GraphemeGrapheme--toto--Phoneme ConversionPhoneme Conversion Tagged Phones Prosodic Analysis Pitch and

Text Analysis ComponentsText Analysis ComponentsText Analysis ComponentsText Analysis ComponentsRaw English Input TextRaw English Input Text

Basic Text ProcessingBasic Text ProcessingDocument Structure DetectionDocument Structure DetectionText NormalizationText NormalizationLinguistic AnalysisLinguistic Analysis DictionaryDictionaryLinguistic AnalysisLinguistic Analysis

Phonetic AnalysisPhonetic Analysis

Tagged TextTagged Text

DictionaryDictionary

Phonetic AnalysisPhonetic AnalysisHomograph disambiguationHomograph disambiguationGraphemeGrapheme--toto--Phoneme ConversionPhoneme Conversion

Tagged PhonesTagged Phones

Prosodic AnalysisProsodic AnalysisPitch and Duration RulesPitch and Duration RulesStress and Pause AssignmentStress and Pause Assignment

Tagged PhonesTagged Phones

Stress and Pause AssignmentStress and Pause Assignment

Synthesis Controls (Sequence of Synthesis Controls (Sequence of Sounds, Durations, Pitch)Sounds, Durations, Pitch)

Page 4: TextText--toto--Speech (TTS) Speech (TTS) Synthesis … · 2010. 5. 19. · GraphemeGrapheme--toto--Phoneme ConversionPhoneme Conversion Tagged Phones Prosodic Analysis Pitch and

Document StructureDocument StructureDocument StructureDocument Structure•• end of sentenceend of sentence marked by ‘.?!’ is not infallibley

– “The car is 72.5 in. long”•• ee--mailmail and web pages need special processing

– Larry:

Sure. I'll try to do it before Thursday :-)y y )

Ed•• multiple languagesmultiple languages•• multiple languagesmultiple languages

– insertion of foreign words, unusual accent and diacritical marks, etc.

Page 5: TextText--toto--Speech (TTS) Speech (TTS) Synthesis … · 2010. 5. 19. · GraphemeGrapheme--toto--Phoneme ConversionPhoneme Conversion Tagged Phones Prosodic Analysis Pitch and

Text NormalizationText Normalization

•• abbreviations and acronymsabbreviations and acronyms– Dr. is pronounced either as Doctor or drive depending on context (Dr.

Smith lives on Smith Dr.)S i d i h ‘ ’ ‘S i ’ d di (I li– St. is pronounced either as ‘street’ or ‘Saint’ depending on context (I live on Bourbon St. in St. Louis)

– DC is either direct current or District of Columbia– MIT is pronounced as either ‘M I T’ or ‘Massachusetts Institute of

Technology’ but never as ‘mitt’– DEC is pronounced as either ‘deck’ or ‘Digital Equipment Company’ but

never as ‘D E C’•• numbersnumbers

– 370-1111 can be either ‘three seven oh …’ or ‘three seventy-model 1111’ for the IBM 370 computer

– 1920 is either ‘nineteen-twenty’ or ‘one thousand, nine hundred, twenty’•• dates times currency account numbers ordinals cardinalsdates times currency account numbers ordinals cardinals•• dates, times currency, account numbers, ordinals, cardinals, dates, times currency, account numbers, ordinals, cardinals,

mathmath– Feb. 15, 1983 needs to convert to ‘February fifteenth, nineteen eighty-

three’$10 50 is pronounced as ‘ten dollars and fifty cents’– $10.50 is pronounced as ‘ten dollars and fifty cents’

– part # 10-50 needs to be pronounced as ‘part number 10 dash fifty’ rather than ‘part pound sign ten to fifty’

Page 6: TextText--toto--Speech (TTS) Speech (TTS) Synthesis … · 2010. 5. 19. · GraphemeGrapheme--toto--Phoneme ConversionPhoneme Conversion Tagged Phones Prosodic Analysis Pitch and

Text NormalizationText NormalizationText NormalizationText Normalization•• proper namesproper namesp pp p

– Rudy Prpch is pronounced as ‘Rudy Perpich’– Sorin Ducan is pronounced as ‘Sorin Duchan’

•• part of speechpart of speech– read is pronounced as ‘reed’ or ‘red’

record is pronounced as ‘rec ard’ or ‘ri cord’– record is pronounced as rec-ard or ri-cord•• word decompositionword decomposition

– need to decompose complex words into base formsneed to decompose complex words into base forms (morphemes) to determine pronunciation (‘indivisibility’ needs to be decomposed into ‘in-di-visible-ity’ to determine pronunciation)visible ity to determine pronunciation)

Page 7: TextText--toto--Speech (TTS) Speech (TTS) Synthesis … · 2010. 5. 19. · GraphemeGrapheme--toto--Phoneme ConversionPhoneme Conversion Tagged Phones Prosodic Analysis Pitch and

Text NormalizationText NormalizationText NormalizationText Normalization

• proper handling of special symbolsspecial symbols• proper handling of special symbolsspecial symbolsin text– punctuation, e.g., . , : ; ’ ” - -- _ * & ( ) ^

% @ ! ~ < > ? / \ = +•• string resolutionstring resolution

– 10:20 can be pronounced as either10:20 can be pronounced as either ‘twenty after 10 (as a time)’ or ‘ten to twenty’ (as a sequence)twenty (as a sequence)

Page 8: TextText--toto--Speech (TTS) Speech (TTS) Synthesis … · 2010. 5. 19. · GraphemeGrapheme--toto--Phoneme ConversionPhoneme Conversion Tagged Phones Prosodic Analysis Pitch and

Knowledge Source ConfusionsKnowledge Source ConfusionsKnowledge Source ConfusionsKnowledge Source Confusions

• Let us pray Lettuce spray• Let us pray ---------- Lettuce spray• Pay per view ---------- Paper viewy p p• Meet her on Main St. ----------

M t M i StMeter on Main St.• It is easy to recognize speech -----It is easy to recognize speech

----- It is easy to wreck a nice b hbeach

8

Page 9: TextText--toto--Speech (TTS) Speech (TTS) Synthesis … · 2010. 5. 19. · GraphemeGrapheme--toto--Phoneme ConversionPhoneme Conversion Tagged Phones Prosodic Analysis Pitch and

Linguistic AnalysisLinguistic AnalysisLinguistic AnalysisLinguistic Analysis• part-of-speech (POS)p p ( )• word sense• phrases

h• anaphora• emphasis• style• style

a conventional parser could be used, but a co e t o a pa se cou d be used, buttypically a simple shallow analysis is done for speed (parsers are not real-time!)

Page 10: TextText--toto--Speech (TTS) Speech (TTS) Synthesis … · 2010. 5. 19. · GraphemeGrapheme--toto--Phoneme ConversionPhoneme Conversion Tagged Phones Prosodic Analysis Pitch and

Homograph DisambiguationHomograph DisambiguationHomograph DisambiguationHomograph Disambiguation• “an absent boy” versus “do you choose toan absent boy versus do you choose to

absent yourself?”• “they will abuse him” versus “they won’tthey will abuse him versus they won t

take abuse”• “an overnight bag” versus “are you stayingan overnight bag versus are you staying

overnight?”• “he is a learned man” versus “he learnedhe is a learned man versus he learned

to play piano”• “El Camino Real road” versus “real world”El Camino Real road versus real world

Page 11: TextText--toto--Speech (TTS) Speech (TTS) Synthesis … · 2010. 5. 19. · GraphemeGrapheme--toto--Phoneme ConversionPhoneme Conversion Tagged Phones Prosodic Analysis Pitch and

LetterLetter--toto--Sound (LTS) ConversionSound (LTS) Conversion

• CART “Whole Word” Dictionary Probe“Whole Word” Dictionary Probe

Input WordInput Word

(Classification and Regression Affix StrippingAffix Stripping

nono

nonogTree) analysis

• conventional “Root” Dictionary Probe“Root” Dictionary Probe

yesyesnono

nonoconventional dictionary search with

LetterLetter--toto--Sound and Sound and Stress RulesStress Rules

yesyesyesyes

search with letter-to-sound rules

Affix ReattachmentAffix Reattachmentrules

phonemes, stress, phonemes, stress, partsparts--ofof--speechspeech

This is only ONE way of doing This is only ONE way of doing LTS. FSMs are another.LTS. FSMs are another.

Page 12: TextText--toto--Speech (TTS) Speech (TTS) Synthesis … · 2010. 5. 19. · GraphemeGrapheme--toto--Phoneme ConversionPhoneme Conversion Tagged Phones Prosodic Analysis Pitch and

ProsodyProsodyProsodyProsody•• pausespauses

– to indicate phrases and to avoid running out of breath

•• pitchpitch– fundamental frequency (F0) as a function of q y ( )

time•• rate/relative durationrate/relative duration

– phoneme durations, timing, and rhythm•• loudnessloudnessloudnessloudness

– relative amplitude/volume

Page 13: TextText--toto--Speech (TTS) Speech (TTS) Synthesis … · 2010. 5. 19. · GraphemeGrapheme--toto--Phoneme ConversionPhoneme Conversion Tagged Phones Prosodic Analysis Pitch and

Full TTS SystemFull TTS Systemyy

T t L tt t S th i

Text Face SyncText/Language

Text Analysis

Letter-to-Sound Rules

Synthesis Backend Speech

S th i d l i t TTS

- Numerical expansion(dates, times, $$$, …)

- abbreviations acronyms

- text and name dictionaries

- timing and duration[rate, stress (position, context)]

- intonation/pitch

Synthesis model impacts TTS, as well as unit representation.

choice of units: - words

phonesabbreviations, acronyms

- proper name id

“Dr. Smith lives at 23 Lakeshore Dr.”

- intonation/pitch(type of phrase[list,statement, ??])

- phonetic substitution(vowel reduction, flapping rules)Pr

osod

y - phones - diphones - dyads - syllables

choice of parameters:

- sentences, phrase and breath groups- semantic and syntactic accent; stress of compounds

- loudness/amplitude(phrasal amp., local reduction)

choice of parameters:- LPC - formants- waveform templates- articulatory parameters

Parsing:

y p- intonation types; parts of speech - sinusoidal parameters

method of computation:- rules- concatenation

Page 14: TextText--toto--Speech (TTS) Speech (TTS) Synthesis … · 2010. 5. 19. · GraphemeGrapheme--toto--Phoneme ConversionPhoneme Conversion Tagged Phones Prosodic Analysis Pitch and

Word Concatenation SynthesisWord Concatenation Synthesisyy

• words in sentences are much shorter than• words in sentences are much shorter than in isolation (up to 50% shorter) (see next page)page)

• words cannot preserve sentence-level t h th i t ti ttstress, rhythm or intonation patterns

• too many words to store (1.7 million surnames), extended words using prefixes and suffixes

Page 15: TextText--toto--Speech (TTS) Speech (TTS) Synthesis … · 2010. 5. 19. · GraphemeGrapheme--toto--Phoneme ConversionPhoneme Conversion Tagged Phones Prosodic Analysis Pitch and

Word ConcatenationWord ConcatenationWord ConcatenationWord Concatenation

Page 16: TextText--toto--Speech (TTS) Speech (TTS) Synthesis … · 2010. 5. 19. · GraphemeGrapheme--toto--Phoneme ConversionPhoneme Conversion Tagged Phones Prosodic Analysis Pitch and

Proper Name StatisticsProper Name StatisticsProper Name StatisticsProper Name StatisticsNumber of NamesNumber of Names CoverageCoverageNumber of NamesNumber of Names CoverageCoverage

10 4.9%

100 16.3%

5,000 59.1%5,000 59.1%

50,000 83.2%

100,000 88.6%

200,000 93.0%, %

Page 17: TextText--toto--Speech (TTS) Speech (TTS) Synthesis … · 2010. 5. 19. · GraphemeGrapheme--toto--Phoneme ConversionPhoneme Conversion Tagged Phones Prosodic Analysis Pitch and

Statistics of Proper Name Statistics of Proper Name CCCoverageCoverage

name name pronunciation pronunciation

based onbased onbased on based on etymologyetymology

Page 18: TextText--toto--Speech (TTS) Speech (TTS) Synthesis … · 2010. 5. 19. · GraphemeGrapheme--toto--Phoneme ConversionPhoneme Conversion Tagged Phones Prosodic Analysis Pitch and

Concatenative Word SynthesisConcatenative Word SynthesisConcatenative Word SynthesisConcatenative Word Synthesis

several examples of several examples of concatenated words and fullconcatenated words and fullconcatenated words and full concatenated words and full

sentencessentences----SBRSBR

wordword--based based synthesis does synthesis does

not work!not work!

Page 19: TextText--toto--Speech (TTS) Speech (TTS) Synthesis … · 2010. 5. 19. · GraphemeGrapheme--toto--Phoneme ConversionPhoneme Conversion Tagged Phones Prosodic Analysis Pitch and

Speech Synthesis MethodsSpeech Synthesis MethodsSpeech Synthesis MethodsSpeech Synthesis Methods• 1939—the VODER (Voice Operated1939 the VODER (Voice Operated

DEmonstratoR)—Homer Dudley– based on a simple model of speech sound production– select voicing source (with foot pedal control of pitch)

or noise sourcet filt h d th t d l– ten filters shaped the source to produce vocal or noise-excited sounds—controlled by finger motions

– separate keys for stop soundsseparate keys for stop sounds– wrist bar control of signal energy

Page 20: TextText--toto--Speech (TTS) Speech (TTS) Synthesis … · 2010. 5. 19. · GraphemeGrapheme--toto--Phoneme ConversionPhoneme Conversion Tagged Phones Prosodic Analysis Pitch and

The VODERThe VODERThe VODERThe VODER

Page 21: TextText--toto--Speech (TTS) Speech (TTS) Synthesis … · 2010. 5. 19. · GraphemeGrapheme--toto--Phoneme ConversionPhoneme Conversion Tagged Phones Prosodic Analysis Pitch and

Articulatory SynthesisArticulatory Synthesis•• in theoryin theory — can create more natural and more

realistic motions of the articulators (rather than formant parameters), thereby leading to more natural sounding synthetic speech

utilize physical constraints of articulator movements– utilize physical constraints of articulator movements– use X-ray data to characterize individual speech

sounds– model how articulatory parameters move smoothly

between sounds• direct method: solve wave equation for sound pressure at q p

lips• indirect method: convert to formants or LPC parameters for

final synthesis in order to utilize existing synthesizers– use highly constrained motions of articulatory

parameters

Page 22: TextText--toto--Speech (TTS) Speech (TTS) Synthesis … · 2010. 5. 19. · GraphemeGrapheme--toto--Phoneme ConversionPhoneme Conversion Tagged Phones Prosodic Analysis Pitch and

Articulatory ModelArticulatory ModelArticulatory ModelArticulatory ModelArticulatory Parameters of Articulatory Parameters of Model:Model:

•• lip lip openingopening——WW

•• lip protrusion lengthlip protrusion length——LLlip protrusion lengthlip protrusion length LL

•• tongue body height and tongue body height and lengthlength——Y,XY,X

l ll l KK•• velar closurevelar closure——KK

•• tongue tip height and tongue tip height and lengthlength——B,RB,R

•• jaw raising (dependent jaw raising (dependent parameter)parameter)

•• velum openingvelum opening——NNp gp g

Page 23: TextText--toto--Speech (TTS) Speech (TTS) Synthesis … · 2010. 5. 19. · GraphemeGrapheme--toto--Phoneme ConversionPhoneme Conversion Tagged Phones Prosodic Analysis Pitch and

Articulatory Synthesis Using Articulatory Synthesis Using F S h i B k dF S h i B k dFormant Synthesizer BackendFormant Synthesizer Backend

Cecil CokerCecil Coker----teaching computers to teaching computers to talktalk

Articulatory Synthesis of Articulatory Synthesis of SpeechSpeech——Cecil CokerCecil Coker

talktalk

SpeechSpeech Cecil CokerCecil Coker

Page 24: TextText--toto--Speech (TTS) Speech (TTS) Synthesis … · 2010. 5. 19. · GraphemeGrapheme--toto--Phoneme ConversionPhoneme Conversion Tagged Phones Prosodic Analysis Pitch and

Articulatory Synthesis by Copying Articulatory Synthesis by Copying M d V l T t D tM d V l T t D tMeasured Vocal Tract DataMeasured Vocal Tract Data

• fully automatic closed-loop optimizationi iti li d f ti l t d b k l t– initialized from articulatory codebooks, neural nets

– Schroeter and Sondhi, 1987

One example: original: re-synthesis:

Page 25: TextText--toto--Speech (TTS) Speech (TTS) Synthesis … · 2010. 5. 19. · GraphemeGrapheme--toto--Phoneme ConversionPhoneme Conversion Tagged Phones Prosodic Analysis Pitch and

Articulatory Synthesis IssuesArticulatory Synthesis IssuesArticulatory Synthesis IssuesArticulatory Synthesis Issues

• requires highly accurate models of glottisrequires highly accurate models of glottis and vocal tract

• requires rules for dynamics of the• requires rules for dynamics of the articulators

Page 26: TextText--toto--Speech (TTS) Speech (TTS) Synthesis … · 2010. 5. 19. · GraphemeGrapheme--toto--Phoneme ConversionPhoneme Conversion Tagged Phones Prosodic Analysis Pitch and

SourceSource--Filter Synthesis ModelsFilter Synthesis ModelsSourceSource Filter Synthesis ModelsFilter Synthesis Models• cascade/serial (formant) synthesis model

Page 27: TextText--toto--Speech (TTS) Speech (TTS) Synthesis … · 2010. 5. 19. · GraphemeGrapheme--toto--Phoneme ConversionPhoneme Conversion Tagged Phones Prosodic Analysis Pitch and

Serial/Formant Synthesis ModelSerial/Formant Synthesis Model

•• flaws in the serial/formant synthesis model:flaws in the serial/formant synthesis model:flaws in the serial/formant synthesis model:flaws in the serial/formant synthesis model:–– can’t handle voiced fricativescan’t handle voiced fricatives–– no zeros for nasal soundsno zeros for nasal sounds–– no precise control for stop consonantsno precise control for stop consonants–– pitch pulse shape fixedpitch pulse shape fixed——independent of pitchindependent of pitch–– spectral compensation is inadequatespectral compensation is inadequate

OVE 1OVE 1----FantFant SPASS SPASS synthesissynthesis

“To Be …”“To Be …”--Bell LabsBell Labs

DaisyDaisy--Daisy with Daisy with

musicmusic

We Wish We Wish You …You …

JSRU JSRU SynthesisSynthesis

Page 28: TextText--toto--Speech (TTS) Speech (TTS) Synthesis … · 2010. 5. 19. · GraphemeGrapheme--toto--Phoneme ConversionPhoneme Conversion Tagged Phones Prosodic Analysis Pitch and

Parallel Synthesis ModelParallel Synthesis ModelA i l th i i d h f l l t tA serial synthesizer is a good approach for open, non-nasal vocal tracts (vowels, liquids). For obstruents and nasals, we need to control the amplitudes of each resonance, and to introduce zeros in addition to the poles.

24

1 2 21

1 21 2

cos( )( )cos( )

k k k

k k k k

r rV zr z r z

θθ − −

=

− +=

− +∏14

1 2 21 1 2 cos( )

use the approximation

k k

k k k k

A B zr z r zθ

− −=

−=

− +∑i

4

1 2 21 1 2

use the approximation

( )cos( )

k

k k k k

AV zr z r zθ − −

=

≈− +∑

i

parallel synthesizer provides more parallel synthesizer provides more flexibility in matching spectrum levels flexibility in matching spectrum levels

t f t f i ( i it f t f i ( i iat formant frequencies (via gain at formant frequencies (via gain controls)controls)——however, zeros are however, zeros are introduced into the spectrum.introduced into the spectrum.

Page 29: TextText--toto--Speech (TTS) Speech (TTS) Synthesis … · 2010. 5. 19. · GraphemeGrapheme--toto--Phoneme ConversionPhoneme Conversion Tagged Phones Prosodic Analysis Pitch and

Parallel SynthesisParallel SynthesisParallel SynthesisParallel Synthesis•• issues:issues:

–– need individual resonance amplitudes need individual resonance amplitudes ((A1,…,A4A1,…,A4))——if resonances are close, this is a if resonances are close, this is a messy calculationmessy calculationmessy calculationmessy calculation

–– phasing of resonances neglected (the phasing of resonances neglected (the BBkkzz--11 terms)terms)–– synthetic speech has both resonances and zeros synthetic speech has both resonances and zeros

(at frequencies between the resonances) that may (at frequencies between the resonances) that may be perceptiblebe perceptiblebetter reproduction of complex consonantsbetter reproduction of complex consonants–– better reproduction of complex consonantsbetter reproduction of complex consonants

Parallel synthesis from BYUParallel synthesis from BYUParallel synthesisParallel synthesis--HolmesHolmes

Page 30: TextText--toto--Speech (TTS) Speech (TTS) Synthesis … · 2010. 5. 19. · GraphemeGrapheme--toto--Phoneme ConversionPhoneme Conversion Tagged Phones Prosodic Analysis Pitch and

Continuing Evolution (1959Continuing Evolution (1959--1987)1987)

• Haskins 1959• Haskins, 1959 • KTH – Stockholm, 1962• Bell Labs 1973• Bell Labs, 1973 • MIT, 1976

MIT t lk 1979• MIT-talk, 1979• Speak ‘N spell, 1980

BELL L b 1985• BELL Labs, 1985• Dec talk, 1987

Page 31: TextText--toto--Speech (TTS) Speech (TTS) Synthesis … · 2010. 5. 19. · GraphemeGrapheme--toto--Phoneme ConversionPhoneme Conversion Tagged Phones Prosodic Analysis Pitch and

TextText--toto--Speech Synthesis (TTS) Speech Synthesis (TTS) EvolutionEvolution

Poor Good

Good Intelligibility;

C stomerPoor Intelligibility;

Poor Naturalness

Good Intelligibility;

Poor Naturalness

Customer Quality

Naturalness (Limited Context)

Formant Synthesis

LPC-Based Diphone/Dyad

Unit Selection Synthesis

Context)

Bell Labs; Joint Speech

Synthesis

Bell Labs; CNET;

ATR in Japan; CSTR in Scotland; Joint Speech

Research Unit; MIT (DEC-Talk);

Haskins Lab

;Bellcore; Berkeley Speech

Technology

BT in England; AT&T Labs (1998);

L&H in Belgium

1962 1967 1972 1977 1982 1987 1992 1997

Year

gy

Page 32: TextText--toto--Speech (TTS) Speech (TTS) Synthesis … · 2010. 5. 19. · GraphemeGrapheme--toto--Phoneme ConversionPhoneme Conversion Tagged Phones Prosodic Analysis Pitch and

Speech SynthesisSpeech Synthesis——the 90’sthe 90’sSpeech SynthesisSpeech Synthesis the 90 sthe 90 s• what changed?what changed?

– TTS was highly intelligible but extremely unnatural sounding

– a decade of work had not changed the naturalness substantiallycomputation and memory grew with Moore’s law– computation and memory grew with Moore s law, enabling highly complex concatenative systems to be created, implemented and perfected

– concatenative systems showed themselves capable of producing (in some cases) extremely natural sounding synthetic speechsounding synthetic speech

Page 33: TextText--toto--Speech (TTS) Speech (TTS) Synthesis … · 2010. 5. 19. · GraphemeGrapheme--toto--Phoneme ConversionPhoneme Conversion Tagged Phones Prosodic Analysis Pitch and

Concatenation TTS SystemsConcatenation TTS SystemsConcatenation TTS SystemsConcatenation TTS Systems• key idea: use segments of recorded speech for y g p

synthesis• data driven approach more segments give better

synthesis using an infinite number of segments leadssynthesis using an infinite number of segments leads to perfect synthesis

• key issues:what units to use– what units to use

– how to select units from natural speech– how to label and extract consistent units from a large database

h t i l t ti h ld b d f t ll– what signal representation should be used for spectrally smoothing units (at junctures) and for prosody modification (pitch, duration, amplitude)

Page 34: TextText--toto--Speech (TTS) Speech (TTS) Synthesis … · 2010. 5. 19. · GraphemeGrapheme--toto--Phoneme ConversionPhoneme Conversion Tagged Phones Prosodic Analysis Pitch and

Concatenation UnitsConcatenation UnitsConcatenation UnitsConcatenation Units

• choice of units:choice of units:– Words—there are an infinite number of them

Syllables there are about 10K in English– Syllables—there are about 10K in English– Phonemes—there are about 45 in English

Demi syllables there are about 2500 in– Demi-syllables—there are about 2500 in EnglishDiphones there are about 1500 2500 in– Diphones—there are about 1500-2500 in English

Page 35: TextText--toto--Speech (TTS) Speech (TTS) Synthesis … · 2010. 5. 19. · GraphemeGrapheme--toto--Phoneme ConversionPhoneme Conversion Tagged Phones Prosodic Analysis Pitch and

Choice of UnitsChoice of UnitsChoice of UnitsChoice of UnitsLength # Rules, Necessary

Unit ModificationsUnit # Units

(English)

Allophone 60-80

( g )QualityLow

ManyShort

pDiphone <402-652

Triphone <403-653

Demisyllable 2K ySyllable 11KVC*V2-syllable <11 K2yWord 100K-1.5MPhraseSentence HighFewLong

8Sentence gLong

Page 36: TextText--toto--Speech (TTS) Speech (TTS) Synthesis … · 2010. 5. 19. · GraphemeGrapheme--toto--Phoneme ConversionPhoneme Conversion Tagged Phones Prosodic Analysis Pitch and

Corpus Coverage by Unit TypeCorpus Coverage by Unit TypeCorpus Coverage by Unit TypeCorpus Coverage by Unit Type50000 2-Syllable

VC*V1.0 NOTE: depends on domain

(here SURNAMES)

40000TriphonesSyllablesDemisyllablesDiphones

.83

)

20000

30000DiphonesWords

Slope

# U

nits

oken

/uni

t)

10000

20000 .23.15.11

#(1

to

0.02

.0031 10K 20K 30K 40K 50K 2 M

Top N Surnames (rank)

Page 37: TextText--toto--Speech (TTS) Speech (TTS) Synthesis … · 2010. 5. 19. · GraphemeGrapheme--toto--Phoneme ConversionPhoneme Conversion Tagged Phones Prosodic Analysis Pitch and

Concatenation UnitsConcatenation UnitsConcatenation UnitsConcatenation Units• Words:

– no complete coverage for broad domains words have to be supplemented with smaller units

– limited ability to modify pitch, amplitude and durationlimited ability to modify pitch, amplitude and duration without losing naturalness and intelligibility

– need huge database to extract multiple versions of each word

• Sub-Words:– hard to isolate in context due to co-articulation

need allophonic variations to characterize units in all– need allophonic variations to characterize units in all contexts

– puts large burden on signal processing to smooth at unit join pointsunit join points

Page 38: TextText--toto--Speech (TTS) Speech (TTS) Synthesis … · 2010. 5. 19. · GraphemeGrapheme--toto--Phoneme ConversionPhoneme Conversion Tagged Phones Prosodic Analysis Pitch and

Block Diagram of a Block Diagram of a C i TTS SC i TTS SConcatenative TTS SystemConcatenative TTS System

Dictionary and Rules

Store of Sound Units

Text l

Assemble Speech W f

SpeechAnalysis,

Letter-to-Sound,Prosody

ssemble Units that

Match Input

Waveform Modification

and Synthesis

Alphabetic Ch t

Message Textp

Prosodyp

Targets SynthesisCharacters

Phonetic SymbolsPhonetic Symbols, Prosody Targets

Female MaleFemale Male

Page 39: TextText--toto--Speech (TTS) Speech (TTS) Synthesis … · 2010. 5. 19. · GraphemeGrapheme--toto--Phoneme ConversionPhoneme Conversion Tagged Phones Prosodic Analysis Pitch and

Procedure for Concatenative SynthesisProcedure for Concatenative Synthesis• off-line inventory preparation:

– record speech corpus and process with coding method f h iof choice

– determine location of speech units and store units in inventoryy

• on-line synthesis from text– normalize input text (expand abbreviations, etc.)p ( p )

– letter-to-sound (pronunciation dictionary and rules)

– prosody (melody/pitch&durations, stress patterns/amplitudes, …)

– select appropriate sequence of units from inventory– modify units (smooth at boundaries, match desired

prosody)prosody)– synthesize and output speech signal

Page 40: TextText--toto--Speech (TTS) Speech (TTS) Synthesis … · 2010. 5. 19. · GraphemeGrapheme--toto--Phoneme ConversionPhoneme Conversion Tagged Phones Prosodic Analysis Pitch and

Unit Selection SynthesisUnit Selection SynthesisUnit Selection SynthesisUnit Selection Synthesis• need to optimally match units at boundaries, e.g.,

fundamental frequency (pitch) and spectrumfundamental frequency (pitch), and spectrum • need to automatically and efficiently select optimal

sequence of units from database• issues in Unit Selection Synthesis

– several examples in each unit category (from 10 to 10 )– waveform modification used sparingly (leads to perceived

6

p g y ( pdistortions)

– high intelligibility must be maintained– “customer quality” attained with reasonable training set y g

(1-10 hours)– “natural quality” attained with large training set (10’s of hours)– “unit selection” algorithm must run in a fraction of real time on a

state-of-the-art processor

Page 41: TextText--toto--Speech (TTS) Speech (TTS) Synthesis … · 2010. 5. 19. · GraphemeGrapheme--toto--Phoneme ConversionPhoneme Conversion Tagged Phones Prosodic Analysis Pitch and

(On(On--line) Unit Selection: Viterbi line) Unit Selection: Viterbi SearchSearch

u-##-a a-u

a-u(1)

#-a(1)

u-#(1)

#-# #-#

( )

a-u(2)

( )

#-a(2)

( )

u-#(2)

a-u(3)

#-a(3)

u-#(3)

• transitional (concatenation) costs are based on acoustic distances

a-u(4)

( )• node (target) costs are based on linguistic identity of unit

Page 42: TextText--toto--Speech (TTS) Speech (TTS) Synthesis … · 2010. 5. 19. · GraphemeGrapheme--toto--Phoneme ConversionPhoneme Conversion Tagged Phones Prosodic Analysis Pitch and

Unit Selection MeasuresUnit Selection MeasuresUnit Selection MeasuresUnit Selection Measures• USD—Unit Segmental Distortion differencesUSD Unit Segmental Distortion differences

between desired spectral pattern of target and that of candidate unit, throughout whole unit

• UCD—Unit Concatenative Distortion spectral discontinuity across boundaries of the concatenated units

Example: source context—want—/w ah n t/target context—cart—/k ah r t/USD (ah n versus ah r)

Page 43: TextText--toto--Speech (TTS) Speech (TTS) Synthesis … · 2010. 5. 19. · GraphemeGrapheme--toto--Phoneme ConversionPhoneme Conversion Tagged Phones Prosodic Analysis Pitch and

Modern TTS Systems (Natural Modern TTS Systems (Natural V i f AT&T)V i f AT&T)Voices from AT&T)Voices from AT&T)

NV2005NV2005• Soliloquy from Hamlet—• Gettysburg Address—

NV2005NV2005

NV2005NV2005

• Bob Story—• German female—• UK British female—• Spanish female—Spanish female• Korean female—• French male• French male—

Page 44: TextText--toto--Speech (TTS) Speech (TTS) Synthesis … · 2010. 5. 19. · GraphemeGrapheme--toto--Phoneme ConversionPhoneme Conversion Tagged Phones Prosodic Analysis Pitch and

Modern COMMERCIAL SystemsModern COMMERCIAL Systems

L t• Lucent• AcuVoice • Festival• L&H RealSpeakL&H RealSpeak• SpeechWorks female

S hW k l• SpeechWorks male• Cselt (Actor) - Italian

Page 45: TextText--toto--Speech (TTS) Speech (TTS) Synthesis … · 2010. 5. 19. · GraphemeGrapheme--toto--Phoneme ConversionPhoneme Conversion Tagged Phones Prosodic Analysis Pitch and

TTS Future NeedsTTS Future NeedsTTS Future NeedsTTS Future Needs• TTS needs to know how things should be said• context-sensitive pronunciations of words• prosody prediction emphasis

– I gave the book to John (not someone else)g ( )– I gave the book to John (not the photos)– I gave the book to John (I did it, not someone else)

• unit selection process target cost captures mismatch p g pbetween predicted unit specification (phoneme name, duration, pitch, spectral properties) and actual features of a candidate recorded unit need better spectral distance measures that incorporate human perceptiondistance measures that incorporate human perception

• better signal processing compress units for small footprint devices

Page 46: TextText--toto--Speech (TTS) Speech (TTS) Synthesis … · 2010. 5. 19. · GraphemeGrapheme--toto--Phoneme ConversionPhoneme Conversion Tagged Phones Prosodic Analysis Pitch and

Business Drivers of TTSBusiness Drivers of TTSBusiness Drivers of TTSBusiness Drivers of TTS• cost reductioncost reduction

– TTS as a dialog component for customer care– TTS for delivering messages– TTS to replace expensive recorded IVR promptsTTS to replace expensive recorded IVR prompts

• new products and services– location-based services

providing information in cars (e g driving directions– providing information in cars (e.g., driving directions, traffic reports)

– unified Messaging (reading e-mail, fax)– voice Portals (enterprise, home, phone access tovoice Portals (enterprise, home, phone access to

Web-based services)– e-commerce (automatic information agents)– customized News, Stock Reports, Sports Scores, p , p– devices

Page 47: TextText--toto--Speech (TTS) Speech (TTS) Synthesis … · 2010. 5. 19. · GraphemeGrapheme--toto--Phoneme ConversionPhoneme Conversion Tagged Phones Prosodic Analysis Pitch and

Reading EmailReading EmailFrom: Marilyn Walker <[email protected]>

Example: Old TTS, No Filter

Reading EmailReading EmailFrom: Marilyn Walker [email protected] To: David Ross <[email protected]> Subject: Re: Today's Meeting Date: Tuesday, December 01, 1998 4:25 PM ---------------------------------------------------------------------------------------------------------------------------4 30 i fi f S t th ti4:30 is fine for me. See you at the meeting.

Marilyn

-----Original Message-----g gFrom: David Ross <[email protected]> To: Marilyn Walker <[email protected]> Date: Tuesday, December 01, 1998 2:25 PM Subject: Today's Meeting

Today's meeting has been changed from 4:00 to 4:30 PM. If the time change is a problem, please send me email at [email protected].

Thanks,

david ross

Page 48: TextText--toto--Speech (TTS) Speech (TTS) Synthesis … · 2010. 5. 19. · GraphemeGrapheme--toto--Phoneme ConversionPhoneme Conversion Tagged Phones Prosodic Analysis Pitch and

Reading Email (final)Reading Email (final)From: Marilyn Walker <[email protected]>

Example: Enhanced TTS

Reading Email (final)Reading Email (final)

To: David Ross <[email protected]> Subject: Re: Today's Meeting Date: Tuesday, December 01, 1998 4:25 PM ---------------------------------------------------------------------------------------------------------------------------4:30 is fine for me. See you at the meeting.

Marilyn Walker

O i i l M-----Original Message-----From: David Ross <[email protected]> To: Marilyn Walker <[email protected]> Date: Tuesday, December 01, 1998 2:25 PM Subject: Today's Meeting j y g

Today's meeting has been changed from 4:00 to 4:30 PM. If the time change is a problem, please send me email at [email protected].

ThanksThanks,

david ross

Page 49: TextText--toto--Speech (TTS) Speech (TTS) Synthesis … · 2010. 5. 19. · GraphemeGrapheme--toto--Phoneme ConversionPhoneme Conversion Tagged Phones Prosodic Analysis Pitch and

TTS Application CategoriesTTS Application CategoriesTTS Application CategoriesTTS Application Categories• PDAs, cellphones, gaming, talking appliancesDevices

• driving directions, city and restaurant guides, location services (e.g., “Macys has a sale!”)

Automotive Connectivity

Devices

voice control of cell phones, VCRs, TVs • home information access over telephone (“Home

Voice Portals”)

Connectivity

ConsumerCommunications

• information access over the phone such as sales information, HR, internal phonebook, messaging

Enterprise Communications

Voice assisted • E-commerce, customer care (e.g., friendly automated talking web agents, FAQs, product information)

Voice-assistedE-Commerce

Call center • next-gen “HMIHY” automated operator servicesAutomation