Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep...

93
Statistical Parametric Speech Synthesis Heiga Zen Google June 9th, 2014

Transcript of Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep...

Page 1: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Statistical ParametricSpeech SynthesisHeiga ZenGoogleJune 9th 2014

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Text-to-speech as sequence-to-sequence mapping

bull Automatic speech recognition (ASR)Speech (continuous time series) rarr Text (discrete symbol sequence)

bull Machine translation (MT)Text (discrete symbol sequence) rarr Text (discrete symbol sequence)

bull Text-to-speech synthesis (TTS)Text (discrete symbol sequence) rarr Speech (continuous time series)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 1 of 79

Text-to-speech as sequence-to-sequence mapping

bull Automatic speech recognition (ASR)Speech (continuous time series) rarr Text (discrete symbol sequence)

bull Machine translation (MT)Text (discrete symbol sequence) rarr Text (discrete symbol sequence)

bull Text-to-speech synthesis (TTS)Text (discrete symbol sequence) rarr Speech (continuous time series)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 1 of 79

Text-to-speech as sequence-to-sequence mapping

bull Automatic speech recognition (ASR)Speech (continuous time series) rarr Text (discrete symbol sequence)

bull Machine translation (MT)Text (discrete symbol sequence) rarr Text (discrete symbol sequence)

bull Text-to-speech synthesis (TTS)Text (discrete symbol sequence) rarr Speech (continuous time series)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 1 of 79

Speech production process

text(concept)

frequencytransfercharacteristics

magnitudestart--end

fundamentalfrequency

mod

ulat

ion

of c

arrie

r wav

eby

spe

ech

info

rmat

ion

fund

amen

tal f

req

voic

edu

nvoi

ced

freq

trans

fer c

har

air flow

Sound sourcevoiced pulseunvoiced noise

speech

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 2 of 79

Typical flow of TTS system

Sentence segmentaitonWord segmentationText normalization

Part-of-speech taggingPronunciation

Prosody prediction

Waveform generation

TEXT

Text analysis

SYNTHESIZEDSPEECH

Speech synthesisdiscrete rArr discrete

discrete rArr continuous

NLP

Speech

Frontend

Backend

This talk focuses on backend

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 3 of 79

Concatenative speech synthesis

All segments

Target cost Concatenation cost

bull Concatenate actual instances of speech from database

bull Large data + automatic learningrarr High-quality synthetic voices can be built automatically

bull Single inventory per unit rarr diphone synthesis [1]

bull Multiple inventory per unit rarr unit selection synthesis [2]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 4 of 79

Statistical parametric speech synthesis (SPSS) [3]

Speech Speech

Text Text

Parametergeneration

Speechsynthesis

Textanalysis

Speechanalysis

Textanalysis

Modeltraining

x

y

x

y

λ

bull Training

minus Extract linguistic features x amp acoustic features yminus Train acoustic model λ given (xy)

λ = arg max p(y | x λ)

bull Synthesis

minus Extract x from text to be synthesizedminus Generate most probable y from λ

y = arg max p(y | x λ)

minus Reconstruct speech from y

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 5 of 79

Statistical parametric speech synthesis (SPSS) [3]

Speech Speech

Text Text

Parametergeneration

Speechsynthesis

Textanalysis

Speechanalysis

Textanalysis

Modeltraining

x

y

x

y

λ

bull Training

minus Extract linguistic features x amp acoustic features yminus Train acoustic model λ given (xy)

λ = arg max p(y | x λ)

bull Synthesis

minus Extract x from text to be synthesizedminus Generate most probable y from λ

y = arg max p(y | x λ)

minus Reconstruct speech from y

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 5 of 79

Statistical parametric speech synthesis (SPSS) [3]

Speech Speech

Text Text

Parametergeneration

Speechsynthesis

Textanalysis

Speechanalysis

Textanalysis

Modeltraining

x

y

x

y

λ

bull Large data + automatic trainingrarr Automatic voice building

bull Parametric representation of speechrarr Flexible to change its voice characteristics

Hidden Markov model (HMM) as its acoustic modelrarr HMM-based speech synthesis system (HTS) [4]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 6 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 8 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 9 of 79

Speech production process

text(concept)

frequencytransfercharacteristics

magnitudestart--end

fundamentalfrequency

mod

ulat

ion

of c

arrie

r wav

eby

spe

ech

info

rmat

ion

fund

amen

tal f

req

voic

edu

nvoi

ced

freq

trans

fer c

har

air flow

Sound sourcevoiced pulseunvoiced noise

speech

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 10 of 79

Source-filter model

pulse train

white noise

speech

lineartime-invariant

systeme(n)

h(n) x(n) = h(n) lowast e(n)excitation

x(n) = h(n) lowast e(n)

darr Fourier transform

X(ejω) = H(ejω)E(ejω)

Source excitation part Vocal tract resonance part

H(ejω)

should be defined by HMM state-output vectorseg mel-cepstrum line spectral pairs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 11 of 79

Parametric models of speech signal

Autoregressive (AR) model Exponential (EX) model

H(z) =K

1minusMsum

m=0

c(m)zminusm

H(z) = expMsum

m=0

c(m)zminusm

Estimate model parameters based on ML

c = arg maxcp(x | c)

bull p(x | c) AR model rarr Linear predictive analysis [5]

bull p(x | c) EX model rarr (ML-based) cepstral analysis [6]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 12 of 79

Examples of speech spectra

0 1 2 3 4 5Frequency (kHz)

-20

0

20

40

60

80

Log

mag

nitu

de (d

B)

0 1 2 3 4 5Frequency (kHz)

-20

0

20

40

60

80

Log

mag

nitu

de (d

B)

(a) ML-based cepstral analysis (b) Linear prediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 13 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 14 of 79

Structure of state-output (observation) vectors

∆ Mel-cepstral coefficients

log F0

∆ log F0

∆∆ log F0

Spectrum part

Excitation part

∆ct

∆2ct

pt

δpt

δ2pt

ct

ot

∆∆ Mel-cepstral coefficients

Mel-cepstral coefficients

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 15 of 79

Hidden Markov model (HMM)

1 2 3π1

a11

a12

a22

a23

a33

b1(ot) b2(ot) b3(ot)

o1 o2 o3 o4 o5 oT

1 1 1 1 2 2 3 3

O

Q

Observation sequence

State sequence

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 16 of 79

Multi-stream HMM structure

Spe

ctru

mE

xcita

tion

∆ct

∆2ct

pt

δ pt

δ 2pt

ct

ot

Stream

1

o1t

o2t

o3t

o4t

bj(ot)

b1j(o

1t )

b2j(o

2t )

b3j(o

3t )

b4j(o

4t )

bj(ot)Sprod

s=1

(bsj(o

st ))ws=

23

4

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 17 of 79

Training process

Compute variancefloor (HCompV)

Initialize CI-HMMs bysegmental k-means (HInit)

Reestimate CI-HMMs byEM algorithm

(HRest amp HERest)

Copy CI-HMMs to CD-HMMs (HHEd CL)

Reestimate CD-HMMs byEM algorithm (HERest)

Decision tree-basedclustering (HHEd TB)

Reestimate CD-HMMsby EM algorithm (HERest)

Untie parameter tyingstructure (HHEd UT)

monophone(context-independent CI)

fullcontext(context-dependent CD)

EstimatedHMMs

data amp labels

Estimate CD-dur modelsfrom FB stats (HERest)

Decision tree-basedclustering (HHEd TB)

Estimated dur models

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 18 of 79

Context-dependent acoustic modeling

bull preceding succeeding two phonemes

bull Position of current phoneme in current syllable

bull of phonemes at preceding current succeeding syllable

bull accent stress of preceding current succeeding syllable

bull Position of current syllable in current word

bull of preceding succeeding stressed accented syllables in phrase

bull of syllables from previous to next stressed accented syllable

bull Guess at part of speech of preceding current succeeding word

bull of syllables in preceding current succeeding word

bull Position of current word in current phrase

bull of preceding succeeding content words in current phrase

bull of words from previous to next content word

bull of syllables in preceding current succeeding phrase

Impossible to have all possible modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 19 of 79

Decision tree-based state clustering [7]

L=voice

L=w R=silence yes

yes yes

no

no no

w-a+sil w-a+sh gy-a+pau

g-a+silgy-a+silw-a+t

k-a+b

t-a+n

leaf nodes

yes no yes no

synthesized states

R=silence L=gy

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 20 of 79

Stream-dependent tree-based clustering

Decision treesfor

mel-cepstrum

Decision treesfor F0

Spectrum amp excitation can have different context dependencyrarr Build decision trees individually

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 21 of 79

State duration models [8]

i

1 2 3 4 5 6 7 T=8t

t0 t1

Probability to enter state i at t0 then leave at t1 + 1

χt0t1(i) propsum

j 6=i

αt0minus1(j)ajiat1minust0ii

t1prod

t=t0

bi(ot)sum

k 6=i

aikbk(ot1+1)βt1+1(k)

rarr estimate state duration modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 22 of 79

Stream-dependent tree-based clustering

State durationmodel

Decision treesfor

mel-cepstrum

Decision treesfor F0

Decision treefor state durmodels

HMM

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 23 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 24 of 79

Speech parameter generation algorithm [9]

Generate most probable state outputs given HMM and words

o = arg maxo

p(o | w λ)

= arg maxo

sum

forallqp(o q | w λ)

asymp arg maxo

maxq

p(o q | w λ)

= arg maxo

maxq

p(o | q λ)P (q | w λ)

Determine the best state sequence and outputs sequentially

q = arg maxq

P (q | w λ)

o = arg maxo

p(o | q λ)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79

Speech parameter generation algorithm [9]

Generate most probable state outputs given HMM and words

o = arg maxo

p(o | w λ)

= arg maxo

sum

forallqp(o q | w λ)

asymp arg maxo

maxq

p(o q | w λ)

= arg maxo

maxq

p(o | q λ)P (q | w λ)

Determine the best state sequence and outputs sequentially

q = arg maxq

P (q | w λ)

o = arg maxo

p(o | q λ)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79

Best state sequence

1 2 3π1

a11

a12

a22

a23

a33

b1(ot) b2(ot) b3(ot)

o1 o2 o3 o4 o5 oT

1 1 1 1 2 2 3 3

O

Q

Observation sequence

State sequence

State duration 4 10 5D

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 26 of 79

Best state outputswo dynamic features

Variance Mean

o becomes step-wise mean vector sequence

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 27 of 79

Using dynamic features

State output vectors include static amp dynamic features

ot =[cgtt ∆cgtt

]gt ∆ct = ct minus ctminus1

M Mctminus1 ct+1ctminus2 ct+2ct

∆ctminus1 ∆ct+1∆ctminus2 ∆ct+2∆ct

2M

Relationship between static and dynamic features can be arranged as

o c

ctminus1

∆ ctminus1

ct∆ ctct+1

∆ ct+1

=

middot middot middot

middot middot middotmiddot middot middot 0 I 0 0 middot middot middotmiddot middot middot minusI I 0 0 middot middot middotmiddot middot middot 0 0 I 0 middot middot middotmiddot middot middot 0 minusI I 0 middot middot middotmiddot middot middot 0 0 0 I middot middot middotmiddot middot middot 0 0 minusI I middot middot middotmiddot middot middot

middot middot middot

ctminus2

ctminus1

ctct+1

W

o t

ot+1

otminus1

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 28 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Wgt Σminus1q W c

=

Wgt Σminus1q microq

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

1 0 0

0

10 0

-1 1

10

0

-1 1

1

0

0

0

-1 1

0 0

0

0

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

microq1

microq2

microqT

c1

c2

cT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79

Generated speech parameter trajectory

Sta

ticD

ynam

ic

Variance Mean c

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79

Waveform reconstruction

pulse train

white noise

synthesizedspeech

lineartime-invariant

systeme(n)

h(n) x(n) = h(n) lowast e(n)excitation

Generatedexcitation parameter(log F0 with VUV)

Generatedspectral parameter

(cepstrum LSP)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79

Synthesis filter

bull Cepstrum rarr LMA filter

bull Generalized cepstrum rarr GLSA filter

bull Mel-cepstrum rarr MLSA filter

bull Mel-generalized cepstrum rarr MGLSA filter

bull LSP rarr LSP filter

bull PARCOR rarr all-pole lattice filter

bull LPC rarr all-pole filter

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation

minus Small footprint [10 11]minus Robustness [12]

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Adaptation (mimicking voice) [13]

Average-voice model

AdaptiveTraining

Adaptation

Training speakers Target speakers

bull Train average voice model (AVM) from training speakers using SAT

bull Adapt AVM to target speakers

bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79

Interpolation (mixing voice) [14 15 16 17]

λ1

λ2

λ3

λ4

λprime

I(λprimeλ1)I(λprimeλ2)

I(λprimeλ3)I(λprimeλ4)

λ HMM setI(λprimeλ) Interpolation ratio

bull Interpolate representive HMM sets

bull Can obtain new voices wo adaptation data

bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Vocoding issues

bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)

Unvoiced Voiced

pulse train

white noise

e(n)

excitation

bull Spectral envelope extractionHarmonic effect often cause problem

0

40

80

0 2 4 6 8 [kHz]

Pow

er [d

B]

bull PhaseImportant but usually ignored

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79

Better vocoding

bull Mixed excitation linear prediction (MELP)

bull STRAIGHT

bull Multi-band excitation

bull Harmonic + noise model (HNM)

bull Harmonic stochastic model

bull LF model

bull Glottal waveform

bull Residual codebook

bull ML excitation

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79

Limitations of HMMs for acoustic modeling

bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state

bull Conditional independence assumptionState output probability depends only on the current state

bull Weak duration modelingState duration probability decreases exponentially with time

None of them hold for real speech

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79

Better acoustic modeling

bull Piece-wise constatnt statistics rarr Dynamical model

minus Trended HMMminus Polynomial segment modelminus Trajectory HMM

bull Conditional independence assumption rarr Graphical model

minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM

bull Weak duration modeling rarr Explicit duration model

minus Hidden semi-Markov model

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79

Oversmoothing

bull Speech parameter generation algorithm

minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled

Nat

ural

0 4 8Frequency (kHz)

Gen

erat

ed

0 4 8Frequency (kHz)

bull Why

minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79

Oversmoothing compensation

bull Postfiltering

minus Mel-cepstrumminus LSP

bull Nonparametric approach

minus Conditional parameter generationminus Discrete HMM-based speech synthesis

bull Combine multiple-level statistics

minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation eigenvoice CAT multiple regression

minus Small footprintminus Robustness

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

HMM-based acoustic modeling for SPSS [4]

yes noyes no

yes no

yes no yes no

Acoustic space

bull Decision tree-clustered HMM with GMM state-output distributions

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79

DNN-based acoustic modeling for SPSS [18]

Acoustic

features y

Linguistic

features x

h1

h2

h3

bull DNN represents conditional distribution of y given x

bull DNN replaces decision trees and GMMs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79

Framework

Input layer Output layerHidden layers

TEXT

SPEECH Parametergeneration

Waveformsynthesis

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

feat

ures

at f

ram

e 1

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

fe

atur

es a

t fra

me T

Textanalysis

Input featureextraction

Statistics (m

ean amp var) of speech param

eter vector sequence

Binaryfeatures

NumericfeaturesDuration

featureFrame position

feature

Spectralfeatures

ExcitationfeaturesVUVfeature

Durationprediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Experimental setup

Database US English female speaker

Training test data 33000 amp 173 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic 11 categorical featuresfeatures 25 numeric features

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2

HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]

DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79

Example of speech parameter trajectories

wo grouping questions numeric contexts silence frames removed

-1

0

1

0 100 200 300 400 500

5-t

h M

el-ce

pstr

um

Frame

Natural speechHMM (α=1)DNN (4x512)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79

Subjective evaluations

Compared HMM-based systems with DNN-based ones withsimilar of parameters

bull Paired comparison test

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

HMM DNN(α) (layers times units) Neutral p value z value

158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 2: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Text-to-speech as sequence-to-sequence mapping

bull Automatic speech recognition (ASR)Speech (continuous time series) rarr Text (discrete symbol sequence)

bull Machine translation (MT)Text (discrete symbol sequence) rarr Text (discrete symbol sequence)

bull Text-to-speech synthesis (TTS)Text (discrete symbol sequence) rarr Speech (continuous time series)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 1 of 79

Text-to-speech as sequence-to-sequence mapping

bull Automatic speech recognition (ASR)Speech (continuous time series) rarr Text (discrete symbol sequence)

bull Machine translation (MT)Text (discrete symbol sequence) rarr Text (discrete symbol sequence)

bull Text-to-speech synthesis (TTS)Text (discrete symbol sequence) rarr Speech (continuous time series)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 1 of 79

Text-to-speech as sequence-to-sequence mapping

bull Automatic speech recognition (ASR)Speech (continuous time series) rarr Text (discrete symbol sequence)

bull Machine translation (MT)Text (discrete symbol sequence) rarr Text (discrete symbol sequence)

bull Text-to-speech synthesis (TTS)Text (discrete symbol sequence) rarr Speech (continuous time series)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 1 of 79

Speech production process

text(concept)

frequencytransfercharacteristics

magnitudestart--end

fundamentalfrequency

mod

ulat

ion

of c

arrie

r wav

eby

spe

ech

info

rmat

ion

fund

amen

tal f

req

voic

edu

nvoi

ced

freq

trans

fer c

har

air flow

Sound sourcevoiced pulseunvoiced noise

speech

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 2 of 79

Typical flow of TTS system

Sentence segmentaitonWord segmentationText normalization

Part-of-speech taggingPronunciation

Prosody prediction

Waveform generation

TEXT

Text analysis

SYNTHESIZEDSPEECH

Speech synthesisdiscrete rArr discrete

discrete rArr continuous

NLP

Speech

Frontend

Backend

This talk focuses on backend

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 3 of 79

Concatenative speech synthesis

All segments

Target cost Concatenation cost

bull Concatenate actual instances of speech from database

bull Large data + automatic learningrarr High-quality synthetic voices can be built automatically

bull Single inventory per unit rarr diphone synthesis [1]

bull Multiple inventory per unit rarr unit selection synthesis [2]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 4 of 79

Statistical parametric speech synthesis (SPSS) [3]

Speech Speech

Text Text

Parametergeneration

Speechsynthesis

Textanalysis

Speechanalysis

Textanalysis

Modeltraining

x

y

x

y

λ

bull Training

minus Extract linguistic features x amp acoustic features yminus Train acoustic model λ given (xy)

λ = arg max p(y | x λ)

bull Synthesis

minus Extract x from text to be synthesizedminus Generate most probable y from λ

y = arg max p(y | x λ)

minus Reconstruct speech from y

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 5 of 79

Statistical parametric speech synthesis (SPSS) [3]

Speech Speech

Text Text

Parametergeneration

Speechsynthesis

Textanalysis

Speechanalysis

Textanalysis

Modeltraining

x

y

x

y

λ

bull Training

minus Extract linguistic features x amp acoustic features yminus Train acoustic model λ given (xy)

λ = arg max p(y | x λ)

bull Synthesis

minus Extract x from text to be synthesizedminus Generate most probable y from λ

y = arg max p(y | x λ)

minus Reconstruct speech from y

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 5 of 79

Statistical parametric speech synthesis (SPSS) [3]

Speech Speech

Text Text

Parametergeneration

Speechsynthesis

Textanalysis

Speechanalysis

Textanalysis

Modeltraining

x

y

x

y

λ

bull Large data + automatic trainingrarr Automatic voice building

bull Parametric representation of speechrarr Flexible to change its voice characteristics

Hidden Markov model (HMM) as its acoustic modelrarr HMM-based speech synthesis system (HTS) [4]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 6 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 8 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 9 of 79

Speech production process

text(concept)

frequencytransfercharacteristics

magnitudestart--end

fundamentalfrequency

mod

ulat

ion

of c

arrie

r wav

eby

spe

ech

info

rmat

ion

fund

amen

tal f

req

voic

edu

nvoi

ced

freq

trans

fer c

har

air flow

Sound sourcevoiced pulseunvoiced noise

speech

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 10 of 79

Source-filter model

pulse train

white noise

speech

lineartime-invariant

systeme(n)

h(n) x(n) = h(n) lowast e(n)excitation

x(n) = h(n) lowast e(n)

darr Fourier transform

X(ejω) = H(ejω)E(ejω)

Source excitation part Vocal tract resonance part

H(ejω)

should be defined by HMM state-output vectorseg mel-cepstrum line spectral pairs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 11 of 79

Parametric models of speech signal

Autoregressive (AR) model Exponential (EX) model

H(z) =K

1minusMsum

m=0

c(m)zminusm

H(z) = expMsum

m=0

c(m)zminusm

Estimate model parameters based on ML

c = arg maxcp(x | c)

bull p(x | c) AR model rarr Linear predictive analysis [5]

bull p(x | c) EX model rarr (ML-based) cepstral analysis [6]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 12 of 79

Examples of speech spectra

0 1 2 3 4 5Frequency (kHz)

-20

0

20

40

60

80

Log

mag

nitu

de (d

B)

0 1 2 3 4 5Frequency (kHz)

-20

0

20

40

60

80

Log

mag

nitu

de (d

B)

(a) ML-based cepstral analysis (b) Linear prediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 13 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 14 of 79

Structure of state-output (observation) vectors

∆ Mel-cepstral coefficients

log F0

∆ log F0

∆∆ log F0

Spectrum part

Excitation part

∆ct

∆2ct

pt

δpt

δ2pt

ct

ot

∆∆ Mel-cepstral coefficients

Mel-cepstral coefficients

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 15 of 79

Hidden Markov model (HMM)

1 2 3π1

a11

a12

a22

a23

a33

b1(ot) b2(ot) b3(ot)

o1 o2 o3 o4 o5 oT

1 1 1 1 2 2 3 3

O

Q

Observation sequence

State sequence

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 16 of 79

Multi-stream HMM structure

Spe

ctru

mE

xcita

tion

∆ct

∆2ct

pt

δ pt

δ 2pt

ct

ot

Stream

1

o1t

o2t

o3t

o4t

bj(ot)

b1j(o

1t )

b2j(o

2t )

b3j(o

3t )

b4j(o

4t )

bj(ot)Sprod

s=1

(bsj(o

st ))ws=

23

4

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 17 of 79

Training process

Compute variancefloor (HCompV)

Initialize CI-HMMs bysegmental k-means (HInit)

Reestimate CI-HMMs byEM algorithm

(HRest amp HERest)

Copy CI-HMMs to CD-HMMs (HHEd CL)

Reestimate CD-HMMs byEM algorithm (HERest)

Decision tree-basedclustering (HHEd TB)

Reestimate CD-HMMsby EM algorithm (HERest)

Untie parameter tyingstructure (HHEd UT)

monophone(context-independent CI)

fullcontext(context-dependent CD)

EstimatedHMMs

data amp labels

Estimate CD-dur modelsfrom FB stats (HERest)

Decision tree-basedclustering (HHEd TB)

Estimated dur models

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 18 of 79

Context-dependent acoustic modeling

bull preceding succeeding two phonemes

bull Position of current phoneme in current syllable

bull of phonemes at preceding current succeeding syllable

bull accent stress of preceding current succeeding syllable

bull Position of current syllable in current word

bull of preceding succeeding stressed accented syllables in phrase

bull of syllables from previous to next stressed accented syllable

bull Guess at part of speech of preceding current succeeding word

bull of syllables in preceding current succeeding word

bull Position of current word in current phrase

bull of preceding succeeding content words in current phrase

bull of words from previous to next content word

bull of syllables in preceding current succeeding phrase

Impossible to have all possible modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 19 of 79

Decision tree-based state clustering [7]

L=voice

L=w R=silence yes

yes yes

no

no no

w-a+sil w-a+sh gy-a+pau

g-a+silgy-a+silw-a+t

k-a+b

t-a+n

leaf nodes

yes no yes no

synthesized states

R=silence L=gy

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 20 of 79

Stream-dependent tree-based clustering

Decision treesfor

mel-cepstrum

Decision treesfor F0

Spectrum amp excitation can have different context dependencyrarr Build decision trees individually

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 21 of 79

State duration models [8]

i

1 2 3 4 5 6 7 T=8t

t0 t1

Probability to enter state i at t0 then leave at t1 + 1

χt0t1(i) propsum

j 6=i

αt0minus1(j)ajiat1minust0ii

t1prod

t=t0

bi(ot)sum

k 6=i

aikbk(ot1+1)βt1+1(k)

rarr estimate state duration modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 22 of 79

Stream-dependent tree-based clustering

State durationmodel

Decision treesfor

mel-cepstrum

Decision treesfor F0

Decision treefor state durmodels

HMM

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 23 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 24 of 79

Speech parameter generation algorithm [9]

Generate most probable state outputs given HMM and words

o = arg maxo

p(o | w λ)

= arg maxo

sum

forallqp(o q | w λ)

asymp arg maxo

maxq

p(o q | w λ)

= arg maxo

maxq

p(o | q λ)P (q | w λ)

Determine the best state sequence and outputs sequentially

q = arg maxq

P (q | w λ)

o = arg maxo

p(o | q λ)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79

Speech parameter generation algorithm [9]

Generate most probable state outputs given HMM and words

o = arg maxo

p(o | w λ)

= arg maxo

sum

forallqp(o q | w λ)

asymp arg maxo

maxq

p(o q | w λ)

= arg maxo

maxq

p(o | q λ)P (q | w λ)

Determine the best state sequence and outputs sequentially

q = arg maxq

P (q | w λ)

o = arg maxo

p(o | q λ)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79

Best state sequence

1 2 3π1

a11

a12

a22

a23

a33

b1(ot) b2(ot) b3(ot)

o1 o2 o3 o4 o5 oT

1 1 1 1 2 2 3 3

O

Q

Observation sequence

State sequence

State duration 4 10 5D

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 26 of 79

Best state outputswo dynamic features

Variance Mean

o becomes step-wise mean vector sequence

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 27 of 79

Using dynamic features

State output vectors include static amp dynamic features

ot =[cgtt ∆cgtt

]gt ∆ct = ct minus ctminus1

M Mctminus1 ct+1ctminus2 ct+2ct

∆ctminus1 ∆ct+1∆ctminus2 ∆ct+2∆ct

2M

Relationship between static and dynamic features can be arranged as

o c

ctminus1

∆ ctminus1

ct∆ ctct+1

∆ ct+1

=

middot middot middot

middot middot middotmiddot middot middot 0 I 0 0 middot middot middotmiddot middot middot minusI I 0 0 middot middot middotmiddot middot middot 0 0 I 0 middot middot middotmiddot middot middot 0 minusI I 0 middot middot middotmiddot middot middot 0 0 0 I middot middot middotmiddot middot middot 0 0 minusI I middot middot middotmiddot middot middot

middot middot middot

ctminus2

ctminus1

ctct+1

W

o t

ot+1

otminus1

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 28 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Wgt Σminus1q W c

=

Wgt Σminus1q microq

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

1 0 0

0

10 0

-1 1

10

0

-1 1

1

0

0

0

-1 1

0 0

0

0

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

microq1

microq2

microqT

c1

c2

cT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79

Generated speech parameter trajectory

Sta

ticD

ynam

ic

Variance Mean c

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79

Waveform reconstruction

pulse train

white noise

synthesizedspeech

lineartime-invariant

systeme(n)

h(n) x(n) = h(n) lowast e(n)excitation

Generatedexcitation parameter(log F0 with VUV)

Generatedspectral parameter

(cepstrum LSP)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79

Synthesis filter

bull Cepstrum rarr LMA filter

bull Generalized cepstrum rarr GLSA filter

bull Mel-cepstrum rarr MLSA filter

bull Mel-generalized cepstrum rarr MGLSA filter

bull LSP rarr LSP filter

bull PARCOR rarr all-pole lattice filter

bull LPC rarr all-pole filter

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation

minus Small footprint [10 11]minus Robustness [12]

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Adaptation (mimicking voice) [13]

Average-voice model

AdaptiveTraining

Adaptation

Training speakers Target speakers

bull Train average voice model (AVM) from training speakers using SAT

bull Adapt AVM to target speakers

bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79

Interpolation (mixing voice) [14 15 16 17]

λ1

λ2

λ3

λ4

λprime

I(λprimeλ1)I(λprimeλ2)

I(λprimeλ3)I(λprimeλ4)

λ HMM setI(λprimeλ) Interpolation ratio

bull Interpolate representive HMM sets

bull Can obtain new voices wo adaptation data

bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Vocoding issues

bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)

Unvoiced Voiced

pulse train

white noise

e(n)

excitation

bull Spectral envelope extractionHarmonic effect often cause problem

0

40

80

0 2 4 6 8 [kHz]

Pow

er [d

B]

bull PhaseImportant but usually ignored

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79

Better vocoding

bull Mixed excitation linear prediction (MELP)

bull STRAIGHT

bull Multi-band excitation

bull Harmonic + noise model (HNM)

bull Harmonic stochastic model

bull LF model

bull Glottal waveform

bull Residual codebook

bull ML excitation

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79

Limitations of HMMs for acoustic modeling

bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state

bull Conditional independence assumptionState output probability depends only on the current state

bull Weak duration modelingState duration probability decreases exponentially with time

None of them hold for real speech

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79

Better acoustic modeling

bull Piece-wise constatnt statistics rarr Dynamical model

minus Trended HMMminus Polynomial segment modelminus Trajectory HMM

bull Conditional independence assumption rarr Graphical model

minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM

bull Weak duration modeling rarr Explicit duration model

minus Hidden semi-Markov model

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79

Oversmoothing

bull Speech parameter generation algorithm

minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled

Nat

ural

0 4 8Frequency (kHz)

Gen

erat

ed

0 4 8Frequency (kHz)

bull Why

minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79

Oversmoothing compensation

bull Postfiltering

minus Mel-cepstrumminus LSP

bull Nonparametric approach

minus Conditional parameter generationminus Discrete HMM-based speech synthesis

bull Combine multiple-level statistics

minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation eigenvoice CAT multiple regression

minus Small footprintminus Robustness

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

HMM-based acoustic modeling for SPSS [4]

yes noyes no

yes no

yes no yes no

Acoustic space

bull Decision tree-clustered HMM with GMM state-output distributions

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79

DNN-based acoustic modeling for SPSS [18]

Acoustic

features y

Linguistic

features x

h1

h2

h3

bull DNN represents conditional distribution of y given x

bull DNN replaces decision trees and GMMs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79

Framework

Input layer Output layerHidden layers

TEXT

SPEECH Parametergeneration

Waveformsynthesis

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

feat

ures

at f

ram

e 1

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

fe

atur

es a

t fra

me T

Textanalysis

Input featureextraction

Statistics (m

ean amp var) of speech param

eter vector sequence

Binaryfeatures

NumericfeaturesDuration

featureFrame position

feature

Spectralfeatures

ExcitationfeaturesVUVfeature

Durationprediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Experimental setup

Database US English female speaker

Training test data 33000 amp 173 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic 11 categorical featuresfeatures 25 numeric features

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2

HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]

DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79

Example of speech parameter trajectories

wo grouping questions numeric contexts silence frames removed

-1

0

1

0 100 200 300 400 500

5-t

h M

el-ce

pstr

um

Frame

Natural speechHMM (α=1)DNN (4x512)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79

Subjective evaluations

Compared HMM-based systems with DNN-based ones withsimilar of parameters

bull Paired comparison test

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

HMM DNN(α) (layers times units) Neutral p value z value

158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 3: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Text-to-speech as sequence-to-sequence mapping

bull Automatic speech recognition (ASR)Speech (continuous time series) rarr Text (discrete symbol sequence)

bull Machine translation (MT)Text (discrete symbol sequence) rarr Text (discrete symbol sequence)

bull Text-to-speech synthesis (TTS)Text (discrete symbol sequence) rarr Speech (continuous time series)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 1 of 79

Text-to-speech as sequence-to-sequence mapping

bull Automatic speech recognition (ASR)Speech (continuous time series) rarr Text (discrete symbol sequence)

bull Machine translation (MT)Text (discrete symbol sequence) rarr Text (discrete symbol sequence)

bull Text-to-speech synthesis (TTS)Text (discrete symbol sequence) rarr Speech (continuous time series)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 1 of 79

Text-to-speech as sequence-to-sequence mapping

bull Automatic speech recognition (ASR)Speech (continuous time series) rarr Text (discrete symbol sequence)

bull Machine translation (MT)Text (discrete symbol sequence) rarr Text (discrete symbol sequence)

bull Text-to-speech synthesis (TTS)Text (discrete symbol sequence) rarr Speech (continuous time series)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 1 of 79

Speech production process

text(concept)

frequencytransfercharacteristics

magnitudestart--end

fundamentalfrequency

mod

ulat

ion

of c

arrie

r wav

eby

spe

ech

info

rmat

ion

fund

amen

tal f

req

voic

edu

nvoi

ced

freq

trans

fer c

har

air flow

Sound sourcevoiced pulseunvoiced noise

speech

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 2 of 79

Typical flow of TTS system

Sentence segmentaitonWord segmentationText normalization

Part-of-speech taggingPronunciation

Prosody prediction

Waveform generation

TEXT

Text analysis

SYNTHESIZEDSPEECH

Speech synthesisdiscrete rArr discrete

discrete rArr continuous

NLP

Speech

Frontend

Backend

This talk focuses on backend

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 3 of 79

Concatenative speech synthesis

All segments

Target cost Concatenation cost

bull Concatenate actual instances of speech from database

bull Large data + automatic learningrarr High-quality synthetic voices can be built automatically

bull Single inventory per unit rarr diphone synthesis [1]

bull Multiple inventory per unit rarr unit selection synthesis [2]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 4 of 79

Statistical parametric speech synthesis (SPSS) [3]

Speech Speech

Text Text

Parametergeneration

Speechsynthesis

Textanalysis

Speechanalysis

Textanalysis

Modeltraining

x

y

x

y

λ

bull Training

minus Extract linguistic features x amp acoustic features yminus Train acoustic model λ given (xy)

λ = arg max p(y | x λ)

bull Synthesis

minus Extract x from text to be synthesizedminus Generate most probable y from λ

y = arg max p(y | x λ)

minus Reconstruct speech from y

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 5 of 79

Statistical parametric speech synthesis (SPSS) [3]

Speech Speech

Text Text

Parametergeneration

Speechsynthesis

Textanalysis

Speechanalysis

Textanalysis

Modeltraining

x

y

x

y

λ

bull Training

minus Extract linguistic features x amp acoustic features yminus Train acoustic model λ given (xy)

λ = arg max p(y | x λ)

bull Synthesis

minus Extract x from text to be synthesizedminus Generate most probable y from λ

y = arg max p(y | x λ)

minus Reconstruct speech from y

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 5 of 79

Statistical parametric speech synthesis (SPSS) [3]

Speech Speech

Text Text

Parametergeneration

Speechsynthesis

Textanalysis

Speechanalysis

Textanalysis

Modeltraining

x

y

x

y

λ

bull Large data + automatic trainingrarr Automatic voice building

bull Parametric representation of speechrarr Flexible to change its voice characteristics

Hidden Markov model (HMM) as its acoustic modelrarr HMM-based speech synthesis system (HTS) [4]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 6 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 8 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 9 of 79

Speech production process

text(concept)

frequencytransfercharacteristics

magnitudestart--end

fundamentalfrequency

mod

ulat

ion

of c

arrie

r wav

eby

spe

ech

info

rmat

ion

fund

amen

tal f

req

voic

edu

nvoi

ced

freq

trans

fer c

har

air flow

Sound sourcevoiced pulseunvoiced noise

speech

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 10 of 79

Source-filter model

pulse train

white noise

speech

lineartime-invariant

systeme(n)

h(n) x(n) = h(n) lowast e(n)excitation

x(n) = h(n) lowast e(n)

darr Fourier transform

X(ejω) = H(ejω)E(ejω)

Source excitation part Vocal tract resonance part

H(ejω)

should be defined by HMM state-output vectorseg mel-cepstrum line spectral pairs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 11 of 79

Parametric models of speech signal

Autoregressive (AR) model Exponential (EX) model

H(z) =K

1minusMsum

m=0

c(m)zminusm

H(z) = expMsum

m=0

c(m)zminusm

Estimate model parameters based on ML

c = arg maxcp(x | c)

bull p(x | c) AR model rarr Linear predictive analysis [5]

bull p(x | c) EX model rarr (ML-based) cepstral analysis [6]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 12 of 79

Examples of speech spectra

0 1 2 3 4 5Frequency (kHz)

-20

0

20

40

60

80

Log

mag

nitu

de (d

B)

0 1 2 3 4 5Frequency (kHz)

-20

0

20

40

60

80

Log

mag

nitu

de (d

B)

(a) ML-based cepstral analysis (b) Linear prediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 13 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 14 of 79

Structure of state-output (observation) vectors

∆ Mel-cepstral coefficients

log F0

∆ log F0

∆∆ log F0

Spectrum part

Excitation part

∆ct

∆2ct

pt

δpt

δ2pt

ct

ot

∆∆ Mel-cepstral coefficients

Mel-cepstral coefficients

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 15 of 79

Hidden Markov model (HMM)

1 2 3π1

a11

a12

a22

a23

a33

b1(ot) b2(ot) b3(ot)

o1 o2 o3 o4 o5 oT

1 1 1 1 2 2 3 3

O

Q

Observation sequence

State sequence

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 16 of 79

Multi-stream HMM structure

Spe

ctru

mE

xcita

tion

∆ct

∆2ct

pt

δ pt

δ 2pt

ct

ot

Stream

1

o1t

o2t

o3t

o4t

bj(ot)

b1j(o

1t )

b2j(o

2t )

b3j(o

3t )

b4j(o

4t )

bj(ot)Sprod

s=1

(bsj(o

st ))ws=

23

4

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 17 of 79

Training process

Compute variancefloor (HCompV)

Initialize CI-HMMs bysegmental k-means (HInit)

Reestimate CI-HMMs byEM algorithm

(HRest amp HERest)

Copy CI-HMMs to CD-HMMs (HHEd CL)

Reestimate CD-HMMs byEM algorithm (HERest)

Decision tree-basedclustering (HHEd TB)

Reestimate CD-HMMsby EM algorithm (HERest)

Untie parameter tyingstructure (HHEd UT)

monophone(context-independent CI)

fullcontext(context-dependent CD)

EstimatedHMMs

data amp labels

Estimate CD-dur modelsfrom FB stats (HERest)

Decision tree-basedclustering (HHEd TB)

Estimated dur models

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 18 of 79

Context-dependent acoustic modeling

bull preceding succeeding two phonemes

bull Position of current phoneme in current syllable

bull of phonemes at preceding current succeeding syllable

bull accent stress of preceding current succeeding syllable

bull Position of current syllable in current word

bull of preceding succeeding stressed accented syllables in phrase

bull of syllables from previous to next stressed accented syllable

bull Guess at part of speech of preceding current succeeding word

bull of syllables in preceding current succeeding word

bull Position of current word in current phrase

bull of preceding succeeding content words in current phrase

bull of words from previous to next content word

bull of syllables in preceding current succeeding phrase

Impossible to have all possible modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 19 of 79

Decision tree-based state clustering [7]

L=voice

L=w R=silence yes

yes yes

no

no no

w-a+sil w-a+sh gy-a+pau

g-a+silgy-a+silw-a+t

k-a+b

t-a+n

leaf nodes

yes no yes no

synthesized states

R=silence L=gy

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 20 of 79

Stream-dependent tree-based clustering

Decision treesfor

mel-cepstrum

Decision treesfor F0

Spectrum amp excitation can have different context dependencyrarr Build decision trees individually

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 21 of 79

State duration models [8]

i

1 2 3 4 5 6 7 T=8t

t0 t1

Probability to enter state i at t0 then leave at t1 + 1

χt0t1(i) propsum

j 6=i

αt0minus1(j)ajiat1minust0ii

t1prod

t=t0

bi(ot)sum

k 6=i

aikbk(ot1+1)βt1+1(k)

rarr estimate state duration modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 22 of 79

Stream-dependent tree-based clustering

State durationmodel

Decision treesfor

mel-cepstrum

Decision treesfor F0

Decision treefor state durmodels

HMM

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 23 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 24 of 79

Speech parameter generation algorithm [9]

Generate most probable state outputs given HMM and words

o = arg maxo

p(o | w λ)

= arg maxo

sum

forallqp(o q | w λ)

asymp arg maxo

maxq

p(o q | w λ)

= arg maxo

maxq

p(o | q λ)P (q | w λ)

Determine the best state sequence and outputs sequentially

q = arg maxq

P (q | w λ)

o = arg maxo

p(o | q λ)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79

Speech parameter generation algorithm [9]

Generate most probable state outputs given HMM and words

o = arg maxo

p(o | w λ)

= arg maxo

sum

forallqp(o q | w λ)

asymp arg maxo

maxq

p(o q | w λ)

= arg maxo

maxq

p(o | q λ)P (q | w λ)

Determine the best state sequence and outputs sequentially

q = arg maxq

P (q | w λ)

o = arg maxo

p(o | q λ)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79

Best state sequence

1 2 3π1

a11

a12

a22

a23

a33

b1(ot) b2(ot) b3(ot)

o1 o2 o3 o4 o5 oT

1 1 1 1 2 2 3 3

O

Q

Observation sequence

State sequence

State duration 4 10 5D

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 26 of 79

Best state outputswo dynamic features

Variance Mean

o becomes step-wise mean vector sequence

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 27 of 79

Using dynamic features

State output vectors include static amp dynamic features

ot =[cgtt ∆cgtt

]gt ∆ct = ct minus ctminus1

M Mctminus1 ct+1ctminus2 ct+2ct

∆ctminus1 ∆ct+1∆ctminus2 ∆ct+2∆ct

2M

Relationship between static and dynamic features can be arranged as

o c

ctminus1

∆ ctminus1

ct∆ ctct+1

∆ ct+1

=

middot middot middot

middot middot middotmiddot middot middot 0 I 0 0 middot middot middotmiddot middot middot minusI I 0 0 middot middot middotmiddot middot middot 0 0 I 0 middot middot middotmiddot middot middot 0 minusI I 0 middot middot middotmiddot middot middot 0 0 0 I middot middot middotmiddot middot middot 0 0 minusI I middot middot middotmiddot middot middot

middot middot middot

ctminus2

ctminus1

ctct+1

W

o t

ot+1

otminus1

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 28 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Wgt Σminus1q W c

=

Wgt Σminus1q microq

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

1 0 0

0

10 0

-1 1

10

0

-1 1

1

0

0

0

-1 1

0 0

0

0

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

microq1

microq2

microqT

c1

c2

cT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79

Generated speech parameter trajectory

Sta

ticD

ynam

ic

Variance Mean c

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79

Waveform reconstruction

pulse train

white noise

synthesizedspeech

lineartime-invariant

systeme(n)

h(n) x(n) = h(n) lowast e(n)excitation

Generatedexcitation parameter(log F0 with VUV)

Generatedspectral parameter

(cepstrum LSP)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79

Synthesis filter

bull Cepstrum rarr LMA filter

bull Generalized cepstrum rarr GLSA filter

bull Mel-cepstrum rarr MLSA filter

bull Mel-generalized cepstrum rarr MGLSA filter

bull LSP rarr LSP filter

bull PARCOR rarr all-pole lattice filter

bull LPC rarr all-pole filter

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation

minus Small footprint [10 11]minus Robustness [12]

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Adaptation (mimicking voice) [13]

Average-voice model

AdaptiveTraining

Adaptation

Training speakers Target speakers

bull Train average voice model (AVM) from training speakers using SAT

bull Adapt AVM to target speakers

bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79

Interpolation (mixing voice) [14 15 16 17]

λ1

λ2

λ3

λ4

λprime

I(λprimeλ1)I(λprimeλ2)

I(λprimeλ3)I(λprimeλ4)

λ HMM setI(λprimeλ) Interpolation ratio

bull Interpolate representive HMM sets

bull Can obtain new voices wo adaptation data

bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Vocoding issues

bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)

Unvoiced Voiced

pulse train

white noise

e(n)

excitation

bull Spectral envelope extractionHarmonic effect often cause problem

0

40

80

0 2 4 6 8 [kHz]

Pow

er [d

B]

bull PhaseImportant but usually ignored

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79

Better vocoding

bull Mixed excitation linear prediction (MELP)

bull STRAIGHT

bull Multi-band excitation

bull Harmonic + noise model (HNM)

bull Harmonic stochastic model

bull LF model

bull Glottal waveform

bull Residual codebook

bull ML excitation

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79

Limitations of HMMs for acoustic modeling

bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state

bull Conditional independence assumptionState output probability depends only on the current state

bull Weak duration modelingState duration probability decreases exponentially with time

None of them hold for real speech

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79

Better acoustic modeling

bull Piece-wise constatnt statistics rarr Dynamical model

minus Trended HMMminus Polynomial segment modelminus Trajectory HMM

bull Conditional independence assumption rarr Graphical model

minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM

bull Weak duration modeling rarr Explicit duration model

minus Hidden semi-Markov model

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79

Oversmoothing

bull Speech parameter generation algorithm

minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled

Nat

ural

0 4 8Frequency (kHz)

Gen

erat

ed

0 4 8Frequency (kHz)

bull Why

minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79

Oversmoothing compensation

bull Postfiltering

minus Mel-cepstrumminus LSP

bull Nonparametric approach

minus Conditional parameter generationminus Discrete HMM-based speech synthesis

bull Combine multiple-level statistics

minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation eigenvoice CAT multiple regression

minus Small footprintminus Robustness

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

HMM-based acoustic modeling for SPSS [4]

yes noyes no

yes no

yes no yes no

Acoustic space

bull Decision tree-clustered HMM with GMM state-output distributions

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79

DNN-based acoustic modeling for SPSS [18]

Acoustic

features y

Linguistic

features x

h1

h2

h3

bull DNN represents conditional distribution of y given x

bull DNN replaces decision trees and GMMs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79

Framework

Input layer Output layerHidden layers

TEXT

SPEECH Parametergeneration

Waveformsynthesis

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

feat

ures

at f

ram

e 1

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

fe

atur

es a

t fra

me T

Textanalysis

Input featureextraction

Statistics (m

ean amp var) of speech param

eter vector sequence

Binaryfeatures

NumericfeaturesDuration

featureFrame position

feature

Spectralfeatures

ExcitationfeaturesVUVfeature

Durationprediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Experimental setup

Database US English female speaker

Training test data 33000 amp 173 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic 11 categorical featuresfeatures 25 numeric features

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2

HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]

DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79

Example of speech parameter trajectories

wo grouping questions numeric contexts silence frames removed

-1

0

1

0 100 200 300 400 500

5-t

h M

el-ce

pstr

um

Frame

Natural speechHMM (α=1)DNN (4x512)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79

Subjective evaluations

Compared HMM-based systems with DNN-based ones withsimilar of parameters

bull Paired comparison test

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

HMM DNN(α) (layers times units) Neutral p value z value

158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 4: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Text-to-speech as sequence-to-sequence mapping

bull Automatic speech recognition (ASR)Speech (continuous time series) rarr Text (discrete symbol sequence)

bull Machine translation (MT)Text (discrete symbol sequence) rarr Text (discrete symbol sequence)

bull Text-to-speech synthesis (TTS)Text (discrete symbol sequence) rarr Speech (continuous time series)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 1 of 79

Text-to-speech as sequence-to-sequence mapping

bull Automatic speech recognition (ASR)Speech (continuous time series) rarr Text (discrete symbol sequence)

bull Machine translation (MT)Text (discrete symbol sequence) rarr Text (discrete symbol sequence)

bull Text-to-speech synthesis (TTS)Text (discrete symbol sequence) rarr Speech (continuous time series)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 1 of 79

Speech production process

text(concept)

frequencytransfercharacteristics

magnitudestart--end

fundamentalfrequency

mod

ulat

ion

of c

arrie

r wav

eby

spe

ech

info

rmat

ion

fund

amen

tal f

req

voic

edu

nvoi

ced

freq

trans

fer c

har

air flow

Sound sourcevoiced pulseunvoiced noise

speech

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 2 of 79

Typical flow of TTS system

Sentence segmentaitonWord segmentationText normalization

Part-of-speech taggingPronunciation

Prosody prediction

Waveform generation

TEXT

Text analysis

SYNTHESIZEDSPEECH

Speech synthesisdiscrete rArr discrete

discrete rArr continuous

NLP

Speech

Frontend

Backend

This talk focuses on backend

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 3 of 79

Concatenative speech synthesis

All segments

Target cost Concatenation cost

bull Concatenate actual instances of speech from database

bull Large data + automatic learningrarr High-quality synthetic voices can be built automatically

bull Single inventory per unit rarr diphone synthesis [1]

bull Multiple inventory per unit rarr unit selection synthesis [2]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 4 of 79

Statistical parametric speech synthesis (SPSS) [3]

Speech Speech

Text Text

Parametergeneration

Speechsynthesis

Textanalysis

Speechanalysis

Textanalysis

Modeltraining

x

y

x

y

λ

bull Training

minus Extract linguistic features x amp acoustic features yminus Train acoustic model λ given (xy)

λ = arg max p(y | x λ)

bull Synthesis

minus Extract x from text to be synthesizedminus Generate most probable y from λ

y = arg max p(y | x λ)

minus Reconstruct speech from y

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 5 of 79

Statistical parametric speech synthesis (SPSS) [3]

Speech Speech

Text Text

Parametergeneration

Speechsynthesis

Textanalysis

Speechanalysis

Textanalysis

Modeltraining

x

y

x

y

λ

bull Training

minus Extract linguistic features x amp acoustic features yminus Train acoustic model λ given (xy)

λ = arg max p(y | x λ)

bull Synthesis

minus Extract x from text to be synthesizedminus Generate most probable y from λ

y = arg max p(y | x λ)

minus Reconstruct speech from y

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 5 of 79

Statistical parametric speech synthesis (SPSS) [3]

Speech Speech

Text Text

Parametergeneration

Speechsynthesis

Textanalysis

Speechanalysis

Textanalysis

Modeltraining

x

y

x

y

λ

bull Large data + automatic trainingrarr Automatic voice building

bull Parametric representation of speechrarr Flexible to change its voice characteristics

Hidden Markov model (HMM) as its acoustic modelrarr HMM-based speech synthesis system (HTS) [4]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 6 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 8 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 9 of 79

Speech production process

text(concept)

frequencytransfercharacteristics

magnitudestart--end

fundamentalfrequency

mod

ulat

ion

of c

arrie

r wav

eby

spe

ech

info

rmat

ion

fund

amen

tal f

req

voic

edu

nvoi

ced

freq

trans

fer c

har

air flow

Sound sourcevoiced pulseunvoiced noise

speech

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 10 of 79

Source-filter model

pulse train

white noise

speech

lineartime-invariant

systeme(n)

h(n) x(n) = h(n) lowast e(n)excitation

x(n) = h(n) lowast e(n)

darr Fourier transform

X(ejω) = H(ejω)E(ejω)

Source excitation part Vocal tract resonance part

H(ejω)

should be defined by HMM state-output vectorseg mel-cepstrum line spectral pairs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 11 of 79

Parametric models of speech signal

Autoregressive (AR) model Exponential (EX) model

H(z) =K

1minusMsum

m=0

c(m)zminusm

H(z) = expMsum

m=0

c(m)zminusm

Estimate model parameters based on ML

c = arg maxcp(x | c)

bull p(x | c) AR model rarr Linear predictive analysis [5]

bull p(x | c) EX model rarr (ML-based) cepstral analysis [6]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 12 of 79

Examples of speech spectra

0 1 2 3 4 5Frequency (kHz)

-20

0

20

40

60

80

Log

mag

nitu

de (d

B)

0 1 2 3 4 5Frequency (kHz)

-20

0

20

40

60

80

Log

mag

nitu

de (d

B)

(a) ML-based cepstral analysis (b) Linear prediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 13 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 14 of 79

Structure of state-output (observation) vectors

∆ Mel-cepstral coefficients

log F0

∆ log F0

∆∆ log F0

Spectrum part

Excitation part

∆ct

∆2ct

pt

δpt

δ2pt

ct

ot

∆∆ Mel-cepstral coefficients

Mel-cepstral coefficients

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 15 of 79

Hidden Markov model (HMM)

1 2 3π1

a11

a12

a22

a23

a33

b1(ot) b2(ot) b3(ot)

o1 o2 o3 o4 o5 oT

1 1 1 1 2 2 3 3

O

Q

Observation sequence

State sequence

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 16 of 79

Multi-stream HMM structure

Spe

ctru

mE

xcita

tion

∆ct

∆2ct

pt

δ pt

δ 2pt

ct

ot

Stream

1

o1t

o2t

o3t

o4t

bj(ot)

b1j(o

1t )

b2j(o

2t )

b3j(o

3t )

b4j(o

4t )

bj(ot)Sprod

s=1

(bsj(o

st ))ws=

23

4

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 17 of 79

Training process

Compute variancefloor (HCompV)

Initialize CI-HMMs bysegmental k-means (HInit)

Reestimate CI-HMMs byEM algorithm

(HRest amp HERest)

Copy CI-HMMs to CD-HMMs (HHEd CL)

Reestimate CD-HMMs byEM algorithm (HERest)

Decision tree-basedclustering (HHEd TB)

Reestimate CD-HMMsby EM algorithm (HERest)

Untie parameter tyingstructure (HHEd UT)

monophone(context-independent CI)

fullcontext(context-dependent CD)

EstimatedHMMs

data amp labels

Estimate CD-dur modelsfrom FB stats (HERest)

Decision tree-basedclustering (HHEd TB)

Estimated dur models

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 18 of 79

Context-dependent acoustic modeling

bull preceding succeeding two phonemes

bull Position of current phoneme in current syllable

bull of phonemes at preceding current succeeding syllable

bull accent stress of preceding current succeeding syllable

bull Position of current syllable in current word

bull of preceding succeeding stressed accented syllables in phrase

bull of syllables from previous to next stressed accented syllable

bull Guess at part of speech of preceding current succeeding word

bull of syllables in preceding current succeeding word

bull Position of current word in current phrase

bull of preceding succeeding content words in current phrase

bull of words from previous to next content word

bull of syllables in preceding current succeeding phrase

Impossible to have all possible modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 19 of 79

Decision tree-based state clustering [7]

L=voice

L=w R=silence yes

yes yes

no

no no

w-a+sil w-a+sh gy-a+pau

g-a+silgy-a+silw-a+t

k-a+b

t-a+n

leaf nodes

yes no yes no

synthesized states

R=silence L=gy

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 20 of 79

Stream-dependent tree-based clustering

Decision treesfor

mel-cepstrum

Decision treesfor F0

Spectrum amp excitation can have different context dependencyrarr Build decision trees individually

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 21 of 79

State duration models [8]

i

1 2 3 4 5 6 7 T=8t

t0 t1

Probability to enter state i at t0 then leave at t1 + 1

χt0t1(i) propsum

j 6=i

αt0minus1(j)ajiat1minust0ii

t1prod

t=t0

bi(ot)sum

k 6=i

aikbk(ot1+1)βt1+1(k)

rarr estimate state duration modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 22 of 79

Stream-dependent tree-based clustering

State durationmodel

Decision treesfor

mel-cepstrum

Decision treesfor F0

Decision treefor state durmodels

HMM

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 23 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 24 of 79

Speech parameter generation algorithm [9]

Generate most probable state outputs given HMM and words

o = arg maxo

p(o | w λ)

= arg maxo

sum

forallqp(o q | w λ)

asymp arg maxo

maxq

p(o q | w λ)

= arg maxo

maxq

p(o | q λ)P (q | w λ)

Determine the best state sequence and outputs sequentially

q = arg maxq

P (q | w λ)

o = arg maxo

p(o | q λ)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79

Speech parameter generation algorithm [9]

Generate most probable state outputs given HMM and words

o = arg maxo

p(o | w λ)

= arg maxo

sum

forallqp(o q | w λ)

asymp arg maxo

maxq

p(o q | w λ)

= arg maxo

maxq

p(o | q λ)P (q | w λ)

Determine the best state sequence and outputs sequentially

q = arg maxq

P (q | w λ)

o = arg maxo

p(o | q λ)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79

Best state sequence

1 2 3π1

a11

a12

a22

a23

a33

b1(ot) b2(ot) b3(ot)

o1 o2 o3 o4 o5 oT

1 1 1 1 2 2 3 3

O

Q

Observation sequence

State sequence

State duration 4 10 5D

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 26 of 79

Best state outputswo dynamic features

Variance Mean

o becomes step-wise mean vector sequence

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 27 of 79

Using dynamic features

State output vectors include static amp dynamic features

ot =[cgtt ∆cgtt

]gt ∆ct = ct minus ctminus1

M Mctminus1 ct+1ctminus2 ct+2ct

∆ctminus1 ∆ct+1∆ctminus2 ∆ct+2∆ct

2M

Relationship between static and dynamic features can be arranged as

o c

ctminus1

∆ ctminus1

ct∆ ctct+1

∆ ct+1

=

middot middot middot

middot middot middotmiddot middot middot 0 I 0 0 middot middot middotmiddot middot middot minusI I 0 0 middot middot middotmiddot middot middot 0 0 I 0 middot middot middotmiddot middot middot 0 minusI I 0 middot middot middotmiddot middot middot 0 0 0 I middot middot middotmiddot middot middot 0 0 minusI I middot middot middotmiddot middot middot

middot middot middot

ctminus2

ctminus1

ctct+1

W

o t

ot+1

otminus1

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 28 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Wgt Σminus1q W c

=

Wgt Σminus1q microq

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

1 0 0

0

10 0

-1 1

10

0

-1 1

1

0

0

0

-1 1

0 0

0

0

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

microq1

microq2

microqT

c1

c2

cT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79

Generated speech parameter trajectory

Sta

ticD

ynam

ic

Variance Mean c

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79

Waveform reconstruction

pulse train

white noise

synthesizedspeech

lineartime-invariant

systeme(n)

h(n) x(n) = h(n) lowast e(n)excitation

Generatedexcitation parameter(log F0 with VUV)

Generatedspectral parameter

(cepstrum LSP)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79

Synthesis filter

bull Cepstrum rarr LMA filter

bull Generalized cepstrum rarr GLSA filter

bull Mel-cepstrum rarr MLSA filter

bull Mel-generalized cepstrum rarr MGLSA filter

bull LSP rarr LSP filter

bull PARCOR rarr all-pole lattice filter

bull LPC rarr all-pole filter

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation

minus Small footprint [10 11]minus Robustness [12]

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Adaptation (mimicking voice) [13]

Average-voice model

AdaptiveTraining

Adaptation

Training speakers Target speakers

bull Train average voice model (AVM) from training speakers using SAT

bull Adapt AVM to target speakers

bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79

Interpolation (mixing voice) [14 15 16 17]

λ1

λ2

λ3

λ4

λprime

I(λprimeλ1)I(λprimeλ2)

I(λprimeλ3)I(λprimeλ4)

λ HMM setI(λprimeλ) Interpolation ratio

bull Interpolate representive HMM sets

bull Can obtain new voices wo adaptation data

bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Vocoding issues

bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)

Unvoiced Voiced

pulse train

white noise

e(n)

excitation

bull Spectral envelope extractionHarmonic effect often cause problem

0

40

80

0 2 4 6 8 [kHz]

Pow

er [d

B]

bull PhaseImportant but usually ignored

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79

Better vocoding

bull Mixed excitation linear prediction (MELP)

bull STRAIGHT

bull Multi-band excitation

bull Harmonic + noise model (HNM)

bull Harmonic stochastic model

bull LF model

bull Glottal waveform

bull Residual codebook

bull ML excitation

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79

Limitations of HMMs for acoustic modeling

bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state

bull Conditional independence assumptionState output probability depends only on the current state

bull Weak duration modelingState duration probability decreases exponentially with time

None of them hold for real speech

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79

Better acoustic modeling

bull Piece-wise constatnt statistics rarr Dynamical model

minus Trended HMMminus Polynomial segment modelminus Trajectory HMM

bull Conditional independence assumption rarr Graphical model

minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM

bull Weak duration modeling rarr Explicit duration model

minus Hidden semi-Markov model

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79

Oversmoothing

bull Speech parameter generation algorithm

minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled

Nat

ural

0 4 8Frequency (kHz)

Gen

erat

ed

0 4 8Frequency (kHz)

bull Why

minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79

Oversmoothing compensation

bull Postfiltering

minus Mel-cepstrumminus LSP

bull Nonparametric approach

minus Conditional parameter generationminus Discrete HMM-based speech synthesis

bull Combine multiple-level statistics

minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation eigenvoice CAT multiple regression

minus Small footprintminus Robustness

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

HMM-based acoustic modeling for SPSS [4]

yes noyes no

yes no

yes no yes no

Acoustic space

bull Decision tree-clustered HMM with GMM state-output distributions

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79

DNN-based acoustic modeling for SPSS [18]

Acoustic

features y

Linguistic

features x

h1

h2

h3

bull DNN represents conditional distribution of y given x

bull DNN replaces decision trees and GMMs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79

Framework

Input layer Output layerHidden layers

TEXT

SPEECH Parametergeneration

Waveformsynthesis

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

feat

ures

at f

ram

e 1

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

fe

atur

es a

t fra

me T

Textanalysis

Input featureextraction

Statistics (m

ean amp var) of speech param

eter vector sequence

Binaryfeatures

NumericfeaturesDuration

featureFrame position

feature

Spectralfeatures

ExcitationfeaturesVUVfeature

Durationprediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Experimental setup

Database US English female speaker

Training test data 33000 amp 173 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic 11 categorical featuresfeatures 25 numeric features

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2

HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]

DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79

Example of speech parameter trajectories

wo grouping questions numeric contexts silence frames removed

-1

0

1

0 100 200 300 400 500

5-t

h M

el-ce

pstr

um

Frame

Natural speechHMM (α=1)DNN (4x512)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79

Subjective evaluations

Compared HMM-based systems with DNN-based ones withsimilar of parameters

bull Paired comparison test

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

HMM DNN(α) (layers times units) Neutral p value z value

158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 5: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Text-to-speech as sequence-to-sequence mapping

bull Automatic speech recognition (ASR)Speech (continuous time series) rarr Text (discrete symbol sequence)

bull Machine translation (MT)Text (discrete symbol sequence) rarr Text (discrete symbol sequence)

bull Text-to-speech synthesis (TTS)Text (discrete symbol sequence) rarr Speech (continuous time series)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 1 of 79

Speech production process

text(concept)

frequencytransfercharacteristics

magnitudestart--end

fundamentalfrequency

mod

ulat

ion

of c

arrie

r wav

eby

spe

ech

info

rmat

ion

fund

amen

tal f

req

voic

edu

nvoi

ced

freq

trans

fer c

har

air flow

Sound sourcevoiced pulseunvoiced noise

speech

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 2 of 79

Typical flow of TTS system

Sentence segmentaitonWord segmentationText normalization

Part-of-speech taggingPronunciation

Prosody prediction

Waveform generation

TEXT

Text analysis

SYNTHESIZEDSPEECH

Speech synthesisdiscrete rArr discrete

discrete rArr continuous

NLP

Speech

Frontend

Backend

This talk focuses on backend

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 3 of 79

Concatenative speech synthesis

All segments

Target cost Concatenation cost

bull Concatenate actual instances of speech from database

bull Large data + automatic learningrarr High-quality synthetic voices can be built automatically

bull Single inventory per unit rarr diphone synthesis [1]

bull Multiple inventory per unit rarr unit selection synthesis [2]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 4 of 79

Statistical parametric speech synthesis (SPSS) [3]

Speech Speech

Text Text

Parametergeneration

Speechsynthesis

Textanalysis

Speechanalysis

Textanalysis

Modeltraining

x

y

x

y

λ

bull Training

minus Extract linguistic features x amp acoustic features yminus Train acoustic model λ given (xy)

λ = arg max p(y | x λ)

bull Synthesis

minus Extract x from text to be synthesizedminus Generate most probable y from λ

y = arg max p(y | x λ)

minus Reconstruct speech from y

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 5 of 79

Statistical parametric speech synthesis (SPSS) [3]

Speech Speech

Text Text

Parametergeneration

Speechsynthesis

Textanalysis

Speechanalysis

Textanalysis

Modeltraining

x

y

x

y

λ

bull Training

minus Extract linguistic features x amp acoustic features yminus Train acoustic model λ given (xy)

λ = arg max p(y | x λ)

bull Synthesis

minus Extract x from text to be synthesizedminus Generate most probable y from λ

y = arg max p(y | x λ)

minus Reconstruct speech from y

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 5 of 79

Statistical parametric speech synthesis (SPSS) [3]

Speech Speech

Text Text

Parametergeneration

Speechsynthesis

Textanalysis

Speechanalysis

Textanalysis

Modeltraining

x

y

x

y

λ

bull Large data + automatic trainingrarr Automatic voice building

bull Parametric representation of speechrarr Flexible to change its voice characteristics

Hidden Markov model (HMM) as its acoustic modelrarr HMM-based speech synthesis system (HTS) [4]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 6 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 8 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 9 of 79

Speech production process

text(concept)

frequencytransfercharacteristics

magnitudestart--end

fundamentalfrequency

mod

ulat

ion

of c

arrie

r wav

eby

spe

ech

info

rmat

ion

fund

amen

tal f

req

voic

edu

nvoi

ced

freq

trans

fer c

har

air flow

Sound sourcevoiced pulseunvoiced noise

speech

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 10 of 79

Source-filter model

pulse train

white noise

speech

lineartime-invariant

systeme(n)

h(n) x(n) = h(n) lowast e(n)excitation

x(n) = h(n) lowast e(n)

darr Fourier transform

X(ejω) = H(ejω)E(ejω)

Source excitation part Vocal tract resonance part

H(ejω)

should be defined by HMM state-output vectorseg mel-cepstrum line spectral pairs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 11 of 79

Parametric models of speech signal

Autoregressive (AR) model Exponential (EX) model

H(z) =K

1minusMsum

m=0

c(m)zminusm

H(z) = expMsum

m=0

c(m)zminusm

Estimate model parameters based on ML

c = arg maxcp(x | c)

bull p(x | c) AR model rarr Linear predictive analysis [5]

bull p(x | c) EX model rarr (ML-based) cepstral analysis [6]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 12 of 79

Examples of speech spectra

0 1 2 3 4 5Frequency (kHz)

-20

0

20

40

60

80

Log

mag

nitu

de (d

B)

0 1 2 3 4 5Frequency (kHz)

-20

0

20

40

60

80

Log

mag

nitu

de (d

B)

(a) ML-based cepstral analysis (b) Linear prediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 13 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 14 of 79

Structure of state-output (observation) vectors

∆ Mel-cepstral coefficients

log F0

∆ log F0

∆∆ log F0

Spectrum part

Excitation part

∆ct

∆2ct

pt

δpt

δ2pt

ct

ot

∆∆ Mel-cepstral coefficients

Mel-cepstral coefficients

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 15 of 79

Hidden Markov model (HMM)

1 2 3π1

a11

a12

a22

a23

a33

b1(ot) b2(ot) b3(ot)

o1 o2 o3 o4 o5 oT

1 1 1 1 2 2 3 3

O

Q

Observation sequence

State sequence

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 16 of 79

Multi-stream HMM structure

Spe

ctru

mE

xcita

tion

∆ct

∆2ct

pt

δ pt

δ 2pt

ct

ot

Stream

1

o1t

o2t

o3t

o4t

bj(ot)

b1j(o

1t )

b2j(o

2t )

b3j(o

3t )

b4j(o

4t )

bj(ot)Sprod

s=1

(bsj(o

st ))ws=

23

4

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 17 of 79

Training process

Compute variancefloor (HCompV)

Initialize CI-HMMs bysegmental k-means (HInit)

Reestimate CI-HMMs byEM algorithm

(HRest amp HERest)

Copy CI-HMMs to CD-HMMs (HHEd CL)

Reestimate CD-HMMs byEM algorithm (HERest)

Decision tree-basedclustering (HHEd TB)

Reestimate CD-HMMsby EM algorithm (HERest)

Untie parameter tyingstructure (HHEd UT)

monophone(context-independent CI)

fullcontext(context-dependent CD)

EstimatedHMMs

data amp labels

Estimate CD-dur modelsfrom FB stats (HERest)

Decision tree-basedclustering (HHEd TB)

Estimated dur models

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 18 of 79

Context-dependent acoustic modeling

bull preceding succeeding two phonemes

bull Position of current phoneme in current syllable

bull of phonemes at preceding current succeeding syllable

bull accent stress of preceding current succeeding syllable

bull Position of current syllable in current word

bull of preceding succeeding stressed accented syllables in phrase

bull of syllables from previous to next stressed accented syllable

bull Guess at part of speech of preceding current succeeding word

bull of syllables in preceding current succeeding word

bull Position of current word in current phrase

bull of preceding succeeding content words in current phrase

bull of words from previous to next content word

bull of syllables in preceding current succeeding phrase

Impossible to have all possible modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 19 of 79

Decision tree-based state clustering [7]

L=voice

L=w R=silence yes

yes yes

no

no no

w-a+sil w-a+sh gy-a+pau

g-a+silgy-a+silw-a+t

k-a+b

t-a+n

leaf nodes

yes no yes no

synthesized states

R=silence L=gy

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 20 of 79

Stream-dependent tree-based clustering

Decision treesfor

mel-cepstrum

Decision treesfor F0

Spectrum amp excitation can have different context dependencyrarr Build decision trees individually

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 21 of 79

State duration models [8]

i

1 2 3 4 5 6 7 T=8t

t0 t1

Probability to enter state i at t0 then leave at t1 + 1

χt0t1(i) propsum

j 6=i

αt0minus1(j)ajiat1minust0ii

t1prod

t=t0

bi(ot)sum

k 6=i

aikbk(ot1+1)βt1+1(k)

rarr estimate state duration modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 22 of 79

Stream-dependent tree-based clustering

State durationmodel

Decision treesfor

mel-cepstrum

Decision treesfor F0

Decision treefor state durmodels

HMM

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 23 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 24 of 79

Speech parameter generation algorithm [9]

Generate most probable state outputs given HMM and words

o = arg maxo

p(o | w λ)

= arg maxo

sum

forallqp(o q | w λ)

asymp arg maxo

maxq

p(o q | w λ)

= arg maxo

maxq

p(o | q λ)P (q | w λ)

Determine the best state sequence and outputs sequentially

q = arg maxq

P (q | w λ)

o = arg maxo

p(o | q λ)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79

Speech parameter generation algorithm [9]

Generate most probable state outputs given HMM and words

o = arg maxo

p(o | w λ)

= arg maxo

sum

forallqp(o q | w λ)

asymp arg maxo

maxq

p(o q | w λ)

= arg maxo

maxq

p(o | q λ)P (q | w λ)

Determine the best state sequence and outputs sequentially

q = arg maxq

P (q | w λ)

o = arg maxo

p(o | q λ)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79

Best state sequence

1 2 3π1

a11

a12

a22

a23

a33

b1(ot) b2(ot) b3(ot)

o1 o2 o3 o4 o5 oT

1 1 1 1 2 2 3 3

O

Q

Observation sequence

State sequence

State duration 4 10 5D

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 26 of 79

Best state outputswo dynamic features

Variance Mean

o becomes step-wise mean vector sequence

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 27 of 79

Using dynamic features

State output vectors include static amp dynamic features

ot =[cgtt ∆cgtt

]gt ∆ct = ct minus ctminus1

M Mctminus1 ct+1ctminus2 ct+2ct

∆ctminus1 ∆ct+1∆ctminus2 ∆ct+2∆ct

2M

Relationship between static and dynamic features can be arranged as

o c

ctminus1

∆ ctminus1

ct∆ ctct+1

∆ ct+1

=

middot middot middot

middot middot middotmiddot middot middot 0 I 0 0 middot middot middotmiddot middot middot minusI I 0 0 middot middot middotmiddot middot middot 0 0 I 0 middot middot middotmiddot middot middot 0 minusI I 0 middot middot middotmiddot middot middot 0 0 0 I middot middot middotmiddot middot middot 0 0 minusI I middot middot middotmiddot middot middot

middot middot middot

ctminus2

ctminus1

ctct+1

W

o t

ot+1

otminus1

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 28 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Wgt Σminus1q W c

=

Wgt Σminus1q microq

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

1 0 0

0

10 0

-1 1

10

0

-1 1

1

0

0

0

-1 1

0 0

0

0

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

microq1

microq2

microqT

c1

c2

cT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79

Generated speech parameter trajectory

Sta

ticD

ynam

ic

Variance Mean c

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79

Waveform reconstruction

pulse train

white noise

synthesizedspeech

lineartime-invariant

systeme(n)

h(n) x(n) = h(n) lowast e(n)excitation

Generatedexcitation parameter(log F0 with VUV)

Generatedspectral parameter

(cepstrum LSP)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79

Synthesis filter

bull Cepstrum rarr LMA filter

bull Generalized cepstrum rarr GLSA filter

bull Mel-cepstrum rarr MLSA filter

bull Mel-generalized cepstrum rarr MGLSA filter

bull LSP rarr LSP filter

bull PARCOR rarr all-pole lattice filter

bull LPC rarr all-pole filter

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation

minus Small footprint [10 11]minus Robustness [12]

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Adaptation (mimicking voice) [13]

Average-voice model

AdaptiveTraining

Adaptation

Training speakers Target speakers

bull Train average voice model (AVM) from training speakers using SAT

bull Adapt AVM to target speakers

bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79

Interpolation (mixing voice) [14 15 16 17]

λ1

λ2

λ3

λ4

λprime

I(λprimeλ1)I(λprimeλ2)

I(λprimeλ3)I(λprimeλ4)

λ HMM setI(λprimeλ) Interpolation ratio

bull Interpolate representive HMM sets

bull Can obtain new voices wo adaptation data

bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Vocoding issues

bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)

Unvoiced Voiced

pulse train

white noise

e(n)

excitation

bull Spectral envelope extractionHarmonic effect often cause problem

0

40

80

0 2 4 6 8 [kHz]

Pow

er [d

B]

bull PhaseImportant but usually ignored

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79

Better vocoding

bull Mixed excitation linear prediction (MELP)

bull STRAIGHT

bull Multi-band excitation

bull Harmonic + noise model (HNM)

bull Harmonic stochastic model

bull LF model

bull Glottal waveform

bull Residual codebook

bull ML excitation

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79

Limitations of HMMs for acoustic modeling

bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state

bull Conditional independence assumptionState output probability depends only on the current state

bull Weak duration modelingState duration probability decreases exponentially with time

None of them hold for real speech

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79

Better acoustic modeling

bull Piece-wise constatnt statistics rarr Dynamical model

minus Trended HMMminus Polynomial segment modelminus Trajectory HMM

bull Conditional independence assumption rarr Graphical model

minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM

bull Weak duration modeling rarr Explicit duration model

minus Hidden semi-Markov model

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79

Oversmoothing

bull Speech parameter generation algorithm

minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled

Nat

ural

0 4 8Frequency (kHz)

Gen

erat

ed

0 4 8Frequency (kHz)

bull Why

minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79

Oversmoothing compensation

bull Postfiltering

minus Mel-cepstrumminus LSP

bull Nonparametric approach

minus Conditional parameter generationminus Discrete HMM-based speech synthesis

bull Combine multiple-level statistics

minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation eigenvoice CAT multiple regression

minus Small footprintminus Robustness

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

HMM-based acoustic modeling for SPSS [4]

yes noyes no

yes no

yes no yes no

Acoustic space

bull Decision tree-clustered HMM with GMM state-output distributions

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79

DNN-based acoustic modeling for SPSS [18]

Acoustic

features y

Linguistic

features x

h1

h2

h3

bull DNN represents conditional distribution of y given x

bull DNN replaces decision trees and GMMs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79

Framework

Input layer Output layerHidden layers

TEXT

SPEECH Parametergeneration

Waveformsynthesis

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

feat

ures

at f

ram

e 1

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

fe

atur

es a

t fra

me T

Textanalysis

Input featureextraction

Statistics (m

ean amp var) of speech param

eter vector sequence

Binaryfeatures

NumericfeaturesDuration

featureFrame position

feature

Spectralfeatures

ExcitationfeaturesVUVfeature

Durationprediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Experimental setup

Database US English female speaker

Training test data 33000 amp 173 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic 11 categorical featuresfeatures 25 numeric features

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2

HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]

DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79

Example of speech parameter trajectories

wo grouping questions numeric contexts silence frames removed

-1

0

1

0 100 200 300 400 500

5-t

h M

el-ce

pstr

um

Frame

Natural speechHMM (α=1)DNN (4x512)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79

Subjective evaluations

Compared HMM-based systems with DNN-based ones withsimilar of parameters

bull Paired comparison test

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

HMM DNN(α) (layers times units) Neutral p value z value

158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 6: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Speech production process

text(concept)

frequencytransfercharacteristics

magnitudestart--end

fundamentalfrequency

mod

ulat

ion

of c

arrie

r wav

eby

spe

ech

info

rmat

ion

fund

amen

tal f

req

voic

edu

nvoi

ced

freq

trans

fer c

har

air flow

Sound sourcevoiced pulseunvoiced noise

speech

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 2 of 79

Typical flow of TTS system

Sentence segmentaitonWord segmentationText normalization

Part-of-speech taggingPronunciation

Prosody prediction

Waveform generation

TEXT

Text analysis

SYNTHESIZEDSPEECH

Speech synthesisdiscrete rArr discrete

discrete rArr continuous

NLP

Speech

Frontend

Backend

This talk focuses on backend

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 3 of 79

Concatenative speech synthesis

All segments

Target cost Concatenation cost

bull Concatenate actual instances of speech from database

bull Large data + automatic learningrarr High-quality synthetic voices can be built automatically

bull Single inventory per unit rarr diphone synthesis [1]

bull Multiple inventory per unit rarr unit selection synthesis [2]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 4 of 79

Statistical parametric speech synthesis (SPSS) [3]

Speech Speech

Text Text

Parametergeneration

Speechsynthesis

Textanalysis

Speechanalysis

Textanalysis

Modeltraining

x

y

x

y

λ

bull Training

minus Extract linguistic features x amp acoustic features yminus Train acoustic model λ given (xy)

λ = arg max p(y | x λ)

bull Synthesis

minus Extract x from text to be synthesizedminus Generate most probable y from λ

y = arg max p(y | x λ)

minus Reconstruct speech from y

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 5 of 79

Statistical parametric speech synthesis (SPSS) [3]

Speech Speech

Text Text

Parametergeneration

Speechsynthesis

Textanalysis

Speechanalysis

Textanalysis

Modeltraining

x

y

x

y

λ

bull Training

minus Extract linguistic features x amp acoustic features yminus Train acoustic model λ given (xy)

λ = arg max p(y | x λ)

bull Synthesis

minus Extract x from text to be synthesizedminus Generate most probable y from λ

y = arg max p(y | x λ)

minus Reconstruct speech from y

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 5 of 79

Statistical parametric speech synthesis (SPSS) [3]

Speech Speech

Text Text

Parametergeneration

Speechsynthesis

Textanalysis

Speechanalysis

Textanalysis

Modeltraining

x

y

x

y

λ

bull Large data + automatic trainingrarr Automatic voice building

bull Parametric representation of speechrarr Flexible to change its voice characteristics

Hidden Markov model (HMM) as its acoustic modelrarr HMM-based speech synthesis system (HTS) [4]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 6 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 8 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 9 of 79

Speech production process

text(concept)

frequencytransfercharacteristics

magnitudestart--end

fundamentalfrequency

mod

ulat

ion

of c

arrie

r wav

eby

spe

ech

info

rmat

ion

fund

amen

tal f

req

voic

edu

nvoi

ced

freq

trans

fer c

har

air flow

Sound sourcevoiced pulseunvoiced noise

speech

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 10 of 79

Source-filter model

pulse train

white noise

speech

lineartime-invariant

systeme(n)

h(n) x(n) = h(n) lowast e(n)excitation

x(n) = h(n) lowast e(n)

darr Fourier transform

X(ejω) = H(ejω)E(ejω)

Source excitation part Vocal tract resonance part

H(ejω)

should be defined by HMM state-output vectorseg mel-cepstrum line spectral pairs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 11 of 79

Parametric models of speech signal

Autoregressive (AR) model Exponential (EX) model

H(z) =K

1minusMsum

m=0

c(m)zminusm

H(z) = expMsum

m=0

c(m)zminusm

Estimate model parameters based on ML

c = arg maxcp(x | c)

bull p(x | c) AR model rarr Linear predictive analysis [5]

bull p(x | c) EX model rarr (ML-based) cepstral analysis [6]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 12 of 79

Examples of speech spectra

0 1 2 3 4 5Frequency (kHz)

-20

0

20

40

60

80

Log

mag

nitu

de (d

B)

0 1 2 3 4 5Frequency (kHz)

-20

0

20

40

60

80

Log

mag

nitu

de (d

B)

(a) ML-based cepstral analysis (b) Linear prediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 13 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 14 of 79

Structure of state-output (observation) vectors

∆ Mel-cepstral coefficients

log F0

∆ log F0

∆∆ log F0

Spectrum part

Excitation part

∆ct

∆2ct

pt

δpt

δ2pt

ct

ot

∆∆ Mel-cepstral coefficients

Mel-cepstral coefficients

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 15 of 79

Hidden Markov model (HMM)

1 2 3π1

a11

a12

a22

a23

a33

b1(ot) b2(ot) b3(ot)

o1 o2 o3 o4 o5 oT

1 1 1 1 2 2 3 3

O

Q

Observation sequence

State sequence

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 16 of 79

Multi-stream HMM structure

Spe

ctru

mE

xcita

tion

∆ct

∆2ct

pt

δ pt

δ 2pt

ct

ot

Stream

1

o1t

o2t

o3t

o4t

bj(ot)

b1j(o

1t )

b2j(o

2t )

b3j(o

3t )

b4j(o

4t )

bj(ot)Sprod

s=1

(bsj(o

st ))ws=

23

4

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 17 of 79

Training process

Compute variancefloor (HCompV)

Initialize CI-HMMs bysegmental k-means (HInit)

Reestimate CI-HMMs byEM algorithm

(HRest amp HERest)

Copy CI-HMMs to CD-HMMs (HHEd CL)

Reestimate CD-HMMs byEM algorithm (HERest)

Decision tree-basedclustering (HHEd TB)

Reestimate CD-HMMsby EM algorithm (HERest)

Untie parameter tyingstructure (HHEd UT)

monophone(context-independent CI)

fullcontext(context-dependent CD)

EstimatedHMMs

data amp labels

Estimate CD-dur modelsfrom FB stats (HERest)

Decision tree-basedclustering (HHEd TB)

Estimated dur models

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 18 of 79

Context-dependent acoustic modeling

bull preceding succeeding two phonemes

bull Position of current phoneme in current syllable

bull of phonemes at preceding current succeeding syllable

bull accent stress of preceding current succeeding syllable

bull Position of current syllable in current word

bull of preceding succeeding stressed accented syllables in phrase

bull of syllables from previous to next stressed accented syllable

bull Guess at part of speech of preceding current succeeding word

bull of syllables in preceding current succeeding word

bull Position of current word in current phrase

bull of preceding succeeding content words in current phrase

bull of words from previous to next content word

bull of syllables in preceding current succeeding phrase

Impossible to have all possible modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 19 of 79

Decision tree-based state clustering [7]

L=voice

L=w R=silence yes

yes yes

no

no no

w-a+sil w-a+sh gy-a+pau

g-a+silgy-a+silw-a+t

k-a+b

t-a+n

leaf nodes

yes no yes no

synthesized states

R=silence L=gy

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 20 of 79

Stream-dependent tree-based clustering

Decision treesfor

mel-cepstrum

Decision treesfor F0

Spectrum amp excitation can have different context dependencyrarr Build decision trees individually

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 21 of 79

State duration models [8]

i

1 2 3 4 5 6 7 T=8t

t0 t1

Probability to enter state i at t0 then leave at t1 + 1

χt0t1(i) propsum

j 6=i

αt0minus1(j)ajiat1minust0ii

t1prod

t=t0

bi(ot)sum

k 6=i

aikbk(ot1+1)βt1+1(k)

rarr estimate state duration modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 22 of 79

Stream-dependent tree-based clustering

State durationmodel

Decision treesfor

mel-cepstrum

Decision treesfor F0

Decision treefor state durmodels

HMM

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 23 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 24 of 79

Speech parameter generation algorithm [9]

Generate most probable state outputs given HMM and words

o = arg maxo

p(o | w λ)

= arg maxo

sum

forallqp(o q | w λ)

asymp arg maxo

maxq

p(o q | w λ)

= arg maxo

maxq

p(o | q λ)P (q | w λ)

Determine the best state sequence and outputs sequentially

q = arg maxq

P (q | w λ)

o = arg maxo

p(o | q λ)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79

Speech parameter generation algorithm [9]

Generate most probable state outputs given HMM and words

o = arg maxo

p(o | w λ)

= arg maxo

sum

forallqp(o q | w λ)

asymp arg maxo

maxq

p(o q | w λ)

= arg maxo

maxq

p(o | q λ)P (q | w λ)

Determine the best state sequence and outputs sequentially

q = arg maxq

P (q | w λ)

o = arg maxo

p(o | q λ)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79

Best state sequence

1 2 3π1

a11

a12

a22

a23

a33

b1(ot) b2(ot) b3(ot)

o1 o2 o3 o4 o5 oT

1 1 1 1 2 2 3 3

O

Q

Observation sequence

State sequence

State duration 4 10 5D

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 26 of 79

Best state outputswo dynamic features

Variance Mean

o becomes step-wise mean vector sequence

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 27 of 79

Using dynamic features

State output vectors include static amp dynamic features

ot =[cgtt ∆cgtt

]gt ∆ct = ct minus ctminus1

M Mctminus1 ct+1ctminus2 ct+2ct

∆ctminus1 ∆ct+1∆ctminus2 ∆ct+2∆ct

2M

Relationship between static and dynamic features can be arranged as

o c

ctminus1

∆ ctminus1

ct∆ ctct+1

∆ ct+1

=

middot middot middot

middot middot middotmiddot middot middot 0 I 0 0 middot middot middotmiddot middot middot minusI I 0 0 middot middot middotmiddot middot middot 0 0 I 0 middot middot middotmiddot middot middot 0 minusI I 0 middot middot middotmiddot middot middot 0 0 0 I middot middot middotmiddot middot middot 0 0 minusI I middot middot middotmiddot middot middot

middot middot middot

ctminus2

ctminus1

ctct+1

W

o t

ot+1

otminus1

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 28 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Wgt Σminus1q W c

=

Wgt Σminus1q microq

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

1 0 0

0

10 0

-1 1

10

0

-1 1

1

0

0

0

-1 1

0 0

0

0

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

microq1

microq2

microqT

c1

c2

cT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79

Generated speech parameter trajectory

Sta

ticD

ynam

ic

Variance Mean c

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79

Waveform reconstruction

pulse train

white noise

synthesizedspeech

lineartime-invariant

systeme(n)

h(n) x(n) = h(n) lowast e(n)excitation

Generatedexcitation parameter(log F0 with VUV)

Generatedspectral parameter

(cepstrum LSP)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79

Synthesis filter

bull Cepstrum rarr LMA filter

bull Generalized cepstrum rarr GLSA filter

bull Mel-cepstrum rarr MLSA filter

bull Mel-generalized cepstrum rarr MGLSA filter

bull LSP rarr LSP filter

bull PARCOR rarr all-pole lattice filter

bull LPC rarr all-pole filter

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation

minus Small footprint [10 11]minus Robustness [12]

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Adaptation (mimicking voice) [13]

Average-voice model

AdaptiveTraining

Adaptation

Training speakers Target speakers

bull Train average voice model (AVM) from training speakers using SAT

bull Adapt AVM to target speakers

bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79

Interpolation (mixing voice) [14 15 16 17]

λ1

λ2

λ3

λ4

λprime

I(λprimeλ1)I(λprimeλ2)

I(λprimeλ3)I(λprimeλ4)

λ HMM setI(λprimeλ) Interpolation ratio

bull Interpolate representive HMM sets

bull Can obtain new voices wo adaptation data

bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Vocoding issues

bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)

Unvoiced Voiced

pulse train

white noise

e(n)

excitation

bull Spectral envelope extractionHarmonic effect often cause problem

0

40

80

0 2 4 6 8 [kHz]

Pow

er [d

B]

bull PhaseImportant but usually ignored

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79

Better vocoding

bull Mixed excitation linear prediction (MELP)

bull STRAIGHT

bull Multi-band excitation

bull Harmonic + noise model (HNM)

bull Harmonic stochastic model

bull LF model

bull Glottal waveform

bull Residual codebook

bull ML excitation

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79

Limitations of HMMs for acoustic modeling

bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state

bull Conditional independence assumptionState output probability depends only on the current state

bull Weak duration modelingState duration probability decreases exponentially with time

None of them hold for real speech

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79

Better acoustic modeling

bull Piece-wise constatnt statistics rarr Dynamical model

minus Trended HMMminus Polynomial segment modelminus Trajectory HMM

bull Conditional independence assumption rarr Graphical model

minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM

bull Weak duration modeling rarr Explicit duration model

minus Hidden semi-Markov model

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79

Oversmoothing

bull Speech parameter generation algorithm

minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled

Nat

ural

0 4 8Frequency (kHz)

Gen

erat

ed

0 4 8Frequency (kHz)

bull Why

minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79

Oversmoothing compensation

bull Postfiltering

minus Mel-cepstrumminus LSP

bull Nonparametric approach

minus Conditional parameter generationminus Discrete HMM-based speech synthesis

bull Combine multiple-level statistics

minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation eigenvoice CAT multiple regression

minus Small footprintminus Robustness

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

HMM-based acoustic modeling for SPSS [4]

yes noyes no

yes no

yes no yes no

Acoustic space

bull Decision tree-clustered HMM with GMM state-output distributions

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79

DNN-based acoustic modeling for SPSS [18]

Acoustic

features y

Linguistic

features x

h1

h2

h3

bull DNN represents conditional distribution of y given x

bull DNN replaces decision trees and GMMs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79

Framework

Input layer Output layerHidden layers

TEXT

SPEECH Parametergeneration

Waveformsynthesis

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

feat

ures

at f

ram

e 1

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

fe

atur

es a

t fra

me T

Textanalysis

Input featureextraction

Statistics (m

ean amp var) of speech param

eter vector sequence

Binaryfeatures

NumericfeaturesDuration

featureFrame position

feature

Spectralfeatures

ExcitationfeaturesVUVfeature

Durationprediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Experimental setup

Database US English female speaker

Training test data 33000 amp 173 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic 11 categorical featuresfeatures 25 numeric features

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2

HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]

DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79

Example of speech parameter trajectories

wo grouping questions numeric contexts silence frames removed

-1

0

1

0 100 200 300 400 500

5-t

h M

el-ce

pstr

um

Frame

Natural speechHMM (α=1)DNN (4x512)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79

Subjective evaluations

Compared HMM-based systems with DNN-based ones withsimilar of parameters

bull Paired comparison test

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

HMM DNN(α) (layers times units) Neutral p value z value

158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 7: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Typical flow of TTS system

Sentence segmentaitonWord segmentationText normalization

Part-of-speech taggingPronunciation

Prosody prediction

Waveform generation

TEXT

Text analysis

SYNTHESIZEDSPEECH

Speech synthesisdiscrete rArr discrete

discrete rArr continuous

NLP

Speech

Frontend

Backend

This talk focuses on backend

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 3 of 79

Concatenative speech synthesis

All segments

Target cost Concatenation cost

bull Concatenate actual instances of speech from database

bull Large data + automatic learningrarr High-quality synthetic voices can be built automatically

bull Single inventory per unit rarr diphone synthesis [1]

bull Multiple inventory per unit rarr unit selection synthesis [2]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 4 of 79

Statistical parametric speech synthesis (SPSS) [3]

Speech Speech

Text Text

Parametergeneration

Speechsynthesis

Textanalysis

Speechanalysis

Textanalysis

Modeltraining

x

y

x

y

λ

bull Training

minus Extract linguistic features x amp acoustic features yminus Train acoustic model λ given (xy)

λ = arg max p(y | x λ)

bull Synthesis

minus Extract x from text to be synthesizedminus Generate most probable y from λ

y = arg max p(y | x λ)

minus Reconstruct speech from y

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 5 of 79

Statistical parametric speech synthesis (SPSS) [3]

Speech Speech

Text Text

Parametergeneration

Speechsynthesis

Textanalysis

Speechanalysis

Textanalysis

Modeltraining

x

y

x

y

λ

bull Training

minus Extract linguistic features x amp acoustic features yminus Train acoustic model λ given (xy)

λ = arg max p(y | x λ)

bull Synthesis

minus Extract x from text to be synthesizedminus Generate most probable y from λ

y = arg max p(y | x λ)

minus Reconstruct speech from y

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 5 of 79

Statistical parametric speech synthesis (SPSS) [3]

Speech Speech

Text Text

Parametergeneration

Speechsynthesis

Textanalysis

Speechanalysis

Textanalysis

Modeltraining

x

y

x

y

λ

bull Large data + automatic trainingrarr Automatic voice building

bull Parametric representation of speechrarr Flexible to change its voice characteristics

Hidden Markov model (HMM) as its acoustic modelrarr HMM-based speech synthesis system (HTS) [4]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 6 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 8 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 9 of 79

Speech production process

text(concept)

frequencytransfercharacteristics

magnitudestart--end

fundamentalfrequency

mod

ulat

ion

of c

arrie

r wav

eby

spe

ech

info

rmat

ion

fund

amen

tal f

req

voic

edu

nvoi

ced

freq

trans

fer c

har

air flow

Sound sourcevoiced pulseunvoiced noise

speech

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 10 of 79

Source-filter model

pulse train

white noise

speech

lineartime-invariant

systeme(n)

h(n) x(n) = h(n) lowast e(n)excitation

x(n) = h(n) lowast e(n)

darr Fourier transform

X(ejω) = H(ejω)E(ejω)

Source excitation part Vocal tract resonance part

H(ejω)

should be defined by HMM state-output vectorseg mel-cepstrum line spectral pairs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 11 of 79

Parametric models of speech signal

Autoregressive (AR) model Exponential (EX) model

H(z) =K

1minusMsum

m=0

c(m)zminusm

H(z) = expMsum

m=0

c(m)zminusm

Estimate model parameters based on ML

c = arg maxcp(x | c)

bull p(x | c) AR model rarr Linear predictive analysis [5]

bull p(x | c) EX model rarr (ML-based) cepstral analysis [6]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 12 of 79

Examples of speech spectra

0 1 2 3 4 5Frequency (kHz)

-20

0

20

40

60

80

Log

mag

nitu

de (d

B)

0 1 2 3 4 5Frequency (kHz)

-20

0

20

40

60

80

Log

mag

nitu

de (d

B)

(a) ML-based cepstral analysis (b) Linear prediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 13 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 14 of 79

Structure of state-output (observation) vectors

∆ Mel-cepstral coefficients

log F0

∆ log F0

∆∆ log F0

Spectrum part

Excitation part

∆ct

∆2ct

pt

δpt

δ2pt

ct

ot

∆∆ Mel-cepstral coefficients

Mel-cepstral coefficients

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 15 of 79

Hidden Markov model (HMM)

1 2 3π1

a11

a12

a22

a23

a33

b1(ot) b2(ot) b3(ot)

o1 o2 o3 o4 o5 oT

1 1 1 1 2 2 3 3

O

Q

Observation sequence

State sequence

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 16 of 79

Multi-stream HMM structure

Spe

ctru

mE

xcita

tion

∆ct

∆2ct

pt

δ pt

δ 2pt

ct

ot

Stream

1

o1t

o2t

o3t

o4t

bj(ot)

b1j(o

1t )

b2j(o

2t )

b3j(o

3t )

b4j(o

4t )

bj(ot)Sprod

s=1

(bsj(o

st ))ws=

23

4

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 17 of 79

Training process

Compute variancefloor (HCompV)

Initialize CI-HMMs bysegmental k-means (HInit)

Reestimate CI-HMMs byEM algorithm

(HRest amp HERest)

Copy CI-HMMs to CD-HMMs (HHEd CL)

Reestimate CD-HMMs byEM algorithm (HERest)

Decision tree-basedclustering (HHEd TB)

Reestimate CD-HMMsby EM algorithm (HERest)

Untie parameter tyingstructure (HHEd UT)

monophone(context-independent CI)

fullcontext(context-dependent CD)

EstimatedHMMs

data amp labels

Estimate CD-dur modelsfrom FB stats (HERest)

Decision tree-basedclustering (HHEd TB)

Estimated dur models

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 18 of 79

Context-dependent acoustic modeling

bull preceding succeeding two phonemes

bull Position of current phoneme in current syllable

bull of phonemes at preceding current succeeding syllable

bull accent stress of preceding current succeeding syllable

bull Position of current syllable in current word

bull of preceding succeeding stressed accented syllables in phrase

bull of syllables from previous to next stressed accented syllable

bull Guess at part of speech of preceding current succeeding word

bull of syllables in preceding current succeeding word

bull Position of current word in current phrase

bull of preceding succeeding content words in current phrase

bull of words from previous to next content word

bull of syllables in preceding current succeeding phrase

Impossible to have all possible modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 19 of 79

Decision tree-based state clustering [7]

L=voice

L=w R=silence yes

yes yes

no

no no

w-a+sil w-a+sh gy-a+pau

g-a+silgy-a+silw-a+t

k-a+b

t-a+n

leaf nodes

yes no yes no

synthesized states

R=silence L=gy

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 20 of 79

Stream-dependent tree-based clustering

Decision treesfor

mel-cepstrum

Decision treesfor F0

Spectrum amp excitation can have different context dependencyrarr Build decision trees individually

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 21 of 79

State duration models [8]

i

1 2 3 4 5 6 7 T=8t

t0 t1

Probability to enter state i at t0 then leave at t1 + 1

χt0t1(i) propsum

j 6=i

αt0minus1(j)ajiat1minust0ii

t1prod

t=t0

bi(ot)sum

k 6=i

aikbk(ot1+1)βt1+1(k)

rarr estimate state duration modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 22 of 79

Stream-dependent tree-based clustering

State durationmodel

Decision treesfor

mel-cepstrum

Decision treesfor F0

Decision treefor state durmodels

HMM

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 23 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 24 of 79

Speech parameter generation algorithm [9]

Generate most probable state outputs given HMM and words

o = arg maxo

p(o | w λ)

= arg maxo

sum

forallqp(o q | w λ)

asymp arg maxo

maxq

p(o q | w λ)

= arg maxo

maxq

p(o | q λ)P (q | w λ)

Determine the best state sequence and outputs sequentially

q = arg maxq

P (q | w λ)

o = arg maxo

p(o | q λ)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79

Speech parameter generation algorithm [9]

Generate most probable state outputs given HMM and words

o = arg maxo

p(o | w λ)

= arg maxo

sum

forallqp(o q | w λ)

asymp arg maxo

maxq

p(o q | w λ)

= arg maxo

maxq

p(o | q λ)P (q | w λ)

Determine the best state sequence and outputs sequentially

q = arg maxq

P (q | w λ)

o = arg maxo

p(o | q λ)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79

Best state sequence

1 2 3π1

a11

a12

a22

a23

a33

b1(ot) b2(ot) b3(ot)

o1 o2 o3 o4 o5 oT

1 1 1 1 2 2 3 3

O

Q

Observation sequence

State sequence

State duration 4 10 5D

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 26 of 79

Best state outputswo dynamic features

Variance Mean

o becomes step-wise mean vector sequence

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 27 of 79

Using dynamic features

State output vectors include static amp dynamic features

ot =[cgtt ∆cgtt

]gt ∆ct = ct minus ctminus1

M Mctminus1 ct+1ctminus2 ct+2ct

∆ctminus1 ∆ct+1∆ctminus2 ∆ct+2∆ct

2M

Relationship between static and dynamic features can be arranged as

o c

ctminus1

∆ ctminus1

ct∆ ctct+1

∆ ct+1

=

middot middot middot

middot middot middotmiddot middot middot 0 I 0 0 middot middot middotmiddot middot middot minusI I 0 0 middot middot middotmiddot middot middot 0 0 I 0 middot middot middotmiddot middot middot 0 minusI I 0 middot middot middotmiddot middot middot 0 0 0 I middot middot middotmiddot middot middot 0 0 minusI I middot middot middotmiddot middot middot

middot middot middot

ctminus2

ctminus1

ctct+1

W

o t

ot+1

otminus1

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 28 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Wgt Σminus1q W c

=

Wgt Σminus1q microq

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

1 0 0

0

10 0

-1 1

10

0

-1 1

1

0

0

0

-1 1

0 0

0

0

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

microq1

microq2

microqT

c1

c2

cT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79

Generated speech parameter trajectory

Sta

ticD

ynam

ic

Variance Mean c

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79

Waveform reconstruction

pulse train

white noise

synthesizedspeech

lineartime-invariant

systeme(n)

h(n) x(n) = h(n) lowast e(n)excitation

Generatedexcitation parameter(log F0 with VUV)

Generatedspectral parameter

(cepstrum LSP)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79

Synthesis filter

bull Cepstrum rarr LMA filter

bull Generalized cepstrum rarr GLSA filter

bull Mel-cepstrum rarr MLSA filter

bull Mel-generalized cepstrum rarr MGLSA filter

bull LSP rarr LSP filter

bull PARCOR rarr all-pole lattice filter

bull LPC rarr all-pole filter

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation

minus Small footprint [10 11]minus Robustness [12]

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Adaptation (mimicking voice) [13]

Average-voice model

AdaptiveTraining

Adaptation

Training speakers Target speakers

bull Train average voice model (AVM) from training speakers using SAT

bull Adapt AVM to target speakers

bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79

Interpolation (mixing voice) [14 15 16 17]

λ1

λ2

λ3

λ4

λprime

I(λprimeλ1)I(λprimeλ2)

I(λprimeλ3)I(λprimeλ4)

λ HMM setI(λprimeλ) Interpolation ratio

bull Interpolate representive HMM sets

bull Can obtain new voices wo adaptation data

bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Vocoding issues

bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)

Unvoiced Voiced

pulse train

white noise

e(n)

excitation

bull Spectral envelope extractionHarmonic effect often cause problem

0

40

80

0 2 4 6 8 [kHz]

Pow

er [d

B]

bull PhaseImportant but usually ignored

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79

Better vocoding

bull Mixed excitation linear prediction (MELP)

bull STRAIGHT

bull Multi-band excitation

bull Harmonic + noise model (HNM)

bull Harmonic stochastic model

bull LF model

bull Glottal waveform

bull Residual codebook

bull ML excitation

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79

Limitations of HMMs for acoustic modeling

bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state

bull Conditional independence assumptionState output probability depends only on the current state

bull Weak duration modelingState duration probability decreases exponentially with time

None of them hold for real speech

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79

Better acoustic modeling

bull Piece-wise constatnt statistics rarr Dynamical model

minus Trended HMMminus Polynomial segment modelminus Trajectory HMM

bull Conditional independence assumption rarr Graphical model

minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM

bull Weak duration modeling rarr Explicit duration model

minus Hidden semi-Markov model

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79

Oversmoothing

bull Speech parameter generation algorithm

minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled

Nat

ural

0 4 8Frequency (kHz)

Gen

erat

ed

0 4 8Frequency (kHz)

bull Why

minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79

Oversmoothing compensation

bull Postfiltering

minus Mel-cepstrumminus LSP

bull Nonparametric approach

minus Conditional parameter generationminus Discrete HMM-based speech synthesis

bull Combine multiple-level statistics

minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation eigenvoice CAT multiple regression

minus Small footprintminus Robustness

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

HMM-based acoustic modeling for SPSS [4]

yes noyes no

yes no

yes no yes no

Acoustic space

bull Decision tree-clustered HMM with GMM state-output distributions

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79

DNN-based acoustic modeling for SPSS [18]

Acoustic

features y

Linguistic

features x

h1

h2

h3

bull DNN represents conditional distribution of y given x

bull DNN replaces decision trees and GMMs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79

Framework

Input layer Output layerHidden layers

TEXT

SPEECH Parametergeneration

Waveformsynthesis

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

feat

ures

at f

ram

e 1

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

fe

atur

es a

t fra

me T

Textanalysis

Input featureextraction

Statistics (m

ean amp var) of speech param

eter vector sequence

Binaryfeatures

NumericfeaturesDuration

featureFrame position

feature

Spectralfeatures

ExcitationfeaturesVUVfeature

Durationprediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Experimental setup

Database US English female speaker

Training test data 33000 amp 173 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic 11 categorical featuresfeatures 25 numeric features

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2

HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]

DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79

Example of speech parameter trajectories

wo grouping questions numeric contexts silence frames removed

-1

0

1

0 100 200 300 400 500

5-t

h M

el-ce

pstr

um

Frame

Natural speechHMM (α=1)DNN (4x512)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79

Subjective evaluations

Compared HMM-based systems with DNN-based ones withsimilar of parameters

bull Paired comparison test

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

HMM DNN(α) (layers times units) Neutral p value z value

158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 8: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Concatenative speech synthesis

All segments

Target cost Concatenation cost

bull Concatenate actual instances of speech from database

bull Large data + automatic learningrarr High-quality synthetic voices can be built automatically

bull Single inventory per unit rarr diphone synthesis [1]

bull Multiple inventory per unit rarr unit selection synthesis [2]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 4 of 79

Statistical parametric speech synthesis (SPSS) [3]

Speech Speech

Text Text

Parametergeneration

Speechsynthesis

Textanalysis

Speechanalysis

Textanalysis

Modeltraining

x

y

x

y

λ

bull Training

minus Extract linguistic features x amp acoustic features yminus Train acoustic model λ given (xy)

λ = arg max p(y | x λ)

bull Synthesis

minus Extract x from text to be synthesizedminus Generate most probable y from λ

y = arg max p(y | x λ)

minus Reconstruct speech from y

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 5 of 79

Statistical parametric speech synthesis (SPSS) [3]

Speech Speech

Text Text

Parametergeneration

Speechsynthesis

Textanalysis

Speechanalysis

Textanalysis

Modeltraining

x

y

x

y

λ

bull Training

minus Extract linguistic features x amp acoustic features yminus Train acoustic model λ given (xy)

λ = arg max p(y | x λ)

bull Synthesis

minus Extract x from text to be synthesizedminus Generate most probable y from λ

y = arg max p(y | x λ)

minus Reconstruct speech from y

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 5 of 79

Statistical parametric speech synthesis (SPSS) [3]

Speech Speech

Text Text

Parametergeneration

Speechsynthesis

Textanalysis

Speechanalysis

Textanalysis

Modeltraining

x

y

x

y

λ

bull Large data + automatic trainingrarr Automatic voice building

bull Parametric representation of speechrarr Flexible to change its voice characteristics

Hidden Markov model (HMM) as its acoustic modelrarr HMM-based speech synthesis system (HTS) [4]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 6 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 8 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 9 of 79

Speech production process

text(concept)

frequencytransfercharacteristics

magnitudestart--end

fundamentalfrequency

mod

ulat

ion

of c

arrie

r wav

eby

spe

ech

info

rmat

ion

fund

amen

tal f

req

voic

edu

nvoi

ced

freq

trans

fer c

har

air flow

Sound sourcevoiced pulseunvoiced noise

speech

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 10 of 79

Source-filter model

pulse train

white noise

speech

lineartime-invariant

systeme(n)

h(n) x(n) = h(n) lowast e(n)excitation

x(n) = h(n) lowast e(n)

darr Fourier transform

X(ejω) = H(ejω)E(ejω)

Source excitation part Vocal tract resonance part

H(ejω)

should be defined by HMM state-output vectorseg mel-cepstrum line spectral pairs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 11 of 79

Parametric models of speech signal

Autoregressive (AR) model Exponential (EX) model

H(z) =K

1minusMsum

m=0

c(m)zminusm

H(z) = expMsum

m=0

c(m)zminusm

Estimate model parameters based on ML

c = arg maxcp(x | c)

bull p(x | c) AR model rarr Linear predictive analysis [5]

bull p(x | c) EX model rarr (ML-based) cepstral analysis [6]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 12 of 79

Examples of speech spectra

0 1 2 3 4 5Frequency (kHz)

-20

0

20

40

60

80

Log

mag

nitu

de (d

B)

0 1 2 3 4 5Frequency (kHz)

-20

0

20

40

60

80

Log

mag

nitu

de (d

B)

(a) ML-based cepstral analysis (b) Linear prediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 13 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 14 of 79

Structure of state-output (observation) vectors

∆ Mel-cepstral coefficients

log F0

∆ log F0

∆∆ log F0

Spectrum part

Excitation part

∆ct

∆2ct

pt

δpt

δ2pt

ct

ot

∆∆ Mel-cepstral coefficients

Mel-cepstral coefficients

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 15 of 79

Hidden Markov model (HMM)

1 2 3π1

a11

a12

a22

a23

a33

b1(ot) b2(ot) b3(ot)

o1 o2 o3 o4 o5 oT

1 1 1 1 2 2 3 3

O

Q

Observation sequence

State sequence

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 16 of 79

Multi-stream HMM structure

Spe

ctru

mE

xcita

tion

∆ct

∆2ct

pt

δ pt

δ 2pt

ct

ot

Stream

1

o1t

o2t

o3t

o4t

bj(ot)

b1j(o

1t )

b2j(o

2t )

b3j(o

3t )

b4j(o

4t )

bj(ot)Sprod

s=1

(bsj(o

st ))ws=

23

4

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 17 of 79

Training process

Compute variancefloor (HCompV)

Initialize CI-HMMs bysegmental k-means (HInit)

Reestimate CI-HMMs byEM algorithm

(HRest amp HERest)

Copy CI-HMMs to CD-HMMs (HHEd CL)

Reestimate CD-HMMs byEM algorithm (HERest)

Decision tree-basedclustering (HHEd TB)

Reestimate CD-HMMsby EM algorithm (HERest)

Untie parameter tyingstructure (HHEd UT)

monophone(context-independent CI)

fullcontext(context-dependent CD)

EstimatedHMMs

data amp labels

Estimate CD-dur modelsfrom FB stats (HERest)

Decision tree-basedclustering (HHEd TB)

Estimated dur models

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 18 of 79

Context-dependent acoustic modeling

bull preceding succeeding two phonemes

bull Position of current phoneme in current syllable

bull of phonemes at preceding current succeeding syllable

bull accent stress of preceding current succeeding syllable

bull Position of current syllable in current word

bull of preceding succeeding stressed accented syllables in phrase

bull of syllables from previous to next stressed accented syllable

bull Guess at part of speech of preceding current succeeding word

bull of syllables in preceding current succeeding word

bull Position of current word in current phrase

bull of preceding succeeding content words in current phrase

bull of words from previous to next content word

bull of syllables in preceding current succeeding phrase

Impossible to have all possible modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 19 of 79

Decision tree-based state clustering [7]

L=voice

L=w R=silence yes

yes yes

no

no no

w-a+sil w-a+sh gy-a+pau

g-a+silgy-a+silw-a+t

k-a+b

t-a+n

leaf nodes

yes no yes no

synthesized states

R=silence L=gy

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 20 of 79

Stream-dependent tree-based clustering

Decision treesfor

mel-cepstrum

Decision treesfor F0

Spectrum amp excitation can have different context dependencyrarr Build decision trees individually

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 21 of 79

State duration models [8]

i

1 2 3 4 5 6 7 T=8t

t0 t1

Probability to enter state i at t0 then leave at t1 + 1

χt0t1(i) propsum

j 6=i

αt0minus1(j)ajiat1minust0ii

t1prod

t=t0

bi(ot)sum

k 6=i

aikbk(ot1+1)βt1+1(k)

rarr estimate state duration modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 22 of 79

Stream-dependent tree-based clustering

State durationmodel

Decision treesfor

mel-cepstrum

Decision treesfor F0

Decision treefor state durmodels

HMM

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 23 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 24 of 79

Speech parameter generation algorithm [9]

Generate most probable state outputs given HMM and words

o = arg maxo

p(o | w λ)

= arg maxo

sum

forallqp(o q | w λ)

asymp arg maxo

maxq

p(o q | w λ)

= arg maxo

maxq

p(o | q λ)P (q | w λ)

Determine the best state sequence and outputs sequentially

q = arg maxq

P (q | w λ)

o = arg maxo

p(o | q λ)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79

Speech parameter generation algorithm [9]

Generate most probable state outputs given HMM and words

o = arg maxo

p(o | w λ)

= arg maxo

sum

forallqp(o q | w λ)

asymp arg maxo

maxq

p(o q | w λ)

= arg maxo

maxq

p(o | q λ)P (q | w λ)

Determine the best state sequence and outputs sequentially

q = arg maxq

P (q | w λ)

o = arg maxo

p(o | q λ)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79

Best state sequence

1 2 3π1

a11

a12

a22

a23

a33

b1(ot) b2(ot) b3(ot)

o1 o2 o3 o4 o5 oT

1 1 1 1 2 2 3 3

O

Q

Observation sequence

State sequence

State duration 4 10 5D

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 26 of 79

Best state outputswo dynamic features

Variance Mean

o becomes step-wise mean vector sequence

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 27 of 79

Using dynamic features

State output vectors include static amp dynamic features

ot =[cgtt ∆cgtt

]gt ∆ct = ct minus ctminus1

M Mctminus1 ct+1ctminus2 ct+2ct

∆ctminus1 ∆ct+1∆ctminus2 ∆ct+2∆ct

2M

Relationship between static and dynamic features can be arranged as

o c

ctminus1

∆ ctminus1

ct∆ ctct+1

∆ ct+1

=

middot middot middot

middot middot middotmiddot middot middot 0 I 0 0 middot middot middotmiddot middot middot minusI I 0 0 middot middot middotmiddot middot middot 0 0 I 0 middot middot middotmiddot middot middot 0 minusI I 0 middot middot middotmiddot middot middot 0 0 0 I middot middot middotmiddot middot middot 0 0 minusI I middot middot middotmiddot middot middot

middot middot middot

ctminus2

ctminus1

ctct+1

W

o t

ot+1

otminus1

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 28 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Wgt Σminus1q W c

=

Wgt Σminus1q microq

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

1 0 0

0

10 0

-1 1

10

0

-1 1

1

0

0

0

-1 1

0 0

0

0

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

microq1

microq2

microqT

c1

c2

cT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79

Generated speech parameter trajectory

Sta

ticD

ynam

ic

Variance Mean c

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79

Waveform reconstruction

pulse train

white noise

synthesizedspeech

lineartime-invariant

systeme(n)

h(n) x(n) = h(n) lowast e(n)excitation

Generatedexcitation parameter(log F0 with VUV)

Generatedspectral parameter

(cepstrum LSP)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79

Synthesis filter

bull Cepstrum rarr LMA filter

bull Generalized cepstrum rarr GLSA filter

bull Mel-cepstrum rarr MLSA filter

bull Mel-generalized cepstrum rarr MGLSA filter

bull LSP rarr LSP filter

bull PARCOR rarr all-pole lattice filter

bull LPC rarr all-pole filter

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation

minus Small footprint [10 11]minus Robustness [12]

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Adaptation (mimicking voice) [13]

Average-voice model

AdaptiveTraining

Adaptation

Training speakers Target speakers

bull Train average voice model (AVM) from training speakers using SAT

bull Adapt AVM to target speakers

bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79

Interpolation (mixing voice) [14 15 16 17]

λ1

λ2

λ3

λ4

λprime

I(λprimeλ1)I(λprimeλ2)

I(λprimeλ3)I(λprimeλ4)

λ HMM setI(λprimeλ) Interpolation ratio

bull Interpolate representive HMM sets

bull Can obtain new voices wo adaptation data

bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Vocoding issues

bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)

Unvoiced Voiced

pulse train

white noise

e(n)

excitation

bull Spectral envelope extractionHarmonic effect often cause problem

0

40

80

0 2 4 6 8 [kHz]

Pow

er [d

B]

bull PhaseImportant but usually ignored

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79

Better vocoding

bull Mixed excitation linear prediction (MELP)

bull STRAIGHT

bull Multi-band excitation

bull Harmonic + noise model (HNM)

bull Harmonic stochastic model

bull LF model

bull Glottal waveform

bull Residual codebook

bull ML excitation

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79

Limitations of HMMs for acoustic modeling

bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state

bull Conditional independence assumptionState output probability depends only on the current state

bull Weak duration modelingState duration probability decreases exponentially with time

None of them hold for real speech

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79

Better acoustic modeling

bull Piece-wise constatnt statistics rarr Dynamical model

minus Trended HMMminus Polynomial segment modelminus Trajectory HMM

bull Conditional independence assumption rarr Graphical model

minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM

bull Weak duration modeling rarr Explicit duration model

minus Hidden semi-Markov model

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79

Oversmoothing

bull Speech parameter generation algorithm

minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled

Nat

ural

0 4 8Frequency (kHz)

Gen

erat

ed

0 4 8Frequency (kHz)

bull Why

minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79

Oversmoothing compensation

bull Postfiltering

minus Mel-cepstrumminus LSP

bull Nonparametric approach

minus Conditional parameter generationminus Discrete HMM-based speech synthesis

bull Combine multiple-level statistics

minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation eigenvoice CAT multiple regression

minus Small footprintminus Robustness

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

HMM-based acoustic modeling for SPSS [4]

yes noyes no

yes no

yes no yes no

Acoustic space

bull Decision tree-clustered HMM with GMM state-output distributions

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79

DNN-based acoustic modeling for SPSS [18]

Acoustic

features y

Linguistic

features x

h1

h2

h3

bull DNN represents conditional distribution of y given x

bull DNN replaces decision trees and GMMs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79

Framework

Input layer Output layerHidden layers

TEXT

SPEECH Parametergeneration

Waveformsynthesis

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

feat

ures

at f

ram

e 1

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

fe

atur

es a

t fra

me T

Textanalysis

Input featureextraction

Statistics (m

ean amp var) of speech param

eter vector sequence

Binaryfeatures

NumericfeaturesDuration

featureFrame position

feature

Spectralfeatures

ExcitationfeaturesVUVfeature

Durationprediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Experimental setup

Database US English female speaker

Training test data 33000 amp 173 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic 11 categorical featuresfeatures 25 numeric features

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2

HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]

DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79

Example of speech parameter trajectories

wo grouping questions numeric contexts silence frames removed

-1

0

1

0 100 200 300 400 500

5-t

h M

el-ce

pstr

um

Frame

Natural speechHMM (α=1)DNN (4x512)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79

Subjective evaluations

Compared HMM-based systems with DNN-based ones withsimilar of parameters

bull Paired comparison test

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

HMM DNN(α) (layers times units) Neutral p value z value

158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 9: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Statistical parametric speech synthesis (SPSS) [3]

Speech Speech

Text Text

Parametergeneration

Speechsynthesis

Textanalysis

Speechanalysis

Textanalysis

Modeltraining

x

y

x

y

λ

bull Training

minus Extract linguistic features x amp acoustic features yminus Train acoustic model λ given (xy)

λ = arg max p(y | x λ)

bull Synthesis

minus Extract x from text to be synthesizedminus Generate most probable y from λ

y = arg max p(y | x λ)

minus Reconstruct speech from y

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 5 of 79

Statistical parametric speech synthesis (SPSS) [3]

Speech Speech

Text Text

Parametergeneration

Speechsynthesis

Textanalysis

Speechanalysis

Textanalysis

Modeltraining

x

y

x

y

λ

bull Training

minus Extract linguistic features x amp acoustic features yminus Train acoustic model λ given (xy)

λ = arg max p(y | x λ)

bull Synthesis

minus Extract x from text to be synthesizedminus Generate most probable y from λ

y = arg max p(y | x λ)

minus Reconstruct speech from y

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 5 of 79

Statistical parametric speech synthesis (SPSS) [3]

Speech Speech

Text Text

Parametergeneration

Speechsynthesis

Textanalysis

Speechanalysis

Textanalysis

Modeltraining

x

y

x

y

λ

bull Large data + automatic trainingrarr Automatic voice building

bull Parametric representation of speechrarr Flexible to change its voice characteristics

Hidden Markov model (HMM) as its acoustic modelrarr HMM-based speech synthesis system (HTS) [4]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 6 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 8 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 9 of 79

Speech production process

text(concept)

frequencytransfercharacteristics

magnitudestart--end

fundamentalfrequency

mod

ulat

ion

of c

arrie

r wav

eby

spe

ech

info

rmat

ion

fund

amen

tal f

req

voic

edu

nvoi

ced

freq

trans

fer c

har

air flow

Sound sourcevoiced pulseunvoiced noise

speech

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 10 of 79

Source-filter model

pulse train

white noise

speech

lineartime-invariant

systeme(n)

h(n) x(n) = h(n) lowast e(n)excitation

x(n) = h(n) lowast e(n)

darr Fourier transform

X(ejω) = H(ejω)E(ejω)

Source excitation part Vocal tract resonance part

H(ejω)

should be defined by HMM state-output vectorseg mel-cepstrum line spectral pairs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 11 of 79

Parametric models of speech signal

Autoregressive (AR) model Exponential (EX) model

H(z) =K

1minusMsum

m=0

c(m)zminusm

H(z) = expMsum

m=0

c(m)zminusm

Estimate model parameters based on ML

c = arg maxcp(x | c)

bull p(x | c) AR model rarr Linear predictive analysis [5]

bull p(x | c) EX model rarr (ML-based) cepstral analysis [6]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 12 of 79

Examples of speech spectra

0 1 2 3 4 5Frequency (kHz)

-20

0

20

40

60

80

Log

mag

nitu

de (d

B)

0 1 2 3 4 5Frequency (kHz)

-20

0

20

40

60

80

Log

mag

nitu

de (d

B)

(a) ML-based cepstral analysis (b) Linear prediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 13 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 14 of 79

Structure of state-output (observation) vectors

∆ Mel-cepstral coefficients

log F0

∆ log F0

∆∆ log F0

Spectrum part

Excitation part

∆ct

∆2ct

pt

δpt

δ2pt

ct

ot

∆∆ Mel-cepstral coefficients

Mel-cepstral coefficients

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 15 of 79

Hidden Markov model (HMM)

1 2 3π1

a11

a12

a22

a23

a33

b1(ot) b2(ot) b3(ot)

o1 o2 o3 o4 o5 oT

1 1 1 1 2 2 3 3

O

Q

Observation sequence

State sequence

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 16 of 79

Multi-stream HMM structure

Spe

ctru

mE

xcita

tion

∆ct

∆2ct

pt

δ pt

δ 2pt

ct

ot

Stream

1

o1t

o2t

o3t

o4t

bj(ot)

b1j(o

1t )

b2j(o

2t )

b3j(o

3t )

b4j(o

4t )

bj(ot)Sprod

s=1

(bsj(o

st ))ws=

23

4

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 17 of 79

Training process

Compute variancefloor (HCompV)

Initialize CI-HMMs bysegmental k-means (HInit)

Reestimate CI-HMMs byEM algorithm

(HRest amp HERest)

Copy CI-HMMs to CD-HMMs (HHEd CL)

Reestimate CD-HMMs byEM algorithm (HERest)

Decision tree-basedclustering (HHEd TB)

Reestimate CD-HMMsby EM algorithm (HERest)

Untie parameter tyingstructure (HHEd UT)

monophone(context-independent CI)

fullcontext(context-dependent CD)

EstimatedHMMs

data amp labels

Estimate CD-dur modelsfrom FB stats (HERest)

Decision tree-basedclustering (HHEd TB)

Estimated dur models

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 18 of 79

Context-dependent acoustic modeling

bull preceding succeeding two phonemes

bull Position of current phoneme in current syllable

bull of phonemes at preceding current succeeding syllable

bull accent stress of preceding current succeeding syllable

bull Position of current syllable in current word

bull of preceding succeeding stressed accented syllables in phrase

bull of syllables from previous to next stressed accented syllable

bull Guess at part of speech of preceding current succeeding word

bull of syllables in preceding current succeeding word

bull Position of current word in current phrase

bull of preceding succeeding content words in current phrase

bull of words from previous to next content word

bull of syllables in preceding current succeeding phrase

Impossible to have all possible modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 19 of 79

Decision tree-based state clustering [7]

L=voice

L=w R=silence yes

yes yes

no

no no

w-a+sil w-a+sh gy-a+pau

g-a+silgy-a+silw-a+t

k-a+b

t-a+n

leaf nodes

yes no yes no

synthesized states

R=silence L=gy

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 20 of 79

Stream-dependent tree-based clustering

Decision treesfor

mel-cepstrum

Decision treesfor F0

Spectrum amp excitation can have different context dependencyrarr Build decision trees individually

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 21 of 79

State duration models [8]

i

1 2 3 4 5 6 7 T=8t

t0 t1

Probability to enter state i at t0 then leave at t1 + 1

χt0t1(i) propsum

j 6=i

αt0minus1(j)ajiat1minust0ii

t1prod

t=t0

bi(ot)sum

k 6=i

aikbk(ot1+1)βt1+1(k)

rarr estimate state duration modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 22 of 79

Stream-dependent tree-based clustering

State durationmodel

Decision treesfor

mel-cepstrum

Decision treesfor F0

Decision treefor state durmodels

HMM

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 23 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 24 of 79

Speech parameter generation algorithm [9]

Generate most probable state outputs given HMM and words

o = arg maxo

p(o | w λ)

= arg maxo

sum

forallqp(o q | w λ)

asymp arg maxo

maxq

p(o q | w λ)

= arg maxo

maxq

p(o | q λ)P (q | w λ)

Determine the best state sequence and outputs sequentially

q = arg maxq

P (q | w λ)

o = arg maxo

p(o | q λ)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79

Speech parameter generation algorithm [9]

Generate most probable state outputs given HMM and words

o = arg maxo

p(o | w λ)

= arg maxo

sum

forallqp(o q | w λ)

asymp arg maxo

maxq

p(o q | w λ)

= arg maxo

maxq

p(o | q λ)P (q | w λ)

Determine the best state sequence and outputs sequentially

q = arg maxq

P (q | w λ)

o = arg maxo

p(o | q λ)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79

Best state sequence

1 2 3π1

a11

a12

a22

a23

a33

b1(ot) b2(ot) b3(ot)

o1 o2 o3 o4 o5 oT

1 1 1 1 2 2 3 3

O

Q

Observation sequence

State sequence

State duration 4 10 5D

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 26 of 79

Best state outputswo dynamic features

Variance Mean

o becomes step-wise mean vector sequence

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 27 of 79

Using dynamic features

State output vectors include static amp dynamic features

ot =[cgtt ∆cgtt

]gt ∆ct = ct minus ctminus1

M Mctminus1 ct+1ctminus2 ct+2ct

∆ctminus1 ∆ct+1∆ctminus2 ∆ct+2∆ct

2M

Relationship between static and dynamic features can be arranged as

o c

ctminus1

∆ ctminus1

ct∆ ctct+1

∆ ct+1

=

middot middot middot

middot middot middotmiddot middot middot 0 I 0 0 middot middot middotmiddot middot middot minusI I 0 0 middot middot middotmiddot middot middot 0 0 I 0 middot middot middotmiddot middot middot 0 minusI I 0 middot middot middotmiddot middot middot 0 0 0 I middot middot middotmiddot middot middot 0 0 minusI I middot middot middotmiddot middot middot

middot middot middot

ctminus2

ctminus1

ctct+1

W

o t

ot+1

otminus1

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 28 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Wgt Σminus1q W c

=

Wgt Σminus1q microq

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

1 0 0

0

10 0

-1 1

10

0

-1 1

1

0

0

0

-1 1

0 0

0

0

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

microq1

microq2

microqT

c1

c2

cT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79

Generated speech parameter trajectory

Sta

ticD

ynam

ic

Variance Mean c

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79

Waveform reconstruction

pulse train

white noise

synthesizedspeech

lineartime-invariant

systeme(n)

h(n) x(n) = h(n) lowast e(n)excitation

Generatedexcitation parameter(log F0 with VUV)

Generatedspectral parameter

(cepstrum LSP)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79

Synthesis filter

bull Cepstrum rarr LMA filter

bull Generalized cepstrum rarr GLSA filter

bull Mel-cepstrum rarr MLSA filter

bull Mel-generalized cepstrum rarr MGLSA filter

bull LSP rarr LSP filter

bull PARCOR rarr all-pole lattice filter

bull LPC rarr all-pole filter

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation

minus Small footprint [10 11]minus Robustness [12]

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Adaptation (mimicking voice) [13]

Average-voice model

AdaptiveTraining

Adaptation

Training speakers Target speakers

bull Train average voice model (AVM) from training speakers using SAT

bull Adapt AVM to target speakers

bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79

Interpolation (mixing voice) [14 15 16 17]

λ1

λ2

λ3

λ4

λprime

I(λprimeλ1)I(λprimeλ2)

I(λprimeλ3)I(λprimeλ4)

λ HMM setI(λprimeλ) Interpolation ratio

bull Interpolate representive HMM sets

bull Can obtain new voices wo adaptation data

bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Vocoding issues

bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)

Unvoiced Voiced

pulse train

white noise

e(n)

excitation

bull Spectral envelope extractionHarmonic effect often cause problem

0

40

80

0 2 4 6 8 [kHz]

Pow

er [d

B]

bull PhaseImportant but usually ignored

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79

Better vocoding

bull Mixed excitation linear prediction (MELP)

bull STRAIGHT

bull Multi-band excitation

bull Harmonic + noise model (HNM)

bull Harmonic stochastic model

bull LF model

bull Glottal waveform

bull Residual codebook

bull ML excitation

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79

Limitations of HMMs for acoustic modeling

bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state

bull Conditional independence assumptionState output probability depends only on the current state

bull Weak duration modelingState duration probability decreases exponentially with time

None of them hold for real speech

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79

Better acoustic modeling

bull Piece-wise constatnt statistics rarr Dynamical model

minus Trended HMMminus Polynomial segment modelminus Trajectory HMM

bull Conditional independence assumption rarr Graphical model

minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM

bull Weak duration modeling rarr Explicit duration model

minus Hidden semi-Markov model

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79

Oversmoothing

bull Speech parameter generation algorithm

minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled

Nat

ural

0 4 8Frequency (kHz)

Gen

erat

ed

0 4 8Frequency (kHz)

bull Why

minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79

Oversmoothing compensation

bull Postfiltering

minus Mel-cepstrumminus LSP

bull Nonparametric approach

minus Conditional parameter generationminus Discrete HMM-based speech synthesis

bull Combine multiple-level statistics

minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation eigenvoice CAT multiple regression

minus Small footprintminus Robustness

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

HMM-based acoustic modeling for SPSS [4]

yes noyes no

yes no

yes no yes no

Acoustic space

bull Decision tree-clustered HMM with GMM state-output distributions

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79

DNN-based acoustic modeling for SPSS [18]

Acoustic

features y

Linguistic

features x

h1

h2

h3

bull DNN represents conditional distribution of y given x

bull DNN replaces decision trees and GMMs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79

Framework

Input layer Output layerHidden layers

TEXT

SPEECH Parametergeneration

Waveformsynthesis

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

feat

ures

at f

ram

e 1

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

fe

atur

es a

t fra

me T

Textanalysis

Input featureextraction

Statistics (m

ean amp var) of speech param

eter vector sequence

Binaryfeatures

NumericfeaturesDuration

featureFrame position

feature

Spectralfeatures

ExcitationfeaturesVUVfeature

Durationprediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Experimental setup

Database US English female speaker

Training test data 33000 amp 173 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic 11 categorical featuresfeatures 25 numeric features

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2

HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]

DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79

Example of speech parameter trajectories

wo grouping questions numeric contexts silence frames removed

-1

0

1

0 100 200 300 400 500

5-t

h M

el-ce

pstr

um

Frame

Natural speechHMM (α=1)DNN (4x512)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79

Subjective evaluations

Compared HMM-based systems with DNN-based ones withsimilar of parameters

bull Paired comparison test

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

HMM DNN(α) (layers times units) Neutral p value z value

158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 10: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Statistical parametric speech synthesis (SPSS) [3]

Speech Speech

Text Text

Parametergeneration

Speechsynthesis

Textanalysis

Speechanalysis

Textanalysis

Modeltraining

x

y

x

y

λ

bull Training

minus Extract linguistic features x amp acoustic features yminus Train acoustic model λ given (xy)

λ = arg max p(y | x λ)

bull Synthesis

minus Extract x from text to be synthesizedminus Generate most probable y from λ

y = arg max p(y | x λ)

minus Reconstruct speech from y

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 5 of 79

Statistical parametric speech synthesis (SPSS) [3]

Speech Speech

Text Text

Parametergeneration

Speechsynthesis

Textanalysis

Speechanalysis

Textanalysis

Modeltraining

x

y

x

y

λ

bull Large data + automatic trainingrarr Automatic voice building

bull Parametric representation of speechrarr Flexible to change its voice characteristics

Hidden Markov model (HMM) as its acoustic modelrarr HMM-based speech synthesis system (HTS) [4]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 6 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 8 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 9 of 79

Speech production process

text(concept)

frequencytransfercharacteristics

magnitudestart--end

fundamentalfrequency

mod

ulat

ion

of c

arrie

r wav

eby

spe

ech

info

rmat

ion

fund

amen

tal f

req

voic

edu

nvoi

ced

freq

trans

fer c

har

air flow

Sound sourcevoiced pulseunvoiced noise

speech

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 10 of 79

Source-filter model

pulse train

white noise

speech

lineartime-invariant

systeme(n)

h(n) x(n) = h(n) lowast e(n)excitation

x(n) = h(n) lowast e(n)

darr Fourier transform

X(ejω) = H(ejω)E(ejω)

Source excitation part Vocal tract resonance part

H(ejω)

should be defined by HMM state-output vectorseg mel-cepstrum line spectral pairs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 11 of 79

Parametric models of speech signal

Autoregressive (AR) model Exponential (EX) model

H(z) =K

1minusMsum

m=0

c(m)zminusm

H(z) = expMsum

m=0

c(m)zminusm

Estimate model parameters based on ML

c = arg maxcp(x | c)

bull p(x | c) AR model rarr Linear predictive analysis [5]

bull p(x | c) EX model rarr (ML-based) cepstral analysis [6]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 12 of 79

Examples of speech spectra

0 1 2 3 4 5Frequency (kHz)

-20

0

20

40

60

80

Log

mag

nitu

de (d

B)

0 1 2 3 4 5Frequency (kHz)

-20

0

20

40

60

80

Log

mag

nitu

de (d

B)

(a) ML-based cepstral analysis (b) Linear prediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 13 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 14 of 79

Structure of state-output (observation) vectors

∆ Mel-cepstral coefficients

log F0

∆ log F0

∆∆ log F0

Spectrum part

Excitation part

∆ct

∆2ct

pt

δpt

δ2pt

ct

ot

∆∆ Mel-cepstral coefficients

Mel-cepstral coefficients

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 15 of 79

Hidden Markov model (HMM)

1 2 3π1

a11

a12

a22

a23

a33

b1(ot) b2(ot) b3(ot)

o1 o2 o3 o4 o5 oT

1 1 1 1 2 2 3 3

O

Q

Observation sequence

State sequence

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 16 of 79

Multi-stream HMM structure

Spe

ctru

mE

xcita

tion

∆ct

∆2ct

pt

δ pt

δ 2pt

ct

ot

Stream

1

o1t

o2t

o3t

o4t

bj(ot)

b1j(o

1t )

b2j(o

2t )

b3j(o

3t )

b4j(o

4t )

bj(ot)Sprod

s=1

(bsj(o

st ))ws=

23

4

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 17 of 79

Training process

Compute variancefloor (HCompV)

Initialize CI-HMMs bysegmental k-means (HInit)

Reestimate CI-HMMs byEM algorithm

(HRest amp HERest)

Copy CI-HMMs to CD-HMMs (HHEd CL)

Reestimate CD-HMMs byEM algorithm (HERest)

Decision tree-basedclustering (HHEd TB)

Reestimate CD-HMMsby EM algorithm (HERest)

Untie parameter tyingstructure (HHEd UT)

monophone(context-independent CI)

fullcontext(context-dependent CD)

EstimatedHMMs

data amp labels

Estimate CD-dur modelsfrom FB stats (HERest)

Decision tree-basedclustering (HHEd TB)

Estimated dur models

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 18 of 79

Context-dependent acoustic modeling

bull preceding succeeding two phonemes

bull Position of current phoneme in current syllable

bull of phonemes at preceding current succeeding syllable

bull accent stress of preceding current succeeding syllable

bull Position of current syllable in current word

bull of preceding succeeding stressed accented syllables in phrase

bull of syllables from previous to next stressed accented syllable

bull Guess at part of speech of preceding current succeeding word

bull of syllables in preceding current succeeding word

bull Position of current word in current phrase

bull of preceding succeeding content words in current phrase

bull of words from previous to next content word

bull of syllables in preceding current succeeding phrase

Impossible to have all possible modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 19 of 79

Decision tree-based state clustering [7]

L=voice

L=w R=silence yes

yes yes

no

no no

w-a+sil w-a+sh gy-a+pau

g-a+silgy-a+silw-a+t

k-a+b

t-a+n

leaf nodes

yes no yes no

synthesized states

R=silence L=gy

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 20 of 79

Stream-dependent tree-based clustering

Decision treesfor

mel-cepstrum

Decision treesfor F0

Spectrum amp excitation can have different context dependencyrarr Build decision trees individually

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 21 of 79

State duration models [8]

i

1 2 3 4 5 6 7 T=8t

t0 t1

Probability to enter state i at t0 then leave at t1 + 1

χt0t1(i) propsum

j 6=i

αt0minus1(j)ajiat1minust0ii

t1prod

t=t0

bi(ot)sum

k 6=i

aikbk(ot1+1)βt1+1(k)

rarr estimate state duration modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 22 of 79

Stream-dependent tree-based clustering

State durationmodel

Decision treesfor

mel-cepstrum

Decision treesfor F0

Decision treefor state durmodels

HMM

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 23 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 24 of 79

Speech parameter generation algorithm [9]

Generate most probable state outputs given HMM and words

o = arg maxo

p(o | w λ)

= arg maxo

sum

forallqp(o q | w λ)

asymp arg maxo

maxq

p(o q | w λ)

= arg maxo

maxq

p(o | q λ)P (q | w λ)

Determine the best state sequence and outputs sequentially

q = arg maxq

P (q | w λ)

o = arg maxo

p(o | q λ)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79

Speech parameter generation algorithm [9]

Generate most probable state outputs given HMM and words

o = arg maxo

p(o | w λ)

= arg maxo

sum

forallqp(o q | w λ)

asymp arg maxo

maxq

p(o q | w λ)

= arg maxo

maxq

p(o | q λ)P (q | w λ)

Determine the best state sequence and outputs sequentially

q = arg maxq

P (q | w λ)

o = arg maxo

p(o | q λ)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79

Best state sequence

1 2 3π1

a11

a12

a22

a23

a33

b1(ot) b2(ot) b3(ot)

o1 o2 o3 o4 o5 oT

1 1 1 1 2 2 3 3

O

Q

Observation sequence

State sequence

State duration 4 10 5D

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 26 of 79

Best state outputswo dynamic features

Variance Mean

o becomes step-wise mean vector sequence

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 27 of 79

Using dynamic features

State output vectors include static amp dynamic features

ot =[cgtt ∆cgtt

]gt ∆ct = ct minus ctminus1

M Mctminus1 ct+1ctminus2 ct+2ct

∆ctminus1 ∆ct+1∆ctminus2 ∆ct+2∆ct

2M

Relationship between static and dynamic features can be arranged as

o c

ctminus1

∆ ctminus1

ct∆ ctct+1

∆ ct+1

=

middot middot middot

middot middot middotmiddot middot middot 0 I 0 0 middot middot middotmiddot middot middot minusI I 0 0 middot middot middotmiddot middot middot 0 0 I 0 middot middot middotmiddot middot middot 0 minusI I 0 middot middot middotmiddot middot middot 0 0 0 I middot middot middotmiddot middot middot 0 0 minusI I middot middot middotmiddot middot middot

middot middot middot

ctminus2

ctminus1

ctct+1

W

o t

ot+1

otminus1

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 28 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Wgt Σminus1q W c

=

Wgt Σminus1q microq

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

1 0 0

0

10 0

-1 1

10

0

-1 1

1

0

0

0

-1 1

0 0

0

0

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

microq1

microq2

microqT

c1

c2

cT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79

Generated speech parameter trajectory

Sta

ticD

ynam

ic

Variance Mean c

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79

Waveform reconstruction

pulse train

white noise

synthesizedspeech

lineartime-invariant

systeme(n)

h(n) x(n) = h(n) lowast e(n)excitation

Generatedexcitation parameter(log F0 with VUV)

Generatedspectral parameter

(cepstrum LSP)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79

Synthesis filter

bull Cepstrum rarr LMA filter

bull Generalized cepstrum rarr GLSA filter

bull Mel-cepstrum rarr MLSA filter

bull Mel-generalized cepstrum rarr MGLSA filter

bull LSP rarr LSP filter

bull PARCOR rarr all-pole lattice filter

bull LPC rarr all-pole filter

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation

minus Small footprint [10 11]minus Robustness [12]

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Adaptation (mimicking voice) [13]

Average-voice model

AdaptiveTraining

Adaptation

Training speakers Target speakers

bull Train average voice model (AVM) from training speakers using SAT

bull Adapt AVM to target speakers

bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79

Interpolation (mixing voice) [14 15 16 17]

λ1

λ2

λ3

λ4

λprime

I(λprimeλ1)I(λprimeλ2)

I(λprimeλ3)I(λprimeλ4)

λ HMM setI(λprimeλ) Interpolation ratio

bull Interpolate representive HMM sets

bull Can obtain new voices wo adaptation data

bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Vocoding issues

bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)

Unvoiced Voiced

pulse train

white noise

e(n)

excitation

bull Spectral envelope extractionHarmonic effect often cause problem

0

40

80

0 2 4 6 8 [kHz]

Pow

er [d

B]

bull PhaseImportant but usually ignored

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79

Better vocoding

bull Mixed excitation linear prediction (MELP)

bull STRAIGHT

bull Multi-band excitation

bull Harmonic + noise model (HNM)

bull Harmonic stochastic model

bull LF model

bull Glottal waveform

bull Residual codebook

bull ML excitation

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79

Limitations of HMMs for acoustic modeling

bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state

bull Conditional independence assumptionState output probability depends only on the current state

bull Weak duration modelingState duration probability decreases exponentially with time

None of them hold for real speech

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79

Better acoustic modeling

bull Piece-wise constatnt statistics rarr Dynamical model

minus Trended HMMminus Polynomial segment modelminus Trajectory HMM

bull Conditional independence assumption rarr Graphical model

minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM

bull Weak duration modeling rarr Explicit duration model

minus Hidden semi-Markov model

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79

Oversmoothing

bull Speech parameter generation algorithm

minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled

Nat

ural

0 4 8Frequency (kHz)

Gen

erat

ed

0 4 8Frequency (kHz)

bull Why

minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79

Oversmoothing compensation

bull Postfiltering

minus Mel-cepstrumminus LSP

bull Nonparametric approach

minus Conditional parameter generationminus Discrete HMM-based speech synthesis

bull Combine multiple-level statistics

minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation eigenvoice CAT multiple regression

minus Small footprintminus Robustness

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

HMM-based acoustic modeling for SPSS [4]

yes noyes no

yes no

yes no yes no

Acoustic space

bull Decision tree-clustered HMM with GMM state-output distributions

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79

DNN-based acoustic modeling for SPSS [18]

Acoustic

features y

Linguistic

features x

h1

h2

h3

bull DNN represents conditional distribution of y given x

bull DNN replaces decision trees and GMMs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79

Framework

Input layer Output layerHidden layers

TEXT

SPEECH Parametergeneration

Waveformsynthesis

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

feat

ures

at f

ram

e 1

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

fe

atur

es a

t fra

me T

Textanalysis

Input featureextraction

Statistics (m

ean amp var) of speech param

eter vector sequence

Binaryfeatures

NumericfeaturesDuration

featureFrame position

feature

Spectralfeatures

ExcitationfeaturesVUVfeature

Durationprediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Experimental setup

Database US English female speaker

Training test data 33000 amp 173 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic 11 categorical featuresfeatures 25 numeric features

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2

HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]

DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79

Example of speech parameter trajectories

wo grouping questions numeric contexts silence frames removed

-1

0

1

0 100 200 300 400 500

5-t

h M

el-ce

pstr

um

Frame

Natural speechHMM (α=1)DNN (4x512)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79

Subjective evaluations

Compared HMM-based systems with DNN-based ones withsimilar of parameters

bull Paired comparison test

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

HMM DNN(α) (layers times units) Neutral p value z value

158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 11: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Statistical parametric speech synthesis (SPSS) [3]

Speech Speech

Text Text

Parametergeneration

Speechsynthesis

Textanalysis

Speechanalysis

Textanalysis

Modeltraining

x

y

x

y

λ

bull Large data + automatic trainingrarr Automatic voice building

bull Parametric representation of speechrarr Flexible to change its voice characteristics

Hidden Markov model (HMM) as its acoustic modelrarr HMM-based speech synthesis system (HTS) [4]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 6 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 8 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 9 of 79

Speech production process

text(concept)

frequencytransfercharacteristics

magnitudestart--end

fundamentalfrequency

mod

ulat

ion

of c

arrie

r wav

eby

spe

ech

info

rmat

ion

fund

amen

tal f

req

voic

edu

nvoi

ced

freq

trans

fer c

har

air flow

Sound sourcevoiced pulseunvoiced noise

speech

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 10 of 79

Source-filter model

pulse train

white noise

speech

lineartime-invariant

systeme(n)

h(n) x(n) = h(n) lowast e(n)excitation

x(n) = h(n) lowast e(n)

darr Fourier transform

X(ejω) = H(ejω)E(ejω)

Source excitation part Vocal tract resonance part

H(ejω)

should be defined by HMM state-output vectorseg mel-cepstrum line spectral pairs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 11 of 79

Parametric models of speech signal

Autoregressive (AR) model Exponential (EX) model

H(z) =K

1minusMsum

m=0

c(m)zminusm

H(z) = expMsum

m=0

c(m)zminusm

Estimate model parameters based on ML

c = arg maxcp(x | c)

bull p(x | c) AR model rarr Linear predictive analysis [5]

bull p(x | c) EX model rarr (ML-based) cepstral analysis [6]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 12 of 79

Examples of speech spectra

0 1 2 3 4 5Frequency (kHz)

-20

0

20

40

60

80

Log

mag

nitu

de (d

B)

0 1 2 3 4 5Frequency (kHz)

-20

0

20

40

60

80

Log

mag

nitu

de (d

B)

(a) ML-based cepstral analysis (b) Linear prediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 13 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 14 of 79

Structure of state-output (observation) vectors

∆ Mel-cepstral coefficients

log F0

∆ log F0

∆∆ log F0

Spectrum part

Excitation part

∆ct

∆2ct

pt

δpt

δ2pt

ct

ot

∆∆ Mel-cepstral coefficients

Mel-cepstral coefficients

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 15 of 79

Hidden Markov model (HMM)

1 2 3π1

a11

a12

a22

a23

a33

b1(ot) b2(ot) b3(ot)

o1 o2 o3 o4 o5 oT

1 1 1 1 2 2 3 3

O

Q

Observation sequence

State sequence

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 16 of 79

Multi-stream HMM structure

Spe

ctru

mE

xcita

tion

∆ct

∆2ct

pt

δ pt

δ 2pt

ct

ot

Stream

1

o1t

o2t

o3t

o4t

bj(ot)

b1j(o

1t )

b2j(o

2t )

b3j(o

3t )

b4j(o

4t )

bj(ot)Sprod

s=1

(bsj(o

st ))ws=

23

4

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 17 of 79

Training process

Compute variancefloor (HCompV)

Initialize CI-HMMs bysegmental k-means (HInit)

Reestimate CI-HMMs byEM algorithm

(HRest amp HERest)

Copy CI-HMMs to CD-HMMs (HHEd CL)

Reestimate CD-HMMs byEM algorithm (HERest)

Decision tree-basedclustering (HHEd TB)

Reestimate CD-HMMsby EM algorithm (HERest)

Untie parameter tyingstructure (HHEd UT)

monophone(context-independent CI)

fullcontext(context-dependent CD)

EstimatedHMMs

data amp labels

Estimate CD-dur modelsfrom FB stats (HERest)

Decision tree-basedclustering (HHEd TB)

Estimated dur models

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 18 of 79

Context-dependent acoustic modeling

bull preceding succeeding two phonemes

bull Position of current phoneme in current syllable

bull of phonemes at preceding current succeeding syllable

bull accent stress of preceding current succeeding syllable

bull Position of current syllable in current word

bull of preceding succeeding stressed accented syllables in phrase

bull of syllables from previous to next stressed accented syllable

bull Guess at part of speech of preceding current succeeding word

bull of syllables in preceding current succeeding word

bull Position of current word in current phrase

bull of preceding succeeding content words in current phrase

bull of words from previous to next content word

bull of syllables in preceding current succeeding phrase

Impossible to have all possible modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 19 of 79

Decision tree-based state clustering [7]

L=voice

L=w R=silence yes

yes yes

no

no no

w-a+sil w-a+sh gy-a+pau

g-a+silgy-a+silw-a+t

k-a+b

t-a+n

leaf nodes

yes no yes no

synthesized states

R=silence L=gy

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 20 of 79

Stream-dependent tree-based clustering

Decision treesfor

mel-cepstrum

Decision treesfor F0

Spectrum amp excitation can have different context dependencyrarr Build decision trees individually

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 21 of 79

State duration models [8]

i

1 2 3 4 5 6 7 T=8t

t0 t1

Probability to enter state i at t0 then leave at t1 + 1

χt0t1(i) propsum

j 6=i

αt0minus1(j)ajiat1minust0ii

t1prod

t=t0

bi(ot)sum

k 6=i

aikbk(ot1+1)βt1+1(k)

rarr estimate state duration modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 22 of 79

Stream-dependent tree-based clustering

State durationmodel

Decision treesfor

mel-cepstrum

Decision treesfor F0

Decision treefor state durmodels

HMM

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 23 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 24 of 79

Speech parameter generation algorithm [9]

Generate most probable state outputs given HMM and words

o = arg maxo

p(o | w λ)

= arg maxo

sum

forallqp(o q | w λ)

asymp arg maxo

maxq

p(o q | w λ)

= arg maxo

maxq

p(o | q λ)P (q | w λ)

Determine the best state sequence and outputs sequentially

q = arg maxq

P (q | w λ)

o = arg maxo

p(o | q λ)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79

Speech parameter generation algorithm [9]

Generate most probable state outputs given HMM and words

o = arg maxo

p(o | w λ)

= arg maxo

sum

forallqp(o q | w λ)

asymp arg maxo

maxq

p(o q | w λ)

= arg maxo

maxq

p(o | q λ)P (q | w λ)

Determine the best state sequence and outputs sequentially

q = arg maxq

P (q | w λ)

o = arg maxo

p(o | q λ)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79

Best state sequence

1 2 3π1

a11

a12

a22

a23

a33

b1(ot) b2(ot) b3(ot)

o1 o2 o3 o4 o5 oT

1 1 1 1 2 2 3 3

O

Q

Observation sequence

State sequence

State duration 4 10 5D

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 26 of 79

Best state outputswo dynamic features

Variance Mean

o becomes step-wise mean vector sequence

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 27 of 79

Using dynamic features

State output vectors include static amp dynamic features

ot =[cgtt ∆cgtt

]gt ∆ct = ct minus ctminus1

M Mctminus1 ct+1ctminus2 ct+2ct

∆ctminus1 ∆ct+1∆ctminus2 ∆ct+2∆ct

2M

Relationship between static and dynamic features can be arranged as

o c

ctminus1

∆ ctminus1

ct∆ ctct+1

∆ ct+1

=

middot middot middot

middot middot middotmiddot middot middot 0 I 0 0 middot middot middotmiddot middot middot minusI I 0 0 middot middot middotmiddot middot middot 0 0 I 0 middot middot middotmiddot middot middot 0 minusI I 0 middot middot middotmiddot middot middot 0 0 0 I middot middot middotmiddot middot middot 0 0 minusI I middot middot middotmiddot middot middot

middot middot middot

ctminus2

ctminus1

ctct+1

W

o t

ot+1

otminus1

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 28 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Wgt Σminus1q W c

=

Wgt Σminus1q microq

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

1 0 0

0

10 0

-1 1

10

0

-1 1

1

0

0

0

-1 1

0 0

0

0

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

microq1

microq2

microqT

c1

c2

cT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79

Generated speech parameter trajectory

Sta

ticD

ynam

ic

Variance Mean c

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79

Waveform reconstruction

pulse train

white noise

synthesizedspeech

lineartime-invariant

systeme(n)

h(n) x(n) = h(n) lowast e(n)excitation

Generatedexcitation parameter(log F0 with VUV)

Generatedspectral parameter

(cepstrum LSP)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79

Synthesis filter

bull Cepstrum rarr LMA filter

bull Generalized cepstrum rarr GLSA filter

bull Mel-cepstrum rarr MLSA filter

bull Mel-generalized cepstrum rarr MGLSA filter

bull LSP rarr LSP filter

bull PARCOR rarr all-pole lattice filter

bull LPC rarr all-pole filter

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation

minus Small footprint [10 11]minus Robustness [12]

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Adaptation (mimicking voice) [13]

Average-voice model

AdaptiveTraining

Adaptation

Training speakers Target speakers

bull Train average voice model (AVM) from training speakers using SAT

bull Adapt AVM to target speakers

bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79

Interpolation (mixing voice) [14 15 16 17]

λ1

λ2

λ3

λ4

λprime

I(λprimeλ1)I(λprimeλ2)

I(λprimeλ3)I(λprimeλ4)

λ HMM setI(λprimeλ) Interpolation ratio

bull Interpolate representive HMM sets

bull Can obtain new voices wo adaptation data

bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Vocoding issues

bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)

Unvoiced Voiced

pulse train

white noise

e(n)

excitation

bull Spectral envelope extractionHarmonic effect often cause problem

0

40

80

0 2 4 6 8 [kHz]

Pow

er [d

B]

bull PhaseImportant but usually ignored

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79

Better vocoding

bull Mixed excitation linear prediction (MELP)

bull STRAIGHT

bull Multi-band excitation

bull Harmonic + noise model (HNM)

bull Harmonic stochastic model

bull LF model

bull Glottal waveform

bull Residual codebook

bull ML excitation

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79

Limitations of HMMs for acoustic modeling

bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state

bull Conditional independence assumptionState output probability depends only on the current state

bull Weak duration modelingState duration probability decreases exponentially with time

None of them hold for real speech

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79

Better acoustic modeling

bull Piece-wise constatnt statistics rarr Dynamical model

minus Trended HMMminus Polynomial segment modelminus Trajectory HMM

bull Conditional independence assumption rarr Graphical model

minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM

bull Weak duration modeling rarr Explicit duration model

minus Hidden semi-Markov model

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79

Oversmoothing

bull Speech parameter generation algorithm

minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled

Nat

ural

0 4 8Frequency (kHz)

Gen

erat

ed

0 4 8Frequency (kHz)

bull Why

minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79

Oversmoothing compensation

bull Postfiltering

minus Mel-cepstrumminus LSP

bull Nonparametric approach

minus Conditional parameter generationminus Discrete HMM-based speech synthesis

bull Combine multiple-level statistics

minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation eigenvoice CAT multiple regression

minus Small footprintminus Robustness

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

HMM-based acoustic modeling for SPSS [4]

yes noyes no

yes no

yes no yes no

Acoustic space

bull Decision tree-clustered HMM with GMM state-output distributions

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79

DNN-based acoustic modeling for SPSS [18]

Acoustic

features y

Linguistic

features x

h1

h2

h3

bull DNN represents conditional distribution of y given x

bull DNN replaces decision trees and GMMs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79

Framework

Input layer Output layerHidden layers

TEXT

SPEECH Parametergeneration

Waveformsynthesis

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

feat

ures

at f

ram

e 1

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

fe

atur

es a

t fra

me T

Textanalysis

Input featureextraction

Statistics (m

ean amp var) of speech param

eter vector sequence

Binaryfeatures

NumericfeaturesDuration

featureFrame position

feature

Spectralfeatures

ExcitationfeaturesVUVfeature

Durationprediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Experimental setup

Database US English female speaker

Training test data 33000 amp 173 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic 11 categorical featuresfeatures 25 numeric features

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2

HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]

DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79

Example of speech parameter trajectories

wo grouping questions numeric contexts silence frames removed

-1

0

1

0 100 200 300 400 500

5-t

h M

el-ce

pstr

um

Frame

Natural speechHMM (α=1)DNN (4x512)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79

Subjective evaluations

Compared HMM-based systems with DNN-based ones withsimilar of parameters

bull Paired comparison test

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

HMM DNN(α) (layers times units) Neutral p value z value

158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 12: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 8 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 9 of 79

Speech production process

text(concept)

frequencytransfercharacteristics

magnitudestart--end

fundamentalfrequency

mod

ulat

ion

of c

arrie

r wav

eby

spe

ech

info

rmat

ion

fund

amen

tal f

req

voic

edu

nvoi

ced

freq

trans

fer c

har

air flow

Sound sourcevoiced pulseunvoiced noise

speech

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 10 of 79

Source-filter model

pulse train

white noise

speech

lineartime-invariant

systeme(n)

h(n) x(n) = h(n) lowast e(n)excitation

x(n) = h(n) lowast e(n)

darr Fourier transform

X(ejω) = H(ejω)E(ejω)

Source excitation part Vocal tract resonance part

H(ejω)

should be defined by HMM state-output vectorseg mel-cepstrum line spectral pairs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 11 of 79

Parametric models of speech signal

Autoregressive (AR) model Exponential (EX) model

H(z) =K

1minusMsum

m=0

c(m)zminusm

H(z) = expMsum

m=0

c(m)zminusm

Estimate model parameters based on ML

c = arg maxcp(x | c)

bull p(x | c) AR model rarr Linear predictive analysis [5]

bull p(x | c) EX model rarr (ML-based) cepstral analysis [6]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 12 of 79

Examples of speech spectra

0 1 2 3 4 5Frequency (kHz)

-20

0

20

40

60

80

Log

mag

nitu

de (d

B)

0 1 2 3 4 5Frequency (kHz)

-20

0

20

40

60

80

Log

mag

nitu

de (d

B)

(a) ML-based cepstral analysis (b) Linear prediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 13 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 14 of 79

Structure of state-output (observation) vectors

∆ Mel-cepstral coefficients

log F0

∆ log F0

∆∆ log F0

Spectrum part

Excitation part

∆ct

∆2ct

pt

δpt

δ2pt

ct

ot

∆∆ Mel-cepstral coefficients

Mel-cepstral coefficients

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 15 of 79

Hidden Markov model (HMM)

1 2 3π1

a11

a12

a22

a23

a33

b1(ot) b2(ot) b3(ot)

o1 o2 o3 o4 o5 oT

1 1 1 1 2 2 3 3

O

Q

Observation sequence

State sequence

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 16 of 79

Multi-stream HMM structure

Spe

ctru

mE

xcita

tion

∆ct

∆2ct

pt

δ pt

δ 2pt

ct

ot

Stream

1

o1t

o2t

o3t

o4t

bj(ot)

b1j(o

1t )

b2j(o

2t )

b3j(o

3t )

b4j(o

4t )

bj(ot)Sprod

s=1

(bsj(o

st ))ws=

23

4

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 17 of 79

Training process

Compute variancefloor (HCompV)

Initialize CI-HMMs bysegmental k-means (HInit)

Reestimate CI-HMMs byEM algorithm

(HRest amp HERest)

Copy CI-HMMs to CD-HMMs (HHEd CL)

Reestimate CD-HMMs byEM algorithm (HERest)

Decision tree-basedclustering (HHEd TB)

Reestimate CD-HMMsby EM algorithm (HERest)

Untie parameter tyingstructure (HHEd UT)

monophone(context-independent CI)

fullcontext(context-dependent CD)

EstimatedHMMs

data amp labels

Estimate CD-dur modelsfrom FB stats (HERest)

Decision tree-basedclustering (HHEd TB)

Estimated dur models

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 18 of 79

Context-dependent acoustic modeling

bull preceding succeeding two phonemes

bull Position of current phoneme in current syllable

bull of phonemes at preceding current succeeding syllable

bull accent stress of preceding current succeeding syllable

bull Position of current syllable in current word

bull of preceding succeeding stressed accented syllables in phrase

bull of syllables from previous to next stressed accented syllable

bull Guess at part of speech of preceding current succeeding word

bull of syllables in preceding current succeeding word

bull Position of current word in current phrase

bull of preceding succeeding content words in current phrase

bull of words from previous to next content word

bull of syllables in preceding current succeeding phrase

Impossible to have all possible modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 19 of 79

Decision tree-based state clustering [7]

L=voice

L=w R=silence yes

yes yes

no

no no

w-a+sil w-a+sh gy-a+pau

g-a+silgy-a+silw-a+t

k-a+b

t-a+n

leaf nodes

yes no yes no

synthesized states

R=silence L=gy

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 20 of 79

Stream-dependent tree-based clustering

Decision treesfor

mel-cepstrum

Decision treesfor F0

Spectrum amp excitation can have different context dependencyrarr Build decision trees individually

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 21 of 79

State duration models [8]

i

1 2 3 4 5 6 7 T=8t

t0 t1

Probability to enter state i at t0 then leave at t1 + 1

χt0t1(i) propsum

j 6=i

αt0minus1(j)ajiat1minust0ii

t1prod

t=t0

bi(ot)sum

k 6=i

aikbk(ot1+1)βt1+1(k)

rarr estimate state duration modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 22 of 79

Stream-dependent tree-based clustering

State durationmodel

Decision treesfor

mel-cepstrum

Decision treesfor F0

Decision treefor state durmodels

HMM

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 23 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 24 of 79

Speech parameter generation algorithm [9]

Generate most probable state outputs given HMM and words

o = arg maxo

p(o | w λ)

= arg maxo

sum

forallqp(o q | w λ)

asymp arg maxo

maxq

p(o q | w λ)

= arg maxo

maxq

p(o | q λ)P (q | w λ)

Determine the best state sequence and outputs sequentially

q = arg maxq

P (q | w λ)

o = arg maxo

p(o | q λ)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79

Speech parameter generation algorithm [9]

Generate most probable state outputs given HMM and words

o = arg maxo

p(o | w λ)

= arg maxo

sum

forallqp(o q | w λ)

asymp arg maxo

maxq

p(o q | w λ)

= arg maxo

maxq

p(o | q λ)P (q | w λ)

Determine the best state sequence and outputs sequentially

q = arg maxq

P (q | w λ)

o = arg maxo

p(o | q λ)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79

Best state sequence

1 2 3π1

a11

a12

a22

a23

a33

b1(ot) b2(ot) b3(ot)

o1 o2 o3 o4 o5 oT

1 1 1 1 2 2 3 3

O

Q

Observation sequence

State sequence

State duration 4 10 5D

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 26 of 79

Best state outputswo dynamic features

Variance Mean

o becomes step-wise mean vector sequence

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 27 of 79

Using dynamic features

State output vectors include static amp dynamic features

ot =[cgtt ∆cgtt

]gt ∆ct = ct minus ctminus1

M Mctminus1 ct+1ctminus2 ct+2ct

∆ctminus1 ∆ct+1∆ctminus2 ∆ct+2∆ct

2M

Relationship between static and dynamic features can be arranged as

o c

ctminus1

∆ ctminus1

ct∆ ctct+1

∆ ct+1

=

middot middot middot

middot middot middotmiddot middot middot 0 I 0 0 middot middot middotmiddot middot middot minusI I 0 0 middot middot middotmiddot middot middot 0 0 I 0 middot middot middotmiddot middot middot 0 minusI I 0 middot middot middotmiddot middot middot 0 0 0 I middot middot middotmiddot middot middot 0 0 minusI I middot middot middotmiddot middot middot

middot middot middot

ctminus2

ctminus1

ctct+1

W

o t

ot+1

otminus1

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 28 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Wgt Σminus1q W c

=

Wgt Σminus1q microq

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

1 0 0

0

10 0

-1 1

10

0

-1 1

1

0

0

0

-1 1

0 0

0

0

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

microq1

microq2

microqT

c1

c2

cT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79

Generated speech parameter trajectory

Sta

ticD

ynam

ic

Variance Mean c

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79

Waveform reconstruction

pulse train

white noise

synthesizedspeech

lineartime-invariant

systeme(n)

h(n) x(n) = h(n) lowast e(n)excitation

Generatedexcitation parameter(log F0 with VUV)

Generatedspectral parameter

(cepstrum LSP)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79

Synthesis filter

bull Cepstrum rarr LMA filter

bull Generalized cepstrum rarr GLSA filter

bull Mel-cepstrum rarr MLSA filter

bull Mel-generalized cepstrum rarr MGLSA filter

bull LSP rarr LSP filter

bull PARCOR rarr all-pole lattice filter

bull LPC rarr all-pole filter

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation

minus Small footprint [10 11]minus Robustness [12]

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Adaptation (mimicking voice) [13]

Average-voice model

AdaptiveTraining

Adaptation

Training speakers Target speakers

bull Train average voice model (AVM) from training speakers using SAT

bull Adapt AVM to target speakers

bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79

Interpolation (mixing voice) [14 15 16 17]

λ1

λ2

λ3

λ4

λprime

I(λprimeλ1)I(λprimeλ2)

I(λprimeλ3)I(λprimeλ4)

λ HMM setI(λprimeλ) Interpolation ratio

bull Interpolate representive HMM sets

bull Can obtain new voices wo adaptation data

bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Vocoding issues

bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)

Unvoiced Voiced

pulse train

white noise

e(n)

excitation

bull Spectral envelope extractionHarmonic effect often cause problem

0

40

80

0 2 4 6 8 [kHz]

Pow

er [d

B]

bull PhaseImportant but usually ignored

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79

Better vocoding

bull Mixed excitation linear prediction (MELP)

bull STRAIGHT

bull Multi-band excitation

bull Harmonic + noise model (HNM)

bull Harmonic stochastic model

bull LF model

bull Glottal waveform

bull Residual codebook

bull ML excitation

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79

Limitations of HMMs for acoustic modeling

bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state

bull Conditional independence assumptionState output probability depends only on the current state

bull Weak duration modelingState duration probability decreases exponentially with time

None of them hold for real speech

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79

Better acoustic modeling

bull Piece-wise constatnt statistics rarr Dynamical model

minus Trended HMMminus Polynomial segment modelminus Trajectory HMM

bull Conditional independence assumption rarr Graphical model

minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM

bull Weak duration modeling rarr Explicit duration model

minus Hidden semi-Markov model

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79

Oversmoothing

bull Speech parameter generation algorithm

minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled

Nat

ural

0 4 8Frequency (kHz)

Gen

erat

ed

0 4 8Frequency (kHz)

bull Why

minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79

Oversmoothing compensation

bull Postfiltering

minus Mel-cepstrumminus LSP

bull Nonparametric approach

minus Conditional parameter generationminus Discrete HMM-based speech synthesis

bull Combine multiple-level statistics

minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation eigenvoice CAT multiple regression

minus Small footprintminus Robustness

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

HMM-based acoustic modeling for SPSS [4]

yes noyes no

yes no

yes no yes no

Acoustic space

bull Decision tree-clustered HMM with GMM state-output distributions

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79

DNN-based acoustic modeling for SPSS [18]

Acoustic

features y

Linguistic

features x

h1

h2

h3

bull DNN represents conditional distribution of y given x

bull DNN replaces decision trees and GMMs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79

Framework

Input layer Output layerHidden layers

TEXT

SPEECH Parametergeneration

Waveformsynthesis

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

feat

ures

at f

ram

e 1

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

fe

atur

es a

t fra

me T

Textanalysis

Input featureextraction

Statistics (m

ean amp var) of speech param

eter vector sequence

Binaryfeatures

NumericfeaturesDuration

featureFrame position

feature

Spectralfeatures

ExcitationfeaturesVUVfeature

Durationprediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Experimental setup

Database US English female speaker

Training test data 33000 amp 173 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic 11 categorical featuresfeatures 25 numeric features

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2

HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]

DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79

Example of speech parameter trajectories

wo grouping questions numeric contexts silence frames removed

-1

0

1

0 100 200 300 400 500

5-t

h M

el-ce

pstr

um

Frame

Natural speechHMM (α=1)DNN (4x512)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79

Subjective evaluations

Compared HMM-based systems with DNN-based ones withsimilar of parameters

bull Paired comparison test

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

HMM DNN(α) (layers times units) Neutral p value z value

158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 13: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 8 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 9 of 79

Speech production process

text(concept)

frequencytransfercharacteristics

magnitudestart--end

fundamentalfrequency

mod

ulat

ion

of c

arrie

r wav

eby

spe

ech

info

rmat

ion

fund

amen

tal f

req

voic

edu

nvoi

ced

freq

trans

fer c

har

air flow

Sound sourcevoiced pulseunvoiced noise

speech

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 10 of 79

Source-filter model

pulse train

white noise

speech

lineartime-invariant

systeme(n)

h(n) x(n) = h(n) lowast e(n)excitation

x(n) = h(n) lowast e(n)

darr Fourier transform

X(ejω) = H(ejω)E(ejω)

Source excitation part Vocal tract resonance part

H(ejω)

should be defined by HMM state-output vectorseg mel-cepstrum line spectral pairs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 11 of 79

Parametric models of speech signal

Autoregressive (AR) model Exponential (EX) model

H(z) =K

1minusMsum

m=0

c(m)zminusm

H(z) = expMsum

m=0

c(m)zminusm

Estimate model parameters based on ML

c = arg maxcp(x | c)

bull p(x | c) AR model rarr Linear predictive analysis [5]

bull p(x | c) EX model rarr (ML-based) cepstral analysis [6]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 12 of 79

Examples of speech spectra

0 1 2 3 4 5Frequency (kHz)

-20

0

20

40

60

80

Log

mag

nitu

de (d

B)

0 1 2 3 4 5Frequency (kHz)

-20

0

20

40

60

80

Log

mag

nitu

de (d

B)

(a) ML-based cepstral analysis (b) Linear prediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 13 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 14 of 79

Structure of state-output (observation) vectors

∆ Mel-cepstral coefficients

log F0

∆ log F0

∆∆ log F0

Spectrum part

Excitation part

∆ct

∆2ct

pt

δpt

δ2pt

ct

ot

∆∆ Mel-cepstral coefficients

Mel-cepstral coefficients

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 15 of 79

Hidden Markov model (HMM)

1 2 3π1

a11

a12

a22

a23

a33

b1(ot) b2(ot) b3(ot)

o1 o2 o3 o4 o5 oT

1 1 1 1 2 2 3 3

O

Q

Observation sequence

State sequence

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 16 of 79

Multi-stream HMM structure

Spe

ctru

mE

xcita

tion

∆ct

∆2ct

pt

δ pt

δ 2pt

ct

ot

Stream

1

o1t

o2t

o3t

o4t

bj(ot)

b1j(o

1t )

b2j(o

2t )

b3j(o

3t )

b4j(o

4t )

bj(ot)Sprod

s=1

(bsj(o

st ))ws=

23

4

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 17 of 79

Training process

Compute variancefloor (HCompV)

Initialize CI-HMMs bysegmental k-means (HInit)

Reestimate CI-HMMs byEM algorithm

(HRest amp HERest)

Copy CI-HMMs to CD-HMMs (HHEd CL)

Reestimate CD-HMMs byEM algorithm (HERest)

Decision tree-basedclustering (HHEd TB)

Reestimate CD-HMMsby EM algorithm (HERest)

Untie parameter tyingstructure (HHEd UT)

monophone(context-independent CI)

fullcontext(context-dependent CD)

EstimatedHMMs

data amp labels

Estimate CD-dur modelsfrom FB stats (HERest)

Decision tree-basedclustering (HHEd TB)

Estimated dur models

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 18 of 79

Context-dependent acoustic modeling

bull preceding succeeding two phonemes

bull Position of current phoneme in current syllable

bull of phonemes at preceding current succeeding syllable

bull accent stress of preceding current succeeding syllable

bull Position of current syllable in current word

bull of preceding succeeding stressed accented syllables in phrase

bull of syllables from previous to next stressed accented syllable

bull Guess at part of speech of preceding current succeeding word

bull of syllables in preceding current succeeding word

bull Position of current word in current phrase

bull of preceding succeeding content words in current phrase

bull of words from previous to next content word

bull of syllables in preceding current succeeding phrase

Impossible to have all possible modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 19 of 79

Decision tree-based state clustering [7]

L=voice

L=w R=silence yes

yes yes

no

no no

w-a+sil w-a+sh gy-a+pau

g-a+silgy-a+silw-a+t

k-a+b

t-a+n

leaf nodes

yes no yes no

synthesized states

R=silence L=gy

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 20 of 79

Stream-dependent tree-based clustering

Decision treesfor

mel-cepstrum

Decision treesfor F0

Spectrum amp excitation can have different context dependencyrarr Build decision trees individually

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 21 of 79

State duration models [8]

i

1 2 3 4 5 6 7 T=8t

t0 t1

Probability to enter state i at t0 then leave at t1 + 1

χt0t1(i) propsum

j 6=i

αt0minus1(j)ajiat1minust0ii

t1prod

t=t0

bi(ot)sum

k 6=i

aikbk(ot1+1)βt1+1(k)

rarr estimate state duration modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 22 of 79

Stream-dependent tree-based clustering

State durationmodel

Decision treesfor

mel-cepstrum

Decision treesfor F0

Decision treefor state durmodels

HMM

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 23 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 24 of 79

Speech parameter generation algorithm [9]

Generate most probable state outputs given HMM and words

o = arg maxo

p(o | w λ)

= arg maxo

sum

forallqp(o q | w λ)

asymp arg maxo

maxq

p(o q | w λ)

= arg maxo

maxq

p(o | q λ)P (q | w λ)

Determine the best state sequence and outputs sequentially

q = arg maxq

P (q | w λ)

o = arg maxo

p(o | q λ)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79

Speech parameter generation algorithm [9]

Generate most probable state outputs given HMM and words

o = arg maxo

p(o | w λ)

= arg maxo

sum

forallqp(o q | w λ)

asymp arg maxo

maxq

p(o q | w λ)

= arg maxo

maxq

p(o | q λ)P (q | w λ)

Determine the best state sequence and outputs sequentially

q = arg maxq

P (q | w λ)

o = arg maxo

p(o | q λ)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79

Best state sequence

1 2 3π1

a11

a12

a22

a23

a33

b1(ot) b2(ot) b3(ot)

o1 o2 o3 o4 o5 oT

1 1 1 1 2 2 3 3

O

Q

Observation sequence

State sequence

State duration 4 10 5D

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 26 of 79

Best state outputswo dynamic features

Variance Mean

o becomes step-wise mean vector sequence

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 27 of 79

Using dynamic features

State output vectors include static amp dynamic features

ot =[cgtt ∆cgtt

]gt ∆ct = ct minus ctminus1

M Mctminus1 ct+1ctminus2 ct+2ct

∆ctminus1 ∆ct+1∆ctminus2 ∆ct+2∆ct

2M

Relationship between static and dynamic features can be arranged as

o c

ctminus1

∆ ctminus1

ct∆ ctct+1

∆ ct+1

=

middot middot middot

middot middot middotmiddot middot middot 0 I 0 0 middot middot middotmiddot middot middot minusI I 0 0 middot middot middotmiddot middot middot 0 0 I 0 middot middot middotmiddot middot middot 0 minusI I 0 middot middot middotmiddot middot middot 0 0 0 I middot middot middotmiddot middot middot 0 0 minusI I middot middot middotmiddot middot middot

middot middot middot

ctminus2

ctminus1

ctct+1

W

o t

ot+1

otminus1

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 28 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Wgt Σminus1q W c

=

Wgt Σminus1q microq

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

1 0 0

0

10 0

-1 1

10

0

-1 1

1

0

0

0

-1 1

0 0

0

0

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

microq1

microq2

microqT

c1

c2

cT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79

Generated speech parameter trajectory

Sta

ticD

ynam

ic

Variance Mean c

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79

Waveform reconstruction

pulse train

white noise

synthesizedspeech

lineartime-invariant

systeme(n)

h(n) x(n) = h(n) lowast e(n)excitation

Generatedexcitation parameter(log F0 with VUV)

Generatedspectral parameter

(cepstrum LSP)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79

Synthesis filter

bull Cepstrum rarr LMA filter

bull Generalized cepstrum rarr GLSA filter

bull Mel-cepstrum rarr MLSA filter

bull Mel-generalized cepstrum rarr MGLSA filter

bull LSP rarr LSP filter

bull PARCOR rarr all-pole lattice filter

bull LPC rarr all-pole filter

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation

minus Small footprint [10 11]minus Robustness [12]

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Adaptation (mimicking voice) [13]

Average-voice model

AdaptiveTraining

Adaptation

Training speakers Target speakers

bull Train average voice model (AVM) from training speakers using SAT

bull Adapt AVM to target speakers

bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79

Interpolation (mixing voice) [14 15 16 17]

λ1

λ2

λ3

λ4

λprime

I(λprimeλ1)I(λprimeλ2)

I(λprimeλ3)I(λprimeλ4)

λ HMM setI(λprimeλ) Interpolation ratio

bull Interpolate representive HMM sets

bull Can obtain new voices wo adaptation data

bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Vocoding issues

bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)

Unvoiced Voiced

pulse train

white noise

e(n)

excitation

bull Spectral envelope extractionHarmonic effect often cause problem

0

40

80

0 2 4 6 8 [kHz]

Pow

er [d

B]

bull PhaseImportant but usually ignored

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79

Better vocoding

bull Mixed excitation linear prediction (MELP)

bull STRAIGHT

bull Multi-band excitation

bull Harmonic + noise model (HNM)

bull Harmonic stochastic model

bull LF model

bull Glottal waveform

bull Residual codebook

bull ML excitation

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79

Limitations of HMMs for acoustic modeling

bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state

bull Conditional independence assumptionState output probability depends only on the current state

bull Weak duration modelingState duration probability decreases exponentially with time

None of them hold for real speech

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79

Better acoustic modeling

bull Piece-wise constatnt statistics rarr Dynamical model

minus Trended HMMminus Polynomial segment modelminus Trajectory HMM

bull Conditional independence assumption rarr Graphical model

minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM

bull Weak duration modeling rarr Explicit duration model

minus Hidden semi-Markov model

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79

Oversmoothing

bull Speech parameter generation algorithm

minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled

Nat

ural

0 4 8Frequency (kHz)

Gen

erat

ed

0 4 8Frequency (kHz)

bull Why

minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79

Oversmoothing compensation

bull Postfiltering

minus Mel-cepstrumminus LSP

bull Nonparametric approach

minus Conditional parameter generationminus Discrete HMM-based speech synthesis

bull Combine multiple-level statistics

minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation eigenvoice CAT multiple regression

minus Small footprintminus Robustness

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

HMM-based acoustic modeling for SPSS [4]

yes noyes no

yes no

yes no yes no

Acoustic space

bull Decision tree-clustered HMM with GMM state-output distributions

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79

DNN-based acoustic modeling for SPSS [18]

Acoustic

features y

Linguistic

features x

h1

h2

h3

bull DNN represents conditional distribution of y given x

bull DNN replaces decision trees and GMMs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79

Framework

Input layer Output layerHidden layers

TEXT

SPEECH Parametergeneration

Waveformsynthesis

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

feat

ures

at f

ram

e 1

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

fe

atur

es a

t fra

me T

Textanalysis

Input featureextraction

Statistics (m

ean amp var) of speech param

eter vector sequence

Binaryfeatures

NumericfeaturesDuration

featureFrame position

feature

Spectralfeatures

ExcitationfeaturesVUVfeature

Durationprediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Experimental setup

Database US English female speaker

Training test data 33000 amp 173 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic 11 categorical featuresfeatures 25 numeric features

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2

HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]

DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79

Example of speech parameter trajectories

wo grouping questions numeric contexts silence frames removed

-1

0

1

0 100 200 300 400 500

5-t

h M

el-ce

pstr

um

Frame

Natural speechHMM (α=1)DNN (4x512)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79

Subjective evaluations

Compared HMM-based systems with DNN-based ones withsimilar of parameters

bull Paired comparison test

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

HMM DNN(α) (layers times units) Neutral p value z value

158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 14: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 9 of 79

Speech production process

text(concept)

frequencytransfercharacteristics

magnitudestart--end

fundamentalfrequency

mod

ulat

ion

of c

arrie

r wav

eby

spe

ech

info

rmat

ion

fund

amen

tal f

req

voic

edu

nvoi

ced

freq

trans

fer c

har

air flow

Sound sourcevoiced pulseunvoiced noise

speech

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 10 of 79

Source-filter model

pulse train

white noise

speech

lineartime-invariant

systeme(n)

h(n) x(n) = h(n) lowast e(n)excitation

x(n) = h(n) lowast e(n)

darr Fourier transform

X(ejω) = H(ejω)E(ejω)

Source excitation part Vocal tract resonance part

H(ejω)

should be defined by HMM state-output vectorseg mel-cepstrum line spectral pairs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 11 of 79

Parametric models of speech signal

Autoregressive (AR) model Exponential (EX) model

H(z) =K

1minusMsum

m=0

c(m)zminusm

H(z) = expMsum

m=0

c(m)zminusm

Estimate model parameters based on ML

c = arg maxcp(x | c)

bull p(x | c) AR model rarr Linear predictive analysis [5]

bull p(x | c) EX model rarr (ML-based) cepstral analysis [6]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 12 of 79

Examples of speech spectra

0 1 2 3 4 5Frequency (kHz)

-20

0

20

40

60

80

Log

mag

nitu

de (d

B)

0 1 2 3 4 5Frequency (kHz)

-20

0

20

40

60

80

Log

mag

nitu

de (d

B)

(a) ML-based cepstral analysis (b) Linear prediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 13 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 14 of 79

Structure of state-output (observation) vectors

∆ Mel-cepstral coefficients

log F0

∆ log F0

∆∆ log F0

Spectrum part

Excitation part

∆ct

∆2ct

pt

δpt

δ2pt

ct

ot

∆∆ Mel-cepstral coefficients

Mel-cepstral coefficients

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 15 of 79

Hidden Markov model (HMM)

1 2 3π1

a11

a12

a22

a23

a33

b1(ot) b2(ot) b3(ot)

o1 o2 o3 o4 o5 oT

1 1 1 1 2 2 3 3

O

Q

Observation sequence

State sequence

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 16 of 79

Multi-stream HMM structure

Spe

ctru

mE

xcita

tion

∆ct

∆2ct

pt

δ pt

δ 2pt

ct

ot

Stream

1

o1t

o2t

o3t

o4t

bj(ot)

b1j(o

1t )

b2j(o

2t )

b3j(o

3t )

b4j(o

4t )

bj(ot)Sprod

s=1

(bsj(o

st ))ws=

23

4

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 17 of 79

Training process

Compute variancefloor (HCompV)

Initialize CI-HMMs bysegmental k-means (HInit)

Reestimate CI-HMMs byEM algorithm

(HRest amp HERest)

Copy CI-HMMs to CD-HMMs (HHEd CL)

Reestimate CD-HMMs byEM algorithm (HERest)

Decision tree-basedclustering (HHEd TB)

Reestimate CD-HMMsby EM algorithm (HERest)

Untie parameter tyingstructure (HHEd UT)

monophone(context-independent CI)

fullcontext(context-dependent CD)

EstimatedHMMs

data amp labels

Estimate CD-dur modelsfrom FB stats (HERest)

Decision tree-basedclustering (HHEd TB)

Estimated dur models

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 18 of 79

Context-dependent acoustic modeling

bull preceding succeeding two phonemes

bull Position of current phoneme in current syllable

bull of phonemes at preceding current succeeding syllable

bull accent stress of preceding current succeeding syllable

bull Position of current syllable in current word

bull of preceding succeeding stressed accented syllables in phrase

bull of syllables from previous to next stressed accented syllable

bull Guess at part of speech of preceding current succeeding word

bull of syllables in preceding current succeeding word

bull Position of current word in current phrase

bull of preceding succeeding content words in current phrase

bull of words from previous to next content word

bull of syllables in preceding current succeeding phrase

Impossible to have all possible modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 19 of 79

Decision tree-based state clustering [7]

L=voice

L=w R=silence yes

yes yes

no

no no

w-a+sil w-a+sh gy-a+pau

g-a+silgy-a+silw-a+t

k-a+b

t-a+n

leaf nodes

yes no yes no

synthesized states

R=silence L=gy

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 20 of 79

Stream-dependent tree-based clustering

Decision treesfor

mel-cepstrum

Decision treesfor F0

Spectrum amp excitation can have different context dependencyrarr Build decision trees individually

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 21 of 79

State duration models [8]

i

1 2 3 4 5 6 7 T=8t

t0 t1

Probability to enter state i at t0 then leave at t1 + 1

χt0t1(i) propsum

j 6=i

αt0minus1(j)ajiat1minust0ii

t1prod

t=t0

bi(ot)sum

k 6=i

aikbk(ot1+1)βt1+1(k)

rarr estimate state duration modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 22 of 79

Stream-dependent tree-based clustering

State durationmodel

Decision treesfor

mel-cepstrum

Decision treesfor F0

Decision treefor state durmodels

HMM

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 23 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 24 of 79

Speech parameter generation algorithm [9]

Generate most probable state outputs given HMM and words

o = arg maxo

p(o | w λ)

= arg maxo

sum

forallqp(o q | w λ)

asymp arg maxo

maxq

p(o q | w λ)

= arg maxo

maxq

p(o | q λ)P (q | w λ)

Determine the best state sequence and outputs sequentially

q = arg maxq

P (q | w λ)

o = arg maxo

p(o | q λ)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79

Speech parameter generation algorithm [9]

Generate most probable state outputs given HMM and words

o = arg maxo

p(o | w λ)

= arg maxo

sum

forallqp(o q | w λ)

asymp arg maxo

maxq

p(o q | w λ)

= arg maxo

maxq

p(o | q λ)P (q | w λ)

Determine the best state sequence and outputs sequentially

q = arg maxq

P (q | w λ)

o = arg maxo

p(o | q λ)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79

Best state sequence

1 2 3π1

a11

a12

a22

a23

a33

b1(ot) b2(ot) b3(ot)

o1 o2 o3 o4 o5 oT

1 1 1 1 2 2 3 3

O

Q

Observation sequence

State sequence

State duration 4 10 5D

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 26 of 79

Best state outputswo dynamic features

Variance Mean

o becomes step-wise mean vector sequence

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 27 of 79

Using dynamic features

State output vectors include static amp dynamic features

ot =[cgtt ∆cgtt

]gt ∆ct = ct minus ctminus1

M Mctminus1 ct+1ctminus2 ct+2ct

∆ctminus1 ∆ct+1∆ctminus2 ∆ct+2∆ct

2M

Relationship between static and dynamic features can be arranged as

o c

ctminus1

∆ ctminus1

ct∆ ctct+1

∆ ct+1

=

middot middot middot

middot middot middotmiddot middot middot 0 I 0 0 middot middot middotmiddot middot middot minusI I 0 0 middot middot middotmiddot middot middot 0 0 I 0 middot middot middotmiddot middot middot 0 minusI I 0 middot middot middotmiddot middot middot 0 0 0 I middot middot middotmiddot middot middot 0 0 minusI I middot middot middotmiddot middot middot

middot middot middot

ctminus2

ctminus1

ctct+1

W

o t

ot+1

otminus1

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 28 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Wgt Σminus1q W c

=

Wgt Σminus1q microq

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

1 0 0

0

10 0

-1 1

10

0

-1 1

1

0

0

0

-1 1

0 0

0

0

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

microq1

microq2

microqT

c1

c2

cT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79

Generated speech parameter trajectory

Sta

ticD

ynam

ic

Variance Mean c

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79

Waveform reconstruction

pulse train

white noise

synthesizedspeech

lineartime-invariant

systeme(n)

h(n) x(n) = h(n) lowast e(n)excitation

Generatedexcitation parameter(log F0 with VUV)

Generatedspectral parameter

(cepstrum LSP)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79

Synthesis filter

bull Cepstrum rarr LMA filter

bull Generalized cepstrum rarr GLSA filter

bull Mel-cepstrum rarr MLSA filter

bull Mel-generalized cepstrum rarr MGLSA filter

bull LSP rarr LSP filter

bull PARCOR rarr all-pole lattice filter

bull LPC rarr all-pole filter

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation

minus Small footprint [10 11]minus Robustness [12]

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Adaptation (mimicking voice) [13]

Average-voice model

AdaptiveTraining

Adaptation

Training speakers Target speakers

bull Train average voice model (AVM) from training speakers using SAT

bull Adapt AVM to target speakers

bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79

Interpolation (mixing voice) [14 15 16 17]

λ1

λ2

λ3

λ4

λprime

I(λprimeλ1)I(λprimeλ2)

I(λprimeλ3)I(λprimeλ4)

λ HMM setI(λprimeλ) Interpolation ratio

bull Interpolate representive HMM sets

bull Can obtain new voices wo adaptation data

bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Vocoding issues

bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)

Unvoiced Voiced

pulse train

white noise

e(n)

excitation

bull Spectral envelope extractionHarmonic effect often cause problem

0

40

80

0 2 4 6 8 [kHz]

Pow

er [d

B]

bull PhaseImportant but usually ignored

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79

Better vocoding

bull Mixed excitation linear prediction (MELP)

bull STRAIGHT

bull Multi-band excitation

bull Harmonic + noise model (HNM)

bull Harmonic stochastic model

bull LF model

bull Glottal waveform

bull Residual codebook

bull ML excitation

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79

Limitations of HMMs for acoustic modeling

bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state

bull Conditional independence assumptionState output probability depends only on the current state

bull Weak duration modelingState duration probability decreases exponentially with time

None of them hold for real speech

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79

Better acoustic modeling

bull Piece-wise constatnt statistics rarr Dynamical model

minus Trended HMMminus Polynomial segment modelminus Trajectory HMM

bull Conditional independence assumption rarr Graphical model

minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM

bull Weak duration modeling rarr Explicit duration model

minus Hidden semi-Markov model

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79

Oversmoothing

bull Speech parameter generation algorithm

minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled

Nat

ural

0 4 8Frequency (kHz)

Gen

erat

ed

0 4 8Frequency (kHz)

bull Why

minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79

Oversmoothing compensation

bull Postfiltering

minus Mel-cepstrumminus LSP

bull Nonparametric approach

minus Conditional parameter generationminus Discrete HMM-based speech synthesis

bull Combine multiple-level statistics

minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation eigenvoice CAT multiple regression

minus Small footprintminus Robustness

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

HMM-based acoustic modeling for SPSS [4]

yes noyes no

yes no

yes no yes no

Acoustic space

bull Decision tree-clustered HMM with GMM state-output distributions

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79

DNN-based acoustic modeling for SPSS [18]

Acoustic

features y

Linguistic

features x

h1

h2

h3

bull DNN represents conditional distribution of y given x

bull DNN replaces decision trees and GMMs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79

Framework

Input layer Output layerHidden layers

TEXT

SPEECH Parametergeneration

Waveformsynthesis

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

feat

ures

at f

ram

e 1

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

fe

atur

es a

t fra

me T

Textanalysis

Input featureextraction

Statistics (m

ean amp var) of speech param

eter vector sequence

Binaryfeatures

NumericfeaturesDuration

featureFrame position

feature

Spectralfeatures

ExcitationfeaturesVUVfeature

Durationprediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Experimental setup

Database US English female speaker

Training test data 33000 amp 173 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic 11 categorical featuresfeatures 25 numeric features

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2

HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]

DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79

Example of speech parameter trajectories

wo grouping questions numeric contexts silence frames removed

-1

0

1

0 100 200 300 400 500

5-t

h M

el-ce

pstr

um

Frame

Natural speechHMM (α=1)DNN (4x512)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79

Subjective evaluations

Compared HMM-based systems with DNN-based ones withsimilar of parameters

bull Paired comparison test

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

HMM DNN(α) (layers times units) Neutral p value z value

158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 15: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Speech production process

text(concept)

frequencytransfercharacteristics

magnitudestart--end

fundamentalfrequency

mod

ulat

ion

of c

arrie

r wav

eby

spe

ech

info

rmat

ion

fund

amen

tal f

req

voic

edu

nvoi

ced

freq

trans

fer c

har

air flow

Sound sourcevoiced pulseunvoiced noise

speech

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 10 of 79

Source-filter model

pulse train

white noise

speech

lineartime-invariant

systeme(n)

h(n) x(n) = h(n) lowast e(n)excitation

x(n) = h(n) lowast e(n)

darr Fourier transform

X(ejω) = H(ejω)E(ejω)

Source excitation part Vocal tract resonance part

H(ejω)

should be defined by HMM state-output vectorseg mel-cepstrum line spectral pairs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 11 of 79

Parametric models of speech signal

Autoregressive (AR) model Exponential (EX) model

H(z) =K

1minusMsum

m=0

c(m)zminusm

H(z) = expMsum

m=0

c(m)zminusm

Estimate model parameters based on ML

c = arg maxcp(x | c)

bull p(x | c) AR model rarr Linear predictive analysis [5]

bull p(x | c) EX model rarr (ML-based) cepstral analysis [6]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 12 of 79

Examples of speech spectra

0 1 2 3 4 5Frequency (kHz)

-20

0

20

40

60

80

Log

mag

nitu

de (d

B)

0 1 2 3 4 5Frequency (kHz)

-20

0

20

40

60

80

Log

mag

nitu

de (d

B)

(a) ML-based cepstral analysis (b) Linear prediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 13 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 14 of 79

Structure of state-output (observation) vectors

∆ Mel-cepstral coefficients

log F0

∆ log F0

∆∆ log F0

Spectrum part

Excitation part

∆ct

∆2ct

pt

δpt

δ2pt

ct

ot

∆∆ Mel-cepstral coefficients

Mel-cepstral coefficients

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 15 of 79

Hidden Markov model (HMM)

1 2 3π1

a11

a12

a22

a23

a33

b1(ot) b2(ot) b3(ot)

o1 o2 o3 o4 o5 oT

1 1 1 1 2 2 3 3

O

Q

Observation sequence

State sequence

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 16 of 79

Multi-stream HMM structure

Spe

ctru

mE

xcita

tion

∆ct

∆2ct

pt

δ pt

δ 2pt

ct

ot

Stream

1

o1t

o2t

o3t

o4t

bj(ot)

b1j(o

1t )

b2j(o

2t )

b3j(o

3t )

b4j(o

4t )

bj(ot)Sprod

s=1

(bsj(o

st ))ws=

23

4

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 17 of 79

Training process

Compute variancefloor (HCompV)

Initialize CI-HMMs bysegmental k-means (HInit)

Reestimate CI-HMMs byEM algorithm

(HRest amp HERest)

Copy CI-HMMs to CD-HMMs (HHEd CL)

Reestimate CD-HMMs byEM algorithm (HERest)

Decision tree-basedclustering (HHEd TB)

Reestimate CD-HMMsby EM algorithm (HERest)

Untie parameter tyingstructure (HHEd UT)

monophone(context-independent CI)

fullcontext(context-dependent CD)

EstimatedHMMs

data amp labels

Estimate CD-dur modelsfrom FB stats (HERest)

Decision tree-basedclustering (HHEd TB)

Estimated dur models

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 18 of 79

Context-dependent acoustic modeling

bull preceding succeeding two phonemes

bull Position of current phoneme in current syllable

bull of phonemes at preceding current succeeding syllable

bull accent stress of preceding current succeeding syllable

bull Position of current syllable in current word

bull of preceding succeeding stressed accented syllables in phrase

bull of syllables from previous to next stressed accented syllable

bull Guess at part of speech of preceding current succeeding word

bull of syllables in preceding current succeeding word

bull Position of current word in current phrase

bull of preceding succeeding content words in current phrase

bull of words from previous to next content word

bull of syllables in preceding current succeeding phrase

Impossible to have all possible modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 19 of 79

Decision tree-based state clustering [7]

L=voice

L=w R=silence yes

yes yes

no

no no

w-a+sil w-a+sh gy-a+pau

g-a+silgy-a+silw-a+t

k-a+b

t-a+n

leaf nodes

yes no yes no

synthesized states

R=silence L=gy

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 20 of 79

Stream-dependent tree-based clustering

Decision treesfor

mel-cepstrum

Decision treesfor F0

Spectrum amp excitation can have different context dependencyrarr Build decision trees individually

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 21 of 79

State duration models [8]

i

1 2 3 4 5 6 7 T=8t

t0 t1

Probability to enter state i at t0 then leave at t1 + 1

χt0t1(i) propsum

j 6=i

αt0minus1(j)ajiat1minust0ii

t1prod

t=t0

bi(ot)sum

k 6=i

aikbk(ot1+1)βt1+1(k)

rarr estimate state duration modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 22 of 79

Stream-dependent tree-based clustering

State durationmodel

Decision treesfor

mel-cepstrum

Decision treesfor F0

Decision treefor state durmodels

HMM

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 23 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 24 of 79

Speech parameter generation algorithm [9]

Generate most probable state outputs given HMM and words

o = arg maxo

p(o | w λ)

= arg maxo

sum

forallqp(o q | w λ)

asymp arg maxo

maxq

p(o q | w λ)

= arg maxo

maxq

p(o | q λ)P (q | w λ)

Determine the best state sequence and outputs sequentially

q = arg maxq

P (q | w λ)

o = arg maxo

p(o | q λ)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79

Speech parameter generation algorithm [9]

Generate most probable state outputs given HMM and words

o = arg maxo

p(o | w λ)

= arg maxo

sum

forallqp(o q | w λ)

asymp arg maxo

maxq

p(o q | w λ)

= arg maxo

maxq

p(o | q λ)P (q | w λ)

Determine the best state sequence and outputs sequentially

q = arg maxq

P (q | w λ)

o = arg maxo

p(o | q λ)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79

Best state sequence

1 2 3π1

a11

a12

a22

a23

a33

b1(ot) b2(ot) b3(ot)

o1 o2 o3 o4 o5 oT

1 1 1 1 2 2 3 3

O

Q

Observation sequence

State sequence

State duration 4 10 5D

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 26 of 79

Best state outputswo dynamic features

Variance Mean

o becomes step-wise mean vector sequence

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 27 of 79

Using dynamic features

State output vectors include static amp dynamic features

ot =[cgtt ∆cgtt

]gt ∆ct = ct minus ctminus1

M Mctminus1 ct+1ctminus2 ct+2ct

∆ctminus1 ∆ct+1∆ctminus2 ∆ct+2∆ct

2M

Relationship between static and dynamic features can be arranged as

o c

ctminus1

∆ ctminus1

ct∆ ctct+1

∆ ct+1

=

middot middot middot

middot middot middotmiddot middot middot 0 I 0 0 middot middot middotmiddot middot middot minusI I 0 0 middot middot middotmiddot middot middot 0 0 I 0 middot middot middotmiddot middot middot 0 minusI I 0 middot middot middotmiddot middot middot 0 0 0 I middot middot middotmiddot middot middot 0 0 minusI I middot middot middotmiddot middot middot

middot middot middot

ctminus2

ctminus1

ctct+1

W

o t

ot+1

otminus1

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 28 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Wgt Σminus1q W c

=

Wgt Σminus1q microq

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

1 0 0

0

10 0

-1 1

10

0

-1 1

1

0

0

0

-1 1

0 0

0

0

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

microq1

microq2

microqT

c1

c2

cT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79

Generated speech parameter trajectory

Sta

ticD

ynam

ic

Variance Mean c

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79

Waveform reconstruction

pulse train

white noise

synthesizedspeech

lineartime-invariant

systeme(n)

h(n) x(n) = h(n) lowast e(n)excitation

Generatedexcitation parameter(log F0 with VUV)

Generatedspectral parameter

(cepstrum LSP)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79

Synthesis filter

bull Cepstrum rarr LMA filter

bull Generalized cepstrum rarr GLSA filter

bull Mel-cepstrum rarr MLSA filter

bull Mel-generalized cepstrum rarr MGLSA filter

bull LSP rarr LSP filter

bull PARCOR rarr all-pole lattice filter

bull LPC rarr all-pole filter

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation

minus Small footprint [10 11]minus Robustness [12]

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Adaptation (mimicking voice) [13]

Average-voice model

AdaptiveTraining

Adaptation

Training speakers Target speakers

bull Train average voice model (AVM) from training speakers using SAT

bull Adapt AVM to target speakers

bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79

Interpolation (mixing voice) [14 15 16 17]

λ1

λ2

λ3

λ4

λprime

I(λprimeλ1)I(λprimeλ2)

I(λprimeλ3)I(λprimeλ4)

λ HMM setI(λprimeλ) Interpolation ratio

bull Interpolate representive HMM sets

bull Can obtain new voices wo adaptation data

bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Vocoding issues

bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)

Unvoiced Voiced

pulse train

white noise

e(n)

excitation

bull Spectral envelope extractionHarmonic effect often cause problem

0

40

80

0 2 4 6 8 [kHz]

Pow

er [d

B]

bull PhaseImportant but usually ignored

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79

Better vocoding

bull Mixed excitation linear prediction (MELP)

bull STRAIGHT

bull Multi-band excitation

bull Harmonic + noise model (HNM)

bull Harmonic stochastic model

bull LF model

bull Glottal waveform

bull Residual codebook

bull ML excitation

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79

Limitations of HMMs for acoustic modeling

bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state

bull Conditional independence assumptionState output probability depends only on the current state

bull Weak duration modelingState duration probability decreases exponentially with time

None of them hold for real speech

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79

Better acoustic modeling

bull Piece-wise constatnt statistics rarr Dynamical model

minus Trended HMMminus Polynomial segment modelminus Trajectory HMM

bull Conditional independence assumption rarr Graphical model

minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM

bull Weak duration modeling rarr Explicit duration model

minus Hidden semi-Markov model

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79

Oversmoothing

bull Speech parameter generation algorithm

minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled

Nat

ural

0 4 8Frequency (kHz)

Gen

erat

ed

0 4 8Frequency (kHz)

bull Why

minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79

Oversmoothing compensation

bull Postfiltering

minus Mel-cepstrumminus LSP

bull Nonparametric approach

minus Conditional parameter generationminus Discrete HMM-based speech synthesis

bull Combine multiple-level statistics

minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation eigenvoice CAT multiple regression

minus Small footprintminus Robustness

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

HMM-based acoustic modeling for SPSS [4]

yes noyes no

yes no

yes no yes no

Acoustic space

bull Decision tree-clustered HMM with GMM state-output distributions

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79

DNN-based acoustic modeling for SPSS [18]

Acoustic

features y

Linguistic

features x

h1

h2

h3

bull DNN represents conditional distribution of y given x

bull DNN replaces decision trees and GMMs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79

Framework

Input layer Output layerHidden layers

TEXT

SPEECH Parametergeneration

Waveformsynthesis

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

feat

ures

at f

ram

e 1

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

fe

atur

es a

t fra

me T

Textanalysis

Input featureextraction

Statistics (m

ean amp var) of speech param

eter vector sequence

Binaryfeatures

NumericfeaturesDuration

featureFrame position

feature

Spectralfeatures

ExcitationfeaturesVUVfeature

Durationprediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Experimental setup

Database US English female speaker

Training test data 33000 amp 173 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic 11 categorical featuresfeatures 25 numeric features

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2

HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]

DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79

Example of speech parameter trajectories

wo grouping questions numeric contexts silence frames removed

-1

0

1

0 100 200 300 400 500

5-t

h M

el-ce

pstr

um

Frame

Natural speechHMM (α=1)DNN (4x512)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79

Subjective evaluations

Compared HMM-based systems with DNN-based ones withsimilar of parameters

bull Paired comparison test

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

HMM DNN(α) (layers times units) Neutral p value z value

158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 16: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Source-filter model

pulse train

white noise

speech

lineartime-invariant

systeme(n)

h(n) x(n) = h(n) lowast e(n)excitation

x(n) = h(n) lowast e(n)

darr Fourier transform

X(ejω) = H(ejω)E(ejω)

Source excitation part Vocal tract resonance part

H(ejω)

should be defined by HMM state-output vectorseg mel-cepstrum line spectral pairs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 11 of 79

Parametric models of speech signal

Autoregressive (AR) model Exponential (EX) model

H(z) =K

1minusMsum

m=0

c(m)zminusm

H(z) = expMsum

m=0

c(m)zminusm

Estimate model parameters based on ML

c = arg maxcp(x | c)

bull p(x | c) AR model rarr Linear predictive analysis [5]

bull p(x | c) EX model rarr (ML-based) cepstral analysis [6]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 12 of 79

Examples of speech spectra

0 1 2 3 4 5Frequency (kHz)

-20

0

20

40

60

80

Log

mag

nitu

de (d

B)

0 1 2 3 4 5Frequency (kHz)

-20

0

20

40

60

80

Log

mag

nitu

de (d

B)

(a) ML-based cepstral analysis (b) Linear prediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 13 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 14 of 79

Structure of state-output (observation) vectors

∆ Mel-cepstral coefficients

log F0

∆ log F0

∆∆ log F0

Spectrum part

Excitation part

∆ct

∆2ct

pt

δpt

δ2pt

ct

ot

∆∆ Mel-cepstral coefficients

Mel-cepstral coefficients

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 15 of 79

Hidden Markov model (HMM)

1 2 3π1

a11

a12

a22

a23

a33

b1(ot) b2(ot) b3(ot)

o1 o2 o3 o4 o5 oT

1 1 1 1 2 2 3 3

O

Q

Observation sequence

State sequence

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 16 of 79

Multi-stream HMM structure

Spe

ctru

mE

xcita

tion

∆ct

∆2ct

pt

δ pt

δ 2pt

ct

ot

Stream

1

o1t

o2t

o3t

o4t

bj(ot)

b1j(o

1t )

b2j(o

2t )

b3j(o

3t )

b4j(o

4t )

bj(ot)Sprod

s=1

(bsj(o

st ))ws=

23

4

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 17 of 79

Training process

Compute variancefloor (HCompV)

Initialize CI-HMMs bysegmental k-means (HInit)

Reestimate CI-HMMs byEM algorithm

(HRest amp HERest)

Copy CI-HMMs to CD-HMMs (HHEd CL)

Reestimate CD-HMMs byEM algorithm (HERest)

Decision tree-basedclustering (HHEd TB)

Reestimate CD-HMMsby EM algorithm (HERest)

Untie parameter tyingstructure (HHEd UT)

monophone(context-independent CI)

fullcontext(context-dependent CD)

EstimatedHMMs

data amp labels

Estimate CD-dur modelsfrom FB stats (HERest)

Decision tree-basedclustering (HHEd TB)

Estimated dur models

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 18 of 79

Context-dependent acoustic modeling

bull preceding succeeding two phonemes

bull Position of current phoneme in current syllable

bull of phonemes at preceding current succeeding syllable

bull accent stress of preceding current succeeding syllable

bull Position of current syllable in current word

bull of preceding succeeding stressed accented syllables in phrase

bull of syllables from previous to next stressed accented syllable

bull Guess at part of speech of preceding current succeeding word

bull of syllables in preceding current succeeding word

bull Position of current word in current phrase

bull of preceding succeeding content words in current phrase

bull of words from previous to next content word

bull of syllables in preceding current succeeding phrase

Impossible to have all possible modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 19 of 79

Decision tree-based state clustering [7]

L=voice

L=w R=silence yes

yes yes

no

no no

w-a+sil w-a+sh gy-a+pau

g-a+silgy-a+silw-a+t

k-a+b

t-a+n

leaf nodes

yes no yes no

synthesized states

R=silence L=gy

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 20 of 79

Stream-dependent tree-based clustering

Decision treesfor

mel-cepstrum

Decision treesfor F0

Spectrum amp excitation can have different context dependencyrarr Build decision trees individually

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 21 of 79

State duration models [8]

i

1 2 3 4 5 6 7 T=8t

t0 t1

Probability to enter state i at t0 then leave at t1 + 1

χt0t1(i) propsum

j 6=i

αt0minus1(j)ajiat1minust0ii

t1prod

t=t0

bi(ot)sum

k 6=i

aikbk(ot1+1)βt1+1(k)

rarr estimate state duration modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 22 of 79

Stream-dependent tree-based clustering

State durationmodel

Decision treesfor

mel-cepstrum

Decision treesfor F0

Decision treefor state durmodels

HMM

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 23 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 24 of 79

Speech parameter generation algorithm [9]

Generate most probable state outputs given HMM and words

o = arg maxo

p(o | w λ)

= arg maxo

sum

forallqp(o q | w λ)

asymp arg maxo

maxq

p(o q | w λ)

= arg maxo

maxq

p(o | q λ)P (q | w λ)

Determine the best state sequence and outputs sequentially

q = arg maxq

P (q | w λ)

o = arg maxo

p(o | q λ)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79

Speech parameter generation algorithm [9]

Generate most probable state outputs given HMM and words

o = arg maxo

p(o | w λ)

= arg maxo

sum

forallqp(o q | w λ)

asymp arg maxo

maxq

p(o q | w λ)

= arg maxo

maxq

p(o | q λ)P (q | w λ)

Determine the best state sequence and outputs sequentially

q = arg maxq

P (q | w λ)

o = arg maxo

p(o | q λ)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79

Best state sequence

1 2 3π1

a11

a12

a22

a23

a33

b1(ot) b2(ot) b3(ot)

o1 o2 o3 o4 o5 oT

1 1 1 1 2 2 3 3

O

Q

Observation sequence

State sequence

State duration 4 10 5D

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 26 of 79

Best state outputswo dynamic features

Variance Mean

o becomes step-wise mean vector sequence

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 27 of 79

Using dynamic features

State output vectors include static amp dynamic features

ot =[cgtt ∆cgtt

]gt ∆ct = ct minus ctminus1

M Mctminus1 ct+1ctminus2 ct+2ct

∆ctminus1 ∆ct+1∆ctminus2 ∆ct+2∆ct

2M

Relationship between static and dynamic features can be arranged as

o c

ctminus1

∆ ctminus1

ct∆ ctct+1

∆ ct+1

=

middot middot middot

middot middot middotmiddot middot middot 0 I 0 0 middot middot middotmiddot middot middot minusI I 0 0 middot middot middotmiddot middot middot 0 0 I 0 middot middot middotmiddot middot middot 0 minusI I 0 middot middot middotmiddot middot middot 0 0 0 I middot middot middotmiddot middot middot 0 0 minusI I middot middot middotmiddot middot middot

middot middot middot

ctminus2

ctminus1

ctct+1

W

o t

ot+1

otminus1

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 28 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Wgt Σminus1q W c

=

Wgt Σminus1q microq

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

1 0 0

0

10 0

-1 1

10

0

-1 1

1

0

0

0

-1 1

0 0

0

0

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

microq1

microq2

microqT

c1

c2

cT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79

Generated speech parameter trajectory

Sta

ticD

ynam

ic

Variance Mean c

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79

Waveform reconstruction

pulse train

white noise

synthesizedspeech

lineartime-invariant

systeme(n)

h(n) x(n) = h(n) lowast e(n)excitation

Generatedexcitation parameter(log F0 with VUV)

Generatedspectral parameter

(cepstrum LSP)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79

Synthesis filter

bull Cepstrum rarr LMA filter

bull Generalized cepstrum rarr GLSA filter

bull Mel-cepstrum rarr MLSA filter

bull Mel-generalized cepstrum rarr MGLSA filter

bull LSP rarr LSP filter

bull PARCOR rarr all-pole lattice filter

bull LPC rarr all-pole filter

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation

minus Small footprint [10 11]minus Robustness [12]

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Adaptation (mimicking voice) [13]

Average-voice model

AdaptiveTraining

Adaptation

Training speakers Target speakers

bull Train average voice model (AVM) from training speakers using SAT

bull Adapt AVM to target speakers

bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79

Interpolation (mixing voice) [14 15 16 17]

λ1

λ2

λ3

λ4

λprime

I(λprimeλ1)I(λprimeλ2)

I(λprimeλ3)I(λprimeλ4)

λ HMM setI(λprimeλ) Interpolation ratio

bull Interpolate representive HMM sets

bull Can obtain new voices wo adaptation data

bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Vocoding issues

bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)

Unvoiced Voiced

pulse train

white noise

e(n)

excitation

bull Spectral envelope extractionHarmonic effect often cause problem

0

40

80

0 2 4 6 8 [kHz]

Pow

er [d

B]

bull PhaseImportant but usually ignored

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79

Better vocoding

bull Mixed excitation linear prediction (MELP)

bull STRAIGHT

bull Multi-band excitation

bull Harmonic + noise model (HNM)

bull Harmonic stochastic model

bull LF model

bull Glottal waveform

bull Residual codebook

bull ML excitation

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79

Limitations of HMMs for acoustic modeling

bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state

bull Conditional independence assumptionState output probability depends only on the current state

bull Weak duration modelingState duration probability decreases exponentially with time

None of them hold for real speech

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79

Better acoustic modeling

bull Piece-wise constatnt statistics rarr Dynamical model

minus Trended HMMminus Polynomial segment modelminus Trajectory HMM

bull Conditional independence assumption rarr Graphical model

minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM

bull Weak duration modeling rarr Explicit duration model

minus Hidden semi-Markov model

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79

Oversmoothing

bull Speech parameter generation algorithm

minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled

Nat

ural

0 4 8Frequency (kHz)

Gen

erat

ed

0 4 8Frequency (kHz)

bull Why

minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79

Oversmoothing compensation

bull Postfiltering

minus Mel-cepstrumminus LSP

bull Nonparametric approach

minus Conditional parameter generationminus Discrete HMM-based speech synthesis

bull Combine multiple-level statistics

minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation eigenvoice CAT multiple regression

minus Small footprintminus Robustness

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

HMM-based acoustic modeling for SPSS [4]

yes noyes no

yes no

yes no yes no

Acoustic space

bull Decision tree-clustered HMM with GMM state-output distributions

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79

DNN-based acoustic modeling for SPSS [18]

Acoustic

features y

Linguistic

features x

h1

h2

h3

bull DNN represents conditional distribution of y given x

bull DNN replaces decision trees and GMMs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79

Framework

Input layer Output layerHidden layers

TEXT

SPEECH Parametergeneration

Waveformsynthesis

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

feat

ures

at f

ram

e 1

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

fe

atur

es a

t fra

me T

Textanalysis

Input featureextraction

Statistics (m

ean amp var) of speech param

eter vector sequence

Binaryfeatures

NumericfeaturesDuration

featureFrame position

feature

Spectralfeatures

ExcitationfeaturesVUVfeature

Durationprediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Experimental setup

Database US English female speaker

Training test data 33000 amp 173 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic 11 categorical featuresfeatures 25 numeric features

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2

HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]

DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79

Example of speech parameter trajectories

wo grouping questions numeric contexts silence frames removed

-1

0

1

0 100 200 300 400 500

5-t

h M

el-ce

pstr

um

Frame

Natural speechHMM (α=1)DNN (4x512)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79

Subjective evaluations

Compared HMM-based systems with DNN-based ones withsimilar of parameters

bull Paired comparison test

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

HMM DNN(α) (layers times units) Neutral p value z value

158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 17: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Parametric models of speech signal

Autoregressive (AR) model Exponential (EX) model

H(z) =K

1minusMsum

m=0

c(m)zminusm

H(z) = expMsum

m=0

c(m)zminusm

Estimate model parameters based on ML

c = arg maxcp(x | c)

bull p(x | c) AR model rarr Linear predictive analysis [5]

bull p(x | c) EX model rarr (ML-based) cepstral analysis [6]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 12 of 79

Examples of speech spectra

0 1 2 3 4 5Frequency (kHz)

-20

0

20

40

60

80

Log

mag

nitu

de (d

B)

0 1 2 3 4 5Frequency (kHz)

-20

0

20

40

60

80

Log

mag

nitu

de (d

B)

(a) ML-based cepstral analysis (b) Linear prediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 13 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 14 of 79

Structure of state-output (observation) vectors

∆ Mel-cepstral coefficients

log F0

∆ log F0

∆∆ log F0

Spectrum part

Excitation part

∆ct

∆2ct

pt

δpt

δ2pt

ct

ot

∆∆ Mel-cepstral coefficients

Mel-cepstral coefficients

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 15 of 79

Hidden Markov model (HMM)

1 2 3π1

a11

a12

a22

a23

a33

b1(ot) b2(ot) b3(ot)

o1 o2 o3 o4 o5 oT

1 1 1 1 2 2 3 3

O

Q

Observation sequence

State sequence

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 16 of 79

Multi-stream HMM structure

Spe

ctru

mE

xcita

tion

∆ct

∆2ct

pt

δ pt

δ 2pt

ct

ot

Stream

1

o1t

o2t

o3t

o4t

bj(ot)

b1j(o

1t )

b2j(o

2t )

b3j(o

3t )

b4j(o

4t )

bj(ot)Sprod

s=1

(bsj(o

st ))ws=

23

4

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 17 of 79

Training process

Compute variancefloor (HCompV)

Initialize CI-HMMs bysegmental k-means (HInit)

Reestimate CI-HMMs byEM algorithm

(HRest amp HERest)

Copy CI-HMMs to CD-HMMs (HHEd CL)

Reestimate CD-HMMs byEM algorithm (HERest)

Decision tree-basedclustering (HHEd TB)

Reestimate CD-HMMsby EM algorithm (HERest)

Untie parameter tyingstructure (HHEd UT)

monophone(context-independent CI)

fullcontext(context-dependent CD)

EstimatedHMMs

data amp labels

Estimate CD-dur modelsfrom FB stats (HERest)

Decision tree-basedclustering (HHEd TB)

Estimated dur models

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 18 of 79

Context-dependent acoustic modeling

bull preceding succeeding two phonemes

bull Position of current phoneme in current syllable

bull of phonemes at preceding current succeeding syllable

bull accent stress of preceding current succeeding syllable

bull Position of current syllable in current word

bull of preceding succeeding stressed accented syllables in phrase

bull of syllables from previous to next stressed accented syllable

bull Guess at part of speech of preceding current succeeding word

bull of syllables in preceding current succeeding word

bull Position of current word in current phrase

bull of preceding succeeding content words in current phrase

bull of words from previous to next content word

bull of syllables in preceding current succeeding phrase

Impossible to have all possible modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 19 of 79

Decision tree-based state clustering [7]

L=voice

L=w R=silence yes

yes yes

no

no no

w-a+sil w-a+sh gy-a+pau

g-a+silgy-a+silw-a+t

k-a+b

t-a+n

leaf nodes

yes no yes no

synthesized states

R=silence L=gy

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 20 of 79

Stream-dependent tree-based clustering

Decision treesfor

mel-cepstrum

Decision treesfor F0

Spectrum amp excitation can have different context dependencyrarr Build decision trees individually

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 21 of 79

State duration models [8]

i

1 2 3 4 5 6 7 T=8t

t0 t1

Probability to enter state i at t0 then leave at t1 + 1

χt0t1(i) propsum

j 6=i

αt0minus1(j)ajiat1minust0ii

t1prod

t=t0

bi(ot)sum

k 6=i

aikbk(ot1+1)βt1+1(k)

rarr estimate state duration modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 22 of 79

Stream-dependent tree-based clustering

State durationmodel

Decision treesfor

mel-cepstrum

Decision treesfor F0

Decision treefor state durmodels

HMM

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 23 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 24 of 79

Speech parameter generation algorithm [9]

Generate most probable state outputs given HMM and words

o = arg maxo

p(o | w λ)

= arg maxo

sum

forallqp(o q | w λ)

asymp arg maxo

maxq

p(o q | w λ)

= arg maxo

maxq

p(o | q λ)P (q | w λ)

Determine the best state sequence and outputs sequentially

q = arg maxq

P (q | w λ)

o = arg maxo

p(o | q λ)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79

Speech parameter generation algorithm [9]

Generate most probable state outputs given HMM and words

o = arg maxo

p(o | w λ)

= arg maxo

sum

forallqp(o q | w λ)

asymp arg maxo

maxq

p(o q | w λ)

= arg maxo

maxq

p(o | q λ)P (q | w λ)

Determine the best state sequence and outputs sequentially

q = arg maxq

P (q | w λ)

o = arg maxo

p(o | q λ)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79

Best state sequence

1 2 3π1

a11

a12

a22

a23

a33

b1(ot) b2(ot) b3(ot)

o1 o2 o3 o4 o5 oT

1 1 1 1 2 2 3 3

O

Q

Observation sequence

State sequence

State duration 4 10 5D

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 26 of 79

Best state outputswo dynamic features

Variance Mean

o becomes step-wise mean vector sequence

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 27 of 79

Using dynamic features

State output vectors include static amp dynamic features

ot =[cgtt ∆cgtt

]gt ∆ct = ct minus ctminus1

M Mctminus1 ct+1ctminus2 ct+2ct

∆ctminus1 ∆ct+1∆ctminus2 ∆ct+2∆ct

2M

Relationship between static and dynamic features can be arranged as

o c

ctminus1

∆ ctminus1

ct∆ ctct+1

∆ ct+1

=

middot middot middot

middot middot middotmiddot middot middot 0 I 0 0 middot middot middotmiddot middot middot minusI I 0 0 middot middot middotmiddot middot middot 0 0 I 0 middot middot middotmiddot middot middot 0 minusI I 0 middot middot middotmiddot middot middot 0 0 0 I middot middot middotmiddot middot middot 0 0 minusI I middot middot middotmiddot middot middot

middot middot middot

ctminus2

ctminus1

ctct+1

W

o t

ot+1

otminus1

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 28 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Wgt Σminus1q W c

=

Wgt Σminus1q microq

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

1 0 0

0

10 0

-1 1

10

0

-1 1

1

0

0

0

-1 1

0 0

0

0

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

microq1

microq2

microqT

c1

c2

cT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79

Generated speech parameter trajectory

Sta

ticD

ynam

ic

Variance Mean c

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79

Waveform reconstruction

pulse train

white noise

synthesizedspeech

lineartime-invariant

systeme(n)

h(n) x(n) = h(n) lowast e(n)excitation

Generatedexcitation parameter(log F0 with VUV)

Generatedspectral parameter

(cepstrum LSP)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79

Synthesis filter

bull Cepstrum rarr LMA filter

bull Generalized cepstrum rarr GLSA filter

bull Mel-cepstrum rarr MLSA filter

bull Mel-generalized cepstrum rarr MGLSA filter

bull LSP rarr LSP filter

bull PARCOR rarr all-pole lattice filter

bull LPC rarr all-pole filter

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation

minus Small footprint [10 11]minus Robustness [12]

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Adaptation (mimicking voice) [13]

Average-voice model

AdaptiveTraining

Adaptation

Training speakers Target speakers

bull Train average voice model (AVM) from training speakers using SAT

bull Adapt AVM to target speakers

bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79

Interpolation (mixing voice) [14 15 16 17]

λ1

λ2

λ3

λ4

λprime

I(λprimeλ1)I(λprimeλ2)

I(λprimeλ3)I(λprimeλ4)

λ HMM setI(λprimeλ) Interpolation ratio

bull Interpolate representive HMM sets

bull Can obtain new voices wo adaptation data

bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Vocoding issues

bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)

Unvoiced Voiced

pulse train

white noise

e(n)

excitation

bull Spectral envelope extractionHarmonic effect often cause problem

0

40

80

0 2 4 6 8 [kHz]

Pow

er [d

B]

bull PhaseImportant but usually ignored

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79

Better vocoding

bull Mixed excitation linear prediction (MELP)

bull STRAIGHT

bull Multi-band excitation

bull Harmonic + noise model (HNM)

bull Harmonic stochastic model

bull LF model

bull Glottal waveform

bull Residual codebook

bull ML excitation

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79

Limitations of HMMs for acoustic modeling

bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state

bull Conditional independence assumptionState output probability depends only on the current state

bull Weak duration modelingState duration probability decreases exponentially with time

None of them hold for real speech

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79

Better acoustic modeling

bull Piece-wise constatnt statistics rarr Dynamical model

minus Trended HMMminus Polynomial segment modelminus Trajectory HMM

bull Conditional independence assumption rarr Graphical model

minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM

bull Weak duration modeling rarr Explicit duration model

minus Hidden semi-Markov model

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79

Oversmoothing

bull Speech parameter generation algorithm

minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled

Nat

ural

0 4 8Frequency (kHz)

Gen

erat

ed

0 4 8Frequency (kHz)

bull Why

minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79

Oversmoothing compensation

bull Postfiltering

minus Mel-cepstrumminus LSP

bull Nonparametric approach

minus Conditional parameter generationminus Discrete HMM-based speech synthesis

bull Combine multiple-level statistics

minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation eigenvoice CAT multiple regression

minus Small footprintminus Robustness

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

HMM-based acoustic modeling for SPSS [4]

yes noyes no

yes no

yes no yes no

Acoustic space

bull Decision tree-clustered HMM with GMM state-output distributions

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79

DNN-based acoustic modeling for SPSS [18]

Acoustic

features y

Linguistic

features x

h1

h2

h3

bull DNN represents conditional distribution of y given x

bull DNN replaces decision trees and GMMs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79

Framework

Input layer Output layerHidden layers

TEXT

SPEECH Parametergeneration

Waveformsynthesis

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

feat

ures

at f

ram

e 1

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

fe

atur

es a

t fra

me T

Textanalysis

Input featureextraction

Statistics (m

ean amp var) of speech param

eter vector sequence

Binaryfeatures

NumericfeaturesDuration

featureFrame position

feature

Spectralfeatures

ExcitationfeaturesVUVfeature

Durationprediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Experimental setup

Database US English female speaker

Training test data 33000 amp 173 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic 11 categorical featuresfeatures 25 numeric features

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2

HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]

DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79

Example of speech parameter trajectories

wo grouping questions numeric contexts silence frames removed

-1

0

1

0 100 200 300 400 500

5-t

h M

el-ce

pstr

um

Frame

Natural speechHMM (α=1)DNN (4x512)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79

Subjective evaluations

Compared HMM-based systems with DNN-based ones withsimilar of parameters

bull Paired comparison test

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

HMM DNN(α) (layers times units) Neutral p value z value

158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 18: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Examples of speech spectra

0 1 2 3 4 5Frequency (kHz)

-20

0

20

40

60

80

Log

mag

nitu

de (d

B)

0 1 2 3 4 5Frequency (kHz)

-20

0

20

40

60

80

Log

mag

nitu

de (d

B)

(a) ML-based cepstral analysis (b) Linear prediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 13 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 14 of 79

Structure of state-output (observation) vectors

∆ Mel-cepstral coefficients

log F0

∆ log F0

∆∆ log F0

Spectrum part

Excitation part

∆ct

∆2ct

pt

δpt

δ2pt

ct

ot

∆∆ Mel-cepstral coefficients

Mel-cepstral coefficients

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 15 of 79

Hidden Markov model (HMM)

1 2 3π1

a11

a12

a22

a23

a33

b1(ot) b2(ot) b3(ot)

o1 o2 o3 o4 o5 oT

1 1 1 1 2 2 3 3

O

Q

Observation sequence

State sequence

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 16 of 79

Multi-stream HMM structure

Spe

ctru

mE

xcita

tion

∆ct

∆2ct

pt

δ pt

δ 2pt

ct

ot

Stream

1

o1t

o2t

o3t

o4t

bj(ot)

b1j(o

1t )

b2j(o

2t )

b3j(o

3t )

b4j(o

4t )

bj(ot)Sprod

s=1

(bsj(o

st ))ws=

23

4

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 17 of 79

Training process

Compute variancefloor (HCompV)

Initialize CI-HMMs bysegmental k-means (HInit)

Reestimate CI-HMMs byEM algorithm

(HRest amp HERest)

Copy CI-HMMs to CD-HMMs (HHEd CL)

Reestimate CD-HMMs byEM algorithm (HERest)

Decision tree-basedclustering (HHEd TB)

Reestimate CD-HMMsby EM algorithm (HERest)

Untie parameter tyingstructure (HHEd UT)

monophone(context-independent CI)

fullcontext(context-dependent CD)

EstimatedHMMs

data amp labels

Estimate CD-dur modelsfrom FB stats (HERest)

Decision tree-basedclustering (HHEd TB)

Estimated dur models

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 18 of 79

Context-dependent acoustic modeling

bull preceding succeeding two phonemes

bull Position of current phoneme in current syllable

bull of phonemes at preceding current succeeding syllable

bull accent stress of preceding current succeeding syllable

bull Position of current syllable in current word

bull of preceding succeeding stressed accented syllables in phrase

bull of syllables from previous to next stressed accented syllable

bull Guess at part of speech of preceding current succeeding word

bull of syllables in preceding current succeeding word

bull Position of current word in current phrase

bull of preceding succeeding content words in current phrase

bull of words from previous to next content word

bull of syllables in preceding current succeeding phrase

Impossible to have all possible modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 19 of 79

Decision tree-based state clustering [7]

L=voice

L=w R=silence yes

yes yes

no

no no

w-a+sil w-a+sh gy-a+pau

g-a+silgy-a+silw-a+t

k-a+b

t-a+n

leaf nodes

yes no yes no

synthesized states

R=silence L=gy

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 20 of 79

Stream-dependent tree-based clustering

Decision treesfor

mel-cepstrum

Decision treesfor F0

Spectrum amp excitation can have different context dependencyrarr Build decision trees individually

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 21 of 79

State duration models [8]

i

1 2 3 4 5 6 7 T=8t

t0 t1

Probability to enter state i at t0 then leave at t1 + 1

χt0t1(i) propsum

j 6=i

αt0minus1(j)ajiat1minust0ii

t1prod

t=t0

bi(ot)sum

k 6=i

aikbk(ot1+1)βt1+1(k)

rarr estimate state duration modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 22 of 79

Stream-dependent tree-based clustering

State durationmodel

Decision treesfor

mel-cepstrum

Decision treesfor F0

Decision treefor state durmodels

HMM

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 23 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 24 of 79

Speech parameter generation algorithm [9]

Generate most probable state outputs given HMM and words

o = arg maxo

p(o | w λ)

= arg maxo

sum

forallqp(o q | w λ)

asymp arg maxo

maxq

p(o q | w λ)

= arg maxo

maxq

p(o | q λ)P (q | w λ)

Determine the best state sequence and outputs sequentially

q = arg maxq

P (q | w λ)

o = arg maxo

p(o | q λ)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79

Speech parameter generation algorithm [9]

Generate most probable state outputs given HMM and words

o = arg maxo

p(o | w λ)

= arg maxo

sum

forallqp(o q | w λ)

asymp arg maxo

maxq

p(o q | w λ)

= arg maxo

maxq

p(o | q λ)P (q | w λ)

Determine the best state sequence and outputs sequentially

q = arg maxq

P (q | w λ)

o = arg maxo

p(o | q λ)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79

Best state sequence

1 2 3π1

a11

a12

a22

a23

a33

b1(ot) b2(ot) b3(ot)

o1 o2 o3 o4 o5 oT

1 1 1 1 2 2 3 3

O

Q

Observation sequence

State sequence

State duration 4 10 5D

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 26 of 79

Best state outputswo dynamic features

Variance Mean

o becomes step-wise mean vector sequence

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 27 of 79

Using dynamic features

State output vectors include static amp dynamic features

ot =[cgtt ∆cgtt

]gt ∆ct = ct minus ctminus1

M Mctminus1 ct+1ctminus2 ct+2ct

∆ctminus1 ∆ct+1∆ctminus2 ∆ct+2∆ct

2M

Relationship between static and dynamic features can be arranged as

o c

ctminus1

∆ ctminus1

ct∆ ctct+1

∆ ct+1

=

middot middot middot

middot middot middotmiddot middot middot 0 I 0 0 middot middot middotmiddot middot middot minusI I 0 0 middot middot middotmiddot middot middot 0 0 I 0 middot middot middotmiddot middot middot 0 minusI I 0 middot middot middotmiddot middot middot 0 0 0 I middot middot middotmiddot middot middot 0 0 minusI I middot middot middotmiddot middot middot

middot middot middot

ctminus2

ctminus1

ctct+1

W

o t

ot+1

otminus1

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 28 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Wgt Σminus1q W c

=

Wgt Σminus1q microq

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

1 0 0

0

10 0

-1 1

10

0

-1 1

1

0

0

0

-1 1

0 0

0

0

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

microq1

microq2

microqT

c1

c2

cT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79

Generated speech parameter trajectory

Sta

ticD

ynam

ic

Variance Mean c

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79

Waveform reconstruction

pulse train

white noise

synthesizedspeech

lineartime-invariant

systeme(n)

h(n) x(n) = h(n) lowast e(n)excitation

Generatedexcitation parameter(log F0 with VUV)

Generatedspectral parameter

(cepstrum LSP)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79

Synthesis filter

bull Cepstrum rarr LMA filter

bull Generalized cepstrum rarr GLSA filter

bull Mel-cepstrum rarr MLSA filter

bull Mel-generalized cepstrum rarr MGLSA filter

bull LSP rarr LSP filter

bull PARCOR rarr all-pole lattice filter

bull LPC rarr all-pole filter

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation

minus Small footprint [10 11]minus Robustness [12]

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Adaptation (mimicking voice) [13]

Average-voice model

AdaptiveTraining

Adaptation

Training speakers Target speakers

bull Train average voice model (AVM) from training speakers using SAT

bull Adapt AVM to target speakers

bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79

Interpolation (mixing voice) [14 15 16 17]

λ1

λ2

λ3

λ4

λprime

I(λprimeλ1)I(λprimeλ2)

I(λprimeλ3)I(λprimeλ4)

λ HMM setI(λprimeλ) Interpolation ratio

bull Interpolate representive HMM sets

bull Can obtain new voices wo adaptation data

bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Vocoding issues

bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)

Unvoiced Voiced

pulse train

white noise

e(n)

excitation

bull Spectral envelope extractionHarmonic effect often cause problem

0

40

80

0 2 4 6 8 [kHz]

Pow

er [d

B]

bull PhaseImportant but usually ignored

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79

Better vocoding

bull Mixed excitation linear prediction (MELP)

bull STRAIGHT

bull Multi-band excitation

bull Harmonic + noise model (HNM)

bull Harmonic stochastic model

bull LF model

bull Glottal waveform

bull Residual codebook

bull ML excitation

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79

Limitations of HMMs for acoustic modeling

bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state

bull Conditional independence assumptionState output probability depends only on the current state

bull Weak duration modelingState duration probability decreases exponentially with time

None of them hold for real speech

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79

Better acoustic modeling

bull Piece-wise constatnt statistics rarr Dynamical model

minus Trended HMMminus Polynomial segment modelminus Trajectory HMM

bull Conditional independence assumption rarr Graphical model

minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM

bull Weak duration modeling rarr Explicit duration model

minus Hidden semi-Markov model

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79

Oversmoothing

bull Speech parameter generation algorithm

minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled

Nat

ural

0 4 8Frequency (kHz)

Gen

erat

ed

0 4 8Frequency (kHz)

bull Why

minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79

Oversmoothing compensation

bull Postfiltering

minus Mel-cepstrumminus LSP

bull Nonparametric approach

minus Conditional parameter generationminus Discrete HMM-based speech synthesis

bull Combine multiple-level statistics

minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation eigenvoice CAT multiple regression

minus Small footprintminus Robustness

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

HMM-based acoustic modeling for SPSS [4]

yes noyes no

yes no

yes no yes no

Acoustic space

bull Decision tree-clustered HMM with GMM state-output distributions

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79

DNN-based acoustic modeling for SPSS [18]

Acoustic

features y

Linguistic

features x

h1

h2

h3

bull DNN represents conditional distribution of y given x

bull DNN replaces decision trees and GMMs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79

Framework

Input layer Output layerHidden layers

TEXT

SPEECH Parametergeneration

Waveformsynthesis

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

feat

ures

at f

ram

e 1

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

fe

atur

es a

t fra

me T

Textanalysis

Input featureextraction

Statistics (m

ean amp var) of speech param

eter vector sequence

Binaryfeatures

NumericfeaturesDuration

featureFrame position

feature

Spectralfeatures

ExcitationfeaturesVUVfeature

Durationprediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Experimental setup

Database US English female speaker

Training test data 33000 amp 173 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic 11 categorical featuresfeatures 25 numeric features

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2

HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]

DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79

Example of speech parameter trajectories

wo grouping questions numeric contexts silence frames removed

-1

0

1

0 100 200 300 400 500

5-t

h M

el-ce

pstr

um

Frame

Natural speechHMM (α=1)DNN (4x512)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79

Subjective evaluations

Compared HMM-based systems with DNN-based ones withsimilar of parameters

bull Paired comparison test

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

HMM DNN(α) (layers times units) Neutral p value z value

158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 19: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 14 of 79

Structure of state-output (observation) vectors

∆ Mel-cepstral coefficients

log F0

∆ log F0

∆∆ log F0

Spectrum part

Excitation part

∆ct

∆2ct

pt

δpt

δ2pt

ct

ot

∆∆ Mel-cepstral coefficients

Mel-cepstral coefficients

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 15 of 79

Hidden Markov model (HMM)

1 2 3π1

a11

a12

a22

a23

a33

b1(ot) b2(ot) b3(ot)

o1 o2 o3 o4 o5 oT

1 1 1 1 2 2 3 3

O

Q

Observation sequence

State sequence

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 16 of 79

Multi-stream HMM structure

Spe

ctru

mE

xcita

tion

∆ct

∆2ct

pt

δ pt

δ 2pt

ct

ot

Stream

1

o1t

o2t

o3t

o4t

bj(ot)

b1j(o

1t )

b2j(o

2t )

b3j(o

3t )

b4j(o

4t )

bj(ot)Sprod

s=1

(bsj(o

st ))ws=

23

4

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 17 of 79

Training process

Compute variancefloor (HCompV)

Initialize CI-HMMs bysegmental k-means (HInit)

Reestimate CI-HMMs byEM algorithm

(HRest amp HERest)

Copy CI-HMMs to CD-HMMs (HHEd CL)

Reestimate CD-HMMs byEM algorithm (HERest)

Decision tree-basedclustering (HHEd TB)

Reestimate CD-HMMsby EM algorithm (HERest)

Untie parameter tyingstructure (HHEd UT)

monophone(context-independent CI)

fullcontext(context-dependent CD)

EstimatedHMMs

data amp labels

Estimate CD-dur modelsfrom FB stats (HERest)

Decision tree-basedclustering (HHEd TB)

Estimated dur models

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 18 of 79

Context-dependent acoustic modeling

bull preceding succeeding two phonemes

bull Position of current phoneme in current syllable

bull of phonemes at preceding current succeeding syllable

bull accent stress of preceding current succeeding syllable

bull Position of current syllable in current word

bull of preceding succeeding stressed accented syllables in phrase

bull of syllables from previous to next stressed accented syllable

bull Guess at part of speech of preceding current succeeding word

bull of syllables in preceding current succeeding word

bull Position of current word in current phrase

bull of preceding succeeding content words in current phrase

bull of words from previous to next content word

bull of syllables in preceding current succeeding phrase

Impossible to have all possible modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 19 of 79

Decision tree-based state clustering [7]

L=voice

L=w R=silence yes

yes yes

no

no no

w-a+sil w-a+sh gy-a+pau

g-a+silgy-a+silw-a+t

k-a+b

t-a+n

leaf nodes

yes no yes no

synthesized states

R=silence L=gy

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 20 of 79

Stream-dependent tree-based clustering

Decision treesfor

mel-cepstrum

Decision treesfor F0

Spectrum amp excitation can have different context dependencyrarr Build decision trees individually

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 21 of 79

State duration models [8]

i

1 2 3 4 5 6 7 T=8t

t0 t1

Probability to enter state i at t0 then leave at t1 + 1

χt0t1(i) propsum

j 6=i

αt0minus1(j)ajiat1minust0ii

t1prod

t=t0

bi(ot)sum

k 6=i

aikbk(ot1+1)βt1+1(k)

rarr estimate state duration modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 22 of 79

Stream-dependent tree-based clustering

State durationmodel

Decision treesfor

mel-cepstrum

Decision treesfor F0

Decision treefor state durmodels

HMM

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 23 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 24 of 79

Speech parameter generation algorithm [9]

Generate most probable state outputs given HMM and words

o = arg maxo

p(o | w λ)

= arg maxo

sum

forallqp(o q | w λ)

asymp arg maxo

maxq

p(o q | w λ)

= arg maxo

maxq

p(o | q λ)P (q | w λ)

Determine the best state sequence and outputs sequentially

q = arg maxq

P (q | w λ)

o = arg maxo

p(o | q λ)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79

Speech parameter generation algorithm [9]

Generate most probable state outputs given HMM and words

o = arg maxo

p(o | w λ)

= arg maxo

sum

forallqp(o q | w λ)

asymp arg maxo

maxq

p(o q | w λ)

= arg maxo

maxq

p(o | q λ)P (q | w λ)

Determine the best state sequence and outputs sequentially

q = arg maxq

P (q | w λ)

o = arg maxo

p(o | q λ)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79

Best state sequence

1 2 3π1

a11

a12

a22

a23

a33

b1(ot) b2(ot) b3(ot)

o1 o2 o3 o4 o5 oT

1 1 1 1 2 2 3 3

O

Q

Observation sequence

State sequence

State duration 4 10 5D

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 26 of 79

Best state outputswo dynamic features

Variance Mean

o becomes step-wise mean vector sequence

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 27 of 79

Using dynamic features

State output vectors include static amp dynamic features

ot =[cgtt ∆cgtt

]gt ∆ct = ct minus ctminus1

M Mctminus1 ct+1ctminus2 ct+2ct

∆ctminus1 ∆ct+1∆ctminus2 ∆ct+2∆ct

2M

Relationship between static and dynamic features can be arranged as

o c

ctminus1

∆ ctminus1

ct∆ ctct+1

∆ ct+1

=

middot middot middot

middot middot middotmiddot middot middot 0 I 0 0 middot middot middotmiddot middot middot minusI I 0 0 middot middot middotmiddot middot middot 0 0 I 0 middot middot middotmiddot middot middot 0 minusI I 0 middot middot middotmiddot middot middot 0 0 0 I middot middot middotmiddot middot middot 0 0 minusI I middot middot middotmiddot middot middot

middot middot middot

ctminus2

ctminus1

ctct+1

W

o t

ot+1

otminus1

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 28 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Wgt Σminus1q W c

=

Wgt Σminus1q microq

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

1 0 0

0

10 0

-1 1

10

0

-1 1

1

0

0

0

-1 1

0 0

0

0

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

microq1

microq2

microqT

c1

c2

cT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79

Generated speech parameter trajectory

Sta

ticD

ynam

ic

Variance Mean c

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79

Waveform reconstruction

pulse train

white noise

synthesizedspeech

lineartime-invariant

systeme(n)

h(n) x(n) = h(n) lowast e(n)excitation

Generatedexcitation parameter(log F0 with VUV)

Generatedspectral parameter

(cepstrum LSP)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79

Synthesis filter

bull Cepstrum rarr LMA filter

bull Generalized cepstrum rarr GLSA filter

bull Mel-cepstrum rarr MLSA filter

bull Mel-generalized cepstrum rarr MGLSA filter

bull LSP rarr LSP filter

bull PARCOR rarr all-pole lattice filter

bull LPC rarr all-pole filter

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation

minus Small footprint [10 11]minus Robustness [12]

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Adaptation (mimicking voice) [13]

Average-voice model

AdaptiveTraining

Adaptation

Training speakers Target speakers

bull Train average voice model (AVM) from training speakers using SAT

bull Adapt AVM to target speakers

bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79

Interpolation (mixing voice) [14 15 16 17]

λ1

λ2

λ3

λ4

λprime

I(λprimeλ1)I(λprimeλ2)

I(λprimeλ3)I(λprimeλ4)

λ HMM setI(λprimeλ) Interpolation ratio

bull Interpolate representive HMM sets

bull Can obtain new voices wo adaptation data

bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Vocoding issues

bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)

Unvoiced Voiced

pulse train

white noise

e(n)

excitation

bull Spectral envelope extractionHarmonic effect often cause problem

0

40

80

0 2 4 6 8 [kHz]

Pow

er [d

B]

bull PhaseImportant but usually ignored

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79

Better vocoding

bull Mixed excitation linear prediction (MELP)

bull STRAIGHT

bull Multi-band excitation

bull Harmonic + noise model (HNM)

bull Harmonic stochastic model

bull LF model

bull Glottal waveform

bull Residual codebook

bull ML excitation

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79

Limitations of HMMs for acoustic modeling

bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state

bull Conditional independence assumptionState output probability depends only on the current state

bull Weak duration modelingState duration probability decreases exponentially with time

None of them hold for real speech

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79

Better acoustic modeling

bull Piece-wise constatnt statistics rarr Dynamical model

minus Trended HMMminus Polynomial segment modelminus Trajectory HMM

bull Conditional independence assumption rarr Graphical model

minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM

bull Weak duration modeling rarr Explicit duration model

minus Hidden semi-Markov model

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79

Oversmoothing

bull Speech parameter generation algorithm

minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled

Nat

ural

0 4 8Frequency (kHz)

Gen

erat

ed

0 4 8Frequency (kHz)

bull Why

minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79

Oversmoothing compensation

bull Postfiltering

minus Mel-cepstrumminus LSP

bull Nonparametric approach

minus Conditional parameter generationminus Discrete HMM-based speech synthesis

bull Combine multiple-level statistics

minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation eigenvoice CAT multiple regression

minus Small footprintminus Robustness

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

HMM-based acoustic modeling for SPSS [4]

yes noyes no

yes no

yes no yes no

Acoustic space

bull Decision tree-clustered HMM with GMM state-output distributions

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79

DNN-based acoustic modeling for SPSS [18]

Acoustic

features y

Linguistic

features x

h1

h2

h3

bull DNN represents conditional distribution of y given x

bull DNN replaces decision trees and GMMs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79

Framework

Input layer Output layerHidden layers

TEXT

SPEECH Parametergeneration

Waveformsynthesis

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

feat

ures

at f

ram

e 1

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

fe

atur

es a

t fra

me T

Textanalysis

Input featureextraction

Statistics (m

ean amp var) of speech param

eter vector sequence

Binaryfeatures

NumericfeaturesDuration

featureFrame position

feature

Spectralfeatures

ExcitationfeaturesVUVfeature

Durationprediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Experimental setup

Database US English female speaker

Training test data 33000 amp 173 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic 11 categorical featuresfeatures 25 numeric features

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2

HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]

DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79

Example of speech parameter trajectories

wo grouping questions numeric contexts silence frames removed

-1

0

1

0 100 200 300 400 500

5-t

h M

el-ce

pstr

um

Frame

Natural speechHMM (α=1)DNN (4x512)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79

Subjective evaluations

Compared HMM-based systems with DNN-based ones withsimilar of parameters

bull Paired comparison test

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

HMM DNN(α) (layers times units) Neutral p value z value

158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 20: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Structure of state-output (observation) vectors

∆ Mel-cepstral coefficients

log F0

∆ log F0

∆∆ log F0

Spectrum part

Excitation part

∆ct

∆2ct

pt

δpt

δ2pt

ct

ot

∆∆ Mel-cepstral coefficients

Mel-cepstral coefficients

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 15 of 79

Hidden Markov model (HMM)

1 2 3π1

a11

a12

a22

a23

a33

b1(ot) b2(ot) b3(ot)

o1 o2 o3 o4 o5 oT

1 1 1 1 2 2 3 3

O

Q

Observation sequence

State sequence

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 16 of 79

Multi-stream HMM structure

Spe

ctru

mE

xcita

tion

∆ct

∆2ct

pt

δ pt

δ 2pt

ct

ot

Stream

1

o1t

o2t

o3t

o4t

bj(ot)

b1j(o

1t )

b2j(o

2t )

b3j(o

3t )

b4j(o

4t )

bj(ot)Sprod

s=1

(bsj(o

st ))ws=

23

4

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 17 of 79

Training process

Compute variancefloor (HCompV)

Initialize CI-HMMs bysegmental k-means (HInit)

Reestimate CI-HMMs byEM algorithm

(HRest amp HERest)

Copy CI-HMMs to CD-HMMs (HHEd CL)

Reestimate CD-HMMs byEM algorithm (HERest)

Decision tree-basedclustering (HHEd TB)

Reestimate CD-HMMsby EM algorithm (HERest)

Untie parameter tyingstructure (HHEd UT)

monophone(context-independent CI)

fullcontext(context-dependent CD)

EstimatedHMMs

data amp labels

Estimate CD-dur modelsfrom FB stats (HERest)

Decision tree-basedclustering (HHEd TB)

Estimated dur models

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 18 of 79

Context-dependent acoustic modeling

bull preceding succeeding two phonemes

bull Position of current phoneme in current syllable

bull of phonemes at preceding current succeeding syllable

bull accent stress of preceding current succeeding syllable

bull Position of current syllable in current word

bull of preceding succeeding stressed accented syllables in phrase

bull of syllables from previous to next stressed accented syllable

bull Guess at part of speech of preceding current succeeding word

bull of syllables in preceding current succeeding word

bull Position of current word in current phrase

bull of preceding succeeding content words in current phrase

bull of words from previous to next content word

bull of syllables in preceding current succeeding phrase

Impossible to have all possible modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 19 of 79

Decision tree-based state clustering [7]

L=voice

L=w R=silence yes

yes yes

no

no no

w-a+sil w-a+sh gy-a+pau

g-a+silgy-a+silw-a+t

k-a+b

t-a+n

leaf nodes

yes no yes no

synthesized states

R=silence L=gy

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 20 of 79

Stream-dependent tree-based clustering

Decision treesfor

mel-cepstrum

Decision treesfor F0

Spectrum amp excitation can have different context dependencyrarr Build decision trees individually

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 21 of 79

State duration models [8]

i

1 2 3 4 5 6 7 T=8t

t0 t1

Probability to enter state i at t0 then leave at t1 + 1

χt0t1(i) propsum

j 6=i

αt0minus1(j)ajiat1minust0ii

t1prod

t=t0

bi(ot)sum

k 6=i

aikbk(ot1+1)βt1+1(k)

rarr estimate state duration modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 22 of 79

Stream-dependent tree-based clustering

State durationmodel

Decision treesfor

mel-cepstrum

Decision treesfor F0

Decision treefor state durmodels

HMM

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 23 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 24 of 79

Speech parameter generation algorithm [9]

Generate most probable state outputs given HMM and words

o = arg maxo

p(o | w λ)

= arg maxo

sum

forallqp(o q | w λ)

asymp arg maxo

maxq

p(o q | w λ)

= arg maxo

maxq

p(o | q λ)P (q | w λ)

Determine the best state sequence and outputs sequentially

q = arg maxq

P (q | w λ)

o = arg maxo

p(o | q λ)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79

Speech parameter generation algorithm [9]

Generate most probable state outputs given HMM and words

o = arg maxo

p(o | w λ)

= arg maxo

sum

forallqp(o q | w λ)

asymp arg maxo

maxq

p(o q | w λ)

= arg maxo

maxq

p(o | q λ)P (q | w λ)

Determine the best state sequence and outputs sequentially

q = arg maxq

P (q | w λ)

o = arg maxo

p(o | q λ)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79

Best state sequence

1 2 3π1

a11

a12

a22

a23

a33

b1(ot) b2(ot) b3(ot)

o1 o2 o3 o4 o5 oT

1 1 1 1 2 2 3 3

O

Q

Observation sequence

State sequence

State duration 4 10 5D

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 26 of 79

Best state outputswo dynamic features

Variance Mean

o becomes step-wise mean vector sequence

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 27 of 79

Using dynamic features

State output vectors include static amp dynamic features

ot =[cgtt ∆cgtt

]gt ∆ct = ct minus ctminus1

M Mctminus1 ct+1ctminus2 ct+2ct

∆ctminus1 ∆ct+1∆ctminus2 ∆ct+2∆ct

2M

Relationship between static and dynamic features can be arranged as

o c

ctminus1

∆ ctminus1

ct∆ ctct+1

∆ ct+1

=

middot middot middot

middot middot middotmiddot middot middot 0 I 0 0 middot middot middotmiddot middot middot minusI I 0 0 middot middot middotmiddot middot middot 0 0 I 0 middot middot middotmiddot middot middot 0 minusI I 0 middot middot middotmiddot middot middot 0 0 0 I middot middot middotmiddot middot middot 0 0 minusI I middot middot middotmiddot middot middot

middot middot middot

ctminus2

ctminus1

ctct+1

W

o t

ot+1

otminus1

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 28 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Wgt Σminus1q W c

=

Wgt Σminus1q microq

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

1 0 0

0

10 0

-1 1

10

0

-1 1

1

0

0

0

-1 1

0 0

0

0

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

microq1

microq2

microqT

c1

c2

cT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79

Generated speech parameter trajectory

Sta

ticD

ynam

ic

Variance Mean c

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79

Waveform reconstruction

pulse train

white noise

synthesizedspeech

lineartime-invariant

systeme(n)

h(n) x(n) = h(n) lowast e(n)excitation

Generatedexcitation parameter(log F0 with VUV)

Generatedspectral parameter

(cepstrum LSP)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79

Synthesis filter

bull Cepstrum rarr LMA filter

bull Generalized cepstrum rarr GLSA filter

bull Mel-cepstrum rarr MLSA filter

bull Mel-generalized cepstrum rarr MGLSA filter

bull LSP rarr LSP filter

bull PARCOR rarr all-pole lattice filter

bull LPC rarr all-pole filter

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation

minus Small footprint [10 11]minus Robustness [12]

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Adaptation (mimicking voice) [13]

Average-voice model

AdaptiveTraining

Adaptation

Training speakers Target speakers

bull Train average voice model (AVM) from training speakers using SAT

bull Adapt AVM to target speakers

bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79

Interpolation (mixing voice) [14 15 16 17]

λ1

λ2

λ3

λ4

λprime

I(λprimeλ1)I(λprimeλ2)

I(λprimeλ3)I(λprimeλ4)

λ HMM setI(λprimeλ) Interpolation ratio

bull Interpolate representive HMM sets

bull Can obtain new voices wo adaptation data

bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Vocoding issues

bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)

Unvoiced Voiced

pulse train

white noise

e(n)

excitation

bull Spectral envelope extractionHarmonic effect often cause problem

0

40

80

0 2 4 6 8 [kHz]

Pow

er [d

B]

bull PhaseImportant but usually ignored

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79

Better vocoding

bull Mixed excitation linear prediction (MELP)

bull STRAIGHT

bull Multi-band excitation

bull Harmonic + noise model (HNM)

bull Harmonic stochastic model

bull LF model

bull Glottal waveform

bull Residual codebook

bull ML excitation

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79

Limitations of HMMs for acoustic modeling

bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state

bull Conditional independence assumptionState output probability depends only on the current state

bull Weak duration modelingState duration probability decreases exponentially with time

None of them hold for real speech

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79

Better acoustic modeling

bull Piece-wise constatnt statistics rarr Dynamical model

minus Trended HMMminus Polynomial segment modelminus Trajectory HMM

bull Conditional independence assumption rarr Graphical model

minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM

bull Weak duration modeling rarr Explicit duration model

minus Hidden semi-Markov model

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79

Oversmoothing

bull Speech parameter generation algorithm

minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled

Nat

ural

0 4 8Frequency (kHz)

Gen

erat

ed

0 4 8Frequency (kHz)

bull Why

minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79

Oversmoothing compensation

bull Postfiltering

minus Mel-cepstrumminus LSP

bull Nonparametric approach

minus Conditional parameter generationminus Discrete HMM-based speech synthesis

bull Combine multiple-level statistics

minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation eigenvoice CAT multiple regression

minus Small footprintminus Robustness

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

HMM-based acoustic modeling for SPSS [4]

yes noyes no

yes no

yes no yes no

Acoustic space

bull Decision tree-clustered HMM with GMM state-output distributions

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79

DNN-based acoustic modeling for SPSS [18]

Acoustic

features y

Linguistic

features x

h1

h2

h3

bull DNN represents conditional distribution of y given x

bull DNN replaces decision trees and GMMs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79

Framework

Input layer Output layerHidden layers

TEXT

SPEECH Parametergeneration

Waveformsynthesis

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

feat

ures

at f

ram

e 1

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

fe

atur

es a

t fra

me T

Textanalysis

Input featureextraction

Statistics (m

ean amp var) of speech param

eter vector sequence

Binaryfeatures

NumericfeaturesDuration

featureFrame position

feature

Spectralfeatures

ExcitationfeaturesVUVfeature

Durationprediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Experimental setup

Database US English female speaker

Training test data 33000 amp 173 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic 11 categorical featuresfeatures 25 numeric features

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2

HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]

DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79

Example of speech parameter trajectories

wo grouping questions numeric contexts silence frames removed

-1

0

1

0 100 200 300 400 500

5-t

h M

el-ce

pstr

um

Frame

Natural speechHMM (α=1)DNN (4x512)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79

Subjective evaluations

Compared HMM-based systems with DNN-based ones withsimilar of parameters

bull Paired comparison test

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

HMM DNN(α) (layers times units) Neutral p value z value

158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 21: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Hidden Markov model (HMM)

1 2 3π1

a11

a12

a22

a23

a33

b1(ot) b2(ot) b3(ot)

o1 o2 o3 o4 o5 oT

1 1 1 1 2 2 3 3

O

Q

Observation sequence

State sequence

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 16 of 79

Multi-stream HMM structure

Spe

ctru

mE

xcita

tion

∆ct

∆2ct

pt

δ pt

δ 2pt

ct

ot

Stream

1

o1t

o2t

o3t

o4t

bj(ot)

b1j(o

1t )

b2j(o

2t )

b3j(o

3t )

b4j(o

4t )

bj(ot)Sprod

s=1

(bsj(o

st ))ws=

23

4

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 17 of 79

Training process

Compute variancefloor (HCompV)

Initialize CI-HMMs bysegmental k-means (HInit)

Reestimate CI-HMMs byEM algorithm

(HRest amp HERest)

Copy CI-HMMs to CD-HMMs (HHEd CL)

Reestimate CD-HMMs byEM algorithm (HERest)

Decision tree-basedclustering (HHEd TB)

Reestimate CD-HMMsby EM algorithm (HERest)

Untie parameter tyingstructure (HHEd UT)

monophone(context-independent CI)

fullcontext(context-dependent CD)

EstimatedHMMs

data amp labels

Estimate CD-dur modelsfrom FB stats (HERest)

Decision tree-basedclustering (HHEd TB)

Estimated dur models

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 18 of 79

Context-dependent acoustic modeling

bull preceding succeeding two phonemes

bull Position of current phoneme in current syllable

bull of phonemes at preceding current succeeding syllable

bull accent stress of preceding current succeeding syllable

bull Position of current syllable in current word

bull of preceding succeeding stressed accented syllables in phrase

bull of syllables from previous to next stressed accented syllable

bull Guess at part of speech of preceding current succeeding word

bull of syllables in preceding current succeeding word

bull Position of current word in current phrase

bull of preceding succeeding content words in current phrase

bull of words from previous to next content word

bull of syllables in preceding current succeeding phrase

Impossible to have all possible modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 19 of 79

Decision tree-based state clustering [7]

L=voice

L=w R=silence yes

yes yes

no

no no

w-a+sil w-a+sh gy-a+pau

g-a+silgy-a+silw-a+t

k-a+b

t-a+n

leaf nodes

yes no yes no

synthesized states

R=silence L=gy

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 20 of 79

Stream-dependent tree-based clustering

Decision treesfor

mel-cepstrum

Decision treesfor F0

Spectrum amp excitation can have different context dependencyrarr Build decision trees individually

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 21 of 79

State duration models [8]

i

1 2 3 4 5 6 7 T=8t

t0 t1

Probability to enter state i at t0 then leave at t1 + 1

χt0t1(i) propsum

j 6=i

αt0minus1(j)ajiat1minust0ii

t1prod

t=t0

bi(ot)sum

k 6=i

aikbk(ot1+1)βt1+1(k)

rarr estimate state duration modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 22 of 79

Stream-dependent tree-based clustering

State durationmodel

Decision treesfor

mel-cepstrum

Decision treesfor F0

Decision treefor state durmodels

HMM

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 23 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 24 of 79

Speech parameter generation algorithm [9]

Generate most probable state outputs given HMM and words

o = arg maxo

p(o | w λ)

= arg maxo

sum

forallqp(o q | w λ)

asymp arg maxo

maxq

p(o q | w λ)

= arg maxo

maxq

p(o | q λ)P (q | w λ)

Determine the best state sequence and outputs sequentially

q = arg maxq

P (q | w λ)

o = arg maxo

p(o | q λ)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79

Speech parameter generation algorithm [9]

Generate most probable state outputs given HMM and words

o = arg maxo

p(o | w λ)

= arg maxo

sum

forallqp(o q | w λ)

asymp arg maxo

maxq

p(o q | w λ)

= arg maxo

maxq

p(o | q λ)P (q | w λ)

Determine the best state sequence and outputs sequentially

q = arg maxq

P (q | w λ)

o = arg maxo

p(o | q λ)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79

Best state sequence

1 2 3π1

a11

a12

a22

a23

a33

b1(ot) b2(ot) b3(ot)

o1 o2 o3 o4 o5 oT

1 1 1 1 2 2 3 3

O

Q

Observation sequence

State sequence

State duration 4 10 5D

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 26 of 79

Best state outputswo dynamic features

Variance Mean

o becomes step-wise mean vector sequence

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 27 of 79

Using dynamic features

State output vectors include static amp dynamic features

ot =[cgtt ∆cgtt

]gt ∆ct = ct minus ctminus1

M Mctminus1 ct+1ctminus2 ct+2ct

∆ctminus1 ∆ct+1∆ctminus2 ∆ct+2∆ct

2M

Relationship between static and dynamic features can be arranged as

o c

ctminus1

∆ ctminus1

ct∆ ctct+1

∆ ct+1

=

middot middot middot

middot middot middotmiddot middot middot 0 I 0 0 middot middot middotmiddot middot middot minusI I 0 0 middot middot middotmiddot middot middot 0 0 I 0 middot middot middotmiddot middot middot 0 minusI I 0 middot middot middotmiddot middot middot 0 0 0 I middot middot middotmiddot middot middot 0 0 minusI I middot middot middotmiddot middot middot

middot middot middot

ctminus2

ctminus1

ctct+1

W

o t

ot+1

otminus1

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 28 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Wgt Σminus1q W c

=

Wgt Σminus1q microq

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

1 0 0

0

10 0

-1 1

10

0

-1 1

1

0

0

0

-1 1

0 0

0

0

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

microq1

microq2

microqT

c1

c2

cT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79

Generated speech parameter trajectory

Sta

ticD

ynam

ic

Variance Mean c

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79

Waveform reconstruction

pulse train

white noise

synthesizedspeech

lineartime-invariant

systeme(n)

h(n) x(n) = h(n) lowast e(n)excitation

Generatedexcitation parameter(log F0 with VUV)

Generatedspectral parameter

(cepstrum LSP)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79

Synthesis filter

bull Cepstrum rarr LMA filter

bull Generalized cepstrum rarr GLSA filter

bull Mel-cepstrum rarr MLSA filter

bull Mel-generalized cepstrum rarr MGLSA filter

bull LSP rarr LSP filter

bull PARCOR rarr all-pole lattice filter

bull LPC rarr all-pole filter

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation

minus Small footprint [10 11]minus Robustness [12]

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Adaptation (mimicking voice) [13]

Average-voice model

AdaptiveTraining

Adaptation

Training speakers Target speakers

bull Train average voice model (AVM) from training speakers using SAT

bull Adapt AVM to target speakers

bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79

Interpolation (mixing voice) [14 15 16 17]

λ1

λ2

λ3

λ4

λprime

I(λprimeλ1)I(λprimeλ2)

I(λprimeλ3)I(λprimeλ4)

λ HMM setI(λprimeλ) Interpolation ratio

bull Interpolate representive HMM sets

bull Can obtain new voices wo adaptation data

bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Vocoding issues

bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)

Unvoiced Voiced

pulse train

white noise

e(n)

excitation

bull Spectral envelope extractionHarmonic effect often cause problem

0

40

80

0 2 4 6 8 [kHz]

Pow

er [d

B]

bull PhaseImportant but usually ignored

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79

Better vocoding

bull Mixed excitation linear prediction (MELP)

bull STRAIGHT

bull Multi-band excitation

bull Harmonic + noise model (HNM)

bull Harmonic stochastic model

bull LF model

bull Glottal waveform

bull Residual codebook

bull ML excitation

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79

Limitations of HMMs for acoustic modeling

bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state

bull Conditional independence assumptionState output probability depends only on the current state

bull Weak duration modelingState duration probability decreases exponentially with time

None of them hold for real speech

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79

Better acoustic modeling

bull Piece-wise constatnt statistics rarr Dynamical model

minus Trended HMMminus Polynomial segment modelminus Trajectory HMM

bull Conditional independence assumption rarr Graphical model

minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM

bull Weak duration modeling rarr Explicit duration model

minus Hidden semi-Markov model

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79

Oversmoothing

bull Speech parameter generation algorithm

minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled

Nat

ural

0 4 8Frequency (kHz)

Gen

erat

ed

0 4 8Frequency (kHz)

bull Why

minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79

Oversmoothing compensation

bull Postfiltering

minus Mel-cepstrumminus LSP

bull Nonparametric approach

minus Conditional parameter generationminus Discrete HMM-based speech synthesis

bull Combine multiple-level statistics

minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation eigenvoice CAT multiple regression

minus Small footprintminus Robustness

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

HMM-based acoustic modeling for SPSS [4]

yes noyes no

yes no

yes no yes no

Acoustic space

bull Decision tree-clustered HMM with GMM state-output distributions

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79

DNN-based acoustic modeling for SPSS [18]

Acoustic

features y

Linguistic

features x

h1

h2

h3

bull DNN represents conditional distribution of y given x

bull DNN replaces decision trees and GMMs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79

Framework

Input layer Output layerHidden layers

TEXT

SPEECH Parametergeneration

Waveformsynthesis

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

feat

ures

at f

ram

e 1

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

fe

atur

es a

t fra

me T

Textanalysis

Input featureextraction

Statistics (m

ean amp var) of speech param

eter vector sequence

Binaryfeatures

NumericfeaturesDuration

featureFrame position

feature

Spectralfeatures

ExcitationfeaturesVUVfeature

Durationprediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Experimental setup

Database US English female speaker

Training test data 33000 amp 173 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic 11 categorical featuresfeatures 25 numeric features

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2

HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]

DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79

Example of speech parameter trajectories

wo grouping questions numeric contexts silence frames removed

-1

0

1

0 100 200 300 400 500

5-t

h M

el-ce

pstr

um

Frame

Natural speechHMM (α=1)DNN (4x512)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79

Subjective evaluations

Compared HMM-based systems with DNN-based ones withsimilar of parameters

bull Paired comparison test

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

HMM DNN(α) (layers times units) Neutral p value z value

158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 22: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Multi-stream HMM structure

Spe

ctru

mE

xcita

tion

∆ct

∆2ct

pt

δ pt

δ 2pt

ct

ot

Stream

1

o1t

o2t

o3t

o4t

bj(ot)

b1j(o

1t )

b2j(o

2t )

b3j(o

3t )

b4j(o

4t )

bj(ot)Sprod

s=1

(bsj(o

st ))ws=

23

4

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 17 of 79

Training process

Compute variancefloor (HCompV)

Initialize CI-HMMs bysegmental k-means (HInit)

Reestimate CI-HMMs byEM algorithm

(HRest amp HERest)

Copy CI-HMMs to CD-HMMs (HHEd CL)

Reestimate CD-HMMs byEM algorithm (HERest)

Decision tree-basedclustering (HHEd TB)

Reestimate CD-HMMsby EM algorithm (HERest)

Untie parameter tyingstructure (HHEd UT)

monophone(context-independent CI)

fullcontext(context-dependent CD)

EstimatedHMMs

data amp labels

Estimate CD-dur modelsfrom FB stats (HERest)

Decision tree-basedclustering (HHEd TB)

Estimated dur models

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 18 of 79

Context-dependent acoustic modeling

bull preceding succeeding two phonemes

bull Position of current phoneme in current syllable

bull of phonemes at preceding current succeeding syllable

bull accent stress of preceding current succeeding syllable

bull Position of current syllable in current word

bull of preceding succeeding stressed accented syllables in phrase

bull of syllables from previous to next stressed accented syllable

bull Guess at part of speech of preceding current succeeding word

bull of syllables in preceding current succeeding word

bull Position of current word in current phrase

bull of preceding succeeding content words in current phrase

bull of words from previous to next content word

bull of syllables in preceding current succeeding phrase

Impossible to have all possible modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 19 of 79

Decision tree-based state clustering [7]

L=voice

L=w R=silence yes

yes yes

no

no no

w-a+sil w-a+sh gy-a+pau

g-a+silgy-a+silw-a+t

k-a+b

t-a+n

leaf nodes

yes no yes no

synthesized states

R=silence L=gy

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 20 of 79

Stream-dependent tree-based clustering

Decision treesfor

mel-cepstrum

Decision treesfor F0

Spectrum amp excitation can have different context dependencyrarr Build decision trees individually

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 21 of 79

State duration models [8]

i

1 2 3 4 5 6 7 T=8t

t0 t1

Probability to enter state i at t0 then leave at t1 + 1

χt0t1(i) propsum

j 6=i

αt0minus1(j)ajiat1minust0ii

t1prod

t=t0

bi(ot)sum

k 6=i

aikbk(ot1+1)βt1+1(k)

rarr estimate state duration modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 22 of 79

Stream-dependent tree-based clustering

State durationmodel

Decision treesfor

mel-cepstrum

Decision treesfor F0

Decision treefor state durmodels

HMM

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 23 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 24 of 79

Speech parameter generation algorithm [9]

Generate most probable state outputs given HMM and words

o = arg maxo

p(o | w λ)

= arg maxo

sum

forallqp(o q | w λ)

asymp arg maxo

maxq

p(o q | w λ)

= arg maxo

maxq

p(o | q λ)P (q | w λ)

Determine the best state sequence and outputs sequentially

q = arg maxq

P (q | w λ)

o = arg maxo

p(o | q λ)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79

Speech parameter generation algorithm [9]

Generate most probable state outputs given HMM and words

o = arg maxo

p(o | w λ)

= arg maxo

sum

forallqp(o q | w λ)

asymp arg maxo

maxq

p(o q | w λ)

= arg maxo

maxq

p(o | q λ)P (q | w λ)

Determine the best state sequence and outputs sequentially

q = arg maxq

P (q | w λ)

o = arg maxo

p(o | q λ)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79

Best state sequence

1 2 3π1

a11

a12

a22

a23

a33

b1(ot) b2(ot) b3(ot)

o1 o2 o3 o4 o5 oT

1 1 1 1 2 2 3 3

O

Q

Observation sequence

State sequence

State duration 4 10 5D

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 26 of 79

Best state outputswo dynamic features

Variance Mean

o becomes step-wise mean vector sequence

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 27 of 79

Using dynamic features

State output vectors include static amp dynamic features

ot =[cgtt ∆cgtt

]gt ∆ct = ct minus ctminus1

M Mctminus1 ct+1ctminus2 ct+2ct

∆ctminus1 ∆ct+1∆ctminus2 ∆ct+2∆ct

2M

Relationship between static and dynamic features can be arranged as

o c

ctminus1

∆ ctminus1

ct∆ ctct+1

∆ ct+1

=

middot middot middot

middot middot middotmiddot middot middot 0 I 0 0 middot middot middotmiddot middot middot minusI I 0 0 middot middot middotmiddot middot middot 0 0 I 0 middot middot middotmiddot middot middot 0 minusI I 0 middot middot middotmiddot middot middot 0 0 0 I middot middot middotmiddot middot middot 0 0 minusI I middot middot middotmiddot middot middot

middot middot middot

ctminus2

ctminus1

ctct+1

W

o t

ot+1

otminus1

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 28 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Wgt Σminus1q W c

=

Wgt Σminus1q microq

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

1 0 0

0

10 0

-1 1

10

0

-1 1

1

0

0

0

-1 1

0 0

0

0

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

microq1

microq2

microqT

c1

c2

cT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79

Generated speech parameter trajectory

Sta

ticD

ynam

ic

Variance Mean c

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79

Waveform reconstruction

pulse train

white noise

synthesizedspeech

lineartime-invariant

systeme(n)

h(n) x(n) = h(n) lowast e(n)excitation

Generatedexcitation parameter(log F0 with VUV)

Generatedspectral parameter

(cepstrum LSP)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79

Synthesis filter

bull Cepstrum rarr LMA filter

bull Generalized cepstrum rarr GLSA filter

bull Mel-cepstrum rarr MLSA filter

bull Mel-generalized cepstrum rarr MGLSA filter

bull LSP rarr LSP filter

bull PARCOR rarr all-pole lattice filter

bull LPC rarr all-pole filter

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation

minus Small footprint [10 11]minus Robustness [12]

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Adaptation (mimicking voice) [13]

Average-voice model

AdaptiveTraining

Adaptation

Training speakers Target speakers

bull Train average voice model (AVM) from training speakers using SAT

bull Adapt AVM to target speakers

bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79

Interpolation (mixing voice) [14 15 16 17]

λ1

λ2

λ3

λ4

λprime

I(λprimeλ1)I(λprimeλ2)

I(λprimeλ3)I(λprimeλ4)

λ HMM setI(λprimeλ) Interpolation ratio

bull Interpolate representive HMM sets

bull Can obtain new voices wo adaptation data

bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Vocoding issues

bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)

Unvoiced Voiced

pulse train

white noise

e(n)

excitation

bull Spectral envelope extractionHarmonic effect often cause problem

0

40

80

0 2 4 6 8 [kHz]

Pow

er [d

B]

bull PhaseImportant but usually ignored

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79

Better vocoding

bull Mixed excitation linear prediction (MELP)

bull STRAIGHT

bull Multi-band excitation

bull Harmonic + noise model (HNM)

bull Harmonic stochastic model

bull LF model

bull Glottal waveform

bull Residual codebook

bull ML excitation

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79

Limitations of HMMs for acoustic modeling

bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state

bull Conditional independence assumptionState output probability depends only on the current state

bull Weak duration modelingState duration probability decreases exponentially with time

None of them hold for real speech

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79

Better acoustic modeling

bull Piece-wise constatnt statistics rarr Dynamical model

minus Trended HMMminus Polynomial segment modelminus Trajectory HMM

bull Conditional independence assumption rarr Graphical model

minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM

bull Weak duration modeling rarr Explicit duration model

minus Hidden semi-Markov model

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79

Oversmoothing

bull Speech parameter generation algorithm

minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled

Nat

ural

0 4 8Frequency (kHz)

Gen

erat

ed

0 4 8Frequency (kHz)

bull Why

minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79

Oversmoothing compensation

bull Postfiltering

minus Mel-cepstrumminus LSP

bull Nonparametric approach

minus Conditional parameter generationminus Discrete HMM-based speech synthesis

bull Combine multiple-level statistics

minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation eigenvoice CAT multiple regression

minus Small footprintminus Robustness

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

HMM-based acoustic modeling for SPSS [4]

yes noyes no

yes no

yes no yes no

Acoustic space

bull Decision tree-clustered HMM with GMM state-output distributions

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79

DNN-based acoustic modeling for SPSS [18]

Acoustic

features y

Linguistic

features x

h1

h2

h3

bull DNN represents conditional distribution of y given x

bull DNN replaces decision trees and GMMs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79

Framework

Input layer Output layerHidden layers

TEXT

SPEECH Parametergeneration

Waveformsynthesis

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

feat

ures

at f

ram

e 1

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

fe

atur

es a

t fra

me T

Textanalysis

Input featureextraction

Statistics (m

ean amp var) of speech param

eter vector sequence

Binaryfeatures

NumericfeaturesDuration

featureFrame position

feature

Spectralfeatures

ExcitationfeaturesVUVfeature

Durationprediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Experimental setup

Database US English female speaker

Training test data 33000 amp 173 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic 11 categorical featuresfeatures 25 numeric features

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2

HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]

DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79

Example of speech parameter trajectories

wo grouping questions numeric contexts silence frames removed

-1

0

1

0 100 200 300 400 500

5-t

h M

el-ce

pstr

um

Frame

Natural speechHMM (α=1)DNN (4x512)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79

Subjective evaluations

Compared HMM-based systems with DNN-based ones withsimilar of parameters

bull Paired comparison test

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

HMM DNN(α) (layers times units) Neutral p value z value

158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 23: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Training process

Compute variancefloor (HCompV)

Initialize CI-HMMs bysegmental k-means (HInit)

Reestimate CI-HMMs byEM algorithm

(HRest amp HERest)

Copy CI-HMMs to CD-HMMs (HHEd CL)

Reestimate CD-HMMs byEM algorithm (HERest)

Decision tree-basedclustering (HHEd TB)

Reestimate CD-HMMsby EM algorithm (HERest)

Untie parameter tyingstructure (HHEd UT)

monophone(context-independent CI)

fullcontext(context-dependent CD)

EstimatedHMMs

data amp labels

Estimate CD-dur modelsfrom FB stats (HERest)

Decision tree-basedclustering (HHEd TB)

Estimated dur models

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 18 of 79

Context-dependent acoustic modeling

bull preceding succeeding two phonemes

bull Position of current phoneme in current syllable

bull of phonemes at preceding current succeeding syllable

bull accent stress of preceding current succeeding syllable

bull Position of current syllable in current word

bull of preceding succeeding stressed accented syllables in phrase

bull of syllables from previous to next stressed accented syllable

bull Guess at part of speech of preceding current succeeding word

bull of syllables in preceding current succeeding word

bull Position of current word in current phrase

bull of preceding succeeding content words in current phrase

bull of words from previous to next content word

bull of syllables in preceding current succeeding phrase

Impossible to have all possible modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 19 of 79

Decision tree-based state clustering [7]

L=voice

L=w R=silence yes

yes yes

no

no no

w-a+sil w-a+sh gy-a+pau

g-a+silgy-a+silw-a+t

k-a+b

t-a+n

leaf nodes

yes no yes no

synthesized states

R=silence L=gy

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 20 of 79

Stream-dependent tree-based clustering

Decision treesfor

mel-cepstrum

Decision treesfor F0

Spectrum amp excitation can have different context dependencyrarr Build decision trees individually

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 21 of 79

State duration models [8]

i

1 2 3 4 5 6 7 T=8t

t0 t1

Probability to enter state i at t0 then leave at t1 + 1

χt0t1(i) propsum

j 6=i

αt0minus1(j)ajiat1minust0ii

t1prod

t=t0

bi(ot)sum

k 6=i

aikbk(ot1+1)βt1+1(k)

rarr estimate state duration modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 22 of 79

Stream-dependent tree-based clustering

State durationmodel

Decision treesfor

mel-cepstrum

Decision treesfor F0

Decision treefor state durmodels

HMM

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 23 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 24 of 79

Speech parameter generation algorithm [9]

Generate most probable state outputs given HMM and words

o = arg maxo

p(o | w λ)

= arg maxo

sum

forallqp(o q | w λ)

asymp arg maxo

maxq

p(o q | w λ)

= arg maxo

maxq

p(o | q λ)P (q | w λ)

Determine the best state sequence and outputs sequentially

q = arg maxq

P (q | w λ)

o = arg maxo

p(o | q λ)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79

Speech parameter generation algorithm [9]

Generate most probable state outputs given HMM and words

o = arg maxo

p(o | w λ)

= arg maxo

sum

forallqp(o q | w λ)

asymp arg maxo

maxq

p(o q | w λ)

= arg maxo

maxq

p(o | q λ)P (q | w λ)

Determine the best state sequence and outputs sequentially

q = arg maxq

P (q | w λ)

o = arg maxo

p(o | q λ)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79

Best state sequence

1 2 3π1

a11

a12

a22

a23

a33

b1(ot) b2(ot) b3(ot)

o1 o2 o3 o4 o5 oT

1 1 1 1 2 2 3 3

O

Q

Observation sequence

State sequence

State duration 4 10 5D

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 26 of 79

Best state outputswo dynamic features

Variance Mean

o becomes step-wise mean vector sequence

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 27 of 79

Using dynamic features

State output vectors include static amp dynamic features

ot =[cgtt ∆cgtt

]gt ∆ct = ct minus ctminus1

M Mctminus1 ct+1ctminus2 ct+2ct

∆ctminus1 ∆ct+1∆ctminus2 ∆ct+2∆ct

2M

Relationship between static and dynamic features can be arranged as

o c

ctminus1

∆ ctminus1

ct∆ ctct+1

∆ ct+1

=

middot middot middot

middot middot middotmiddot middot middot 0 I 0 0 middot middot middotmiddot middot middot minusI I 0 0 middot middot middotmiddot middot middot 0 0 I 0 middot middot middotmiddot middot middot 0 minusI I 0 middot middot middotmiddot middot middot 0 0 0 I middot middot middotmiddot middot middot 0 0 minusI I middot middot middotmiddot middot middot

middot middot middot

ctminus2

ctminus1

ctct+1

W

o t

ot+1

otminus1

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 28 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Wgt Σminus1q W c

=

Wgt Σminus1q microq

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

1 0 0

0

10 0

-1 1

10

0

-1 1

1

0

0

0

-1 1

0 0

0

0

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

microq1

microq2

microqT

c1

c2

cT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79

Generated speech parameter trajectory

Sta

ticD

ynam

ic

Variance Mean c

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79

Waveform reconstruction

pulse train

white noise

synthesizedspeech

lineartime-invariant

systeme(n)

h(n) x(n) = h(n) lowast e(n)excitation

Generatedexcitation parameter(log F0 with VUV)

Generatedspectral parameter

(cepstrum LSP)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79

Synthesis filter

bull Cepstrum rarr LMA filter

bull Generalized cepstrum rarr GLSA filter

bull Mel-cepstrum rarr MLSA filter

bull Mel-generalized cepstrum rarr MGLSA filter

bull LSP rarr LSP filter

bull PARCOR rarr all-pole lattice filter

bull LPC rarr all-pole filter

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation

minus Small footprint [10 11]minus Robustness [12]

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Adaptation (mimicking voice) [13]

Average-voice model

AdaptiveTraining

Adaptation

Training speakers Target speakers

bull Train average voice model (AVM) from training speakers using SAT

bull Adapt AVM to target speakers

bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79

Interpolation (mixing voice) [14 15 16 17]

λ1

λ2

λ3

λ4

λprime

I(λprimeλ1)I(λprimeλ2)

I(λprimeλ3)I(λprimeλ4)

λ HMM setI(λprimeλ) Interpolation ratio

bull Interpolate representive HMM sets

bull Can obtain new voices wo adaptation data

bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Vocoding issues

bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)

Unvoiced Voiced

pulse train

white noise

e(n)

excitation

bull Spectral envelope extractionHarmonic effect often cause problem

0

40

80

0 2 4 6 8 [kHz]

Pow

er [d

B]

bull PhaseImportant but usually ignored

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79

Better vocoding

bull Mixed excitation linear prediction (MELP)

bull STRAIGHT

bull Multi-band excitation

bull Harmonic + noise model (HNM)

bull Harmonic stochastic model

bull LF model

bull Glottal waveform

bull Residual codebook

bull ML excitation

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79

Limitations of HMMs for acoustic modeling

bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state

bull Conditional independence assumptionState output probability depends only on the current state

bull Weak duration modelingState duration probability decreases exponentially with time

None of them hold for real speech

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79

Better acoustic modeling

bull Piece-wise constatnt statistics rarr Dynamical model

minus Trended HMMminus Polynomial segment modelminus Trajectory HMM

bull Conditional independence assumption rarr Graphical model

minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM

bull Weak duration modeling rarr Explicit duration model

minus Hidden semi-Markov model

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79

Oversmoothing

bull Speech parameter generation algorithm

minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled

Nat

ural

0 4 8Frequency (kHz)

Gen

erat

ed

0 4 8Frequency (kHz)

bull Why

minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79

Oversmoothing compensation

bull Postfiltering

minus Mel-cepstrumminus LSP

bull Nonparametric approach

minus Conditional parameter generationminus Discrete HMM-based speech synthesis

bull Combine multiple-level statistics

minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation eigenvoice CAT multiple regression

minus Small footprintminus Robustness

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

HMM-based acoustic modeling for SPSS [4]

yes noyes no

yes no

yes no yes no

Acoustic space

bull Decision tree-clustered HMM with GMM state-output distributions

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79

DNN-based acoustic modeling for SPSS [18]

Acoustic

features y

Linguistic

features x

h1

h2

h3

bull DNN represents conditional distribution of y given x

bull DNN replaces decision trees and GMMs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79

Framework

Input layer Output layerHidden layers

TEXT

SPEECH Parametergeneration

Waveformsynthesis

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

feat

ures

at f

ram

e 1

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

fe

atur

es a

t fra

me T

Textanalysis

Input featureextraction

Statistics (m

ean amp var) of speech param

eter vector sequence

Binaryfeatures

NumericfeaturesDuration

featureFrame position

feature

Spectralfeatures

ExcitationfeaturesVUVfeature

Durationprediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Experimental setup

Database US English female speaker

Training test data 33000 amp 173 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic 11 categorical featuresfeatures 25 numeric features

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2

HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]

DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79

Example of speech parameter trajectories

wo grouping questions numeric contexts silence frames removed

-1

0

1

0 100 200 300 400 500

5-t

h M

el-ce

pstr

um

Frame

Natural speechHMM (α=1)DNN (4x512)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79

Subjective evaluations

Compared HMM-based systems with DNN-based ones withsimilar of parameters

bull Paired comparison test

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

HMM DNN(α) (layers times units) Neutral p value z value

158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 24: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Context-dependent acoustic modeling

bull preceding succeeding two phonemes

bull Position of current phoneme in current syllable

bull of phonemes at preceding current succeeding syllable

bull accent stress of preceding current succeeding syllable

bull Position of current syllable in current word

bull of preceding succeeding stressed accented syllables in phrase

bull of syllables from previous to next stressed accented syllable

bull Guess at part of speech of preceding current succeeding word

bull of syllables in preceding current succeeding word

bull Position of current word in current phrase

bull of preceding succeeding content words in current phrase

bull of words from previous to next content word

bull of syllables in preceding current succeeding phrase

Impossible to have all possible modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 19 of 79

Decision tree-based state clustering [7]

L=voice

L=w R=silence yes

yes yes

no

no no

w-a+sil w-a+sh gy-a+pau

g-a+silgy-a+silw-a+t

k-a+b

t-a+n

leaf nodes

yes no yes no

synthesized states

R=silence L=gy

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 20 of 79

Stream-dependent tree-based clustering

Decision treesfor

mel-cepstrum

Decision treesfor F0

Spectrum amp excitation can have different context dependencyrarr Build decision trees individually

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 21 of 79

State duration models [8]

i

1 2 3 4 5 6 7 T=8t

t0 t1

Probability to enter state i at t0 then leave at t1 + 1

χt0t1(i) propsum

j 6=i

αt0minus1(j)ajiat1minust0ii

t1prod

t=t0

bi(ot)sum

k 6=i

aikbk(ot1+1)βt1+1(k)

rarr estimate state duration modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 22 of 79

Stream-dependent tree-based clustering

State durationmodel

Decision treesfor

mel-cepstrum

Decision treesfor F0

Decision treefor state durmodels

HMM

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 23 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 24 of 79

Speech parameter generation algorithm [9]

Generate most probable state outputs given HMM and words

o = arg maxo

p(o | w λ)

= arg maxo

sum

forallqp(o q | w λ)

asymp arg maxo

maxq

p(o q | w λ)

= arg maxo

maxq

p(o | q λ)P (q | w λ)

Determine the best state sequence and outputs sequentially

q = arg maxq

P (q | w λ)

o = arg maxo

p(o | q λ)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79

Speech parameter generation algorithm [9]

Generate most probable state outputs given HMM and words

o = arg maxo

p(o | w λ)

= arg maxo

sum

forallqp(o q | w λ)

asymp arg maxo

maxq

p(o q | w λ)

= arg maxo

maxq

p(o | q λ)P (q | w λ)

Determine the best state sequence and outputs sequentially

q = arg maxq

P (q | w λ)

o = arg maxo

p(o | q λ)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79

Best state sequence

1 2 3π1

a11

a12

a22

a23

a33

b1(ot) b2(ot) b3(ot)

o1 o2 o3 o4 o5 oT

1 1 1 1 2 2 3 3

O

Q

Observation sequence

State sequence

State duration 4 10 5D

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 26 of 79

Best state outputswo dynamic features

Variance Mean

o becomes step-wise mean vector sequence

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 27 of 79

Using dynamic features

State output vectors include static amp dynamic features

ot =[cgtt ∆cgtt

]gt ∆ct = ct minus ctminus1

M Mctminus1 ct+1ctminus2 ct+2ct

∆ctminus1 ∆ct+1∆ctminus2 ∆ct+2∆ct

2M

Relationship between static and dynamic features can be arranged as

o c

ctminus1

∆ ctminus1

ct∆ ctct+1

∆ ct+1

=

middot middot middot

middot middot middotmiddot middot middot 0 I 0 0 middot middot middotmiddot middot middot minusI I 0 0 middot middot middotmiddot middot middot 0 0 I 0 middot middot middotmiddot middot middot 0 minusI I 0 middot middot middotmiddot middot middot 0 0 0 I middot middot middotmiddot middot middot 0 0 minusI I middot middot middotmiddot middot middot

middot middot middot

ctminus2

ctminus1

ctct+1

W

o t

ot+1

otminus1

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 28 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Wgt Σminus1q W c

=

Wgt Σminus1q microq

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

1 0 0

0

10 0

-1 1

10

0

-1 1

1

0

0

0

-1 1

0 0

0

0

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

microq1

microq2

microqT

c1

c2

cT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79

Generated speech parameter trajectory

Sta

ticD

ynam

ic

Variance Mean c

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79

Waveform reconstruction

pulse train

white noise

synthesizedspeech

lineartime-invariant

systeme(n)

h(n) x(n) = h(n) lowast e(n)excitation

Generatedexcitation parameter(log F0 with VUV)

Generatedspectral parameter

(cepstrum LSP)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79

Synthesis filter

bull Cepstrum rarr LMA filter

bull Generalized cepstrum rarr GLSA filter

bull Mel-cepstrum rarr MLSA filter

bull Mel-generalized cepstrum rarr MGLSA filter

bull LSP rarr LSP filter

bull PARCOR rarr all-pole lattice filter

bull LPC rarr all-pole filter

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation

minus Small footprint [10 11]minus Robustness [12]

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Adaptation (mimicking voice) [13]

Average-voice model

AdaptiveTraining

Adaptation

Training speakers Target speakers

bull Train average voice model (AVM) from training speakers using SAT

bull Adapt AVM to target speakers

bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79

Interpolation (mixing voice) [14 15 16 17]

λ1

λ2

λ3

λ4

λprime

I(λprimeλ1)I(λprimeλ2)

I(λprimeλ3)I(λprimeλ4)

λ HMM setI(λprimeλ) Interpolation ratio

bull Interpolate representive HMM sets

bull Can obtain new voices wo adaptation data

bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Vocoding issues

bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)

Unvoiced Voiced

pulse train

white noise

e(n)

excitation

bull Spectral envelope extractionHarmonic effect often cause problem

0

40

80

0 2 4 6 8 [kHz]

Pow

er [d

B]

bull PhaseImportant but usually ignored

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79

Better vocoding

bull Mixed excitation linear prediction (MELP)

bull STRAIGHT

bull Multi-band excitation

bull Harmonic + noise model (HNM)

bull Harmonic stochastic model

bull LF model

bull Glottal waveform

bull Residual codebook

bull ML excitation

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79

Limitations of HMMs for acoustic modeling

bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state

bull Conditional independence assumptionState output probability depends only on the current state

bull Weak duration modelingState duration probability decreases exponentially with time

None of them hold for real speech

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79

Better acoustic modeling

bull Piece-wise constatnt statistics rarr Dynamical model

minus Trended HMMminus Polynomial segment modelminus Trajectory HMM

bull Conditional independence assumption rarr Graphical model

minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM

bull Weak duration modeling rarr Explicit duration model

minus Hidden semi-Markov model

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79

Oversmoothing

bull Speech parameter generation algorithm

minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled

Nat

ural

0 4 8Frequency (kHz)

Gen

erat

ed

0 4 8Frequency (kHz)

bull Why

minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79

Oversmoothing compensation

bull Postfiltering

minus Mel-cepstrumminus LSP

bull Nonparametric approach

minus Conditional parameter generationminus Discrete HMM-based speech synthesis

bull Combine multiple-level statistics

minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation eigenvoice CAT multiple regression

minus Small footprintminus Robustness

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

HMM-based acoustic modeling for SPSS [4]

yes noyes no

yes no

yes no yes no

Acoustic space

bull Decision tree-clustered HMM with GMM state-output distributions

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79

DNN-based acoustic modeling for SPSS [18]

Acoustic

features y

Linguistic

features x

h1

h2

h3

bull DNN represents conditional distribution of y given x

bull DNN replaces decision trees and GMMs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79

Framework

Input layer Output layerHidden layers

TEXT

SPEECH Parametergeneration

Waveformsynthesis

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

feat

ures

at f

ram

e 1

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

fe

atur

es a

t fra

me T

Textanalysis

Input featureextraction

Statistics (m

ean amp var) of speech param

eter vector sequence

Binaryfeatures

NumericfeaturesDuration

featureFrame position

feature

Spectralfeatures

ExcitationfeaturesVUVfeature

Durationprediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Experimental setup

Database US English female speaker

Training test data 33000 amp 173 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic 11 categorical featuresfeatures 25 numeric features

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2

HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]

DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79

Example of speech parameter trajectories

wo grouping questions numeric contexts silence frames removed

-1

0

1

0 100 200 300 400 500

5-t

h M

el-ce

pstr

um

Frame

Natural speechHMM (α=1)DNN (4x512)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79

Subjective evaluations

Compared HMM-based systems with DNN-based ones withsimilar of parameters

bull Paired comparison test

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

HMM DNN(α) (layers times units) Neutral p value z value

158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 25: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Decision tree-based state clustering [7]

L=voice

L=w R=silence yes

yes yes

no

no no

w-a+sil w-a+sh gy-a+pau

g-a+silgy-a+silw-a+t

k-a+b

t-a+n

leaf nodes

yes no yes no

synthesized states

R=silence L=gy

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 20 of 79

Stream-dependent tree-based clustering

Decision treesfor

mel-cepstrum

Decision treesfor F0

Spectrum amp excitation can have different context dependencyrarr Build decision trees individually

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 21 of 79

State duration models [8]

i

1 2 3 4 5 6 7 T=8t

t0 t1

Probability to enter state i at t0 then leave at t1 + 1

χt0t1(i) propsum

j 6=i

αt0minus1(j)ajiat1minust0ii

t1prod

t=t0

bi(ot)sum

k 6=i

aikbk(ot1+1)βt1+1(k)

rarr estimate state duration modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 22 of 79

Stream-dependent tree-based clustering

State durationmodel

Decision treesfor

mel-cepstrum

Decision treesfor F0

Decision treefor state durmodels

HMM

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 23 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 24 of 79

Speech parameter generation algorithm [9]

Generate most probable state outputs given HMM and words

o = arg maxo

p(o | w λ)

= arg maxo

sum

forallqp(o q | w λ)

asymp arg maxo

maxq

p(o q | w λ)

= arg maxo

maxq

p(o | q λ)P (q | w λ)

Determine the best state sequence and outputs sequentially

q = arg maxq

P (q | w λ)

o = arg maxo

p(o | q λ)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79

Speech parameter generation algorithm [9]

Generate most probable state outputs given HMM and words

o = arg maxo

p(o | w λ)

= arg maxo

sum

forallqp(o q | w λ)

asymp arg maxo

maxq

p(o q | w λ)

= arg maxo

maxq

p(o | q λ)P (q | w λ)

Determine the best state sequence and outputs sequentially

q = arg maxq

P (q | w λ)

o = arg maxo

p(o | q λ)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79

Best state sequence

1 2 3π1

a11

a12

a22

a23

a33

b1(ot) b2(ot) b3(ot)

o1 o2 o3 o4 o5 oT

1 1 1 1 2 2 3 3

O

Q

Observation sequence

State sequence

State duration 4 10 5D

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 26 of 79

Best state outputswo dynamic features

Variance Mean

o becomes step-wise mean vector sequence

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 27 of 79

Using dynamic features

State output vectors include static amp dynamic features

ot =[cgtt ∆cgtt

]gt ∆ct = ct minus ctminus1

M Mctminus1 ct+1ctminus2 ct+2ct

∆ctminus1 ∆ct+1∆ctminus2 ∆ct+2∆ct

2M

Relationship between static and dynamic features can be arranged as

o c

ctminus1

∆ ctminus1

ct∆ ctct+1

∆ ct+1

=

middot middot middot

middot middot middotmiddot middot middot 0 I 0 0 middot middot middotmiddot middot middot minusI I 0 0 middot middot middotmiddot middot middot 0 0 I 0 middot middot middotmiddot middot middot 0 minusI I 0 middot middot middotmiddot middot middot 0 0 0 I middot middot middotmiddot middot middot 0 0 minusI I middot middot middotmiddot middot middot

middot middot middot

ctminus2

ctminus1

ctct+1

W

o t

ot+1

otminus1

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 28 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Wgt Σminus1q W c

=

Wgt Σminus1q microq

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

1 0 0

0

10 0

-1 1

10

0

-1 1

1

0

0

0

-1 1

0 0

0

0

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

microq1

microq2

microqT

c1

c2

cT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79

Generated speech parameter trajectory

Sta

ticD

ynam

ic

Variance Mean c

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79

Waveform reconstruction

pulse train

white noise

synthesizedspeech

lineartime-invariant

systeme(n)

h(n) x(n) = h(n) lowast e(n)excitation

Generatedexcitation parameter(log F0 with VUV)

Generatedspectral parameter

(cepstrum LSP)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79

Synthesis filter

bull Cepstrum rarr LMA filter

bull Generalized cepstrum rarr GLSA filter

bull Mel-cepstrum rarr MLSA filter

bull Mel-generalized cepstrum rarr MGLSA filter

bull LSP rarr LSP filter

bull PARCOR rarr all-pole lattice filter

bull LPC rarr all-pole filter

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation

minus Small footprint [10 11]minus Robustness [12]

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Adaptation (mimicking voice) [13]

Average-voice model

AdaptiveTraining

Adaptation

Training speakers Target speakers

bull Train average voice model (AVM) from training speakers using SAT

bull Adapt AVM to target speakers

bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79

Interpolation (mixing voice) [14 15 16 17]

λ1

λ2

λ3

λ4

λprime

I(λprimeλ1)I(λprimeλ2)

I(λprimeλ3)I(λprimeλ4)

λ HMM setI(λprimeλ) Interpolation ratio

bull Interpolate representive HMM sets

bull Can obtain new voices wo adaptation data

bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Vocoding issues

bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)

Unvoiced Voiced

pulse train

white noise

e(n)

excitation

bull Spectral envelope extractionHarmonic effect often cause problem

0

40

80

0 2 4 6 8 [kHz]

Pow

er [d

B]

bull PhaseImportant but usually ignored

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79

Better vocoding

bull Mixed excitation linear prediction (MELP)

bull STRAIGHT

bull Multi-band excitation

bull Harmonic + noise model (HNM)

bull Harmonic stochastic model

bull LF model

bull Glottal waveform

bull Residual codebook

bull ML excitation

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79

Limitations of HMMs for acoustic modeling

bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state

bull Conditional independence assumptionState output probability depends only on the current state

bull Weak duration modelingState duration probability decreases exponentially with time

None of them hold for real speech

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79

Better acoustic modeling

bull Piece-wise constatnt statistics rarr Dynamical model

minus Trended HMMminus Polynomial segment modelminus Trajectory HMM

bull Conditional independence assumption rarr Graphical model

minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM

bull Weak duration modeling rarr Explicit duration model

minus Hidden semi-Markov model

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79

Oversmoothing

bull Speech parameter generation algorithm

minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled

Nat

ural

0 4 8Frequency (kHz)

Gen

erat

ed

0 4 8Frequency (kHz)

bull Why

minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79

Oversmoothing compensation

bull Postfiltering

minus Mel-cepstrumminus LSP

bull Nonparametric approach

minus Conditional parameter generationminus Discrete HMM-based speech synthesis

bull Combine multiple-level statistics

minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation eigenvoice CAT multiple regression

minus Small footprintminus Robustness

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

HMM-based acoustic modeling for SPSS [4]

yes noyes no

yes no

yes no yes no

Acoustic space

bull Decision tree-clustered HMM with GMM state-output distributions

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79

DNN-based acoustic modeling for SPSS [18]

Acoustic

features y

Linguistic

features x

h1

h2

h3

bull DNN represents conditional distribution of y given x

bull DNN replaces decision trees and GMMs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79

Framework

Input layer Output layerHidden layers

TEXT

SPEECH Parametergeneration

Waveformsynthesis

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

feat

ures

at f

ram

e 1

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

fe

atur

es a

t fra

me T

Textanalysis

Input featureextraction

Statistics (m

ean amp var) of speech param

eter vector sequence

Binaryfeatures

NumericfeaturesDuration

featureFrame position

feature

Spectralfeatures

ExcitationfeaturesVUVfeature

Durationprediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Experimental setup

Database US English female speaker

Training test data 33000 amp 173 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic 11 categorical featuresfeatures 25 numeric features

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2

HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]

DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79

Example of speech parameter trajectories

wo grouping questions numeric contexts silence frames removed

-1

0

1

0 100 200 300 400 500

5-t

h M

el-ce

pstr

um

Frame

Natural speechHMM (α=1)DNN (4x512)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79

Subjective evaluations

Compared HMM-based systems with DNN-based ones withsimilar of parameters

bull Paired comparison test

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

HMM DNN(α) (layers times units) Neutral p value z value

158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 26: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Stream-dependent tree-based clustering

Decision treesfor

mel-cepstrum

Decision treesfor F0

Spectrum amp excitation can have different context dependencyrarr Build decision trees individually

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 21 of 79

State duration models [8]

i

1 2 3 4 5 6 7 T=8t

t0 t1

Probability to enter state i at t0 then leave at t1 + 1

χt0t1(i) propsum

j 6=i

αt0minus1(j)ajiat1minust0ii

t1prod

t=t0

bi(ot)sum

k 6=i

aikbk(ot1+1)βt1+1(k)

rarr estimate state duration modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 22 of 79

Stream-dependent tree-based clustering

State durationmodel

Decision treesfor

mel-cepstrum

Decision treesfor F0

Decision treefor state durmodels

HMM

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 23 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 24 of 79

Speech parameter generation algorithm [9]

Generate most probable state outputs given HMM and words

o = arg maxo

p(o | w λ)

= arg maxo

sum

forallqp(o q | w λ)

asymp arg maxo

maxq

p(o q | w λ)

= arg maxo

maxq

p(o | q λ)P (q | w λ)

Determine the best state sequence and outputs sequentially

q = arg maxq

P (q | w λ)

o = arg maxo

p(o | q λ)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79

Speech parameter generation algorithm [9]

Generate most probable state outputs given HMM and words

o = arg maxo

p(o | w λ)

= arg maxo

sum

forallqp(o q | w λ)

asymp arg maxo

maxq

p(o q | w λ)

= arg maxo

maxq

p(o | q λ)P (q | w λ)

Determine the best state sequence and outputs sequentially

q = arg maxq

P (q | w λ)

o = arg maxo

p(o | q λ)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79

Best state sequence

1 2 3π1

a11

a12

a22

a23

a33

b1(ot) b2(ot) b3(ot)

o1 o2 o3 o4 o5 oT

1 1 1 1 2 2 3 3

O

Q

Observation sequence

State sequence

State duration 4 10 5D

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 26 of 79

Best state outputswo dynamic features

Variance Mean

o becomes step-wise mean vector sequence

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 27 of 79

Using dynamic features

State output vectors include static amp dynamic features

ot =[cgtt ∆cgtt

]gt ∆ct = ct minus ctminus1

M Mctminus1 ct+1ctminus2 ct+2ct

∆ctminus1 ∆ct+1∆ctminus2 ∆ct+2∆ct

2M

Relationship between static and dynamic features can be arranged as

o c

ctminus1

∆ ctminus1

ct∆ ctct+1

∆ ct+1

=

middot middot middot

middot middot middotmiddot middot middot 0 I 0 0 middot middot middotmiddot middot middot minusI I 0 0 middot middot middotmiddot middot middot 0 0 I 0 middot middot middotmiddot middot middot 0 minusI I 0 middot middot middotmiddot middot middot 0 0 0 I middot middot middotmiddot middot middot 0 0 minusI I middot middot middotmiddot middot middot

middot middot middot

ctminus2

ctminus1

ctct+1

W

o t

ot+1

otminus1

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 28 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Wgt Σminus1q W c

=

Wgt Σminus1q microq

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

1 0 0

0

10 0

-1 1

10

0

-1 1

1

0

0

0

-1 1

0 0

0

0

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

microq1

microq2

microqT

c1

c2

cT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79

Generated speech parameter trajectory

Sta

ticD

ynam

ic

Variance Mean c

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79

Waveform reconstruction

pulse train

white noise

synthesizedspeech

lineartime-invariant

systeme(n)

h(n) x(n) = h(n) lowast e(n)excitation

Generatedexcitation parameter(log F0 with VUV)

Generatedspectral parameter

(cepstrum LSP)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79

Synthesis filter

bull Cepstrum rarr LMA filter

bull Generalized cepstrum rarr GLSA filter

bull Mel-cepstrum rarr MLSA filter

bull Mel-generalized cepstrum rarr MGLSA filter

bull LSP rarr LSP filter

bull PARCOR rarr all-pole lattice filter

bull LPC rarr all-pole filter

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation

minus Small footprint [10 11]minus Robustness [12]

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Adaptation (mimicking voice) [13]

Average-voice model

AdaptiveTraining

Adaptation

Training speakers Target speakers

bull Train average voice model (AVM) from training speakers using SAT

bull Adapt AVM to target speakers

bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79

Interpolation (mixing voice) [14 15 16 17]

λ1

λ2

λ3

λ4

λprime

I(λprimeλ1)I(λprimeλ2)

I(λprimeλ3)I(λprimeλ4)

λ HMM setI(λprimeλ) Interpolation ratio

bull Interpolate representive HMM sets

bull Can obtain new voices wo adaptation data

bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Vocoding issues

bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)

Unvoiced Voiced

pulse train

white noise

e(n)

excitation

bull Spectral envelope extractionHarmonic effect often cause problem

0

40

80

0 2 4 6 8 [kHz]

Pow

er [d

B]

bull PhaseImportant but usually ignored

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79

Better vocoding

bull Mixed excitation linear prediction (MELP)

bull STRAIGHT

bull Multi-band excitation

bull Harmonic + noise model (HNM)

bull Harmonic stochastic model

bull LF model

bull Glottal waveform

bull Residual codebook

bull ML excitation

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79

Limitations of HMMs for acoustic modeling

bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state

bull Conditional independence assumptionState output probability depends only on the current state

bull Weak duration modelingState duration probability decreases exponentially with time

None of them hold for real speech

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79

Better acoustic modeling

bull Piece-wise constatnt statistics rarr Dynamical model

minus Trended HMMminus Polynomial segment modelminus Trajectory HMM

bull Conditional independence assumption rarr Graphical model

minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM

bull Weak duration modeling rarr Explicit duration model

minus Hidden semi-Markov model

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79

Oversmoothing

bull Speech parameter generation algorithm

minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled

Nat

ural

0 4 8Frequency (kHz)

Gen

erat

ed

0 4 8Frequency (kHz)

bull Why

minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79

Oversmoothing compensation

bull Postfiltering

minus Mel-cepstrumminus LSP

bull Nonparametric approach

minus Conditional parameter generationminus Discrete HMM-based speech synthesis

bull Combine multiple-level statistics

minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation eigenvoice CAT multiple regression

minus Small footprintminus Robustness

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

HMM-based acoustic modeling for SPSS [4]

yes noyes no

yes no

yes no yes no

Acoustic space

bull Decision tree-clustered HMM with GMM state-output distributions

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79

DNN-based acoustic modeling for SPSS [18]

Acoustic

features y

Linguistic

features x

h1

h2

h3

bull DNN represents conditional distribution of y given x

bull DNN replaces decision trees and GMMs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79

Framework

Input layer Output layerHidden layers

TEXT

SPEECH Parametergeneration

Waveformsynthesis

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

feat

ures

at f

ram

e 1

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

fe

atur

es a

t fra

me T

Textanalysis

Input featureextraction

Statistics (m

ean amp var) of speech param

eter vector sequence

Binaryfeatures

NumericfeaturesDuration

featureFrame position

feature

Spectralfeatures

ExcitationfeaturesVUVfeature

Durationprediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Experimental setup

Database US English female speaker

Training test data 33000 amp 173 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic 11 categorical featuresfeatures 25 numeric features

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2

HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]

DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79

Example of speech parameter trajectories

wo grouping questions numeric contexts silence frames removed

-1

0

1

0 100 200 300 400 500

5-t

h M

el-ce

pstr

um

Frame

Natural speechHMM (α=1)DNN (4x512)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79

Subjective evaluations

Compared HMM-based systems with DNN-based ones withsimilar of parameters

bull Paired comparison test

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

HMM DNN(α) (layers times units) Neutral p value z value

158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 27: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

State duration models [8]

i

1 2 3 4 5 6 7 T=8t

t0 t1

Probability to enter state i at t0 then leave at t1 + 1

χt0t1(i) propsum

j 6=i

αt0minus1(j)ajiat1minust0ii

t1prod

t=t0

bi(ot)sum

k 6=i

aikbk(ot1+1)βt1+1(k)

rarr estimate state duration modelsHeiga Zen Statistical Parametric Speech Synthesis June 9th 2014 22 of 79

Stream-dependent tree-based clustering

State durationmodel

Decision treesfor

mel-cepstrum

Decision treesfor F0

Decision treefor state durmodels

HMM

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 23 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 24 of 79

Speech parameter generation algorithm [9]

Generate most probable state outputs given HMM and words

o = arg maxo

p(o | w λ)

= arg maxo

sum

forallqp(o q | w λ)

asymp arg maxo

maxq

p(o q | w λ)

= arg maxo

maxq

p(o | q λ)P (q | w λ)

Determine the best state sequence and outputs sequentially

q = arg maxq

P (q | w λ)

o = arg maxo

p(o | q λ)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79

Speech parameter generation algorithm [9]

Generate most probable state outputs given HMM and words

o = arg maxo

p(o | w λ)

= arg maxo

sum

forallqp(o q | w λ)

asymp arg maxo

maxq

p(o q | w λ)

= arg maxo

maxq

p(o | q λ)P (q | w λ)

Determine the best state sequence and outputs sequentially

q = arg maxq

P (q | w λ)

o = arg maxo

p(o | q λ)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79

Best state sequence

1 2 3π1

a11

a12

a22

a23

a33

b1(ot) b2(ot) b3(ot)

o1 o2 o3 o4 o5 oT

1 1 1 1 2 2 3 3

O

Q

Observation sequence

State sequence

State duration 4 10 5D

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 26 of 79

Best state outputswo dynamic features

Variance Mean

o becomes step-wise mean vector sequence

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 27 of 79

Using dynamic features

State output vectors include static amp dynamic features

ot =[cgtt ∆cgtt

]gt ∆ct = ct minus ctminus1

M Mctminus1 ct+1ctminus2 ct+2ct

∆ctminus1 ∆ct+1∆ctminus2 ∆ct+2∆ct

2M

Relationship between static and dynamic features can be arranged as

o c

ctminus1

∆ ctminus1

ct∆ ctct+1

∆ ct+1

=

middot middot middot

middot middot middotmiddot middot middot 0 I 0 0 middot middot middotmiddot middot middot minusI I 0 0 middot middot middotmiddot middot middot 0 0 I 0 middot middot middotmiddot middot middot 0 minusI I 0 middot middot middotmiddot middot middot 0 0 0 I middot middot middotmiddot middot middot 0 0 minusI I middot middot middotmiddot middot middot

middot middot middot

ctminus2

ctminus1

ctct+1

W

o t

ot+1

otminus1

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 28 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Wgt Σminus1q W c

=

Wgt Σminus1q microq

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

1 0 0

0

10 0

-1 1

10

0

-1 1

1

0

0

0

-1 1

0 0

0

0

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

microq1

microq2

microqT

c1

c2

cT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79

Generated speech parameter trajectory

Sta

ticD

ynam

ic

Variance Mean c

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79

Waveform reconstruction

pulse train

white noise

synthesizedspeech

lineartime-invariant

systeme(n)

h(n) x(n) = h(n) lowast e(n)excitation

Generatedexcitation parameter(log F0 with VUV)

Generatedspectral parameter

(cepstrum LSP)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79

Synthesis filter

bull Cepstrum rarr LMA filter

bull Generalized cepstrum rarr GLSA filter

bull Mel-cepstrum rarr MLSA filter

bull Mel-generalized cepstrum rarr MGLSA filter

bull LSP rarr LSP filter

bull PARCOR rarr all-pole lattice filter

bull LPC rarr all-pole filter

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation

minus Small footprint [10 11]minus Robustness [12]

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Adaptation (mimicking voice) [13]

Average-voice model

AdaptiveTraining

Adaptation

Training speakers Target speakers

bull Train average voice model (AVM) from training speakers using SAT

bull Adapt AVM to target speakers

bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79

Interpolation (mixing voice) [14 15 16 17]

λ1

λ2

λ3

λ4

λprime

I(λprimeλ1)I(λprimeλ2)

I(λprimeλ3)I(λprimeλ4)

λ HMM setI(λprimeλ) Interpolation ratio

bull Interpolate representive HMM sets

bull Can obtain new voices wo adaptation data

bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Vocoding issues

bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)

Unvoiced Voiced

pulse train

white noise

e(n)

excitation

bull Spectral envelope extractionHarmonic effect often cause problem

0

40

80

0 2 4 6 8 [kHz]

Pow

er [d

B]

bull PhaseImportant but usually ignored

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79

Better vocoding

bull Mixed excitation linear prediction (MELP)

bull STRAIGHT

bull Multi-band excitation

bull Harmonic + noise model (HNM)

bull Harmonic stochastic model

bull LF model

bull Glottal waveform

bull Residual codebook

bull ML excitation

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79

Limitations of HMMs for acoustic modeling

bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state

bull Conditional independence assumptionState output probability depends only on the current state

bull Weak duration modelingState duration probability decreases exponentially with time

None of them hold for real speech

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79

Better acoustic modeling

bull Piece-wise constatnt statistics rarr Dynamical model

minus Trended HMMminus Polynomial segment modelminus Trajectory HMM

bull Conditional independence assumption rarr Graphical model

minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM

bull Weak duration modeling rarr Explicit duration model

minus Hidden semi-Markov model

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79

Oversmoothing

bull Speech parameter generation algorithm

minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled

Nat

ural

0 4 8Frequency (kHz)

Gen

erat

ed

0 4 8Frequency (kHz)

bull Why

minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79

Oversmoothing compensation

bull Postfiltering

minus Mel-cepstrumminus LSP

bull Nonparametric approach

minus Conditional parameter generationminus Discrete HMM-based speech synthesis

bull Combine multiple-level statistics

minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation eigenvoice CAT multiple regression

minus Small footprintminus Robustness

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

HMM-based acoustic modeling for SPSS [4]

yes noyes no

yes no

yes no yes no

Acoustic space

bull Decision tree-clustered HMM with GMM state-output distributions

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79

DNN-based acoustic modeling for SPSS [18]

Acoustic

features y

Linguistic

features x

h1

h2

h3

bull DNN represents conditional distribution of y given x

bull DNN replaces decision trees and GMMs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79

Framework

Input layer Output layerHidden layers

TEXT

SPEECH Parametergeneration

Waveformsynthesis

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

feat

ures

at f

ram

e 1

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

fe

atur

es a

t fra

me T

Textanalysis

Input featureextraction

Statistics (m

ean amp var) of speech param

eter vector sequence

Binaryfeatures

NumericfeaturesDuration

featureFrame position

feature

Spectralfeatures

ExcitationfeaturesVUVfeature

Durationprediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Experimental setup

Database US English female speaker

Training test data 33000 amp 173 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic 11 categorical featuresfeatures 25 numeric features

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2

HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]

DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79

Example of speech parameter trajectories

wo grouping questions numeric contexts silence frames removed

-1

0

1

0 100 200 300 400 500

5-t

h M

el-ce

pstr

um

Frame

Natural speechHMM (α=1)DNN (4x512)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79

Subjective evaluations

Compared HMM-based systems with DNN-based ones withsimilar of parameters

bull Paired comparison test

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

HMM DNN(α) (layers times units) Neutral p value z value

158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 28: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Stream-dependent tree-based clustering

State durationmodel

Decision treesfor

mel-cepstrum

Decision treesfor F0

Decision treefor state durmodels

HMM

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 23 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 24 of 79

Speech parameter generation algorithm [9]

Generate most probable state outputs given HMM and words

o = arg maxo

p(o | w λ)

= arg maxo

sum

forallqp(o q | w λ)

asymp arg maxo

maxq

p(o q | w λ)

= arg maxo

maxq

p(o | q λ)P (q | w λ)

Determine the best state sequence and outputs sequentially

q = arg maxq

P (q | w λ)

o = arg maxo

p(o | q λ)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79

Speech parameter generation algorithm [9]

Generate most probable state outputs given HMM and words

o = arg maxo

p(o | w λ)

= arg maxo

sum

forallqp(o q | w λ)

asymp arg maxo

maxq

p(o q | w λ)

= arg maxo

maxq

p(o | q λ)P (q | w λ)

Determine the best state sequence and outputs sequentially

q = arg maxq

P (q | w λ)

o = arg maxo

p(o | q λ)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79

Best state sequence

1 2 3π1

a11

a12

a22

a23

a33

b1(ot) b2(ot) b3(ot)

o1 o2 o3 o4 o5 oT

1 1 1 1 2 2 3 3

O

Q

Observation sequence

State sequence

State duration 4 10 5D

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 26 of 79

Best state outputswo dynamic features

Variance Mean

o becomes step-wise mean vector sequence

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 27 of 79

Using dynamic features

State output vectors include static amp dynamic features

ot =[cgtt ∆cgtt

]gt ∆ct = ct minus ctminus1

M Mctminus1 ct+1ctminus2 ct+2ct

∆ctminus1 ∆ct+1∆ctminus2 ∆ct+2∆ct

2M

Relationship between static and dynamic features can be arranged as

o c

ctminus1

∆ ctminus1

ct∆ ctct+1

∆ ct+1

=

middot middot middot

middot middot middotmiddot middot middot 0 I 0 0 middot middot middotmiddot middot middot minusI I 0 0 middot middot middotmiddot middot middot 0 0 I 0 middot middot middotmiddot middot middot 0 minusI I 0 middot middot middotmiddot middot middot 0 0 0 I middot middot middotmiddot middot middot 0 0 minusI I middot middot middotmiddot middot middot

middot middot middot

ctminus2

ctminus1

ctct+1

W

o t

ot+1

otminus1

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 28 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Wgt Σminus1q W c

=

Wgt Σminus1q microq

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

1 0 0

0

10 0

-1 1

10

0

-1 1

1

0

0

0

-1 1

0 0

0

0

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

microq1

microq2

microqT

c1

c2

cT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79

Generated speech parameter trajectory

Sta

ticD

ynam

ic

Variance Mean c

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79

Waveform reconstruction

pulse train

white noise

synthesizedspeech

lineartime-invariant

systeme(n)

h(n) x(n) = h(n) lowast e(n)excitation

Generatedexcitation parameter(log F0 with VUV)

Generatedspectral parameter

(cepstrum LSP)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79

Synthesis filter

bull Cepstrum rarr LMA filter

bull Generalized cepstrum rarr GLSA filter

bull Mel-cepstrum rarr MLSA filter

bull Mel-generalized cepstrum rarr MGLSA filter

bull LSP rarr LSP filter

bull PARCOR rarr all-pole lattice filter

bull LPC rarr all-pole filter

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation

minus Small footprint [10 11]minus Robustness [12]

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Adaptation (mimicking voice) [13]

Average-voice model

AdaptiveTraining

Adaptation

Training speakers Target speakers

bull Train average voice model (AVM) from training speakers using SAT

bull Adapt AVM to target speakers

bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79

Interpolation (mixing voice) [14 15 16 17]

λ1

λ2

λ3

λ4

λprime

I(λprimeλ1)I(λprimeλ2)

I(λprimeλ3)I(λprimeλ4)

λ HMM setI(λprimeλ) Interpolation ratio

bull Interpolate representive HMM sets

bull Can obtain new voices wo adaptation data

bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Vocoding issues

bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)

Unvoiced Voiced

pulse train

white noise

e(n)

excitation

bull Spectral envelope extractionHarmonic effect often cause problem

0

40

80

0 2 4 6 8 [kHz]

Pow

er [d

B]

bull PhaseImportant but usually ignored

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79

Better vocoding

bull Mixed excitation linear prediction (MELP)

bull STRAIGHT

bull Multi-band excitation

bull Harmonic + noise model (HNM)

bull Harmonic stochastic model

bull LF model

bull Glottal waveform

bull Residual codebook

bull ML excitation

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79

Limitations of HMMs for acoustic modeling

bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state

bull Conditional independence assumptionState output probability depends only on the current state

bull Weak duration modelingState duration probability decreases exponentially with time

None of them hold for real speech

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79

Better acoustic modeling

bull Piece-wise constatnt statistics rarr Dynamical model

minus Trended HMMminus Polynomial segment modelminus Trajectory HMM

bull Conditional independence assumption rarr Graphical model

minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM

bull Weak duration modeling rarr Explicit duration model

minus Hidden semi-Markov model

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79

Oversmoothing

bull Speech parameter generation algorithm

minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled

Nat

ural

0 4 8Frequency (kHz)

Gen

erat

ed

0 4 8Frequency (kHz)

bull Why

minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79

Oversmoothing compensation

bull Postfiltering

minus Mel-cepstrumminus LSP

bull Nonparametric approach

minus Conditional parameter generationminus Discrete HMM-based speech synthesis

bull Combine multiple-level statistics

minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation eigenvoice CAT multiple regression

minus Small footprintminus Robustness

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

HMM-based acoustic modeling for SPSS [4]

yes noyes no

yes no

yes no yes no

Acoustic space

bull Decision tree-clustered HMM with GMM state-output distributions

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79

DNN-based acoustic modeling for SPSS [18]

Acoustic

features y

Linguistic

features x

h1

h2

h3

bull DNN represents conditional distribution of y given x

bull DNN replaces decision trees and GMMs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79

Framework

Input layer Output layerHidden layers

TEXT

SPEECH Parametergeneration

Waveformsynthesis

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

feat

ures

at f

ram

e 1

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

fe

atur

es a

t fra

me T

Textanalysis

Input featureextraction

Statistics (m

ean amp var) of speech param

eter vector sequence

Binaryfeatures

NumericfeaturesDuration

featureFrame position

feature

Spectralfeatures

ExcitationfeaturesVUVfeature

Durationprediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Experimental setup

Database US English female speaker

Training test data 33000 amp 173 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic 11 categorical featuresfeatures 25 numeric features

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2

HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]

DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79

Example of speech parameter trajectories

wo grouping questions numeric contexts silence frames removed

-1

0

1

0 100 200 300 400 500

5-t

h M

el-ce

pstr

um

Frame

Natural speechHMM (α=1)DNN (4x512)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79

Subjective evaluations

Compared HMM-based systems with DNN-based ones withsimilar of parameters

bull Paired comparison test

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

HMM DNN(α) (layers times units) Neutral p value z value

158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 29: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 24 of 79

Speech parameter generation algorithm [9]

Generate most probable state outputs given HMM and words

o = arg maxo

p(o | w λ)

= arg maxo

sum

forallqp(o q | w λ)

asymp arg maxo

maxq

p(o q | w λ)

= arg maxo

maxq

p(o | q λ)P (q | w λ)

Determine the best state sequence and outputs sequentially

q = arg maxq

P (q | w λ)

o = arg maxo

p(o | q λ)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79

Speech parameter generation algorithm [9]

Generate most probable state outputs given HMM and words

o = arg maxo

p(o | w λ)

= arg maxo

sum

forallqp(o q | w λ)

asymp arg maxo

maxq

p(o q | w λ)

= arg maxo

maxq

p(o | q λ)P (q | w λ)

Determine the best state sequence and outputs sequentially

q = arg maxq

P (q | w λ)

o = arg maxo

p(o | q λ)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79

Best state sequence

1 2 3π1

a11

a12

a22

a23

a33

b1(ot) b2(ot) b3(ot)

o1 o2 o3 o4 o5 oT

1 1 1 1 2 2 3 3

O

Q

Observation sequence

State sequence

State duration 4 10 5D

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 26 of 79

Best state outputswo dynamic features

Variance Mean

o becomes step-wise mean vector sequence

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 27 of 79

Using dynamic features

State output vectors include static amp dynamic features

ot =[cgtt ∆cgtt

]gt ∆ct = ct minus ctminus1

M Mctminus1 ct+1ctminus2 ct+2ct

∆ctminus1 ∆ct+1∆ctminus2 ∆ct+2∆ct

2M

Relationship between static and dynamic features can be arranged as

o c

ctminus1

∆ ctminus1

ct∆ ctct+1

∆ ct+1

=

middot middot middot

middot middot middotmiddot middot middot 0 I 0 0 middot middot middotmiddot middot middot minusI I 0 0 middot middot middotmiddot middot middot 0 0 I 0 middot middot middotmiddot middot middot 0 minusI I 0 middot middot middotmiddot middot middot 0 0 0 I middot middot middotmiddot middot middot 0 0 minusI I middot middot middotmiddot middot middot

middot middot middot

ctminus2

ctminus1

ctct+1

W

o t

ot+1

otminus1

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 28 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Wgt Σminus1q W c

=

Wgt Σminus1q microq

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

1 0 0

0

10 0

-1 1

10

0

-1 1

1

0

0

0

-1 1

0 0

0

0

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

microq1

microq2

microqT

c1

c2

cT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79

Generated speech parameter trajectory

Sta

ticD

ynam

ic

Variance Mean c

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79

Waveform reconstruction

pulse train

white noise

synthesizedspeech

lineartime-invariant

systeme(n)

h(n) x(n) = h(n) lowast e(n)excitation

Generatedexcitation parameter(log F0 with VUV)

Generatedspectral parameter

(cepstrum LSP)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79

Synthesis filter

bull Cepstrum rarr LMA filter

bull Generalized cepstrum rarr GLSA filter

bull Mel-cepstrum rarr MLSA filter

bull Mel-generalized cepstrum rarr MGLSA filter

bull LSP rarr LSP filter

bull PARCOR rarr all-pole lattice filter

bull LPC rarr all-pole filter

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation

minus Small footprint [10 11]minus Robustness [12]

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Adaptation (mimicking voice) [13]

Average-voice model

AdaptiveTraining

Adaptation

Training speakers Target speakers

bull Train average voice model (AVM) from training speakers using SAT

bull Adapt AVM to target speakers

bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79

Interpolation (mixing voice) [14 15 16 17]

λ1

λ2

λ3

λ4

λprime

I(λprimeλ1)I(λprimeλ2)

I(λprimeλ3)I(λprimeλ4)

λ HMM setI(λprimeλ) Interpolation ratio

bull Interpolate representive HMM sets

bull Can obtain new voices wo adaptation data

bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Vocoding issues

bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)

Unvoiced Voiced

pulse train

white noise

e(n)

excitation

bull Spectral envelope extractionHarmonic effect often cause problem

0

40

80

0 2 4 6 8 [kHz]

Pow

er [d

B]

bull PhaseImportant but usually ignored

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79

Better vocoding

bull Mixed excitation linear prediction (MELP)

bull STRAIGHT

bull Multi-band excitation

bull Harmonic + noise model (HNM)

bull Harmonic stochastic model

bull LF model

bull Glottal waveform

bull Residual codebook

bull ML excitation

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79

Limitations of HMMs for acoustic modeling

bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state

bull Conditional independence assumptionState output probability depends only on the current state

bull Weak duration modelingState duration probability decreases exponentially with time

None of them hold for real speech

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79

Better acoustic modeling

bull Piece-wise constatnt statistics rarr Dynamical model

minus Trended HMMminus Polynomial segment modelminus Trajectory HMM

bull Conditional independence assumption rarr Graphical model

minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM

bull Weak duration modeling rarr Explicit duration model

minus Hidden semi-Markov model

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79

Oversmoothing

bull Speech parameter generation algorithm

minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled

Nat

ural

0 4 8Frequency (kHz)

Gen

erat

ed

0 4 8Frequency (kHz)

bull Why

minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79

Oversmoothing compensation

bull Postfiltering

minus Mel-cepstrumminus LSP

bull Nonparametric approach

minus Conditional parameter generationminus Discrete HMM-based speech synthesis

bull Combine multiple-level statistics

minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation eigenvoice CAT multiple regression

minus Small footprintminus Robustness

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

HMM-based acoustic modeling for SPSS [4]

yes noyes no

yes no

yes no yes no

Acoustic space

bull Decision tree-clustered HMM with GMM state-output distributions

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79

DNN-based acoustic modeling for SPSS [18]

Acoustic

features y

Linguistic

features x

h1

h2

h3

bull DNN represents conditional distribution of y given x

bull DNN replaces decision trees and GMMs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79

Framework

Input layer Output layerHidden layers

TEXT

SPEECH Parametergeneration

Waveformsynthesis

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

feat

ures

at f

ram

e 1

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

fe

atur

es a

t fra

me T

Textanalysis

Input featureextraction

Statistics (m

ean amp var) of speech param

eter vector sequence

Binaryfeatures

NumericfeaturesDuration

featureFrame position

feature

Spectralfeatures

ExcitationfeaturesVUVfeature

Durationprediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Experimental setup

Database US English female speaker

Training test data 33000 amp 173 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic 11 categorical featuresfeatures 25 numeric features

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2

HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]

DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79

Example of speech parameter trajectories

wo grouping questions numeric contexts silence frames removed

-1

0

1

0 100 200 300 400 500

5-t

h M

el-ce

pstr

um

Frame

Natural speechHMM (α=1)DNN (4x512)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79

Subjective evaluations

Compared HMM-based systems with DNN-based ones withsimilar of parameters

bull Paired comparison test

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

HMM DNN(α) (layers times units) Neutral p value z value

158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 30: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Speech parameter generation algorithm [9]

Generate most probable state outputs given HMM and words

o = arg maxo

p(o | w λ)

= arg maxo

sum

forallqp(o q | w λ)

asymp arg maxo

maxq

p(o q | w λ)

= arg maxo

maxq

p(o | q λ)P (q | w λ)

Determine the best state sequence and outputs sequentially

q = arg maxq

P (q | w λ)

o = arg maxo

p(o | q λ)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79

Speech parameter generation algorithm [9]

Generate most probable state outputs given HMM and words

o = arg maxo

p(o | w λ)

= arg maxo

sum

forallqp(o q | w λ)

asymp arg maxo

maxq

p(o q | w λ)

= arg maxo

maxq

p(o | q λ)P (q | w λ)

Determine the best state sequence and outputs sequentially

q = arg maxq

P (q | w λ)

o = arg maxo

p(o | q λ)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79

Best state sequence

1 2 3π1

a11

a12

a22

a23

a33

b1(ot) b2(ot) b3(ot)

o1 o2 o3 o4 o5 oT

1 1 1 1 2 2 3 3

O

Q

Observation sequence

State sequence

State duration 4 10 5D

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 26 of 79

Best state outputswo dynamic features

Variance Mean

o becomes step-wise mean vector sequence

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 27 of 79

Using dynamic features

State output vectors include static amp dynamic features

ot =[cgtt ∆cgtt

]gt ∆ct = ct minus ctminus1

M Mctminus1 ct+1ctminus2 ct+2ct

∆ctminus1 ∆ct+1∆ctminus2 ∆ct+2∆ct

2M

Relationship between static and dynamic features can be arranged as

o c

ctminus1

∆ ctminus1

ct∆ ctct+1

∆ ct+1

=

middot middot middot

middot middot middotmiddot middot middot 0 I 0 0 middot middot middotmiddot middot middot minusI I 0 0 middot middot middotmiddot middot middot 0 0 I 0 middot middot middotmiddot middot middot 0 minusI I 0 middot middot middotmiddot middot middot 0 0 0 I middot middot middotmiddot middot middot 0 0 minusI I middot middot middotmiddot middot middot

middot middot middot

ctminus2

ctminus1

ctct+1

W

o t

ot+1

otminus1

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 28 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Wgt Σminus1q W c

=

Wgt Σminus1q microq

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

1 0 0

0

10 0

-1 1

10

0

-1 1

1

0

0

0

-1 1

0 0

0

0

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

microq1

microq2

microqT

c1

c2

cT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79

Generated speech parameter trajectory

Sta

ticD

ynam

ic

Variance Mean c

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79

Waveform reconstruction

pulse train

white noise

synthesizedspeech

lineartime-invariant

systeme(n)

h(n) x(n) = h(n) lowast e(n)excitation

Generatedexcitation parameter(log F0 with VUV)

Generatedspectral parameter

(cepstrum LSP)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79

Synthesis filter

bull Cepstrum rarr LMA filter

bull Generalized cepstrum rarr GLSA filter

bull Mel-cepstrum rarr MLSA filter

bull Mel-generalized cepstrum rarr MGLSA filter

bull LSP rarr LSP filter

bull PARCOR rarr all-pole lattice filter

bull LPC rarr all-pole filter

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation

minus Small footprint [10 11]minus Robustness [12]

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Adaptation (mimicking voice) [13]

Average-voice model

AdaptiveTraining

Adaptation

Training speakers Target speakers

bull Train average voice model (AVM) from training speakers using SAT

bull Adapt AVM to target speakers

bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79

Interpolation (mixing voice) [14 15 16 17]

λ1

λ2

λ3

λ4

λprime

I(λprimeλ1)I(λprimeλ2)

I(λprimeλ3)I(λprimeλ4)

λ HMM setI(λprimeλ) Interpolation ratio

bull Interpolate representive HMM sets

bull Can obtain new voices wo adaptation data

bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Vocoding issues

bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)

Unvoiced Voiced

pulse train

white noise

e(n)

excitation

bull Spectral envelope extractionHarmonic effect often cause problem

0

40

80

0 2 4 6 8 [kHz]

Pow

er [d

B]

bull PhaseImportant but usually ignored

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79

Better vocoding

bull Mixed excitation linear prediction (MELP)

bull STRAIGHT

bull Multi-band excitation

bull Harmonic + noise model (HNM)

bull Harmonic stochastic model

bull LF model

bull Glottal waveform

bull Residual codebook

bull ML excitation

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79

Limitations of HMMs for acoustic modeling

bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state

bull Conditional independence assumptionState output probability depends only on the current state

bull Weak duration modelingState duration probability decreases exponentially with time

None of them hold for real speech

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79

Better acoustic modeling

bull Piece-wise constatnt statistics rarr Dynamical model

minus Trended HMMminus Polynomial segment modelminus Trajectory HMM

bull Conditional independence assumption rarr Graphical model

minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM

bull Weak duration modeling rarr Explicit duration model

minus Hidden semi-Markov model

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79

Oversmoothing

bull Speech parameter generation algorithm

minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled

Nat

ural

0 4 8Frequency (kHz)

Gen

erat

ed

0 4 8Frequency (kHz)

bull Why

minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79

Oversmoothing compensation

bull Postfiltering

minus Mel-cepstrumminus LSP

bull Nonparametric approach

minus Conditional parameter generationminus Discrete HMM-based speech synthesis

bull Combine multiple-level statistics

minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation eigenvoice CAT multiple regression

minus Small footprintminus Robustness

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

HMM-based acoustic modeling for SPSS [4]

yes noyes no

yes no

yes no yes no

Acoustic space

bull Decision tree-clustered HMM with GMM state-output distributions

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79

DNN-based acoustic modeling for SPSS [18]

Acoustic

features y

Linguistic

features x

h1

h2

h3

bull DNN represents conditional distribution of y given x

bull DNN replaces decision trees and GMMs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79

Framework

Input layer Output layerHidden layers

TEXT

SPEECH Parametergeneration

Waveformsynthesis

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

feat

ures

at f

ram

e 1

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

fe

atur

es a

t fra

me T

Textanalysis

Input featureextraction

Statistics (m

ean amp var) of speech param

eter vector sequence

Binaryfeatures

NumericfeaturesDuration

featureFrame position

feature

Spectralfeatures

ExcitationfeaturesVUVfeature

Durationprediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Experimental setup

Database US English female speaker

Training test data 33000 amp 173 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic 11 categorical featuresfeatures 25 numeric features

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2

HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]

DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79

Example of speech parameter trajectories

wo grouping questions numeric contexts silence frames removed

-1

0

1

0 100 200 300 400 500

5-t

h M

el-ce

pstr

um

Frame

Natural speechHMM (α=1)DNN (4x512)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79

Subjective evaluations

Compared HMM-based systems with DNN-based ones withsimilar of parameters

bull Paired comparison test

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

HMM DNN(α) (layers times units) Neutral p value z value

158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 31: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Speech parameter generation algorithm [9]

Generate most probable state outputs given HMM and words

o = arg maxo

p(o | w λ)

= arg maxo

sum

forallqp(o q | w λ)

asymp arg maxo

maxq

p(o q | w λ)

= arg maxo

maxq

p(o | q λ)P (q | w λ)

Determine the best state sequence and outputs sequentially

q = arg maxq

P (q | w λ)

o = arg maxo

p(o | q λ)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 25 of 79

Best state sequence

1 2 3π1

a11

a12

a22

a23

a33

b1(ot) b2(ot) b3(ot)

o1 o2 o3 o4 o5 oT

1 1 1 1 2 2 3 3

O

Q

Observation sequence

State sequence

State duration 4 10 5D

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 26 of 79

Best state outputswo dynamic features

Variance Mean

o becomes step-wise mean vector sequence

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 27 of 79

Using dynamic features

State output vectors include static amp dynamic features

ot =[cgtt ∆cgtt

]gt ∆ct = ct minus ctminus1

M Mctminus1 ct+1ctminus2 ct+2ct

∆ctminus1 ∆ct+1∆ctminus2 ∆ct+2∆ct

2M

Relationship between static and dynamic features can be arranged as

o c

ctminus1

∆ ctminus1

ct∆ ctct+1

∆ ct+1

=

middot middot middot

middot middot middotmiddot middot middot 0 I 0 0 middot middot middotmiddot middot middot minusI I 0 0 middot middot middotmiddot middot middot 0 0 I 0 middot middot middotmiddot middot middot 0 minusI I 0 middot middot middotmiddot middot middot 0 0 0 I middot middot middotmiddot middot middot 0 0 minusI I middot middot middotmiddot middot middot

middot middot middot

ctminus2

ctminus1

ctct+1

W

o t

ot+1

otminus1

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 28 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Wgt Σminus1q W c

=

Wgt Σminus1q microq

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

1 0 0

0

10 0

-1 1

10

0

-1 1

1

0

0

0

-1 1

0 0

0

0

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

microq1

microq2

microqT

c1

c2

cT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79

Generated speech parameter trajectory

Sta

ticD

ynam

ic

Variance Mean c

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79

Waveform reconstruction

pulse train

white noise

synthesizedspeech

lineartime-invariant

systeme(n)

h(n) x(n) = h(n) lowast e(n)excitation

Generatedexcitation parameter(log F0 with VUV)

Generatedspectral parameter

(cepstrum LSP)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79

Synthesis filter

bull Cepstrum rarr LMA filter

bull Generalized cepstrum rarr GLSA filter

bull Mel-cepstrum rarr MLSA filter

bull Mel-generalized cepstrum rarr MGLSA filter

bull LSP rarr LSP filter

bull PARCOR rarr all-pole lattice filter

bull LPC rarr all-pole filter

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation

minus Small footprint [10 11]minus Robustness [12]

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Adaptation (mimicking voice) [13]

Average-voice model

AdaptiveTraining

Adaptation

Training speakers Target speakers

bull Train average voice model (AVM) from training speakers using SAT

bull Adapt AVM to target speakers

bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79

Interpolation (mixing voice) [14 15 16 17]

λ1

λ2

λ3

λ4

λprime

I(λprimeλ1)I(λprimeλ2)

I(λprimeλ3)I(λprimeλ4)

λ HMM setI(λprimeλ) Interpolation ratio

bull Interpolate representive HMM sets

bull Can obtain new voices wo adaptation data

bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Vocoding issues

bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)

Unvoiced Voiced

pulse train

white noise

e(n)

excitation

bull Spectral envelope extractionHarmonic effect often cause problem

0

40

80

0 2 4 6 8 [kHz]

Pow

er [d

B]

bull PhaseImportant but usually ignored

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79

Better vocoding

bull Mixed excitation linear prediction (MELP)

bull STRAIGHT

bull Multi-band excitation

bull Harmonic + noise model (HNM)

bull Harmonic stochastic model

bull LF model

bull Glottal waveform

bull Residual codebook

bull ML excitation

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79

Limitations of HMMs for acoustic modeling

bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state

bull Conditional independence assumptionState output probability depends only on the current state

bull Weak duration modelingState duration probability decreases exponentially with time

None of them hold for real speech

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79

Better acoustic modeling

bull Piece-wise constatnt statistics rarr Dynamical model

minus Trended HMMminus Polynomial segment modelminus Trajectory HMM

bull Conditional independence assumption rarr Graphical model

minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM

bull Weak duration modeling rarr Explicit duration model

minus Hidden semi-Markov model

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79

Oversmoothing

bull Speech parameter generation algorithm

minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled

Nat

ural

0 4 8Frequency (kHz)

Gen

erat

ed

0 4 8Frequency (kHz)

bull Why

minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79

Oversmoothing compensation

bull Postfiltering

minus Mel-cepstrumminus LSP

bull Nonparametric approach

minus Conditional parameter generationminus Discrete HMM-based speech synthesis

bull Combine multiple-level statistics

minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation eigenvoice CAT multiple regression

minus Small footprintminus Robustness

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

HMM-based acoustic modeling for SPSS [4]

yes noyes no

yes no

yes no yes no

Acoustic space

bull Decision tree-clustered HMM with GMM state-output distributions

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79

DNN-based acoustic modeling for SPSS [18]

Acoustic

features y

Linguistic

features x

h1

h2

h3

bull DNN represents conditional distribution of y given x

bull DNN replaces decision trees and GMMs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79

Framework

Input layer Output layerHidden layers

TEXT

SPEECH Parametergeneration

Waveformsynthesis

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

feat

ures

at f

ram

e 1

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

fe

atur

es a

t fra

me T

Textanalysis

Input featureextraction

Statistics (m

ean amp var) of speech param

eter vector sequence

Binaryfeatures

NumericfeaturesDuration

featureFrame position

feature

Spectralfeatures

ExcitationfeaturesVUVfeature

Durationprediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Experimental setup

Database US English female speaker

Training test data 33000 amp 173 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic 11 categorical featuresfeatures 25 numeric features

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2

HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]

DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79

Example of speech parameter trajectories

wo grouping questions numeric contexts silence frames removed

-1

0

1

0 100 200 300 400 500

5-t

h M

el-ce

pstr

um

Frame

Natural speechHMM (α=1)DNN (4x512)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79

Subjective evaluations

Compared HMM-based systems with DNN-based ones withsimilar of parameters

bull Paired comparison test

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

HMM DNN(α) (layers times units) Neutral p value z value

158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 32: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Best state sequence

1 2 3π1

a11

a12

a22

a23

a33

b1(ot) b2(ot) b3(ot)

o1 o2 o3 o4 o5 oT

1 1 1 1 2 2 3 3

O

Q

Observation sequence

State sequence

State duration 4 10 5D

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 26 of 79

Best state outputswo dynamic features

Variance Mean

o becomes step-wise mean vector sequence

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 27 of 79

Using dynamic features

State output vectors include static amp dynamic features

ot =[cgtt ∆cgtt

]gt ∆ct = ct minus ctminus1

M Mctminus1 ct+1ctminus2 ct+2ct

∆ctminus1 ∆ct+1∆ctminus2 ∆ct+2∆ct

2M

Relationship between static and dynamic features can be arranged as

o c

ctminus1

∆ ctminus1

ct∆ ctct+1

∆ ct+1

=

middot middot middot

middot middot middotmiddot middot middot 0 I 0 0 middot middot middotmiddot middot middot minusI I 0 0 middot middot middotmiddot middot middot 0 0 I 0 middot middot middotmiddot middot middot 0 minusI I 0 middot middot middotmiddot middot middot 0 0 0 I middot middot middotmiddot middot middot 0 0 minusI I middot middot middotmiddot middot middot

middot middot middot

ctminus2

ctminus1

ctct+1

W

o t

ot+1

otminus1

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 28 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Wgt Σminus1q W c

=

Wgt Σminus1q microq

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

1 0 0

0

10 0

-1 1

10

0

-1 1

1

0

0

0

-1 1

0 0

0

0

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

microq1

microq2

microqT

c1

c2

cT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79

Generated speech parameter trajectory

Sta

ticD

ynam

ic

Variance Mean c

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79

Waveform reconstruction

pulse train

white noise

synthesizedspeech

lineartime-invariant

systeme(n)

h(n) x(n) = h(n) lowast e(n)excitation

Generatedexcitation parameter(log F0 with VUV)

Generatedspectral parameter

(cepstrum LSP)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79

Synthesis filter

bull Cepstrum rarr LMA filter

bull Generalized cepstrum rarr GLSA filter

bull Mel-cepstrum rarr MLSA filter

bull Mel-generalized cepstrum rarr MGLSA filter

bull LSP rarr LSP filter

bull PARCOR rarr all-pole lattice filter

bull LPC rarr all-pole filter

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation

minus Small footprint [10 11]minus Robustness [12]

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Adaptation (mimicking voice) [13]

Average-voice model

AdaptiveTraining

Adaptation

Training speakers Target speakers

bull Train average voice model (AVM) from training speakers using SAT

bull Adapt AVM to target speakers

bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79

Interpolation (mixing voice) [14 15 16 17]

λ1

λ2

λ3

λ4

λprime

I(λprimeλ1)I(λprimeλ2)

I(λprimeλ3)I(λprimeλ4)

λ HMM setI(λprimeλ) Interpolation ratio

bull Interpolate representive HMM sets

bull Can obtain new voices wo adaptation data

bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Vocoding issues

bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)

Unvoiced Voiced

pulse train

white noise

e(n)

excitation

bull Spectral envelope extractionHarmonic effect often cause problem

0

40

80

0 2 4 6 8 [kHz]

Pow

er [d

B]

bull PhaseImportant but usually ignored

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79

Better vocoding

bull Mixed excitation linear prediction (MELP)

bull STRAIGHT

bull Multi-band excitation

bull Harmonic + noise model (HNM)

bull Harmonic stochastic model

bull LF model

bull Glottal waveform

bull Residual codebook

bull ML excitation

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79

Limitations of HMMs for acoustic modeling

bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state

bull Conditional independence assumptionState output probability depends only on the current state

bull Weak duration modelingState duration probability decreases exponentially with time

None of them hold for real speech

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79

Better acoustic modeling

bull Piece-wise constatnt statistics rarr Dynamical model

minus Trended HMMminus Polynomial segment modelminus Trajectory HMM

bull Conditional independence assumption rarr Graphical model

minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM

bull Weak duration modeling rarr Explicit duration model

minus Hidden semi-Markov model

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79

Oversmoothing

bull Speech parameter generation algorithm

minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled

Nat

ural

0 4 8Frequency (kHz)

Gen

erat

ed

0 4 8Frequency (kHz)

bull Why

minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79

Oversmoothing compensation

bull Postfiltering

minus Mel-cepstrumminus LSP

bull Nonparametric approach

minus Conditional parameter generationminus Discrete HMM-based speech synthesis

bull Combine multiple-level statistics

minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation eigenvoice CAT multiple regression

minus Small footprintminus Robustness

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

HMM-based acoustic modeling for SPSS [4]

yes noyes no

yes no

yes no yes no

Acoustic space

bull Decision tree-clustered HMM with GMM state-output distributions

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79

DNN-based acoustic modeling for SPSS [18]

Acoustic

features y

Linguistic

features x

h1

h2

h3

bull DNN represents conditional distribution of y given x

bull DNN replaces decision trees and GMMs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79

Framework

Input layer Output layerHidden layers

TEXT

SPEECH Parametergeneration

Waveformsynthesis

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

feat

ures

at f

ram

e 1

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

fe

atur

es a

t fra

me T

Textanalysis

Input featureextraction

Statistics (m

ean amp var) of speech param

eter vector sequence

Binaryfeatures

NumericfeaturesDuration

featureFrame position

feature

Spectralfeatures

ExcitationfeaturesVUVfeature

Durationprediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Experimental setup

Database US English female speaker

Training test data 33000 amp 173 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic 11 categorical featuresfeatures 25 numeric features

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2

HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]

DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79

Example of speech parameter trajectories

wo grouping questions numeric contexts silence frames removed

-1

0

1

0 100 200 300 400 500

5-t

h M

el-ce

pstr

um

Frame

Natural speechHMM (α=1)DNN (4x512)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79

Subjective evaluations

Compared HMM-based systems with DNN-based ones withsimilar of parameters

bull Paired comparison test

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

HMM DNN(α) (layers times units) Neutral p value z value

158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 33: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Best state outputswo dynamic features

Variance Mean

o becomes step-wise mean vector sequence

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 27 of 79

Using dynamic features

State output vectors include static amp dynamic features

ot =[cgtt ∆cgtt

]gt ∆ct = ct minus ctminus1

M Mctminus1 ct+1ctminus2 ct+2ct

∆ctminus1 ∆ct+1∆ctminus2 ∆ct+2∆ct

2M

Relationship between static and dynamic features can be arranged as

o c

ctminus1

∆ ctminus1

ct∆ ctct+1

∆ ct+1

=

middot middot middot

middot middot middotmiddot middot middot 0 I 0 0 middot middot middotmiddot middot middot minusI I 0 0 middot middot middotmiddot middot middot 0 0 I 0 middot middot middotmiddot middot middot 0 minusI I 0 middot middot middotmiddot middot middot 0 0 0 I middot middot middotmiddot middot middot 0 0 minusI I middot middot middotmiddot middot middot

middot middot middot

ctminus2

ctminus1

ctct+1

W

o t

ot+1

otminus1

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 28 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Wgt Σminus1q W c

=

Wgt Σminus1q microq

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

1 0 0

0

10 0

-1 1

10

0

-1 1

1

0

0

0

-1 1

0 0

0

0

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

microq1

microq2

microqT

c1

c2

cT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79

Generated speech parameter trajectory

Sta

ticD

ynam

ic

Variance Mean c

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79

Waveform reconstruction

pulse train

white noise

synthesizedspeech

lineartime-invariant

systeme(n)

h(n) x(n) = h(n) lowast e(n)excitation

Generatedexcitation parameter(log F0 with VUV)

Generatedspectral parameter

(cepstrum LSP)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79

Synthesis filter

bull Cepstrum rarr LMA filter

bull Generalized cepstrum rarr GLSA filter

bull Mel-cepstrum rarr MLSA filter

bull Mel-generalized cepstrum rarr MGLSA filter

bull LSP rarr LSP filter

bull PARCOR rarr all-pole lattice filter

bull LPC rarr all-pole filter

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation

minus Small footprint [10 11]minus Robustness [12]

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Adaptation (mimicking voice) [13]

Average-voice model

AdaptiveTraining

Adaptation

Training speakers Target speakers

bull Train average voice model (AVM) from training speakers using SAT

bull Adapt AVM to target speakers

bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79

Interpolation (mixing voice) [14 15 16 17]

λ1

λ2

λ3

λ4

λprime

I(λprimeλ1)I(λprimeλ2)

I(λprimeλ3)I(λprimeλ4)

λ HMM setI(λprimeλ) Interpolation ratio

bull Interpolate representive HMM sets

bull Can obtain new voices wo adaptation data

bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Vocoding issues

bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)

Unvoiced Voiced

pulse train

white noise

e(n)

excitation

bull Spectral envelope extractionHarmonic effect often cause problem

0

40

80

0 2 4 6 8 [kHz]

Pow

er [d

B]

bull PhaseImportant but usually ignored

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79

Better vocoding

bull Mixed excitation linear prediction (MELP)

bull STRAIGHT

bull Multi-band excitation

bull Harmonic + noise model (HNM)

bull Harmonic stochastic model

bull LF model

bull Glottal waveform

bull Residual codebook

bull ML excitation

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79

Limitations of HMMs for acoustic modeling

bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state

bull Conditional independence assumptionState output probability depends only on the current state

bull Weak duration modelingState duration probability decreases exponentially with time

None of them hold for real speech

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79

Better acoustic modeling

bull Piece-wise constatnt statistics rarr Dynamical model

minus Trended HMMminus Polynomial segment modelminus Trajectory HMM

bull Conditional independence assumption rarr Graphical model

minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM

bull Weak duration modeling rarr Explicit duration model

minus Hidden semi-Markov model

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79

Oversmoothing

bull Speech parameter generation algorithm

minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled

Nat

ural

0 4 8Frequency (kHz)

Gen

erat

ed

0 4 8Frequency (kHz)

bull Why

minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79

Oversmoothing compensation

bull Postfiltering

minus Mel-cepstrumminus LSP

bull Nonparametric approach

minus Conditional parameter generationminus Discrete HMM-based speech synthesis

bull Combine multiple-level statistics

minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation eigenvoice CAT multiple regression

minus Small footprintminus Robustness

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

HMM-based acoustic modeling for SPSS [4]

yes noyes no

yes no

yes no yes no

Acoustic space

bull Decision tree-clustered HMM with GMM state-output distributions

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79

DNN-based acoustic modeling for SPSS [18]

Acoustic

features y

Linguistic

features x

h1

h2

h3

bull DNN represents conditional distribution of y given x

bull DNN replaces decision trees and GMMs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79

Framework

Input layer Output layerHidden layers

TEXT

SPEECH Parametergeneration

Waveformsynthesis

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

feat

ures

at f

ram

e 1

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

fe

atur

es a

t fra

me T

Textanalysis

Input featureextraction

Statistics (m

ean amp var) of speech param

eter vector sequence

Binaryfeatures

NumericfeaturesDuration

featureFrame position

feature

Spectralfeatures

ExcitationfeaturesVUVfeature

Durationprediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Experimental setup

Database US English female speaker

Training test data 33000 amp 173 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic 11 categorical featuresfeatures 25 numeric features

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2

HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]

DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79

Example of speech parameter trajectories

wo grouping questions numeric contexts silence frames removed

-1

0

1

0 100 200 300 400 500

5-t

h M

el-ce

pstr

um

Frame

Natural speechHMM (α=1)DNN (4x512)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79

Subjective evaluations

Compared HMM-based systems with DNN-based ones withsimilar of parameters

bull Paired comparison test

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

HMM DNN(α) (layers times units) Neutral p value z value

158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 34: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Using dynamic features

State output vectors include static amp dynamic features

ot =[cgtt ∆cgtt

]gt ∆ct = ct minus ctminus1

M Mctminus1 ct+1ctminus2 ct+2ct

∆ctminus1 ∆ct+1∆ctminus2 ∆ct+2∆ct

2M

Relationship between static and dynamic features can be arranged as

o c

ctminus1

∆ ctminus1

ct∆ ctct+1

∆ ct+1

=

middot middot middot

middot middot middotmiddot middot middot 0 I 0 0 middot middot middotmiddot middot middot minusI I 0 0 middot middot middotmiddot middot middot 0 0 I 0 middot middot middotmiddot middot middot 0 minusI I 0 middot middot middotmiddot middot middot 0 0 0 I middot middot middotmiddot middot middot 0 0 minusI I middot middot middotmiddot middot middot

middot middot middot

ctminus2

ctminus1

ctct+1

W

o t

ot+1

otminus1

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 28 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Wgt Σminus1q W c

=

Wgt Σminus1q microq

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

1 0 0

0

10 0

-1 1

10

0

-1 1

1

0

0

0

-1 1

0 0

0

0

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

microq1

microq2

microqT

c1

c2

cT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79

Generated speech parameter trajectory

Sta

ticD

ynam

ic

Variance Mean c

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79

Waveform reconstruction

pulse train

white noise

synthesizedspeech

lineartime-invariant

systeme(n)

h(n) x(n) = h(n) lowast e(n)excitation

Generatedexcitation parameter(log F0 with VUV)

Generatedspectral parameter

(cepstrum LSP)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79

Synthesis filter

bull Cepstrum rarr LMA filter

bull Generalized cepstrum rarr GLSA filter

bull Mel-cepstrum rarr MLSA filter

bull Mel-generalized cepstrum rarr MGLSA filter

bull LSP rarr LSP filter

bull PARCOR rarr all-pole lattice filter

bull LPC rarr all-pole filter

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation

minus Small footprint [10 11]minus Robustness [12]

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Adaptation (mimicking voice) [13]

Average-voice model

AdaptiveTraining

Adaptation

Training speakers Target speakers

bull Train average voice model (AVM) from training speakers using SAT

bull Adapt AVM to target speakers

bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79

Interpolation (mixing voice) [14 15 16 17]

λ1

λ2

λ3

λ4

λprime

I(λprimeλ1)I(λprimeλ2)

I(λprimeλ3)I(λprimeλ4)

λ HMM setI(λprimeλ) Interpolation ratio

bull Interpolate representive HMM sets

bull Can obtain new voices wo adaptation data

bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Vocoding issues

bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)

Unvoiced Voiced

pulse train

white noise

e(n)

excitation

bull Spectral envelope extractionHarmonic effect often cause problem

0

40

80

0 2 4 6 8 [kHz]

Pow

er [d

B]

bull PhaseImportant but usually ignored

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79

Better vocoding

bull Mixed excitation linear prediction (MELP)

bull STRAIGHT

bull Multi-band excitation

bull Harmonic + noise model (HNM)

bull Harmonic stochastic model

bull LF model

bull Glottal waveform

bull Residual codebook

bull ML excitation

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79

Limitations of HMMs for acoustic modeling

bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state

bull Conditional independence assumptionState output probability depends only on the current state

bull Weak duration modelingState duration probability decreases exponentially with time

None of them hold for real speech

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79

Better acoustic modeling

bull Piece-wise constatnt statistics rarr Dynamical model

minus Trended HMMminus Polynomial segment modelminus Trajectory HMM

bull Conditional independence assumption rarr Graphical model

minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM

bull Weak duration modeling rarr Explicit duration model

minus Hidden semi-Markov model

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79

Oversmoothing

bull Speech parameter generation algorithm

minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled

Nat

ural

0 4 8Frequency (kHz)

Gen

erat

ed

0 4 8Frequency (kHz)

bull Why

minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79

Oversmoothing compensation

bull Postfiltering

minus Mel-cepstrumminus LSP

bull Nonparametric approach

minus Conditional parameter generationminus Discrete HMM-based speech synthesis

bull Combine multiple-level statistics

minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation eigenvoice CAT multiple regression

minus Small footprintminus Robustness

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

HMM-based acoustic modeling for SPSS [4]

yes noyes no

yes no

yes no yes no

Acoustic space

bull Decision tree-clustered HMM with GMM state-output distributions

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79

DNN-based acoustic modeling for SPSS [18]

Acoustic

features y

Linguistic

features x

h1

h2

h3

bull DNN represents conditional distribution of y given x

bull DNN replaces decision trees and GMMs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79

Framework

Input layer Output layerHidden layers

TEXT

SPEECH Parametergeneration

Waveformsynthesis

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

feat

ures

at f

ram

e 1

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

fe

atur

es a

t fra

me T

Textanalysis

Input featureextraction

Statistics (m

ean amp var) of speech param

eter vector sequence

Binaryfeatures

NumericfeaturesDuration

featureFrame position

feature

Spectralfeatures

ExcitationfeaturesVUVfeature

Durationprediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Experimental setup

Database US English female speaker

Training test data 33000 amp 173 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic 11 categorical featuresfeatures 25 numeric features

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2

HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]

DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79

Example of speech parameter trajectories

wo grouping questions numeric contexts silence frames removed

-1

0

1

0 100 200 300 400 500

5-t

h M

el-ce

pstr

um

Frame

Natural speechHMM (α=1)DNN (4x512)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79

Subjective evaluations

Compared HMM-based systems with DNN-based ones withsimilar of parameters

bull Paired comparison test

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

HMM DNN(α) (layers times units) Neutral p value z value

158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 35: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Wgt Σminus1q W c

=

Wgt Σminus1q microq

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

1 0 0

0

10 0

-1 1

10

0

-1 1

1

0

0

0

-1 1

0 0

0

0

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

microq1

microq2

microqT

c1

c2

cT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79

Generated speech parameter trajectory

Sta

ticD

ynam

ic

Variance Mean c

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79

Waveform reconstruction

pulse train

white noise

synthesizedspeech

lineartime-invariant

systeme(n)

h(n) x(n) = h(n) lowast e(n)excitation

Generatedexcitation parameter(log F0 with VUV)

Generatedspectral parameter

(cepstrum LSP)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79

Synthesis filter

bull Cepstrum rarr LMA filter

bull Generalized cepstrum rarr GLSA filter

bull Mel-cepstrum rarr MLSA filter

bull Mel-generalized cepstrum rarr MGLSA filter

bull LSP rarr LSP filter

bull PARCOR rarr all-pole lattice filter

bull LPC rarr all-pole filter

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation

minus Small footprint [10 11]minus Robustness [12]

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Adaptation (mimicking voice) [13]

Average-voice model

AdaptiveTraining

Adaptation

Training speakers Target speakers

bull Train average voice model (AVM) from training speakers using SAT

bull Adapt AVM to target speakers

bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79

Interpolation (mixing voice) [14 15 16 17]

λ1

λ2

λ3

λ4

λprime

I(λprimeλ1)I(λprimeλ2)

I(λprimeλ3)I(λprimeλ4)

λ HMM setI(λprimeλ) Interpolation ratio

bull Interpolate representive HMM sets

bull Can obtain new voices wo adaptation data

bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Vocoding issues

bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)

Unvoiced Voiced

pulse train

white noise

e(n)

excitation

bull Spectral envelope extractionHarmonic effect often cause problem

0

40

80

0 2 4 6 8 [kHz]

Pow

er [d

B]

bull PhaseImportant but usually ignored

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79

Better vocoding

bull Mixed excitation linear prediction (MELP)

bull STRAIGHT

bull Multi-band excitation

bull Harmonic + noise model (HNM)

bull Harmonic stochastic model

bull LF model

bull Glottal waveform

bull Residual codebook

bull ML excitation

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79

Limitations of HMMs for acoustic modeling

bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state

bull Conditional independence assumptionState output probability depends only on the current state

bull Weak duration modelingState duration probability decreases exponentially with time

None of them hold for real speech

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79

Better acoustic modeling

bull Piece-wise constatnt statistics rarr Dynamical model

minus Trended HMMminus Polynomial segment modelminus Trajectory HMM

bull Conditional independence assumption rarr Graphical model

minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM

bull Weak duration modeling rarr Explicit duration model

minus Hidden semi-Markov model

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79

Oversmoothing

bull Speech parameter generation algorithm

minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled

Nat

ural

0 4 8Frequency (kHz)

Gen

erat

ed

0 4 8Frequency (kHz)

bull Why

minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79

Oversmoothing compensation

bull Postfiltering

minus Mel-cepstrumminus LSP

bull Nonparametric approach

minus Conditional parameter generationminus Discrete HMM-based speech synthesis

bull Combine multiple-level statistics

minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation eigenvoice CAT multiple regression

minus Small footprintminus Robustness

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

HMM-based acoustic modeling for SPSS [4]

yes noyes no

yes no

yes no yes no

Acoustic space

bull Decision tree-clustered HMM with GMM state-output distributions

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79

DNN-based acoustic modeling for SPSS [18]

Acoustic

features y

Linguistic

features x

h1

h2

h3

bull DNN represents conditional distribution of y given x

bull DNN replaces decision trees and GMMs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79

Framework

Input layer Output layerHidden layers

TEXT

SPEECH Parametergeneration

Waveformsynthesis

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

feat

ures

at f

ram

e 1

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

fe

atur

es a

t fra

me T

Textanalysis

Input featureextraction

Statistics (m

ean amp var) of speech param

eter vector sequence

Binaryfeatures

NumericfeaturesDuration

featureFrame position

feature

Spectralfeatures

ExcitationfeaturesVUVfeature

Durationprediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Experimental setup

Database US English female speaker

Training test data 33000 amp 173 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic 11 categorical featuresfeatures 25 numeric features

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2

HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]

DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79

Example of speech parameter trajectories

wo grouping questions numeric contexts silence frames removed

-1

0

1

0 100 200 300 400 500

5-t

h M

el-ce

pstr

um

Frame

Natural speechHMM (α=1)DNN (4x512)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79

Subjective evaluations

Compared HMM-based systems with DNN-based ones withsimilar of parameters

bull Paired comparison test

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

HMM DNN(α) (layers times units) Neutral p value z value

158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 36: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Wgt Σminus1q W c

=

Wgt Σminus1q microq

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

1 0 0

0

10 0

-1 1

10

0

-1 1

1

0

0

0

-1 1

0 0

0

0

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

microq1

microq2

microqT

c1

c2

cT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79

Generated speech parameter trajectory

Sta

ticD

ynam

ic

Variance Mean c

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79

Waveform reconstruction

pulse train

white noise

synthesizedspeech

lineartime-invariant

systeme(n)

h(n) x(n) = h(n) lowast e(n)excitation

Generatedexcitation parameter(log F0 with VUV)

Generatedspectral parameter

(cepstrum LSP)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79

Synthesis filter

bull Cepstrum rarr LMA filter

bull Generalized cepstrum rarr GLSA filter

bull Mel-cepstrum rarr MLSA filter

bull Mel-generalized cepstrum rarr MGLSA filter

bull LSP rarr LSP filter

bull PARCOR rarr all-pole lattice filter

bull LPC rarr all-pole filter

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation

minus Small footprint [10 11]minus Robustness [12]

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Adaptation (mimicking voice) [13]

Average-voice model

AdaptiveTraining

Adaptation

Training speakers Target speakers

bull Train average voice model (AVM) from training speakers using SAT

bull Adapt AVM to target speakers

bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79

Interpolation (mixing voice) [14 15 16 17]

λ1

λ2

λ3

λ4

λprime

I(λprimeλ1)I(λprimeλ2)

I(λprimeλ3)I(λprimeλ4)

λ HMM setI(λprimeλ) Interpolation ratio

bull Interpolate representive HMM sets

bull Can obtain new voices wo adaptation data

bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Vocoding issues

bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)

Unvoiced Voiced

pulse train

white noise

e(n)

excitation

bull Spectral envelope extractionHarmonic effect often cause problem

0

40

80

0 2 4 6 8 [kHz]

Pow

er [d

B]

bull PhaseImportant but usually ignored

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79

Better vocoding

bull Mixed excitation linear prediction (MELP)

bull STRAIGHT

bull Multi-band excitation

bull Harmonic + noise model (HNM)

bull Harmonic stochastic model

bull LF model

bull Glottal waveform

bull Residual codebook

bull ML excitation

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79

Limitations of HMMs for acoustic modeling

bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state

bull Conditional independence assumptionState output probability depends only on the current state

bull Weak duration modelingState duration probability decreases exponentially with time

None of them hold for real speech

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79

Better acoustic modeling

bull Piece-wise constatnt statistics rarr Dynamical model

minus Trended HMMminus Polynomial segment modelminus Trajectory HMM

bull Conditional independence assumption rarr Graphical model

minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM

bull Weak duration modeling rarr Explicit duration model

minus Hidden semi-Markov model

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79

Oversmoothing

bull Speech parameter generation algorithm

minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled

Nat

ural

0 4 8Frequency (kHz)

Gen

erat

ed

0 4 8Frequency (kHz)

bull Why

minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79

Oversmoothing compensation

bull Postfiltering

minus Mel-cepstrumminus LSP

bull Nonparametric approach

minus Conditional parameter generationminus Discrete HMM-based speech synthesis

bull Combine multiple-level statistics

minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation eigenvoice CAT multiple regression

minus Small footprintminus Robustness

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

HMM-based acoustic modeling for SPSS [4]

yes noyes no

yes no

yes no yes no

Acoustic space

bull Decision tree-clustered HMM with GMM state-output distributions

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79

DNN-based acoustic modeling for SPSS [18]

Acoustic

features y

Linguistic

features x

h1

h2

h3

bull DNN represents conditional distribution of y given x

bull DNN replaces decision trees and GMMs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79

Framework

Input layer Output layerHidden layers

TEXT

SPEECH Parametergeneration

Waveformsynthesis

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

feat

ures

at f

ram

e 1

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

fe

atur

es a

t fra

me T

Textanalysis

Input featureextraction

Statistics (m

ean amp var) of speech param

eter vector sequence

Binaryfeatures

NumericfeaturesDuration

featureFrame position

feature

Spectralfeatures

ExcitationfeaturesVUVfeature

Durationprediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Experimental setup

Database US English female speaker

Training test data 33000 amp 173 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic 11 categorical featuresfeatures 25 numeric features

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2

HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]

DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79

Example of speech parameter trajectories

wo grouping questions numeric contexts silence frames removed

-1

0

1

0 100 200 300 400 500

5-t

h M

el-ce

pstr

um

Frame

Natural speechHMM (α=1)DNN (4x512)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79

Subjective evaluations

Compared HMM-based systems with DNN-based ones withsimilar of parameters

bull Paired comparison test

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

HMM DNN(α) (layers times units) Neutral p value z value

158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 37: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Speech parameter generation algorithm [9]

Introduce dynamic feature constraints

o = arg maxo

p(o | q λ) subject to o = Wc

If state-output distribution is single Gaussian

p(o | q λ) = N (o microq Σq)

By setting part logN (Wc microq Σq)partc = 0

WgtΣminus1q Wc = WgtΣminus1

q microq

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 29 of 79

Speech parameter generation algorithm [9]

Wgt Σminus1q W c

=

Wgt Σminus1q microq

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

1 0 0

0

10 0

-1 1

10

0

-1 1

1

0

0

0

-1 1

0 0

0

0

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

microq1

microq2

microqT

c1

c2

cT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79

Generated speech parameter trajectory

Sta

ticD

ynam

ic

Variance Mean c

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79

Waveform reconstruction

pulse train

white noise

synthesizedspeech

lineartime-invariant

systeme(n)

h(n) x(n) = h(n) lowast e(n)excitation

Generatedexcitation parameter(log F0 with VUV)

Generatedspectral parameter

(cepstrum LSP)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79

Synthesis filter

bull Cepstrum rarr LMA filter

bull Generalized cepstrum rarr GLSA filter

bull Mel-cepstrum rarr MLSA filter

bull Mel-generalized cepstrum rarr MGLSA filter

bull LSP rarr LSP filter

bull PARCOR rarr all-pole lattice filter

bull LPC rarr all-pole filter

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation

minus Small footprint [10 11]minus Robustness [12]

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Adaptation (mimicking voice) [13]

Average-voice model

AdaptiveTraining

Adaptation

Training speakers Target speakers

bull Train average voice model (AVM) from training speakers using SAT

bull Adapt AVM to target speakers

bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79

Interpolation (mixing voice) [14 15 16 17]

λ1

λ2

λ3

λ4

λprime

I(λprimeλ1)I(λprimeλ2)

I(λprimeλ3)I(λprimeλ4)

λ HMM setI(λprimeλ) Interpolation ratio

bull Interpolate representive HMM sets

bull Can obtain new voices wo adaptation data

bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Vocoding issues

bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)

Unvoiced Voiced

pulse train

white noise

e(n)

excitation

bull Spectral envelope extractionHarmonic effect often cause problem

0

40

80

0 2 4 6 8 [kHz]

Pow

er [d

B]

bull PhaseImportant but usually ignored

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79

Better vocoding

bull Mixed excitation linear prediction (MELP)

bull STRAIGHT

bull Multi-band excitation

bull Harmonic + noise model (HNM)

bull Harmonic stochastic model

bull LF model

bull Glottal waveform

bull Residual codebook

bull ML excitation

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79

Limitations of HMMs for acoustic modeling

bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state

bull Conditional independence assumptionState output probability depends only on the current state

bull Weak duration modelingState duration probability decreases exponentially with time

None of them hold for real speech

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79

Better acoustic modeling

bull Piece-wise constatnt statistics rarr Dynamical model

minus Trended HMMminus Polynomial segment modelminus Trajectory HMM

bull Conditional independence assumption rarr Graphical model

minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM

bull Weak duration modeling rarr Explicit duration model

minus Hidden semi-Markov model

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79

Oversmoothing

bull Speech parameter generation algorithm

minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled

Nat

ural

0 4 8Frequency (kHz)

Gen

erat

ed

0 4 8Frequency (kHz)

bull Why

minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79

Oversmoothing compensation

bull Postfiltering

minus Mel-cepstrumminus LSP

bull Nonparametric approach

minus Conditional parameter generationminus Discrete HMM-based speech synthesis

bull Combine multiple-level statistics

minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation eigenvoice CAT multiple regression

minus Small footprintminus Robustness

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

HMM-based acoustic modeling for SPSS [4]

yes noyes no

yes no

yes no yes no

Acoustic space

bull Decision tree-clustered HMM with GMM state-output distributions

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79

DNN-based acoustic modeling for SPSS [18]

Acoustic

features y

Linguistic

features x

h1

h2

h3

bull DNN represents conditional distribution of y given x

bull DNN replaces decision trees and GMMs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79

Framework

Input layer Output layerHidden layers

TEXT

SPEECH Parametergeneration

Waveformsynthesis

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

feat

ures

at f

ram

e 1

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

fe

atur

es a

t fra

me T

Textanalysis

Input featureextraction

Statistics (m

ean amp var) of speech param

eter vector sequence

Binaryfeatures

NumericfeaturesDuration

featureFrame position

feature

Spectralfeatures

ExcitationfeaturesVUVfeature

Durationprediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Experimental setup

Database US English female speaker

Training test data 33000 amp 173 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic 11 categorical featuresfeatures 25 numeric features

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2

HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]

DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79

Example of speech parameter trajectories

wo grouping questions numeric contexts silence frames removed

-1

0

1

0 100 200 300 400 500

5-t

h M

el-ce

pstr

um

Frame

Natural speechHMM (α=1)DNN (4x512)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79

Subjective evaluations

Compared HMM-based systems with DNN-based ones withsimilar of parameters

bull Paired comparison test

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

HMM DNN(α) (layers times units) Neutral p value z value

158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 38: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Speech parameter generation algorithm [9]

Wgt Σminus1q W c

=

Wgt Σminus1q microq

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

1 0 0

0

10 0

-1 1

10

0

-1 1

1

0

0

0

-1 1

0 0

0

0

100 0 0

0

10

0

-11

10 0

-11

1

0

00

-11

00

00

microq1

microq2

microqT

c1

c2

cT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 30 of 79

Generated speech parameter trajectory

Sta

ticD

ynam

ic

Variance Mean c

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79

Waveform reconstruction

pulse train

white noise

synthesizedspeech

lineartime-invariant

systeme(n)

h(n) x(n) = h(n) lowast e(n)excitation

Generatedexcitation parameter(log F0 with VUV)

Generatedspectral parameter

(cepstrum LSP)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79

Synthesis filter

bull Cepstrum rarr LMA filter

bull Generalized cepstrum rarr GLSA filter

bull Mel-cepstrum rarr MLSA filter

bull Mel-generalized cepstrum rarr MGLSA filter

bull LSP rarr LSP filter

bull PARCOR rarr all-pole lattice filter

bull LPC rarr all-pole filter

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation

minus Small footprint [10 11]minus Robustness [12]

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Adaptation (mimicking voice) [13]

Average-voice model

AdaptiveTraining

Adaptation

Training speakers Target speakers

bull Train average voice model (AVM) from training speakers using SAT

bull Adapt AVM to target speakers

bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79

Interpolation (mixing voice) [14 15 16 17]

λ1

λ2

λ3

λ4

λprime

I(λprimeλ1)I(λprimeλ2)

I(λprimeλ3)I(λprimeλ4)

λ HMM setI(λprimeλ) Interpolation ratio

bull Interpolate representive HMM sets

bull Can obtain new voices wo adaptation data

bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Vocoding issues

bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)

Unvoiced Voiced

pulse train

white noise

e(n)

excitation

bull Spectral envelope extractionHarmonic effect often cause problem

0

40

80

0 2 4 6 8 [kHz]

Pow

er [d

B]

bull PhaseImportant but usually ignored

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79

Better vocoding

bull Mixed excitation linear prediction (MELP)

bull STRAIGHT

bull Multi-band excitation

bull Harmonic + noise model (HNM)

bull Harmonic stochastic model

bull LF model

bull Glottal waveform

bull Residual codebook

bull ML excitation

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79

Limitations of HMMs for acoustic modeling

bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state

bull Conditional independence assumptionState output probability depends only on the current state

bull Weak duration modelingState duration probability decreases exponentially with time

None of them hold for real speech

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79

Better acoustic modeling

bull Piece-wise constatnt statistics rarr Dynamical model

minus Trended HMMminus Polynomial segment modelminus Trajectory HMM

bull Conditional independence assumption rarr Graphical model

minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM

bull Weak duration modeling rarr Explicit duration model

minus Hidden semi-Markov model

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79

Oversmoothing

bull Speech parameter generation algorithm

minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled

Nat

ural

0 4 8Frequency (kHz)

Gen

erat

ed

0 4 8Frequency (kHz)

bull Why

minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79

Oversmoothing compensation

bull Postfiltering

minus Mel-cepstrumminus LSP

bull Nonparametric approach

minus Conditional parameter generationminus Discrete HMM-based speech synthesis

bull Combine multiple-level statistics

minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation eigenvoice CAT multiple regression

minus Small footprintminus Robustness

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

HMM-based acoustic modeling for SPSS [4]

yes noyes no

yes no

yes no yes no

Acoustic space

bull Decision tree-clustered HMM with GMM state-output distributions

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79

DNN-based acoustic modeling for SPSS [18]

Acoustic

features y

Linguistic

features x

h1

h2

h3

bull DNN represents conditional distribution of y given x

bull DNN replaces decision trees and GMMs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79

Framework

Input layer Output layerHidden layers

TEXT

SPEECH Parametergeneration

Waveformsynthesis

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

feat

ures

at f

ram

e 1

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

fe

atur

es a

t fra

me T

Textanalysis

Input featureextraction

Statistics (m

ean amp var) of speech param

eter vector sequence

Binaryfeatures

NumericfeaturesDuration

featureFrame position

feature

Spectralfeatures

ExcitationfeaturesVUVfeature

Durationprediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Experimental setup

Database US English female speaker

Training test data 33000 amp 173 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic 11 categorical featuresfeatures 25 numeric features

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2

HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]

DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79

Example of speech parameter trajectories

wo grouping questions numeric contexts silence frames removed

-1

0

1

0 100 200 300 400 500

5-t

h M

el-ce

pstr

um

Frame

Natural speechHMM (α=1)DNN (4x512)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79

Subjective evaluations

Compared HMM-based systems with DNN-based ones withsimilar of parameters

bull Paired comparison test

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

HMM DNN(α) (layers times units) Neutral p value z value

158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 39: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Generated speech parameter trajectory

Sta

ticD

ynam

ic

Variance Mean c

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 31 of 79

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79

Waveform reconstruction

pulse train

white noise

synthesizedspeech

lineartime-invariant

systeme(n)

h(n) x(n) = h(n) lowast e(n)excitation

Generatedexcitation parameter(log F0 with VUV)

Generatedspectral parameter

(cepstrum LSP)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79

Synthesis filter

bull Cepstrum rarr LMA filter

bull Generalized cepstrum rarr GLSA filter

bull Mel-cepstrum rarr MLSA filter

bull Mel-generalized cepstrum rarr MGLSA filter

bull LSP rarr LSP filter

bull PARCOR rarr all-pole lattice filter

bull LPC rarr all-pole filter

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation

minus Small footprint [10 11]minus Robustness [12]

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Adaptation (mimicking voice) [13]

Average-voice model

AdaptiveTraining

Adaptation

Training speakers Target speakers

bull Train average voice model (AVM) from training speakers using SAT

bull Adapt AVM to target speakers

bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79

Interpolation (mixing voice) [14 15 16 17]

λ1

λ2

λ3

λ4

λprime

I(λprimeλ1)I(λprimeλ2)

I(λprimeλ3)I(λprimeλ4)

λ HMM setI(λprimeλ) Interpolation ratio

bull Interpolate representive HMM sets

bull Can obtain new voices wo adaptation data

bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Vocoding issues

bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)

Unvoiced Voiced

pulse train

white noise

e(n)

excitation

bull Spectral envelope extractionHarmonic effect often cause problem

0

40

80

0 2 4 6 8 [kHz]

Pow

er [d

B]

bull PhaseImportant but usually ignored

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79

Better vocoding

bull Mixed excitation linear prediction (MELP)

bull STRAIGHT

bull Multi-band excitation

bull Harmonic + noise model (HNM)

bull Harmonic stochastic model

bull LF model

bull Glottal waveform

bull Residual codebook

bull ML excitation

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79

Limitations of HMMs for acoustic modeling

bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state

bull Conditional independence assumptionState output probability depends only on the current state

bull Weak duration modelingState duration probability decreases exponentially with time

None of them hold for real speech

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79

Better acoustic modeling

bull Piece-wise constatnt statistics rarr Dynamical model

minus Trended HMMminus Polynomial segment modelminus Trajectory HMM

bull Conditional independence assumption rarr Graphical model

minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM

bull Weak duration modeling rarr Explicit duration model

minus Hidden semi-Markov model

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79

Oversmoothing

bull Speech parameter generation algorithm

minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled

Nat

ural

0 4 8Frequency (kHz)

Gen

erat

ed

0 4 8Frequency (kHz)

bull Why

minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79

Oversmoothing compensation

bull Postfiltering

minus Mel-cepstrumminus LSP

bull Nonparametric approach

minus Conditional parameter generationminus Discrete HMM-based speech synthesis

bull Combine multiple-level statistics

minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation eigenvoice CAT multiple regression

minus Small footprintminus Robustness

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

HMM-based acoustic modeling for SPSS [4]

yes noyes no

yes no

yes no yes no

Acoustic space

bull Decision tree-clustered HMM with GMM state-output distributions

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79

DNN-based acoustic modeling for SPSS [18]

Acoustic

features y

Linguistic

features x

h1

h2

h3

bull DNN represents conditional distribution of y given x

bull DNN replaces decision trees and GMMs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79

Framework

Input layer Output layerHidden layers

TEXT

SPEECH Parametergeneration

Waveformsynthesis

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

feat

ures

at f

ram

e 1

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

fe

atur

es a

t fra

me T

Textanalysis

Input featureextraction

Statistics (m

ean amp var) of speech param

eter vector sequence

Binaryfeatures

NumericfeaturesDuration

featureFrame position

feature

Spectralfeatures

ExcitationfeaturesVUVfeature

Durationprediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Experimental setup

Database US English female speaker

Training test data 33000 amp 173 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic 11 categorical featuresfeatures 25 numeric features

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2

HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]

DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79

Example of speech parameter trajectories

wo grouping questions numeric contexts silence frames removed

-1

0

1

0 100 200 300 400 500

5-t

h M

el-ce

pstr

um

Frame

Natural speechHMM (α=1)DNN (4x512)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79

Subjective evaluations

Compared HMM-based systems with DNN-based ones withsimilar of parameters

bull Paired comparison test

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

HMM DNN(α) (layers times units) Neutral p value z value

158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 40: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

HMM-based speech synthesis [4]

Training part

Synthesis part

Training HMMs

Context-dependent HMMs amp state duration models

Labels

Spectralparameters

Excitationparameters

TEXT

Labels

SYNTHESIZEDSPEECH

Speech signal

Excitation

Parameter generationfrom HMMs

Excitationgeneration

SynthesisFilter

Text analysis

Spectralparameterextraction

SPEECHDATABASE Excitation

parameterextraction

Spectralparameters

Excitationparameters

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 32 of 79

Waveform reconstruction

pulse train

white noise

synthesizedspeech

lineartime-invariant

systeme(n)

h(n) x(n) = h(n) lowast e(n)excitation

Generatedexcitation parameter(log F0 with VUV)

Generatedspectral parameter

(cepstrum LSP)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79

Synthesis filter

bull Cepstrum rarr LMA filter

bull Generalized cepstrum rarr GLSA filter

bull Mel-cepstrum rarr MLSA filter

bull Mel-generalized cepstrum rarr MGLSA filter

bull LSP rarr LSP filter

bull PARCOR rarr all-pole lattice filter

bull LPC rarr all-pole filter

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation

minus Small footprint [10 11]minus Robustness [12]

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Adaptation (mimicking voice) [13]

Average-voice model

AdaptiveTraining

Adaptation

Training speakers Target speakers

bull Train average voice model (AVM) from training speakers using SAT

bull Adapt AVM to target speakers

bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79

Interpolation (mixing voice) [14 15 16 17]

λ1

λ2

λ3

λ4

λprime

I(λprimeλ1)I(λprimeλ2)

I(λprimeλ3)I(λprimeλ4)

λ HMM setI(λprimeλ) Interpolation ratio

bull Interpolate representive HMM sets

bull Can obtain new voices wo adaptation data

bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Vocoding issues

bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)

Unvoiced Voiced

pulse train

white noise

e(n)

excitation

bull Spectral envelope extractionHarmonic effect often cause problem

0

40

80

0 2 4 6 8 [kHz]

Pow

er [d

B]

bull PhaseImportant but usually ignored

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79

Better vocoding

bull Mixed excitation linear prediction (MELP)

bull STRAIGHT

bull Multi-band excitation

bull Harmonic + noise model (HNM)

bull Harmonic stochastic model

bull LF model

bull Glottal waveform

bull Residual codebook

bull ML excitation

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79

Limitations of HMMs for acoustic modeling

bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state

bull Conditional independence assumptionState output probability depends only on the current state

bull Weak duration modelingState duration probability decreases exponentially with time

None of them hold for real speech

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79

Better acoustic modeling

bull Piece-wise constatnt statistics rarr Dynamical model

minus Trended HMMminus Polynomial segment modelminus Trajectory HMM

bull Conditional independence assumption rarr Graphical model

minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM

bull Weak duration modeling rarr Explicit duration model

minus Hidden semi-Markov model

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79

Oversmoothing

bull Speech parameter generation algorithm

minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled

Nat

ural

0 4 8Frequency (kHz)

Gen

erat

ed

0 4 8Frequency (kHz)

bull Why

minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79

Oversmoothing compensation

bull Postfiltering

minus Mel-cepstrumminus LSP

bull Nonparametric approach

minus Conditional parameter generationminus Discrete HMM-based speech synthesis

bull Combine multiple-level statistics

minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation eigenvoice CAT multiple regression

minus Small footprintminus Robustness

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

HMM-based acoustic modeling for SPSS [4]

yes noyes no

yes no

yes no yes no

Acoustic space

bull Decision tree-clustered HMM with GMM state-output distributions

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79

DNN-based acoustic modeling for SPSS [18]

Acoustic

features y

Linguistic

features x

h1

h2

h3

bull DNN represents conditional distribution of y given x

bull DNN replaces decision trees and GMMs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79

Framework

Input layer Output layerHidden layers

TEXT

SPEECH Parametergeneration

Waveformsynthesis

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

feat

ures

at f

ram

e 1

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

fe

atur

es a

t fra

me T

Textanalysis

Input featureextraction

Statistics (m

ean amp var) of speech param

eter vector sequence

Binaryfeatures

NumericfeaturesDuration

featureFrame position

feature

Spectralfeatures

ExcitationfeaturesVUVfeature

Durationprediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Experimental setup

Database US English female speaker

Training test data 33000 amp 173 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic 11 categorical featuresfeatures 25 numeric features

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2

HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]

DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79

Example of speech parameter trajectories

wo grouping questions numeric contexts silence frames removed

-1

0

1

0 100 200 300 400 500

5-t

h M

el-ce

pstr

um

Frame

Natural speechHMM (α=1)DNN (4x512)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79

Subjective evaluations

Compared HMM-based systems with DNN-based ones withsimilar of parameters

bull Paired comparison test

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

HMM DNN(α) (layers times units) Neutral p value z value

158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 41: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Waveform reconstruction

pulse train

white noise

synthesizedspeech

lineartime-invariant

systeme(n)

h(n) x(n) = h(n) lowast e(n)excitation

Generatedexcitation parameter(log F0 with VUV)

Generatedspectral parameter

(cepstrum LSP)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 33 of 79

Synthesis filter

bull Cepstrum rarr LMA filter

bull Generalized cepstrum rarr GLSA filter

bull Mel-cepstrum rarr MLSA filter

bull Mel-generalized cepstrum rarr MGLSA filter

bull LSP rarr LSP filter

bull PARCOR rarr all-pole lattice filter

bull LPC rarr all-pole filter

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation

minus Small footprint [10 11]minus Robustness [12]

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Adaptation (mimicking voice) [13]

Average-voice model

AdaptiveTraining

Adaptation

Training speakers Target speakers

bull Train average voice model (AVM) from training speakers using SAT

bull Adapt AVM to target speakers

bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79

Interpolation (mixing voice) [14 15 16 17]

λ1

λ2

λ3

λ4

λprime

I(λprimeλ1)I(λprimeλ2)

I(λprimeλ3)I(λprimeλ4)

λ HMM setI(λprimeλ) Interpolation ratio

bull Interpolate representive HMM sets

bull Can obtain new voices wo adaptation data

bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Vocoding issues

bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)

Unvoiced Voiced

pulse train

white noise

e(n)

excitation

bull Spectral envelope extractionHarmonic effect often cause problem

0

40

80

0 2 4 6 8 [kHz]

Pow

er [d

B]

bull PhaseImportant but usually ignored

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79

Better vocoding

bull Mixed excitation linear prediction (MELP)

bull STRAIGHT

bull Multi-band excitation

bull Harmonic + noise model (HNM)

bull Harmonic stochastic model

bull LF model

bull Glottal waveform

bull Residual codebook

bull ML excitation

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79

Limitations of HMMs for acoustic modeling

bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state

bull Conditional independence assumptionState output probability depends only on the current state

bull Weak duration modelingState duration probability decreases exponentially with time

None of them hold for real speech

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79

Better acoustic modeling

bull Piece-wise constatnt statistics rarr Dynamical model

minus Trended HMMminus Polynomial segment modelminus Trajectory HMM

bull Conditional independence assumption rarr Graphical model

minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM

bull Weak duration modeling rarr Explicit duration model

minus Hidden semi-Markov model

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79

Oversmoothing

bull Speech parameter generation algorithm

minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled

Nat

ural

0 4 8Frequency (kHz)

Gen

erat

ed

0 4 8Frequency (kHz)

bull Why

minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79

Oversmoothing compensation

bull Postfiltering

minus Mel-cepstrumminus LSP

bull Nonparametric approach

minus Conditional parameter generationminus Discrete HMM-based speech synthesis

bull Combine multiple-level statistics

minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation eigenvoice CAT multiple regression

minus Small footprintminus Robustness

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

HMM-based acoustic modeling for SPSS [4]

yes noyes no

yes no

yes no yes no

Acoustic space

bull Decision tree-clustered HMM with GMM state-output distributions

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79

DNN-based acoustic modeling for SPSS [18]

Acoustic

features y

Linguistic

features x

h1

h2

h3

bull DNN represents conditional distribution of y given x

bull DNN replaces decision trees and GMMs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79

Framework

Input layer Output layerHidden layers

TEXT

SPEECH Parametergeneration

Waveformsynthesis

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

feat

ures

at f

ram

e 1

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

fe

atur

es a

t fra

me T

Textanalysis

Input featureextraction

Statistics (m

ean amp var) of speech param

eter vector sequence

Binaryfeatures

NumericfeaturesDuration

featureFrame position

feature

Spectralfeatures

ExcitationfeaturesVUVfeature

Durationprediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Experimental setup

Database US English female speaker

Training test data 33000 amp 173 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic 11 categorical featuresfeatures 25 numeric features

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2

HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]

DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79

Example of speech parameter trajectories

wo grouping questions numeric contexts silence frames removed

-1

0

1

0 100 200 300 400 500

5-t

h M

el-ce

pstr

um

Frame

Natural speechHMM (α=1)DNN (4x512)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79

Subjective evaluations

Compared HMM-based systems with DNN-based ones withsimilar of parameters

bull Paired comparison test

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

HMM DNN(α) (layers times units) Neutral p value z value

158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 42: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Synthesis filter

bull Cepstrum rarr LMA filter

bull Generalized cepstrum rarr GLSA filter

bull Mel-cepstrum rarr MLSA filter

bull Mel-generalized cepstrum rarr MGLSA filter

bull LSP rarr LSP filter

bull PARCOR rarr all-pole lattice filter

bull LPC rarr all-pole filter

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 34 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation

minus Small footprint [10 11]minus Robustness [12]

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Adaptation (mimicking voice) [13]

Average-voice model

AdaptiveTraining

Adaptation

Training speakers Target speakers

bull Train average voice model (AVM) from training speakers using SAT

bull Adapt AVM to target speakers

bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79

Interpolation (mixing voice) [14 15 16 17]

λ1

λ2

λ3

λ4

λprime

I(λprimeλ1)I(λprimeλ2)

I(λprimeλ3)I(λprimeλ4)

λ HMM setI(λprimeλ) Interpolation ratio

bull Interpolate representive HMM sets

bull Can obtain new voices wo adaptation data

bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Vocoding issues

bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)

Unvoiced Voiced

pulse train

white noise

e(n)

excitation

bull Spectral envelope extractionHarmonic effect often cause problem

0

40

80

0 2 4 6 8 [kHz]

Pow

er [d

B]

bull PhaseImportant but usually ignored

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79

Better vocoding

bull Mixed excitation linear prediction (MELP)

bull STRAIGHT

bull Multi-band excitation

bull Harmonic + noise model (HNM)

bull Harmonic stochastic model

bull LF model

bull Glottal waveform

bull Residual codebook

bull ML excitation

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79

Limitations of HMMs for acoustic modeling

bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state

bull Conditional independence assumptionState output probability depends only on the current state

bull Weak duration modelingState duration probability decreases exponentially with time

None of them hold for real speech

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79

Better acoustic modeling

bull Piece-wise constatnt statistics rarr Dynamical model

minus Trended HMMminus Polynomial segment modelminus Trajectory HMM

bull Conditional independence assumption rarr Graphical model

minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM

bull Weak duration modeling rarr Explicit duration model

minus Hidden semi-Markov model

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79

Oversmoothing

bull Speech parameter generation algorithm

minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled

Nat

ural

0 4 8Frequency (kHz)

Gen

erat

ed

0 4 8Frequency (kHz)

bull Why

minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79

Oversmoothing compensation

bull Postfiltering

minus Mel-cepstrumminus LSP

bull Nonparametric approach

minus Conditional parameter generationminus Discrete HMM-based speech synthesis

bull Combine multiple-level statistics

minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation eigenvoice CAT multiple regression

minus Small footprintminus Robustness

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

HMM-based acoustic modeling for SPSS [4]

yes noyes no

yes no

yes no yes no

Acoustic space

bull Decision tree-clustered HMM with GMM state-output distributions

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79

DNN-based acoustic modeling for SPSS [18]

Acoustic

features y

Linguistic

features x

h1

h2

h3

bull DNN represents conditional distribution of y given x

bull DNN replaces decision trees and GMMs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79

Framework

Input layer Output layerHidden layers

TEXT

SPEECH Parametergeneration

Waveformsynthesis

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

feat

ures

at f

ram

e 1

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

fe

atur

es a

t fra

me T

Textanalysis

Input featureextraction

Statistics (m

ean amp var) of speech param

eter vector sequence

Binaryfeatures

NumericfeaturesDuration

featureFrame position

feature

Spectralfeatures

ExcitationfeaturesVUVfeature

Durationprediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Experimental setup

Database US English female speaker

Training test data 33000 amp 173 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic 11 categorical featuresfeatures 25 numeric features

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2

HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]

DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79

Example of speech parameter trajectories

wo grouping questions numeric contexts silence frames removed

-1

0

1

0 100 200 300 400 500

5-t

h M

el-ce

pstr

um

Frame

Natural speechHMM (α=1)DNN (4x512)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79

Subjective evaluations

Compared HMM-based systems with DNN-based ones withsimilar of parameters

bull Paired comparison test

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

HMM DNN(α) (layers times units) Neutral p value z value

158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 43: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation

minus Small footprint [10 11]minus Robustness [12]

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM)minus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 35 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Adaptation (mimicking voice) [13]

Average-voice model

AdaptiveTraining

Adaptation

Training speakers Target speakers

bull Train average voice model (AVM) from training speakers using SAT

bull Adapt AVM to target speakers

bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79

Interpolation (mixing voice) [14 15 16 17]

λ1

λ2

λ3

λ4

λprime

I(λprimeλ1)I(λprimeλ2)

I(λprimeλ3)I(λprimeλ4)

λ HMM setI(λprimeλ) Interpolation ratio

bull Interpolate representive HMM sets

bull Can obtain new voices wo adaptation data

bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Vocoding issues

bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)

Unvoiced Voiced

pulse train

white noise

e(n)

excitation

bull Spectral envelope extractionHarmonic effect often cause problem

0

40

80

0 2 4 6 8 [kHz]

Pow

er [d

B]

bull PhaseImportant but usually ignored

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79

Better vocoding

bull Mixed excitation linear prediction (MELP)

bull STRAIGHT

bull Multi-band excitation

bull Harmonic + noise model (HNM)

bull Harmonic stochastic model

bull LF model

bull Glottal waveform

bull Residual codebook

bull ML excitation

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79

Limitations of HMMs for acoustic modeling

bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state

bull Conditional independence assumptionState output probability depends only on the current state

bull Weak duration modelingState duration probability decreases exponentially with time

None of them hold for real speech

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79

Better acoustic modeling

bull Piece-wise constatnt statistics rarr Dynamical model

minus Trended HMMminus Polynomial segment modelminus Trajectory HMM

bull Conditional independence assumption rarr Graphical model

minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM

bull Weak duration modeling rarr Explicit duration model

minus Hidden semi-Markov model

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79

Oversmoothing

bull Speech parameter generation algorithm

minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled

Nat

ural

0 4 8Frequency (kHz)

Gen

erat

ed

0 4 8Frequency (kHz)

bull Why

minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79

Oversmoothing compensation

bull Postfiltering

minus Mel-cepstrumminus LSP

bull Nonparametric approach

minus Conditional parameter generationminus Discrete HMM-based speech synthesis

bull Combine multiple-level statistics

minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation eigenvoice CAT multiple regression

minus Small footprintminus Robustness

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

HMM-based acoustic modeling for SPSS [4]

yes noyes no

yes no

yes no yes no

Acoustic space

bull Decision tree-clustered HMM with GMM state-output distributions

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79

DNN-based acoustic modeling for SPSS [18]

Acoustic

features y

Linguistic

features x

h1

h2

h3

bull DNN represents conditional distribution of y given x

bull DNN replaces decision trees and GMMs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79

Framework

Input layer Output layerHidden layers

TEXT

SPEECH Parametergeneration

Waveformsynthesis

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

feat

ures

at f

ram

e 1

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

fe

atur

es a

t fra

me T

Textanalysis

Input featureextraction

Statistics (m

ean amp var) of speech param

eter vector sequence

Binaryfeatures

NumericfeaturesDuration

featureFrame position

feature

Spectralfeatures

ExcitationfeaturesVUVfeature

Durationprediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Experimental setup

Database US English female speaker

Training test data 33000 amp 173 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic 11 categorical featuresfeatures 25 numeric features

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2

HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]

DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79

Example of speech parameter trajectories

wo grouping questions numeric contexts silence frames removed

-1

0

1

0 100 200 300 400 500

5-t

h M

el-ce

pstr

um

Frame

Natural speechHMM (α=1)DNN (4x512)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79

Subjective evaluations

Compared HMM-based systems with DNN-based ones withsimilar of parameters

bull Paired comparison test

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

HMM DNN(α) (layers times units) Neutral p value z value

158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 44: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Adaptation (mimicking voice) [13]

Average-voice model

AdaptiveTraining

Adaptation

Training speakers Target speakers

bull Train average voice model (AVM) from training speakers using SAT

bull Adapt AVM to target speakers

bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79

Interpolation (mixing voice) [14 15 16 17]

λ1

λ2

λ3

λ4

λprime

I(λprimeλ1)I(λprimeλ2)

I(λprimeλ3)I(λprimeλ4)

λ HMM setI(λprimeλ) Interpolation ratio

bull Interpolate representive HMM sets

bull Can obtain new voices wo adaptation data

bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Vocoding issues

bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)

Unvoiced Voiced

pulse train

white noise

e(n)

excitation

bull Spectral envelope extractionHarmonic effect often cause problem

0

40

80

0 2 4 6 8 [kHz]

Pow

er [d

B]

bull PhaseImportant but usually ignored

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79

Better vocoding

bull Mixed excitation linear prediction (MELP)

bull STRAIGHT

bull Multi-band excitation

bull Harmonic + noise model (HNM)

bull Harmonic stochastic model

bull LF model

bull Glottal waveform

bull Residual codebook

bull ML excitation

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79

Limitations of HMMs for acoustic modeling

bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state

bull Conditional independence assumptionState output probability depends only on the current state

bull Weak duration modelingState duration probability decreases exponentially with time

None of them hold for real speech

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79

Better acoustic modeling

bull Piece-wise constatnt statistics rarr Dynamical model

minus Trended HMMminus Polynomial segment modelminus Trajectory HMM

bull Conditional independence assumption rarr Graphical model

minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM

bull Weak duration modeling rarr Explicit duration model

minus Hidden semi-Markov model

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79

Oversmoothing

bull Speech parameter generation algorithm

minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled

Nat

ural

0 4 8Frequency (kHz)

Gen

erat

ed

0 4 8Frequency (kHz)

bull Why

minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79

Oversmoothing compensation

bull Postfiltering

minus Mel-cepstrumminus LSP

bull Nonparametric approach

minus Conditional parameter generationminus Discrete HMM-based speech synthesis

bull Combine multiple-level statistics

minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation eigenvoice CAT multiple regression

minus Small footprintminus Robustness

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

HMM-based acoustic modeling for SPSS [4]

yes noyes no

yes no

yes no yes no

Acoustic space

bull Decision tree-clustered HMM with GMM state-output distributions

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79

DNN-based acoustic modeling for SPSS [18]

Acoustic

features y

Linguistic

features x

h1

h2

h3

bull DNN represents conditional distribution of y given x

bull DNN replaces decision trees and GMMs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79

Framework

Input layer Output layerHidden layers

TEXT

SPEECH Parametergeneration

Waveformsynthesis

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

feat

ures

at f

ram

e 1

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

fe

atur

es a

t fra

me T

Textanalysis

Input featureextraction

Statistics (m

ean amp var) of speech param

eter vector sequence

Binaryfeatures

NumericfeaturesDuration

featureFrame position

feature

Spectralfeatures

ExcitationfeaturesVUVfeature

Durationprediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Experimental setup

Database US English female speaker

Training test data 33000 amp 173 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic 11 categorical featuresfeatures 25 numeric features

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2

HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]

DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79

Example of speech parameter trajectories

wo grouping questions numeric contexts silence frames removed

-1

0

1

0 100 200 300 400 500

5-t

h M

el-ce

pstr

um

Frame

Natural speechHMM (α=1)DNN (4x512)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79

Subjective evaluations

Compared HMM-based systems with DNN-based ones withsimilar of parameters

bull Paired comparison test

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

HMM DNN(α) (layers times units) Neutral p value z value

158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 45: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Adaptation (mimicking voice) [13]

Average-voice model

AdaptiveTraining

Adaptation

Training speakers Target speakers

bull Train average voice model (AVM) from training speakers using SAT

bull Adapt AVM to target speakers

bull Requires small data from target speakerspeaking stylerarr Small cost to create new voices

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 37 of 79

Interpolation (mixing voice) [14 15 16 17]

λ1

λ2

λ3

λ4

λprime

I(λprimeλ1)I(λprimeλ2)

I(λprimeλ3)I(λprimeλ4)

λ HMM setI(λprimeλ) Interpolation ratio

bull Interpolate representive HMM sets

bull Can obtain new voices wo adaptation data

bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Vocoding issues

bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)

Unvoiced Voiced

pulse train

white noise

e(n)

excitation

bull Spectral envelope extractionHarmonic effect often cause problem

0

40

80

0 2 4 6 8 [kHz]

Pow

er [d

B]

bull PhaseImportant but usually ignored

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79

Better vocoding

bull Mixed excitation linear prediction (MELP)

bull STRAIGHT

bull Multi-band excitation

bull Harmonic + noise model (HNM)

bull Harmonic stochastic model

bull LF model

bull Glottal waveform

bull Residual codebook

bull ML excitation

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79

Limitations of HMMs for acoustic modeling

bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state

bull Conditional independence assumptionState output probability depends only on the current state

bull Weak duration modelingState duration probability decreases exponentially with time

None of them hold for real speech

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79

Better acoustic modeling

bull Piece-wise constatnt statistics rarr Dynamical model

minus Trended HMMminus Polynomial segment modelminus Trajectory HMM

bull Conditional independence assumption rarr Graphical model

minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM

bull Weak duration modeling rarr Explicit duration model

minus Hidden semi-Markov model

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79

Oversmoothing

bull Speech parameter generation algorithm

minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled

Nat

ural

0 4 8Frequency (kHz)

Gen

erat

ed

0 4 8Frequency (kHz)

bull Why

minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79

Oversmoothing compensation

bull Postfiltering

minus Mel-cepstrumminus LSP

bull Nonparametric approach

minus Conditional parameter generationminus Discrete HMM-based speech synthesis

bull Combine multiple-level statistics

minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation eigenvoice CAT multiple regression

minus Small footprintminus Robustness

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

HMM-based acoustic modeling for SPSS [4]

yes noyes no

yes no

yes no yes no

Acoustic space

bull Decision tree-clustered HMM with GMM state-output distributions

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79

DNN-based acoustic modeling for SPSS [18]

Acoustic

features y

Linguistic

features x

h1

h2

h3

bull DNN represents conditional distribution of y given x

bull DNN replaces decision trees and GMMs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79

Framework

Input layer Output layerHidden layers

TEXT

SPEECH Parametergeneration

Waveformsynthesis

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

feat

ures

at f

ram

e 1

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

fe

atur

es a

t fra

me T

Textanalysis

Input featureextraction

Statistics (m

ean amp var) of speech param

eter vector sequence

Binaryfeatures

NumericfeaturesDuration

featureFrame position

feature

Spectralfeatures

ExcitationfeaturesVUVfeature

Durationprediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Experimental setup

Database US English female speaker

Training test data 33000 amp 173 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic 11 categorical featuresfeatures 25 numeric features

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2

HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]

DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79

Example of speech parameter trajectories

wo grouping questions numeric contexts silence frames removed

-1

0

1

0 100 200 300 400 500

5-t

h M

el-ce

pstr

um

Frame

Natural speechHMM (α=1)DNN (4x512)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79

Subjective evaluations

Compared HMM-based systems with DNN-based ones withsimilar of parameters

bull Paired comparison test

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

HMM DNN(α) (layers times units) Neutral p value z value

158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 46: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Interpolation (mixing voice) [14 15 16 17]

λ1

λ2

λ3

λ4

λprime

I(λprimeλ1)I(λprimeλ2)

I(λprimeλ3)I(λprimeλ4)

λ HMM setI(λprimeλ) Interpolation ratio

bull Interpolate representive HMM sets

bull Can obtain new voices wo adaptation data

bull Eigenvoice CAT multiple regressionrarr estimate representative HMM sets from data

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 38 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Vocoding issues

bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)

Unvoiced Voiced

pulse train

white noise

e(n)

excitation

bull Spectral envelope extractionHarmonic effect often cause problem

0

40

80

0 2 4 6 8 [kHz]

Pow

er [d

B]

bull PhaseImportant but usually ignored

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79

Better vocoding

bull Mixed excitation linear prediction (MELP)

bull STRAIGHT

bull Multi-band excitation

bull Harmonic + noise model (HNM)

bull Harmonic stochastic model

bull LF model

bull Glottal waveform

bull Residual codebook

bull ML excitation

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79

Limitations of HMMs for acoustic modeling

bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state

bull Conditional independence assumptionState output probability depends only on the current state

bull Weak duration modelingState duration probability decreases exponentially with time

None of them hold for real speech

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79

Better acoustic modeling

bull Piece-wise constatnt statistics rarr Dynamical model

minus Trended HMMminus Polynomial segment modelminus Trajectory HMM

bull Conditional independence assumption rarr Graphical model

minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM

bull Weak duration modeling rarr Explicit duration model

minus Hidden semi-Markov model

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79

Oversmoothing

bull Speech parameter generation algorithm

minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled

Nat

ural

0 4 8Frequency (kHz)

Gen

erat

ed

0 4 8Frequency (kHz)

bull Why

minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79

Oversmoothing compensation

bull Postfiltering

minus Mel-cepstrumminus LSP

bull Nonparametric approach

minus Conditional parameter generationminus Discrete HMM-based speech synthesis

bull Combine multiple-level statistics

minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation eigenvoice CAT multiple regression

minus Small footprintminus Robustness

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

HMM-based acoustic modeling for SPSS [4]

yes noyes no

yes no

yes no yes no

Acoustic space

bull Decision tree-clustered HMM with GMM state-output distributions

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79

DNN-based acoustic modeling for SPSS [18]

Acoustic

features y

Linguistic

features x

h1

h2

h3

bull DNN represents conditional distribution of y given x

bull DNN replaces decision trees and GMMs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79

Framework

Input layer Output layerHidden layers

TEXT

SPEECH Parametergeneration

Waveformsynthesis

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

feat

ures

at f

ram

e 1

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

fe

atur

es a

t fra

me T

Textanalysis

Input featureextraction

Statistics (m

ean amp var) of speech param

eter vector sequence

Binaryfeatures

NumericfeaturesDuration

featureFrame position

feature

Spectralfeatures

ExcitationfeaturesVUVfeature

Durationprediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Experimental setup

Database US English female speaker

Training test data 33000 amp 173 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic 11 categorical featuresfeatures 25 numeric features

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2

HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]

DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79

Example of speech parameter trajectories

wo grouping questions numeric contexts silence frames removed

-1

0

1

0 100 200 300 400 500

5-t

h M

el-ce

pstr

um

Frame

Natural speechHMM (α=1)DNN (4x512)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79

Subjective evaluations

Compared HMM-based systems with DNN-based ones withsimilar of parameters

bull Paired comparison test

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

HMM DNN(α) (layers times units) Neutral p value z value

158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 47: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Vocoding issues

bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)

Unvoiced Voiced

pulse train

white noise

e(n)

excitation

bull Spectral envelope extractionHarmonic effect often cause problem

0

40

80

0 2 4 6 8 [kHz]

Pow

er [d

B]

bull PhaseImportant but usually ignored

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79

Better vocoding

bull Mixed excitation linear prediction (MELP)

bull STRAIGHT

bull Multi-band excitation

bull Harmonic + noise model (HNM)

bull Harmonic stochastic model

bull LF model

bull Glottal waveform

bull Residual codebook

bull ML excitation

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79

Limitations of HMMs for acoustic modeling

bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state

bull Conditional independence assumptionState output probability depends only on the current state

bull Weak duration modelingState duration probability decreases exponentially with time

None of them hold for real speech

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79

Better acoustic modeling

bull Piece-wise constatnt statistics rarr Dynamical model

minus Trended HMMminus Polynomial segment modelminus Trajectory HMM

bull Conditional independence assumption rarr Graphical model

minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM

bull Weak duration modeling rarr Explicit duration model

minus Hidden semi-Markov model

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79

Oversmoothing

bull Speech parameter generation algorithm

minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled

Nat

ural

0 4 8Frequency (kHz)

Gen

erat

ed

0 4 8Frequency (kHz)

bull Why

minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79

Oversmoothing compensation

bull Postfiltering

minus Mel-cepstrumminus LSP

bull Nonparametric approach

minus Conditional parameter generationminus Discrete HMM-based speech synthesis

bull Combine multiple-level statistics

minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation eigenvoice CAT multiple regression

minus Small footprintminus Robustness

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

HMM-based acoustic modeling for SPSS [4]

yes noyes no

yes no

yes no yes no

Acoustic space

bull Decision tree-clustered HMM with GMM state-output distributions

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79

DNN-based acoustic modeling for SPSS [18]

Acoustic

features y

Linguistic

features x

h1

h2

h3

bull DNN represents conditional distribution of y given x

bull DNN replaces decision trees and GMMs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79

Framework

Input layer Output layerHidden layers

TEXT

SPEECH Parametergeneration

Waveformsynthesis

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

feat

ures

at f

ram

e 1

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

fe

atur

es a

t fra

me T

Textanalysis

Input featureextraction

Statistics (m

ean amp var) of speech param

eter vector sequence

Binaryfeatures

NumericfeaturesDuration

featureFrame position

feature

Spectralfeatures

ExcitationfeaturesVUVfeature

Durationprediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Experimental setup

Database US English female speaker

Training test data 33000 amp 173 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic 11 categorical featuresfeatures 25 numeric features

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2

HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]

DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79

Example of speech parameter trajectories

wo grouping questions numeric contexts silence frames removed

-1

0

1

0 100 200 300 400 500

5-t

h M

el-ce

pstr

um

Frame

Natural speechHMM (α=1)DNN (4x512)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79

Subjective evaluations

Compared HMM-based systems with DNN-based ones withsimilar of parameters

bull Paired comparison test

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

HMM DNN(α) (layers times units) Neutral p value z value

158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 48: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Vocoding issues

bull Simple pulse noise excitationDifficult to model mix of VUV sounds (eg voiced fricatives)

Unvoiced Voiced

pulse train

white noise

e(n)

excitation

bull Spectral envelope extractionHarmonic effect often cause problem

0

40

80

0 2 4 6 8 [kHz]

Pow

er [d

B]

bull PhaseImportant but usually ignored

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 40 of 79

Better vocoding

bull Mixed excitation linear prediction (MELP)

bull STRAIGHT

bull Multi-band excitation

bull Harmonic + noise model (HNM)

bull Harmonic stochastic model

bull LF model

bull Glottal waveform

bull Residual codebook

bull ML excitation

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79

Limitations of HMMs for acoustic modeling

bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state

bull Conditional independence assumptionState output probability depends only on the current state

bull Weak duration modelingState duration probability decreases exponentially with time

None of them hold for real speech

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79

Better acoustic modeling

bull Piece-wise constatnt statistics rarr Dynamical model

minus Trended HMMminus Polynomial segment modelminus Trajectory HMM

bull Conditional independence assumption rarr Graphical model

minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM

bull Weak duration modeling rarr Explicit duration model

minus Hidden semi-Markov model

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79

Oversmoothing

bull Speech parameter generation algorithm

minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled

Nat

ural

0 4 8Frequency (kHz)

Gen

erat

ed

0 4 8Frequency (kHz)

bull Why

minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79

Oversmoothing compensation

bull Postfiltering

minus Mel-cepstrumminus LSP

bull Nonparametric approach

minus Conditional parameter generationminus Discrete HMM-based speech synthesis

bull Combine multiple-level statistics

minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation eigenvoice CAT multiple regression

minus Small footprintminus Robustness

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

HMM-based acoustic modeling for SPSS [4]

yes noyes no

yes no

yes no yes no

Acoustic space

bull Decision tree-clustered HMM with GMM state-output distributions

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79

DNN-based acoustic modeling for SPSS [18]

Acoustic

features y

Linguistic

features x

h1

h2

h3

bull DNN represents conditional distribution of y given x

bull DNN replaces decision trees and GMMs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79

Framework

Input layer Output layerHidden layers

TEXT

SPEECH Parametergeneration

Waveformsynthesis

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

feat

ures

at f

ram

e 1

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

fe

atur

es a

t fra

me T

Textanalysis

Input featureextraction

Statistics (m

ean amp var) of speech param

eter vector sequence

Binaryfeatures

NumericfeaturesDuration

featureFrame position

feature

Spectralfeatures

ExcitationfeaturesVUVfeature

Durationprediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Experimental setup

Database US English female speaker

Training test data 33000 amp 173 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic 11 categorical featuresfeatures 25 numeric features

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2

HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]

DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79

Example of speech parameter trajectories

wo grouping questions numeric contexts silence frames removed

-1

0

1

0 100 200 300 400 500

5-t

h M

el-ce

pstr

um

Frame

Natural speechHMM (α=1)DNN (4x512)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79

Subjective evaluations

Compared HMM-based systems with DNN-based ones withsimilar of parameters

bull Paired comparison test

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

HMM DNN(α) (layers times units) Neutral p value z value

158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 49: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Better vocoding

bull Mixed excitation linear prediction (MELP)

bull STRAIGHT

bull Multi-band excitation

bull Harmonic + noise model (HNM)

bull Harmonic stochastic model

bull LF model

bull Glottal waveform

bull Residual codebook

bull ML excitation

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 41 of 79

Limitations of HMMs for acoustic modeling

bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state

bull Conditional independence assumptionState output probability depends only on the current state

bull Weak duration modelingState duration probability decreases exponentially with time

None of them hold for real speech

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79

Better acoustic modeling

bull Piece-wise constatnt statistics rarr Dynamical model

minus Trended HMMminus Polynomial segment modelminus Trajectory HMM

bull Conditional independence assumption rarr Graphical model

minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM

bull Weak duration modeling rarr Explicit duration model

minus Hidden semi-Markov model

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79

Oversmoothing

bull Speech parameter generation algorithm

minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled

Nat

ural

0 4 8Frequency (kHz)

Gen

erat

ed

0 4 8Frequency (kHz)

bull Why

minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79

Oversmoothing compensation

bull Postfiltering

minus Mel-cepstrumminus LSP

bull Nonparametric approach

minus Conditional parameter generationminus Discrete HMM-based speech synthesis

bull Combine multiple-level statistics

minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation eigenvoice CAT multiple regression

minus Small footprintminus Robustness

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

HMM-based acoustic modeling for SPSS [4]

yes noyes no

yes no

yes no yes no

Acoustic space

bull Decision tree-clustered HMM with GMM state-output distributions

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79

DNN-based acoustic modeling for SPSS [18]

Acoustic

features y

Linguistic

features x

h1

h2

h3

bull DNN represents conditional distribution of y given x

bull DNN replaces decision trees and GMMs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79

Framework

Input layer Output layerHidden layers

TEXT

SPEECH Parametergeneration

Waveformsynthesis

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

feat

ures

at f

ram

e 1

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

fe

atur

es a

t fra

me T

Textanalysis

Input featureextraction

Statistics (m

ean amp var) of speech param

eter vector sequence

Binaryfeatures

NumericfeaturesDuration

featureFrame position

feature

Spectralfeatures

ExcitationfeaturesVUVfeature

Durationprediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Experimental setup

Database US English female speaker

Training test data 33000 amp 173 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic 11 categorical featuresfeatures 25 numeric features

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2

HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]

DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79

Example of speech parameter trajectories

wo grouping questions numeric contexts silence frames removed

-1

0

1

0 100 200 300 400 500

5-t

h M

el-ce

pstr

um

Frame

Natural speechHMM (α=1)DNN (4x512)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79

Subjective evaluations

Compared HMM-based systems with DNN-based ones withsimilar of parameters

bull Paired comparison test

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

HMM DNN(α) (layers times units) Neutral p value z value

158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 50: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Limitations of HMMs for acoustic modeling

bull Piece-wise constatnt statisticsStatistics do not vary within an HMM state

bull Conditional independence assumptionState output probability depends only on the current state

bull Weak duration modelingState duration probability decreases exponentially with time

None of them hold for real speech

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 42 of 79

Better acoustic modeling

bull Piece-wise constatnt statistics rarr Dynamical model

minus Trended HMMminus Polynomial segment modelminus Trajectory HMM

bull Conditional independence assumption rarr Graphical model

minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM

bull Weak duration modeling rarr Explicit duration model

minus Hidden semi-Markov model

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79

Oversmoothing

bull Speech parameter generation algorithm

minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled

Nat

ural

0 4 8Frequency (kHz)

Gen

erat

ed

0 4 8Frequency (kHz)

bull Why

minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79

Oversmoothing compensation

bull Postfiltering

minus Mel-cepstrumminus LSP

bull Nonparametric approach

minus Conditional parameter generationminus Discrete HMM-based speech synthesis

bull Combine multiple-level statistics

minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation eigenvoice CAT multiple regression

minus Small footprintminus Robustness

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

HMM-based acoustic modeling for SPSS [4]

yes noyes no

yes no

yes no yes no

Acoustic space

bull Decision tree-clustered HMM with GMM state-output distributions

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79

DNN-based acoustic modeling for SPSS [18]

Acoustic

features y

Linguistic

features x

h1

h2

h3

bull DNN represents conditional distribution of y given x

bull DNN replaces decision trees and GMMs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79

Framework

Input layer Output layerHidden layers

TEXT

SPEECH Parametergeneration

Waveformsynthesis

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

feat

ures

at f

ram

e 1

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

fe

atur

es a

t fra

me T

Textanalysis

Input featureextraction

Statistics (m

ean amp var) of speech param

eter vector sequence

Binaryfeatures

NumericfeaturesDuration

featureFrame position

feature

Spectralfeatures

ExcitationfeaturesVUVfeature

Durationprediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Experimental setup

Database US English female speaker

Training test data 33000 amp 173 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic 11 categorical featuresfeatures 25 numeric features

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2

HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]

DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79

Example of speech parameter trajectories

wo grouping questions numeric contexts silence frames removed

-1

0

1

0 100 200 300 400 500

5-t

h M

el-ce

pstr

um

Frame

Natural speechHMM (α=1)DNN (4x512)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79

Subjective evaluations

Compared HMM-based systems with DNN-based ones withsimilar of parameters

bull Paired comparison test

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

HMM DNN(α) (layers times units) Neutral p value z value

158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 51: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Better acoustic modeling

bull Piece-wise constatnt statistics rarr Dynamical model

minus Trended HMMminus Polynomial segment modelminus Trajectory HMM

bull Conditional independence assumption rarr Graphical model

minus Buried Markov modelminus Autoregressive HMMminus Trajectory HMM

bull Weak duration modeling rarr Explicit duration model

minus Hidden semi-Markov model

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 43 of 79

Oversmoothing

bull Speech parameter generation algorithm

minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled

Nat

ural

0 4 8Frequency (kHz)

Gen

erat

ed

0 4 8Frequency (kHz)

bull Why

minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79

Oversmoothing compensation

bull Postfiltering

minus Mel-cepstrumminus LSP

bull Nonparametric approach

minus Conditional parameter generationminus Discrete HMM-based speech synthesis

bull Combine multiple-level statistics

minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation eigenvoice CAT multiple regression

minus Small footprintminus Robustness

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

HMM-based acoustic modeling for SPSS [4]

yes noyes no

yes no

yes no yes no

Acoustic space

bull Decision tree-clustered HMM with GMM state-output distributions

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79

DNN-based acoustic modeling for SPSS [18]

Acoustic

features y

Linguistic

features x

h1

h2

h3

bull DNN represents conditional distribution of y given x

bull DNN replaces decision trees and GMMs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79

Framework

Input layer Output layerHidden layers

TEXT

SPEECH Parametergeneration

Waveformsynthesis

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

feat

ures

at f

ram

e 1

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

fe

atur

es a

t fra

me T

Textanalysis

Input featureextraction

Statistics (m

ean amp var) of speech param

eter vector sequence

Binaryfeatures

NumericfeaturesDuration

featureFrame position

feature

Spectralfeatures

ExcitationfeaturesVUVfeature

Durationprediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Experimental setup

Database US English female speaker

Training test data 33000 amp 173 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic 11 categorical featuresfeatures 25 numeric features

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2

HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]

DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79

Example of speech parameter trajectories

wo grouping questions numeric contexts silence frames removed

-1

0

1

0 100 200 300 400 500

5-t

h M

el-ce

pstr

um

Frame

Natural speechHMM (α=1)DNN (4x512)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79

Subjective evaluations

Compared HMM-based systems with DNN-based ones withsimilar of parameters

bull Paired comparison test

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

HMM DNN(α) (layers times units) Neutral p value z value

158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 52: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Oversmoothing

bull Speech parameter generation algorithm

minus Dynamic feature constraints make generated parameters smoothminus Often too smooth rarr sounds muffled

Nat

ural

0 4 8Frequency (kHz)

Gen

erat

ed

0 4 8Frequency (kHz)

bull Why

minus Details of spectral (formant) structure disappearminus Use of better AM relaxes the issue but not enough

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 44 of 79

Oversmoothing compensation

bull Postfiltering

minus Mel-cepstrumminus LSP

bull Nonparametric approach

minus Conditional parameter generationminus Discrete HMM-based speech synthesis

bull Combine multiple-level statistics

minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation eigenvoice CAT multiple regression

minus Small footprintminus Robustness

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

HMM-based acoustic modeling for SPSS [4]

yes noyes no

yes no

yes no yes no

Acoustic space

bull Decision tree-clustered HMM with GMM state-output distributions

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79

DNN-based acoustic modeling for SPSS [18]

Acoustic

features y

Linguistic

features x

h1

h2

h3

bull DNN represents conditional distribution of y given x

bull DNN replaces decision trees and GMMs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79

Framework

Input layer Output layerHidden layers

TEXT

SPEECH Parametergeneration

Waveformsynthesis

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

feat

ures

at f

ram

e 1

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

fe

atur

es a

t fra

me T

Textanalysis

Input featureextraction

Statistics (m

ean amp var) of speech param

eter vector sequence

Binaryfeatures

NumericfeaturesDuration

featureFrame position

feature

Spectralfeatures

ExcitationfeaturesVUVfeature

Durationprediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Experimental setup

Database US English female speaker

Training test data 33000 amp 173 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic 11 categorical featuresfeatures 25 numeric features

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2

HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]

DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79

Example of speech parameter trajectories

wo grouping questions numeric contexts silence frames removed

-1

0

1

0 100 200 300 400 500

5-t

h M

el-ce

pstr

um

Frame

Natural speechHMM (α=1)DNN (4x512)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79

Subjective evaluations

Compared HMM-based systems with DNN-based ones withsimilar of parameters

bull Paired comparison test

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

HMM DNN(α) (layers times units) Neutral p value z value

158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 53: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Oversmoothing compensation

bull Postfiltering

minus Mel-cepstrumminus LSP

bull Nonparametric approach

minus Conditional parameter generationminus Discrete HMM-based speech synthesis

bull Combine multiple-level statistics

minus Global variance (intra-utterance variance)minus Modulation spectrum (intra-utterance frequency components)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 45 of 79

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation eigenvoice CAT multiple regression

minus Small footprintminus Robustness

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

HMM-based acoustic modeling for SPSS [4]

yes noyes no

yes no

yes no yes no

Acoustic space

bull Decision tree-clustered HMM with GMM state-output distributions

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79

DNN-based acoustic modeling for SPSS [18]

Acoustic

features y

Linguistic

features x

h1

h2

h3

bull DNN represents conditional distribution of y given x

bull DNN replaces decision trees and GMMs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79

Framework

Input layer Output layerHidden layers

TEXT

SPEECH Parametergeneration

Waveformsynthesis

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

feat

ures

at f

ram

e 1

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

fe

atur

es a

t fra

me T

Textanalysis

Input featureextraction

Statistics (m

ean amp var) of speech param

eter vector sequence

Binaryfeatures

NumericfeaturesDuration

featureFrame position

feature

Spectralfeatures

ExcitationfeaturesVUVfeature

Durationprediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Experimental setup

Database US English female speaker

Training test data 33000 amp 173 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic 11 categorical featuresfeatures 25 numeric features

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2

HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]

DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79

Example of speech parameter trajectories

wo grouping questions numeric contexts silence frames removed

-1

0

1

0 100 200 300 400 500

5-t

h M

el-ce

pstr

um

Frame

Natural speechHMM (α=1)DNN (4x512)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79

Subjective evaluations

Compared HMM-based systems with DNN-based ones withsimilar of parameters

bull Paired comparison test

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

HMM DNN(α) (layers times units) Neutral p value z value

158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 54: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Characteristics of SPSS

bull Advantages

minus Flexibility to change voice characteristics

Adaptation Interpolation eigenvoice CAT multiple regression

minus Small footprintminus Robustness

bull Drawback

minus Quality

bull Major factors for quality degradation [3]

minus Vocoder (speech analysis amp synthesis)minus Acoustic model (HMM) rarr Neural networksminus Oversmoothing (parameter generation)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 46 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

HMM-based acoustic modeling for SPSS [4]

yes noyes no

yes no

yes no yes no

Acoustic space

bull Decision tree-clustered HMM with GMM state-output distributions

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79

DNN-based acoustic modeling for SPSS [18]

Acoustic

features y

Linguistic

features x

h1

h2

h3

bull DNN represents conditional distribution of y given x

bull DNN replaces decision trees and GMMs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79

Framework

Input layer Output layerHidden layers

TEXT

SPEECH Parametergeneration

Waveformsynthesis

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

feat

ures

at f

ram

e 1

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

fe

atur

es a

t fra

me T

Textanalysis

Input featureextraction

Statistics (m

ean amp var) of speech param

eter vector sequence

Binaryfeatures

NumericfeaturesDuration

featureFrame position

feature

Spectralfeatures

ExcitationfeaturesVUVfeature

Durationprediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Experimental setup

Database US English female speaker

Training test data 33000 amp 173 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic 11 categorical featuresfeatures 25 numeric features

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2

HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]

DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79

Example of speech parameter trajectories

wo grouping questions numeric contexts silence frames removed

-1

0

1

0 100 200 300 400 500

5-t

h M

el-ce

pstr

um

Frame

Natural speechHMM (α=1)DNN (4x512)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79

Subjective evaluations

Compared HMM-based systems with DNN-based ones withsimilar of parameters

bull Paired comparison test

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

HMM DNN(α) (layers times units) Neutral p value z value

158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 55: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

HMM-based acoustic modeling for SPSS [4]

yes noyes no

yes no

yes no yes no

Acoustic space

bull Decision tree-clustered HMM with GMM state-output distributions

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79

DNN-based acoustic modeling for SPSS [18]

Acoustic

features y

Linguistic

features x

h1

h2

h3

bull DNN represents conditional distribution of y given x

bull DNN replaces decision trees and GMMs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79

Framework

Input layer Output layerHidden layers

TEXT

SPEECH Parametergeneration

Waveformsynthesis

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

feat

ures

at f

ram

e 1

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

fe

atur

es a

t fra

me T

Textanalysis

Input featureextraction

Statistics (m

ean amp var) of speech param

eter vector sequence

Binaryfeatures

NumericfeaturesDuration

featureFrame position

feature

Spectralfeatures

ExcitationfeaturesVUVfeature

Durationprediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Experimental setup

Database US English female speaker

Training test data 33000 amp 173 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic 11 categorical featuresfeatures 25 numeric features

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2

HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]

DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79

Example of speech parameter trajectories

wo grouping questions numeric contexts silence frames removed

-1

0

1

0 100 200 300 400 500

5-t

h M

el-ce

pstr

um

Frame

Natural speechHMM (α=1)DNN (4x512)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79

Subjective evaluations

Compared HMM-based systems with DNN-based ones withsimilar of parameters

bull Paired comparison test

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

HMM DNN(α) (layers times units) Neutral p value z value

158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 56: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

HMM-based acoustic modeling for SPSS [4]

yes noyes no

yes no

yes no yes no

Acoustic space

bull Decision tree-clustered HMM with GMM state-output distributions

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79

DNN-based acoustic modeling for SPSS [18]

Acoustic

features y

Linguistic

features x

h1

h2

h3

bull DNN represents conditional distribution of y given x

bull DNN replaces decision trees and GMMs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79

Framework

Input layer Output layerHidden layers

TEXT

SPEECH Parametergeneration

Waveformsynthesis

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

feat

ures

at f

ram

e 1

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

fe

atur

es a

t fra

me T

Textanalysis

Input featureextraction

Statistics (m

ean amp var) of speech param

eter vector sequence

Binaryfeatures

NumericfeaturesDuration

featureFrame position

feature

Spectralfeatures

ExcitationfeaturesVUVfeature

Durationprediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Experimental setup

Database US English female speaker

Training test data 33000 amp 173 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic 11 categorical featuresfeatures 25 numeric features

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2

HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]

DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79

Example of speech parameter trajectories

wo grouping questions numeric contexts silence frames removed

-1

0

1

0 100 200 300 400 500

5-t

h M

el-ce

pstr

um

Frame

Natural speechHMM (α=1)DNN (4x512)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79

Subjective evaluations

Compared HMM-based systems with DNN-based ones withsimilar of parameters

bull Paired comparison test

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

HMM DNN(α) (layers times units) Neutral p value z value

158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 57: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

HMM-based acoustic modeling for SPSS [4]

yes noyes no

yes no

yes no yes no

Acoustic space

bull Decision tree-clustered HMM with GMM state-output distributions

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79

DNN-based acoustic modeling for SPSS [18]

Acoustic

features y

Linguistic

features x

h1

h2

h3

bull DNN represents conditional distribution of y given x

bull DNN replaces decision trees and GMMs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79

Framework

Input layer Output layerHidden layers

TEXT

SPEECH Parametergeneration

Waveformsynthesis

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

feat

ures

at f

ram

e 1

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

fe

atur

es a

t fra

me T

Textanalysis

Input featureextraction

Statistics (m

ean amp var) of speech param

eter vector sequence

Binaryfeatures

NumericfeaturesDuration

featureFrame position

feature

Spectralfeatures

ExcitationfeaturesVUVfeature

Durationprediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Experimental setup

Database US English female speaker

Training test data 33000 amp 173 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic 11 categorical featuresfeatures 25 numeric features

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2

HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]

DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79

Example of speech parameter trajectories

wo grouping questions numeric contexts silence frames removed

-1

0

1

0 100 200 300 400 500

5-t

h M

el-ce

pstr

um

Frame

Natural speechHMM (α=1)DNN (4x512)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79

Subjective evaluations

Compared HMM-based systems with DNN-based ones withsimilar of parameters

bull Paired comparison test

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

HMM DNN(α) (layers times units) Neutral p value z value

158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 58: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Linguistic rarr acoustic mapping

bull TrainingLearn relationship between linguistc amp acoustic features

bull SynthesisMap linguistic features to acoustic ones

bull Linguistic features used in SPSS

minus Phoneme syllable word phrase utterance-level featuresminus eg phone identity POS stress of words in a phraseminus Around 50 different types much more than ASR (typically 3ndash5)

Effective modeling is essential

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 48 of 79

HMM-based acoustic modeling for SPSS [4]

yes noyes no

yes no

yes no yes no

Acoustic space

bull Decision tree-clustered HMM with GMM state-output distributions

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79

DNN-based acoustic modeling for SPSS [18]

Acoustic

features y

Linguistic

features x

h1

h2

h3

bull DNN represents conditional distribution of y given x

bull DNN replaces decision trees and GMMs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79

Framework

Input layer Output layerHidden layers

TEXT

SPEECH Parametergeneration

Waveformsynthesis

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

feat

ures

at f

ram

e 1

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

fe

atur

es a

t fra

me T

Textanalysis

Input featureextraction

Statistics (m

ean amp var) of speech param

eter vector sequence

Binaryfeatures

NumericfeaturesDuration

featureFrame position

feature

Spectralfeatures

ExcitationfeaturesVUVfeature

Durationprediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Experimental setup

Database US English female speaker

Training test data 33000 amp 173 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic 11 categorical featuresfeatures 25 numeric features

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2

HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]

DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79

Example of speech parameter trajectories

wo grouping questions numeric contexts silence frames removed

-1

0

1

0 100 200 300 400 500

5-t

h M

el-ce

pstr

um

Frame

Natural speechHMM (α=1)DNN (4x512)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79

Subjective evaluations

Compared HMM-based systems with DNN-based ones withsimilar of parameters

bull Paired comparison test

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

HMM DNN(α) (layers times units) Neutral p value z value

158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 59: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

HMM-based acoustic modeling for SPSS [4]

yes noyes no

yes no

yes no yes no

Acoustic space

bull Decision tree-clustered HMM with GMM state-output distributions

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 49 of 79

DNN-based acoustic modeling for SPSS [18]

Acoustic

features y

Linguistic

features x

h1

h2

h3

bull DNN represents conditional distribution of y given x

bull DNN replaces decision trees and GMMs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79

Framework

Input layer Output layerHidden layers

TEXT

SPEECH Parametergeneration

Waveformsynthesis

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

feat

ures

at f

ram

e 1

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

fe

atur

es a

t fra

me T

Textanalysis

Input featureextraction

Statistics (m

ean amp var) of speech param

eter vector sequence

Binaryfeatures

NumericfeaturesDuration

featureFrame position

feature

Spectralfeatures

ExcitationfeaturesVUVfeature

Durationprediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Experimental setup

Database US English female speaker

Training test data 33000 amp 173 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic 11 categorical featuresfeatures 25 numeric features

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2

HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]

DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79

Example of speech parameter trajectories

wo grouping questions numeric contexts silence frames removed

-1

0

1

0 100 200 300 400 500

5-t

h M

el-ce

pstr

um

Frame

Natural speechHMM (α=1)DNN (4x512)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79

Subjective evaluations

Compared HMM-based systems with DNN-based ones withsimilar of parameters

bull Paired comparison test

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

HMM DNN(α) (layers times units) Neutral p value z value

158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 60: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

DNN-based acoustic modeling for SPSS [18]

Acoustic

features y

Linguistic

features x

h1

h2

h3

bull DNN represents conditional distribution of y given x

bull DNN replaces decision trees and GMMs

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 50 of 79

Framework

Input layer Output layerHidden layers

TEXT

SPEECH Parametergeneration

Waveformsynthesis

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

feat

ures

at f

ram

e 1

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

fe

atur

es a

t fra

me T

Textanalysis

Input featureextraction

Statistics (m

ean amp var) of speech param

eter vector sequence

Binaryfeatures

NumericfeaturesDuration

featureFrame position

feature

Spectralfeatures

ExcitationfeaturesVUVfeature

Durationprediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Experimental setup

Database US English female speaker

Training test data 33000 amp 173 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic 11 categorical featuresfeatures 25 numeric features

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2

HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]

DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79

Example of speech parameter trajectories

wo grouping questions numeric contexts silence frames removed

-1

0

1

0 100 200 300 400 500

5-t

h M

el-ce

pstr

um

Frame

Natural speechHMM (α=1)DNN (4x512)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79

Subjective evaluations

Compared HMM-based systems with DNN-based ones withsimilar of parameters

bull Paired comparison test

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

HMM DNN(α) (layers times units) Neutral p value z value

158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 61: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Framework

Input layer Output layerHidden layers

TEXT

SPEECH Parametergeneration

Waveformsynthesis

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

feat

ures

at f

ram

e 1

Inpu

t fea

ture

s in

clud

ing

bina

ry amp

num

eric

fe

atur

es a

t fra

me T

Textanalysis

Input featureextraction

Statistics (m

ean amp var) of speech param

eter vector sequence

Binaryfeatures

NumericfeaturesDuration

featureFrame position

feature

Spectralfeatures

ExcitationfeaturesVUVfeature

Durationprediction

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 51 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Experimental setup

Database US English female speaker

Training test data 33000 amp 173 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic 11 categorical featuresfeatures 25 numeric features

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2

HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]

DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79

Example of speech parameter trajectories

wo grouping questions numeric contexts silence frames removed

-1

0

1

0 100 200 300 400 500

5-t

h M

el-ce

pstr

um

Frame

Natural speechHMM (α=1)DNN (4x512)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79

Subjective evaluations

Compared HMM-based systems with DNN-based ones withsimilar of parameters

bull Paired comparison test

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

HMM DNN(α) (layers times units) Neutral p value z value

158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 62: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Experimental setup

Database US English female speaker

Training test data 33000 amp 173 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic 11 categorical featuresfeatures 25 numeric features

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2

HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]

DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79

Example of speech parameter trajectories

wo grouping questions numeric contexts silence frames removed

-1

0

1

0 100 200 300 400 500

5-t

h M

el-ce

pstr

um

Frame

Natural speechHMM (α=1)DNN (4x512)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79

Subjective evaluations

Compared HMM-based systems with DNN-based ones withsimilar of parameters

bull Paired comparison test

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

HMM DNN(α) (layers times units) Neutral p value z value

158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 63: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Experimental setup

Database US English female speaker

Training test data 33000 amp 173 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic 11 categorical featuresfeatures 25 numeric features

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2

HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]

DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79

Example of speech parameter trajectories

wo grouping questions numeric contexts silence frames removed

-1

0

1

0 100 200 300 400 500

5-t

h M

el-ce

pstr

um

Frame

Natural speechHMM (α=1)DNN (4x512)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79

Subjective evaluations

Compared HMM-based systems with DNN-based ones withsimilar of parameters

bull Paired comparison test

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

HMM DNN(α) (layers times units) Neutral p value z value

158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 64: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Advantages of NN-based acoustic modeling

bull Integrating feature extraction

minus Can model high-dimensional highly correlated features efficientlyminus Layered architecture w non-linear operationsrarr Integrated feature extraction to acoustic modeling

bull Distributed representation

minus Can be exponentially more efficient than fragmentedrepresentation

minus Better representation ability with fewer parameters

bull Layered hierarchical structure in speech production

minus concept rarr linguistic rarr articulatory rarr waveform

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 52 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Experimental setup

Database US English female speaker

Training test data 33000 amp 173 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic 11 categorical featuresfeatures 25 numeric features

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2

HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]

DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79

Example of speech parameter trajectories

wo grouping questions numeric contexts silence frames removed

-1

0

1

0 100 200 300 400 500

5-t

h M

el-ce

pstr

um

Frame

Natural speechHMM (α=1)DNN (4x512)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79

Subjective evaluations

Compared HMM-based systems with DNN-based ones withsimilar of parameters

bull Paired comparison test

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

HMM DNN(α) (layers times units) Neutral p value z value

158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 65: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Experimental setup

Database US English female speaker

Training test data 33000 amp 173 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic 11 categorical featuresfeatures 25 numeric features

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2

HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]

DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79

Example of speech parameter trajectories

wo grouping questions numeric contexts silence frames removed

-1

0

1

0 100 200 300 400 500

5-t

h M

el-ce

pstr

um

Frame

Natural speechHMM (α=1)DNN (4x512)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79

Subjective evaluations

Compared HMM-based systems with DNN-based ones withsimilar of parameters

bull Paired comparison test

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

HMM DNN(α) (layers times units) Neutral p value z value

158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 66: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Framework

Is this new no

bull NN [19]

bull RNN [20]

Whatrsquos the difference

bull More layers data computational resources

bull Better learning algorithm

bull Statistical parametric speech synthesis techniques

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 53 of 79

Experimental setup

Database US English female speaker

Training test data 33000 amp 173 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic 11 categorical featuresfeatures 25 numeric features

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2

HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]

DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79

Example of speech parameter trajectories

wo grouping questions numeric contexts silence frames removed

-1

0

1

0 100 200 300 400 500

5-t

h M

el-ce

pstr

um

Frame

Natural speechHMM (α=1)DNN (4x512)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79

Subjective evaluations

Compared HMM-based systems with DNN-based ones withsimilar of parameters

bull Paired comparison test

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

HMM DNN(α) (layers times units) Neutral p value z value

158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 67: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Experimental setup

Database US English female speaker

Training test data 33000 amp 173 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic 11 categorical featuresfeatures 25 numeric features

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity ∆∆2

HMM 5-state left-to-right HSMM [21]topology MSD F0 [22] MDL [23]

DNN 1ndash5 layers 25651210242048 unitslayerarchitecture sigmoid continuous F0 [24]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 54 of 79

Example of speech parameter trajectories

wo grouping questions numeric contexts silence frames removed

-1

0

1

0 100 200 300 400 500

5-t

h M

el-ce

pstr

um

Frame

Natural speechHMM (α=1)DNN (4x512)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79

Subjective evaluations

Compared HMM-based systems with DNN-based ones withsimilar of parameters

bull Paired comparison test

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

HMM DNN(α) (layers times units) Neutral p value z value

158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 68: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Example of speech parameter trajectories

wo grouping questions numeric contexts silence frames removed

-1

0

1

0 100 200 300 400 500

5-t

h M

el-ce

pstr

um

Frame

Natural speechHMM (α=1)DNN (4x512)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 55 of 79

Subjective evaluations

Compared HMM-based systems with DNN-based ones withsimilar of parameters

bull Paired comparison test

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

HMM DNN(α) (layers times units) Neutral p value z value

158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 69: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Subjective evaluations

Compared HMM-based systems with DNN-based ones withsimilar of parameters

bull Paired comparison test

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

HMM DNN(α) (layers times units) Neutral p value z value

158 (16) 385 (4 times 256) 457 lt 10minus6 -99161 (4) 272 (4 times 512) 568 lt 10minus6 -51127 (1) 366 (4 times 1 024) 507 lt 10minus6 -115

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 56 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 70: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 71: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 72: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 73: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Limitations of DNN-based acoustic modeling

y1

y2

Data samples

NN prediction

bull Unimodality

minus Human can speak in different ways rarr one-to-many mappingminus NN trained by MSE loss rarr approximates conditional mean

bull Lack of variance

minus DNN-based SPSS uses variances computed from all training dataminus Parameter generation algorithm utilizes variances

Linear output layer rarr Mixture density output layer [26]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 58 of 79

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 74: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Mixture density network [26]

micro1(x1) micro2(x1)σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)w2(x1)

y

Weights rarr Softmax activation function

Means rarr Linear activation function

Variances rarr Exponential activation function

sumzj =

4

i=1

hiwij

w1(x) =exp(z1)sum2

m=1 exp(zm)

micro1(x) = z3

σ1(x) = exp(z5)

Inputs of activation function

1-dim 2-mix MDN

w2(x) =exp(z2)sum2

m=1 exp(zm)

micro1(x) = z4

σ2(x) = exp(z6)

NN + mixture model (GMM)rarr NN outputs GMM weights means amp variances

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 59 of 79

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 75: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

DMDN-based SPSS [27]

micro1(x1) micro2(x1) σ1(x1) σ2(x1)w1(x1) w2(x1)

micro1(x1) micro2(x1)

σ2(x1)σ1(x1)

w1(x1)

w2(x1)

y

x1

micro1(x2 ) micro2(x2 )

σ2(x2 )σ1(x2 )

w1(x2 ) w2(x2 )

y

x2

micro1(xT ) micro2(xT )

σ1(xT )σ2(xT )

w2(xT )

w1(xT )

y

xT

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

micro1(xT ) micro2(xT ) σ1(xT ) σ2(xT )w1(xT ) w2(xT )micro1(x2) micro2(x2) σ1(x2) σ2(x2)w1(x2) w2(x2)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 60 of 79

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 76: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Experimental setup

bull Almost the same as the previous setup

bull Differences

DNN 4ndash7 hidden layers 1024 unitshidden layerarchitecture ReLU (hidden) Linear (output)

DMDN 4 hidden layers 1024 units hidden layerarchitecture ReLU [28] (hidden) Mixture density (output)

1ndash16 mix

Optimization AdaDec [29] (variant of AdaGrad [30]) on GPU

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 61 of 79

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 77: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Subjective evaluation

bull 5-scale mean opinion score (MOS) test (1 unnatural ndash 5 natural)

bull 173 test sentences 5 subjects per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

1 mix 3537 plusmn 0113HMM 2 mix 3397 plusmn 0115

4times1024 3635 plusmn 0127DNN 5times1024 3681 plusmn 0109

6times1024 3652 plusmn 01087times1024 3637 plusmn 0129

1 mix 3654 plusmn 0117DMDN 2 mix 3796 plusmn 0107

(4times1024) 4 mix 3766 plusmn 01138 mix 3805 plusmn 011316 mix 3791 plusmn 0102

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 62 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 78: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 79: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 80: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Limitations of DNNDMDN-based acoustic modeling

bull Fixed time span for input features

minus Fixed number of preceding succeeding contexts(eg plusmn2 phonemessyllable stress) are used as inputs

minus Difficult to incorporate long time span contextual effect

bull Frame-by-frame mapping

minus Each frame is mapped independentlyminus Smoothing using dynamic feature constraints is still essential

Recurrent connections rarr Recurrent NN (RNN) [31]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 64 of 79

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 81: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Basic RNN

Output y

Input x

Recurrentconnections

xt

yt yt+1ytminus1

xt+1xtminus1

bull Only able to use previous contextsrarr bidirectional RNN [31]

bull Trouble accessing long-range contextsminus Information in hidden layers loops through recurrent connectionsrarr Quickly decay over time

minus Prone to being overwritten by new information arriving from inputsrarr long short-term memory (LSTM) RNN [32]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 65 of 79

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 82: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Long short-term memory (LSTM) [32]

bull RNN architecture designed to have better memory

bull Uses linear memory cells surrounded by multiplicative gate units

ct

bi xt h tminus

it

tanh

sigm

tanh

bc

xt

h tminus

Input gate

Forget gate

Memory cell

bo xt h tminus

ht

bf xt h tminus

sigm

sigmBlock

Output gate Input gate Write

Output gate Read

Forget gate Reset

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 66 of 79

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 83: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

LSTM-based SPSS [33 34]

x1 x2

TEX

TTe

xtan

alys

isIn

put f

eatu

reex

tract

ion

Dur

atio

npr

edic

tion

SP

EE

CH

Param

etergeneration

Waveform

synthesis

xT

y1 y2 yT

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 67 of 79

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 84: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Experimental setup

Database US English female speaker

Train dev set data 34632 amp 100 sentences

Sampling rate 16 kHz

Analysis window 25-ms width 5-ms shift

Linguistic DNN 449features LSTM 289

Acoustic 0ndash39 mel-cepstrumfeatures logF0 5-band aperiodicity (∆∆2)

4 hidden layers 1024 unitshidden layerDNN ReLU (hidden) Linear (output)

AdaDec [29] on GPU

1 forward LSTM layerLSTM 256 units 128 projection

Asynchronous SGD on CPUs [35]

Postprocessing Postfiltering in cepstrum domain [25]

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 68 of 79

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 85: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Subjective evaluations

bull Paired comparison test

bull 100 test sentences 5 ratings per pair

bull Up to 30 pairs per subject

bull Crowd-sourced

DNN LSTM Statsw ∆ wo ∆ w ∆ wo ∆ Neutral z p

500 142 ndash ndash 358 120 lt 10minus10

ndash ndash 302 156 542 51 lt 10minus6

158 ndash 340 ndash 502 -62 lt 10minus9

284 ndash ndash 336 380 -15 0138

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 69 of 79

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 86: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Samples

bull DNN (wo dynamic features)

bull DNN (w dynamic features)

bull LSTM (wo dynamic features)

bull LSTM (w dynamic features)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 70 of 79

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 87: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Outline

BackgroundHMM-based statistical parametric speech synthesis (SPSS)FlexibilityImprovements

Statistical parametric speech synthesis with neural networksDeep neural network (DNN)-based SPSSDeep mixture density network (DMDN)-based SPSSRecurrent neural network (RNN)-based SPSS

SummarySummary

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 88: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

Summary

Statistical parametric speech synthesis

bull Vocoding + acoustic model

bull HMM-based SPSS

minus Flexible (eg adaptation interpolation)minus Improvements

Vocoding Acoustic modeling Oversmoothing compensation

bull NN-based SPSS

minus Learn mapping from linguistic features to acoustic onesminus Static network (DNN DMDN) rarr dynamic ones (LSTM)

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 72 of 79

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 89: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

References I

[1] E Moulines and F CharpentierPitch synchronous waveform processing techniques for text-to-speech synthesis using diphonesSpeech Commun 9453ndash467 1990

[2] A Hunt and A BlackUnit selection in a concatenative speech synthesis system using a large speech databaseIn Proc ICASSP pages 373ndash376 1996

[3] H Zen K Tokuda and A BlackStatistical parametric speech synthesisSpeech Commun 51(11)1039ndash1064 2009

[4] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSimultaneous modeling of spectrum pitch and duration in HMM-based speech synthesisIn Proc Eurospeech pages 2347ndash2350 1999

[5] F Itakura and S SaitoA statistical method for estimation of speech spectral density and formant frequenciesTrans IEICE J53ndashA35ndash42 1970

[6] S ImaiCepstral analysis synthesis on the mel frequency scaleIn Proc ICASSP pages 93ndash96 1983

[7] J OdellThe use of context in large vocabulary speech recognitionPhD thesis Cambridge University 1995

[8] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraDuration modeling for HMM-based speech synthesisIn Proc ICSLP pages 29ndash32 1998

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 75 of 79

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 90: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

References II

[9] K Tokuda T Yoshimura T Masuko T Kobayashi and T KitamuraSpeech parameter generation algorithms for HMM-based speech synthesisIn Proc ICASSP pages 1315ndash1318 2000

[10] Y Morioka S Kataoka H Zen Y Nankaku K Tokuda and T KitamuraMiniaturization of HMM-based speech synthesisIn Proc Autumn Meeting of ASJ pages 325ndash326 2004(in Japanese)

[11] S-J Kim J-J Kim and M-S HahnHMM-based Korean speech synthesis system for hand-held devicesIEEE Trans Consum Electron 52(4)1384ndash1390 2006

[12] J Yamagishi ZH Ling and S KingRobustness of HMM-based speech synthesisIn Proc Interspeech pages 581ndash584 2008

[13] J YamagishiAverage-Voice-Based Speech SynthesisPhD thesis Tokyo Institute of Technology 2006

[14] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraSpeaker interpolation in HMM-based speech synthesis systemIn Proc Eurospeech pages 2523ndash2526 1997

[15] K Shichiri A Sawabe K Tokuda T Masuko T Kobayashi and T KitamuraEigenvoices for HMM-based speech synthesisIn Proc ICSLP pages 1269ndash1272 2002

[16] H Zen N Braunschweiler S Buchholz M Gales K Knill S Krstulovic and J LatorreStatistical parametric speech synthesis based on speaker and language factorizationIEEE Trans Acoust Speech Lang Process 20(6)1713ndash1724 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 76 of 79

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 91: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

References III

[17] T Nose J Yamagishi T Masuko and T KobayashiA style control technique for HMM-based expressive speech synthesisIEICE Trans Inf Syst E90-D(9)1406ndash1413 2007

[18] H Zen A Senior and M SchusterStatistical parametric speech synthesis using deep neural networksIn Proc ICASSP pages 7962ndash7966 2013

[19] O Karaali G Corrigan and I GersonSpeech synthesis with neural networksIn Proc World Congress on Neural Networks pages 45ndash50 1996

[20] C Tuerk and T RobinsonSpeech synthesis using artificial network trained on cepstral coefficientsIn Proc Eurospeech pages 1713ndash1716 1993

[21] H Zen K Tokuda T Masuko T Kobayashi and T KitamuraA hidden semi-Markov model-based speech synthesis systemIEICE Trans Inf Syst E90-D(5)825ndash834 2007

[22] K Tokuda T Masuko N Miyazaki and T KobayashiMulti-space probability distribution HMMIEICE Trans Inf Syst E85-D(3)455ndash464 2002

[23] K Shinoda and T WatanabeAcoustic modeling based on the MDL criterion for speech recognitionIn Proc Eurospeech pages 99ndash102 1997

[24] K Yu and S YoungContinuous F0 modelling for HMM based statistical parametric speech synthesisIEEE Trans Audio Speech Lang Process 19(5)1071ndash1079 2011

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 77 of 79

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 92: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

References IV

[25] T Yoshimura K Tokuda T Masuko T Kobayashi and T KitamuraIncorporation of mixed excitation model and postfilter into HMM-based text-to-speech synthesisIEICE Trans Inf Syst J87-D-II(8)1563ndash1571 2004

[26] C BishopMixture density networksTechnical Report NCRG94004 Neural Computing Research Group Aston University 1994

[27] H Zen and A SeniorDeep mixture density networks for acoustic modeling in statistical parametric speech synthesisIn Proc ICASSP pages 3872ndash3876 2014

[28] M Zeiler M Ranzato R Monga M Mao K Yang Q-V Le P Nguyen A Senior V Vanhoucke J Dean andG HintonOn rectified linear units for speech processingIn Proc ICASSP pages 3517ndash3521 2013

[29] A Senior G Heigold M Ranzato and K YangAn empirical study of learning rates in deep neural networks for speech recognitionIn Proc ICASSP pages 6724ndash6728 2013

[30] J Duchi E Hazan and Y SingerAdaptive subgradient methods for online learning and stochastic optimizationThe Journal of Machine Learning Research pages 2121ndash2159 2011

[31] M Schuster and K PaliwalBidirectional recurrent neural networksIEEE Trans Signal Process 45(11)2673ndash2681 1997

[32] S Hochreiter and J SchmidhuberLong short-term memoryNeural computation 9(8)1735ndash1780 1997

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 78 of 79

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary
Page 93: Statistical Parametric Speech Synthesis - Google · Deep neural network (DNN)-based SPSS Deep mixture density ... Heiga Zen Statistical Parametric Speech Synthesis June 9th, 2014

References V

[33] Y Fan Y Qian F Xie and F SoongTTS synthesis with bidirectional LSTM based recurrent neural networksIn Proc Interspeech 2014(Submitted) httpresearchmicrosoftcomen-usprojectsdnntts

[34] H Zen H Sak A Graves and A SeniorStatistical parametric speech synthesis using recurrent neural networksIn UKSpeech Conference 2014

[35] J Dean G Corrado R Monga K Chen M Devin Q Le M Mao M Ranzato A Senior P Tucker K Yang andA NgLarge scale distributed deep networksIn Proc NIPS 2012

Heiga Zen Statistical Parametric Speech Synthesis June 9th 2014 79 of 79

  • Background
    • HMM-based statistical parametric speech synthesis (SPSS)
    • Flexibility
    • Improvements
      • Statistical parametric speech synthesis with neural networks
        • Deep neural network (DNN)-based SPSS
        • Deep mixture density network (DMDN)-based SPSS
        • Recurrent neural network (RNN)-based SPSS
          • Summary
            • Summary
              • lstm_samplespdf
                • Background
                  • HMM-based statistical parametric speech synthesis (SPSS)
                  • Flexibility
                  • Improvements
                    • Statistical parametric speech synthesis with neural networks
                      • Deep neural network (DNN)-based SPSS
                      • Deep mixture density network (DMDN)-based SPSS
                      • Recurrent neural network (RNN)-based SPSS
                        • Summary
                          • Summary