Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf ·...

FundamentalFrequencyModelingforNeural-Network-based

StatisticalParametricSpeechSynthesis

ID:201517062018-07-13

1contact:[email protected],suggestions,anddiscussion

XinWang

q Introduction• Background• Topic• Thesisoutline

q Issuesandmethods

q Summary

CONTENTS

2

Newresults/updatedexplanation

(c.f. HTSSlides,byHTSWorkingGroup)

Text-to-speech(TTS)q TTSpipeline[1,2]

q Synthesizer

3

INTRODUCTION

[1] Taylor,P.(2009).Text-to-SpeechSynthesis.[2] Dutoit,T.(1997).AnIntroductiontoText-to-speech Synthesis.[3] Tokuda,K.,etal.,(2013).SpeechSynthesisBasedonHiddenMarkovModels.ProceedingsoftheIEEE,101(5),1234–1252.[4] Zen,H.,etal.(2009).Statisticalparametricspeechsynthesis.SpeechCommunication,51,1039–1064.

AcousticfeaturesFundamentalfrequency(F0)

SpectralfeaturesAcousticmodels

Vocoder

Statisticalparametricspeechsynthesizer(SPSS)[3,4]

TextText

analyzerSpeech

synthesizerLinguisticfeatures Speech

Text-to-speech(TTS)q Neural-network-basedacousticmodels[5,6,7]

4

INTRODUCTION

[5] H.Zen,A.Senior,andM.Schuster.Statisticalparametricspeechsynthesisusingdeepneuralnetworks.InProc.ICASSP,pages7962–7966,2013.[6] Z.H.Ling,etal.Deeplearningforacousticmodelinginparametricspeechgeneration:Asystematicreviewofexistingtechniquesandfuturetrends.IEEESignalProcessingMagazine,

32(3):35–52, 2015.[7] Y.Fan,Y.Qian,F.Xie,andF.K.Soong.TTSsynthesiswithbidirectionalLSTMbasedrecurrentneuralnetworks.InProc.Interspeech,pages1964–1968,2014.

AcousticfeaturesLinguisticfeatures Acousticmodels

次は新金岡、新金岡です。

Frametier

Neuralnetworks

... 1 1 ... 1 1 1 ... 1 2 ...

... ... * ... * ...

... ツツ ... ツツギ ... ギワ ...

... ts ts ... u u g ... i w ...

… 162163 … 194195196 … 227228 ...

0.1 …1.2 …4.5 …

0 0 43 56 32 …Phone tier

Moratier

Phrasetier Spectralfeatures

F0

Fundamentalfrequency(F0)

Spectralfeatures

Topicq NeuralF0modelingforTTS

qWhyF0?

5

INTRODUCTION

Frametier

Neuralnetworks

0.1 …1.2 …4.5 …

0 0 43 56 32 …Phone tier

Moratier

Phrasetier Spectralfeatures

F0

... 1 1 ... 1 1 1 ... 1 2 ...

... ... * ... * ...

... ツツ ... ツツギ ... ギワ ...

... ts ts ... u u g ... i w ...

… 162163 … 194195196 … 227228 ...

[8]NanetteVeilleux,etal.6.911TranscribingProsodicStructureofSpokenUtteranceswithToBI.JanuaryIAP2006.https://ocw.mit.edu.License:CreativeCommonsBY-NC-SA.

次は新金岡、新金岡です。

Speaker A: Who made the marmalade.

Speaker B:Marianna made the marmalade.

Speaker A: Bob made the marmalade.

Speaker B: (No,) Marianna made the marmalade.

Speaker B:Marianna made the marmalade.

Speaker B: Marianna made the marmalade.

Speaker B: Mariannamade the marmalade.

Topicq Issuestobeaddressed

6

INTRODUCTION

Neuralnetworks

Spectralfeatures

F0

[9]M.S.Ribeiro.Suprasegmental representationsforthemodelingoffundamentalfrequencyinstatisticalparametricspeechsynthesis.PhDthesis,TheUniversityofEdinburgh,2018.

[10]J.Hirschberg.Pitchaccentincontextpredictingintonational prominencefromtext.ArtificialIntelligence,63(1):305–340,1993.

... … … … xTx1 x2 x3

... … … …

... … … …

Linguisticfeatures bsT

boTbo3

bs3bs2bs1

bo1 bo2

F0

p(o1:T , s1:T |x1:T ;⇥)=TY

t=1

N ([ot, st]; Network⇥(x1:T , t),�I)

F0features[9]Linguisticfeatures[10] Issue1:jointmodeling?

Issue2:temporaldependency?

Issue3:efficientenough?

Thesisoutlineq Conventionalapproaches[7](Table3.1inthesis)

7

INTRODUCTION

…

…

Recurrentlayers

Feedforwardlayer

Linguisticfeatures

F0contour

…Spectralfeatures

xTx1 x2 x3

bsTboTbo3

bs3bs2bs1bo1 bo2

Recurrentneuralnetwork(RNN)

... 1 1 ... 1 1 1 ... 1 2 ...

... ... * ... * ...... ツツ ... ツツギ ... ギワ ...... ts ts ... u u g ... i w ...… 162 163 … 194 195 196 … 227 228 ...

T frames(timesteps)[7] Y.Fan,Y.Qian,F.Xie,andF.K.Soong.TTSsynthesiswithbidirectionalLSTMbasedrecurrentneuralnetworks.InProc.Interspeech,pages1964–1968,2014.

Thesisoutlineq Threeissues

8

INTRODUCTION

Issue1:Jointmodeling?

…

…

…

xTx1 x2 x3

bsTboTbo3

bs3bs2bs1bo1 bo2

T frames

... 1 1 ... 1 1 1 ... 1 2 ...

... ... * ... * ...... ツツ ... ツツギ ... ギワ ...... ts ts ... u u g ... i w ...… 162 163 … 194 195 196 … 227 228 ...

Issue2:Temporaldependency?

Issue3:Frame-by-frame

processing isefficient?

Thesisoutline(Chapter4)q Onissue1:JointmodelingofF0andspectralfeatures?

• Investigationusinghighwaynetworkso Spectralfeaturesareprioritizedo Differentinput/hiddenfeaturesforF0andspectral

✕ Sub-optimalforF0modelingü OnlyF0astarget

9

INTRODUCTION

…

… xTx1 x2 x3

boTbo3bo1 bo2

… bsTbs3bs2bs1

Novelanalysis

F0contour

Spectralfeatures

Thesisoutline(Chapter5-6)q Onissue2:Temporaldependency?

• Evidencefromrandomsampling✕ RNNignorestemporaldependency

10

INTRODUCTION

…

… xTx1 x2 x3

boTbo3bo1 bo2

Thesisoutline(Chapter5-6)q Onissue2:Temporaldependency?

ü Shallowautoregressivemodel(SAR)• Short-termdependency

ü Deepautoregressivemodel(DAR)• Longerdependency&bestresults&randomsampling

11

INTRODUCTION

…

… xTx1 x2 x3

boTbo3bo1 bo2

DAR

Novelmodels&interpretations

Thesisoutline(Chapter7)q Onissue3:Frame-by-frameprocessing?

✕ Inefficient

12

INTRODUCTION

…

… xTx1 x2 x3

boTbo3bo1 bo2

... 1 1 ... 1 1 1 ... 1

... ... * ... *... ツツ ... ツツギ ... ギ

... ts ts ... u u g ... i… 162 163 … 194 195 196 … 227

Thesisoutline(Chapter7)q Onissue3:Frame-by-frameprocessing?

ü Two-stagemodel:efficient&interpretable&multi-level

13

INTRODUCTION

… boTbo3bo1 bo2

x1p x2p

Unit 1 Unit 2

1 1 1 1* *

ツツギギ

ts u g i

Stage1:F0contourmodeling

Stage2:Linguisticlinking

Novelmodel

Unsupervisedlearning

Fastsupervisedlearning

Thesisoutline

14

INTRODUCTION

Preliminary

Summary

Issue1Jointmodeling forF0?

Issue2Temporaldependency?

Issue3Frame-by-frame?

…

… xTx1 x2 x3

boTbo3bo1 bo2

… bsTbs3bs2bs1

…

… xTx1 x2 x3

boTbo3bo1 bo2

… boTbo3bo1 bo2

x1p x2p

Chapter1-3

Chapter5Chapter6

Chapter7

Chapter8

Chapter4 HighwaynetworksToolsforanalysis

SAR&extendedSARDAR &techniques

Two-stageF0modelstage1:frame-by-framestage2:unit-by-unit

Updatedresults

15

INTRODUCTION

Preliminary

Summary

Issue1Jointmodeling forF0?

Issue2Temporaldependency?

Issue3Frame-by-frame?

Chapter1-3

Chapter5Chapter6

Chapter7

Chapter8

Chapter4

SAR+logarearatioAdditional listeningtests

Listeningtest

CONTENTS

16

q Introduction

q Issue1:jointmodelingofF0andspectralfeatures

q Issuesandmethods

q Summary

Motivationq Commonapproach [5,7,11]

• Joint(multi-task)learning? Beneficialforbothtargets? Sharinghiddenfeatures

q Empiricalresultsagainstjointlearning[5,12]

17

ISSUE 1:JOINT LEARNING FOR F0?

Spectralfeatures

Neuralnetwork

F0

Linguisticfeatures

[5] H.Zen,A.Senior,andM.Schuster.Statisticalparametricspeechsynthesisusingdeepneuralnetworks.InProc.ICASSP,pages7962–7966,2013.[7] Y.Fan,Y.Qian,F.Xie,andF.K.Soong.TTSsynthesiswithbidirectionalLSTMbasedrecurrentneuralnetworks.InProc.Interspeech,pages1964–1968,2014.[11] H.ZenandA.Senior.Deepmixturedensitynetworksforacousticmodelinginstatisticalparametricspeechsynthesis.In Proc.ICASSP,pages3844–3848,2014.[12] S.KangandH.Meng.Statisticalparametricspeechsynthesisusingweightedmulti-distributiondeepbeliefnetwork.InProc.Interspeech,pages1959–1963,2014.

Trueornot?Moreevidence?

Trueornot?Moreevidence?

Method

• Joint(multi-task)learning? Beneficialforbothtargets? Sharinghiddenfeatures

18


Spectralfeatures

Highwaynetworks [13]

F0

Linguisticfeatures

[13] R.K.Srivastava,K.Greff,andJ.Schmidhuber.Highwaynetworks.InProc.DeepLearningWorkshop,2015.

• Modelandtools:o Highwaynetwork[13]

o Histogram&sensitivitytools

Methodq Definitionofhighwaynetwork

19


bo1

x1

g1

h1 …

xT

boT

hTgT

bo2

x2

h2g2

bot = gt � ht + (1� gt)� xt

Highwaynetwork

gt = sigmoid(W gxt + bg)

ht = �(W ixt + bi)

Highwaygatevector

…

…bo1 bo2

x1 x2 xT

boT bot = W oht + bo.

ht = �(W ixt + bi)Feedforward

networkW i, bi

W o, boh1 hTh2

Methodq Highwaynetworkforacousticmodeling

• Spectralfeatures

20


Linguisticfeatures

Linear

MGC

Highwayblock

Highwayblock…

LinearBAP

Highwayblock

Highwayblock…

LinearF0

Highwayblock

Highwayblock…

Linear

MGC:Mel-generalizedcepstral(MGC)coefficients[14]

BAP:bandaperiodicitycoefficients[15,16]

Linear

MGCBAPF0

Highwayblock

Highwayblock…

Linear

Linguisticfeatures

g

[14] K.Tokuda,T.Kobayashi,T.Masuko,andS.Imai.Mel-generalizedcepstralanalysisaunifiedapproach.InProc.ICSLP,pages1043–1046,1994.[15] H.Kawahara,J.Estill,andO.Fujimura.Aperiodicityextractionandcontrolusingmixedmodeexcitationandgroupdelay manipulationforahighqualityspeechanalysis,modification

andsynthesissystemstraight.InSecondInternationalWorkshoponModelsandAnalysisofVocalEmissionsforBiomedicalApplications,2001.[16] H.ZenandT.Toda.AnoverviewofNitech HMM-basedspeechsynthesissystemforBlizzardChallenge2005.InProc.Interspeech,pages93–96,2005.

Single-stream Multi-stream

Methodq Analysistools

1. Histogramofgatevectors

2. Sensitivityoftodifferentlinguisticfeatures(Sec.4.3.2)

21


bo = g � h+ (1� g)� x

g

{g1, · · · , gT } Histogram

g

0 0.5 10

0.5

1

1.5

2

2.5 #105

0 0.5 10

2

4

6

8 #104

0 0.5 10

1

2

3 #104

0 0.5 10

0.5

1

1.5

2

2.5 #104

0 0.5 10

1

2

3 #104

0 0.5 10

1

2

3

4 #104

0 0.5 10

2

4

6

8 #104

g ⇡ 1g ⇡ 0

Non-lineartransformationNotransformation

Linear

MGCBAPF0

Highwayblock

Highwayblock…

Linear

Linguisticfeatures

gh

bo

x

Experimentsq Configuration

• Data:English,16hours• Feature:MGC,BAP,F0(InterpolatedF0+voicing(U/V))

• Metric:

q Threemodels:• Single-streamfeedforwardnetwork• Single-streamhighwaynetwork• Multi-streamhighwaynetwork

q Twotests:1. Fixedlayerwidth,varyingnetworkdepth2. Fixednetworkdepth,varyinglayerwidth

22


Rootmeansquareerror(RMSE)Correlationcoefficients(CORR)

g

x

Highwayblock

h(1)

h(2)

Experimentsq Objectiveresults:increasingnetworkdepth

v Networkdepth:Numberoftanh-basedtransformationlayers• Single-streamnetworkprioritizesMGC?

Linguisticfeatures

Linear

MGC

Highwayblock

Highwayblock

…

LinearBAP

Highwayblock

Highwayblock

…

LinearF0

Highwayblock

Highwayblock

…

LinearF0

Highwayblock

Highwayblock

…

LinearMGC BAP

Linear

Linguisticfeatures

F0

feedforward

feedforward

…

LinearMGC BAP

Linear

Linguisticfeatures

feedforward

feedforward


23

MGCRMSE F0RMSE F0CORR

2 4 8 14 20 40Network depth

0.66

0.67

0.68

0.69

0.70

0.71

0.72

0.73

0.74

F0C

orre

latio

n(0

-1)


43

44

45

46

47

F0R

MS

E(H

z)


1.02

1.04

1.06

1.08

1.10

1.12

1.14

MG

CR

MS

E

Single-stream feedforwardSingle-stream highwayMulti-stream highway

Experimentsq Objectiveresults:increasingwidth(depth=14)

v MS1 :[MGC256]– [F0256]v MS2 :[MGC382]– [F0256]v MS3 :[MGC512]– [F0382]v MS4 :[MGC768]– [F0512]


24

• Single-streamnetworkprioritizesMGC?

3.0e6 9.0e6 2.5e7Number of Network weights

0.66

0.67

0.68

0.69

0.70

0.71

0.72

0.73

0.74

F0C

orre

latio

n(0

-1)

382

782882

1024382 482582 782 1024

MS1

MS2MS3

MS4


42.5

43.0

43.5

44.0

44.5

45.0

45.5

46.0

46.5

F0R

MS

E(H

z)

382

782 882

1024382

482

582 7821024

MS1 MS2MS3

MS4


1.01

1.02

1.03

1.04

1.05

MG

CR

MS

E

382

782882 1024

382 482 582

782 1024

MS1

MS2MS3 MS4

Single-stream feedforwardSingle-stream highwayMulti-stream highway

MGCRMSE F0RMSE F0CORR

0 1

6e

+0

5 b.1 b.2 b.3 b.4 b.5 b.6 b.7

Experimentsq Histogramof

• Multi-streamhighway(depth14)

•


25

linear

F0

highwayblock

linguistic features

linear

BAP

highwayblock

linear

MGC

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

Test set

g

g ⇡ 1g ⇡ 0

Non-lineartransformationNotransformation linear

0 1

6e

+0

5 b.1 b.2 b.3 b.4 b.5 b.6 b.7


• Multi-streamhighway(depth14,7blocks)

•


26

linear

linear

F0

highwayblock

linguistic features

linear

BAP

highwayblock

linear

MGC

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

Test set

0 1

7e

+0

5 b.1 b.2 b.3 b.4 b.5 b.6 b.7

g

g ⇡ 1g ⇡ 0

Non-lineartransformationNotransformation

• DifferenthiddenfeaturesforMGCandF0


• Single-streamhighway(depth14,7blocks)


27

linear

linear

F0

highwayblock

linguistic features

BAPMGC

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

Test set

0 1

8e

+0

5 b.1 b.2 b.3 b.4 b.5 b.6 b.7

• SimilartoMGCsub-netinmulti-streamhighway• Single-streamnetworkprioritizesMGC?

g

Summaryq Answertoissue1

q OnlyF0modelinginthefollowingchapters

q F0isusefulforMGCmodeling?Howtodo?(slidesappendix)


28

NOTforthesakeofF0modeling!

Joint(multi-task)learning? BeneficialforbothF0andspectralfeatures

? Theysharehiddenfeatures

Negativeevidence• Jointlearning(single-streamnetwork)

prioritizesspectralfeatures• Theyusedifferenthiddenfeatures• Theyusedifferentinputfeatures(Sec.4.4.3)• ResultsonEnglishandJapanesecorpora

(Sec.4.5)

q Introduction


q Issue2:temporaldependencymodelingofF0contours

q Issuesandmethods

q Summary

CONTENTS

29

Motivationq BaselineRNNmodel[17]

v T:numberofframes(timesteps)

30

ISSUE 2:TEMPORAL DEPENDENCY?

bo1:T = {bo1, · · · , boT }

Recurrent neural network (RNN)

F0 contour

Linguisticfeatures

[17]R.Fernandez,et.al.Prosodycontourpredictionwithlongshort- termmemory,bi-directional,deeprecurrentneuralnetworks.InProc.Interspeech,pages2268–2272,2014.

x1:T = {x1, · · · ,xT }

Motivationq BaselineRNNmodel

• Learnthecorrelationbetweenand,

31


x1 x2 x3 x4 x5

bo1 bo2 bo3 bo4 bo5

x1:T = {x1, · · · ,xT }

bo1:T = {bo1, · · · , boT }

H(RNN)⇥ (·) bot = H

(RNN)⇥ (x1:T , t)

t2 6= t1ot1 ot2

Motivationq BaselineRNNmodel

Mt = {µt}, where µt = H(RNN)⇥ (x1:T , t)

32

x1 x2 x3 x4 x5

M3 M4M2M1 M5

o1 o2 o3 o4 o5

[18]C.M.Bishop.MixtureDensityNetworks.Technicalreport,AstonUniversity,2004.[19]C.M.Bishop.Neuralnetworksforpatternrecognition.Oxforduniversitypress,1995.[20]M.Schuster.Bettergenerativemodelsforsequentialdataproblems:Bidirectionalrecurrentmixturedensitynetworks.In Proc.NIPS,pages589–595,1999.

H(RNN)⇥ (·)


Probabilistic part

Computational part

p(o1:T |x1:T ;⇥) =TY

t=1

p(ot|x1:T ;⇥) =TY

t=1

N (ot;µt,�I)

Recurrentmixturedensitynetwork

(RMDN)

bot = argmaxot

p(ot|x1:T ;⇥⇤) = µt Mean-basedgeneration

100 200 300 400 500 600Frame index (utterance ATR Ximera F009 AOZORAR 03372 T01)

160

260

360

460

560

F0(H

z)

Natural F0

33

Motivationq Initialanswer

• Evidencefromrandomsampling


TemporaldependencyisignoredbyRNN/RMDN


160

260

360

460

560

F0(H

z)

Natural F0RMDN mean-based output


160

260

360

460

560

F0(H

z)

Natural F0RMDN mean-based outputRMDN sample

p(o1:T |x1:T ;⇥) =TY

t=1

p(ot|x1:T ;⇥) =TY

t=1

N (ot;µt,�I)

Autoregressive(AR) [23] models

• ShallowARmodels(SAR)&extendedSAR• DeepARmodels(DAR)

34


RNN/RMDNp(o1:T ) =

TY

t=1

p(ot|o1:t�1)

p(o1:T ) =TY

t=1

p(ot)

[23] B.Frey.GraphicalModelsforMachineLearningandDigitalCommunication.ABradfordbook.Bradfordbook,1998.

Motivationq Initialanswer

q Bettermodels?

TemporaldependencyisignoredbyRNN/RMDN

q Introduction



§ SARandextension§ DAR

q Issuesandmethods

q Summary

CONTENTS

35

36

x1 x2 x3 x4 x5

M3 M4M2M1 M5

o1 o2 o3 o4 o5

SARq Definition


p(o1:T |x1:T ) =TY

t=1

p(ot|ot�K:t�1,x1:T )

37

SARq Definition

• Timeinvariant• K:hyper-parameter

a2 a2 a2

a1a1a1a1

Trainableparameters

o1 o2 o3 o4 o5 K=2


f (ot�K:t�1) =KX

k=1

ak � ot�k + b

= {a1, · · · ,aK , b}

p(o1:T |x1:T ;⇥, ) =TY

t=1

p(ot|ot�K:t�1,x1:T ;⇥, )

=TY

t=1

N (ot;µt + f (ot�K:t�1),⌃t)

38

SARq Interpretation1(Sec.5.3)

q Interpretation2(Sec.5.3)

o1 o2 o3 o4 o5


p(o1:T |x1:T ;⇥, ) =TY

t=1

p(ot|ot�K:t�1,x1:T ;⇥, )

=TY

t=1

N (ot;µt + f (ot�K:t�1),⌃t)

=TY

t=1

pc(ct|x1:T ;⇥)

c1 c2 c3 c4 c5

ct = ot �KX

k=1

ak � ot�k

o1 o2 o3 o4 o5 c1 c2 c3 c4 c5Filters

SARq Interpretation(Sec.5.3)

• SAR=Lineartransformation+RMDN

39

Training

Generationx1:T

RMDN

RMDNbo1:T

o1:T

bc1:T

c1:T


TY

t=1

p(ct|x1:T ;⇥)

TY

t=1

p(ct|x1:T ;⇥)

ct = ot �KX

k=1

ak � ot�k

bot = bct +KX

k=1

ak � bot�k

ExtendedSAR(eSAR)qMotivation

• SAR->Non-lineartransformation+RMDN?

• Yes,ifisinvertibleand̀ simple’

40

Training

Generationx1:T

RMDN

RMDNbo1:T

o1:T

bc1:T

c1:T


TY

t=1

p(ct|x1:T ;⇥)

TY

t=1

p(ct|x1:T ;⇥)

c1:T = f (o1:T )

bo1:T = f�1 (bc1:T )

f (o1:T )

po(o1:T |x1:T ;⇥, ) = pc(c1:T = f (o1:T )|x1:T ;⇥)

�� det@f (o1:T )

@o1:T

��

ExtendedSAR(eSAR)q Definition

• Volume-preserving[24]

41

SAR eSAR


c1:T = f (o1:T )

bo1:T = f�1 (bc1:T )

ct = ot � RNN (o1:t�1, t)

bot = bct +RNN (bo1:t�1, t)

det@f (o1:T )

@o1:Tdet

@f (o1:T )

@o1:T= 1

ct = ot �KX

k=1

ak � ot�k

det@f (o1:T )

@o1:T= 1

bot = bct +KX

k=1

ak � bot�k


�� det@f (o1:T )

@o1:T

��

[24] J.M.Tomczak andM.Welling.Improvingvariationalauto-encodersusingconvexcombination linearinverseautoregressiveflow.InProc.Benelearn,pages162–194,2017.

ExtendedSAR(eSAR)q Implementation

42

o1 o2 o3 o4 o5

x1 x2 x3 x4 x5

M3 M4M2M1 M5

c1 c2 c3 c4 c5

µ5µ4µ3µ20


ct = ot � µt

pc(c1:T |x1:T ;⇥)

Transformation

Modeling


µt = RNN (o1:t�1, t)Normflow

43

x1 x2 x3 x4 x5

M3 M4M2M1 M5

bc1 bc2 bc3 bc4 bc5

bo5bo4bo3bo2bo1

De-transformation

Generationbc1:T ⇠ pc(c1:T |x1:T ;⇥)

ExtendedSAR(eSAR)q Implementation



bµ2 bµ3 bµ4 bµ5

bµt = RNN (bo1:t�1, t)

bot = bct + bµt

SummaryofSARq Theoreticallyappealing

q PerformanceforF0modeling(modeldetailslater)

q PerformanceforMGCmodeling• BetterthanRNN/RMDN[25,26]


44[25] X.Wang,J.Lorenzo-Trueba,S.Takaki,L.Juvela,andJ.Yamagishi.Acomparisonofrecentwaveformgenerationandacousticmodelingmethodsforneural-network-basedspeechsynthesis.InProc.ICASSP,pages4804–4808,2018.

[26] X.Wang,S.Takaki,andJ.Yamagishi.Anautoregressiverecurrentmixturedensitynetworkforpara- metricspeechsynthesis.InProc.ICASSP,pages4895–4899,2017.

SAR

eSAR

SAR+stablefilters

0% 25% 50% 75% 100%

eSAR 55.70% SAR 44.30%

Signalprocessing

Machinelearning

Real-valuedpolesLogarearatio(Sec.5.3.2)Segment-variantfilters(slides)

Volume-preservingNone-volume-preserving (slides)…

q Introduction



§ SARandextension§ DAR

q Issuesandmethods

q Summary

CONTENTS

45

DARqMotivation

• AreSARandeSAR sufficientlygood?

46



160

260

360

460

560

F0(H

z)

Natural F0SAR sampled output

RandomfromSAR


160

260

360

460

560

F0(H

z)

Natural F0eSAR sampled output

RandomfromeSAR

DARqMotivation

• RandomsamplingonSARandeSAR?

• SAR:linear(Sec.6.1)

• eSAR:non-linearbutaspecialform

q Non-linearandnon-invertibleARtransformation?

47


o1:Tc1:T x1:Tc1:T = f�(o1:T )

NetworkwithARdependencyo1:T x1:T

TY

t=1

p(ct|x1:T ;⇥)

f�(·)

48

x1 x2 x3 x4 x5

M3 M4M2M1 M5

o1 o2 o3 o4 o5

DARq Definition


49

DARq Definition

• DARismoregeneralthanSAR(Sec6.2.2&slides,toyexamples)1. Longer-timedependency2. Non-lineardependency


x1 x2 x3 x4 x5

M3 M4M2M1 M5

o1 o2 o3 o4 o5

p(o1:T |x1:T ;⇥) =TY

t=1

p(ot|o1:t�1,x1:T ;⇥)

P (o1:T |x1:T ;⇥) =TY

t=1

P (ot|o1:t�1,x1:T ;⇥)

50

DARq Implementation(Sec.6.3)


x1 x2 x3 x4 x5

M3 M4M2M1 M5

o1 o2 o3 o4 o5

QuantizedF0Hierarchicalsoftmax

Datadropout

51

Experimentq Dataandfeatures

• Data:Japanese,48hours• Feature:F0(interpolatedF0value1dim+U/V1dim)

qModels


Layer size

512

512

256

128

linear

bi-LSTM

bi-LSTM

FF

F0

GMM 2mix

F0

FF

RNN RMDN SAR

F0

GMM 2mix

linear

bi-LSTM

bi-LSTM

FF

linear

bi-LSTM

bi-LSTM

FF

FF FF

Linguistic features

GMM 2mix

eSAR

linear

bi-LSTM

bi-LSTM

FF

FF

normflow

normflow

F0

normflow

uni-LSTM 64

linear 2

v FF:feedforwardwithtanh-activationfunction

WaveNet

52

Experimentq Dataandfeatures

• Data:Japanese,48hours• Feature:F0(interpolatedF0value1dim+U/V1dim)• Feature:quantizedF0(256quantizationbins)

qModels


Linguisticfeatures

QuantizedF0

hierarchicalsoftmaxlinear

uni-LSTM

bi-LSTM

FF

FF

v FF:feedforwardwithtanh-activationfunction

DAR

Linguisticfeatures

QuantizedF0

hierarchicalsoftmax

bi-LSTMbi-LSTM

FFFF

WaveNet-F0

linear

Layer size

512

512

256

128

256


160

260

360

460

560

F0(H

z)

Natural F0SAR sampled output

RandomfromSAR


160

260

360

460

560

F0(H

z)

Natural F0eSAR sampled output

RandomfromeSAR

53

Experimentq Randomsampling


190

390

F0(H

z)

Natural DAR sample 1

190

390

F0(H

z)


100 200 300 400 500 600Frame index (ATR Ximera F009 AOZORAR 03372 T01)

190

390

F0(H

z)


DARsample1

DARsample2

DARsample3

190

390

F0(H

z)

Natural RNN

190

390

F0(H

z)

Natural RMDN

190

390

F0(H

z)

Natural SAR

190

390

F0(H

z)

Natural eSAR

190

390

F0(H

z)

Natural WaveNet-F0

100 200 300 400 500 600Frame index (ATR Ximera F009 AOZORAR 03372 T01)

190

390

F0(H

z)

Natural DAR

54

ExperimentqMean-basedgeneration


55

ExperimentqMean-basedgeneration

• 500testutterances, >1000setsofscores


Natural DAR SAR RMDN RNNNatural <1e-10 <1e-10 <1e-10 <1e-10DAR <1e-10 <1e-10 <1e-10 <1e-10SAR <1e-10 <1e-10 0.01785 0.7429RMDN <1e-10 <1e-10 0.01785 0.00426RNN <1e-10 <1e-10 0.7429 0.00426

3.00

3.25

3.50

3.75

4.00

4.25

MO

S

4.243

3.856

3.5643.628 3.565

Natural DAR SAR eSAR WaveNet-F0Natural <1e-10 <1e-10 <1e-10 <1e-10DAR <1e-10 <1e-10 9.062e-08 0.000102SAR <1e-10 <1e-10 0.007186 0.000176eSAR <1e-10 9.062e-08 0.007186 0.219733WaveNet-F0 <1e-10 0.000102 0.000176 0.219733

3.00

3.25

3.50

3.75

4.00

4.25

MO

S

4.199

3.971

3.7323.813 3.803

MOStest2

p-value>0.05

p-value>0.05

MOStest1

Natural DAR SAR RMDN RNN

Natural DAR SAR eSAR WaveNet-F0

Natural DAR SAR RMDN RNN

Summaryq Fullanswertoissue2


56

TemporaldependencyisignoredbyRNN/RMDN!Butbettermodelcanbedefined

RMDN SAR eSAR DAR

ARlinear? - Linear Non-linear(constrained) Non-linear

ARtime span -

Tractable? - Yes Somewhat No

Sampling? No No No Yes

h2h1

x1 x2

o1 o2

µ1 µ2

h2h1

x1 x2

o1 o2

µ1 µ2

h2h1

x1 x2

o1 o2

µ1 µ2

c1 c2

h2h1

x1 x2

o1 o2

µ1 µ2

t�K : t� 1 1 : t� 1 1 : t� 1

q Introduction



q Issue3:frame-by-frameprocessing

q Summary

CONTENTS

57

Motivationq Inefficientprocessing

ISSUE 3:FRAME-BY-FRAME PROCESSING?

* ** *** ***** ******* ******** ********** ************ ************** **********************************

…

…

oT

xT

o1

x1

o3o2

x2 x3

…

…

…

…xtn xt(n+1)

otn ot(n+1)…

…

…

…

58

Phone 1 Phone 2 Phone 3

Phrasetier

Moratier

Phone tierFrametier

ツツツツツツツツギギギ

ts ts ts u u u u u g g g

1 1 1 1 1 1 1 1 1 1 1



Motivationq Inefficientprocessing


…… …… …

* ** *** ***** ******* ******** ********** ************ ************** **********************************

…

…

oT

xT

o1

x1

o3o2

x2 x3

…

…

…

…xtn xt(n+1)

otn ot(n+1)…

…

…

…

59


Phrasetier

Moratier

Phone tierFrametier

ツツツツツツツツギギギ

ts ts ts u u u u u g g g

1 1 1 1 1 1 1 1 1 1 1

MotivationqMoreefficientprocessing?


… oTo1 o3o2 … …otn ot(n+1)… …


• Givendurationofeachunit

60

x3px2px1p

Howtodo?ツツギ

ts u g

1 1 1

Methodq Two-stageF0modeling



… oTo1 o3o2 … …otn ot(n+1)… …

Stage1:F0contourmodeling

Stage2:Linguisticlinking

61

x3px2px1p

Methodq Two-stageF0modeling

• Revisitlinguisticapproaches[28,29]



… oTo1 o3o2 … …otn ot(n+1)… …

Vector-quantizationvariational-auto-encoder(VQ-VAE[27])

Sequential classifier

Compactcodespace

62[27] A.vandenOord,O.Vinyals,andK.Kavukcuoglu.Neuraldiscreterepresentationlearning.InProc.NIPS,pagetoappear,2017.[28] K.E.Dusterhoff,A.W.Black,andP.A.Taylor.UsingdecisiontreeswithintheTiltintonationmodeltopredictF0contours.Proc.Eurospeech,pages1627–1630,1999.[29] K.Hirose,K.Sato,Y.Asano,andN.Minematsu.SynthesisofF0contoursusinggenerationprocessmodelparameterspredictedfromunlabeledcorpora:Applicationto

emotionalspeech synthesis.Speechcommunication, 46(3):385–404,2005.

x3px2px1p

Stage1:F0contourmodelingq VQ-VAE

• Unsupervisedlearning• Onecodeforvaried-lengthunit• Multiplelinguisticlevels


… oTo1 o3o2 … …otn ot(n+1)… …


Compactcodespace

Goals

63

Stage1:F0contourmodelingq VQ-VAE

• Onecodeperphone


… oTo1 o3o2 … …otn… …


ot(n+1)

e1p e2p

… oTo1 o3o2 … …otn… … ot(n+1)

z1p z2p z3p

e3p

Codebook

64

Decoder

Encoder

Stage1:F0contourmodelingq VQ-VAEmultiplelinguistictiers

• Onecodeperphone&onecodepermora

Mora-levelencoder

Mora-level codebookz1m

e1m e2m

z2m


… oTo1 o3o2 … …otn… …


ot(n+1)

… oTo1 o3o2 … …otn… … ot(n+1)

65

Mora 1 Mora 2

Decoder

Phone-levelEncoder

e1p e2p

z1p z2p z3p

e3p

Phone-level codebook

Stage1:F0contourmodelingq VQ-VAEmultiplelinguistictiers


… oTo1 o3o2 … …otn… …


ot(n+1)

… oTo1 o3o2 … …otn… … ot(n+1)

66

Mora 1 Mora 2

VQ-VAEencoder(s)

VQ-VAEdecoder

PhonecodeindicesMoracodeindices

{l1p , l2p , l3p , · · · , lNp}

{l1m , l2m , · · · , lNm}Codebooks +Duration

Stage2:LinguisticlinkingISSUE 3:FRAME-BY-FRAME PROCESSING?

… oTo1 o3o2 … …otn… …


ot(n+1)

67

Mora 1 Mora 2

VQ-VAEdecoder


{l1p , l2p , l3p , · · · , lNp}

{l1m , l2m , · · · , lNm}Codebooks +Duration

Linguisticlinker

Linguisticfeatures

Stage2:LinguisticlinkingISSUE 3:FRAME-BY-FRAME PROCESSING?

68


{l1p , l2p , l3p , · · · , lNp}

{l1m , l2m , · · · , lNm}

RNN-basedsequentialclassifier

• ClockworkRNN[30]• Highway[31]

• AR&feedbacklinks• Dropout[32](Sec.7.2.2)

[30] J.Koutnik,K.Greff,F.Gomez,andJ.Schmidhuber.AClockworkRNN.InProc.ICML,pages1863– 1871,2014.[31] K.Greff,R.K.Srivastava,andJ.Schmidhuber.Highwayandresidualnetworkslearnunrollediterativeestimation.InProc.ICLR,2017.[32] N.Srivastava,G.Hinton,A.Krizhevsky,I.Sutskever,andR.Salakhutdinov.Dropout:Asimplewaytopreventneuralnetworksfromoverfitting.TheJournalofMachine

LearningResearch,15(1):1929–1958, 2014.

Linguisticfeatures

F0modelingforTTSISSUE 3:FRAME-BY-FRAME PROCESSING?

69

Encoders&codebooks

Decoder

Trainingset

F0

F0

Codeindices

Linker

Linguisticfeatures

Stage1:VQ-VAE Stage2:Linker

Testset

Linker

Linguisticfeatures

Decoder

Codebooks

F0

F0modelingforTTS

Experimentsonstage1(sec.7.3.2)ExperimentsonwholemodelqModels

• Givennaturalduration


70

Model1 Model3

Frame-by-frameLinker

VQ-VAEdecoder(phone-level)

VQ-VAEdecoder(phone-level)

Phone-by-phoneLinker

VQ-VAEdecoder(phone-mora-level)

Phone-by-phoneLinker+Moralock

… oTo1 o3o2 … …otn ot(n+1)… … … oTo1 o3o2 … …otn ot(n+1)… … … oTo1 o3o2 … …otn ot(n+1)… …Model2

Experimentsq Objectiveresults


RMSE CORR U/V Timecost(s/epoch)

Model1 34.3 0.839 7.96% 1300

Model2 27.1 0.906 6.36% 54

71

Model3 25.5 0.916 4.87% 65

Model1 Model2 Model3… oTo1 o3o2 … …otn ot(n+1)… … … oTo1 o3o2 … …otn

ot(n+1)… … … oTo1 o3o2 … …otn ot(n+1)… …

VAEvsDARq Objectiveresults

• VQ-VAEencoderneedstimeandmemory


RMSE CORR U/VNumberofparameters(m) Timecost (s/epoch)

Stage 1 Stage2 Sum Stage 1 Stage2 Sum

DAR 28.3 0.903 3.46%

VAEmodel3 25.5 0.916 4.87%

72

DAR VAE(model3)… oTo1 o3o2 … …otn ot(n+1)… … … oTo1 o3o2 … …otn ot(n+1)… …

~1300

65

~700

1500

Stage1

Stage2

… oTo1 o3o2 … …otn… … ot(n+1)

z1p z2p z3p

0.36

0.44

1.11

0.67

2000

1565

1.48

1.11

73

ISSUE 3:FRAME-BY-FRAME PROCESSING?VAEvsDARq Subjectivetest

• 500testutterances, >1000setsofscores

Natural DAR SAR eSAR WaveNet-F0 VAE (model 3)Natural <1e-10 <1e-10 <1e-10 <1e-10 <1e-10DAR <1e-10 <1e-10 9.062e-08 0.000102 0.392125SAR <1e-10 <1e-10 0.007186 0.000176 <1e-10eSAR <1e-10 9.062e-08 0.007186 0.219733 5.946e-10WaveNet-F0 <1e-10 0.000102 0.000176 0.219733 2.635e-06VAE (model 3) <1e-10 0.392125 <1e-10 5.946e-10 2.635e-06

3.00

3.25

3.50

3.75

4.00

4.25

MO

S

4.199

3.971

3.7323.813 3.803

3.993

p-value>0.05

MOStest

p-value>0.05

Natural DAR SAR eSAR WaveNet-F0 VAE(model3)

74

ISSUE 3:FRAME-BY-FRAME PROCESSING?Summaryq Answertoissue3

q How

q Results• Multiplelinguistictiers• MoreefficientthanDAR:smaller+faster+F0CORR>0.91• Interpretablelatentcodespaces(Sec.7.3.2)• RandomsamplingOK(slides)

Itcouldbemoreefficient

LinkerLinguisticfeatures

VQVAEDecoderCodebooks F0

unit-by-unit frame-by-frame

q Introduction



q Issue3:frame-by-frameprocessing

q Conclusion

CONTENTS

75

Summary

76

CONCLUSION

…

…

…

xTx1 x2 x3

bsTboTbo3

bs3bs2bs1bo1 bo2

PreliminaryIssue1:

JointmodelingforF0?

Chapter1-3 Chapter4

? JointmodelingofF0andspectral?✕ Sub-optimalforF0modeling• Methods:

o Highwaynetworkso Histogram+sensitivityanalysis

• Results:o Spectralfeaturesareprioritizedo Differentinput/hiddenfeaturesforF0

andspectralfeatures

Summary

77

CONCLUSION

…

… xTx1 x2 x3

boTbo3bo1 bo2

PreliminaryIssue1:

JointmodelingforF0?

Chapter1-3 Chapter4

? TemporaldependencyinRNN/RMDN?✕ IgnoredbyRNN/RMDN• Models:

o SAR:tractabledependencyo DAR:non-linear+longerdependency

• Results:o DAR:F0CORR>0.90,MOSscore⬆o DARsupportsrandomsampling!

Issue2:Temporal

dependency?

Chapter5-6

DAR

Summary

78

CONCLUSION

… boTbo3bo1 bo2

PreliminaryIssue1:

JointmodelingforF0?

Chapter1-3 Chapter4

? Frame-by-frameprocessing✕ Inefficient• Two-stageF0model:

o F0contourcoding:VQ-VAE+DARo Linguisticlinking:unit-by-unitclassifier

• Results:o F0CORR>0.91⬆o Moreefficient(smaller,faster)

… xTx1 x2 x3

Issue2:Temporal

dependency?

Chapter5-6

Issue3:Frame-by-frame?

Chapter7

x1p x2p

Phone 1 Phone 2

Summary

Chapter8

Newtopics?q Towardscompleteprosodymodeling

• NotonlyF0butalsoduration• Easyforjointmodeling

qWaveformmodeling• F0andwaveformare1dimensionalsignals• Signalprocessingmethodsavailable

SAR+logarearatio+segment-variantfilters

79

CONCLUSION

LinkerLinguisticfeatures

Latentcodeperphone

Durationofphone

Thankyouforyourattention

Q&A

80

Codes,scripts,slides:tonywangx.github.io

q NeuralF0modeling

ü Meetbasicrequirementonpublication:3journals+6conferences

81

THESIS OUTLINE

Chapter1-3Introduction, TTS,neuralnetworks

Chapter 8Summary and application of F0 model

Chapter4Investigationusinghighwaynetwork

Chapter5ShallowautoregressiveF0model(SAR)

ExtendedSAR

Chapter7Variationalauto-encoder(VAE)+DAR

ICASSP2018

Speechsynthesisworkshop2016,

SpeechCommunication, Vol96,Feb.2018,pp1-9

ICASSP2017

Interspeech2017

IEEETrans.ASLP, accepted,2018Chapter6DeepautoregressiveF0model (DAR)

Prepareajournalpaper

qWorknotincludedinPh.D.thesis• Wordembeddingasinputfeatures(IEICE2018,Interpseech2016)• Impactoftrainingdatasize(SSW2016)

q Collaboratedwork• DARusingmanually-annotated/corrupteddata(Interspeech 2018)

ProvideDAR,SAR,andWaveNet-vocoder

• Voicecloningusingfounddata(Odyssey2018)ProvideWaveNet-vocoder,andacousticmodels

• Cyborgspeech:multilingualspeechsynthesis(ICASSP2018)Provideacousticmodels

• SpeechsynthesisfromMFCC(ICASSP2018)ProvideDARonMFCC

• Controllablespeechsynthesis(Interspeech 2017)Provideacousticmodels

82

WORK DURING PH.D.

Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf ·...

Documents

Transcript of Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf ·...