Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf ·...

82
Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech Synthesis ID: 20151706 2018-07-13 1 contact: [email protected] we welcome critical comments, suggestions, and discussion Xin Wang

Transcript of Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf ·...

Page 1: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

FundamentalFrequencyModelingforNeural-Network-based

StatisticalParametricSpeechSynthesis

ID:201517062018-07-13

1contact:[email protected],suggestions,anddiscussion

XinWang

Page 2: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

q Introduction• Background• Topic• Thesisoutline

q Issuesandmethods

q Summary

CONTENTS

2

Newresults/updatedexplanation

Page 3: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

(c.f. HTSSlides,byHTSWorkingGroup)

Text-to-speech(TTS)q TTSpipeline[1,2]

q Synthesizer

3

INTRODUCTION

[1] Taylor,P.(2009).Text-to-SpeechSynthesis.[2] Dutoit,T.(1997).AnIntroductiontoText-to-speech Synthesis.[3] Tokuda,K.,etal.,(2013).SpeechSynthesisBasedonHiddenMarkovModels.ProceedingsoftheIEEE,101(5),1234–1252.[4] Zen,H.,etal.(2009).Statisticalparametricspeechsynthesis.SpeechCommunication,51,1039–1064.

AcousticfeaturesFundamentalfrequency(F0)

SpectralfeaturesAcousticmodels

Vocoder

Statisticalparametricspeechsynthesizer(SPSS)[3,4]

TextText

analyzerSpeech

synthesizerLinguisticfeatures Speech

Page 4: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

Text-to-speech(TTS)q Neural-network-basedacousticmodels[5,6,7]

4

INTRODUCTION

[5] H.Zen,A.Senior,andM.Schuster.Statisticalparametricspeechsynthesisusingdeepneuralnetworks.InProc.ICASSP,pages7962–7966,2013.[6] Z.H.Ling,etal.Deeplearningforacousticmodelinginparametricspeechgeneration:Asystematicreviewofexistingtechniquesandfuturetrends.IEEESignalProcessingMagazine,

32(3):35–52, 2015.[7] Y.Fan,Y.Qian,F.Xie,andF.K.Soong.TTSsynthesiswithbidirectionalLSTMbasedrecurrentneuralnetworks.InProc.Interspeech,pages1964–1968,2014.

AcousticfeaturesLinguisticfeatures Acousticmodels

次は新金岡、新金岡です。

Frametier

Neuralnetworks

... 1 1 ... 1 1 1 ... 1 2 ...

... ... * ... * ...

... ツ ツ ... ツ ツ ギ ... ギ ワ ...

... ts ts ... u u g ... i w ...

… 162163 … 194195196 … 227228 ...

0.1 …1.2 …4.5 …

0 0 43 56 32 …Phone tier

Moratier

Phrasetier Spectralfeatures

F0

Fundamentalfrequency(F0)

Spectralfeatures

Page 5: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

Topicq NeuralF0modelingforTTS

qWhyF0?

5

INTRODUCTION

Frametier

Neuralnetworks

0.1 …1.2 …4.5 …

0 0 43 56 32 …Phone tier

Moratier

Phrasetier Spectralfeatures

F0

... 1 1 ... 1 1 1 ... 1 2 ...

... ... * ... * ...

... ツ ツ ... ツ ツ ギ ... ギ ワ ...

... ts ts ... u u g ... i w ...

… 162163 … 194195196 … 227228 ...

[8]NanetteVeilleux,etal.6.911TranscribingProsodicStructureofSpokenUtteranceswithToBI.JanuaryIAP2006.https://ocw.mit.edu.License:CreativeCommonsBY-NC-SA.

次は新金岡、新金岡です。

Speaker A: Who made the marmalade.

Speaker B:Marianna made the marmalade.

Speaker A: Bob made the marmalade.

Speaker B: (No,) Marianna made the marmalade.

Speaker B:Marianna made the marmalade.

Speaker B: Marianna made the marmalade.

Speaker B: Mariannamade the marmalade.

Page 6: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

Topicq Issuestobeaddressed

6

INTRODUCTION

Neuralnetworks

Spectralfeatures

F0

[9]M.S.Ribeiro.Suprasegmental representationsforthemodelingoffundamentalfrequencyinstatisticalparametricspeechsynthesis.PhDthesis,TheUniversityofEdinburgh,2018.

[10]J.Hirschberg.Pitchaccentincontextpredictingintonational prominencefromtext.ArtificialIntelligence,63(1):305–340,1993.

... … … … xTx1 x2 x3

... … … …

... … … …

Linguisticfeatures bsT

boTbo3

bs3bs2bs1

bo1 bo2

F0

p(o1:T , s1:T |x1:T ;⇥)=TY

t=1

N ([ot, st]; Network⇥(x1:T , t),�I)

F0features[9]Linguisticfeatures[10] Issue1:jointmodeling?

Issue2:temporaldependency?

Issue3:efficientenough?

Page 7: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

Thesisoutlineq Conventionalapproaches[7](Table3.1inthesis)

7

INTRODUCTION

Recurrentlayers

Feedforwardlayer

Linguisticfeatures

F0contour

…Spectralfeatures

xTx1 x2 x3

bsTboTbo3

bs3bs2bs1bo1 bo2

Recurrentneuralnetwork(RNN)

... 1 1 ... 1 1 1 ... 1 2 ...

... ... * ... * ...... ツ ツ ... ツ ツ ギ ... ギ ワ ...... ts ts ... u u g ... i w ...… 162 163 … 194 195 196 … 227 228 ...

T frames(timesteps)[7] Y.Fan,Y.Qian,F.Xie,andF.K.Soong.TTSsynthesiswithbidirectionalLSTMbasedrecurrentneuralnetworks.InProc.Interspeech,pages1964–1968,2014.

Page 8: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

Thesisoutlineq Threeissues

8

INTRODUCTION

Issue1:Jointmodeling?

xTx1 x2 x3

bsTboTbo3

bs3bs2bs1bo1 bo2

T frames

... 1 1 ... 1 1 1 ... 1 2 ...

... ... * ... * ...... ツ ツ ... ツ ツ ギ ... ギ ワ ...... ts ts ... u u g ... i w ...… 162 163 … 194 195 196 … 227 228 ...

Issue2:Temporaldependency?

Issue3:Frame-by-frame

processing isefficient?

Page 9: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

Thesisoutline(Chapter4)q Onissue1:JointmodelingofF0andspectralfeatures?

• Investigationusinghighwaynetworkso Spectralfeaturesareprioritizedo Differentinput/hiddenfeaturesforF0andspectral

✕ Sub-optimalforF0modelingü OnlyF0astarget

9

INTRODUCTION

… xTx1 x2 x3

boTbo3bo1 bo2

… bsTbs3bs2bs1

Novelanalysis

F0contour

Spectralfeatures

Page 10: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

Thesisoutline(Chapter5-6)q Onissue2:Temporaldependency?

• Evidencefromrandomsampling✕ RNNignorestemporaldependency

10

INTRODUCTION

… xTx1 x2 x3

boTbo3bo1 bo2

Page 11: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

Thesisoutline(Chapter5-6)q Onissue2:Temporaldependency?

ü Shallowautoregressivemodel(SAR)• Short-termdependency

ü Deepautoregressivemodel(DAR)• Longerdependency&bestresults&randomsampling

11

INTRODUCTION

… xTx1 x2 x3

boTbo3bo1 bo2

DAR

Novelmodels&interpretations

Page 12: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

Thesisoutline(Chapter7)q Onissue3:Frame-by-frameprocessing?

✕ Inefficient

12

INTRODUCTION

… xTx1 x2 x3

boTbo3bo1 bo2

... 1 1 ... 1 1 1 ... 1

... ... * ... *... ツ ツ ... ツ ツ ギ ... ギ

... ts ts ... u u g ... i… 162 163 … 194 195 196 … 227

Page 13: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

Thesisoutline(Chapter7)q Onissue3:Frame-by-frameprocessing?

ü Two-stagemodel:efficient&interpretable&multi-level

13

INTRODUCTION

… boTbo3bo1 bo2

x1p x2p

Unit 1 Unit 2

1 1 1 1* *

ツ ツ ギ ギ

ts u g i

Stage1:F0contourmodeling

Stage2:Linguisticlinking

Novelmodel

Unsupervisedlearning

Fastsupervisedlearning

Page 14: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

Thesisoutline

14

INTRODUCTION

Preliminary

Summary

Issue1Jointmodeling forF0?

Issue2Temporaldependency?

Issue3Frame-by-frame?

… xTx1 x2 x3

boTbo3bo1 bo2

… bsTbs3bs2bs1

… xTx1 x2 x3

boTbo3bo1 bo2

… boTbo3bo1 bo2

x1p x2p

Chapter1-3

Chapter5Chapter6

Chapter7

Chapter8

Chapter4 HighwaynetworksToolsforanalysis

SAR&extendedSARDAR &techniques

Two-stageF0modelstage1:frame-by-framestage2:unit-by-unit

Page 15: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

Updatedresults

15

INTRODUCTION

Preliminary

Summary

Issue1Jointmodeling forF0?

Issue2Temporaldependency?

Issue3Frame-by-frame?

Chapter1-3

Chapter5Chapter6

Chapter7

Chapter8

Chapter4

SAR+logarearatioAdditional listeningtests

Listeningtest

Page 16: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

CONTENTS

16

q Introduction

q Issue1:jointmodelingofF0andspectralfeatures

q Issuesandmethods

q Summary

Page 17: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

Motivationq Commonapproach [5,7,11]

• Joint(multi-task)learning? Beneficialforbothtargets? Sharinghiddenfeatures

q Empiricalresultsagainstjointlearning[5,12]

17

ISSUE 1:JOINT LEARNING FOR F0?

Spectralfeatures

Neuralnetwork

F0

Linguisticfeatures

[5] H.Zen,A.Senior,andM.Schuster.Statisticalparametricspeechsynthesisusingdeepneuralnetworks.InProc.ICASSP,pages7962–7966,2013.[7] Y.Fan,Y.Qian,F.Xie,andF.K.Soong.TTSsynthesiswithbidirectionalLSTMbasedrecurrentneuralnetworks.InProc.Interspeech,pages1964–1968,2014.[11] H.ZenandA.Senior.Deepmixturedensitynetworksforacousticmodelinginstatisticalparametricspeechsynthesis.In Proc.ICASSP,pages3844–3848,2014.[12] S.KangandH.Meng.Statisticalparametricspeechsynthesisusingweightedmulti-distributiondeepbeliefnetwork.InProc.Interspeech,pages1959–1963,2014.

Trueornot?Moreevidence?

Page 18: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

Trueornot?Moreevidence?

Method

• Joint(multi-task)learning? Beneficialforbothtargets? Sharinghiddenfeatures

18

ISSUE 1:JOINT LEARNING FOR F0?

Spectralfeatures

Highwaynetworks [13]

F0

Linguisticfeatures

[13] R.K.Srivastava,K.Greff,andJ.Schmidhuber.Highwaynetworks.InProc.DeepLearningWorkshop,2015.

• Modelandtools:o Highwaynetwork[13]

o Histogram&sensitivitytools

Page 19: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

Methodq Definitionofhighwaynetwork

19

ISSUE 1:JOINT LEARNING FOR F0?

bo1

x1

g1

h1 …

xT

boT

hTgT

bo2

x2

h2g2

bot = gt � ht + (1� gt)� xt

Highwaynetwork

gt = sigmoid(W gxt + bg)

ht = �(W ixt + bi)

Highwaygatevector

…bo1 bo2

x1 x2 xT

boT bot = W oht + bo.

ht = �(W ixt + bi)Feedforward

networkW i, bi

W o, boh1 hTh2

Page 20: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

Methodq Highwaynetworkforacousticmodeling

• Spectralfeatures

20

ISSUE 1:JOINT LEARNING FOR F0?

Linguisticfeatures

Linear

MGC

Highwayblock

Highwayblock…

LinearBAP

Highwayblock

Highwayblock…

LinearF0

Highwayblock

Highwayblock…

Linear

MGC:Mel-generalizedcepstral(MGC)coefficients[14]

BAP:bandaperiodicitycoefficients[15,16]

Linear

MGCBAPF0

Highwayblock

Highwayblock…

Linear

Linguisticfeatures

g

[14] K.Tokuda,T.Kobayashi,T.Masuko,andS.Imai.Mel-generalizedcepstralanalysisaunifiedapproach.InProc.ICSLP,pages1043–1046,1994.[15] H.Kawahara,J.Estill,andO.Fujimura.Aperiodicityextractionandcontrolusingmixedmodeexcitationandgroupdelay manipulationforahighqualityspeechanalysis,modification

andsynthesissystemstraight.InSecondInternationalWorkshoponModelsandAnalysisofVocalEmissionsforBiomedicalApplications,2001.[16] H.ZenandT.Toda.AnoverviewofNitech HMM-basedspeechsynthesissystemforBlizzardChallenge2005.InProc.Interspeech,pages93–96,2005.

Single-stream Multi-stream

Page 21: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

Methodq Analysistools

1. Histogramofgatevectors

2. Sensitivityoftodifferentlinguisticfeatures(Sec.4.3.2)

21

ISSUE 1:JOINT LEARNING FOR F0?

bo = g � h+ (1� g)� x

g

{g1, · · · , gT } Histogram

g

0 0.5 10

0.5

1

1.5

2

2.5 #105

0 0.5 10

2

4

6

8 #104

0 0.5 10

1

2

3 #104

0 0.5 10

0.5

1

1.5

2

2.5 #104

0 0.5 10

1

2

3 #104

0 0.5 10

1

2

3

4 #104

0 0.5 10

2

4

6

8 #104

g ⇡ 1g ⇡ 0

Non-lineartransformationNotransformation

Linear

MGCBAPF0

Highwayblock

Highwayblock…

Linear

Linguisticfeatures

gh

bo

x

Page 22: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

Experimentsq Configuration

• Data:English,16hours• Feature:MGC,BAP,F0(InterpolatedF0+voicing(U/V))

• Metric:

q Threemodels:• Single-streamfeedforwardnetwork• Single-streamhighwaynetwork• Multi-streamhighwaynetwork

q Twotests:1. Fixedlayerwidth,varyingnetworkdepth2. Fixednetworkdepth,varyinglayerwidth

22

ISSUE 1:JOINT LEARNING FOR F0?

Rootmeansquareerror(RMSE)Correlationcoefficients(CORR)

g

x

Highwayblock

h(1)

h(2)

Page 23: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

Experimentsq Objectiveresults:increasingnetworkdepth

v Networkdepth:Numberoftanh-basedtransformationlayers• Single-streamnetworkprioritizesMGC?

Linguisticfeatures

Linear

MGC

Highwayblock

Highwayblock

LinearBAP

Highwayblock

Highwayblock

LinearF0

Highwayblock

Highwayblock

LinearF0

Highwayblock

Highwayblock

LinearMGC BAP

Linear

Linguisticfeatures

F0

feedforward

feedforward

LinearMGC BAP

Linear

Linguisticfeatures

feedforward

feedforward

ISSUE 1:JOINT LEARNING FOR F0?

23

MGCRMSE F0RMSE F0CORR

2 4 8 14 20 40Network depth

0.66

0.67

0.68

0.69

0.70

0.71

0.72

0.73

0.74

F0C

orre

latio

n(0

-1)

2 4 8 14 20 40Network depth

43

44

45

46

47

F0R

MS

E(H

z)

2 4 8 14 20 40Network depth

1.02

1.04

1.06

1.08

1.10

1.12

1.14

MG

CR

MS

E

Single-stream feedforwardSingle-stream highwayMulti-stream highway

Page 24: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

Experimentsq Objectiveresults:increasingwidth(depth=14)

v MS1 :[MGC256]– [F0256]v MS2 :[MGC382]– [F0256]v MS3 :[MGC512]– [F0382]v MS4 :[MGC768]– [F0512]

ISSUE 1:JOINT LEARNING FOR F0?

24

• Single-streamnetworkprioritizesMGC?

3.0e6 9.0e6 2.5e7Number of Network weights

0.66

0.67

0.68

0.69

0.70

0.71

0.72

0.73

0.74

F0C

orre

latio

n(0

-1)

382

782882

1024382 482582 782 1024

MS1

MS2MS3

MS4

3.0e6 9.0e6 2.5e7Number of Network weights

42.5

43.0

43.5

44.0

44.5

45.0

45.5

46.0

46.5

F0R

MS

E(H

z)

382

782 882

1024382

482

582 7821024

MS1 MS2MS3

MS4

3.0e6 9.0e6 2.5e7Number of Network weights

1.01

1.02

1.03

1.04

1.05

MG

CR

MS

E

382

782882 1024

382 482 582

782 1024

MS1

MS2MS3 MS4

Single-stream feedforwardSingle-stream highwayMulti-stream highway

MGCRMSE F0RMSE F0CORR

Page 25: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

0 1

6e

+0

5 b.1 b.2 b.3 b.4 b.5 b.6 b.7

Experimentsq Histogramof

• Multi-streamhighway(depth14)

ISSUE 1:JOINT LEARNING FOR F0?

25

linear

F0

highwayblock

linguistic features

linear

BAP

highwayblock

linear

MGC

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

Test set

g

g ⇡ 1g ⇡ 0

Non-lineartransformationNotransformation linear

Page 26: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

0 1

6e

+0

5 b.1 b.2 b.3 b.4 b.5 b.6 b.7

Experimentsq Histogramof

• Multi-streamhighway(depth14,7blocks)

ISSUE 1:JOINT LEARNING FOR F0?

26

linear

linear

F0

highwayblock

linguistic features

linear

BAP

highwayblock

linear

MGC

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

Test set

0 1

7e

+0

5 b.1 b.2 b.3 b.4 b.5 b.6 b.7

g

g ⇡ 1g ⇡ 0

Non-lineartransformationNotransformation

• DifferenthiddenfeaturesforMGCandF0

Page 27: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

Experimentsq Histogramof

• Single-streamhighway(depth14,7blocks)

ISSUE 1:JOINT LEARNING FOR F0?

27

linear

linear

F0

highwayblock

linguistic features

BAPMGC

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

highwayblock

Test set

0 1

8e

+0

5 b.1 b.2 b.3 b.4 b.5 b.6 b.7

• SimilartoMGCsub-netinmulti-streamhighway• Single-streamnetworkprioritizesMGC?

g

Page 28: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

Summaryq Answertoissue1

q OnlyF0modelinginthefollowingchapters

q F0isusefulforMGCmodeling?Howtodo?(slidesappendix)

ISSUE 1:JOINT LEARNING FOR F0?

28

NOTforthesakeofF0modeling!

Joint(multi-task)learning? BeneficialforbothF0andspectralfeatures

? Theysharehiddenfeatures

Negativeevidence• Jointlearning(single-streamnetwork)

prioritizesspectralfeatures• Theyusedifferenthiddenfeatures• Theyusedifferentinputfeatures(Sec.4.4.3)• ResultsonEnglishandJapanesecorpora

(Sec.4.5)

Page 29: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

q Introduction

q Issue1:jointmodelingofF0andspectralfeatures

q Issue2:temporaldependencymodelingofF0contours

q Issuesandmethods

q Summary

CONTENTS

29

Page 30: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

Motivationq BaselineRNNmodel[17]

v T:numberofframes(timesteps)

30

ISSUE 2:TEMPORAL DEPENDENCY?

bo1:T = {bo1, · · · , boT }

Recurrent neural network (RNN)

F0 contour

Linguisticfeatures

[17]R.Fernandez,et.al.Prosodycontourpredictionwithlongshort- termmemory,bi-directional,deeprecurrentneuralnetworks.InProc.Interspeech,pages2268–2272,2014.

x1:T = {x1, · · · ,xT }

Page 31: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

Motivationq BaselineRNNmodel

• Learnthecorrelationbetweenand,

31

ISSUE 2:TEMPORAL DEPENDENCY?

x1 x2 x3 x4 x5

bo1 bo2 bo3 bo4 bo5

x1:T = {x1, · · · ,xT }

bo1:T = {bo1, · · · , boT }

H(RNN)⇥ (·) bot = H

(RNN)⇥ (x1:T , t)

t2 6= t1ot1 ot2

Page 32: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

Motivationq BaselineRNNmodel

Mt = {µt}, where µt = H(RNN)⇥ (x1:T , t)

32

x1 x2 x3 x4 x5

M3 M4M2M1 M5

o1 o2 o3 o4 o5

[18]C.M.Bishop.MixtureDensityNetworks.Technicalreport,AstonUniversity,2004.[19]C.M.Bishop.Neuralnetworksforpatternrecognition.Oxforduniversitypress,1995.[20]M.Schuster.Bettergenerativemodelsforsequentialdataproblems:Bidirectionalrecurrentmixturedensitynetworks.In Proc.NIPS,pages589–595,1999.

H(RNN)⇥ (·)

ISSUE 2:TEMPORAL DEPENDENCY?

Probabilistic part

Computational part

p(o1:T |x1:T ;⇥) =TY

t=1

p(ot|x1:T ;⇥) =TY

t=1

N (ot;µt,�I)

Recurrentmixturedensitynetwork

(RMDN)

bot = argmaxot

p(ot|x1:T ;⇥⇤) = µt Mean-basedgeneration

Page 33: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

100 200 300 400 500 600Frame index (utterance ATR Ximera F009 AOZORAR 03372 T01)

160

260

360

460

560

F0(H

z)

Natural F0

33

Motivationq Initialanswer

• Evidencefromrandomsampling

ISSUE 2:TEMPORAL DEPENDENCY?

TemporaldependencyisignoredbyRNN/RMDN

100 200 300 400 500 600Frame index (utterance ATR Ximera F009 AOZORAR 03372 T01)

160

260

360

460

560

F0(H

z)

Natural F0RMDN mean-based output

100 200 300 400 500 600Frame index (utterance ATR Ximera F009 AOZORAR 03372 T01)

160

260

360

460

560

F0(H

z)

Natural F0RMDN mean-based outputRMDN sample

p(o1:T |x1:T ;⇥) =TY

t=1

p(ot|x1:T ;⇥) =TY

t=1

N (ot;µt,�I)

Page 34: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

Autoregressive(AR) [23] models

• ShallowARmodels(SAR)&extendedSAR• DeepARmodels(DAR)

34

ISSUE 2:TEMPORAL DEPENDENCY?

RNN/RMDNp(o1:T ) =

TY

t=1

p(ot|o1:t�1)

p(o1:T ) =TY

t=1

p(ot)

[23] B.Frey.GraphicalModelsforMachineLearningandDigitalCommunication.ABradfordbook.Bradfordbook,1998.

Motivationq Initialanswer

q Bettermodels?

TemporaldependencyisignoredbyRNN/RMDN

Page 35: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

q Introduction

q Issue1:jointmodelingofF0andspectralfeatures

q Issue2:temporaldependencymodelingofF0contours

§ SARandextension§ DAR

q Issuesandmethods

q Summary

CONTENTS

35

Page 36: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

36

x1 x2 x3 x4 x5

M3 M4M2M1 M5

o1 o2 o3 o4 o5

SARq Definition

ISSUE 2:TEMPORAL DEPENDENCY?

p(o1:T |x1:T ) =TY

t=1

p(ot|ot�K:t�1,x1:T )

Page 37: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

37

SARq Definition

• Timeinvariant• K:hyper-parameter

a2 a2 a2

a1a1a1a1

Trainableparameters

o1 o2 o3 o4 o5 K=2

ISSUE 2:TEMPORAL DEPENDENCY?

f (ot�K:t�1) =KX

k=1

ak � ot�k + b

= {a1, · · · ,aK , b}

p(o1:T |x1:T ;⇥, ) =TY

t=1

p(ot|ot�K:t�1,x1:T ;⇥, )

=TY

t=1

N (ot;µt + f (ot�K:t�1),⌃t)

Page 38: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

38

SARq Interpretation1(Sec.5.3)

q Interpretation2(Sec.5.3)

o1 o2 o3 o4 o5

ISSUE 2:TEMPORAL DEPENDENCY?

p(o1:T |x1:T ;⇥, ) =TY

t=1

p(ot|ot�K:t�1,x1:T ;⇥, )

=TY

t=1

N (ot;µt + f (ot�K:t�1),⌃t)

=TY

t=1

pc(ct|x1:T ;⇥)

c1 c2 c3 c4 c5

ct = ot �KX

k=1

ak � ot�k

o1 o2 o3 o4 o5 c1 c2 c3 c4 c5Filters

Page 39: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

SARq Interpretation(Sec.5.3)

• SAR=Lineartransformation+RMDN

39

Training

Generationx1:T

RMDN

RMDNbo1:T

o1:T

bc1:T

c1:T

ISSUE 2:TEMPORAL DEPENDENCY?

TY

t=1

p(ct|x1:T ;⇥)

TY

t=1

p(ct|x1:T ;⇥)

ct = ot �KX

k=1

ak � ot�k

bot = bct +KX

k=1

ak � bot�k

Page 40: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

ExtendedSAR(eSAR)qMotivation

• SAR->Non-lineartransformation+RMDN?

• Yes,ifisinvertibleand̀ simple’

40

Training

Generationx1:T

RMDN

RMDNbo1:T

o1:T

bc1:T

c1:T

ISSUE 2:TEMPORAL DEPENDENCY?

TY

t=1

p(ct|x1:T ;⇥)

TY

t=1

p(ct|x1:T ;⇥)

c1:T = f (o1:T )

bo1:T = f�1 (bc1:T )

f (o1:T )

po(o1:T |x1:T ;⇥, ) = pc(c1:T = f (o1:T )|x1:T ;⇥)

����� det@f (o1:T )

@o1:T

�����

Page 41: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

ExtendedSAR(eSAR)q Definition

• Volume-preserving[24]

41

SAR eSAR

ISSUE 2:TEMPORAL DEPENDENCY?

c1:T = f (o1:T )

bo1:T = f�1 (bc1:T )

ct = ot � RNN (o1:t�1, t)

bot = bct +RNN (bo1:t�1, t)

det@f (o1:T )

@o1:Tdet

@f (o1:T )

@o1:T= 1

ct = ot �KX

k=1

ak � ot�k

det@f (o1:T )

@o1:T= 1

bot = bct +KX

k=1

ak � bot�k

po(o1:T |x1:T ;⇥, ) = pc(c1:T = f (o1:T )|x1:T ;⇥)

����� det@f (o1:T )

@o1:T

�����

[24] J.M.Tomczak andM.Welling.Improvingvariationalauto-encodersusingconvexcombination linearinverseautoregressiveflow.InProc.Benelearn,pages162–194,2017.

Page 42: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

ExtendedSAR(eSAR)q Implementation

42

o1 o2 o3 o4 o5

x1 x2 x3 x4 x5

M3 M4M2M1 M5

c1 c2 c3 c4 c5

µ5µ4µ3µ20

ISSUE 2:TEMPORAL DEPENDENCY?

ct = ot � µt

pc(c1:T |x1:T ;⇥)

Transformation

Modeling

po(o1:T |x1:T ;⇥, ) = pc(c1:T = f (o1:T )|x1:T ;⇥)

µt = RNN (o1:t�1, t)Normflow

Page 43: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

43

x1 x2 x3 x4 x5

M3 M4M2M1 M5

bc1 bc2 bc3 bc4 bc5

bo5bo4bo3bo2bo1

De-transformation

Generationbc1:T ⇠ pc(c1:T |x1:T ;⇥)

ExtendedSAR(eSAR)q Implementation

ISSUE 2:TEMPORAL DEPENDENCY?

po(o1:T |x1:T ;⇥, ) = pc(c1:T = f (o1:T )|x1:T ;⇥)

bµ2 bµ3 bµ4 bµ5

bµt = RNN (bo1:t�1, t)

bot = bct + bµt

Page 44: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

SummaryofSARq Theoreticallyappealing

q PerformanceforF0modeling(modeldetailslater)

q PerformanceforMGCmodeling• BetterthanRNN/RMDN[25,26]

ISSUE 2:TEMPORAL DEPENDENCY?

44[25] X.Wang,J.Lorenzo-Trueba,S.Takaki,L.Juvela,andJ.Yamagishi.Acomparisonofrecentwaveformgenerationandacousticmodelingmethodsforneural-network-basedspeechsynthesis.InProc.ICASSP,pages4804–4808,2018.

[26] X.Wang,S.Takaki,andJ.Yamagishi.Anautoregressiverecurrentmixturedensitynetworkforpara- metricspeechsynthesis.InProc.ICASSP,pages4895–4899,2017.

SAR

eSAR

SAR+stablefilters

0% 25% 50% 75% 100%

eSAR 55.70% SAR 44.30%

Signalprocessing

Machinelearning

Real-valuedpolesLogarearatio(Sec.5.3.2)Segment-variantfilters(slides)

Volume-preservingNone-volume-preserving (slides)…

Page 45: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

q Introduction

q Issue1:jointmodelingofF0andspectralfeatures

q Issue2:temporaldependencymodelingofF0contours

§ SARandextension§ DAR

q Issuesandmethods

q Summary

CONTENTS

45

Page 46: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

DARqMotivation

• AreSARandeSAR sufficientlygood?

46

ISSUE 2:TEMPORAL DEPENDENCY?

100 200 300 400 500 600Frame index (utterance ATR Ximera F009 AOZORAR 03372 T01)

160

260

360

460

560

F0(H

z)

Natural F0SAR sampled output

RandomfromSAR

100 200 300 400 500 600Frame index (utterance ATR Ximera F009 AOZORAR 03372 T01)

160

260

360

460

560

F0(H

z)

Natural F0eSAR sampled output

RandomfromeSAR

Page 47: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

DARqMotivation

• RandomsamplingonSARandeSAR?

• SAR:linear(Sec.6.1)

• eSAR:non-linearbutaspecialform

q Non-linearandnon-invertibleARtransformation?

47

ISSUE 2:TEMPORAL DEPENDENCY?

o1:Tc1:T x1:Tc1:T = f�(o1:T )

NetworkwithARdependencyo1:T x1:T

TY

t=1

p(ct|x1:T ;⇥)

f�(·)

Page 48: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

48

x1 x2 x3 x4 x5

M3 M4M2M1 M5

o1 o2 o3 o4 o5

DARq Definition

ISSUE 2:TEMPORAL DEPENDENCY?

Page 49: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

49

DARq Definition

• DARismoregeneralthanSAR(Sec6.2.2&slides,toyexamples)1. Longer-timedependency2. Non-lineardependency

ISSUE 2:TEMPORAL DEPENDENCY?

x1 x2 x3 x4 x5

M3 M4M2M1 M5

o1 o2 o3 o4 o5

p(o1:T |x1:T ;⇥) =TY

t=1

p(ot|o1:t�1,x1:T ;⇥)

Page 50: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

P (o1:T |x1:T ;⇥) =TY

t=1

P (ot|o1:t�1,x1:T ;⇥)

50

DARq Implementation(Sec.6.3)

ISSUE 2:TEMPORAL DEPENDENCY?

x1 x2 x3 x4 x5

M3 M4M2M1 M5

o1 o2 o3 o4 o5

QuantizedF0Hierarchicalsoftmax

Datadropout

Page 51: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

51

Experimentq Dataandfeatures

• Data:Japanese,48hours• Feature:F0(interpolatedF0value1dim+U/V1dim)

qModels

ISSUE 2:TEMPORAL DEPENDENCY?

Layer size

512

512

256

128

linear

bi-LSTM

bi-LSTM

FF

F0

GMM 2mix

F0

FF

RNN RMDN SAR

F0

GMM 2mix

linear

bi-LSTM

bi-LSTM

FF

linear

bi-LSTM

bi-LSTM

FF

FF FF

Linguistic features

GMM 2mix

eSAR

linear

bi-LSTM

bi-LSTM

FF

FF

normflow

normflow

F0

normflow

uni-LSTM 64

linear 2

v FF:feedforwardwithtanh-activationfunction

Page 52: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

WaveNet

52

Experimentq Dataandfeatures

• Data:Japanese,48hours• Feature:F0(interpolatedF0value1dim+U/V1dim)• Feature:quantizedF0(256quantizationbins)

qModels

ISSUE 2:TEMPORAL DEPENDENCY?

Linguisticfeatures

QuantizedF0

hierarchicalsoftmaxlinear

uni-LSTM

bi-LSTM

FF

FF

v FF:feedforwardwithtanh-activationfunction

DAR

Linguisticfeatures

QuantizedF0

hierarchicalsoftmax

bi-LSTMbi-LSTM

FFFF

WaveNet-F0

linear

Layer size

512

512

256

128

256

Page 53: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

100 200 300 400 500 600Frame index (utterance ATR Ximera F009 AOZORAR 03372 T01)

160

260

360

460

560

F0(H

z)

Natural F0SAR sampled output

RandomfromSAR

100 200 300 400 500 600Frame index (utterance ATR Ximera F009 AOZORAR 03372 T01)

160

260

360

460

560

F0(H

z)

Natural F0eSAR sampled output

RandomfromeSAR

53

Experimentq Randomsampling

ISSUE 2:TEMPORAL DEPENDENCY?

190

390

F0(H

z)

Natural DAR sample 1

190

390

F0(H

z)

Natural DAR sample 2

100 200 300 400 500 600Frame index (ATR Ximera F009 AOZORAR 03372 T01)

190

390

F0(H

z)

Natural DAR sample 3

DARsample1

DARsample2

DARsample3

Page 54: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

190

390

F0(H

z)

Natural RNN

190

390

F0(H

z)

Natural RMDN

190

390

F0(H

z)

Natural SAR

190

390

F0(H

z)

Natural eSAR

190

390

F0(H

z)

Natural WaveNet-F0

100 200 300 400 500 600Frame index (ATR Ximera F009 AOZORAR 03372 T01)

190

390

F0(H

z)

Natural DAR

54

ExperimentqMean-basedgeneration

ISSUE 2:TEMPORAL DEPENDENCY?

Page 55: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

55

ExperimentqMean-basedgeneration

• 500testutterances, >1000setsofscores

ISSUE 2:TEMPORAL DEPENDENCY?

Natural DAR SAR RMDN RNNNatural <1e-10 <1e-10 <1e-10 <1e-10DAR <1e-10 <1e-10 <1e-10 <1e-10SAR <1e-10 <1e-10 0.01785 0.7429RMDN <1e-10 <1e-10 0.01785 0.00426RNN <1e-10 <1e-10 0.7429 0.00426

3.00

3.25

3.50

3.75

4.00

4.25

MO

S

4.243

3.856

3.5643.628 3.565

Natural DAR SAR eSAR WaveNet-F0Natural <1e-10 <1e-10 <1e-10 <1e-10DAR <1e-10 <1e-10 9.062e-08 0.000102SAR <1e-10 <1e-10 0.007186 0.000176eSAR <1e-10 9.062e-08 0.007186 0.219733WaveNet-F0 <1e-10 0.000102 0.000176 0.219733

3.00

3.25

3.50

3.75

4.00

4.25

MO

S

4.199

3.971

3.7323.813 3.803

MOStest2

p-value>0.05

p-value>0.05

MOStest1

Natural DAR SAR RMDN RNN

Natural DAR SAR eSAR WaveNet-F0

Natural DAR SAR RMDN RNN

Page 56: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

Summaryq Fullanswertoissue2

ISSUE 2:TEMPORAL DEPENDENCY?

56

TemporaldependencyisignoredbyRNN/RMDN!Butbettermodelcanbedefined

RMDN SAR eSAR DAR

ARlinear? - Linear Non-linear(constrained) Non-linear

ARtime span -

Tractable? - Yes Somewhat No

Sampling? No No No Yes

h2h1

x1 x2

o1 o2

µ1 µ2

h2h1

x1 x2

o1 o2

µ1 µ2

h2h1

x1 x2

o1 o2

µ1 µ2

c1 c2

h2h1

x1 x2

o1 o2

µ1 µ2

t�K : t� 1 1 : t� 1 1 : t� 1

Page 57: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

q Introduction

q Issue1:jointmodelingofF0andspectralfeatures

q Issue2:temporaldependencymodelingofF0contours

q Issue3:frame-by-frameprocessing

q Summary

CONTENTS

57

Page 58: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

Motivationq Inefficientprocessing

ISSUE 3:FRAME-BY-FRAME PROCESSING?

* ** *** ***** ******* ******** ********** ************ ************** **********************************

oT

xT

o1

x1

o3o2

x2 x3

…xtn xt(n+1)

otn ot(n+1)…

58

Phone 1 Phone 2 Phone 3

Phrasetier

Moratier

Phone tierFrametier

ツツ ツ ツ ツ ツ ツ ツ ギ ギ ギ

ts ts ts u u u u u g g g

1 1 1 1 1 1 1 1 1 1 1

Phone 1 Phone 2 Phone 3

Page 59: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

Phone 1 Phone 2 Phone 3

Motivationq Inefficientprocessing

ISSUE 3:FRAME-BY-FRAME PROCESSING?

…… …… …

* ** *** ***** ******* ******** ********** ************ ************** **********************************

oT

xT

o1

x1

o3o2

x2 x3

…xtn xt(n+1)

otn ot(n+1)…

59

Phone 1 Phone 2 Phone 3

Phrasetier

Moratier

Phone tierFrametier

ツツ ツ ツ ツ ツ ツ ツ ギ ギ ギ

ts ts ts u u u u u g g g

1 1 1 1 1 1 1 1 1 1 1

Page 60: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

MotivationqMoreefficientprocessing?

ISSUE 3:FRAME-BY-FRAME PROCESSING?

… oTo1 o3o2 … …otn ot(n+1)… …

Phone 1 Phone 2 Phone 3

• Givendurationofeachunit

60

x3px2px1p

Howtodo?ツ ツ ギ

ts u g

1 1 1

Page 61: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

Methodq Two-stageF0modeling

ISSUE 3:FRAME-BY-FRAME PROCESSING?

Phone 1 Phone 2 Phone 3

… oTo1 o3o2 … …otn ot(n+1)… …

Stage1:F0contourmodeling

Stage2:Linguisticlinking

61

x3px2px1p

Page 62: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

Methodq Two-stageF0modeling

• Revisitlinguisticapproaches[28,29]

ISSUE 3:FRAME-BY-FRAME PROCESSING?

Phone 1 Phone 2 Phone 3

… oTo1 o3o2 … …otn ot(n+1)… …

Vector-quantizationvariational-auto-encoder(VQ-VAE[27])

Sequential classifier

Compactcodespace

62[27] A.vandenOord,O.Vinyals,andK.Kavukcuoglu.Neuraldiscreterepresentationlearning.InProc.NIPS,pagetoappear,2017.[28] K.E.Dusterhoff,A.W.Black,andP.A.Taylor.UsingdecisiontreeswithintheTiltintonationmodeltopredictF0contours.Proc.Eurospeech,pages1627–1630,1999.[29] K.Hirose,K.Sato,Y.Asano,andN.Minematsu.SynthesisofF0contoursusinggenerationprocessmodelparameterspredictedfromunlabeledcorpora:Applicationto

emotionalspeech synthesis.Speechcommunication, 46(3):385–404,2005.

x3px2px1p

Page 63: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

Stage1:F0contourmodelingq VQ-VAE

• Unsupervisedlearning• Onecodeforvaried-lengthunit• Multiplelinguisticlevels

ISSUE 3:FRAME-BY-FRAME PROCESSING?

… oTo1 o3o2 … …otn ot(n+1)… …

Phone 1 Phone 2 Phone 3

Compactcodespace

Goals

63

Page 64: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

Stage1:F0contourmodelingq VQ-VAE

• Onecodeperphone

ISSUE 3:FRAME-BY-FRAME PROCESSING?

… oTo1 o3o2 … …otn… …

Phone 1 Phone 2 Phone 3

ot(n+1)

e1p e2p

… oTo1 o3o2 … …otn… … ot(n+1)

z1p z2p z3p

e3p

Codebook

64

Decoder

Encoder

Page 65: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

Stage1:F0contourmodelingq VQ-VAEmultiplelinguistictiers

• Onecodeperphone&onecodepermora

Mora-levelencoder

Mora-level codebookz1m

e1m e2m

z2m

ISSUE 3:FRAME-BY-FRAME PROCESSING?

… oTo1 o3o2 … …otn… …

Phone 1 Phone 2 Phone 3

ot(n+1)

… oTo1 o3o2 … …otn… … ot(n+1)

65

Mora 1 Mora 2

Decoder

Phone-levelEncoder

e1p e2p

z1p z2p z3p

e3p

Phone-level codebook

Page 66: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

Stage1:F0contourmodelingq VQ-VAEmultiplelinguistictiers

ISSUE 3:FRAME-BY-FRAME PROCESSING?

… oTo1 o3o2 … …otn… …

Phone 1 Phone 2 Phone 3

ot(n+1)

… oTo1 o3o2 … …otn… … ot(n+1)

66

Mora 1 Mora 2

VQ-VAEencoder(s)

VQ-VAEdecoder

PhonecodeindicesMoracodeindices

{l1p , l2p , l3p , · · · , lNp}

{l1m , l2m , · · · , lNm}Codebooks +Duration

Page 67: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

Stage2:LinguisticlinkingISSUE 3:FRAME-BY-FRAME PROCESSING?

… oTo1 o3o2 … …otn… …

Phone 1 Phone 2 Phone 3

ot(n+1)

67

Mora 1 Mora 2

VQ-VAEdecoder

PhonecodeindicesMoracodeindices

{l1p , l2p , l3p , · · · , lNp}

{l1m , l2m , · · · , lNm}Codebooks +Duration

Linguisticlinker

Linguisticfeatures

Page 68: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

Stage2:LinguisticlinkingISSUE 3:FRAME-BY-FRAME PROCESSING?

68

PhonecodeindicesMoracodeindices

{l1p , l2p , l3p , · · · , lNp}

{l1m , l2m , · · · , lNm}

RNN-basedsequentialclassifier

• ClockworkRNN[30]• Highway[31]

• AR&feedbacklinks• Dropout[32](Sec.7.2.2)

[30] J.Koutnik,K.Greff,F.Gomez,andJ.Schmidhuber.AClockworkRNN.InProc.ICML,pages1863– 1871,2014.[31] K.Greff,R.K.Srivastava,andJ.Schmidhuber.Highwayandresidualnetworkslearnunrollediterativeestimation.InProc.ICLR,2017.[32] N.Srivastava,G.Hinton,A.Krizhevsky,I.Sutskever,andR.Salakhutdinov.Dropout:Asimplewaytopreventneuralnetworksfromoverfitting.TheJournalofMachine

LearningResearch,15(1):1929–1958, 2014.

Linguisticfeatures

Page 69: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

F0modelingforTTSISSUE 3:FRAME-BY-FRAME PROCESSING?

69

Encoders&codebooks

Decoder

Trainingset

F0

F0

Codeindices

Linker

Linguisticfeatures

Stage1:VQ-VAE Stage2:Linker

Testset

Linker

Linguisticfeatures

Decoder

Codebooks

F0

F0modelingforTTS

Page 70: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

Experimentsonstage1(sec.7.3.2)ExperimentsonwholemodelqModels

• Givennaturalduration

ISSUE 3:FRAME-BY-FRAME PROCESSING?

70

Model1 Model3

Frame-by-frameLinker

VQ-VAEdecoder(phone-level)

VQ-VAEdecoder(phone-level)

Phone-by-phoneLinker

VQ-VAEdecoder(phone-mora-level)

Phone-by-phoneLinker+Moralock

… oTo1 o3o2 … …otn ot(n+1)… … … oTo1 o3o2 … …otn ot(n+1)… … … oTo1 o3o2 … …otn ot(n+1)… …Model2

Page 71: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

Experimentsq Objectiveresults

ISSUE 3:FRAME-BY-FRAME PROCESSING?

RMSE CORR U/V Timecost(s/epoch)

Model1 34.3 0.839 7.96% 1300

Model2 27.1 0.906 6.36% 54

71

Model3 25.5 0.916 4.87% 65

Model1 Model2 Model3… oTo1 o3o2 … …otn ot(n+1)… … … oTo1 o3o2 … …otn

ot(n+1)… … … oTo1 o3o2 … …otn ot(n+1)… …

Page 72: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

VAEvsDARq Objectiveresults

• VQ-VAEencoderneedstimeandmemory

ISSUE 3:FRAME-BY-FRAME PROCESSING?

RMSE CORR U/VNumberofparameters(m) Timecost (s/epoch)

Stage 1 Stage2 Sum Stage 1 Stage2 Sum

DAR 28.3 0.903 3.46%

VAEmodel3 25.5 0.916 4.87%

72

DAR VAE(model3)… oTo1 o3o2 … …otn ot(n+1)… … … oTo1 o3o2 … …otn ot(n+1)… …

~1300

65

~700

1500

Stage1

Stage2

… oTo1 o3o2 … …otn… … ot(n+1)

z1p z2p z3p

0.36

0.44

1.11

0.67

2000

1565

1.48

1.11

Page 73: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

73

ISSUE 3:FRAME-BY-FRAME PROCESSING?VAEvsDARq Subjectivetest

• 500testutterances, >1000setsofscores

Natural DAR SAR eSAR WaveNet-F0 VAE (model 3)Natural <1e-10 <1e-10 <1e-10 <1e-10 <1e-10DAR <1e-10 <1e-10 9.062e-08 0.000102 0.392125SAR <1e-10 <1e-10 0.007186 0.000176 <1e-10eSAR <1e-10 9.062e-08 0.007186 0.219733 5.946e-10WaveNet-F0 <1e-10 0.000102 0.000176 0.219733 2.635e-06VAE (model 3) <1e-10 0.392125 <1e-10 5.946e-10 2.635e-06

3.00

3.25

3.50

3.75

4.00

4.25

MO

S

4.199

3.971

3.7323.813 3.803

3.993

p-value>0.05

MOStest

p-value>0.05

Natural DAR SAR eSAR WaveNet-F0 VAE(model3)

Page 74: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

74

ISSUE 3:FRAME-BY-FRAME PROCESSING?Summaryq Answertoissue3

q How

q Results• Multiplelinguistictiers• MoreefficientthanDAR:smaller+faster+F0CORR>0.91• Interpretablelatentcodespaces(Sec.7.3.2)• RandomsamplingOK(slides)

Itcouldbemoreefficient

LinkerLinguisticfeatures

VQVAEDecoderCodebooks F0

unit-by-unit frame-by-frame

Page 75: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

q Introduction

q Issue1:jointmodelingofF0andspectralfeatures

q Issue2:temporaldependencymodelingofF0contours

q Issue3:frame-by-frameprocessing

q Conclusion

CONTENTS

75

Page 76: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

Summary

76

CONCLUSION

xTx1 x2 x3

bsTboTbo3

bs3bs2bs1bo1 bo2

PreliminaryIssue1:

JointmodelingforF0?

Chapter1-3 Chapter4

? JointmodelingofF0andspectral?✕ Sub-optimalforF0modeling• Methods:

o Highwaynetworkso Histogram+sensitivityanalysis

• Results:o Spectralfeaturesareprioritizedo Differentinput/hiddenfeaturesforF0

andspectralfeatures

Page 77: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

Summary

77

CONCLUSION

… xTx1 x2 x3

boTbo3bo1 bo2

PreliminaryIssue1:

JointmodelingforF0?

Chapter1-3 Chapter4

? TemporaldependencyinRNN/RMDN?✕ IgnoredbyRNN/RMDN• Models:

o SAR:tractabledependencyo DAR:non-linear+longerdependency

• Results:o DAR:F0CORR>0.90,MOSscore⬆o DARsupportsrandomsampling!

Issue2:Temporal

dependency?

Chapter5-6

DAR

Page 78: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

Summary

78

CONCLUSION

… boTbo3bo1 bo2

PreliminaryIssue1:

JointmodelingforF0?

Chapter1-3 Chapter4

? Frame-by-frameprocessing✕ Inefficient• Two-stageF0model:

o F0contourcoding:VQ-VAE+DARo Linguisticlinking:unit-by-unitclassifier

• Results:o F0CORR>0.91⬆o Moreefficient(smaller,faster)

… xTx1 x2 x3

Issue2:Temporal

dependency?

Chapter5-6

Issue3:Frame-by-frame?

Chapter7

x1p x2p

Phone 1 Phone 2

Summary

Chapter8

Page 79: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

Newtopics?q Towardscompleteprosodymodeling

• NotonlyF0butalsoduration• Easyforjointmodeling

qWaveformmodeling• F0andwaveformare1dimensionalsignals• Signalprocessingmethodsavailable

SAR+logarearatio+segment-variantfilters

79

CONCLUSION

LinkerLinguisticfeatures

Latentcodeperphone

Durationofphone

Page 80: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

Thankyouforyourattention

Q&A

80

Codes,scripts,slides:tonywangx.github.io

Page 81: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

q NeuralF0modeling

ü Meetbasicrequirementonpublication:3journals+6conferences

81

THESIS OUTLINE

Chapter1-3Introduction, TTS,neuralnetworks

Chapter 8Summary and application of F0 model

Chapter4Investigationusinghighwaynetwork

Chapter5ShallowautoregressiveF0model(SAR)

ExtendedSAR

Chapter7Variationalauto-encoder(VAE)+DAR

ICASSP2018

Speechsynthesisworkshop2016,

SpeechCommunication, Vol96,Feb.2018,pp1-9

ICASSP2017

Interspeech2017

IEEETrans.ASLP, accepted,2018Chapter6DeepautoregressiveF0model (DAR)

Prepareajournalpaper

Page 82: Fundamental Frequency Modeling for Neural-Network-based ...tonywangx.github.io/pdfs/E4.pdf · Fundamental Frequency Modeling for Neural-Network-based Statistical Parametric Speech

qWorknotincludedinPh.D.thesis• Wordembeddingasinputfeatures(IEICE2018,Interpseech2016)• Impactoftrainingdatasize(SSW2016)

q Collaboratedwork• DARusingmanually-annotated/corrupteddata(Interspeech 2018)

ProvideDAR,SAR,andWaveNet-vocoder

• Voicecloningusingfounddata(Odyssey2018)ProvideWaveNet-vocoder,andacousticmodels

• Cyborgspeech:multilingualspeechsynthesis(ICASSP2018)Provideacousticmodels

• SpeechsynthesisfromMFCC(ICASSP2018)ProvideDARonMFCC

• Controllablespeechsynthesis(Interspeech 2017)Provideacousticmodels

82

WORK DURING PH.D.