Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf ·...
-
Upload
hoangnguyet -
Category
Documents
-
view
214 -
download
0
Transcript of Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf ·...
![Page 1: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...](https://reader035.fdocuments.in/reader035/viewer/2022062909/5b5739e77f8b9ac5358ddab5/html5/thumbnails/1.jpg)
ComparingRecentWaveformGenerationandAcousticModelingMethodsfor
Neural-network-basedSpeechSynthesis
XinWANG,JaimeLorenzo-Trueba,ShinjiTAKAKI,LauriJuvela,JunichiYAMAGISHI
NationalInstituteofInformatics,Japan&AaltoUniversity,Finland2018-04-17
1contact:[email protected],suggestions,anddiscussion
ICASSP2018Calgary,Canada
![Page 2: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...](https://reader035.fdocuments.in/reader035/viewer/2022062909/5b5739e77f8b9ac5358ddab5/html5/thumbnails/2.jpg)
qMotivation• Bettermodulesforthestatisticalparametricspeech
synthesis(SPSS)framework?
qMethod• Plugandtestnewacousticmodelsandwaveform
generators
qResults• Bestcombination
• Quality:asgoodasvocodedspeech(at16kHz)
2
OVERVIEW
Autoregressive(AR)acousticmodelsWaveNet-basedvocoder
![Page 3: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...](https://reader035.fdocuments.in/reader035/viewer/2022062909/5b5739e77f8b9ac5358ddab5/html5/thumbnails/3.jpg)
q Introduction
qModelsandmodules
q Experiments
q Summary
CONTENTS
3
![Page 4: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...](https://reader035.fdocuments.in/reader035/viewer/2022062909/5b5739e77f8b9ac5358ddab5/html5/thumbnails/4.jpg)
F0
Backgroundq ConventionalTTSpipeline[1]
q SPSSback-end[2,3]
• MGC:Mel-generalizedcepstralcoefficients[4]• BAP:band-aperiodicity
4
INTRODUCTION
SpectralfeaturesAcousticmodels
Waveformgenerators
Text Front-end Back-endLinguisticfeatures Speech
SpeechLinguisticfeatures
[1]T.Dutoit.AnIntroductiontoText-to-speechSynthesis.KluwerAcademicPublishers,Norwell,MA,USA,1997.[2]Tokuda,K.,etal.,(2013).SpeechSynthesisBasedonHiddenMarkovModels.ProceedingsoftheIEEE,101(5),1234–1252.[3]Zen,H.,etal.(2009).Statisticalparametricspeechsynthesis.SpeechCommunication,51,1039–1064.[4]Tokuda,K.,Kobayashi,T.,Masuko,T.,andImai,S.(1994).Mel-generalizedcepstralanalysisaunifiedapproach.InProc.ICSLP,pages1043–1046.
MGC&BAP
![Page 5: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...](https://reader035.fdocuments.in/reader035/viewer/2022062909/5b5739e77f8b9ac5358ddab5/html5/thumbnails/5.jpg)
Topicofthisworkq BettermodulesforSPSSback-end?
• MGC:Mel-generalizedcepstralcoefficients• BAP:band-aperiodicity
5
INTRODUCTION
F0Acousticmodels
Waveformgenerators
SpeechLinguisticfeatures
Recurrentneuralnetworks(RNNs)
Autoregressive(AR)models
Generaladversarialnetwork(GAN)
…
WORLDvocoder
+Phraserecovery
WaveNet-basedvocoder
…
MGC&BAP
![Page 6: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...](https://reader035.fdocuments.in/reader035/viewer/2022062909/5b5739e77f8b9ac5358ddab5/html5/thumbnails/6.jpg)
q Introduction
qModelsandmodules
q Experiments
q Summary
CONTENTS
6
![Page 7: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...](https://reader035.fdocuments.in/reader035/viewer/2022062909/5b5739e77f8b9ac5358ddab5/html5/thumbnails/7.jpg)
7
MODELS &METHODS
Waveformgenerators
Acousticmodels
Linguisticfeatures
MGC&BAP&F0
Speechwaveforms
RNN
WORLD
![Page 8: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...](https://reader035.fdocuments.in/reader035/viewer/2022062909/5b5739e77f8b9ac5358ddab5/html5/thumbnails/8.jpg)
Acousticmodelsq BaselineRNN
• Sequenceoflinguisticfeatures• Sequenceofgeneratedacousticfeatures
8
MODELS &METHODS
x1 x2 x3 x4 x5
bo1 bo2 bo3 bo4 bo5
{x1, · · · ,xt, · · · }
{bo1, · · · , bot, · · · }
![Page 9: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...](https://reader035.fdocuments.in/reader035/viewer/2022062909/5b5739e77f8b9ac5358ddab5/html5/thumbnails/9.jpg)
Acousticmodelsq BaselineRNN
9
MODELS &METHODS
x1 x2 x3 x4 x5
M3 M4M2M1 M5
o1 o2 o3 o4 o5
Mt = {µt}, where µt = H(RNN)⇥ (x1:T , t)
H(RNN)⇥ (·)
bot = µt
Neuralnetwork
Probabilisticmodels
p(o1:T |x1:T ;⇥) =TY
t=1
p(ot|x1:T ;⇥) =TY
t=1
N (ot;µt, I)
[5]C.M.Bishop.Neuralnetworksforpatternrecognition.Oxforduniversitypress,1995.
![Page 10: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...](https://reader035.fdocuments.in/reader035/viewer/2022062909/5b5739e77f8b9ac5358ddab5/html5/thumbnails/10.jpg)
10
MODELS &METHODS
x1 x2 x3 x4 x5
M3 M4M2M1 M5
o1 o2 o3 o4 o5
p(o1:T |x1:T ;⇥) =TY
t=1
p(ot|x1:T ;⇥) =TY
t=1
N (ot;µt, I)
ARmodels
GAN
Acousticmodelsq BaselineRNN
• Limitations1. Conditionalindependence2. Maximum-likelihoodtraining
![Page 11: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...](https://reader035.fdocuments.in/reader035/viewer/2022062909/5b5739e77f8b9ac5358ddab5/html5/thumbnails/11.jpg)
Acousticmodelsq ShallowAR(SAR)
• Alternativeinterpretation:trainablefilter+RNN
11
MODELS &METHODS
x1 x2 x3 x4 x5
M3 M4M2M1 M5
o1 o2 o3 o4 o5 K=2
p(o1:T |x1:T ;⇥, ) =TY
t=1
p(ot|ot�K:t�1,x1:T ;⇥) =TY
t=1
N (ot;µt + f (ot�K:t�1), I)
[6]X.Wang,S.Takaki,andJ.Yamagishi.Anautoregressiverecurrentmixturedensitynetworkforparametricspeechsynthesis.InProc.ICASSP,pages4895–4899,2017.
![Page 12: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...](https://reader035.fdocuments.in/reader035/viewer/2022062909/5b5739e77f8b9ac5358ddab5/html5/thumbnails/12.jpg)
Acousticmodelsq DeepAR(DAR)
• Onlyfor(quantized)F0modeling
12
MODELS &METHODS
x1 x2 x3 x4 x5
M3 M4M2M1 M5
o1 o2 o3 o4 o5
p(o1:T |x1:T ;⇥) =TY
t=1
p(ot|o1:t�1,x1:T ;⇥)
[7]X.Wang,S.Takaki,andJ.Yamagishi.AutoregressiveneuralF0modelforstatisticalparametricspeechsynthesis.IEEETransactionsonAudio,SpeechandLanguageProcessing.(Accepted)
![Page 13: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...](https://reader035.fdocuments.in/reader035/viewer/2022062909/5b5739e77f8b9ac5358ddab5/html5/thumbnails/13.jpg)
Acousticmodelsq GAN
• GAN-basedpost-filter[8]
13
MODELS &METHODS
Acousticmodel
Linguisticfeatures
ResidualgeneratorNoise Discriminator
Acousticfeatures(generated)
+
Acousticfeatures(natural)
0/1
[8]T.Kaneko,H.Kameoka,N.Hojo,Y.Ijima,K.Hiramatsu,andK.Kashino.Generativeadversarialnetwork-basedpostfilterforstatisticalparametricspeechsynthesis.InProc.ICASSP,pages4910–4914,2017.
![Page 14: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...](https://reader035.fdocuments.in/reader035/viewer/2022062909/5b5739e77f8b9ac5358ddab5/html5/thumbnails/14.jpg)
14
MODELS &METHODSAcousticmodels
• BAPisnotshown
Linguisticfeatures
SAR
WORLD
RNNDAR
F0
Waveformgenerators
Acousticmodels
MGC
GAN
Speechwaveforms
![Page 15: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...](https://reader035.fdocuments.in/reader035/viewer/2022062909/5b5739e77f8b9ac5358ddab5/html5/thumbnails/15.jpg)
15
MODELS &METHODSAcousticmodels
• BAPisnotshown
Linguisticfeatures
SAR RNNDAR
Waveformgenerators
Acousticmodels
GAN
SAR-Wo SGA-Wo RGA-Wo RNN-Wo
WORLD
F0 MGC
![Page 16: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...](https://reader035.fdocuments.in/reader035/viewer/2022062909/5b5739e77f8b9ac5358ddab5/html5/thumbnails/16.jpg)
Waveformgeneratorsq Deterministicapproaches
• WORLD[9]
o Binaryvoicingdecisiono Minimumphase
• Alogdomainpulsemodel(PML)[10]
o Source-filtermodel,additiveinlog-domaino Binarynoisymask
• WORLD+phraserecovery
16
MODELS &METHODS
[9]M.Morise,etal.WORLD:Avocoder-basedhigh-qualityspeechsynthesissystemforreal-timeapplications.IEICETrans.onInformationandSystems, 99(7):1877–1884,2016.[10]G.Degottex,etal.Alogdomainpulsemodelforparametricspeechsynthesis.IEEE/ACMTransactionsonAudio,Speech,andLanguageProcessing,2017.[11]D.GriffinandJ.Lim.Signalestimationfrommodifiedshort-timeFouriertransform.IEEETrans.ASSP,32(2):236–243,1984.
Phraserecovery[11]
Generatedwaveform
STFTamplitude
InverseSTFT
“Phrase-recovered”waveform
![Page 17: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...](https://reader035.fdocuments.in/reader035/viewer/2022062909/5b5739e77f8b9ac5358ddab5/html5/thumbnails/17.jpg)
WaveformgeneratorsqWaveNet-basedvocoder[12,13]
• Howtogenerate(search)awaveform:1. Exploration: sampling2. Exploitation: pickingone-best
17
MODELS &METHODS
[12]A.vandenOord,S.Dieleman,H.Zen,K.Simonyan,O.Vinyals,A.Graves,N.Kalchbrenner,A.Senior,andK.Kavukcuoglu.WaveNet:Agenerativemodelforrawaudio.arXiv preprintarXiv:1609.03499,2016.
[13]A.Tamamori,T.Hayashi,K.Kobayashi,K.Takeda,andT.Toda.Speaker-dependentWaveNetvocoder.InProc.Interspeech,pages1118–1122,2017.
![Page 18: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...](https://reader035.fdocuments.in/reader035/viewer/2022062909/5b5739e77f8b9ac5358ddab5/html5/thumbnails/18.jpg)
Sampling point
9300.09305.0
9310.09315.0
9320.0Waveform level
0200
400600
8001000
12000.0
0.1
0.2
0.3
0.4
0.5
Sampling point
8900.08905.0
8910.08915.0
8920.0Waveform level
0200
400600
8001000
12000.0
0.1
0.2
0.3
0.4
0.5
• Samplinginunvoicedregion• Pickingone-bestinvoicedregion
MODELS &METHODS
59
WaveformgeneratorsqWaveNet-basedvocoder
probabilityUnvoicedsegment
Voicedregion
• Lessdistortionofharmonics☛ appendix&paper
![Page 19: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...](https://reader035.fdocuments.in/reader035/viewer/2022062909/5b5739e77f8b9ac5358ddab5/html5/thumbnails/19.jpg)
19
MODELS &METHODSAcousticmodels
Linguisticfeatures
SAR RNNDAR
Waveformgenerators
Acousticmodels
GAN
SAR-Wo SGA-Wo RGA-Wo RNN-Wo
F0 MGC
SAR-PmSAR-PrSAR-Wa
Phraserecovery
PMLWaveNet WORLD
minimum phase
![Page 20: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...](https://reader035.fdocuments.in/reader035/viewer/2022062909/5b5739e77f8b9ac5358ddab5/html5/thumbnails/20.jpg)
q Introduction
qModelsandmodules
q Experiments
q Summary
CONTENTS
20
![Page 21: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...](https://reader035.fdocuments.in/reader035/viewer/2022062909/5b5739e77f8b9ac5358ddab5/html5/thumbnails/21.jpg)
Configurationq Data
• Recordingperiod:over1year
q Front-end:OpenJTalk [15]
21[14]Kawai,H.,Toda,T.,Ni,J.,Tsuzaki,M.,andTokuda,K.(2004).Ximera:AnewTTSfromATRbasedoncorpus-based
technologies.InProc.SSW5,pages179–184.[15]HTSWorkingGroup.TheJapaneseTTSSystem‘OpenJTalk’,2015.
Corpus Size Note
ATR XimeraF009voice[14]
~30,000 utterances48hours
Samplingrate:48kHzJapanese,
neutral style,reading
Feature Dimension
Linguistic Phoneidentity,prosodictags... ~390
Acoustic
MGC 60
BAP 25
F0 1
EXPERIMENTS
![Page 22: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...](https://reader035.fdocuments.in/reader035/viewer/2022062909/5b5739e77f8b9ac5358ddab5/html5/thumbnails/22.jpg)
Configurationq Listeningtest
• Quality:MOS(1-5)• Similarity: rate1-5,naturalreference48kHz• Participants:235nativeJapaneselisteners,1500 setsofresults
q Systems• Commonnetworkconfiguration(cf.thepaper)• Without,norformantenhancement• Samplingrate:48kHz&16kHz,exceptSAR-Wa at16kHz(10bits,𝜇-law)
22
EXPERIMENTS
SAR-Wo SGA-Wo RGA-Wo RNN-WoSAR-PmSAR-PrSAR-WaAbs-WoAbs-PmNatural
WORLDPML+PhraseRecWORLDWaveNetPML WORLD
RNNGANRNN
GANSARSARnaturalnatural
�,�2
SAR
![Page 23: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...](https://reader035.fdocuments.in/reader035/viewer/2022062909/5b5739e77f8b9ac5358ddab5/html5/thumbnails/23.jpg)
23
EXPERIMENTS
SAR-Wo SGA-Wo RGA-Wo RNN-WoSAR-PmSAR-PrSAR-WaAbs-WoAbs-PmNatural
WORLDPML+PhraseRecWORLDWaveNetPML WORLD
RNNGANRNN
GANSARSARSARSARSARnaturalnatural
16kHz
48kHz
SAR
![Page 24: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...](https://reader035.fdocuments.in/reader035/viewer/2022062909/5b5739e77f8b9ac5358ddab5/html5/thumbnails/24.jpg)
Qualityscores
24
EXPERIMENTS
SAR-Wo SGA-Wo RGA-Wo RNN-WoSAR-PmSAR-PrSAR-WaAbs-WoAbs-PmNatural
WORLDPML+PhraseRecWORLDWaveNetPML WORLD
RNNGANRNN
GANSARSARSARSARSARnaturalnatural SAR
![Page 25: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...](https://reader035.fdocuments.in/reader035/viewer/2022062909/5b5739e77f8b9ac5358ddab5/html5/thumbnails/25.jpg)
Similarityscores
25
EXPERIMENTS
SAR-Wo SGA-Wo RGA-Wo RNN-WoSAR-PmSAR-PrSAR-WaAbs-WoAbs-PmNatural
WORLDPML+PhraseRecWORLDWaveNetPML WORLD
RNNGANRNN
GANSARSARSARSARSARnaturalnatural SAR
![Page 26: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...](https://reader035.fdocuments.in/reader035/viewer/2022062909/5b5739e77f8b9ac5358ddab5/html5/thumbnails/26.jpg)
q Introduction
qModelsandmethods
q Experiments
q Summary
CONTENTS
26
![Page 27: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...](https://reader035.fdocuments.in/reader035/viewer/2022062909/5b5739e77f8b9ac5358ddab5/html5/thumbnails/27.jpg)
Plugandtestq SAR:
• Avoidconditionalindependenceassumption• Alleviatetheover-smoothingeffect☛ appendix&paper
qWaveNet• Generationmethod:one-bestgeneration+randomsampling• Lessdistortionofharmonics☛ appendix&paper
qWaveNetvocoder+SAR&DAR• Betterthanothercombinations• Worsethannaturalspeech• Closetovocodedspeech
27
SUMMARY
![Page 28: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...](https://reader035.fdocuments.in/reader035/viewer/2022062909/5b5739e77f8b9ac5358ddab5/html5/thumbnails/28.jpg)
Recentwork☛ appendixq SAR
• Specialcaseofvolume-preservingnormalizingflow[16]
• ExtendedSAR= time-variantfilter+RNN
qWaveNet-vocoders• Trainingbasedongeneratedconditionalfeatures
q Annotatedlinguisticfeatures• Reducethegapbetweennatural&syntheticspeech
Futurework?q Reducethevariabilityofrecordingsq Usecomplex-valuedneuralmodels
28
FURTHER IMPROVEMENT?
[16]D.Rezende andS.Mohamed.Variationalinferencewithnormalizingflows.InProc.ICML,pages1530–1538,2015.
![Page 29: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...](https://reader035.fdocuments.in/reader035/viewer/2022062909/5b5739e77f8b9ac5358ddab5/html5/thumbnails/29.jpg)
Code,recipes,slidesq Acousticmodels&WaveNet(CUDA/C++)
q SimpleexplanationonWaveNetandacousticmodels
29
MESSAGE
https://github.com/TonyWangX/CURRENNT_MODIFIEDhttps://github.com/TonyWangX/CURRENNT_Recipes
http://tonywangx.github.io/slides.html
![Page 31: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...](https://reader035.fdocuments.in/reader035/viewer/2022062909/5b5739e77f8b9ac5358ddab5/html5/thumbnails/31.jpg)
Conditionalnetwork
WaveNet-Backend
Learningcurve
Generationmethod
Generationwithotheracousticmodels
Trainingbasedongeneratedfeatures31
APPENDIX - WAVENET
Tutorialslides:http://tonywangx.github.io/pdfs/wavenet.pdf
• Nocherrypicking• Allsamplesbasedongeneratedacousticfeatures
orautomaticallyinferredlinguisticfeatures
![Page 32: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...](https://reader035.fdocuments.in/reader035/viewer/2022062909/5b5739e77f8b9ac5358ddab5/html5/thumbnails/32.jpg)
Structure
32
+
Wavenetblock 1
Clock rate: 200Hz (frame shift = 5ms)
Linearot�1Wavenetblock 2
Wavenetblock M
…
Linear+tanh Linear+tanh Softmax
s(1)t s(2)t s(M)tet�1
r(1)t
lt
… …
et
Bi-LSTM Linearc1:N concatenationCNNF0
Up-sampling
Clock rate: 16kHz
Conditional feature network
Post-processing network
P (ot|ot�R:t�1, c1:N )
APPENDIX - WAVENET
![Page 33: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...](https://reader035.fdocuments.in/reader035/viewer/2022062909/5b5739e77f8b9ac5358ddab5/html5/thumbnails/33.jpg)
Conditionalnetworkq Architectureoftheconditionalnetwork
• Trial1
• Trial2
• Trial3
33
Bi-LSTM Linearc1:N Concat.CNN
F0
l1:N
c1:N l1:N
c1:N l1:N
F0MGC
Bi-LSTM LinearCNN
Linear
APPENDIX - WAVENET
![Page 34: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...](https://reader035.fdocuments.in/reader035/viewer/2022062909/5b5739e77f8b9ac5358ddab5/html5/thumbnails/34.jpg)
Conditionalnetworkq ExperimentsonWaveNet Vocoder
• Givengenerated MGC/F0
34
sample1 sample2 sample3
Natural
Trial1None
Trial 2LSTM+CNN
Trial3LSTM+CNN+skip-F0
APPENDIX - WAVENET
![Page 35: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...](https://reader035.fdocuments.in/reader035/viewer/2022062909/5b5739e77f8b9ac5358ddab5/html5/thumbnails/35.jpg)
WaveNetbackendq Architecture
qWaveNet-backendonlyusesrandomsampling
35
APPENDIX - WAVENET
WaveNet backendText Textanalyzer
F0model(DAR)
Textualfeatures
Waveform
sample1 sample2 sample3 sample4 sample5
Natural
WaveNet-vocoder
WaveNet-backend
![Page 36: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...](https://reader035.fdocuments.in/reader035/viewer/2022062909/5b5739e77f8b9ac5358ddab5/html5/thumbnails/36.jpg)
Learningcurve
• WetrainedWaveNetbackendformorethan100epochs36
247002570026700277002870029700307003170032700
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43
-loglikelihoo
d
epoch
Trainset
Val.Set
WaveNetvocoder
247002570026700277002870029700307003170032700
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43
-loglikelihoo
d
epoch
Trainset
Val.Set
WaveNetbackend
APPENDIX - WAVENET
![Page 37: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...](https://reader035.fdocuments.in/reader035/viewer/2022062909/5b5739e77f8b9ac5358ddab5/html5/thumbnails/37.jpg)
Generationmethod
37
8900.0 9100.0 9300.0 9500.0 9700.0 9900.00
200
400
600
800
1000
wav
efor
m(m
u-la
w)
8900.0 9100.0 9300.0 9500.0 9700.0 9900.0sampling point
0
200
400
600
800
1000
prob
ablit
yWaveformlevels(0-1024)
Waveformlevels(0-1024)
APPENDIX - WAVENET
![Page 38: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...](https://reader035.fdocuments.in/reader035/viewer/2022062909/5b5739e77f8b9ac5358ddab5/html5/thumbnails/38.jpg)
Generationmethodq ExperimentsonWaveNet vocoder
Sampling point
9300.09305.0
9310.09315.0
9320.0Waveform level
0200
400600
8001000
12000.0
0.1
0.2
0.3
0.4
0.5
Sampling point
8900.08905.0
8910.08915.0
8920.0Waveform level
0200
400600
8001000
12000.0
0.1
0.2
0.3
0.4
0.5
59
APPENDIX - WAVENET
![Page 39: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...](https://reader035.fdocuments.in/reader035/viewer/2022062909/5b5739e77f8b9ac5358ddab5/html5/thumbnails/39.jpg)
Generationmethodq One-best+randomsampling
• GivengeneratedMGC/F0
• Mix:
• Random:randomsamplingallthetime
39
voicedregion:one-best
unvoicedregion:randomsampling
sample1 sample2 sample3
Natural
Mix
Random
APPENDIX - WAVENET
![Page 40: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...](https://reader035.fdocuments.in/reader035/viewer/2022062909/5b5739e77f8b9ac5358ddab5/html5/thumbnails/40.jpg)
40
Natural
Voiced:one-bestUnvoiced:sampling
Voiced:samplingUnvoiced:sampling
![Page 41: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...](https://reader035.fdocuments.in/reader035/viewer/2022062909/5b5739e77f8b9ac5358ddab5/html5/thumbnails/41.jpg)
Generationmethodq One-best+randomsampling
• Mix:
• Mix2:
• …
• Mix4:
41
voicedregion:one-best
unvoicedregion:randomsampling
75%voicedframes:one-best
else:randomsampling
APPENDIX - WAVENET
25%voicedframes:one-best
else:randomsampling
![Page 42: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...](https://reader035.fdocuments.in/reader035/viewer/2022062909/5b5739e77f8b9ac5358ddab5/html5/thumbnails/42.jpg)
Generationmethodq One-best+randomsampling
INVESTIGATION
42
sample1 sample2 sample3
Natural
Mix
Mix2
Mix3
Mix4
Random
75%
50%
20%
100%
00%
![Page 43: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...](https://reader035.fdocuments.in/reader035/viewer/2022062909/5b5739e77f8b9ac5358ddab5/html5/thumbnails/43.jpg)
Generationmethodq One-best+randomsampling
43
sample1 sample2 sample3
Natural
WaveNetbackend
Mix
Random
WaveNetvocoder Mix
APPENDIX - WAVENET
![Page 44: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...](https://reader035.fdocuments.in/reader035/viewer/2022062909/5b5739e77f8b9ac5358ddab5/html5/thumbnails/44.jpg)
44
Natural
Mix
Random
WaveNet-backendMix
WaveNet-backendRandom
![Page 45: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...](https://reader035.fdocuments.in/reader035/viewer/2022062909/5b5739e77f8b9ac5358ddab5/html5/thumbnails/45.jpg)
45
WaveNet-vocoder+otheracousticmodelsAPPENDIX - WAVENET
sample1 sample2 sample3
Natural
SAR+DAR
SGA+DAR
RGA+DAR
RNN+DAR
extendedSAR+DAR
![Page 46: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...](https://reader035.fdocuments.in/reader035/viewer/2022062909/5b5739e77f8b9ac5358ddab5/html5/thumbnails/46.jpg)
WaveNet-vocoder:trainingusinggeneratedMGC
46
APPENDIX - WAVENET
sample1 sample2 sample3
Natural
WaveNet-Backend
Trainedonnatural MGC
Trainedongenerated MGCEpoch35
Trainedongenerated MGCEpoch45
Trainedongenerated MGCEpoch55
WaveNet-Vocoder
![Page 47: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...](https://reader035.fdocuments.in/reader035/viewer/2022062909/5b5739e77f8b9ac5358ddab5/html5/thumbnails/47.jpg)
Generalcomparison
SAR&DAR
SARextension
47
APPENDIX – ACOUSTIC MODELS
Moredetails:http://tonywangx.github.io/pdfs/talk.pdf
![Page 48: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...](https://reader035.fdocuments.in/reader035/viewer/2022062909/5b5739e77f8b9ac5358ddab5/html5/thumbnails/48.jpg)
Generalcomparisonq Generatedtrajectories
48
0 100 200 300 400 500 600 700 800 900Frame index (utterance ATR Ximera F009 AOZORAR 03372 T01)
�2
�1
0
1
2
3
4
MG
C1
dim
Natural RNN RNN+GAN (RGA) SAR SAR+GAN (SGA)
0 100 200 300 400 500 600 700 800 900Frame index (utterance ATR Ximera F009 AOZORAR 03372 T01)
�0.8
�0.6
�0.4
�0.2
0.0
0.2
0.4
0.6
0.8
1.0
MG
C5
dim
Natural RNN RNN+GAN (RGA) SAR SAR+GAN (SGA)
MGC1st dim
MGC5th dim
APPENDIX – ACOUSTIC MODELS
![Page 49: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...](https://reader035.fdocuments.in/reader035/viewer/2022062909/5b5739e77f8b9ac5358ddab5/html5/thumbnails/49.jpg)
0 100 200 300 400 500 600 700 800 900Frame index (utterance ATR Ximera F009 AOZORAR 03372 T01)
�0.4
�0.3
�0.2
�0.1
0.0
0.1
0.2
0.3
0.4
MG
C15
dim
Natural RNN SAR
49
MGC15th dim
0 100 200 300 400 500 600 700 800 900Frame index (utterance ATR Ximera F009 AOZORAR 03372 T01)
�0.20
�0.15
�0.10
�0.05
0.00
0.05
0.10
0.15
0.20
MG
C31
dim
Natural RNN SAR
MGC31th dim
Generalcomparisonq Generatedtrajectories
APPENDIX – ACOUSTIC MODELS
![Page 50: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...](https://reader035.fdocuments.in/reader035/viewer/2022062909/5b5739e77f8b9ac5358ddab5/html5/thumbnails/50.jpg)
0 100 200 300 400 500 600 700 800 900Frame index (utterance ATR Ximera F009 AOZORAR 03372 T01)
�0.4
�0.3
�0.2
�0.1
0.0
0.1
0.2
0.3
0.4
MG
C15
dim
Natural RNN RNN+GAN (RGA) SAR SAR+GAN (SGA)
50
MGC15th dim
0 100 200 300 400 500 600 700 800 900Frame index (utterance ATR Ximera F009 AOZORAR 03372 T01)
�0.20
�0.15
�0.10
�0.05
0.00
0.05
0.10
0.15
0.20
MG
C31
dim
Natural RNN RNN+GAN (RGA) SAR SAR+GAN (SGA)
MGC31th dim
Generalcomparisonq Generatedtrajectories
APPENDIX – ACOUSTIC MODELS
![Page 51: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...](https://reader035.fdocuments.in/reader035/viewer/2022062909/5b5739e77f8b9ac5358ddab5/html5/thumbnails/51.jpg)
51
Generalcomparisonq GV
APPENDIX – ACOUSTIC MODELS
0 1 2 3 4 5 6 7 8�2
�1
0
1
Natural
RNN
RNN+GAN (RGA)
SAR
SAR+GAN (SGA)
10 12 14 16 18 20 22 24�3.5
�3.0
�2.5
�1.5
26 28 30 32 34 36 38 40 42�4.0
�3.6
�3.0
�2.4
44 46 48 50 52 54 56 58�4.2
�3.8
�3.2
�2.6
Dimension index of MGC
Utte
ranc
e-le
velG
Vof
MG
C
Modulationspectrum(MGC31th)
Globalvariance
![Page 52: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...](https://reader035.fdocuments.in/reader035/viewer/2022062909/5b5739e77f8b9ac5358ddab5/html5/thumbnails/52.jpg)
52
SAR&DARq ToySARexample
• SARversusRMDNwitharecurrentoutputlayer[11]
• Assumeand,linearactivationfunction
APPENDIX – ACOUSTIC MODELS
[11]H.ZenandH.Sak.Unidirectionallongshort-termmemoryrecurrentneuralnetworkwithrecurrentoutputlayerforlow-latencyspeechsynthesis.InProc.ICASSP,pages4470–4474,2015.
ot 2 R ⌃t = 1
h2h1
RMDNSAR
h2h1
x1 x2x1 x2
o1 o2o1 o2
wµ
a
µ1 µ2µ1 µ2 outputlayer
hiddenlayer
![Page 53: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...](https://reader035.fdocuments.in/reader035/viewer/2022062909/5b5739e77f8b9ac5358ddab5/html5/thumbnails/53.jpg)
53[11]H.ZenandH.Sak.Unidirectionallongshort-termmemoryrecurrentneuralnetworkwithrecurrentoutputlayerforlow-latencyspeechsynthesis.InProc.ICASSP,pages4470–4474,2015.
SAR&DARq ToySARexample
APPENDIX – ACOUSTIC MODELS
h2h1RMDN
x1 x2
o1 o2
wµµ1 µ2
SARh2h1
x1 x2
o1 o2a
µ1 µ2
µ1 = w>h1 + b
µ2 = w>h2 + b+ wµµ1 = µ̃2 + wµµ1
µ1 = w>h1 + b
µ2 = w>h2 + b
p(o1:2) =N (o1;µ1, 1)N (o2;µ2 + ao1, 1)
p(o1:2) =N (o1;µ1, 1)N (o2; µ̃2 + wµµ1, 1)
![Page 54: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...](https://reader035.fdocuments.in/reader035/viewer/2022062909/5b5739e77f8b9ac5358ddab5/html5/thumbnails/54.jpg)
54
h2h1RMDN
x1 x2
o1 o2
wµµ1 µ2
SARh2h1
x1 x2
o1 o2a
µ1 µ2
p(o1:2) =N (o1;µ1, 1)N (o2; µ̃2 + wµµ1, 1)
=1
2⇡exp(�1
2(o� µ)>⌃�1(o� µ))
o = [o1, o2]> µ = [µ1, µ2 + aµ1]
>⌃ =
1 aa 1 + a2
�
o = [o1, o2]>
⌃ =
1 00 1
�µ = [µ1, µ̃2 + wµµ1]
>
p(o1:2) =N (o1;µ1, 1)N (o2;µ2 + ao1, 1)
=1
2⇡exp(�1
2(o� µ)>⌃�1(o� µ))
Dependencybetweenor?µt ot
SAR&DARq ToySARexample
APPENDIX – ACOUSTIC MODELS
![Page 55: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...](https://reader035.fdocuments.in/reader035/viewer/2022062909/5b5739e77f8b9ac5358ddab5/html5/thumbnails/55.jpg)
55
SAR&DARq ToySARexample
APPENDIX – ACOUSTIC MODELS
µc = [µ1, µ2]> ⌃c =
1 00 1
�RMDN
h2h1
x1 x2
µ1 µ2
c2c1
SARh2h1
x1 x2
o1 o2a
µ1 µ2
o = [o1, o2]>
c =
c1c2
�=
o1
o2 � ao1
�=
1 0�a 1
� o1o2
�= Ao
p(o) = p(c) = N (c;µc,⌃c)
p(o) = N (o;µo,⌃o)
µo = [µ1, µ2 + aµ1]>
⌃o =
1 aa 1 + a2
�
![Page 56: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...](https://reader035.fdocuments.in/reader035/viewer/2022062909/5b5739e77f8b9ac5358ddab5/html5/thumbnails/56.jpg)
56
SAR&DARq SAR:invertiblelinearfeature/modeltransformation
• For
• SARisequivalentto:
APPENDIX – ACOUSTIC MODELS
Training
Generationx1:T
TY
t=1
p(ct;Mt)
TY
t=1
p(ct;Mt)bo1:T
o1:T
bc1:T
c1:T
2
4c1:T,1
· · ·c1:T,D
3
5o1:T =
2
4o1:T,1
· · ·o1:T,D
3
5A(1)
A(D)
…o1:T 2 RD⇥T
…
…
A(1)
A(D)
A(1)�1
A(D)�1
![Page 57: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...](https://reader035.fdocuments.in/reader035/viewer/2022062909/5b5739e77f8b9ac5358ddab5/html5/thumbnails/57.jpg)
57
SAR&DARq SAR:invertiblelinearfeature/modeltransformation
• For
• SARisequivalentto:
APPENDIX – ACOUSTIC MODELS
Training
Generationx1:T
bo1:T
o1:T
bc1:T
c1:T
2
4c1:T,1
· · ·c1:T,D
3
5o1:T =
2
4o1:T,1
· · ·o1:T,D
3
5o1:T 2 RD⇥T
filtersA1(z)
AD(z)…
filters
…1/A1(z)
1/AD(z)
filter1
filterD…
TY
t=1
p(ct;Mt)
TY
t=1
p(ct;Mt)
![Page 58: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...](https://reader035.fdocuments.in/reader035/viewer/2022062909/5b5739e77f8b9ac5358ddab5/html5/thumbnails/58.jpg)
58
SAR&DARq SAR:invertiblelinearfeature/modeltransformation
• Onlydueto?
• Dueto,lessmismatchbetweenandRMDN
APPENDIX – ACOUSTIC MODELS
bo1:T
o1:T
bc1:T
c1:T
filtersA1(z)
AD(z)…
filters
…1/A1(z)
1/AD(z)
1 250 500 750 1000Frequency bin (: /1024)
-10
-5
0
5
Mag
nitu
de (d
B)
H1(z)
1 250 500 750 1000Frequency bin (: /1024)
-10
-5
0
5
Mag
nitu
de (d
B)
A1(z)A1(z)
1/A1(z)
1/Ad(z)
{Ad(z), 1/Ad(z)} c1:T
![Page 59: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...](https://reader035.fdocuments.in/reader035/viewer/2022062909/5b5739e77f8b9ac5358ddab5/html5/thumbnails/59.jpg)
SARextension:normalizingflowq Basicidea
• Jacobianmatrixmustbesimple• f(.)mustbeinvertible
59
x1:T
bo1:T
o1:T
bc1:T
c1:Tc1:T = f�(o1:T )
bo1:T = f�1� (bc1:T )
po(o1:T |x1:T ) = pc(c1:T |x1:T )
����� det@c1:T@o1:T
�����
[13]D.Rezende andS.Mohamed.Variational inferencewithnormalizingflows.InInternationalConferenceonMachineLearning,pages1530–1538,2015.[14]D.P.Kingma,T.Salimans,R.Jozefowicz,X.Chen,I.Sutskever,andM.Welling.Improvedvariational inferencewithinverseautoregressiveflow.InProc.NIPS,pages
4743–4751,2016.
TY
t=1
p(ct;Mt)
TY
t=1
p(ct;Mt)
APPENDIX – ACOUSTIC MODELS
![Page 60: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...](https://reader035.fdocuments.in/reader035/viewer/2022062909/5b5739e77f8b9ac5358ddab5/html5/thumbnails/60.jpg)
SARextension:normalizingflowq Basicidea
• SimpleforSAR:
60
SAR ARFlow
Transform
De-transform
ct = ot �KX
k=1
ak � ot�k
bot = ct +KX
k=1
ak � bot�k
po(o1:T |x1:T ) = pc(c1:T |x1:T )
����� det@c1:T@o1:T
�����
����� det@c1:T@o1:T
����� = 1
µt = RNN(o1:t�1)
ct = ot �KX
k=1
f (k)(o1:t�k)� ot�k
bot = ct +KX
k=1
f (k)(bo1:t�k)� bot�k
APPENDIX – ACOUSTIC MODELS
![Page 61: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...](https://reader035.fdocuments.in/reader035/viewer/2022062909/5b5739e77f8b9ac5358ddab5/html5/thumbnails/61.jpg)
61
SARextensionq SARcanbeextended
APPENDIX – ACOUSTIC MODELS
Moredetails:http://tonywangx.github.io/pdfs/talk.pdf
![Page 62: Comparing Recent Waveform Generation and Acoustic …tonywangx.github.io/pdfs/ICASSP18.pdf · Mel-generalized cepstral analysis a unified approach. In Proc. ICSLP, pages 1043 ...](https://reader035.fdocuments.in/reader035/viewer/2022062909/5b5739e77f8b9ac5358ddab5/html5/thumbnails/62.jpg)
62
DARq Sameautoregressiveprincipleq Butnoninvertiblenonlinear
APPENDIX – ACOUSTIC MODELSp-value
NAT DAR SAR RMDN RNNNAT <1e-30 <1.0e-30 <1.0e-30 <1.0e-30DAR <1e-30 1.6e-28 6.3e-19 2.4e-30SAR <1e-30 1.6e-28 0.015 0.949RMDN <1e-30 6.3e-19 0.015 0.014RNN <1e-30 2.4e-30 0.949 0.014
3.00
3.25
3.50
3.75
4.00
4.25
MO
S
NAT RNN RMDN SAR DAR20
40
60
80
100
120
GV
ofF0
atut
tera
nce-
leve
l(H
z)
NAT
DAR
SAR RMDN RNN
MOSscore F0GV
NAT DARSARRMDNRNN
[7]X.Wang,S.Takaki,andJ.Yamagishi.AutoregressiveneuralF0modelforstatisticalparametricspeechsynthesis.IEEETransactionsonAudio,SpeechandLanguageProcessing.(Accepted)