Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)

[course site]

Day 3 Lecture 5

Parametric Speech SynthesisAntonio Bonafonte

https://telecombcn-dl.github.io/2017-dlsl/

https://telecombcn-dl.github.io/2017-dlsl/

2

Main TTS Technologies

Concatenative speech synthesis + Unit SelectionConcatenate best prerecorded speech unitsSpeech data: 2-10 hours, professional speaker, carefully segmenten and annotated.

3

Concatenative

H. Zen - RTTHSS 2015http://rtthss2015.talp.cat/

http://rtthss2015.talp.cat/


Statistical Speech Synthesis




5

Main TTS Technologies

Concatenative speech synthesis + Unit SelectionConcatenate best pre-recorded speech units

Statistical Parametric Speech Synthesis Represent speech waveform using parameters (eg 5ms)Use statistic generative modelReconstruct waveform from generated parameters

Hybrid SystemsConcatenative speech synthesisSelect best units attending a statistical parametric system

6

Deep architectures … but not deep (yet)

6

Text to Speech: Textual features → Spectrum of speech (many coefficients)

TXTDesigned

feature extraction

ft 1

ft 2

ft 3

Regression module

s1

s2

s3

wavegen

“Hand-crafted” features

“Hand-crafted” features

7

Textual features (x)

From text to phoneme (pronunciation)Disambiguation, pronuntiation (e.g.: Jan. 26)

From phoneme to phoneme+ (with linguistic features)

8

Textual features (x)● {preceding, succeeding} two phonemes● Position of current phoneme in current syllable● # of phonemes at {preceding, current, succeeding} syllable● {accent, stress} of {preceding, current, succeeding} syllable● Position of current syllable in current word● # of {preceding, succeeding} {stressed, accented} syllables in

phrase● # of syllables {from previous, to next} {stressed, accented}

syllable● Guess at part of speech of {preceding, current, succeeding} word● # of syllables in {preceding, current, succeeding} word● Position of current word in current phrase● # of {preceding, succeeding} content words in current phrase● # of words {from previous, to next} content word● # of syllables in {preceding, current, succeeding} phrase




Statistical Speech Synthesis




10

Speech features (y)

11

Speech features (y)

12

Speech features (y)

13

Speech features (y)

● Rate: ~ 5 ms. (200Hz)

● Spectral features (envelope)

● Excitation features (fundamental frequency, pitch)

● Representation that allows reconstruction: vocoders (Straight, Ahocoder, ...)

14

Regression

14

TXTDesigned

feature extraction

ft 1

ft 2

ft 3

Regression module

s1

s2

s3

wavegen

15

Phoneme rate vs. frame rate




16

Duration Modeling




17

Acoustic Modeling

18

Acoustic Modeling: DNN




19

Acoustic Modeling: DNN

20

Regression using DNN (problem)




21

Mixture density network (MDN)




22

Mixture density network (MDN)




23

Recurrent Networks: LSTM




24

Recurrent Networks: LSTM

Boxplot of mean opinion score

SPPS: Models built using HTSUS: Ogmios unit selection systemLSTM: Duration and Acoustic LSTM modelsLSTM-pf: Post-filteredAhocoded: Analysis/Synthesis (vocoder effect)Natural: Human recording

Source [MSc Pascual]

http://veu.talp.cat/doc/MSC_Santiago_Pascual.pdf

25

Multi-speaker

26

Multi-speaker

Boxplot of subjective preference:-2: multioutput system preferred+2 speaker dependent system preferred

27

Multi-speaker

28

Adaptation to new speaker

29

Speaker interpolation

Interpolation: 0 1

Interpolation: 0.25 0.75



Interpolation 1 0

http://veu.talp.cat/toni/dlsl-tts/interp/A0_1_0_0_0_0.wav








http://veu.talp.cat/toni/dlsl-tts/interp

http://veu.talp.cat/toni/dlsl-tts/interp

Conclusions

30

● Quality DL-based on SPSS is much better than conventional SPSS

● Concatenative Hybrid systems also benefit from Deep Learning (eg. Apple SIRI)

● Some research on add better linguistic features to add expressivity (eg. sentiment analysis features)

● A vocoder is still used on most systems which degrades quality … (but … more on wavenet tomorrow)

References

31

Statistical parametric speech synthesis: from HMM to LSTM-RNN.Heiga Zen, Googlertthss2015.talp.cat/ (slides, video, .. and references)

Deep learning applied to Speech Synthesis, 2016Santiago Pascual, MSc Thesis, UPCveu.talp.cat/doc/MSC_Santiago_Pascual.pdf[eusipco-2016] [ssw9]





http://www.eurasip.org/Proceedings/Eusipco/Eusipco2016/papers/1570256358.pdf

http://www.isca-speech.org/archive/SSW_2016/pdfs/ssw9_OS2-3_Pascual.pdf

Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)

Data & Analytics

Transcript of Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)