Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)

31
[course site] Day 3 Lecture 5 Parametric Speech Synthesis Antonio Bonafonte

Transcript of Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)

Page 1: Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)

[course site]

Day 3 Lecture 5

Parametric Speech SynthesisAntonio Bonafonte

Page 2: Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)

2

Main TTS Technologies

Concatenative speech synthesis + Unit SelectionConcatenate best prerecorded speech unitsSpeech data: 2-10 hours, professional speaker, carefully segmenten and annotated.

Page 3: Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)

3

Concatenative

H. Zen - RTTHSS 2015http://rtthss2015.talp.cat/

Page 4: Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)

Statistical Speech Synthesis

H. Zen - RTTHSS 2015http://rtthss2015.talp.cat/

Page 5: Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)

5

Main TTS Technologies

Concatenative speech synthesis + Unit SelectionConcatenate best pre-recorded speech units

Statistical Parametric Speech Synthesis Represent speech waveform using parameters (eg 5ms)Use statistic generative modelReconstruct waveform from generated parameters

Hybrid SystemsConcatenative speech synthesisSelect best units attending a statistical parametric system

Page 6: Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)

6

Deep architectures … but not deep (yet)

6

Text to Speech: Textual features → Spectrum of speech (many coefficients)

TXTDesigned

feature extraction

ft 1

ft 2

ft 3

Regression module

s1

s2

s3

wavegen

“Hand-crafted” features

“Hand-crafted” features

Page 7: Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)

7

Textual features (x)

From text to phoneme (pronunciation)Disambiguation, pronuntiation (e.g.: Jan. 26)

From phoneme to phoneme+ (with linguistic features)

Page 8: Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)

8

Textual features (x)● {preceding, succeeding} two phonemes● Position of current phoneme in current syllable● # of phonemes at {preceding, current, succeeding} syllable● {accent, stress} of {preceding, current, succeeding} syllable● Position of current syllable in current word● # of {preceding, succeeding} {stressed, accented} syllables in

phrase● # of syllables {from previous, to next} {stressed, accented}

syllable● Guess at part of speech of {preceding, current, succeeding} word● # of syllables in {preceding, current, succeeding} word● Position of current word in current phrase● # of {preceding, succeeding} content words in current phrase● # of words {from previous, to next} content word● # of syllables in {preceding, current, succeeding} phrase

H. Zen - RTTHSS 2015http://rtthss2015.talp.cat/

Page 9: Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)

Statistical Speech Synthesis

H. Zen - RTTHSS 2015http://rtthss2015.talp.cat/

Page 10: Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)

10

Speech features (y)

Page 11: Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)

11

Speech features (y)

Page 12: Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)

12

Speech features (y)

Page 13: Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)

13

Speech features (y)

● Rate: ~ 5 ms. (200Hz)

● Spectral features (envelope)

● Excitation features (fundamental frequency, pitch)

● Representation that allows reconstruction: vocoders (Straight, Ahocoder, ...)

Page 14: Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)

14

Regression

14

TXTDesigned

feature extraction

ft 1

ft 2

ft 3

Regression module

s1

s2

s3

wavegen

Page 15: Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)

15

Phoneme rate vs. frame rate

H. Zen - RTTHSS 2015http://rtthss2015.talp.cat/

Page 16: Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)

16

Duration Modeling

H. Zen - RTTHSS 2015http://rtthss2015.talp.cat/

Page 17: Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)

17

Acoustic Modeling

Page 18: Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)

18

Acoustic Modeling: DNN

H. Zen - RTTHSS 2015http://rtthss2015.talp.cat/

Page 19: Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)

19

Acoustic Modeling: DNN

Page 20: Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)

20

Regression using DNN (problem)

H. Zen - RTTHSS 2015http://rtthss2015.talp.cat/

Page 21: Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)

21

Mixture density network (MDN)

H. Zen - RTTHSS 2015http://rtthss2015.talp.cat/

Page 22: Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)

22

Mixture density network (MDN)

H. Zen - RTTHSS 2015http://rtthss2015.talp.cat/

Page 23: Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)

23

Recurrent Networks: LSTM

H. Zen - RTTHSS 2015http://rtthss2015.talp.cat/

Page 24: Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)

24

Recurrent Networks: LSTM

Boxplot of mean opinion score

SPPS: Models built using HTSUS: Ogmios unit selection systemLSTM: Duration and Acoustic LSTM modelsLSTM-pf: Post-filteredAhocoded: Analysis/Synthesis (vocoder effect)Natural: Human recording

Source [MSc Pascual]

Page 25: Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)

25

Multi-speaker

Page 26: Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)

26

Multi-speaker

Boxplot of subjective preference:-2: multioutput system preferred+2 speaker dependent system preferred

Page 27: Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)

27

Multi-speaker

Page 28: Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)

28

Adaptation to new speaker

Page 30: Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)

Conclusions

30

● Quality DL-based on SPSS is much better than conventional SPSS

● Concatenative Hybrid systems also benefit from Deep Learning (eg. Apple SIRI)

● Some research on add better linguistic features to add expressivity (eg. sentiment analysis features)

● A vocoder is still used on most systems which degrades quality … (but … more on wavenet tomorrow)

Page 31: Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)

References

31

Statistical parametric speech synthesis: from HMM to LSTM-RNN.Heiga Zen, Googlertthss2015.talp.cat/ (slides, video, .. and references)

Deep learning applied to Speech Synthesis, 2016Santiago Pascual, MSc Thesis, UPCveu.talp.cat/doc/MSC_Santiago_Pascual.pdf[eusipco-2016] [ssw9]