Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)

Post on 07-Feb-2017

25 views 2 download

Transcript of Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)

[course site]

Day 3 Lecture 5

Parametric Speech SynthesisAntonio Bonafonte

2

Main TTS Technologies

Concatenative speech synthesis + Unit SelectionConcatenate best prerecorded speech unitsSpeech data: 2-10 hours, professional speaker, carefully segmenten and annotated.

3

Concatenative

H. Zen - RTTHSS 2015http://rtthss2015.talp.cat/

Statistical Speech Synthesis

H. Zen - RTTHSS 2015http://rtthss2015.talp.cat/

5

Main TTS Technologies

Concatenative speech synthesis + Unit SelectionConcatenate best pre-recorded speech units

Statistical Parametric Speech Synthesis Represent speech waveform using parameters (eg 5ms)Use statistic generative modelReconstruct waveform from generated parameters

Hybrid SystemsConcatenative speech synthesisSelect best units attending a statistical parametric system

6

Deep architectures … but not deep (yet)

6

Text to Speech: Textual features → Spectrum of speech (many coefficients)

TXTDesigned

feature extraction

ft 1

ft 2

ft 3

Regression module

s1

s2

s3

wavegen

“Hand-crafted” features

“Hand-crafted” features

7

Textual features (x)

From text to phoneme (pronunciation)Disambiguation, pronuntiation (e.g.: Jan. 26)

From phoneme to phoneme+ (with linguistic features)

8

Textual features (x)● {preceding, succeeding} two phonemes● Position of current phoneme in current syllable● # of phonemes at {preceding, current, succeeding} syllable● {accent, stress} of {preceding, current, succeeding} syllable● Position of current syllable in current word● # of {preceding, succeeding} {stressed, accented} syllables in

phrase● # of syllables {from previous, to next} {stressed, accented}

syllable● Guess at part of speech of {preceding, current, succeeding} word● # of syllables in {preceding, current, succeeding} word● Position of current word in current phrase● # of {preceding, succeeding} content words in current phrase● # of words {from previous, to next} content word● # of syllables in {preceding, current, succeeding} phrase

H. Zen - RTTHSS 2015http://rtthss2015.talp.cat/

Statistical Speech Synthesis

H. Zen - RTTHSS 2015http://rtthss2015.talp.cat/

10

Speech features (y)

11

Speech features (y)

12

Speech features (y)

13

Speech features (y)

● Rate: ~ 5 ms. (200Hz)

● Spectral features (envelope)

● Excitation features (fundamental frequency, pitch)

● Representation that allows reconstruction: vocoders (Straight, Ahocoder, ...)

14

Regression

14

TXTDesigned

feature extraction

ft 1

ft 2

ft 3

Regression module

s1

s2

s3

wavegen

15

Phoneme rate vs. frame rate

H. Zen - RTTHSS 2015http://rtthss2015.talp.cat/

16

Duration Modeling

H. Zen - RTTHSS 2015http://rtthss2015.talp.cat/

17

Acoustic Modeling

18

Acoustic Modeling: DNN

H. Zen - RTTHSS 2015http://rtthss2015.talp.cat/

19

Acoustic Modeling: DNN

20

Regression using DNN (problem)

H. Zen - RTTHSS 2015http://rtthss2015.talp.cat/

21

Mixture density network (MDN)

H. Zen - RTTHSS 2015http://rtthss2015.talp.cat/

22

Mixture density network (MDN)

H. Zen - RTTHSS 2015http://rtthss2015.talp.cat/

23

Recurrent Networks: LSTM

H. Zen - RTTHSS 2015http://rtthss2015.talp.cat/

24

Recurrent Networks: LSTM

Boxplot of mean opinion score

SPPS: Models built using HTSUS: Ogmios unit selection systemLSTM: Duration and Acoustic LSTM modelsLSTM-pf: Post-filteredAhocoded: Analysis/Synthesis (vocoder effect)Natural: Human recording

Source [MSc Pascual]

25

Multi-speaker

26

Multi-speaker

Boxplot of subjective preference:-2: multioutput system preferred+2 speaker dependent system preferred

27

Multi-speaker

28

Adaptation to new speaker

Conclusions

30

● Quality DL-based on SPSS is much better than conventional SPSS

● Concatenative Hybrid systems also benefit from Deep Learning (eg. Apple SIRI)

● Some research on add better linguistic features to add expressivity (eg. sentiment analysis features)

● A vocoder is still used on most systems which degrades quality … (but … more on wavenet tomorrow)

References

31

Statistical parametric speech synthesis: from HMM to LSTM-RNN.Heiga Zen, Googlertthss2015.talp.cat/ (slides, video, .. and references)

Deep learning applied to Speech Synthesis, 2016Santiago Pascual, MSc Thesis, UPCveu.talp.cat/doc/MSC_Santiago_Pascual.pdf[eusipco-2016] [ssw9]