Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)
-
Upload
xavier-giro -
Category
Data & Analytics
-
view
25 -
download
2
Transcript of Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)
[course site]
Day 3 Lecture 5
Parametric Speech SynthesisAntonio Bonafonte
2
Main TTS Technologies
Concatenative speech synthesis + Unit SelectionConcatenate best prerecorded speech unitsSpeech data: 2-10 hours, professional speaker, carefully segmenten and annotated.
3
Concatenative
H. Zen - RTTHSS 2015http://rtthss2015.talp.cat/
Statistical Speech Synthesis
H. Zen - RTTHSS 2015http://rtthss2015.talp.cat/
5
Main TTS Technologies
Concatenative speech synthesis + Unit SelectionConcatenate best pre-recorded speech units
Statistical Parametric Speech Synthesis Represent speech waveform using parameters (eg 5ms)Use statistic generative modelReconstruct waveform from generated parameters
Hybrid SystemsConcatenative speech synthesisSelect best units attending a statistical parametric system
6
Deep architectures … but not deep (yet)
6
Text to Speech: Textual features → Spectrum of speech (many coefficients)
TXTDesigned
feature extraction
ft 1
ft 2
ft 3
Regression module
s1
s2
s3
wavegen
“Hand-crafted” features
“Hand-crafted” features
7
Textual features (x)
From text to phoneme (pronunciation)Disambiguation, pronuntiation (e.g.: Jan. 26)
From phoneme to phoneme+ (with linguistic features)
8
Textual features (x)● {preceding, succeeding} two phonemes● Position of current phoneme in current syllable● # of phonemes at {preceding, current, succeeding} syllable● {accent, stress} of {preceding, current, succeeding} syllable● Position of current syllable in current word● # of {preceding, succeeding} {stressed, accented} syllables in
phrase● # of syllables {from previous, to next} {stressed, accented}
syllable● Guess at part of speech of {preceding, current, succeeding} word● # of syllables in {preceding, current, succeeding} word● Position of current word in current phrase● # of {preceding, succeeding} content words in current phrase● # of words {from previous, to next} content word● # of syllables in {preceding, current, succeeding} phrase
H. Zen - RTTHSS 2015http://rtthss2015.talp.cat/
Statistical Speech Synthesis
H. Zen - RTTHSS 2015http://rtthss2015.talp.cat/
10
Speech features (y)
11
Speech features (y)
12
Speech features (y)
13
Speech features (y)
● Rate: ~ 5 ms. (200Hz)
● Spectral features (envelope)
● Excitation features (fundamental frequency, pitch)
● Representation that allows reconstruction: vocoders (Straight, Ahocoder, ...)
14
Regression
14
TXTDesigned
feature extraction
ft 1
ft 2
ft 3
Regression module
s1
s2
s3
wavegen
15
Phoneme rate vs. frame rate
H. Zen - RTTHSS 2015http://rtthss2015.talp.cat/
16
Duration Modeling
H. Zen - RTTHSS 2015http://rtthss2015.talp.cat/
17
Acoustic Modeling
18
Acoustic Modeling: DNN
H. Zen - RTTHSS 2015http://rtthss2015.talp.cat/
19
Acoustic Modeling: DNN
20
Regression using DNN (problem)
H. Zen - RTTHSS 2015http://rtthss2015.talp.cat/
21
Mixture density network (MDN)
H. Zen - RTTHSS 2015http://rtthss2015.talp.cat/
22
Mixture density network (MDN)
H. Zen - RTTHSS 2015http://rtthss2015.talp.cat/
23
Recurrent Networks: LSTM
H. Zen - RTTHSS 2015http://rtthss2015.talp.cat/
24
Recurrent Networks: LSTM
Boxplot of mean opinion score
SPPS: Models built using HTSUS: Ogmios unit selection systemLSTM: Duration and Acoustic LSTM modelsLSTM-pf: Post-filteredAhocoded: Analysis/Synthesis (vocoder effect)Natural: Human recording
Source [MSc Pascual]
25
Multi-speaker
26
Multi-speaker
Boxplot of subjective preference:-2: multioutput system preferred+2 speaker dependent system preferred
27
Multi-speaker
28
Adaptation to new speaker
29
Speaker interpolation
Interpolation: 0 1
Interpolation: 0.25 0.75
Interpolation: 0.5 0.5
Interpolation: 0.75 0.25
Interpolation 1 0
Conclusions
30
● Quality DL-based on SPSS is much better than conventional SPSS
● Concatenative Hybrid systems also benefit from Deep Learning (eg. Apple SIRI)
● Some research on add better linguistic features to add expressivity (eg. sentiment analysis features)
● A vocoder is still used on most systems which degrades quality … (but … more on wavenet tomorrow)
References
31
Statistical parametric speech synthesis: from HMM to LSTM-RNN.Heiga Zen, Googlertthss2015.talp.cat/ (slides, video, .. and references)
Deep learning applied to Speech Synthesis, 2016Santiago Pascual, MSc Thesis, UPCveu.talp.cat/doc/MSC_Santiago_Pascual.pdf[eusipco-2016] [ssw9]