Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)

[course site]

Day 3 Lecture 5

Parametric Speech SynthesisAntonio Bonafonte

Main TTS Technologies

Concatenative speech synthesis + Unit SelectionConcatenate best prerecorded speech unitsSpeech data: 2-10 hours, professional speaker, carefully segmenten and annotated.

Concatenative

H. Zen - RTTHSS 2015http://rtthss2015.talp.cat/

Statistical Speech Synthesis

Main TTS Technologies

Concatenative speech synthesis + Unit SelectionConcatenate best pre-recorded speech units

Statistical Parametric Speech Synthesis Represent speech waveform using parameters (eg 5ms)Use statistic generative modelReconstruct waveform from generated parameters

Hybrid SystemsConcatenative speech synthesisSelect best units attending a statistical parametric system

Deep architectures … but not deep (yet)

Text to Speech: Textual features → Spectrum of speech (many coefficients)

TXTDesigned

feature extraction

Regression module

wavegen

“Hand-crafted” features

Textual features (x)

From text to phoneme (pronunciation)Disambiguation, pronuntiation (e.g.: Jan. 26)

From phoneme to phoneme+ (with linguistic features)

Textual features (x)● {preceding, succeeding} two phonemes● Position of current phoneme in current syllable● # of phonemes at {preceding, current, succeeding} syllable● {accent, stress} of {preceding, current, succeeding} syllable● Position of current syllable in current word● # of {preceding, succeeding} {stressed, accented} syllables in

phrase● # of syllables {from previous, to next} {stressed, accented}

syllable● Guess at part of speech of {preceding, current, succeeding} word● # of syllables in {preceding, current, succeeding} word● Position of current word in current phrase● # of {preceding, succeeding} content words in current phrase● # of words {from previous, to next} content word● # of syllables in {preceding, current, succeeding} phrase

Statistical Speech Synthesis

Speech features (y)

● Rate: ~ 5 ms. (200Hz)

● Spectral features (envelope)

● Excitation features (fundamental frequency, pitch)

● Representation that allows reconstruction: vocoders (Straight, Ahocoder, ...)

Regression

TXTDesigned

feature extraction

Regression module

wavegen

Phoneme rate vs. frame rate

Duration Modeling

Acoustic Modeling

Acoustic Modeling: DNN

Regression using DNN (problem)

Mixture density network (MDN)

Recurrent Networks: LSTM

Boxplot of mean opinion score

SPPS: Models built using HTSUS: Ogmios unit selection systemLSTM: Duration and Acoustic LSTM modelsLSTM-pf: Post-filteredAhocoded: Analysis/Synthesis (vocoder effect)Natural: Human recording

Source [MSc Pascual]

Multi-speaker

Boxplot of subjective preference:-2: multioutput system preferred+2 speaker dependent system preferred

Multi-speaker

Adaptation to new speaker

Speaker interpolation

Interpolation: 0 1

Interpolation: 0.25 0.75

Interpolation 1 0

Conclusions

● Quality DL-based on SPSS is much better than conventional SPSS

● Concatenative Hybrid systems also benefit from Deep Learning (eg. Apple SIRI)

● Some research on add better linguistic features to add expressivity (eg. sentiment analysis features)

● A vocoder is still used on most systems which degrades quality … (but … more on wavenet tomorrow)

References

Statistical parametric speech synthesis: from HMM to LSTM-RNN.Heiga Zen, Googlertthss2015.talp.cat/ (slides, video, .. and references)

Deep learning applied to Speech Synthesis, 2016Santiago Pascual, MSc Thesis, UPCveu.talp.cat/doc/MSC_Santiago_Pascual.pdf[eusipco-2016] [ssw9]

Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)

Data & Analytics

Transcript of Parametric Speech Synthesis (D3L5 Deep Learning for Speech and Language UPC 2017)

Recurrent Neural Networks II (D2L3 Deep Learning for Speech and Language UPC 2017)

Speech Recognition Pattern Classification. 4 December 2015Veton Këpuska2 Pattern Classification Introduction Parametric classifiers Semi-parametric.

UPC-3 / UPC-1

End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learning for Speech and Language UPC 2017)

Parametric Speech Emotion Recognition Using Neural Network756207/FULLTEXT01.pdf · Parametric Speech Emotion Recognition Using Neural ... about voice recognition in ... utilizes a

Statistical Parametric Speech Synthesis Heiga Zen

Multi-parametric source-filter separation of speech and prosodic ...

Deep Belief Networks (D2L1 Deep Learning for Speech and Language UPC 2017)

Advanced Neural Machine Translation (D4L2 Deep Learning for Speech and Language UPC 2017)

UPC September

Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)

Parametric and non parametric

Speech Recognition Pattern Classification. 22 September 2015Veton Këpuska2 Pattern Classification Introduction Parametric classifiers Semi-parametric.

Some Parametric Methods of Speech Processing - MP3'Tech

Parametric Speech Emotion Recognition Using …756207/...Rui!Ma! Parametric!SpeechEmotionRecognitionUsing!Neural!Network!!!!! 1! Abstract The aim of this thesis work is to investigate

Modeling the creaky excitation for parametric speech

Statistical Parametric Speech Synthesis - Google Researchresearch.google.com/pubs/archive/42624.pdf · 2020. 3. 3. · Heiga Zen Statistical Parametric Speech Synthesis June 9th,

Recurso Upc

Statistical parametric speech synthesis based on sinusoidal models · 2018. 3. 28. · Statistical parametric speech synthesis based on sinusoidal models Qiong Hu Institute for Language,

Generative Adversarial Networks (D2L5 Deep Learning for Speech and Language UPC 2017)