The NAIST Text-to-Speech System for Blizzard Challenge 2015

2015©Shinnosuke TAKAMICHI 09/11/2015

[email protected]

The NAIST Text-to-Speech System

for Blizzard Challenge 2015

Shinnosuke Takamichi,

Kazuhiro Kobayashi, Kou Tanaka,

Tomoki Toda, Satoshi Nakamura

(NAIST, Japan)

Blizzard Challenge 2015

/20

Blizzard Challenge 2015

Languages

– Bengali, Hindi, Malayalam, Marathi, Tamil, & Telugu

– + English

Provided data

– UTF-8-encoded text & 16 kHz-sampled speech waveform

– → We need to develop natural language process (front-end) and

speech waveform generation (back-end).

2 tasks

– Mono-lingual task (IH1) … 6 Indian languages

– Multi-lingual task (IH2) … Indian languages + English

2

/20

Overview of our TTS system

3

v v

Provided database

Speech features Context labels

HSMM & MS database

Context labels

of input text

Text Speech

Text processing Speech processing

Training

Synthesis

Our system

– HMM-based TTS with 4 main modules

– No external data for all modules

New functions

– Parameter trajectory smoothing in the speech processing module

– Modulation Spectrum (MS) in the synthesis module

Synthetic speech

/20

Text processing module

4

Text

(Discrete)

context labels

Text

analysis

Context

generation

Bengali, Hindi, Tamil, & Telugu Festvox ver. 2.7 recipes [Black et al., 2001.]

Marathi Festvox recipe for Hindi

Malayalam Rule [Nair et al., 2013.] … Stress is not extracted.

Same contexts for all languages Phoneme, syllable, & stress

Vowel/consonant, articulator & U/V

Position of phoneme, syllable, & word

The number of phonemes, syllables, & words

/20

Speech processing module

5

Speech

61-dim.

mel-cepstrum

Spectrum

extraction

F0

extraction

Aperiodicity

extraction

Trajectory

smoothing

Continuous F0 U/V symbol 5-band

aperiodicity

Continuous

F0 extraction

Trajectory

smoothing

Band

averaging

*STRAIGHT [Kawahara et al.], WORLD [Morise et al.]

/20

Motivation of parameter

trajectory smoothing

Motivation

– Remove temporal fluctuation difficult to be modeled with HMMs

Examples

– Fluctuating sequence vs. Smooth sequence

6

Mean ± variance

/20

Modulation spectrum analysis

for parameter trajectory smoothing

Modulation spectrum [Takamichi et al., 2014 & 2015.]

– Power spectra of the temporal parameter sequence

– An extension of Global Variance (GV) [Toda et al., 2007.]

7 Modulation frequency

Mo

du

latio

n s

pe

ctr

um

Mel-cep sequence

FFT

& pow.

Easy to model

with HMMs

Difficult to model

with HMMs

Dominant in speech

perception

/20

Parameter trajectory smoothing

(= High modulation freq. removal)

8

Extracted parameters

50 Hz-cutoff LPF to remove high modulation freq.

*LPF: Low Pass Filter

/20

Training module

9

Mel-cepstrum Cont. F0 U/V symbol Aperiodicity

HSMM database MS database

HSMM training

– ML training of context-dependent HSMM [Yoshimura et al., 1999.]

– MDL-based clustering [Shinoda et al., 2000.]

MS model training

– Mean-normalized MS [Takamichi et al., 2014.]

– ML training of Gaussian distribution

/20

Synthesis module

10

Context labels of input text

HSMM database MS database

Spectrum

Generation

w/ MS

Cont. F0

generation

Aperiodicity

generation

U/V symbol

generation

MS-based

post-filter

MLSA filter

Synthetic speech

Smoothing

In silence*

* For reducing unnatural power in silence

/20

Speech parameter generation

algorithm considering MS

11

w/ ~50 Hz MS

w/o MS

𝒚 = argmax 𝑃𝑟𝑜𝑏HMM 𝑾𝒚 𝑃𝑟𝑜𝑏MS 𝒔 𝒚𝜔

𝒚: speech parameters, 𝑾: delta window, 𝒔 𝒚 : MS of 𝒚

/20

Speech samples

12

Language w/o MS w/ MS

Bengali

Hindi

Malayalam

Marathi

Tamil

Telugu

EXPERIMENTAL RESULTS

13

/20

Evaluation of synthesizer

Evaluation

– Naturalness: 5-point MOS score

– Intelligibility: WER of listening tests

– Similarity: 5-point DMOS score

Result shown in this talk

– Naturalness: mean of MOS score of RD task in Marathi

– Intelligibility: mean of WER in Marathi

– Similarity: mean of DMOS score of RD task in Marathi

– + rank of these scores in all languages

14

/20

5-point MOS score on naturalness

15

Our place for RD task (our place / #-of-systems)

Bengali Marathi Hindi Tamil Malayalam Telugu

6 / 10 2 / 9 4 / 10 6 / 10 2 / 10 2 / 10

Results in Marathi

/20

WER for intelligibility

16

Our place (our place / #-of-systems)


5 / 10 1 / 9 7 / 10 8 / 10 4 / 10 4 / 10

Results in Marathi

/20

5-point DMOS score on similarity

17

Our place for RD task (our place / #-of-systems)


5 / 10 5 / 9 4 / 10 5 / 10 8 / 10 5 / 10

Results in Marathi

/20

Goodness and Weakness

Good!

– Naturalness of synthetic speech

– Intelligibility of synthetic speech (in Marathi)

– Small footprint (10 ~ 20 MB)

– Fast training (~ 10 hours for 1 system)

Weak…

– Similarity of synthetic speech

– Slow synthesis (3 minutes for 1 sentence)

• Because generation considering MS needs iteration.

Early stopping of the iteration

Parallelization of generation algorithm

18

/20

Is our system open source?

Text processing

– Text analyzer … Yes (Festvox) except Malayalam

– Context generator … Yes (my GitHub*1)

Speech processing

– Speech analyzer … Yes (STRAIGHT & WORLD)

– Spectral smoothing … No, but it uses only Butterworth LPF.

Training

– HSMM & MS model training … Yes (HTS & SPTK)

Synthesis

– Generation w/ MS … No, but post-filter is available (HTS).

19 *1: search “shinnsuke takamichi”

/20

Conclusion

Our challenge

– Mono-lingual task (IH1) for Indian languages

Our TTS synthesizer

– HMM-based TTS with 4 main modules

– Parameter trajectory smoothing in the speech processing module

– Modulation spectrum in the synthesis module

Future work

– Combine with statistical sample-based method [Takamichi et al., 2014.]

20

The NAIST Text-to-Speech System for Blizzard Challenge 2015

Science

Transcript of The NAIST Text-to-Speech System for Blizzard Challenge 2015