The NAIST Text-to-Speech System for Blizzard Challenge 2015
-
Upload
shinnosuke-takamichi -
Category
Science
-
view
977 -
download
0
Transcript of The NAIST Text-to-Speech System for Blizzard Challenge 2015
2015©Shinnosuke TAKAMICHI 09/11/2015
The NAIST Text-to-Speech System
for Blizzard Challenge 2015
Shinnosuke Takamichi,
Kazuhiro Kobayashi, Kou Tanaka,
Tomoki Toda, Satoshi Nakamura
(NAIST, Japan)
Blizzard Challenge 2015
/20
Blizzard Challenge 2015
Languages
– Bengali, Hindi, Malayalam, Marathi, Tamil, & Telugu
– + English
Provided data
– UTF-8-encoded text & 16 kHz-sampled speech waveform
– → We need to develop natural language process (front-end) and
speech waveform generation (back-end).
2 tasks
– Mono-lingual task (IH1) … 6 Indian languages
– Multi-lingual task (IH2) … Indian languages + English
2
/20
Overview of our TTS system
3
v v
Provided database
Speech features Context labels
HSMM & MS database
Context labels
of input text
Text Speech
Text processing Speech processing
Training
Synthesis
Our system
– HMM-based TTS with 4 main modules
– No external data for all modules
New functions
– Parameter trajectory smoothing in the speech processing module
– Modulation Spectrum (MS) in the synthesis module
Synthetic speech
/20
Text processing module
4
Text
(Discrete)
context labels
Text
analysis
Context
generation
Bengali, Hindi, Tamil, & Telugu Festvox ver. 2.7 recipes [Black et al., 2001.]
Marathi Festvox recipe for Hindi
Malayalam Rule [Nair et al., 2013.] … Stress is not extracted.
Same contexts for all languages Phoneme, syllable, & stress
Vowel/consonant, articulator & U/V
Position of phoneme, syllable, & word
The number of phonemes, syllables, & words
/20
Speech processing module
5
Speech
61-dim.
mel-cepstrum
Spectrum
extraction
F0
extraction
Aperiodicity
extraction
Trajectory
smoothing
Continuous F0 U/V symbol 5-band
aperiodicity
Continuous
F0 extraction
Trajectory
smoothing
Band
averaging
*STRAIGHT [Kawahara et al.], WORLD [Morise et al.]
/20
Motivation of parameter
trajectory smoothing
Motivation
– Remove temporal fluctuation difficult to be modeled with HMMs
Examples
– Fluctuating sequence vs. Smooth sequence
6
Mean ± variance
/20
Modulation spectrum analysis
for parameter trajectory smoothing
Modulation spectrum [Takamichi et al., 2014 & 2015.]
– Power spectra of the temporal parameter sequence
– An extension of Global Variance (GV) [Toda et al., 2007.]
7 Modulation frequency
Mo
du
latio
n s
pe
ctr
um
Mel-cep sequence
FFT
& pow.
Easy to model
with HMMs
Difficult to model
with HMMs
Dominant in speech
perception
/20
Parameter trajectory smoothing
(= High modulation freq. removal)
8
Extracted parameters
50 Hz-cutoff LPF to remove high modulation freq.
*LPF: Low Pass Filter
/20
Training module
9
Mel-cepstrum Cont. F0 U/V symbol Aperiodicity
HSMM database MS database
HSMM training
– ML training of context-dependent HSMM [Yoshimura et al., 1999.]
– MDL-based clustering [Shinoda et al., 2000.]
MS model training
– Mean-normalized MS [Takamichi et al., 2014.]
– ML training of Gaussian distribution
/20
Synthesis module
10
Context labels of input text
HSMM database MS database
Spectrum
Generation
w/ MS
Cont. F0
generation
Aperiodicity
generation
U/V symbol
generation
MS-based
post-filter
MLSA filter
Synthetic speech
Smoothing
In silence*
* For reducing unnatural power in silence
/20
Speech parameter generation
algorithm considering MS
11
w/ ~50 Hz MS
w/o MS
𝒚 = argmax 𝑃𝑟𝑜𝑏HMM 𝑾𝒚 𝑃𝑟𝑜𝑏MS 𝒔 𝒚𝜔
𝒚: speech parameters, 𝑾: delta window, 𝒔 𝒚 : MS of 𝒚
/20
Speech samples
12
Language w/o MS w/ MS
Bengali
Hindi
Malayalam
Marathi
Tamil
Telugu
EXPERIMENTAL RESULTS
13
/20
Evaluation of synthesizer
Evaluation
– Naturalness: 5-point MOS score
– Intelligibility: WER of listening tests
– Similarity: 5-point DMOS score
Result shown in this talk
– Naturalness: mean of MOS score of RD task in Marathi
– Intelligibility: mean of WER in Marathi
– Similarity: mean of DMOS score of RD task in Marathi
– + rank of these scores in all languages
14
/20
5-point MOS score on naturalness
15
Our place for RD task (our place / #-of-systems)
Bengali Marathi Hindi Tamil Malayalam Telugu
6 / 10 2 / 9 4 / 10 6 / 10 2 / 10 2 / 10
Results in Marathi
/20
WER for intelligibility
16
Our place (our place / #-of-systems)
Bengali Marathi Hindi Tamil Malayalam Telugu
5 / 10 1 / 9 7 / 10 8 / 10 4 / 10 4 / 10
Results in Marathi
/20
5-point DMOS score on similarity
17
Our place for RD task (our place / #-of-systems)
Bengali Marathi Hindi Tamil Malayalam Telugu
5 / 10 5 / 9 4 / 10 5 / 10 8 / 10 5 / 10
Results in Marathi
/20
Goodness and Weakness
Good!
– Naturalness of synthetic speech
– Intelligibility of synthetic speech (in Marathi)
– Small footprint (10 ~ 20 MB)
– Fast training (~ 10 hours for 1 system)
Weak…
– Similarity of synthetic speech
– Slow synthesis (3 minutes for 1 sentence)
• Because generation considering MS needs iteration.
Early stopping of the iteration
Parallelization of generation algorithm
18
/20
Is our system open source?
Text processing
– Text analyzer … Yes (Festvox) except Malayalam
– Context generator … Yes (my GitHub*1)
Speech processing
– Speech analyzer … Yes (STRAIGHT & WORLD)
– Spectral smoothing … No, but it uses only Butterworth LPF.
Training
– HSMM & MS model training … Yes (HTS & SPTK)
Synthesis
– Generation w/ MS … No, but post-filter is available (HTS).
19 *1: search “shinnsuke takamichi”
/20
Conclusion
Our challenge
– Mono-lingual task (IH1) for Indian languages
Our TTS synthesizer
– HMM-based TTS with 4 main modules
– Parameter trajectory smoothing in the speech processing module
– Modulation spectrum in the synthesis module
Future work
– Combine with statistical sample-based method [Takamichi et al., 2014.]
20