Design of Tree-based Context Clustering for an HMM-based Thai Speech Synthesis System

Outlines Objectives Study of Thai tones

Construction of contextual factors

Design of decision-tree structures

Design of context clustering styles

Characteristics of Thai tones

Categorizations of Thai tones

Tree-based context clustering

Evaluation of overall tone correctness

Evaluation of tone correctness for each tone type

Evaluation of syllable duration distortion

Experiments

Conclusions

Objectives

To implement an HMM-based speech synthesis system for Thai language with the highest correctness of tone.

Study of Thai tones

Characteristics of Thai tones Syllable Structure [Nakasakul2002]

Thai : Tonal Language

)(CV(V)T )(CC fii

fi CVTCfi CVVTC

fii CVVT CC

fii CVT CC

รกั r-a-k^-3 (love)

เรื่อย r-va-j^-2 (always)

เครง่ khr-e-ng^-2 (strict)

เครยีด khr-ia-t^-2 (stress)

VTCi

และ l-x-3 (and)

VVTCC ii

เพลีย phl-iia-0 (exhausted)

VVTCiเสยี s-iia-4 (spoil)

VTCC ii

ปร ิpr-i-1 (break)

Study of Thai tones

Characteristics of Thai tones F0 contours of Standard Thai Tones (normalized

duration)[Luksaneeyanawin1992]

140

230

200

170

290

260

F0(Hz)

0% 50% 100%

rising (4)high (3)

falling (2)low (1)middle (0)

Duration

สามญั Middle(0) เอก Low(1) โท Falling(2) ตร ีHigh(3) จตัวา Rising(4)

Study of Thai tones

Categorizations of Thai tones Abramson divided the tones into two groups:

static group dynamic group

According to the final trend of contours: upward trend group downward trend group

140

230

200

170

290

260

F0(Hz)

0% 50% 100%

rising (4)high (3)

falling (2)low (1)middle (0)

Duration

HMM-based speech synthesizer

• Phoneme based speechunit modeling

• Provide flexible models,an efficient adaptation

Speaker adaptation Speaking style conversion

1994 K. Tokuda; et al, proposedHMM-based speech synthesizerfor Japanese Excitation

ParameterExtraction

Spectral ParameterExtraction

Training of HMM

ExcitationGeneration

Synthesisfilter

Parameter Generation from HMM

Label

Speech Signal

Excitation Parameter Spectral Parameter

Text Analysis

Label

Text

Synthetic Speech

Context Dependent HMMs

Training Stage

Excitation Parameter Spectral Parameter

Synthesis Stage

Speech Database

Phrase level

• current word position in current phrase

• the number of syllables in {preceding, current, succeeding} phrase

Utterance level

• current phrase position in current sentence

• the number of syllables in current sentence

• the number of words in current sentence

Phoneme level

• {preceding, current, succeeding} phonetic type

• {preceding, current, succeeding} part of syllable structure

Syllable level

• {preceding, current, succeeding} tone type

• the number of phones in {preceding, current, succeeding} syllable

• current phone position in current syllable

Word level

• current syllable position in current word• part of speech• the number of syllables in {preceding,

current, succeeding} word


Construction of contextual factorsContext clustering is to treat the problem of limitation of training data.



(a)

F0 contours of (a) synthesized speech from the clustering style of single binary tree without tone type questions and (b) natural speech.

Problem of Misshaped F0 contour

5.0 5.2 5.4 5.6 5.8 6.0 6.2 6.4150

200

250

5.0 5.2 5.4 5.6 5.8 6.0 6.2 6.4150

200

250

T i m e ( s )

f r

e q

u e

n c

y ( H

z ) (a)

(b)



(a)

Tone 0 Tone 1 Tone 2 Tone 3 Tone 4

(b)

Static Tone(Tone 0, 1, 3)

Tone 2 Tone 4

Dynamic Tone(Tone 2, 4)

(c)

Upward Trend (Tone 3, 4)

(d)

Downward Trend (Tone 0, 1, 2)

Tone 3 Tone 4


Design of 8 context clustering styles (a)-(h)

(a)

Tone 0 Tone 1 Tone 2 Tone 3 Tone 4

(b)

Static Tone(Tone 0, 1, 3)

Tone 2 Tone 4

Dynamic Tone(Tone 2, 4)

(c)

Upward Trend (Tone 3, 4)

(d)

Downward Trend (Tone 0, 1, 2)

Tone 3 Tone 4

+ tone type questions (g)

+ tone type questions (e)

+ tone type questions (h) + tone type questions (f)

1. Sentence structure analysis

2. Word structure analysis3. Full context labeling 4. Construction of question

set for context clustering5. Feature extraction

System PreparationsVAJA

Speech corpus

Wav file Label file

ORCHID Text corpus

Wav file Wav file Wav file Label fileLabel fileLabel file

XML fileXML fileXML fileXML file

Parameterfile (.cmp)

Full contextLabeling

FeatureExtraction(mcep,f0)

Parameterfile (.cmp) Parameterfile (.cmp) Parameterfile (.cmp)

Full contextlabel file(.lab)

Label file (.lab)

Label file (.lab)

Label file (.lab)

Label file (.lab)




HMM Training and SynthesisSyntheticSpeech

Experiments Evaluation of overall tone correctness

1 5 02 0 02 5 0

1 5 02 0 02 5 0

1 5 02 0 02 5 0

1 5 02 0 02 5 0

1 5 02 0 02 5 0

F r e

q u

e n

c y (

H z

)

1 5 02 0 02 5 0

1 5 02 0 02 5 0

1 5 02 0 02 5 0

5 . 0 5 . 2 5 . 4 5 . 6 5 . 8 6 . 0 6 . 2 6 . 41 5 02 0 02 5 0

T i m e ( s )

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i)

Figure 5: F0 contours of synthesized speech from 8 different clustering styles; and F0 contour of natural speech.


0

5

10

15

20

25

30

35

40

45

50

100 200 300 400 500 1000 1500 2000 2500

N u m b e r o f t r a i n i n g u t t e r a n c e s

E r r

o r

p e

r c e

n t a

g e

.

(a)(b)(c)(d)

Figure 6: Tone error percentages of synthesized speech from 4 different clustering styles


0

5

10

15

20

25

30

35

40

45

50

100 200 300 400 500 1000 1500 2000 2500


E r r

o r

p e

r c e

n t a

g e

.

(a)(b)(c)(d)(e)(f)(g)(h)

Figure 7: Tone error percentages of synthesized speech from 8 different clustering styles

t o n e 03 8 %

t o n e 12 2 %

t o n e 21 7 %

t o n e 31 5 %

t o n e 48 %

Experiments Evaluation of tone correctness for each tone type

050

100

0

50

0

50

0

50

050

100

E

r r o

r p

e r

c e n

t a g

e

0

50

0

50

100 300 500 1500 25000

50


t o n e 0t o n e 1t o n e 2t o n e 3t o n e 4

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

Figure 8: Tone error percentages of synthesized speech from 8 different clustering styles categorized by tone types;

Experiments Evaluation of syllable duration distortion

71

6055 53 51

11

28

4247 49

6157

49 49 48

56 55 5451 52

0

10

20

30

40

50

60

70

80

100 300 500 1500 2500


S c o

r e

( %

)

.

(e)(f)(g)(h)

Figure 9: Scores of a paired-comparison test for natural duration among 4 different clustering styles;

Examples of synthesized speech

Female

Methodcorpus size (number of

training utterances)

Examples1 2

3

HMM

100

500

2500

VAJA (Unit Selection) Analysis-Synthesis speech

Female

Method Tree Structure

Add tone question set

HMM

(a) (e)

(b) (f)

(c) (g)

(d) (h)

Conclusions An analysis of tree-based context clustering of an HMM-based Thai speech synthesis system has been conducted in this paper.Four structures of decision tree were designed according to tone groups and tone types to obtain higher correctness of tone of synthesized speech.The results show that the tone-separated tree structures can reduce the tone error percentage of the synthesized speech compared to the single binary tree structure significantly.As for using the contextual tone information in the syllable level, it can improve the tone correctness for all structures of decision tree.There are some distortions of the syllable duration appearing in the case of using the simple tone-separated tree context clustering with a small amount of training data, however it can be relieved when using the constancy-based-tone-separated or the trend-based-tone-separated tree context clustering.The analysis of tone correctness of the average-voice-based speech model and the intonation analysis issues are anticipated to be studied in the future.

Design of Tree-based Context Clustering for an HMM-based Thai Speech Synthesis System

Documents

Transcript of Design of Tree-based Context Clustering for an HMM-based Thai Speech Synthesis System