Atsuhiro Sakurai (Texas Instruments Japan, Tsukuba R&D Center)

Modeling and Generation of Accentual Phrase F0 Contours Based on Discrete HMMs Synchronized at Mora-U

nit TransitionsAtsuhiro Sakurai (Texas Instruments Japan, Tsukuba R&D Center)Koji Iwano (currently with Tokyo Institute of Technology, Japan)Keikichi Hirose (Dep. of Frontier Eng., The University of Tokyo, Japan)

Introduction to Corpus-Based Intonation Modeling

• Traditional approach: rules derived from linguistic expertiseHuman-dependent (too complicated and not satisfactory, because the phenomena involved are not completely understood)

• Corpus-based approach: modeling derived from statistical analysis of speech corporaAutomatic (potential to improve as better speech corpora become available)

Background

• HMMs are widely used in speech recognition, and fast learning algorithms exist

• Macroscopic discrete HMMs associated to accentual phrases can store information such as accent type and prosodic structure

• Morae are extremely important to describe Japanese intonation - sequences of high and low mora can characterize accent types

Overview of the Method

• Definition of HMM and alphabet:– Accent types modeled by discrete HMMs

– 2-code mora F0 contour alphabet used as output symbols

– State transitions sychronized with mora transitions

• Classification of HMMs and training:– HMMs classified according to linguistic attributes

– Training by usual FB algorithm

• Generation of F0 contours:– Best sequence of symbols generated by a modified Vi

terbi algorithm

The Mora-F0 Alphabet

• Two codes: stylized mora F0 contours and mora-to-mora F0: 34 symbols each

• Obtained by LBG clustering from a 500-sentence database (ATR continuous speech database, speaker MHT)

• All the database is labeled using the 2-code symbols.

State transition Mora transition

Accentual phrase

The Accentual Phrase HMM

HMM

• Accentual phrases are classified according to:– Accent type

– Position of accentual phrase in the sentence

– (Optional: number of morae, part-of-speech, syntactic structure)

Example:

Example: ‘Karewa Tookyookara kuru. (He comes from Tokyo)

Accent type Position

1 1

0 2

1 3

Label sequence

[],[],[]

[],[],[],[],[],[]

[],[]

shape1

F01, shape2

F02

M1:

M2:

M3:

HMM Topologies

(a) Accent types 0 and 1

(a) Other accent types

Training Database

• ATR Continuous Speech Database (500 sentences, speaker MHT)

• Segmented in mora and accentual phrases

• Mora labels using the mora-F0 alphabet: shape (stylized F0 contour), mora F0.

• Accentual phrase labels: number of morae, position in the sentence

Output Code Generation

How to use the HMM for synthesis?

A) Recognition

B) Synthesis

1 output sequenceLikelihoodBest path

Best output sequenceBest path

Intonation Modeling Using HMM

for t=2,3,...,Tfor it=1,2,...,S

Dmin(t, it) = min(it-1){Dmin(t-1, it-1) + [-log a(it| it-1)]+[-log b(y(t)| it)]}

(t, it) =argmin(it-1){Dmin(t-1, it-1)+[-log a(it| it-1)]+[-log b(y(t)| it)]}

next it

next t

Viterbi Search for the Recognition Problem:

Intonation Modeling Using HMM

for t=2,3,...,Tfor it=1,2,...,S

Dmin(t, it) = min(it-1){Dmin(t-1, it-1) + [-log a(it| it-1)]+[-log b(ymax(t)| it)]}

(t, it) =argmin(it-1){Dmin(t-1, it-1)+[-log a(it| it-1)]+[-log b(ymax(t)| it)]}

next it

next t

Modified Viterbi Search for the Synthesis Problem:

Use of Bigram Probabilities

for t=2,3,...,Tfor it=1,2,...,S

Dmin(t, it) = min(it-1){Dmin(t-1, it-1) + [-log a(it| it-1)]+maxk{[-log b(y (t)| y(t-1),it)]}}

(t, it) =argmin(it-1){Dmin(t-1, it-1)+[-log a(it| it-1)]+maxk{[-log b(y (t)| y(t-1),it)]}}

next it

next t

k=1,…,K (dimension of y)


Accent Type Modeling Using HMM

3.65

3.7

3.75

3.8

3.85

3.9

3.95

4

4.05

4.1

4.15

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

Mora #

log(Hz) "Type0""Type1""Type2""Type3"

Phrase Boundary Level Modeling Using HMM

3.9

3.92

3.94

3.96

3.98

4

4.02

4.04

4.06

4.08

0 0.5 1 1.5 2 2.5 3 3.5 4

Mora #

log(Hz) "level1.graph""level2.graph""level3.graph"J-TOBI

B.I.PauseY/N

Bound.Level

332

YNN

123

-0.4

-0.2

0

0.2

0.4

0 50 100 150 200 250 300 350 400 450 500

log

F0

[H

z]

t [msec]

"PH1_0"

PH1_0.original

-0.4

-0.2

0

0.2

0.4

0 50 100 150 200 250 300 350 400 450 500

log

F0

[H

z]

t [msec]

"PH1_0"

PH1_0.bigram

-0.4

-0.2

0

0.2

0.4

0 50 100 150 200 250 300 350 400 450 500

log

F0

[H

z]

t [msec]

"PH1_1"

PH1_1.original

-0.4

-0.2

0

0.2

0.4

0 50 100 150 200 250 300 350 400 450 500

log

F0

[H

z]

t [msec]

"PH1_1"

PH1_1.bigram

-0.4

-0.2

0

0.2

0.4

0 50 100 150 200 250 300 350 400 450 500

log

F0

[H

z]

t [msec]

"PH1_2"

PH1_2.original

-0.4

-0.2

0

0.2

0.4

0 50 100 150 200 250 300 350 400 450 500

log

F0

[H

z]

t [msec]

"PH1_2"

PH1_2.bigram

The Effect ofBigrams

Comments• We presented a novel approach to intonation modeli

ng for TTS synthesis based on discrete mora-synchronous HMMs.

• For now on, more features should be included in the HMM modeling (phonetic context, part-of-speech, etc.), and the approach should be compared to rule-based methods.

• Training data scarcity is a major problem to overcome (by feature clustering, an F0 contour generation model, etc.)

Hidden Markov Models (HMM)A Hidden Markov Model (HMM) is a Finite State Automaton where both state transitions and outputs are stochastic. It changes to a new state each time period, generating a new vector according to the output distribution of that state.

Symbols: 1,2, ..., K

a12 a23 a34

a22 a33

b(1|1)~b(K|1) b(1|2)~b(K|2) b(1|3)~b(K|3)

a44

b(1|4)~b(K|4)

a11

a13

1 2 3 4

ステップ１：データベース作成

•ATR の連続音声データベースを使用（５００文，話者 MHT)

•モーラ単位に分割•モーララベルの付与•F0 パターンを抽出•LBG 法によるクラスタリング•全データベースにクラスタクラスを付与

Ｂｉｇｒａｍの導入

for t=2,3,...,Tfor it=1,2,...,S

Dmin(t, it) = min(it-1){Dmin(t-1, it-1) + [-log a(it| it-1)]+maxk{[-log b(y (t)| y(t-1),it)]}}

(t, it) =argmin(it-1){Dmin(t-1, it-1)+[-log a(it| it-1)]+maxk{[-log b(y (t)| y(t-1),it)]}}

next it

next t



考察・今後の展望

•学習データが少ない•TTS システムへの組込みにはさらなる工夫が必要他の言語情報を考慮（音素、モーラ数、品詞等）データ不足を克服するための工夫（クラスタリング等）モデルの接続に関する検討

Hidden Markov Models (HMM)A Hidden Markov Model (HMM) is a Finite State Automaton where both state transitions and outputs are stochastic. It changes to a new state each time period, generating a new vector according to the output distribution of that state.

Symbols: 1,2, ..., K

a12 a23 a34

a22 a33

b(1|1)~b(K|1) b(1|2)~b(K|2) b(1|3)~b(K|3)

a44

b(1|4)~b(K|4)

a11

a13

1 2 3 4

Atsuhiro Sakurai (Texas Instruments Japan, Tsukuba R&D Center)

Documents

Transcript of Atsuhiro Sakurai (Texas Instruments Japan, Tsukuba R&D Center)