Modelling Prosody for Speech Synthesis: example from Polish Dominika Oliver IGK Colloquium 22 July...

49
Modelling Prosody for Speech Synthesis: example from Polish Dominika Oliver IGK Colloquium 22 July 2004
  • date post

    18-Dec-2015
  • Category

    Documents

  • view

    218
  • download

    2

Transcript of Modelling Prosody for Speech Synthesis: example from Polish Dominika Oliver IGK Colloquium 22 July...

Page 1: Modelling Prosody for Speech Synthesis: example from Polish Dominika Oliver IGK Colloquium 22 July 2004.

Modelling Prosody for Speech Synthesis: example from Polish

Dominika Oliver

IGK Colloquium22 July 2004

Page 2: Modelling Prosody for Speech Synthesis: example from Polish Dominika Oliver IGK Colloquium 22 July 2004.

04/18/23 2

Outline

Goal prosodic modelling for TTS

Review of past studies intonational investigations

Current state latest modelling results

Page 3: Modelling Prosody for Speech Synthesis: example from Polish Dominika Oliver IGK Colloquium 22 July 2004.

04/18/23 3

TTS Cycle

Text Processing Text Normalisation : names,abbrev.,numbers Linguistic Analysis : morphology,syntax,semantics

Prosodic Analysis Pitch, Phrasing & Duration Modelling

Speech Synthesis Voice Rendering

Text Input (raw or annotated)

Phonetic Analysis Grapheme-to-Phoneme Conversion : rules, dict.

Prosodic Analysis Pitch, Phrasing & Duration Modelling

Page 4: Modelling Prosody for Speech Synthesis: example from Polish Dominika Oliver IGK Colloquium 22 July 2004.

04/18/23 4

TTS Cycle

Prosodic analysis/modelling

Prosodic components (focus, stress, duration etc.)

Prosodic phrasing Intonation: accent types, pitch

contour

Page 5: Modelling Prosody for Speech Synthesis: example from Polish Dominika Oliver IGK Colloquium 22 July 2004.

04/18/23 5

Overview

ProcedureResourcesModelling techniquesModelling prosodyProblems & solutions Suggested improvements

Page 6: Modelling Prosody for Speech Synthesis: example from Polish Dominika Oliver IGK Colloquium 22 July 2004.

04/18/23 6

Procedure

Prosodic modelling shopping list:

Language specific intonation description

Accent type and placement prediction & F0 generation methods

Research and evaluation tool (Festival)

Page 7: Modelling Prosody for Speech Synthesis: example from Polish Dominika Oliver IGK Colloquium 22 July 2004.

04/18/23 7

Language specific intonation description

Quantitative analysis of Polish intonation (accent types) Standard description of Polish intonation

(Jassem, 1961, 1984, Demenko, 1999)Falling: HL, HM, ML, xLRising: LM, MH, LH Level: MMRise-fall: LHL

Broad-Narrow Focus/Peak alignment study (Andreeva and Oliver, 2003)

Page 8: Modelling Prosody for Speech Synthesis: example from Polish Dominika Oliver IGK Colloquium 22 July 2004.

04/18/23 8

Accent types

Falling

Page 9: Modelling Prosody for Speech Synthesis: example from Polish Dominika Oliver IGK Colloquium 22 July 2004.

04/18/23 9

Accent types

Rising

Page 10: Modelling Prosody for Speech Synthesis: example from Polish Dominika Oliver IGK Colloquium 22 July 2004.

04/18/23 10

Overview

ProcedureResourcesModelling techniquesModelling prosodyProblems & solutions Suggested improvements

Page 11: Modelling Prosody for Speech Synthesis: example from Polish Dominika Oliver IGK Colloquium 22 July 2004.

04/18/23 11

Resources

Speech corpora: PoInt (Polish Intonation Database) (Karpiński, 2001) 350MB, multi-speaker (~40)

read, (semi)-spontaneous

TranscribedSyllable based IPA segmentalSyllable based prosodic annotation

Page 12: Modelling Prosody for Speech Synthesis: example from Polish Dominika Oliver IGK Colloquium 22 July 2004.

04/18/23 12

Resources

PoInt Prosodic transcription

Tone heights : xH, H, M, L, xL

Phrase boundary indication

Page 13: Modelling Prosody for Speech Synthesis: example from Polish Dominika Oliver IGK Colloquium 22 July 2004.

04/18/23 13

Resources

Falling

Time (s)0 3.48395

-0.7994

0.78

0

Time (s)0 3.48395

120

350

fst ci va wem vled vje vi dot ne d ur ci ter pen t neH L |

Time (s)0 3.48395

Time (s)0 3.48395

120

400

fst ci vawem

vled

vje

vi dot ned ur ci ter pen

t

ne

Page 14: Modelling Prosody for Speech Synthesis: example from Polish Dominika Oliver IGK Colloquium 22 July 2004.

04/18/23 14

Resources

Rising

Time (s)0 0.713424

-0.5675

0.8338

0

Time (s)0 0.713424

80

350

t to prav daL H |

Time (s)0 0.694399

Page 15: Modelling Prosody for Speech Synthesis: example from Polish Dominika Oliver IGK Colloquium 22 July 2004.

04/18/23 15

Resources

Festival TTS (Black & Taylor, 1998) a general multi-lingual speech

synthesis system offers a full text to speech system environment for development and

research of speech synthesis techniques

Page 16: Modelling Prosody for Speech Synthesis: example from Polish Dominika Oliver IGK Colloquium 22 July 2004.

04/18/23 16

Overview

ProcedureResourcesModelling techniquesModelling prosodyProblems & solutions Suggested improvements

Page 17: Modelling Prosody for Speech Synthesis: example from Polish Dominika Oliver IGK Colloquium 22 July 2004.

04/18/23 17

Modelling techniques

Default prosodic assignment from simple text analysis

Hand-built rule-based system: hard to modify and adapt to new domains

Corpus-based approaches (Sproat et al ’92) Train prosodic variation on large labeled

corpora using machine learning techniques

Page 18: Modelling Prosody for Speech Synthesis: example from Polish Dominika Oliver IGK Colloquium 22 July 2004.

04/18/23 18

Modelling techniques – accent type/placement prediction

Classification and regression trees (CART) (Breiman, Friedman, Olshen & Stone 1984, 1993)

In speech synthesis widely used to model • segment durations (e.g. Riley 1992) • accent prediction (Syrdal, Hirschberg,McGory,

Beckman 2001)• pitch contour generation (Dusterhoff 1997,

Dusterhoff, Black, Taylor 1999)

Page 19: Modelling Prosody for Speech Synthesis: example from Polish Dominika Oliver IGK Colloquium 22 July 2004.

04/18/23 19

Modelling techniques - F0 prediction

Linear regression (Black & Hunt, 1996) used e.g. for F0 contour prediction/generation find the appropriate F0 target per syllable

based on available features trained from data predicted variable (p) can be modelled as a

sum of a set of weighted real-valued factorsp= w0 + w1f1 + w1f1 + w1f1 + … + wnfn

factors (fi) - parameterised properties of the data

weights (wi) - trained usually using a stepwise least squares technique

Page 20: Modelling Prosody for Speech Synthesis: example from Polish Dominika Oliver IGK Colloquium 22 July 2004.

04/18/23 20

Prerequisite

F0 normalisation (Ladd, 1995, Clark, 2003)

(PoInt 40 speakers, mixed sex)

-where is f0 mean and is the f0 standard deviation of the utterance

-the rescaling uses standard deviation and mean f0 of the database :

i

in

ff

0

0

DDnff 00

i i

Page 21: Modelling Prosody for Speech Synthesis: example from Polish Dominika Oliver IGK Colloquium 22 July 2004.

04/18/23 21

Overview

ProcedureResourcesModelling techniquesModelling prosodyProblems & solutions Suggested improvements

Page 22: Modelling Prosody for Speech Synthesis: example from Polish Dominika Oliver IGK Colloquium 22 July 2004.

04/18/23 22

Modelling

Steps Building the utterance structure of the database

speech filesIncorporating database intonation labelling

Extracting features for accent prediction and f0 generation

Building CART modelPoInt intonation labels

Building LR model3 points per syllable

Incorporating model parameters into voice description

Page 23: Modelling Prosody for Speech Synthesis: example from Polish Dominika Oliver IGK Colloquium 22 July 2004.

04/18/23 Dominika Oliver 23

Modelling - accent type/placement prediction

Model based on PoInt multiple speaker (male, female) Accent inventory (L, H, M) Accent prediction method: CART Features (31)

POS windowPosition of candidate syllable in word and

sentenceStress information window etc.

Page 24: Modelling Prosody for Speech Synthesis: example from Polish Dominika Oliver IGK Colloquium 22 July 2004.

04/18/23 24

Results – accent prediction

train set (total 963 correct 897 93.146% )

test set (total 1070 correct 996 93.084%)

Accents NONE H L M Total AccuracyNONE 839 0 3 0 839/842 99.60%H 15 5 5 4 5/29 17.20%L 7 1 37 5 37/50 74%M 13 1 12 16 16/42 38%

Accents NONE H L M Total AccuracyNONE 953 3 6 4 953/966 98.70%H 11 8 2 6 8/27 29.63%L 4 1 29 12 29/46 63.04%M 11 3 11 6 6/31 19.40%

Page 25: Modelling Prosody for Speech Synthesis: example from Polish Dominika Oliver IGK Colloquium 22 July 2004.

04/18/23 25

Modelling - F0 prediction/generation

F0 generation :Linear regression

Features • accent type• POS window• Position of candidate syllable in word and

sentence• Stress information window etc.

Page 26: Modelling Prosody for Speech Synthesis: example from Polish Dominika Oliver IGK Colloquium 22 July 2004.

04/18/23 26

Results – F0 shape prediction

Train TestPosition RMSE Correlation RMSE Correlationstart 48.56 0.46 50.17 0.40mid 55.87 0.49 59.13 0.49end 58.99 0.45 54.50 0.48

Page 27: Modelling Prosody for Speech Synthesis: example from Polish Dominika Oliver IGK Colloquium 22 July 2004.

04/18/23 27

Overview

ProcedureResourcesModelling techniquesModelling prosodyProblems & solutions Suggested improvements

Page 28: Modelling Prosody for Speech Synthesis: example from Polish Dominika Oliver IGK Colloquium 22 July 2004.

04/18/23 28

Potential problems

Data not enough tokens to learn from Annotation inconsistencies (noisy data,

messy accent class assignment )

Inappropriate technique / suboptimal feature set

Page 29: Modelling Prosody for Speech Synthesis: example from Polish Dominika Oliver IGK Colloquium 22 July 2004.

04/18/23 29

Potential data problems

Page 30: Modelling Prosody for Speech Synthesis: example from Polish Dominika Oliver IGK Colloquium 22 July 2004.

04/18/23 30

Potential data problems

Page 31: Modelling Prosody for Speech Synthesis: example from Polish Dominika Oliver IGK Colloquium 22 July 2004.

04/18/23 31

Potential data problems

Page 32: Modelling Prosody for Speech Synthesis: example from Polish Dominika Oliver IGK Colloquium 22 July 2004.

04/18/23 32

PoInt Analysis

Peak alignment

Page 33: Modelling Prosody for Speech Synthesis: example from Polish Dominika Oliver IGK Colloquium 22 July 2004.

04/18/23 33

Addressing data issues

F0 tracking errors

Identifying outliers / annotation inconsistencies

Re-classifying accent types

Page 34: Modelling Prosody for Speech Synthesis: example from Polish Dominika Oliver IGK Colloquium 22 July 2004.

04/18/23 34

When everything else fails – blame it on the data

Labelling errorsUnmarked disfluencies/wrong reading Phonemic labellingMissing phrasingNo indication of sentence mode in

annotation

Inconsistent labellingMisleading transcription descriptionNo independent labellers

Page 35: Modelling Prosody for Speech Synthesis: example from Polish Dominika Oliver IGK Colloquium 22 July 2004.

04/18/23 35

Data fixes

Automatically identifying outliers /annotation inconsistencies Statistic analysis of acoustic parameters

Manual data inspection Insertion of phrase boundaries Marking of disfluencies Aligning speech with text Deriving Gold Standard (hard)

Page 36: Modelling Prosody for Speech Synthesis: example from Polish Dominika Oliver IGK Colloquium 22 July 2004.

04/18/23 36

Accent classification studies

Hierarchical clustering (Klabbers & van Santen 2004)

Linear regression (Keller & Zellner Keller, 2003) EM bagging & boosting (Sun, 2002) HMMs

(Kumpf, King 2004) (Blackburn ,Vonwiller, and King, 1993) (Batliner et al 1999, 2001) (Maragoudakis 2003, Zervas 2004) (Chan, Feng, Heinen, and Niederjohn 1994)

Page 37: Modelling Prosody for Speech Synthesis: example from Polish Dominika Oliver IGK Colloquium 22 July 2004.

04/18/23 37

Accent type re-classification

Two stage procedure Self-organising maps (Kohonen 1982,1995)

(Kaski, 1997)(Vesanto & Alhoniemi, 2000)create set of data representative prototype vectorsprojection of prototypes onto low dimensional space

Hierarchical agglomerative clustering (HAC)method for good candidates for map unit clusters –

cut the dendrogram where there is a large distance between two clusters

Page 38: Modelling Prosody for Speech Synthesis: example from Polish Dominika Oliver IGK Colloquium 22 July 2004.

04/18/23 38

Acoustic data parameterisation

Accent type classification: (Demenko, 1999)

1. Difference between start F0 (first vowel) and F0 extreme value (on a vowel or consonant)

2. Difference between F0 extreme value and end point F0

3. Difference between F0 max and F0 min 4. Difference between utterance mean F0 and mean

F0 for all utterances by the same voice5. Difference between utterance min F0 and global

mean min F0 for the same voice

ke FFx 2evp FFx 1

minmax3 FFx

srgsr FFx 4

gFFx minmin5

Page 39: Modelling Prosody for Speech Synthesis: example from Polish Dominika Oliver IGK Colloquium 22 July 2004.

04/18/23 39

Accent type re-classification

Clusters description

3 30 7 405 15 108 1285 51 37 93

6 2 8149 18 2 16958 2 3 6353 9 2 6481 3 3 8760 65 9 13488 24 48 16042 54 14 110

544 277 235 1056

HHHLHMHxHLHLLLMLxLMHMLMM

Accent label

Gruppen-Gesamtwert

1 2 3HAC Gruppen-G

esamtwert

Page 40: Modelling Prosody for Speech Synthesis: example from Polish Dominika Oliver IGK Colloquium 22 July 2004.

04/18/23 40

Accent type re-classification

Clusters characteristics

Page 41: Modelling Prosody for Speech Synthesis: example from Polish Dominika Oliver IGK Colloquium 22 July 2004.

04/18/23 41

Accent type re-classification

Page 42: Modelling Prosody for Speech Synthesis: example from Polish Dominika Oliver IGK Colloquium 22 July 2004.

04/18/23 42

New results – Accent placement prediction

train data

test data

Accent 89/103 86.40%

Accent 88/97 90.70%

Page 43: Modelling Prosody for Speech Synthesis: example from Polish Dominika Oliver IGK Colloquium 22 July 2004.

04/18/23 43

New results – Accent type prediction

train data

test data

Accents Total Accuracy PreviousH 13/24 54.20% 17.20%L 49/55 89.10% 74.00%M 14/30 46.70% 38.00%

Accents Total Accuracy PreviousH 12/24 50.00% 29.00%L 42/44 95.50% 63.00%M 13/29 44.80% 19.00%

Page 44: Modelling Prosody for Speech Synthesis: example from Polish Dominika Oliver IGK Colloquium 22 July 2004.

04/18/23 44

Evaluation

self-organised maps - potential method for categorisation

the results relatively successful and consistent

the data pre-processing - most critical phase

automatic training phase requires solid and consistent preparations (manual)

Page 45: Modelling Prosody for Speech Synthesis: example from Polish Dominika Oliver IGK Colloquium 22 July 2004.

04/18/23 45

Overview

ProcedureResourcesModelling techniquesModelling prosodyProblems & solutions Suggested improvements

Page 46: Modelling Prosody for Speech Synthesis: example from Polish Dominika Oliver IGK Colloquium 22 July 2004.

04/18/23 46

Need for better data

Based on problems encountered Further analysis of clusters A large amount of data from a single

speaker (primary need) A large amount of prosodic variation A balanced set of pitch events Clear speech which can be easily tracked Complex prosodic structure

Page 47: Modelling Prosody for Speech Synthesis: example from Polish Dominika Oliver IGK Colloquium 22 July 2004.

04/18/23 47

Suggested improvements

Model modification More data e.g. Peak Alignment study Separate models for different sentence

types (Y/N Quest/Statements) Re-estimation of parameters based on

new intonationally rich data

Page 48: Modelling Prosody for Speech Synthesis: example from Polish Dominika Oliver IGK Colloquium 22 July 2004.

04/18/23 48

Next

Closer inspection of automatically assigned accent classes (clusters)

Evaluation: perception experiments

Page 49: Modelling Prosody for Speech Synthesis: example from Polish Dominika Oliver IGK Colloquium 22 July 2004.

04/18/23 49

The End