Human Language Technologies – Text-to-Speech © 2007 IBM Corporation Sixth Speech Synthesis...

33
Human Language Technologies – Text-to- Speech Sixth Speech Synthesis Workshop, Bonn, Germany. August 22-24, 2007 © 2007 IBM Corporation Automatic Exploration of Corpus- Specific Properties for Expressive Text-to- Speech. (A Case Study in Emphasis.) Raul Fernandez and Bhuvana Ramabhadran I.B.M. T.J. Watson Research Center

Transcript of Human Language Technologies – Text-to-Speech © 2007 IBM Corporation Sixth Speech Synthesis...

Human Language Technologies – Text-to-Speech

Sixth Speech Synthesis Workshop, Bonn, Germany. August 22-24, 2007 © 2007 IBM Corporation

Automatic Exploration of Corpus-SpecificProperties for Expressive Text-to-Speech.

(A Case Study in Emphasis.)

Raul Fernandez and Bhuvana RamabhadranI.B.M. T.J. Watson Research Center

Human Language Technologies – Text to Speech

© 2007 IBM Corporation2 Sixth Speech Synthesis Workshop, Bonn, Germany. August 22-24, 2007

• Motivation.

• Review of Expressive TTS Architecture

• Expression Mining: Emphasis.

• Evaluation.

Outline

Human Language Technologies – Text to Speech

© 2007 IBM Corporation3 Sixth Speech Synthesis Workshop, Bonn, Germany. August 22-24, 2007

Expressive TTS

We have shown that corpus-based approaches to expressive CTTS manage to convey expressiveness if the corpus is well designed to contain the desired expression(s).

There are, however, shortcomings to this approach:

Adding new expressions, or increasing the size of the repository for an existing one, is expensive and time consuming.

The footprint of the system increases as we add new expressions.

Without abandoning this framework, we propose to partially address these limitations by an approach that exploits the properties of the existing databases to maximize the expressive range of the TTS system.

Human Language Technologies – Text to Speech

© 2007 IBM Corporation4 Sixth Speech Synthesis Workshop, Bonn, Germany. August 22-24, 2007

Some observations about data and listeners…

Production variability:

Speakers produce subtle expressive variations, even when they’re asked to speak in a mostly-neutral style.

Human Language Technologies – Text to Speech

© 2007 IBM Corporation5 Sixth Speech Synthesis Workshop, Bonn, Germany. August 22-24, 2007

Some observations about data and listeners…

Production variability:

Speakers produce subtle expressive variations, even when they’re asked to speak in a mostly-neutral style.

Anger

FearSad

Neutral

Perceptual confusability/redundancy:

Several studies have shown that there’s an overlap in the way listeners interpret the prosodic-acoustic realizations of different expressions.

Human Language Technologies – Text to Speech

© 2007 IBM Corporation6 Sixth Speech Synthesis Workshop, Bonn, Germany. August 22-24, 2007

Some observations about data and listeners…

Production variability:

Speakers produce subtle expressive variations, even when they’re asked to speak in a mostly-neutral style.

Anger

FearSad

Neutral

Perceptual confusability/redundancy:

Several studies have shown that there’s an overlap in the way listeners interpret the prosodic-acoustic realizations of different expressions.

Human Language Technologies – Text to Speech

© 2007 IBM Corporation7 Sixth Speech Synthesis Workshop, Bonn, Germany. August 22-24, 2007

Goals:

Exploit the variability present in a given dataset to increase the expressive range of the TTS engine.

Augment the corpus-based with an expression-mining approach for expressive synthesis.

Challenge:

Automatic annotation of instances in the corpus where an expression of interest occurs.

(Approach may still require collecting a smaller expression-specific corpus to bootstrap data-driven learning algorithms.)

Case study: Emphasis.

Expression Mining

Human Language Technologies – Text to Speech

© 2007 IBM Corporation8 Sixth Speech Synthesis Workshop, Bonn, Germany. August 22-24, 2007

• Motivation.

• Review of Expressive TTS Architecture

• Expression Mining: Emphasis.

• Evaluation.

Outline

Human Language Technologies – Text to Speech

© 2007 IBM Corporation9 Sixth Speech Synthesis Workshop, Bonn, Germany. August 22-24, 2007

The Expressive Framework of the IBM TTS System

The IBM Expressive Text-to-Speech consists of:

a rules-based front-end for text analysis

acoustic models (DTs) for generating synthesis candidate units

prosody models (DTs) for generating pitch and duration targets

a module to carry out a Viterbi search

a waveform generation module to concatenate the selected units

Expressiveness is achieved in this framework by associating symbolic attribute vectors with the synthesis units. These attribute values are able to influence the

target prosody generation

unit-search selection

Human Language Technologies – Text to Speech

© 2007 IBM Corporation10 Sixth Speech Synthesis Workshop, Bonn, Germany. August 22-24, 2007

Attributes

Style

Good News

Apologetic

Uncerta

in …

Default Attribute

Human Language Technologies – Text to Speech

© 2007 IBM Corporation11 Sixth Speech Synthesis Workshop, Bonn, Germany. August 22-24, 2007

Attributes

Style

Good News

Apologetic

Uncerta

in …

Emphasis

0

1

Human Language Technologies – Text to Speech

© 2007 IBM Corporation12 Sixth Speech Synthesis Workshop, Bonn, Germany. August 22-24, 2007

Attributes

Style

Good News

Apologetic

Uncerta

in …

Emphasis

0

1

? (e.g., voice quality={breathy,…}, etc.)

Human Language Technologies – Text to Speech

© 2007 IBM Corporation13 Sixth Speech Synthesis Workshop, Bonn, Germany. August 22-24, 2007

How do attributes influence the search?

- Corpus is tagged a priori.

- At run time: Input is tagged at the word level (e.g., via user-provided mark-up) with annotations indicating the desired attribute. Annotations are propagated down to the unit level.

- A component of the target cost function penalizes label substitutions:

Neutral Good news Bad news

Neutral 0 0.5 0.6

Good news 0.3 0 1.0

Bad news 0.5 1.0 0

Human Language Technologies – Text to Speech

© 2007 IBM Corporation14 Sixth Speech Synthesis Workshop, Bonn, Germany. August 22-24, 2007

How do attributes influence the search?

- Additionally, the style attribute has style-specific prosody models (for pitch and duration) associated with it. Therefore, prosody targets are produced according to the style requested.

Prosody ModelStyle 1

Prosody ModelStyle 3

Prosody ModelStyle 2

Prosody Targets

Model Output Generation

No

rmal

ized

Tex

t

Target Style

Human Language Technologies – Text to Speech

© 2007 IBM Corporation15 Sixth Speech Synthesis Workshop, Bonn, Germany. August 22-24, 2007

• Motivation.

• Review of Expressive TTS Architecture

• Expression Mining: Emphasis.

• Evaluation.

Outline

Human Language Technologies – Text to Speech

© 2007 IBM Corporation16 Sixth Speech Synthesis Workshop, Bonn, Germany. August 22-24, 2007

Mining Emphasis

Emphasis Corpus

(~1K sents.)

Statistical Learner

Trained Emphasis Classifier

Baseline Corpus w. Emphasis

Labels

Baseline Corpus

(~10K sents.)

Build TTS System w. Emphasis

Human Language Technologies – Text to Speech

© 2007 IBM Corporation17 Sixth Speech Synthesis Workshop, Bonn, Germany. August 22-24, 2007

Training Materials

Two sets of recordings, one from a female and one from a male speaker of US English.

Approximately 1K sentences in script.

Approximately 20% of words in script contain emphasis.

Recordings are single channel, 22.05kHz.

To hear DIRECTIONS to this destination say YES.

I'd LOVE to hear how it SOUNDS.

It is BASED on the information that the company gathers, but not DEPENDENT on it.

Exs:

Human Language Technologies – Text to Speech

© 2007 IBM Corporation18 Sixth Speech Synthesis Workshop, Bonn, Germany. August 22-24, 2007

Modeling Emphasis – Classification Scheme

- Modeled at the word level.

- Feature set: prosodic features derived from (i) pitch (absolute; speaker-normalized), (ii) duration, and (iii) energy measures.

- Individual classifiers are trained, and results stacked (this marginally improves the generalization performance estimated through 10-fold CV).

K-Nearest Neighbor

SVM

Naïve Bayes

Interm. Output Probs.

Final Output Probs.

Prosodic Features

Human Language Technologies – Text to Speech

© 2007 IBM Corporation19 Sixth Speech Synthesis Workshop, Bonn, Germany. August 22-24, 2007

Modeling Emphasis – Classification Results

TP Rate FP Rate Prec. F-Meas. Class

0.82 0.06 0.78 0.80 emphasis

0.94 0.18 0.95 0.94 notemphasis

Correctly Classified Instances 91.2 %

TP Rate FP Rate Prec. F-Meas. Class

0.80 0.06 0.75 0.77 emphasis

0.93 0.18 0.94 0.94 notemphasis

Correctly Classified Instances 89.9 %

M

A

L

E

F

E

M

A

L

E

Human Language Technologies – Text to Speech

© 2007 IBM Corporation20 Sixth Speech Synthesis Workshop, Bonn, Germany. August 22-24, 2007

What does it find in the corpus?

Human Language Technologies – Text to Speech

© 2007 IBM Corporation21 Sixth Speech Synthesis Workshop, Bonn, Germany. August 22-24, 2007

What does it find in the corpus?

I think they will diverge from bonds, and they may even go up.

Human Language Technologies – Text to Speech

© 2007 IBM Corporation22 Sixth Speech Synthesis Workshop, Bonn, Germany. August 22-24, 2007

What does it find in the corpus?

Human Language Technologies – Text to Speech

© 2007 IBM Corporation23 Sixth Speech Synthesis Workshop, Bonn, Germany. August 22-24, 2007

What does it find in the corpus?

Please say the full name of the person you want to call.

Human Language Technologies – Text to Speech

© 2007 IBM Corporation24 Sixth Speech Synthesis Workshop, Bonn, Germany. August 22-24, 2007

What does it find in the corpus?

Human Language Technologies – Text to Speech

© 2007 IBM Corporation25 Sixth Speech Synthesis Workshop, Bonn, Germany. August 22-24, 2007

What does it find in the corpus?

There's a long fly ball to deep center field. Going, going. It's gone, a home run.

Human Language Technologies – Text to Speech

© 2007 IBM Corporation26 Sixth Speech Synthesis Workshop, Bonn, Germany. August 22-24, 2007

• Motivation.

• Review of Expressive TTS Architecture

• Expression Mining: Emphasis.

• Evaluation.

Outline

Human Language Technologies – Text to Speech

© 2007 IBM Corporation27 Sixth Speech Synthesis Workshop, Bonn, Germany. August 22-24, 2007

Listening Tests – Stimuli and Conditions

Sent. Type

Emphasis in Text?

Baseline Neutral Units

Baseline Corpus w/ Mined Emphasis

Training Corpus w/ Explicit Emphasis

A N B Y C Y

Condition 1 Pair: 1 Type-A sentence vs. 1 Type-B sentence (in random order).

Condition 2 Pair: 1 Type-A sentence vs. 1 Type-C sentence (in random order).

Synthesis SourcesTarget

Human Language Technologies – Text to Speech

© 2007 IBM Corporation28 Sixth Speech Synthesis Workshop, Bonn, Germany. August 22-24, 2007

B1 vs A1

A2 vs B2

A3 vs B3

B12 vs A12

A1 vs C1

A2 vs C2

C3 vs A3

C12 vs A12

Condition 1

(12 Pairs)

Condition 2

(12 Pairs)

A2 vs C2

B1 vs A1

B12 vs A12

A3 vs B3

+ Shuffle

C2 vs A2

A1 vs B1

A12 vs B12

B3 vs A3

Reverse Order Pair

Listening Tests – Setup

L

I

S

T

1

L

I

S

T

2

Human Language Technologies – Text to Speech

© 2007 IBM Corporation29 Sixth Speech Synthesis Workshop, Bonn, Germany. August 22-24, 2007

Listening Tests – Task Description

A total of 31 participants listen to a playlist (16 to List 1; 15 to List 2)

For each pair of stimuli, the listeners are asked to select which member of the pair contains emphasis-bearing words

No information is given about which words may be emphasized.

Listeners may opt to listen to a pair repeatedly.

Human Language Technologies – Text to Speech

© 2007 IBM Corporation30 Sixth Speech Synthesis Workshop, Bonn, Germany. August 22-24, 2007

Listening Tests – Results

Condition Neutral (A) Emphatic (B/C)

1 61.6% 38.4%

2 48.7% 51.3%

Human Language Technologies – Text to Speech

© 2007 IBM Corporation31 Sixth Speech Synthesis Workshop, Bonn, Germany. August 22-24, 2007

Conclusions

When only the limited expressive corpus is considered, listeners actually prefer the neutral baseline. Possible explanation is that biasing the search heavily toward a small corpus is introducing artifacts that interfere with the perception of emphasis.

However, when the small expressive corpus is augmented with automatic annotations, the perception of intended emphasis increases significantly by 13% (p<0.001).

Although further work is needed to reliably convey emphasis, we have demonstrated the advantages of automatic mining the dataset to augment the search space of expressive synthesis units.

Human Language Technologies – Text to Speech

© 2007 IBM Corporation32 Sixth Speech Synthesis Workshop, Bonn, Germany. August 22-24, 2007

Future Work

Explore alternative feature sets to improve automatic emphasis classification.

Extend the proposed framework to automatically detect more complex expressions in a “neutral” database and augment the search space for our expressive systems (e.g., good news; apologies; uncertainty)

Explore how the perceptual confusion between different labels can be exploited to increase the range of expressiveness of the TTS system.

N

A

GN

U

A

GN

UN

Human Language Technologies – Text-to-Speech

Sixth Speech Synthesis Workshop, Bonn, Germany. August 22-24, 2007 © 2007 IBM Corporation

Automatic Exploration of Corpus-SpecificProperties for Expressive Text-to-Speech.

(A Case Study in Emphasis.)

Raul Fernandez and Bhuvana RamabhadranI.B.M. T.J. Watson Research Center