Download - Trainable Videorealistic Speech Animation9.520/spring10/Classes/tony_9520class0… · · 2008-05-19Audio Synthesis Video. ... Real Audio. Speech . Recognition. Forced. Viterbi.

Trainable Videorealistic Speech Animation

Tony EzzatGadi

GeigerTomaso Poggio

CBCL/AI LabMIT

Outline

• Problem Setting• Previous Work• Our Approach• Results• Evaluation

Overview

VideoDatabase

“Air” “Badge”

Visual SpeechProcessing

“Badge”

2 Themes:Videorealism

Machine Learning

Mary101

Audio Analysis

VideoDatabase



“Badge”

AudioDatabase

Audio is recorded also to help label video

Audio Synthesis

VideoDatabase



“Badge”

AudioDatabase

“Badge”

Audio SpeechProcessingX

No Audio Synthesis!

What is the Input REALLY?


“Badge”

Input: Phone Stream


/SIL B B B AE AE JH JH SIL SIL/ Real Audio

Speech Recognition

ForcedViterbiAlignment

Manual Labelling

“Badge”

TTS

“Badge”

Pre-

and Post-Processing

Pre-Processing

Post-Processing

Remove head movementusing

planar perspectivewarping

Mask out mouthTrack & Recomposite

into background sequence


/SIL B B B AE AE JH JH SIL SIL/

Tracking & Compositing

Outline

• Problem Setting• Previous Work• Our Approach• Results• Evaluation• More Results

Video Rewrite(Bregler, Covell, Slaney 1997)

/H-E-L/ /E-L-OW/

+Hello:

Triphone

basis unitsReorder them to new utterancePixel blending at join points

Coarticulation: /utu/

vs

/iti/

• Sampling coarticulation20000 triphones

~ 3 hrs!

Video Rewrite Issues(Bregler, Covell, Slaney 1997)

• Model of speech is entire video corpusNo capacity to learn/model/distillNot a parsimonious representation

• Poor capacity for novel image synthesisPoor smoothing at join pointsCannot stretch/shrink to match audioDiscrete number of pathsCannot fill in missing data

Outline

• Problem Setting• Previous Work• Our Approach• Results• Evaluation• More Results

Extracting Prototypes

46 prototypes extracted using PCA and K-means clustering

Multidimensional Morphable

Model

1I

2I 3I

4I

2C 3C

4C

),( βα

MMM Background

Tommy Poggio/MITDavid BeymerMike JonesVinay

Kumar

Volker Blanz/MPI Saabrucken

Thomas Vetter/University of Basel

Tim Cootes/Manchester

Michael Black/Brown

1D Morphing

(Beier

& Neely 1992)

),( 111 FIWARP α

x x x

x x x

),( 222 FIWARP α

1 1

0 0

2β1β +

Optical Flow

C = {dx(x,y), dy(x,y)}

OpticalFlow

(Beymer, Shashua, Poggio 93) (Chen & Williams 93)

1D Morphing w/Optical Flow

Forward warping A to B

Forward warping B to A

Blending

Holefilling

Parameterize using

),( βα

),( βα

MMM Definition

46 Image prototypes from Corpus

1I

2I 3I

4I

2C 3C

4C

46 Optical flow betweenprototypes

alpha is 46-dimensionalbeta is 46 dimensional

MMM Synthesis

1I

2I 3I

4I

2C 3C

4C

∑=

=N

iii

synth CC1

1 α

synthC1

),( 1 isynth

isynthi CCCWC −=

),( synthii

warpi CIWI =

∑=

=N

i

warpii

morph II1

),( ββα

Fine, but whatabout speech?

Mary101 Speech Model

1I

2I 3I

4I

2C 3C

4C

/SIL/ /F/

/AE/ Each phoneme represents a cluster in MMM space

Speech trajectory passes close to clusters

but which is also smooth

),( βα

MMM Analysis

1I

2I 3I

4I

2C 3C

4C

MMM Analysis (Cntd)

1I

2I 3I

4I

2C 3C

4C

novelI

novelC

novelC

∑=

−N

iiinovel CC

1

α

Re-orient + Warp

10

1

=

∀>

−

∑

∑=

i

i

N

i

warpediinovel

itosubject

II

β

β

β

MMM Analysis Parameters

badge

lavish

Flow

Texture

Comparison of Real and Synthesized Images

Tongue is not perfect

Slight blurring Real Synthetic Real Synthetic

Analysis of Entire Recorded Corpus

),( 111 βα=z

1I

2I 3I

4I

2C 3C

4C

),( 222 βα=z

1I

2I 3I

4I

2C 3C

4C

),( 300003000030000 βα=z

1I

2I 3I

4I

2C 3C

4C

LVideo Corpus

/b/ /jh/ /ae/

Phonetic Clusters

pμ pΣRepresent each phone with

One set for flows, another set for textures

/t/

/w/

/m/

/aa/

/b/

Trajectory Synthesis

21 )()(min yyy T

yΔ+−Σ− − λμμ

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

=

Ty

yy

yM2

1

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

Σ

ΣΣ

=Σ

TP

P

P

O2

1

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

=

Tμ

μμ

μM2

1

Phonetic Targets Smoothness

/SIL B B B AE AE JH JH SIL SIL/

Smoothness

Higher orders of smoothness: K,, ΔΔΔΔΔOrder 2, 3, ….

⎥⎥⎥⎥

⎦

⎤

⎢⎢⎢⎢

⎣

⎡

−

−−

=Δ

II

IIII

O

Setting

Cross-validation:

flow: order 4, = 250: septic splinestexture: order 5, = 100: nintic

splines

Δ,λ

ΔΔ

λλ

Setting Phonetic Clusters

Use sample estimates?

/t/

/b/

Problem: Underarticulation!

Adjusting Phonetic Clusters

Use Gradient descent

to tweak

)()( yzyzE T −−=

ii

yyEE

μμ ∂∂

∂∂

=∂∂

μημμ∂∂

−=Eoldnew

Compare synthesized trajectory with original trajectory

{ }ttz βα ,={ }tty βα ,=

/t/

/b/

Phones Before/After Training

/t/

/b/before

after

Trajectories Before/After Training

12α

28β

Coarticulation

Model

1I

2I 3I

4I

2C 3C

4C

/B//U/

/T/Coarticulation

controlledby width

of cluster regions

/I/

Coarticulation

/utu/

/iti/ /ata/

/ubu/

/ibi/ /aba/

Big Picture

Trajectory Synthesis

MMM

Construct

MMM

{ }ii CI ,1I

2I 3I

4I

2C 3C

4C

Analyze Corpus{ }tt βα ,

Train phonetic models

{ }pp Σ,μ/SIL/ /F/

/AE/

Post-process

Pre-process/SIL B B B AE AE JH JH SIL SIL/

Synthesize!

{ }tt βα ,

Results

Mary101:

8 minutes of training data

1-syllable words: 132 training/20 test2-syllable words: 136 training/20 test

46-prototype MMM

Sentences not even included in training.

Comments So Far

“She looks like she’s been Botox’ed”--

Nobel Laureate

“Has she had a frontal lobotomy?”--

ATT executive

Send me your comments to

[email protected]

Visual Turing Tests

We win!

Experiment % correct P<Single

presentation 52.1% 0.3

Double presentation

46.6% 0.5

Visual Intelligibility

Still some work to do…….

Correct Phoneme ID

Experiment %correct on N %correct on S P<

Words+Sents 30.01% 21.19% 0.001

Words 38.55% 28.07% 0.001

Sents 24.38% 16.52% 0.01

Stay Tuned!

Acknowledgments:Association Christian BenoitNSFNTTITRI

Mary101

Dynasty ModelsCraig Milanesi

Dave KonstineJoanne Flood

Jay BenoitMarypat

Fitzgerald

Casey JohnsonVinay

Kumar

Sayan

MukherjeeChao Wang

Adlar

KimDanielle Suh

Osamu YoshimiVolker Blanz

Thomas VetterDemetri

Terzopoulos

Jenny Shapiro/BMGRehema

Ellis/NBC

Kevin Chang