Trainable Videorealistic Speech Animation9.520/spring10/Classes/tony_9520class0… · ·...

Trainable Videorealistic Speech Animation

Tony EzzatGadi

GeigerTomaso Poggio

CBCL/AI LabMIT

Outline

• Problem Setting• Previous Work• Our Approach• Results• Evaluation

Overview

VideoDatabase

“Air” “Badge”

Visual SpeechProcessing

“Badge”

2 Themes:Videorealism

Machine Learning

Mary101

Audio Analysis

VideoDatabase

“Badge”

AudioDatabase

Audio is recorded also to help label video

Audio Synthesis

VideoDatabase

“Badge”

AudioDatabase

“Badge”

Audio SpeechProcessingX

No Audio Synthesis!

What is the Input REALLY?

“Badge”

Input: Phone Stream

/SIL B B B AE AE JH JH SIL SIL/ Real Audio

Speech Recognition

ForcedViterbiAlignment

Manual Labelling

“Badge”

and Post-Processing

Pre-Processing

Post-Processing

Remove head movementusing

planar perspectivewarping

Mask out mouthTrack & Recomposite

into background sequence

/SIL B B B AE AE JH JH SIL SIL/

Tracking & Compositing

Outline

• Problem Setting• Previous Work• Our Approach• Results• Evaluation• More Results

Video Rewrite(Bregler, Covell, Slaney 1997)

/H-E-L/ /E-L-OW/

+Hello:

Triphone

basis unitsReorder them to new utterancePixel blending at join points

Coarticulation: /utu/

• Sampling coarticulation20000 triphones

~ 3 hrs!

Video Rewrite Issues(Bregler, Covell, Slaney 1997)

• Model of speech is entire video corpusNo capacity to learn/model/distillNot a parsimonious representation

• Poor capacity for novel image synthesisPoor smoothing at join pointsCannot stretch/shrink to match audioDiscrete number of pathsCannot fill in missing data

Outline

• Problem Setting• Previous Work• Our Approach• Results• Evaluation• More Results

Extracting Prototypes

46 prototypes extracted using PCA and K-means clustering

Multidimensional Morphable

),( βα

MMM Background

Tommy Poggio/MITDavid BeymerMike JonesVinay

Volker Blanz/MPI Saabrucken

Thomas Vetter/University of Basel

Tim Cootes/Manchester

Michael Black/Brown

1D Morphing

(Beier

& Neely 1992)

),( 111 FIWARP α

),( 222 FIWARP α

2β1β +

Optical Flow

C = {dx(x,y), dy(x,y)}

OpticalFlow

(Beymer, Shashua, Poggio 93) (Chen & Williams 93)

1D Morphing w/Optical Flow

Forward warping A to B

Forward warping B to A

Blending

Holefilling

Parameterize using

),( βα

MMM Definition

46 Image prototypes from Corpus

46 Optical flow betweenprototypes

alpha is 46-dimensionalbeta is 46 dimensional

MMM Synthesis

synth CC1

synthC1

),( 1 isynth

isynthi CCCWC −=

),( synthii

warpi CIWI =

warpii

morph II1

),( ββα

Fine, but whatabout speech?

Mary101 Speech Model

/SIL/ /F/

/AE/ Each phoneme represents a cluster in MMM space

Speech trajectory passes close to clusters

but which is also smooth

),( βα

MMM Analysis

MMM Analysis (Cntd)

novelI

novelC

iiinovel CC

Re-orient + Warp

warpediinovel

itosubject

MMM Analysis Parameters

lavish

Texture

Comparison of Real and Synthesized Images

Tongue is not perfect

Slight blurring Real Synthetic Real Synthetic

Analysis of Entire Recorded Corpus

),( 111 βα=z

),( 222 βα=z

),( 300003000030000 βα=z

LVideo Corpus

/b/ /jh/ /ae/

Phonetic Clusters

pμ pΣRepresent each phone with

One set for flows, another set for textures

Trajectory Synthesis

21 )()(min yyy T

yΔ+−Σ− − λμμ

⎥⎥⎥⎥

⎢⎢⎢⎢

⎥⎥⎥⎥

⎢⎢⎢⎢

⎥⎥⎥⎥

⎢⎢⎢⎢

Phonetic Targets Smoothness

/SIL B B B AE AE JH JH SIL SIL/

Smoothness

Higher orders of smoothness: K,, ΔΔΔΔΔOrder 2, 3, ….

⎥⎥⎥⎥

⎢⎢⎢⎢

−−

Setting

Cross-validation:

flow: order 4, = 250: septic splinestexture: order 5, = 100: nintic

splines

Setting Phonetic Clusters

Use sample estimates?

Problem: Underarticulation!

Adjusting Phonetic Clusters

Use Gradient descent

to tweak

)()( yzyzE T −−=

μμ ∂∂

∂∂

=∂∂

μημμ∂∂

−=Eoldnew

Compare synthesized trajectory with original trajectory

{ }ttz βα ,={ }tty βα ,=

Phones Before/After Training

/b/before

Trajectories Before/After Training

Coarticulation

/B//U/

/T/Coarticulation

controlledby width

of cluster regions

Coarticulation

/iti/ /ata/

/ibi/ /aba/

Big Picture

Trajectory Synthesis

Construct

{ }ii CI ,1I

Analyze Corpus{ }tt βα ,

Train phonetic models

{ }pp Σ,μ/SIL/ /F/

Post-process

Pre-process/SIL B B B AE AE JH JH SIL SIL/

Synthesize!

{ }tt βα ,

Results

Mary101:

8 minutes of training data

1-syllable words: 132 training/20 test2-syllable words: 136 training/20 test

46-prototype MMM

Sentences not even included in training.

Comments So Far

“She looks like she’s been Botox’ed”--

Nobel Laureate

“Has she had a frontal lobotomy?”--

ATT executive

Send me your comments to

tonebone@ai.mit.edu

Visual Turing Tests

We win!

Experiment % correct P<Single

presentation 52.1% 0.3

Double presentation

46.6% 0.5

Visual Intelligibility

Still some work to do…….

Correct Phoneme ID

Experiment %correct on N %correct on S P<

Words+Sents 30.01% 21.19% 0.001

Words 38.55% 28.07% 0.001

Sents 24.38% 16.52% 0.01

Stay Tuned!

Acknowledgments:Association Christian BenoitNSFNTTITRI

Mary101

Dynasty ModelsCraig Milanesi

Dave KonstineJoanne Flood

Jay BenoitMarypat

Fitzgerald

Casey JohnsonVinay

MukherjeeChao Wang

KimDanielle Suh

Osamu YoshimiVolker Blanz

Thomas VetterDemetri

Terzopoulos

Jenny Shapiro/BMGRehema

Ellis/NBC

Kevin Chang

Trainable Videorealistic Speech Animation9.520/spring10/Classes/tony_9520class0… · ·...

Documents

Transcript of Trainable Videorealistic Speech Animation9.520/spring10/Classes/tony_9520class0… · ·...

The Very Hungry Caterpillar sequencing - Speech-Fun.com · The Very Hungry Caterpillar Sequencing speech-fun.com speech-fun.com speech-fun.com speech-fun.com speech-fun.com speech-fun.com

speech motion - Speech and Motion Therapy - Speech and ...

Videorealistic Facial Animation for Speech-Based Interfaces · Videorealistic Facial Animation for Speech-Based Interfaces by ... 3.2 Speech Recognition ... audio, graphics/video,

Trainable Videorealistic Speech Animation - mit.edu9.520/spring09/Classes/tony_9520class08.pdfNo Audio Synthesis! What is ... Processing /SIL B B B AE AE JH JH SIL SIL/ Real Audio.

Fitting: The Hough transform - Computer Sciencelazebnik/spring10/lec09_hough.pdf · Hough transform • An early type of voting scheme • General outline: • Discretize parameter

Speech Perception The Speech Chain - UCSB speech...– speech coding – speech recognition • try to understand speech perception by looking at the physiological models of hearing

Inside LF Spring10 Final (2)

Perceptual Evaluation Of Videorealistic Speech - CBCLcbcl.mit.edu/publications/ai-publications/2003/AIM-2003-003.pdf · Perceptual evaluation of ... we recorded audio and video of

SIFF programme spring10

repeat - classes.dma.ucla.educlasses.dma.ucla.edu/Spring10/155/projects/Lauren Mullane... · repeat repeat MINATE TERMINATE rebirth. Title: WordsInMotionStoryboard Created Date: 4/19/2010

Trainable Videorealistic Speech Animation - mit.edu9.520/spring08/Classes/tony_9520class08.pdf · No Audio Synthesis! What is ... Processing /SIL B B B AE AE JH JH SIL SIL/ Real Audio.

Aspire Spring10

Introduction to Quantum Information - math.umd.edudiom/RIT/QI-Spring10/IntroQuantInfo.pdfResources Classical Information Some Relevant Quantum Mechanics Quantum Information Introduction

Introduction to Quantum Information - Department of ...dio/RIT/QI-Spring10/IntroQuantInfo.pdfResources Classical Information Some Relevant Quantum Mechanics Quantum Information Introduction

Transverse and Longitudinal Waves Electromagnetic Waves ...physics.wisc.edu/undergrads/courses/spring10/202/lect20_handout.pdfTransverse and Longitudinal Waves If the direction of

Newsletter Spring10 Update

Classical vs Quantum Information - Department of ...diom/RIT/QI-Spring10/ClassvsQuantInfo.pdfResources Correlations Information Causality: Deriving the Tsirelson Bound Web Resources

Videorealistic Facial Animation for Speech-Based Interfacesgroups.csail.mit.edu/sls/publications/2009/Thesis_Pueblo.pdf · Master of Engineering in Electrical Engineering and Computer

Section 3 Introduction-1 Freedom of Speech Key Terms pure speech, symbolic speech, seditious speech, defamatory speech, slander, libel Find Out What speech.

Unit 1 Friendship Grammar Direct speech & Indirect speech Grammar Direct speech & Indirect speech.