Trainable Videorealistic Speech Animation
Tony EzzatGadi
GeigerTomaso Poggio
CBCL/AI LabMIT
Outline
• Problem Setting• Previous Work• Our Approach• Results• Evaluation
Overview
VideoDatabase
“Air” “Badge”
Visual SpeechProcessing
“Badge”
2 Themes:Videorealism
Machine Learning
Mary101
Audio Analysis
VideoDatabase
“Air” “Badge”
Visual SpeechProcessing
“Badge”
AudioDatabase
Audio is recorded also to help label video
Audio Synthesis
VideoDatabase
“Air” “Badge”
Visual SpeechProcessing
“Badge”
AudioDatabase
“Badge”
Audio SpeechProcessingX
No Audio Synthesis!
What is the Input REALLY?
Visual SpeechProcessing
“Badge”
Input: Phone Stream
Visual SpeechProcessing
/SIL B B B AE AE JH JH SIL SIL/ Real Audio
Speech Recognition
ForcedViterbiAlignment
Manual Labelling
“Badge”
TTS
“Badge”
Pre-
and Post-Processing
Pre-Processing
Post-Processing
Remove head movementusing
planar perspectivewarping
Mask out mouthTrack & Recomposite
into background sequence
Visual SpeechProcessing
/SIL B B B AE AE JH JH SIL SIL/
Tracking & Compositing
Outline
• Problem Setting• Previous Work• Our Approach• Results• Evaluation• More Results
Video Rewrite(Bregler, Covell, Slaney 1997)
/H-E-L/ /E-L-OW/
+Hello:
Triphone
basis unitsReorder them to new utterancePixel blending at join points
Coarticulation: /utu/
vs
/iti/
• Sampling coarticulation20000 triphones
~ 3 hrs!
Video Rewrite Issues(Bregler, Covell, Slaney 1997)
• Model of speech is entire video corpusNo capacity to learn/model/distillNot a parsimonious representation
• Poor capacity for novel image synthesisPoor smoothing at join pointsCannot stretch/shrink to match audioDiscrete number of pathsCannot fill in missing data
Outline
• Problem Setting• Previous Work• Our Approach• Results• Evaluation• More Results
Extracting Prototypes
46 prototypes extracted using PCA and K-means clustering
Multidimensional Morphable
Model
1I
2I 3I
4I
2C 3C
4C
),( βα
MMM Background
Tommy Poggio/MITDavid BeymerMike JonesVinay
Kumar
Volker Blanz/MPI Saabrucken
Thomas Vetter/University of Basel
Tim Cootes/Manchester
Michael Black/Brown
1D Morphing
(Beier
& Neely 1992)
),( 111 FIWARP α
x x x
x x x
),( 222 FIWARP α
1 1
0 0
2β1β +
Optical Flow
C = {dx(x,y), dy(x,y)}
OpticalFlow
(Beymer, Shashua, Poggio 93) (Chen & Williams 93)
1D Morphing w/Optical Flow
Forward warping A to B
Forward warping B to A
Blending
Holefilling
Parameterize using
),( βα
),( βα
MMM Definition
46 Image prototypes from Corpus
1I
2I 3I
4I
2C 3C
4C
46 Optical flow betweenprototypes
alpha is 46-dimensionalbeta is 46 dimensional
MMM Synthesis
1I
2I 3I
4I
2C 3C
4C
∑=
=N
iii
synth CC1
1 α
synthC1
),( 1 isynth
isynthi CCCWC −=
),( synthii
warpi CIWI =
∑=
=N
i
warpii
morph II1
),( ββα
Fine, but whatabout speech?
Mary101 Speech Model
1I
2I 3I
4I
2C 3C
4C
/SIL/ /F/
/AE/ Each phoneme represents a cluster in MMM space
Speech trajectory passes close to clusters
but which is also smooth
),( βα
MMM Analysis
1I
2I 3I
4I
2C 3C
4C
MMM Analysis (Cntd)
1I
2I 3I
4I
2C 3C
4C
novelI
novelC
novelC
∑=
−N
iiinovel CC
1
α
Re-orient + Warp
10
1
=
∀>
−
∑
∑=
i
i
N
i
warpediinovel
itosubject
II
β
β
β
MMM Analysis Parameters
badge
lavish
Flow
Texture
Comparison of Real and Synthesized Images
Tongue is not perfect
Slight blurring Real Synthetic Real Synthetic
Analysis of Entire Recorded Corpus
),( 111 βα=z
1I
2I 3I
4I
2C 3C
4C
),( 222 βα=z
1I
2I 3I
4I
2C 3C
4C
),( 300003000030000 βα=z
1I
2I 3I
4I
2C 3C
4C
LVideo Corpus
/b/ /jh/ /ae/
Phonetic Clusters
pμ pΣRepresent each phone with
One set for flows, another set for textures
/t/
/w/
/m/
/aa/
/b/
Trajectory Synthesis
21 )()(min yyy T
yΔ+−Σ− − λμμ
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
=
Ty
yy
yM2
1
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
Σ
ΣΣ
=Σ
TP
P
P
O2
1
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
=
Tμ
μμ
μM2
1
Phonetic Targets Smoothness
/SIL B B B AE AE JH JH SIL SIL/
Smoothness
Higher orders of smoothness: K,, ΔΔΔΔΔOrder 2, 3, ….
⎥⎥⎥⎥
⎦
⎤
⎢⎢⎢⎢
⎣
⎡
−
−−
=Δ
II
IIII
O
Setting
Cross-validation:
flow: order 4, = 250: septic splinestexture: order 5, = 100: nintic
splines
Δ,λ
ΔΔ
λλ
Setting Phonetic Clusters
Use sample estimates?
/t/
/b/
Problem: Underarticulation!
Adjusting Phonetic Clusters
Use Gradient descent
to tweak
)()( yzyzE T −−=
ii
yyEE
μμ ∂∂
∂∂
=∂∂
μημμ∂∂
−=Eoldnew
Compare synthesized trajectory with original trajectory
{ }ttz βα ,={ }tty βα ,=
/t/
/b/
Phones Before/After Training
/t/
/b/before
after
Trajectories Before/After Training
12α
28β
Coarticulation
Model
1I
2I 3I
4I
2C 3C
4C
/B//U/
/T/Coarticulation
controlledby width
of cluster regions
/I/
Coarticulation
/utu/
/iti/ /ata/
/ubu/
/ibi/ /aba/
Big Picture
Trajectory Synthesis
MMM
Construct
MMM
{ }ii CI ,1I
2I 3I
4I
2C 3C
4C
Analyze Corpus{ }tt βα ,
Train phonetic models
{ }pp Σ,μ/SIL/ /F/
/AE/
Post-process
Pre-process/SIL B B B AE AE JH JH SIL SIL/
Synthesize!
{ }tt βα ,
Results
Mary101:
8 minutes of training data
1-syllable words: 132 training/20 test2-syllable words: 136 training/20 test
46-prototype MMM
Sentences not even included in training.
Comments So Far
“She looks like she’s been Botox’ed”--
Nobel Laureate
“Has she had a frontal lobotomy?”--
ATT executive
Send me your comments to
Visual Turing Tests
We win!
Experiment % correct P<Single
presentation 52.1% 0.3
Double presentation
46.6% 0.5
Visual Intelligibility
Still some work to do…….
Correct Phoneme ID
Experiment %correct on N %correct on S P<
Words+Sents 30.01% 21.19% 0.001
Words 38.55% 28.07% 0.001
Sents 24.38% 16.52% 0.01
Stay Tuned!
Acknowledgments:Association Christian BenoitNSFNTTITRI
Mary101
Dynasty ModelsCraig Milanesi
Dave KonstineJoanne Flood
Jay BenoitMarypat
Fitzgerald
Casey JohnsonVinay
Kumar
Sayan
MukherjeeChao Wang
Adlar
KimDanielle Suh
Osamu YoshimiVolker Blanz
Thomas VetterDemetri
Terzopoulos
Jenny Shapiro/BMGRehema
Ellis/NBC
Kevin Chang
Top Related