“ Pixels that Sound ” Find pixels that correspond (correlate !?) to sound Kidron, Schechner,...

30
“ Pixels that Sound ” Find pixels that correspond (correlate !?) to sound Kidron, Schechner, Elad, CVPR 2005 34

Transcript of “ Pixels that Sound ” Find pixels that correspond (correlate !?) to sound Kidron, Schechner,...

“ Pixels that Sound ”

Find pixels that correspond (correlate !?) to sound

Kidron, Schechner, Elad, CVPR 2005

34

Audio-Visual Analysis: Applications• Lip reading – detection of lips (or person)

Slaney, Covell (2000)

Bregler, Konig (1994)

• Analysis and synthesis of music from motionMurphy, Andersen, Jensen (2003)

• Source separation based on visionLi, Dimitrova, Li, Sethi (2003)

Smaragdis, Casey (2003)

Nock, Iyengar, Neti (2002)

Fisher, Darrell, Freeman, Viola (2001)

Hershey, Movellan (1999)

• Tracking Vermaak, Gangnet, Blake, Pérez (2001)

• Biological systemsGutfreund, Zheng, Knudsen (2002)

47

Problem: Different Modalities

camera

microphone

audio-visual analysis

Visual data

25 frames/sec

Each frame: 576 x 720 pixels

Audio data

44.1 KHz, few bands

Not stereophonic

Kidron, Schechner, Elad, Pixels that Sound

47

Previous Work

Pointwise correlationNock, Iyengar, Neti (2002)

Hershey, Movellan (1999)

Ill-posed(lack of data)

• Canonical Correlation Analysis (CCA)Smaragdis, Casey (2003)

Li, Dimitrova, Li, Sethi (2003)

Slaney, Covell (2000)

Cluster of pixels - linear superposition

• Mutual Information (MI)Fisher et. al. (2001)

Cutler, Davis (2000)

Bregler,Konig (1994)

NotTypical

highly complex

54

Kidron, Schechner, Elad, Pixels that Sound

49

ProjectionProjection

Video Audio

Pixel #1

Pixel #2

Pixel #3

Band #1

Band #2

Optimal Optimal visual components

CCA

Visual Projection

1Dvariable

Projection

34012052687436859Video features• Pixels intensity• Transform coeff (wavelet)• Image differences

v

40

Audio Projection

1Dvariable

Projection

Audio features• Average energy per frame• Transform coeffs per frame

a

41

Canonical Correlation

Video AudioRepresentation

Projections(per time window)

Random variables(time dependent)

Correlation coefficient

42

CCA Formulation

yield an eigenvalue problem:Knutsson, Borga, Landelius (1995)

CanonicalCorrelationProjections

Largest Eigenvalue

equivalent to

Corresponding Eigenvectors

43

Visual Data

t (frames)

Spatial Location(pixels intensities)

Kidron, Schechner, Elad, Pixels that Sound

51

Rank Deficiency

t (frames)

Spatial Location(pixels intensities)

=

Kidron, Schechner, Elad, Pixels that Sound

44

Estimation of Covariance

Rank deficient

45

Ill-Posedness

Prior solutions:

• Use many more frames poor temporal resolution.

• Aggressive spatial pruning poor spatial resolution.

• Trivial regularization

Impossible to invert !!!

46

A General Problem

Small amount of data

The problem is ILL-POSED

Over fitting is likely

Large number of weights

47

An Equivalent Problem

Minimizing

Maximizing

48

Single Audio Band

(The denominator is non-zero)

Minimizing

Knowndata

A has a single column, and

49

=

Time

a(ti)

a (1)

a (30)

a (2)

V a

Full correlation if

Underdetermined system !

Kidron, Schechner, Elad, Pixels that Sound

52

end

Detected correlated pixels

“Out of clutter, find simplicity.

From discord, find harmony.”

Albert Einstein

52

end

Sparse Solution

• Non-convex• Exponential

complexity

-norm minimum

53

The -norm criterion

• Sparse• Convex• Polynomial

complexity

in common situations

-norm minimum

Donoho, Elad (2005)

54

The Minimum Norm Solution

Energy spread

-norm minimum

Solving using -norm (pseudo-inverse, SVD, QR)

55

Linear programming

Fully correlated

Sparse

No parameters to tweak

Polynomial

Audio-visual events

Maximum correlation: Eigenproblem

Minimum objective function G

56

Multiple Audio Bands - Solution

-ball

Non-convex constraint

• Convex• Linear

The optimization problem:

57

1 ball

Multiple Audio Bands

Optimization over each face is:

S1

S2

S3 S4

No parameters to tweak

• Each face: linear programming

58

Sharp & Dynamic, Despite Distraction

Frame 9 Frame 42 Frame 68

Frame 115 Frame 146 Frame 169

Frame 51

Frame 106

Frame 83

Frame 177

• Sparse

• Localization on the proper elements

• False alarm – temporally inconsistent

• Handling dynamics

Performing in Audio Noise

–norm: Energy Spread

Movie #1 Movie #2

Frame 83Frame 146

56

–norm: Localization

Movie #1 Movie #2

Frame 83Frame 146

57

The “Chorus Ambiguity”

Who’s talking?

Synchronized talk

Not unique (ambiguous)

Possible solutions:• Left• Right• Both

The “Chorus Ambiguity”

-norm-norm

feature 1

feature 2

feature 1

feature 2

Both