Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands...

$: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,$
Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture)

Dr. Gerald Friedland, [email protected]

1

mailto:[email protected]

mailto:[email protected]

Today

2

•Recap: Some more Machine Learning

•Multimedia Systems•An example Multimedia System

Recap: Architecture of Content Analysis Algorithms

3

Recap: Some More Machine Learning

4

•k-Nearest Neighbors•Neural Networks•SVMs•HMMs

k-Nearest Neighbors

5

Another Magic Duo

6

•Histograms are the most practically used image models

•Nearest Neighbors (with Euclidean Distance) is the most used technique for visual features are comparison

Neural Networks (MLPs)

7

Linear Separation

8

Support Vector Machines

9

Hidden Markov Models

10

•a’s: State transitions•b’s: Likelihood observations

Hidden Markov Models

11

Multimedia: Definition

12

Entry: multimedia Function: noun plural but singular or plural in construction Date: 1950A technique (as the combining of sound, video, and text) for expressing ideas (as in communication, entertainment, or art) in which several media are employed; also: something (as software) using or facilitating such a technique.

(Merriam-Webster online dictionary)

Multimedia Content Analysis

Automatic analysis of the content (semantics) contained in data directly encoded for human perception (audio, images, video, touch) and its associated meta data (natural text, computer-encoded data).

13

Multimodal Integration

14


•... is a field of cognitive psychology.

14



•Before 1960: Unimodal approach

14



•Before 1960: Unimodal approach

• Initial results in the 1960’s, recently hyped again (2003+)

14


15

Human psychology suggests:


•Multiple sensory inputs increase the speed of the output (Hershenson 1962)

15



•Multiple sensory inputs increase the speed of the output (Hershenson 1962)

•Uncertainty in sensory domains results in increased dependency of multisensory integration (Alais & Burr 2004)

15



16

In computer science:


•How to create systems that benefit from multimodal integration in similar ways the brain does, i.e. they are

16



•How to create systems that benefit from multimodal integration in similar ways the brain does, i.e. they are– more accurate, robust, and/or faster than

unimodal state of the art and/or

16



•How to create systems that benefit from multimodal integration in similar ways the brain does, i.e. they are– more accurate, robust, and/or faster than

unimodal state of the art and/or– offer qualitative improvements over

unimodal approaches

16


Recap: Architecture of Content Analysis Algorithms

17

Generic Scheme of a Classification Algorithm

18

Some signal is observed and reduced...

...to the essentials relevant to the problem, ...

...statistical models are used to compute a score (e.g. probabilities) for the given observations, ...

... so that a decision function can decide on the classification.

Features

Models

Result

Signal

Decision

reduce dimensions

build abstraction

generate score

output decision

Feature-Level Integration

19

Features are integrated before the model layer using a function ‘+’.

For example concatenation: n-dimensional vector ‘+’ m-dimensional vector = n+m-dimensional vector

Features

Models

Result

Signal1

Decision

reduce dimensions

build abstraction

generate score

output decision

Features

Signal2

reduce dimensions

+

Model-Level Integration

20

Output scores are integrated using a function ‘+’.

For example weighted combined log-likelihoods.

Features

Models

Result

Signal1

Decision

reduce dimensions

build abstraction

generate scores

output decision

Features

Signal2

reduce dimensions

Models

build abstraction

generate scores

+

combined score

Decision-Level Integration

21

Output decision are fused using a function ‘+’.

For example majority voting.

WARNING: Meta-data fusion in general is a difficult research problem.

Features

Models

Signal1

Decision

reduce dimensions

build abstraction

generate score

output decision

Features

Models

Signal2

Decision

reduce dimensions

build abstraction

generate score

output decision

+

Result

output decision

Remarks

• Signal-level integration is unlikely because of intractable data dimensionality.

• Multi-Level integration is also possible.

• In reality, a classification algorithm is more complicated than this scheme (eg. feedback loops)

• The integration function ‘+’ may also be learned automatically.

22

Example System

Dialocalization: Acoustic Speaker Diarization and Visual Localization as Joint Optimization Problem

23

G. Friedland, C. Yeo., H. Hung: Dialocalizaton: Acoustic Speaker Diarization and Visual Localization as Joint Optimization Problem, ACM Transactions on Multimedia Computing, Communications, and Applications, Vol. 6, No. 4, Article 27, November 2010

Current Common Sense

• Localization (Computer Vision Task)

Localization in space.• Speaker Diarization (Speech

Processing Task)

Example: Speaker Diarization

25

Audiotrack:

Speaker localization on timeline: “who spoke when”.


25

Audiotrack:

Segmentation:



25

Audiotrack:

Segmentation:

Clustering:


Speaker Diarization...

➡tries to answer the question: “who spoke when?”

➡using a single microphone input

➡without prior knowledge of anything (#speakers, language, text, etc...)

26

Single Audio Stream

27

Feature

Extraction

Speech/Non-

Speech Detector

Diarization

Engine

Audio Signal

Metadata

Speech OnlyMFCC

Segmentation

Clustering

Bottom-Up Algorithm

Cluster1Cluster2 Cluster2 Cluster3

Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3

28

Start with too many clusters (initialized randomly) Purify clusters by comparing and merging similar clusters Resegment and repeat until no more merging needed

Bottom-Up AlgorithmInitialization

28


Bottom-Up AlgorithmInitialization

Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3

28


Bottom-Up Algorithm

(Re-)Training

Initialization


28


Bottom-Up Algorithm

(Re-)Alignment

(Re-)Training

Initialization


28


Bottom-Up Algorithm

(Re-)Alignment

(Re-)Training

Cluster1Cluster2 Cluster2 Cluster3Cluster1Cluster2 Cluster2 Cluster3

Initialization


28


Bottom-Up Algorithm

(Re-)Alignment

Merge two Clusters?

Yes(Re-)Training


Initialization


28


Bottom-Up Algorithm

(Re-)Alignment

Merge two Clusters?

Yes(Re-)Training

Cluster1 Cluster2 Cluster1 Cluster2Cluster1 Cluster2 Cluster1 Cluster2

Initialization


28


Bottom-Up Algorithm

(Re-)Alignment

Merge two Clusters?

Yes(Re-)Training

Cluster1 Cluster2 Cluster1 Cluster2Cluster1 Cluster2 Cluster1 Cluster2

End

No

Initialization


28


Current Accuracy

29

Single-Stream System ICSI Devset 07 Eval07 VACE (AMI)Speech/Non Speech Error 6.4% 6.8% 12.2%Speaker Error 11.3% 14.9% 19.89%Diarization Error Rate 17.57% 21.24% 32.09%

ICSI Speaker Diarization Engine as participated in NIST RT07.

Goals

30

Goals

• Improve Robustness while...

30

Goals

• Improve Robustness while...• ...increasing or at least

keeping the speed.

30

Goals


keeping the speed.• Need to identify speakers, eg

by association with face.

30

Goals


keeping the speed.• Need to identify speakers, eg

by association with face.

Idea: Multimodality could help

30

Multimodal Speaker

➡tries to answer the question: “who spoke when?”

➡using a single microphone and single camera input

➡without prior knowledge of anything (#speakers, language, text, etc...)

31

AMI Meeting Room Setup

32

AMI Meetings: Real-World Problems

33

• Close-view still not good enough for face detection

• People lean back and forward, stand up, walk around, leave the room, etc...

Even more Problems: Single Camera View

34

• Very low resolution per participant

• Partial occlusions

Audio/Visual Correlation Assumptions

35

• Camera captures all participants, most of the time.

• Speaker locations have limited spatial variance.

• Speakers have more visual activity than non-speakers.

Multimodal Diarization

36

Feature

Extraction

Speech/Non-

Speech Detector

Diarization

Engine

Audio Signal

"Who spoke when"

MFCC(only

Speech)

MFCC Segmentation

Clustering

Feature

Extraction

Video Signal

Video Activity

(only Speech

Regions)

Events

Video Feature Extraction

37

MPEG-4

Video

n-dimensional

activity vector

Divide Frames

into n Regions

Avg. Motion

Vectors

Detect Skin

Blocks

Windowsize: 400ms

Model-Level Integration

38

MFCC

GMMs

Result

Audio

Decision

likelihoods

Activity

Video

GMMs

likelihoods

+

Multimodal Diarization Results

39

12 Meetings from AMI corpus “VACE Meetings”

Multimodal vs Unimodal

40

Error/System Four Cameras RandomSpeaker Error 68.80% 75.00%

Video features alone perform poorly!

Multimodal vs Unimodal

40

Warning: Designing multimodal algorithms may require integrated thinking. Blackbox combination of unimodal approaches may not work.

Error/System Four Cameras RandomSpeaker Error 68.80% 75.00%

Video features alone perform poorly!

Agglomerative Clustering

41

+

Video activities in each region

Cepstral audio features

Models containing MFCC and video activity vectors

Who Spoke When?

42

+

Speaker X

?

Video activities in each region

Cepstral audio Features

Which model fits best?

Where is the Speaker?

43

+Speaker X ?Speaker from Diarization all possible activity

locations for speakers

Which activity location fits best?

Speaker Localization

44

Feature

Extraction

Speech/Non-

Speech Detector

Audio Signal

"who spoke when"MFCC(only

Speech)

MFCCDiarization

Engine

Segmentation

Clustering

Feature

Extraction

Video Signal

Video Activity

(only Speech

Regions)

Events

Invert Visual

Models"where the speaker was"

Speaker Localization and Diarization

45

Conclusion I

46

Speaker Diarization = Speaker Localization

No need to treat as separate problem!

Conclusion II

47

Conclusion II

47

Multimodal diarization with video results in:

Conclusion II

47


• higher accuracy at low computational overhead

Conclusion II

47


• higher accuracy at low computational overhead

• speaker localisation as a by-product

= “Multimodal Synergy”

Conclusion III

48

Conclusion III

48

It is possible to create a machine learning system that benefits from multimodal integration such that

Conclusion III

48

It is possible to create a machine learning system that benefits from multimodal integration such that – it is more accurate than the unimodal state

of the art and it

Conclusion III

48

It is possible to create a machine learning system that benefits from multimodal integration such that – it is more accurate than the unimodal state

of the art and it– offers qualitative improvements over

unimodal approaches (here: more semantic output)

Next Week (Project Meeting)

•Benjamin Elizalde on ICSIs TRECVID MED 2012 System

Next Week (Lecture)

50

•How to estimate computational needs

Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands...

Documents

Transcript of Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands...