Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands...

86
Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland, [email protected] 1

Transcript of Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands...

Page 1: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture)

Dr. Gerald Friedland, [email protected]

1

Page 2: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Today

2

•Recap: Some more Machine Learning

•Multimedia Systems•An example Multimedia System

Page 3: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Recap: Architecture of Content Analysis Algorithms

3

Page 4: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Recap: Some More Machine Learning

4

•k-Nearest Neighbors•Neural Networks•SVMs•HMMs

Page 5: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

k-Nearest Neighbors

5

Page 6: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Another Magic Duo

6

•Histograms are the most practically used image models

•Nearest Neighbors (with Euclidean Distance) is the most used technique for visual features are comparison

Page 7: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Neural Networks (MLPs)

7

Page 8: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Linear Separation

8

Page 9: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Support Vector Machines

9

Page 10: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Hidden Markov Models

10

•a’s: State transitions•b’s: Likelihood observations

Page 11: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Hidden Markov Models

11

Page 12: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Multimedia: Definition

12

Entry: multimedia Function: noun plural but singular or plural in construction Date: 1950A technique (as the combining of sound, video, and text) for expressing ideas (as in communication, entertainment, or art) in which several media are employed; also: something (as software) using or facilitating such a technique.

(Merriam-Webster online dictionary)

Page 13: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Multimedia Content Analysis

Automatic analysis of the content (semantics) contained in data directly encoded for human perception (audio, images, video, touch) and its associated meta data (natural text, computer-encoded data).

13

Page 14: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Multimodal Integration

14

Page 15: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Multimodal Integration

•... is a field of cognitive psychology.

14

Page 16: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Multimodal Integration

•... is a field of cognitive psychology.

•Before 1960: Unimodal approach

14

Page 17: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Multimodal Integration

•... is a field of cognitive psychology.

•Before 1960: Unimodal approach

• Initial results in the 1960’s, recently hyped again (2003+)

14

Page 18: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Multimodal Integration

•... is a field of cognitive psychology.

•Before 1960: Unimodal approach

• Initial results in the 1960’s, recently hyped again (2003+)

14

Page 19: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Multimodal Integration

15

Human psychology suggests:

Page 20: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Multimodal Integration

•Multiple sensory inputs increase the speed of the output (Hershenson 1962)

15

Human psychology suggests:

Page 21: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Multimodal Integration

•Multiple sensory inputs increase the speed of the output (Hershenson 1962)

•Uncertainty in sensory domains results in increased dependency of multisensory integration (Alais & Burr 2004)

15

Human psychology suggests:

Page 22: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Multimodal Integration

16

In computer science:

Page 23: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Multimodal Integration

•How to create systems that benefit from multimodal integration in similar ways the brain does, i.e. they are

16

In computer science:

Page 24: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Multimodal Integration

•How to create systems that benefit from multimodal integration in similar ways the brain does, i.e. they are– more accurate, robust, and/or faster than

unimodal state of the art and/or

16

In computer science:

Page 25: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Multimodal Integration

•How to create systems that benefit from multimodal integration in similar ways the brain does, i.e. they are– more accurate, robust, and/or faster than

unimodal state of the art and/or– offer qualitative improvements over

unimodal approaches

16

In computer science:

Page 26: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Recap: Architecture of Content Analysis Algorithms

17

Page 27: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Generic Scheme of a Classification Algorithm

18

Some signal is observed and reduced...

...to the essentials relevant to the problem, ...

...statistical models are used to compute a score (e.g. probabilities) for the given observations, ...

... so that a decision function can decide on the classification.

Features

Models

Result

Signal

Decision

reduce dimensions

build abstraction

generate score

output decision

Page 28: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Feature-Level Integration

19

Features are integrated before the model layer using a function ‘+’.

For example concatenation: n-dimensional vector ‘+’ m-dimensional vector = n+m-dimensional vector

Features

Models

Result

Signal1

Decision

reduce dimensions

build abstraction

generate score

output decision

Features

Signal2

reduce dimensions

+

Page 29: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Model-Level Integration

20

Output scores are integrated using a function ‘+’.

For example weighted combined log-likelihoods.

Features

Models

Result

Signal1

Decision

reduce dimensions

build abstraction

generate scores

output decision

Features

Signal2

reduce dimensions

Models

build abstraction

generate scores

+

combined score

Page 30: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Decision-Level Integration

21

Output decision are fused using a function ‘+’.

For example majority voting.

WARNING: Meta-data fusion in general is a difficult research problem.

Features

Models

Signal1

Decision

reduce dimensions

build abstraction

generate score

output decision

Features

Models

Signal2

Decision

reduce dimensions

build abstraction

generate score

output decision

+

Result

output decision

Page 31: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Remarks

• Signal-level integration is unlikely because of intractable data dimensionality.

• Multi-Level integration is also possible.

• In reality, a classification algorithm is more complicated than this scheme (eg. feedback loops)

• The integration function ‘+’ may also be learned automatically.

22

Page 32: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Example System

Dialocalization: Acoustic Speaker Diarization and Visual Localization as Joint Optimization Problem

23

G. Friedland, C. Yeo., H. Hung: Dialocalizaton: Acoustic Speaker Diarization and Visual Localization as Joint Optimization Problem, ACM Transactions on Multimedia Computing, Communications, and Applications, Vol. 6, No. 4, Article 27,  November 2010

Page 33: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Current Common Sense

• Localization (Computer Vision Task)

Localization in space.• Speaker Diarization (Speech

Processing Task)

Page 34: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Example: Speaker Diarization

25

Audiotrack:

Speaker localization on timeline: “who spoke when”.

Page 35: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Example: Speaker Diarization

25

Audiotrack:

Speaker localization on timeline: “who spoke when”.

Page 36: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Example: Speaker Diarization

25

Audiotrack:

Segmentation:

Speaker localization on timeline: “who spoke when”.

Page 37: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Example: Speaker Diarization

25

Audiotrack:

Segmentation:

Clustering:

Speaker localization on timeline: “who spoke when”.

Page 38: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Example: Speaker Diarization

25

Audiotrack:

Segmentation:

Clustering:

Speaker localization on timeline: “who spoke when”.

Page 39: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Speaker Diarization...

➡tries to answer the question: “who spoke when?”

➡using a single microphone input

➡without prior knowledge of anything (#speakers, language, text, etc...)

26

Page 40: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Single Audio Stream

27

Feature

Extraction

Speech/Non-

Speech Detector

Diarization

Engine

Audio Signal

Metadata

Speech OnlyMFCC

Segmentation

Clustering

Page 41: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Bottom-Up Algorithm

Cluster1Cluster2 Cluster2 Cluster3

Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3

28

Start with too many clusters (initialized randomly) Purify clusters by comparing and merging similar clusters Resegment and repeat until no more merging needed

Page 42: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Bottom-Up AlgorithmInitialization

28

Start with too many clusters (initialized randomly) Purify clusters by comparing and merging similar clusters Resegment and repeat until no more merging needed

Page 43: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Bottom-Up AlgorithmInitialization

Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3

28

Start with too many clusters (initialized randomly) Purify clusters by comparing and merging similar clusters Resegment and repeat until no more merging needed

Page 44: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Bottom-Up Algorithm

(Re-)Training

Initialization

Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3

28

Start with too many clusters (initialized randomly) Purify clusters by comparing and merging similar clusters Resegment and repeat until no more merging needed

Page 45: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Bottom-Up Algorithm

(Re-)Training

Initialization

Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3

28

Start with too many clusters (initialized randomly) Purify clusters by comparing and merging similar clusters Resegment and repeat until no more merging needed

Page 46: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Bottom-Up Algorithm

(Re-)Alignment

(Re-)Training

Initialization

Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3

28

Start with too many clusters (initialized randomly) Purify clusters by comparing and merging similar clusters Resegment and repeat until no more merging needed

Page 47: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Bottom-Up Algorithm

(Re-)Alignment

(Re-)Training

Cluster1Cluster2 Cluster2 Cluster3Cluster1Cluster2 Cluster2 Cluster3

Initialization

Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3

28

Start with too many clusters (initialized randomly) Purify clusters by comparing and merging similar clusters Resegment and repeat until no more merging needed

Page 48: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Bottom-Up Algorithm

(Re-)Alignment

Merge two Clusters?

Yes(Re-)Training

Cluster1Cluster2 Cluster2 Cluster3Cluster1Cluster2 Cluster2 Cluster3

Initialization

Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3

28

Start with too many clusters (initialized randomly) Purify clusters by comparing and merging similar clusters Resegment and repeat until no more merging needed

Page 49: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Bottom-Up Algorithm

(Re-)Alignment

Merge two Clusters?

Yes(Re-)Training

Cluster1Cluster2 Cluster2 Cluster3Cluster1Cluster2 Cluster2 Cluster3

Initialization

Cluster1Cluster2 Cluster2 Cluster2Cluster1Cluster2 Cluster2 Cluster2

28

Start with too many clusters (initialized randomly) Purify clusters by comparing and merging similar clusters Resegment and repeat until no more merging needed

Page 50: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Bottom-Up Algorithm

(Re-)Alignment

Merge two Clusters?

Yes(Re-)Training

Cluster1Cluster2 Cluster2 Cluster3Cluster1Cluster2 Cluster2 Cluster3

Initialization

Cluster1Cluster2 Cluster2 Cluster2Cluster1Cluster2 Cluster2 Cluster2

28

Start with too many clusters (initialized randomly) Purify clusters by comparing and merging similar clusters Resegment and repeat until no more merging needed

Page 51: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Bottom-Up Algorithm

(Re-)Alignment

Merge two Clusters?

Yes(Re-)Training

Cluster1 Cluster2 Cluster1 Cluster2Cluster1 Cluster2 Cluster1 Cluster2

Initialization

Cluster1Cluster2 Cluster2 Cluster2Cluster1Cluster2 Cluster2 Cluster2

28

Start with too many clusters (initialized randomly) Purify clusters by comparing and merging similar clusters Resegment and repeat until no more merging needed

Page 52: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Bottom-Up Algorithm

(Re-)Alignment

Merge two Clusters?

Yes(Re-)Training

Cluster1 Cluster2 Cluster1 Cluster2Cluster1 Cluster2 Cluster1 Cluster2

End

No

Initialization

Cluster1Cluster2 Cluster2 Cluster2Cluster1Cluster2 Cluster2 Cluster2

28

Start with too many clusters (initialized randomly) Purify clusters by comparing and merging similar clusters Resegment and repeat until no more merging needed

Page 53: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Current Accuracy

29

Single-Stream System ICSI Devset 07 Eval07 VACE (AMI)Speech/Non Speech Error 6.4% 6.8% 12.2%Speaker Error 11.3% 14.9% 19.89%Diarization Error Rate 17.57% 21.24% 32.09%

ICSI Speaker Diarization Engine as participated in NIST RT07.

Page 54: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Goals

30

Page 55: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Goals

• Improve Robustness while...

30

Page 56: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Goals

• Improve Robustness while...• ...increasing or at least

keeping the speed.

30

Page 57: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Goals

• Improve Robustness while...• ...increasing or at least

keeping the speed.• Need to identify speakers, eg

by association with face.

30

Page 58: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Goals

• Improve Robustness while...• ...increasing or at least

keeping the speed.• Need to identify speakers, eg

by association with face.

30

Page 59: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Goals

• Improve Robustness while...• ...increasing or at least

keeping the speed.• Need to identify speakers, eg

by association with face.

Idea: Multimodality could help

30

Page 60: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Multimodal Speaker

➡tries to answer the question: “who spoke when?”

➡using a single microphone and single camera input

➡without prior knowledge of anything (#speakers, language, text, etc...)

31

Page 61: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

AMI Meeting Room Setup

32

Page 62: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

AMI Meetings: Real-World Problems

33

• Close-view still not good enough for face detection

• People lean back and forward, stand up, walk around, leave the room, etc...

Page 63: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Even more Problems: Single Camera View

34

• Very low resolution per participant

• Partial occlusions

Page 64: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Audio/Visual Correlation Assumptions

35

• Camera captures all participants, most of the time.

• Speaker locations have limited spatial variance.

• Speakers have more visual activity than non-speakers.

Page 65: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Multimodal Diarization

36

Feature

Extraction

Speech/Non-

Speech Detector

Diarization

Engine

Audio Signal

"Who spoke when"

MFCC(only

Speech)

MFCC Segmentation

Clustering

Feature

Extraction

Video Signal

Video Activity

(only Speech

Regions)

Events

Page 66: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Video Feature Extraction

37

MPEG-4

Video

n-dimensional

activity vector

Divide Frames

into n Regions

Avg. Motion

Vectors

Detect Skin

Blocks

Windowsize: 400ms

Page 67: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Model-Level Integration

38

MFCC

GMMs

Result

Audio

Decision

likelihoods

Activity

Video

GMMs

likelihoods

+

Page 68: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Multimodal Diarization Results

39

12 Meetings from AMI corpus “VACE Meetings”

Page 69: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Multimodal vs Unimodal

40

Error/System Four Cameras RandomSpeaker Error 68.80% 75.00%

Video features alone perform poorly!

Page 70: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Multimodal vs Unimodal

40

Warning: Designing multimodal algorithms may require integrated thinking. Blackbox combination of unimodal approaches may not work.

Error/System Four Cameras RandomSpeaker Error 68.80% 75.00%

Video features alone perform poorly!

Page 71: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Agglomerative Clustering

41

+

Video activities in each region

Cepstral audio features

Models containing MFCC and video activity vectors

Page 72: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Who Spoke When?

42

+

Speaker X

?

Video activities in each region

Cepstral audio Features

Which model fits best?

Page 73: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Where is the Speaker?

43

+Speaker X ?Speaker from Diarization all possible activity

locations for speakers

Which activity location fits best?

Page 74: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Speaker Localization

44

Feature

Extraction

Speech/Non-

Speech Detector

Audio Signal

"who spoke when"MFCC(only

Speech)

MFCCDiarization

Engine

Segmentation

Clustering

Feature

Extraction

Video Signal

Video Activity

(only Speech

Regions)

Events

Invert Visual

Models"where the speaker was"

Page 75: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Speaker Localization and Diarization

45

Page 76: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Conclusion I

46

Speaker Diarization = Speaker Localization

No need to treat as separate problem!

Page 77: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Conclusion II

47

Page 78: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Conclusion II

47

Multimodal diarization with video results in:

Page 79: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Conclusion II

47

Multimodal diarization with video results in:

• higher accuracy at low computational overhead

Page 80: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Conclusion II

47

Multimodal diarization with video results in:

• higher accuracy at low computational overhead

• speaker localisation as a by-product

= “Multimodal Synergy”

Page 81: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Conclusion III

48

Page 82: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Conclusion III

48

It is possible to create a machine learning system that benefits from multimodal integration such that

Page 83: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Conclusion III

48

It is possible to create a machine learning system that benefits from multimodal integration such that – it is more accurate than the unimodal state

of the art and it

Page 84: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Conclusion III

48

It is possible to create a machine learning system that benefits from multimodal integration such that – it is more accurate than the unimodal state

of the art and it– offers qualitative improvements over

unimodal approaches (here: more semantic output)

Page 85: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Next Week (Project Meeting)

•Benjamin Elizalde on ICSIs TRECVID MED 2012 System

Page 86: Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture) Dr. Gerald Friedland,

Next Week (Lecture)

50

•How to estimate computational needs