Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands...
Transcript of Hands On: Multimedia Methods for Large Scale Video ...fractor/fall2012/cs294-13-2012.pdf · Hands...
Hands On: Multimedia Methods for Large Scale Video Analysis (Lecture)
Dr. Gerald Friedland, [email protected]
1
Today
2
•Recap: Some more Machine Learning
•Multimedia Systems•An example Multimedia System
Recap: Architecture of Content Analysis Algorithms
3
Recap: Some More Machine Learning
4
•k-Nearest Neighbors•Neural Networks•SVMs•HMMs
k-Nearest Neighbors
5
Another Magic Duo
6
•Histograms are the most practically used image models
•Nearest Neighbors (with Euclidean Distance) is the most used technique for visual features are comparison
Neural Networks (MLPs)
7
Linear Separation
8
Support Vector Machines
9
Hidden Markov Models
10
•a’s: State transitions•b’s: Likelihood observations
Hidden Markov Models
11
Multimedia: Definition
12
Entry: multimedia Function: noun plural but singular or plural in construction Date: 1950A technique (as the combining of sound, video, and text) for expressing ideas (as in communication, entertainment, or art) in which several media are employed; also: something (as software) using or facilitating such a technique.
(Merriam-Webster online dictionary)
Multimedia Content Analysis
Automatic analysis of the content (semantics) contained in data directly encoded for human perception (audio, images, video, touch) and its associated meta data (natural text, computer-encoded data).
13
Multimodal Integration
14
Multimodal Integration
•... is a field of cognitive psychology.
14
Multimodal Integration
•... is a field of cognitive psychology.
•Before 1960: Unimodal approach
14
Multimodal Integration
•... is a field of cognitive psychology.
•Before 1960: Unimodal approach
• Initial results in the 1960’s, recently hyped again (2003+)
14
Multimodal Integration
•... is a field of cognitive psychology.
•Before 1960: Unimodal approach
• Initial results in the 1960’s, recently hyped again (2003+)
14
Multimodal Integration
15
Human psychology suggests:
Multimodal Integration
•Multiple sensory inputs increase the speed of the output (Hershenson 1962)
15
Human psychology suggests:
Multimodal Integration
•Multiple sensory inputs increase the speed of the output (Hershenson 1962)
•Uncertainty in sensory domains results in increased dependency of multisensory integration (Alais & Burr 2004)
15
Human psychology suggests:
Multimodal Integration
16
In computer science:
Multimodal Integration
•How to create systems that benefit from multimodal integration in similar ways the brain does, i.e. they are
16
In computer science:
Multimodal Integration
•How to create systems that benefit from multimodal integration in similar ways the brain does, i.e. they are– more accurate, robust, and/or faster than
unimodal state of the art and/or
16
In computer science:
Multimodal Integration
•How to create systems that benefit from multimodal integration in similar ways the brain does, i.e. they are– more accurate, robust, and/or faster than
unimodal state of the art and/or– offer qualitative improvements over
unimodal approaches
16
In computer science:
Recap: Architecture of Content Analysis Algorithms
17
Generic Scheme of a Classification Algorithm
18
Some signal is observed and reduced...
...to the essentials relevant to the problem, ...
...statistical models are used to compute a score (e.g. probabilities) for the given observations, ...
... so that a decision function can decide on the classification.
Features
Models
Result
Signal
Decision
reduce dimensions
build abstraction
generate score
output decision
Feature-Level Integration
19
Features are integrated before the model layer using a function ‘+’.
For example concatenation: n-dimensional vector ‘+’ m-dimensional vector = n+m-dimensional vector
Features
Models
Result
Signal1
Decision
reduce dimensions
build abstraction
generate score
output decision
Features
Signal2
reduce dimensions
+
Model-Level Integration
20
Output scores are integrated using a function ‘+’.
For example weighted combined log-likelihoods.
Features
Models
Result
Signal1
Decision
reduce dimensions
build abstraction
generate scores
output decision
Features
Signal2
reduce dimensions
Models
build abstraction
generate scores
+
combined score
Decision-Level Integration
21
Output decision are fused using a function ‘+’.
For example majority voting.
WARNING: Meta-data fusion in general is a difficult research problem.
Features
Models
Signal1
Decision
reduce dimensions
build abstraction
generate score
output decision
Features
Models
Signal2
Decision
reduce dimensions
build abstraction
generate score
output decision
+
Result
output decision
Remarks
• Signal-level integration is unlikely because of intractable data dimensionality.
• Multi-Level integration is also possible.
• In reality, a classification algorithm is more complicated than this scheme (eg. feedback loops)
• The integration function ‘+’ may also be learned automatically.
22
Example System
Dialocalization: Acoustic Speaker Diarization and Visual Localization as Joint Optimization Problem
23
G. Friedland, C. Yeo., H. Hung: Dialocalizaton: Acoustic Speaker Diarization and Visual Localization as Joint Optimization Problem, ACM Transactions on Multimedia Computing, Communications, and Applications, Vol. 6, No. 4, Article 27, November 2010
Current Common Sense
• Localization (Computer Vision Task)
Localization in space.• Speaker Diarization (Speech
Processing Task)
Example: Speaker Diarization
25
Audiotrack:
Speaker localization on timeline: “who spoke when”.
Example: Speaker Diarization
25
Audiotrack:
Speaker localization on timeline: “who spoke when”.
Example: Speaker Diarization
25
Audiotrack:
Segmentation:
Speaker localization on timeline: “who spoke when”.
Example: Speaker Diarization
25
Audiotrack:
Segmentation:
Clustering:
Speaker localization on timeline: “who spoke when”.
Example: Speaker Diarization
25
Audiotrack:
Segmentation:
Clustering:
Speaker localization on timeline: “who spoke when”.
Speaker Diarization...
➡tries to answer the question: “who spoke when?”
➡using a single microphone input
➡without prior knowledge of anything (#speakers, language, text, etc...)
26
Single Audio Stream
27
Feature
Extraction
Speech/Non-
Speech Detector
Diarization
Engine
Audio Signal
Metadata
Speech OnlyMFCC
Segmentation
Clustering
Bottom-Up Algorithm
Cluster1Cluster2 Cluster2 Cluster3
Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3
28
Start with too many clusters (initialized randomly) Purify clusters by comparing and merging similar clusters Resegment and repeat until no more merging needed
Bottom-Up AlgorithmInitialization
28
Start with too many clusters (initialized randomly) Purify clusters by comparing and merging similar clusters Resegment and repeat until no more merging needed
Bottom-Up AlgorithmInitialization
Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3
28
Start with too many clusters (initialized randomly) Purify clusters by comparing and merging similar clusters Resegment and repeat until no more merging needed
Bottom-Up Algorithm
(Re-)Training
Initialization
Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3
28
Start with too many clusters (initialized randomly) Purify clusters by comparing and merging similar clusters Resegment and repeat until no more merging needed
Bottom-Up Algorithm
(Re-)Training
Initialization
Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3
28
Start with too many clusters (initialized randomly) Purify clusters by comparing and merging similar clusters Resegment and repeat until no more merging needed
Bottom-Up Algorithm
(Re-)Alignment
(Re-)Training
Initialization
Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3
28
Start with too many clusters (initialized randomly) Purify clusters by comparing and merging similar clusters Resegment and repeat until no more merging needed
Bottom-Up Algorithm
(Re-)Alignment
(Re-)Training
Cluster1Cluster2 Cluster2 Cluster3Cluster1Cluster2 Cluster2 Cluster3
Initialization
Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3
28
Start with too many clusters (initialized randomly) Purify clusters by comparing and merging similar clusters Resegment and repeat until no more merging needed
Bottom-Up Algorithm
(Re-)Alignment
Merge two Clusters?
Yes(Re-)Training
Cluster1Cluster2 Cluster2 Cluster3Cluster1Cluster2 Cluster2 Cluster3
Initialization
Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3Cluster1 Cluster2 Cluster3 Cluster1 Cluster2 Cluster3
28
Start with too many clusters (initialized randomly) Purify clusters by comparing and merging similar clusters Resegment and repeat until no more merging needed
Bottom-Up Algorithm
(Re-)Alignment
Merge two Clusters?
Yes(Re-)Training
Cluster1Cluster2 Cluster2 Cluster3Cluster1Cluster2 Cluster2 Cluster3
Initialization
Cluster1Cluster2 Cluster2 Cluster2Cluster1Cluster2 Cluster2 Cluster2
28
Start with too many clusters (initialized randomly) Purify clusters by comparing and merging similar clusters Resegment and repeat until no more merging needed
Bottom-Up Algorithm
(Re-)Alignment
Merge two Clusters?
Yes(Re-)Training
Cluster1Cluster2 Cluster2 Cluster3Cluster1Cluster2 Cluster2 Cluster3
Initialization
Cluster1Cluster2 Cluster2 Cluster2Cluster1Cluster2 Cluster2 Cluster2
28
Start with too many clusters (initialized randomly) Purify clusters by comparing and merging similar clusters Resegment and repeat until no more merging needed
Bottom-Up Algorithm
(Re-)Alignment
Merge two Clusters?
Yes(Re-)Training
Cluster1 Cluster2 Cluster1 Cluster2Cluster1 Cluster2 Cluster1 Cluster2
Initialization
Cluster1Cluster2 Cluster2 Cluster2Cluster1Cluster2 Cluster2 Cluster2
28
Start with too many clusters (initialized randomly) Purify clusters by comparing and merging similar clusters Resegment and repeat until no more merging needed
Bottom-Up Algorithm
(Re-)Alignment
Merge two Clusters?
Yes(Re-)Training
Cluster1 Cluster2 Cluster1 Cluster2Cluster1 Cluster2 Cluster1 Cluster2
End
No
Initialization
Cluster1Cluster2 Cluster2 Cluster2Cluster1Cluster2 Cluster2 Cluster2
28
Start with too many clusters (initialized randomly) Purify clusters by comparing and merging similar clusters Resegment and repeat until no more merging needed
Current Accuracy
29
Single-Stream System ICSI Devset 07 Eval07 VACE (AMI)Speech/Non Speech Error 6.4% 6.8% 12.2%Speaker Error 11.3% 14.9% 19.89%Diarization Error Rate 17.57% 21.24% 32.09%
ICSI Speaker Diarization Engine as participated in NIST RT07.
Goals
30
Goals
• Improve Robustness while...
30
Goals
• Improve Robustness while...• ...increasing or at least
keeping the speed.
30
Goals
• Improve Robustness while...• ...increasing or at least
keeping the speed.• Need to identify speakers, eg
by association with face.
30
Goals
• Improve Robustness while...• ...increasing or at least
keeping the speed.• Need to identify speakers, eg
by association with face.
30
Goals
• Improve Robustness while...• ...increasing or at least
keeping the speed.• Need to identify speakers, eg
by association with face.
Idea: Multimodality could help
30
Multimodal Speaker
➡tries to answer the question: “who spoke when?”
➡using a single microphone and single camera input
➡without prior knowledge of anything (#speakers, language, text, etc...)
31
AMI Meeting Room Setup
32
AMI Meetings: Real-World Problems
33
• Close-view still not good enough for face detection
• People lean back and forward, stand up, walk around, leave the room, etc...
Even more Problems: Single Camera View
34
• Very low resolution per participant
• Partial occlusions
Audio/Visual Correlation Assumptions
35
• Camera captures all participants, most of the time.
• Speaker locations have limited spatial variance.
• Speakers have more visual activity than non-speakers.
Multimodal Diarization
36
Feature
Extraction
Speech/Non-
Speech Detector
Diarization
Engine
Audio Signal
"Who spoke when"
MFCC(only
Speech)
MFCC Segmentation
Clustering
Feature
Extraction
Video Signal
Video Activity
(only Speech
Regions)
Events
Video Feature Extraction
37
MPEG-4
Video
n-dimensional
activity vector
Divide Frames
into n Regions
Avg. Motion
Vectors
Detect Skin
Blocks
Windowsize: 400ms
Model-Level Integration
38
MFCC
GMMs
Result
Audio
Decision
likelihoods
Activity
Video
GMMs
likelihoods
+
Multimodal Diarization Results
39
12 Meetings from AMI corpus “VACE Meetings”
Multimodal vs Unimodal
40
Error/System Four Cameras RandomSpeaker Error 68.80% 75.00%
Video features alone perform poorly!
Multimodal vs Unimodal
40
Warning: Designing multimodal algorithms may require integrated thinking. Blackbox combination of unimodal approaches may not work.
Error/System Four Cameras RandomSpeaker Error 68.80% 75.00%
Video features alone perform poorly!
Agglomerative Clustering
41
+
Video activities in each region
Cepstral audio features
Models containing MFCC and video activity vectors
Who Spoke When?
42
+
Speaker X
?
Video activities in each region
Cepstral audio Features
Which model fits best?
Where is the Speaker?
43
+Speaker X ?Speaker from Diarization all possible activity
locations for speakers
Which activity location fits best?
Speaker Localization
44
Feature
Extraction
Speech/Non-
Speech Detector
Audio Signal
"who spoke when"MFCC(only
Speech)
MFCCDiarization
Engine
Segmentation
Clustering
Feature
Extraction
Video Signal
Video Activity
(only Speech
Regions)
Events
Invert Visual
Models"where the speaker was"
Speaker Localization and Diarization
45
Conclusion I
46
Speaker Diarization = Speaker Localization
No need to treat as separate problem!
Conclusion II
47
Conclusion II
47
Multimodal diarization with video results in:
Conclusion II
47
Multimodal diarization with video results in:
• higher accuracy at low computational overhead
Conclusion II
47
Multimodal diarization with video results in:
• higher accuracy at low computational overhead
• speaker localisation as a by-product
= “Multimodal Synergy”
Conclusion III
48
Conclusion III
48
It is possible to create a machine learning system that benefits from multimodal integration such that
Conclusion III
48
It is possible to create a machine learning system that benefits from multimodal integration such that – it is more accurate than the unimodal state
of the art and it
Conclusion III
48
It is possible to create a machine learning system that benefits from multimodal integration such that – it is more accurate than the unimodal state
of the art and it– offers qualitative improvements over
unimodal approaches (here: more semantic output)
Next Week (Project Meeting)
•Benjamin Elizalde on ICSIs TRECVID MED 2012 System
Next Week (Lecture)
50
•How to estimate computational needs