Slides
description
Transcript of Slides
Multimedia Segmentation and Multimedia Segmentation and SummarizationSummarization
Dr. Jia-Ching WangDr. Jia-Ching Wang
Honorary Fellow, ECE Department, UW-MadisonHonorary Fellow, ECE Department, UW-Madison
2 / 47Multimedia Segmentation and Summarization
OutlineOutline
Introduction
Speaker Segmentation
Video Summarization
Conclusion
3 / 47Multimedia Segmentation and Summarization
What is Multimedia?What is Multimedia?
Image
Video
Speech
Audio
Text
4 / 47Multimedia Segmentation and Summarization
Multimedia EverywhereMultimedia Everywhere
Fax machines: transmission of binary images Digital cameras: still images iPod / iPhone & MP3 Digital camcorders: video sequences with audio Digital television broadcasting Compact disk (CD), Digital video disk (DVD) Personal video recorder (PVR, TiVo) Images on the World Wide Web Video streaming, video conferencing Video on cell phones, PDAs High-definition televisions (HDTV) Medical imaging: X-ray, MRI, ultrasound Military imaging: multi-spectral, satellite, microwave
5 / 47Multimedia Segmentation and Summarization
WhatWhat is Multimedia Content? is Multimedia Content?
Multimedia content: the syntactic and semantic information inherent in a digital material.
Example: text document
Syntactic content: chapter, paragraph
Semantic content: key words, subject, types of text document, etc.
Example: video document
Syntactic content: scene cuts, shots
Semantic content: motion, summary, index, caption, etc.
6 / 47Multimedia Segmentation and Summarization
WhyWhy We Need to Know Multimedia Content? We Need to Know Multimedia Content?
Why we need to know multimedia content?
Information processing, in terms of archiving, indexing, delivering, accessing and other processing, require in-depth knowledge of content to optimize the performance.
7 / 47Multimedia Segmentation and Summarization
HowHow to Know Multimedia Content?to Know Multimedia Content?
How to Know Multimedia Content?
Multimedia content analysis The computerized understanding of the semantic/syntactic
of a multimedia document
Multimedia content analysis usually involves
Segmentation Segmenting the multimedia document into units
Classification Classifying each unit into a predefined type
Annotation Annotating the multimedia document
Summarization Summarizing the multimedia document
8 / 47Multimedia Segmentation and Summarization
Multimedia Segmentation and SummarizationMultimedia Segmentation and Summarization
Multimedia segmentation
Syntactic content
Multimedia summarization
Semantic/syntactic content
The result of the temporal segmentation can benefit the video summarization
9 / 47Multimedia Segmentation and Summarization
Multimedia SegmentationMultimedia Segmentation
Image segmentation Video segmentation
Scene change, shot change Audio segmentation
Audio class change Speech segmentation
Speaker change detection Text Segmentation
word segmentation, sentence segmentation, topic change detection
10 / 47Multimedia Segmentation and Summarization
Multimedia SummarizationMultimedia Summarization
Image summarization Region of interest
Video summarization Storyboard, highlight
Audio summarization Main theme in music, Corus in song, event sound
in environmental sound stream Speech summarization
Speech abstract Text summarization
Abstract
11 / 47Multimedia Segmentation and Summarization
What is Speaker Segmentation?What is Speaker Segmentation?
It can also be called speaker change detection (SCD) Assumption: there is no overlapping between any of
the two speaker streams
speaker1 speaker2 speaker3
12 / 47Multimedia Segmentation and Summarization
Supervised v.s. Unsupervised SCDSupervised v.s. Unsupervised SCD
Supervised manner: acoustic data are made up of distinct speakers who are known a priori
Recognition based solution
Unsupervised manner: no prior knowledge about the number and identities of speakers
Metric-based criterion
Model selection-based criterion
13 / 47Multimedia Segmentation and Summarization
Supervised Speaker SegmentationSupervised Speaker Segmentation-- Gaussian Mixture Model-- Gaussian Mixture Model
Gaussian mixture modeling (GMM)
)}()(2
1exp{
2
1)( 1
1 21
2ii
Ti
M
ii
di xxcxp
ii
x is a d-dimensional random vector. , i=1,…,M is the mixture weight. ,the mean vector. ,the covariance matrix.
ic
)(maxarg ,,2,1 idDdt xPD
Incoming audio stream is classified into one of D classes in a maximum likelihood manner at time t
14 / 47Multimedia Segmentation and Summarization
Supervised Speaker SegmentationSupervised Speaker Segmentation-- Hidden Markov Model-- Hidden Markov Model
15 / 47Multimedia Segmentation and Summarization
Unsupervised Speaker SegmentationUnsupervised Speaker Segmentation-- Sliding Window Strategy & Detection Criterion-- Sliding Window Strategy & Detection Criterion
Metric-based criterion (The dissimilarities between the acoustic feature vectors are measured)
Kullback-Leibler distance
Mahalanobis distance
Bhattacharyya distance
Model selection-based criterion
Bayesian information criterion (BIC)
16 / 47Multimedia Segmentation and Summarization
Bayesian Information CriterionBayesian Information Criterion
Model selection Choose one among a set of candidate models Mi , i=1,2,...,m and corresponding
model parameters to represent a given data set D = (D1, D2, …, DN).
Model Posterior Probability
Bayesian information criterion Maximized log data likelihood for the given model with model complexity penalty Bayesian information criterion of model Mi
where di is the number of independent
parameters in the mode parameter set
i
( | ) ( )( | ) ( | )
( )
P D M P MP M D P D M
P D
BIC( ) log ( | )
ˆlog ( | , ) ( log )2
i i
ii
M P M D
dP D M N
1 1ˆlog 2 log ( ( 1)) log2 2 2 2 2
d N NN d d d N
17 / 47Multimedia Segmentation and Summarization
Unsupervised Segmentation Using Bayesian Unsupervised Segmentation Using Bayesian Information CriterionInformation Criterion
First model
Second model
Bayesian information criterion
1 1 2: , , , ~ ( , )NM x x x N
2 1 2 1 1
1 2 2 2
: , , , ~ ( , )
, , , ~ ( , )b
b b N
M x x x N
x x x N
1
1 1ˆBIC( ) log 2 log ( ( 1)) log2 2 2 2 2
d N NM N d d d N
2 1 2
1 1ˆ ˆBIC( ) log 2 log log ( ( 1)) log2 2 2 2 2 2
d N N b NM N d d d N
2 1BIC( ) BIC( )-BIC( )b M M
18 / 47Multimedia Segmentation and Summarization
Disadvantages of Conventional Unsupervised Disadvantages of Conventional Unsupervised Speaker Change DetectionSpeaker Change Detection
Disadvantage:
For metric based methods, it’s not easy to decide a suitable threshold
For BIC, it’s not easy to detect speaker segment less than 2 seconds
19 / 47Multimedia Segmentation and Summarization
Proposed Method -- Misclassification Error RateProposed Method -- Misclassification Error Rate
Sliding window pairs
Feature vector distribution
Same speaker Different speakers
20 / 47Multimedia Segmentation and Summarization
Mathematical AnalysisMathematical Analysis
21 / 47Multimedia Segmentation and Summarization
Mathematical AnalysisMathematical Analysis
22 / 47Multimedia Segmentation and Summarization
DiscussionDiscussion
Generative and discriminant classifiers are both applicable
Key Point: Discriminant classifiers have the benefit that smaller data are required We can have smaller scanning window size The ability to detect short speaker change segment
increases
23 / 47Multimedia Segmentation and Summarization
Speaker Segmentation Using Misclassification Speaker Segmentation Using Misclassification Error RateError Rate
Steps
Preprocessing
Framing, Feature extraction
Hypothesized speaker change point selection
Forcing 2-class labels
Training a discriminat hyperplane
Inside data recognition & calculating misclassification error rate
Accept/reject the hypothesized speaker change point
Significance
The unsupervised speaker segmentation problem is solved by supervised classification
Feature Extraction Feature ExtractionTag +1 Tag -1
+1 +1+1 +1+1 +1+1 +1+1 +1+1 +1+1 +1+1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
Discriminant Classifier
+1-1
Mmisclassification Error Rate
Hypothesized Speaker Change Point
Accept/ Reject the Hypothesized Speaker Change Point
24 / 47Multimedia Segmentation and Summarization
Experimental ResultsExperimental Results
EXPERIMENTAL RESULTSMethod F-score Precision Recall
Proposed 71.8 70.2 81.3
BIC 63.3 54.4 75.7
25 / 47Multimedia Segmentation and Summarization
Video SummarizationVideo Summarization
Dynamic v.s. Static Video Summarization
Dynamic video summarization
Sport highlight, movie trailer
Static video summarization
Storyboard
– Visual-based approach
– Incorporation of the semantic Information
26 / 47Multimedia Segmentation and Summarization
Static Video SummarizationStatic Video Summarization-- Visual Based Approach-- Visual Based Approach
Example
Problem
Is the summarization ratio adjustable?
How to generate effective storyboard under a given summarization ratio?
27 / 47Multimedia Segmentation and Summarization
How to Generate Effective StoryboardHow to Generate Effective Storyboard
Question: Assume there are n frames and the summarization ratio is r/n. How do we select the best r frames ?
Complexity:
There are C(n,r) different choices
28 / 47Multimedia Segmentation and Summarization
How to Generate Effective StoryboardHow to Generate Effective Storyboard
In visual viewpoint Most visually distinct frames should be extracted Dissimality between two frames is measured by low level visual
features
How to select best r frames from n frames Solution: maximize the overall pairwise dissimilities Complexity: C(n,r) x C(r,2) Unfeasible: C(n,r) is usually huge
Fact Human beings usually browse a storyboard in a sequential way
Optimal solution in a sequential sense Maximize the sum of dissimilities from sequential adjacent
images in a storyboard
29 / 47Multimedia Segmentation and Summarization
How to Maximize the Dissimality Sum of the How to Maximize the Dissimality Sum of the Extracted ImagesExtracted Images
Lattice-based representative frame extraction approach
Extract key component from temporal sequence
Dynamic programming can be applied
Example: how to select the best 4 images from an 8-image sequence
30 / 47Multimedia Segmentation and Summarization
How to Maximize the Adjacent Dissimality Sum of How to Maximize the Adjacent Dissimality Sum of the Extracted Imagesthe Extracted Images
Original images: O(1), O(2), O(3), O(4), O(5), O(6), O(7), O(8)
Extracted images: E(1), E(2), E(3), E(4)
E(1) ← O(i); E(1) ← O(j); E(1) ← O(k); E(1) ← O(l); where i < j < k < lOriginal Sequence
Extracted Sequence
1
2
3
4
5
6
7
8
1 2 3 4
Each legal left-to-right path represents a way to extract images
Each transition results in an adjacent dissimality
In this example, the adjacent dissimality sum of the extracted images are D[ O(1),O(3) ] +
D[ O(3),O(4) ] + D[ O(4),O(7) ]
31 / 47Multimedia Segmentation and Summarization
How to Maximize the Adjacent Dissimality Sum of How to Maximize the Adjacent Dissimality Sum of the Extracted Imagesthe Extracted Images
Original Sequence
Extracted Sequence
1
2
3
4
5
6
7
8
1 2 3 4
Original Sequence
Extracted Sequence
1
2
3
4
5
6
7
8
1 2 3 4
32 / 47Multimedia Segmentation and Summarization
Complexity ComparisonComplexity Comparison
Select 4 images from an 8-image sequence Lattice-based approach
45 dissimality comparison
Optimal approach 420 dissimality comparison
33 / 47Multimedia Segmentation and Summarization
Segment-Based SolutionSegment-Based Solution
Original Sequence
Extracted Sequence
8765 9 121110432
1 9
8
7
6
5
4
3
2
16
15
14
13
12
11
10
24
23
22
21
20
19
18
17
1
34 / 47Multimedia Segmentation and Summarization
Experimental ResultsExperimental Results
35 / 47Multimedia Segmentation and Summarization
Incorporation of the Semantic InformationIncorporation of the Semantic Information
Conventional
The static summarized images are extracted in accordance with low level visual features
Disadvantage
It’s difficult to catch the main story without the support of semantic significant information
We present a semantic based static video summarization
Each extracted image has an annotation
Related images are connected by edge
Using ‘who’ ‘what’ ‘where’ ‘when’ to list all extracted images
36 / 47Multimedia Segmentation and Summarization
The Proposed ArchitectureThe Proposed Architecture
Shot annotation: mapping visual content to text Concept expansion: It provides an alterative view and
dependency information while measuring the relation of two annotations.
Relational graph construction
37 / 47Multimedia Segmentation and Summarization
Concept Tree ConstructionConcept Tree Construction
The concept tree denotes the dependent structure of the expanded words
Meronym
‘Wheel' is a meronym of 'automobile'.
Holonym
‘Tree' is a holonym of 'bark', of 'trunk' and of 'limb'
Pencil used for Draw
Salesperson location of Store
Motorist capable of Drive
Eat breakfast Effect of Full stomach
38 / 47Multimedia Segmentation and Summarization
Concept Tree ReorganizationConcept Tree Reorganization
Who: names of people, subset of "person" in WordNet Where: "social group," "building," and "location " in WordNet What: " All the other words which do not belong to "who" and
"where" When: searching for time-period phrase
39 / 47Multimedia Segmentation and Summarization
Relational Graph Construction Relational Graph Construction -- Relation of Two Concept Trees -- Relation of Two Concept Trees
The relation of the two concept trees
The relation of the two roots
The relation of the two children
root child, Relation , Relation ,
,rootRelation ,
the number of the sentences
sent
,child
,
Relation ,the number of the pairs
I J
type I J type
ident
,
,
Relation ,the number of the pairs
I Jtype
I J
identI J
40 / 47Multimedia Segmentation and Summarization
Relational Graph Construction Relational Graph Construction -- Remove Unimportant Vertices and Edges -- Remove Unimportant Vertices and Edges
Remove edges with smaller weighting, i.e. lower relation
Remove vertices with smaller term frequency – inverse document frequency (TF-IDF)
41 / 47Multimedia Segmentation and Summarization
The Final Relational GraphThe Final Relational Graph
Comparison with conventional storyboard
42 / 47Multimedia Segmentation and Summarization
ConclusionConclusion
A novel speaker segmentation criterion is proposed Misclassification error rate
The unsupervised speaker segmentation problem is solved by supervised classification with label-forcing
Discriminat classifier makes the proposed approach be able to have smaller scanning window size The ability to detect short speaker change segment increases
Two new static video summarization approaches are proposed Lattice-based representative frame extraction
Merely using low level visual features The summarization ratio is adjustable Under a given summarization ratio, the dissimality sum from sequential
adjacent images is minimized Concept-organized representative frame extraction
Incorporating semantic information Mining the four kinds of concept entities: who, what, where, and when People can efficiently grasp the comprehensive structure of the story
and understand the main points of the contents
43 / 47Multimedia Segmentation and Summarization
Future WorkFuture Work
Multimedia segmentation
Speech segmentation
Audio segmentation
Video segmentation
Multimedia summarization
Video summarization Static, dynamic
Speech summarization
Audio summarization
Thank all of you for your attendance!