Slides

Multimedia Segmentation and Multimedia Segmentation and SummarizationSummarization

Dr. Jia-Ching WangDr. Jia-Ching Wang

Honorary Fellow, ECE Department, UW-MadisonHonorary Fellow, ECE Department, UW-Madison

2 / 47Multimedia Segmentation and Summarization

OutlineOutline

Introduction

Speaker Segmentation

Video Summarization

Conclusion


What is Multimedia?What is Multimedia?

Image

Video

Speech

Audio

Text


Multimedia EverywhereMultimedia Everywhere

Fax machines: transmission of binary images Digital cameras: still images iPod / iPhone & MP3 Digital camcorders: video sequences with audio Digital television broadcasting Compact disk (CD), Digital video disk (DVD) Personal video recorder (PVR, TiVo) Images on the World Wide Web Video streaming, video conferencing Video on cell phones, PDAs High-definition televisions (HDTV) Medical imaging: X-ray, MRI, ultrasound Military imaging: multi-spectral, satellite, microwave


WhatWhat is Multimedia Content? is Multimedia Content?

Multimedia content: the syntactic and semantic information inherent in a digital material.

Example: text document

Syntactic content: chapter, paragraph

Semantic content: key words, subject, types of text document, etc.

Example: video document

Syntactic content: scene cuts, shots

Semantic content: motion, summary, index, caption, etc.


WhyWhy We Need to Know Multimedia Content? We Need to Know Multimedia Content?

Why we need to know multimedia content?

Information processing, in terms of archiving, indexing, delivering, accessing and other processing, require in-depth knowledge of content to optimize the performance.


HowHow to Know Multimedia Content?to Know Multimedia Content?

How to Know Multimedia Content?

Multimedia content analysis The computerized understanding of the semantic/syntactic

of a multimedia document

Multimedia content analysis usually involves

Segmentation Segmenting the multimedia document into units

Classification Classifying each unit into a predefined type

Annotation Annotating the multimedia document

Summarization Summarizing the multimedia document


Multimedia Segmentation and SummarizationMultimedia Segmentation and Summarization

Multimedia segmentation

Syntactic content

Multimedia summarization

Semantic/syntactic content

The result of the temporal segmentation can benefit the video summarization


Multimedia SegmentationMultimedia Segmentation

Image segmentation Video segmentation

Scene change, shot change Audio segmentation

Audio class change Speech segmentation

Speaker change detection Text Segmentation

word segmentation, sentence segmentation, topic change detection


Multimedia SummarizationMultimedia Summarization

Image summarization Region of interest

Video summarization Storyboard, highlight

Audio summarization Main theme in music, Corus in song, event sound

in environmental sound stream Speech summarization

Speech abstract Text summarization

Abstract


What is Speaker Segmentation?What is Speaker Segmentation?

It can also be called speaker change detection (SCD) Assumption: there is no overlapping between any of

the two speaker streams

speaker1 speaker2 speaker3


Supervised v.s. Unsupervised SCDSupervised v.s. Unsupervised SCD

Supervised manner: acoustic data are made up of distinct speakers who are known a priori

Recognition based solution

Unsupervised manner: no prior knowledge about the number and identities of speakers

Metric-based criterion

Model selection-based criterion


Supervised Speaker SegmentationSupervised Speaker Segmentation-- Gaussian Mixture Model-- Gaussian Mixture Model

Gaussian mixture modeling (GMM)

)}()(2

1exp{

2

1)( 1

1 21

2ii

Ti

M

ii

di xxcxp

ii

x is a d-dimensional random vector. , i=1,…,M is the mixture weight. ,the mean vector. ,the covariance matrix.

ic

)(maxarg ,,2,1 idDdt xPD

Incoming audio stream is classified into one of D classes in a maximum likelihood manner at time t


Supervised Speaker SegmentationSupervised Speaker Segmentation-- Hidden Markov Model-- Hidden Markov Model


Unsupervised Speaker SegmentationUnsupervised Speaker Segmentation-- Sliding Window Strategy & Detection Criterion-- Sliding Window Strategy & Detection Criterion

Metric-based criterion (The dissimilarities between the acoustic feature vectors are measured)

Kullback-Leibler distance

Mahalanobis distance

Bhattacharyya distance

Model selection-based criterion

Bayesian information criterion (BIC)


Bayesian Information CriterionBayesian Information Criterion

Model selection Choose one among a set of candidate models Mi , i=1,2,...,m and corresponding

model parameters to represent a given data set D = (D1, D2, …, DN).

Model Posterior Probability

Bayesian information criterion Maximized log data likelihood for the given model with model complexity penalty Bayesian information criterion of model Mi

where di is the number of independent

parameters in the mode parameter set

i

( | ) ( )( | ) ( | )

( )

P D M P MP M D P D M

P D

BIC( ) log ( | )

ˆlog ( | , ) ( log )2

i i

ii

M P M D

dP D M N

1 1ˆlog 2 log ( ( 1)) log2 2 2 2 2

d N NN d d d N


Unsupervised Segmentation Using Bayesian Unsupervised Segmentation Using Bayesian Information CriterionInformation Criterion

First model

Second model

Bayesian information criterion

1 1 2: , , , ~ ( , )NM x x x N

2 1 2 1 1

1 2 2 2

: , , , ~ ( , )

, , , ~ ( , )b

b b N

M x x x N

x x x N

1

1 1ˆBIC( ) log 2 log ( ( 1)) log2 2 2 2 2

d N NM N d d d N

2 1 2

1 1ˆ ˆBIC( ) log 2 log log ( ( 1)) log2 2 2 2 2 2

d N N b NM N d d d N

2 1BIC( ) BIC( )-BIC( )b M M


Disadvantages of Conventional Unsupervised Disadvantages of Conventional Unsupervised Speaker Change DetectionSpeaker Change Detection

Disadvantage:

For metric based methods, it’s not easy to decide a suitable threshold

For BIC, it’s not easy to detect speaker segment less than 2 seconds


Proposed Method -- Misclassification Error RateProposed Method -- Misclassification Error Rate

Sliding window pairs

Feature vector distribution

Same speaker Different speakers


Mathematical AnalysisMathematical Analysis


DiscussionDiscussion

Generative and discriminant classifiers are both applicable

Key Point: Discriminant classifiers have the benefit that smaller data are required We can have smaller scanning window size The ability to detect short speaker change segment

increases


Speaker Segmentation Using Misclassification Speaker Segmentation Using Misclassification Error RateError Rate

Steps

Preprocessing

Framing, Feature extraction

Hypothesized speaker change point selection

Forcing 2-class labels

Training a discriminat hyperplane

Inside data recognition & calculating misclassification error rate

Accept/reject the hypothesized speaker change point

Significance

The unsupervised speaker segmentation problem is solved by supervised classification

Feature Extraction Feature ExtractionTag +1 Tag -1

+1 +1+1 +1+1 +1+1 +1+1 +1+1 +1+1 +1+1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1

Discriminant Classifier

+1-1

Mmisclassification Error Rate

Hypothesized Speaker Change Point

Accept/ Reject the Hypothesized Speaker Change Point


Experimental ResultsExperimental Results

EXPERIMENTAL RESULTSMethod F-score Precision Recall

Proposed 71.8 70.2 81.3

BIC 63.3 54.4 75.7


Video SummarizationVideo Summarization

Dynamic v.s. Static Video Summarization

Dynamic video summarization

Sport highlight, movie trailer

Static video summarization

Storyboard

– Visual-based approach

– Incorporation of the semantic Information


Static Video SummarizationStatic Video Summarization-- Visual Based Approach-- Visual Based Approach

Example

Problem

Is the summarization ratio adjustable?

How to generate effective storyboard under a given summarization ratio?


How to Generate Effective StoryboardHow to Generate Effective Storyboard

Question: Assume there are n frames and the summarization ratio is r/n. How do we select the best r frames ?

Complexity:

There are C(n,r) different choices


How to Generate Effective StoryboardHow to Generate Effective Storyboard

In visual viewpoint Most visually distinct frames should be extracted Dissimality between two frames is measured by low level visual

features

How to select best r frames from n frames Solution: maximize the overall pairwise dissimilities Complexity: C(n,r) x C(r,2) Unfeasible: C(n,r) is usually huge

Fact Human beings usually browse a storyboard in a sequential way

Optimal solution in a sequential sense Maximize the sum of dissimilities from sequential adjacent

images in a storyboard


How to Maximize the Dissimality Sum of the How to Maximize the Dissimality Sum of the Extracted ImagesExtracted Images

Lattice-based representative frame extraction approach

Extract key component from temporal sequence

Dynamic programming can be applied

Example: how to select the best 4 images from an 8-image sequence


How to Maximize the Adjacent Dissimality Sum of How to Maximize the Adjacent Dissimality Sum of the Extracted Imagesthe Extracted Images

Original images: O(1), O(2), O(3), O(4), O(5), O(6), O(7), O(8)

Extracted images: E(1), E(2), E(3), E(4)

E(1) ← O(i); E(1) ← O(j); E(1) ← O(k); E(1) ← O(l); where i < j < k < lOriginal Sequence

Extracted Sequence

1

2

3

4

5

6

7

8

1 2 3 4

Each legal left-to-right path represents a way to extract images

Each transition results in an adjacent dissimality

In this example, the adjacent dissimality sum of the extracted images are D[ O(1),O(3) ] +

D[ O(3),O(4) ] + D[ O(4),O(7) ]


How to Maximize the Adjacent Dissimality Sum of How to Maximize the Adjacent Dissimality Sum of the Extracted Imagesthe Extracted Images

Original Sequence

Extracted Sequence

1

2

3

4

5

6

7

8

1 2 3 4

Original Sequence

Extracted Sequence

1

2

3

4

5

6

7

8

1 2 3 4


Complexity ComparisonComplexity Comparison

Select 4 images from an 8-image sequence Lattice-based approach

45 dissimality comparison

Optimal approach 420 dissimality comparison


Segment-Based SolutionSegment-Based Solution

Original Sequence

Extracted Sequence

8765 9 121110432

1 9

8

7

6

5

4

3

2

16

15

14

13

12

11

10

24

23

22

21

20

19

18

17

1


Experimental ResultsExperimental Results


Incorporation of the Semantic InformationIncorporation of the Semantic Information

Conventional

The static summarized images are extracted in accordance with low level visual features

Disadvantage

It’s difficult to catch the main story without the support of semantic significant information

We present a semantic based static video summarization

Each extracted image has an annotation

Related images are connected by edge

Using ‘who’ ‘what’ ‘where’ ‘when’ to list all extracted images


The Proposed ArchitectureThe Proposed Architecture

Shot annotation: mapping visual content to text Concept expansion: It provides an alterative view and

dependency information while measuring the relation of two annotations.

Relational graph construction


Concept Tree ConstructionConcept Tree Construction

The concept tree denotes the dependent structure of the expanded words

Meronym

‘Wheel' is a meronym of 'automobile'.

Holonym

‘Tree' is a holonym of 'bark', of 'trunk' and of 'limb'

Pencil used for Draw

Salesperson location of Store

Motorist capable of Drive

Eat breakfast Effect of Full stomach


Concept Tree ReorganizationConcept Tree Reorganization

Who: names of people, subset of "person" in WordNet Where: "social group," "building," and "location " in WordNet What: " All the other words which do not belong to "who" and

"where" When: searching for time-period phrase


Relational Graph Construction Relational Graph Construction -- Relation of Two Concept Trees -- Relation of Two Concept Trees

The relation of the two concept trees

The relation of the two roots

The relation of the two children

root child, Relation , Relation ,

,rootRelation ,

the number of the sentences

sent

,child

,

Relation ,the number of the pairs

I J

type I J type

ident

,

,

Relation ,the number of the pairs

I Jtype

I J

identI J


Relational Graph Construction Relational Graph Construction -- Remove Unimportant Vertices and Edges -- Remove Unimportant Vertices and Edges

Remove edges with smaller weighting, i.e. lower relation

Remove vertices with smaller term frequency – inverse document frequency (TF-IDF)


The Final Relational GraphThe Final Relational Graph

Comparison with conventional storyboard


ConclusionConclusion

A novel speaker segmentation criterion is proposed Misclassification error rate

The unsupervised speaker segmentation problem is solved by supervised classification with label-forcing

Discriminat classifier makes the proposed approach be able to have smaller scanning window size The ability to detect short speaker change segment increases

Two new static video summarization approaches are proposed Lattice-based representative frame extraction

Merely using low level visual features The summarization ratio is adjustable Under a given summarization ratio, the dissimality sum from sequential

adjacent images is minimized Concept-organized representative frame extraction

Incorporating semantic information Mining the four kinds of concept entities: who, what, where, and when People can efficiently grasp the comprehensive structure of the story

and understand the main points of the contents


Future WorkFuture Work

Multimedia segmentation

Speech segmentation

Audio segmentation

Video segmentation

Multimedia summarization

Video summarization Static, dynamic

Speech summarization

Audio summarization

Thank all of you for your attendance!

Slides

Documents

Transcript of Slides