CVHCI-Lecture-Shot boundary TV genre v3

61
Content-based Image and Video Retrieval Vorlesung, SS 2009 Shot Boundary Detection & TV Genre Classification TV Genre Classification Hazım Kemal Ekenel, [email protected] Rainer Stiefelhagen, [email protected] CV-HCI Research Group: http://isl.ira.uka.de/cvhci 18.05.2009

Transcript of CVHCI-Lecture-Shot boundary TV genre v3

Page 1: CVHCI-Lecture-Shot boundary TV genre v3

Content-based Image and Video Retrieval

Vorlesung, SS 2009

Shot Boundary Detection & TV Genre ClassificationTV Genre Classification

Hazım Kemal Ekenel, [email protected]

Rainer Stiefelhagen, [email protected]

CV-HCI Research Group: http://isl.ira.uka.de/cvhci

18.05.2009

Page 2: CVHCI-Lecture-Shot boundary TV genre v3

Outline

Shot Boundary Detection

Definition

Types of shot boundary

Detection methods Detection methods

TV Genre Classification

Features

Sample systems

2

Page 3: CVHCI-Lecture-Shot boundary TV genre v3

Shot Boundary Detection

3

Page 4: CVHCI-Lecture-Shot boundary TV genre v3

Video

4

Page 5: CVHCI-Lecture-Shot boundary TV genre v3

Hugely important technology for archiving, content analysis, the Internet etc.

Need for tools to support automatic browsing and retrieval of large amounts of broadcast video.

Some Jargon…..

Digital Video Processing

Some Jargon….. A Frame is 1/25 (for PAL) of a second of video.

A Shot is a sequence of frames captured by a single camera in a single continuous action.

A Shot Boundary is the transition between two shots. Can be abrupt (cut) or gradual (fade, dissolve, wipe, morph).

A Scene is a logical grouping of shots into a semantic unit.

Page 6: CVHCI-Lecture-Shot boundary TV genre v3

Scene Scene

Video Sequence

…….

Shots and Scenes

Shot Shot Shot …….

Shot Boundaries…….F F F F

Page 7: CVHCI-Lecture-Shot boundary TV genre v3

Types of Transitions

Identity class: Neither of the two shots involved are modified, and no additional edit frames are added. Hard cuts.

Spatial class: Some spatial transformations are applied to the two shots involved. Wipe, page turn, slide, and iris effects.slide, and iris effects.

Chromatic class: Some color space transformations are applied to the two shots involved. Fade and dissolve effects.

Spatio-Chromatic class: Some spatial as well as some color space transformations are applied to the two shots involved. Morphing effects.

7

Page 8: CVHCI-Lecture-Shot boundary TV genre v3

Types of Transitions

Cut Fade Out/In Dissolve

8

Wipe

Page 9: CVHCI-Lecture-Shot boundary TV genre v3

Why do we need Shot Boundary Detection ?

Shots are basic units of a video. They are required for further video analysis, such as Person tracking, identification, High-level feature detection …

They provide cue about high-level semantics

In video production each transition type is chosen carefully to support the content and context.

For example, dissolves occur much more often in feature films and documentaries than in news, sports and shows. The opposite is true for wipes.

9

Page 10: CVHCI-Lecture-Shot boundary TV genre v3

Hard Cuts

The most common transition type.

Direct concatenation of two shots, &

t : Time stamp of the first frame after the hard cutthardcut : Time stamp of the first frame after the hard cut

u-1(t): The unit step function

Produces a temporal visual discontinuity.

How to measure the discontinuity?

10

Page 11: CVHCI-Lecture-Shot boundary TV genre v3

Features to Measure Visual Discontinuity

Pixel differences

Statistical differences

Histograms Histograms

Compression differences

Edge differences

Motion vectors

11

Page 12: CVHCI-Lecture-Shot boundary TV genre v3

Pixel differences

Two common approaches:

(1) Calculate pixel-to-pixel difference & Compare the sum with a threshold

(2) Count the number of pixels that change in value more than some threshold & Compare the total more than some threshold & Compare the total number against a second threshold

Sensitive to camera & object motion! Use an average filtering Motion compensation

12

Page 13: CVHCI-Lecture-Shot boundary TV genre v3

Camera Motion & Object Motion

13

Page 14: CVHCI-Lecture-Shot boundary TV genre v3

Absolute Pixel Differences with & w/o Motion Compensation

Frame 66 Frame 69

14

Absolute difference w/o motion compensation Absolute difference with motion compensation

Page 15: CVHCI-Lecture-Shot boundary TV genre v3

Motion Estimation

Adjacent frames are similar and changes are due to object or camera motion

15

Page 16: CVHCI-Lecture-Shot boundary TV genre v3

Optical Flow

16

Assumptions:• color constancy : a point in “t-1” looks the same in “t”

– For grayscale images, this is brightness constancy

• small motion : points do not move very far

Frame t-1 Frame t

Page 17: CVHCI-Lecture-Shot boundary TV genre v3

Optical Flow Constraint Equation

),( yx

),( tvytux δδ ++

ttime tttime δ+),( yx

Optical Flow: Velocities ),( vuDisplacement:

),(),( tvtuyx δδδδ =

• Assume brightness of patch remains same in both images:

• Assume small motion (Taylor expansion of left-hand-side upto first order):

),,(),,( tyxItttvytuxI =+++ δδδ

),,(),,( tyxIt

It

y

Iy

x

IxtyxI =

∂∂+

∂∂+

∂∂+ δδδ

Page 18: CVHCI-Lecture-Shot boundary TV genre v3

Optical Flow Constraint Equation

0=∂∂+

∂∂+

∂∂

t

It

y

Iy

x

Ix δδδ

0=∂∂+

∂∂+

∂∂

t

I

y

I

dt

dy

x

I

dt

dx

Divide by and take the limit tδ 0→tδu

0=∂

+∂

+∂ tydtxdt

0=++ tyx IvIuIConstraint Equation

v

),( vuNOTE: must lie on a straight line

We can compute using gradient operators! tyx III ,,

Page 19: CVHCI-Lecture-Shot boundary TV genre v3

A sample optical flow output

Image I Image I -Rotated

19

Absolute difference w/o motion compensation

Absolute difference with motion compensation

Image I Image I -Rotated

Illustration of optical flow

Page 20: CVHCI-Lecture-Shot boundary TV genre v3

Motion Estimation Methods

Feature/Region Matching: Motion is estimated by correlating/matching features (e.g., edges) or regional intensities (e.g., block of pixels) from one frame to another.

Block Matching Block Matching

Phase Correlation

Gradient-based Methods: Motion is estimated by using spatial and temporal changes (gradients) of the image intensity distribution and the displacement vector field.

Lucas-Kanade

20

Page 21: CVHCI-Lecture-Shot boundary TV genre v3

Statistical differences

Divide image into regions

Compute statistical measures from these regions (e.g., mean, standard deviation …)

Compare the obtained statistical measures

21

Page 22: CVHCI-Lecture-Shot boundary TV genre v3

Histogram comparison

The most common method used to detect shot boundaries.

Provides good trade-off between accuracy and speed

The simplest histogram method computes gray level The simplest histogram method computes gray level or color histograms of the two images. If the bin-wise difference between the two histograms is above a threshold, a shot boundary is assumed.

Several extensions available: Using regions, region weighting, different distance metrics …

22

Page 23: CVHCI-Lecture-Shot boundary TV genre v3

Compression differences

Use differences in the discrete cosine transform (DCT) coefficients of JPEG compressed frames as the measure of frame similarity.

Avoid the need to decompress the frames

23

Page 24: CVHCI-Lecture-Shot boundary TV genre v3

Edges/Contours

The edges of the objects in the last frame before the hard cut usually cannot be found in the first frame after the hard cut,

The edges of the objects in the first frame after the hard cut in turn cannot be found in the last frame before the hard cut.

Use Edge Change Ratio (ECR) to detect hard cuts!

24Hard cut

Page 25: CVHCI-Lecture-Shot boundary TV genre v3

Edge Change Ratio (ECR)

),max( 11 −−= noutnn

innn pXpXECR

:np

:innX

The number of edge pixels in frame n

The number of entering edge pixels in frame n

25

:1outnX − The number of exiting edge pixels in frame n-1

To make the measure more robust to object motion:

Edge pixels in one image which have edge pixels nearby in the other image (e.g. within 6 pixels’) are not regarded as entering or exiting edge pixels.

Page 26: CVHCI-Lecture-Shot boundary TV genre v3

26

Edge Change Ratio (ECR)

Compare Compare

Page 27: CVHCI-Lecture-Shot boundary TV genre v3

Motion

Use motion vectors to determine discontinuity.

27

Image I Image I -Rotated Illustration of optical flow

Page 28: CVHCI-Lecture-Shot boundary TV genre v3

Fade Detection

A fade sequence S(x,y,t) of duration T: scaling the pixel intensities/colors of a video sequence S1(x,y,t)by a temporally monotone scaling function f(t)

Fade in: f(0) = 0 and f(T) = 1 Fade out: f(0) = 1 and f(T) = 0 Often f(t) is linear

Fade in: f(t) = t/T, Fade out: f(t) = (T-t)/T

28

Page 29: CVHCI-Lecture-Shot boundary TV genre v3

Fade Detection –Standard deviation of pixel intensities

Var(S(x,y,t)) = Var(f(t) * S1(x,y,t))

= f2(t) * Var(S1(x,y,t))

= f2(t) * Var(S1(x,y))

σ(S(x,y,t)) = f(t) * σ(S (x,y))σ(S(x,y,t)) = f(t) * σ(S1(x,y))

Method:

Detect the monochrome frames

Search in both directions for a linear increase in the pixels’ intensity/color standard deviation

29

Page 30: CVHCI-Lecture-Shot boundary TV genre v3

Dissolve Detection

A dissolve sequence D(x,y,t) of duration T: mixture of two video sequences S1(x,y,t) and S2(x,y,t), where the first sequence is fading out while the second is fading in

f1(t) = (T – t) / T = 1 - f2(t)

f2(t) = t / T

Method

Train support vector machines

30

Page 31: CVHCI-Lecture-Shot boundary TV genre v3

Fade out/in vs. Dissolve

Fade out/in (FOI) Dissolve

31

Page 32: CVHCI-Lecture-Shot boundary TV genre v3

Shot Boundary Detection@ TRECVID Evaluations

A video retrieval evaluation campaign from the National Institute of Standards and Technology (NIST), US.

Promote progress in content-based analysis, detection, retrieval in large amount of digital videodetection, retrieval in large amount of digital video Combine multiple errorful sources of evidence Achieve greater effectiveness, speed, and usability

Confront systems with unfiltered data and realistic tasks

Measure systems against human abilities

Content-based image and video retrieval 32

Page 33: CVHCI-Lecture-Shot boundary TV genre v3

Evaluated each year from 2001 – 2007

57 different research groups worldwide

Shot Boundary Detection@ TRECVID Evaluations

33

Page 34: CVHCI-Lecture-Shot boundary TV genre v3

Cut vs. Gradual Transition Performance

Cuts Gradual Transitions

Content-based image and video retrieval 34

Page 35: CVHCI-Lecture-Shot boundary TV genre v3

A short break

35

Page 36: CVHCI-Lecture-Shot boundary TV genre v3

TV Genre Classification

Multimedia content annotation

Key issue in current convergence of audiovisual entertainment and information media

Good information and communication technologies availableavailable

but multimedia classification not mature enough

Lack of good automatic algorithms

Main challange: combine and map low-level descriptors and high-level concepts

36

Page 37: CVHCI-Lecture-Shot boundary TV genre v3

Sample Genres

37

Page 38: CVHCI-Lecture-Shot boundary TV genre v3

Subgenres

38

Page 39: CVHCI-Lecture-Shot boundary TV genre v3

Sample Feature -Scene LengthNews Cast Sports - Tennis

39

Commercials Cartoon

Page 40: CVHCI-Lecture-Shot boundary TV genre v3

Sample Feature -Audio Statistics: Wave Forms

News Cast Sports - Race

Sports - Tennis Commercials

40

Cartoon

Page 41: CVHCI-Lecture-Shot boundary TV genre v3

Sample Feature -Audio Statistics: Frequency Spectrum

News Cast Sports - RaceA

mpl

itude

Am

plitu

de

41

Sports - Tennis Commercials

Am

plitu

de

Am

plitu

de

Page 42: CVHCI-Lecture-Shot boundary TV genre v3

A sample system

TV Genre Classification Using Multimodal Information and

42

Multimodal Information and Multilayer Perceptrons

Credit: Tomas Semela

Montagnuolo, M., Messina, A: TV Genre Classififcation Using Multimodal Information and Multilayer Perceptrons , AI*IA, LNAI 4733, pp. 730-741, 2007

Page 43: CVHCI-Lecture-Shot boundary TV genre v3

Modality information in broadcast domain concerns

Physical properties perceived by users like colours, shapes and motion

Structural-syntactic information, e.g. relationships

Feature Sets

Structural-syntactic information, e.g. relationships between frames, shots and scenes

Cognitive information related to high-level semantic concepts like faces

Aural analysis of noise and speech

resulting in a feature vector

43

),,,( ACSVPV c=

Page 44: CVHCI-Lecture-Shot boundary TV genre v3

Low-level visual feature vector component

Color represented by

hue (H)

saturation (S)

value (V)

Feature Sets

44

Luminance (Y) represented by a grey scale [16, 233]

Textures described through contrast (C) and directionality (D) Tamura’s features

Temporal activity information (T) based of displaced frame difference (DFD)

65- bin histrogram for each feature

Last bin collects undefined values

Page 45: CVHCI-Lecture-Shot boundary TV genre v3

Feature Sets

Computed on a frame by frame basis

Accumulated over the number of frames

Each histogram modeled by a 10-component Gaussian mixture model

45

Each component being a Gaussian distribution with three parameters

weight , mean and standard deviation

2

2

2

2

)(

, 2

1)( i

i

ii

x

i

ex σµ

σµ πσϕ

−−

=

Page 46: CVHCI-Lecture-Shot boundary TV genre v3

Feature Sets

Gaussian mixture model

example of 4 component gaussian

∑=

10

1, 2

ii

ii

wσµ

ϕ

example of 4 component gaussian

mixture

different means and standard deviation

46

resulting into a 210 – dimensional feature vector

),,,,,,( TDCYVSHVc =

Page 47: CVHCI-Lecture-Shot boundary TV genre v3

Structural feature vector component

Extracted using a shot detection module

S1 captures information about the rhythm of the video:

is the frame rate (i.e. 25 fps), ∑∆=

sN

isS11 rF

47

total number of shots

• shot length, measured as the number of frames

• within the shot.

∑=

∆=i

isr

sNF

S1

1sN

is∆thi

Page 48: CVHCI-Lecture-Shot boundary TV genre v3

Structural feature vector component

S2 describes shot lengths distributed along the video

represented by a 65-bin histogram

64 bins for shot lengths [0,30s]

bin for shots longer than 30sth65

48

bin for shots longer than 30s

histogram normalized by so the area sums to one

resulting into a 66-dimensional feature vector

sN

),( 21 SSS =

65

Page 49: CVHCI-Lecture-Shot boundary TV genre v3

Cognitive feature vector component

Built by applying face detection Leads to three features

total number of faces

total number of framesp

f

D

NC =1

fNPD

describes how faces are distributed along the video

expressed by a 11-bin histogram

bin contains the number of frames with i faces,

bin containts the number of frames with 10 or more faces

49

pD PD

)( 2C

th11

thi

Page 50: CVHCI-Lecture-Shot boundary TV genre v3

Cognitive feature vector component

describes how faces are positioned along the video

9-bin histogram where the bin represents the

positions in the frame

Positions are top-left, top-right, bottom-left, bottom-

)( 3Cthi

thi

right, left, right, top, bottom and center

All histograms normalized by so their area sums to one

resulting into a 21-dimensional feature vector

50

fN

),,( 321 CCCC =

Page 51: CVHCI-Lecture-Shot boundary TV genre v3

Aural feature vector component

Derived by audio analysis of the TV programme

Audio signal segmented into seven classes:speech, silence, noise, music, pure speaker, speaker plus noise, speaker plus music

duration values, normalized by total duration of the 1A duration values, normalized by total duration of the video for the seven classes

the avarage speech rate, computed from speech content transcriptions using a speech-to-text engine

resulting into a 8-dimensional feature vector

51

1A

2A

),( 21 AAA =

Page 52: CVHCI-Lecture-Shot boundary TV genre v3

Genre Classification

is the TV programme to be classified

the set of available genres

Feature vector of is derived like described in previous slides

Each feature vector of is input of an Neural Network

p,...,, 21 ωωωω N=Ωp

Each feature vector of is input of an Neural Network

52

p

Page 53: CVHCI-Lecture-Shot boundary TV genre v3

Genre Classification

• Each Neural Network has an output vector

• can be interpreted as the membership value of p to genre i, according to the pattern vector part n

4,...,1,,..., ),(),(),(1

==Φ nnpN

npnp

ωφφ

),( npiφ

• Outputs combined into a resulting vector where:

• The genre j is selected corresponding to the maximum element of

53

,..., )()()(1

pN

pp

ωφφ=Φ

∑=

=Φ4

1

),()(

4

1

n

npi

pi φ

)( pφ

Page 54: CVHCI-Lecture-Shot boundary TV genre v3

Experimental Results - Dataset

About 110 hours of complete TV programs

Genres: cartoons, football, talk show, weather forecast, news, music videos, commercials

Each TV program manually annotated

Dataset split into K = 6 disjoint subsets of equal size

K-fold cross validation is used

54

Page 55: CVHCI-Lecture-Shot boundary TV genre v3

Sample Clips from the Data Set

Cartoon Commercial Football

55

Music News Talk show Weather forecast

Page 56: CVHCI-Lecture-Shot boundary TV genre v3

Experimental Results - Settings

All networks with one hidden layer, seven output neurons with sigmoid activation functions in the range of [0,1]

All hidden neurons have symmetric sigmoid activiation functions in the range of [-1,1]

Aural network has 8 input neurons and 32 hidden neurons

Cognitive network has 21 input neurons and 32 hidden neurons

Structural network has 65 input neurons and 8 hidden neurons

Visual network has 210 input neurons and 16 hidden neurons

56

Page 57: CVHCI-Lecture-Shot boundary TV genre v3

Obtained accuarcy with an avaraged value of 92 %

In some cases even greater than 95 %

Some news - talk shows and commercials - music clips confused with each other

Music genre shows the most scattered results due to

Experimental Results

structural, visual and cognitive inhomogenity

57

Page 58: CVHCI-Lecture-Shot boundary TV genre v3

Experimental Results – Comparison

58

Page 59: CVHCI-Lecture-Shot boundary TV genre v3

Experimental Results - Comparison

59

Page 60: CVHCI-Lecture-Shot boundary TV genre v3

References

Rainer Lienhart. Reliable Transition Detection In Videos: A Survey and Practitioner's Guide. International Journal of Image and Graphics (IJIG), Vol. 1, No. 3, pp. 469-486, 2001.

M. Montagnuolo, A. Messina: TV Genre Classififcation Using Multimodal Information and Multilayer Perceptrons , AIIA, LNAI 4733, pp. 730-741, 2007

60

Page 61: CVHCI-Lecture-Shot boundary TV genre v3

Questions?Questions?

61