VIDEO SEGMENTATION - UniFI · 2010-05-10 · Video segmentation • Segmentation is the process of...

Video segmentation

Video segmentation

• Segmentation is the process of breaking out a video in its constituent basic elements, the shots, and in their higher-level aggregates, like episodes or scenes.

• A common definition of shot is: “a sequence of frames that was (or appears to be) continuously captured from the same camera”. A shot-break is the transition from one shot to the next. Shot segmentation is therefore the process of detecting transitions between two consecutive shots.

• Traditional approaches to perform segmentation is previewing the whole video, and then annotating them and their boundaries with textual labels. A fully manual segmentation of a movie may require, approximately, 10 hours of work for one hour of data

• A less expensive approach uses edit decision lists created by video producers during post production. Final changes to the video stream can determine misalignments with edit decision lists. A large part of existing videos do not contain any edit decision lists.

• Automatic segmentation is a viable approach to produce reliable shot segmentation. Segmentation into episodes is highly dependent on the type of video and context information available.

Shot segmentation and edit effects

Hard cut

Dissolve (combined fade-out fade-in)

Wipe

Matte

• There are two types of shot transitions: sharp shot transitions (cuts) and gradual shot transitions (fades, dissolves, wipes and mattes)

• Edit effects are used differently in different types of video

SPORTS VIDEOshots with large camera zoom in shots with large fast moving objects (close ups) shots with invasive edit effects (partial mattes)

NEWS VIDEOshots with little motion (only part of the frame 1/4 approx) shots with almost no motion

COMMERCIALS

different edit effects and shots of different duration depending on targets and goals

Telecom14 shots 9 cut 4 horizontal wipes 1 flip wipe

1 shot very fast (5 frames)1 shot with large motion12 shots with little or almost no motion

Golia29 shots 29 cut

27 shot very fast (5 frames or less)3 shots with fast motion1 shot with almost no motion

Kia12 shots 12 cut

2 shot with large fast moving objects2 shots static

Findus10 shots 10 cut

2 shot with large camera zoom in1 shot with camera rotation7 shots with little or almost no motion

• Methods for edit effect detection and shot segmentation work either in the

uncompressed or in the compressed domain– In the uncompressed domain solutions are based on evaluation of similarity measure

between successive images. When two images are sufficiently dissimilar, there may be a cut. Gradual transitions are found by using cumulative difference measures.

– In the compressed domain methods do not perform decoding/re-encoding, but exploit the fact that the encoded video stream already contains a rich set of precomputed features, such as motion vectors (MVs) and block averages (DC coefficients), that can be used for temporal video segmentation.

• Shot Segmentation problems are anyway concerned with:– object motion person moves into a camera shot ...– camera motion panning, zooming …– lighting changes camera flash , lightning ..– some types of shot boundary dissolves , fades ...– digital effects swirls , morphing …

• To reduce false shot change detections– Algorithmic solutions – Threshold values e.g.: higher values– Empirical restrictions e.g.: shot must be greater than 100 frames

…..

• A cut is defined as a sharp transition between one shot and the one

following. It is obtained by simply joining two different shots without the insertion of any other photographic effect.

• Automatic cut detection is based on the information that is extracted from the shots which contribute to the cut (brightness and color distribution change, motion, edges...).

Cut

Cut detection

• Cuts generally correspond to an abrupt change in the brightness pattern for two consecutive images.

• Therefore, cuts between shots with small motion and constant illumination can be easily detected by looking for sharp brightness changes. The principle behind this approach is that, since two consecutive frames in a shot do not change significantly in their background and object content, their overall brightness distribution differs little.

• However, detection is difficult in the presence of continuous object motion, or camera movements, or change of illumination in the shot. Researches have been concentrated on developing algorithms that amplify visual properties of the shots in order to detect a discontinuity in the visual property.

Cut detection - uncompressed domain

Pairwise pixel comparisonfull frame, pixelwise, intensity based

• Pairwise comparison is simply based on the differences between gray-levels Ixy (ft), Ixy (ft+1) of corresponding pixels (pointwise gray-level difference ) in two consecutive frames ft and ft+1: Dcut = | Ixy (ft) - Ixy (ft+1) |

Frame t Frame t + 1

∑x=1 ∑y=1 | Ixy (ft) - Ixy (ft+1) |D(t,t+1)=X Y

XY

• Pairwise comparison can be extended to color frames, by calculating the pointwise color difference in each color channel Dp

cut and calculating the

sum of such differences: Dcut = Σ Dpcut

A sequence break is detected if the number of pixels, that have been changed, exceeds a certain threshold

Pairwise pixel comparison example

Consecutive frame differencesfull frame, intensity based

• The average intensity difference is applied to two consecutive color frames

according to the following procedure:

– Compute the normalized sum St of pixel intensity values, for each frame ft, of size of M x N:

– Evaluate the inter-frame difference Dcut between frames ft-1, ft and ft+1, in the following manner:

€

St =

Ixy

y= 0

M- 1

å (ft )

x= 0

N- 1

åMN

€

d =St - St +1St- 1 - St

Color histogram comparisonfull frame, color histogram based

• Histogram comparison method is simply based on the differences between values of corresponding brightness histogram bins in two consecutive frames:

• A sequence break is detected whenever a predefined threshold τ is exceeded (peaks reveal cuts). The threshold can be obtained by computing all the frame-to-frame differences and their mean µ and variance σ. The threshold is calculated as:

τ = µ + ασ

where α is typically a small number.

€

d( f , f ' ) = | H ( f , j ) - H ( f ' , j ) |

j=1

N

å

• For color images, this equation can be applied to each individual color channel. A

64 bin histogram (2 bits for each color channel) has been suggested in order to obtain fairly accurate results.

• Peaks of the function for color images are sharper than for gray-level histograms.

Color histogram comparison example

• Implementations of the color histogram methods differ in a number of

factors, including:– The color space used to represent the pixel values. – Threshold calculation. Threshold can be global or local, and can be

determined using several methods.– Differencing criterion: it is possible to use several methods and metrics to

compute the difference of two histograms. Some of the most used criteria are reported in the following.

Histogram intersection

• Histogram intersection is applied to the values of corresponding brightness histogram bins in two consecutive frames:

• Since the intersection of two identical frames is equal to the number N of pixels in the frame, the dissimilarity metric is defined from the minima of the function:

D = N – d

A

B

A∩Β

€

d( f , f ' ) = min(H ( f , j ), H ( f ' , j ))

j= 0

N

å

Normalized χ2 test

• The normalized χ2 test amplifies the distance between color histogram bins of two consecutive frames:

• Measures are not taken at full video rate, but instead at sampled frames (typically from 3 to 10 frames per second)

• A modification of the original χ2 test that has been proposed is:

€

d( f , f ' ) =1

N 2(H ( f , j ) - H ( f ' , j ))2

max(H ( f , j ),H ( f ' , j ))j= 0

N

å

€

d( f , f ' ) =(H ( f , j ) - H ( f ' , j ))2

H ( f ' , j )j= 0

N

å

Edge differenceswhole frame, edge based

• This method considers edge images and gray level information. It is based on the consideration that during scene breaks new edges appear far from old edges and old edges disappear in location far from the new edges.

• Cuts are detected by counting the number of entering edges (ρ in) and exiting edges (ρ out) in two consecutive frames, using a fixed threshold over a temporal window.

• Processing steps:– Perform image smoothing

(Gaussian filtering)– Compute image gradient and

threshold– Extract edges (Canny filtering and

dilation)– Detect dissimilarity from the peaks

of ρ = max ( ρ in, ρ out)

• Use of subframes minimizes the influence of local changes in illumination and

motion each frame is divided into subframes (typically 16: 4x4).

subframe ft+1, i

subframe ft, i

Using subframes

Likelihood ratiosubframe, intensity based

• Likelihood ratio is computed by considering corresponding subframes i or blocks of two consecutive frames and the second order statistics of their intensity values.

• If mi(ft,i) and σi(ft) are respectively the mean value and the variance in the i-th block of frame ft in the sequence, then likelihood ratio for a block is defined as:

• A sequence break is detected, if most of the blocks, into which the image has been partitioned, exhibit likelihood ratios greater than a predefined threshold.

€

di( f , f ' ) =s i( f ) + s i( f ' )

2

æ

è ç

ö

ø ÷ +

m i( f ' ) - m i( f )

2

æ

è ç

ö

ø ÷

2é

ë

ê ê

ù

û

ú ú

2

Bin to bin histogram differencesubframe, histogram based

• Bin to bin histogram difference can be computed for each image subframe k. N = 9 subframes has been suggested. Cuts are detected by averaging the bin to bin differences computed at each subframe with appropriate tresholding D

χ2 testsubframe, histogram based

• Corresponding subframes i of consecutive frames ft, ft+1 are compared by considering their color histograms. Equation can thus be rewritten according to:

The 8 largest difference-values are discarded and only the 8 remaining ones are retained. €

d( f , f ' ) =(H ( f , j ) - H ( f ' , j ))2

H ( f ' , j )j= 0

N

å

Color histogram moment comparisonsubframe, histogram based

• Performs color histogram differences between two corresponding subframes of consecutive frames plus the statistical moments of the histogram, up to the third order.

• Each frame in the sequence is partitioned into subframes i. Since horizontal panning and motion are statistically more frequent, the frequency of subframes is set higher in the horizontal direction than in the vertical one.

• The interblock difference is then defined as: Di = Σ p di

The global difference D is obtained from this measure by discarding the n worse values. A shot change is detected within a temporal window centred in t with amplitude 5 frames

• Being Hi(ft) the histogram of subframe i for one color channel of RGB in frame ft, the difference between the corresponding subframes of two consecutive frames f and f’ is defined as follows:

where mi (f) = [m1, m2, m3] is the moment vector of histogram Hi (ft) for the color channel and a = [ a1, a2, a3] is the vector of scale parameters. The scale factor a1 is adaptively tuned depending on the absolute value of m1 (f’).

The k-order moment is defined as the average of the kth power of the deviation from the average:

mk = Σ (x – µ1)k H(x)

µ1 = arithmetic average ; m2 = µ2 – µ12 (variance) ; m3 = µ3 -3µ2µ1 + 2µ1

3

µk = Σ xkH(x)

€

di( f , f ' ) = | H i( f , j ) - H i( f ' , j )

j=1

N

å | + a T | m i( f ) - m i( f ' ) |

Remarks and comments on cut detection

• Histogram-based methods vs pixelwise – The histogram-based methods minimize the sensitivity to camera movements (such

as panning and zooming) and are not sensibly affected by histogram dimensionality. They offer better performance wrt Intensity based pixelwise methods.

– Precision values obtainable are close to 90%. – Most of the histogram-based solutions are however sensitive to fast camera

movements, large moving objects, and fast moving objects. Abrupt changes of brightness also have a negative impact on the algorithm performance.

Histogram intersection– Histogram intersection is the simplest approach among the histogram-based methods.

It requires low computation effort. May lead to wrong estimations, since in exchanging pixel positions, the histogram remains unchanged while the image pattern may largely vary.

– In non-critical cases, histogram intersection performance is to be preferred to χ2 test. If the number of color codes is high and L*u*v* or MTM color space is used it outperforms the χ2 test method.

χ2 test method− The χ2 test, like the pointwise absolute difference method, gives false cut detections

in scenes where fast motion is present. This is mainly due to the fact that a two-frame window is used.

• Color histogram moments method

– The method based on color histogram moments uses a window of five frames to observe changes in brightness-histogram differences with an adaptive threshold. A comparative analysis has shown a superior performance.

– Misses and false detections of this method occur in the presence very dark shots or very fast motion (a large object that rapidly obscures the camera view within 3 to 5 frames).

• Edge-change method– The edge-change method performance is ruled by three parameters:

• the edge detector smoothing factor; • the edge detector threshold; • the radius r of the neighbourhood in which ρ is evaluated.

Low values of r makes the algorithm very sensitive to shifts in edges due to noise and non-rigid motion. Large values of r make values of the ρ parameter to become lower. This makes cut detection more difficult and unstable.

– The edge-change method is strongly impaired by the presence of low contrast between two consecutive frames.

• Locally adaptive vs global fixed thresholding– The choice of threshold is a critical point for almost all of the techniques. Setting

appropriate thresholds may require a pre-analysis of the video to be segmented. – Global tresholding compute statistics over the video fails in the presence of large

variety of behaviors and is usually inadequate– Local thresholding improves performance: e.g. a window is centered around

each frame and the mean value is calculated. The threshold at any frame is thus calculated as a multiple of the local window average and a constant factor k, dependent on the frame difference.

• Full frame vs subframe– Full frame – based: very resistant to motion tend to be poor at detecting changes

between similar shots– Subframe – based: minimize the influence of local changes in illumination and

motion; adequately discriminant; the choice of the size of the blocks influences behavior

Cut detection - compressed domain

For JPEG encoded video

JPEG processing chain

8x8 image block 8x8 shifted image blocka

8x8 DCT coefficientsa

8x8 quantized DCT coefficientsa

RLE sequence a

Huffman table

a

DC coefficient of the previous block

RGB YCrCb

8x8 blocks

aa

• Shot boundaries can be detected using DCTcoefficients of JPEG compressed video:

– For each video frame, a subset of the 8x8 pixel blocks is considered.– For each block only a subset of the 64 DCT coefficients (the most significant

coefficients) is taken . These DCT coefficients are considered as representatives of the frame content.

– Cuts are detected by evaluating the normalized inner product between coefficient vectors Cf , Cf+k of two frames shifted by k on the temporal axis

€

d(f,f + k) = 1-c f · c f +k

cf

cf +k

For MPEG encoded video

MPEG suggests that an encoding of the differences between adjacent still pictures is a fruitful approach to compression. It assumes that:

- A moving picture is simply a succession of still pictures.- The differences between adjacent still pictures are generally small.

Main MPEG features:– Transform-domain-based compression (intra-frame coding)

o DCT, quantization and run-length encodingDCT, quantization and run-length encoding– Block-based motion compensation Block-based motion compensation

o Similar blocks of pixels common to two or more successive frames are Similar blocks of pixels common to two or more successive frames are replaced by a pointer (motion vector) that references one of the blocks.replaced by a pointer (motion vector) that references one of the blocks.

o 16x16 pixels Macroblocks (MBs)16x16 pixels Macroblocks (MBs)o Predictive Encoding is done with reference to an anchor frame Predictive Encoding is done with reference to an anchor frame

– Interpolative techniques (inter-frame coding)Interpolative techniques (inter-frame coding)o Bidirectional interpolation (forward-predicted and backward-predicted)Bidirectional interpolation (forward-predicted and backward-predicted)

MPEG GoP

• A video sequence is divided into Groups of Pictures (GOPs). The smaller GoP is the better performance is with respect to motion, although compression is lower (more I frames are present):

– Four types of frames: • I (intra coded)• P (predictive forward coded) • B (bi-directional coded) • D frames

I and P frames are anchor framesI frames have no reference to other framesP frames have forward reference to I or P framesD frames only use the DC component (low resolution rarely used)I frames have distance M wrt P frames and N wrt I frames. N is typically a multiple of M

Example: M=3, N=9

• Each video frame contains 64x64 pixels grouped into 16 macroblocks covering a region of 16x16 pixel. Macroblocks are necessary for motion compensation.

• Three types of macroblocks are possible:I encoded independently of other macroblocksP encode not the region but the motion vector and error block of the previous frameB same as above except that the motion vector and error block are encoded from the previous or next frame

Skipped macroblocks encode the case of 0 motion (the macroblock of the previous frame is copied)

8#23

• Frame have in their turn types: I, P, B Different frame types have different macroblocks:

P frames: intra-coded MBs or Forward-predicted MBsB frames: intra-coded, Forward or/and Backward-predicted MBs or skipped

MPEG macroblocks and frames

Example: the match of the shaded macroblock of the current frame in the previous frame is in position (24,4). Then the motion vector for the current frame is (8, -4)

Block motion compensation

Best matching macroblock

Backward motion vector

Forward motion vector

Best matching macroblock

Current B frame

• Each macroblock is encoded separately for luminance and chrominance components.

MPEG processing chain

The block diagram of the MPEG encoder

I frame

P frame

B frame

Using I-frame Histograms

• Exploits the fact that MPEG frames and macroblocks may be of type I, B or P.

• Extracts I frames from the MPEG video. For each I-frame evaluates the histogram by considering the first coefficient of each 8x8 DCT block. Histograms of consecutive I-frames are compared according to a statistical test.

• Experiments suggest that the χ2 test provide the most satisfactory solution

Using MPEG motion vectors and pairwise comparison

• Exploits the fact that in MPEG, each B-frame is predicted and interpolated from its preceding and succeeding I- frames and P-frames by using motion compensation algorithms. Only residual error is encoded. If there is a large discontinuity between two frames, this will cause large residual errors in the frame blocks. MPEG directly transforms the original pixel values into DCT coefficients.

• The presence of a small number of motion vectors in B-frames is used as a clue in detecting video cuts. The pairwise comparison technique is then applied to DCT coefficients of I-frames.

• Processing is reduced by approximately 1/6 with respect to using JPEG compressed video.

Remarks on cut detection in the compressed domain

•Video segmentation based on MPEG is sensitive to the MPEG encoder that is

used. It may:

• use different methods to perform motion compensation,• calculate motion vectors, • use different quantization tables for DCT coefficient compression, • use preferred direction for predictive encoding

•Results of the comparison of MPEG algorithms by Kasturi [‘96] and Boreczky and Kasturi [‘96] indicate that they have a much higher rate of false detection when dealing with cuts, if compared to histogram based algorithms in the non-compressed domain. Moreover their computational cost is the highest of all the algorithms.

Fades, dissolve, wipe and matte detection

• Fades and Dissolves make the boundary, between two shots spread across a number of frames. They have therefore both starting and ending frames:

– Fading is an optical process, which determines the progressive darkening of a shot until the last frame becomes completely black (fade-out) or, the opposite, allowing the gradual transition from black to light (fade-in).

– Dissolve is a superimposition of a fade-out and a fade-in: the first shot fades out to black while the following fades in to full light.

• If made with optical machines are with sequences of standard duration of 16,24,32, 48, 96 frames with linear variation of pixel brightness. With electronic equipments these effects can be very fast (similar to cuts).

• Semantic meaning is associated with fades and dissolves:– In movies, fades reflect a change of context (sharp change of place - time; the

end of an episode).– Dissolves are used in movies as “…..” in a text. They convey the idea of shifting

the action in time and place and are commonly used for flashbacks.- In documentaries, they to smooth changes from one description to another and make presentation flowing.

Example of Fade out followed by Fade in with semantic meaning (falling asleep and …. after a while …. waking up)

From “Chinatown” movie

Examples of dissolve (moderate length and fast dissolve)

Histogram differences and twin tresholding

• Gradual transitions are usually detected using a twin thresholding mechanism and histogram difference metric as used for cut detection.

• Two thresholds are used to detect cuts (the higher τb ) and special effects (the lower τs).

– A cut is detected whenever the τb threshold is exceeded. – If the τs threshold is exceeded, and τb is not, then the frame at which this

happens is identified as a potential starting frame for gradual transition.

• Threshold τb is set automatically, according to video statistics. Threshold τs do not vary too much among different video sources. A suggestion is to assume for this threshold a value considerably greater than the mean value of the frame to frame difference (from 8 to 10). Setting of tolerance values around thresholds may add robustness to this technique.

• Twin comparison method

− If Tb < diff− If Ts < diff < Tb accumulate differences in δ− If diff < Ts nothing

− If the accumulated value (δ) is greater than Tb, a gradual change is detected.

Production model

• A mathematical approximation is used for fades and dissolves. Fades and dissolves are modeled as chromatic scaling operations. If G(x,y,t) is a grey scale sequence and ls the length of this sequence, a fade-out and fade-in are respectively modeled as:

t in [ t0, t0 + ls]

where vec(0) represents black.

• The first order difference image, obtained by differentiating the model equation, is a constant image, proportional to the fade rate.

• The presence of a constant image is used to detect the chromatic change associated with fading

€

E(x,y,t) = G(x,y,t) 1 -t

ls

æ

è ç

ö

ø ÷ + vec(0)

€

E(x,y,t) = vec(0) + G(x,y,t)t

ls

æ

è ç

ö

ø ÷

Production model with full color information

• Exploiting full color information allows to distinguish between fades, dissolve and other gradual transition effects. Since the most important parameter in a fade is the linear change of pixels brightness, the basic idea is to adopt color spaces that separate brightness from color and do not present instability (e.g. L*u*v* ). In this case, during a fade, while L*changes, the values of u* and v* are approximately constant.

• The algorithm for fade detection is based on the verification of a pseudo-linear variation in the L* values and a constancy of u* and v* values.

Wipes

• Wipes are a category of effects in which an image, the last frame of a shot, is progressively pushed out of the screen by the appearing on, which is the first frame of the following shot. They can be distinguished as horizontal, vertical and flip wipes.

• Wipes are generally fast transition effects (10-15 frames) and therefore create during the effect a large interfarme difference. This phenomenon typically generates a train of peaks in the cut detection measure, spanning the duration of the effect, that can be used as a clue to reveal the presence of wipes.

tT

Mattes

• According to this, mattes can be detected in the same way as wipes are, checking for the presence of a train of peaks in the cut detection measure. To distinguish mattes the central frame in the transition can be converted to gray and the relative histogram H(x) analyzed.

• Mattes are progressive darkening of the objective by a dark mask of varying shape. Typically this transition is as fast as the wipe.

T

H(X)

VIDEO SEGMENTATION - UniFI · 2010-05-10 · Video segmentation • Segmentation is the process of...

Documents

Transcript of VIDEO SEGMENTATION - UniFI · 2010-05-10 · Video segmentation • Segmentation is the process of...