Graph-based Object Detection and Tracking in H.264/AVC Bitstreams for Surveillance Video

8/2/2019 Graph-based Object Detection and Tracking in H.264/AVC Bitstreams for Surveillance Video

1/7

GRAPH-BASED OBJECT DETECTION AND TRACKING IN H.264/AVC BITSTREAMS FOR

SURVEILLANCE VIDEO

Houari Sabirin, Jaeil Kim and Munchurl Kim

Department of Information and Communications Engineering,

Korea Advanced Institute of Science and Technology, Daejeon, Korea

[email protected], [email protected], [email protected]

ABSTRACT

In this paper we present a novel method to detect and track

moving objects in H.264/AVC bitstreams by processing

motion vector and residue information. The encoded blocks

with nonzero motion vectors and residues are first detected

as moving object candidates. A spatio-temporal graph in

video sequences is then constructed to represent groups of

blocks in each frame and their associations to the other

groups of blocks in subsequent frames. Identification and

refinement of ROIs for moving objects being tracked are

done by graph matching and adaptive ROI-size adjustment.

The experimental results show that the proposed method

can correctly identify real moving objects from frame to

frame and can effectively detect small-sized objects and

objects with small motion vectors and residues, as well as

by recognizing moving objects even under occlusion.

Index Terms object detection and tracking, graph

theory, H.264/AVC, surveillance video

1. INTRODUCTION

Object detection and tracking in compressed bitstream

domain has been an interesting and challenging topic in

surveillance video analysis because the moving objects are

detected not directly upon the visible object data but in the

encoded data that represent the motion and pixel difference

due to the moving objects. It focuses on how to precisely

locate and identify the moving object regions and the

resulting trajectories, which usually relies on limited

information available in the compressed bitstreams.

Especially in H.264/AVC bitstream domain, some

research has been conducted to automatically detect and

track moving objects of interest. Some techniques based onpartial decoding proposed by [1] and [2] utilize additional

information such as object colors in identifying an object of

interest from different objects. But these methods may

require the computational complexity. Another method

using partial decoding proposed by [3] detects moving

vehicles in traffic recording which may not be suitable for

general applications. A method of using the bit size of block

partitions has shown good precision in detecting moving

objects [4]. While the shapes of the detected objects well

approximate real object boundaries in the precision of 44

block units, it does not identify different detected objects.

Similar results are given in [5] where the moving objects are

detected via motion vector processing, but no identifications

are made on the detected objects. The proposed method in

[6] presents the labels and trajectories of the detected

objects. However, it assumes that there is no noise due to

illumination changes or improper encoding process, which

is not usually the case in real applications.

On the other hand, graph theory has long been used as

one of the effective tools for object segmentation in

computer vision. A graph cut algorithm has been popularly

utilized for image segmentation in video sequence, which

observes the similarities and dissimilarities in terms of

energy between pixels which are represented as vertices. It

has shown effective performance in segmenting objects

from background [7]. Graph-based object tracking also has

been utilized to correctly identify two corresponding sets of

graphs between two consecutive frames in video sequences

[8]. Graph-based object detection in pixel domain has also

been studied for sports video in [9] to determine the

trajectories of moving objects. From these observations,

graph-based object detection and tracking might also be

applicable in compressed domain.

In this paper we propose a novel method of object

detection and tracking in compressed domain using a graph-

based approach. Firstly, the blocks that have non-zero

motion vectors and residues are detected as moving object

candidates. Secondly, groups of the detected blocks are

represented as spatial graphs in each frame. Then the groups

of detected blocks as spatial graphs in each frame are

temporally connected to the groups of detected blocks in its

next frame, which constitutes a spatio-temporal graph for

the whole block groups. Thirdly, the temporal connections

of spatial graphs are checked to remove the block groups

that are not part of the real moving objects and to track the

segmented block groups as moving objects by their attribute

similarities.

___________________________

This work was supported by the R&D program of MKE/IITA

[A1100-0801-3015, Development of Open-IPTV Technologies for

Wired and Wireless Networks]


2/7

This paper is organized as follows: We first define a

spatio-temporal graph with graph attributes in Section 2;

Section 3 describes a method of removing noisy objects in

the proposed spatio-temporal graph; a method of tracking

moving objects is described in Section 4. Region refinement

for the detected block groups is discussed in Section 5; the

experimental results are presented in Section 6; Finally

Section 7 concludes our works.

2. SPATIO-TEMPORAL GRAPH

In H.264/AVC, each MB is encoded in a block partition

mode among 1616 ~ 44 block partitions for Inter

prediction coding or among 44, 88 and 1616 block

modes for Intra prediction coding. Since the object regions

are represented in a group of 44 blocks which may include

non-zero motion vectors and/or non-zero residues, the

blocks having non-zero motion vectors or non-zero residues

are detected in 44 unit and clustered into groups. Note that

the motion vectors of the detected 44 blocks are copied

from their respective block partitions in MBs.A block group in a frame is defined as the detected

blocks for which their block boundaries are in touched.

Each block group is considered a moving object candidate

as one single subgraph for which a vertex represents a 44

block and an edge connects a pair of blocks that are in

touched. Thus one frame may contain several block groups

that represent the moving object candidates (i.e. the groups

can be the real moving objects or noise).

Now, we define a spatial graph which simply represents

the whole set of subgraphs in a frame. Notice that each

subgraph can be regarded as a super-vertex and there is no

connection between super-vertices in the spatial graph. In

general, the super-vertices in a frame have theircorresponding super-vertices in the next following frame.

Therefore, a spatio-temporal graph is defined by temporally

connecting the super-vertices to their corresponding super-

vertices between pairs of two consecutive frames in a video

sequence. Note that the spatio-temporal graph does not

growth in time in a video sequence. Instead, it is slid

forward from frame to frame. Next, this sort of graph based

representation for defining moving object candidates is

explained in details.

Let }0;,,{ 1 NggG N be a set of spatial graphs in

a frame where each spatial graph ),,( aEVgn is an

undirected attributed graph that represent the moving objectcandidates. Here N is the number of the detected moving

objects. The vertexng

vvvV ,,, 21 denotes the blocks

in a block group and the edge 1,0, vuE between twovertices u and v denotes the connections between two

adjacent blocks. The ordern

g is the number of blocks in

the group. An attribute for a vertex is defined as

)(),(),(),()( vevMvDvcvann gg

where the elements of

the attribute denote the location, direction, magnitude and

energy of the block, respectively, which characterizes the

corresponding object. By this definition, each detected

object is represented as a subgraphn

g in the spatial graph

in a frame. These attributes will be used to track the objects

of interest by correctly identifying them in video sequences.

The location JjIijivc ,;,)( indicates x and ycoordinates of the block relative to the top-left edge of the

frame. The direction is a real number ranging from to

calculated from the motion vector of the block as

xijyij mvmvvD 1tan)( where xijmv and yijmv are x and ycomponents of that motion vector. The magnitude that

indicates how far the block is moving, is given by

ijmvvM )( . The energy of the block is a nonnegative real

number calculated from the average of residues in block,

which is given by k ijkrKve 21)( . Here ijkr is theresidue ofk-th pixel of block in },{ ji and Kis the number

of pixels in which the residue is not zero.

Between consecutive frames, a spatio-temporal graph is

constructed by defining a weighted graph where the vertices

are composed of the subgraphs of spatial graph from five

consecutive frames. We define a weighted spatio-temporal

graph ),,( wEVG where the vertices are defined as

f

N

ff

N

f

f

N

ff

N

ff

N

f

01

234

,,,,,

,,,,,,,,,

1

11

1

22

1

33

1

44

1

vvvv

vvvvvvV

(1)

and the edges, the relation between two vertices in two

consecutive frames, are defined as

f

n

f

n

f

n

f

n

f

n

f

n

f

n

f

n 01123334,,,,,,,

1122334vvvvvvvvE

(2)

where n0, n-1, n-2, n-3, and n-4 are the indices of the vertices

in frame f to frame f-4, respectively, and N0, N-1, N-2, N-3,

and N-4 are the total numbers of vertices in the

corresponding frames. Thus vertex fnv denotes the

subgraphn

g in frame f. The weight of the edge w is

determined by calculating the similarity in distance between

two vertices, given by

)1()1()1()1(

,

Nf

n

Nf

n

Nf

n

Nf

n NNNNccw vvvv (3)

where f-N and f-(N+1) denote the index of two adjacent

frames, and 4,3,2,1,0N . The centroid vc is the meanof the location of all subvertices in v (vertices of subgraph

g). Fig. 1 illustrates an example of a spatio-temporal graph

G .

3. GRAPH PRUNING AND PROJECTION


3/7

In many cases, object detection and tracking in compressed

domain always suffers from the falsely detected blocks that

are not part of moving objects due to intensity change and

movement of background clutters such as shaking trees, as

well as due to fine quantization during encoding. To remove

such false blocks being detected as parts of moving objects

(and furthermore, to be tracked), a noise filtering is applied

by pruning the spatio-temporal graph G .

By assuming that the position of a moving object in a

frame is very close to that of the corresponding moving

object in the next frame (within 1 block, or 4 pixels away),we can remove the subgraphs g resulted from noisy blocks

by pruning the vertices and edges in spatio-temporal

graphthe spatio-temporal graph G for which the edge

weights are larger than 4 pixels. Fig. 2 illustrates an

example of edge weights spatio-temporal graph for two

consecutive frames.

Fig. 3 shows an example of graph pruning to remove

noisy subgraphs for the Speedway sequence. In Fig. 3(a),

the subgraphs are produced by moving objects as well as

background clutter (shaking trees as noise). The spatio-

temporal graph G constructed from five consecutive frames

exposes that some subgraphs are isolated while the others

are clustered into groups which are the groups G1, G2, andG3 as shown in Fig. 3(b).

Further observation shows that only group G2 and G3

are having edges in the consecutive five frames. Therefore,

by graph pruning, we can prune all vertices except those in

group G2 and G3 which are determined as the real moving

objects. Fig. 3(c) shows the result of graph pruning where

only subgraphs of the real objects remain survived after

graph pruning.

In other case, improper motion compensation or

insignificant frame differences may cause the blocks that are

supposed to represent moving objects to contain zero

moving vectors or no residue data. In this situation, the

graph pruning may remove the vertices that actuallyrepresent moving objects. To handle this problem, a graph

projection is performed after graph pruning to recover

missing vertices.

To avoid improper projection of noisy block groups

(subgraphs), the graph projection is performed after graph

pruning and is only performed when the number of

subgraphs is decreasing or becomes zero in two consecutive

frames. We first label the vertices of spatio-temporal

graphG in two consecutive frames. Let the vertex in frame

f-1 be 11,

Nmfmv and the corresponding missing vertex

to be found by projection in frame fbe 0, Nnfn v , where

N0 and N-1 are the numbers of vertices in the current and

previous frames, respectively. The missing vertex in the

current frame is projected from the previous frame.

Therefore 1 fmfn vv where its position is calculated as

11 fmfmfn mvcc vvv (4)

where vmv is the motion vector of v and =0.5 is theregulator constant to avoid the projected vertex shifted too

far from the actual object position.

4. GRAPH-BASED OBJECT TRACKING

The attributes of vertices in a spatio-temporal graph G are

used to track the detected objects by correctly identifying

them in video sequences. Object tracking is performed by

vertex matching between the current frame and a past

reference frame based on the attribute similarity. For vertex

matching, the attributes of vertices are compared. A

reference frame for vertex matching can be selected from

the preceding frames of the current frame, depending on the

change in the order of a spatial graph.

4.1. Adjacent vertex matching

Vertex matching is performed by simply matching two

vertices with similar attribute values for location in two

consecutive frames (the previous frame f-1 as a reference

frame and the current framef).

The matching between the vertices in frame f-1 and f

can be determined by finding two similar vertices where the

(a) (b)

Fig. 2. (a) An example of graph where red circles represent vertex

from current frame and blue circles represent vertex from previous

frame. (b) Edge weights of the same graph are shown.

(a) (b) (c)

Fig. 3. (a) A frame from Speedway sequence with superimposed

spatio-temporal graphsH. (b) The vertices and edges from five

consecutive frames. (c) The resulting graph pruning.

f f-1 f-2 f-3 f-4

Fig. 1. An example of spatio-temporal graph G constructed for

five consecutive frames. The edges show the correspondencesbetween two vertices in two consecutive frames.


4/7

edge weight is smaller than an adaptive threshold that is

determined by the block size unit and the magnitude of the

motion vectors to handle significant position changes due to

fast object motion.

4.2. Conditional vertex matching

Under a certain condition when vertex attributes cannot be

obtained, for example, in case of occlusion, the vertex

matching shall be performed by taking into account the

change in the order of a spatial graph and the selection of a

different reference frame.

The change in the orders of the spatial graphs between

two frames is defined as 101 ,,- where -1 denotesdecrement of the number of vertices, 0 denotes no changes

in the number of vertices, and 1 denotes the increment of

the number of vertices. Since does not explicitly determine

the number of detected objects, we need to know whether

the number of the objects in a frame has really changed due

to occlusion or not.

Let }1,0{;,)( 10 sssS v denote the status given to avertex to indicate whether occlusion is occurred or not in a

frame. Here S0 is the default status that indicates neither

occlusion-just-happened (OJH) nor occlusion-just-

finished (OJF) occurred in a frame, and S1 is the status that

indicates either OJH or OJF occurred in a frame. One vertex

is restricted to have only one status per frame. The statuses

are determined as follows:

Default status S0 = 1 is initially set in a start frame for allvertices

OJH status S1 = 1 is set when the distance between twovertices in one frame prior to the occlusion is smaller

than a block-unit size and = -1

OJF status S1 = 0 is set when the distance between twoobjects in one frame after occlusion is ended is smaller

than a block-unit size and = +1.

The selection of a reference frame in conditional vertex

matching is determined from the last frame when the objects

were occluded. Based on this condition, we perform vertex

matching by selecting one preceding frame f- as a

reference frame. The weight between vertices in framefand

f- is defined as

f

n

f

n

f

n

f

n

f

n

fn

fn

fn

fn

fn

eeDDS

ccSw

vvvvv

vvvvv

''''

,

00

00

1

0 (5)

where0

f

nS v is the status of

f

nv in default status and

1

f

nS v is the status of

f

nv in OJH status. Here, the

weight of an edge now takes into account the direction and

the energy of a vertex as a similarity feature. The direction

and energy ofv are calculated as the means of the direction

and energy of all subvertices in v . Since the ranges of

direction values and energy values are significantly different,

we need to rescale the values to balance the difference and

to make fair comparison between two attributes. Therefore

in (5), the direction and energy are defined as the base 10

logarithm of their original values.

To detect a new object is relatively simple. When a new

vertex of a spatio-temporal graph in subsequent frames is

detected and both adjacent vertex matching and conditional

vertex matching cannot find a similar vertex in the reference

frame, the vertex can be identified as a new object.

5. ROI REFINEMENT

In this stage, we define a region of interest (ROI) for

moving objects with the rectangle that encloses the block

groups (subgraphs), and refine the ROI size by controlling

the width and height of the ROI so the refined ROI could fit

into the real object size adaptively to accommodate the

changes in the order of each subgraph in spatial graphs.

Recalling subgraphn

g as the graph representing the n-

th object in a frame, we define the ROI of the n-th object in

frame fas nfn gcO ,, where and are the width

and the height of the ROI, respectively, and n

gc is the

centroid of the subgraph. The width is determined from the

number of vertices along the horizontal direction ofn

g and

the height is determined from the vertical direction. The

centroid is calculated as the mean of locations of the

vertices in the subgraph.

The refinement is performed by observing the size and

centroid of the ROI every five frames. That is, the sizes of

the ROI are computed and compared to the refined ROI in

previous frame every five frames. The refinement for the

size of the ROI is then performed according to the following

condition

otherwiseOO

OOifO

OOifO

O

fn

fn

fn

fn

fn

fn

fn

fn

fn

,

3:4:,

4:3:,

1

2

1

1

4

3

11

4

3

. (6)

where fnO is the refined ROI and1f

nO is the previously

refined ROI from preceding five frames.

In many cases, the ROIs may have different centroids

due to the changes in the number and position of the

vertices in subgraphs. As a result, the positions of ROIs maybe fluctuating. To reduce the large fluctuation in the ROI

positions, the centroids of the ROIs are controlled by

restricting their movement compared to those of the

corresponding ROIs in the previous frame. The changes in

the centroids of the ROIs are restricted within the 4-pixel

distance. If the centroid of an ROI moves beyond the 4-

pixel distance, the ROI displacement is retracted within the

4-pixel distance. By doing so, a reliable position for the ROI

can be ensured within the real moving object area.


5/7

In case of occlusion, when two subgraphs in a frame

are merged, they are represented as only one ROI. To track

the detected object even in occlusion, we observe the

attributes of vertices in the occluded subgraph (block group),

and cluster the vertices with similar attributes that represent

each occluded object. Therefore we can reconstruct the ROI

of both subgraphs during the occlusion.

Fig. 4 illustrates an ROI reconstruction during

occlusion. At one frame prior to occlusion, the ROI size for

the occluded objects is stored in the so-called ROI memory.

During the occlusion, the vertices can be clustered

according to its attribute similarities based on the attribute

values of both subgraphs prior to occlusion. Therefore, the

reconstruction of the ROI during the occlusion can be done

by simply assigning the ROIs of both objects prior to

occlusion to the locations of the clusters of vertices, as

shown in Fig. 4. After occlusion is finished, the ROI of both

objects are determined normally as the rectangle that

encloses the block groups of each detected object.

before occlusion

Of1

Of2

during occlusionocclusion started

ROI memory

ROI attributes

ROI attributes

ROI attributes

ROI attributes

Fig. 4. Illustration of reconstructing ROI of two objects during

occlusion: Dashed rectangles are the ROI of the encapsulated

subgraphs.

6. EXPERIMENT RESULTS

We use three test video sequences for the experiments with

Speedway, PETS2001, and Shinji sequences of 352288,

384288, 320240 pixel resolutions, respectively. All the

test sequences are encoded by H.264/AVC reference

software Joint Model 15.1 [10] with quantization parameter

value 32 in Baseline profile. The simulation platform for the

experiments is a PC with a 2.4GHz CPU with 2GB RAM.

Fig. 5 shows the tracking results of our proposed

method with a superimposed snapshot of five Speedway

sequence frames that are taken every ten frames. Thedetected object regions as ROIs in the superimposed snap

are shown in rectangle boxes. For better visibility, simple

brightness and contrast adjustments are made on the

superimposed snap. From the superimposed ROIs, we can

observe the speeds of the two detected objects with the size

changes in their respective ROIs. When an object is moving

fast as for Object 1, the motion displacement becomes large.

Therefore, the ROI approximation is not accurate by

including the non-object area. On the other hand, when an

object moves slowly as for Object 2, the ROI approximation

becomes quite accurate by tightly encompassing the object

region. In general, the proposed method can detect and

localize objects of a small size such as Object 1, as shown in

Fig. 5.

Fig. 5. Snapshots of superimposed five frames from Speedway

sequence.

Fig. 6 shows a superimposed snapshot of five

PETS2001 sequence frames taken every five frames for

which the proposed method also works well regardless of

object sizes.

Fig. 6. Snapshots of superimposed five frames from PETS2001

sequence.

Fig. 7 shows a series of snapshots of fragmented

PETS2001 sequence frames that are taken every ten frames

in order to highlight the performance of detection and

tracking by the proposed method under occlusion.

Fig. 7. A series of snapshots ofPETS2001 sequence frames during

the occlusion of two objects.

Object 1

Object 2

Object 1

Object 2


6/7

The rectangle boxes in the snapshots indicate the ROI

regions for moving objects, which is detected and identified

by the proposed method. It Object 1 (a person marked as 1

in the red rectangle box) and Object 2 (a car marked as 2

in the green rectangle box) are separate in the first snapshot.

As can be shown in the subsequent snapshots, Object 1 is

occluded by Object 2 in the third and fourth snapshots, and

then they are successfully detected and identified as two

separated moving objects in the fifth and sixth snapshots by

the proposed method. Although there are several frames

where the ROI size of Object 2 is relatively larger than the

real object size, the rectangle boxes as ROI sizes are

visually acceptable to distinguish different moving objects.

Fig. 8 shows a superimposed snapshot of five Shinji

sequence frames that have been taken every forty frames.

Most of the ROIs are obtained from the projection of the

vertices of the spatio-temporal graph when the blocks are

missing due to zero motion vectors and/or no residues. As

the object is moving forward closer to the camera, the real

object and ROI size are getting larger. The proposed method

successfully detects and locates the moving object under the

change in size, as shown in Fig. 8.

Fig. 8. Snapshots of superimposed five frames from Shinji

sequence.

7. CONCLUSIONS

We have presented a graph-based method for detecting and

tracking moving objects in H.264/AVC bitstream domain by

constructing spatio-temporal graph from the detected blocks

with non-zero motion vectors and/or non-zero residues.

Here the detected blocks are clustered into groups of blocks,

and the block groups are represented as subgraphs which

constitute a spatial graph in each frame. The temporal

connections between spatial graphs in two frames create a

spatio-temporal graph in which the edge between two super-vertices represents the correspondence for the same object

in two frames. The spatial graph enables representation of

moving objects in each frame, even for the objects of small

sizes, and the ROI identification for the detected objects

during occlusion. The spatio-temporal graph can be utilized

to recognize whether the detected blocks are real or false

moving objects based on the edge weights between super-

vertices by graph pruning. The spatio-temporal graph also

enables to accurately identify objects of interest from frame

to frame, even when the detected objects are occluded, as

well as to detect and track the objects under the change in

sizes.

8. REFERENCES

[1] W. You, M. S. H. Sabirin, and M. Kim, MovingObject Tracking in H.264/AVC Bitstream, In

Multimedia Content Analysis and Mining 2007, Nicu

Sebe, Yuncai Liu, Yueting Zhuang, and Thomas Huang

(Eds.). Springer-Verlag, Berlin, Heidelberg, 483-492.

[2] W. You, M. S. H. Sabirin, and M. Kim, Real-timedetection and tracking of multiple objects with partial

decoding in H.264/AVC bitstream domain,Real-Time

Image and Video Processing 2009, Vol. 7244, No. 1.

(2009), 72440D.

[3] C. Kas, M. Brulin, H. Nicolas, and C. Maillet,"Compressed domain aided analysis of traffic

surveillance videos,"Distributed Smart Cameras, 2009.

ICDSC 2009. Third ACM/IEEE InternationalConference on, pp.1-8, Aug. 30 2009-Sept. 2 2009.

[4] C. Poppe, S. De Bruyne, T. Paridaens, P. Lambert, andR. Van de Walle, Moving object detection in the

H.264/AVC compressed domain for video surveillance

applications, J. Vis. Comun. Image Represent. 20, 6

(August 2009), 428-437, 2009.

[5] S. K. Kapotas and A. N. Skodras, "Moving objectdetection in the H.264 compressed domain," Imaging

Systems and Techniques (IST), 2010 IEEE

International Conference on, pp.325-328, 1-2 July

2010.

[6] C. Ks and H. Nicolas, An Approach to TrajectoryEstimation of Moving Objects in the H.264Compressed Domain, In Proceedings of the 3rd

Pacific Rim Symposium on Advances in Image and

Video Technology (PSIVT '09), Toshikazu Wada, Fay

Huang, and Stephen Lin (Eds.). Springer-Verlag, Berlin,

Heidelberg, 318-329.

[7] J. Mooser, S. You, and U. Neumann, "Real-TimeObject Tracking for Augmented Reality Combining

Graph Cuts and Optical Flow," Mixed and Augmented

Reality, 2007. ISMAR 2007. 6th IEEE and ACM

International Symposium on, pp.145-152, 13-16 Nov.

2007.

[8] Z. Guanling, W. Yuping, and D. Nanping, "Graphbased visual object tracking," Computing,Communication, Control, and Management, 2009.

CCCM 2009. ISECS International Colloquium on,

vol.1, pp.99-102, 8-9 Aug. 2009.

[9] V. Pallavi, J. Mukherjee, A. K. Majumdar, and S. Sural ,"Graph-Based Multiplayer Detection and Tracking in

Broadcast Soccer Videos," Multimedia, IEEE

Transactions on, vol.10, no.5, pp.794-805, Aug. 2008.


7/7

[10]Dolby Laboratories Inc., Fraunhofer-Institute HHI, andMicrosoft Corporation, H.264/14496-10 AVC

Reference Software, http://iphome.hhi.de/suehring/tml/.

Graph-based Object Detection and Tracking in H.264/AVC Bitstreams for Surveillance Video

Documents

Transcript of Graph-based Object Detection and Tracking in H.264/AVC Bitstreams for Surveillance Video