A region-level motion-based graph representation and ... · The handling over time of image...

Pattern Recognition 33 (2000) 725}740

A region-level motion-based graph representation and labelingfor tracking a spatial image partitionq

Marc Gelgon!,",*, Patrick Bouthemy!

!IRISA/INRIA, Campus universitaire de Beaulieu 35042 Rennes Cedex, France"Nokia Research Center, Tampere, Finland

Received 15 March 1999

Abstract

This paper addresses two image sequence analysis issues under a common framework. These tasks are, "rst,motion-based segmentation and second, updating and tracking over time of a spatial partition of an image. By spatialpartition, we mean that constituent regions display an intensity, color or texture-based homogeneity criterion. Severalissues in dynamic scene analysis or in image sequence coding can motivate this kind of development. A general-purposemethodology involving a region-level motion-based graph representation of the partition is presented. This graph is builtfrom the topology of the spatial segmentation map. A statistical motion-based labeling of its nodes is carried out andformalized within a Markovian approach. Groups of spatial regions with consistent motion are identi"ed using thislabeling framework, leading to a motion-based segmentation that is both useful in itself and for propagating the spatialpartition over time. Results on synthetic and real-world image sequences are shown, and provide a validation of theproposed approach. ( 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved.

Keywords: Image sequence analysis; Motion-based segmentation; Partition tracking; Markov random "elds

1. Introduction and related work

Image segmentation, regardless of the segmentationcriterion, is among the most fundamental tasks faced inimage analysis. The problem of performing this segmen-tation on a whole set of successive frames is also fre-quently met. In this paper, we tackle two problems undera common framework. First, we aim at motion-basedsegmentation of an image sequence [1]. Second, we ad-dress the problem of updating and tracking spatial imagepartitions over time [2]. By image partition, we meana set of disjoint regions, the union of which forms the

qThis study is supported in part by DGA (DeH leH gation GeH n-eH rale pour l'Armement - French Ministry of Defense) througha student grant.

*Corresponding author. Nokia Research Centre, Tampere,Finland.

E-mail addresses: [email protected] (M.Gelgon), [email protected] (P. Bouthemy)

image. This partition can result from an intensity-based,color-based or texture-based segmentation.

Studies in motion analysis have shown that motion-based segmentation would bene"t from including notonly motion, but also the intensity cue, particularly toretrieve region boundaries accurately. Hence, the knowl-edge of the spatial partition can improve the reliabilityof the motion-based segmentation. Conversely, if themotion-based partition of an image is recovered andproperly exploited, temporal tracking of a spatial parti-tion of this image can be done in an more e$cient waythan if spatial regions were tracked individually. Asa consequence of these two remarks, we propose ascheme that builds both a spatial-based and a motion-based partition of an image, and that tracks both of themover time. Depending on the application goal underconcern, the output partition relevant for the user may beeither the spatial partition or the motion-based one.

Such a scheme requires the construction of a relevantstructure exploiting the motion information which re-lates two successive image partitions. This paper mainly

0031-3203/00/$20.00 ( 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved.PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 0 8 3 - 7

focuses on this stage consisting in a region-level motion-based graph representation and labeling. To this end,region-level contextual information has to be formalizedand exploited. The introduction of a region-levelmotion-based valued graph is proposed. The applicationpresented here is concerned with texture-based segmen-tation in infra-red image sequences as well as grey-level and color segmentation in the visible domain,with motion as inter-frame transformation. Besides, suchan updating-tracking scheme can facilitate the deter-mination itself of the spatial partition map at each in-stant, in terms of quality of results and saving of com-putational time, by providing an appropriate predictionstep.

Several issues can motivate this kind of develop-ment. For instance, in a surveillance task, extractingsmall moving objects is easier within spatially homogene-ous tracked regions than directly from the image. In thiscase, the sequence of spatial partitions is the relevantoutput. In other cases, it is the motion-based regionsthat form meaningful entities in terms of content under-standing. The results we present include these two cases.Besides interpretation purposes, object-based codingapplications can bene"t from such an achievement.The expanding "eld of content-based video indexingmay also be aimed at, as will be mentioned in the con-clusion.

We now review some previous approaches concernedwith segmentation taking both intensity and motion intoaccount, with region grouping, or with region tracking.Taking both intensity and motion information into ac-count in segmentation procedures is, among other rea-sons, motivated by the ability of intensity cues to locateboundaries accurately and to cope with image areas withpoor intensity gradient information. These are oftenshortcomings for segmentation exploiting only motioninformation. On the other side, motion-based segmenta-tion generally leads to a semantic description of theimage, involving fewer and often more signi"cative re-gions than a spatial segmentation. In several approaches,intensity is involved at pixel level through a spatial seg-mentation, providing a set of regions that are handled bya region-level motion-based scheme. In [3}6], a spatialsegmentation stage is followed by a motion-based re-gion-merging phase. In [3], regions are grouped by iter-ating estimation of the dominant motion and grouping ofregions that conform to that motion, while in [4], a k-medoid clustering algorithm if used. Other methods in-volve, in contrast, motion-based intermediate regions.A variety of methods have been proposed in this direc-tion, generally carrying out region grouping also ona motion-based criterion. A k-means clustering algorithmin motion parameter space was used in [7]. With cluster-ing method in particular, determination of the number ofclusters is a key issue. This problem was addressed in [8]with an MDL-based approach. An explicit region-level

merging procedure has been embedded in a Markovianframework in [9,10]. Di!ering from two-level ap-proaches, [11] proposes to perform spatial segmentationas an iterative process operating with progressive graphs.Given the partition at the current iteration, the adjacencygraph is built and labeled on a spatial criterion, usingstochastic dynamics and exploiting the desired connect-ivity of regions to reduce the space to be searched. Thelabeled graph provides an initial partition for the nextiteration. By this means, the "nal partition displays bothaccurate boundaries and a reduced number of regions.Another possibility is to introduce spatial and motioninformation both at pixel level. In [12] both types ofconstraints, along with geometrical ones, are included inthe same energy function in a Markovian}Bayesianscheme.

The handling over time of image segmentation mapshas already received some attention. Short-term ap-proaches were chosen for motion-based segmentation in[13], and in [5] for grey-level segmentation. A longer-term view is introduced in [14], where temporal integra-tion of frames is achieved by recursive registration ofsuccessive images. However, no formal tracking stagewas introduced in these works. Occlusions and crossingshave been coped with in [15], but considering onlya small number of regions, and by tracking them inde-pendently. A 2D mesh model of an object of interest wasemployed in [16] to track its motion, intensity andboundary.

The region segmentation hierarchical schemes dis-cussed above are often applied to the segmentation oftwo frames only [4,8]. When applied to a whole se-quence, in [3,5] only the "rst pair of frames is concernedwith the layer of intermediate regions. Whereas thiscan be justi"ed since in [3,5] it is the motion-basedpartition which is the output partition of interest,the problem stated here requires intermediate a spatialregion partition for all images. Besides, some of thesemethods are not incremental, i.e. they cannot straight-forwardly bene"t, at t#1, from the segmentationmap obtained at t. The method presented here yieldsa contribution in this direction, because it is incrementalat both levels of the hierarchy (spatial and motionpartitions).

This paper is organized as follows. Section 2 introduc-es the proposed tracking algorithm and its advantagesover an elementary tracking method. Several alternativespatial segmentation schemes are described in Section 3.Section 4 details the central stage of the approach, i.e.,how a motion-based graph representation of the imagecan be built. The use of this graph for tracking is thescope of Section 5. Results obtained both on a patch ofnatural textures undergoing synthetic motions, and onvarious real-world image sequences are presented inSection 6. Finally, Section 7 contains concluding re-marks.

726 M. Gelgon, P. Bouthemy / Pattern Recognition 33 (2000) 725}740

Fig. 1. The various structures involved in the method and how they are related to one another.

2. Principles of the proposed approach

We present in this section the main features of ouroriginal method for motion-based segmentation and par-tition tracking. Fig. 1 gives an overview of its generalframework. A spatial partition for the "rst frame of theimage sequence is "rst required. It can be part of thegiven input data, or build from a statistical markoviansegmentation we propose here, that rely on texture, grey-level or color as possible alternative criteria. A spatialregion graph is then derived from the spatial imagepartition P (Region adjacency graph, Fig. 1b). The nodesof this graph are then considered as sites of a region-levelMarkov random "eld (MRF), and are assigned motionlabels using a statistical regularization scheme. A 2Dmotion model is estimated within each region, and theoptimal motion label con"guration is sought for using anenergy minimization approach, such that regions under-going similar (resp. di!erent) motions are given the same(resp. di!erent) labels (Fig. 1c). This label map is con-sidered in turn as a region-level and motion-based parti-tion P

m, from which a second graph (tracking graph) is

derived. This graph, valued by motion information mea-sured on the resulting regions (Fig. 1e) is the one used fortracking. Mechanisms are included for building predictedlabel maps at both pixel and region level. These predic-tions also exploit temporal "ltering of measured motionparameters. These predicted label maps at both pixel and

Fig. 2. Diagram of an elementary short-term tracking scheme,relying on alternate partition prediction and updating phases.

region level provide initial label con"gurations at theserespective levels, that are close to the optimal con"gura-tions, ensures a fast energy minimization step.

The comparison between our algorithm and an ele-mentary (or short-term) tracking scheme (sketched inFig. 2) highlights its advantages. The latter, as the onewe have de"ned in Ref. [13], relies on alternateimage partition prediction and updating phases. Givena partition determined at time t, motion is estimatedwithin each region; a predicted segmentation map is thenprojected at t#1 using the estimated a$ne motionmodel in each region, then updated.

With this simple short-term tracking scheme, theinitial label con"guration used for the energy minimiz-ation process delivering the image segmentation at timet#1, and supplied by means of the described prediction

M. Gelgon, P. Bouthemy / Pattern Recognition 33 (2000) 725}740 727

technique, is generally close to the optimal one. It enablesthe deterministic relaxation step, performing the minim-ization of the considered energy function, to convergequickly to an adequate local minimum, i.e. to update thepartition in a satisfactory way. Moreover, provided nospecial event (like occlusion or crossing) occurs, the labelassociated to each region remains identical from t tot#1 in a straightforward manner. Yet, occlusions arenot handled since such a scheme has no `memorya. Ifa region disappears and then reappears, it will be as-signed a new di!erent label. Furthermore, this methoddoes not involve region-level contextual information.This last point causes the algorithm to under-perform,for instance, in the case of a region where motion cannotbe accurately measured or in the case of oversegmenta-tion. In addition, the global evolution of the partitionstructure is not accounted for. While keeping the advant-ages of the simple tracking scheme, our approach bringsin new bene"ts. Firstly, in poorly textured areas, insu$-cient intensity gradient information is available for thedi!erential motion estimation method to supply accuratemotion estimates. Let us recall that this was not sucha critical issue in Ref. [13], since the pixel-level segmenta-tion criterion was not texture, grey-level or color butdirectly motion. In this case as in general, if severalregions can be jointly considered for motion estimation,they are likely to provide more intensity gradientinformation that helps deliver more accurate motionestimates. The use of higher degree polynomial motionmodels than a$ne models may also be correctly achiev-able in that case. Secondly, as far as long-term tracking isconcerned, involving region representation, recursive "l-tering and an explicit formalized temporal evolutionmodel, the tracking graph structure is obviously muchsimpler than the spatial region graph, while involving allthe useful information.

Another important feature of our method is the follow-ing. In contrast with approaches that operate a progres-sive and irreversible simplication of the partition topologythrough merging, the labeling approach presented herekeeps track of the spatial regions composing a motion-based grouping, so that nodes that are identically labeledat a given iteration or at a given instant can subsequentlybe labeled di!erently along the energy minimization. Also,region-level contextual information can be taken into ac-count through a contribution to the energy function.

The sections to come present the di!erent stages of theproposed scheme. This paper does not discuss the thor-ough long-term tracking of regions, along with trajecto-graphy of regions.

3. Spatial segmentation scheme

For the spatial image segmentation stage, we proposethree approaches, depending on the sequences to be

processed. The proposed region homogeneity criterioncould be texture, grey-level or color. The appropriatechoice naturally depends on the nature of image contentand availability of color. As a general rule, color shouldbe used rather than grey-level.

During the partition updating phase, the number ofregions p is determined on-line. We outline here the mainfeatures of the spatial segmentation algorithm, indicatingits speci"cities for each homogeneity criterion.

The segmentation method operates within a Bayesianestimation framework. Let E"ME

s, s3SN be the label

"eld de"ned on the set S of sites s corresponding to theimage discrete grid, i.e. sites are pixels. Let O"MO

s, s3SN

be the observation "eld. Let e"Mes, s3SN, (resp.

o"Mos, s3SN) be a realization of E (resp. O). Given

a neighbourhood system, (E,O) is modeled as an Markovrandom "eld. The optimal label "eld e( is derived accord-ing to the Maximum a posteriori (MAP) criterion. Owingto the equivalence between Gibbs distributions andMRF, this optimal label con"guration in fact resultsfrom e("argmin

e|);(e,o), where ) is the set of all pos-sible realizations of E and ;(e,o) is the so-called energyfunction encompassing the interactions between labelsand observations and prior information on the label"eld [17].

Let C be the set of all cliques c (a clique is a subset ofsites which are mutual neighbors). We use a second-orderneighborhood, but only two-site cliques are considered.The energy function ;(e,o) is expressed as a sum of twoterms, which both break into the sum of local potentialsde"ned on cliques:

;(e,o)";1(e,o)#;

2(e).

3.1. Texture-based data-driven term

;1(e,o) expresses the relation between the observations

at hand and the labels to be determined. It is given by

;1(e,o)"+

s

<1(o(B

s),o(R

es)), (1)

where<1(o(B

s),o(R

es)) conveys the likelihood of a particu-

lar label being assigned to site s, given the observations o.o(B

s) is the set of observation vectors in a local window

centered at s, and o(Res) is the set of observation vectors

corresponding to sites currently labeled esand forming

region Res. We outline the de"nition of this potential for

texture and color (next subsection). The potential <1

forgrey-level segmentation method is derived by simpli"ca-tion of the color segmentation potential below.

The unsupervised texture segmentation method de-scribed in Ref. [18] is employed. No prior information isrequired about the nature of the textures. In order toselect the appropriate set of texture features to build theobservation vectors Mo

s"[o1

s,2, om

s], s3SN, two classes


of images are considered. For signi"cantly textured im-ages, like the infrared images of Fig. 6, or Brodatz texturepatchwork Fig. 5, statistical features extracted from co-occurence matrices were added to grey-level and variancefeatures. The potential <

1is de"ned as follows:

<1(o(B

s), o(R

es))

"

m+i/1G#1 if d(o(i)(R

es), o(i)(B

s))'a(i),

!1 if d(o(i)(Res), o(i)(B

s))(a(i), (2)

where d(. , .) stands for the Kolmogorov}Smirnov distancebetween the distributions estimated respectively on thelocal window and on the entire current region R

es, and i is

the observation vector number. Thresholds a(i) are pre-determined constants.

3.2. Color-based data-driven term

Designating by (r, g, b) the red, green and blue com-ponents of a pixel, we have selected a representationproposed in Ref. [19]. The three selected axes and thequanti"cation used are as follows:

rg"r!g (16 quanti"cation levels), (3)

by"2b!r!g (16 quanti"cation levels), (4)

wg"r#g#b (8 quanti"cation levels). (5)

The choice of this color space was driven by the satisfac-tory results obtained, with regard to its complexity [19],not being unaware that recent studies have proposedmore e!ective representations of color. The introductionof color should not be considered as a major contribu-tion of the paper, but rather an interesting alternativeto grey-level segmentation, with a view to building amotion-based partition. Let i, j, k be three indices on thethree chosen color axis and let us call respectivelyC

i, j, k(s) and H

i, j, k(R

es) the local and global tri-dimen-

sional color histograms measured of the above attributes.The potential <

1is de"ned as follows:

<1(e

s, o

s, o(R

es))

"+i

+j

+k

DHi, j, k

(Res)!C

i, j, k(s) D2 . (6)

3.3. Regularization term

The regularization term;2(e) re#ects the a priori con-

straint on the label map. We have

;2(e)" +

Ws, tX|C<

2(s, t)

where <2(s, t)"k(1!2d(e

s!e

t)), (7)

k is a predetermined positive constant, andd(e

s!e

t)"1, if e

s"e

t, 0 otherwise. By penalizing local

con"gurations where two neighboring labels are di!er-ent, homogeneous regions are globally favored.

3.4. Energy minimization

Energy minimization is performed using a modi"edICM algorithm [20]. A binary stability label is attachedto each site, all of which are initially unstable. A site israndomly selected among the unstable sites. The set "

sof

candidate labels that may be assigned to site s, includelabels currently assigned in the neighborhood l(s) of sites, the current label e

sand an outlier label t. This last

label enables the creation of new regions [13,18]. Thelocal energy evaluation *;

swhen considering a candi-

date label is given by

f For rOt,

*;s(r)

"

m+i/1G#1 if d(o(i)(R(e

s"r)), o(i)(B

s))'a(i)

!1 if d(o(i)(R(es"r)), o(i)(B

s))(a(i)

#+cs

<2(c

s), (8)

where csdesignates the subset of cliques c containing s.

f For r"t, we extend the de"nition of *;s(r) as fol-

lows:

*;s(t)"+

l(s)k[1!2d(t!e

t)]#/ (9)

r( , the optimal label among those labels, is supplied by

r("arg minr|"s

*;s(r). (10)

Besides, in the case of grey-level or color, a multiscalestrategy is employed [21]. Once the relaxation process iscompleted, new labels are attributed to the connectedsubsets of sites with the o-label, which size exceeds a pre-set threshold.

4. Building of a region-level motion-based graph

The central part of the algorithm is now introduced.Given the spatial partition P"MR

k, k"1,2, pN, con-

taining p regions, an irregular graph is derived from itstopology. We denote it by G, the nodes N

kof which

correspond to the regions Rkof the spatial partition. Let

arcs Ajjoin in G the nodes associated to adjacent regions


in the spatial partition.

G"MMN1,2,N

pN, MA

1,2, A

qNN. (11)

We aim at assigning a motion label to every node inthe graph, with a view to partitioning this graph intonode subsets corresponding to groupings of regions ofcoherent motion. Each grouping is hence numbered byits label. The labeling of the graph is formalized withina Markovian framework. To this purpose, we identifythe nodes of the graph to the sites of a region-level MRF.The cliques are deduced in a straightforward mannerfrom the arcs of the graph. Let l"Ml

1,2, l

pN be the set

of sites and !"Mc1,2, c

qN be the set of binary cliques.

We now focus on the de"nition of a suitable energyfunction for our region grouping objective.

4.1. Energy function dexnition and minimization

As in the case of the pixel-level energy function for thespatial segmentation stage, the region-level energy func-tion ;@ is split up into several terms. It involves a obser-vation-label interaction term, a geometric interactionterm and a regularization term. However, the interactionterm is here also de"ned over a binary clique. The choiceof binary clique is explained as grouping regions accord-ing to motion consistency is done by considering pairs ofneighboring regions. The energy function is expressed as

;@(e@,o@)"+cj|!<@

1(e@(c

j), o@(c

j))

# +cj|!<@

2(e@(c

j))#;@

3(e@), (12)

where e@(cj) stands for the pair of labels attached to the

clique cj(c

j"Ml

k,lk{N), and o@ for the region-level observa-

tions, which we will examine below. This is an elegantand #exible way to formalize the merging of regions ofsimilar motions. Potential <@

1will express a discrepancy

measure between the two motion model "elds attachedto the sites l

kand l

k{composing clique c

j. <@

2takes into

account the geometric degree of adjacency between ad-jacent regions, and ;@

3favours a reduced number of

regions. The motion estimation technique and the chosendiscrepancy measure are now presented.

4.1.1. Parametric motion estimationThe inter-frame transformation between frame I

tat

time t and frame It`1

at time t#1 is modeled by a set of2D a$ne motion models, one per region. The displace-ment vector at pixel site s"(x, y) in region R

kwhich

gravity center gk"(xk

g, yk

g), is expressed as

d(#k)t`1

t(s)"A

ak0#ak

2(x!xk

g)#ak

3(y!yk

g)

ak1#ak

4(x!xk

g)#ak

5(y!yk

g)B (13)

in which the motion parameter vector(#

k)t`1t

"(ak0,2, ak

5) is estimated on each region

Rk, k"1,2, p, using the robust multi-resolution es-

timator described in Ref. [22]. A M-estimator criterion isminimized by means of an iterative reweighted least-squares technique embedded in a multiresolution frame-work. If #n

kY designates the estimate of #

kat iteration n,

we have #k"#n

kY #*#n

k, and the estimate of the in-

crement *#nkis given by

(*#nkY ) t`1

t"arg min

*#nk

+s|Rk(t)

o(r(s,*#nk)), (14)

where

r(s,*#nk)"I(s#d#k

Y (s), t#1)!I(s, t)

#+I(s#d#kY (s), t#1) ) d*#k

(s), (15)

where o() is Tukey's function. Then, #n`1k Y "#n

kY #*#n

kY

and the process is iterated. This method only involvesthe computation of the spatio-temporal derivatives of theintensity function. An estimation of the covariancematrix associated to the motion parameter vector is alsoprovided.

4.1.2. Construction of a motion-based distance betweenregions

Owing to the robustness of the estimator used ourmotion measurement is rather insensitive to minor errorsin region border determination, secondary motions dueto small mobile objects if any within the region. A pos-sible way of comparing the motions of two regionsinvolves estimating a motion model on the union ofregions. This can provide valuable information, but indu-ces a combinatorial computational cost.

In order to characterize the di!erence between theestimated motions within two neighbouring regionsR

kand R

k{, we prefer considering the two motion "elds

issued from the motion models estimated within eachregion. Let us note that doing so, we do not resort to`displaced frame di!erencea-type criteria. We extendthese "elds over the support corresponding to the unionof the two regions. The discrepancy between these twoextended "elds, denoted by D(c

j), is expressed as the

average, over the union of the two regions, of a weigheddistance e between the velocity vectors that form these"elds:

D(cj)"

1

card(RkXR

k{)

+s|(Rk

XRk{)

e(d#k(s),d#k{

(s)). (16)

<@1

aims at assigning identical motion labels to nodeswhen the attached motions are similar, and di!erentlabels when motions are strongly di!erent. A binarypenalty value resulting from a test on the hypothesis that


Fig. 3. <@1

potential as a function of the di!erence D(cj) between the two motion model "elds, for identical labels and di!erent labels.

Parameter values: q"2 and i"7.

two estimated motions really correspond to two reallyidentical underlying motions is de"ned in Ref. [10]. Incontrast, a progressive transition is introduced here, andpotential<@

1is de"ned as in relation (17). This function is

plotted in Fig. 3; it is expressed as follows:

<@1(e@(c

j), o@(c

j))

"G1

1#eiq (D(cj)~q)

if e@k"e@

k{,

1!1

1#eiq (D(cj)~q)

if e@kOe@

k{.

(17)

4.1.3. Regularization terms<@

2corresponds to the regularization term. To take

into account the `degreea of adjacency between tworegions, two geometrical features are computed per re-gion pair R

k, k{: the length of the common border,

denoted by mk, k{

, and the distance between the regiongravity centers (Fig. 4). They are combined into a geo-metrical `compacity factora g

k, k{of the region pair:

gk, k{

"

mk, k{

mk, k{

#DDgk!g

k{DD

2

. (18)

This factor takes part in the de"nition of the potential<@

2:

<@2(e@(c

j))"G

!b.gk, k{

, b'0 if e@k"e@

k{,

0 if e@kOe@

k{.

(19)

Fig. 4. Measure of the adjacency degree between two neighbor-ing regions R

kand R

k{, based on the length m

k, k{of the common

boundary and the distance between the gravity centers.

The third energy term expresses a prior constraint onthe partition structure, consisting of a penalty propor-tional to the number of motion-based region groups.Denoting by E the set of all di!erent labels assigned tonodes, dE the number of elements in this set, and settinga constant j, this energy term is de"ned as follows:

;@3(e@)"j )dE. (20)

The a priori introduced here is that there should be fewregions. This idea that generally the fewer the models to


explain the data the better, is discussed extensively inRef. [23]. This term does not a!ect adjustment of motionboundaries, but attempts to reduce the appearance ofspurious motion-based regions composed of a singlespatial region.

The relative small number of regions allows us toutilize an energy minimization technique based on theHCF method [20]. During the labeling process, motionmodels are not re-estimated on groups, but per spatialregion only initially once and for all. For the "rst frame ofthe sequence, all regions are initially given di!erentmotion labels. Sites are visited according to their rank inan unstability stack [20]. Candidate labels at a given siteinclude the current label at this site and the labels cur-rently assigned to the neighbor sites. An extraneouslabel is also proposed. For each candidate label, thelocal energy variation involved is computed. For theextraneous label, the potentials de"ned in Eqs. (17) and(19) are calculated considering e@

kOe@

k{(the computation

of these potential do not require the knowledge of theprecise labels). The label giving rise to the highest de-crease in local energy variation is then selected. Theaddition of the extraneous label to the list of candidatelabels makes possible a correct on-line determination ofthe number of relevant motion entities. We arbitrarilychose to label disconnected site subsets with di!erentlabels. If necessary, this choice can easily be set onor o!.

According to the "nal label values, the labeled graph ispartitioned into subsets of identically labeled nodes.A second graph G

m, called `tracking grapha, can be

deduced from this partition. A node in Gm

is associatedto each subset and, if at least one arc in G joinsthe two subsets, then the two corresponding nodes inG

mare linked by an arc. The next section describes

how the partition tracking stage can exploit this graphG

m.

5. Partition tracking using the graph Gm

Tracking of spatial regions aims "rst at establishinga correspondence between these regions in successiveframes. It can also increase reliability, e$ciency andconsistency of features attached to the regions to betracked, such as geometry and motion. To this end, thetracking graph is introduced. Its purpose is two-fold.First, it allows to maintain label consistency over time,both for pixel-level and region-level labeling. Secondly, itmay improve the reliability of motion estimates throughtemporal recursive "ltering.

5.1. Region-level label map prediction

We "rst examine how G can be predicted at t#1. Weseek to build a relevant label con"guration to initialize

the motion-based region-level relaxation at t#1. Givena spatial partition P

tat time t, the spatial partition

Pt`1

at t#1 can be split into two subsets. Let Pt`1@t

in-clude the spatial regions that are already existing in P

t,

and Pt`1@t

include the spatial regions that emerged at

t#1. We have Pt`1

"Pt`1@t

XPt`1@t

.Prior to motion model estimation at t#1, no informa-

tion is available to favor any particular labeling forspatial regions created at t#1. We hence attach a newinitial motion label to the corresponding nodes. On theother hand, the prior belief that the motion-based regiongrouping should be maintained from t to t#1 suggeststhat for regions that survive from t to t#1, node labelshave to be initialized at t#1 with the label obtained at t.If we denote by e8 @

k, t`1the predicted label attached to the

node corresponding to region Rk,

e8 @k, t`1

"Ge8 @k, t

if Rk3P

t`1@t,

a new label if Rk3P

t`1@t.

(21)

Owing to the scheme de"ned in Section 4, with whichspatial regions are not irreversibily merged, the group-ings determined for a given frame can be called intoquestion for the next one. The label con"guration is thenappropriately updated during the energy minimizationstep, an update of the number of motion-based regionsbeing jointly performed.

A$ne motion models are then estimated on eachmotion-based region group. We consider the temporalevolution of the motion parameters of the motion modelsas a stochastic processes. For each region grouping R@

n,

the six estimated motion parameters are considerednon-correlated and account for measurements suppliedto six independent Kalman "lters. A "rst-order derivativetemporal evolution model is selected here, as in Ref. [15]for a similar usage. Evolution of the state of a motionparameter a

ican be approximated by the following dy-

namic system:

Cai

ai5D(t#1)"C

1 dt

0 1 DCai

ai5D(t)#C

e1

e2D, (22)

ameasuredi

(t)"ai(t)#f(t). (23)

Process noise is modeled by the zero-mean Gaussianvector (e

1,e2)T. This vector characterized by its covariance

matrix R, hence by the variance p2R:

R"p2RCdt3

3

dt2

2dt2

2dt D. (24)

Measurement noise f(t), also modeled as zero-meanGaussian noise, is characterized by its variance p2f .


At time t, the Kalman "lter provides a predictedmotion model at t#1 that can hold as an initial valuefor the motion estimator (measurement) at t#1 on co-herent region groups (Fig. 1e). Initializing the state vectormakes use of the "rst three measurements. The predictedparameters and then "ltered parameters of the recursive"lters at a given node of G

mcan be identically attributed

to all nodes of G in the corresponding node subset. The"ltered parameter vectors are passed on to their respect-ive regions in the spatial partition, in order to providea prediction for this spatial partition at t#1.

5.2. Pixel-level label map prediction

We now explain the spatial partition prediction tech-nique. Let #t@t

kY stand for the "ltered motion parameter

vector from t to t#1. Given the estimated spatial labelmap e

tY at time t, the predicted spatial label "eld e8

t`1is derived from a motion-oriented propagation of labels[13]:

e8t`1

(s)"et(s#d

(#t@tesY)t`1t

(s)). (25)

Since s#d#(s) usually points to a place of non-integercoordinates, the label is assigned to the four nearest siteson the image grid. Sites that receive no label or multiplelabels are respectively assigned `uncovereda and `occlu-siona labels. Both labels are considered neutral when itcomes to the relaxation algorithm. The number of iter-ations required in the pixel-level and region-level energyfunction minimization steps are automatically deter-mined by the algorithm and are a function of the com-plexity of both partitions.

6. Extension to multiple motion models per spatial region

Though there is in general some intensity spatial gradi-ent, or texture variation, on motion boundaries, it maynot always be signi"cant enough to give rise to a spatialboundary. This is illustrated in Fig. 8, in which thewoman on the right is swinging her left arm upwards.A spatial segmentation seems inappropriate to retrievethe arm. As a result, a single motion descriptor cannotdescribe correctly the apparent motion, and the partitionprediction is inaccurate in this particular area. Inad-equacy of a single motion model may also occur when thecomplexity of the apparent motion, due for instance tosigni"cant depth variation, depth discontinuities or non-rigid motions, is beyond the description ability of the 2Da$ne motion model used here.

We propose to alleviate this issue as follows. On everyspatial region R

k, we detect sub-regions that do not

conform to the estimated motion model #kY , using the

Markovian multiscale technique described in Ref. [24].Only sub-regions of a signi"cant size are kept.

The graph G must thus be built upon the topologyof the spatial partition augmented with these addi-tional detected sub-regions. For each region R

k,

there may exist one or several non-conforming sub-regions Rnc

k, z, where z is the sub-region index and nc is an

overscript denoting non conforming. A motion model#

k, zY is estimated on every sub-region. Node labelingproceeds as presented in Section 4. Motion models canthen be estimated on coherent region sets. The graphG accounts for these detected sub-regions, by includingalso a node for them. Let s be a pixel site in R

k, R

kbeing

composed of several sub-regions: one sub-region Rck

conforming to the dominant motion in Rk, and possibly

several sub-regions Rnck, z

, non-conforming to this motionmodel. The motion-oriented propagation of labelsin R

ksimply takes into account which sub-region s

belongs to

e8t`1

(s)"Get(s#d

(#t@tRkY )t`1

t(s)) if s3R

k, c,

et(s#d

(#t@tRnck, zY )t`1

t (s)) if s3Rnc

k, z.

7. Results

The proposed scheme has been validated on bothsynthetic and real-world image sequences. Parametersa(i) were set to 0.2 once and for all, empirically inferredfrom the de"nition of the Kolmogorov-Smirnov distance.The pixel-level regularization constant, k, can reason-nably set between 0.2 and 0.4 (it is set to 0.3 in practise).The control over the number of spatial regions is mainlyleft to /, which setting depends on the selected segmenta-tion criterion (texture, grey-level or color). The samevalue has shown satisfactory for most tested sequences,for a given criterion. q is the most important criterion,since it controls the tolerated motion discrepancy be-tween motions. As a general rule, on should use smallvalues of q for spatial partition tracking and for videoconding. For dynamic content analysis (interpretation),there may be di!erent motion-based partitions that areall meaningful from some point of view. It is hard to givea general rule since, for instance, discrimination of parti-cularly slow motion requires a very low value of q, whichmay cause undesired regions due to some parallax e!ect,in some other sequence. Chosen values of q for the se-quences shown in this section are as given in the tablebelow. Since motion is the most important criterion andthat regularization should only in practise determine thelabel only in ambiguous cases, we set b and j both to 0.1.The measurement and process noise variances for tem-poral "ltering are taken constant and both equal to 0.01.Such a parametrized model is satisfactory for the se-quences tested, but if prior knowledge was availableabout the dynamics tracked objects, or if some schemewas added to learn these parameters, values better suitedto each sequence may be chosen.


Spatial segmentation Motion-based grouping Temporal "ltering

a(i) k b j p2f p2R

0.2 0.3 0.1 0.1 0.01 0.01

Patchwork Power station Renata Interview Mobi Car

q 0.2 0.2 2 2 2 2

Criterion Texture Texture Grey-level Grey-level Color Color

Fig. 5. Patchwork sequence: (a) true region numbering, original image with estimated vectors sub"elds corresponding to the motionmodels and motion boundaries superimposed at time (b) t"4, (c) t"16, (d) t"20 and (e) t"38.


Fig. 6. `Power stationa sequence (infra-red, by courtesy of SAT): original images with (a) superimposed texture region boundaries,(b) spatial segmentation maps and (c) motion-based region groupings for frames 1 and 15.

The method was "rst applied to synthetic sequences. Thedi!erent regions of a 256]256 image texture Patchworkmade up of natural textures taken from the Brodatzalbum, are imparted di!erent and time-varying a$nemotions. Image intensities are "rst quantized on 20levels. Texture features used are mean value, local vari-ance, and a statistics extracted from co-occurence ma-trices of grey-level values calculated on 7]7 pixel localwindows, namely correlation [18]. The true region labelsare given in Fig. 5a. Regions 1 and 2, in the foreground,undergo horizontal translational motion, "rst accelerat-ing then slowing down. Region 3 is imparted combinedtranslation, rotation, divergence, while region 4 under-goes combined divergence and translation. A "rst in-creasing, then decreasing divergence is applied to region5. The determined motion boundaries and the estimatedmotion model "elds superimposed on the original imagesare shown in Fig. 5 for various frames. Three groupingsare initially formed: respectively (1), (2), (3, 4, 5) (Fig. 5b).The region-level label con"guration then varies along thesequence. Indeed, as the motion in region 3 gets stronglydi!erent from the motion imparted to regions 4 and 5,region 3 becomes a separate motion entity (Fig. 5c), inaccordance with the ground truth. A new motion-basedregion is further created (Fig. 5d), because of the increas-

ing strength of the divergence applied to region 5. Closeto the end of the sequence, regions 2 and 5 become almoststatic, and thus form a grouping with very slowly movingregion 4. In this example, regions undergoing similarmotions are correctly grouped and region-level labelingis consistent over the sequence. The number of groupingsis also updated in agreement with the ground truth.Accuracy of retrieved boundaries and of estimatedmotion models is satisfactory.

The Power Station infra-red image sequence (Fig. 6, ofsize 500]236 pixels) corresponds to the surveillance do-main, on which it is of interest to structure the framesinto regions of homogeneous appearance, so as to adaptand facilitate subsequent detection of small moving ob-jects. Since only camera motion is present, motion dis-continuities are only due to di!erences in depth relativelyto the camera, or to di!erent surface orientations. Tex-ture is taken as the segmentation criterion, all parametersbeing as for the Patchwork sequence. Results are shown inFig. 6 for two frames. Motion grouping can be observedbetween regions that are located at similar depths. Labelsand boundaries are consistently maintained through thesequence.

In the Renata sequence (Fig. 7, size 360]288 pixels),the woman is going right, while being tracked by the


Fig. 7. `Renataa sequence: (a)}(c) spatial regions and (d)}(f) original images with motion-based boundaries superimposed for frames9, 19 and 29.

camera. The spatial regions are shown in Figs. 7a}cand the motion-based region boundaries in Figs. 7d,e and f for frames 9, 19 and 29, respectively. Again, labelsof spatial regions are temporally consistent, and themotion entity (the woman) is correctly extracted, whichdemonstrates the e$ciency of the updating-trackingmethod.

In the Interview sequence (Fig. 8, of size 337]268pixels), the woman on the right is getting up, whilebringing her left arm upwards. Meanwhile, the camera istilting upwards, more slowly than the woman's motion,causing a downwards apparent motion for the rest of thescene. At the beginning of the sequence, the woman isgetting up quickly, then progressively more slowly. Thisresult illustrates how the motion-based grouping can beimproved by creating sub-regions on a motion criterion.Fig. 8a contains the spatial partition. It can be seen thatthe left arm is included in the same region as an impor-tant area of background, and that the hand is attached tothe lower part of the body. The assumption of motion

unicity in spatial regions is here clearly at least twicebroken. In Fig. 8b, the result of sub-regions detection isshown. Both problems, related to the arm and the hand,are alleviated, and the arm and the hand are correctlygrouped and delimited after the motion grouping process(Fig. 8c), as well as most of the rest of the body. Somedark background is nevertheless attached to the movingbody. Because of its uniformity, and that the movingoccluding contour is the only available information, theresulting estimated motion model is similar to that of thewoman. Figs. 8d and e show a comparison of the motionmodel "elds as estimated on spatial regions, and those asestimated on motion-based regions. The rotationalmotion of the arm is correctly extracted in the latter,whereas it is formed of almost translational piece-wisesub-"elds in the former, which are not such a gooddescription.

In Fig. 9, we show some results of a comparison ofmotion-based partitions between the approach proposedin this paper and a segmentation method described in


Fig. 8. `Interviewa sequence: (a) original image (frame 74) with spatial region boundaries, (b) spatial region boundaries with subregionboundaries, (c) motion-based region contours, (d) motion model "elds as estimated on spatial regions and sub-regions and (e) asestimated on motion-based regions groupings.

Ref. [25]. We selected this method because it representsa class of techniques that take a di!erent approach, in thesense that it relies on motion-only.

Fig. 9a is to be compared with Fig. 9c and Fig. 9b withFig. 9d. Our new approach is more accurate in thelocalization of boundaries (e.g. right arm, head), but isless e$cient in regions where motion estimation or sub-region detection cannot be achieved correctly.

The e!ect of temporal "ltering can be illustrated on thefollowing example. Temporal evolution of the verticaltranslation parameter of the region group correspondingto the woman is considered. Comparison of the measuredand "ltered parameters (Fig. 10) shows how the motionestimated can be temporally stabilized by "ltering. Tem-porally smooth apparent motion variations are indeedphysically more realistic than the more hectic rawmotion estimates. The use of Kalman "ltering is oftenbene"cial. Though, occasionally, it may not be ase!ective, because the chosen evolution model and itsparameters may not be well suited to real parameterevolution.

For the Interview sequence, processing each pair offrames takes around 80 s using non-optimized code on anUltraSparc, 4 s of which are devoted to the motion model

estimation and 4 s to the region-level computations (dis-tance calculations and energy minimization), the rest ofthe time is in fact spent by the updating of the spatialsegmentation.

In the Mobi sequence (Fig. 11), size 337]268 pixels,MPEG-1 decompressed, color), the train is pushinga rolling spotted ball leftwards, while the calendar ispulled upwards. The camera is panning and tracks thetrain. The original image for the "rst frame, the colorboundaries and motion-based boundaries are respective-ly shown in Figs. 11a, b and c. It can be seen that themotion-based groups obtained mainly correspond tomeaningful motion entities (background, ball, train), andthat they are accurately retrieved.

The Car1 sequence (Fig. 12) is also a color sequence.The original image for the "rst frame, the color bound-aries and motion-based boundaries are respectivelyshown in Figs. 12a, b and c.

1We would like to thank INA (Institut National del'Audiovisuel, DeH partement Innovation, France) for providingthis sequence.


Fig. 9. `Interviewa sequence: comparison of the region-level motion-based partition method with a direct pixel-level motion-basedsegmentation technique [25]. Motion region boundaries (a) at time t"30 and (b) t"62 for the technique presented in this paper, (c) attime t"30 and (d) t"62 for the pixel-level motion-based technique.

Fig. 10. Vertical translation motion parameter (a1) associated to

the motion-based region group corresponding to the woman:comparison between measured and "ltered estimates. X-axis:frame number.

8. Conclusion

A global method for motion-based segmentation andspatial image partition updating and tracking has beenpresented, through the de"nition of a motion-basedgraph representation of the spatial partition as the key-tool to prediction and tracking. Having estimateda motion model on every spatial region, region groupingis formalized as an energy minimization problem, takingmotion, geometric and contextual information into ac-count. Motion-based region boundaries and the numberof region groups are jointly determined and updatedalong the sequence. Promising results have been ob-tained on image sequences of relatively high complexity,providing good structuration of the content in terms ofmobile elements.

In comparison with usual region tracking techniques,the original introduction of region-level context allows


Fig. 11. `Mobia sequence: for the "rst frame of the sequence, the "gure shows the original image (a), the spatial boundaries (b) andmotion boundaries (c).

Fig. 12. `Cara sequence: for the "rst frame of the sequence, the "gure shows the original image (a), the spatial boundaries (b) and motionboundaries (c).

motion estimation accuracy and map prediction coher-ence and quality to be improved. Also, considering thatthe general task at hand is to track a given spatialpartition, an extension of the method to the case in whicha spatial region has to be described by several motionmodels has been proposed.

We have proposed in Ref. [26] to cope with occlusionsand crossings. Should some regions be occluded, trackingcould then rely only on the predicted motion and pre-dicted geometry of these regions, and bene"t further fromthe simplicity of structure of the motion-based graphG

mrelatively to the spatial-based graph G, as demon-

strated in the Interview, Mobi and Renata sequences, forinstance. This also provides a high-level representationand interpretation of the dynamic content of the imagesequence.

In the context of content-based video indexing, thiswork has contributed to structuring a video in terms ofrelevant spatio-temporal regions [27]. This leads tovideo summaries and moving object indexed from theirmotion. The underlying spatial segmentation directlyprovides a texture or color information for each region,and the distance used to compare local and region statis-

tics could also be employed to compare queries andextracted regions. Besides, a motion descriptor is at-tributed to every region, permitting queries combiningtexture and motion.

References

[1] M. Gelgon, P. Bouthemy, A region-level graph labelingapproach to motion-based segmentation, Proceedings ofIEEE International Conference on Computer Vision andPattern Recognition, Puerto-Rico, June 1997, pp. 514}519.

[2] M. Gelgon, P. Bouthemy, A region-level motion-basedgraph representation and labeling for tracking a spatialimage partition, Proceedings of IAPR Workshop onEnergy Minimization Methods in Computer Vision andPattern Recognition (EMMCVPR), Venice, May 1997,Lecture Notes in Computer Science, vol. 1223, Springer,Berlin.

[3] S. Ayer, P. Schroeter, J. BiguK n, Segmentation of movingobjects by robust motion parameter estimation overmultiple frames, Proceedings of Third European Confer-ence on Computer Vision, Stockholm, May 1994, pp.316}327.


[4] F. Dufaux, F. Moscheni, A. Lippman, Spatio-temporalsegmentation based on motion and static segmentation,Proceedings of Second IEEE International Conferenceof Image Processing, Washington, October 1995, pp.306}309.

[5] V. Garcia-Garduno, C. Labit, On the tracking of regionsover time for very low bit rate image sequence coding,Proceedings of Picture Coding Symposium PCS'94, Sacra-mento, CA, September 1994, pp. 257}260.

[6] L. Wu, J. Benois-Pineau, Ph. Delagnes, D. Barba, Spatio-temporal segmentation of image sequences for object-oriented low bit-rate image coding, Signal Process. ImageCommun. 8 (1996) 513}543.

[7] J.Y.A Wang, E.H Adelson, Representing moving imageswith layers, IEEE Trans. Image Process. 3 (5) (1994)625}638.

[8] H Zheng, D. Blostein, Motion-based object segmentationand estimation using the MDL principle, IEEE Trans.Image Process. 4 (9) (1995) 1223}1235.

[9] C. Hennebert, V. Rebu!el, P. Bouthemy, A hierarchicalapproach for scene segmentation based on 2D motion,Proceedings of the 13th International Conference onPattern Recognition Vienne, August 1996, pp. 218}222.

[10] W. Xiong, C. Gra$gne, A hierarchical method for detec-tion of moving objects, Proceedings of First IEEE Interna-tional Conference of Image Processing, Austin, November1994, pp. 795}799.

[11] J. Wang, Stochastic relaxation on partitions with connec-ted components and its application to image segmenta-tion, IEEE Trans. Pattern Anal. Mach. Intell. 20 (6) (1998)619}636.

[12] M.J. Black, Combining intensity and motion for incremen-tal segmentation and tracking over long image sequences,Proceedings of Second European Conference on Com-puter Vision, Santa Margherita Ligure, Italie, May 1992,pp. 485}493.

[13] P. Bouthemy, E. Franc7 ois, Motion segmentation andqualitative dynamic scene analysis from an image se-quence, Int. J. Comput. Vision 10 (2) (1993) 157}182.

[14] M. Irani, B. Rousso, S. Peleg, Detecting and tracking mul-tiple moving objects using temporal integration, Proceed-ings of Second European Conference on Computer Vision,Santa Margherita Ligure, Italy, May 1992, 282}287.

[15] F. Meyer, P. Bouthemy, Region-based tracking using af-"ne motion models in long image sequences, CVGIP:Image Understanding 60 (2) (1994) 119}140.

[16] C. Toklu, A.T. Erdem, M.I Sezan, A.M. Tekalp, Trackingmotion and intensity variations using hierarchical 2Dmesh modeling for synthetic object trans"gura-tion, Graphical Models Image Process. 58 (6) (1996)553}573.

[17] S. Geman, D. Geman, Stochastic relaxation, Gibbsdistributions and the Bayesian restoration of images, IEEETrans. Pattern Anal. Mach. Intell. 6 (6) (1984)721}741.

[18] C. Kervrann, F. Heitz, A Markov random "eld model-based approach to unsupervised texture segmentation us-ing local and global spatial statistics, IEEE Trans. ImageProcess. 4 (6) (1995) 856}862.

[19] M.J. Swain, D. Ballard, Color indexing, Int. J. Comput.Vision 7 (1) (1991).

[20] P.B. Chou, C.M. Brown, The theory and practise ofBayesian image modelling, Int. J. Comput. Vision 4 (1990)185}210.

[21] F. Heitz, P. PeH rez, P. Bouthemy, Multiscale minimiza-tion of global energy functions in some visual recoveryproblems, CVGIP: Image Understanding 59 (1) (1994)125}134.

[22] J.-M. Odobez, P. Bouthemy, Robust multiresolution es-timation of parametric motion models, J. Visual Commun.Image Representation 6 (4) (1995) 348}365.

[23] Y.G. Leclerc, Constructing simple stable descriptions forimage partitioning, Int. J. Comput. Vision 3 (1989) 73}102.

[24] J.-M. Odobez, P. Bouthemy, Separation of moving regionsfrom background in an image sequence acquired witha mobile camera, In: H.H. Li, S. Sun, H. Derin (Eds.),Video Data Compression for Multimedia Computing,Kluwer Academic Publisher, Dordrecht, 1997, pp.283}311.

[25] J.M. Odobez, P. Bouthemy, Direct incremental model-based image motion segmentation for video analysis, Sig-nal Processing 66 (3) (1998) 143}156.

[26] M. Gelgon, P. Bouthemy, J.-P. Le Cadre, Associating andestimating trajectories of multiple moving regions witha probabilistic multi-hypothesis tracking approach, FirstInternational Symposium of Physics in Image Processing,Paris, January 1999.

[27] M. Gelgon, P. Bouthemy, Determining a structured spa-tio-temporal representation of video content for e$cientvisualisation and indexing, Fifth European Conference onComputer Vision (ECCV'98), Freiburg, Germany, June1998, pp. 595}609 (II).


A region-level motion-based graph representation and ... · The handling over time of image...

Documents

Transcript of A region-level motion-based graph representation and ... · The handling over time of image...