An Object Category Speciﬁc mrf for Segmentation · 2015-04-21 · An Object Category Speciﬁc...

An Object Category Specific mrf forSegmentation

M. Pawan Kumar1, Philip H.S. Torr1, and Andrew Zisserman2

1 Department of Computing,Oxford Brookes University,

Oxford, OX33 1HX{pkmudigonda,philiptorr}@brookes.ac.ukhttp://cms.brookes.ac.uk/computervision

2 Department of Engineering Science,University of Oxford,

Oxford, OX1 [email protected]

http://www.robots.ox.ac.uk/~vgg

Abstract. In this chapter we present a principled Bayesian method fordetecting and segmenting instances of a particular object category withinan image, providing a coherent methodology for combining top down andbottom up cues. The work draws together two powerful formulations:pictorial structures (ps) and Markov random fields (mrfs) both of whichhave efficient algorithms for their solution. The resulting combination,which we call the object category specific mrf, suggests a solution to theproblem that has long dogged mrfs namely that they provide a poor priorfor specific shapes. In contrast, our model provides a prior that is globalacross the image plane using the ps. We develop an efficient method,ObjCut, to obtain segmentations using this model. Novel aspects ofthis method include an efficient algorithm for sampling the ps model,and the observation that the expected log likelihood of the model canbe increased by a single graph cut. Results are presented on two objectcategories, cows and horses. We compare our methods to the state ofthe art in object category specific image segmentation and demonstratesignificant improvements.

1 Introduction

Image segmentation has seen renewed interest in the field of computer vision,in part due the arrival of new efficient algorithms to perform the segmenta-tion [5], and in part due to the resurgence of interest in object category recog-nition [2,8,17]. Segmentation fell from favour partly due to an excess of papersattempting to solve ill posed problems with no means of judging the result. In-terleaved object recognition and segmentation [4,17] is both well posed and ofpractical use. Well posed in that the result of the segmentation can be quantita-tively judged e.g. how many pixels have been correctly and incorrectly assignedto the object. Of practical use because (a) the more accurately the image can be

J. Ponce et al. (Eds.): Toward Category-Level Object Recognition, LNCS 4170, pp. 596–616, 2006.c© Springer-Verlag Berlin Heidelberg 2006

An Object Category Specific mrf for Segmentation 597

segmented the more certain the recognition results will be, and (b) image editingtools can be designed that provide a “power assist” to cut out applications like‘Magic Wand’, e.g. “I know this is a horse, please segment it for me, without thepain of having to manually delineate the boundary.”

Markov Random Fields (mrfs) provide a useful model of images for segmen-tation and their prominence has been increased by the availability of efficientpublically available code for their solution. Boykov and Jolly [5], and more re-cently Rother et al. [21], strikingly demonstrate that with a minimum of userassistance objects can be rapidly segmented. However samples from the Gibbsdistribution defined by the mrf very rarely give rise to realistic shapes and ontheir own mrfs are ill suited to segmenting objects. What is required is a wayto inject prior knowledge of object shape into the mrf. Within this chapter wederive a Bayesian way of doing this in which the prior knowledge is provided bya Pictorial Structure (ps). Pictorial Structures [6] and the related Constellationof Parts model [8] have proven to be very successful for the task of object recog-nition. In addition, ps have been highly effective in localizing object parts, e.g.human limbs. We cast the problem of object category specific segmentation asthat of estimating an mrf (representing bottom up information) which is influ-enced by a set of latent variables, the ps (representing top down information),encouraging the mrf to resemble the object. Unlike mrfs, which model the priorusing pairwise potentials, the ps model provides a prior over the shape of thesegmentation that is global across the image plane.

Many different approaches for segmentation using a global shape prior havebeen reported in the literature. Huang et al. [13] describe an iterative algorithmwhich alternates between fitting an active contour to an image and segmentingit on the basis of the shape of the active contour. However, the computationalinefficiency of the algorithm restricts the application of this method. Freedman etal. [9] describe an efficient algorithm based on mincut which uses a shape priorfor segmentation. Both these methods, however, require manual initialization.

Borenstein and Ullman [4] describe an automatic algorithm for segmentinginstances of a particular object category from images using a patch-based ap-proach. Leibe and Schiele [17] provide a probabilistic formulation for this whileincorporating spatial information of the relative location of the patches. How-ever, in order to deal with intra-class shape and appearance variation, as wellas large deformations of articulated objects, these methods have to resort to thecomputational inefficiency of hundreds of exemplars.

In this chapter, we propose a novel probabilistic mrf model which overcomesthe above problems. We develop an efficient method, ObjCut, to obtain seg-mentations using this model. The basis of our method are two new theoreti-cal/algorithmic contributions: (1) we make the (not obvious) observation thatthe expectation of the log likelihood of an mrf with respect to some latentvariables can be efficiently optimized with respect to the labels of the mrf bya single graph cut optimization; (2) we provide a highly efficient algorithm formarginalizing or optimizing the latent variables when they are a ps following aPotts model.

598 M.P. Kumar, P.H.S. Torr, and A. Zisserman

The chapter is organized as follows. In Section 2 the probabilistic mrf modelof the image is described in broad terms. Section 3 gives an overview of anefficient method for solving this model for figure-ground labellings. In section 4the layered pictorial structures (lps) model is described, which extends the ps

model so that it handles partial self occlusion. The important issue of automaticinitialization of the lps is addressed in section 5. The ObjCut algorithm isdescribed in section 6. Results are shown for two object categories, namely cowsand horses, and a comparison with other methods is given in section 7.

2 Object Category Specific MRF

In this section we describe the model that forms the basis of our work. Weformally specify and build upon previous work on segmentation, providing aBayesian graphical model for work that has previously been specified in termsof energy functions [5]. Notably there are three issues to be addressed in thissection: (i) how to make the segmentation conform to object and backgroundappearance models, (ii) how to encourage the segmentation to follow edges withinthe image, (iii) how to encourage the segmentation to look like an object.

Given an image D containing an instance of a known object category, e.g.cows, we wish to segment the image into figure, i.e. pixels belonging to the object,and ground, i.e. the background. Taking a Bayesian perspective, we define a setof binary labels, m, one label mx for each pixel x, that optimizes the posteriorprobability given by the Gibbs distribution

p(m|D) =p(D|m)p(m)

p(D)=

1Z1

exp(−Ψ1(m)). (1)

Here Z1 is the normalizing constant (or partition function), i.e.

Z1 =∫

p(D|m)p(m)dm, (2)

which ensures that the probabilities sum up to one. The energy is defined by thesummation of clique potentials and has the form:

Ψ1(m) =∑

x

(φ(D|mx) +

∑y

ψ(mx, my)

), (3)

where y is a neighbouring pixel of x. The likelihood term φ(D|mx) is the emissionmodel for one or more pixels and is given by

φ(D|mx) ={

− log(p(x ∈ figure|Hobj)) if mx = 1− log(p(x ∈ ground|Hbkg)) if mx = 0,

where Hobj and Hbkg are the rgb distributions for foreground and backgroundrespectively. The prior ψ(mx, my) takes the form of an Ising model:

ψ(mx, my) ={

P if mx �= my,0 if mx = my.


which (equally) favours any two neighbouring pixels having the same label, thusensuring smoothness.

In the mrfs used for image segmentation, a contrast term is used to favourpixels with similar colour having the same label [3,5], thereby pushing theboundary to lie on image edges. This is done by reducing the cost within theIsing model for two labels being different in proportion to the difference inintensities of their corresponding pixels e.g. by subtracting an energy termγ(x, y) = λ(1−exp

(−g2(x,y)

2σ2

)1

dist(x,y)), where g2(x, y) measures the difference inthe rgb values of pixels x and y, and dist(x, y) gives the spatial distance betweenx and y [3,5]. In our experiments, we use P = λ = 0.1 and σ = 5. Together withthe prior, this makes the pairwise terms discontinuity preserving [14]. However,this has previously not been given a proper Bayesian formulation which we nowaddress. It can not be included in the prior, for the prior term cannot includethe data. Rather it leads to a pair wise linkage between neighbouring labels andtheir pixels as shown in the graphical model given in Figure 1. The posteriorprobability is now given by p(m|D) = 1

Z2exp(−Ψ2(m) where

Ψ2(m) =∑

x

(φ(D|mx) +

∑y

(φ(D|mx, my) + ψ(mx, my))

)(4)

and Z2 =∫

p(D|m)p(m)dm. The contrast term of the energy function is givenby:

φ(D|mx, my) ={

−γ(x, y) if mx �= my,0 if mx = my.

mrf-based segmentation techniques which use mincut [14] have achieved ex-cellent results [3,5] with manual initialization. However, due to the lack of ashape model, these methods do not work so well for automatic segmentation ofinstances of specific object categories. We would like to use the power of themincut algorithm for interleaved object recognition and segmentation. In somesense the result of the recognition will replace the user interventions. In order toachieve this we introduce a stronger shape model to the mrf. This shape modelwill supply a set of latent variables, Θ, which will favour segmentations of aspecific shape, as shown in the graphical model depicted in Figure 1. We callthis new mrf model the object category specific mrf, which has the followingenergy function:

Ψ3(m,Θ) =∑

x

(φ(D|mx) + φ(mx|Θ) +

∑y

(φ(D|mx, my) + ψ(mx, my))

)(5)

with posterior p(m,Θ|D) = 1Z3

exp(−Ψ3(m,Θ)), where

Z3 =∫

p(D|m,Θ)p(m,Θ)dmdΘ, (6)

is the partition function. The function φ(mx|Θ) is chosen so that if we weregiven an estimate of the location and shape of the object, then pixels falling


Fig. 1. Graphical model representation of the object category specific mrf. The con-nections introducing the contrast term are shown in blue. Note that some of theseconnections (going diagonally) are not shown for the sake of clarity of the image. Thelabels m lie in a plane. Together with the pixels shown below this plane, these form thecontrast-dependent mrf used for segmentation. In addition to these, the object cate-gory specific mrf makes use of an underlying shape parameter in the form of an lps

model (shown lying above the plane). The lps model guides the segmentation towardsa realistic shape closely resembling the object of interest.

near to that shape would be more likely to have object label than pixels fallingfar from the shape. It has the form:

φ(mx|Θ) = − log p(mx|Θ). (7)

In this work, we choose to define p(mx|Θ) as

p(mx = figure|Θ) =1

1 + exp(μ ∗ dist(x,Θ))(8)

and p(mx = ground|Θ) = 1 − p(mx = figure|Θ), where dist(x,Θ) is the spatialdistance of a pixel x from the shape defined by Θ (being negative if inside theshape). The parameter μ determines how much the points outside the shape arepenalized compared to the points inside the shape. We use μ = 0.2 in our exper-iments. Note energy function Ψ3(m,Θ) can still be minimized via mincut [14].

We combine the Contrast Dependent mrf with the layered pictorial structures(lps) model (see section 4). However we observe that the methodology below iscompletely general and could be combined with any sort of latent shape model.

The optimal figure-ground labelling should be obtained by integrating out thelatent variable Θ. The surprising result of this work is that this rather intractablelooking integral can in fact be optimized by a simple and computationally effi-cient set of operations. In order to do this, we need to demonstrate two things:(i) Given an estimate of m we can sample efficiently for Θ. This we shall demon-strate for the case of lps, and in §5.1 we describe a new algorithm for efficientcalculation of the marginal distribution for a non regular Potts model (i.e. whenthe labels are not specified by an underlying grid of parameters, complementing


the result of Felzenszwalb and Huttenlocher [7]). (ii) Given the distribution of Θwe can efficiently optimize m so as to increase the posterior. For a MRF this isnot immediately obvious. However we shall demonstrate this in the next section.

3 Roadmap of the Solution

For the problem of segmentation the parameters m are of immediate interest andthe em framework provides a natural way to deal with the latent parameters Θ[11] by treating them as missing data. We are interested in maximizing the logposterior density of m is given by

log p(m|D) = log p(Θ,m|D) − log p(Θ|m,D), (9)

where p(Θ,m|D) = 1Z3

exp(−Ψ3(m,Θ)). The em framework iteratively refinesthe estimate of m by marginalizing the latent parameters Θ. Given the currentguess of the labelling m′, we treat Θ as a random variable with the distributionp(Θ|m′,D). Averaging over Θ yields

log p(m|D) = E(log p(Θ,m|D)) − E(log p(Θ|m,D)), (10)

where E is the averaging over Θ under the distribution p(Θ|m′,D).The key result of em is that the second term on the right side of equation (10),

i.e.E(log p(Θ|m,D)) =

∫(log p(Θ|m,D))p(Θ|m′,D)dΘ (11)

is maximized when m = m′. Our goal then is to choose a labelling m′′ (differentfrom m′) which maximizes

E(log p(Θ,m|D)) =∫

(log p(Θ,m|D))p(Θ|m′,D)dΘ. (12)

This labelling increases E(log p(Θ,m|D) (the first term of equation (10)) anddecreases E(log p(Θ|m,D)) (the second term of equation (10) which is maxi-mized when m = m′). Thus, it increases the posterior p(m|D). The expressionin equation (12) is called Q(m|m′), the expected complete-data log-likelihood,in the em literature.

In §5.1 it will be shown that we can efficiently sample from a ps which sug-gests a sampling based solution to maximizing (12). Let the set of s samples beΘ1 . . .Θs, with weights p(Θi|m′,D) = wi, then the minimization correspondingto equation (12) can be written as

m̂ = arg minm

i=s∑i=1

wiΨ3(m,Θi) − C. (13)

Here C =∑

i wi log Z3 is a constant which can be ignored during minimization.This is the key equation of our approach. Section 6 describes an efficient method


for minimizing the energy function (13). We observe that this energy function is aweighted linear sum of the energies Ψ3(m,Θ) which, being a linear combination,can also be optimized using mincut [14]. This demonstrates the interestingresult that for Markov random fields, with latent variables, it is computationallyfeasible to optimize Q(m|m′).

The em algorithm often converges to a local minima of the energy functionand its success depends on the initial labelling m0 (i.e. the labelling m′ at thefirst iteration). In the last section a generative graphical model for pixel by pixelsegmentation was set up. However, it would be computationally extravagant toattempt to minimize this straight off. Rather an initialization stage is adoptedin which we get a rough estimate of the object’s posterior extracted from a setof image features Z, defined in § 4.1. Image features (such as textons and edges)can provide high discrimination at low computational cost. We approximate theinitial distribution p0(Θ|m,D), as g(Θ|Z), where Z are some image featureschosen to localize the object in a computationally efficient manner. Thus, theweights wi required to evaluate equation (13) on the first EM iteration areobtained by sampling from the distribution g(Θ|Z), defined in Section 4.

The next section describes the lps model in detail. In the remainder of thechapter, we describe an efficient method to obtain the samples from the posteriorof a ps model required for the marginalization in equation (13), and the ObjCut

algorithm which re-estimates the labelling m by minimizing equation (13). Thesemethods are applicable to any articulated object category which can be modelledusing an lps. We demonstrate the results on two quadrupeds, namely cows andhorses.

4 Layered Pictorial Structures

Pictorial structures (ps) are compositions of 2D patterns, termed parts, undera probabilistic model for their shape, appearance and the spatial layout. Whencalculating the likelihood of model parameters, a typical assumption under theps model is that the parts do not (partially) occlude each other [15]. Thus, theestimated poses of similar parts of the ps model tend to overlap. For example,the two forelegs of a cow tend to be explain the same pixels in the image as thisprovides a high likelihood.

In the layered pictorial structures (lps) model introduced in [16] (and in asimilar model described in [1]), in addition to shape and appearance, each partpi is also assigned a layer number li which determines its relative depth. Severalparts can have the same layer number if they are at the same depth. A part pi

can partially or completely occlude part pj if and only if li > lj . The parts ofan lps are defined as rigidly moving components of the object. In the case ofside views of quadrupeds, this results in 2 layers containing a total of 10 parts:head, torso and 8 half limbs (see Figure 2). The parts are obtained as describedin § 4.2.

An lps can also be viewed as an mrf with the sites of the mrf correspondingto parts. Each site takes one of nL part labels which encode the putative poses of


Fig. 2. Layered pictorial structure of a cow. The various parts belonging to layers 2and 1 are shown in the top left and right image respectively. Pairwise potentials definedfor every pair of parts as shown in equation (16) only allow valid configurations of acow. Three such configurations are shown in the bottom row.

the part. The pose of the ith site is defined by ti = (xi, yi, θi, σi), where (xi, yi)is the location, θi is the orientation, σi is the scale.

For a given part label ti and image D, the ith part corresponds to the setof pixels Di ⊂ D which are used to calculate features zi = (z1, z2). Let nP

be the number of parts. The shape and appearance parameters for part pi,represented as si and ai, are used to compute z1 and z2 respectively. The featurez2(Di) is initially obtained using texture exemplars as described in § 4.1. Oncean initial estimate of the model is obtained, the location of the object is usedto estimate the rgb histograms for object and background as described in § 5.2(and texture exemplars are no longer used). Assuming that Di does not includepixels accounted for by pj , lj > li, we get

p(Z|Θ) =i=nP∏i=1

p(zi|ai, si), (14)

where Z = {z1 . . . znP } are the image features.lps, like ps, are characterized by pairwise only dependencies between the sites.

These are modelled as a prior on the relative poses of parts:

p(Θ) ∝ exp

⎛⎝−

i=nP∑i=1

j=nP∑j=1,j �=i

α(ti, tj)

⎞⎠ . (15)

Note that we use a completely connected mrf. In our approach, the pairwisepotentials α(ti, tj) are given by a Potts model, i.e.

α(ti, tj) ={

d1 if valid configurationd2 otherwise,


where d1 < d2. In other words, all valid configurations are considered equallylikely and have a higher prior. A configuration is valid provided the relative shapeparameters of the two poses lie within a box, i.e. if tmin

ij ≤ |ti − tj | ≤ tmaxij ,

where tminij = {xmin

ij , yminij , θmin

ij , σminij } and tmax

ij = {xmaxij , ymax

ij , θmaxij , σmax

ij }are learnt using training video sequences as described in § 4.2. The posterior ofthe model parameters is given by

g(Θ|Z) ∝i=nP∏i=1

p(zi|ai, si) exp

⎛⎝−

∑j �=i

α(ti, tj)

⎞⎠ (16)

We now describe how we model the likelihood of the parts of the lps.

4.1 Feature Likelihood for Parts

We define the features Z extracted from the pixels D. As described earlier, we usetwo types of features zi(Di) = (z1(Di), z2(Di)) for the shape and appearance ofthe part respectively. Assuming independence of the two features, the likelihoodbased on the whole data is approximated as

p(zi|ai, si) = p(z1|si)p(z2|ai) (17)

where p(z1|si) = exp(−z1) and p(z2|ai) = exp(−z2).

Outline (z1(Di)): In order to handle the variability in shape among membersof an object class (e.g. horses), it is necessary to represent the part outline by aset of exemplar curves. Chamfer distances are computed for each exemplar foreach pose ti. The first feature z1(Di) is the minimum of the truncated chamferdistances over all the exemplars of pi at pose ti. Truncated chamfer distance mea-sures the similarity between two shapes U = (u1, u2, ...un) and V = (v1, v2, ...vm).It is the mean of the distances between each point ui ∈ U and its closest pointin V :

dcham =1n

∑i

min{minj

||ui − vj ||, τ1}, (18)

where τ1 is a threshold for truncation which reduces the effect of outliers andmissing edges. Edge orientation is included by computing the chamfer score onlyfor edges with similar orientation, in order to make the distance function morerobust [10]. We use 8 orientation groups for edges.

Texture (z2(Di)): Similar to the outline of a part, we represent the texture of anobject by a set of exemplars. We use the VZ classifier [23] which obtains a textondictionary by clustering the vectorized raw intensities of N × N neighbourhoodof each pixel in the exemplars. We use N = 3. The exemplars are then modelledas a histogram of pixel texton labellings [18]. The feature z2(Di) is defined asthe minimum χ2 distance of the histogram of texton labellings for Di with thehistogram modelling the exemplars. We now describe how the lps parametersare learnt so as to handle intra-class variability in shape and appearance.


4.2 Learning the LPS

The various parameters of the lps model are learnt using the method describedin [16] which divides a scene in a video into rigidly moving components and pro-vides the segmentation of each frame. We use this approach on 20 cow videos of45 frames each1. Correspondence between parts learnt from two different videosis established using shape context with continuity constraints [22]. This gives usthe multiple shape exemplars for each part required to compute the feature z1and multiple texture exemplars for calculating z2 along with the layer numbersof all parts (see Figure 3). Furthermore, this provides us with an estimate of|ti − tj | (after normalizing the size of all cows to 230 × 130), for each frame andfor all pairs of parts pi and pj. This is used to compute the parameters tmin

ij andtmaxij that define valid configurations.

Fig. 3. Correspondence using shape context matching. Outlines of the two cows areshown in the first row. Lines are drawn to indicate corresponding points. The secondand third row show the multiple exemplars of the head and the torso part obtainedusing this method.

To obtain the shape exemplars and texture examples for horses, we use 20segmented images of horses2. A point to point correspondence is establishedover the outline of a cow from a training video to the outlines of the horses usingshape context with continuity constraints [22]. Using this correspondence andthe learnt parts of the cow, the parts of the horse are determined (see Figure 4).The part correspondence thus obtained maps the parameters tmin

ij and tmaxij

that were learnt for cows to horses. In the next section, we describe an efficientalgorithm for matching the lps model in the image.

1 Courtesy Derek Magee, University of Leeds.2 Courtesy Eran Borenstein, Weizmann Institute of Science.


Fig. 4. Correspondence using shape context matching. Outlines of the a horse anda cow are shown in the first row. Lines are drawn to indicate corresponding points.The second and third row show the multiple exemplars of the head and the torso partobtained using this method.

5 Sampling the LPS

Given an image, our objective is to match the lps model to the image to obtainsamples from the distribution g(Θ|Z). We develop a novel algorithm for efficientsampling which generalizes the method described in [7] to non-grid based mrfs.We achieve this sampling in two stages: (i) Initialization, where we fit a ps modelof the object (using texture to measure z2(Di)) to a given image D as describedin [15] without considering the layer numbers of the parts, and (ii) Refinement,where the initial estimate is refined by (a) using a better appearance model i.e.the rgb distribution for the object and background and (b) using the layeringof the lps parts.

5.1 Initial Estimation of Poses

We find the initial estimate of the poses of the ps for an image D in two stages:(i) part detection, or finding putative poses for each part along with the corre-sponding likelihoods and, (ii) estimating posteriors of the putative poses.

Part detection: The putative poses of the parts are found using a tree cascade ofclassifiers as described in [15]. In our experiments, we constructed a 3-level treeby clustering the templates using a cost function based on chamfer distance. Weuse 20 exemplars per part and search over discrete rotations between −π/4 andπ/4 in intervals of 0.1 radians and scales between 0.7 and 1.3 in intervals of 0.1.

The edge image of D is found using edge detection with embeddedconfidence [19]. The feature z1(Di) (truncated chamfer distance) is computed


efficiently by using a distance transform of the edge image. The feature z2(Di)is computed only at level 3 of the tree cascade by efficiently determining thenearest neighbour of the histogram of texton labelling of Di among the his-togram of texture examples using the method described in [12] (modified for χ2

distance instead of Euclidean distance).Associated with each node of the cascade is a threhold used to reject bad poses.

The putative poses ti of parts pi are found by traversing through the tree cascadestarting from the root node. The likelihoods p(Di|ai, si) are approximated byfeature likelihoods p(zi|ai, si) as shown in equation (17).

Next, an initial estimate of the model is obtained by commputing the poste-riors of the putative poses. The pose of each part in the initial estimate is givenby the putative pose which has the highest posterior.

Estimating posteriors: We use loopy belief propagation (lbp) to find the poste-rior probability of pi having a part label ti. lbp is a message passing algorithmproposed by Pearl [20]. It is a Viterbi-like algorithm for graphical models withloops.

The message that pi passes to its neighbour pj at iteration t is a vector oflength equal to the number of discrete part labels nL of pj and is given by:

mtij(tj) ←

∑ti

p(zi|ai, si) exp(−α(ti, tj))∏

s�=i,s�=j

mt−1si (ti). (19)

The beliefs (posteriors) after T iterations are calculated as:

bTi (ti) = p(zi|ai, si)

∏s�=i

mTsi(ti), (20)

andbTij(ti, tj) = p(zi|ai, si)p(zi|ai, si)

∏s�=i,s�=j

mTsi(ti)mT

sj(tj). (21)

All messages are initialized to 1. The algorithm is said to have converged whenthe rate of change of all beliefs falls below a certain threshold. The time com-plexity of this algorithm is O(nP n2

L) where nP is the number of parts in the lps

and nL is the number of putative poses per part. This makes sampling infeasiblefor large nL which is the case with smaller parts of the lps model such as thehalf-limbs. Thus, we develop an efficient novel algorithm for lbp for the casewhere the pairwise potentials are given by a Potts model as shown in equation(16). The algorithm exploits the fact that the number of pairs of part labelsn′

L, one for each of the two parts pi and pj , which form a valid configuration ismuch smaller than the total number of such pairs, n2

L, i.e. n′L � n2

L. Note thata similar method is described in [7] which takes advantage of fast convolutionsusing fft. However, it is restricted to mrfs with regularly discretized labels, i.e.the labels lie on a grid, which is not true for putative poses of parts of the lps.

Let Ci(tj) be the set of part labels of pi which form a valid pairwise config-uration with tj . The part labels Ci(tj) are computed just once before runninglbp.


We defineT (i, j) =

∑ti

p(zi|ai, si)∏

s�=i,s�=j

mt−1si (ti), (22)

which is independent of the part label tj of pj and needs to be calculated onlyonce before pi passes a message to pj. It is clear from equation (19) that if nopart label ti forms a valid configuration with tj , then the message mij(tj) issimply exp(−d2)T (i, j). To compute the contribution of the labels ti ∈ Ci(tj) incomputing mij(tj) we define

S(i, tj) =∑

ti∈Ci(tj)

p(zi|ai, si)∏

s�=i,s�=j

mt−1si (ti), (23)

which is computationally inexpensive to calculate since Ci(tj) consists of veryfew part labels. The message mt

ij(tj) is calculated as

mtij(tj) ← exp(−d1)S(i, tj) + exp(−d2)(T (i, j) − S(i, tj)). (24)

Fig. 5. Two example cow images are shown in the first row. The second row shows theinitial estimate obtained for poses of parts (see § 5.1). The half-limbs tend to overlapsince layer numbers are not used. Refined estimates of the poses obtained using thergb distribution of foreground and background together with the lps model are shownin the third row (see § 5.2). The fourth row shows the segmentation obtained using theObjCut algorithm (see § 6).


Our method speeds up lbp by a factor of nearly nL. Extension to GeneralizedPotts model is trivial. The beliefs computed using lbp allow us to determinethe map estimate which provides the initial estimate of the poses of the parts.Figure 5 (row 2) shows the initial estimate obtained for two cow images.Theinitial estimate is refined using the lps model as described below.

5.2 Layerwise Refinement

Once the initial estimate of the parts is obtained using texture, we refine it byusing the colour of the object and background together with the lps model. Thecolour of the object and background are represented as histograms Hobj andHbkg of rgb values learnt using the initial estimate. The feature z2(Di) is nowredefined such that

p(z2|ai) =∏

x∈Di

p(x|Hobj)p(x|Hbkg)

, (25)

i.e. no longer using texture exemplars.The refined estimate of the poses are obtained by compositing the parts of

the lps in descending order of their layer numbers as follows. When consideringlayer li, putative poses of the parts pj belonging to li are found using the treecascade of classifiers around the initial estimate of pj . In our experiments, weconsider locations which are at most at a distance of 15% of the size of the objectas given by the initial estimate. When computing the likelihood of the part ata given pose, pixels which have already been accounted for by a previous layerare not considered. The posteriors over the putative poses is computed using theefficient lbp algorithm.

5.3 Sampling the lps

One might argue that if the map estimate of the poses has a very high posteriorcompared to other configuration of poses, then equation (13) can be approxi-mated using only the map estimate Θ∗ instead of the samples Θ1 . . .Θs. How-ever, we found that this is not the case especially when the rgb distributionof the background is similar to that of the object. Figure 5 (row 3) shows themap estimate of the refined poses of the parts using the initial estimate shownin Figure 5 (row 2). Note that the legs of the first cow in Figure 5 (row 3) aredetected incorrectly since the parts of the background have roughly the samecolour as the cow. Thus, it is necessary to use multiple samples of the lps model.

We describe the method for sampling for 2 layers. The extension to an arbi-trary number of layers is trivial. To obtain a sample Θi, parts belonging to layer2 are considered first. The posterior for sample Θi is approximated using lbp as

g(Θi|Z) =

∏ij bij(ti, tj)∏i bi(ti)qi−1 , (26)

where qi is the number of neighbouring parts of part i. Note that the posterior isexact only for a singly connected graph. However, using this approximation lbp


has been shown to converge to stationary points of the Bethe free energy [24].The posterior is then sampled for poses of parts, one part at a time (i.e. Gibbssampling), such that the pose of the part being sampled forms a valid configu-ration with the poses of the parts previously sampled. The process is repeatedto obtain multiple samples Θi which do not include the poses of parts belongingto layer 1. This method of sampling is efficient since Ci(j) are pre-computed andcontain very few part labels. The best nS samples, with the highest belief, arechosen.

To obtain the poses of parts in layer 1 for sample Θi, we fix the poses of partsbelonging to layer 2 as given by Θi and calculate the posterior over the posesof parts in layer 1 using lbp. We sample this posterior for poses of parts suchthat they form a valid configuration with the poses of the parts in layer 2 andwith those previously sampled. As in the case of layer 2, multiple samples areobtained and the best nS samples are chosen. The process is repeated for allsamples Θi for layer 2, resulting in a total of n2

S samples.However, since computing the likelihood of the parts in layer 1 for each Θ is

inefficient, we approximate by using only those poses whose overlap with layer2 is below a threshold τ . Figure 6 shows some of the samples obtained using theabove method for cows shown in Figure 5. These samples are the input for theObjCut algorithm.

Fig. 6. Posteriors over the putative poses of parts are calculated using lbp. Theposterior is then sampled to obtain instances of the object (see § 5.3) . The half-limbsare detected correctly in some samples.

6 Estimation – The ObjCut Algorithm

Given an image D containing an instance of a known object category, and thesamples Θ1 . . .Θs of the lps parameters, we wish to obtain the segmentation ofthe object, i.e. infer labels m. We now present the ObjCut algorithm whichprovides reliable segmentation using both (a) modelled and (b) unmodelleddeformations of articulated object categories.


Modelled deformations. These are taken into account by the lps model whichuses multiple shape exemplars for each part and allows for all valid configurationsof the object category using pairwise potentials α(ti, tj). The various samples Θi

localize the parts of the object in the image. They also provide us with refinedestimates of the histograms Hobj and Hbkg which model the appearance of thefigure and ground.

Unmodelled deformations. These are accounted for by merging pixels surround-ing the object which are similar in appearance to the object. Only those pixelswhich lie in a ‘band’ surrounding the outline of the object are considered. Thewidth of the band is 10% of the size of the object as specified by the sample ofthe lps. Points lying inside the object are given preference over the points sur-rounding the object. As is the case with mrf-based segmentations, boundariesare preferred around image edges (using the contrast term φ(bfD|mx, my)).

The segmentation is obtained by minimizing equation (13) using the mincut

algorithm. The various terms in equation (13) are defined as follows. The weightswi are approximated as wi ≈ g(Θi|Z). The data likelihood term φ(D|mx) iscomputed using equation (4). The contrast term is given by equations (4) and(5). The function φ(mx|Θi) is defined by equation (7). Table 1 summarizes themain steps of obtaining the segmentation using the ObjCut algorithm.

The figure-ground labelling m obtained as described above can be used itera-tively to refine the segmentation using the em algorithm. However, we found thatthis does not result in a significant improvement over the initial segmentations asthe samples Θ1 . . .Θs do not change much from one iteration to the other.

Table 1. The ObjCut algorithm

1. Given an image D, an object category is chosen, e.g. cows or horses.2. Initial estimate of the pose of the pictorial structure (PS) using edges and texture:

(a) A set of candidate poses (tj = (xi, yi, θi, σj)) for each part is identified usinga tree cascade of classifiers [15].

(b) An initial estimate of the PS is found using the efficient LBP algorithm de-scribed in §5.1.

3. Improved estimation of the pose of the layered pictorial structure (LPS) taking into account occlusion and colour:(a) Update the appearance model of both foreground and background as described

in §5.2.(b) Generate a new set of candidate poses for each part by densely sampling pose

space around the estimate found in step 2 above.(c) Estimate the pose of the (LPS) using efficient LBP and layering as described

in §5.2.4. Obtain the samples Θ1, · · · ,ΘS from the posterior g(Θ|Z) of the LPS (§5.3).5. OBJCUT

(a) Compute the weights wi = g(Θi|Z).(b) Minimize the energy in equation (13) using a single MINCUT operation to obtain

the segmentation m.


Fig. 7. Segmentation results I. The first two images in each row show some of thesamples of the lps model. The segmentation obtained using the object category specificmrf is shown in the last column. Most of the errors were caused by the tail (which wasnot a part of the lps model) and parts of the background which were close and similarin colour to the object.


7 Results

We present several results of the ObjCut algorithm for two object categories,namely cows and horses, and compare it with a state-of-the-art method andground truth. In all our experiments, we used the same values of the parame-ters. Figure 5 (row 4) shows the results of the ObjCut algorithm for two cowimages. Figures 7 and 8 show the segmentation of various images of cows andhorses respectively. To obtain a comparison with ground truth, all 8 cow imagesand 5 horse images were manually segmented. For the cow images, out of the

Fig. 8. Segmentation results II. The lps model for the horse learnt using segmentedhorse images was used to segment previously unseen images. Most of the errors werecaused by unmodelled parts i.e. the mane and the tail.


Fig. 9. Comparison with Leibe and Schiele. The first two images of each row showsome of the samples obtained by matching the lps model to the image. The third imageis the segmentation obtained using the ObjCut algorithm and the fourth image is thesegmentation obtained using [17]. Note that ObjCut provides a better segmentationof the torso and head without detecting extra half limbs.

125,362 foreground pixels and 472,670 background pixels present in the groundtruth, 120,127 (95.82%) and 466,611 (98.72%) were present in the segmentationsobtained. Similarly, for the horses images, out of the 79,860 foreground pixelsand 151,908 background pixels present in the ground truth, 71,397 (89.39%)and 151,185 (99.52%) were present in the segmentations obtained. In the caseof horses, most errors are due to unmodelled mane and tail parts. Results indi-cate that, by considering both modelled and unmodelled deformations, excellentsegmentations are obtained by ObjCut.

Figure 9 shows a comparison of the segmentation results obtained when usingObjCut with a state-of-the-art method for object category specific segmentationdescribed in Leibe and Schiele [17]. A similar approach was described in [4].The ObjCut algorithm provides better segmentations using significantly smallernumber of exemplars by exploiting the ability of mincut for providing excellentsegmentations using a good initialization obtained by lps.

Figure 10 shows the effects of using only the shape or only appearance infor-mation by discarding the other completely. Shape alone undersegments the imageas different samples assign different poses to the half-limbs. Thus, the segmenta-tion mainly includes the head and the torso (which do not change considerablyamong the various samples). Using only appearance results in some parts of thebackground, which have a similar colour to the object, being pulled into thesegmentation. Results indicate that good segmentations depend on combiningboth shape and appearance, as is the case with the ObjCut algorithm.


Fig. 10. Effects of shape and appearance information. The first column shows an imagecontaining a cow. The segmentation results obtained by using only the rgb histogramsfor the object and the background provided by the lps model are shown in the secondcolumn. The results obtained by using only the shape prior provided by the lps modelis shown in the third column. The fourth column shows the segmentations we get usingthe ObjCut algorithm. Results indicate that good segmentation is obtained only whenboth shape and appearance information are used.

8 Summary and Conclusions

We presented a new model, called the object category specific mrf, which com-bines the lps and the mrf model to perform object category specific segmenta-tion. The method needs to be extended to handle multiple visual aspects of anobject category and to deal with partial occlusion by other objects.

Acknowledgments

We thank Dan Huttenlocher for fruitful discussions on efficient lbp. This workwas supported in part by the IST Programme of the European Community,under the PASCAL Network of Excellence, IST-2002-506778. This publicationonly reflects the authors’ views.

References

1. A. Agarwal and B. Triggs. Tracking articulated motion using a mixture of autore-gressive models. In ECCV, pages III:54–65, 2004.

2. S. Agarwal and D. Roth. Learning a sparse representation for object detection. InECCV, page IV: 113 ff., 2002.


3. A. Blake, C. Rother, M. Brown, P. Perez, and P.H.S. Torr. Interactive imagesegmentation using an adaptive GMMRF model. In ECCV, pages Vol I: 428–441,2004.

4. E. Borenstein and S. Ullman. Class-specific, top-down segmentation. In ECCV,page II: 109 ff., 2002.

5. Y. Boykov and M.P. Jolly. Interactive graph cuts for optimal boundary and regionsegmentation of objects in N-D images. In ICCV, pages I: 105–112, 2001.

6. P.F. Felzenszwalb and D.P. Huttenlocher. Efficient matching of pictorial structures.In CVPR, pages II: 66–73, 2000.

7. P.F. Felzenszwalb and D.P. Huttenlocher. Fast algorithms for large state spaceHMMs with applications to web usage analysis. In NIPS, 2003.

8. R. Fergus, P. Perona, and A. Zisserman. Object class recognition by unsupervisedscale-invariant learning. In CVPR, pages II: 264–271, 2003.

9. D. Freedman and T. Zhang. Interactive graph cut based segmentation with shapepriors. In CVPR, pages I: 755–762, 2005.

10. D.M. Gavrilla. Pedestrian detection from a moving vehicle. In ECCV, pages II:37–49, 2000.

11. A. Gelman, J. Carlin, H. Stern, and D. Rubin. Bayesian Data Analysis. Chapmanand Hall, 1995.

12. J. Goldstein, J. Platt, and C. Burges. Indexing high-dimensional rectangles for fastmultimedia identification. Technical Report MSR-TR-2003-38, Microsoft Research,2003.

13. R. Huang, V. Pavlovic, and D.N. Metaxas. A graphical model framework forcoupling MRFs and deformable models. In CVPR, pages II: 739–746, 2004.

14. V. Kolmogorov and R. Zabih. What energy functions can be minimized via graphcuts. IEEE PAMI, 26(2):147–159, 2004.

15. M. P. Kumar, P. H. S. Torr, and A. Zisserman. Extending pictorial structures forobject recognition. In BMVC, pages II: 789–798, 2004.

16. M. P. Kumar, P. H. S. Torr, and A. Zisserman. Learning layered pictorial structuresfrom video. In ICVGIP, pages 148–153, 2004.

17. B. Leibe and B. Schiele. Interleaved object categorization and segmentation. InBMVC, pages II: 264–271, 2003.

18. T. Leung and J. Malik. Recognizing surfaces using three-dimensional textons. InICCV, pages 1010–1017, 1999.

19. P. Meer and B. Georgescu. Edge detection with embedded confidence. PAMI,23:1351–1365, December 2001.

20. J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of PlausibleInference. Morgan Kauffman, 1998.

21. C. Rother, V. Kolmogorov, and A. Blake. Grabcut: interactive foreground extrac-tion using iterated graph cuts. In SIGGRAPH, pages 309–314, 2004.

22. A. Thayananthan, B. Stenger, P.H.S. Torr, and R. Cipolla. Shape context andchamfer matching in cluttered scenes. In CVPR, pages I: 127–133, 2003.

23. M. Varma and A. Zisserman. Texture classification: Are filter banks necessary? InCVPR, pages II:691–698, 2003.

24. J. Yedidia, W. Freeman, and Y. Weiss. Bethe free energy, Kikuchi approximations,and belief propagation algorithms. Technical Report TR2001-16, MERL, 2001.

An Object Category Speciﬁc mrf for Segmentation · 2015-04-21 · An Object Category Speciﬁc...

Documents

Transcript of An Object Category Speciﬁc mrf for Segmentation · 2015-04-21 · An Object Category Speciﬁc...