Shape-Based Object Localization for Descriptive Classiﬁcationgalelidan/papers/Heitz...Int J Comput...

Int J Comput VisDOI 10.1007/s11263-009-0228-y

Shape-Based Object Localization for Descriptive Classification

Geremy Heitz · Gal Elidan · Benjamin Packer ·Daphne Koller

Received: 12 October 2007 / Accepted: 2 March 2009© Springer Science+Business Media, LLC 2009

Abstract Discriminative tasks, including object categoriza-tion and detection, are central components of high-levelcomputer vision. However, sometimes we are interested ina finer-grained characterization of the object’s properties,such as its pose or articulation. In this paper we develop aprobabilistic method (LOOPS) that can learn a shape and ap-pearance model for a particular object class, and be used toconsistently localize constituent elements (landmarks) of theobject’s outline in test images. This localization effectivelyprojects the test image into an alternative representationalspace that makes it particularly easy to perform various de-scriptive tasks. We apply our method to a range of objectclasses in cluttered images and demonstrate its effectivenessin localizing objects and performing descriptive classifica-tion, descriptive ranking, and descriptive clustering.

Keywords Probabilistic graphical models · Deformableshape models · Object recognition · Markov random fields

Authors G. H., G. E. and B. P. contributed equally to this manuscript.

G. Heitz (�) · B. Packer · D. KollerStanford University, Stanford, CA 94305, USAe-mail: [email protected]

B. Packere-mail: [email protected]

D. Kollere-mail: [email protected]

G. ElidanDepartment of Statistics, Hebrew University of Jerusalem,Jerusalem, 91905, Israele-mail: [email protected]

1 Introduction

Discriminative questions such as “What is it?” (categoriza-tion) and “Where is it?” (detection) are central to machinevision and have received much attention in recent years. Inmany cases, however, we are also interested in more refineddescriptive questions. We define a descriptive query to beone that considers characteristics of an object that vary be-tween instances within the object category. Examples in-clude questions such as “Is the giraffe standing upright orbending over to drink?”, “Is the cheetah running?”, or “Findme all lamps that have a beige, rectangular lampshade”.These questions relate to the object’s pose, articulation, orlocalized appearance.

In principle, it is possible to convert such descriptivequeries into discriminative classification tasks given an ap-propriately labeled training set. But to take such an ap-proach, we would need to construct a specialized trainingset (often a large one) for each descriptive question we wantto answer, at significant human effort. Intuitively, it seemsthat we should be able to avoid this requirement. After all,to a person, we can convey the difference between a stand-ing and bending giraffe using one or two training instancesof each. The key, of course, is that the person has a good rep-resentational model of what giraffes look like, one in whichthe salient differences between the different descriptive la-bels are easily captured and learned.

In this paper, we focus on the aspect of object shape, animportant characteristic of many objects that can be usedas the basis for many descriptive distinctions (Felzenszwalband Huttenlocher 2000), including the examples describedabove. We introduce a method that automatically learns aprobabilistic characterization of an object class shape andappearance, allowing instances in that class to be encodedin this new representational space. We provide an algorithm

mailto:[email protected]:[email protected]:[email protected]:[email protected]

Int J Comput Vis

for converting new test images containing the object into thisrepresentation, and show that this ability allows us to answerdifficult descriptive questions, such as the ones above, usinga very small number of training instances. Importantly, thespecific descriptive questions are not known at the time theprobabilistic model is learned, and the same model can beused for answering multiple questions.

Concretely, we propose an approach called LOOPS: Lo-calizing Object Outlines using Probabilistic Shape. OurLOOPS model combines two components: a model of theobject’s deformable shape, encoded as a joint probabilitydistribution over the positions of a set of automatically se-lected landmark points on the outline of the object; and aset of appearance-based boosted detectors that are aimed atindividually localizing each of these landmarks. Both com-ponents are automatically learned from a set of labeled train-ing contours, of the type obtained from the LabelMe dataset(http://labelme.csail.mit.edu). The location of these land-marks in an image, plus the landmark-localized appearance,provide an alternative representation from which many de-scriptive tasks can be easily answered. A key technical chal-lenge, which we address in this paper, is to correspond thismodel to a novel image that often contains a large degree ofclutter and object deformation or articulation.

Contour-based methods such as active shape/appearancemodels (AAMs) (Cootes et al. 1998; Sethian 1998) werealso developed with the goal of localizing an object shapemodel to an image. However, these methods typically re-quire good initial guesses and are applied to images with

significantly less clutter than real-life photographs. As a re-sult, AAMs have not been successfully used for class-levelobject recognition/analysis. Some works use some form ofgeometry as a means toward an end goal of object classi-fication or detection (Fergus et al. 2003; Hillel et al. 2005;Opelt et al. 2006b; Shotton et al. 2005). Since, for example, amisplaced leg has a negligible effect on classification, theseworks neither attempt to optimize localization nor evalu-ate their ability to do so. Other works do attempt to accu-rately localize objects in cluttered photographs but only al-low for relatively rigid configurations (e.g., Berg et al. 2005;Ferrari et al. 2008), and cannot capture large deformationssuch as the articulation of the giraffe’s neck (see Fig. 1).

To allow robust localization of landmarks in clutteredimages with significant deformation, we propose a hybridmethod, which combines both discrete global and continu-ous local search steps. In the discrete phase, we use a dis-crete space defined by a limited set of candidate assignmentsfor each landmark; we use a Markov random field (MRF)to define an energy function over this set of assignments,and effectively search this multi-modal, combinatorial spaceby applying state-of-the-art probabilistic inference methodsover this MRF. This global search step allows us to find arough but close-to-optimal solution, despite the many localoptima resulting from the clutter and the large deformationsallowed by our model. This localization can then be refinedusing a continuous hill-climbing approach, allowing a verygood solution to be found without requiring a good initial-ization or multiple random restarts. Preliminary investiga-

Fig. 1 Example deformationsof the giraffe object class. Theaxis shows the location ofinstances along the twoprincipal components in ourdataset. The horizontal axiscorresponds to articulation ofthe neck, while the vertical axiscorresponds to articulations ofthe legs. The ellipse indicatesthe set of instances within onestandard deviation of the mean.We show four typical examplesof giraffes that illustrate theouter extremes along the firsttwo modes of variation

http://labelme.csail.mit.edu

Int J Comput Vis

tions showed that a simpler approach that does a purely lo-cal search, similar to the active appearance models of Cooteset al. (1998), was unable to deal with the challenges of ourdata.

To evaluate the performance of our method, we considera variety of object classes in cluttered images and explorethe space of applications facilitated by the LOOPS modelin two orthogonal directions. The first concerns the machinelearning task: we present results for classification, search(ranking), and clustering. The second direction varies thecomponents that are extracted from the LOOPS outlinesfor these tasks: we show examples that use the entire ob-ject shape, a subcomponent of the object shape, and theappearance of a specific part of the object. Given enoughtask-specific data, each of these applications might be ac-complished by other methods. However, we show that theLOOPS framework allows us to approach these tasks with asingle class-based model, without the need for task-specifictraining.

2 Overview of the LOOPS Method

As discussed, having a representation of the constituent ele-ments of an object should aid in answering descriptive ques-tions. For example, to decide whether a giraffe is standingupright or bending down to drink, we might look at the char-acteristics of the model elements that correspond to the head,neck, or legs. In this section we briefly describe our repre-sentation over these automatically learned elements (land-marks). We also provide an overview of the LOOPS methodfor learning an object class model over these landmarks and

using it to localize corresponded outlines in test images.A flowchart of our method is depicted in Fig. 2.

We start by representing the shape of an object class viaan ordered set of N landmark points that together consti-tute a piecewise linear contour (see Fig. 1). It is importantthat these landmarks be corresponded across instances—landmark ‘17’ in one instance should correspond to the samemeaningful point (for example, the nose) as landmark ‘17’in another instance. Obtaining training instances labeledwith corresponded landmarks, however, requires painstak-ing supervision. Instead, we would like to use simple out-lines such as those in the LabelMe dataset (http://labelme.csail.mit.edu), which requires much less supervision andfor which many examples are available from a variety ofclasses. We must therefore automatically augment the sim-ple training outlines with a corresponded labeling. That is,we want to specify a relatively small number of landmarksthat still represent the shape faithfully, and consistently po-sition them on all training outlines. This stage transformsarbitrary outlines into useful training data as depicted inFig. 2(a). We describe our method for performing this au-tomatic correspondence in Sect. 4. Once we have our set oftraining outlines, each with N corresponded landmarks, wecan construct a distribution of the geometry of the objects’outline as depicted in Fig. 2 (middle) and augment this withappearance based features to form a LOOPS model, as de-scribed in Sect. 5. With a model for an object class in hand,we now face the real computational challenge that our goalposes: how to localize the landmarks of the model in test im-ages in the face of clutter and large deformations and artic-ulations (fourth box in Fig. 2). Our solution involves a two-stage scheme: we first use state-of-the-art inference from the

Fig. 2 A flowchart depictingthe stages of our LOOPSmethod

http://labelme.csail.mit.eduhttp://labelme.csail.mit.edu

Int J Comput Vis

field of graphical models to achieve a coarse solution in adiscretized search space, and then refine this solution usinggreedy optimization. The localization of outlines in test im-ages is described in detail in Sect. 6. Finally, once the local-ization is complete, we can readily perform a range of de-scriptive tasks (classification, ranking, clustering), based onthe predicted location of landmarks in test images as wellas appearance characteristics in the vicinity of those land-marks. We demonstrate how this is carried out for severaldescriptive tasks in Sect. 8.

3 Related Work

Our method relies on a joint probabilistic model over land-mark locations that unifies boosted detectors and globalshape toward corresponded object localization. Our boosteddetectors are constructed, for the most part, based on thestate-of-the-art object recognition methods of Opelt et al.(2006b) and Torralba et al. (2005) (see Sect. 5.2 for details).In this section, we concentrate on the most relevant worksthat employ some notion of geometry or “shape” for variousimage classification tasks.

Shape Based Classification Several recent works study theability to classify the shape of an outline directly (Basri etal. 1998; Grauman and Darrell 2005; Belongie et al. 2000;Sebastian et al. 2004; Felzenszwalb and Schwartz 2007).Such techniques often take on the form of a matching prob-lem, in which the cost between two instances is the cost of“deforming” one shape to match the other. Most of thesemethods assume that the object of interest is either repre-sented as a contour or has already been segmented out fromthe image. While many of these approaches could aid clas-sification of objects in cluttered scenes, few of them ad-dress the more serious challenge of extracting the object’soutline from such an image. One exception is the workof Thayananthan et al. (2003), which uses shape-contextsfor object localization in cluttered scenes. Their methoddemonstrates matching of rigid templates and a manually-constructed model of the hand, but does not attempt to learnmodels from images.

Contour Propagation Methods Some of the best-studiedmethods in machine vision for object localization andboundary detection rely on shape to guide the search fora known object. In medical imaging, active contour tech-niques such as snakes (Cremers et al. 2002; Caselles et al.1995), active shape models (Cootes et al. 1995), or more re-cently level set and fast marching methods (Sethian 1998)combine simple image features with relatively sophisticatedshape models. In medical imaging applications, where theshape variation is relatively low, and a good starting point

can easily be found, these methods have achieved impres-sive results.

Attempting to apply these methods to recognizing objectclasses in cluttered scenes, however, faces two problems.The first is that we generally cannot find a good startingpoint, and even if we do, localization is complicated by themany local maxima present in real images. Secondly, edgecues alone are typically insufficient to identify the objectin question. In an attempt to overcome this problem, activeappearance models have been introduced by Cootes et al.(1998). While this method achieved success in the context ofmedical imaging or face outlining in images with little ad-ditional clutter, they have not (to the best of our knowledge)been successfully applied to the kind of cluttered images thatwe consider in our experimental evaluation. Indeed, our ownearly experiments confirmed that a contour front propaga-tion approach with several starting points is generally unableto accurately locate outlines in complex images, and simplemethods for finding a good starting point—such as first de-tecting a bounding box—are insufficient.

Contour Outlining in Real Images Ferrari et al. match testimages to a contour-based shape model constructed froma single hand-drawn outline (Ferrari et al. 2006) and learna similar model automatically from a set of training im-ages with supervised bounding boxes (Ferrari et al. 2007,2008). Their algorithm achieves impressive results at mul-tiple scales and in a fair amount of clutter using only de-tected edges in the image. However, they do not, in thesepapers, attempt to use the discovered shapes for any descrip-tive classification task. In experiments below, we compare tothis method, and show that our method produces more usefuloutlines for the descriptive classification tasks that we target.Furthermore, our method is more appropriate for classes inwhich shape is more deformable or cluttered images wherefeatures other than edge fragments might be useful.

“Parts”-Based Methods Following the generative constel-lation model of Fergus et al. (2003), several works make useof explicit shape by relying on a model of parts (patches)and their geometrical arrangement (e.g., Hillel et al. 2005;Fergus et al. 2005). Most of these models are trained fordetection or classification accuracy, rather than accuratelocalization of the constituent “parts”, and do not pro-vide evidence that the discovered parts have any consis-tent meaning or localization. In fact, even with a “correct”set of parts, localization need not be accurate to aid detec-tion/classification. To see this, consider an example wherethe giraffe abdomen and legs are localized correctly but thehead and neck are incorrectly pointing upwards. In this case,classification and bounding box detection are still proba-bly correct but the localization cannot be used to determinethe pose of the giraffe. Indeed, instead of trying to localize

Int J Comput Vis

parts, the constellation approach averages over localization.While this is the correct strategy when the goal is detec-tion/classification, it fails to provide a single most-likely as-signment of parts that could be used for the further process-ing that we consider.

The work of Crandall et al. (2005) uses a parts-basedmodel with manually annotated parts. They evaluate the ac-curacy of localizing these parts in test images. Their followup work (Crandall and Huttenlocher 2006) automaticallylearns the parts, but similar to the constellation method, usesthese only as cues for object detection. Our work learns ob-ject landmarks automatically and explicitly uses them notonly for detection, but for further descriptive querying.

Additional Use of Shape for Classification, Detection andSegmentation Foreground/background segmentation is an-other task that is closely related to object outlining. Theclass-based segmentation works of Kumar et al. (2005) andBorenstein et al. (2004) rely on class-specific image featuresto segment objects of the relevant class. These approaches,however, aim to maximize pixel (or region) label predictionand do not provide a corresponded outline. Winn and Shot-ton (2006) attempt simultaneous detection and segmentationusing a CRF over object parts. They evaluate their segmenta-tion quality, but do not consider the accuracy of their part lo-calization. In experiments below, we use existing code fromPrasad and Fitzgibbon (2006), which is based on the work ofKumar et al. (2005), to produce uncorresponded segmenta-tion outlines that are compared to de-corresponded outlinesproduced by our method.

Many recent state-of-the-art methods rely on implicitshape to aid in object recognition. For instance, Torralba etal. (2005) use image patches, and Opelt et al. (2006b) andShotton et al. (2005) use edge fragments as weak detectorswith offsets from the object centroid that are boosted to pro-duce a strong detector. These works combine features in or-der to optimize the aggregate decision, and typically do notaim to segment/outline objects. Leordeanu et al. (2007) andBerg et al. (2005) similarly combine local image featureswith pairwise relationship features to solve model matchingas a quadratic assignment problem. While such models givea correspondence between model components and the im-age, the training regime optimizes recognition rather thanthe accuracy of these correspondences. Despite the train-ing regimes of the above models, the implicit shape used bythese models can be used to produce both a detection and alocalization through intelligent use of the “weak” detectorsand their offsets. Indeed, both Leibe et al. (2004) and Opeltet al. (2006a) produce both a detection and soft segmenta-tion. While these methods tend to show appealing anecdotalresults, the merit of their localizations (both correspondedand uncorresponded) is hard to measure, as no quantitativeevaluation of segmentation quality or “part” localization isprovided.

4 Automatic Landmark Selection

In this section we describe our method for transformingsimple outlines into corresponded outlines over a relativelysmall and consistent set of landmarks, as in Fig. 3. Manyworks have studied the problem of automatic landmark se-lection for point distribution models (PDMs) for use inASMs or AAMs. In particular, Hill and Taylor (1996) de-veloped a robust method based on optimization of a costfunction over a sparse polygonal representation of the out-lines. Any robust method for correspondence between train-ing outlines and salient landmark selection should work forour purposes, and for completeness we present our method,which is simple and was robust and accurate for all classesthat we considered. Our method builds on an intuition simi-lar to that of Hill and Taylor (1996).

Intuitively, a good correspondence trades off betweentwo objectives. The first is that corresponding points should

Fig. 3 Example training instances with an outline defined by a piece-wise linear contour connecting 42 landmarks. Note that the landmarksare not labeled in the training data but automatically learned using themethod described in Sect. 4. These instances demonstrate the effec-tiveness of the correspondence method (e.g., the larger green landmarkin the middle of the back is landmark #7 in all instances) as well asimperfections (e.g., misalignments near the feet). The most systematicproblems occur in articulated instances, such as these. While the threelandmarks at the lower back of the giraffe are consistent across theseinstances, some of the landmarks along the head must “slide” in orderto keep the overall alignment correct

Int J Comput Vis

Fig. 4 Our arc-length correspondence procedure. (a) The first outline,O(1), with landmark 0 marked. (b), (c) The second outline, O(2), withtwo different choices for landmark 0, along with the correspondencecost for that choice

be “equally spaced” around the boundary of the object, andthe second is that the nonrigid transformation that aligns thelandmark points should be small, in some meaningful sense.In our case, we achieve the correspondence using a sim-ple two-step process: find a correspondence between high-resolution outlines then prune down to a small number of“salient” landmarks.

Arc-Length Correspondence Our correspondence algo-rithm searches for a correspondence between pairs of in-stances that achieves the lowest cost. Suppose we havea candidate correspondence between two outlines, repre-sented by the vectors of points O(1) and O(2). We first an-alytically determine the transformation of O(2) that mini-mizes its least-mean-square distance to O(1). Applying thistransformation to each landmark produces Õ(2). Our scorepenalizes changes in the offset vectors between distant land-marks on our two outlines. In particular, we compute ourcost as:

Cost(O(1), Õ(2)) =∑

i,j

‖δ(1)ij − δ̃(2)ij ‖2,

where δ(m)ij is the vector offset between landmark i and land-mark j on contour m. The sum over i, j includes all land-marks i together with the landmark j that has largest geo-desic distance along the contour from landmark i. We notethat our correspondence is scale-invariant since the targetcontour is affine transformed before the cost is computed.

In our search for the best correspondence according tothis score, we leverage the fact that our outlines are orderedone-dimensionally. We begin by sampling a large number(500) of equally spaced landmarks along each of the trainingoutlines. We fix a “base” contour (randomly selected), anda landmark numbering scheme for that base. We can nowcorrespond the entire dataset by corresponding each outlineto our base. Figure 4 shows the cost assigned by our algo-rithm for two sample correspondences from a target outlineto the base outline. Furthermore, because of our ordering as-sumption and our fixed choice of landmarks (the 500 equally

spaced points), there are only 500 possible correspondences.This can be seen because landmark 0 in the base outline onlyhas 500 choices for a corresponding landmark in each targetoutline. We can thus simply compute the cost for all 500choices, and select the best one. Once we have selected thecorrespondence for each target contour, we have a fully cor-responded dataset.

The entire process is simple and takes only a few minutesto choose the best amongst 10 random base instances andcorrespond each to the 20 training outlines. While a moresophisticated approach might lead to strictly more accuratecorrespondences, this approach was more than sufficient forour purposes.

Landmark Pruning The next step is to reduce the num-ber of landmarks used to represent each outline in a waythat takes into account all training contours. Our goal is toremove points that contribute little to the outline of any in-stance. Toward this end, we greedily prune landmarks one ata time. At any given time in this process, each landmark canbe scored by the error its removal would introduce into thedataset. Suppose that our candidate for removal is landmarki at location li . If i is removed from instance m, the outlinein the vicinity of landmark i will be approximated by theline segment between landmarks i − 1 and i + 1 at locationsl(m)i−1 and l

(m)i+1. We define distO(m) (l

(m)i−1, l

(m)i+1) to be the mean

segment-to-outline squared distance, and let the cost of re-moving landmark i be the average of these distances acrossall M instances in the entire dataset:

Ci = 1M

∑

m

distO(m) (l(m)i−1, l

(m)i+1).

At each step, we remove the landmark whose cost is lowest,and terminate the process when the cost of the next removalis above a fixed threshold (2 pixels2).

Figure 3 shows an example of the correspondence of gi-raffe outlines. Note that the fully automatic choice of land-marks is both sparse and similar to what a human labelermight choose. In the next section we show how to use corre-sponded outlines to learn a shape and appearance model thatwill later be used (see Sect. 6) to detect objects and preciselylocalize these landmarks in cluttered images.

5 The LOOPS Model

Once we have corresponded the training outlines so thateach is specified by N (corresponded) landmarks, we areready to construct a distribution over possible assignmentsof these landmarks to pixels in an image. Toward this end,the LOOPS model combines two components: an explicitrepresentation of the object’s shape (2D silhouette), and aset of image-based features. We define the shape of a class of

Int J Comput Vis

objects via the locations of the N object landmarks, each ofwhich is assigned to one of the pixels in the image. We rep-resent such an assignment as a 2N vector of image coordi-nates which we denote by L. Using the language of Markovrandom fields (Pearl 1988), the LOOPS model defines a con-ditional probability distribution over L:

P(L | I,w,μ,�)= 1

Z(I)PShape(L;μ,�)∏

i

exp(wiF

deti (li; I)

)

×∏

i,j

exp(wijF

gradij (li , lj ; I)

), (1)

where μ,�,w are model parameters, and i and j indexlandmarks of the model. PShape encodes the (unnormalized)distribution over the object shape (outline), F det(li) is alandmark specific detector, and F gradij (li , lj ; I) encodes apreference for aligning outline segments along image edges.

The shape model and the detector features are learnedin parallel, as shown in Fig. 2(b). Below, we describe thesefeatures and how they are learned. The weights w are setas follows: wi = 5, wij = 1, for all i, j . In principle, wecould learn these weights from data, for example using astandard conjugate gradient approach. This process, how-ever, requires an expensive inference step at each iteration.Our preliminary experiments indicated that our outlining re-sults are relatively robust to a range of these weights and thatlearning the weights provides no clear benefit. We thereforeavoid this computational burden and simply use a fixed set-ting of the weights.

We note that our MRF formulation is quite general, andallows for both flexible weighting of the features, and for theincorporation of additional features. For instance, we mightwant to capture the notion that internal line segments (linesentirely contained within the object) should have low colorvariability. This can naturally be posed as a pairwise featureover landmarks on opposite sides of the object.

5.1 Object Shape Model

We model the shape component of (1) as a multivariateGaussian distribution over landmark locations with meanμ and covariance �. The Gaussian parametric form hasmany attractive properties, and has been used successfully tomodel shape distributions in a variety of applications (e.g.,Cootes et al. 1995; Anguelov et al. 2005). In our context, oneparticularly useful property is that the Gaussian distributiondecomposes into a product of quadratic terms over pairs ofvariables

PShape(L | μ,�)

= 1Z

exp

(−1

2(x − μ)�−1(x − μ)

)

= 1Z

∏

i,j

exp

(−1

2(xi − μi)�−1ij (xj − μj )

)

= 1Z

∏

i,j

φi,j (xi, xj ;μ,�), (2)

where Z is the normalization factor for the Gaussian density.We can see from (2) that we can specify potentials φi,j overonly singletons and pairs of variables and still manage toreconstruct the full likelihood of the shape. This allows (1)to take an appealing form in which all terms are defined overat most a pair of variables.

As we discuss below in Sect. 6, the procedure to lo-cate the model landmarks in an image first involves discreteglobal inference using the LOOPS model, followed by a re-finement stage that takes local steps. Even if we limit our-selves to pairwise terms, performing discrete inference in adensely connected MRF may be computationally impracti-cal. Unfortunately, a general multivariate Gaussian includespairwise terms between all landmarks. Thus, during the dis-crete inference stage, we limit the number of pairwise ele-ments by approximating our shape distribution with a sparsemultivariate Gaussian. During the final refinement stage, weuse the full multivariate Gaussian distribution.

The sparse shape approximation can be selected in var-ious ways, trading off accuracy of the distribution withcomputational resources. In our implementation, we includeonly two types of terms: terms correlating neighboring land-marks along the outline, and a linear number (one) of termsencoding long-range dependencies that promote stability ofthe shape. We greedily select the latter terms over pairs oflandmarks for which the relative location has the least vari-ance (averaged across the x and y coordinates).

Estimation of the maximum likelihood parameters of thefull Gaussian distribution may be solved analytically usingthe corresponded training instances. If we desire a sparseGaussian distribution, however, there are multiple optionsfor which approximation to use. We use a standard itera-tive projected gradient approach (Boyd and Vandenberghe2004) to minimize the Kullback-Leibler divergence (Coverand Thomas 1991) between the sparse distribution and thefull maximum likelihood distribution:

(μS,�−1S )

= arg minμ,�−1∈Δ

KL(

N (μ,�−1)‖ND(μD,�−1D ))

where μD and �D are the (dense) ML parameters, and Δis the set of all inverse covariances that meet the constraintsdefined by our choice of included edges. Note that a removalof a potential (edge) between a pair of landmarks is equiv-alent to constraining the corresponding entry in �−1 to bezero.

Int J Comput Vis

5.2 Landmark Detector Features

To construct our detector features F det, we build on thedemonstrated efficacy of discriminative methods in identify-ing salient regions (parts) of objects (e.g., Fergus et al. 2003;Hillel et al. 2005; Torralba et al. 2005; Opelt et al. 2006b).The extensive work in this area suggests that the best resultsare obtained by combining a large number of features. How-ever, incorporating all of these features directly into (1) isproblematic because setting such a large number of weightscorresponding to these features requires a significant amountof tuning. Even if we chose to learn the weights, this prob-lem would not be alleviated: training a conditional MRF re-quires that we run inference multiple times as part of thegradient descent process; models with more parameters re-quire many more iterations of gradient descent to converge,making the learning cost prohibitive.

Our strategy builds on the success of boosting in state-of-the-art object detection methods (Opelt et al. 2006b;Torralba et al. 2005). Specifically, we use boosting to learna strong detector (classifier), Hi for each landmark i. Wethen define the feature value in the conditional MRF for theassignment of landmark i to pixel li to be:

F deti (li; I) = Hi(li).Any set of image features can be used in this approach;

we use features that are based on our shape model as well asother features that have proven useful for the task of objectdetection (see Fig. 5 for examples of each):

• Shape Templates. This contour-based feature is a contigu-ous segment of the mean shape emanating from a partic-ular landmark for a given pixel radius in both directions.This feature is entirely shape dependent and can only beused in the case where training outlines are provided. SeeElidan et al. (2006a) for further details on this feature. Be-cause we expect the edges in the image to roughly corre-spond to the object outline, we match a shape template to

actual edges in the image using chamfer distance match-ing (Borgefors 1988).

• Boundary Fragments. This feature is made of randomlyselected edge chains (contiguous groups of edge pixels)within the bounding box of the training objects, and arematched using chamfer distance. We construct these usinga protocol similar to Opelt et al. (2006b), except that weneither cluster fragments nor combine pairs of fragmentsinto a single feature.

• Filter Response Patches. This feature is represented by apatch from a filtered version of the image (filters includebars and oriented Gaussians). Patches are randomly ex-tracted from within the object bounding boxes in the train-ing images. The patch is matched to a new image usingnormalized cross-correlation. We construct the patchessimilarly to Torralba et al. (2005).

• SIFT Descriptors. Interesting keypoints are extractedfrom each image, and a SIFT descriptor (Lowe 2003) iscomputed for each keypoint. In order to limit the num-ber of SIFT features, each descriptor is scored based onhow common it is in the training set, and the best scoringdescriptors are kept.

Each landmark detector is trained by constructing astrong boosted classifier from a set of the weak single fea-ture detectors described above. To build such detectors, werely on boosting and adapt the protocol of Torralba et al.(2005). We now provide a brief description of the essentialsof the process.

Boosting provides a straightforward way to perform fea-ture selection and learn additive classifiers of the form

Hi(p) =T∑

t=1αth

ti(p),

where each hti(p) is a weak feature detector for landmark i(described below), T is the number of boosting rounds, andHi(p) is a strong classifier whose output is proportional tothe log-odds of landmark i being at pixel p.

Fig. 5 (Color online) Examplesof weak detectors (clockwisefrom the top left): a shapetemplate feature, representing asegment of the mean contournear the nose of the plane; aboundary fragment generatedfrom an edge chain; a SIFTfeature; a filter response patchextracted from a randomlocation within the objectbounding box. All features areshown along with their offsetvector (yellow arrow) that“votes” for the location of alandmark at the tail of theairplane (pink square)

Int J Comput Vis

A weak detector hi is a feature of one of the types listedabove (e.g., a filter response patch) providing information onthe location of landmark i. It is defined in terms of a featurevector of the relevant type and an offset v between the pixelp at which the feature is computed and the landmark posi-tion. Thus, if a weak detector hi produces a high responseon a test image at pixel p, then this detector will “vote” forthe presence of the landmark at p + v. Figure 5 illustrateseach type of weak detector together with a sample landmarkoffset.

A weak detector is applied to an image I in the follow-ing manner. First, the detector’s feature is evaluated for eachpixel in the image and shifted by the offset for the corre-sponding landmark i to produce the response image Ri . Forour patch features, we make Ri sparse by zeroing out all butthe top K (50) responses and blur it to provide a softer votingmechanism, in which small perturbations are not penalized.The sparsification of the response map allows only the mostconfident feature matches. A more detailed explanation ofthis process can be found in Murphy et al. (2006). In all ex-periments, we blur the response image using a Gaussian blurwith a standard deviation of 10 pixels. We now let hi(p) be aregression stump over the response Ri(p) that indicates theprobability of the landmark occurring at this pixel.

Once we have computed the responses for each weak de-tector, we learn the strong detector Hi(p) using Real Ad-aboost (Schapire and Singer 1999) with regression stumps(the hi ’s) as in Torralba et al. (2005), Murphy et al. (2006).For each landmark i, our positive training set consists of theresponse vectors corresponding to assignments of that land-mark in each of the training instances. In practice, becauseour landmark locations are noisy (due to labeling and corre-spondence error), we add vectors from a small (3×3) grid ofpixels around the true assignment. Our negative set is con-structed from the responses at random pixels that are at leasta minimum distance (10 pixels) away from the true answer.

5.3 Gradient Features

Our model also includes terms that estimate the likelihoodof the edges connecting the landmarks. In general, we expectthat the image will contain high gradient magnitudes at theobject boundary. For neighboring landmarks i and j withpixel assignments (li , lj ), we define the feature:

Fgradij (li , lj ; I) =

∑

r∈li lj|g(r)T n(li , lj )|,

where r varies over all of the points along the line seg-ment from li to lj , g(r) is the image gradient at point r ,and n(li , lj ) is the normal to the edge (between li and lj ).High values of the feature encourages the boundary of theobject to lie along segments of high gradient.

6 Localization of Object Classes

We now address our central computational challenge: as-signing the landmarks of a LOOPS model to test image pix-els while allowing for large deformations and articulations.Recall that the conditional MRF defines a distribution (1)over assignments of model landmarks to pixels. This allowsus to outline objects by using probabilistic inference to findthe most probable such assignment:

L∗ = arg maxL

P(L | I,w).

Because, in principle, each landmark can be assigned to anypixel, finding L∗ is computationally prohibitive. One op-tion is to use an approach analogous to active shape mod-els, using a greedy method to deform the model from afixed starting point. However, unlike most applications ofactive shape/appearance models (e.g., Cootes et al. 1998),our images have significant clutter, and such an approachwill quickly get trapped in an inferior local maxima. A pos-sible solution to this problem is to consider a series of start-ing points. Preliminary experiments along these lines (notshown for lack of space), however, showed that such anapproach requires a computationally prohibitive number ofstarting points to effectively localize even rigid objects. Fur-thermore, large articulations were not captured even with the“correct” starting point (placing the mean shape in the centerof the true location).

To overcome these limitations, we propose an alternativetwo step method, depicted in Fig. 6: we first approximate ourproblem and find a coarse solution using discrete inference;we then refine our solution using continuous optimizationand the full objective defined by (1).

Discrete Inference Obviously, we cannot perform infer-ence over the entire space as even a modestly sized image(200 × 300 pixels) results in a search space of size N60,000where N is the number of model landmarks. Thus, we startwith pruning the domain of each landmark to a relativelysmall number of pixels. To do so both effectively and ef-ficiently (without requiring inference), we first assume thatlandmarks will fall on “interesting” points in the image, andconsider only points found by the SIFT interest operator(Lowe 2003). We adjust the settings of the SIFT interest op-erator to produce between 1000 and 2000 descriptors perimage (we set the number of scales to 10), allowing us tohave a large number of locations to choose from. Follow-ing this step, we make use of the MRF appearance basedfeature functions F deti to identify the most promising candi-date pixel assignments for each landmark. While we cannotexpect these features to identify the single correct pixel foreach landmark, we might expect the correct answer to lie to-wards the higher end of the response values. Thus, for each

Int J Comput Vis

Fig. 6 (a) An outline definedby the top detection for eachlandmark independently, (b) anoutline inferred by inferenceover our discrete conditionalMRF, and (c) a refined outlineafter coordinate-ascentoptimization

landmark, we use the corresponding F deti to score all theSIFT keypoint pixels in the image and choose the top 25 lo-cal optima as candidate assignments. Figure 6(a) shows thetop assignment for each landmark according to the detectorsfor a single example test image.

Given a set of candidates for each landmark, we mustnow find a single, consistent joint assignment L∗ to alllandmarks together. However, even in this pruned space,the inference problem is quite daunting. Thus, we furtherapproximate our objective by sparsifying the multivariateGaussian shape distribution to include only terms corre-sponding to neighboring landmarks plus a linear number ofterms (see Sect. 5.1). The only pairwise feature functions weuse are over neighboring pairs of landmarks (as described inSect. 5.3), which does not add to the density of the MRFconstruction, thus allowing the inference procedure to betractable.

Finally, we find L∗ by performing approximate max-product inference, using the recently developed and empir-ically successful Residual Belief Propagation (RBP) of Eli-dan et al. (2006b). Figure 6(b) shows an example of an out-lines found after the discrete inference stage.

Refinement Given the best coarse assignment L∗ predictedin the discrete stage, we perform a refinement stage in whichwe reintroduce the entire pixel domain and use the full shapedistribution. This allows us to more accurately adapt to theimage statistics while also considering shapes that fit a bet-ter distribution than was available at the discrete stage. Re-finement is accomplished using a greedy hill-climbing al-gorithm in which we iterate across each landmark, movingit to the best candidate location using one of two types ofmoves, while holding the other landmarks fixed. In a lo-cal move, each landmark can pick the best pixel locationin a small window around its current location. In a globalmove, each landmark can move to its mean location givenall the other landmark assignments. Determining the condi-tional mean location is straightforward given the Gaussianform of the joint shape model. If we seek the conditionalmean of the location of landmark n, denoted xn, given thelocations of landmarks 1 . . . n − 1, denoted by x−, we beginwith the joint distribution:

[x−xn

]∼ N

([μ−μ

],

[�− β−nβ−Tn σ 2n

]).

In this case, the conditional mean is given by

E[xn | x−] = μ + β−Tn(�−

)−1 (x− − μ−) .Interestingly, in a typical refinement, the global moves

dominate the early iterations, correcting large mistakesmade by the discrete stage and that resulted in an unlikelyshape. In the later iterations, local moves do most of thework by carefully adapting to the local image characteris-tics. Figure 6(c) shows an example of an outline found afterthe refinement stage.

7 Experimental Evaluation of LOOPS Outlining

We begin with a quantitative evaluation of the accuracy ofthe loops outlines. In particular, we evaluate the ability ofour model to produce accurate outlines in which the model’slandmarks are positioned consistently across test images. Inthe following section, we will demonstrate the range of ca-pabilities of the LOOPS model by showing how these cor-respondences can be used for a series of descriptive tasks.

As mentioned in Sect. 3, there are existing methods thatalso seek to produce an accurate outline of object classesin cluttered images. In order to evaluate the LOOPS out-lines, we compare to two such state-of-the-art methods. Thefirst is the OBJCUT model of Prasad and Fitzgibbon (2006).Briefly, this method uses an exemplar-based shape modelof the object class together with a texture model to find thebest match of exemplar to the image. The second methodwe compare to is that of Ferrari et al. (2008). This approachuses adjacent contour segments as features for a detector.Note that unlike both OBJCUT and our LOOPS method,this method only requires bounding box supervision for thetraining images rather than full outlines. In order to put thismethod on equal footing with the others, we also use a ver-sion of this method that sees the “ideal” edgemap of thetraining images, which is derived from the manual outline.Under this setting, the learned contours are clean sections ofthe true outline. In order to differentiate, we call the origi-nal method the kAS Detector (bounding box), and the super-vised outline version is kAS Detector (outline). The matchedcontour segments for each detection are returned as the de-tected shape of the object.

We compare LOOPS to these three methods on fourobject classes: ‘giraffe’, ‘cheetah’, ‘airplane’, and ‘lamp’.

Int J Comput Vis

Fig. 7 Example outlines for thethree methods. The full set ofresulting outlines for eachmethod can be found athttp://ai.stanford.edu/~gaheitz/Research/Loops

Images for each class were downloaded from Google im-ages and were chosen to exhibit a range of articulationand deformation, which will be addressed the experimentsbelow. Randomly selected example outlines are shownin Fig. 7, and the full dataset can be found at http://ai.stanford.edu/~gaheitz/Research/Loops along with all im-ages used in this paper. Both competing methods were up-dated to fit our data with help from the original code de-velopers (P. Kumar, V. Ferrari; personal communications).The code and parameters were adjusted through rounds ofcommunication until they were satisfied that the methodshad achieved reasonable results. We thank them for theirfriendly and helpful communications.

In all experiments, we trained our loops model on 20randomly chosen training images and tested on the others.We report average results over 5 random train/test partitions.The features used to construct the boosted classifier for each

landmark (see Sect. 5.2) include shape templates of lengths3,5,7,10,15,20,30,50,75 for each landmark as well as400 boundary fragments, 400 SIFT features and 600 filterfeatures that are shared between all landmarks.

In order to quantitatively evaluate these results, wemeasured the symmetric root mean squared (rms) dis-tance between the produced outlines and the hand-labeledgroundtruth. These numbers are reported in Table 1. Basedon this metric, LOOPS clearly produces more accurate out-lines than the competing methods. In addition to these evalu-ations, we include a quantitative analysis for the descriptiveexperiments in Sect. 8.

In our data, we know the class of the object in each im-age, and the object is generally in the middle of the image,covering a majority of the pixels. The kAS Detector, how-ever, was originally developed for object detection, whichis often unnecessary in our data. Due to its success as a de-

http://ai.stanford.edu/~gaheitz/Research/Loopshttp://ai.stanford.edu/~gaheitz/Research/Loopshttp://ai.stanford.edu/~gaheitz/Research/Loopshttp://ai.stanford.edu/~gaheitz/Research/Loops

Int J Comput Vis

tector, its inclusion as a pre-processing stage to the LOOPSoutlining is a promising direction for data for which detect-ing the object is more difficult. Indeed, our experiments haveshown that LOOPS struggles with the pure detection task onmore difficult detection data for which the kAS Detector wasdesigned. We tested LOOPS as a detector on the data of Fer-rari et al. (2007), and found that a kAS Detector learned fromonly bounding box annotations outperforms the LOOPS de-tector by 11%, averaged across the 5 classes. According toFerrari et al. (2007), the kAS Detector achieves detectionrates (at 0.4 false positives per image) of 84%, 83%, 58%,83%, and 75% for the classes ‘mug’, ‘bottle’, ‘giraffe’, ‘ap-plelogo’, and ‘swan’, respectively. LOOPS achieves corre-sponding detection rates of 70%, 61%, 83%, 75%, and 83%.Note that LOOPS does outperform the kAS Detector on the‘giraffe’ and ‘swan’ classes, which have the most distinctshape.

While in some cases producing outlines is sufficient,there are also cases where we care about the precision ofthe localized landmarks. In the next section we show ex-periments that use the localized landmarks for classifica-tion. Towards this goal, we now evaluate the accuracy ofthe model landmarks in test images. We consider the ‘air-

Table 1 Normalized symmetric root mean squared (rms) distance be-tween the produced outline and the hand-labeled groundtruth, for thethree competitors. Outlines are converted into a high-resolution pointset, and the number reported is the rms of the distance from each pointon the outline to the nearest point on the groundtruth (and vice versa),as a percentage of the groundtruth bounding box diagonal. Note thatwhile the LOOPS and OBJCUT methods require full object outlinesfor the training images, the kAS Detector can be trained with onlybounding boxes supervision (fourth column) or with full supervision(fifth column)

Class LOOPS OBJCUT kAS Detector kAS Detector

(bounding box) (outline)

Airplane 1.9 5.5 3.8 3.6

Cheetah 5.0 12.3 11.7 10.5

Giraffe 2.9 10.5 8.7 8.1

Lamp 2.9 7.3 5.8 5.3

plane’,‘bass’,‘buddha’ and ‘rooster’ classes from the Cal-tech 101 dataset (Fei-Fei et al. 2004) as well as the sig-nificantly more challenging ‘bison’, ‘deer’, ‘elephant’, ‘gi-raffe’, ‘llama’ and ‘rhino’ classes from the mammal datasetof (Fink and Ullman 2007).1 These images have more clut-tered backgrounds than the Caltech images as well as fore-ground objects that blend into these backgrounds.

To measure our ability to accurately localize landmarks,we need a groundtruth labeling for the landmarks that iscorresponded with our model. Thus, for the purposes ofthis evaluation only, we did not use our automatic methodfor corresponding training outlines, but rather labeled theCaltech and Mammal classes with manually correspondedlandmarks. We then train a LOOPS model on these highly-supervised instances and evaluate our results using a scale-independent landmark success rate: a landmark is success-fully localized if its distance from the manually labeled loca-tion is less than 5% of the diagonal of the object’s boundingbox in pixels.

Figure 8 compares the landmark success rates of ourLOOPS model to two baseline approaches. In the first, wepick each landmark location using its corresponding detec-tor by assigning the landmark to the pixel location in the im-age with the highest detector response. This Landmark ap-proach uses no shape knowledge beyond the vote offset thatis encoded in the weak detectors. As a second baseline, wecreate a single boosted Centroid detector that attempts to lo-calize the center of the object, and places the landmarks rel-ative to the predicted centroid using the mean object shape.

The evaluation of error on a per landmark basis is par-ticularly biased in favor of the Landmark method, which istrained specifically to localize each landmark, without at-tempting to also fit the shape of the object. Nevertheless,the relative advantage of the LOOPS model, which takesinto account the global shape, is evident. We note that while

1Classes were selected based on the number of images available. Forclasses with almost enough images, we augmented the dataset withadditional ones from Google images. All images can be found athttp://ai.stanford.edu/~gaheitz/Research/Loops.

Fig. 8 Landmark success ratesfor the Caltech and Mammalclasses. Shown is the averagefraction of landmarks that werelocalized within 5% of thebounding box diagonal from thegroundtruth for our LOOPSmethod as well as the Centroidand Landmark baselines

http://ai.stanford.edu/~gaheitz/Research/Loops

Int J Comput Vis

Table 2 Normalized symmetric root mean squared (rms) distance be-tween the LOOPS outline and the hand-labeled groundtruth

Caltech Centroid Landmark LOOPS

Discrete Refined

Airplane 2.6 2.0 1.6 1.5

Bass 5.9 4.3 4.0 4.1

Buddha 7.5 4.8 4.1 4.0

Rooster 6.5 4.7 3.8 3.8

Mammals

Bison 2.7 2.7 2.0 2.0

Deer 5.9 6.6 4.5 4.4

Elephant 3.3 3.3 2.5 2.5

Giraffe 7.1 4.5 3.4 3.3

Llama 3.7 3.2 3.5 3.1

Rhino 3.3 3.1 2.6 2.4

other works quantitatively evaluate the accuracy of their out-lines (e.g., Ferrari et al. 2008), our evaluation explicitly mea-sures the accuracy of the localized correspondences.

Figure 6 provides a representative example of the local-ization at each stage of our algorithm. It illustrates how thediscrete inference prediction (b) builds on the landmark de-tectors (a) but also takes into account global shape. Follow-ing a reasonable rough landmark-level localization, our re-finement is able to further improve the accuracy of the out-line (c).

To get an absolute sense of the quality of our cor-responded localizations, Table 2 shows the symmetricroot mean squared distance between the predicted andgroundtruth outlines. The improvement over the baselinesis obvious and most errors are on the order of 10 pixels orless, a sufficiently precise localization for images that are300 pixels in width. We note that our worst performance forthe ‘buddha’ class is in part due to inconsistent human la-beling that includes the base of the statue in some imagesbut not in others. Figure 9 provides a qualitative sense ofthe outlines predicted by LOOPS, showing several exam-ples from each of the Mammal classes and Fig. 10 showsseveral examples from each of the Caltech classes.

8 Descriptive Queries with LOOPS Outlines

Recall that our original goal was to perform descriptivequeries on the image data. In this section we consider tasksthat attempt to explore the space of possible applicationsfacilitated by the LOOPS model. Broadly speaking, thisspace varies along two principal directions. The first direc-tion concerns variation of the machine learning application.We will present results for classification, search (ranking),and a clustering application. The second direction varies the

Fig. 9 Example LOOPS localizations from the Mammal classes. Thetop three for each class are successful localizations, while the fourth(below the line) is a failure. The full set of resulting outlines can befound at http://ai.stanford.edu/~gaheitz/Research/Loops

components that are extracted from the LOOPS outlines forthese tasks. We will show examples that use the entire ob-ject shape, those that use a part of the object shape, and thosethat use the local appearance of a shape-localized part of theobject. While each particular example shown below mightbe accomplished by other methods, the LOOPS frameworkallows us to approach any of these tasks with the same un-derlying object model.

8.1 User-Defined Landmark Queries

We begin by allowing the user to query the objects in our im-ages using refined shape-based queries. In particular, we usean interface in which the user selects one or more landmarksin the model and provides a query about those landmarks intest images. For example, to answer the question “Where isthe giraffe’s head?” the user might select a landmark in the

http://ai.stanford.edu/~gaheitz/Research/Loops

Int J Comput Vis

Fig. 10 Example LOOPSlocalizations for the Caltechclasses. The top three for eachclass are successfullocalizations, while the fourth(below the line) is a failure. Thefull set of resulting outlines canbe found athttp://ai.stanford.edu/~gaheitz/Research/Loops

Fig. 11 Refined object descriptions using pairs and triplets of land-marks. In the top row, the user selects two landmarks from the modelon either side of the lamp base (highlighted on the left) and asks forthe distance between them in test images. Results are displayed inorder of this distance increasing from left to right. In the bottom row,

the user selects a landmark on the front and hind legs of the chee-tah as well as a landmark on the stomach in between them and asksfor the angle formed by the three points. Test images are displayedin order from widest angle on the left to smallest angle on the right

head and ask for its location in test images. The user mightalso ask for the distance between two landmarks, or the an-gle between three landmarks. Figure 11 shows how this canbe used to find images of lamps whose bases vary in width,as well as cheetahs with varying angles of their legs. Whileone might be able to engineer a task-specific solution forany particular query using another method, the localizationof corresponded landmarks of the LOOPS model allows arange of such queries to be addressed in a natural and sim-ple manner.

8.2 Shape-Based Classification

Our goal is to use the predicted LOOPS outlines to distin-guish between configurations of an object. To accomplishthis, we first train the joint shape and appearance model andperform inference to localize outlines in the test images, allwithout knowledge of the task at hand or any labels. We then

incorporate knowledge of the desired descriptive task, suchas class labels, and perform the task using the correspondedlocalizations in the test images.

We now show how to perform descriptive classifica-tion with minimal supervision in the form of example im-ages. Rather than specifying relationships between particu-lar landmarks, the user simply provides classification labelsfor a small subset of the instances used to train the LOOPSmodel (as few as one per class) and the entire outline isused. We present three descriptive classification tasks forthree quite different object classes (below we will also con-sider multiple classification tasks applied to the same objectclass):

1. Giraffes standing vs. bending down.2. Cheetahs running vs. standing.3. Airplanes taking off vs. flying horizontally.

http://ai.stanford.edu/~gaheitz/Research/Loopshttp://ai.stanford.edu/~gaheitz/Research/Loops

Int J Comput Vis

The first two of these tasks depend on the articulation of atarget object, while the third task demonstrates the ability ofLOOPS to capture the pose of an object. With these exper-iments we hope to demonstrate two points. The first is thatan accurate shape allows for better classification than usingimage features alone. To show this, we compare to genera-tive and discriminative classifiers that operate directly on theimage features. Our second point is that the LOOPS modelin particular produces outlines that are better suited to thesetasks than existing methods that produce object outlines. Wecompare to two existing methods and show that the classifi-cations produced by LOOPS are superior.

We begin our comparison with two baseline methods thatoperate directly on the image features. The first baseline isa generative Naive Bayes classifier that uses a SIFT dictio-nary for constructing features. For each SIFT keypoint foreach image, we create a vector by concatenating the loca-tion, saliency score, scale, orientation and SIFT descriptor.Across the training images, we cluster these vectors into binsusing K-means clustering, thus creating a SIFT-based dic-tionary (we chose 200 bins as this achieves the best over-all classification results). We then train a generative Naive-Bayes model, where, for each label (e.g., standing = 0,bending down = 1), we learn a multinomial over the SIFTdescriptor dictionary entries. When all of the training datais labeled, the multinomial for each label may be learned inclosed form; when some of the training data is unlabeled,the EM algorithm is used to train the model by filling in themissing labels.

Our second baseline uses a discriminative approach,based on the Centroid detector described above, that is sim-ilar to the detector used by Torralba et al. (2005). Afterpredicting the centroid of the object, we use the vector offeature responses at that location in order to classify theobject. Specifically, we train a second boosted classifier todifferentiate between the response vector of the two classesusing the labeled training instances. We note that, unlike thegenerative baseline, this discriminative method is unable tomake use of the unsupervised instances in the training set.

When using the LOOPS model, we train the joint shapeand appearance model independently of the classificationtasks and without knowledge of the labels. We then use thelabels to train a classifier for predicting the label given acorresponded localization. To do this in a manner that is in-variant to scaling, translation and rotation, we first align thetraining outlines to the mean using the procrustes method(Dryden and Mardia 1998). (In the airplane task, where therotation is the goal itself, we normalize only for the trans-lation and scale of the instances, while keeping their orig-inal orientations.) Once our instances are aligned, we useprincipal component analysis to represent our outlines as avector of coefficients that quantify the variation along themost prominently varying axes. We then learn a standard lo-gistic regression classifier over these vectors; this classifier

is trained to maximize the likelihood of the training labelsgiven the training outlines. To classify a test image, we alignthe corresponded outline produced by LOOPS to the meantraining outline and apply the classifier to the resulting PCAvector.

In order to get a sense of what contributes to the er-rors made by the LOOPS method, we also include classi-fication results using the groundtruth outlines of the test in-stances. In this method, called GROUND in graphs below,the training and test groundtruth outlines are mutually corre-sponded using our automatic landmark selection algorithm(see Sect. 4). These corresponded outlines are then used inthe same way as the LOOPS outputs. This measure showsthe signal present in the “ideal” localizations (according tohuman observers), and serves as a rough indication of howwell the LOOPS method would perform if its localizationsmatched human annotations.

Figure 12 (left column) shows the classification resultsfor the three tasks as a function of the number N of super-vised training instances per class (x-axis). For all three tasks,LOOPS + Logistic (solid blue) outperforms both baselines.Importantly, by making use of the shape of the outline pre-dicted in a cluttered image, we surpass the fully supervisedbaselines (rightmost on the graphs) with as little as a singlesupervised instance (leftmost on the graphs). Furthermore,our classification results were similarly impressive when us-ing both a nearest-neighbor classifier and support vector ma-chine. This indicates that once the data instances have beenprojected into the correct space (that of the correspondedoutline), many simple machine learning algorithms can eas-ily solve this problem. For the airplane and giraffe classes,the LOOPS + Logistic method perfectly classifies all testinstances, even with a single training instance per class. Ouradvantage relative to the Naive Bayes classifier is least pro-nounced in the case of the cheetah task. This is not surpris-ing as most running cheetahs are on blurry or blank back-grounds, and a SIFT descriptor based approach is expectedto perform well.

Convinced that outlines are superior to image featuresfor these descriptive classification tasks, we now turn tothe question of how LOOPS compares to existing methodsin producing outlines that will be used for these tasks. Wecompare to the two existing methods described above, OB-JCUT and the kAS Detector (both versions). Because thesethe OBJCUT method has no notion of correspondence be-tween outline points, we will use a classification techniquethat ignores the correspondences of the kAS and LOOPSoutlines. In particular, for each produced shape (by all fourmethods) we rescale the shape to have a bounding-box diag-onal of 100 pixels, and center the shape around the origin.We can then compute the chamfer distance between any pairof shapes normalized in this manner.

We build nearest-neighbor classifiers for the descriptivetask in question using this chamfer distance as our distance

Int J Comput Vis

Fig. 12 (Left column) Results for three single-axis descriptive classi-fication tasks. Compared are the LOOPs outlining method joined by alogistic regression classifier LOOPS + Logistic, a Naive Bayes classi-fier and a discriminative appearance based Centroid classifier. We alsoinclude the GROUND + Logistic results, in which manually labeledoutlines from the training and test set are used and that approximatelyupper bounds the achievable performance when relying on outlines. Onboth the giraffe and airplane tasks, the combination of LOOPS outlin-

ing and a logistic regression classifier (LOOPS + Logistic) achievesperfect performance (GROUND + Logistic), even with a single train-ing instance for each class. For the cheetah task where the images areblurred, the LOOPS method only comes near to perfect performanceand its advantage over the Naive Bayes classifier is less pronounced.(Right column) A comparison of LOOPS to three other shape-basedobject detectors. For the purposes of these tasks, the LOOPS outlinesare far superior

metric. Classification accuracy for these three methods as afunction of the number N of supervised training instancesis shown in Fig. 12 (right column). We can see that LOOPSsignificant outperforms both the OBJCUT and kAS Detectorfor these tasks. For this data, OBJCUT is generally equallyas good as the kAS Detector with outline supervision, andboth tend to outperform the kAS Detector with only bound-ing box supervision.

Once we have our output outlines, an important benefit ofthe LOOPS method is that we can in fact perform multipledescriptive tasks with the same object model. We demon-strate this with a pair of classification tasks for the lamp ob-ject class. The tasks differ in which “part” of the object we

consider for classification. In particular, we learn a LOOPSmodel for table lamps, and consider the following two de-scriptive classification tasks:

1. Triangular vs. Rectangular Lamp Shade.2. Thin vs. Fat Lamp Base.

While we do not explicitly annotate a subset of the land-marks to consider, we show that by including a few exam-ples in the labeled set, our classifiers can learn to consideronly the relevant portion of the shape. The setup for eachtask is the same as described in the previous section. Westress that the test localizations predicted by LOOPS are thesame for both tasks. The only things that change are the la-

Int J Comput Vis

Fig. 13 (Left column) Classification results for the two descriptivetasks for the table lamp object class. Compared are the LOOPS out-lining method joined by a logistic regression classifier LOOPS + Lo-gistic, a Naive Bayes classifier and a discriminative appearance basedCentroid classifier. We also include the GROUND + Logistic results,in which manually labeled outlines from the training and test set are

used and that approximately upper bounds the achievable performancewhen relying on outlines. In both tasks, the LOOPS model is clearlysuperior to the baselines and is not far from perfect (GROUND + Lo-gistic) performance. (Right column) A comparison of LOOPS to threeother shape-based object detectors. For the purposes of these tasks, theLOOPS outlines are far superior

bel set and the logistic regression classifier that attempts topredict labels based on the output localization. In this case,the baseline methods must be completely retrained for eachtask, which requires EM learning in the case of Naive Bayes,and boosting in the case of Centroid.

As can be seen in Fig. 13 (left column), along both axeswe are able to achieve effective classification, while bothfeature-based baselines perform barely above random. Inaddition, we again outperform the shape-based baselines,as can be seen in the right column of Fig. 13. The conse-quences of this result are promising: we can do most of thework (learning a LOOPS model and predicting correspond-ing outlines) once, and then easily perform several descrip-tive classification tasks. All that is needed for each task is asmall number of labels and an extremely efficient training ofa simple logistic regression classifier. Figure 14 shows sev-eral qualitative examples of successes and failures for theabove tasks.

We note that in the lamp tasks, the LOOPS + Logisticmethod sometimes outperforms the GROUND + Logistic“upper bound” for a small number of training instances.While this seems counterintuitive, in practice a differentchoice of landmarks or inaccuracies in localizations can leadto fluctuations in the classification accuracy.

8.3 Shape Similarity Search

The second application area that we consider is similaritysearch, which involves the ranking of test instances on theirsimilarity to a search query. A shopping website, for exam-ple, might wish to allow a user to organize the examples ina database according to similarity to a query product. Thesimilarity measure can be any feature that is easily extractedfrom the LOOPS outline or from the image based on thecorresponded outline. We demonstrate similarity using theentire shape, a user-specified component of the shape, andan appearance feature (color) localized to a certain part ofthe object.

The experimental setup is as follows: offline, we traina LOOPS model for the object class and localize corre-sponded outlines in each of the test set images. Recall that inLOOPS this is done once and all tasks are carried out giventhe same localized outlines. Online, a user indicates an in-stance from the test set to serve as a “query” image and asimilarity metric to use. We search our database for the im-ages that are most similar based on the specified similaritymetric, and return the ranked list back to the user.

As an example, we consider the case of the lamp datasetused above. Users select a particular lamp instance, a sub-set of landmarks (possibly all), and whether to use shape or

Int J Comput Vis

Fig. 14 Example classifications for our five tasks, for the case ofone labeled training instance per label. Underneath each image wereport the label produced by LOOPS + Logistic, and for errors wegive the groundtruth label in parentheses. The first two columns showa correct labeling for each label, and the third column shows a mis-take for each task (where available). In rows three and five (cheetah

and lamp bases), we can see that the mistake is caused by a failureof LOOPS to correctly outline the object. In the fourth row (lampshades), we see that despite a very accurate LOOPS localization, westill incorrectly classify the lamp shade. This is due to the inher-ent limitations of the logistic approach with a single training instance

color. Each instance in the dataset is then ranked based onEuclidean distance to the query in shape PCA space.

Figure 15 shows some example queries. In the top rowwe show a full-shape search, where the first (left-most) in-stance is the query instance. The results are shown left toright in increasing distance from the query instance. For thefull-shape query, we see that both triangular lampshades andwide bases are returned. Suppose that the user decides to

focus only on the lampshade; the second row shows the re-sults of a search using only the landmarks in the selectedregion. This search returns lamps with triangular shades, butwith bases of varying width. The third row displays the re-sults of a search where the user instead decides to selectonly the base of the lamp, which returns lamps with widebases. Finally, the bottom row shows a search based on thecolor of the shade. In this case we compute the similar-

Int J Comput Vis

Fig. 15 (Color online) Object similarity search using the LOOPSoutput. In the top row we show the top matches in the test data-base using the full shape of the query object (left column). Simi-larity is computed as the negative sum of the squared distances be-tween the corresponding PCA coefficients of the procrustes alignedshapes. In the second row we refine our search by using only the

landmarks that correspond to the lamp shade. The results nowshow only triangular lamp shades. The third row shows a searchbased on the lamp base, which returns wide bases. In the bottomrow, we use color similarity of the lamp shade to rank the searchresults. The top matches show lamps with faded yellow shades

ity between two instances as the negative distance betweenthe mean LAB color vectors2 in the region outlined by theselected landmarks. The top-ranked results all contain off-white shades similar in color to the query image.

Similarity search allows users to browse a dataset withthe goal of finding instances with certain characteristics. Inall of the examples considered above, by projecting the im-ages into LOOPS outlines, similarity search desiderata wereeasily specified and effectively taken into account. Impor-tantly, the similarity of interest in all of these cases is hardto specify without a predicted outline.

8.4 Descriptive Clustering

Finally, we consider clustering a database by leveraging theLOOPS predicted outlines. As an example, suppose we havea large database of airplane images, and we wish to groupour images into “similar looking” groups of airplanes. Clus-tering based on shape might produce clusters corresponding

2The Lab color space represents colors as three real numbers, where theL component indicates “lightness” and the A and B dimensions repre-sent the “color”. The space is derived from a nonlinear compression ofthe XYZ color space.

to passenger jets, fighter jets, and small propeller airplanes.In this section, we consider an even more interesting outlineand appearance based clustering where the feature vectorfor each airplane includes the mean color values in the LABcolor space for all pixels inside the airplane boundary. Wecluster the database of examples on this feature vector usingK-means clustering. Figure 16(a) shows one cluster that re-sults from this procedure for a database of 770 images fromthe Caltech airplanes image set (Fei-Fei et al. 2004). Suchresults might be obtained from any method that produces aprecise outline (or segmentation) of the object.

Imagine, however, that instead of clustering using the ap-pearance for the entire airplane, we are instead interestedin the appearance of the tail. This may be a useful scenariobecause patterns and colors on the tail are generally moreinformative about the airline or country of origin. In orderto focus on the airplane tails, we can specify a subset of thelandmarks in the LOOPS model to indicate the region of in-terest (as in the lamp example above). Since a localizationis a correspondence between the model’s landmarks and atest image, this will automatically allow us to zoom in onthe airplane tails as shown in Fig. 16(b) for the same cluster

Int J Comput Vis

Fig. 16 (Color online) Clustering of a large airplane database. Foreach cluster we show the top 36 instances based on the distance to thecluster centroid. (a) shows an example cluster when using the colorfeatures averaged across the entire airplane. (b) depicts the zoomed-in

tails of the cluster in (a). If we instead cluster using the color of thetails alone, we obtain more coherent groups of airplane tails. (c) and (d)show the tail clusters that contain the first two examples of (a) and (b)

of Fig. 16(a). Despite the fact that the cluster looks coherentwhen considering the whole plane, the tails are very hetero-geneous in appearance.

The fact that the outlines produced by LOOPS are con-sistently corresponded allows us to cluster the appearanceof the tail itself. That is, LOOPS allows us to rely on the

Int J Comput Vis

localized appearance of different parts of the object. Spec-ifying the tail as the landmark subset of interest, we com-pute the same feature vectors only for the pixels containedin the hull of the tail. We then re-cluster our database usingthese tail feature vectors. Figures 16(c) and (d) shows thezoomed in tails for the two clusters that contain the first twoinstances from Fig. 16(a). The coherence of the tail appear-ance is much more apparent in this case, and both clustersgroup many tails from the same airlines.

In order to perform such coherent clustering of airplanetails, one needs first to accurately localize the tail in testimages. Even more than the table lamp ranking task pre-sented above, this example highlights the ability of LOOPSto leverage localized appearance, opening the door formany additional shape and appearance based descriptivetasks.

9 Discussion and Future Work

In this work we presented the Localizing Object Outlinesusing Probabilistic Shape (LOOPS) approach for obtain-ing accurate, corresponded outlines of objects in test im-ages, with the goal of performing a variety of descriptivetasks. Our LOOPS approach relies on learning a probabilis-tic model in which shape is combined with discriminativedetectors into a coherent model. We overcome the compu-tational challenge of localizing such a model in clutteredimages and under large deformations and articulation bymaking use of a hybrid discrete-then-continuous optimiza-tion approach. We directly and quantitatively evaluated theability of our method to consistently localize constituent el-ements of the model (landmarks) and showed how such a lo-calization can be used to perform descriptive classification,shape and localized appearance based search, and cluster-ing that focuses on a particular part of the object. Impor-tantly, we showed that by relying on a model-to-image cor-respondence, our performance is superior to discriminativeand generative competitors often with as little as a singlelabeled example.

In theory, some existing detection methods (e.g., Fer-rari et al. 2008; Berg et al. 2005; Opelt et al. 2006b;Shotton et al. 2005) lend themselves to some of the descrip-tive tasks described above, by producing outlines duringthe detection process. However, in experiments above, wedemonstrated shortcomings with two state-of-the-art meth-ods in this regard. Because our LOOPS model was targetedtowards producing these types of outlines rather than detect-ing the presence or absence of the object, it was able to ob-tain significantly more accurate shape-based classifications.Furthermore, in practice, no other work that considered ob-ject classes in cluttered images demonstrated a combinationof accurate localization and shape analysis that would solvethese problems.

Our contribution is thus threefold. First, we design ourLOOPS model with the primary goal of corresponded local-ization in cluttered scenes. The landmarks are chosen auto-matically so that they will both be salient and appear con-sistently across instances, and both the training and corre-spondence steps are geared specifically toward the goal ofaccurate localization. Second, we present a hybrid global-discrete then local-continuous approach to the computa-tional challenge of corresponding a model to a novel im-age. This allows us to achieve consistent correspondence incluttered images even in the face of large deformations thatwill hinder alternatives such as active shape or appearancemodels. Third, we demonstrate that accurate localization isof value for a range of descriptive tasks, including those thatare based on appearance.

There are several interesting directions in which ourLOOPS approach can be extended. We would like to au-tomatically learn coherent parts of objects (e.g., the neck ofthe giraffe) as a set of landmarks that articulate together, andaccurately localize both landmarks and part joints in test im-ages. Intuitively, learning a distribution over part articulation(e.g., the legs of a running mammal are synchronized) canhelp localization. More importantly, localizing parts opensthe door for novel descriptive tasks that consider variationin the number of parts (e.g., ceiling fan with 4 or 5 blades)or in their presence/absence (e.g., cup with and without ahandle). Equally exciting is the prospect of prediction at thescene level. A natural extension of our model is a hierar-chical variant that views each object detected as a landmarkin the higher level scene model. One can imagine how thegeometry of such a model could capture relative spatial lo-cation and orientations so that we can answer questions suchas whether a man is walking the dog, or whether the dog ischasing the man.

Acknowledgements This work was supported by the DARPATransfer Learning program under contract number FA8750-05-2-0249.We would also like to thank Vittorio Ferrari and Pawan Kumar for pro-viding us code and helping us to get their methods working on ourdata.

References

Anguelov, D., Srinivasan, P., Koller, D., Thrun, S., Rodgers,J., & Davis, J. (2005). Scape: shape completion and an-imation of people. In SIGGRAPH ’05: ACM SIGGRAPH2005 papers (pp. 408–416). New York: ACM. doi:http://doi.acm.org/10.1145/1186822.1073207.

Basri, R., Costa, L., Geiger, D., & Jacobs, D. (1998). Determining thesimilarity of deformable shapes. Vision Research, 38, 2365–2385.

Belongie, S., Malik, J., & Puzicha, J. (2000) Shape context: A newdescriptor for shape matching and object recognition. In NeuralInformation Processing Systems (pp. 831–837).

Berg, A., Berg, T., & Malik, J. (2005). Shape matching and objectrecognition using low distortion correspondence. In IEEE Com-puter Society conference on computer vision and pattern recogni-tion (CVPR).

http://doi.acm.org/10.1145/1186822.1073207http://doi.acm.org/10.1145/1186822.1073207

Int J Comput Vis

Borenstein, E., Sharon, E., & Ullman, S. (2004). Combining top-downand bottom-up segmentation. In IEEE Computer Society confer-ence on computer vision and pattern recognition (CVPR) (p. 46).Los Alamitos: IEEE Computer Society. ISBN 0-7695-2158-4.

Borgefors, G. (1988). Hierarchical chamfer matching: A parametricedge matching algorithm. IEEE Transactions on Pattern Analy-sis and Machine Intelligence, 10(6), 849–865. ISSN 0162-8828.doi:10.1109/34.9107.

Boyd, S., & Vandenberghe, L. (2004). Convex optimization. Cam-bridge: Cambridge University Press.

Caselles, V., Kimmel, R., & Sapiro, G. (1995). Geodesic active con-tours. In International conference on computer vision (pp. 694–699).

Cootes, T. F., Taylor, C. J., Cooper, D. H., & Graham, J. (1995). Ac-tive shape models: their training and application. Computer Vi-sion and Image Understanding, 61(1), 38–59. ISSN 1077-3142.doi:10.1006/cviu.1995.1004.

Cootes, T. F., Edwards, G. J., & Taylor, C. J. (1998). Active appearancemodels. In European conference on computer vision (vol. 2, pp.484–498).

Cover, T. M., & Thomas, J. A. (1991). Elements of information theory.New York: Wiley.

Crandall, D. J., & Huttenlocher, D. P. (2006). Weakly supervised learn-ing of part-based spatial models for visual object recognition. InA. Leonardis, H. Bischof, & A. Pinz (Eds.), Lecture notes in com-puter science: Vol. 3951. European conference on computer vision(Vol. 1, pp. 16–29). Berlin: Springer.

Crandall, D., Felzenszwalb, P., & Huttenlocher, D. (2005). Spatial pri-ors for part-based recognition using statistical models. In Pro-ceedings of the 2005 IEEE Computer Society conference on com-puter vision and pattern recognition (CVPR’05) (vol. 1).

Cremers, D., Tischhäuser, F., Weickert, J., & Schnörr, C. (2002).Diffusion snakes: Introducing statistical shape knowledgeinto the Mumford-Shah functional. International Jour-nal of Computer Vision, 50(3), 295–313. ISSN 0920-5691.doi:10.1023/A:1020826424915.

Dryden, I., & Mardia, K. (1998). Statistical shape analysis. New York:Wiley.

Elidan, G., Heitz, G., & Koller, D. (2006a). Learning object shape:From cartoons to images. In IEEE Computer Society conferenceon computer vision and pattern recognition (CVPR).

Elidan, G., McGraw, I., & Koller, D. (2006b). Residual belief propaga-tion: Informed scheduling for asynchronous message passing. InUncertainty in artificial intelligence.

Fei-Fei, L., Fergus, R., & Perona, P. (2004). Learning generative visualmodels from few training examples: an incremental Bayesian ap-proach tested on 101 object categories. In IEEE Computer Societyconference on computer vision and pattern recognition (CVPR).

Felzenszwalb, P. F., & Huttenlocher, D. P. (2000). Efficient matchingof pictorial structures. In IEEE Computer Society conference oncomputer vision and pattern recognition (CVPR) (pp. 66–73).

Felzenszwalb, P. F., & Schwartz, J. D. (2007). Hierarchical matchingof deformable shapes. In IEEE Computer Society conference oncomputer vision and pattern recognition (CVPR).

Fergus, R., Perona, P., & Zisserman, A. (2003). Object class recogni-tion by unsupervised scale-invariant learning. In IEEE ComputerSociety conference on computer vision and pattern recognition(CVPR) (Vol. 2, pp. 264–271)

Fergus, R., Perona, P., & Zisserman, A. (2005). A sparse object cat-egory model for efficient learning and exhaustive recognition. InProceedings of the IEEE conference on computer vision and pat-tern recognition, San Diego (Vol. 1, pp. 380–397).

Ferrari, V., Tuytelaars, T., & Van Gool, L. (2006). Object detection bycontour segment networks. In European conference on computervision (ECCV).

Ferrari, V., Jurie, F., & Schmid, C. (2007). Accurate object detectionwith deformable shape models learnt from images. In IEEE con-ference on computer vision and pattern recognition. IEEE, June2007. New York: IEEE.

Ferrari, V., Fevrier, L., Jurie, F., & Schmid, C. (2008). Groups of adja-cent contour segments for object detection. IEEE Transactions onPattern Analysis and Machine Intelligence, 30(5), 36–51.

Fink, M., & Ullman, S. (2007). From aardvark to zorro: A benchmarkfor mammal image classification. International Journal of Com-puter Vision, 77, 143–156.

Grauman, K., & Darrell, T. (2005). Pyramid match kernels: Discrimi-native classification with sets of image features. In Internationalconference on computer vision, October 2005.

Hill, A., & Taylor, C. (1996). A method of non-rigid correspon-dence for automatic landmark identification. In Proceedings of theBritish machine vision conference.

Hillel, A. B., Hertz, T., & Weinshall, D. (2005). Efficient learning of re-lational object class models. In International conference on com-puter vision (pp. 1762–1769), Washington, DC, USA. Los Alami-tos: IEEE Computer Society. ISBN 0-7695-2334-X.

Kumar, M. P., Torr, P. H. S., & Zisserman, A. (2005). OBJ CUT. InIEEE Computer Society conference on computer vision and pat-tern recognition (CVPR).

Leibe, B., Leonardis, A., & Schiele, B. (2004). Combined object cat-egorization and segmentation with an implicit shape model. InECCV’04 workshop on statistical learning in computer vision (pp.17–32), Prague, Czech Republic, May 2004.

Leordeanu, M., Hebert, M., & Sukthankar, R. (2007). Beyond local ap-pearance: Category recognition from pairwise inter

Shape-Based Object Localization for Descriptive Classiﬁcationgalelidan/papers/Heitz...Int J Comput...

Documents

Transcript of Shape-Based Object Localization for Descriptive Classiﬁcationgalelidan/papers/Heitz...Int J Comput...