Robust model-based scene interpretation by multilayered context information

21
Robust model-based scene interpretation by multilayered context information Sungho Kim * , In So Kweon Department of EECS, Korea Advanced Institute of Science and Technology, 373-1 Gusong-dong, Yuseong-gu, Daejeon, Republic of Korea Received 26 December 2005; accepted 25 September 2006 Available online 6 December 2006 Abstract In this paper, we present a new graph-based frame work for collaborative place, object, and part recognition in indoor environments. We consider a scene to be an undirected graphical model composed of a place node, object nodes, and part nodes with undirected links. Our key contribution is the introduction of collaborative place and object recognition (we call it as the hierarchical context in this paper) instead of object only or causal relation of place to objects. We unify the hierarchical context and the well-known spatial context into a complete hierarchical graphical model (HGM). In the HGM, object and part nodes contain labels and related pose information instead of only a label for robust inference of objects. The most difficult problems of the HGM are learning and inferring variable graph struc- tures. We learn the HGM in a piecewise manner instead of by joint graph learning for tractability. Since the inference includes variable structure estimation with marginal distribution of each node, we approximate the pseudo-likelihood of marginal distribution using mul- timodal sequential Monte Carlo with weights updated by belief propagation. Data-driven multimodal hypothesis and context-based pruning provide the correct inference. For successful recognition, issues related to 3D object recognition are also considered and several state-of-the-art methods are incorporated. The proposed system greatly reduces false alarms using the spatial and hierarchical contexts. We demonstrate the feasibility of the HGM-based collaborative place, object, and part recognition in actual large-scale environments for guidance applications (12 places, 112 3D objects). Ó 2006 Elsevier Inc. All rights reserved. Keywords: Hierarchical context; Spatial context; Collaborative; Place–object–part recognition; Piecewise learning; Multi-modal sequential Monte Carlo; Hierarchical graphical model 1. Introduction Consider visitors to a new environment. They have no prior information about the environment so they may need a guidance system to acquire the place and related object information. This paper is concerned with the problem of recognizing places and objects in real environments, as shown in Fig. 1. The scope of the place and object recogni- tion is to identify places and objects with parts in the form of place labels and object labels with poses. This can be regarded as scene interpretation at the semantic level. It is important in guidance applications to recognize places and objects in uncontrolled environments where the cam- era may move arbitrarily and light conditions also change. We are aware of two feasible approaches to place and object recognition. One baseline method regards these as separate problems. The other method directly relates places and objects using a Bayesian framework [1] (dotted arrow as shown in Fig. 2). Place is first recognized using gist infor- mation from filter responses and then the place informa- tion provides the Bayesian probable prior distribution of object label and scale, position [1]. On the contrary, we interrelate place, object, and part recognition method using an undirected graphical model [3]. Place information can provide contextually related object priors, but conversely, ambiguous places can be discriminated by recognizing con- textually related objects. This is the key concept for collab- orative place and object recognition. Likewise, object 1077-3142/$ - see front matter Ó 2006 Elsevier Inc. All rights reserved. doi:10.1016/j.cviu.2006.09.004 * Corresponding author. Fax: +82 42 869 5465. E-mail addresses: [email protected] (S. Kim), iskweon@kaist. ac.kr (I.S. Kweon). www.elsevier.com/locate/cviu Computer Vision and Image Understanding 105 (2007) 167–187

Transcript of Robust model-based scene interpretation by multilayered context information

www.elsevier.com/locate/cviu

Computer Vision and Image Understanding 105 (2007) 167–187

Robust model-based scene interpretation by multilayeredcontext information

Sungho Kim *, In So Kweon

Department of EECS, Korea Advanced Institute of Science and Technology, 373-1 Gusong-dong, Yuseong-gu, Daejeon, Republic of Korea

Received 26 December 2005; accepted 25 September 2006Available online 6 December 2006

Abstract

In this paper, we present a new graph-based frame work for collaborative place, object, and part recognition in indoor environments.We consider a scene to be an undirected graphical model composed of a place node, object nodes, and part nodes with undirected links.Our key contribution is the introduction of collaborative place and object recognition (we call it as the hierarchical context in this paper)instead of object only or causal relation of place to objects. We unify the hierarchical context and the well-known spatial context into acomplete hierarchical graphical model (HGM). In the HGM, object and part nodes contain labels and related pose information insteadof only a label for robust inference of objects. The most difficult problems of the HGM are learning and inferring variable graph struc-tures. We learn the HGM in a piecewise manner instead of by joint graph learning for tractability. Since the inference includes variablestructure estimation with marginal distribution of each node, we approximate the pseudo-likelihood of marginal distribution using mul-timodal sequential Monte Carlo with weights updated by belief propagation. Data-driven multimodal hypothesis and context-basedpruning provide the correct inference. For successful recognition, issues related to 3D object recognition are also considered and severalstate-of-the-art methods are incorporated. The proposed system greatly reduces false alarms using the spatial and hierarchical contexts.We demonstrate the feasibility of the HGM-based collaborative place, object, and part recognition in actual large-scale environments forguidance applications (12 places, 112 3D objects).� 2006 Elsevier Inc. All rights reserved.

Keywords: Hierarchical context; Spatial context; Collaborative; Place–object–part recognition; Piecewise learning; Multi-modal sequential Monte Carlo;Hierarchical graphical model

1. Introduction

Consider visitors to a new environment. They have noprior information about the environment so they may needa guidance system to acquire the place and related objectinformation. This paper is concerned with the problem ofrecognizing places and objects in real environments, asshown in Fig. 1. The scope of the place and object recogni-tion is to identify places and objects with parts in the formof place labels and object labels with poses. This can beregarded as scene interpretation at the semantic level. Itis important in guidance applications to recognize places

1077-3142/$ - see front matter � 2006 Elsevier Inc. All rights reserved.

doi:10.1016/j.cviu.2006.09.004

* Corresponding author. Fax: +82 42 869 5465.E-mail addresses: [email protected] (S. Kim), iskweon@kaist.

ac.kr (I.S. Kweon).

and objects in uncontrolled environments where the cam-era may move arbitrarily and light conditions also change.

We are aware of two feasible approaches to place andobject recognition. One baseline method regards these asseparate problems. The other method directly relates placesand objects using a Bayesian framework [1] (dotted arrowas shown in Fig. 2). Place is first recognized using gist infor-mation from filter responses and then the place informa-tion provides the Bayesian probable prior distribution ofobject label and scale, position [1]. On the contrary, weinterrelate place, object, and part recognition method usingan undirected graphical model [3]. Place information canprovide contextually related object priors, but conversely,ambiguous places can be discriminated by recognizing con-textually related objects. This is the key concept for collab-orative place and object recognition. Likewise, object

Fig. 1. Our place and object recognition system can guide visitors with a wearable camera system. A USB camera carried on the head can provide imagedata to a notebook computer in a carry bag. Processed information is delivered to the visitors as audiovisual data.

PlaceRecognition

ObjectRecognition

Previousapproach

Our approach

PlaceRecognition

ObjectRecognition

Previousapproach

Our approach

Fig. 2. Previous approaches regard place recognition and object recognition as separate problems or directly (causally) related problems. Our approach(solid line) regards them as interrelated problems.

168 S. Kim, I.S. Kweon / Computer Vision and Image Understanding 105 (2007) 167–187

information can provide contextually related part priors,but conversely, parts can provide evidence for the existenceof a specific object. This work is motivated from the recentneurophysiological findings by Bar [4], in which objects arestrongly related to specific scenes or places. A drier isstrongly related to a bathroom and a drill is strongly relat-ed to a workshop and so on.

We then extend the concept of interrelation betweenplace and object to the object and part level so that we unifyhierarchical context and the well-known spatial context [5]in objects and parts into a hierarchical graphical model(HGM), as shown in Fig. 3. Originally, the term ‘‘context’’has been used in linguistics to represent verbal meaningsfrom relationships within a sentence. We use the termthroughout this paper to represent information coming

Place Context

Part Context

Object Context

Pixel Context

Place Context

Part Context

Object Context

Pixel Context

a

Fig. 3. Unification of visual context: (a) A scene contains place, objects, pargraphical model (HGM). Red solid lines represent hierarchical contextual relablack lines represent measurements. (For interpretation of color mentioned in

from ‘‘visual relationships’’ in images. There is one placenode, an object layer composed of multiple object nodes,and a part layer composed of multiple part nodes. In thisHGM, the bidirectional interaction properties of nodeswithin layers (dotted lines in Fig. 3b), and nodes betweenlayers (solid line in Fig. 3b) are important.

In this, work, we develop a piecewise approximatedlearning method instead of holistic minimization of jointlikelihood for tractable learning. We simultaneously inferthe variable structure and marginal distribution using mul-timodal sequential Monte Carlo method (MM-SMC) withweight calculation by belief propagation (BP).

The fundamental issues (see Fig. 4) related to 3D objectsare considered and a robust and scalable 3D object repre-sentation is incorporated in the nodes of HGM. The issue

Image or scene

Placecontext

Objectcontext

Partcontext

PixelContext

Image or scene

Placecontext

Objectcontext

Partcontext

PixelContext

Image or scene

Placecontext

Objectcontext

Partcontext

PixelContext

b

ts, and pixel information, and (b) they are integrated into a hierarchicaltionships, blue dotted lines represent the well-known spatial context, andthis figure the reader is referred to the web version of the article.)

Fig. 4. Fundamental issues in 3D object recognition are visual variations with changing viewpoint, scale, illumination, occlusion and blurring. Theseissues arise concurrently in working environments.

S. Kim, I.S. Kweon / Computer Vision and Image Understanding 105 (2007) 167–187 169

of figure/ground segmentation is bypassed by using seman-tic local features and using background information in theform of visual context solves the issue of backgroundclutter.

The structure of the remainder of this paper is as fol-lows: Section 2 describes related work on HGMs for place,object, and part recognition. Section 3 describes our HGMin detail, including bidirectional place and object recogni-tion. Learning and inference methods are presented in Sec-tions 4 and 5, respectively. The model is evaluated anddemonstrated using a large real data set in Section 6. Sec-tion 7 presents our conclusions.

2. Related works

In this section, we introduce related works focusing onplace and object recognition in real environments. Thereare two approaches for place and object recognition,according to whether the visual relationships are modeledor not.

2.1. Independent approach to place and object recognition

Place recognition or scene classification is an activeresearch area. Vogel and Schiele proposed a natural sceneretrieval system based on a semantic modeling step [6].They classify image regions into semantic labels, such as‘‘grass’’ and ‘‘rocks’’. Co-occurrence vectors of labelscan provide scene categories of coasts and rivers, forexample. Fei-Fei and Perona applied latent semantic anal-ysis in a text categorization technique to the scene catego-rization problem and obtained satisfactory results for 13categories of office, kitchen, street, etc. [7]. However, theplace recognition researchers did not consider object rec-ognition information explicitly, so place recognition canbe ambiguous in similar environments, as shown inFig. 24a.

In the object recognition problem, local feature-basedapproaches show groundbreaking performance for tex-tured objects [8–12]. Of these, SIFT (scale invariant featuretransform) feature, proposed by Lowe, shows robust per-formance in object recognition [10]. This feature can copewith partial view changes, scale changes, illuminationchanges, and occlusions, as shown in Fig. 4. However, localfeature-based object recognition cannot discriminatebetween ambiguous objects, shown in Fig. 23a.

2.2. Graphical model for visual relation

Unlike independent place and object modeling, there areseveral modeling methods, depending on visual relation-ships. Basically, we can categorize visual relations into spa-tial contexts and hierarchical contexts. The first is visualinteraction in image space, such as pixels, parts, andobjects. The second is the interaction between abstractionlevels, such as place–objects and object–parts. Manyresearchers have mathematically modeled the relationalinformation, using a semantic network [2] and a graphicalmodel [3]. A semantic network can represent concepts indeclarative knowledge and inference is performed usingrule-based procedural knowledge. Although the semanticnetwork intuitively provides a method suitable for ourproblem, we adopt a graphical model because it can pro-vide a principled method for the modeling of visual rela-tions (graph theory) and uncertainty (probability theory).The graphical model is the combination of probability the-ory and graph theory. We can subdivide the previous workon graphical models according to which of five conditionsit uses: single layer, multilayer, linking method, graphstructure, and node information, as shown in Fig. 5a.

2.2.1. Single layerMost undirected graphical models (non-causal models)

have a single layer to describe spatial context. There are

Single layer MRF CRF

Multilayer MRF CRF

Linking methodSingly directed(BU TD)

Doubly directed(BU+TD)

Graph structureFixed Variable

Node infoLabel Label+pose

Single layer MRF CRF

Multilayer MRF CRF

Linking methodSingly directed(BU TD)

Doubly directed(BU+TD)

Graph structureFixed Variable

Node infoLabel Label+pose

Pose-Encoded MRF

Pose-Encoded MRF

Image

Place node

Object layer

Part layer

Bottom-up proposal

Bottom-up proposal

Pose-Encoded MRF

Pose-Encoded MRF

Image

Place node

Object layer

Part layer

Bottom-up proposal

Bottom-up proposal

a

b

Fig. 5. (a) Aspects of graphical modeling for visual relations. Thegraphical model can be divided into single layer or multilayer, by thelinking method between layers, graph structure, and node information, (b)our proposed scheme consists of multilayers, doubly directed linking andvariable graph with pose-encoded-MRFs for collaborative place, object,and part recognition.

170 S. Kim, I.S. Kweon / Computer Vision and Image Understanding 105 (2007) 167–187

generative models of the Markov random field (MRF) [13]and discriminative models of the conditional/discrimina-tive random field (CRF or DRF) [14]. Recently, CRF-re-lated methods have been proposed, such as MutualBoosting [15] and boosted random field (BRF) [16], foreasy learning and discriminative inference of hidden labels.

2.2.2. MultilayerMultiple hidden layers can be used to incorporate larger

spatial interactions by multiresolution [17,18], or throughdifferent semantic abstraction levels, such as scenes,objects, and parts [1,19,20].

2.2.3. Linking methodThe most important and difficult element in multilay-

ered graphical models is the linkage of the multiple layers.The simplest method is to produce each layer (fully undi-rected model) where there is the same type and numberof nodes in each layer, such as mCRF using Expert-of-Product method [17]. Other popular methods are directlylinked methods. One is the top-down Bayesian Network,or generative model [1,18,19,21], and the other is thebottom-up, or discriminative, model, such as the hierarchi-cal random field (HRF) formed by directly linking twoCRFs [20].

2.2.4. Graph structureThe graphical models above assume a fixed graphical

model, which is rather a simple problem because only the

marginal distribution of each node has to be estimated. Ifthe graph structure varies from image to image then thisbecomes a very challenging problem. Either the numberof nodes is fixed and only the links can be variable, such asdynamic trees [22], or both nodes and links can be variable[23,24]. For variable node estimation, transdimensionalMarkov Chain Monte Carlo (TD-MCMC) is frequentlyused since mathematical convergence is guaranteed [23].

2.2.5. Node information

Traditionally, most graphical models only estimatelabels as hidden variables. Recently, position informationis encoded into dynamic trees and this shows better perfor-mance for image labeling [22]. In DRF, domain informa-tion of the patch location is reflected in the labelinteraction [14]. We extend the location information toobject pose and part pose for accurate inference.

In our problem, we use three layers composed of place,object, and part. Each layer is conditionally linked. Theplace node contains only the place label. Part node andobject node have the labels with poses. Therefore, eachlayer has a pose-encoded-MRF structure, as shown inFig. 5b.

3. Hierarchical graphical model for spatial and hierarchical

context

Consider the real computer desk scene shown in Fig. 3a.We wish to recognize place and objects with parts from theimage. A desk provides the prior of contextually relatedobjects (hierarchical context); a desk usually contains com-puter sets. All objects are spatially correlated (spatial con-text); a monitor, keyboard, and mouse co-occur on a desk.Each object is also composed of spatially related visualparts (hierarchical context); a mouse has buttons and apad. These visual parts are spatially related (spatial con-text); the mouse buttons and pad coexist.

The problem is to model such visual contexts for suc-cessful place and object recognition. As described in Sec-tion 2, direct extension of the simple Bayesian formulaproposed by Torralba [1] may be a starting point. However,this approach cannot model important contextual proper-ties, such as the spatial and hierarchical interaction. Mar-kov random field (MRF) or discriminative random field(DRF), which are frequently used for image segmentation,may be more suitable for modeling spatial interaction[13,14]. Kumar and Hebert extended DRF to two layersthat are directly linked for longer contextual influence insegmentation [20]. Tu et al. proposed an image parsingmethod using bottom-up/top-down information with aMonte Carlo Markov Chain framework [23]. However,this did not explicitly model the spatial interaction of visualobjects or parts. In this section, we propose an HGM toreflect spatial and hierarchical visual contexts simulta-neously at an identification level. Identification level meansthat we deal with specific scenes that include specific objectsin a certain working environment.

ð2Þ

Ptx

OtxS

tx

Otx

y

Place node object layer Part layer

Ptx

Otx

y

Stx

a b c

Fig. 6. Pseudo-likelihood approximation of the original undirectedgraphical model for tractability of the global partition function.

S. Kim, I.S. Kweon / Computer Vision and Image Understanding 105 (2007) 167–187 171

3.1. Mathematical formulation for place, object, and part

recognition

Although the directed graphical model of a Bayesiannetwork may be computationally feasible, causal relation-ships between visual objects are undesirable, since thefailure of one component leads to a cascading failure torecognize other objects [25]. Conceptually, the spatialcontext and the hierarchical context for place, object,and part recognition can be modeled as an undirectedgraphical model, as shown in Fig. 3b. A graphical modelis a useful tool because it can model a complex system bya set of probabilistic modules in [3]. Nodes representrandom variables and arcs represent probabilistic interac-tions between nodes. White nodes represent hiddenvariables and black nodes represent visual observations.The arcs between hidden nodes are represented bycompatibility or correlation functions. Thick solid arcsrepresent the hierarchical context and dotted arcsrepresent the spatial context in Fig. 3b. For example,the detection of each of monitor, keyboard, and mousereinforces one another, which leads to stable recognition.In the HGM, contextual facilitation mechanisms arereflected everywhere. A scene node is related to the objectnodes and the object nodes are also related to the partnodes. Scene and objects are processed in parallel (thinsolid lines). The scene node receives evidence from holisticimage features and each part receives evidence from anindividual image feature. The black nodes represent visualfeatures extracted from an image.

Let xS 2 {0,1,2, . . .,NS} be a place label where 0 is theunknown place, xO = {nO,hO} is the object label(nO 2 {1,2, . . .,NO}) with pose (hO: a similarity transformfrom model to image), and xP = {nP,hP} is the part label(np 2 {0,1,2, � � �,NO}) with pose (observed). The poseinformation is very important for successful object recogni-tion. Note that most graph-based methods use only labelinformation [17,18]. Since the objective of this work is torecognize a place and multiple objects with related partsfrom a given image, the ideal inference can be performedby maximizing a posteriori probability (MAP):

fxS ; xO; xPg ¼ argmaxxS ;xO ;xP

pðxS ; xO; xP jyÞ ð1Þ

If the graph structure is known (fixed), and all the compat-ibilities are given, we can estimate the joint probability dis-tribution from Eq. (2), where w denotes the compatibility,or correlation, functions between hidden nodes and / de-notes the compatibility between the hidden nodes and theobservation nodes x represents hidden variables to estimateand y represents observed features.

The scene layer or place node receives evidence fromimage features and objects. The object layer receives infor-mation from parts, a scene and neighboring objects. Thepart layer receives evidence from features, objects andneighboring parts. However, there are two computationalproblems in Eq. (2). First, the computation of the globalnormalization factor or partition function Z is intractabledue to its complexity, and second, the graph structure isnot fixed during learning and inference because we donot know the number of object nodes before they are man-ually given or recognized. The first problem caused by theglobal partition function can be alleviated using thepseudo-likelihood or conditional independence in layerlinking [26]. The original HGM in Fig. 3b can be approxi-mated as shown in Fig. 6. For simplicity, the object layerand the part layer are represented as a single node. Math-ematically, Eq. (2) can be rewritten as Eq. (3). Since eachlayer is conditioned on neighboring layers the global parti-tion function is not required. Mathematically, Eq. (2) canbe rewritten as

pðxS ; xO; xP jyÞ � pðxS jxO; yÞpðxOjxS ; xP ÞpðxP jxO; yÞ ð3Þ

The second problem for variable structure can be formulatedby deriving each conditional probability in Eq. (3). If wecondition based on object nodes (xO

t Þ and observationnodes (y), the likelihood of part layer (xP

t Þ can be expressedby Eq. (4) by considering the incoming messages, wherepðxP

t jyÞ can be regarded as a bottom-up message, pðxPt jxO

t Þcan be regarded as a top-down message and a is the normal-ization factor. Fig. 6c is the graphical representation of Eq.(4). We added time index t to indicate the recursive nature.

pðxPt jxO

t ; yÞ ¼ apðxPt jyÞpðxP

t jxOt Þ ð4Þ

By Bayes’ rule, Eq. (4) becomes

pðxPt jxO

t ; yÞ ¼ a½pðyjxPt ÞpðxP

t Þ�pðxPt jxO

t Þ ð5Þ

172 S. Kim, I.S. Kweon / Computer Vision and Image Understanding 105 (2007) 167–187

Although inference given an image is static, we consider adynamic inference to reflect contextual influences in HGM.For example, if a part label is influenced by measurementthen part–part context, and object–part context followsequentially. Therefore, the prior pðxP

t Þ can be predictedfrom the previous state (xP

t�1Þ (see Fig. 7c and Eq. (6)),

pðxPt jxO

t ; yÞ ¼ apðyjxPt ÞpðxP

t jxOt Þ

�Z

pðxPt jxP

t�1ÞpðxPt�1jxO

t�1; yÞdxPt�1 ð6Þ

If we consider the transdimensional state jump (pruningunnecessary nodes) and pairwise bilateral interaction,

pðxPt jxP

t�1Þ ¼Y

i

pðxPit jxP

iðt�1ÞÞY

ij2EP

wðxPit ; x

PjtÞ ð7Þ

Ep represents a set of node edges. Summarizing Eqs. (5)–(7)and considering graphical models in general (change theconditional likelihood to potential w(xi,xj) indicating pre-ferred pairs of values of directly linked variables xi andxj). The current likelihood can be estimated from the mea-surement, object context, spatial context, and dynamic pre-diction prior. Therefore, Eq. (4) can be rewritten as

pðxPt jxO

t ; yÞ ¼ aY

i

/ðyi; xPitÞwðxP

it ; xOit ÞY

ij2EP

wðxPit ; x

PjtÞ

�Z Y

i

pðxPit jxP

iðt�1ÞÞpðxPt�1jxO

t�1; yÞdxPt�1 ð8Þ

Likewise, the conditional likelihood of the object layer andplace layer can be represented by Eqs. (9) and (10) withcorresponding simple graphical models, as shown inFig. 7b and a derived from Fig. 6b and a, respectively.Note that all three layers (part, object, and place) consistof incoming contextual messages of a recursive nature.Note that the transdimensional state transition probabili-ties, such as pðxP

it jxPiðt�1ÞÞ and pðxO

it jxOiðt�1ÞÞ, can treat the var-

iable graph structures by recursive inference.Consequently, the second problem of handling variablestructure can be solved.

a b c

Fig. 7. Modified pseudo-likelihood using Bayes’ rule and previous priorstates to reflect sequential contextual influence.

pðxOt jxS

t ; xPt Þ ¼ a

Yi

wðxPt ; x

Oit ÞwðxO

it ; xSt ÞY

ij2EO

wðxOit ; x

OjtÞ

�Z Y

i

pðxOit jxO

iðt�1ÞÞpðxOt�1jxS

t�1; xPt�1ÞdxO

t�1

ð9ÞpðxS

t jxOt ; yÞ ¼ a/ðyjxS

t ÞwðxSt ; x

Ot Þ

�Z

pðxSt jxS

t�1ÞpðxSt�1jxO

t�1; yÞdxSt�1 ð10Þ

The place and object recognition problem in such graphicalmodels is learning the graphical model of compatibilitiesand inferring the graph structure and marginal distribu-tions from an image. We present a computationally feasible(approximate) method of learning and inference in Sections4 and 5, respectively.

3.2. Generalized robust invariant feature

We briefly introduce the method of building visual fea-tures suitable for context-based place, object, and part rec-ognition. Basically, we generalize SIFT [10] in terms ofregion detection and description [37]. A Structure-basedLocal Visual Feature is detected by extending DoG featuredetector (SIFT) with Harris corner detector.

x ¼ maxx2WfDoGðx;rÞ or HMðx;rÞg;r ¼ max

rfDoGðx;rÞg;

where DoGðx;rÞ ¼ jIðxÞ �Gðrn�1Þ � IðxÞ �GðrnÞj;

HMðx;rÞ ¼ detðCðx;rÞÞ � atrace2ðCðx;rÞÞð11Þ

where DoG is the difference of Gaussians, HM representsthe Harris measure, and C is the first order moment matrix.

Fig. 8a shows the decomposition of an object into con-vex parts and corner-like interior parts using Eq. (11). Eachpart is characterized by a set of localized histograms ofedge density field, orientation field, and hue field, as shownin Fig. 8b. Fig. 8c shows that the proposed part detectorthat can extract complementary object structures.

4. Piecewise learning of HGM

4.1. Background of piecewise learning

Ideally, we can learn the model parameters by maxi-mizing the conditional likelihoods in Eqs. (8)–(10) fromthe given training images. This can be solved by gradientascent with the evaluation of marginals of the hiddenvariables [14,17,18]. This is computationally intractabledue to the complexity of partition function. Instead ofsuch conventional learning, we use piecewise learning,introduced by Sutton and McCallum [27]. Piecewise train-ing involves dividing the undirected graphical model intopieces, each of which is trained independently. This isreported to be a successful heuristic for training largegraphical models.

× a

× b

× g

Fig. 8. (a) Object decomposition by radial symmetry part and corner-like interior part. (b) Part description by localized histograms of edge density field,orientation field and hue field. (c) (left) Convex parts, (middle) corner-like parts, (right) complementary parts.

Control ε

S. Kim, I.S. Kweon / Computer Vision and Image Understanding 105 (2007) 167–187 173

To train such large models efficiently, the basic conceptof piecewise learning is to divide the full model into piecesthat are trained independently, combining the learnedweights from each piece of inference afterwards. Accordingto Sutton and McCallum’s proof and experiments on lan-guage data sets [27], the piecewise training method providesan upper bound on log partition functions, producesgreater accuracy than pseudo-likelihood and performscomparably to global training using belief propagation.In computer vision, Freeman et al. [28] trained a graphmodel using piecewise training of compatibility functionsfor a super resolution application.

4.2. Piecewise learning in place node

For piecewise training related to place nodes, first wedivide the graph of Fig. 7a into two separate subgraphsof measurement and message from objects, as shown inFig. 9.

a b

Fig. 9. Types of piecewise learning in place node: (a) evidence of placelabel, (b) compatibility between place label and objects.

4.2.1. Evidence of place label (/(y,xS))

We basically represent a place using histograms oflearned visual words. First, we extract all local featuresfrom the labeled training set of images. Through a learningphase, in which an entropy-based MDL criterion is used, weobtain an optimal set of visual words and class conditionaldistribution of visual words for inference [29]. Fig. 10 sum-marizes the steps for learning optimal visual words. There isonly one parameter (e) that controls the size of the visualwords. Physically, e is a similarity threshold used to measuredescriptor distance in feature space. Through an iterativelearning process, we can obtain the optimal visual wordsin terms of the entropy-based MDL criterion. If a novelobject is presented, categorization is conducted using thedetected features and learned visual words.

Automaticvisual word generation

Automatic classconditional word

distribution estimation

Entropy-based MDL criterion

Fig. 10. Visual word selection procedure for evidence of place label.

174 S. Kim, I.S. Kweon / Computer Vision and Image Understanding 105 (2007) 167–187

Kim and Kweon [29] introduce an entropy-based MDLcriterion for simultaneous classification and visual wordlearning. The original MDL criterion is not suitable sincewe have to find universal visual words for all places andsufficient classification accuracy [30]. If the classificationis discriminative, then the entropy of the class a posterioriprobability should be low. Therefore, we propose an entro-py-based MDL criterion for simultaneous classificationand visual word learning by combining MDL with theentropy of the class a posteriori probability, where I

denotes training images belonging to only one category,V = {vi} denotes visual words, N is the size of the trainingsamples, and 1(V) is the parameter size for the visualwords. Each visual word has parameters, such ashi ¼ fli; r

2i g (mean, variance). Let L ¼ fðI i; ciÞgN

i¼1, a setof labeled training images, where ci 2 {1,2, . . .,C} is theclass label. Then, the entropy-based MDL criterion isdefined by

V ¼ arg mine

XN

i¼1

HðcjI i; hðVÞðeÞÞ þ k � 1ðVÞ logðNÞ2

( ); ð12Þ

where k represents the weight of complexity term. Theentropy H is defined as

HðcjI i; hðVÞÞ ¼ �XC

ci¼1

pðcijI i; hðVÞÞlog2 pðcijI i; hðVÞÞ ð13Þ

In Eq. (12), the first term represents the overall entropy forthe training image set. The lower the entropy, the better theguaranteed classification accuracy; the second term acts asa penalty on learning. If the size of the visual wordsincreases, then the model requires more parameters. There-fore, if we minimize Eq. (12), we can find the optimal set ofvisual words for successful classification with moderatemodel complexity. Note that we require only one parame-ter e for minimization as e controls the size of thevisual words automatically. Details are given in the nextsection.

Since e is the distance threshold between the normalizedfeatures, we can obtain initial clusters with an automaticsizing. We can obtain refined cluster parameters(hðVÞ ¼ fhðiÞgV

i¼1Þ with k-mean clustering. After e-NN-based visual word generations, we have estimated theclass-conditional visual word distribution for the entropycalculation. The Laplacian smoothing-based estimation isdefined by

pðvtjcjÞ ¼1þ

PfI i2cjgNðt; iÞ

V þPV

s¼1

PfI i2cjgNðs; iÞ

; ð14Þ

where N(t,i) represents the number of occurrences of thevisual word (vt) in the training image (Ii), and V representsthe size of the visual words. The physical meaning of thisequation is the empirical likelihood of visual words for agiven class [31].

Finally, we calculate a posteriori probabilitypðcijI i; hðVÞÞ, which is used for the entropy calculation inEq. (12). Using Bayes’ rule, this can be stated as

pðcijI i; hðVÞÞ ¼ apðI ijci; hðVÞÞ ¼ apðfygijci; hðVÞÞ ð15Þ

where the image is approximated with local features(Ii = {y}i). Using the naive Bayes’ method (assuming inde-pendent features),

pðfygijci; hðVÞÞ ¼Y

j

XV

t¼1

pðyjjvtÞpðvtjciÞ ð16Þ

where pðyjjvtÞ ¼ exp �kyj � ltk2=2r2

t

n o.

From the calculations defined by Eqs. (13)–(16), wecan evaluate Eq. (12). We learn the optimal set of visualwords by changing e and evaluating Eq. (12) iteratively.We found that the optimal e is 0.4 for scene visualwords. The evidence of place, /(y,xS), is the same asEq. (16).

4.2.2. Compatibility of place label and object label

(w(xS,xO))

The conditional likelihood of place labels for givenobjects can be estimated by counting the number of objectappearances at places. We add a Dirichlet smoothing priorto the count matrix so that we do not assign zero likelihoodthat does not appear in the training data. Fig. 11 shows theprobability look-up table for place–object, given labeledimages (12 places with 112 objects).

4.3. Piecewise learning in the object layer

The graphical model of the object layer shown inFig. 7b can be subdivided into three subgraphs as shownin Fig. 12. It consists of object likelihood given place,object–object compatibility, and part likelihood for thegiven object.

4.3.1. Compatibility between object and place (w(xO,xS))

Compatibility between object and place is obtaineddirectly from the probability look-up table of place andobject shown in Fig. 11. From the table, we can predictthe object likelihood given a specific place, by renormaliz-ing each row in the table.

4.3.2. Compatibility between objects (w(xO,xO0))

We estimate the compatibility matrix empirically bycounting co-occurrence of objects. Initially, we count co-occurrence of objects for each training image. We alsoadd a Dirichlet smoothing prior to the count matrix so thatwe do not assign zero likelihood that does not appear in thetraining data. Then the matrix inner product is normalizedusing diagonal values so that the self-compatibility is unity.Fig. 13 shows the object–object compatibility matrix.Objects that co-occurred spatially are clustered diagonally.Object–object compatibility is used in Eq. (9) for spatialinteraction of objects.

Object ID

Pla

ce ID

Comapatibility matrix of Place – Object

10 20 30 40 50 60 70 80 90 100 110

2468

1012

Fig. 11. Compatibility table of places and objects.

a b c

Fig. 12. Types of piecewise learning in object layer: (a) compatibilitybetween object and place, (b) compatibility between objects, (c) compat-ibility between part and object.

Fig. 13. Object–object compatibility matrix obtained from labeled train-ing set.

S. Kim, I.S. Kweon / Computer Vision and Image Understanding 105 (2007) 167–187 175

4.3.3. Compatibility of part and object (w(xP,xO))

Physically, w(xP,xO) represents a top-down messagefrom object model to parts in terms of the image positionof the parts.

We adopt a part-based object representation, specificallya constellation model (CM) [32] because part-based objectrepresentation is robust to local photometric and geometricvariations, background clutter, and occlusion. There aretwo forms of CM according to the parameterizationmethod of parts. The well-known mathematical model pro-posed by Fergus et al. [32] is a fully parameterized CM.Appearances and the positions of parts belonging to anobject are modeled jointly, or fully parameterized in multi-dimensional Gaussian distributions. Assume a part is xi

and the number of parts is N, then it can be modeled asa full covariance-based joint PDF, p(x1,x2, . . .,xN). Thedegrees of freedom (DOF) of the required parameters are

O(N2). Therefore this model can represent 3 � 6 parts ofan object.

Instead of this representation, we use a common-frameconstellation model (CFCM) representation scheme, as thisprovides advantages in terms of computational efficiencyand redundancy removal by sharing object view parame-ters. If we fix the object ID and viewpoint, each part canshare viewing parameters or object frames, h = [objectID, pose], as shown in Fig. 14 [33]. The term ‘‘commonframe’’ means the object reference coordinates, whichchange according to the viewing conditions. Then, themathematical representation can be reduced to the productform, conditioned on an object parameter, such asQN

i¼1pðxijhÞ. In this scheme, the order is reduced to O(N),which is useful during object learning and recognition. Thismodel can handle hundreds of parts. We refer to this part-based object representation as CFCM because each partshares object parameters (object ID, pose).

For scalable 3D object representation, we use the con-cept of sharing in part appearances and in views [34].Fig. 14 shows the basic object representation scheme usingshared appearance features and view clustering in CFCMs,which is motivated by two key papers [35,36]. We use arobust local feature, called generalized robust invariantfeature (G-RIF) [37] because this feature is a generalizedversion of SIFT [10] in terms of interest point detection(DoG + Harris) and descriptor (edge orientation, density,hue).

The local appearance library is constructed initiallyusing e-nearest neighbor clustering at random featurepoints, where e is simply the threshold for the appearancedistance. The initial cluster centers are refined using k-means clustering. Finally, object model CFCMs are con-structed sequentially, according to feature matchingbetween input and stored models. If there are few matches,then a new CFCM is generated from the input image. Asshown in Fig. 14, a CFCM contains a set of parts that haveappearance indices to the appearance library and part pos-es relative to the object frame. We assume default variancefor part position until view clustering. To measure the spa-tial matching, we use similarity transform because G-RIFis invariant under the similarity transform. If the spatialmatching error is larger than a predefined threshold (d),we create a new CFCM with shared features (seeFig. 15), otherwise only new parts are added to the modelCFCM and corresponding part poses are updated. Weobtained optimal values for e of 0.2 and for d of 15 usingthe entropy-based MDL criteria as Eq. (12).

e –

d –

y

Fig. 14. Part-based scalable object representation model using appearance library and view clustered common-frame constellation model (CFCM). ACFCM contains constituent parts that have indices to the appearance library and part poses relative to the object (common) frame. The appearance librarycontains all the links to the parts in the CFCMs.

Fig. 15. Scalable object learning method using appearance sharing and view clustering in CFCM. Local feature appearance is represented in terms of anappearance library and new CFCM models are generated according to feature matching.

176 S. Kim, I.S. Kweon / Computer Vision and Image Understanding 105 (2007) 167–187

Given such part-based object representation andlearning, we can estimate the likelihood of w(xP,xO) asfollows. At inference given arbitrary test image, xO con-tains both object label and pose (hO) of object frameand xP contains observed part pose. Therefore, we pre-dict the part positions (vM = hO(v(pi))) due to the prop-erty of the common-frame constellation model, where pi

represents a part in a CFCM and v(Æ) represents thepose term of (Æ). The positions of the model partsdepend on the estimated object pose (hO). Finally,w(xP,xO) is defined by Eq. (17), where rv is the standarddeviation of part position learned during the CFCMconstruction.

wðxP ; xOÞ ¼ exp �kvðxP Þ � hOðvðpiÞÞk

2

2r2v

!ð17Þ

In this likelihood estimation, it is essential to estimate theobject pose hO accurately. For a given input image and acandidate object model, conventional methods usually findcorresponding points using a Euclidean feature distance.However, if there are similar parts within an object, wecannot obtain the correct pose by a similarity transform be-tween the input and the model. We propose a novel visualdistance measure based on the concept of saliency andextending Lowe’s distance ratio, which considers only thefirst and second nearest neighbors [10]. If an image feature

Input object Candidate model object

similar

dissimilar

dissimilar

Iif

Mjf

Fig. 16. The concept of saliency-based visual distance measure for stable matching.

S. Kim, I.S. Kweon / Computer Vision and Image Understanding 105 (2007) 167–187 177

(f Ii Þ and a corresponding model feature (f M

i Þ are sufficient-ly similar and f I

i is quite different from the rest of the inputfeatures and f M

i is also quite different from the rest of themodel features, then the matching is very reliable (seeFig. 16). Based on this concept, we propose a novel salien-cy-based distance measure (DP) defined by

DP f Ij ; f

Mj

� �¼ DE f I

i ; fMj

� �� C f I

i

� �þ C f M

j

� �� �where C f I

i

� �¼

D1st NNE f I

i ; ff Ig� �

D2nd NNE f I

i ; ff Igð Þ; C f M

j

� �

¼D1st NN

E f Mj ; ff Mg

� �D2nd NN

E f Mj ; ff Mg

� � ð18Þ

where DE represents the Euclidean distance and Cðf Ii Þ rep-

resents contrast or saliency of f Ii within the input feature

set, which is defined as the relative distance ratio betweenthe nearest and second nearest neighbors. As the DP issmaller, we produce more stable matches. The effect ofsaliency-based pose estimation is shown in Fig. 26.

4.4. Piecewise learning in part layer

The graphical model of the part layer is shown inFig. 7c, which can be subdivided into three subgraphs asshown in Fig. 17. It consists of, part-object compatibilitypart–part compatibility, and feature likelihood for a givenpart.

4.4.1. Compatibility between part and object (w(xP,xO))

The message to part layer from object layer shown inFig. 17a is the same as Eq. (17).

a b c

Fig. 17. Types of piecewise learning in part layer: (a) compatibilitybetween part and object, (b) compatibility between parts, (c) evidence ofpart.

4.4.2. Part–part compatibility (w(xP,xP 0))

We assume a pairwise clique as shown in Fig. 17b. InMRF, the compatibility is defined as the energy of labelsmoothness [13]. In DRF, it is defined as the interactionenergy of the part position, including label smoothness,and parameters are learned by gradient ascent [14].Instead of such modeling, we apply the gestalt law ofproximity and similarity because parts within an objectare geometrically very close and have the same objectlabels, as shown in Fig. 18. We model such a propertyas part–part compatibility, as defined by Eq. (19), wherethe first term reflects the weight for spatial distance, thesecond term reflects weight for the same object labeling.If two parts have different labels, we assign very smallweight (1). rD is defined as the standard deviation of dis-tances between part pair in the given training set. Thelabeling similarity weight is calculated using /(y,xp),defined in the following section. ID(Æ) represents the partlabel of part (Æ).

wðxpT ; x

pS jIDðxp

T Þ ¼ i; IDðxpSÞ ¼ iÞ

/ expkvðxp

T Þ � vðxpSÞk

2

2r2D

!� wðIDðxp

T Þ ¼ i; IDðxpSÞ ¼ iÞP

jwðIDðxp

T Þ ¼ j; IDðxpSÞ ¼ jÞ

wðxpT ; x

pS jIDðxp

T Þ ¼ i; IDðxpSÞ ¼ jÞ ¼ 1; ði 6¼ jÞ

ð19Þwhere wðIDðxp

T Þ ¼ i; IDðxpSÞ ¼ iÞ ¼ /ðyT ; IDðxp

T Þ ¼ iÞ/ðyS;IDðxp

SÞ ¼ iÞ:

Fig. 18. The concept of part–part compatibility is based on the proximityand similarity of the gestalt principle. If a part pair has small distance andthe same part labeling, the part–part compatibility is strong.

Fig. 19. Basic concept of belief propagation.

178 S. Kim, I.S. Kweon / Computer Vision and Image Understanding 105 (2007) 167–187

4.4.3. Evidence of part (/(y,xP))

The likelihood of an observed feature’s appearance in agiven part node is calculated from Eq. (20). Since the partnode is given, we know which appearance library is usedfrom the memorized linking (w(xp,A)), where A denotesthe appearance library; we assume noisy measurement forthe likelihood, A and rA are learned during e-nearest neigh-bor clustering.

/ðy; xpÞ ¼ GðyjAÞwðxp;AÞ; where GðAp; yÞ

¼ exp �kAðyÞ � Ak2

2r2A

( )ð20Þ

5. Approximate inference for online graph construction

5.1. Multimodal sequential Monte Carlo (MM-SMC)

The inference in this paper is a very difficult problembecause we have to estimate both the graph structure anda posteriori probability distribution, as different imageshave different numbers of objects and parts. If the graphstructure is fixed and a posteriori probability distributionis available by Eq. (1), then maximizing a posteriori prob-ability (MAP) can give an exact inference. In practice, it isnot trivial to estimate a posteriori probability, even for afixed graph, due to the loopy graph structure. Mostresearchers use approximate methods to estimate a posteri-ori probability. One is to use variational inference methods[38], which factorize an a posteriori probability distributioninto a simpler distribution. Another method is to use Mon-te Carlo (sampling) approximation, such as Markov ChainMonte Carlo [39], where a Markov chain is constructedthat has the desired a posteriori probabilities as the limitdistributions. A third approach is to use loopy belief prop-agation [40] to approximate a posteriori probability mar-ginals for loopy graphs.

If the graph structure varies from image to image, it can-not be solved by directly applying these approximate infer-ence methods. Recently, extended versions of Monte Carlomethods have been proposed to handle graph variabilityand a posteriori probability approximation. They are theReversible Jump MCMC (RJ-MCMC) sampler [23] andthe transdimensional SMC (TD-SMC) sampler [41]. TheRJ-MCMC is theoretically more rigorous (no normaliza-tion), but it requires a very long time for the inference ofhigh dimensional states. TD-SMC is similar to RJ-MCMCbut is more flexible because TD-SMC has no constraint onreversibility. The complexity is the same as for RJ-MCMC.

In this paper, we use several approximation schemesfor the simultaneous inference of graph structure andmarginal. First, we approximate the original a posterioriprobability (Eq. (2)) as three conditional likelihoods orpseudo-likelihoods using Eq. (3). This simplifies the globalpartition function into subpartition functions. Second,we further approximate the joint pseudo-likelihood

(conditional likelihoods) in Eqs. (8)–(10) by loopy beliefpropagation (LBP) to estimate a posteriori probabilitymarginals [40]. BP provides an approximate solution forthe loopy graphical model and empirically shows successfulperformance. Through the LBP, we obtain an approximatesolution by maximizing the a posteriori probabilitymarginal (MPM). In the original BP, belief (bi) at node i

is formulated as Eq. (21) (see also Fig. 19). In the messagecalculation, we approximate /jðxjÞ

Qk2NðjÞnimkjðxjÞ by

bt�1j ðxjÞ estimated in the previous time step and we

use max-product instead of sum-product for accurateestimation.

biðxiÞ ¼ k/iðxiÞY

j2NðiÞmjiðxiÞ

mjiðxiÞ X

j

wijðxi; xjÞ/jðxjÞY

k2NðjÞnimkjðxjÞ

ð21Þ

Third, we finally approximate the a posteriori probabilitymarginal into multimodal sequential Monte Carlo. Themultimodal scheme can handle the uncertainty of the graphstructure and marginal distribution simultaneously [42].Since the marginal in each node cannot be representedparametrically due to non-Gaussian measurement andocclusions, we use sequential Monte Carlo (SMC). InEqs. (8)–(10), this has a recursive filtering nature so SMCis suitable for our problem solution. Therefore we callour approximate inference multimodal sequential MonteCarlo (MM-SMC). Physically, the MM-SMC does notconclude immediately, but to allow multiple probablehypotheses (particles). These particles stay alive untillonger feedback (scene context) is influenced. A node hasmultiple particles which contain hypotheses with weights.Each weight is updated by belief propagation (BP) [44].

5.2. Hypothesis and prune for structure estimation in MM-

SMC

We propose a hypothesize and prune method ratherthan a Markov Chain to fully utilize the contextualinfluence during inference. As described in the above, aMarkov Chain, such as RJ-MCMC, is very slow and doesnot fully utilize the spatial and hierarchical context. In thehypothesize and prune method, we initialize a graphstructure which contains the true graph structure usingthe bottom-up method (see Fig. 20a). Then the distribu-tion of each graph node is represented by a set of samples(Monte Carlo) (see Fig. 20b). Each sample weight is

Bottom-upStructure

Hypothesis

Mote Carlorepresentation

Weight updateby LBP

On-the-flyStructurePruning

Optimal Sampling

rln

e

Optimal Sampling

a b c d

Fig. 20. Simultaneous estimation of graph structure and marginal distribution using the MM-SMC method: (a) hypothesis of graph structure, (b) samplegeneration for each marginal distribution, (c) weight calculation using LBP in MM-SMC, and (d) on-the-fly structure pruning.

S. Kim, I.S. Kweon / Computer Vision and Image Understanding 105 (2007) 167–187 179

updated using the BP in MM-SMC. During this process,the spatial context and hierarchical context are activated.Contextually consistent graph nodes receive strongweights and contextually inconsistent graph nodes receivevery weak weights (see Fig. 20c). After each contextualinfluence, we conduct the on-the-fly structure pruning toremove the wrong graph nodes (see Fig. 20d). Notethat several nodes can be removed simultaneously insteadof one by one as in RJ-MCMC. This is the reasonhypothesis pruning is computationally more efficient thanRJ-MCMC.

Fig. 21. Data-driven initialization of graph structure in part layer using the G-R(b) and in image space (c).

5.3. Implementation details of MM-SMC

5.3.1. Particle design

A particle in a place node represents a probable placelabel. Particles in an object node represent possible multi-view CFCMs, which contain object IDs with poses (simi-larity transform parameters: position, scale, orientation).Particles in a part node represent probable object IDswhere they belong with observed poses (position, scale, ori-entation). All the pose parameters for objects and parts areestimated deterministically from image structure (object

IF part detector (a), and object layer using Hough transform in pose space

Fig. 22. The diagram of importance sampling-based LBP in MM-SMC at the object layer.

180 S. Kim, I.S. Kweon / Computer Vision and Image Understanding 105 (2007) 167–187

pose: similarity transform between model and image, partpose: image feature) to reduce dimensionality.

5.3.2. Representation of MM-SMC

We formally represent the MM-SMC, which is a non-parametric form of multimodal. Assume the number of

Fig. 23. Toy example for collaborative property. (a) Object recognition usingbetween place and objects. Dotted arrows represent objects to place messagedisambiguate blurred objects (drier, drill).

multimodals is M, note that initially we do not know thetrue number of objects, which must be estimated throughthe hypothesis and pruning strategy (so the number shouldbe larger than the true number of objects). The followingscheme is almost the same as for the part layer. An objectnode collects messages from the part layer (xP), scene layer

only object-related features [4]. (b) Application of bidirectional interactionto disambiguate place. Solid arrows represent place to object message to

Fig. 24. Place recognition using only measurement (a) and both measurement and message from objects (b).

S. Kim, I.S. Kweon / Computer Vision and Image Understanding 105 (2007) 167–187 181

(xs), neighboring objects and messages from the previousstage (see Eq. (24)). Let O = {N,M,P,X,W,C} denote theparticle representation of the object layer, with N the num-ber of total view-tuned particles (CFCM), M the number ofmixture components, P ¼ fpmgM

m¼1 the mixture component(object) weights, X ¼ fxo

ðiÞgNi¼1 the particles, W ¼ fwðiÞgN

i¼1

the particle weights, and C ¼ fcðiÞgNi¼1 the object indicators,

i.e. c(i) 2 {1,2, . . .,M}. If particle i belongs to mixture com-ponent m, then c(i) = m. Om = {i 2 {1, . . .,N}:c(i) = m} isthe set of indices of the particles belonging to the m-mix-ture components (objects). When M is the number ofobjects (or graph nodes) in a scene, each mixture compo-nent has view-tuned particles with different poses. Theapproximate multimodal particles can be represented by

bðxOÞ ¼XM

m¼1

pmbðxOmÞ �

XM

m¼1

pm

Xi2Om

wðiÞdxoðiÞ ð22Þ

5.3.3. Bottom-up graph hypothesis (node birth) and particle

generation

We generate particles using importance sampling, espe-cially the data-driven proposal function, for fast and accu-rate convergence [42]. In the part layer, the initial partgraph is hypothesized by the G-RIF part detector, asshown in Fig. 21a [37]. The G-RIF can provide cornerand convex parts that are complementary. A particle in apart node has an object label with observed part pose. Par-ticles (different object labels with the same part pose) ineach part node are generated automatically using the visualappearance library because visual codebooks contain allthe links to the object models (CFCMs). We use e-nearestneighbor search for the appearance library (e = 0.2) andthe number of part particles in each part node is 5 � 10.In the object layer, the data-driven proposal function (q)for object particles is defined by Eq. (23). The Hough

Fig. 25. Examples of scenes (a), and related objects (b).

182 S. Kim, I.S. Kweon / Computer Vision and Image Understanding 105 (2007) 167–187

transform of the part particles generates CFCMs in posespace, as shown in Fig. 21b and c. We usually have 2 � 6object particles per node. Note that these particles are veryuseful and can provide quick convergence because the par-ticles are generated in a data-driven way.

qðxOt jxS

t ; xPt Þ � HoughðxO

t jxPt Þ ð23Þ

5.3.4. Weight calculation by LBP in MM-SMC

Importance weight in each node is updated by combin-ing incoming messages, as shown in Fig. 22. We calculateeach message using Eq. (24) and weights are updated usingEq. (25), b(Æ) represents belief or weight of the consideringparticles.

M1ðxoðiÞÞ ¼ max

kfbðfxp

ðkÞgÞwðfxpðkÞg; xo

ðiÞÞg;

M2ðxoðiÞÞ ¼ max

kfbðxs

ðkÞÞwðxsðkÞ; x

oðiÞÞg;

M3ðxoðiÞÞ ¼

Yl2CðmÞ

MlmðxoðiÞÞ;

where MlmðxoðiÞÞ ¼ max

kfbðxo

ðkÞÞwðxoðkÞ; x

oðiÞÞg

ð24Þ

wnewðiÞ ¼

~wðiÞPj2Om

~wðjÞ;

~wðiÞ ¼ M1ðxoðiÞÞ �M2ðxo

ðiÞÞ �M3ðxoðiÞÞ=qðxo

ðiÞÞ;

pnewm � pm ~wmPM

n¼1pn ~wn

; ~wm ¼Xi2Om

~wðiÞ ð25Þ

Fig. 26. (Left top) Performance evaluation using Euclidean distance and the proposed perceptual distance. Pairs of images show the pose estimation usingcorresponding distance measures. White arrows indicate incorrect pose estimation using Euclidean distance and black arrows indicate pose estimationusing our method. (For interpretation of color mentioned in this figure the reader is referred to the web version of the article.)

S. Kim, I.S. Kweon / Computer Vision and Image Understanding 105 (2007) 167–187 183

5.3.5. Graph pruning (node death) and resampling

After calculating the importance weight using MM-SMC,we conduct two steps of sample selection for graph structureestimation and marginal distribution. First, the multimodalcomponents with very small weights (pm 6 q) are removedby contextual influences. Nodes that are contextually incon-sistent receive little weight. Second, we conduct optimal sam-pling from the surviving multimodal densities using optimalresampling [43]. If a sample is reselected, then pðxO

i;tjxOi;t�1Þ

equals unity during MM-SMC. Note that our inference isstatic not temporal tracking where pðxO

i;tjxOi;t�1Þ is the motion

a priori probability. Competing particles survive until longercontextual feedback messages are activated. Through thedata-driven graph structure hypothesis and pruning by con-textual influences, we can produce the true inference.

5.3.6. Overall inference algorithm

Our inference system conducts the hypothesis pruning inplace layer, object layer, and part layer in parallel to pro-duce the true graph structure and a maximum a posterioriprobability marginal (MPM) solution. The hypothesis isgoverned by bottom-up information and pruning is gov-erned by contextual influences.

6. Experimental results

6.1. Validation of bidirectional reinforcement property

First, we tested the bidirectional place and object recog-nition method on ambiguous examples. If we use object

information only, as shown in Fig. 23a, we fail to discrim-inate both objects since local features are almost identical.However, if we use the bidirectional interaction method, wecould discriminate between these objects simultaneously, asshown in Fig. 23b.

In the second experiment, we prepared place images tak-en in front of an elevator. Given a test elevator scene, asshown in Fig. 24a, measurement only provided incorrectplace recognition. However, if the same test scene was pro-cessed using bidirectional place–object recognition our sys-tem produces the correct recognition, as shown in Fig. 24b.In the diagram, the center graph shows the measurementmessage, the right graph shows the message from theobjects, and the left graph represents the combined mes-sage for place recognition.

6.2. Large scale experiment for building guidance

We validated the proposed scene interpretation methodin terms of false alarm reduction. 620 object images (112objects) were acquired in 12 topological places, such as offi-ces, corridors, etc. (a total of 228 images), as shown inFig. 25. Note that the images are completely general andobtained in uncontrolled environments. We trained usingthis labeled training set and the piecewise learning methodsdescribed in Section 4. We used 114 test images that werenot used in training; the test set contains 645 object images.After training, the size of the appearance library wasreduced by 33.3% from 72,083 to 48,063 (e = 0.2). After

C1:L1M1+L2M1 C2:C1+L1M3 C3:C2+L1M2 C4:C3+L2M3 C5:C4+Scene0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Performance Evaluation by Adding Context

Level of Context

Level of Context

DRFAR

C1:L1M1+L2M1 C1+L1M3 C1+L1M2 C1+L2M3 C1+Scene0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Performance Evaluation by Component Context

Rat

eR

ate

DRFAR

a

b

Fig. 27. Performance evaluation according to various contexts: (a) byadding higher contexts, (b) the effect of individual context. Notations areindicated in the text.

Fig. 28. Scene interpretation results: (a) with

184 S. Kim, I.S. Kweon / Computer Vision and Image Understanding 105 (2007) 167–187

the shared-feature-based view clustering, the CFCM sizewas reduced from 5.5 CFCMs/object to 2.4 CFCMs/object(d = 15).

6.3. False alarm reduction by saliency-based pose estimation

We evaluate the influence of the proposed salience-basedperceptual distance on the pose calculation between modelobjects and images. As a performance evaluation measure,we use a detection rate (DR: correct name + correct pose)and false alarm rate (FAR: incorrect name or incorrectpose). Fig. 26 shows the power of the proposed perceptualdistance to reduce false alarms generated by incorrectposes. In this test, we use a complete inference model incor-porating scene, object, part context; the white arrow indi-cates pose estimation using Euclidean distance and theyellow arrow indicates the result of using proposed dis-tance. Note that repeated parts can cause incorrect posebut our method can estimate object pose using salient parts(note the air cleaner at the bottom of Fig. 26).

6.4. Overall performance evaluation

In the second evaluation, we validate the power for falsealarm removal using the proposed context-based inferencescheme. The previous results show only a partial effect ofscene context. There are four contexts, part context(L1M3: neighboring message), object context of top-down(L1M2), object context from neighboring objects (L2M3),and scene context (L2M2, L3M2). The basic scene interpre-tation block is C1 which consists of part evidence (L1M1)and bottom-up message for objects (L2M1). We call thisthe baseline method, which has almost no context. Thebaseline method is equivalent to SIFT based on Houghtransform with a very small bin threshold [10]. In the firsttest, we evaluate the proposed system by adding four kindsof messages one by one. Fig. 27a shows the DR and FARresults. The basic scene interpretation block shows 26%

out scene context, (b) with scene context.

Fig. 29. Various scene interpretation results in an indoor environment.

S. Kim, I.S. Kweon / Computer Vision and Image Understanding 105 (2007) 167–187 185

186 S. Kim, I.S. Kweon / Computer Vision and Image Understanding 105 (2007) 167–187

false alarms for 645 test object images. However, by addinghigher contexts to this basic block, the FAR is reduced to0.15% (only 1 false alarm) for full contexts. In the next test,we check what contributions are provided by each contextto the false alarm reduction problem. We conduct testssimilar to previous tests but we combined each context tothe basic inference block (C1) separately. Fig. 27b summa-rizes the evaluation results. The contribution is sorted aspart context > object context of top-down > scene con-text > object context from neighboring objects. Fig. 28shows real scene interpretation results without scene con-text and with scene context. The proposed system canremove false alarms effectively and provides topologicalplace information.

Fig. 29 shows partial examples of scene interpretation inan indoor environment. The proposed method can provideobject and place information simultaneously. As can beseen, the related information is extracted stably in variousplaces. In this test, all the place information is recognizedcorrectly. Consequently, it can be used in vision-based guid-ance systems for visitors or mobile robot applications. Theaverage inference time is 50 s at AMD 4800+ machine in aMATLAB environment. Most inference time is consumedin database search (over 35 s) and core inference is very fast.

7. Conclusions and discussion

We have presented a collaborative place, object, and partrecognition system that uses spatial and hierarchical con-texts in an HGM. Our HGM has a three-layered structure,composed of place, object, and part layers. The object layerand part layer have a pose-encoded-MRF model to reflectthe spatial context. These layers are indirectly linked toreflect the hierarchical context. We train the graph usingpiecewise learning for tractability. We incorporate a robustand scalable object representation scheme to handle visualvariations. For successful inference (simultaneous graphstructure estimation with marginal), we adopt a pseudo-like-lihood strategy and MM-SMC. Because of the data-drivenhypothesis generation and contextual influence-based prun-ing, we can produce accurate inference. In addition, we havepresented a novel visual distance measure based on the con-cept of saliency. We obtain correct point correspondences toestimate object poses. We have validated the false alarmreduction by applying to real scene data. The proposedsystem shows only one false alarm for 645 test objects.

In the future, we will investigate upgrading the currentscene interpretation at the identification level to a categori-zation level based on the HGM.

Acknowledgment

This research was partially supported by the KoreanMinistry of Science and Technology for National ResearchLaboratory Program (Grant No. M1-0302-00-0064), by

MIC & IITA through the Korean IT leading R&Dsupport program, and by Microsoft Research Asia.

References

[1] A. Torralba, K. Murphy, W. Freeman, M. Rubin, Context-basedvision system for place and object recognition, IEEE InternationalConference on Computer Vision (2003) 273–280.

[2] H. Niemann, G.F. Sagerer, S. Schroder, F. Kummert, ERNEST: asemantic network system for pattern understanding, IEEE Transac-tions on Pattern Analysis and Machine Intelligence 12 (9) (1990)883–905.

[3] M.I. Jordan, Learning in Graphical Models, MIT Press, Cambridge,MA, 1999.

[4] M. Bar, Visual objects in context, Nature Reviews Neuroscience 5(2004) 617–629.

[5] M. Bar, S. Ullman, Spatial context in recognition, Perception 25(1996) 324–352.

[6] J. Vogel, B. Schiele, Natural scene retrieval based on a semanticmodeling step, in: International Conference on Image and VideoRetrieval, 2004.

[7] L. Fei-Fei, P. Perona, A Bayesian hierarchical model for learningnatural scene categories, in: IEEE Conference on Computer Visionand Pattern Recognition, 2005.

[8] C. Schmid, R. Mohr, C. Bauckhage, Evaluation of interest pointdetectors, International Journal of Computer Vision 37 (2) (2000)151–172.

[9] K. Mikolajczyk, C. Schmid, Scale and affine invariant interest pointdetectors, International Journal of Computer Vision 60 (1) (2004)63–86.

[10] D.G. Lowe, Distinctive image features from scale-invariant keypoints,International Journal of Computer Vision 60 (2) (2004) 91–110.

[11] K. Mikolajczyk, C. Schmid, A performance evaluation of localdescriptors, in: Proceedings of IEEE Conference on Computer Visionand Pattern Recognition, 2003, pp. 257–263.

[12] S. Belongie, J. Malik, J. Puzicha, Shape matching and objectrecognition using shape contexts, IEEE Transactions on PatternAnalysis and Machine Intelligence 24 (24) (2002) 509–522.

[13] S.Z. Li, Markov Random Field Modeling in Image Analysis,Springer, Berlin, 2001.

[14] S. Kumar, M. Hebert, Multiclass discriminative fields for part-basedobject detection, Snobird Learning Workshop, Utah, 2004.

[15] M. Fink, P. Perona, Mutual boosting for contextual inference, NeuralInformation Processing Systems (2003).

[16] A. Torralba, K.P. Murphy, W.T. Freeman, Contextual models forobject detection using boosted random fields, Neural InformationProcessing Systems (2004).

[17] X. He, R.S. Zemel, M.A. Carreira-Perpinan, Multiscale conditionalrandom field for image labeling. in: Proceedings of IEEE Conferenceon Computer Vision and Pattern Recognition, 2004.

[18] S. Todorovic, M.C. Nechyba, Interpretation of complex scenes usinggenerative dynamic-structure models, in: CVPR Workshop onGenerative-Model Based Vision, 2004.

[19] K. Murphy, A. Torralba, W.T. Freeman, Using the forest to see thetrees: a graphical model relating features, objects, and scenes, NeuralInformation Processing Systems (2003).

[20] S. Kumar, M. Hebert, A hierarchical field framework for unifiedcontext-based classification, in: IEEE International Conference onComputer Vision, 2005.

[21] E.B. Sudderth, A. Torralba, W.T. Freema, A.S. Willsky, Learninghierarchical models of scenes, objects, and parts, in: IEEE Interna-tional Conference on Computer Vision, 2005.

[22] A.J. Storkey, C.K.I. Williams, Image modeling with position-encod-ing dynamic trees, IEEE Transactions on Pattern Analysis andMachine Intelligence 25 (7) (2003) 859–871.

S. Kim, I.S. Kweon / Computer Vision and Image Understanding 105 (2007) 167–187 187

[23] Z. Tu, X. Chen, A.L. Yuille, S.-C. Zhu, Image parsing: unifyingsegmentation, detection, and recognition, International Journal ofComputer Vision 63 (2) (2005) 113–140.

[24] Z. Khan, T. Balsh, F. Dellaert, MCMC-based particle filtering fortracking a variable number of interacting targets, IEEE Transactionson Pattern Analysis and Machine Intelligence 27 (11) (2005)1805–1819.

[25] B. Epshtein, S. Ullman, Identifying semantically equivalent objectfragments, in: IEEE International Conference on Computer Vision,2005.

[26] P. Carbonetto, N.D. Freitas, K. Barnard, A statistical models forgeneral contextual object recognition, in: European Conference onComputer Vision, 2004.

[27] C. Sutton, A. McCallum, Piecewise training of undirected models, in:21stConference on Uncertainty in Artificial Intelligence, 2005.

[28] W.T. Freeman, E.C. Pasztor, O.T. Carmichael, Learning low-levelvision, International Journal of Computer Vision 40 (1) (2000) 24–57.

[29] S. Kim, I.S. Kweon, Simultaneous classification and visual wordselection using entropy-based minimum description length, in: 18thIEEE International Conference on Pattern Recognition, 2006.

[30] A. Vailaya, M.A.T. Figueiredo, A.K. Jain, H.-J. Zhang, Imageclassification for context-based indexing, IEEE Transactions onImage Processing 10 (1) (2001) 117–130.

[31] G. Csurka, C.R. Dance, L. Fan, J. Willamowski, C. Bray, Visualcategorization with bags of keypoints, in: ECCV InternationalWorkshop on Statistical Learning in Computer Vision, 2004.

[32] R. Fergus, P. Perona, A. Zisserman, Object class recognition byunsupervised scale-invariant learning, in: Proceedings of the IEEEConference on Computer Vision and Pattern Recognition (CVPR),2003, pp. 264–271.

[33] P. Moreels, M. Maire, P. Perona, Recognition by ProbabilisticHypothesis Construction, in: Proc. European Conf. Computer Vision(ECCV), 2004, pp. 55–68.

[34] S. Kim, I.S. Kweon, Scalable representation and learning for 3Dobject recognition using shared feature-based view clustering, LectureNotes in Computer Science 3852 (2006) 561–570.

[35] E. Murphy-Chutorian, J. Triesch, Shared features for scalableappearance-based object recognition, in: IEEE Workshop onAdvanced Computer Vision (WACV), 2005.

[36] D.G. Lowe, Local feature view clustering for 3D object recognition,in: Proceedings of the IEEE Conference on Computer Vision andPattern Recognition (CVPR), 2001, pp. 682–688.

[37] S. Kim, K.-J. Yoon, I.S. Kweon, Object recognition using generalizedrobust invariant feature and gestalt law of proximity and similarity,in: CVPR Workshop: 5th IEEE Workshop on Perceptual Organiza-tion in Computer Vision, 2006.

[38] A.J. Storky, Dynamic trees: a structured variational method givingefficient propagation rules, in: C. Boutilier, M. Goldszmidt (Eds.),Uncertainty in Artificial Intelligence, 2000, pp. 566–573.

[39] W.R. Gilks, S. Richardson, D.J. Spiegelhalter, Markov Chain MonteCarlo in Practice, Champman and Hall, London, 1996.

[40] J.S. Yedidia, W.T. Freeman, Y. Weiss, Understanding belief prop-agation and its generalization, Exploring Artificial Intelligence in theNew Millennium (2003) 239–269.

[41] B.-N. Vo, S. Singh, A. Boucet, Sequential Monte Carlo methods formulti-target filtering with random finite sets, Transactions onAerospace and Electronic Systems 41 (4) (2005) 1224–1245.

[42] J. Vermaak, A. Doucet, P. Perez, Maintaining multi-modalitythrough mixture tracking, in: IEEE International Conference onComputer Vision, 2003, pp. 1110–1116.

[43] P. Fearnhead, P. Clifford, On-line inference for hidden Markovmodels via particle filters, Journal of the Royal Statistical Society B65 (Part 4) (2003) 887–899.

[44] T.S. Lee, D. Mumford, Hierarchical Bayesian inference in thevisual cortex, Journal of Optical Society of America, A 20 (7)(2003) 1434–1448.