Simultaneous place and object recognition using collaborative context information

10
Short communication Simultaneous place and object recognition using collaborative context information Sungho Kim * , In So Kweon Department of EECS, Korea Advanced Institute of Science and Technology, 373-1, Guseong-dong Yuseong-gu, Daejeon, Republic of Korea article info Article history: Received 28 June 2006 Received in revised form 12 July 2008 Accepted 25 July 2008 Keywords: Place recognition Object recognition Bidirectional context Disambiguation abstract In this paper, we present a practical place and object recognition method for guiding visitors in building environments. Due to motion blur or camera noise, places or objects can be ambiguous. The first key con- tribution of this work is the modeling of bidirectional interaction between places and objects for simul- taneous reinforcement. The second key contribution is the unification of visual context, including scene context, object context, and temporal context. The last key contribution is a practical demonstration of the proposed system for visitors in a large scale building environment. Ó 2008 Elsevier B.V. All rights reserved. 1. Introduction Let us imagine that a visitor is looking around a complex building. He might need a guide to get place and object informa- tion. This can be realized by using a wearable computer and re- cent computer vision technology. A web camera on the head receives video data and a wearable computer processes the data to provide place and object recognition results to the person in the form of image and sound in the head mount display (HMD). Visitors with such human computer interaction (HCI) devices can get information on objects and find specific places quickly. Although the general purpose of place and object categoriza- tion is not possible with current state-of-the-art vision technol- ogy, recognition or identification of places and objects in certain environments is realizable because of the development of robust local features [7,8] and strong classifiers like SVM and Adaboost [17]. However, there are several sources of ambiguities caused by motion blur, camera noise, and environmental similarity. Fig. 9(a) shows an example of ambiguous objects due to blurring and Fig. 1 shows an example of place ambiguities caused by sim- ilar environments. Recently, Torralba et al. and Murphy et al. pro- posed context-based place and object recognition [9,14]. In [14], Torralba et al. utilized gist information from the whole scene. This gist provides strong prior object label and positioning informa- tion. In [9], Murphy et al. developed a tree structure-based scene, object recognition method by incorporating gist and boosting information. These approaches attempted to solve the ambiguity of objects using scene information (Fig. 9(a)). However, no one has tried to disambiguate place label explic- itly. Only Torralba incorporated temporal context, which was modeled as the hidden Markov model, HMM [14]. In this paper, we focus on the disambiguation method of simultaneous place and object recognition using bidirectional contextual interaction between places and objects [1]. The human visual system (HVS) can recognize place immediately using a low amount of spatial information. If the place is ambiguous, HVS can discriminate the place using object information in the scene (see indicated red regions in Fig. 1). Motivated from this bidirectional interac- tion, we present a more robust place and object recognition method. 2. Place recognition in video 2.1. Graphical model-based formulation Conventionally, place labels from video sequences can be estimated by the well-known HMM [14]. We extend the HMM by incorporating the bidirectional context of objects. You can get a clearer concept of the extension through the graphical model, especially Bayesian net, as shown in Fig. 2. A place node at time t is affected by three kinds of information: measurement message (likelihood) from an image, top-down message from objects, and temporal message from the previous state. Let Q t 2 {1, 2, ... , N p } represent place label at t, z G t represent whole image features, ~ O t represent object label vector, and T(q 0 , q) rep- resent the place transition matrix. The Bayesian formula for this graphical model can be represented by the following equation: 0262-8856/$ - see front matter Ó 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.imavis.2008.07.010 * Corresponding author. Tel.: +82 42 869 5465. E-mail addresses: [email protected] (S. Kim), [email protected] (I.S. Kweon). Image and Vision Computing 27 (2009) 824–833 Contents lists available at ScienceDirect Image and Vision Computing journal homepage: www.elsevier.com/locate/imavis

Transcript of Simultaneous place and object recognition using collaborative context information

Image and Vision Computing 27 (2009) 824–833

Contents lists available at ScienceDirect

Image and Vision Computing

journal homepage: www.elsevier .com/locate / imavis

Short communication

Simultaneous place and object recognition using collaborative context information

Sungho Kim *, In So KweonDepartment of EECS, Korea Advanced Institute of Science and Technology, 373-1, Guseong-dong Yuseong-gu, Daejeon, Republic of Korea

a r t i c l e i n f o a b s t r a c t

Article history:Received 28 June 2006Received in revised form 12 July 2008Accepted 25 July 2008

Keywords:Place recognitionObject recognitionBidirectional contextDisambiguation

0262-8856/$ - see front matter � 2008 Elsevier B.V. Adoi:10.1016/j.imavis.2008.07.010

* Corresponding author. Tel.: +82 42 869 5465.E-mail addresses: [email protected] (S. Kim

Kweon).

In this paper, we present a practical place and object recognition method for guiding visitors in buildingenvironments. Due to motion blur or camera noise, places or objects can be ambiguous. The first key con-tribution of this work is the modeling of bidirectional interaction between places and objects for simul-taneous reinforcement. The second key contribution is the unification of visual context, including scenecontext, object context, and temporal context. The last key contribution is a practical demonstration ofthe proposed system for visitors in a large scale building environment.

� 2008 Elsevier B.V. All rights reserved.

1. Introduction

Let us imagine that a visitor is looking around a complexbuilding. He might need a guide to get place and object informa-tion. This can be realized by using a wearable computer and re-cent computer vision technology. A web camera on the headreceives video data and a wearable computer processes the datato provide place and object recognition results to the person inthe form of image and sound in the head mount display (HMD).Visitors with such human computer interaction (HCI) devicescan get information on objects and find specific places quickly.

Although the general purpose of place and object categoriza-tion is not possible with current state-of-the-art vision technol-ogy, recognition or identification of places and objects in certainenvironments is realizable because of the development of robustlocal features [7,8] and strong classifiers like SVM and Adaboost[17]. However, there are several sources of ambiguities causedby motion blur, camera noise, and environmental similarity.Fig. 9(a) shows an example of ambiguous objects due to blurringand Fig. 1 shows an example of place ambiguities caused by sim-ilar environments. Recently, Torralba et al. and Murphy et al. pro-posed context-based place and object recognition [9,14]. In [14],Torralba et al. utilized gist information from the whole scene. Thisgist provides strong prior object label and positioning informa-tion. In [9], Murphy et al. developed a tree structure-based scene,object recognition method by incorporating gist and boosting

ll rights reserved.

), [email protected] (I.S.

information. These approaches attempted to solve the ambiguityof objects using scene information (Fig. 9(a)).

However, no one has tried to disambiguate place label explic-itly. Only Torralba incorporated temporal context, which wasmodeled as the hidden Markov model, HMM [14]. In this paper,we focus on the disambiguation method of simultaneous placeand object recognition using bidirectional contextual interactionbetween places and objects [1]. The human visual system (HVS)can recognize place immediately using a low amount of spatialinformation. If the place is ambiguous, HVS can discriminatethe place using object information in the scene (see indicatedred regions in Fig. 1). Motivated from this bidirectional interac-tion, we present a more robust place and object recognitionmethod.

2. Place recognition in video

2.1. Graphical model-based formulation

Conventionally, place labels from video sequences can beestimated by the well-known HMM [14]. We extend the HMMby incorporating the bidirectional context of objects. You canget a clearer concept of the extension through the graphicalmodel, especially Bayesian net, as shown in Fig. 2. A place nodeat time t is affected by three kinds of information: measurementmessage (likelihood) from an image, top-down message fromobjects, and temporal message from the previous state. LetQt 2 {1, 2, . . . , Np} represent place label at t, zG

t represent wholeimage features, ~Ot represent object label vector, and T(q0, q) rep-resent the place transition matrix. The Bayesian formula for thisgraphical model can be represented by the following equation:

Fig. 1. Ambiguity of places by similar environments (which floor are we at?). This can be disambiguated by recognizing specific objects (pictures on the wall).

S. Kim, I.S. Kweon / Image and Vision Computing 27 (2009) 824–833 825

pðQ t¼q j zG1:t ;~O1:tÞ/pðzG

t jQ t¼qÞpðQ t¼q j~O1:tÞpðQ t¼q j zG1:t�1;

~O1:t�1Þ;ð1Þ

where pðQt ¼ q j zG1:t�1;

~O1:t�1Þ ¼P

q0Tðq0; qÞpðQt�1 ¼ q0 j zG1:t�1;

~O1:t�1Þ:pðzG

t j Q t ¼ qÞ represents bottom-up messages (measurement)from whole images (M1), pðQt ¼ q j ~OtÞ represents top-down mes-sages coming from object label (M2), pðQt ¼ q j zG

t�1;~Ot�1Þ repre-

sents temporal messages from the previous state (M4). M2 iscalculated by combining messages from related objects using thescene–object compatibility matrix. The important thing from Eq.(1) is how to utilize individual messages. Combining all the mes-sages is not always a good idea in terms of performance and com-putational complexity. We can think of three kinds of situations:no temporal context is available (e.g., initialization and kidnapped),static context (bottom-up, top-down) is useless due to blurring,and static and temporal context are available and necessary. Sincewe do not know such situations a priori, we propose a stochasticplace estimation scheme as Eq. (2). c is the probability of reinitial-ization (mode 1) where temporal context is blocked. a is the prob-ability of normal tracking (mode 2) where static context isprevented, which reduces the computational load. Otherwise,mode 3 is activated for place estimation. For each frame, eachmode is selected according to the selection probability. In this pa-per, we set the parameters as c = 0.1, a = 0.8 by manual tuning.

Mode 1 ðcÞ : pðQ t ¼ q j zG1:t;

~O1:tÞ / M1M2;

Mode 2 ðaÞ : pðQ t ¼ q j zG1:t ;

~O1:tÞ / M1M4;

Mode 3 ð1� c� aÞ : pðQ t ¼ q j zG1:t ;~O1:tÞ / M1M2M4:

ð2Þ

2.2. Modeling of measurement (M1)

There are two kinds of place measurement methods dependingon feature type. Torralba et al. proposed a very effective place mea-surement using a set of filter bank responses [14]. In this paper, our

tt-1tO

tQ1tQ−

Gtz

1M

2M

4M

tt-1

tQ1tQ−

Gtz

1M

4M

tt-1tO

tQ1tQ−

Gtz

1M

2M

4M

tt-1tO

tQ1tQ−

Gtz

1M

2M

4M

tt-1

tQ1tQ−

Gtz

1M

4M

Fig. 2. Conventional graphical model (left) and extended graphical model (right) forplace recognition in video. Belief at the place node (center circle) gets informationfrom image measurement (M1), object information (M2), and previous message(M4).

place–object recognition system is based on local features [7]. Gen-eralized robust invariant feature (G-RIF) is a generalized version ofSIFT [8] by decomposing a scene into convex parts and cornerparts, which are described by localized histograms of edge, orien-tation, and hue. The G-RIF shows upgraded recognition perfor-mance by 20% than SIFT for COIL-100 [7]. For place classification,we utilize the bags of keypoints method proposed by Csurkaet al. [2]. Although the SVM-based classification shows better per-formance than naive Bayes [17], we use a naive Bayes classifier toshow the effect of context. The incoming messages from objects orprevious places can compensate for the rather weak classifier.

Kim and Kweon [5] introduce an entropy-based minimumdescription length (MDL) criterion for simultaneous classificationand visual word learning as shown in Fig. 3. The original MDL crite-rion is not suitable since we have to find universal visual words forall places and sufficient classification accuracy [15]. If the classifica-tion is discriminative, then the entropy of the class a posterioriprobability should be low. Therefore, we propose an entropy-basedMDL criterion for simultaneous classification and visual word learn-ing by combining MDL with the entropy of the class a posterioriprobability, where I denotes training images belonging to only onecategory. V = {vi} denotes visual words. N is the size of the trainingsamples and f(V) is the parameter size for the visual words. Each vi-sual word has parameters such as hi ¼ fli;r2

i g (mean, variance).This MDL criterion is only useful to the class-specific learning ofthe visual word (Low distortion with minimal complexity). LetL ¼ fðIi; qig

Ni¼1, a set of labeled training images where qi 2

{1,2, . . . ,NP} is the place label. Then, the entropy-based MDL crite-rion is defined as Eq. (3). k represents the weight of complexity term.

bV ¼ arg min�XN

i

Hðq j Ii; bHðVÞð�ÞÞ þ k � fðVÞ logðNÞ2

( ); ð3Þ

where entropy H is defined as

Hðq j Ii; bHðVÞÞ ¼ �XC

qci¼1

pðq j Ii; bHðVÞÞlog2pðq j Ii; bHðVÞÞ: ð4Þ

Control ε

Automatic visual word generation

Automatic class conditional word

distribution estimation

Entropy-based MDL criterion

Control ε

Automatic visual word generation

Automatic class conditional word

distribution estimation

Entropy-based MDL criterion

Fig. 3. Visual word selection procedure for evidence of place label.

826 S. Kim, I.S. Kweon / Image and Vision Computing 27 (2009) 824–833

In Eq. (3), the first term represents the overall entropy for thetraining image set. The lower the entropy, the better the guaran-teed classification accuracy; the second term acts as a penalty onlearning. If the size of the visual words increases, then the modelrequires more parameters. Therefore, if we minimize Eq. (3), wecan find the optimal set of visual words for successful classificationwith moderate model complexity. Note that we require only oneparameter � for minimization as � controls the size of the visualwords automatically. Details are given in Section 2.3.

Since � is the distance threshold between the normalized fea-tures, we can obtain initial clusters with an automatic sizing. Wecan obtain refined cluster parameters ( bHVÞ ¼ fhðiÞgV

i¼1) with k-meanclustering. After �-NN-based visual word generations, we haveestimated the class-conditional visual word distribution for the en-tropy calculation. The Laplacian smoothing-based estimation is de-fined by the equation [2]

pðvt j qjÞ ¼1þ

PðIi2qjÞNðt; iÞ

V þPV

s¼1

PIi2qj

Nðs; iÞ; ð5Þ

where N(t, i) represents the number of occurrences of the visualword (vt) in the training image (Ii), and V represents the size ofthe visual words. The physical meaning of this equation is theempirical likelihood of visual words for a given class.

Finally, we can calculate the likelihood pðIt j Q t ¼ q; bHðVÞÞ,which is used for the entropy calculation in Eq. (3). This can be sta-ted as Eq. (6), where the image is approximated with a set of localfeatures (It � zG

t ).

pðIt j Q t ¼ qi;bHðVÞÞ � apðzG

t j Q t ¼ qi;bHðVÞÞ: ð6Þ

Using the naive Bayes’ method (assuming independentfeatures)

pðzGt j Qt ¼ qi;

bHðVÞÞ ¼Y

j

XV

k¼1

pðzj j vkÞpðvk j qiÞ; ð7Þ

where pðzj j vkÞ ¼ expf�kzj � lkk2=2r2

kg.Form the calculations defined by Eqs. (4)–(7), we can evaluate

Eq. (3). We learn the optimal set of visual words by changing andevaluating Eq. (3) iteratively. We found that the optimal is 0.4for scene visual words. The measurement M1 is the same as Eq. (7).

2.3. Modeling of object message (M2)

Direct computation of object messages from multiple objects isnot easy. If we use the graphical model [3,18], we can estimateapproximate messages. Since the number (NO) of objects and dis-tribution of object nodes are given, the incoming message M2 toa place node is expressed as

M2ðQt ¼ qÞ ¼ pðQ t ¼ q j ~OtÞ ¼YNO

i¼1

pðQt ¼ q j OitÞ; ð8Þ

where pðQt ¼ q j OitÞ /maxkfwðq;Oi

tðkÞÞpðOitðkÞÞg:

OitðkÞ is a hypothesis of multi-views for 3D object Oi

t , where k ismulti-view index. pðOi

tðkÞÞ is estimated probability of the objecthypothesis. Details of the computations are introduced in Section

Obje

Plac

e ID

Comapatibility ma

10 20 30 40 50

2468

1012

Fig. 4. Compatibility table

3.2. wðq;OitðkÞÞ is the compatibility matrix of the place label and ob-

ject label. It is estimated by counting the co-occurrences fromplace-labeled object data. We add a Dirichlet smoothing prior tothe count matrix so that we do not assign zero likelihood that doesnot appear in the training data. Fig. 4 shows the probability look-up table for place–object, given labeled images (12 places with112 objects). Black represents high strong compatibility betweenplace and object. Eq. (8) is an approximated version of belief prop-agation. The max-product method is incorporated instead of thesum-product for the increased estimation accuracy. Physically,the individual maximum message (pðQ t ¼ q j Oi

tÞ) from each objectis combined to generate the objects to scene message (M2).

2.4. Modeling of temporal message (M4)

The computation of temporal message from previous place toprobable current place is defined as Eq. (9). It is the same equationused in the HMM [14]. The place transition matrix T(q0,q) is learnedby counting frequencies in the physical path. This term preventsquantum jumps during place recognition.

M4ðQ t ¼ qÞ ¼ pðQ t ¼ q j zGt�1;

~Ot�1Þ¼X

q0Tðq0; qÞpðQ t�1 ¼ q0 j zG

t�1;~Ot�1Þ: ð9Þ

3. Object recognition in video

3.1. Graphical model-based formulation

In general, an object node in video can get information frommeasurement (M1), message from scene (M2), and informationfrom the previous state (M4) as shown in Fig. 5(c), which is com-bined version of (a) and (b). Murphy et al. proposed a mathemati-cal framework for combining M1 and M2 with a tree structuredgraphical model as in Fig. 5(a) [9]. Vermaak et al. proposed a utili-zation method for combining M1 and M4 to track multiple objectsas Fig. 5(b) [16].

To our knowledge, this is the first attempt to unify these mes-sages within a graphical model. We assume independent objectsfor simple derivation (see [4] for interaction). This is reasonablesince objects are conditioned on a scene. Therefore, we consideronly one object for simple mathematical formulation. LetXt = (Ot,ht) represent a hybrid state composed of an object labeland its pose at t. The pose is the similarity transformation of an ob-ject view. According to the derivation in [12], the complex objectprobability, given a measurement and place, can be approximatedby particles (Monte Carlo) as in the equation

pðXt j z1:t ;Q 1:tÞ �XN

i¼1

witdðXt � Xi

tÞ;

where wit / wi

t�1pðzt j Xi

tÞpðXit j Q tÞpðXi

t j Xit�1Þ

gðXit j X

it�1; ztÞ

: ð10Þ

As you see in Eq. (10), weight is updated by importance sampling,where pðzt j Xi

tÞ represents measurement (M1), pðXit j QtÞ represents

scene context (M2), and pðXit j X

it�1Þ represents temporal context

ct ID

trix of Place−Object

60 70 80 90 100 110

of places and objects.

tt-1tQ

Ltz

1M

2M

4M1 1{ , }t tO θ− − { , }t tO θ

staticgtempq

tt-1

Ltz

1M

4M1 1{ , }t tO θ− − { , }t tO θ

staticgtempq

tQ

Ltz

1M

2M{ , }t tO θ

staticg

tt-1tQ

Ltz

1M

2M

4M1 1{ , }t tO θ− − { , }t tO θ

staticgtempq

tt-1tQ

Ltz

1M

2M

4M1 1{ , }t tO θ− − { , }t tO θ

staticgtempq

tt-1

Ltz

1M

4M1 1{ , }t tO θ− − { , }t tO θ

staticgtempq

tQ

Ltz

1M

2M{ , }t tO θ

staticg

Fig. 5. Graphical model for object recognition in video. In our combined model (c), belief at the object node (center circle) gets information from image measurement (M1),place information (M2), and the previous message (M4).

. . .

indexpose

2T control parameter−

control parameterε −

. . .

View-clustered constellation model

Part (appearance) library

Hard link

1A 2A 3A LA

( , ) 1 0i jp A orψ =

ip

CFCM1

CFCM: common frame constellation model

CFCM2

. . .

indexpose

2T control parameter−

control parameterε −

. . .

View-clustered constellation model

Part (appearance) library

Hard link

1A 2A 3A LA

( , ) 1 0i jp A orψ =

ip

CFCM1

CFCM: common frame constellation model

CFCM2

. . .

indexpose

2T control parameter−

control parameterε −

. . .

View-clustered constellation model

Part (appearance) library

Hard link

1A 2A 3A LA

( , ) 1 0i jp A orψ =

ip

CFCM1

CFCM: common frame constellation model

CFCM2

Fig. 6. Scalable 3D object representation scheme. The local appearance codebook is shared between objects, part parameters are shared in a CFCM, and multiple views areclustered to a CFCM.

Hypotheses (CFCMs) generation by

Hough transform

Grouping hypotheses by object ID ( m)

so

CFCM ID

Obj. 1 Obj. 2

CFCM …

~mbin opt

m mbin opt

accept if N Th

reject if N Thπ

⎧ ≥⎪⎨

<⎪⎩

Accept or reject objects based on

Hough bin value ( )mbinN

Hypotheses (CFCMs) generation by

Hough transform

Grouping hypotheses by object ID ( m )

so

CFCM ID

Obj. 1 Obj. 2

CFCM …

~mbin opt

m mbin opt

accept if N Th

reject if N Thπ

⎧ ≥⎪⎨

<⎪⎩

Accept or reject objects based on

Hough bin value ( )mbinN

S. Kim, I.S. Kweon / Image and Vision Computing 27 (2009) 824–833 827

(M4) from the previous state. The importance (or proposal) function(gðXi

t j Xit�1; ztÞ) is defined as Eq. (11). This is almost the same form as

introduced in [11]. The performance of the particle filter depends onmodeling the proposal function (g). If we set the proposal function byusing only the prior motion, the system cannot cope with the dy-namic object appearances and disappearances. Therefore, we utilizethe concept of mixture proposal for our problem. If gstatic representsthe proposal function defined by static object recognition and gtemp

represents the proposal function defined by conventional prior mo-tion, then the final proposal is defined as Eq. (11). Similar to placerecognition, we propose three kinds of sample generation modessince we do not know the situation a priori: reinitialization (r) whereb = 0, normal tracking (a) where b = 1, and hybrid tracking(1 � r � a) where (0 < b < 1). If the temporal context is unavailable,we generate samples from only static object recognition. If in normaltracking mode, we use the conventional proposal function. If staticand temporal context are available, we propose samples from hybrid(mixture) density functions.

gðXit j X

it�1; ztÞ ¼ ð1� bÞgstaticðX

it j ztÞ þ bgtempðX

it j X

it�1Þ: ð11Þ

Select maximal hypothesis in

each hypothesis group

maxObj. 1 Obj. 2Select maximal

hypothesis in each hypothesis group

maxObj. 1 Obj. 2

Fig. 7. Hypothesis and test (verification)-based object recognition procedures.

3.2. Modeling of proposal function (gðXit j X

it�1; ztÞ)

The proposal function in Eq. (11) consists of temporal context(prior motion) from the previous state and static context fromthe input image. The temporal context is modeled as Eq. (12).We assume that object labels and pose are independent.

gtempðXit j X

it�1Þ ¼ ptempðO

ðiÞt j O

ðiÞt�1Þptempðh

ðiÞt j h

ðiÞt�1Þ; ð12Þ

where OðiÞt ¼ OðiÞt�1; t > 0 and hðiÞt ¼ hðiÞt�1 þ ut . ut is Gaussian prior mo-tion. Temporal context means simply normal tracking state.

828 S. Kim, I.S. Kweon / Image and Vision Computing 27 (2009) 824–833

The static proposal function gstaticðXit j ztÞ is defined as a Hough

transformation using corresponding local features (G-RIF) be-tween image and model with local pose information in the fea-tures. We use the scalable 3D object representation andrecognition scheme proposed in [6] using shared feature andthe view clustering method. Fig. 6 shows the scalable 3D objectrepresentation scheme for local feature-based object recognition(simultaneous labeling and pose estimation). The bottom tableshows a library of parts with the corresponding appearance. A lo-cal feature used in this work contains an appearance vector andlocal pose information (region size, dominant orientation, partlocation relative to reference frame). The appearance of an indi-vidual part is G-RIF. The appearance codebook is generated byclustering of a set of local features extracted from trainingimages. Each training image is represented by a CFCM where apart has an index (ID) to appearance codebook (feature sharingin a CFCM) and local part pose information. Then CFCMs belong-

Fig. 8. Multiple object recognition procedure

ing to the similarity transformation space are merged into a sin-gle CFCM (multi-view sharing). In the proposed 3D objectrepresentation scheme, we use sharing-based redundancy reduc-tion strategies. Parts belonging to an object share an object frameso the complexity of the parameter is a linear function of thenumber of parts. Because training images are composed of manymultiple views and objects, there exist redundant parts andviews. We can reduce the redundancies by applying a sharing(clustering) concept to both parts and views. The indicated solidarrow in Fig. 6 shows the learned, or view, clustered CFCM.

By means of learning, any 3D object can be represented by a setof view-clustered CFCMs. Each learned CFCM contains object partsthat have part pose and link indices to the part libraries (appear-ance). Likewise, each element in the library contains all the linksto the parts in the CFCMs. Note that local appearance codebooklibraries are shared between objects and each CFCM contains partposes conditioned on an object frame. The information of part pose

s using the hypothesis-test framework.

S. Kim, I.S. Kweon / Image and Vision Computing 27 (2009) 824–833 829

is available in [13,8] and this is important for bottom-up objectrecognition and verification. We can use this fact to generatehypotheses during object recognition.

We can fully utilize the scalable object representation by theshared feature-based view clustering method in object recognitionby using the well-known hypothesis and verification framework.However, we modify it to recognize multiple objects in the pro-posed object representation scheme.

If S represents a set of scene features, D represents a set of data-base entries (shared feature lib. + CFCMs), and H represents hypoth-esized CFCMs, which best describe the scene, then the objectrecognition problem can be formulated as a mixture form (the 1stline in Eq. (13): assuming multiple objects in a scene). pm is the mix-ture weight of object m which is estimated on-line by a set of CFCMsbelonging to m. hm is the optimal transformed CFCM for object m. Ifwe assume uniform priors, the equation can be reduced to the sec-ond line in Eq. (13). We select the best hypothesis (hm), which hasthe maximum conditional probability (pmðSm j hðiÞm ;DÞ).

pðH j S;DÞ ¼XM

m¼1

pmpmðhm j Sm;DÞ;

/XM

m¼1

pmpmðSm j hm;DÞ;ð13Þ

where pmðSm j hm;DÞ ¼ arg maxi2ImfpmðSm j hðiÞm ;DÞg;PM

m¼1pm ¼ 1.We can model pmðSm j hðiÞm ;DÞ by a Gaussian noise model of appear-ance and pose using Eq. (14). We assume that the appearance andpose of each part is independent. In addition, since features in aCFCM are conditioned on a common-frame, they can be handledindependently. yapp is the shared feature closest to scene featurexapp. yloc is the position of a part hypothesized by hðiÞm . rapp, rloc

Fig. 9. (a) Object recognition using only object related features (images from [1]). (b)represent objects to place messages to disambiguate place. Solid arrows represent place

are estimated from training data. Note that we can get the pðOitðkÞ

in Section 2.3 from the equation

pmðSm j hðiÞm ;DÞ ¼Y

x2Sm

pappðx j hðiÞm ;DÞ � pposeðx j h

ðiÞm ;DÞ; ð14Þ

where pappðx j hðiÞm ;DÞ / expð�kxapp � yappk

2=r2

appÞ, pposeðx j hðiÞm ;DÞ /

expð�kxloc � ylock2=r2

locÞ.Fig. 7 summarizes the object recognition procedures graphi-

cally. We can obtain all possible matching pairs by an �-NN (near-est neighbor) search in the feature library. Because each featurelibrary contains all possible links to CFCMs, we can form all thematching pairs between input image features and parts in CFCMs.From these, hypotheses are generated by the Hough transform inthe CFCM ID, scale (11 bins), orientation (8 bins) space [8], andgrouped by object ID. Colors in Fig. 8(b) represent object groups.Each Hough particle (hypothesis) in Fig. 8(b) is shown inFig. 8(c). Then we decide whether to accept or reject the hypothe-sized object based on the bin size with an optimal threshold [10].Finally, we select optimal hypotheses, using Eq. (14), which canbest be matched to object features in a scene. Fine object posesare refined by similarity transformation between input image fea-tures and the selected CFCM.

3.3. Modeling of measurement M1

Given sample object parameters (Xit ¼ ðO

it ; h

itÞ), we use a color

histogram in normalized r–g space as a measurement(M1 ¼ pðzt j Xi

tÞ). The model color histogram is acquired by the pro-posal function explained previously since the recognized objectcan provide an object label with object boundary information.We use v2 distance in the kernel recipe for measurements [17].

Application of bidirectional interaction between place and objects. Dotted arrowsto object messages to disambiguate blurred objects (dryer, drill).

Fig. 10. (a) Place recognition using only measurement (M1). (b) Both measurement and message from objects (M2).

Fig. 11. Composition of training and test database.

830 S. Kim, I.S. Kweon / Image and Vision Computing 27 (2009) 824–833

Table 1The composition of training images and test images

Role Scene (640 � 480) Object

# of places # of scenes # of objects # of views

Training 10 120 80 209Test 10 7208 80 12,315

placeplace

place

object

place

object

Fig. 12. Place recognition results using (a) measurement, (b) tempor

S. Kim, I.S. Kweon / Image and Vision Computing 27 (2009) 824–833 831

3.4. Modeling of place message M2

The message from the place to a specific object is calculatedusing Eq. (15). We also use approximated belief propagation withthe max-product rule.

M2 ¼ pðOit; h

it j Q tÞ /max

qfwðOi

t;Qt ¼ qÞpðQ t ¼ qÞg: ð15Þ

3.5. Modeling of temporal message M4

The temporal message (M4 ¼ pðXit j X

it�1Þ) in Eq. (11) is the same

as the proposal function in Eq. (12). If an object particle is sampledfrom the temporal context only, the weight is simply the produc-tion of measurement and message from place due to the cancella-tion in Eq. (11).

4. Experimental results

4.1. Validation of bidirectional reinforcement property

First, we test the bidirectional place and object recognitionmethod for ambiguous examples. If we use object information onlyas in Fig. 9(a), we fail to discriminate both objects since the local

al con

features are similar. However, if we utilize the bidirectional inter-action method, we can discriminate those objects simultaneouslyas in Fig. 9(b).

In the second experiment, we prepare place images taken infront of the elevator at each floor (see example pair in Fig. 1).Given a test elevator scene as in Fig. 10(a), measurement pro-vides recognition of the wrong place. However, if the same testscene is processed using bidirectional place–object recognition,we can obtain correct recognition as in Fig. 10(b). In the dia-gram, the center graph shows the measurement message, theright graph shows the message from the objects, and the leftgraph represents the combined message for place recognition(see Eq. (2), Mode 1).

4.2. Large scale experiment for building guidance

We validate the proposed place–object recognition system inthe department of Electrical Engineering building to guide visitorsfrom the first floor to the third floor. Training data statistics and re-lated images are summarized in Fig. 11 and Table 1, respectively.We used 120 images captured at arbitrary poses and manually la-beled and segmented 80 objects in 10 places. Note that the imagesare very blurry. After learning, the size of the object feature was re-duced from 42,433 to 30,732 and the scene features were reducedfrom 106,119 to 62,610 (same as the number of visual vocabulary).The number of learned views was reduced from 260 to 209 (2.61views/object).

In the first experiment, we evaluate the performance of placerecognition. As we said, there are four kinds of message combina-tions: measurement only (M1), temporal context (HMM: M1 + M4),static context only (M1 + M2), and unified context (proposed:M1 + M2 + M4). Fig. 12 shows the evaluation results for every 100frames among 7208. As you can see, the unified context-basedplace recognition is better than others.

Ground truth

Recognized

t-1 t

Ground truth

Recognized

t-1 t

text, (c) static context, and (d) unified context for the test sequences.

Static context only Full context Temporal context only0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Performance Evaluation

Type of video interpretaton

Rat

eDetection RateRelative processing time

Fig. 13. (a) Evaluation of object recognition in terms of detection rate and relative processing time. (b) (top) Results with only static context and (bottom) using ours.

Fig. 14. Examples of bidirectional place and object recognition results in lobby and washstand.

832 S. Kim, I.S. Kweon / Image and Vision Computing 27 (2009) 824–833

Fig. 13 summarizes the overall evaluation for static contextonly, full context (static + temporal), and temporal context only.We checked both the detection rate of objects and the relative pro-cessing time. The full context-based method shows a better detec-tion rate than the static context only with less processing time.Fig. 14 shows partial examples of bidirectional place and objectrecognition sequences. Note that the proposed system using staticcontext and temporal context shows successful place and objectrecognition in video under temporal occlusions and with a dy-namic number of objects.

5. Conclusions and discussion

In this paper, we present modeling methods of several kinds ofvisual context focusing on bidirectional interaction between placeand object recognition in video. We first modeled object to placemessages for the disambiguation of places using object informa-tion. The unified context-based place recognition shows improvedplace recognition. We also modeled place message with measure-ment and temporal context to recognize objects under the impor-tant sampling-based framework. This structure of objectrecognition in video can disambiguate objects with reduced com-putational complexity. We demonstrated the synergy of the bidi-rectional interaction-based place–object recognition system in anindoor environment to guide visitors. The system can be applieddirectly to various areas for interactions between humans andcomputers. The proposed method can be upgraded to the category

level to provide higher level place and object information tohumans.

Acknowledgements

This research has been supported by the Korean Ministry of Sci-ence and Technology for National Research Laboratory Program(Grant No. M1-0302-00-0064).

References

[1] M. Bar, Visual objects in context, Nature Reviews: Neuroscience 5 (2004) 617–629.

[2] G. Csurka, C. Dance, C. Bray, L. Fan, Visual categorization with bags ofkeypoints, in: Workshop on Statistical Learning in Computer Vision, 2004.

[3] M.I. Jordan (Ed.), Learning in Graphical Models, MIT Press, Cambridge, MA,1999.

[4] Z. Khan, T. Balch, F. Dellaert, MCMC-based particle filtering for tracking avariable number of interacting targets, PAMI 27 (11) (2005) 1805–1918.

[5] S. Kim, I.-S Kweon, Simultaneous classification and visualword selection usingentropy-based minimum description length, International Conference onPattern Recognition (2006) 650–653.

[6] S. Kim, I.S. Kweon, Scalable representation for 3d object recogntion usingfeature sharing and view clustering, Pattern Recognition 41 (11) (2008) 754–773.

[7] S. Kim, K.-J. Yoon, I.S. Kweon, Object recognition using generalized robustinvariant feature and gestalt law of proximity and similarity, in: IEEE CVPRWorkshop on Perceptual Organization in Computer Vision, 2006.

[8] D.G. Lowe, Distinctive image features from scale-invariant keypoints,International Journal of Computer Vision 60 (2) (2004) 91–110.

[9] K. Murphy, A. Torralba, W.T. Freeman, Using the forest to see the trees: agraphical model relating features, objects, scenes, in: NIPS, 2004.

S. Kim, I.S. Kweon / Image and Vision Computing 27 (2009) 824–833 833

[10] E. Murphy-Chutorian, J. Triesch, Shared features for scalable appearance-basedobject recognition, WACV (2005) 16–21.

[11] K. Okuma, A. Taleghani, N. de Freitas, J.J. Little, D.G. Lowe, A boosted particlefilter: multitarget detection and tracking, ECCV (1) (2004) 28–39.

[12] N. Gordon Ristic, S. Arulampalam, Beyond the Kalman Filter-Particle Filters forTracking Applications, Artech House, London, 2004.

[13] C. Schmid, R. Mohr, C. Bauckhage, Evaluation of interest point detectors,International Journal of Computer Vision 37 (2) (2000) 151–172.

[14] A. Torralba, K.P. Murphy, W.T. Freeman, M.A. Rubin, Context-based visionsystem for place and object recognition, ICCV (1) (2003) 273–280.

[15] A. Vailaya, M.A.T. Figueiredo, A.K. Jain, H.-J. Zhang, Image classification forcontext-based indexing, IEEE Transactions on Image Processing 10 (1) (2001)117–130.

[16] J. Vermaak, A. Doucet, P. Perez, Maintaining multi-modality through mixturetracking, ICCV (2) (2003) 1110–1116.

[17] C. Wallraven, B. Caputo, A. Graf, Recognition with local features: the kernelrecipe, ICCV (1) (2003) 257–264.

[18] J.S. Yedidia, W.T. Freeman, Y. Weiss, Understanding belief propagation and itsgeneralizations, Exploring artificial intelligence in the new millennium (2003)239–269.