3D target recognition using cooperative feature map binding under Markov Chain Monte Carlo

11
3D target recognition using cooperative feature map binding under Markov Chain Monte Carlo Sungho Kim * , In-So Kweon Robotics & Computer Vision Laboratory, Korea Advanced Institute of Science and Technology, 373-1 Guseong-dong, Yuseong-gu, Daejeon, Republic of Korea Received 10 September 2004; received in revised form 19 September 2005 Available online 19 January 2006 Communicated by T.K. Ho Abstract A robust and effective feature map integration method is presented for infrared (IR) target recognition. Noise in an IR image makes a target recognition system unstable in pose estimation and shape matching. A cooperative feature map binding under computational Gestalt theory shows robust shape matching properties in noisy conditions. The pose of a 3D target is estimated using a Markov Chain Monte Carlo (MCMC) method, a statistical global optimization tool where noise-robust shape matching is used. In addition, bottom-up information accelerates the recognition of 3D targets by providing initial values to the MCMC scheme. Experimental results show that cooperative feature map binding by analyzing spatial relationships has a crucial role in robust shape matching, which is statistically opti- mized using the MCMC framework. Ó 2005 Elsevier B.V. All rights reserved. Keywords: 3D infrared target recognition; Robust to noise; Cooperative feature map binding; Meaningful shape matching; MCMC optimization 1. Introduction The performance of an IR target recognition system for unmanned aerial vehicles largely depends on image quality, target representation, and the matching paradigm. The issue of target representation is how to cope with the geo- metrical variations caused by the 3D target pose. There are two approaches to this problem, namely, view-based repre- sentation and model-based representation. The view-based approach stores all possible target views (Murase and Nayar, 1995). In recent work, each target view is repre- sented as a sum of visual parts (Nair and Aggarwal, 2000; Lowe, 2004). These representations are biologically plausible and suitable for target indexing, but do not pro- vide accurate target information, such as the 3D pose. The model-based approach represents a 3D target as a 3D com- puter aided design (CAD) model or voxels, and handles the target pose by controlling the pose parameters of the 3D model (Jain and Dorai, 2000). This representation is suit- able for obtaining accurate pose information for artificial IR targets. The main issue in target matching is how to obtain a correct match between a rendered 3D CAD model and a 2D image in a model-based representation under a noisy environment. There are two kinds of noise, thermal noise in the sensor itself, and atmospheric factors such as humid- ity and temperature, which affect atmospheric transmit- tance. The matching should be robust to these noise sources. Fig. 1 shows two kinds of IR images acquired under different humidity and temperature conditions (day and night) at the same site. Note the enormous visual dif- ferences in appearance. There are many descriptor-based matching methods, such as shape context, curvature scale space, and moments 0167-8655/$ - see front matter Ó 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2005.11.008 * Corresponding author. Tel./fax: +82 42 869 5465. E-mail addresses: [email protected] (S. Kim), [email protected]. kr (I.-S. Kweon). www.elsevier.com/locate/patrec Pattern Recognition Letters 27 (2006) 811–821

Transcript of 3D target recognition using cooperative feature map binding under Markov Chain Monte Carlo

www.elsevier.com/locate/patrec

Pattern Recognition Letters 27 (2006) 811–821

3D target recognition using cooperative feature map bindingunder Markov Chain Monte Carlo

Sungho Kim *, In-So Kweon

Robotics & Computer Vision Laboratory, Korea Advanced Institute of Science and Technology, 373-1 Guseong-dong,

Yuseong-gu, Daejeon, Republic of Korea

Received 10 September 2004; received in revised form 19 September 2005Available online 19 January 2006

Communicated by T.K. Ho

Abstract

A robust and effective feature map integration method is presented for infrared (IR) target recognition. Noise in an IR image makes atarget recognition system unstable in pose estimation and shape matching. A cooperative feature map binding under computationalGestalt theory shows robust shape matching properties in noisy conditions. The pose of a 3D target is estimated using a Markov ChainMonte Carlo (MCMC) method, a statistical global optimization tool where noise-robust shape matching is used. In addition, bottom-upinformation accelerates the recognition of 3D targets by providing initial values to the MCMC scheme. Experimental results show thatcooperative feature map binding by analyzing spatial relationships has a crucial role in robust shape matching, which is statistically opti-mized using the MCMC framework.� 2005 Elsevier B.V. All rights reserved.

Keywords: 3D infrared target recognition; Robust to noise; Cooperative feature map binding; Meaningful shape matching; MCMC optimization

1. Introduction

The performance of an IR target recognition system forunmanned aerial vehicles largely depends on image quality,target representation, and the matching paradigm. Theissue of target representation is how to cope with the geo-metrical variations caused by the 3D target pose. There aretwo approaches to this problem, namely, view-based repre-sentation and model-based representation. The view-basedapproach stores all possible target views (Murase andNayar, 1995). In recent work, each target view is repre-sented as a sum of visual parts (Nair and Aggarwal,2000; Lowe, 2004). These representations are biologicallyplausible and suitable for target indexing, but do not pro-vide accurate target information, such as the 3D pose. The

0167-8655/$ - see front matter � 2005 Elsevier B.V. All rights reserved.

doi:10.1016/j.patrec.2005.11.008

* Corresponding author. Tel./fax: +82 42 869 5465.E-mail addresses: [email protected] (S. Kim), [email protected].

kr (I.-S. Kweon).

model-based approach represents a 3D target as a 3D com-puter aided design (CAD) model or voxels, and handles thetarget pose by controlling the pose parameters of the 3Dmodel (Jain and Dorai, 2000). This representation is suit-able for obtaining accurate pose information for artificialIR targets.

The main issue in target matching is how to obtain acorrect match between a rendered 3D CAD model and a2D image in a model-based representation under a noisyenvironment. There are two kinds of noise, thermal noisein the sensor itself, and atmospheric factors such as humid-ity and temperature, which affect atmospheric transmit-tance. The matching should be robust to these noisesources. Fig. 1 shows two kinds of IR images acquiredunder different humidity and temperature conditions (dayand night) at the same site. Note the enormous visual dif-ferences in appearance.

There are many descriptor-based matching methods,such as shape context, curvature scale space, and moments

Fig. 1. Examples of real FLIR images taken at (a) day and (b) night.

812 S. Kim, I.-S. Kweon / Pattern Recognition Letters 27 (2006) 811–821

(Zhang and Lu, 2004). But these methods assume that thetarget objects are segmented, which is impractical in a realworking environment. One successful target recognitionmethod represents the target as a 3D CAD model and rec-ognizes it by matching either edge magnitudes (Der andChellappa, 1997) or edge orientations (Olson and Huttenl-ocher, 1997). However, these approaches are not onlyunstable under noise, because of single-feature map-basedmatching, but are also very inefficient, as they must searchthe full pose space, the scale, and the image region. Thereis a probabilistic method that handles incomplete datacorrupted by noise (Hornegger and Niemann, 2000). Thismethod may be an optimal solution, but is very complexto use. There is also a search space reduction approachusing multiple hypotheses from angle cues (Shimshoniand Ponce, 2000). This approach is not so effective, dueto using a simple bottom-up cue.

In this paper, we use a 3D CAD model-based represen-tation suitable for artificial targets such as cars and build-ings. Fig. 2 summarizes the issues and the proposedmethods for dealing with them. A novel shape-matchingmethod is proposed, motivated by feature map binding(Treisman, 1998) and computational Gestalt theory (Des-olneux et al., 2004), which are human visual perceptionproperties. This matching shows robust properties to noise.The target pose is optimized using Markov Chain MonteCarlo, called MCMC (Dick et al., 2002), a global optimiza-tion tool that is known to outperform the genetic algorithm(Doucet et al., 2001). The pose search problem is alleviatedusing bottom-up indexing cues to the MCMC.

Noise sources

Thermal noise

Humidity

Temperature

Pose estim

Shape matcDegrade

Noise sources

Thermal noise

Humidity

Temperature

Pose estim

Shape matcDegrade

Fig. 2. The main issue and the proposed nois

The structure of this paper is as follows. In Section 1, wedescribe our 2D shape matching method, which is the corecomponent for 3D target recognition. In Section 3, weshow how to extend the 2D shape matching to a 3D targetrecognition system using MCMC, where the initial param-eters are estimated from bottom-up inference. We demon-strate the power of our shape matching in various noisyimages, and efficient 3D target recognition results using asingle image in Section 4. We conclude in Section 5.

2. Noise-robust 2D shape matching

It is very important, but difficult, to robustly match a 2Dshape model (or rendered 3D CAD model) to IR images,since IR images are sensitive to thermal noise, humidity,and temperature as shown in Fig. 1. (How can you matcha 2D roof model to the boxed regions, which show com-pletely different contrast and intensity distribution in acluttered background?) In this section, we propose anoise-robust shape-matching scheme by incorporating bothcomputational Gestalt theory (Desolneux et al., 2004) andfeature integration theory (Treisman, 1998).

2.1. e-meaningful event

Since IR images are highly noisy, it is reasonable toassume that the intensity distribution of each pixel is ran-dom and independent. If a pattern happens to appear, thenthe event is meaningful and significant. Recently, Desol-neux et al. (2004) modeled this phenomenon mathemati-

ation

hing

· MCMC

· Meaningful shapematching

· Cooperative featuremap binding

Noise-robust scheme

ation

hing

· MCMC

· Meaningful shapematching

· Cooperative featuremap binding

Noise-robust scheme

e-robust scheme in 3D target recognition.

S. Kim, I.-S. Kweon / Pattern Recognition Letters 27 (2006) 811–821 813

cally using the concept of the Helmholtz principle, andapplied it to the computational Gestalt problem. Moti-vated by noisy IR images and related theoretical work,we propose an e-meaningful shape-matching method todetect meaningful patterns. We briefly summarize the basicconcept of the Helmholtz principle and the e-meaningfulevent (Desolneux et al., 2004).

Helmholtz principle: This principle provides a suitablemathematical tool for modeling a computational Gestalt.It assumes that an image has a random distribution of pixelvalues or orientations. If some pixels break the random-ness, then these grouped pixels have a certain pattern,called a Gestalt.

e-meaningful event: A certain configuration is e-meaning-ful if the expectation of occurrences in an image is less thane.

When e = 1, we can say that the event is meaningful.Note that the concept of meaningfulness depends on thenumber of expectation, not the probability of occurrenceitself. The fact that the probability of occurrence is verysmall does not mean that few events occur. We have tocheck the number of expectation. Since an image is quan-tized into a set of pixels, there exist a precision of measure-ment and image size. Depending on these facts, an eventcan be meaningful or not. If the number of expectation islarge, then the event is not meaningful.

2.2. Cooperative feature map binding

What information should be encoded to increase therobustness to image noise during shape matching? Treis-man’s feature integration theory (FIT) provides biologicalevidence on how to utilize various feature maps during per-ception (Treisman, 1998). According to the FIT, differentproperties of the visual input were encoded in separate fea-ture maps and were combined in perception by integratingseparate feature maps through spatial attention. This spa-

Fig. 3. Diagram for e-meaningful shape matching: xi re

tial attention makes the feature map binding possible.The attention takes place at each point of the rendered2D shape model.

Since we deal only with IR images, especially FLIR (for-ward looking IR) images, all the available local informationis plotted on just three independent maps, namely, a pixelintensity map, a gradient magnitude map, and a gradientorientation map. By effectively integrating these maps, wecan enhance the robustness of shape matching. More spe-cific modeling is explained in the following section.

Pixel intensity : uðx; yÞ; ð1Þ

Gradient : u0ðx; yÞ ¼ ouox;ouoy

� �ðx; yÞ; ð2Þ

Gradient orientation :

hðx; yÞ ¼ 1

kDuðx; yÞk �ouoy;ouox

� �ðx; yÞ. ð3Þ

2.3. e-meaningful shape matching by feature map binding

In this section, we present the details of a noise-robustshape-matching method based on the concept of an e-meaningful event and feature integration theory. We useonly the gradient magnitude map (GMM, Eq. (2)) andthe gradient orientation map (GOM, Eq. (3)) as indepen-dent feature maps, since they are robust to illuminationand noise. If we assume that the size of an image isN · N, and each pixel is random, the GMM and GOMare also random. If a priori shape information (edge posi-tion and edge orientation) is given, we can measure thestructural alignment to this pattern.

As shown in Fig. 3, we can think of a matching at anattended spatial point xi that satisfies both the image gradi-ent and orientation. If the expected number of the shapematching is smaller than e, then this shape-matching eventis meaningful. If the length of the rendered 2D shape is l,

presents the position of the shape model template.

FLIR image

GMM, GOM

( (x) )P C μ≥

Slide shape model

Calculate eq. (4)

Select maximalmeaningful match

Check eq. (5)

A priori 2Dshape model(position,ori./pixel)

FLIR image

GMM, GOM

( (x) )P C μ≥

Slide shape model

Calculate eq. (4)

Select maximalmeaningful match

Check eq. (5)

A priori 2Dshape model(position,ori./pixel)

Fig. 4. Flow diagram of e-meaningful shape matching (see also Fig. 3).

814 S. Kim, I.-S. Kweon / Pattern Recognition Letters 27 (2006) 811–821

then the probability of the event that the values of theGOM or contrast (C(x)) are larger than a certain value,and that orientation differences (O(x)) between the orienta-tion of the a priori shape model and that of the GOM arewithin a certain precision along the shape model, is definedin Eq. (4). We assume that the precision of orientation is1/8 (within 45�). Since the GMM and GOM at each pixelare independent, and the probability of p(C(xi) P l) is esti-mated empirically from the image itself, the left equationcan be simplified to the right equation.

P Cðx1ÞP l;Oðx1Þ 6p4

h i� P Cðx2ÞP l;Oðx2Þ 6

p4

h i

� � � P CðxlÞP l;OðxlÞ 6p4

h i¼ Hðx; lÞl; ð4Þ

where Hðx; lÞ ¼ 18� num of fxjCðxÞPlg0

N2

CðxÞ ¼ ku0IðxÞk;OðxÞ ¼ jhIðxÞ � hMðxÞj;I for input, M for model, x : ðx; yÞ.

In order to measure a meaningful event in shape match-ing, we have to compute the expectation of occurrences ofthe event given a prior and observed image. This number iscalled the number of false alarms (NFA), since we initiallyassume a random distribution. If an event deviates signifi-cantly from this assumption, then this event can beregarded as a false alarm. The NFA of shape matching isdefined in Definition 1.

Definition 1 (Number of false alarms—NFA). Let T be ashape template with length l, and n · n be the boundary sizeof shape template (assume square). Let l be the minimalcontrast, and p be the precision (p = 1/8). Assume that atleast k0 points have matched the contrast and orientation.The number of false alarms of this event is defined by

NFAðx;k0;lÞ ¼ ðN � nÞ2 �Xl

k¼k0

l

k

� �Hðx;lÞkð1�Hðx;lÞÞl�k

;

ð5Þ

where (N � n)2 is all possible realizations of shape match-ing within an image.

Definition 2 (e-meaningful shape matching). We call amatching between an image and a certain model an e-meaningful shape matching if:

f ðx; k0; lÞ ¼ NFAðx; k0; lÞ 6 e; ð6Þwhere e� 1.

Eq. (6) is an important measure for meaningful shapematching. The smaller this value is, the more meaningfulthe shape matching. Maximally meaningful shape match-ing is selected when Eq. (6) shows a minimum value withinan image region. In addition, we can select the minimalk0,l to obtain meaningful shape matching. Here is anexample of how to set the threshold of the minimal match-ing point k0 to get a meaningful shape matching. IfN = 300, l = 50, n = 20, l = 10 or P ðC > lÞ ¼ 10000

300�300;

e ¼ 1, k0 can be calculated as 7. In this situation, if thematching points are at least 7, then the shape matching ise-meaningful.

2.4. Implementation details of FIT-based e-meaningful

shape matching

The FIT-based meaningful shape matching follows theflow shown in Fig. 4. The 2D shape model is a priori givenby a rendering of the 3D CAD model. The GMM andGOM are calculated independently from an FLIR imageusing Sobel masks. Then p(C(xi) P l) is calculated empir-ically using the GMM. The minimal value of l is set to 30(see Section 2.5). We calculate Eq. (4) by sliding the shapemodel in the image. We select the maximal meaningfulmatch, and then finally check Eq. (5) as the selectionthreshold.

2.5. Performance evaluation

We evaluated the performance of the proposed shapematching in terms of three aspects: noise robustness; thereceiver operating characteristics (ROC) property; andshape matching on real data. In the experiment, we usedtwo datasets, the Fort Carson RSTA Data, and real dataacquired using an FSI Prism SP camera. The RSTA datasetcomprises 35 car scenes with exact shape templates. Thecomparing methods are GMM only, GOM only, andGMM with GOM, corresponding to Der and Chellappa(1997), Olson and Huttenlocher (1997), and the methodproposed in this paper, respectively. Although the firsttwo methods are not exactly the same as the authors’ origi-nal algorithm, the features used are the same.

First, we tested the robustness to Gaussian noise. Weadded Gaussian noise with standard deviation 8. Fig. 5shows the shape matching results. It reveals the robustness

Fig. 5. Shape matching examples on Fort Carson RSTA Data: (a) original FLIR image; (b) GMM only; (c) GOM only and (d) proposed GMM + GOM.

S. Kim, I.-S. Kweon / Pattern Recognition Letters 27 (2006) 811–821 815

of the feature map binding using the computational Gestaltmodel. The binding GMM with GOM outperforms the sin-gle-map-based shape matching.

Second, we evaluated the three methods using an ROCcurve. The detection is tested on the RSTA dataset by

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

false positive rate

dete

ctio

n ra

te

GMM onlyGOM onlyGMM+GOM[proposed]

Fig. 6. Evaluation of feature map-based shape matching using ROC.

changing the gradient magnitude threshold (l) of GMMand the orientation precision threshold (p) of GOM.Fig. 6 shows the upgraded ROC performance of the pro-posed feature map binding. The proposed shape matchingis better than the other single-map-based methods. Theoptimal thresholds corresponding to the equal error rate(false rejection rate = false detection rate) are 30 in gradi-ent magnitude and 45� in gradient orientation.

Finally, we tested the perceptual shape matching to realimages acquired using an FLIR camera (FSI Prism SP).The test images were captured during a 24-h period.Fig. 7 shows several examples of shape matching on thetemperature-varying IR sequences. The detection of a roofis very robust to a dynamic temperature range of 14.3–33.4 �C.

3. Sensor-driven MCMC-based 3D target recognition

This section extends the FIT-based e-meaningful shapematching to recognizing 3D targets. As we discussed inSection 1, it is important to find a method to robustly esti-mate 3D target pose under noise. Since we use a model-based 3D target representation, we have to find optimal

Fig. 7. Shape matching results for temperature-varying FLIR sequences. The proposed method is very robust to temperature changes.

Sensor-driveninitial parameter

estimation

Sample generationby proposal func.

Render 3D model

Shape matching

Accept/rejectthe sample

0

Sensor-driveninitial parameter

estimation

Sample generationby proposal func.

Render 3D model

Shape matching

Accept/rejectthe sample

Fig. 8. Structure of MCMC to the ATR problem.

816 S. Kim, I.-S. Kweon / Pattern Recognition Letters 27 (2006) 811–821

pose parameters. If we know an initial target pose ormatching points, then a linear solution such as nonstochas-tic pose optimization may be suitable (Drummond andCipolla, 2002). However, if we do not know the targetID or target pose in advance, the problem is more difficult.So we propose a stochastic optimization, such as MCMC,which can provide a global optimal solution in multidimen-sional parameter space in a noisy environment.

3.1. Parameters to estimate

The targets to be recognized are constructed objectssuch as buildings, bridges, or factories sensed on an IRcamera. We regard 3D automatic target recognition(ATR) as the estimation of parameters such as name(hID) or pose (hC : hyaw, hpitch, hroll), relative to cameracoordinates in the 3D world, namely, position (hP: hx,hy), as well as scale or distance (hD) in a 2D image (total7 dimensions). The ATR 3D problem can then be formu-lated as the Bayesian parameter estimation of P(hjS),where h means the parameter set explained above, and S

denotes the input image. We propose an ATR algorithmcomposed of two stages, namely parameter initialization(stage I), and optimization (stage II), as shown in Algo-rithm 1. To achieve fast optimization, we insert a parame-ter initialization stage into the MCMC and make use ofspatial attention to combine low-level feature maps, whichcooperatively leads to robust target recognition.

3.2. Structure of ATR using sensor-driven MCMC

To find an optimal solution from P(hjS), a maximum aposteriori (MAP) method is used in general. But it is diffi-cult to obtain a correct posterior for a high-dimensionalparameter space (in our case, 7 dimensions). We bypassthis problem by using a statistical technique, drawingsamples using a Markov Chain Monte Carlo (MCMC)

technique (Green, 1996). The Monte Carlo (or sampling)method approximates the posterior distribution asweighted particles or samples. A Markov Chain means thatthe transition probability of samples is a function of themost recent sample value only. The theoretical advantageof the MCMC is that its samples are guaranteed toasymptotically approximate those from the posterior. TheMCMC method is a theoretically well-proved and suitableglobal optimization tool for combining bottom-up andtop-down information, which reveals superiority to geneticalgorithms or simulated annealing, although there are someanalogies to the Monte Carlo method (Doucet et al., 2001).

A sensor-driven MCMC for ATR has the structureshown in Fig. 8. It comprises sensor-driven initial parame-ter estimation, MCMC sample generation, 3D model ren-dering, shape matching (see Fig. 3 and Section 2), anddecision. Proposal samples generated from the bottom-upprocess achieve fast optimization or reduce burn-in time.

A bottom-up or sensor-driven process means accumu-lating evidence computed from local structures, and esti-

Fig. 9. Binding of CEM and HCM in bottom-up process.

S. Kim, I.-S. Kweon / Pattern Recognition Letters 27 (2006) 811–821 817

mates initial parameters such as object ID, pose, position,and scale. Based on this, MCMC samples are generated bya jumping distribution that represents state-transitionprobability. From this sample, a 3D shape model is ren-dered. The final decision, of object recognition, is madeafter iterative sample generation and global shape match-ing. The decision information is fed back to the bottom-up process for additional object recognition in thesame scene. Algorithm 1 summarizes the overall recogni-tion steps, and is a detailed implementation of theMetropolis–Hastings algorithm (Robert and Casella,1999).

Algorithm 1. ATR based on feature map integration

Stage I: Initialization by bottom-up process

Step 1: Extract local Zernike momentsStep 2: Estimate likelihood by voting to multiview

modelsStep 3: Sort candidate parameters h0 ¼ fh0

ID; h0C; h

0P; h

0Dg

Stage II: Optimization by top-down process

Step 1: Extract GMM and GOMStep 2: Set initial point h0 ¼ fh0

ID; h0C; h

0P; h

0Dg from

Stage IStep 3: Optimize parameters by MCMC sampling with

feature map binding

Fig. 10. Binding of GMM and GOM in top-down process.

For t = 0 to T

Draw a candidate point h* from the jumpingdistribution Jt(h*jht�1)Render the 3D CAD model based on h*

Calculate the cost function f(h*), by focusing onthe rendered model and the integrated featuremaps (GMM + GOM)

Calculate the ratio

r ¼ f ðh�ÞJ tðht�1jh�Þf ðht�1ÞJ tðh�jht�1Þ

.

Accept ht = h* with probability min(r, 1), orht = ht�1

End for

Step 4: If f(hT) < e, recognition finished

Else reject h0 and go to step 2 with the next can-didate h0

818 S. Kim, I.-S. Kweon / Pattern Recognition Letters 27 (2006) 811–821

3.3. Stage I: Bottom-up process

Target parameters are initialized by the bottom-up pro-cess, accumulating evidence computed from local struc-tures, as shown in Fig. 9. The core component is theextraction of the local Zernike moment from integratedfeature maps comprising the Canny edge map (CEM)and the Harris corner map (HCM) (Kim and Kweon,2005). The CEM is well known for being accurate androbust to thermal noise, and the HCM has high repeatabil-ity for environmental changes.

Table 1Jumping types and corresponding distributions

Jump type Function Param

J1 Object addition hID, hC

J2 Object deletion hID, hC

J3 Fine tuning of parameters dhC, d

Fig. 11. Parameter optimization by the top-down process: (a) CAD model is ofor visible object and (d) IR target.

We can bind these two feature maps by extracting anedge patch around a corner. This can reduce the computa-tional complexity by up to two orders of magnitude (e.g.,number of feature points: original image 320 · 240; CEMthousands; CEM + HCM hundreds, in a normal scene).A 3D target is represented as a set of aspect views, and eachtarget view is presented as a set of local Zernike moments(labeled target ID, aspect view angle, spatial location),extracted in scale space. By voting in each parameter space,we can estimate the likelihood of each parameter. The ini-tial target ID is directly estimated from the feature voting

eters Jumping distribution

, hD, hP, Depend on bottom-up information, hD, hP Depend on top-down resulthD, dhP hC ¼ fhyaw; hpitch; hrollg

dhyaw 2 Uð�30; 30Þdhpitch 2 Uð�30; 30Þdhroll 2 Uð�10; 10ÞdhD 2 UðhD � hD=5; hD þ hD=5ÞhP ¼ fhx; hygdhx ¼ Uf�40; 40gdhy ¼ Uf�40; 40g

verlaid with initial parameters; (b) after 10 iterations; (c) after 40 iterations

S. Kim, I.-S. Kweon / Pattern Recognition Letters 27 (2006) 811–821 819

in the target ID space. Since we already know the probableinitial target ID, the search space for other parameters isreduced enormously. The only difference is that the votingspaces are dependent on the parameters. For example, if wewant to estimate the initial pose hC, we vote for the matchpairs nearest to the corresponding pose space (azimuth, ele-vation) and select the maximum. Given an initial target IDand pose, the initial target scale hD and position hP are esti-mated by voting in the scale space. Details of the votingprocedure are available in (Kim and Kweon, 2005).

3.4. Stage II: Top-down process

Fig. 10 shows the top-down process, comprising modelparameter prediction by a jumping distribution and global2D shape matching, as explained above. A posterioriP(h|S) is approximated statistically by MCMC sampling.Based on the initial parameters obtained in the bottom-

Fig. 12. FLIR and CCD targets to recog

Fig. 13. The composition of test images: targets i

up process, the next samples are generated based on thejumping distribution, Ji(hijhi�1). It is referred to as aproposal or candidate-generation function for this role.Generally, random samples are generated to prevent localmaxima. However, we utilize the bottom-up informationand the top-down verification result for suitable samplegeneration. In this paper, we use three jumping types,namely, object addition, deletion, and refinement, as shownin Table 1.

The first type (J1) inserts a new object and its parametersdepending on the result of the bottom-up process. Objectaddition occurs when the voting value in the bottom-upprocess is greater than a predefined threshold (stage I).The second type (J2) removes a tested model and its param-eters, as determined by the result of top-down recognition.Object deletion occurs when the final MCMC result orNFA is larger than e. A jumping example of the third type(J3) is similar to Eq. (7). The next state depends on the

nize: cars, tower, building, and box.

n DB, targets not in DB, and natural scenes.

820 S. Kim, I.-S. Kweon / Pattern Recognition Letters 27 (2006) 811–821

current state and a random gain. This gain has uniform dis-tribution (U) in the range of 30� because the view sphere isquantized within this range. Here, h0

C is initialized by theresult of the bottom-up process.

htC ¼ ht�1

C þ DhC; ð7Þwhere hC ¼ ½ hyaw hpitch hroll �T; DhC � Uð�15; 15Þ.

Shape matching is performed by focusing on a shapemodel that combines a GMM and a GOM. We measurethe suitability of the generated parameter samples by calcu-lating a meaningfulness index for the shape matching,as explained in Section 2. Eq. (6) can be reformulated asEq. (8), by incorporating the parameters of the 3D targetmodel.

f ðx; k0; ljhÞ ¼ NFAðx; k0; lÞ 6 e. ð8Þ

4. Experimental results

4.1. 3D object recognition test using a CCD sensor

First, we tested the algorithm for the objects capturedusing a CCD camera. We made a database for quantizedviews as explained above.

Fig. 14. Successful target recognition results of KAIS

Table 2Performance comparison (SR: success rate, FR: false alarm rate)

Test set Ours

Acquired data6 targets in DB (61)Targets not in DB (25)Natural scene (10)

(SR,FR) (90/96,1/3(SR,FR)% (93.8,2.85)

RSTA data3 targets in DB (18)Targets not in DB (30)Natural scene (7)

(SR,FR) (45/55,5/3(SR,FR)% (81.8,13.51

Fig. 11(a)–(c) show the optimization process. After 40iterations, optimal object parameters are estimated by thetop-down process. Fig. 11(d) shows another top-downoptimization result for a milk pack. Note that a very accu-rate alignment is possible using only a single camera, bot-tom-up information, and a 3D shape model, using theMCMC statistical method. The overall computation timeaveraged 2 s (0.5 s for the bottom-up process) using anAMD 2400+ processor.

4.2. Target recognition test of FLIR imagery

We tested the algorithm using two data sets, our dataset (KAIST FLIR), and the RSTA data set (Der andChellappa, 1997). The targets to recognize are shown inFig. 12. The sensor is a FLIR Prism SP with 320 · 240 res-olution and an NTSC interface. These models contain somebackground information, which provide scene context. 3DCAD models are acquired by manual measurements.

The test images are shown in Fig. 13. They include threetypes of object, to enable accurate performance evaluationin practical cases. The system has to recognize the targets inits database with a high recognition rate, and be able toreject clutter objects or natural scenes.

T FLIR dataset and Fort Carson RSTA dataset.

Der Olson

5) (85/96,3/35) (78/96,8/35)(88.5,8.57) (81.3,22.85)

7) (39/55,12/37) (35/55,11/37)) (70.9,32.43) (63.6,29.72)

S. Kim, I.-S. Kweon / Pattern Recognition Letters 27 (2006) 811–821 821

Fig. 14 shows several successful recognition results.Table 2 shows our results compared with the methods ofDer and Chellappa (1997) and Olson and Huttenlocher(1997). We take the optimal parameters, as described inSection 2. In this test, the 3D target view is limited in range.Our method outperforms the other two, with false alarmrates under 2.85%, and shows optimal matching as wellas more robustness to noise (Der and Chellappa, 1997;Olson and Huttenlocher, 1997). However, recognition fail-ures occurred with target structures severely distorted bynoise or when the image contrast was too low.

5. Conclusion

We propose a novel ATR paradigm based on the humanvisual system, especially cooperative feature map binding,by utilizing both bottom-up and top-down processes anddemonstrate the system performance via several experi-ments. The test results on several IR images demonstrateefficient optimal matching and robustness to noise, as wellas the feasibility of the proposed recognition paradigm.

Acknowledgements

This research was supported by the Korean Ministry ofScience and Technology for National Research LaboratoryProgram (Grant number M1-0302-00-0064), Korea.

Appendix A. Supplementary data

Supplementary data associated with this article can befound, in the online version, at doi:10.1016/j.patrec.2005.11.008.

References

Der, S.Z., Chellappa, R., 1997. Probe-based automatic target recognitionin infrared imagery. IEEE Trans. Image Process. 6 (1), 92–102.

Desolneux, A., Moisan, L., Morel, J.-M., 2004. Gestalt theory andcomputer vision. In: Carsetti, A. (Ed.), Seeing, Thinking and KnowingDordrecht. Kluwer Academic Publishers, Netherlands, pp. 71–101.

Dick, A.R., Torr, P.H.S., Cipolla, R., 2002. A Bayesian estimation ofbuilding shape using MCMC. In: Proceedings of the 7th EuropeanConference on Computer Vision (2), pp. 852–866.

Doucet, A., Freitas, N.D., Gordon, N., 2001. Sequential Monte CarloMethods in Practice. Springer, New York.

Drummond, T., Cipolla, R., 2002. Real-time tracking of complexstructures. IEEE Trans. Pattern Anal. Machine Intell. 24 (7), 932–946.

Fort Carson RSTA Data Collection, Colorado State University ComputerVision Group. Available from: <http://www.cs.colostate.edu/~vision/ft_carson/>.

Green, P., 1996. Reversible Jump Markov Chain Monte Carlo Compu-tation and Bayesian Model Determination. Chapman and Hall,London.

Hornegger, J., Niemann, H., 2000. Probabilistic modeling and recognitionof 3-D objects. Internat. J. Comput. Vision 39 (3), 229–251.

Jain, A.K., Dorai, C., 2000. 3D object recognition: Representation andmatching. Statist. Comput. 10, 167–182.

Kim, S., Kweon, I.S., 2005. Automatic model-based 3D object recognitionby combining feature matching with tracking. Machine VisionApplications published online, 5 July 2005.

Lowe, D.G., 2004. Distinctive image features from scale-invariantkeypoints. Internat. J. Comput. Vision 60 (2), 91–110.

Murase, H., Nayar, S., 1995. Visual learning and recognition of 3-Dobjects from appearance. Internat. J. Comput. Vision 14 (1), 5–24.

Nair, D., Aggarwal, J.K., 2000. Bayesian recognition of targets by parts insecond generation forward looking infrared images. Image VisionComput. 18, 849–864.

Olson, C.F., Huttenlocher, D.P., 1997. Automatic target recognition bymatching oriented edge pixels. IEEE Trans. Image Process. 6 (1), 103–113.

Robert, C.P., Casella, G., 1999. In: Monte Carlo Statistical Methods.Springer, New York, p. 1999.

Shimshoni, I., Ponce, J., 2000. Probabilistic 3D object recognition.Internat. J. Comput. Vision 35 (1), 51–70.

Treisman, A., 1998. Feature binding, attention and object perception.Philos. Trans.: Biol. Sci. 353 (1373), 1295–1306.

Zhang, D., Lu, G., 2004. Review of shape representation and descriptiontechniques. Pattern Recognition 37, 1–19.