Object recognition using a generalized robust invariant feature and Gestalt's law of proximity and...

Pattern Recognition 41 (2008) 726–741www.elsevier.com/locate/pr

Object recognition using a generalized robust invariant feature and Gestalt’slaw of proximity and similarity

Sungho Kim∗, Kuk-Jin Yoon, In So KweonKorea Advanced Institute of Science and Technology, 373-1 Guseong-dong Yuseong-gu Daejeon, Korea

Received 24 April 2006; received in revised form 19 May 2007; accepted 22 May 2007

Abstract

In this paper, we propose a new context-based method for object recognition. We first introduce a neuro-physiologically motivated visual partdetector. We found that the optimal form of the visual part detector is a combination of a radial symmetry detector and a corner-like structuredetector. A general context descriptor, named G-RIF (generalized-robust invariant feature), is then proposed, which encodes edge orientation,edge density and hue information in a unified form. Finally, a context-based voting scheme is proposed. This proposed method is inspiredby the function of the human visual system, called figure-ground discrimination. We use the proximity and similarity between features tosupport each other. The contextual feature descriptor and contextual voting method, which use contextual information, enhance the recognitionperformance enormously in severely cluttered environments.� 2007 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved.

Keywords: Background clutter; Interior context; Complementary feature; Contextual voting; Gestalt law

1. Introduction

How to cope with image variations caused by photometricand geometric distortions is one of the main issues in objectrecognition. It is generally accepted that the local invariantfeature-based approach is very successful in this regard. Thisapproach is generally composed of visual part detection, de-scription and classification.

The first step in the local feature-based approach is visualpart detection. Lindeberg [1] proposed a pioneering method onblob like image structure detection in scale space. Shokoufan-deh et al. [2] extended this feature to wavelet domain. Schmid etal. [3] compared various interest point detectors and concludedthat the scale-reflected Harris corner detector is most robust toimage variations. Mikolajczyk and Schmid [4] also comparedvisual part extractors and found that the Harris–Laplacian-basedpart detector is suitable for most applications. Recently, sev-eral visual descriptors have been proposed [5–8]. Most ap-proaches try to encode local visual information such as spatial

∗ Corresponding author. Tel./Fax: +82 42 869 5465.E-mail addresses: [email protected] (S. Kim),

[email protected] (K.-J. Yoon), [email protected] (I.S. Kweon).

0031-3203/$30.00 � 2007 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved.doi:10.1016/j.patcog.2007.05.014

orientation or edge. Based on these local visual features, severalobject recognition methods, such as the probabilistic votingmethod [9] and constellation model-based approaches [10,11],have been introduced.

However, those approaches occasionally fail to recognizeobjects with few local features in highly cluttered backgrounds.Several researchers have tried to solve the background clut-ter problem. Stein and Hebert [12] proposed a backgroundinvariant object recognition method by combining the objectsegmentation scheme with SIFT. Since this method is based onthe prior figure/ground information using manual segmentationor stereo matching, this method cannot be used in the generalenvironment. Recently, object-specific, figure-ground segmen-tation methodologies by combining bottom-up and top-downcues have been proposed [13–15]. Although these approachesshow somewhat clear figure-ground segmentation results usingobject-specific information, there is no concern about how toreduce the influence of background clutter for successful ob-ject recognition in a non-segmentation-based approach. OnlyLowe [5] tried to discard background clutter by using a simpledistance ratio, which in practice is useful. Torralba et al. [16]introduced a completely different approach that exploits thebackground clutter information into object recognition. The

http://www.elsevier.com/locate/pr

mailto:[email protected]



S. Kim et al. / Pattern Recognition 41 (2008) 726–741 727

Clutter Reduction Strategy

Object Decomposition Weight Aggregation

Convex

Parts

Corner

PartsSimilarity

of labeling

Proximity

(distance)

Feature Voting

+ +

Clutter Reduction Strategy

Object Decomposition Weight Aggregation

Convex

Parts

Corner

PartsSimilarity

of labeling

Proximity

(distance)

Feature Voting

+ +

Fig. 1. Clutter reduction strategy.

background information is called place context or exterior con-text in that method. The exterior context information is very use-ful for practical applications such as intelligent mobile roboticssystems.

Our research interest is how to efficiently extract an object’sinterior contextual information to enhance object recognitionperformance in strongly cluttered environments. We concen-trate on the properties of a human visual receptive field and pro-pose a computationally efficient local context coding method.We also propose an object recognition method based on thehuman visual system’s (HVS) figure-ground discriminationmechanism, which is accomplished by grouping interior con-textual information. Fig. 1 summarizes the clutter reductionstrategy proposed in this chapter. The first key idea is to de-compose an object into convex parts and corner parts in orderto include fewer background pixels. The second key idea isto aggregate feature weights according to similarity and prox-imity. Parts of an object tend to share the same attributesand aggregate together. We can discard background clutter byvoting the aggregated weights.

2. Background of human visual system

2.1. Simple cells/complex cells in V1

Simple cells and complex cells exist in the primary visualcortex (V1), which detects low level visual features. It is wellknown that the response of simple cells in V1 can be modeledby a set of Gabor filters as [17]

G(x, y, �, �) = e−(x′2+�y′2)/2�2cos

(2�

x′

�+ �

), (1)

where x′ = x cos � + y sin � and y′ = −x sin � + y cos �. Ac-cording to recent neuro-physiological findings [18], the rangeof an effective spatial phase parameter is 0��/2 due tosymmetry. An important finding is that the distribution of spa-tial phase is bimodal, with cells clustering near 0 phase (evensymmetry) and �/2 phase (odd symmetry).

Complex cell response at any point and orientation com-bines the simple cell responses. There are two kinds of models:weighted linear summation and MAX operation [17]. However,the MAX operation model has the most support since neuronsperforming a MAX operation have been found in the visual

Fig. 2. (a) Perfect global object description. (b) Issues in local part-basedobject description.

cortex [19]. The role of the simple cell can be regarded as atuning process, that is, extracting all responses by changingGabor parameters. The role of a complex cell can be regardedas a MAX operation from the tuning responses by selectingmaximal responses. The former gives selectivity to the objectstructure and the latter gives invariance to geometric variationssuch as location and scale changes.

2.2. Receptive field in V4

Through the simple and the complex cells, orientation re-sponse maps are generated and fine orientation adaptation oc-curs on the receptive field within the attended convex part in V4[20]. The computational method of orientation adaptation phe-nomenon is steering filtering [21]. Adapted orientation is cal-culated by the maximum response spanned by basis responses(tan−1(Iy/Ix)). There are also color blobs in a hyper columnwhere the opponent color information is stored. Hue is invari-ant to affine illumination changes and highlights. Orientationinformation and color information are combined in V4 [22].

How does the HVS encode the receptive field responseswithin the attended convex part? Few facts are known on thispoint, but it is certain that larger receptive fields are used, rep-resenting a broader orientation band [23].

3. Perceptual visual part detection

Appearance- or view-based object recognition methods wereproposed in the early 1990s for face recognition and currentlyhave become popular in the object recognition society. How-ever, global appearance representation using a perfect supportwindow as shown in Fig. 2(a) cannot recognize objects in clut-tered environments because of imperfect figure-ground seg-mentation. A bypassing method is to approximate an object asthe sum of its sub-windows. The main issue in this approachis how to select the location, shape and size of a sub-windowas shown in Fig. 2(b). For successful object recognition, visualparts have to satisfy the following requirements. First, back-ground information should be excluded as much as possible.Second, they must be perceptually meaningful. Finally, theyshould be robust to the effects of photometric and geometric dis-tortions. Mahamud and Hebert [24] suggested a greedy searchmethod to find visual parts that satisfy the above requirements.

728 S. Kim et al. / Pattern Recognition 41 (2008) 726–741

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1Gabor phase: π /21st deriv. of Gassian

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

-0.5

0

0.5

1Gabor phase: 02nd deriv. of Gassian

-10 -5 0 5 10

-0.4

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.41st derivative of Gaussianapproximation by discrete difference

-5 0 5

-0.4

-0.2

0

0.2

0.4

0.6

0.8

12nd derivative of Gaussianapproximation by difference of Gaussian

Fig. 3. Gabor filters are approximate by derivatives of Gaussian: �/2 phase Gabor by 1st derivative of Gaussian (a) and 0 phase Gabor by 2nd derivativeof Gaussian (b). The 1st and 2nd derivatives of Gaussian are further approximated by discrete difference of Gaussian (c) and by difference of two Gaussiankernels, respectively, for computational efficiency.

Image structure-based part detectors are more efficient than reg-ular grid-based part [3,5]. Although these methods work well,they do not exploit the full structural information of interestingobjects.

Recently, Serre et al. [17] proposed a neurocomputationalobject recognition method by precisely modeling the propertiesof simple and complex cells. The basic concepts of the modelare tuning and MAX operation on both feature detection andrecognition. Although this model is a biologically good model,it is computationally inefficient because of the enormous tuningprocess of the location, size and phase of the Gabor filter.

In this section, we present a visual part detection method byapplying the tune-MAX [17] to the approximated Gabor fil-ter. Serre and Riesenhuber modeled the Gabor filter using a lotof filter banks while changing scale and orientation. Further-more, they fixed the phase to 0 which is another limitation.As described in Section 2.1, the distribution of spatial phasein receptive field is bimodal, 0 (even symmetry) and �/2 (odd

Fig. 4. (Left two) Interest points are localized spatially by tune (arrow)-Max(dot) of 1st and 2nd derivatives of Gaussian, respectively. (Right two) Regionsizes around interest points are selected by tune-Max of convexity in scalespace.

symmetry). So, we can approximate the Gabor function by twobases generated by the Gaussian derivatives shown in Fig. 3(a)and (b). The 1st and 2nd derivatives of Gaussian which approx-imate odd and even symmetry, respond to edge structures andconvex (or bar) structures, respectively.

The location and size invariance is acquired by the MAXoperation of various tuning responses from the 1st and 2ndderivatives of Gaussian. Fig. 4 shows the complex cell re-sponses using the approximated filters. The arrows represent


Gaussian Scale-space

Image PyramidSubtract between

Scale-space images:

DoG

Scale-space image:

Ix, Iy

Local Maxima:

(x, y, scale)

Efficient Simple Cell Model:

Tuning locations, scales

Efficient Complex Cell Model:

Location, scale selection by MAX

Subtract within

Fig. 5. Computationally efficient perceptual part detection scheme.

the tuning process and the dot and circle represents the selectedfeature position and region size by the MAX operation.

(1) Location tuning using the 1st derivative of Gaussian:• Select maximal response in all orientations within a 3 × 3

complex cell (pixel).• Suitable method: Harris corner or KLT corner extraction

(both eigenvalues are large).(2) Location tuning using the 2nd derivative of Gaussian:

• Select maximal response in all orientations within 3 × 3complex cell.

• Suitable method: Laplacian or DoG gravity center (radialsymmetry point).

(3) Scale tuning using the convexity:• Select maximal response in directional scale space [25].• For computational efficiency, a convexity measure such as

DoG is suitable. This is related to the properties of V4receptive field where convex part is used to represent visualinformation [26].

For the efficient computation of the tune-Max, we utilizethree approximation schemes: the scale-space-based imagepyramid [5], discrete approximation of the 1st derivative ofGaussian by subtracting neighboring pixels in a Gaussiansmoothed image, and the approximation of the 2nd derivativeof Gaussian (Laplacian) by difference of Gaussian (DoG).The image pyramid is built by consecutive smoothing withGaussian filter and ended when the image size is very small.As shown in Fig. 3(c) and (d), the kernel approximations forthe Gabor bases are almost identical to the true kernel func-tion. Fig. 5 shows the structure of our part detection methodwhich computes the 1st and 2nd derivatives of Gaussian bysubtracting neighboring pixels in a scale-space image and bysubtracting between scale-space images, respectively. We cal-culate local max during scale selection. Fig. 6 shows a sample

result of the proposed perceptual part detector. Note that theproposed method extracts complementary visual parts. We canget corner-like parts through the left path, and radial sym-metry parts through the right path in Fig. 5 (see Eq. (2)). Asshown in Fig. 7, the proposed object decomposition methodis also supported by cognitive properties that HVS attends togravity centers and high curvature points [27] and objects aredeconstructed into perceptual parts that are convex [28,29].

x = maxx∈W

{DoG(x, �) or HM(x, �)}, � = max�

{DoG(x, �)},(2)

where DoG(x, �) = |I (x) ∗ G(�n−1) − I (x) ∗ G(�n)| andHM(x, �) = det(C) − � trace2(C). C is defined as

C(x, �) = �2 · G(x, 3�) ·[

I 2x (x, �) IxIy(x, �)

IxIy(x, �) I 2y (x, �)

], (3)

where Ix(x, �) = {S([x + 1, y], �) − S([x − 1, y], �)}/2,Iy(x, �) = {S([x, y + 1], �) − S([x, y − 1], �)}/2, S(x, �) =I (x) ∗ G(�).

4. General contextual part descriptor

The description of visual parts is very important for objectidentification or classification. Two important factors in descrip-tor design are selectivity and invariance. Selectivity means thatdifferent parts or part classes can be discriminated robustly. In-variance means that the same parts or part classes can be de-tected regardless of photometric and geometric distortions. De-scriptors must satisfy both properties appropriately. Direct useof pixel data shows very high selectivity but very low invari-ance. A PCA-based descriptor shows high selectivity but lowinvariance due to its properties that are sensitive to the transla-tion of interest points [7]. Histogram-based descriptors show aproper compromise of selectivity and invariance [5,8,30].


Fig. 6. The proposed part detector can extract convex, corner part simultaneously; (1st row) geometric structure, (2nd row) man-made object, (3rd row) naturalscene. (a) Convex parts; (b) corner parts; (c) combined parts.

Fig. 7. Object decomposition comparison between SIFT [5] and our model.

Then, what information has to be represented? We found thesolution from the properties of the receptive field in V1 andV4. As we discussed in Section 2.2, we can mimic the role ofreceptive field V4 to represent visual parts. In V4, edge densitymap, orientation field, and hue field coexist in the attendedconvex part. These independent feature maps are detected fromV1 (in particular, edge orientation and edge density is extractedusing the approximated Gabor with �/2 phase).

Now the question becomes: how to encode the independentfeature maps? We utilize the fact that larger receptive fields areused with a broader orientation band [23] and independent fea-ture maps are combined to make more informative features [22].Fig. 8(a) shows a possible pattern of receptive field in V4. Thedensity of the black circle depicts the level of attention of theHVS in which 86% of fixation occurs around the center recep-tive field [27]. Each black circle stores the visual distributionsof edge density, orientation field, and hue field of pixels aroundthe circle. Fig. 8(b) shows how it works. We set the number ofreceptive fields within a part as 21 as shown in Fig. 8(a). Eachreceptive field stores them in the form of a histogram whichshows good balance between selectivity and invariance by con-trolling the bin size, and is partially supported by the compu-

tational model of multidimensional receptive field histograms[31]. We can control the resolution of each receptive field suchas the number of edge orientation bins, hue orientation bins ex-cept edge density which is scalar. In this paper, we set the num-ber of orientation bins as 4 and that of hue as 4 (see Fig. 8(b)).Each localized sensor gathers information about edge density,edge orientation, hue color of receptive field. The histogram ofan individual sensor is generated by simply counting the cor-responding bins according to feature values weighted by theattention strength and sensitivity of sensor (modeling of pixelcontext). Scalar edge density is generated from edge magni-tudes. Each localized sensor accumulates edge magnitudes andedge density is obtained by normalization with total gradientmagnitude. This process is linear to the number of pixels withina convex part. Each pixel in a receptive field affects to neigh-boring visual sensors. After all the receptive field histogramsare generated, we normalize each histogram vector. The fea-ture dimension for the edge orientation is 21 ∗ 4 and that forthe hue is 21 ∗ 4. The feature dimension for the edge densityis 21 ∗ 1. After we multiply these histograms with componentweights (� + + � = 1), we integrate three kinds of features asFig. 8(b) right column. Finally, we renormalize the feature (thetotal feature dim.: 21 ∗ (4 + 1 + 4) = 189) so that the feature’senergy equals to 1.

It is essential to align the receptive field patterns (shownas black circles in Fig. 8(a)) to the dominant orientation ofan attended part if there is image rotation. We compared fourdominant orientation detection methods: Eigenvector [32],weighted steerable filter [21], maximum of orientation his-togram [5], and radon transform [33]. The eigenvector methodcalculates the dominant orientation from an eigenvector ofa Hessian matrix. The weighted steerable filter calculatesthe dominant orientation using weighted gradient compo-nents (x-direction and y-direction). The orientation histogram


Fig. 8. (a) Plausible receptive field model in V4, (b) three kinds of localized histograms are integrated to describe local region.

Table 1Comparison of dominant orientation detection methods

Method Eigenvector [32] Steerable filter [21] Orientation histogram [5] Radon transform [33]

# of correct/# of tot. 12/20 16/20 9/20 10/20Matching rate (%) 60 80 45 50

method uses max orientation bins generated from pixel ori-entations. Variance projection in Radon transform can pro-vide dominant orientation information. Table 1 shows thecomparison results. In this experiment, we prepare a normalobject image and a planar rotated image by 45◦. We deter-mine an orientation estimation is correct if the orientationerror is within 5◦. We found that the weighted steerable fil-ter method showed the best matching rate in rotated imagepairs.

Each receptive field encodes histogram-based orientation dis-tribution in [0, �], hue distribution in [0, 2�] and scalar valueof edge density in [0, 1]. Orientation distribution can be [0, 2�]but the orientation can be sensitive to phase reverse due to dif-ferent illumination angle. The coordinates of the receptive fieldare aligned to the dominant orientation of the convex visualpart, which is calculated efficiently by steering filtering [21].This encoding scheme can control the level of local contextinformation-aperture size and feature level as in Table 2. Theaperture size in Table 2 is related to Fig. 8(a) and the type offeature in Table 2 is related to Fig. 8(b). The proposed encod-ing scheme is a general form of the contextual feature repre-sentations of [5,8,30]. We can select a suitable level of contextdepending on applications. In general, as the aperture size islarger, the recall is lower and the precision is higher. Fig. 9shows an example of visual part matching using the proposedpart detector and general context descriptor (G-RIF: general-ized robust invariant feature). Most visual parts correspond toone another by a simple Euclidean distance measure. More spe-cific details of implementation and invariant properties can befound in Ref. [34].

Table 2Level of local context

Context Contents of information

L1:1Aperture L2:1,L2:2,L2:3,L2:4,L2:5,L2:6,L2:7,L2:8size L3:1,L3:2,L3:3,L3:4,L3:5,L3:6,L3:7,L3:8,

L3:9,L3:10,L3:11,L3:12L4:1,L4:2,L4:3,L4:4

(1) Edge orientation [0,�], quan. level:4Type of (2) Edge density [0, 1]feature (3) Hue information [0, 2�], quan. level:4

etc: (1) + (2), (1) + (3), (2) + (3), (1) + (2) + (3)

5. Neighboring context-based object recognition

A conventional classifier based on local informative featuresis direct voting of nearest neighbors as in Fig. 11 [9]. Al-though there are more sophisticated classifiers, such as SVM[35], Adaboost [36], Bayesian decision theory [11], and strongspatial constraint-based indexing [37], those methods are ba-sically based on the concept of nearest neighbors and theirvoting.

Basically, we consider two properties of part relation: thesame labeling and proximity as shown in Fig. 10. Parts be-longing to an object share the same object labels. Furthermore,those parts are spatially very close. Proper weights are assignedto those parts according to the probability of the same label-ing and proximity. Our research interest is how to utilize thepart relation in the nearest-neighbor voting scheme (top-down


Fig. 9. The matching results on the rotated objects using the perceptual part detector and local contextual descriptor [G-RIF: aperture size L3 and feature type(1) + (2), valid only in this test example].

ID

IDID

proximity

same label

Fig. 10. Similarity and proximity of part–part context.

verification is beyond the scope). Contextually, supported partsget stronger weights with a certain label. We can estimate ob-ject labels by voting the weights. This is the reason why this iscalled weight-aggregated voting (Fig. 11).

The proposed object recognition method, named neighboringcontext-based voting, can be modeled as follows:

L = argmaxl

P (l|X) ≈ argmaxl

NX∑i=1

P(xi |l), (4)

where local feature xi belongs to input feature set X, l isan object label and Nx is the number of input local features.The posterior P(l|X) is approximated by the Bayes rule anduniform prior of label. is the threshold of the feature-labelmap. We use the following binary probability model to designP(xi |l):

P(xi |l) ={1 i ∈ l, SM(i, l)�,

0 otherwise,(5)

where SM(i, l) is called a feature-label map which representsthe strength of the match between an input feature and its labelbased on neighboring context information. It is calculated asfollows:

SM(i, l) =∑

j∈N(i)

w(i, j)c(j, l), (6)

where N(i) is the support region (or neighbors) of feature i. Asshown in Fig. 12, neighboring context information is aggregatedby summing over the neighborhood using the above equation.

w(i, j) represents the support (contextual) weight at site j inthe support region. The support from the neighborhood is validwhen the neighboring visual parts have the same label (or objectindex) as the site of interest. This is the same concept as Gestaltproperties, i.e., proximity and similarity. c(j, l) is the goodnessof match pair (j, l).

The support weight will be proportional to the probability,p(Ol

j |Oli ) where Ol

i represents the event that the label of a sitei is l:

w(i, j) = k · p(Olj |Ol

i ). (7)

When we consider the fixed label l, it can be rewritten as

w(i, j) = k exp

{−dist(xi , xj )

2

�2

}· p(lxi = l, lxj = l), (8)

where the first term represents the proximity between two partsand the second term represents the similarity of the same label.dist(xi , xj ) denotes the distance between part i and j . lxi

represents the label of feature xi and k is normalization factor.The probability of both features coming from the same objectl is defined as

p(lxi = l, lxj = l) = S(lxi = l)S(lxj = l)∑NL

m=1

∑NL

n=1S(lxi = n)S(lxj = m), (9)

where S(lxi = l) means the similarity that a local feature ismapped to a label l. This is equivalent to the �2 kernel valuebetween input feature xi and its nearest neighbor xl

1NN , whoselabel is l:

S(lxi = l) = K�2(xi , xl1NN). (10)

The �2 similarity kernel is defined as [35]:

K�2(x, (y)) = exp{−��2(x, y)}, (11)

where � is set to 3–4 normally. Goodness of match, c(j, l) isalso defined as S(lxj = l).

Although the feature-label map SM(i, l) is calculated usingthe above similarity kernel, it can be formulated in recursiveform by simply reusing the feature-label map in the next iter-ation, rather than the similarity kernel.


Learning

Recognition

1 NN search

Label

Vote

Recognition

Label

Vote

Context-based

Feature-label map

Learning

Fig. 11. (a) Conventional classifier: direct voting of nearest neighbor. (b) Novel classifier: neighboring context-based voting.

Fig. 12. Concept of feature-label map (strength of match) generation usingneighboring context information.

Fig. 13(a) shows the feature-label map calculated from thesimilarity kernel and Fig. 13(b) shows the feature-label mapafter applying the proposed method: contextually weightedfeature-label map after two recursions. We run the iterationuntil the labels do not change. In the experiment, it convergedwithin five iterations. Fig. 14 represents a close up view ofFig. 13 and displays several feature locations. Note the powerof neighboring contextual information, which enhances thestrength of figural features (here, f.id: 527, 536) and suppressesthe background features (here, f.id: 525).

The baseline distance measure of voting is Euclidean featuredistance. In this case, log P(l|(xi) = 1 if the label of nearest(xi) is l. Fig. 5(a) and (b) show the feature labeling and votingresults, respectively. As you can see, many background cluttersare labeled incorrectly. The next improved distance measurefor clutter rejection is distance ratio obtained by comparing thedistance of the nearest to that of the second closest neighbor[5]. Fig.15(c) and (d) show the feature labeling and voting re-sults, respectively. The incorrectly labeled features are removedpartially. Finally, we apply the weight-aggregation-based

distance measure to the same test image. Fig. 15(e) and (f)show the feature labeling and voting results, respectively. Notethat all the object-related features are labeled correctly. In thistest, we tune the related distance thresholds so that the numberof correctly labeled feature is almost same.

6. Evaluation of G-RIF

We dubbed the proposed perceptual feature (perceptual partdetector with generalized descriptor of pixel information) asthe G-RIF. In this section, we evaluate G-RIF in terms of objectrecognition. We adopt a new feature comparison measure interms of object labeling. We use the accuracy of detection ratewhich is widely used in classification or labeling problems. Al-though there are several suitable open object databases such asCOIL-100 and Caltech DB, we evaluate the proposed methodusing our own database because our research goal is to mea-sure the properties of features in terms of scale change, viewangle change, planar rotation, illumination intensity change,and occlusion. The total number of objects is 104: related testimages are shown in Fig. 16. These DB and test images areacquired using a SONY F717 digital camera and resized to320 × 240.

We compare the performance of the proposed G-RIF withSIFT, the state-of-the art feature [5]. We evaluate the featuresusing the nearest neighbor classifier with direct voting (NNC-voting) which is used commonly in local feature-based objectrecognition approaches. NNC-based voting is a very similarconcept to the winner-take-all (WTA). We use the binary pro-gram offered by Lowe [5] for the accurate comparison. We setthe �, , � in Fig. 8(b) as 1

3 individually for color images and12 , 1

2 , 0 for gray images.Fig. 17 shows the performance of the proposed perceptual

part detectors. We use the same descriptor (edge orientationonly) with the same Euclidean distance threshold (0.2) usedin NNC-based voting. The proposed perceptual part detec-tor outperforms single part detectors in most test sets. Themaximal recognition rate is higher than the part detectorof SIFT (radial symmetry part) by 15%. Fig. 18 shows theperformance of the proposed part descriptor with the SIFT


Fig. 13. (a) Initial feature-label map obtained from similarity kernel. (b) Contextually weighted feature-label map using the proposed method.

Fig. 14. Close-up view of Fig. 13 around Label 41 and corresponding feature locations.

descriptor. We used the same radial symmetry part detectorto show the power of descriptors only. The full contextualdescriptor (edge orientation + edge density + hue field) showsthe best performance except the illumination intensity changeset. In this case, the performance is fair compared to theother contextual descriptors. Under severe illumination in-tensity change or different light sources, it is reasonable touse the contextual descriptor of edge orientation with edgedensity.

Fig. 19 summarizes the performance of the SIFT, perceptualpart detector with edge orientation descriptor, and the G-RIF(both parts with general descriptor). The G-RIF always out-performs the SIFT in all test sets. This good performance isoriginates from the effective use of image structures (radialsymmetry point with a high curvature point) of the proposedvisual part detector and effective spatial coding of multiplefeatures in a unified way. The dimension of G-RIF is 189(21 ∗ 4 + 21 ∗ 4 + 21) and that of SIFT is 128 (16 ∗ 8). Theaverage extraction time of G-RIF is 0.15 s and that of SIFT is

0.11 s in a 320∗240 image under AMD 2400+. This differenceis due to the number of visual parts. The G-RIF extracts twicenumber of parts than the SIFT does.

We also compared G-RIF and SIFT on COIL-100 DB[38,39]. Since the DB is composed of multiple views with aninterval of 5◦, we use the frontal view (0◦) as the model imageand tested five different views up to 45◦. We use the nearestneighbor classifier (NNC) based simple voting as the classifier.Fig. 20 shows that G-RIF has upgraded performance (by 20%)at the 45◦viewing angle.

7. Evaluation of contextual voting

Although there are several open object databases, such asAmsterdam DB, PASCAL DB, ETH-80 DB, and Caltech DB,we evaluate the proposed method using COIL-100 DB forfeature comparison. We also use our DB, CMU DB (avail-able at http://www.cs.cmu.edu/∼stein/BSIFT) which containbackground clutter, because our research goal is to recognize

http://www.cs.cmu.edu/~stein/BSIFT


Fig. 15. Feature labeling and voting using nearest neighbor (a) and (b), distance ratio (c) and (d), and the proposed weight aggregation (e) and (f).

general objects in severely cluttered environments. Fig. 21 (top)shows sample objects in our DB, composed of 21 textureless2D objects, 26 textured 2D objects, 26 textureless 3D objects

and 31 textured 3D objects. We stored each model’s image inthe canonical viewpoint. Because our research issue is howthe contextual information enhances recognition, fixing a 3D


Fig. 16. We use frontal 104 views for DB (right) generation and 20 test set (five scales, four view angles, four rotations, four intensity changes, three occlusions)per object (left).

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.384

86

88

90

92

94

96

98

100

Object size

Re

co

gn

itio

n r

ate

[%

]

Comparison of part detectors for object size

Proposed (both) partHigh curvature partRadial symmetry part

10 15 20 25 30 35 4060

65

70

75

80

85

90

95

100

View angle

Re

co

gn

itio

n r

ate

[%

]

Comparison of part detectors for view angle


10 15 20 25 30 35 4066

68

70

72

74

76

78

80

82

Planar rotation angle [deg]

Re

co

gn

itio

n r

ate

[%

]

Comparison of part detectors for planar rotation


40 60 80 100 120 140 160 18088

90

92

94

96

98

100

Average illumination intensity

Re

co

gn

itio

n r

ate

[%

]

Comparison of part detectors for illumination change


0 10 20 30 40 5050

55

60

65

70

75

80

85

90

95

Occlusion [%]

Re

co

gn

itio

n r

ate

[%

]

Comparison of part detectors for occlusion


Fig. 17. Evaluation of part detectors (radial symmetry part, high curvature part, perceptual part): the proposed part detector shows the best performance forall test set.

viewpoint is reasonable. To effectively validate the power ofthe context-based object recognition method, we made test im-ages as in Fig. 21 (bottom). One hundred and four objects areplaced on a highly cluttered background. These DB and testimages are acquired using a SONY F717 digital camera andare resized to 320 × 240.

First, we compare the performance of the proposed G-RIFwith SIFT, which is a state-of-the art descriptor [5,6]. Typi-cally, visual part detectors are evaluated only in terms of their

repeatability. However, the evaluation should be in the contextof a task, such as accuracy of object labeling in this chapter.We evaluate the features using the direct voting-based classi-fier which is used commonly as in Fig. 11(a). We use the bi-nary program offered by Lowe [5]. The accuracy of detectionrate is used as a comparison measure which is widely used inclassification or labeling problems. Fig. 22 shows the evalua-tion results using our DB by changing the similarity thresholdof direct voting. G-RIF shows better performance than SIFT


0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3

84

86

88

90

92

94

96

98

100

Object size

Recognitio

n r

ate

[%

]

Comparison of contextual descriptors for object size

Ori.+Edge density+HueOri.+Edge densityOri.+HueOrientation

10 15 20 25 30 35 4055

60

65

70

75

80

85

90

95

100

View angle [deg]

Recognitio

n r

ate

[%

]

Comparison of contextual descriptors for view angle


10 15 20 25 30 35 40

55

60

65

70

75

80

85

90


Recognitio

n r

ate

[%

]

Comparison of contextual descriptors for planar rotation


40 60 80 100 120 140 160 18085

90

95

100


Recognitio

n r

ate

[%

]

Comparison of contextual descriptors for illumination change


0 10 20 30 40 50

55

60

65

70

75

80

85

90

95

100

Occlusion [%]

Recognitio

n r

ate

[%

]

Comparison of contextual descriptors for occlusion


Fig. 18. Evaluation of part descriptors (ori. only, ori + hue, ori + edge, ori + hue + edge): The full descriptor shows almost best performance.

0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3

84

86

88

90

92

94

96

98

100

Object size

Re

co

gn

itio

n r

ate

[%

]

Comparison of visual features for object size

Both parts+[ori+edge+color]Both parts+ori.SIFT

10 15 20 25 30 35 40

55

60

65

70

75

80

85

90

95

100

View angle [deg]

Re

co

gn

itio

n r

ate

[%

]

Comparison of visual features for view angle


10 15 20 25 30 35 40

60

65

70

75

80

85

90

95


Re

co

gn

itio

n r

ate

[%

]

Comparison of visual features for planar rotation


40 60 80 100 120 140 160 180

85

90

95

100


Re

co

gn

itio

n r

ate

[%

]

Comparison of visual features for illumination


0 10 20 30 40 50

55

60

65

70

75

80

85

90

95

100

Occlusion [%]

Re

co

gn

itio

n r

ate

[%

]

Comparison of visual features for occlusion


Fig. 19. Summary of visual features (SIFT, perceptual part + edge orientation, G-RIF): G-RIF shows the best performance. Both parts + ori and SIFT followG-RIF.

and reveals the complementary properties of G-RIF: right path(gravity center point) and G-RIF: left path (high curvaturepoint).

The 2nd and 3rd columns in Table 3 show the overall recog-nition results using the optimal classifier threshold (0.8). Therecognition rate using G-RIF is higher than the SIFT feature


by 5.77%. This good performance originates from the com-plementary properties of the proposed visual part detectorand the effective spatial coding of multiple features in a uni-fied way. In this test, we set the aperture level up to L3 andfeature level by (1) + (2). Next, we compared our weight-aggregation-based voting with the two baseline methods,nearest neighbor and distance ratio with the same G-RIF. Theneighborhood size is set to 2 times the part scale. Overalltest results are shown in Table 3 (3rd, 4th, 5th column). Theweight-aggregation-based classifier shows the best recognitionperformance among other baseline methods. This good perfor-mance originated from the effective use of neighboring context.Fig. 23 shows several successful examples using G-RIF thatare incorrect using the SIFT feature, although our feature di-mension is 105 (84 for edge orientation +21 for edge density)and SIFTs is 128.

Next, we compared our context-based voting (Fig. 11(b))to the conventional nearest-neighbor based direct voting (Fig.11(a)). We used the same visual feature (G-RIF) proposed inChapter 3 and our DB. The neighborhood size is set to two timesthe part scale. Overall test results are shown in Table 3 (3rdand 4th columns). The contextual voting-based classifier shows

5 10 15 20 25 30 35 40 45

55

60

65

70

75

80

85

90

95

100

View angle [deg]

Re

co

gn

itio

n r

ate

[%

]

Comparison of visual features for view angle

G-RIF

SIFT

Fig. 20. G-RIF vs. SIFT for COIL-100 DB. G-RIF is more robust than SIFTfor viewing angle changes by 20%. We used the same voting-based classifier.

Fig. 21. (Top) examples of KAIST-104 general object models, (bottom) corresponding examples of KAIST-104 test objects on a highly cluttered background.

better recognition performance than direct voting by 5.77%.This good performance originates from the effective use of theneighboring context. The contextual influence was explainedin Fig. 14. Fig. 24 shows three kinds of recognition exampleswith both methods. Fig. 24(a) is a normal case. Both meth-ods succeeded for textured objects. Fig. 24(b) show ambigu-ous recognition results by direct voting but stable recognitionresults by contextual voting. Finally, Fig. 24(c) shows the fail-ure case by direct voting, but success with contextual voting.Fig. 25 shows other results of Fig. 24(c)’s case. Due to manyambiguous features on the cluttered back-ground, the direct vot-ing method labels the wrong object. However, the context-basedvoting method can reduce the effect of those cluttered featuresby neighboring contextual information. Finally, we evaluated

Table 3Comparison of labeling performance for KAIST-104 DB

Feature SIFT G-RIF G-RIF G-RIFDistance Nearest Nearest Ratio Weight# correct/ # test 62/104 68/104 70/104 74/104Success rate (%) 59.61 65.38 67.31 71.15

0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.9510

20

30

40

50

60

70

Similarity threshold of direct voting

Accu

racy:

co

rre

ct

de

tectio

n r

ate

[%

]

Recognition accuracy: SIFT vs. G-RIF

SIFTG-RIFG-RIF:right pathG-RIF:left path

Fig. 22. Comparison of NNC classification accuracy between G-RIF (left,right, both paths) and SIFT.


Fig. 23. Correct detection using G-RIF (failure cases using SIFT).

Fig. 24. Recognition performance of nearest neighbor-based direct voting method (left column) and contextual voting method (right column): (a) normal case,(b) ambiguous case and (c) failure using direct voting but success using ours.


Fig. 25. Correct detection using contextual voting (failure cases using direct voting).

65 70 75 80 8560

65

70

75

80

85

90

95

100

Area of background clutter [%]

Accu

racy:

co

rre

ct

de

tectio

n r

ate

[%

]

Recognition accuracy vs. background clutter size

Weight aggr.

Distancne ratio

Nearest neighbor

Fig. 26. As the area of background clutter increases, the contextual votingshows relatively higher detection accuracy than the direct voting.

contextual and direct voting using the CMU database. Thisdatabase is composed of 110 separate objects and 25 back-ground images. We generated test images by changing objects’sizes on different backgrounds. This is equivalent to changingthe background size. Fig. 26 shows the evaluation results. Con-textual voting shows an equal or better recognition rate thandirect voting. Note that the neighboring contextual informationhas a more important role for object labeling in a severely clut-tered background than in a less cluttered background. Finally,

8. Conclusions

In this paper, we propose a biologically motivated visual partselection method that extracts complementary visual parts. Inaddition, we propose a general contextual descriptor that en-codes multicue information in a unified form. Recognition per-formance was improved by using the proposed feature. We alsopropose a simple and powerful object recognition method basedon neighboring contextual information. Neighboring contex-tual information suppresses the strength of background featuresand enhances the strength of figural features in the feature-labelmap. The proposed method shows better recognition perfor-mance than other methods in a severe environment. We willextend the context-based voting method for general object-based image understanding such as labeled figure-groundsegmentation.

Acknowledgments

This work was supported in part by the Korea Science andEngineering Foundation (KOSEF) Grant funded by the Ko-rea government (MOST) (No. M1-0302-00-0064), by the MICfor the project, “Development of Cooperative Network-basedHumanoids’ Technology” of Korea.

References

[1] T. Lindeberg, Detecting salient blob-like image structures and their scaleswith a scale-space primal sketch: a method for focus-of-attention, Int.J. Comput. Vision 11 (3) (1993) 283–318.

[2] A. Shokoufandeh, I. Marsic, S.J. Dickinson, View-based object matching,in: International Conference on Computer Vision, 1998, pp. 588–595.

[3] C. Schmid, R. Mohr, C. Bauckhage, Evaluation of interest point detectors,Int. J. Comput. Vision 37 (2) (2000) 151–172.

[4] K. Mikolajczyk, C. Schmid, Scale & affine invariant interest pointdetectors, Int. J. Comput. Vision 60 (1) (2004) 63–86.

[5] D.G. Lowe, Distinctive image features from scale-invariant keypoints,Int. J. Comput. Vision 60 (2) (2004) 91–110.

[6] K. Mikolajczyk, C. Schmid, A performance evaluation of localdescriptors, IEEE Trans. Pattern Anal. Mach. Intell. 27 (10) (2005)1615–1630.

[7] Y. Ke, R. Sukthankar, PCA-SIFT: a more distinctive representationfor local image descriptors, in: IEEE Computer Vision and PatternRecognition or CVPR, 2004, pp. 506–513.

[8] M. Tico, P. Kuosmanen, Fingerprint matching using an orientation-basedminutia descriptor, IEEE Trans. Pattern Anal. Mach. Intell. 25 (8) (2003)1009–1014.

[9] C. Schmid, A structured probabilistic model for recognition, In IEEEComputer Vision and Pattern Recognition or CVPR, vol. II, June 1999,pp. 485–490.

[10] R. Fergus, P. Perona, A. Zisserman, Object class recognition byunsupervised scale-invariant learning, in: IEEE Computer Vision andPattern Recognition or CVPR, 2003, pp. 264–271.

[11] P. Moreels, M. Maire, P. Perona, Recognition by probabilistic hypothesisconstruction, in: European Conference on Computer Vision, 2004, pp.55–68.

[12] A. Stein, M. Hebert, Incorporating background invariance into feature-based object recognition, in: Workshop on Applications of ComputerVision (WACV’05), 2005, pp. 37–44.

[13] S. Kumar, M. Hebert, A hierarchical field framework for unified context-based classification, in: International Conference on Computer Vision,2005, pp. 1284–1291.

[14] B. Leibe, A. Leonardis, B. Schiele, Combined object categorization andsegmentation with an implicit shape model, in: Workshop on StatisticalLearning in Computer Vision, 2004.

[15] S.X. Yu, J. Shi, Object-specific figure-ground segregation, in: IEEEComputer Vision and Pattern Recognition or CVPR, 2003, pp. 39–45.

[16] A. Torralba, K.P. Murphy, W.T. Freeman, M.A. Rubin, Context-based vision system for place and object recognition, in: InternationalConference on Computer Vision, 2003, pp. 273–280.

[17] T. Serre, L. Wolf, T. Poggio, Object recognition with features inspiredby visual cortex, in: IEEE Computer Vision and Pattern Recognition orCVPR, 2005.


[18] D.L. Ringach, Spatial structure and symmetry of simple-cell receptivefield in macaque primary visual cortex, J. Neurophysiol. 88 (2001)455–463.

[19] T.J. Gawne, J.M. Martin, Responses of primate visual cortical V4neurons to simultaneously presented stimuli, J. Neurophysiol. 88 (2002)1128–1135.

[20] G.M. Boynton, Adaptation and attentional selection, Nature Neurosci. 7(1) (2004) 8–10.

[21] O. Chomat, Vincent Colin de Verdière, D. Hall, J. L. Crowley, Localscale selection for gaussian based description techniques, in: EuropeanConference on Computer Vision, 2000, pp. 117–133.

[22] R.H. Wurtz, E.R. Kandel, Perception of motion, depth and form,Principles Neural Sci. 2000, 548–571.

[23] M. Kouh, M. Riesenhuber, Investigating shape representation in areaV4 with hmax: orientation and grating selectivities, Technical ReportAIM2003-021, Massachusetts Institute of Technology, 2003.

[24] S. Mahamud, M. Hebert, The optimal distance measure for objectdetection, in: IEEE Computer Vision and Pattern Recognition or CVPR,2003.

[25] K. Okada, D. Comaniciu, Scale selection for anisotropic scale-space:application to volumetric tumor characterization, in: IEEE ComputerVision and Pattern Recognition or CVPR, 2004, pp. 594–601.

[26] A. Pasupathy, C.E. Connor, Shape representation in area V4: position-specific tuning for boundary conformation, J. Neurophysiol. 86 (5) (2001)2505–2519.

[27] D. Reisfeld, H. Wolfson, Y. Yeshurun, Context-free attentional operators:the generalized symmetry transform, Int. J. Comput. Vision 14 (2) (1995)119–130.

[28] L.J. Latecki, R. Lakämper, Convexity rule for shape decomposition basedon discrete contour evolution, Comput. Vision Image Understanding 73(3) (1999) 441–454.

[29] G. Loy, A. Zelinsky, Fast radial symmetry transform for detectingpoints of interest, IEEE Trans. Pattern Anal. Mach. Intell. 25 (8) (2003)959–973.

[30] S. Belongie, J. Malik, J. Puzicha, Shape matching and object recognitionusing shape contexts, IEEE Trans. Pattern Anal. and Mach. Intell. 24(24) (2002) 509–522.

[31] B. Schiele, J.L. Crowley, Recognition without correspondence usingmultidimensional receptive field histograms, Int. J. Comput. Vision 36(1) (2000) 31–50.

[32] J.V. Lucas, W.V. Piet, Estimators for orientation and anisotropy indigitized images, in: 1st Conference on Advanced School for Computingand Imaging (ASCI), 1995, pp. 442–450.

[33] K. Jafari-Khouzani, H. Soltanian-Zadeh, Radon transform orientationestimation for rotation invariant texture analysis, IEEE Trans. PatternAnal. Machine Intell. 27 (6) (2005) 1004–1008.

[34] S. Kim, I.-S. Kweon, Biologically motivated perceptual feature:generalized robust invariant feature, in: Asian Conference on ComputerVision, 2006, pp. 305–314.

[35] C. Wallraven, B. Caputo, A. Graf, Recognition with local features: thekernel recipe, in International Conference on Computer Vision, 2003,pp. 257–264.

[36] L. Zhang, S.Z. Li, Z.Y. Qu, X. Huang, Boosting local feature basedclassifiers for face recognition, in: CVPR Workshop on Face Processingin Video (FPIV’04), 2004. p. 87.

[37] A. Shokoufandeh, D. Macrini, S. Dickinson, K. Siddiqi, S.W. Zucker,Indexing hierarchical structures using graph spectra, IEEE Trans. PatternAnal. Mach. Intell. 27 (7) (2005) 1125–1140.

[38] S.K. Nayar, S.A. Nene, H. Murase, Real-time 100 object recognitionsystem, IEEE Trans. Pattern Anal. Mach. Intell. 18 (1996) 1186–1198.

[39] S.A. Nene, S.K. Nayar, H. Murase, Columbia object image library(COIL-100). Technical Report, February 1996.

About the Author—SUNGHO KIM received the Ph.D. degree in electrical engineering and computer science from Korea Advanced Institute of Science andTechnology, Korea in 2007. He has been a senior researcher at agency for defense development, concentrating on human object perception mechanism andits computational model.

About the Author—KUK-JIN YOON received the Ph.D. degree in electrical engineering and computer science from Korea Advanced Institute of Scienceand Technology, Korea in 2006. He has been a post-doctoral fellow in the “MOVI” research team, INRIA. His research interest is depth recovery from stereoimage pairs.

About the Author—IN SO KWEON received the Ph.D. degree in robotics from Carnegie Mellon University, Pittsburgh, PA, in 1990. During 1991 and 1992,he was a visiting scientist in the Information Systems Laboratory at Toshiba Research and Development Center, where he worked on behavior-based mobilerobots and motion vision research. Since 1992, he has been a Professor of Electrical Engineering at KAIST. His current research interests include humanvisual perception, object recognition, real-time tracking, vision based mobile robot localization, volumetric 3D reconstruction, and camera calibration. He is amember of the IEEE, and Korea Robotics Society (KRS).

Object recognition using a generalized robust invariant feature and Gestalt's law of proximity and...

Documents

Transcript of Object recognition using a generalized robust invariant feature and Gestalt's law of proximity and...