Probabilistic 3D object recognition and pose estimation using multiple interpretations generation

12
Probabilistic 3D object recognition and pose estimation using multiple interpretations generation Zhaojin Lu 1 and Sukhan Lee 1,2, * 1 School of Information and Communication Engineering, Sungkyunkwan University, Seoul, South Korea 2 Department of Interaction Science, Sungkyunkwan University, Seoul, South Korea *Corresponding Author: [email protected] Received June 3, 2011; revised September 10, 2011; accepted October 7, 2011; posted October 11, 2011 (Doc. ID 148720); published November 18, 2011 This paper presents a probabilistic object recognition and pose estimation method using multiple interpretation generation in cluttered indoor environments. How to handle pose ambiguity and uncertainty is the main challenge in most recognition systems. In order to solve this problem, we approach it in a probabilistic manner. First, given a three-dimensional (3D) polyhedral object model, the parallel and perpendicular line pairs, which are detected from stereo images and 3D point clouds, generate pose hypotheses as multiple interpretations, with ambiguity from partial occlusion and fragmentation of 3D lines especially taken into account. Different from the previous methods, each pose interpretation is represented as a region instead of a point in pose space reflecting the mea- surement uncertainty. Then, for each pose interpretation, more features around the estimated pose are further utilized as additional evidence for computing the probability using the Bayesian principle in terms of likelihood and unlikelihood. Finally, fusion strategy is applied to the top ranked interpretations with high probabilities, which are further verified and refined to give a more accurate pose estimation in real time. The experimental results show the performance and potential of the proposed approach in real cluttered domestic environments. © 2011 Optical Society of America OCIS codes: 100.0100, 100.5010, 150.0150, 330.0330. 1. INTRODUCTION Three-dimensional (3D) object recognition and pose estima- tion is a difficult problem in computer vision and has been intensively investigated for many years in widespread applica- tions. In particular, 3D object recognition is an indispensable process of manipulation and SLAM in robotics. How to deal with 3D object recognition in a domestic environment under varying illumination, perspective viewpoint, distance, partial occlusion, background, etc., is still an open problem. Many researchers have proposed various 3D object recog- nition approaches [13]. Among them, the model-based recog- nition method is the most general one [4,5]. The method computes the hypothesized model pose by finding correspon- dences between the model features and image features, and the final pose is verified using additional image features. The most challenging part of this process is the effective represen- tation of features and the identification of corresponding features in the unstructured environment with change of illu- mination, viewpoint, clutter, partial occlusion, and so on. Many features have been employed for recognition, such as point, 2D/3D line, appearance, and so forth. In real applications, sometimes, visual recognition may need to rely on features that are not powerful enough for a unique and crisp decision. For instance, visual recognition of such home appliances as table, dish washers, refrigerators, TV, milk box, book, etc., representing objects of a polyhedral shape with little texture, may need to rely on line-based fea- tures representing the boundaries of the objects and/or their parts. In this case, the quality of detected features may vary and are sometimes very poor, depending on the illumination, viewpoint, and distance at the time of detection, resulting in, possibly, a lot of uncertainties. Furthermore, besides un- certainties, depending on the complexity of the feature used for matching, there may arise much ambiguity in decision- making, e.g., the parallel and intersecting 3D lines used in this paper can produce many possibilities for the object pose, as described by multiple interpretations. This paper aims at pro- posing a method for simultaneously dealing with both ambi- guities and uncertainties in object recognition and pose estimation in one computational framework, based on multi- ple interpretation generation and Bayesian probabilistic rea- soning with likelihood and unlikelihood computation. It is our opinion that the above vision problem, often found in practice, and the proposed approach to dealing with uncer- tainties and ambiguities in one computational framework is sure to contribute to the advancement in the field of vision both theoretically and application-wise, especially, in terms of widening its applications to many practical circumstances. There are three challenges in constructing such a 3D object recognition system. The first is determining how to generate multiple interpretations to cover all possible locations of the target object in 3D space. In our approach, 3D lines are esti- mated from stereo images and 3D point clouds as described in [6]. That is, first, 2D lines are extracted from one of the stereo images. Then, the 3D points corresponding to each 2D line are collected from the 3D point cloud obtained from the stereo image. Finally, the 3D line is estimated from the collected 3D points with the outliers identified and removed out. The estimated 3D lines are, of course, subject to errors. However, the effect of these errors on the overall process is kept con- trolled by representing these errors as the uncertainties or variances associated with 3D lines. And then we group the 3D lines into two types of feature sets, pairs of parallel lines and perpendicular lines; both usually appear in man-made Z. Lu and S. Lee Vol. 28, No. 12 / December 2011 / J. Opt. Soc. Am. A 2607 1084-7529/11/122607-12$15.00/0 © 2011 Optical Society of America

Transcript of Probabilistic 3D object recognition and pose estimation using multiple interpretations generation

Probabilistic 3D object recognition and pose estimationusing multiple interpretations generation

Zhaojin Lu1 and Sukhan Lee1,2,*1School of Information and Communication Engineering, Sungkyunkwan University, Seoul, South Korea

2Department of Interaction Science, Sungkyunkwan University, Seoul, South Korea*Corresponding Author: [email protected]

Received June 3, 2011; revised September 10, 2011; accepted October 7, 2011;posted October 11, 2011 (Doc. ID 148720); published November 18, 2011

This paper presents a probabilistic object recognition and pose estimation method using multiple interpretationgeneration in cluttered indoor environments. How to handle pose ambiguity and uncertainty is the main challengein most recognition systems. In order to solve this problem, we approach it in a probabilistic manner. First, given athree-dimensional (3D) polyhedral object model, the parallel and perpendicular line pairs, which are detectedfrom stereo images and 3D point clouds, generate pose hypotheses as multiple interpretations, with ambiguityfrom partial occlusion and fragmentation of 3D lines especially taken into account. Different from the previousmethods, each pose interpretation is represented as a region instead of a point in pose space reflecting the mea-surement uncertainty. Then, for each pose interpretation, more features around the estimated pose are furtherutilized as additional evidence for computing the probability using the Bayesian principle in terms of likelihoodand unlikelihood. Finally, fusion strategy is applied to the top ranked interpretations with high probabilities,which are further verified and refined to give a more accurate pose estimation in real time. The experimentalresults show the performance and potential of the proposed approach in real cluttered domestic environments.© 2011 Optical Society of America

OCIS codes: 100.0100, 100.5010, 150.0150, 330.0330.

1. INTRODUCTIONThree-dimensional (3D) object recognition and pose estima-tion is a difficult problem in computer vision and has beenintensively investigated for many years in widespread applica-tions. In particular, 3D object recognition is an indispensableprocess of manipulation and SLAM in robotics. How to dealwith 3D object recognition in a domestic environment undervarying illumination, perspective viewpoint, distance, partialocclusion, background, etc., is still an open problem.

Many researchers have proposed various 3D object recog-nition approaches [1–3]. Among them, the model-based recog-nition method is the most general one [4,5]. The methodcomputes the hypothesized model pose by finding correspon-dences between the model features and image features, andthe final pose is verified using additional image features. Themost challenging part of this process is the effective represen-tation of features and the identification of correspondingfeatures in the unstructured environment with change of illu-mination, viewpoint, clutter, partial occlusion, and so on.Many features have been employed for recognition, such aspoint, 2D/3D line, appearance, and so forth.

In real applications, sometimes, visual recognition mayneed to rely on features that are not powerful enough for aunique and crisp decision. For instance, visual recognitionof such home appliances as table, dish washers, refrigerators,TV, milk box, book, etc., representing objects of a polyhedralshape with little texture, may need to rely on line-based fea-tures representing the boundaries of the objects and/or theirparts. In this case, the quality of detected features may varyand are sometimes very poor, depending on the illumination,viewpoint, and distance at the time of detection, resultingin, possibly, a lot of uncertainties. Furthermore, besides un-

certainties, depending on the complexity of the feature usedfor matching, there may arise much ambiguity in decision-making, e.g., the parallel and intersecting 3D lines used in thispaper can produce many possibilities for the object pose, asdescribed by multiple interpretations. This paper aims at pro-posing a method for simultaneously dealing with both ambi-guities and uncertainties in object recognition and poseestimation in one computational framework, based on multi-ple interpretation generation and Bayesian probabilistic rea-soning with likelihood and unlikelihood computation. It isour opinion that the above vision problem, often found inpractice, and the proposed approach to dealing with uncer-tainties and ambiguities in one computational framework issure to contribute to the advancement in the field of visionboth theoretically and application-wise, especially, in termsof widening its applications to many practical circumstances.

There are three challenges in constructing such a 3D objectrecognition system. The first is determining how to generatemultiple interpretations to cover all possible locations of thetarget object in 3D space. In our approach, 3D lines are esti-mated from stereo images and 3D point clouds as described in[6]. That is, first, 2D lines are extracted from one of the stereoimages. Then, the 3D points corresponding to each 2D line arecollected from the 3D point cloud obtained from the stereoimage. Finally, the 3D line is estimated from the collected3D points with the outliers identified and removed out. Theestimated 3D lines are, of course, subject to errors. However,the effect of these errors on the overall process is kept con-trolled by representing these errors as the uncertainties orvariances associated with 3D lines. And then we group the3D lines into two types of feature sets, pairs of parallel linesand perpendicular lines; both usually appear in man-made

Z. Lu and S. Lee Vol. 28, No. 12 / December 2011 / J. Opt. Soc. Am. A 2607

1084-7529/11/122607-12$15.00/0 © 2011 Optical Society of America

objects. Every pairing of an image feature set to a model fea-ture set contributes a pose hypothesis as an interpretationconsisting of a 6 degree of freedom rigid transformation. Ty-pically, each image feature set corresponds to multiple modelfeature sets, which results in multiple interpretations. Thanksto the adopted 3D line features, which is invariant to transla-tion, orientation, and viewpoint, the total number of interpre-tations is much less compared to conventional 2D featurebased approaches, and each interpretation is less likely tobe corrupted by spurious features.

The second challenge is determining how to verify each in-terpretation with additional image features as supporting evi-dences. Most of the initial hypothesized interpretations areinaccurate because correspondences between the model fea-ture sets and image feature sets are incorrect. Thus, our ap-proach ranks interpretations in a probabilistic manner usingthe Bayesian rule. To ensure the estimated probability is reli-able, the probabilities associated with these multiple interpre-tations are computed by exploiting all available evidence inconjunction with the corresponding poses, such as 3D linesaround the poses in 3D space; color feature is also taken intoaccount as photometric evidence. As a means of combiningmultiple evidences around the poses, a Bayesian posteriorprobability is computed based on the unlikelihood likelihoodratio, where the likelihood is evaluated between the modelfeature sets and image feature sets, and the unlikelihood isevaluated by analyzing the context information around the es-timated pose. The probability estimation is largely robust toenvironmental change. In order to take into account the un-certainty in the values measured in the image, we representeach interpretation as a region in the pose space rather thana point in that space. This approach is similar to that in [7], butthe difference is [7] approximates the uncertainty region as auniform probability density function (PDF), but we approxi-mate the uncertainty region as a Gaussian PDF, which is moreappropriate because it is a good model of the phenomenon.Consequently, each interpretation is represented as aGaussian PDF with a certain probability weight.

The final challenge is determining how to refine top rankedinterpretations to provide a more accurate pose. To do this,we make use of the information inherent in interpretations,meaning interpretations should yield compatible poses if theycorrespond to the same object. Then, we verify that the inter-pretations support each other by determining how much theirpose uncertainty regions intersect in terms of the Mahalanobisdistance between each pair of Gaussian PDFs. We fuse sets ofinterpretations that support each other and output a smallnumber of fused interpretations with higher probabilitiesand smaller uncertainties. Compared to traditional algorithms,our method yields a more accurate pose refinement with aless expensive verification computation. For example, anapproach such as the modified Gold’s graduated assignmentalgorithm in [8] requires a number of iterations using determi-nistic annealing to yield an optimal pose. By fusing the com-patible set of interpretations, we are able to find the precisepose in real time. To summarize, a flowchart of the proposedalgorithm is given in Fig. 1.

The remainder of the paper is structured as follows. InSection 2 we discuss related work. Details of multiple inter-pretations generation are described in Section 3. In Section 4we derive the probability computation in terms of likelihood

and unlikelihood. In Section 5 we present fusion based poserefinement. Section 6 demonstrates the experimental results,and conclusions are given in Section 7.

2. RELATED WORKSince 3D object recognition is one of the most difficult andimportant problems in computer vision, many 3D object re-cognition approaches have been proposed in last decadessince Robert’s pioneer work on 3D polyhedral object recogni-tion from 2D images [9]. Fischler and Bolles’ RANSAC ap-proach [10], Beis and Lowe’s invariants indexing approach[11], and Costa’s relational indexing approach [12] hypothe-size poses from initial feature matching correspondencesand verify those hypotheses based on additional presenceof supporting evidence. These approaches cannot be appliedin real time when the number of model and image featuresbecomes large. David et al. [13] proposed an approach inwhich the recognition and pose estimation are solved simul-taneously by minimizing the energy function. However, the

START

Capture image from stereo camera

2D rectified colorimage 3D point clouds

2D lines 3D lines

3D parallel lines &perpendicular lines

Multipleinterpretations

generation and poseestimation

Probabilitycomputation bysupported 3D lines

Probabilitycomputation bycolor feature in 2D

Object model( 3D geometric wireframe &

surface texture)

Satisfy thevisibility test?

Yes

Probabilistic multi-cue integration

Pose verification andrefinement

Assign pose uncertaintyto each interpretation

Final recognitiondecision

Fig. 1. Flow chart of the proposed method. First, images are cap-tured by stereo camera, followed by 3D line extraction; 3D parallellines and perpendicular lines are selected based on the model con-straints. Second, multiple interpretations are generated by matchingthe image features with those of the model. If the generated interpre-tations satisfy the visibility test [36], then probability and pose distri-bution are computed. Finally, a set of top ranked interpretations arefurther verified and refined for final recognition decision.

2608 J. Opt. Soc. Am. A / Vol. 28, No. 12 / December 2011 Z. Lu and S. Lee

energy function may not converge to a minimum value in thefunctional minimization method due to the high nonlinearityof the cost function.

Recently, a number of appearance-based approaches to 3Dobject recognition have been proposed, in which multiple 2Dviews are sampled as a representation of 3D objects. The firstappearance-based systems found in the literature used princi-pal component analysis (PCA) as a feature extraction techni-que to reduce the dimensionality of the object classes ormodels. Vicente [14] compared the performance of PCA withindependent component analysis [15] for 3D object recogni-tion. Sun [16] proposed a multiview probabilistic model,which considers not only similar features in multiview images,but also the 3D relationship between the multiview or multipleparts of one view. Moreover, Ekvall [17] proposed an interest-ing approach for object recognition and pose estimation usingcolor co-ocurrence histograms (CCH) and geometric model-ing; they used CCH to estimate the partial pose estimation,which was subsequently improved by the geometric model.However, these methods cannot provide accurate pose esti-mation since they do not use 3D models and are sensitiveto illumination change, clutter, and partial occlusion. In addi-tion to appearance feature, the use of local feature descriptorshas become popular. The Harris corner detector [18] has seenwidespread use; however, it is not robust to scale change.Schmid and Mohr [19] have proposed a rotationally invariantfeature descriptor with the Harris corner detector. Lowe [20]extended this work to scale invariant and partially affine in-variant features with his SIFT approach. Excellent resultshave been obtained by approaches using local features whenobjects have significant distinctive texture. However, thereare many objects with little texture.

Most of the aforementioned approaches utilize 2D images,which are sensitive to changes in illumination, viewpoint,scale, and so on. More recently, the use of range images hasbecome popular as a way of overcoming the limitation of 2Dimages. In range images, 3D shapes are represented by localfeatures. Spin images [21] and 3D shape context [22] are ex-amples of methods in which surface points are described bythe shape distribution of a local neighborhood. However,these methods mostly deal with dense and accurate depthdata, which are different from stereo vision based images.

A number of studies have also used 3D lines for 3D objectrecognition. 3D lines are invariant features and are easy to de-tect from the boundaries of both textured and nontexturedobjects. Zhang and Faugeras [23] were the first to present linematching problems. There is, however, an implicit assumptionthat the corresponding points are the midpoints of the corre-sponding line segment pairs, which, in fact, are difficult tofind. Guerra and Pascucci [24] present a Hausdorff distancebased method for matching two sets of 3D line segmentswhere line correspondences are not known. Kamgar-Parsi[25] relies on the repeated use of matching sets of equal lengthline segments. While the algorithm works for all basic linematching cases, partially overlapping line segments are notconsidered in case of finite-finite line matching.

Shimshoni and Ponce [7] proposed a probabilistic approachfor 3D object recognition, which is similar to ours in a numberof ways. Both approaches first hypothesize poses using asmall set of local correspondences and then sort the hypoth-eses probabilistically, and finally top ranked poses are refined

to achieve more accurate estimation. Significant differencesbetween the two approaches are that ours uses 3D linefeatures instead of 2D line features, and we also employ aninvariant 3D polyhedral model, rather than 2D modelssampled from a viewing sphere using probabilistic peaking ef-fect. Thus our approach will generate fewer hypotheses as in-terpretations, and each interpretation can provide moreaccurate pose estimation due to the invariance of 3D lines.Furthermore, color feature is taken into account and incorpo-rated as additional information, whereas they do not have thisrobust feature. Our approach also has some similarities toDavid’s [8] approach, in that it represents an interpretationas a point in pose space and uses a graduated assignment al-gorithm for pose refinement. In contrast, we represent an in-terpretation as a region in pose space that is approximated asa Gaussian PDF, which is more appropriate because it is agood model of the phenomenon and uses fusion to get a moreaccurate pose by testing whether two interpretations supporteach other in real time.

3. MULTIPLE INTERPRETATIONSGENERATIONThe motivation of probabilistic multiple interpretations is tospecifically focus on 3D object recognition. The feature se-lected as weak evidence for the initial object recognition isincomplete and ambiguous, meaning that the feature may gen-erate a number of matches that are not the target object weare searching for or, even though the feature represents thetarget object, the feature is unable to localize the target objectuniquely as Fig. 2 shows. Furthermore, we need to considerthe case where there may be multiple similar objects presentin the scene. How to incorporate the above two factors intoobject recognition is a matter of interest. As stated above, theinitial matching generated by the initial feature are rather am-biguous and incomplete in terms of uniquely identifying wherethe target object is. In order to generate true interpretations,each match is interpreted in terms of possible object poses.There newly identified interpretations are then subject tofurther evaluation with additional evidence so as to determinethe probability and verify that the candidate represents thetarget object.

Parallel line pairs and perpendicular line pairs are typicalcombinations of line features of man-made objects in domes-tic environments. In order to achieve robustness, we modeleach interpretation in a probabilistic manner as Fig. 2 shows.We would like to compute Pðx;Hm;OjFÞ [26], which isobtained basically from placing the model of target objectO at location x given feature set F, where Hm denotes the hy-pothesis that matching mth model feature set of model Oagainst the measured image feature set F. More specifically,Pðx;Hm;OjFÞ can be represented via the Bayesian principle as

Pðx;Hm;OjFÞ ¼ PðHm;OjFÞPðxjF;Hm;OÞ; ð1Þ

where PðHm;OjFÞ is the probability (i.e., a positive real value≤1) that F represents hypothesis Hm of target object O, whichis defined as a polyhedral shape. Because most of the domes-tic object can be approximated as a polyhedral shape (as inFig. 2), even the cylinder object (e.g, juice can) can be ap-proximated as a cuboid.

Z. Lu and S. Lee Vol. 28, No. 12 / December 2011 / J. Opt. Soc. Am. A 2609

On the other hand, PðxjF;Hm;OÞ represents the probabilitythat hypothesis Hm of the object O, given F, is located at x,which is a six-dimensional vector including translation and or-ientation in Euclidean space. Since x represents a variable dueto the uncertainty of image features, thus PðxjF;Hm;OÞ de-fines a PDF along x. Therefore, the pose of an interpretationis represented as a region instead of a point in the pose space.The probability of an interpretation PðHm;OjFÞ, which is theposterior probability that the object is conditioned directly onthe detected feature set. In the case where feature set is par-allel line pairs, due to the partial occlusion and fragmentationof 3D lines, the image feature set F may not uniquely matchwithHm of target objectO. We could use the midpoint to makethe correspondence between each image line and model line,but this is not always true in the case of a short line super-posed on a longer line (think of the short line as a fragmentof the longer line). A solution to this challenge is to generatemultiple representations for each image feature set. Specifi-cally, in order to estimate pose by matching F against Hm,we need to extend each 3D line in F to be the same lengthas Hm such that four pairs of endpoints (each F has two par-allel or perpendicular 3D lines) can be corresponded. How-ever, the extension of F is not unique, as extended F canbe represented by a series of Fk, which means F can be repre-sented by a series of extended Fk (Fig. 3) that uniformly sam-ple the dynamic range with an interval s, where the dynamicrange is computed based on the length difference between Fand Hm, and interval s depends on the size of the model anddynamic range. Incorporating Fk into (1) yields:

Pðx;Hm;OjFÞ ¼XKk¼1

Pðx;Hm;OjFkÞPðFkjFÞ; ð2Þ

where PðFkjFÞ is the probability that Fk represents F iscorrect, which is uniformly distributed as mentioned above.Pðx;Hm;OjFkÞ can be represented as PðHm;OjFkÞPðxjFk;Hm;OÞ, similar to the derivation in (1), where Pðx;Hm;OjFkÞis noted as subinterpretation by matching the extended image

feature set Fk against the mth model feature set of hypothesisHm. We approximate PðxjFk;Hm;OÞ as a Gaussian PDF,where the covariance matrix of the uncertainty is computedbased on the error of estimated 3D lines [6]. Therefore,Pðx;Hm;OjFÞ is computed by summing over the subinterpre-tations, which is a mixture of Gaussians as Fig. 3 illustrates.Although the complexity increases linearly with the number ofsubinterpretations, the recognition performance is greatlyenhanced. Finally, in order to compute the pose, the pose hy-pothesis of a subinterpretation is generated by correspondingendpoints of each 3D line inHm and Fk one by one. In total wehave four pairs of corresponding endpoints. Thus, the trans-formation mapping a model feature set Hm to the extendedimage feature set Fk is

Fk ¼ TkmHm; ð3Þ

where Tkm is a 4 × 4 homogeneous transformation matrix

with twist representation [27], and the correspondingtwist parameters are represented as a vector, θkm ¼ðωx ωy ωz tx ty tz Þ, where ðωx ωy ωz Þ representrotation and ð tx ty tz Þ represent translation. For sake ofclarity in following sections, the general form of subinterpre-tation is characterized as

Ikm ¼ fπkm; Nðθkm;ΣkmÞg; ð4Þ

where πkm denotes the probability weight PðHm;OjFkÞ, andNðθkm;Σk

mÞ denotes a Gaussian PDF of pose distributionPðxjFk;Hm;OÞ, and covariance matrix Σk

m characterizes thepose uncertainty estimation as shown in Fig. 4 according tothe range resolution of stereo camera [6].

Fig. 2. (Color online) Multiple interpretations: (a) target object, (b) parallel line pairs based interpretations (parallel line pair superimposed uponthe object model O), (c) perpendicular line pairs based interpretations. H1; � � � ;Hm represent hypotheses from different feature sets of the model.

Fig. 3. (Color online) Illustration of subinterpretations, whereF1;F2;…Fk are extended feature sets (represented by the dashed blueline).

Fig. 4. (Color online) Pose uncertainty estimation: Given two 3Dlines L1, L2 with corresponding four endpoints as (p11, p

21) and (p12,

p22), respectively. The error bound of each line is modeled as an ellipticcylinder. Therefore, the pose uncertainty is estimated based on thecentroid point pcm.

2610 J. Opt. Soc. Am. A / Vol. 28, No. 12 / December 2011 Z. Lu and S. Lee

4. PROBABILITY COMPUTATIONThe second challenge of the proposed approach is determin-ing how to rank the generated multiple interpretations prob-abilistically. During the multiple interpretations generationstage, the initial feature sets F are selected as a weak classi-fier, which means given F, there are many possible poses thatcan be estimated due to the ambiguity and uncertainty of F.Thus, in order to decrease the number of interpretations in therefinement stage, only top ranked interpretations are selected,which can lead to a more precise pose. For this purpose, amatching probability between the model (transformed byan estimated pose) and image is computed. Since each inter-pretation is represented by a number of subinterpretations,instead of computing the probability of each interpretation,we compute the probability of each individual subinterpreta-tion separately. In other words, rather than computePðHm;OjFÞ, we would like to compute PðHm;OjFkÞ, usingBayesian law:

PðHm;OjFkÞ ¼PðFkjHm;OÞPðHm;OÞ

PðFkÞ¼ 1

1þ α ; ð5Þ

where α ¼ PðFkjHm;OÞPðFkjHm;OÞ, and we assume PðHm;OÞ and PðHm;OÞ

are equal because no prior knowledge is available. More spe-cifically, PðFkjHm;OÞ represents the likelihood that feature setFk appears conditioned on the given hypothesis Hm of the ob-ject O is present in the scene. PðFkjHm;OÞ represents the un-likelihood that feature set Fk appears conditioned on theabsence of the object O. The term “likelihood” and “unlikeli-hood” used in this paper are similar to the “likelihood ratio” in[28]. However, in order to emphasize the importance of Baye-sian probabilistic reasoning with likelihood and unlikelihoodcomputation in our method, we used the above two terms. Inorder to make our approach robust to both textured and tex-tureless objects, both 3D line features and color features areused as additional evidence for probability computation.

A. 3D Line-Based Probability ComputationTo ensure the estimated probability reliably discriminates trueinterpretation from false interpretation, all the neighboring 3Dlines around the estimated pose should be involved as addi-tional evidence in probability computation. Let N ðθkmÞ denotea set of neighboring 3D lines around estimated pose θkm, asshown in Fig. 5(a). Thus, during the real computation ofPðHm;OjFkÞ in (5), Fk is alternatively represented asN ðθkmÞ. Actually, the probability of each subinterpretation

is the function of the pose. More details about the definitionof supporting evidence are shown in Fig. 5(b). Let Lj be the jthline segment of the subinterpretation where j ∈ ½1; Nr � and Nr

is the number of visible line segments [solid black 3D lines inFig. 5(a)] of the subinterpretation. A line feature lji belongs toLj , determined by the distance between the midpoint of theline feature and Lj . The angle θji between the line featureslji and Lj is also utilized to find all line features that belongto the interpretation. Two threshold values, specified a priori,are utilized to remove nonrelevant line features such as �d and�θ, representing distance threshold and angle threshold, re-spectively. Because of line fragmentation, Lj might own sev-eral line features. Thus one can find line features lji, i ∈ ½1; Nj�where Nj is the number of line features that belong to Lj .

1. Line-Based Likelihood ComputationIn order to compute likelihood PðFkjHm;OÞ given an interpre-tation, we opt to use not only error distance, but also coverageof the line feature over the subinterpretation. The error dis-tance of the ith line feature with respect to the jth referenceline segment is denoted by dji and defined by the distance be-tween the midpoint of the line feature and the reference line.

As mentioned above, each reference line might possess sev-eral line features within its threshold. In order to compute thecoverage of each reference line, we project each line featureonto the corresponding reference line. As shown in Fig. 5(c),the green portion of the reference line Lj represents the cover-age of the line features with respect to the reference line.Subsequently, the error ej and the coverage cj associated witheach jth reference line are computed as

ej ¼ min

�Emax;

1Nj

XNj

i¼1

�μ d

ji

�d2þ ð1 − μÞ tan

2 θjitan2 �θ

��;

cj ¼ max�Cmin;

XNj

i¼1

ljiLj

�; ð6Þ

where the parameters Emax and Cmin ensure that “good” posesare not penalized too severely when a model line is fully oc-cluded in the image. These parameters are easily set by obser-ving the values of ej and cj , which are generated for poorposes. It should be noted that when calculating each errorej , distance and the angle error are normalized by the thresh-old values �d and �θ. In particular, the coefficient μ is utilized toimpose relative weight between the distance error and the an-gle error. Therefore, given a set of reference lines, in order to

jL

j1 j

2

j

ji

L

jLj

1j

2

j

ji

L

(c)(b)(a)

Fig. 5. (Color online) Generated interpretation and support 3D line evidences, where (a) shows neighboring 3D measured lines that belong to theestimated pose, blue 3D lines are neighboring but red are not, (b) shows the geometric constraints requirement of the support 3D line evidences,and (c) illustrates the coverage of support line features.

Z. Lu and S. Lee Vol. 28, No. 12 / December 2011 / J. Opt. Soc. Am. A 2611

balance the distance error and coverage of every measuredline of the generated interpretation, the average distance errore and coverage c are computed as:

e ¼ 1Nr

XNr

j¼1

ej; c ¼ 1Nr

XNr

j¼1

cj: ð7Þ

Finally, the line-based likelihood is computed as

PlineðFkjHm;OÞ ¼ cð1 − e2Þ: ð8Þ

Note that the likelihood is proportional to the coverage whilebeing parabolic to the distance error. Because the distanceerror is more sensitive than the coverage in the likelihoodcomputation.

2. Line-Based Unlikelihood ComputationWe define the unlikelihood as detection of a particular featureset in the absence of the target object. Most of the previousapproaches compute the unlikelihood based on either learn-ing or empirical data [29,30], where the learning-basedapproach requires a large number of manually labeled trainingdata, typically hundreds or thousands of images are required,and empirical data are obtained from some typical scenes; thisresults in problems with respect to accuracy and robustness.We propose an approach for unlikelihood evaluation in a com-putational way in terms of distribution value, which is definedby analyzing the spatial distribution of the support evidence.For instance, given two interpretations, both of them have thesame overall coverage, but the support evidence has differentdistribution as in Fig. 6, where interpretations (a) and (b) havethe same coverage (covered by thick red lines). (a) is actuallymore robust than (b) because the lines are equally distributedin (a), providing stronger geometric constraints. Therefore, ifthe support evidence is distributed more equally around theestimated pose, then this estimated pose is more likely tobe the true object. Otherwise, the estimated pose might begenerated from background or other non-objects with a simi-lar initial feature set. Hence, we compute the unlikelihood inorder to evaluate whether the estimated pose is reasonable.

If the overall coverage is fixed, then the distribution value ismaximized when cj [defined in (6)] are equal. Under this cri-terion, the problem of distribution value computation is iden-tical to entropy computation, which is well defined instandard information theory. It follows that distribution valuemaximization is equivalent to entropy maximization, which isalso equivalent to infomax. The equivalence between distribu-tion value maximization and infomax makes sense because if

the support evidence around the estimated pose can providethe most information of the target object, then the estimatedpose is more likely to be correct, remembering that the unli-kelihood is defined under the assumption that the object isabsent from the scene. So if the object is absent from orinvisible in the scene, then the support evidences may notprovide much related information about the target object.

The unlikelihood of each subinterpretation is computed asfollows. Let ~cj denote the normalized cj as ~cj ¼ cjP

Nrs¼1

cs, such

thatPNr

j¼1 ~cj ¼ 1. Then, the distribution value of the support

evidenceN ðθkmÞ around the estimated pose θkm is computed as

DðN ðθkmÞÞ ¼XNr

j¼1

ð−~cj · logð~cjÞÞ: ð9Þ

In order to represent the unlikelihood in terms of thedistribution value DðN ðθkmÞÞ, we need to normalize thevalue of DðN ðθkmÞÞ to the range of ½0; 1�. Because the max va-

lue of DðN ðθkmÞÞ is maxfDðN ðθkmÞÞg ¼ PNrj¼1ð− 1

Nr· logð 1

NrÞÞ≡

logðNrÞ. Hence, the normalized distribution value ~DðN ðθkmÞÞis represented as

~DðN ðθkmÞÞ ¼DðN ðθkmÞÞlogðNrÞ

: ð10Þ

Finally, the line-based unlikelihood is computed as

PlineðFkjHm;OÞ ¼ 1 − ~DðN ðθkmÞÞ: ð11Þ

The smaller the value of PlineðFkjHm;OÞ is, the more likely theestimated pose θkm is correct. In order to justify the effective-ness of the distribution value computation, we captured anumber of images for testing; half of them contain the targetobject, and the other half contain nonobject. Multiple interpre-tations generation was performed for both sets of images. Thestatistic distribution of the distribution value is shown in Fig. 7.From this we can see that the distribution value of the non-object and target object can be discriminated well.

B. Color Based Probability ComputationColor is an important component for distinguishing betweenobjects. If the color information of an object is neglected, avery important source of distinction may be lost. Therefore,in order to improve the robustness of the probability

(b)(a)

Fig. 6. (Color online) Illustration of distribution and coverage ofsupport line evidence.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

Distribution value

Nor

mal

ized

his

togr

am

Target objectnon−object

Fig. 7. (Color online) Statistic analysis of the distribution value forthe target object and nonobject.

2612 J. Opt. Soc. Am. A / Vol. 28, No. 12 / December 2011 Z. Lu and S. Lee

computation of the subinterpretations, we adopt not only geo-metric evidence as 3D lines but also photometric evidence ascolor feature. When we are to compute the probability weightfor a subinterpretation, the pose of the target object for thechosen interpretation is already known. Thus we are ableto represent all possible perspective images of the model in2D. Actually, the recognition results are inclined to be dis-tracted by background regions that are similar to the targetappearance [31]. Hence, the importance of the backgroundcannot be ignored. Therefore, similar to line features, we com-pute similarity as the likelihood between the target model andeach hypothetical observation, and dissimilarity as the unlike-lihood between the background model and each hypotheticalobservation, where the background model is approximated asthe local region around the hypothetical observation. The sub-interpretations that are similar to the target but different fromthe surrounding background are assigned a high probabilityweight.

We adopt hue, saturation, value (HSV) color histograms toencode the color information. Because HSV decouples the in-tensity (i.e., value) from color (i.e., hue and saturation), it isrobust to illumination change. Despite the robustness of HSVcolor, it does not contain any information about the spatiallayout of the color distribution. Thus, in order to take into ac-count the spatial layout of the color distribution, we adopt amultipart HSV color histogram by defining the region as thesum of r subregions as Fig. 8 illustrates. Let multipart N -binsnormalized color histograms Hj

tðuÞ and HjbðuÞ represent the

target model and background model (background is the localregion around the target), respectively. The color histogram ofeach hypothetical observation from each subinterpretation isrepresented byHj

oðuÞ, where u ¼ 1; � � � ; N; j ¼ 1; � � � ; r. In fact,HðuÞ represents three channels. More specifically, the huechannel is quantized to 18 levels, saturation, and brightnesschannel are quantized to three levels for each. Therefore,the quantized HSV space has N ¼ 18 × 3 × 3 ¼ 162 histogrambins. Similar to line-based probability computation, the Fk de-fined in Eq. (5) is alternatively represented as the multipartHSV color histogram Hj

oðuÞ. And we want to favor interpreta-tions whose color histograms are similar to the target modeland should correspond to large likelihood, while penalizinginterpretations whose color histograms are similar to thebackground model and should correspond to large unlikeli-hood. Then, the color based likelihood and unlikelihood arecomputed by a Gaussian [32] with variance σ as

PcolorðFkjHm;OÞ ¼1ffiffiffiffiffi2π

pσexp

�−d2ðHtðuÞ;HoðuÞÞ

2σ2�;

PcolorðFkjHm;OÞ ¼1ffiffiffiffiffi2π

pσexp

�−1 − d2ðHbðuÞ;HoðuÞÞ

2σ2�; ð12Þ

where σ is a constant determined in practice, and d is theBhattacharyya distance [33] between two multipart HSV colorhistograms, which is defined as

dðH1;H2Þ ¼Xrj¼1

cj

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1 −

XNu¼1

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiHj

1ðuÞHj2ðuÞ

qvuut ; ð13Þ

where cj is the coefficient weight for each subregion. In thispaper we divide up the regions equally into 2 × 2 subregions,such that r ¼ 4. We experimentally found that the optimalvalues of cj are cj ¼ 0:25.

C. Probabilistic Multicue IntegrationSince our approach uses multiple cues together to improvethe robustness of the system performance, the way in whichto fuse multiple cues is also very important, under the assump-tion that the observations from each cue are statistically inde-pendent. Thus, the likelihood and unlikelihood in Eq. (5) canbe integrated as

PðFkjHm;OÞ ¼ PlineðFkjHm;OÞPcolorðFkjHm;OÞ;PðFkjHm;OÞ ¼ PlineðFkjHm;OÞPcolorðFkjHm;OÞ: ð14Þ

Therefore, the posterior probability in Eq. (5) can be furtherrepresented as

PðHm;OjFkÞ ¼1

1þ α ¼ 1

1þ PlineðFk jHm;OÞPlineðFk jHm;OÞ ·

PcolorðFkjHm;OÞPcolorðFkjHm;OÞ

: ð15Þ

In this paper, we take only 3D lines and color cues into con-sideration for probability computation. However, the inclu-sion of extra cues into this framework is straightforward.

5. POSE VERIFICATION AND REFINEMENTThe final challenge of the proposed approach is to apply apose refinement to a few top ranked interpretations accordingto the estimated probabilities. Remembering that we repre-sent an interpretation as a number of subinterpretations dur-ing the stage of multiple interpretations generation. The mainpurpose of the subinterpretations is to decrease the ambiguitydue to the partial occlusion or fragmentation of the 3D linefeatures. After the probability computation stage, we knowthe relative importance among subinterpretations in an inter-pretation, and these subinterpretations have an OR relation-ship, which means only one subinterpretation is correct inan interpretation. Therefore, in the refinement stage, we onlychoose the subinterpretation with the highest probability torepresent an interpretation. In other words, an interpretationis represented as a weighted Gaussian PDF. We use a fusionstrategy that pairs of interpretations that support each other,meaning that the fused pose is more likely to be the correctlocation of the target object in space.

For the sake of simplicity, let us assume that twoindependent image feature sets f1 and f2 and two respective

Fig. 8. (Color online) Multipart HSV color histogram: each region isdivided into four subregions. (a) is the target model Hj

tðuÞ, (b) is thehypothetical observationHj

oðuÞ, and (c) is backgroundHjbðuÞ, which is

used for dissimilarity measurement.

Z. Lu and S. Lee Vol. 28, No. 12 / December 2011 / J. Opt. Soc. Am. A 2613

hypotheses h1 and h2 are given. Specifically, we have twointerpretations, Pðx; h1; Ojf1Þ and Pðx;h2; Ojf2Þ, and their cor-responding general form representations are given as I1 ¼fπ1; Nðθ1;Σ1Þg and I2 ¼ fπ2; Nðθ2;Σ2Þg, respectively, asdefined in (4). Therefore, the fusion of two interpretationsis performed by fusing two weighted Gaussian PDFs.Now, we are interested in the possibility of fusing the two in-terpretations and, if possible, how to fuse them intoPðx; h1; h2; Ojf1; f2Þ, or how to get its corresponding generalform I ¼ fπ; Nðθ;ΣÞg.

For f1 and f2 to be feature sets of the same object, the poseof the object x must be in the intersection of the pose uncer-tainty regions of two interpretations. But if f1 and f2 do notbelong to the same object (e.g, one of them is from back-ground) due to clutter, then they do not support each other.Therefore, before fusing two interpretations, we need tocheck the supporting relationship between them. For thispurpose, a supporting relationship is determined by theMahalanobis distance between two pose distributions,Nðθ1;Σ1Þ and Nðθ2;Σ2Þ, by

dðI1; I2Þ ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi�θ1 − θ2

�T�Σ1 þΣ2

2

�−1ðθ1 − θ2Þ

sð16Þ

if dðI1; I2Þ ≤ δth and I1 and I2 are deemed to support eachother. We have found that good performance can be achievedwhen δth ¼ 3. The next problem is determining how to fusetwo interpretations in terms of probabilities and posedistributions becomes a matter of interest. Note thatPðx; h1; h2; Ojf1; f2Þ ¼ Pðh1; h2; Ojf1; f2ÞPðxjf1; f2;h1;h2; OÞ. Si-milar to (5), the fused probability Pðh1; h2; Ojf1; f2Þ iscomputed as

Pðh1;h2; Ojf1; f2Þ ¼Pðf1; f2jh1;h2; OÞ

Pðf1; f2Þ¼ 1

1þ α12; ð17Þ

where α12 ¼ Pðf1 jh1;OÞPðf1 jh1;OÞ ·

Pðf2jh2 ;OÞPðf2jh2 ;OÞ, and all items in α12 are already

computed in Section 4. In addition to probability fusion, posedistribution fusion should also be computed, as we mentionedabove, we assume ðf1;h1Þ and ðf2;h2Þ are independent. There-fore, the fused pose distribution Pðxjf1; f2; h1; h2; OÞ can be ap-proximated as multiplying two independent pose distributionas

Pðxjf1; f2;h1;h2; OÞ ¼ Pðxjf1;h1; OÞPðxjf2;h2; OÞ: ð18Þ

Both Pðxjf1;h1; OÞ and Pðxjf2; h2; OÞ have their correspondingweights. Therefore, to adapt to the reliability of each interpre-tation [34], (18) is slightly modified:

Pðxjf1; f2;h1; h2; OÞ ¼ Pðxjf1; h1; OÞω1Pðxjf2;h2; OÞω2 ; ð19Þ

where ω1 ¼ π1π1þπ2 and ω2 ¼ 1 − ω1 are normalized weights.

Finally, the general form of fused pose Nðθ;ΣÞ is computedas

Σ−1 ¼ ω1Σ−11 þ ω2Σ−1

2 θ ¼ Σðω1Σ−11 θ1 þ ω2Σ−1

2 θ2Þ ð20Þ

6. EXPERIMENTAL RESULTSIn order to validate our approach, we applied this method tothe recognition of 3D objects in a cluttered domestic environ-ment. About 30 daily used domestic objects were employedfor the experiments, such as refrigerator, milk box, biscuitbox, cup, book, and so forth. Some selected typical objectsare shown in Fig. 9. All images were captured with a stereocamera at a resolution of 640 × 480 pixels.

A. Recognition Results with Multiple InterpretationsGenerationFirst, we tested the proposed algorithm for both texture andtextureless objects. Figure 10 illustrates recognition resultsfor different kinds of selected objects. The poses in both3D space and their projection on 2D images are illustrated.All of the objects were recognized correctly within seventop ranked interpretations. In order to show the recognitioncapability of our method for nonbox objects (e.g., cylindricalobjects), we tested several cylindrical objects as in Fig. 11.Because some classes of nonbox objects as long as theirboundaries can be approximated as edges, such a cylindricalbottle can be approximated as a cuboid.

In the interest of analyzing the feasibility and effectivenessof probability computation, we chose two cluttered domesticscenes as shown in Figs. 12 and 13. In Fig. 12, several ambig-uous interpretations are generated because the parallel linepair comes from two different objects (one line from thecup and another line from the box). In Fig. 13, multiple inter-pretations are generated from both the target object and non-objects because the non-objects have a similar initial featureset to the target object. However, our approach can discrimi-nate these ambiguities by estimated probabilities using addi-tional evidence such as support 3D lines and color featurearound the estimated pose. In order to illustrate the color ef-ficiency for probability computation, we show a very challen-ging scenario as in Fig. 14, in which three objects with similargeometric shape and size are shown, and results in ambiguousmultiple interpretations are generated. However, thanks to thecolor feature differences of their surface patches, ourapproach can discriminate them with color feature. This

Fig. 9. (Color online) Selected textured/textureless daily used objects, which include milk box, biscuit box, refrigerator, trash can, dishwasher,book, etc.

2614 J. Opt. Soc. Am. A / Vol. 28, No. 12 / December 2011 Z. Lu and S. Lee

Fig. 10. (Color online) Recognition results of multiple interpretations generation; results in both 2D images and 3D point clouds are illustrated.Each row shows selected interpretations for one object. From first to sixth row, the objects are: kitchen refuse bin, refrigerator, biscuit box, milkbox, tissue box, and dishwasher. The first column shows the true recognition result and correct pose estimation, and the second and third columnsshow the multiple interpretations with incorrect results. (Media 1)

Fig. 11. (Color online) Recognition of 3D nonbox objects: sweet corn bottle (top left), Pocari Sweat can (top right), Gatorade can (bottom left),and coffee cup (bottom right). The appearance of target objects are shown in the top middle of each image, and the model of recognized objectsoverlaid with pink color on the original image.

Fig. 12. (Color online) Probability computation of interpretations. (a) is an original 2D image overlapped with 2D lines, the right-most box on thewhite table is the target object. (b)–(d) are three generated interpretations with estimated pose (green color) in 3D space. The probabilities of eachinterpretation are 0.32, 0.27, 0.76, respectively, where only (d) is the true interpretation of the target object. The figure is best viewed in color andwith PDF magnification.

Z. Lu and S. Lee Vol. 28, No. 12 / December 2011 / J. Opt. Soc. Am. A 2615

demonstrates that initial interpretations are generated as aweak classifier without the loss of any possible candidates,and these interpretations can be verified probabilistically usingadditional evidence as a strong classifier in order to eliminatethe spurious interpretations. Finally, a small number of reason-able interpretations with high probability are selected.

B. Pose Estimation AccuracyIn addition to obtaining recognition results, we measured thepose estimation accuracy, where the pose of the object is es-timated with respect to the stereo camera. The ground truth ofthe object pose is obtained manually by marking the corner ofthe object in a 2D image (which is equal to the vertex in 3D)and then applying the POSIT algorithm [35] to compute thetransformation matrix with given calibrated stereo cameraparameters. Table 1 contains the pose values of correct inter-pretations shown in the first column of Fig. 10. Both estimatedpose and ground truth pose are shown, where (̂tx, t̂y, t̂z, ϕ̂roll,ϕ̂pitch, ϕ̂yaw) and (tx, ty, tz, ϕroll, ϕpitch, ϕyaw) represent the es-timated pose value and the ground truth pose value, respec-tively. More specifically, x denotes the horizontal direction, ydenotes the vertical direction and z denotes the depth direc-tion (optical axis of the camera). Pose error is computed asthe absolute value of the difference between the estimatedpose and the ground truth pose, i.e., Δtx ¼ ‖̂tx − tx‖. It is im-portant to note that the z axis has the biggest translation error,and also the error is proportional to the distance between thecamera and the object, because the range resolution of thestereo camera is much better at closer ranges. Furthermore,

Fig. 15 illustrates the pose error in Table. 1, and it is evidentthat the refrigerator and dishwasher have a greater error thanthe other objects due to the large distance between the objectand the camera.

C. Performance EvaluationPrevious approaches to multiple hypothesis based object re-cognition rely mostly on 2D information [7,8], without con-cerning real 3D pose estimation. In contrast, our approachmakes use of features in 3D space directly, not only to recog-nize object ID but also to determine the object pose in 3D. Inthis sense, a direct comparison with previous approaches maybe somewhat less meaningful. However, in order to evaluatethe performance of the proposed algorithm, three parameterswere measured:

1. Detection Probability β: The probability that thecomplete set of generated interpretations contain at leastone correct matching between the model and image.

2. Ranking index n: Among all the interpretations rankedprobabilistically, the ranking of the first correct interpretationleads to a true location of the target object.

3. Computation time t: The average computation timefor each model for each image.

Selected images shown in Fig. 16 are used to perform theexperiment. For each specific object, there are three typicalscenarios in the domestic environment ranging from easy todifficult. In case 1, a single object appears in the scene withoutany partial occlusion, which is the easiest case, as shown inthe first row of Fig. 16. In case 2, an object with partial occlu-sion but no similar objects is present, as shown in the secondrow of Fig. 16. In case 3, the most difficult case, partial occlu-sion and similar objects coexist with the target object, as

Fig. 14. (Color online) Color efficiency for the objects with similar geometric shape. (a) Shows the correct recognition and pose estimation of thetrue object with a probability 0.92. Both (b) and (c) are incorrect interpretations with probabilities of 0.57 and 0.64, respectively. Surface templatesfor color features are attached in each image, where (a) is Seoul milk, (b) is Maeil milk and (c) is Namyang milk.

Fig. 13. (Color online) Multiple interpretations generation and prob-ability assignment. Top row: The left image is the original 2D image,the right image is the 3D point clouds, and multiple interpretations aregenerated where the target object is the milk box that is bounded by awhite ellipse, and two more sets of interpretations are for nonobjectbounded by the dashed yellow ellipses. Bottom row: Selecting oneinterpretation from each set in the top right image, the probabilitiesof each are 0.37, 0.89, and 0.42, respectively, where the highest prob-ability correctly indicates the true object. The figure is best viewed incolor and with PDF magnification.

0

10

20

30

40

50

60

70

80

90

Abs

olut

e T

rans

latio

n E

rror

(m

m)

1

2

3

4

5

6

7

8

Abs

olut

e R

otat

ion

Err

or(d

eg)

Refuse Bin

Refrigerator

Biscuit Box

Milk BoxTissue Box

Dishwasher

Fig. 15. (Color online) Pose accuracy evaluation.

2616 J. Opt. Soc. Am. A / Vol. 28, No. 12 / December 2011 Z. Lu and S. Lee

shown in the third row of Fig. 16. Actually, for each scenarioof each object, 25 images shown in Fig. 16 were captured ran-domly from different viewpoints, distances, and illumination.The estimated value of detection probability is β ≡ 1 in allthree cases, as a result of the adopted weak initial featureset that prevents the loss of any possible candidates of thetarget object. The estimated values of the ranking index nare shown in Fig. 17. The horizontal axis represents the index

of object models, 20 objects that were employed for this eva-luation. From this we see that, for the simple scene that onlyincludes a single object, the correct interpretation can befound only in top three interpretations. By examining onlythe top five interpretations for case 2, the correct matchingresult can be obtained. Even for the most difficult scene (case3), where many similar objects coexist with the target object,correct recognition can be achieved in only top eight inter-pretations, suggesting that the probability computation isworking as expected. Therefore, only a few interpretationswith high probabilistic ranking are required in the fusionstage. Furthermore, in this experiment, the proposed methodsrequired, on average, computation time less than 500ms evenfor the very cluttered scene (Table 2).

7. CONCLUSION3D object recognition and pose estimation are basic prerequi-sites for home service robotics. In this paper, an approach forprobabilistic recognition based on multiple interpretations fordealing with uncertainties and ambiguities was proposed. Ourapproach determines the recognized object pose using prob-abilistic multiple interpretations, which are generated from 3Dline feature sets such as parallel line pairs and perpendicularline pairs. An interpretation is represented as a weightedGaussian PDF, which is a region instead of a point in posespace. The probability of each interpretation is computed ef-ficiently in terms of both likelihood and unlikelihood, which isrobust to occlusion and clutter. The top ranked interpreta-tions are further verified and refined with a fusion strategyin a closed form. The fused interpretations are more reliablewith high probabilities, leading to more accurate poseestimation. Experiments show that the proposed approachcan robustly recognize an object in a cluttered domesticenvironment in real time.

Fig. 16. (Color online) Selected images for performance evaluation. The first row is case 1 with only a single object in the foreground, but thebackground is still cluttered; The second row is case 2, where the object is partially occluded; The third row is case 3, which is the most difficultcase, where partial occlusion and several similar objects coexist with the target object. Both 2D and 3D recognition results are shown for eachimage. For clarity, the 3D results are enlarged to fit the window. The figure is best viewed in color and with PDF magnification. (See Media 1.)

Table 1. Pose Accuracy Analysis, Test Objects are Shown in First Column of Fig. 10

Object Translation (mm) Rotation (deg)

t̂x=tx Δtx t̂y=ty Δty t̂z=tz Δtz ϕ̂roll=ϕroll Δϕroll ϕ̂pitch=ϕpitch Δϕpitch ϕ̂yaw=ϕyaw Δϕyaw

Refuse Bin −237= − 217 20 −86= − 96 10 1566/1615 49 18.3/12.9 5.4 38.3/41.8 3.5 −154:7= − 157:7 3.0Refrigerator −1066= − 1035 31 632/659 27 3069/3158 89 1.9/-5.7 7.5 −2:1= − 6:1 4.0 −169:9= − 166:4 3.5Biscuit Box 98/91 7 115/123 8 968/996 28 −21:4= − 24:1 2.7 −54:0= − 55:4 1.4 −152:3= − 154:9 2.6Milk Box −87= − 79 8 145/139 6 820/791 29 −17:3= − 13:8 3.5 −42:8= − 40:8 2.1 −153:3= − 156:6 3.3Tissue Box −503= − 492 11 −98= − 91 7 1388/1421 33 6.0/1.6 4.4 18.9/20.9 2.0 −154:5= − 152:4 2.1Dishwasher −579= − 561 18 139/159 20 2963/2910 53 4.7/11.0 6.3 40.7/45.0 4.4 −176:6= − 180:5 3.9

0 5 10 15 200

2

4

6

8

10

Index of object model

Cor

rect

ran

king

inde

x n

case 1case 2case 3

Fig. 17. (Color online) Performance statistics, showing that thecorrect recognition can be obtained from only a small number oftop ranked interpretations.

Table 2. Average Computation Time

for Each Image

Type of Scene Required time (ms)

Case 1 (single object) 347Case 2 (partial occlusion) 360Case 3 (coexist with similar objects) 469

Z. Lu and S. Lee Vol. 28, No. 12 / December 2011 / J. Opt. Soc. Am. A 2617

ACKNOWLEDGMENTSThis research was performed for the Intelligent Robotics De-velopment Program, one of the 21st Century Frontier R&DPrograms (F0005000-2010-32), and in part by the KORUS-TechProgram (KT-2008-SW-AP-FSO-0004) funded by the Ministryof Knowledge Economy (MKE), and by the Priority ResearchCenters Program through the National Research Foundationof Korea, funded by the Ministry of Education, Science, andTechnology (MEST) (2011-0018397). This work was alsopartially supported by MEST, Korea, under the World ClassUniversity Program supervised by the Korea Science andEngineering Foundation (KOSEF) (R31-2010-000-10062-0),by MKE, Korea, under ITRC NIPA-2010-(C1090-1021-0008)(NTIS-2010-(1415109527)).

REFERENCES AND NOTES1. M. DaneshPanah, B. Javidi, and E. A. Watson, “Three dimen-

sional object recognition with photon counting imagery in thepresence of noise,” Opt. Express 18, 26450–26460 (2010).

2. S.-H. Hong and B. Javidi, “Distortion-tolerant 3d recognition ofoccluded objects using computational integral imaging,” Opt.Express 14, 12085–12095 (2006).

3. B. Javidi, R. Ponce-Diaz, and S. H. Hong, “Three-dimensional re-cognition of occluded objects by using computational integralimaging,” Opt. Lett. 31, 1106–1108 (2006).

4. V. Lepetit and P. Fua, Monocular Model-based 3D Tracking of

Rigid Objects, Foundations and Trends in Computer Graphicsand Vision (2005), Vol. 1, pp. 1–89.

5. S. Kim and I. Kweon, “Automatic model-based 3d object recog-nition by combining feature matching with tracking,” Mach.Vision Appl. 16, 267–272 (2005).

6. Z. Lu, S. Baek, and S. Lee, “Robust 3D line extraction from stereopoint clouds,” in 2008 IEEE Conference Robotics, Automation

and Mechatronics (IEEE, 2008).7. I. Shimshoni and J. Ponce, “Probabilistic 3D object recognition,”

Int. J. Comput. Vis. 36, 51–70 (2000).8. P. David and D. DeMenthon, “Object recognition in high clutter

images using line features,” in Tenth IEEE International

Conference on Computer Vision (IEEE, 2005).9. L. G. Roberts, “Machine perception of three-dimensional solids,”

in Optical and Electrooptical Information Processing,J. T. Tipett, ed. (MIT, 1965).

10. M. A. Fischler and R. C. Bolles, “Random sample consensus: aparadigm for model fitting with applications to image analysisand automated cartography,” Commun. ACM 24, 381–395 (1981).

11. J. S. Beis and D. G. Lowe, “Indexing without invariants in 3dobject recognition,” IEEE Trans. Pattern Anal. Mach. Intell.21, 1000–1015 (1999).

12. M. S. Costa and L. G. Shapiro, “3D object recognition andpose with relational indexing,” Comput. Vis. Image Underst.79, 364–407 (2000).

13. P. David, D. Dementhon, R. Duraiswami, and H. Samet, “Softpo-sit: Simultaneous pose and correspondence determination,” Int.J. Comput. Vis. 59, 259–284 (2004).

14. M. A. Vicente, P. O. Hoyer, and A. Hyvarinen, “Equivalenceof some common linear feature extraction techniques forappearance-based object recognition tasks,” IEEE Trans.Pattern Anal. Mach. Intell. 29, 896–900 (2007).

15. C. M. Do, R. Martinez-Cuenca, and B. Javidi, “Three-dimensionalobject-distortion-tolerant recognition for integral imaging using

independent component analysis,” J. Opt. Soc. Am. A 26,245–251 (2009).

16. S. Min, S. Hao, S. Savarese, and F.-F. Li, “A multi-view probabil-istic model for 3d object classes,” in IEEE Conference on

Computer Vision and Pattern Recognition (IEEE, 2009).17. S. Ekvall, D. Kragic, and F. Hoffmann, “Object recognition

and pose estimation using color cooccurrence histogramsand geometric modeling,” Image Vis. Comput. 23, 943–955(2005).

18. C. Harris and M. Stephens, “A combined corner and edgedetection,” in Proceedings of The Fourth Alvey Vision Confer-

ence (1988).19. C. Schmid and R. Mohr, “Local grayvalue invariants for image

retrieval,” IEEE Trans. Pattern Anal. Mach. Intell. 19, 530–535(1997).

20. D. G. Lowe, “Distinctive image features from scale-invariantkeypoints,” Int. J. Comput. Vis. 60, 91–110 (2004).

21. A. E. Johnson and M. Hebert, “Using spin images for efficientobject recognition in cluttered 3d scenes,” IEEE Trans. PatternAnal. Mach. Intell. 21, 433–449 (1999).

22. A. Frome, D. Huber, R. Kolluri, T. Bulow, and J. Malik, “Recog-nizing objects in range data using regional point descriptors,” inProceedings of the European Conference on Computer Vision

(ECCV, 2004), Vol. 3023, pp. 224–237.23. Z. Zhang and O. D. Faugeras, “Determining motion from 3d line

segment matches: a comparative study,” Image Vis. Comput. 9,10–19 (1991).

24. C. Guerra and V. Pascucci, “Matching sets of 3D segments,”(SPIE, 1999).

25. B. Kamgar-Parsi, “Algorithms for matching 3d line sets,” IEEETrans. Pattern Anal. Mach. Intell. 26, 582–593 (2004).

26. Throughout this paper, the bold font is referred to vector ormatrix.

27. C. Bregler and J. Malik, “Tracking people with twists andexponential maps,” in Proceedings of the IEEE Computer So-

ciety Conference on Computer Vision and Pattern Recognition

(1998).28. H. Schneiderman and T. Kanade, “Object detection using the

statistics of parts,” Int. J. Comput. Vis. 56, 151–177 (2004).29. R. Fergus, L. Fei-Fei, P. Perona, and A. Zisserman, “Learning

object categories from google’s image search,” in Tenth

IEEE International Conference on Computer Vision (IEEE,2005).

30. G. Dashan, H. Sunhyoung, and N. Vasconcelos, “Discriminantsaliency, the detection of suspicious coincidences, and applica-tions to visual recognition,” IEEE Trans. Pattern Anal. Mach.Intell. 31, 989–1005 (2009).

31. P. Wang and H. Qiao, “Adaptive probabilistic tracking withreliable particle selection,” Electron. Lett. 45, 1160–1161(2009).

32. K. Nummiaro, E. Koller-Meier, and L. Van Gool, “An adaptivecolor-based particle filter,” Image Vis. Comput. 21, 99–110(2003).

33. D. Comaniciu, V. Ramesh, and P. Meer, “Kernel-based objecttracking,” IEEE Trans. Pattern Anal. Mach. Intell. 25, 564–577(2003).

34. C. Genest and J. V. Zidek, “Combining probability distributions:A critique and an annotated bibliography,” Statist. Sci. 1,114–135 (1986).

35. D. F. Dementhon and L. S. Davis, “Model-based object pose in 25lines of code,” Int. J. Comput. Vis. 15, 123–141 (1995).

36. P. P. Loutrel, “A solution to the hidden-line problem forcomputer-drawn polyhedra,” IEEE Trans. Comput. C-19,205–213 (1970).

2618 J. Opt. Soc. Am. A / Vol. 28, No. 12 / December 2011 Z. Lu and S. Lee