BOP: Benchmark for 6D Object Pose Estimation

BOP: Benchmark for 6D Object Pose Estimation

Tomas Hodan1∗, Frank Michel2∗, Eric Brachmann3, Wadim Kehl4

Anders Glent Buch5, Dirk Kraft5, Bertram Drost6, Joel Vidal7, Stephan Ihrke2

Xenophon Zabulis8, Caner Sahin9, Fabian Manhardt10, Federico Tombari10

Tae-Kyun Kim9, Jirı Matas1, Carsten Rother3

1CTU in Prague, 2TU Dresden, 3Heidelberg University, 4Toyota Research Institute5University of Southern Denmark, 6MVTec Software, 7Taiwan Tech

8FORTH Heraklion, 9Imperial College London, 10TU Munich

Abstract. We propose a benchmark for 6D pose estimation of a rigidobject from a single RGB-D input image. The training data consists ofa texture-mapped 3D object model or images of the object in known 6Dposes. The benchmark comprises of: i) eight datasets in a unified formatthat cover different practical scenarios, including two new datasets focus-ing on varying lighting conditions, ii) an evaluation methodology with apose-error function that deals with pose ambiguities, iii) a comprehensiveevaluation of 15 diverse recent methods that captures the status quo ofthe field, and iv) an online evaluation system that is open for continu-ous submission of new results. The evaluation shows that methods basedon point-pair features currently perform best, outperforming templatematching methods, learning-based methods and methods based on 3Dlocal features. The project website is available at bop.felk.cvut.cz.

1 Introduction

Estimating the 6D pose, i.e. 3D translation and 3D rotation, of a rigid objecthas become an accessible task with the introduction of consumer-grade RGB-Dsensors. An accurate, fast and robust method that solves this task will have abig impact in application fields such as robotics or augmented reality.

Many methods for 6D object pose estimation have been published recently,e.g. [34,24,18,2,36,21,27,25], but it is unclear which methods perform well andin which scenarios. The most commonly used dataset for evaluation was createdby Hinterstoisser et al. [14], which was not intended as a general benchmark andhas several limitations: the lighting conditions are constant and the objects areeasy to distinguish, unoccluded and located around the image center. Since then,some of the limitations have been addressed. Brachmann et al. [1] added ground-truth annotation for occluded objects in the dataset of [14]. Hodan et al. [16]created a dataset that features industry-relevant objects with symmetries andsimilarities, and Drost et al. [8] introduced a dataset containing objects withreflective surfaces. However, the datasets have different formats and no standardevaluation methodology has emerged. New methods are usually compared withonly a few competitors on a small subset of datasets.

∗Authors have been leading the project jointly.

arX

iv:1

808.

0831

9v1

[cs

.CV

] 2

4 A

ug 2

018

http://bop.felk.cvut.cz

2 Hodan, Michel et al.

LM/LM-O [14,1] IC-MI [34] IC-BIN [7] T-LESS [16] RU-APC [28] TUD-L - new TYO-L - new

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

T-LESS

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 3 4 5 6

LM/LM-O IC-MI/IC-BIN

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

TYO-L

1 2 3 1 2 3 4 5 6 7 8 9 10 11 12 13 14

TUD-L RU-APC

Fig. 1. A collection of benchmark datasets. Top: Example test RGB-D images wherethe second row shows the images overlaid with 3D object models in the ground-truth6D poses. Bottom: Texture-mapped 3D object models. At training time, a method isgiven an object model or a set of training images with ground-truth object poses. Attest time, the method is provided with one test image and an identifier of the targetobject. The task is to estimate the 6D pose of an instance of this object.

This work makes the following contributions:

1. Eight datasets in a unified format, including two new datasets focusingon varying lighting conditions, are made available (Fig. 1). The datasets con-tain: i) texture-mapped 3D models of 89 objects with a wide range of sizes,shapes and reflectance properties, ii) 277K training RGB-D images showingisolated objects from different viewpoints, and iii) 62K test RGB-D imagesof scenes with graded complexity. High-quality ground-truth 6D poses of themodeled objects are provided for all images.

2. An evaluation methodology based on [17] that includes the formulationof an industry-relevant task, and a pose-error function which deals well withpose ambiguity of symmetric or partially occluded objects, in contrast to thecommonly used function by Hinterstoisser et al. [14].

3. A comprehensive evaluation of 15 methods on the benchmark datasetsusing the proposed evaluation methodology. We provide an analysis of theresults, report the state of the art, and identify open problems.

4. An online evaluation system at bop.felk.cvut.cz that allows for con-tinuous submission of new results and provides up-to-date leaderboards.

http://bop.felk.cvut.cz

BOP: Benchmark for 6D Object Pose Estimation 3

1.1 Related Work

The progress of research in computer vision has been strongly influenced bychallenges and benchmarks, which enable to evaluate and compare methods andbetter understand their limitations. The Middlebury benchmark [31,32] for depthfrom stereo and optical flow estimation was one of the first that gained largeattention. The PASCAL VOC challenge [10], based on a photo collection fromthe internet, was the first to standardize the evaluation of object detection andimage classification. It was followed by the ImageNet challenge [29], which hasbeen running for eight years, starting in 2010, and has pushed image classificationmethods to new levels of accuracy. The key was a large-scale dataset that enabledtraining of deep neural networks, which then quickly became a game-changer formany other tasks [23]. With increasing maturity of computer vision methods,recent benchmarks moved to real-world scenarios. A great example is the KITTIbenchmark [11] focusing on problems related to autonomous driving. It showedthat methods ranking high on established benchmarks, such as the Middlebury,perform below average when moved outside the laboratory conditions.

Unlike the PASCAL VOC and ImageNet challenges, the task considered inthis work requires a specific set of calibrated modalities that cannot be easilyacquired from the internet. In contrast to KITTY, it was not necessary to recordlarge amounts of new data. By combining existing datasets, we have coveredmany practical scenarios. Additionally, we created two datasets with varyinglighting conditions, which is an aspect not covered by the existing datasets.

2 Evaluation Methodology

The proposed evaluation methodology formulates the 6D object pose estimationtask and defines a pose-error function which is compared with the commonlyused function by Hinterstoisser et al. [13].

2.1 Formulation of the Task

Methods for 6D object pose estimation report their predictions on the basisof two sources of information. Firstly, at training time, a method is given atraining set T = {To}no=1, where o is an object identifier. Training data To mayhave different forms, e.g. a 3D mesh model of the object or a set of RGB-Dimages showing object instances in known 6D poses. Secondly, at test time, themethod is provided with a test target defined by a pair (I, o), where I is animage showing at least one instance of object o. The goal is to estimate the 6Dpose of one of the instances of object o visible in image I.

If multiple instances of the same object model are present, then the poseof an arbitrary instance may be reported. If multiple object models are shownin a test image, and annotated with their ground truth poses, then each objectmodel may define a different test target. For example, if a test image shows threeobject models, each in two instances, then we define three test targets. For eachtest target, the pose of one of the two object instances has to be estimated.


This task reflects the industry-relevant bin-picking scenario where a robotneeds to grasp a single arbitrary instance of the required object, e.g. a componentsuch as a bolt or nut, and perform some operation with it. It is the simplestvariant of the 6D localization task [17] and a common denominator of its othervariants, which deal with a single instance of multiple objects, multiple instancesof a single object, or multiple instances of multiple objects. It is also the core ofthe 6D detection task, where no prior information about the object presence inthe test image is provided [17].

2.2 Measuring Error

A 3D object model is defined as a set of vertices in R3 and a set of polygons thatdescribe the object surface. The object pose is represented by a 4 × 4 matrixP = [R, t; 0, 1], where R is a 3× 3 rotation matrix and t is a 3× 1 translationvector. The matrix P transforms a 3D homogeneous point xm in the modelcoordinate system to a 3D point xc in the camera coordinate system: xc = Pxm.

Visible Surface Discrepancy. To calculate the error of an estimated pose Pw.r.t. the ground-truth pose P in a test image I, an object model M is firstrendered in the two poses. The result of the rendering is two distance maps1 Sand S. As in [17], the distance maps are compared with the distance map SI ofthe test image I to obtain the visibility masks V and V , i.e. the sets of pixelswhere the model M is visible in the image I (Fig. 2). Given a misalignmenttolerance τ , the error is calculated as:

eVSD(S, S, SI , V , V , τ) = avgp∈V ∪V

{0 if p ∈ V ∩ V ∧ |S(p)− S(p)| < τ

1 otherwise.(1)

Properties of eVSD. The object pose can be ambiguous, i.e. there can bemultiple poses that are indistinguishable. This is caused by the existence ofmultiple fits of the visible part of the object surface to the entire object surface.The visible part is determined by self-occlusion and occlusion by other objectsand the multiple surface fits are induced by global or partial object symmetries.

Pose error eVSD is calculated only over the visible part of the model surfaceand thus the indistinguishable poses are treated as equivalent. This is a desirableproperty which is not provided by pose-error functions commonly used in theliterature [17], including eADD and eADI discussed below. As the commonly usedpose-error functions, eVSD does not consider color information.

Definition (1) is different from the original definition in [17] where the pixel-wise cost linearly increases to 1 as |S(p)−S(p)| increases to τ . The new definitionis easier to interpret and does not penalize small distance differences that maybe caused by imprecisions of the depth sensor or of the ground-truth pose.

1 A distance map stores at a pixel p the distance from the camera center to a 3D pointxp that projects to p. It can be readily computed from the depth map which storesat p the Z coordinate of xp and which can be obtained by a Kinect-like sensor.


RGBI SI S V S V S∆

Fig. 2. Quantities used in the calculation of eVSD. Left: Color channels RGBI (only forillustration) and distance map SI of a test image I. Right: Distance maps S and S areobtained by rendering the object model M at the estimated pose P and the ground-truth pose P respectively. V and V are masks of the model surface that is visible in I,obtained by comparing S and S with SI . Distance differences S∆(p) = S(p) − S(p),∀p ∈ V ∩ V , are used for the pixel-wise evaluation of the surface alignment.

a: 0.043.7/15.2

b: 0.083.6/10.9

c: 0.113.2/13.4

d: 0.191.0/6.4

e: 0.281.4/7.7

f: 0.342.1/6.4

g: 0.402.1/8.6

h: 0.444.8/21.7

i: 0.474.8/9.2

j: 0.546.9/10.8

k: 0.576.9/8.9

l: 0.6421.0/21.7

m: 0.664.4/6.5

n: 0.768.8/9.9

o: 0.8949.4/11.1

p: 0.9532.8/10.8

Fig. 3. Comparison of eVSD (bold, τ = 20 mm) with eADI/θAD (mm) on example poseestimates sorted by increasing eVSD. Top: Cropped and brightened test images overlaidwith renderings of the model at i) the estimated pose P in blue, and ii) the ground-truth pose P in green. Only the part of the model surface that falls into the respectivevisibility mask is shown. Bottom: Difference maps S∆. Case (b) is analyzed in Fig. 2.

Criterion of Correctness. An estimated pose P is considered correct w.r.t.the ground-truth pose P if the error eVSD < θ. If multiple instances of thetarget object are visible in the test image, the estimated pose is compared to theground-truth instance that minimizes the error. The choice of the misalignmenttolerance τ and the correctness threshold θ depends on the target application.For robotic manipulation, where a robotic arm operates in 3D space, both τ andθ need to be low, e.g. τ = 20 mm, θ = 0.3, which is the default setting in theevaluation presented in Sec. 5. The requirement is different for augmented realityapplications. Here the surface alignment in the Z dimension, i.e. the optical axisof the camera, is less important than the alignment in the X and Y dimension.The tolerance τ can be therefore relaxed, but θ needs to stay low.


Comparison to Hinterstoisser et al. In [14], the error is calculated as theaverage distance from vertices of the model M in the ground-truth pose P tovertices ofM in the estimated pose P. The distance is measured to the positionof the same vertex if the object has no indistinguishable views (eADD), otherwise

to the position of the closest vertex (eADI). The estimated pose P is consideredcorrect if e ≤ θAD = 0.1d, where e is eADD or eADI, and d is the object diameter,i.e. the largest distance between any pair of model vertices.

Error eADI can be un-intuitively low because of many-to-one vertex match-ing established by the search for the closest vertex. This is shown in Fig. 3,which compares eVSD and eADI on example pose estimates of objects that haveindistinguishable views. Overall, (f)-(n) yield low eADI scores and satisfy thecorrectness criterion of Hinterstoisser et al. These estimates are not consideredcorrect by our criterion. Estimates (a)-(e) are considered correct and (o)-(p) areconsidered wrong by both criteria.

3 Datasets

We collected six publicly available datasets, some of which we reduced to removeredundancies2 and re-annotated to ensure a high quality of the ground truth.Additionally, we created two new datasets focusing on varying lighting condi-tions, since this variation is not present in the existing datasets. An overview ofthe datasets is in Fig. 1 and a detailed description follows.

3.1 Training and Test Data

The datasets consist of texture-mapped 3D object models and training and testRGB-D images annotated with ground-truth 6D object poses. The 3D objectmodels were created using KinectFusion-like systems for 3D surface reconstruc-tion [26,33]. All images are of approximately VGA resolution.

For training, a method may use the 3D object models and/or the trainingimages. While 3D models are often available or can be generated at a low cost,capturing and annotating real training images requires a significant effort. Thebenchmark is therefore focused primarily on the more practical scenario whereonly the object models, which can be used to render synthetic training images,are available at training time. All datasets contain already synthesized trainingimages. Methods are allowed to synthesize additional training images, but thisoption was not utilized for the evaluation in this paper. Only T-LESS and TUD-Linclude real training images of isolated, i.e. non-occluded, objects.

To generate the synthetic training images, objects from the same dataset wererendered from the same range of azimuth/elevation covering the distribution ofobject poses in the test scenes. The viewpoints were sampled from a sphere, asin [14], with the sphere radius set to the distance of the closest object instancein the test scenes. The objects were rendered with fixed lighting conditions anda black background.

2 Identifiers of the selected images are available on the project website.


Dataset Objects Training images/obj. Test images Test targets

Real Synt. Used All Used All

LM [14] 15 – 1313 3000 18273 3000 18273LM-O [1] 8 – 1313 200 1214 1445 8916IC-MI [34] 6 – 1313 300 2067 300 2067IC-BIN [7] 2 – 2377 150 177 200 238T-LESS [16] 30 1296 2562 2000 10080 9819 49805RU-APC [28] 14 – 2562 1380 5964 1380 5911TUD-L - new 3 >11000 1827 600 23914 600 23914TYO-L - new 21 – 2562 – 1680 – 1669

Total 89 7450 62155 16951 110793

Table 1. Parameters of the datasets. Note that if a test image shows multiple objectmodels, each model defines a different test target – see Sec. 2.1.

.

The test images are real images from a structured-light sensor – MicrosoftKinect v1 or Primesense Carmine 1.09. The test images originate from indoorscenes with varying complexity, ranging from simple scenes with a single isolatedobject instance to very challenging scenes with multiple instances of severalobjects and a high amount of clutter and occlusion. Poses of the modeled objectswere annotated manually. While LM, IC-MI and RU-APC provide annotation forinstances of only one object per image, the other datasets provide ground-truthfor all modeled objects. Details of the datasets are in Tab. 1.

3.2 The Dataset Collection

LM/LM-O [14,1]. LM (a.k.a. Linemod) has been the most commonly useddataset for 6D object pose estimation. It contains 15 texture-less householdobjects with discriminative color, shape and size. Each object is associated witha test image set showing one annotated object instance with significant clutterbut only mild occlusion. LM-O (a.k.a. Linemod-Occluded) provides ground-truthannotation for all other instances of the modeled objects in one of the test sets.This introduces challenging test cases with various levels of occlusion.

IC-MI/IC-BIN [34,7]. IC-MI (a.k.a. Tejani et al.) contains models of twotexture-less and four textured household objects. The test images show multipleobject instances with clutter and slight occlusion. IC-BIN (a.k.a. Doumano-glou et al., scenario 2) includes test images of two objects from IC-MI, whichappear in multiple locations with heavy occlusion in a bin-picking scenario. Wehave removed test images with low-quality ground-truth annotations from bothdatasets, and refined the annotations for the remaining images in IC-BIN.

T-LESS [16]. It features 30 industry-relevant objects with no significant textureor discriminative color. The objects exhibit symmetries and mutual similaritiesin shape and/or size, and a few objects are a composition of other objects. T-LESS includes images from three different sensors and two types of 3D objectmodels. For our evaluation, we only used RGB-D images from the Primesensesensor and the automatically reconstructed 3D object models.


RU-APC [28]. This dataset (a.k.a. Rutgers APC) includes 14 textured prod-ucts from the Amazon Picking Challenge 2015 [6], each associated with test im-ages of a cluttered warehouse shelf. The camera was equipped with LED stripsto ensure constant lighting. From the original dataset, we omitted ten objectswhich are non-rigid or poorly captured by the depth sensor, and included onlyone from the four images captured from the same viewpoint.

TUD-L/TYO-L. Two new datasets with household objects captured underdifferent settings of ambient and directional light. TUD-L (TU Dresden Light)contains training and test image sequences that show three moving objects undereight lighting conditions. The object poses were annotated by manually aligningthe 3D object model with the first frame of the sequence and propagating theinitial pose through the sequence using ICP. TYO-L (Toyota Light) contains 21objects, each captured in multiple poses on a table-top setup, with four differenttable cloths and five different lighting conditions. To obtain the ground truthposes, manually chosen correspondences were utilized to estimate rough poseswhich were then refined by ICP. The images in both datasets are labeled bycategorized lighting conditions.

4 Evaluated Methods

The evaluated methods cover the major research directions of the 6D object poseestimation field. This section provides a review of the methods, together with adescription of the setting of their key parameters. If not stated otherwise, theimage-based methods used the synthetic training images.

4.1 Learning-Based Methods

Brachmann-14 [1]. For each pixel of an input image, a regression forest pre-dicts the object identity and the location in the coordinate frame of the objectmodel, a so called “object coordinate”. Simple RGB and depth difference featuresare used for the prediction. Each object coordinate prediction defines a 3D-3Dcorrespondence between the image and the 3D object model. A RANSAC-basedoptimization schema samples sets of three correspondences to create a pool ofpose hypotheses. The final hypothesis is chosen, and iteratively refined, to max-imize the alignment of predicted correspondences, as well as the alignment ofobserved depth with the object model. The main parameters of the method wereset as follows: maximum feature offset: 20 px, features per tree node: 1000, train-ing patches per object: 1.5M, number of trees: 3, size of the hypothesis pool: 210,refined hypotheses: 25. Real training images were used for TUD-L and T-LESS.

Brachmann-16 [2]. The method of [1] is extended in several ways. Firstly,the random forest is improved using an auto-context algorithm to support poseestimation from RGB-only images. Secondly, the RANSAC-based optimization


hypothesizes not only with regard to the object pose but also with regard to theobject identity in cases where it is unknown which objects are visible in the inputimage. Both improvements were disabled for the evaluation since we deal withRGB-D input, and it is known which objects are visible in the image. Thirdly,the random forest predicts for each pixel a full, three-dimensional distributionover object coordinates capturing uncertainty information. The distributions areestimated using mean-shift in each forest leaf, and can therefore be heavily multi-modal. The final hypothesis is chosen, and iteratively refined, to maximize thelikelihood under the predicted distributions. The 3D object model is not usedfor fitting the pose. The parameters were set as: maximum feature offset: 10 px,features per tree node: 100, number of trees: 3, number of sampled hypotheses:256, pixels drawn in each RANSAC iteration: 10K, inlier threshold: 1 cm.

Tejani-14 [34]. Linemod [14] is adapted into a scale-invariant patch descriptorand integrated into a regression forest with a new template-based split function.This split function is more discriminative than simple pixel tests and acceler-ated via binary bit-operations. The method is trained on positive samples only,i.e. rendered images of the 3D object model. During the inference, the class dis-tributions at the leaf nodes are iteratively updated, providing occlusion-awaresegmentation masks. The object pose is estimated by accumulating pose regres-sion votes from the estimated foreground patches. The baseline evaluated in thispaper implements [34] but omits the iterative segmentation/refinement step anddoes not perform ICP. The features and forest parameters were set as in [34]:number of trees: 10, maximum depth of each tree: 25, number of features in boththe color gradient and the surface normal channel: 20, patch size: 1/2 the image,rendered images used to train each forest: 360.

Kehl-16 [22]. Scale-invariant RGB-D patches are extracted from a regular gridattached to the input image, and described by features calculated using a con-volutional auto-encoder. At training time, a codebook is constructed from de-scriptors of patches from the training images, with each codebook entry holdinginformation about the 6D pose. For each patch descriptor from the test image,k-nearest neighbors from the codebook are found, and a 6D vote is cast usingneighbors whose distance is below a threshold t. After the voting stage, the 6Dhypothesis space is filtered to remove spurious votes. Modes are identified bymean-shift and refined by ICP. The final hypothesis is verified in color, depthand surface normals to suppress false positives. The main parameters of themethod with the used values: patch size: 32× 32 px, patch sampling step: 6 px,k-nearest neighbors: 3, threshold t: 2, number of extracted modes from the posespace: 8. Real training images were used for T-LESS.

4.2 Template Matching Methods

Hodan-15 [18]. A template matching method that applies an efficient cascade-style evaluation to each sliding window location. A simple objectness filter is ap-plied first, rapidly rejecting most locations. For each remaining location, a set of


candidate templates is identified by a voting procedure based on hashing, whichmakes the computational complexity largely unaffected by the total number ofstored templates. The candidate templates are then verified as in Linemod [14]by matching feature points in different modalities (surface normals, image gradi-ents, depth, color). Finally, object poses associated with the detected templatesare refined by particle swarm optimization (PSO). The templates were gener-ated by applying the full circle of in-plane rotations with 10◦ step to a portion ofthe synthetic training images, resulting in 11–23K templates per object. Otherparameters were set as described in [18]. We present also results without the lastrefinement step (Hodan-15-nr).

4.3 Methods Based on Point-Pair Features

Drost-10 [9]. A method based on matching oriented point pairs between thepoint cloud of the test scene and the object model, and grouping the matchesusing a local voting scheme. At training time, point pairs from the model aresampled and stored in a hash table. At test time, reference points are fixed in thescene, and a low-dimensional parameter space for the voting scheme is createdby restricting to those poses that align the reference point with the model. Pointpairs between the reference point and other scene points are created, similarmodel point pairs searched for using the hash table, and a vote is cast for eachmatching point pair. Peaks in the accumulator space are extracted and usedas pose candidates, which are refined by coarse-to-fine ICP and re-scored bythe relative amount of visible model surface. Note that color information isnot used. It was evaluated using function find surface model from HALCON13.0.2 [12]. The sampling distances for model and scene were set to 3% of theobject diameter, 10% of points were used as the reference points, and the normalswere computed using the mls method. Points further than 2 m were discarded.

Drost-10-edge. An extension of [9] which additionally detects 3D edges fromthe scene and favors poses in which the model contours are aligned with theedges. A multi-modal refinement minimizes the surface distances and the dis-tances of reprojected model contours to the detected edges. The evaluation wasperformed using the same software and parameters as Drost-10, but with acti-vated parameter train 3d edges during the model creation.

Vidal-18 [35]. The point cloud is first sub-sampled by clustering points based onthe surface normal orientation. Inspired by improvements of [15], the matchingstrategy of [9] was improved by mitigating the effect of the feature discretizationstep. Additionally, an improved non-maximum suppression of the pose candi-dates from different reference points removes spurious matches. The most voted500 pose candidates are sorted by a surface fitting score and the 200 best candi-dates are refined by projective ICP. For the final 10 candidates, the consistencyof the object surface and silhouette with the scene is evaluated. The samplingdistance for model, scene and features was set to 5% of the object diameter, and20% of the scene points were used as the reference points.


4.4 Methods Based on 3D Local Features

Buch-16 [3]. A RANSAC-based method that iteratively samples three featurecorrespondences between the object model and the scene. The correspondencesare obtained by matching 3D local shape descriptors and are used to generatea 6D pose candidate, whose quality is measured by the consensus set size. Thefinal pose is refined by ICP. The method achieved the state-of-the-art results onearlier object recognition datasets captured by LIDAR, but suffers from a cubiccomplexity in the number of correspondences. The number of RANSAC itera-tions was set to 10000, allowing only for a limited search in cluttered scenes. Themethod was evaluated with several descriptors: 153d SI [19], 352d SHOT [30],30d ECSAD [20], and 1536d PPFH [5]. None of the descriptors utilize color.

Buch-17 [4]. This method is based on the observation that a correspondencebetween two oriented points on the object surface is constrained to cast votesin a 1-DoF rotational subgroup of the full group of poses, SE(3). The timecomplexity of the method is thus linear in the number of correspondences. Kerneldensity estimation is used to efficiently combine the votes and generate a 6D poseestimate. As Buch-16, the method relies on 3D local shape descriptors and refinesthe final pose estimate by ICP. The parameters were set as in the paper: 60 angletessellations were used for casting rotational votes, and the translation/rotationbandwidths were set to 10 mm/22.5◦.

5 Evaluation

The methods reviewed in Sec. 4 were evaluated by their original authors on thedatasets described in Sec. 3, using the evaluation methodology from Sec. 2.

5.1 Experimental Setup

Fixed Parameters. The parameters of each method were fixed for all objectsand datasets. The distribution of object poses in the test scenes was the onlydataset-specific information used by the methods. The distribution determinedthe range of viewpoints from which the object models were rendered to obtainsynthetic training images.

Pose Error. The error of a 6D object pose estimate is measured with the pose-error function eVSD defined in Sec. 2.2. The visibility masks were calculated asin [17], with the occlusion tolerance δ set to 15 mm. Only the ground truth posesin which the object is visible from at least 10% were considered in the evaluation.

Performance Score. The performance is measured by the recall score, i.e. thefraction of test targets for which a correct object pose was estimated. Recallscores per dataset and per object are reported. The overall performance is givenby the average of per-dataset recall scores. We thus treat each dataset as a sep-arate challenge and avoid the overall score being dominated by larger datasets.


Subsets Used for the Evaluation. We reduced the number of test images toremove redundancies and to encourage participation of new, in particular slow,methods. From the total of 62K test images, we sub-sampled 7K, reducing thenumber of test targets from 110K to 17K (Tab. 1). Full datasets with identifiersof the selected test images are on the project website. TYO-L was not used forthe evaluation presented in this paper, but it is a part of the online evaluation.

5.2 Results

Accuracy. Tab. 2 and 3 show the recall scores of the evaluated methods perdataset and per object respectively, for the misalignment tolerance τ = 20 mmand the correctness threshold θ = 0.3. Ranking of the methods according to therecall score is mostly stable across the datasets. Methods based on point-pairfeatures perform best. Vidal-18 is the top-performing method with the averagerecall of 74.6%, followed by Drost-10-edge, Drost-10, and the template matchingmethod Hodan-15, all with the average recall above 67%. Brachmann-16 is thebest learning-based method, with 55.4%, and Buch-17-ppfh is the best methodbased on 3D local features, with 54.0%. Scores of Buch-16-si and Buch-16-shotare inferior to the other variants of this method and not presented.

Fig. 4 shows the average of the per-dataset recall scores for different valuesof τ and θ. If the misalignment tolerance τ is increased from 20 mm to 80 mm,the scores increase only slightly for most methods. Similarly, the scores increaseonly slowly for θ > 0.3. This suggests that poses estimated by most methods areeither of a high quality or totally off, i.e. it is a hit or miss.

Speed. The average running times per test target are reported in Tab. 2. How-ever, the methods were evaluated on different computers3 and thus the presentedrunning times are not directly comparable. Moreover, the methods were opti-mized primarily for the recall score, not for speed. For example, we evaluatedDrost-10 with several parameter settings and observed that the running timecan be lowered by a factor of ∼5 to 0.5 s with only a relatively small drop ofthe average recall score from 68.1% to 65.8%. However, in Tab. 2 we present theresult with the highest score. Brachmann-14 could be sped up by sub-samplingthe 3D object models and Hodan-15 by using less object templates. A study ofsuch speed/accuracy trade-offs is left for future work.

Open Problems. Occlusion is a big challenge for current methods, as shown byscores dropping swiftly already at low levels of occlusion (Fig. 4, right). The biggap between LM and LM-O scores provide further evidence. All methods per-form on LM by at least 30% better than on LM-O, which includes the same butpartially occluded objects. Inspection of estimated poses on T-LESS test imagesconfirms the weak performance for occluded objects. Scores on TUD-L show thatvarying lighting conditions present a serious challenge for methods that rely on

3 Specifications of computers used for the evaluation are on the project website.


# Method LM LM-O IC-MI IC-BIN T-LESS RU-APC TUD-L Average Time (s)

1. Vidal-18 87.83 59.31 95.33 96.50 66.51 36.52 80.17 74.60 4.72. Drost-10-edge 79.13 54.95 94.00 92.00 67.50 27.17 87.33 71.73 21.53. Drost-10 82.00 55.36 94.33 87.00 56.81 22.25 78.67 68.06 2.34. Hodan-15 87.10 51.42 95.33 90.50 63.18 37.61 45.50 67.23 13.55. Brachmann-16 75.33 52.04 73.33 56.50 17.84 24.35 88.67 55.44 4.46. Hodan-15-nopso 69.83 34.39 84.67 76.00 62.70 32.39 27.83 55.40 12.37. Buch-17-ppfh 56.60 36.96 95.00 75.00 25.10 20.80 68.67 54.02 14.28. Kehl-16 58.20 33.91 65.00 44.00 24.60 25.58 7.50 36.97 1.89. Buch-17-si 33.33 20.35 67.33 59.00 13.34 23.12 41.17 36.81 15.9

10. Brachmann-14 67.60 41.52 78.67 24.00 0.25 30.22 0.00 34.61 1.411. Buch-17-ecsad 13.27 9.62 40.67 59.00 7.16 6.59 24.00 22.90 5.912. Buch-17-shot 5.97 1.45 43.00 38.50 3.83 0.07 16.67 15.64 6.713. Tejani-14 12.10 4.50 36.33 10.00 0.13 1.52 0.00 9.23 1.414. Buch-16-ppfh 8.13 2.28 20.00 2.50 7.81 8.99 0.67 7.20 47.115. Buch-16-ecsad 3.70 0.97 3.67 4.00 1.24 2.90 0.17 2.38 39.1

Table 2. Recall scores (%) for τ = 20 mm and θ = 0.3. The recall score is the percentageof test targets for which a correct object pose was estimated. The methods are sortedby their average recall score calculated as the average of the per-dataset recall scores.The right-most column shows the average running time per test target.

# Method LM LM-O TUD-L

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 5 6 8 9 10 11 12 1 2 3

1. Vidal-18 89 96 91 94 92 96 89 89 87 97 59 69 93 92 90 66 81 46 65 73 43 26 64 79 88 742. Drost-10-edge 77 97 94 40 98 94 83 96 45 94 68 66 72 88 79 47 82 46 75 42 44 36 57 85 88 903. Drost-10 86 83 89 84 93 87 86 92 66 96 53 67 79 91 80 62 75 39 70 57 46 26 57 73 90 744. Hodan-15 91 97 79 97 91 97 73 69 90 97 81 79 99 74 95 54 66 40 26 73 37 44 68 27 63 485. Brachmann-16 92 93 76 84 86 90 44 72 85 79 46 67 94 60 66 64 65 44 68 71 3 32 61 81 95 916. Hodan-15-nr 91 57 40 89 66 87 59 49 92 90 65 63 71 54 79 47 35 24 12 63 9 32 53 12 52 207. Buch-17-ppfh 77 65 0 94 84 60 24 59 75 67 24 39 75 47 62 59 63 18 35 60 17 5 30 55 89 638. Kehl-16 60 52 81 25 79 68 17 68 42 91 45 42 78 83 46 39 47 24 30 48 14 13 49 0 23 09. Buch-17-si 40 43 1 63 81 47 12 8 36 43 18 3 46 19 43 54 63 11 2 16 9 1 3 2 74 48

10. Brachmann-14 74 70 77 75 88 66 11 81 69 66 50 75 92 75 49 50 48 27 44 60 6 30 62 0 0 011. Buch-17-ecsad 31 2 2 19 66 3 3 0 9 49 1 0 3 7 6 29 29 0 0 7 8 1 0 1 62 1012. Buch-17-shot 3 4 11 9 9 4 1 3 2 10 1 0 10 12 14 2 7 0 0 1 1 1 0 1 33 1713. Tejani-14 36 0 36 0 1 0 1 11 1 70 27 0 0 0 0 26 2 0 1 0 0 10 0 0 0 014. Buch-16-ppfh 11 0 1 22 3 7 2 7 18 12 4 3 9 12 14 4 0 0 2 11 1 1 1 2 0 015. Buch-16-ecsad 2 0 0 9 5 0 0 4 5 8 0 0 17 3 5 1 3 0 2 2 0 0 0 0 1 0

IC-MI -BIN T-LESS

1 2 3 4 5 6 2 4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

1. Vidal-18 80 100100 98 100 94 100 93 43 46 68 65 69 71 76 76 92 69 68 84 55 47 54 85 82 792. Drost-10-edge 78 100100100 90 96 100 84 53 44 61 67 71 73 75 89 92 72 64 81 53 46 55 85 88 783. Drost-10 76 100 98 100 96 96 100 74 34 46 63 63 68 64 54 48 59 54 51 69 43 45 53 80 79 684. Hodan-15 100100100 74 98 100100 81 66 67 72 72 61 60 52 61 86 72 56 55 54 21 59 81 81 795. Brachmann-16 42 98 70 88 64 78 84 29 8 10 21 4 46 19 52 22 12 7 3 3 0 0 0 5 3 546. Hodan-15-nr 100100 92 62 60 94 93 59 64 67 71 73 62 57 49 56 85 70 57 55 60 23 60 82 81 777. Buch-17-ppfh 88 100 94 100100 88 100 50 1 7 0 5 25 16 4 35 37 48 4 10 4 0 0 12 34 498. Kehl-16 22 100 70 72 96 30 71 17 7 10 18 24 23 10 0 2 11 17 5 1 0 9 12 56 52 229. Buch-17-si 62 100 94 62 52 34 97 21 0 1 17 17 9 3 1 4 0 8 2 0 0 0 0 20 26 12


T-LESS RU-APC

19 20 21 22 23 24 25 26 27 28 29 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14

1. Vidal-18 57 43 62 69 85 66 43 58 62 69 69 85 39 38 42 54 53 43 4 82 32 0 48 47 20 82. Drost-10-edge 55 47 55 56 84 59 47 69 61 80 84 89 0 20 35 47 35 39 0 89 28 0 48 21 15 33. Drost-10 53 35 60 61 81 57 28 51 32 60 81 71 0 11 29 45 33 29 26 71 10 0 47 9 0 04. Hodan-15 59 27 57 50 74 59 47 72 45 73 74 85 4 36 59 24 47 46 52 97 28 28 34 52 17 05. Brachmann-16 38 1 39 19 61 1 16 27 17 13 6 5 6 64 25 21 32 41 47 37 1 0 18 40 0 56. Hodan-15-nr 58 27 55 50 73 60 49 72 40 72 76 85 4 39 50 24 41 15 43 91 25 33 31 39 16 17. Buch-17-ppfh 31 25 36 35 71 46 64 51 4 44 49 58 16 5 17 51 27 6 57 24 8 10 55 5 11 08. Kehl-16 35 5 26 27 71 36 28 51 34 54 86 69 19 14 46 38 54 40 4 80 3 5 3 37 7 59. Buch-17-si 11 21 18 11 37 4 52 53 3 35 32 53 24 49 16 39 3 4 32 54 14 9 43 15 17 5


Table 3. Recall scores (%) per object for τ = 20 mm and θ = 0.3.


0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

reca

ll

= 20

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

reca

ll

= 80

0

2000

4000

6000

8000

10000

test

targ

ets

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0visibility [%]

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

reca

ll

Vidal-18Drost-10-edgeDrost-10Hodan-15Brachmann-16Hodan-15-nrBuch-17-ppfhKehl-16Buch-17-siBrachmann-14Buch-17-ecsadBuch-17-shotTejani-14Buch-16-ppfhBuch-16-ecsad

Fig. 4. Left, middle: Average of the per-dataset recall scores for the misalignmenttolerance τ fixed to 20 mm and 80 mm, and varying value of the correctness threshold θ.The curves do not change much for τ > 80 mm. Right: The recall scores w.r.t. the visiblefraction of the target object. If more instances of the target object were present in thetest image, the largest visible fraction was considered.

synthetic training RGB images, which were generated with fixed lighting. Meth-ods relying only on depth information (e.g. Vidal-18, Drost-10) are noticeablymore robust under such conditions. Note that Brachmann-16 achieved a highscore on TUD-L despite relying on RGB images because it used real trainingimages, which were captured under the same range of lighting conditions as thetest images. Methods based on 3D local features and learning-based methodshave very low scores on T-LESS, which is likely caused by the object symme-tries and similarities. All methods perform poorly on RU-APC, which is likelybecause of a higher level of noise in the depth images.

6 Conclusion

We have proposed a benchmark for 6D object pose estimation that includes eightdatasets in a unified format, an evaluation methodology, a comprehensive evalu-ation of 15 recent methods, and an online evaluation system open for continuoussubmission of new results. With this benchmark, we have captured the statusquo in the field and will be able to systematically measure its progress in the fu-ture. The evaluation showed that methods based on point-pair features performbest, outperforming template matching methods, learning-based methods andmethods based on 3D local features. As open problems, our analysis identifiedocclusion, varying lighting conditions, and object symmetries and similarities.

Acknowledgements

We gratefully acknowledge Manolis Lourakis, Joachim Staib, Christoph Kick,Juil Sock and Pavel Haluza for their help. This work was supported by CTUstudent grant SGS17/185/OHK3/3T/13, Technology Agency of the Czech Re-public research program TE01020415 (V3C – Visual Computing CompetenceCenter), and the project for GACR, No. 16-072105: Complex network methodsapplied to ancient Egyptian data in the Old Kingdom (2700–2180 BC).


References

1. Brachmann, E., Krull, A., Michel, F., Gumhold, S., Shotton, J., Rother, C.: Learn-ing 6D object pose estimation using 3D object coordinates. In: ECCV (2014) 1, 2,7, 8

2. Brachmann, E., Michel, F., Krull, A., Yang, M.Y., Gumhold, S., Rother, C.:Uncertainty-driven 6D pose estimation of objects and scenes from a single RGBimage. In: CVPR (2016) 1, 8

3. Buch, A.G., Petersen, H.G., Kruger, N.: Local shape feature fusion for improvedmatching, pose estimation and 3D object recognition. SpringerPlus (2016) 11

4. Buch, A.G., Kiforenko, L., Kraft, D.: Rotational subgroup voting and pose clus-tering for robust 3D object recognition. In: ICCV (2017) 11

5. Buch, A.G., Kraft, D.: Local point pair feature histogram for accurate 3D matching.In: BMVC (2018) 11

6. Correll, N., Bekris, K.E., Berenson, D., Brock, O., Causo, A., Hauser, K., Okada,K., Rodriguez, A., Romano, J.M., Wurman, P.R.: Lessons from the Amazon pickingchallenge. ArXiv e-prints (2016) 8

7. Doumanoglou, A., Kouskouridas, R., Malassiotis, S., Kim, T.K.: Recovering 6Dobject pose and predicting next-best-view in the crowd. In: CVPR (2016) 2, 7

8. Drost, B., Ulrich, M., Bergmann, P., Hartinger, P., Steger, C.: Introducing MVTecITODD – a dataset for 3D object recognition in industry. In: ICCVW (2017) 1

9. Drost, B., Ulrich, M., Navab, N., Ilic, S.: Model globally, match locally: Efficientand robust 3D object recognition. In: CVPR (2010) 10

10. Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: ThePASCAL visual object classes (VOC) challenge. IJCV (2010) 3

11. Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? TheKITTI vision benchmark suite. In: CVPR (2012) 3

12. MVTec HALCON: https://www.mvtec.com/halcon/ 10

13. Hinterstoisser, S., Cagniart, C., Ilic, S., Sturm, P., Navab, N., Fua, P., Lepetit,V.: Gradient response maps for real-time detection of texture-less objects. TPAMI(2012) 3

14. Hinterstoisser, S., Lepetit, V., Ilic, S., Holzer, S., Bradski, G., Konolige, K., Navab,N.: Model based training, detection and pose estimation of texture-less 3D objectsin heavily cluttered scenes. In: ACCV (2012) 1, 2, 6, 7, 9, 10

15. Hinterstoisser, S., Lepetit, V., Rajkumar, N., Konolige, K.: Going further withpoint pair features. In: ECCV (2016) 10

16. Hodan, T., Haluza, P., Obdrzalek, S., Matas, J., Lourakis, M., Zabulis, X.: T-LESS:An RGB-D dataset for 6D pose estimation of texture-less objects. In: WACV (2017)1, 2, 7

17. Hodan, T., Matas, J., Obdrzalek, S.: On evaluation of 6D object pose estimation.In: ECCVW (2016) 2, 4, 11

18. Hodan, T., Zabulis, X., Lourakis, M., Obdrzalek, S., Matas, J.: Detection and fine3D pose estimation of texture-less objects in RGB-D images. In: IROS (2015) 1,9, 10

19. Johnson, A.E., Hebert, M.: Using spin images for efficient object recognition incluttered 3D scenes. TPAMI 21(5) (1999) 11

20. Jørgensen, T.B., Buch, A.G., Kraft, D.: Geometric edge description and classifi-cation in point cloud data with application to 3D object recognition. In: VISAPP(2015) 11

https://www.mvtec.com/halcon/


21. Kehl, W., Manhardt, F., Tombari, F., Ilic, S., Navab, N.: SSD-6D: Making RGB-based 3D detection and 6D pose estimation great again. In: ICCV (2017) 1

22. Kehl, W., Milletari, F., Tombari, F., Ilic, S., Navab, N.: Deep learning of localRGB-D patches for 3D object detection and 6D pose estimation. In: ECCV (2016)9

23. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep con-volutional neural networks. In: NIPS (2012) 3

24. Krull, A., Brachmann, E., Michel, F., Ying Yang, M., Gumhold, S., Rother, C.:Learning analysis-by-synthesis for 6D pose estimation in RGB-D images. In: ICCV(2015) 1

25. Michel, F., Alexander Kirillov, Brachmann, E., Krull, A., Gumhold, S., Savchyn-skyy, B., Rother, C.: Global hypothesis generation for 6D object pose estimation.In: CVPR (2017) 1

26. Newcombe, R.A., Izadi, S., Hilliges, O., Molyneaux, D., Kim, D., Davison, A.J.,Kohi, P., Shotton, J., Hodges, S., Fitzgibbon, A.: KinectFusion: Real-time densesurface mapping and tracking. In: ISMAR (2011) 6

27. Rad, M., Lepetit, V.: BB8: A scalable, accurate, robust to partial occlusion methodfor predicting the 3D poses of challenging objects without using depth. In: ICCV(2017) 1

28. Rennie, C., Shome, R., Bekris, K.E., De Souza, A.F.: A dataset for improvedRGBD-based object detection and pose estimation for warehouse pick-and-place.Robotics and Automation Letters (2016) 2, 7, 8

29. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z.,Karpathy, A., Khosla, A., Bernstein, M., et al.: ImageNet large scale visual recog-nition challenge. IJCV (2015) 3

30. Salti, S., Tombari, F., Di Stefano, L.: SHOT: Unique signatures of histograms forsurface and texture description. Computer Vision and Image Understanding (2014)11

31. Scharstein, D., Szeliski, R.: A taxonomy and evaluation of dense two-frame stereocorrespondence algorithms. IJCV (2002) 3

32. Scharstein, D., Pal, C.: Learning conditional random fields for stereo. In: CVPR(2007) 3

33. Steinbrucker, F., Sturm, J., Cremers, D.: Volumetric 3D mapping in real-time ona CPU. In: ICRA (2014) 6

34. Tejani, A., Tang, D., Kouskouridas, R., Kim, T.K.: Latent-class hough forests for3D object detection and pose estimation. In: ECCV (2014) 1, 2, 7, 9

35. Vidal, J., Lin, C.Y., Martı, R.: 6D pose estimation using an improved methodbased on point pair features. In: ICCAR (2018) 10

36. Wohlhart, P., Lepetit, V.: Learning descriptors for object recognition and 3D poseestimation. In: CVPR (2015) 1

BOP: Benchmark for 6D Object Pose Estimation

Documents

Transcript of BOP: Benchmark for 6D Object Pose Estimation