Geometrically Consistent Plane Extraction for …trung/lib/exe/fetch.php?... · 2016-07-31 ·...

6
Geometrically Consistent Plane Extraction for Dense Indoor 3D Maps Segmentation Trung T. Pham 1 , Markus Eich 2 , Ian Reid 1 , Gordon Wyeth 2 Abstract— Modern SLAM systems with a depth sensor are able to reliably reconstruct dense 3D geometric maps of indoor scenes. Representing these maps in terms of meaningful entities is a step towards building semantic maps for autonomous robots. One approach is to segment the 3D maps into semantic objects using Conditional Random Fields (CRF), which requires large 3D ground truth datasets to train the classification model. Additionally, the CRF inference is often computationally expensive. In this paper, we present an unsupervised geometric- based approach for the segmentation of 3D point clouds into objects and meaningful scene structures. We approximate an input point cloud by an adjacency graph over surface patches, whose edges are then classified as being either on or off. We devise an effective classifier which utilises both global planar surfaces and local surface convexities for edge classification. More importantly, we propose a novel global plane extraction algorithm for robustly discovering the underlying planes in the scene. Our algorithm is able to enforce the extracted planes to be mutually orthogonal or parallel which conforms usually with human-made indoor environments. We reconstruct 654 3D indoor scenes from NYUv2 sequences to validate the efficiency and effectiveness of our segmentation method. I. INTRODUCTION For a robot to interact within an unstructured environment, it will often need to build a rich map of its surroundings, in order to plan and to act in a world which is semantically labelled. Although SLAM (simultaneous localisation and mapping) has seen a significant progress, in the last decades, the reconstructed map is usually simply a set of 3D points in which information about objects, surfaces, boundaries are not preserved or even extracted. In this work, we present progress towards building more meaningful maps for indoor SLAM systems using a depth sensor. Unlike 2D (or 2.5D) images, full 3D scenes fused from multiple views contain much more valuable information about objects and scene structures. Work in [1] has shown that a classifier that has access to full 3D models results in better semantic segmentations than using single-view point clouds. However, to exploit full scene context (rather than simple smoothness priors), modern 3D semantic segmen- tation methods (e.g., [2], [1], [3]) resort to sophisticated probabilistic graphical models to encode complex contextual relationships of different objects and geometric features. 1 Trung T. Pham and Ian Reid are with the ARC Centre of Excellence for Robotic Vision, University of Adelaide, Adelaide, Australia. Contact: [email protected] 2 Markus Eich and Gordon Wyeth are with the ARC Centre of Excellence for Robotic Vision, Queensland University of Technology (QUT), Brisbane, Australia. The authors acknowledge the support of the Australian Research Council through the Centre of Excellence for Robotic Vision (CE140100016) and Laureate Fellowship (FL130100102) to IDR. Prior to testing, one has to train these graphical models using a large amount of 3D ground truth annotated point clouds. Unfortunately, manually annotating 3D data requires much more human effort and time than that of 2D images. Moreover, the training and inference processes of these models are usually computationally expensive, and mostly unsuitable for real-time systems. In this work, we present an unsupervised geometric-based method to segment 3D point clouds into objects, such as chairs or tables, and into scene structures like walls and floors, purely using geometric information. We leave the semantic class recognition as a separated post-processing step, which is not the focus of this work. In particular, we employ a graph clustering algorithm, in which the graph is constructed over local surface patches, and edges connecting adjacent regions are classified into the same or different objects. Effectively, this graph clustering process will seg- ment the input point cloud into objects. We propose an simple edge classifier which effectively exploits global scene planar surfaces (e.g., wall, floor, table top) and local surface convexities. Fig. 1 shows the overview of our segmentation approach. An important contribution in this work lies on a novel global plane fitting algorithm which is able to extract geomet- rically consistent planes in 3D scenes. Indoor environments often admit special structures, for instance, walls are orthog- onal to the floor. Therefore, a robust plane fitting algorithm should guarantee such geometric consistencies. Note that it is not possible to enforce such regularities if one extracts planes individually, for instance using sequential RANSAC. In contrast to this, our global approach extracts all underlying planes jointly, and balances goodness-of-fit and consistent arrangement between extracted planes. We formulate the plane extraction task as minimizing a global pairwise en- ergy function which jointly considers plane fidelities and geometric consistencies between planes. We approximate the continuous space of planes by a set of plane candidates, and introduce an energy function over binary variables whose values indicate which of all possible plane proposals are selected. We use the local sub-modularization approximation method [4] for efficient inference. In summary, our main contribution is twofold: 1) We present a simple graph clustering based segmentation method for full 3D indoor scenes, in which the edge classifier considers both global planar surfaces and local surface convexities; and more importantly 2) We propose a novel plane fitting algorithm which is capable of reliably extracting consistent planar surfaces in the scenes. We reconstructed a

Transcript of Geometrically Consistent Plane Extraction for …trung/lib/exe/fetch.php?... · 2016-07-31 ·...

Page 1: Geometrically Consistent Plane Extraction for …trung/lib/exe/fetch.php?... · 2016-07-31 · Geometrically Consistent Plane Extraction for Dense Indoor 3D Maps Segmentation ...

Geometrically Consistent Plane Extraction forDense Indoor 3D Maps Segmentation

Trung T. Pham1, Markus Eich2, Ian Reid1, Gordon Wyeth2

Abstract— Modern SLAM systems with a depth sensor areable to reliably reconstruct dense 3D geometric maps of indoorscenes. Representing these maps in terms of meaningful entitiesis a step towards building semantic maps for autonomousrobots. One approach is to segment the 3D maps into semanticobjects using Conditional Random Fields (CRF), which requireslarge 3D ground truth datasets to train the classificationmodel. Additionally, the CRF inference is often computationallyexpensive. In this paper, we present an unsupervised geometric-based approach for the segmentation of 3D point clouds intoobjects and meaningful scene structures. We approximate aninput point cloud by an adjacency graph over surface patches,whose edges are then classified as being either on or off. Wedevise an effective classifier which utilises both global planarsurfaces and local surface convexities for edge classification.More importantly, we propose a novel global plane extractionalgorithm for robustly discovering the underlying planes in thescene. Our algorithm is able to enforce the extracted planesto be mutually orthogonal or parallel which conforms usuallywith human-made indoor environments. We reconstruct 654 3Dindoor scenes from NYUv2 sequences to validate the efficiencyand effectiveness of our segmentation method.

I. INTRODUCTION

For a robot to interact within an unstructured environment,it will often need to build a rich map of its surroundings, inorder to plan and to act in a world which is semanticallylabelled. Although SLAM (simultaneous localisation andmapping) has seen a significant progress, in the last decades,the reconstructed map is usually simply a set of 3D pointsin which information about objects, surfaces, boundaries arenot preserved or even extracted. In this work, we presentprogress towards building more meaningful maps for indoorSLAM systems using a depth sensor.

Unlike 2D (or 2.5D) images, full 3D scenes fused frommultiple views contain much more valuable informationabout objects and scene structures. Work in [1] has shownthat a classifier that has access to full 3D models results inbetter semantic segmentations than using single-view pointclouds. However, to exploit full scene context (rather thansimple smoothness priors), modern 3D semantic segmen-tation methods (e.g., [2], [1], [3]) resort to sophisticatedprobabilistic graphical models to encode complex contextualrelationships of different objects and geometric features.

1Trung T. Pham and Ian Reid are with the ARC Centre of Excellencefor Robotic Vision, University of Adelaide, Adelaide, Australia. Contact:[email protected]

2Markus Eich and Gordon Wyeth are with the ARC Centre of Excellencefor Robotic Vision, Queensland University of Technology (QUT), Brisbane,Australia.

The authors acknowledge the support of the Australian Research Councilthrough the Centre of Excellence for Robotic Vision (CE140100016) andLaureate Fellowship (FL130100102) to IDR.

Prior to testing, one has to train these graphical modelsusing a large amount of 3D ground truth annotated pointclouds. Unfortunately, manually annotating 3D data requiresmuch more human effort and time than that of 2D images.Moreover, the training and inference processes of thesemodels are usually computationally expensive, and mostlyunsuitable for real-time systems.

In this work, we present an unsupervised geometric-basedmethod to segment 3D point clouds into objects, such aschairs or tables, and into scene structures like walls andfloors, purely using geometric information. We leave thesemantic class recognition as a separated post-processingstep, which is not the focus of this work. In particular, weemploy a graph clustering algorithm, in which the graph isconstructed over local surface patches, and edges connectingadjacent regions are classified into the same or differentobjects. Effectively, this graph clustering process will seg-ment the input point cloud into objects. We propose ansimple edge classifier which effectively exploits global sceneplanar surfaces (e.g., wall, floor, table top) and local surfaceconvexities. Fig. 1 shows the overview of our segmentationapproach.

An important contribution in this work lies on a novelglobal plane fitting algorithm which is able to extract geomet-rically consistent planes in 3D scenes. Indoor environmentsoften admit special structures, for instance, walls are orthog-onal to the floor. Therefore, a robust plane fitting algorithmshould guarantee such geometric consistencies. Note that itis not possible to enforce such regularities if one extractsplanes individually, for instance using sequential RANSAC.In contrast to this, our global approach extracts all underlyingplanes jointly, and balances goodness-of-fit and consistentarrangement between extracted planes. We formulate theplane extraction task as minimizing a global pairwise en-ergy function which jointly considers plane fidelities andgeometric consistencies between planes. We approximate thecontinuous space of planes by a set of plane candidates, andintroduce an energy function over binary variables whosevalues indicate which of all possible plane proposals areselected. We use the local sub-modularization approximationmethod [4] for efficient inference.

In summary, our main contribution is twofold: 1) Wepresent a simple graph clustering based segmentation methodfor full 3D indoor scenes, in which the edge classifierconsiders both global planar surfaces and local surfaceconvexities; and more importantly 2) We propose a novelplane fitting algorithm which is capable of reliably extractingconsistent planar surfaces in the scenes. We reconstructed a

Page 2: Geometrically Consistent Plane Extraction for …trung/lib/exe/fetch.php?... · 2016-07-31 · Geometrically Consistent Plane Extraction for Dense Indoor 3D Maps Segmentation ...

Fig. 1. A conceptual overview of our approach. We take a sequence of RGB-D images as input to reconstruct a dense indoor scene. Using our globalplane fitting algorithm, we extract a set of geometrically consistent planes from the scene, which together with local surface convexities are used for scenesegmentation via a graph clustering algorithm.

total of 654 3D scenes using NYU v2 RGB-D sequencesto validate the effectiveness of our approach. Experimentalresults show that our approach greatly outperforms manypopular unsupervised 3D scene segmentation algorithms.The source code is available online at https://github.com/MarkusEich/cpf_segmentation

II. RELATED WORK

A. Unsupervised Methods

Unsupervised segmentation of 3D point clouds into ob-jects has been studied extensively in computer vision androbotics [5], [6], [7], [8]. This is a preliminary step towardsemantic scene understanding. Early approaches [5], [6]rely on RANSAC to fit and remove planes (and possiblyother primitive shapes such as spheres, cones) from thepoint clouds, then isolate objects using Euclidean clustering.One drawback of these approaches is that the results arenon-deterministic and sensitive to the choice of parameters.Moreover running RANSAC sequentially to detect multipleplanes is widely known sub-optimal since the inaccuracies indetecting the first planes will heavily affect the subsequentplanes. In other words, there is no guarantee that the detectedplanes will correspond to the underlying planar surfaces inthe scenes. Our plane fitting algorithm favourably returnsplanes corresponding to the indoor scene structures such asfloor, walls, ceilings. Planar surfaces such as backs of thelounge chairs will not be extracted.

Multi-model fitting algorithms (e.g., [9], [10]) have beenwidely studied to fit multiple models simultaneously, how-ever almost all of these works fail to consider geometricpriors during fitting. Some exceptions such as [11], [12] en-code geometric priors using multi-label higher-order MarkovRandom Fields (MRF) which is hard to optimise. In contrast,our proposed plane fitting algorithm detects all the planesjointly, and at the same time favours consistent (orthogonal,parallel) arrangements of the extracted planes. Unlike [11],[12], our formulated energy is binary and pairwise, which ismuch simpler and admits efficient approximate optimisationalgorithms (e.g., [4], [13]).

Recently, Stein et al. [14] presented a simple but efficientalgorithm for object segmentation without relying on planefitting. The idea is to classify every adjacent pairs of localsurface patches into convexity or concavity using normalsestimated from the surface patches. Concavity relations in-dicate the pairs of surface patches coming from different

objects. Effectively, the convexity/concavity classificationwill partition the scene into different objects and objectparts. Similar idea has been done in [15]. However, pointclouds from full room-scale scenes are often noisy as a depthsensor hardly captures correct depth information for distantobjects. Therefore, normal estimation in such data can beimprecise, leading to wrong convex/concave classification.Our approach, in contrast, makes use of both global scenestructures (e.g., supporting planes) as well as local convexityinformation for segmentation.

B. Supervised Methods

Supervised 3D semantic segmentation using joint seg-mentation and recognition, has attracted much interest re-cently [16], [1], [2], [3]. Similar to 2D semantic labellingtechniques, 3D methods learn a classification model fromtraining data to predict a semantic class for every 3Delement (e.g., 3D point, surfel). Graphical models, like CRFis often employed to capture scene features and complexrelationships between different class labels. Unlike 2D (or2.5D) images, full-scene 3D point clouds inherently carryrich contextual information, such as correlations betweenobject classes, which is beneficial for the label predictiontask. However, encoding such contextual information re-quires complex graphical models, leading to difficult learningand inference problems. For example, the label inferencedone in [1] takes over a minute to complete. Nevertheless,near real-time performance can be achieved if the graphicalmodels consider only Gaussian potentials (e.g., [17], [18]),which limits the capability of encoding rich scene context.

In contrast to the above-mentioned approaches, our workfocuses only on the segmentation task, however the methodis completely unsupervised and requires no training. Oncesegmented, each segment can then be analysed for shape,structure and spatial relationships [19] for semantic labelling.Deep-learning based 3D object recognition systems like [20]can also be applied.

III. GRAPH CLUSTERING BASED 3D SEGMENTATION

Given a 3D indoor map reconstructed from a sequenceof RGB-D images, our system first approximates the 3Dgeometric map by a set of surface patches and extractsa set of geometrically consistent supporting planes usinga global energy minimization algorithm (see Sec. IV). Tosegment the scene, we build an adjacency graph over thesurface patches, then classify each pair of surface patches

Page 3: Geometrically Consistent Plane Extraction for …trung/lib/exe/fetch.php?... · 2016-07-31 · Geometrically Consistent Plane Extraction for Dense Indoor 3D Maps Segmentation ...

into the same or different objects using local concavity andthe extracted planes. This classification process will result ina set of connected components, each of which correspondsto either an object (i.e., chair), object part (i.e., top table) ora scene structure (i.e., wall, floor).

A. Surface Patch Approximation

As the reconstructed point cloud contains several millionsof points, we represent the point cloud by a collection ofsmall supervoxels to reduce the computational cost. We usedthe supervoxel segmentation method proposed in [21], whichover-segments the input point cloud into a set of surfacepatches V = {vi}Ni=1, of which each contains a centroidci, a normal vector ni and curvature ui. The method returnsalso an adjacency graph G = {V, E} defined over the surfacepatches V; E = {eij} is the set of edges connecting adjacentsurface patches. We use the resulting graph for our scenesegmentation.

B. Graph based Scene Segmentation

Given the adjacency graph G = {V, E}, object sepa-ration can be formulated as a graph partitioning problem.Specifically, edges are classified into ON or OFF based ontheir object memberships. This means that edges connectingpairs of nodes coming from the same objects are labelledas ON, otherwise as OFF. Removing OFF edges from thegraph results in a number of connected components, whichrepresent individual objects or object parts.

In a recent work [22], edges are classified as either convex(ON) or concave (OFF). The edge convexity (or concavity)is computed using the normals estimated at the two points.However, the estimated normals can be very imprecise,especially at object boundaries or when the point cloudis noisy. This can lead to incorrect convexity (concavity)classification, and subsequently results in unreliable objectpartitioning. Alternatively, in this work we classify theseedges into ON or OFF using the scene supporting planesin addition to local convexity information. Intuitively, if twoconnected surface patches belong to the same 3D plane, theyare likely from the same object. If only one of the pair ison the plane, clearly they belong to two different objects.In the case where the two neighbour surface patches do notbelong to any plane, we need to consider their local convexityrelation for classification.

Now assume that we have extracted a set of K supportingplanes from the input scene {p1, p2, . . . , pK}, and surfacepatches have been assigned to these planes. We denote theassignment using labelling variables {li}N1 ; li ∈ [0,K]; li =0 means that surface patch i does not belong to any plane.Our proposed edge classification function f is defined for alleij ∈ E as follow:

f(eij) =

1 if li = lj & li 6= 0

1 if li = lj = 0 & gcv(vi, vj)

0 otherwise,(1)

where 1 is ON, 0 is OFF. gcv(vi, vj) is a convexity clas-sifier. We refer readers to [22] for detailed convex/concave

classification. After classification, we remove all OFF edgesfrom the adjacency graph G, resulting in a number of sub-graphs, each of which represents either a scene structure oran individual object.

IV. GEOMETRICALLY CONSISTENT PLANE EXTRACTION

Detecting planar surfaces from point clouds is actually acommon problem in computer vision and robotics. In fact,RANSAC has been widely used to fit a plane model into aset of 3D points. Indoor scenes usually compose of a variablenumber of planes; a standard approach to detecting them allis to run RANSAC sequentially until no more plane can befound. In contrast to this, we present a novel algorithm todetect jointly all planes in a scene. Our method prefers everypair of planes to be either orthogonal or parallel to eachother. These geometric regularities agree well with indoorscene layouts or human made structures.

To that end, our plane extraction algorithm first samplesa collection of potential plane hypotheses and then selects asubset that minimizes a global energy function, which jointlyencodes plane qualities and pairwise consistencies.

A. Plane Hypothesis Generation

A simple method for generating plane candidates is to fit3D plane models onto randomly sampled minimum subsetsof points. However, to ensure the adequacy of the hypothesisset, one would need to sample a huge number of mini-mal subsets, which is time-consuming and also increasesthe computational effort for the energy minimization latter.Here we propose to generate plane candidates using surfacepatches. As mentioned earlier, the input point cloud hasbeen over-segmented into a set of surface patches, of whicheach represents a local surface of the scene with a centroid,a normal vector and curvature. We select surface patcheswith small curvatures to generate planes using centroidsand normal vectors. We further enrich the hypothesis set byrandomly sampling subsets of the centroids.

B. Model Selection

Given M plane candidates H = {p1, p2, . . . , pM}, weseek a subset that simultaneously fits the data well, andcontains a regular arrangement of planes. We formulate theplane selection problem as an energy minimization task:

P∗ = argminP⊂H

E(P), (2)

where the energy E(P) simultaneously encodes goodness-of-fit and geometric consistency. In this work, our fittingenergy is defined as:

E(P) =∑

pm∈PD(pm) +

∑pm,pn∈P

V (pm, pn). (3)

D(pm) measures the fidelity of plane pm. V (pm, pn) modelsthe geometric consistency between pm and pn.

Specifically, the function D(pm) evaluates how well theplane pm fits to the input point cloud. Given a noise thresholdσ, we first compute a set of surface patches (inliers) associ-ated with plane pm, i.e., Ipm = {vi ∈ V | r(ci, pm) < σ},

Page 4: Geometrically Consistent Plane Extraction for …trung/lib/exe/fetch.php?... · 2016-07-31 · Geometrically Consistent Plane Extraction for Dense Indoor 3D Maps Segmentation ...

where r(ci, pm) is the distance of the surface patch centroidci to plane pm. We then define:

D(pm) =

{1− |Ipm |

λ if |Ipm| < λ

exp(λ−|Ipm |

λ ) if |Ipm| ≥ λ

(4)

where |.| is the cardinality function, and λ is the minimumnumber of inliers of a good plane. We set σ = 5 cm andλ = 50.

The function V (pm, pn) will take care of the geometricconsistency between planes. In this work, we seek the regularinter-plane relations as orthogonality and parallelism. Hence,for each irregular pair of planes, a penalty cost is paid whichis

V (pm, pn) = 1− exp(−A(pm, pn)

α). (5)

A(pm, pn) = min(90 − A(pm, pn),A(pm, pn)), andA(pm, pn) ∈ [0, 90] is the angle between pm and pn. αis standard (angle) deviation. We set α = 10.

Furthermore, we enforce the overlap between two planesas small as possible to avoid redundant planes. The finalpairwise energy becomes:

V (pm, pn) = 1− exp(−A(pm, pn)

α) +

|Ipm∩ Ipn

|min(|Ipm

|, |Ipn|).

(6)

Above, we have defined our fitting energy function. Nextwe will show how to efficiently solve optimisation problem(2), which is NP-hard in general. Minimizing the energyfunction (3) corresponds to minimizing the following binaryfunction:

E(X) =

M∑m=1

D(pm)xm +∑

pm,pn∈PV (pm, pn)xmxn (7)

where X = [x1, x2, . . . , xM ] is a vector of binary variables,in which xm = 1 indicates that the plane candidate pmis selected. Although the energy (7) is, in general, non-submodular [23], and thus the minimization is still a NP-hardproblem, the energy (7) is binary, which can be optimisedefficiently using the local submodularisation method [4].

C. Point-to-Plane Assignment

We have now extracted a set of K planes P∗ ={p1, p2, . . . , pK} from the scene, and each plane pk ∈ P∗ isassociated with a set of inliers Ipk

. In practice there couldbe a single surface patch belonging to two different planes,which makes the scene segmentation difficult. By assumingthe underlying scene surface is smooth, the inlier sets can berefined by optimizing the objective function of assignmentvariables L = [l1, l2, . . . , lN ] (one for each surface patch):

E(L) =

N∑n=1

U(vn, ln) +∑

<i,j>∈ES(li, lj). (8)

Each ln takes value in [0,K], which indicates surface patchvn belonging a plane pln (ln = 0 indicates vn coming from anon-plane structure.) U(vn, ln) is the data cost and S(li, lj)

is the smoothness cost. The data cost function is constructedas:

U(vn, ln) =

{r(cn, pln) if ln ∈ [1, L]

2σ if ln = 0,(9)

where r(cn, pln) measures distance from centroid point cnto plane pln , σ is the noise threshold described earlier. Thesmoothness cost is defined as the Potts model

V (li, lj) =

{1 if li = lj

0 otherwise.(10)

Since the number of surface patches N and number ofplanes K are relatively small, the energy (8) can be optimisedvery efficiently using the fast graph-cut based expansionmove algorithm [24].

V. EXPERIMENTS

A. Dense 3D Point Cloud Reconstruction

In order to show the robustness of our approach, wereconstruct 654 3D geometric models (3D scenes) using theraw NYUv2 dataset [25]. Specifically, we select 200 framesaround each ground truth labelled image to reconstruct thescene using efficient InfiniTAM [26], which is an open sourceimplementation of Kinect-Fusion [27].

B. Performance Evaluation

We tested our scene segmentation approach on the re-constructed scenes. As there is no proper 3D benchmarkingavailable, quantitative evaluation of our method is difficult.Thus we re-project the 3D segmentation results to 2D groundtruth images for performance evaluation. However it isimportant to note that due to errors in camera pose estimationduring reconstruction as well as errors in calibration betweendepth camera and rgb camera, the re-projected images oftendo not align properly with the ground truth images. Thiswould leads to lower performance compared to methodsworking directly with 2D images. Also, as our segmenta-tion method is unsupervised, we only compare our methoddirectly against geometric-based methods. Specifically, ourbaseline is the popular RANSAC based method, and therecent convexity based object partitioning method [14], de-noted as LCCP. Note that in [14] the authors have shown thatLCCP works better many machine-learning based methods.

For quantitative comparison, we use the popular over-lapping criteria proposed in [28]. Specifically, for a givenimage, let G = {G1, G2, . . . , GM} be a set of ground truthannotated regions, and S = {S1, S2, . . . , SN} be a set ofpredicted segments, respectively. The overlap and weightedoverlap scores are computed as follow:

O =1

M

∑m

maxSn

(|Gm ∩ Sn||Gm ∪ Sn|

), (11)

WO =1∑

m |Gm|∑m

|Gm|maxSn

(|Gm ∩ Sn||Gm ∪ Sn|

), (12)

Page 5: Geometrically Consistent Plane Extraction for …trung/lib/exe/fetch.php?... · 2016-07-31 · Geometrically Consistent Plane Extraction for Dense Indoor 3D Maps Segmentation ...

(a) 3D scene (b) RANSAC

(c) LCCP [14] (d) OURS

Fig. 2. Qualitative comparison results using bedroom-0019 sequence in NYUv2 dataset. Notice that LCCP method fails to separate bed, bedside tableand blinds, while the yellow cluster returned by RANSAC does not correspond to any true object.

(a) 3D scene (b) RANSAC

(c) LCCP [14] (d) OURS

Fig. 3. Qualitative comparison results using livingroom-0088 sequence in NYUv2 dataset. Again the yellow clustered returned by RANSAC spreads overmultiple objects, while LCCP is unable to partition the left wall and cabinet.

(a) 3D scene (b) RANSAC (c) LCCP [14] (d) OURS

Fig. 4. Qualitative comparison results using diningroom-0001 sequence in NYUv2 dataset. Notice that RANSAC fits a plane to the back of the loungechair, while LCCP merges table and chairs.

where |Gm| is the pixel area of Gm. The scores are averagedover 654 scenes.

Tab. I reports the comparison results. It can be seen thatour method outperforms all the other methods. Moreover, the

last column in Tab. I also shows that our method is significantfaster than RANSAC based method while slightly slowerthan the LCCP. Samples of qualitative comparison resultsare shown in Figs 2, 3, 4.

Page 6: Geometrically Consistent Plane Extraction for …trung/lib/exe/fetch.php?... · 2016-07-31 · Geometrically Consistent Plane Extraction for Dense Indoor 3D Maps Segmentation ...

Methods Weighted Overlap (%) Unweighted Overlap (%) Running timeRANSAC Unsupervised 43.58 20.03 5-10 mins

LCCP [14] Unsupervised 45.90 24.95 2-4 secondsOURS Unsupervised 47.56 30.27 3-6 seconds

TABLE ICOMPARISON OF DIFFERENT SEGMENTATION METHODS USING 654 SCENES RECONSTRUCTED FROM NYUV2 SEQUENCES.

C. Running Time Analysis and Discussion

We have demonstrated the efficacy of our 3D scene seg-mentation framework using offline reconstructed data. Theultimate goal is, however, a real-time joint segmentation andreconstruction system. While our running time in the aboveexperiment ranges from 3 to 6 seconds, the segmentationwere performed on large scenes reconstructed from 200frames. The computational cost will significantly reduce ifone considers only a small number of frames, and propagatesthe segmentations incrementally. For example, in [1] only8-9 frames are used. In fact, with single-view point clouds,our method takes approximately 70ms. Furthermore, a majorcomputational cost of our method devotes for surface patchapproximation. This cost will be zero if one resorts to SLAMsystems like ElastisFusion [29] where dense surfel-basedmaps are reconstructed directly.

VI. CONCLUSIONSWe have introduced a simple but effective model-free

framework for 3D scene segmentation, which productivelyexploits geometric regularities of 3D indoor scenes. The maincontribution of this work is a novel algorithm for jointlyextracting all possible underlying planes in 3D scenes usingan energy minimization approach. Importantly our algorithmensures that the extracted planes are geometrically consistentwith the scene structures by enforcing orthogonal and parallelconstraints between pairs of planes. Furthermore, we showedhow these extracted planes can be closely integrated withlocal surface convexities for scene segmentation using asimple graph clustering algorithm. The experimental resultsdemonstrate that our proposed approach greatly outperformsthe standard RANSAC based method, as well as state-of-the-art unsupervised scene segmentation methods.

REFERENCES

[1] A. Anand, H. S. Koppula, T. Joachims, and A. Saxena, “Contextuallyguided semantic labeling and search for three-dimensional pointclouds,” Int. J. Rob. Res., vol. 32, no. 1, pp. 19–34, Jan. 2013.

[2] O. Kahler and I. Reid, “Efficient 3d scene labeling using fields oftrees,” in ICCV, 2013.

[3] T. T. Pham, I. Reid, Y. Latif, and S. Gould, “Hierarchical higher-orderregression forest fields: An application to 3d indoor scene labelling,”in ICCV, 2015.

[4] L. Gorelick, Y. Boykov, O. Veksler, I. B. Ayed, and A. Delong,“Submodularization for binary pairwise energies,” in CVPR, 2014.

[5] R. Schnabel, R. Wahl, and R. Klein, “Efficient RANSAC for point-cloud shape detection,” Computer Graphics Forum, vol. 26, no. 2, pp.214–226, June 2007.

[6] A. Nuchter and J. Hertzberg, “Towards semantic maps for mobilerobots,” Robotics and Autonomous Systems, vol. 56, no. 11, pp.915–926, 2008.

[7] J. Poppinga, N. Vaskevicius, A. Birk, and K. Pathak, “Fast planedetection and polygonalization in noisy 3d range images,” in IROS,2008.

[8] A. Richtsfeld, T. Morwald, J. Prankl, M. Zillich, and M. Vincze,“Segmentation of unknown objects in indoor environments,” in IROS,2012.

[9] H. Isack and Y. Boykov, “Energy-based geometric multi-model fitting,”International Journal of Computer Vision, vol. 97, no. 2, pp. 123–147,2012.

[10] T. T. Pham, T. J. Chin, J. Yu, and D. Suter, “The random cluster modelfor robust geometric fitting,” IEEE Transactions on Pattern Analysisand Machine Intelligence, vol. 36, no. 8, pp. 1658–1671, Aug 2014.

[11] T. T. Pham, T. J. Chin, K. Schindler, and D. Suter, “Interactinggeometric priors for robust multimodel fitting,” IEEE Transactions onImage Processing, vol. 23, no. 10, pp. 4601–4610, Oct 2014.

[12] A. Monszpart, N. Mellado, G. Brostow, and N. Mitra, “RAPter:Rebuilding man-made scenes with regular arrangements of planes,”ACM SIGGRAPH 2015, 2015.

[13] T. T. Pham, S. Hamid Rezatofighi, I. Reid, and T.-J. Chin, “Efficientpoint process inference for large-scale object detection,” in CVPR,2016.

[14] S. C. Stein, F. Worgotter, M. Schoeler, J. Papon, and T. Kulvicius,“Convexity based object partitioning for robot applications,” ICRA,2014.

[15] K. Tateno, F. Tombari, and N. Navab, “When 2.5d is not enough:Simultaneous reconstruction, segmentation and recognition on denseSLAM,” in ICRA, 2016.

[16] J. P. C. Valentin, S. Sengupta, J. Warrell, A. Shahrokni, and P. H. S.Torr, “Mesh based semantic modelling for indoor and outdoor scenes,”in CVPR, 2013.

[17] D. Wolf, J. Prankl, and M. Vincze, “Fast semantic segmentation of3d point clouds using a dense crf with learned parameters,” in ICRA,2015.

[18] A. Hermans, G. Floros, and B. Leibe, “Dense 3D Semantic Mappingof Indoor Scenes from RGB-D Images,” in International Conferenceon Robotics and Automation, 2014.

[19] M. Eich and F. Kirchner, “Reasoning about geometry: An ApproachUsing Spatial-Descriptive Ontologies,” in Workshop AILog, 19th Eu-ropean Conference on Artificial Intelligence,(ECAI-10), 2010.

[20] D. Maturana and S. Scherer, “VoxNet: A 3D Convolutional NeuralNetwork for Real-Time Object Recognition,” in IROS, 2015.

[21] J. Papon, A. Abramov, M. Schoeler, and F. Worgotter, “Voxel cloudconnectivity segmentation - supervoxels for point clouds,” in CVPR,2013.

[22] S. Stein, M. Schoeler, J. Papon, and F. Worgotter, “Object partitioningusing local convexity,” in CVPR, 2014.

[23] V. Kolmogorov and C. Rother, “Minimizing nonsubmodular functionswith graph cuts-a review,” IEEE Transactions on Pattern Analysis andMachine Intelligence, vol. 29, no. 7, pp. 1274–1279, July 2007.

[24] Y. Boykov, O. Veksler, and R. Zabih, “Fast approximate energyminimization via graph cuts,” IEEE Trans. Pattern Anal. Mach. Intell.,vol. 23, no. 11, pp. 1222–1239, Nov. 2001.

[25] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor Segmenta-tion and Support Inference from RGBD Images,” in ECCV, 2012.

[26] O. Kahler, V. Prisacariu, C. Ren, X. Sun, P. Torr, and D. Murray,“Very High Frame Rate Volumetric Integration of Depth Images onMobile Devices,” IEEE Transactions on Visualization and ComputerGraphics, 2015.

[27] R. a. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim,A. J. Davison, P. Kohli, J. Shotton, S. Hodges, and A. Fitzgibbon,“KinectFusion: Real-time dense surface mapping and tracking,” inISMAR, 2011.

[28] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor Segmenta-tion and Support Inference from RGBD Images,” in ECCV, 2012.

[29] T. Whelan, S. Leutenegger, R. S. Moreno, B. Glocker, and A. Davison,“Elasticfusion: Dense slam without a pose graph,” in RSS, 2015.