Handling Perceptual Clutter for Robot Vision with Partial...

8
Handling Perceptual Clutter for Robot Vision with Partial Model-Based Interpretations Grace Tsai and Benjamin Kuipers Abstract—For a robot to act in the world, it needs to build and maintain a simple and concise model of that world, from which it can derive safe opportunities for action and hazards to avoid. Unfortunately, the world itself is infinitely complex, containing aspects (“clutter”) that are not well described, or even well approximated, by the simple model. An adequate explanatory model must therefore explicitly delineate the clutter that it does not attempt to explain. As the robot searches for the best model to explain its observations, it faces a three- way trade-off among the coverage of the model, the degree of accuracy with which the model explains the observations, and the simplicity of the model. We present a likelihood function that addresses this trade-off. We demonstrate and evaluate this likelihood function in the context of a mobile robot doing visual scene understanding. Our experimental results on a corpus of RGB-D videos of cluttered indoor environments demonstrate that this method is capable of creating a simple and concise planar model of the major structures (ground plane and walls) in the environment, while separating out for later analysis segments of clutter represented by 3D point clouds. I. INTRODUCTION An indoor navigating robot must perceive its local environ- ment in order to act. Visual perception has become popular because it captures a lot of information at low cost. When using vision as a sensor for a robot, the input to visual perception is a temporally continuous stream of images, not simply a single image or a collection of images. The output of visual scene understanding has to be a concise interpretation of the local environment that is useful for the robot to make plans. Moreover, visual processing must be an on-line and efficient process as oppose to a batch process so that the robot’s interpretation can be updated as visual observations become available. Many Visual SLAM methods [1], [11], [9] have been proposed to construct a 3D point cloud of the local en- vironment in real-time. A more concise, large-granularity model that would be useful to a robot in planning must then be constructed from the point cloud. Methods [4], [2], [3] have been proposed to extract planar models from a 3D point cloud. However, due to the need for multiple iterations through the entire set of images, these methods are off-line and computationally intensive, making them inapplicable to real-time visual scene understanding for robots. Our previous work is an efficient on-line method for scene understanding. We presented a concise planar representation, Both authors are with the Dept. of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, U.S.A. This work has taken place in the Intelligent Robotics Lab in the Computer Science and Engineering Division of the University of Michigan. Research of the Intelligent Robotics lab is supported in part by grants from the Na- tional Science Foundation (CPS-0931474, IIS-1111494, and IIS-1252987). the Planar Semantic Model [?], to represent the geometric structure of the local indoor environment. We treat the scene understanding problem as a dynamic and incremental process because the view of the local environment changes while the robot travels within it. We presented an efficient on-line generate-and-test method with Bayesian filtering to construct the PSM from a stream of monocular images [14]. However, our previous method assumes the indoor environment is empty and the PSM is capable of modeling everything in the environment. If the environment is not empty, observations that are not part of the planar structures may mislead the Bayesian filter to converge to the wrong hypothesis. The focus of this paper is to handle clutter in visual scene understanding. We define “clutter” as regions in the local environment that cannot be represented by the given model. Since clutter is unstructured by the given the model, clutter is represented by 3D point clouds, where each point cloud corresponds to a 2D segment in the image space. The interpretation of the local environment consists of the model and clutter. The model “explains” a subset of the visual observations, and clutter is the collection of observations that are not explained by the model. There are existing works on indoor scene understanding that interpret the scene with a planar model and clutter. Hedau et al. [5] and Wang et al. [16] model clutter using a classifier that links the image features to clutter. Dependence on prior training for clutter is difficult to generalize to different indoor environments since the appearance of clutter is highly variable. Lee et al. [10] and Taylor et al. [12] treat clutter as observations that are not explainable by the planar model geometrically, which is our definition of clutter, and find a model that fits most of the observations. However, these methods may be difficult to apply to real-time applications because evaluations at pixel level or evaluation on a large set of hypotheses are involved. Moreover, since these methods handle only single image, temporally coherent interpretation of the scene may be difficult to achieve if each frame is independently processed. This paper builds on top of the on-line generate-and-test framework [14] to incrementally construct an interpretation of the local environment with a PSM model and clutter. Specifically, we propose a likelihood function that allows a good PSM hypothesis to explain only a subset of the ob- served features. The likelihood function tests the hypothesis based on a three-way trade-off among coverage, accuracy, and simplicity of the model. Our experimental results on a variety of RGB-D videos on cluttered indoor environments demonstrate that our method is capable of interpreting the

Transcript of Handling Perceptual Clutter for Robot Vision with Partial...

  • Handling Perceptual Clutter for Robot Visionwith Partial Model-Based Interpretations

    Grace Tsai and Benjamin Kuipers

    Abstract— For a robot to act in the world, it needs to buildand maintain a simple and concise model of that world, fromwhich it can derive safe opportunities for action and hazardsto avoid. Unfortunately, the world itself is infinitely complex,containing aspects (“clutter”) that are not well described, oreven well approximated, by the simple model. An adequateexplanatory model must therefore explicitly delineate the clutterthat it does not attempt to explain. As the robot searches forthe best model to explain its observations, it faces a three-way trade-off among the coverage of the model, the degree ofaccuracy with which the model explains the observations, andthe simplicity of the model. We present a likelihood functionthat addresses this trade-off. We demonstrate and evaluate thislikelihood function in the context of a mobile robot doing visualscene understanding. Our experimental results on a corpus ofRGB-D videos of cluttered indoor environments demonstratethat this method is capable of creating a simple and conciseplanar model of the major structures (ground plane and walls)in the environment, while separating out for later analysissegments of clutter represented by 3D point clouds.

    I. INTRODUCTION

    An indoor navigating robot must perceive its local environ-ment in order to act. Visual perception has become popularbecause it captures a lot of information at low cost. Whenusing vision as a sensor for a robot, the input to visualperception is a temporally continuous stream of images,not simply a single image or a collection of images. Theoutput of visual scene understanding has to be a conciseinterpretation of the local environment that is useful for therobot to make plans. Moreover, visual processing must bean on-line and efficient process as oppose to a batch processso that the robot’s interpretation can be updated as visualobservations become available.

    Many Visual SLAM methods [1], [11], [9] have beenproposed to construct a 3D point cloud of the local en-vironment in real-time. A more concise, large-granularitymodel that would be useful to a robot in planning mustthen be constructed from the point cloud. Methods [4], [2],[3] have been proposed to extract planar models from a 3Dpoint cloud. However, due to the need for multiple iterationsthrough the entire set of images, these methods are off-lineand computationally intensive, making them inapplicable toreal-time visual scene understanding for robots.

    Our previous work is an efficient on-line method for sceneunderstanding. We presented a concise planar representation,

    Both authors are with the Dept. of Electrical Engineering and ComputerScience, University of Michigan, Ann Arbor, U.S.A.

    This work has taken place in the Intelligent Robotics Lab in the ComputerScience and Engineering Division of the University of Michigan. Researchof the Intelligent Robotics lab is supported in part by grants from the Na-tional Science Foundation (CPS-0931474, IIS-1111494, and IIS-1252987).

    the Planar Semantic Model [?], to represent the geometricstructure of the local indoor environment. We treat the sceneunderstanding problem as a dynamic and incremental processbecause the view of the local environment changes whilethe robot travels within it. We presented an efficient on-linegenerate-and-test method with Bayesian filtering to constructthe PSM from a stream of monocular images [14]. However,our previous method assumes the indoor environment isempty and the PSM is capable of modeling everything in theenvironment. If the environment is not empty, observationsthat are not part of the planar structures may mislead theBayesian filter to converge to the wrong hypothesis.

    The focus of this paper is to handle clutter in visualscene understanding. We define “clutter” as regions in thelocal environment that cannot be represented by the givenmodel. Since clutter is unstructured by the given the model,clutter is represented by 3D point clouds, where each pointcloud corresponds to a 2D segment in the image space. Theinterpretation of the local environment consists of the modeland clutter. The model “explains” a subset of the visualobservations, and clutter is the collection of observations thatare not explained by the model.

    There are existing works on indoor scene understandingthat interpret the scene with a planar model and clutter.Hedau et al. [5] and Wang et al. [16] model clutter using aclassifier that links the image features to clutter. Dependenceon prior training for clutter is difficult to generalize todifferent indoor environments since the appearance of clutteris highly variable. Lee et al. [10] and Taylor et al. [12]treat clutter as observations that are not explainable bythe planar model geometrically, which is our definition ofclutter, and find a model that fits most of the observations.However, these methods may be difficult to apply to real-timeapplications because evaluations at pixel level or evaluationon a large set of hypotheses are involved. Moreover, sincethese methods handle only single image, temporally coherentinterpretation of the scene may be difficult to achieve if eachframe is independently processed.

    This paper builds on top of the on-line generate-and-testframework [14] to incrementally construct an interpretationof the local environment with a PSM model and clutter.Specifically, we propose a likelihood function that allowsa good PSM hypothesis to explain only a subset of the ob-served features. The likelihood function tests the hypothesisbased on a three-way trade-off among coverage, accuracy,and simplicity of the model. Our experimental results on avariety of RGB-D videos on cluttered indoor environmentsdemonstrate that our method is capable of interpreting the

  • Refine Hypotheses

    Generate Hypotheses

    Test Hypotheses

    Extract Ground Plane

    Estimate Pose

    Extract Features

    RGB-D Image

    Fig. 1. Framework. Given the current RGB-D frame t, we first find the relation between the camera and the local 3D world coordinate, and then estimatethe 3dof robot pose in the 3D world coordinate. At the same time, we extract useful features Ft (e.g. vertical-plane features Vt and point-set featuresCt) from frame t. We then follow the on-line generate-and-test framework for indoor scene understanding [14]. The main contribution of this paper isthe proposed likelihood function for testing the hypotheses that allows a good hypothesis to explain only a subset of the features Ft. In addition, wedemonstrated a method to generate and refine the hypotheses using the vertical-plane features Vt.

    environment with a PSM model and separate out for lateranalysis segments of clutter represented by 3D point clouds.

    Our work directly addressed a central dilemma in model-based scene interpretation. On one hand, it is worth describ-ing much of the environment using a model that is as simpleas possible, and identifying the parts of the environment thatrequire more sophisticated models, instead of having onehighly-complex model to describe the whole environment.On the other hand, when testing multiple model hypotheses,a hypothesis is preferred if it explains as many observa-tions as possible and only leaves a small portion of theobservations unexplained. Our proposed likelihood functionmakes explicit the trade-off between explanation coverage,accuracy, and simplicity. While this paper demonstrates amethod to extract a geometric structure of the environmentand separate out clutter, the same approach can be appliedto further interpret clutter with object models.

    II. METHOD

    This paper demonstrates a method that incrementally anddynamically construct an interpretation of the local environ-ment with a concise model (Section II-C) and clutter, partsof the environment that cannot be represented by the model,using a stream of RGB-D images. Figure 1 illustrates ourframework. At each frame t, we first extract the ground planeG and compute a transformation [Rgt , T

    gt ] that transforms

    the 3D points from the image coordinate to the local 3Dcoordinate (Section II-A). Points that lie on the groundplane are now considered as “explained” by the ground G.The transformation [Rgt , T

    gt ] captures 3dof of the full 6dof

    camera pose, the pan and roll angles and the height of thecamera with respect to the ground plane at frame t. We callthe remaining 3dof the robot pose, which is represented as(xrt , y

    rt , θ

    rt ) on the 3D world coordinate. The robot pose is

    estimated by aligning RGB image features and the 3D non-ground-plane points between frame t−1 and t (Section II-B).

    At the same time, we extract a set of features Ft from the3D points that are not explained by the ground G. Whilethere are other research focusing on grouping a RGB-Dimage or a set of 3D points into primitive shapes [7], inthis paper, we group the 3D points into a set of vertical-plane features Vt and a set of point-set features Ct. Vertical-plane features suggest the existence of walls, while point-setfeatures suggest the existence of clutter.

    Given the robot pose and the features Ft = {Vt,Ct}, theon-line generate-and-test framework first uses the vertical-plane features Vt to refine the precision of the existing setof PSM hypotheses {M}t−1 (Section II-E). Then, a set ofnew hypotheses are generated and added to the hypothesis set{M}′t−1 by transforming existing hypotheses into childrenhypotheses describing the same environment with moredetails based on the vertical-plane features Vt. (Section II-F) Finally, we use a Bayesian filter to test the hypotheses.While our previous work requires a good hypothesis explainsall the observed features, in this paper, we proposed a newlikelihood function that allows a good hypothesis to explainonly a subset of the features. (Section II-G) The outputinterpretation of the environment is the PSM hypothesis withthe maximum posterior probability and clutter, 3D points thatcannot be explained by that PSM hypothesis.

    A. Extract Ground Plane

    At each frame, we extract the 3D ground plane usingboth RGB image and depth information. Figure 2 illustrateshow the ground-plane is extracted. First, we collect thepixels where their local surface normals differ from theapproximated normal vector Nappx of the ground plane 1

    within φground. The local surface normal of each pixel iscomputed using the efficient algorithm proposed by Holzet al. [7]. From those pixels, we use RANSAC to fit thedominant plane that has a normal vector within φground ofthe approximated normal Nappx. From the inlier pixels, weperform a morphological close operation on the RGB imageto locate a smooth and bounded region for the ground plane.

    Once the ground plane is extracted, we compute thetransformation [Rgt ,T

    gt ] between the camera coordinate and

    the local 3D coordinate, where the x-y plane is the ground-plane and the z-axis is pointing up. The origin of thelocal 3D coordinate is set to the projected location of thecamera center on the ground plane. Mathematically, Rgt is therotation matrix that rotates the normal vector of the groundplane to [0, 0, 1], and Tgt = [0, 0, ht] is determined by thedistance ht between the camera center and the ground-plane.In this paper, we also define the world coordinate to be thelocal 3D coordinate of the first frame. The transformation

    1For a front-facing camera, Nappx = [0, 1, 0] in the image coordinate,and in our experiments, we set φground = π6 .

  • (a) Collect candidate ground points (b) Fit plane using RANSAC

    (c) Obtain inlier points (d) Smooth ground-plane region

    Fig. 2. Extract ground plane from a single RGB-D image. (Best viewedin color.) (a) From the input RGB-D image, collect candidate ground-plane points (blue) with local surface normal close to [0, 1, 0] in the imagecoordinate. Notice that since the depth information is noisy, candidate pointsmay not lie on the ground plane, and some ground-plane points may not beidentified as candidate points. (b) (c) Use RANSAC to fit the dominant planefrom the candidate points. Inlier points (pixels) are marked as green. Fromthe ground-plane equation, we determine the transformation between theimage coordinate and the local 3D coordinate. In (b), points are transformedto the local 3D coordinate. (d) Perform a morphological close operation onthe image to obtain a smooth ground-plane region.

    between the local 3D coordinate and the world coordinatecan be inferred by the robot pose (Section II-B).

    B. Estimate Pose

    Given the ground-plane and the transformation [Rgt ,Tgt ]

    from the camera coordinate to the local 3D coordinate, therobot pose (xrt , y

    rt , θ

    rt ) captures the remaining three degrees

    of freedom of the camera pose in the 3D world coordinate.We start by estimating the pose change between frame t andt−1 by aligning sparse feature correspondences between thetwo RGB-D images. We extract Harris corner features in theRGB image at frame t − 1, and obtain their correspondingimage locations at frame t using KLT tracking. Then, wefind a rigid-body transformation [Rpt ,T

    pt ] that aligns the 3D

    locations of the correspondences in the two frames. Notethat since we only have three degrees of freedom, Rpt is arotation matrix along the z-axis and Tpt is a translation vectoron the x-y plane. With this initial estimate of the robot pose2, we use Iterative Closest Point (ICP) algorithm to refine[Rpt ,T

    pt ] from all the 3D points that are not on the ground

    plane. Finally, the robot pose in the world coordinate can becomputed from the pose change.

    2If there are not enough sparse feature correspondences to compute theinitial estimate, we assume the robot is moving at a constant motion anduse the pose change between frame t− 1 and t− 2 as our initial estimate.

    (a) Extract Vertical-Plane Features (b) Extract Point-Set Features

    Fig. 3. Extract features from the RGB-D frame shown in Figure 2.(Best viewed in color.) Once the ground-plane is extracted, we extractfeatures from the points that are not from the ground plane. (a) We firstextract vertical-plane features by fitting line segments to the points in theground-plane map. Points are colored according to the vertical-plane featuresthat they generated, and points that cannot be grouped into line segmentsare marked as black. (b) We extract point-set features by clustering theremaining points based on their Euclidean distances. In general, vertical-plane features suggests the existence of walls, and point-set features suggestthe existence of clutter. However, some vertical-plane features may comefrom clutter region (e.g. the blue trash can), and some (red and orange)point-set features may be noise from a wall plane.

    C. Planar Semantic Model

    We use the Planar Semantic Model (PSM) proposed in ourprevious work [14] to represent the structure of an indoorenvironment. In this paper, we use the following notationsto represent a PSM M :

    M = {G,W1,W2,W3, ...,Wn}Wi = 〈αi, di, Si1, Si2, ..., Simi〉Sij = 〈xi1,j , yi1,j , ui1,j , xi2,j , yi2,j , ui2,j〉

    . (1)

    The PSM is defined on a ground-plane map, a 2D slice of theworld coordinate along the ground-plane. PSM consists ofthe ground plane G and a set of walls {W1,W2,W3, ...,Wn}.The location of a wall Wi is specified by a line on theground-plane map parametrized by (αi, di), where αi ∈(−π2 ,

    π2

    ]is the orientation of the line and di ∈ R is

    the directed distance from the origin to the line. A wallWi consists of a set of wall segments {Si1, Si2, ..., Simi}delimiting where the wall is present and where there is anopening. A wall segment Sij is represented by two endpoints.An endpoint is defined by its location (xi•,j , y

    i•,j) on the

    ground-plane map and its type ui•,j representing differentlevels of understanding of the bound of the wall segment.

    D. Extract Features

    After extracting the ground-plane, we group the points(pixels) from the current RGB-D frame into a set of vertical-plane features V = {vj |j = 1, ..., nv} and a set of point-setfeatures C = {cj |j = 1, ..., nc}. Figure 3 is an example ofthe features. A vertical-plane feature v is a plane segment thatis perpendicular to the ground-plane, and a point-set featureis a cluster of 3D points that cannot be grouped into verticalplanar segments. Vertical-plane features suggest the existenceof walls, and point-set features suggest the existence ofclutter. However, not all the vertical-plane features belong tothe PSM. For example, a box on the ground also generatesseveral vertical-plane features. On the opposite, a point-set

  • feature that is close to a wall may simply be a small instancesticking out from the wall or noise. For example, a door knobmay become a point-set feature. In this paper, we use thevertical-plane features to generate and refine the hypotheses,and we use all features to test the hypotheses.

    Similar to a PSM wall, a vertical-plane feature v corre-sponds to a line segment in the ground-plane map:

    v = 〈αv, dv, xv1, yv1 , xv2, yv2〉 (2)

    where αv ∈(−π2 ,

    π2

    ]and dv ∈ R. This is the same line

    parametrization used to represent a wall plane, and (xv1, yv1)

    and (xv2, yv2) are two end-points of the line that denotes

    where the vertical-plane is visible. We first extract vertical-plane features from the 3D points that are not on the groundplane. (Points are represented in the 3D world coordinate.)We project the points onto the ground-plane map, and use J-linkage [13] to fit a set of line segments. Each line segmentforms a vertical-plane feature. We remove the points thatform the vertical-plane features and cluster the remainingpoints based on their 3D Euclidean distances. A cluster thatconsists of more than 100 points forms a point-set feature.Mathematically, a point-set feature c is represented by

    c = 〈xc, yc,Pc〉 (3)

    where Pc is the set of 3D points in the cluster, and (xc, yc)is the ground-plane map projection of the 3D mean locationof the points. For points that failed to form features areconsidered as noise and are thus, discarded.

    E. Refine Hypotheses

    We use the information from the current frame to refine theprecision of each hypothesis. The generic Extended KalmanFilter (EKF) is used to estimate the parameters of each walland the location of each occluding endpoint [14]. 3

    Vertical-plane features Vt are potential measurements forthe walls that are visible in frame t. We associate the visiblewalls to the vertical-plane features based on the similarity oftheir corresponding lines on the ground-plane map. If a wallwi is associated to a vertical-plane feature vj , we use EKFto update the wall parameters (αi, di) based on the locationof the vertical-plane feature (αvj , d

    vj ).

    Once the wall parameters are updated, we refine thelocation of each occluding endpoint. Potential measurementsfor an occluding endpoint are the end-points of the vertical-plane features that have similar parameters with its corre-sponding wall. We project these end-points onto the line ofthe corresponding wall and find the nearest projected pointto the endpoint. If the distance between the nearest point andthe endpoint is less than 0.2 meters, the projected point isthe measurement for that endpoint. An endpoint is updatedusing EKF, if its measurement is available.

    3Occluding endpoints is a type of endpoint that is associated to only oneobserved wall. [14]

    F. Generate Hypotheses

    We combine vertical-plane features Vt to generate PSMhypotheses. In the first frame, a set of simple hypotheses aregenerated. A simple hypothesis is either a PSM with twoparallel walls or with at most three walls where each wallintersects with its adjacent walls. These simple hypothesesare generated by combining vertical-plane features withcertain constraints [15]. At every 10 frames, we transfereach existing hypothesis into a set of child hypothesesdescribing the same environment in more details. Thesechild hypotheses are essential for incrementally modeling theenvironment as a robot travels. For example, when the robotis at a long corridor, the wall at the end of the corridor oropenings along the side walls, may not be sensible from adistance, and thus, hypotheses capturing these details canonly be generated when the robot is close to them. Given anexisting hypothesis, its children hypotheses are generated byadding openings to the walls [14] or by combining with asimple hypothesis generated at the current frame.

    G. Test Hypotheses

    Hypotheses are tested using a recursive Bayesian filteringframework [15], [14]. Starting from a uniform prior, ateach frame t, we compute the likelihood function p(Ft|Mi)of each hypothesis Mi to update its posterior probabilityp(Mi|F1:t). In our previous work, we assume that a hypoth-esis explains all the observed features in the current frame.This assumption works well in empty environments becauseall the observed features actually lie on the walls and groundplanes. However, in a cluttered environment, features maybe part of clutter which cannot be explained by the PSM, asshown in Figure 3. Thus, in this paper, we design a likelihoodfunction that allows a good hypothesis to explain only asubset of the observed features Ft = {Vt,Ct}.

    The likelihood of hypothesis Mi consists of three terms,

    p(Ft|Mi) = pc(Ft|Mi)pa(Ft|Mi)ps(Ft|Mi). (4)

    These terms describe a three-way trade-off among the featurecoverage by the hypothesis pc(Ft|Mi), the accuracy of theexplained features pa(Ft|Mi), and the simplicity of thehypothesis ps(Ft|Mi).

    The coverage term pc(Ft|Mi) measures the number offeatures that Mi explains, regardless of the accuracy of theexplanations. Formally,

    pc(Ft|Mi) = ωv|Vit||Vt|

    + ωc|Cit||Ct|

    (5)

    where Vit ⊆ Vt are the vertical-plane features, and Cit ⊆ Ctare the point-set features that are explained by Mi. (One candefine their own metric to determine whether a feature isexplained or not. Our metric is presented in the appendix.)ωv and ωc are the importances of explaining vertical-planefeatures and point-set features, respectively (ωv+ωc = 1). 4

    4In indoor environment, a vertical-plane feature is more likely to be partof a wall plane, while a point-set feature is more likely to be clutter. Thus,we set ωv > ωc. In our experiments, we set ωv = 0.7 and ωc = 0.3.

  • (a) Frame t

    (b) Vertical-plane featuresextracted from the observa-tion of frame t

    (c) M1: simple 1-wall model thatperfectly explains some features

    (d) M2: simple 1-wall model thatmoderately explains all features

    (e) M3: complex 4-wall-segmentmodel that perfectly explains allfeatures

    Hypothesis M1 M2 M3coverage pc(Ft|Mi) 0.50 1.00 1.00

    accuracy pa(Ft|Mi) (σ = 0.2 ) 1.00 0.88 1.00accuracy pa(Ft|Mi) (σ = 0.05) 1.00 0.14 1.00simplicity ps(Ft|Mi) (γ = 0) 0.50 0.50 0.50simplicity ps(Ft|Mi) (γ = 1) 0.88 0.88 0.27

    Likelihood p(Ft|Mi) = pc(Ft|Mi)pa(Ft|Mi)ps(Ft|Mi)CASE 1: σ = 0.20, γ = 0 0.25 0.44 0.50CASE 2: σ = 0.20, γ = 1 0.44 0.77 0.27CASE 3: σ = 0.05, γ = 0 0.25 0.07 0.50CASE 4: σ = 0.05, γ = 1 0.44 0.12 0.27

    (f) Three-Way trade-off among likelihood factors

    Fig. 4. Example for the three-way trade-off among the likelihood factors. (Best viewed in color.) Assume we extract four vertical-plane features fromframe t (a), as shown in (b), and three valid hypotheses (thick blue lines) are generated: (c) Hypothesis M1 models the environment with one wall thatperfectly explains only two of the features (green), leaving the other two unexplained (red); (d) Hypothesis M2 models the environment with one wall thatexplains all the features but explain them poorly (yellow); (e) Hypothesis M3 models the environment with four wall segments that explain all four featuresperfectly. As shown in (f), the likelihood function is a three-way trade-off among feature coverage of the hypothesis, accuracy of the explained features,and simplicity of the hypothesis. Depending on the parameters in each term, the likelihood function can prefer any of the hypotheses. If σ is small, theaccuracy term is important since the Gaussian function in pa(Ft|Mi) gives large penalties to poor explanations. Contrary, if σ is large, the accuracy termbecomes less important because the Gaussian function is less discriminative between poor and good explanations. In this example, pa(Ft|M2) dramaticallyincreases when σ increases. The decay rate γ controls the preference towards simpler hypotheses (Fig. 5). When γ = 0, there is no preference for thesimplicity of the hypothesis. As γ increases, the likelihood function will increases its preference towards simpler hypotheses. In CASE 1, coverage is themost important factor, and since M3 has a slightly better accuracy than M2, M3 has the highest likelihood. In contrast to CASE 1, CASE 2 consider bothcoverage and simplicity important, and thus, M2 has a higher likelihood than the complex hypothesis M3. In CASE 3, where only coverage and accuracyare considered, M3 is the best because it perfectly explains all the features while M1 and M2 have issues with coverage and accuracy, respectively. In allthree cases, M1 is not preferred, because it has a lower coverage. However, if considering all three factors (CASE 4), M1 is the best.

    The accuracy term pa(Ft|Mi) measures the accuracy ofthe features explained by Mi. Since different hypothesesmay explain different subsets of the features Ft, we computethe weighted RMS error error(Ft,Mi) of the features thathypothesis Mi explains,

    error(Ft,Mi) =√ωv∑vj∈Vit

    εp(vj ,Mi)2 + ωc∑cj∈Cit

    εc(cj ,Mi)2

    ωv|Vit|+ ωc|Cit|(6)

    εp(vj ,Mi) and εc(cj ,Mi) are the error of Mi explainingfeature vj and cj . (One can define their own error metric.Our metric is presented in the appendix.) The RMS error ismodeled by a Gaussian distribution with zero mean and σ2

    variance 5. Mathematically,

    pa(Ft|Mi) ∝

    {0 if |Vit|+ |Cit| = 0exp −error(Ft,Mi)

    2

    2σ2 otherwise.

    (7)Note that both the coverage and the accuracy term require ahypothesis to explain at least one feature in order to obtain anon-zero likelihood. In other words, the Bayesian filter willfilter out a hypothesis that explains nothing in view.

    The simplicity term ps(Ft|Mi) measures the amount ofinformation that Mi used to explain the features. This isa regularization term that prevents the Bayesian filter fromover-fitting the features. ps(Ft|Mi) is modeled by a gener-alized logistic function where the growth rate γ is negative,

    ps(Ft|Mi) =1

    1 + expγ(|Mi|t − nmaxγ ). (8)

    5In our experiments, σ2 = 0.04. This value is selected to account forthe noise of the sensor and the accumulated error for pose estimation.

    Fig. 5. The simplicity term ps(Ft|Mi) of the likelihood function withrespect to different γ with a fixed nmaxγ = 3. (Best viewed in color.) Thehorizontal axis |Mi|t is the number of walls in view and the vertical axisis ps(Ft|Mi). In the extreme case, where γ = 0, the likelihood functionhas no preference towards simpler hypotheses. As γ increases, the Bayesianfilter is more likely to converge to a simpler hypothesis.

    |Mi|t is the number of walls that are visible in frame t andnmaxγ is where the maximum decay occurs.

    6

    Figure 4 demonstrates the trade-offs among the three terms(coverage pc(Ft|M), accuracy pa(Ft|M), and simplicityps(Ft|M)) in the likelihood function. The importance ofeach term is controlled by two parameters, the variance ofthe Gaussian function σ2 in the accuracy term pa(Ft|M)and the decay rate γ in the simplicity term ps(Ft|M). Thevariance σ2 mainly controls the importance of the accuracyterm. A large variance σ2 in the accuracy term means that thedifference between good and poor explanations is low, and

    6In our experiments, γ = 0.3 and nmaxγ = 10 to ensure the likelihoodfor having one to five walls are similar because theses are the commonnumber of walls that are visible in a frame based on the field of view ofour depth camera (57◦ horizontally).

  • (a) M4 — A good hypothesis (b) M5 — A mediocre hypothesis

    Full Explanation Partial ExplanationHypothesis M4 M5 M4 M5coverage pc 1 1 0.46 0.94accuracy pa 0.12 0.25 0.95 0.29simplicity ps sameLikelihood 0.12 0.25 0.44 0.31

    (c) Likelihood comparisons with and without clutter concept

    Fig. 6. The environment in Fig. 2 and 3 demonstrates the importance ofallowing a hypothesis to explain only a subset of the features. (Best viewedin color) Both M4 and M5 use one wall (blue thick line) to represent theenvironment, so the simplicity term for both hypotheses are the same. Ifwe force both hypotheses to explain all the features, then M5 has a higherlikelihood than M4 because it has a lower RMS error. (ωv = 0.7 andσ = 0.2) If we allow the hypotheses to explain only a subset of the features,M4 is most likely to have a higher likelihood. Green features are perfectlyexplained (0 error) by the hypothesis, yellow features are explained witherrors, and red features are not explained. See text for more details.

    thus, the likelihood function is more reflective on the featurecoverage. Contrary, a small variance σ2 means that thepenalty for poor explanation is high, and thus, the accuracyof the explanation is highly important. The decay rate γcontrols the preferences toward having a simpler model asshown in Fig. 5. In the extreme case, where γ = 0, thesimplicity term will be the same for all hypotheses andthus, the likelihood will not prefer simpler hypotheses. Asγ increases, the likelihood function will start preferring asimpler hypothesis with reasonable coverage and accuracy.

    On one hand, the parameters in the likelihood functionallows the user to set their preferences based on theirapplications. If the purpose of exploration and mappingis to determine free-space for safe navigation, then oneset of parameter values might be preferred, while if thepurpose is to create an architectural CAD model of theenvironment, a different set of values may be preferable. Onthe other hand, the parameters may seem to be sensitive tothe hypothesis that the Bayesian filter converges to. In fact,for the purpose of navigation, it is reasonable to convergeto any of the three hypotheses. Both M3 and M1 specifythe same part of the environment as free-space, but theyspecify the free-space in a different way. M2 is less precisein specifying the boundaries of free-space, and the accuracyof its explanation pa(Ft|M2) requires a robot to be morecautious about the boundaries. In other words, the accuracyterm provides a confidence measurement of free-space fornavigation algorithms [8]. In a simple empty environmentlike this, most hypotheses are reasonable. However, in a morecomplex or cluttered environment, many bizarre hypotheseswill be generated and the likelihood function allows us toconverge to a reasonable hypothesis (see Section III).

    Figure 6 is an example that shows why it is importantto allow a good hypothesis to explain only a subset of theobserved features. If we require a hypothesis to explain all

    the observed features (pc(Ft|M4) = pc(Ft|M5) = 1), thenthe mediocre hypothesis M5 dominates M4 because it haslower RMS error. However, if a hypothesis can distinguishbetween clutter and non-clutter features, then the higher-quality partial explanation M4 can dominate the nearlycomplete but lower-quality hypothesis M5.

    III. RESULTS

    We evaluated our approach using four RGBD videodatasets in various indoor environments. 7 The videos werecollected by a Kinect sensor mounted on a wheeled devicewith a front pointed direction. The relative poses betweenthe camera and the ground plane are fixed within each video,but different among different videos. In Dataset CORNERand LAB, the robot traveled about 1.5 meters in a verycluttered corner. In Dataset INTERSECTION, the robotmade a right turn around an empty L-intersection. In DatasetCORRIDOR, the robot traveled about 5 meters in a longcorridor with objects on the sides.

    Figure 8 shows the results on these videos. Our methodconverges to a reasonable hypothesis to describe each en-vironment in 3D. Once we obtain the best hypothesis, weseparate out clutter, observations that were not explainedby the hypothesis. At each frame, each hypothesis has itsown partition of explained and unexplained features. Theunexplained observations are the 3D points that contribute tothese unexplained features at each frame. We further clusterthese unexplained observations into a set of 3D regions basedon their Euclidean distances. In most cases, a cluttered regionis either an object or a pile of objects, but in some situations(INTERSECTION), the cluttered region may be part of thebuilding, such as a pillar along the wall.

    To evaluate our method quantitatively, for every 10 frames,we manually labeled the ground truth classification of theplanes (i.e. the walls, ground and ceiling), and ground-truth classification of clutter in the projected image space.We define the Plane Accuracy of a hypothesis being thepercentage of pixels with the correct plane classification, andthe Scene Accuracy of a hypothesis being the percentage ofpixels with the correct scene interpretation (PSM + clutter).When computing these accuracies, only non-ceiling pixelswith valid depth data are considered, because ceiling is notmodeled in PSM and pixels with invalid depth data are notused to compute the likelihood. The average accuracies ofthe maximum a posteriori hypotheses are reported in Fig. 7.To illustrate the importance of partial explanation, we ran anexperiment without the concept of clutter (pc(Ft|Mi) = 1and pa(Ft|Mi) is computed by all the features). Amongall datasets, only INTERSECTION converges to the correcthypothesis, and the overall Plane Accuracy accuracy withoutclutter concept is 67.11%. Thus, allowing partial explanationis a key to handle cluttered environments.

    Besides locating free-space for navigation, our outputinterpretation is an important step towards reasoning about

    7Publicly available at http://www.eecs.umich.edu/˜gstsai/release/Umich_indoor_corridor_2014_dataset.html.

  • Dataset CORNER LAB CORRIDOR INTERSECTION OverallClutterness 34.84% 20.80% 9.50% 10.22% 16.50%

    Plane Accuracy 98.18 % 99.31 % 97.83 % 98.25 % 98.49 %Scene Accuracy 92.82 % 95.10 % 91.76 % 98.16 % 94.83 %

    Fig. 7. Quantitative evaluation. The average accuracies of the maximum a posteriori hypotheses at the evaluated frames are reported. Plane Accuracymeasures the accuracy of the hypothesized PSM model, and Scene Accuracy measures the accuracy of the whole interpretation (PSM + clutter).

    objects in the local environment. Our interpretation segmentsout regions that may contain objects for object reasoningmethods. A clutter region is a smaller problem for objectreasoning compare to the whole scene. For example, weapplied a functionality-based object classification [6] to theclutter region with the chairs in LAB and determined thatthis region is “sittable” among other functionality classes(“table-like”, “cup-like”, and “layable”).

    IV. CONCLUSION

    We addressed the dilemma of model-based scene interpre-tation. On one hand, it is valuable to interpret the environ-ment with a simple model and separate out the regions thatcannot be described by a simple model — clutter, as opposeto having a fine-grained model that models all the details inthe environment. On the other hand, while testing multiplemodel hypotheses, it is more preferable to select a hypothesisthat explains more observations and has less unexplainableobservations. We proposed a likelihood function handles thedilemma. The likelihood function make explicit a three-waytrade-off among coverage of the observed features, accuracyof the explanation, and simplicity of the hypothesis.

    We demonstrated the likelihood function in the contextof an on-line generate-and-test framework to visual sceneunderstanding [14]. Our experimental results on a variety ofRGB-D videos demonstrated that our method is capable ofinterpreting cluttered environment by a simple model (PSM)and clutter. Our output interpretation not only providesinformation of free-space for navigation, but separates outsegments of clutter represented by 3D point clouds thatrequire further analysis with more complex models. Ourapproach can be applied to further interpret clutter with othermodels (e.g. object models).

    APPENDIX

    Explaining vertical-plane features: A vertical-plane fea-ture v is explained by hypothesis M if v can be explainedby a wall segment Sij in M . Feature v is explained by Mif the error εp(op,M) is less than a threshold �. (� = 0.1 inour experiments.) The error metric εp(v,M) is computed asfollowed. First, we find the wall plane Wmatch ∈M that bestmatches v by computing the displacement dist(v,Wi) be-tween v and each wall Wi based on their corresponding linesin the ground-plane map. For efficiency, we only consider awall that has a similar angle to the vertical-plane feature.Two angles are similar, if min(|αi − αv|, π − |αi − αv|) <0.0873. If no walls are within the angle constraints, v is notexplained by M . Each wall has a coordinate for computingthe displacement. The origin of the coordinate is at theweighted center of the line of v and the line of wall Wi, and

    the x-axis is along the weighted average direction of the twolines. The weight of each line is proportional to its length.The displacement dist(v,Wi) of the two lines is defined asthe maximum difference along the y-axis. Wmatch is the wallwith minimum dist(v,Wi). If the entire v lies within a singlewall segment of Wmatch, εv(v,M) = dist(v,Wmatch).Otherwise, v is not explained by M .

    Explaining point-set features: We define an error metricεc(c,M) to be the shortest distance of the point-set feature(xc, yc) to a wall Wi in M . Feature c is not explained byM if the error is larger than a threshold �. (This is the samethreshold � for explaining vertical-plane features.) Featurec is not explained if the projected location of the featuredoes not lie within the bound of any wall segments of Wi.Moreover, we take into consideration the distribution of thepoints Pc in c. Only if 70% of the points are within � ofdistance to wall Wi, point-set feature c is explained.

    REFERENCES[1] A. J. Davison. Real-time simultaneous localization and mapping with

    a single camera. ICCV, 2003.[2] A. Flint, D. Murray, and I. Reid. Manhattan scene understanding using

    monocular, stereo, and 3d features. ICCV, 2011.[3] A. Furlan, S. Miller, D. G. Sorrenti, L. Fei-Fei, and S. Savarese. Free

    your camera: 3d indoor scene understanding from arbitrary cameramotion. BMVC, 2013.

    [4] Y. Furukawa, B. Curless, S. Seitz, and R. Szeliski. Reconstructingbuilding interiors from images. ICCV, 2009.

    [5] V. Hedau, D. Hoiem, and D. Forsyth. Recovering the spatial layoutof cluttered rooms. ICCV, 2009.

    [6] L. Hinkle and E. Olson. Predicting object functionality using physicalsimulations. IROS, 2013.

    [7] D. Holz, S. Holzer, R. B. Rusu, and S. Behnke. Real-time planesegmentation using RGB-D cameras. RoboCup 2011: Robot SoccerWorld Cup XV, 2012.

    [8] C. J. Jong Jin Park and B. Kuipers. Robot navigation with modelpredictive equilibrium point control. IROS, 2012.

    [9] G. Klein and D. Murray. Parallel tracking and mapping for small arworkspaces. ISMAR, 2007.

    [10] D. C. Lee, M. Hebert, and T. Kanade. Geometric reasoning for singleimage structure recovery. CVPR, 2009.

    [11] P. Newman, D. Cole, and K. Ho. Outdoor SLAM using visualappearance and laser ranging. ICRA, 2006.

    [12] C. J. Taylor and A. Cowley. Parsing indoor scenes using RGB-Dimagery. RSS, 2012.

    [13] R. Toldo and A. Fusiello. Robust multiple structure estimation withJ-linkage. ECCV, 2008.

    [14] G. Tsai and B. Kuipers. Dynamic visual understanding of the localenvironment for an indoor navigating robot. IROS, 2012.

    [15] G. Tsai, C. Xu, J. Liu, and B. Kuipers. Real-time indoor sceneunderstanding using Bayesian filtering with motion cues. ICCV, 2011.

    [16] H. Wang, S. Gould, and D. Koller. Discriminative learning with latentvariables for cluttered indoor scene understanding. ECCV, 2010.

  • CORNER

    LAB

    INTERSECTION

    CORRIDOR

    Fig. 8. Evaluation of our approach on interpreting indoor environment by a Planar Semantic Model and clutter. (Best viewed in color.) The first columnis the image in our dataset. (The depth images are not shown in this figure.) The second column is the 3D interpretation obtained by our method. Theinterpretation is visualized in the ground-plane map. For the PSM, the blue lines are the wall planes, and the big dots are the endpoints. The endpointsare color coded based on their types (green: dihedral; yellow: occluding; red: indefinite)[14]. Clutter, observations that are not explained by the model, isrepresented by a 3D point cloud, which is visualized by the tiny red dots. Notice that some of the clutter points are caused by errors in the pose estimation.The pose estimation is less accurate in CORRIDOR because a large portion of the pixels have unreliable depth measurements due to distances. The thirdcolumn is the image projection of the PSM. The ground is green and each wall is shown in a different color. The part of the images that are not paintedare too far away from the robot to obtain a reliable depth data, so the robot will start modeling those regions as it gets closer. The fourth column shows theclutter in the image. We cluster the clutter points in 3D into regions, and each cluster corresponds to a 2D segment in the image space. These segments areshown in different colors. In general, each clutter region consists of an object or a pile of objects, but in INTERSECTION, the clutter region is actually apillar along the wall. Our interpretation does not model the objects, since PSM can only represent the ground-plane and walls. However, if object modelsare given, our method can be applied to factor each clutter region into known objects and points that remains unexplained.