Analysis of Goal-directed Manipulation in Clutter using ... · Analysis of Goal-directed...

Analysis of Goal-directed Manipulation in Clutterusing Scene Graph Belief Propagation

Karthik Desingh Anthony Opipari Odest Chadwicke Jenkins

Abstract— Goal-directed manipulation requires a robot toperform a sequence of manipulation actions in order to achievea goal. A sequential pick-and-place actions can be plannedusing a symbolic task planner that requires current sceneand goal scene represented symbolically. In addition to thesymbolic representation, an object’s pose in the robot’s worldis essential to perform manipulation. Axiomatic scene graphis a standard representation that consists of symbolic spatialrelations between objects along with their poses. Perceivingan environment to estimate its axiomatic scene graph is achallenging problem especially in cluttered environments withobjects stacking on top of and occluding each other. As a steptowards perceiving cluttered scenes, we analyze the clutternessof a scene by measuring the visibility of objects. The morean object is visible to the robot, the higher the chances ofperceiving its pose and manipulating it. As the visibility of theobject is not directly computable, a measure of uncertaintyon object pose estimation is necessary. We propose a beliefpropagation approach to perceive the scene as a collectionof object pose beliefs. These pose beliefs can provide theuncertainty to allow or discard the planned actions from thesymbolic planner. We propose a generative inference systemthat performs non-parametric belief propagation over scenegraphs. The problem is formulated as a pairwise MarkovRandom Field (MRF) where each hidden node (continuous posevariable) is an observed object’s pose and the edges (discreterelation variable) denote the relations between the objects. Arobot experiment is provided to demonstrate the necessity ofbeliefs to perform goal-directed manipulation.

I. INTRODUCTION

Autonomous robots must robustly perform goal-directedmanipulation to achieve tasks specified by a human user.Goal-directed autonomy is a challenging field of researchwhere robots must interpret a goal state and achieve itby performing sequence of manipulation actions. If initialand the goal states are represented symbolically, classicalsequential planning algorithms [6], [20] can be used toplan a sequence of actions that achieves the goal state. Inorder to execute a planned action, the robot also requiresobject’s pose in the robot’s world. Axiomatic scene graphis a standard representation that encompasses the symbolicspatial relations between the objects along with their poses.However, perceiving an axiomatic scene graph is challengingespecially in cluttered environments where the objects areoccluded and only partially visible to the sensor. A classicalplanner assumes full perception of the scene while planninga sequence of actions. This assumption demands an accurateestimation of the scene from a perception system. This

K. Desingh, A. Opipari, O.C. Jenkins are with the Department ofComputer Science, University of Michigan, Ann Arbor, MI, USA, 48109-2121

Fig. 1: Robot is observing a scene with objects in a tray. The goal ofthe object is to perform sequence of manipulation actions and achieve agoal scene. The bottom row shows object bounding boxes provided by aobject detector, initial and goal scene graphs with object relations (childnode on parent node) known a priori. The colors of the nodes in the scenegraph corresponds to objects whose detections have the same color. Thenumbers below the leaf nodes on the initial scene graph corresponds totheir percentage of visibility to the robot at this view point.

is an unrealistic assumption given the complexity of thetypical human environments. An object’s pose estimationdepends on the observability/visibility of the object, whichin turn affects the success of the manipulation. However, aperception system cannot measure the visibility of an objectdirectly and hence should be able to provide a measure ofuncertainty in the pose estimation. In this paper we proposea belief propagation based approach to estimate not only thepose of the objects but also a measure of uncertainty. Thismeasure of uncertainty is used to allow or discard the actionsplanned by the symbolic planner.

An example task is shown in Fig. 1 where a robot isgiven a tray full of objects and commanded to achieve agoal scene state where all the objects are on the table. If thescene graph is provided to the robot to simplify the goal-directed manipulation task, a planner can be used to generatea plan. As the planner doesn’t have any information about theobjects’ pose in the world; all leaf nodes are equally likely

to be part of the first pick-and-place action. But in the realworld, only a some of these leaf node objects are visible.the objects whose visibilities are in red color are visible lessthan 30%. Hence, the perception system must output highuncertainty for these objects. In this fashion the perceptionsystem helps prune all the implausible actions.

We formulate the problem as a graphical model to performbelief propagation. The graphical model approach has beenproven to perform well for articulated 3D human poseestimation [15] and visual hand tracking problems [17].In such problems the inference leverages the probabilisticconstraints posed on the articulations of body parts bythe human skeletal structure. These structural constraintsare modeled precisely with domain knowledge and do notchange from scene to scene for human body. However, in theproposed framework we infer the pose of the objects usingtheir scene specific relationship. For example, object A andB can have an on relationship in a current scene, howeverit is not a universal relationship to hold. But object Aand B, not inter-penetrating each other is universal physicalrelationship to hold. Potentials defined on these relationshipsare generalized to perform for different scene graphs thatchanges scene to scene along with actions performed. Ideallyan inference algorithm for goal-directed manipulation shouldproduce axiomatic scene graph [18]. This paper however, isfocused only on estimating the object poses. The problemis formulated as a Markov random field (MRF) where eachhidden node represents an observed object’s pose (continuousvariable), each observed node has the information about theobject from observation (discrete) and the edges of the graphdenote the known relationships between object poses.

The main contribution of this paper is a generative sceneunderstanding approach designed to support goal-directedmanipulation. Our approach models inter-object relations inthe scene and produces belief over object poses throughnonparametric belief propagation (NBP)[16]. We detail theadaptation of NBP for scene estimation domain. In additionto this, we propose a measure to analyze the clutterness ofa scene in terms of objects’ visibility.

II. RELATED WORK

Our aim is to compute a scene estimate that allows a robotto use sequential planning algorithms, such as STRIPS [6]and SHRDLU [20]. A scene graph representation is oneway to describe the world. However, the information onobjects’ poses should be perceived precisely to complimentthe planning algorithms in order to let the robot makea physical change in the world. Chao et al. [1] performtaskable symbolic goal-directed manipulation by associatingobserved robot percepts with knowledge categories. Thismethod uses background subtraction to adaptively buildappearance models of objects and obtain percepts but issensitive to lighting and object color. Narayanaswamy etal. [12] perform scene estimation and goal-directed robotmanipulation in cluttered scenes for flexible assembly ofstructures. In contrast use the scene graph prior to runa perception system that estimates the object poses while

maintaining the scene graph relations. Sui et. al [18] attemptto estimate a scene in a generative way; however, physicalinteractions are indirectly constrained by collision checks onobject geometries. We avoid these expensive collision checkson object geometries and use potential functions that softlyconstraints inter-object collisions. Dogar et al. [5] use physicssimulations and reasoning to accomplish object grasping andmanipulation in clutter. However, their work does not includethe object-object interactions in their clutter environments.The method proposed in this paper does not depend on aphysics engine and also avoids explicit collision checking.Collet et al. [4] and Papazov et al. [13] apply a bottom upapproach of using local features and 3D geometry to estimatethe pose of objects for manipulation. In our work we productnot only the object pose estimates but also belief over theobject poses, which gives the confidence on the estimation.ten Pas and Platt [19] propose a model free approach localizegraspable points in highly unstructured scenes of diverseunknown objects, which is directed towards pick and dropapplications and doesn’t apply to pick and place context.Model based robot manipulation as described in [3] canbenefit from a scene graph based belief propagation proposedin this paper.

Probabilistic graphical model representations such asMarkov random field (MRF) are widely used in computervision problems where the variables take discrete labelssuch as foreground/background. Many algorithms have beenproposed to compute the joint probability of the graphicalmodel. Belief propagation algorithms are guaranteed to con-verge on tree-structured graphs. For graph structures withloops, Loopy Belief Propagation (LBP) [10] is empiricallyproven to perform well for discrete variables. The problembecomes non-trivial when the variables take continuous val-ues. Sudderth et.al (NBP) [16] and Particle Message Passing(PAMPAS) by Isard et.al [8] provide sampling approaches toperform belief propagation with hidden variables that takecontinuous values. Both of these approaches approximatea continuous valued function as a mixture of weightedGaussians and use local Gibbs sampling to approximatethe product of mixtures. This has been effectively used inapplications such as human pose estimation [15] and handtracking [17] by modeling the graph as a tree structuredparticle network. This approach has not been applied toscene understanding problems where a scene is composedof household objects with no strict constraints on theirinteractions. In this paper we propose a framework that cancompute belief over object poses with relaxed constraints ontheir relations using a scene graph.

Model based generative methods [11] are increasingly be-ing used to solve scene estimation problems where heuristicsfrom discriminative approaches such as Convolutional NeuralNetworks (CNNs) [14], [7] are utilized to infer object poses.These approaches don’t account for object-object interactionsand rely significantly on the effectiveness of recognition. Ourframework doesn’t completely rely on the effectiveness oftraining the recognition system and can handle noisy priorsas long as the priors have 100% recall.

(a) Observed de-tections

(b) Scene graph prior

𝑥6 𝑥7

𝑦2

𝑦6

𝑦3

𝑦7

𝑦4 𝑦5

𝑦1

𝑥3 𝑥4 𝑥5

𝑥1

𝑥2

(c) Markov random field

(d) Initial belief samples (e) Final belief samples (f) Final estimate

Fig. 2: Graphical Model (c) is constructed using the scene graph prioralong the observed 2D object detections derived from the RGBD sensordata. In the constructed graphical model the edges denote the type of relationbetween the objects: black for support relations, cyan: non-colliding relation.The colors in (b) correspond to the object bounding box colors in (a).Inference on this graph is initialized with the object bounding boxes andthe corresponding point clouds. Initial belief samples are shown in (d) andthe inference iteratively propagates these beliefs to (e). The final estimateis shown in (f) using a post processing step using Iterative Closest Point(ICP) (discussed later in the paper).

Recent works such as [21] propose systems that canwork on wild image data and refine the object detectionsalong with their relations. However these methods do notconsider the continuous pose in their estimation and workin pixel domain. Chua et. al [2] propose a scene grammarrepresentation and belief propagation over factor graphs,whose objective is in generating scenes with multiple-objectssatisfying the scene grammars. We specifically deal withRGBD observations and infer with continuous variablespose variables whereas Chua et. al [2] applies to RGBobservations and discrete variables.

III. METHODOLOGY

A. Problem Statement

A robot observes a scene using an RGBD sensor, whichgives an RGB image I and a point cloud P . An objectdetector takes I as an input and produces object boundingboxes B = {b1, b2, ..., bk} and corresponding confidencescores C = {c1, c2, ..., ck}, with k being the number ofdetections. We make an assumption that the detector hasa 100% recall.Using these detections, an undirected graphG = (V,E) is constructed with nodes V and edges E. Foreach unique object label from the object detection result,there exists a corresponding observed node in the graph G.Let Y = {ys | ys ∈ V } denote the set of observed variables,where ys = (ls, Bs, Cs), with detections Bs ⊆ B andconfidences Cs ⊆ C from the object detector correspondingto object label ls ∈ L. Each observed node is connected to ahidden node that represents the pose of the underlying object.Let X = {xs | xs ∈ V } denote a set of hidden variables,where xs ∈ Rd with d being the dimensions of the pose ofthe object. This graph G consists of N hidden nodes if thereare N objects with labels L = {l1, l2, ..., lN} present in the

scene. However, k ≥ N as there could be multiple detectionswith the same label. G represents a scene with the observedand hidden nodes. Scene estimation involves finding thisgraph structure along with the inference on the hidden nodes.In this paper we assume that the graph structure is knownapriori. This known graph structure in the form of scenegraph provides the edges E showing the relations betweenthe hidden nodes. The edges in our problem represent therelation between the objects in the scene. More precisely,we have two types of edges: one showing that an object issupporting/supported by another object (dark black in Fig 2),the other one indicating that an object is not in contact withanother object (cyan in Fig 2). The joint probability of thisgraph considering only second order cliques is given as:

p(x, y) =1

Z

∏(s,t)∈E

ψs,t(xs, xt)∏s∈V

φs(xs, ys) (1)

where ψs,t(xs, xt) is the pairwise potential between nodes xsand xt, φs(xs, ys) is the unary potential between the hiddennode xs and observed node ys, and Z is a normalizingfactor. Pairwise potential can be modeled using the typeof edges to perform the inference over this graph. Weuse Nonparametric Belief Propagation (NBP) [16] to passmessages over continuous variables and perform inferenceon a loopy graph structure such as ours (see Fig 2).

After converging over iterations, the maximum likelihoodestimate of this marginal belief gives the pose estimate xestsof the object corresponding to the node in the graph G. Thecollection of all these pose estimates form the scene estimate.

B. Nonparametric Belief Propagation

Loopy belief propagation in the context of continuous vari-ables is shown in Algorithm 1. Computing message updatesin continuous domain is nontrivial. A message update in acontinuous domain at an iteration n from a node t→ s is:

mnts(xs)←

∫xt∈Rd

(ψst(xs, xt)φt(xt, yt)∏

u∈ρ(t)\s

mn−1ut (xt)

)dxt (2)

where ρ(t) is a set of neighbor nodes of t. The marginalbelief over each hidden node at iteration n is given by:

belns (xs) ∝ φs(xs, ys)∏t∈ρ(s)

mnts(xs) (3)

We approximate each message mnts(xs) as a mixture of

weighted Gaussian components given as:

mnts(xs) =

M∑i=1

ws(i)N (xs;µs(i),Λs(i)) (4)

where M is the number of Gaussian components, ws(i) is theweight associated with the ith component, µs(i) and Λs(i)are the mean and covariance of the ith component respec-tively. Fixing Λs(i) to a constant Λ simplifies the approxi-mation without affecting the performance of the system [16].

NBP provides a Gibbs sampling based method to compute anapproximate of the value

∏u∈ρ(t)\sm

n−1ut (xt) that results in

the same form as Eq 4. Assuming that φt(xt, yt) is pointwisecomputable, the product φt(xt, yt)

∏u∈ρ(t)\sm

n−1ut (xt) is

computed as part of the sampling procedure. The pairwiseterm ψst(xs, xt) should be approximated as marginal in-fluence function ζ(xt) to make the right side of Eq 2independent of xs. The marginal influence function is givenby:

ζ(xt) =

∫xs∈Rd

ψst(xs, xt)dxs (5)

If the marginal influence function is alsopointwise computable then the entire productζ(xt)φt(xt, yt)

∏u∈ρ(t)\sm

n−1ut (xt) can be computed

as part of the sampling procedure proposed in NBP. Referto the papers describing NBP [16] and PAMPAS [8] forfurther details on how changes to the nature of the potentialsaffect the message update computation (as in Eq 2). Themarginal influence function provides the influence of xs forsampling xt. However, the function can be ignored if thepairwise potential function is based on the distance betweenthe variables, which is true in our case.

C. Potential functions

1) Unary potential: Unary potential φt(xt, yt) is used tomodel the likelihood by measuring how a pose xt explainsthe observation yt. The observation yt ∈ (lt, Bt, Ct) providesthe 2D bounding boxes Bt corresponding to the xt. Let btbe a bounding box sampled from the Bt using the Ct ascorresponding weights. Let pt be a subset of original pointcloud corresponding to a bounding box bt. The hypothesizedobject pose xt is used to position the given geometric objectmodel for object lt and generate a synthetic point cloud p∗tthat can be matched with the observation pt. The syntheticpoint cloud is constructed using the object geometric modelavailable a priori. The likelihood is calculated as

φt(xt, yt) = eλrD(pt,p∗t ) (6)

where λr is the scaling factor, D(pt, p∗t ) is the sum of

3D Euclidean distance between each point in pt and p∗tassociated by their pixel location in the observed boundingbox bt on I .

2) Pairwise potential: Pairwise potential gives informa-tion about how compatible two object poses are given thesupport relation in the form of edges. We consider threedifferent support cases: 1) object s is not directly in physicalcontact with object t, 2) object s is supporting object t and3) object t is supporting object s. This support structure isprovided as input to the system as shown in Fig 2. A binaryvalue {0, 1} is assigned based on the compatibility of objectposes and their support relations. Object geometries are avail-able to the system for modeling this potential. We considerthe number of the dimensions of a pose to be 3 so thatxs = {xxs , xys , xzs} and xt = {xxt , x

yt , x

zt }. The dimensions

of the object models are denoted as ds = {dxs , dys , dzs} anddt = {dxt , d

yt , d

zt } for the objects associated with the nodes s

and t respectively. The potentials between the nodes s and t

in the graph G can be computed using simple rules as shownin the case below.

Case 1: Object s is not in physical contact with t

ψts(xs, xt) =

1, if ∆x >

(dxs+dxt )

2

or ∆y >(dys+d

yt )

2

or ∆z >(dzs+d

zt )

2

0, otherwise

(7)

where ∆x = |xxs − xxt | and dxs denotes the size of thegeometry associated with the node s in x direction. The sizeof the geometry is used to avoid collisions and hence anapproximation such as dxs = dys = max(dxs , d

ys) can be used.

Case 2: Object s supports object t

ψts(xs, xt) =

1, if ∆x < 1

2 (dxs )

and ∆y < 12 (dys)

and |∆z − 12 (dzs + dzt )| < ∆d

0, otherwise

(8)

where ∆d denotes such a threshold in the z direction thatthe objects are touching each other.

Case 3: An object associated with t is supporting an objectassociated with s

ψts(xs, xt) =

1, if ∆x < 1

2 (dxt )

and ∆y < 12 (dyt )

and |∆z − 12 (dzs + dzt )| < ∆d

0, otherwise

(9)

However, in the sampling based algorithm it is computa-tionally optimal to sample for cases 2 and 3. In case 2, givenxt, the xxs and xys components of xs can be sampled usingN (xxt ,

12 (dxt )) and N (xyt ,

12 (dyt )) respectively. The xzs is

computed as xzt+dzs+d

zt

2 +N (0, σ) with a small noise. Similarto case 2, in case 3, the xxs and xys components of xs can besampled using N (xxt ,

12 (dxt + dxs )) and N (xyt ,

12 (dyt + dys)).

The xzs is computed as xzt −dzs+d

zt

2 +N (0, σ) with a smallnoise.

The Algorithm 1 denotes the high level overview of anon-parametric belief propagation, Algorithm 2 gives thedetails of how the Message update is performed using Gibbssampling approach. Algorithm 3 gives the details of how thebelief is computed which results in the maximum likelihoodestimate.

Algorithm 1: Overall Belief Propagation

Given node potentials φ(xs, ys) ∀s ∈ V , pairwise potentialsψ(xs, xt) ∀(s, t) ∈ E and initial messages for every edgem0st ∀(s, t) ∈ E, the algorithm iteratively updates all mes-

sages and computes the belief till the graph G till converges.1 For n ∈ [1 :MAX BELIEF ITERATIONS]

(a) Message update:Update messages from iteration (n − 1) → nusing Eq 2 to 5mnts(xs)←∫xt∈Rd

(ζ(xt)φt(xt, yt)

∏u∈ρ(t)\sm

n−1ut (xt)

)dxt

(b) Belief update:(Optional) Update belief at every iteration ifnecessary. However, it doesn’t affect the messageupdate at 1(a) in next iteration.belns (xs) ∝ φs(xs)

∏t∈ρ(s)m

nts(xs)

Algorithm 2: Message update

Given input messages mn−1ut (xt) = {µ(i)

ut ,Λ(i)ut , w

(i)ut }Mi=1

for each u ∈ ρ(t) \ s, and methods to compute functionsψts(xt, xs) and φt(xt, yt) pointwise, the algorithm computesmnts(xs) = {µ(i)

ts ,Λ(i)ts , w

(i)ts }Mi=1

1. Draw M independent samples {x(i)t }Mi=1 from the prod-uct ζ(xt)φt(xt, yt)

∏u∈ρ(t)\sm

nut(xt) using Gibbs

sampler in NBP.(a) In our specific case ζ(xt) is 1.0.

2 For each {x(i)t }Mi=1, sample x(i)s ∼ ψst(xs, xt = x(i)t )

(a) Using Eq:(7, 8, 9), rejection sampling is per-formed to sample x(i)s .

3 mnts(xs) = {µ(i)

ts ,Λ(i)ts , w

(i)ts }Mi=1 is constructed

(a) µ(i)ts is the sampled component x(i)s

(b) kernel density estimators can be used to selectthe appropriate kernel size Λ

(i)ts . We use ”rule of

thumb” estimator [9].(c) w

(i)ts is initialized to 1/M ; however, if the pair-

wise potential is a density function then impor-tance weights in the selection of µ(i)

ts can be used.

IV. EXPERIMENTS

In this section we firstly discuss the results of the proposedbelief propagation system qualitatively. Secondly, we intro-duce the measure of clutterness of scene in terms of visibilityof objects. Lastly, we discuss a goal-directed manipulationexperiment performed with the proposed belief propagationframework.

A. Belief Propagation Experiments

We initially tested the proposed framework on 6 sceneswith 6 objects in different configurations for each scene.The scenes included objects in stacked, partially occluded,and completely occluded configurations. In Fig 3, we showthe qualitative results of the pose estimations for a mildlycluttered scene. The belief samples after 15 iterations shownin Fig 3d cannot be directly used for manipulation as theyrepresent the samples drawn from a distribution of poses.However, when these samples go through a post processing

Algorithm 3: Belief update

Given incoming messages mnts(xt) = {µ(i)

ts ,Λ(i)ts , w

(i)ts }Mi=1

for each t ∈ ρ(s), and methods to compute functionsφs(xs, ys) pointwise, the algorithm computes belns (xs) ∝φs(xs, ys)

∏t∈ρ(s)m

nts(xs)

1 Draw T independent samples {x(i)s }Mi=1 from the prod-uct φs(xs, ys)

∏t∈ρ(s)m

nts(xs) using Gibbs sampler in

NBP.(a) If φs(xs, ys) is modeled as a mixture of Gaus-

sians, then it can be part of the product.(b) If φs(xs, ys) can be computed pointwise, then

the variant proposed in NBP [16] can be used tocompute the product using Gibbs sampling.

2 belns (xs) ∼ {µ(i)s ,Λ

(i)s , w

(i)s }Ti=1 is used to compute

the scene estimate.3 (Optional) To perform robot grasping and manipulation,

a single estimate is required. We draw K samplesfrom the belief product in step (2) to initialize the ICPalgorithm. From the ICP fits, get the weighted mean andcovariance. These are the estimates for manipulationexperiments.

step with ICP they transform into reliable a single pose esti-mates for each object that fits to the point cloud. ICP fits foreach belief sample are used to compute the weighted mean,which acts as the single final estimate for manipulation. Forevery sample output from the ICP fit, distance is computedto all the other samples in output of ICP. The distances arenormalized to sum to one. Weights used in the weightedmean is the 1−distance computed and assigned to thesample. The weighted variance measures the uncertainty inperception for manipulation. If this variance is higher than athreshold (0.25cm2) on all dimensions, then no manipulationaction is performed with this final pose estimate. For thismildly cluttered scenes with 6 objects, the average error inthe pose of the final scene estimate is 0.91cm.

B. Analysis of Clutterness

We analyze the visibility of the objects in the sceneand measure its clutterness. This analysis is done using theground truth poses of the objects. If a scene consists of Nobjects, the entire scene is rendered to a depth map S usingOpenGL rendering. Each of the N objects are rendered attheir ground truth poses in isolation to generate N depthmaps O1:N . Rendered image has non-zero depth value atpixels where objects are present. Let Ji be the set of all thepixels in the rendered depth map with non-zero depth valuefor object Oi. Then the visibility V oi of object Oi and sceneV s is given by the below equation.

V oi =voiKoi

V s =

∑Ni=1 v

oi∑N

i=1Koi

(10)

where voi is the total number of pixels with same depth valuesin the isolated depth map of object Oi and full scene S; Ko

i

is the total number of pixels in Ji. See Fig. 5 to get a senseof the scene visibility and object visibility.

(a) RGBD Obser-vation with detec-tions

(b) Scene graph prior (c) Initial belief samples (d) Final belief samples (e) Final estimate (f) Ground truth

Fig. 3: Qualitative results showing the estimation of the proposed framework: This experiment uses 15 Gaussian components for each message and isrun for 15 iterations to get the final belief samples. Final estimation is achieved through post processing the final belief samples with ICP registration,followed by the weighted average on the converged poses.

(a) Observation Initialized belief Belief after 5 iter-ations

Ground truth fromdifferent view

Fig. 4: The first row shows a complete occlusion case: pink object (color ofbounding box in first column or object colors in 2-4 columns) is behind theblue object and supporting the orange object. The bounding box for the pinkobject drives the initial sampling. In the later iterations, samples are drawnusing support relations. The second row demonstrates a partial occlusioncase: The blue object (downy detergent) is behind the orange object inthe front. The bounding box initializes the belief propagation confusingly.However, the pairwise potential resolves these cases by sampling away fromthe objects without support relations.

C. Goal-directed Manipulation Experiment

To evaluate the proposed framework’s success in roboticmanipulation tasks we use the weighted mean pose and theweighted variance that results from the post processing stepto perform a scene reorganization task. The reorganizationtask is provided a goal scene with desired place locationsfor the objects. Specifically, the robot’s task is to unloada tray full of objects on to the table at the desired placelocations. To accomplish this task, a sequence of pick-and-place actions are executed by the robot using the estimatedposes of each object. After all estimated object poses withhigh confidence (with a threshold variance of 0.25cm2) havebeen acted on, we reapply the scene estimation to produceupdated pose estimates. We iterate until the entire task isaccomplished.

In Fig 6, we report the results of an object manipulationexperiment on a heavily cluttered scene. This scene contains10 objects stacked on a tray with a mixture of visible andnon-visible objects with partial and complete occlusions.This experiment ended up with 4 sequence of perceptionand manipulations stages. Sequence 1: begins with robot

perceiving the scene, which provides accurate pose estimatesfor the coffee (yellow) and ginger tea (cyan) along with highconfidence in terms of variance less than 0.25cm2. Thesetwo estimates are used by the robot manipulate these objectsfrom the tray onto the table. It should be noted from theFig. 5(a) that these two objects were 100% visible. Howeverthe perception system was not confident about the coffeemate which had 86% visibility. The perception system didn’tconsider any other leaf node for the manipulation action astheir belief samples didn’t converge due to low visibility andhad high uncertainty. Sequence 2: robot perceives the sceneagain, which provided accurate poses with high confidencefor not only the leaf nodes -lemon tea (red) and granulabar (pink)) but also the parent node for the lemon teawhich is oats container (purple). It should be noted fromFig. 5(b) that only the fully visible (100%) leaf nodes aremanipulated along with 62% visible non-leaf node. This isbecause the planner’s sequence of actions contains the entiremanipulation of the scene graph at any point of the time.Sequence 3: the robot perceives the scene again. At thisstage all the objects on the tray are the leaf nodes exceptfor the protein box which is visible to the sensor for the firsttime with visibility of 5%. Only three objects gets accuratepose estimates along with high confidence for manipulation.Sequence 4: we perceive the scene that contains only twoobjects whose visibility is 100%. The poses are estimatedwith high confidence for manipulation.

In addition to the manipulation results, we would liketo emphasize the advantage of using scene graph in theinference. The first row of Fig 4 shows a case, wherethe location of a completely occluded object supporting avisible object is correctly estimated. Similarly second rowof Fig 4 demonstrates how the scene graph enables a correctestimation for the location of an object partially occluded byanother one.

V. CONCLUSION

We analyze the approach towards goal-directed manipu-lation especially in cluttered scenes. We observe that thevisibility can affect perception system. To overcome theplanner using the pose estimates directly from the perceptionsystem, a confidence measure is calculated. We proposed anonparametric scene graph belief propagation to estimate ascene as a collection of object poses and their interactions.

Fig. 5: Visibility of the nodes in the scene graph as viewed by the robot. Arrows towards left side of the scene corresponds to non-leaf nodes and rightside corresponds to the leaf nodes. These scenes are same as in Fig. 6 with same color coding as in scene graphs. The visibility is computed using themeasure described in Eq. 10. The red color % correspond to values less than 30%. The scene complexity is provided in the caption of every scene.

This problem is formulated as a graph inference problemon a Markov Random Field. We show the benefit of thebelief propagation approach in manipulation by presentingthe qualitative results of pose estimations used for robotmanipulation. We also show how to measure the clutternessof the scene in terms of object visibilities. In the future workwe would like to infer the graph structure or the scene graphwhich is currently provided as the input to the system.

REFERENCES

[1] C. Chao, M. Cakmak, and A. L. Thomaz. Towards grounding conceptsfor transfer in goal learning from demonstration. In Proceedings of theJoint IEEE International Conference on Development and Learningand on Epigenetic Robotics (ICDL-EpiRob). IEEE, 2011.

[2] J. Chua and P. F. Felzenszwalb. Scene grammars, factor graphs, andbelief propagation. arXiv preprint arXiv:1606.01307, 2016.

[3] M. Ciocarlie, K. Hsiao, E. G. Jones, S. Chitta, R. B. Rusu, and I. A.Sucan. Towards reliable grasping and manipulation in householdenvironments. In Experimental Robotics, pages 241–252. SpringerBerlin Heidelberg, 2014.

[4] A. Collet, M. Martinez, and S. S. Srinivasa. The moped framework:Object recognition and pose estimation for manipulation. Int. J. Rob.Res., 30(10):1284–1306, Sept. 2011.

[5] M. Dogar, K. Hsiao, M. Ciocarlie, and S. Srinivasa. Physics-basedgrasp planning through clutter. 2012.

[6] R. E. Fikes and N. J. Nilsson. Strips: A new approach to the appli-cation of theorem proving to problem solving. Artificial intelligence,2(3):189–208, 1972.

[7] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich featurehierarchies for accurate object detection and semantic segmentation.In Proceedings of the IEEE conference on computer vision and patternrecognition, pages 580–587, 2014.

[8] M. Isard. Pampas: Real-valued graphical models for computer vision.In Computer Vision and Pattern Recognition, 2003. Proceedings. 2003IEEE Computer Society Conference on, volume 1, pages I–I. IEEE,2003.

[9] H. Lauter. Silverman, bw: Density estimation for statistics and dataanalysis. chapman & hall, london–new york 1986, 175 pp.,£ 12., 1988.

[10] K. P. Murphy, Y. Weiss, and M. I. Jordan. Loopy belief propagationfor approximate inference: An empirical study. In Proceedings of theFifteenth conference on Uncertainty in artificial intelligence, pages467–475. Morgan Kaufmann Publishers Inc., 1999.

[11] V. Narayanan and M. Likhachev. Discriminatively-guided deliberativeperception for pose estimation of multiple 3d object instances. InRobotics: Science and Systems, 2016.

[12] S. Narayanaswamy, A. Barbu, and J. M. Siskind. A visual languagemodel for estimating object pose and structure in a generative visualdomain. In Robotics and Automation (ICRA), 2011 IEEE InternationalConference on, pages 4854–4860. IEEE, 2011.

[13] C. Papazov, S. Haddadin, S. Parusel, K. Krieger, and D. Burschka.Rigid 3d geometry matching for grasping of known objects in clut-tered scenes. The International Journal of Robotics Research, page0278364911436019, 2012.

[14] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In C. Cortes,N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors,Advances in Neural Information Processing Systems 28, pages 91–99.Curran Associates, Inc., 2015.

[15] L. Sigal, S. Bhatia, S. Roth, M. J. Black, and M. Isard. Trackingloose-limbed people. In Computer Vision and Pattern Recognition,2004. CVPR 2004. Proceedings of the 2004 IEEE Computer SocietyConference on, volume 1, pages I–I. IEEE, 2004.

[16] E. B. Sudderth, A. T. Ihler, W. T. Freeman, and A. S. Willsky.Nonparametric belief propagation. In Computer Vision and PatternRecognition, 2008. CVPR 2008. IEEE Conference on. IEEE, 2003.

[17] E. B. Sudderth, M. I. Mandel, W. T. Freeman, and A. S. Willsky. Visualhand tracking using nonparametric belief propagation. In ComputerVision and Pattern Recognition Workshop, 2004. CVPRW’04. Confer-ence on, pages 189–189. IEEE, 2004.

[18] Z. Sui, L. Xiang, O. C. Jenkins, and K. Desingh. Goal-directed robotmanipulation through axiomatic scene estimation. The InternationalJournal of Robotics Research, 36(1):86–104, 2017.

[19] A. ten Pas and R. Platt. Localizing handle-like grasp affordances in 3dpoint clouds. In International Symposium on Experimental Robotics(ISER), 2014.

[20] T. Winograd. Understanding natural language. Cognitive psychology,3(1):1–191, 1972.

[21] D. Xu, Y. Zhu, C. B. Choy, and L. Fei-Fei. Scene graph generationby iterative message passing. arXiv preprint arXiv:1701.02426, 2017.

Fig. 6: Goal-directed manipulation experiment: The robot is given a task of reconfiguring the scene and achieve a predefined goal configuration. Inthis experiment, the robot takes 4 sequences to finish the task. Each sequence starts with the proposed system perceiving the scene and providing objectpose estimates and measure of uncertainty. The robot performs the pick-and-place actions on the objects whose estimates have high confidence from theperception system. In the last run the robot fails to pickup an object as it was not reachable, though the estimate was precise with respect to the groundtruth.

Analysis of Goal-directed Manipulation in Clutter using ... · Analysis of Goal-directed...

Documents

Transcript of Analysis of Goal-directed Manipulation in Clutter using ... · Analysis of Goal-directed...