Combining Grasp Pose Detection with Object...

3
Combining Grasp Pose Detection with Object Detection Andreas ten Pas , Kate Saenko , Robert Platt College of Computer and Information Science, Northeastern University Department of Computer Science, University of Massachussetts, Lowell 1 Introduction Recently, researchers have proposed various grasp detection methods that can be used to localize grasp configurations without estimating object pose [17, 7, 12, 16, 15, 2, 3, 1, 5, 10, 8, 20, 4]. These methods take as input a noisy and partially occluded RGBD image or point cloud and produce as output pose estimates of viable grasps. The underlying idea is to treat grasp perception analogously to object detection in computer vision. Given large amounts of grasp training data, a classi- fier or regression system is trained to detect parts of an image or a point cloud that can be grasped. Because these methods detect grasps independently of object identity, they typically generalize grasp knowledge to new objects well. Recently, we have proposed a variation on grasp detection that detects grasps in SE (3) with high accuracy. This method achieved a 93% grasp success rate in dense clutter [4] (grasps succeeded as a fraction of grasp attempts) – one of the highest reported. A key drawback of grasp pose detection in general is that it does not immediately provide a way to grasp a specific ob- ject of interest. Grasp detectors will typically find any vi- able grasp – not necessarily one on the desired object. This suggests that grasp detection might be integrated with object detection. One might think about this as a mixture of two ex- perts: a grasp detection expert that estimates the probability that a grasp candidate is a true grasp; and an object detec- tion expert that estimates the probability that the grasp lies on the object of interest. Grasps would be ranked according to the product of these two probabilities. There are a cou- ple of reasons why the “mixture of experts” approach might compare favorably with more traditional approaches based on estimating the pose of a particular object instance (e.g. [14]). First, any grasp method that incorporates our grasp pose de- tection algorithm immediately inherits a high grasp success rate. Second, the mixture-of-experts method is able to ben- efit from training data from a variety of sources – not just training data derived from the particular object instance being grasped. By using training data from these different sources, the mixture-of-experts method might generalize grasp knowl- edge to new objects better and even perform better when the object instance to be grasped is known in advance. 2 Combined Grasp and Object Detec- tion Figure 1: A mixture of a grasp and an object detec- tion expert are used to rank grasps. In order to detect grasps on a spe- cific object of interest, our method takes the product of an object de- tector and a grasp detector. This is illustrated in Figure 1. The ob- ject detector estimates the proba- bility that a grasp candidate is lo- cated on the object of interest, and the grasp detector estimates the probability that a grasp candidate is a good grasp. Given a set of grasp candidates, we use the prod- uct of these two experts to rank the grasps. Generating Grasp Candidates Our system generates a large number of grasp candidates (several thousand) by searching for 6-DOF hand configurations that satisfy certain conditions related to the geometry of a grasp [20, 4]. These become can- didates both for the grasp detector and for the object detector. Object Detection Our object detector is a fully convolu- tional variation of Alexnet [9] that has been trained to detect eleven different objects from our lab. We started from the Alexnet reference model and weights in Caffe [6], and fine- tuned its last inner product layer. We obtained 11k RGB training images using the following semi-automated proce- dure. First, objects were placed in front of the robot on a table in a small number (three or four) of different configura- tions. Then, for each object, the robot took 1k RGBD im- ages from different perspectives by using an RGBD camera (an Asus Xtion Pro) mounted to the robot gripper. We syn- thetically augmented the training set by randomly cropping, rotating, and adding zero-mean Gaussian noise with a small standard deviation. Our network was trained using stochas- tic gradient descent with momentum on 500k images gen- 1

Transcript of Combining Grasp Pose Detection with Object...

Combining Grasp Pose Detection with Object Detection

Andreas ten Pas†, Kate Saenko‡, Robert Platt†† College of Computer and Information Science, Northeastern University‡ Department of Computer Science, University of Massachussetts, Lowell

1 IntroductionRecently, researchers have proposed various grasp detectionmethods that can be used to localize grasp configurationswithout estimating object pose [17, 7, 12, 16, 15, 2, 3, 1, 5, 10,8, 20, 4]. These methods take as input a noisy and partiallyoccluded RGBD image or point cloud and produce as outputpose estimates of viable grasps. The underlying idea is to treatgrasp perception analogously to object detection in computervision. Given large amounts of grasp training data, a classi-fier or regression system is trained to detect parts of an imageor a point cloud that can be grasped. Because these methodsdetect grasps independently of object identity, they typicallygeneralize grasp knowledge to new objects well. Recently,we have proposed a variation on grasp detection that detectsgrasps in SE(3) with high accuracy. This method achieved a93% grasp success rate in dense clutter [4] (grasps succeededas a fraction of grasp attempts) – one of the highest reported.

A key drawback of grasp pose detection in general is thatit does not immediately provide a way to grasp a specific ob-ject of interest. Grasp detectors will typically find any vi-able grasp – not necessarily one on the desired object. Thissuggests that grasp detection might be integrated with objectdetection. One might think about this as a mixture of two ex-perts: a grasp detection expert that estimates the probabilitythat a grasp candidate is a true grasp; and an object detec-tion expert that estimates the probability that the grasp lieson the object of interest. Grasps would be ranked accordingto the product of these two probabilities. There are a cou-ple of reasons why the “mixture of experts” approach mightcompare favorably with more traditional approaches based onestimating the pose of a particular object instance (e.g. [14]).First, any grasp method that incorporates our grasp pose de-tection algorithm immediately inherits a high grasp successrate. Second, the mixture-of-experts method is able to ben-efit from training data from a variety of sources – not justtraining data derived from the particular object instance beinggrasped. By using training data from these different sources,the mixture-of-experts method might generalize grasp knowl-edge to new objects better and even perform better when theobject instance to be grasped is known in advance.

2 Combined Grasp and Object Detec-tion

Figure 1: A mixture of agrasp and an object detec-tion expert are used to rankgrasps.

In order to detect grasps on a spe-cific object of interest, our methodtakes the product of an object de-tector and a grasp detector. Thisis illustrated in Figure 1. The ob-ject detector estimates the proba-bility that a grasp candidate is lo-cated on the object of interest, andthe grasp detector estimates theprobability that a grasp candidateis a good grasp. Given a set ofgrasp candidates, we use the prod-uct of these two experts to rankthe grasps.

Generating Grasp CandidatesOur system generates a largenumber of grasp candidates (several thousand) by searchingfor 6-DOF hand configurations that satisfy certain conditionsrelated to the geometry of a grasp [20, 4]. These become can-didates both for the grasp detector and for the object detector.

Object Detection Our object detector is a fully convolu-tional variation of Alexnet [9] that has been trained to detecteleven different objects from our lab. We started from theAlexnet reference model and weights in Caffe [6], and fine-tuned its last inner product layer. We obtained ∼ 11k RGBtraining images using the following semi-automated proce-dure. First, objects were placed in front of the robot on atable in a small number (three or four) of different configura-tions. Then, for each object, the robot took ∼ 1k RGBD im-ages from different perspectives by using an RGBD camera(an Asus Xtion Pro) mounted to the robot gripper. We syn-thetically augmented the training set by randomly cropping,rotating, and adding zero-mean Gaussian noise with a smallstandard deviation. Our network was trained using stochas-tic gradient descent with momentum on ∼500k images gen-

1

(a) (b) (c) (d)

Figure 2: (a) The projections of the high probability pixels of the object detection heatmap into the point cloud. (b) Grasps detected for theprojections. (c) Scores assigned to each grasp according to the estimated object of interest probability in the heatmap. (d) Object detectionheatmap.

erated this way and validated on ∼5k images. We convertedour trained classifier into a detector by replacing the last threeinner product layers by fully convolutional layers and trans-planted the network’s learned parameters into these new lay-ers [13], resulting in a fully convolutional version of Alexnet.Given an input image, this network creates an object detectionheatmap. The intensity of a pixel in this heatmap correspondsto the estimated probability that a grasp centered at that pointbelongs to the object of interest (see Figure 2(d)).

Grasp Detection Grasp candidates are classified either asgrasps or not using a variation of the Lenet architecture [11].Our approach takes a multi-view representation [19] of thevolume that would be contained within the grasp and classi-fies it using Lenet. The network is pretrained using ∼ 200kgrasp examples obtained from approximately 400 CAD mod-els contained in the 3DNET dataset [21]. Then, the network isfinetuned using data from approximately 55 objects from theBigBird dataset [18]. See [4] for more details.

Integrated Object-Grasp Detector Figure 2 illustrates theend-to-end process of detecting grasps on an object of inter-est. First, we use the object detector to create a heatmap thatidentifies the approximate location of the object of interest(Figure 2 (d)). Second, we generate several thousand graspsamples in the vicinity of likely object locations using datafrom the point cloud [20, 4]. (Figure 2 (a) shows the regionsof interest in magenta; Figure 2 (b) shows the sampled graspcandidates.). Third, we prune low-scoring grasp candidatesand rank the remainer based on their object detection scores(Figures 2 (c) and (d) show the locations of the detected graspsrelative to an RGB image and the corresponding object detec-tion heatmap). Finally, we select a grasp to execute using theutility function described in [4].

3 Experiments and Discussion

We evaluated our method on the Baxter research robot.We used Baxter’s right 7-DOF arm, with the off-the-shelfparallel-jaw gripper, constrained to a 3 to 7cm aperture. TwoAsus Xtion Pro RGBD-cameras were mounted to Baxter’swaist. In each trial of the experiment, we randomly selected6 out of the 11 objects, and placed them on a table in front ofthe robot with at least 1cm distance between the objects andwithin the workspace of the robot’s right arm. The user thenentered the name of one of the objects on the table, and therobot used our method to detect and grasp that object. Thiscontinued until all objects had been grasped. The robot wasallowed to attempt to identify and grasp an object at most3 times. We measured the objection detection success rate(grasps attempted on the desired object as a fraction of totalattempts) and the grasp success rate (grasps succeeded as afraction of grasp attempts). Over the 10 trials conducted forthis experiment, we obtained a 94% object detection successrate (4 object detection failures out of 75 attempts) and a 90%grasp success rate (5 grasp failures out of 54 attempts; in theremaining attempts, the inverse kinematics solver did not finda solution).

We view the results reported above as promising. Our ob-jective is to grasp objects of interest reliably from dense clut-ter where standard object segmentation methods are infeasi-ble. This paper takes the first step by proposing and demon-strating a method that does not require precise object segmen-tation. Our object heatmap can be viewed as an approximatesemantic segmentation that suffices to distinguish on-targetgrasps from those off-target. As this paper shows, we ob-tain good results for objects that are close (1 cm), but not indense clutter. In the future, we hope to replicate these resultsin dense clutter. To accomplish this, we hope to use better se-mantic segmentation methods [13] and to jointly detect graspsand objects in a way that makes each detection more accurate.

2

AcknowledgementsThis work was supported in part by NSF under Grant No.IIS-1427081, NASA under Grant No. NNX16AC48A andNNX13AQ85G, and ONR under Grant No. N000141410047.Kate Saenko was supported by NSF Award IIS-1212928.

References[1] Renaud Detry, Carl Henrik Ek, Marianna Madry, and Danica Kragic.

Learning a dictionary of prototypical grasp-predicting parts from grasp-ing experience. In IEEE Int’l Conf. on Robotics and Automation, pages601–608, 2013.

[2] David Fischinger and Markus Vincze. Empty the basket-a shape basedlearning approach for grasping piles of unknown objects. In IEEE/RSJInt’l Conf. on Intelligent Robots and Systems, pages 2051–2057, 2012.

[3] David Fischinger, Markus Vincze, and Yun Jiang. Learning grasps forunknown objects in cluttered scenes. In IEEE Int’l Conf. on Roboticsand Automation, pages 609–616, 2013.

[4] Marcus Gualtieri, Andreas ten Pas, Kate Saenko, and Robert Platt.High precision grasp pose detection in dense clutter. arXiv preprintarXiv:1603.01564, 2016.

[5] Alexander Herzog, Peter Pastor, Mrinal Kalakrishnan, LudovicRighetti, Tamim Asfour, and Stefan Schaal. Template-based learningof grasp selection. In IEEE Int’l Conf. on Robotics and Automation,pages 2379–2384, 2012.

[6] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev,Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Dar-rell. Caffe: Convolutional architecture for fast feature embedding. InACM Int’l Conf. on Multimedia, pages 675–678, 2014.

[7] Yun Jiang, Stephen Moseson, and Ashutosh Saxena. Efficient graspingfrom RGBD images: Learning using a new rectangle representation. InIEEE Int’l Conference on Robotics and Automation, pages 3304–3311,2011.

[8] Daniel Kappler, Jeannette Bohg, and Stefan Schaal. Leveraging bigdata for grasp planning. In IEEE Int’l Conf. on Robotics and Automa-tion, pages 4304–4311, 2015.

[9] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classificationwith deep convolutional neural networks. In NIPS, 2012.

[10] Oliver Kroemer, Emre Ugur, Erhan Oztop, and Jan Peters. A kernel-based approach to direct action perception. In IEEE Int’l Conf. onRobotics and Automation, pages 2605–2610, 2012.

[11] Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haffner.Gradient-based learning applied to document recognition. Proceedingsof the IEEE, 86(11):2278–2324, 1998.

[12] Ian Lenz, Honglak Lee, and Ashutosh Saxena. Deep learning for de-tecting robotic grasps. The Int’l Journal of Robotics Research, 34(4-5):705–724, 2015.

[13] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks forsemantic segmentation. In CVPR, pages 3431–3440, 2015.

[14] John Oberlin and Stefanie Tellex. Autonomously acquiring instance-based object models from experience. In International Symposium onRobotics Research (ISRR), 2015.

[15] Lerrel Pinto and Abhinav Gupta. Supersizing self-supervision: Learn-ing to grasp from 50k tries and 700 robot hours. arXiv preprintarXiv:1509.06825, 2015.

[16] Joseph Redmon and Anelia Angelova. Real-time grasp detection usingconvolutional neural networks. In IEEE Int’l Conf. on Robotics andAutomation, pages 1316–1322, 2015.

[17] Ashutosh Saxena, Justin Driemeyer, and Andrew Ng. Robotic grasp-ing of novel objects using vision. Int’l Journal of Robotics Research,27(4):157–173, 2008.

[18] Ashutosh Singh, Jin Sha, Karthik Narayan, Tudor Achim, and PieterAbbeel. Bigbird: A large-scale 3d database of object instances. InIEEE Int’l Conf. on Robotics and Automation, pages 509–516, 2014.

[19] Hang Su, Subhransu Maji, Evangelos Kalogerakis, and Erik Learned-Miller. Multi-view convolutional neural networks for 3d shape recogni-tion. In Proceedings of the IEEE Int’l Conf. on Computer Vision, pages945–953, 2015.

[20] Andreas ten Pas and Robert Platt. Using geometry to detect grasp posesin 3d point clouds. In Proceedings of the Int’l Symp. on Robotics Re-search, 2015.

[21] Walter Wohlkinger, Aitor Aldoma, Radu B Rusu, and Markus Vincze.3dnet: Large-scale object class recognition from cad models. InRobotics and Automation (ICRA), 2012 IEEE International Conferenceon, pages 5384–5391. IEEE, 2012.

3