Supervised Policy Fusion with Application ... - Karol Hausman · Yevgen Chebotar* Karol Hausman*...

3
Supervised Policy Fusion with Application to Regrasping Yevgen Chebotar* Karol Hausman* Oliver Kroemer Gaurav S. Sukhatme Stefan Schaal I. I NTRODUCTION Robust and stable grasping is one of the key requirements for successful robotic manipulation. Although, there has been a lot of progress in the area of grasping [1], the state-of-the- art approaches may still result in failures. Ideally, the robot would detect failures quickly enough to be able to correct them. In addition, the robot should be able to learn from its mistakes to avoid the failures in the future. To address these challenges, we propose using early grasp stability prediction during the initial phases of the grasp. We also present a regrasping behavior that corrects failed grasps based on this prediction and improves over time. In our previous work [2], we presented a first step towards an autonomous regrasping behavior using spatio-temporal tactile features and reinforcement learning. We were able to show that simple regrasping strategies can be learned using linear policies if enough data is provided. However, these strategies do not generalize well to other classes of objects than those they were trained on. The main reason for this shortcoming is that the policies are not representative enough to capture the richness of different shapes and physical properties of the objects. A potential solution to learn a more complex and generalizable regrasping strategy is to employ a more complex policy class and gather a lot of real-robot data with a variety of objects to learn the policy parameters. The main weakness of such a solution is that, in addition to requiring large amounts of data, these complex policies often result in the learner becoming stuck in poor local optima [3]. In this paper, we propose learning a complex high-dimensional regrasping policy in a supervised fashion. Our method uses simple linear policies to guide the general policy to avoid poor local minima and to learn the general policy from smaller amounts of data. The idea of using supervised learning in policy search has been used in [4], where the authors use trajectory optimization to direct the policy learning process and apply the learned policies to various manipulation tasks. A similar approach was proposed in [5], where the authors use deep spatial autoencoders to learn the state representation and unify a set of linear Gaussian controllers to generalize for the unseen situations. In our work, we use the idea of unifying simple strategies to generate a complex generic policy. Here, however, we use simple linear policies learned through reinforcement learning rather than optimized trajectories as the examples that the general policy can learn from. II. TECHNICAL APPROACH To describe a time series of tactile data, we employ spatio- temporal feature descriptors extracted using Spatio-Temporal All the authors are with the Department of Computer Science, University of Southern California, Los Angeles. [email protected] *Both authors contributed equally to this work Fig. 1: Objects and experimental setup used for learning the grasp stability predictor and the regrasping behavior. Top- left: the cylinder. Top-right: the box. Bottom-left: the ball. Bottom-right: the novel object. Hierarchical Matching Pursuit (ST-HMP) that have been shown to have high performance in temporal tactile data clas- sification tasks [6]. When combined with learning regrasping behaviors for multiple objects, the ST-HMP approach leads to a large number of parameters to learn for the regrasping mapping function, which is usually a hard task for policy search algorithms [3]. Thus, in this work, we add several modifications to make this process feasible. In particular, we divide the learning process into two stages: i) learning linear policies for individual objects and ii) learning a high- dimensional policy to generalize between objects. Once a grasp is predicted to fail by the grasp stability predictor, the robot has to place the object down and regrasp it using the information acquired during the initial grasp. In order to achieve this goal, we learn a mapping from the tactile features of the initial grasp to the grasp adjustment, i.e. the change in position and orientation between the initial grasp and the regrasp. The parameters of this mapping function for individual objects are learned using reinforcement learning. We define the policy π(θ) as a Gaussian distribution over mapping parameters θ with a mean μ and a covariance matrix Σ. To reduce the dimensionality of the input features, we perform a principal component analysis (PCA) [7] on the ST-HMP descriptors and use only the largest principal components. The mapping function is a linear combination of these PCA features: (x,y,z,α,β,γ )= Wφ with W R 6×n and φ R n , where W contains the learned weights θ =(w x,1 ,...,w x,n ,...,w γ,n ) of the features φ, and n is the number of principal components. The reward R(θ) is computed by estimating the success of the adjusted grasp using the grasp stability predictor. For optimizing the linear policy for individual objects we use the relative entropy policy search (REPS) algorithm [8]. After the individual linear policies have been learned, we train a larger high-dimensional policy in a supervised manner using the outputs of the individual policies. This is similar to the guided policy search approach proposed in [9]. In

Transcript of Supervised Policy Fusion with Application ... - Karol Hausman · Yevgen Chebotar* Karol Hausman*...

  • Supervised Policy Fusion with Application to Regrasping

    Yevgen Chebotar* Karol Hausman* Oliver Kroemer Gaurav S. Sukhatme Stefan Schaal

    I. INTRODUCTIONRobust and stable grasping is one of the key requirements

    for successful robotic manipulation. Although, there has beena lot of progress in the area of grasping [1], the state-of-the-art approaches may still result in failures. Ideally, the robotwould detect failures quickly enough to be able to correctthem. In addition, the robot should be able to learn from itsmistakes to avoid the failures in the future. To address thesechallenges, we propose using early grasp stability predictionduring the initial phases of the grasp. We also present aregrasping behavior that corrects failed grasps based on thisprediction and improves over time.

    In our previous work [2], we presented a first step towardsan autonomous regrasping behavior using spatio-temporaltactile features and reinforcement learning. We were able toshow that simple regrasping strategies can be learned usinglinear policies if enough data is provided. However, thesestrategies do not generalize well to other classes of objectsthan those they were trained on. The main reason for thisshortcoming is that the policies are not representative enoughto capture the richness of different shapes and physicalproperties of the objects. A potential solution to learn a morecomplex and generalizable regrasping strategy is to employa more complex policy class and gather a lot of real-robotdata with a variety of objects to learn the policy parameters.The main weakness of such a solution is that, in additionto requiring large amounts of data, these complex policiesoften result in the learner becoming stuck in poor localoptima [3]. In this paper, we propose learning a complexhigh-dimensional regrasping policy in a supervised fashion.Our method uses simple linear policies to guide the generalpolicy to avoid poor local minima and to learn the generalpolicy from smaller amounts of data.

    The idea of using supervised learning in policy searchhas been used in [4], where the authors use trajectoryoptimization to direct the policy learning process and applythe learned policies to various manipulation tasks. A similarapproach was proposed in [5], where the authors use deepspatial autoencoders to learn the state representation andunify a set of linear Gaussian controllers to generalize for theunseen situations. In our work, we use the idea of unifyingsimple strategies to generate a complex generic policy. Here,however, we use simple linear policies learned throughreinforcement learning rather than optimized trajectories asthe examples that the general policy can learn from.

    II. TECHNICAL APPROACHTo describe a time series of tactile data, we employ spatio-

    temporal feature descriptors extracted using Spatio-Temporal

    All the authors are with the Department of Computer Science, Universityof Southern California, Los Angeles. [email protected]*Both authors contributed equally to this work

    Fig. 1: Objects and experimental setup used for learning thegrasp stability predictor and the regrasping behavior. Top-left: the cylinder. Top-right: the box. Bottom-left: the ball.Bottom-right: the novel object.

    Hierarchical Matching Pursuit (ST-HMP) that have beenshown to have high performance in temporal tactile data clas-sification tasks [6]. When combined with learning regraspingbehaviors for multiple objects, the ST-HMP approach leadsto a large number of parameters to learn for the regraspingmapping function, which is usually a hard task for policysearch algorithms [3]. Thus, in this work, we add severalmodifications to make this process feasible. In particular,we divide the learning process into two stages: i) learninglinear policies for individual objects and ii) learning a high-dimensional policy to generalize between objects.

    Once a grasp is predicted to fail by the grasp stabilitypredictor, the robot has to place the object down and regraspit using the information acquired during the initial grasp. Inorder to achieve this goal, we learn a mapping from the tactilefeatures of the initial grasp to the grasp adjustment, i.e. thechange in position and orientation between the initial graspand the regrasp. The parameters of this mapping function forindividual objects are learned using reinforcement learning.We define the policy π(θ) as a Gaussian distribution overmapping parameters θ with a mean µ and a covariancematrix Σ. To reduce the dimensionality of the input features,we perform a principal component analysis (PCA) [7] onthe ST-HMP descriptors and use only the largest principalcomponents. The mapping function is a linear combinationof these PCA features: (x, y, z, α, β, γ) = Wφ with W ∈R6×n and φ ∈ Rn, where W contains the learned weightsθ = (wx,1, . . . , wx,n, . . . , wγ,n) of the features φ, and n isthe number of principal components. The reward R(θ) iscomputed by estimating the success of the adjusted graspusing the grasp stability predictor. For optimizing the linearpolicy for individual objects we use the relative entropypolicy search (REPS) algorithm [8].

    After the individual linear policies have been learned, wetrain a larger high-dimensional policy in a supervised mannerusing the outputs of the individual policies. This is similarto the guided policy search approach proposed in [9]. In

  • Object Individual policies (# regrasps) Combined policy0 1 2 3

    Cylinder 41.8 83.5 90.3 97.1 92.3Box 40.7 85.4 93.7 96.8 87.6Ball 52.9 84.8 91.2 95.1 91.4New object 40.1 - - - 80.7

    TABLE I: Performance of the individual and combinedregrasping policies.

    our case, the guidance of the general policy comes from theindividual policies that can be efficiently learned for separateobjects. As the general policy class we choose a neuralnetwork with a large number of parameters. Such a policy hasenough representational richness to incorporate regraspingbehavior for many different objects. However, learning itsparameters directly requires a very large number of exper-iments, whereas supervised learning with already learnedindividual policies speeds up the process significantly.

    To generate training data for learning the general pol-icy, we sample grasp corrections from the already learnedindividual policies using previously collected data. Inputfeatures and resulting grasp corrections are combined in a“transfer” dataset, which is used to transfer the behaviorsto the general policy. In order to increase the amount ofinformation provided to the general policy, we increase thenumber of its input features by extracting a larger number ofPCA components from the ST-HMP features. Using differentfeatures in the general policy than in the original individualpolicies is one of the advantages of our setting. The individ-ual policies provide outputs of the desired behavior, whilethe general policy can have a different set of input features.

    III. EXPERIMENTAL RESULTS

    In our experiments, we use a Barrett arm and handthat is equipped with three biomimetic tactile sensors (Bio-Tacs) [10]. For extracting ST-HMP features, the BioTacelectrode values are laid out in a 2D tactile image accordingto their spatial arrangement on the sensor.

    First, we evaluate individual regrasping policies. The robotperforms a randomly generated top grasp using the force gripcontroller [11], and lifts the object. At the final stage of theexperiment, the robot performs extensive shaking motions inall directions to ensure that the grasp is stable. The robot usesthe stability prediction to self-supervise the learning process.

    To evaluate the results of the policy search, we perform100 random grasps using the final policies on each of theobjects that they were learned on. The robot has threeattempts to regrasp each object using the learned policy.Table I shows the percentage of successful grasps on eachobject after each regrasp. Already after one regrasp, therobot is able to correct the majority of the failed graspsby increasing the success rate of the grasps from 41.8% to83.5% on the cylinder, from 40.7% to 85.4% on the boxand from 52.9% to 84.8% on the ball. Moreover, allowingadditional regrasps increases this value to 90.3% for two and97.1% for three regrasps on the cylinder, 93.7% and 96.8%on the box, and to 91.2% and 95.1% on the ball. Theseresults indicate that the robot is able to learn a tactile-basedregrasping strategy for individual objects.

    After training individual policies we create a “transfer”dataset with grasp corrections obtained from the individuallinear regrasping policies for all objects. For each set oftactile features, we query the respective previously-learnedlinear policy for the corresponding grasp correction. We takethe input features for the individual policies from the unsuc-cessful grasps in the open-source BiGS dataset [12]. Thegrasps in BiGS were collected in an analogous experimentalsetup and can directly be used for creating the “transfer”dataset. Here, we employ a neural network to mimic thebehavior of the individual policies.

    Table I shows performance of the generalized policy on thesingle objects. Interestingly, the combined policy achievesbetter performance on each of the single objects than therespective linear policies learned specifically for these objectafter one regrasp. Furthermore, in cases of the cylinder andthe ball, the performance of the generalized policy is betterthan the linear policies evaluated after two regrasps. Thisshows that the general policy generalizes well between thesingle policies. In addition, by utilizing the knowledge ob-tained from single policies, the generalized policy performsbetter on the objects that the single policies were trained on.

    We test performance of the generalized policy on a novel,more complex object (see the bottom-right corner in Fig. 1),which was not present during learning. The generalized pol-icy improves the grasping performance significantly, whichshows its ability to generalize to more complex objects.Nevertheless, there are some difficulties when the robotperforms regrasp on a part of the object that is differentfrom the initial grasp. In this case, the regrasp is incorrectfor the new part of the object, i.e. the yaw adjustment issuboptimal for the box part of the object due to the roundgrasping surface (the lid) in the initial grasp.

    During the experiments, we were able to observe manyintuitive corrections made by the robot using the learnedregrasping policy. The robot was able to identify if one of thefingers was only barely touching the object’s surface, causingthe object to rotate in the hand. In this case, the regraspresulted in either rotating or translating the gripper such thatall of its fingers were firmly touching the object. Anothernoticeable trend learned through reinforcement learning wasthat the robot would regrasp the middle part of the objectwhich was closer to the center of mass, hence, more stable forgrasping. On the box object, the robot learned to change itsgrasp such that its fingers were aligned with the box’s sides.These results indicate that not only can the robot learn a setof linear regrasping policies for individual objects, but alsothat it can use them as the basis for guiding the generalizedregrasping behavior.

    IV. CONCLUSIONSOur experiments indicate that the combined policy learned

    using our method is able to achieve better performanceon each of the single objects than the respective linearpolicies learned using reinforcement learning specifically forthese objects after one regrasp. Moreover, the general policyachieves approximately 80% success rate after one regraspon a novel object that was not present during training. Theseresults show that our supervised policy learning method ap-plied to regrasping can generalize to more complex objects.

  • REFERENCES[1] J. Bohg, A. Morales, T. Asfour, and D. Kragic. Data-driven

    grasp synthesis a survey. Robotics, IEEE Transactions on, 30(2):289–309, 2014.

    [2] Y. Chebotar, K. Hausman, Z. Su, G.S. Sukhatme, and S Ste-fan. Self-supervised regrasping using spatio-temporal tactilefeatures and reinforcement learning. In IROS, 2016.

    [3] M.P. Deisenroth, G. Neumann, and J. Peters. A surveyon policy search for robotics. Foundations and Trends inRobotics, 2(1-2):1–142, 2013.

    [4] S. Levine, N. Wagener, and P. Abbeel. Learning contact-richmanipulation skills with guided policy search. In ICRA, 2015.

    [5] C. Finn, X.Y. Tan, Y. Duan, T. Darrell, S. Levine, andP. Abbeel. Deep spatial autoencoders for visuomotor learning.CoRR, 117(117):240, 2015.

    [6] M. Madry, L. Bo, D. Kragic, and D. Fox. St-hmp: Unsu-pervised spatio-temporal feature learning for tactile data. InICRA, 2014.

    [7] I. T. Jolliffe. Principal component analysis. Springer, NewYork, 1986.

    [8] J. Peters, K. Mülling, and Y. Altun. Relative entropy policysearch. In AAAI. AAAI Press, 2010.

    [9] S. Levine and V. Koltun. Guided policy search. In ICML,2013.

    [10] N. Wettels, V.J. Santos, R.S. Johansson, and G.E. Loeb.Biomimetic tactile sensor array. Advanced Robotics, 22(8):829–849, 2008.

    [11] Z. Su, K. Hausman, Y. Chebotar, A. Molchanov, G.E. Loeb,G.S. Sukhatme, and S. Schaal. Force estimation and slipdetection/classification for grip control using a biomimetictactile sensor. In Humanoids, 2015.

    [12] Y. Chebotar, K. Hausman, Z. Su, A. Molchanov, O. Kroemer,G. Sukhatme, and S. Schaal. Bigs: Biotac grasp stabilitydataset. In Grasping and Manipulation Datasets, ICRA 2016Workshop on, 2016. URL http://bigs.robotics.usc.edu/.