Combining active learning and reactive control for robot grasping

12
Robotics and Autonomous Systems 58 (2010) 1105–1116 Contents lists available at ScienceDirect Robotics and Autonomous Systems journal homepage: www.elsevier.com/locate/robot Combining active learning and reactive control for robot grasping O.B. Kroemer a,* , R. Detry b , J. Piater b , J. Peters a a Max Planck Institute for Biological Cybernetics, Spemannstr. 38, 72076 Tübingen, Germany b Universit de Liége, INTELSIG Lab, Department of Electrical Engineering and Computer Science, Belgium article info Article history: Available online 12 June 2010 Keywords: Robot grasping Reinforcement learning Reactive motion control Imitation learning abstract Grasping an object is a task that inherently needs to be treated in a hybrid fashion. The system must decide both where and how to grasp the object. While selecting where to grasp requires learning about the object as a whole, the execution only needs to reactively adapt to the context close to the grasp’s location. We propose a hierarchical controller that reflects the structure of these two sub-problems, and attempts to learn solutions that work for both. A hybrid architecture is employed by the controller to make use of various machine learning methods that can cope with the large amount of uncertainty inherent to the task. The controller’s upper level selects where to grasp the object using a reinforcement learner, while the lower level comprises an imitation learner and a vision-based reactive controller to determine appropriate grasping motions. The resulting system is able to quickly learn good grasps of a novel object in an unstructured environment, by executing smooth reaching motions and preshaping the hand depending on the object’s geometry. The system was evaluated both in simulation and on a real robot. © 2010 Elsevier B.V. All rights reserved. 1. Introduction Robots possess great potential for being employed in domestic environments, where they could perform various tasks such as tidying up rooms, taking out the garbage, or serving dinner. Although these chores are variations of a basic pick-and-place task, robots still struggle with them. One of the key challenges for roboticists is the large variability inherent in the tasks and environments that a robot may encounter. Preparing a robot completely beforehand for all possible situations is probably impossible as it is prohibitively difficult to foresee all scenarios. Such a preparation is also inefficient, as only a few of the situations will be required by the robot. Due to these limitations, it is important to design robots that can adapt and learn from their own experiences. Grasping an unknown object is an example of a task that is made particularly difficult by the large variety of objects (see Fig. 2). Many approaches have been proposed for robot grasping. Early works [1,2] found analytical solutions to the problem, but these approaches require precise information about the environment (e.g., external forces, surface properties) that may not be accessible. Supervised learning can be used to train robots how to recognize good grasping points [3], but requires a considerable initial input from a human supervisor. Active and reinforcement learning * Corresponding author. E-mail addresses: [email protected] (O.B. Kroemer), [email protected] (R. Detry), [email protected] (J. Piater), [email protected] (J. Peters). methods have focused on exploring the object to acquire a complete affordance model [4,5], but not on optimizing grasps. However, finding good grasp locations is only a part of the problem. The robot grasping task can be decomposed into two problems: (i) deciding where to grasp the object, and (ii) determining how to perform the grasping movement. These two sub-problems are closely related and must be addressed together in order to perform a successful grasp. The choice of where to grasp an object sets the context for determining how to grasp it. However, the execution of the grasp ultimately determines whether the grasp location was well chosen. In this paper, we propose a hierarchical controller that reflects the structure of these two task components, as shown in Fig. 1. The upper level decides where to grasp the object, and the lower level determines how to perform the grasping movements given the context of these grasp parameters and the scene. The upper level subsequently receives a reward based on the grasp execution, and takes this into consideration when selecting future grasps. The system employs a hybrid architecture that uses reinforce- ment learning, imitation learning, and reactive control. The core of the upper level is a reinforcement learning approach that uses the successfulness of evaluated grasps to determine future grasps. It is crucial that its state-action space is low dimensional for faster convergence [6,7], and that information from other sources (e.g., demonstrated grasps) can easily be incorporated. To reduce the action space, the reinforcement learner specifies a grasp as a six-dimensional hand pose in the object’s reference frame, and all remaining variables inherent to the grasping movements are han- dled by a lower level controller. The lower level controller is responsible for action execution. A straightforward method of acquiring an arbitrary motion policy 0921-8890/$ – see front matter © 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.robot.2010.06.001

Transcript of Combining active learning and reactive control for robot grasping

Robotics and Autonomous Systems 58 (2010) 1105–1116

Contents lists available at ScienceDirect

Robotics and Autonomous Systems

journal homepage: www.elsevier.com/locate/robot

Combining active learning and reactive control for robot graspingO.B. Kroemer a,∗, R. Detry b, J. Piater b, J. Peters aaMax Planck Institute for Biological Cybernetics, Spemannstr. 38, 72076 Tübingen, Germanyb Universit de Liége, INTELSIG Lab, Department of Electrical Engineering and Computer Science, Belgium

a r t i c l e i n f o

Article history:Available online 12 June 2010

Keywords:Robot graspingReinforcement learningReactive motion controlImitation learning

a b s t r a c t

Grasping an object is a task that inherently needs to be treated in a hybrid fashion. The system mustdecide both where and how to grasp the object. While selecting where to grasp requires learning aboutthe object as a whole, the execution only needs to reactively adapt to the context close to the grasp’slocation. We propose a hierarchical controller that reflects the structure of these two sub-problems, andattempts to learn solutions thatwork for both. A hybrid architecture is employed by the controller tomakeuse of various machine learning methods that can cope with the large amount of uncertainty inherent tothe task. The controller’s upper level selects where to grasp the object using a reinforcement learner,while the lower level comprises an imitation learner and a vision-based reactive controller to determineappropriate graspingmotions. The resulting system is able to quickly learn good grasps of a novel object inanunstructured environment, by executing smooth reachingmotions andpreshaping thehanddependingon the object’s geometry. The system was evaluated both in simulation and on a real robot.

© 2010 Elsevier B.V. All rights reserved.

1. Introduction

Robots possess great potential for being employed in domesticenvironments, where they could perform various tasks such astidying up rooms, taking out the garbage, or serving dinner.Although these chores are variations of a basic pick-and-place task,robots still struggle with them.One of the key challenges for roboticists is the large variability

inherent in the tasks and environments that a robot mayencounter. Preparing a robot completely beforehand for allpossible situations is probably impossible as it is prohibitivelydifficult to foresee all scenarios. Such a preparation is alsoinefficient, as only a few of the situations will be required by therobot. Due to these limitations, it is important to design robots thatcan adapt and learn from their own experiences.Grasping anunknownobject is an example of a task that ismade

particularly difficult by the large variety of objects (see Fig. 2).Many approaches have been proposed for robot grasping. Earlyworks [1,2] found analytical solutions to the problem, but theseapproaches require precise information about the environment(e.g., external forces, surface properties) thatmaynot be accessible.Supervised learning can be used to train robots how to recognizegood grasping points [3], but requires a considerable initial inputfrom a human supervisor. Active and reinforcement learning

∗ Corresponding author.E-mail addresses: [email protected] (O.B. Kroemer),

[email protected] (R. Detry), [email protected] (J. Piater),[email protected] (J. Peters).

0921-8890/$ – see front matter© 2010 Elsevier B.V. All rights reserved.doi:10.1016/j.robot.2010.06.001

methods have focused on exploring the object to acquire acomplete affordance model [4,5], but not on optimizing grasps.However, finding good grasp locations is only a part of the problem.The robot grasping task can be decomposed into two problems:

(i) deciding where to grasp the object, and (ii) determining howto perform the grasping movement. These two sub-problems areclosely related andmust be addressed together in order to performa successful grasp. The choice of where to grasp an object sets thecontext for determining how to grasp it. However, the executionof the grasp ultimately determines whether the grasp location waswell chosen.In this paper, we propose a hierarchical controller that reflects

the structure of these two task components, as shown in Fig. 1.The upper level decides where to grasp the object, and the lowerlevel determines how to perform the grasping movements giventhe context of these grasp parameters and the scene. The upperlevel subsequently receives a reward based on the grasp execution,and takes this into consideration when selecting future grasps.The system employs a hybrid architecture that uses reinforce-

ment learning, imitation learning, and reactive control. The coreof the upper level is a reinforcement learning approach that usesthe successfulness of evaluated grasps to determine future grasps.It is crucial that its state-action space is low dimensional forfaster convergence [6,7], and that information from other sources(e.g., demonstrated grasps) can easily be incorporated. To reducethe action space, the reinforcement learner specifies a grasp as asix-dimensional hand pose in the object’s reference frame, and allremaining variables inherent to the grasping movements are han-dled by a lower level controller.The lower level controller is responsible for action execution.

A straightforward method of acquiring an arbitrary motion policy

1106 O.B. Kroemer et al. / Robotics and Autonomous Systems 58 (2010) 1105–1116

Fig. 1. The controller architecture consists of an upper level based on reinforcement learning and a bottom level based on reactive control. Both levels are supported bysupervised/imitation learning. The World and Supervisor are external elements of the system.

is by imitation learning. One approach to imitation learning is totransform a demonstrated trajectory into a standard dynamicalsystems motor primitive (DMP) [8,9]. This policy is adapted, ina task-specific manner, to the grasp parameters specified by thereinforcement learner. The resulting DMP is augmented by areactive controller that takes the geometry of the object and sceneinto consideration. The resulting action is executed by the robot,which returns a corresponding reward to the upper level of thecontroller.The complete hybrid controller is illustrated in Fig. 1. It uses its

own experiences to quickly converge on good grasping locations.The grasping motions are taught by demonstration and adaptedto different grasp locations and the surrounding geometry. A keyfeature of this hybrid approach is that the reactive controlleris incorporated in the reinforcement learner’s action–rewardfeedback loop. Thus, the hybrid system will learn an appropriategrasping action together with a corresponding grasp location, andsolve both of the sub-problems.In the following sections, we discuss the proposed controller in

a top-down manner. The active learner and the reactive bottomlevel of the controller are detailed in Sections 2 and 3 respectively.In Section 4, the system is evaluated both in simulation and on therobot platform shown in Fig. 2.

2. High level active learner

The high level controller chooses where on the object to applythe next grasp, and improves the grasp locations using the acquireddata. The reinforcement learning approach is inspired by the grasplearning exhibited by infants [10–12], requiring relatively littleprior knowledge andmaking few assumptions. Young infants havea grasp reflex that allows them to crudely grasp objects [10]. Theylearn to improve their grasps through trial and error, allowingthem to later be able to perform precision grips. The reactivecontroller of the hybrid system represents a vision-based graspreflex. The initial graspsmay be crude, but the learning systemwilladapt to the object and can learn to perform precision grasps.To keep the number of assumptions low, we define the state

as the object being grasped, and learn a model for each object.The robot’s grasps are learned in the object’s reference frame,allowing the object to be repositioned in the workspace. Similarto a young infant [10], learning to grasp an object is treated ascontext independent and only based on the task constraints ithas encountered. Thus, if an object has always been presentedas hanging on a string, both the robot and infant would initiallynot know that grasping it from below does not work when theobject is on a table [10]. The robot will assign an expected reward

Fig. 2. The robot used in our experiments and an example of a grasping task in acluttered environment.

to the grasp that reflects both situations and how often it hasencountered each.Another infant-like feature is that the robot has no vision-grasp

mapping. Infants under nine months do not orientate their handsto the orientation of object parts [12]. The robot also does notassume that the geometry of a cup’s handle will imply a certainorientation of the hand as appropriate. Instead, it will try differentorientations and find one that is well suited for it. Hence, severalobject properties do not need to bemodeled explicitly, e.g., friction.Ultimately, the reinforcement learning approach is highly adaptiveand is applicable to a wide range of situations.In contrast, supervised learning of grasps has focused on

methods using internal models of the world [13,14], or mappingsbetween visual features of objects and grasps [3]. These approachesare more characteristic of adult human grasping, and thus requirelarge amounts of prior information.To converge quickly to high rewarding grasp locations, the

system must balance the exploitation of good grasping points andthe exploration of new, possibly better, ones. From a machinelearning perspective, this selecting of grasps can be interpreted asa continuum-armed bandits problem [15].The continuum-armed bandit problem is a generalization of

the traditional n-armed bandit problem [6] where the agent mustchoose from a continuous range of locally dependent actions,instead of a finite number. Under this interpretation, the actionis given by the grasp applied and the reward is a measure of thesuccess of this grasp.To date, most methods [16,17] that solve the continuum-

armed bandit problem are based on discretizing the space.For high-dimensional domains, such as robot grasping, anydiscrete segmenting will scale badly due to the ‘‘curse ofdimensionality’’ [7]. The hard segmentationwill result in unnatural

O.B. Kroemer et al. / Robotics and Autonomous Systems 58 (2010) 1105–1116 1107

Fig. 3. The algorithm models the merit function with GPR, and finds a set of localmaxima using a parallel search. The candidate action with the greatest merit isevaluated and the results are stored.

borders and make the use of prior knowledge complicated. Wepropose a sample-based reinforcement learner that models thedistribution of expected rewards over the continuous space ofactions usingGaussian process regression (GPR) [18]. The proposedlearner then searches for the most promising grasp to evaluatenext, using a method inspired by Mean-shift [19]. The resultingpolicy is called Continuum Gaussian Bandits (CGB), and is outlinedin Fig. 3.The following four sections detail the active learner and present

the employed policy (Section 2.1), the modeling of the expectedrewards (Section 2.2), how the learner selects the next grasp(Section 2.3), and then themethod for implementing this selectionon the reward model (Sections 2.4 and 2.5). Finally, Section 2.6explains how supervised data can be incorporated into the activelearner as prior knowledge.

2.1. Upper confidence bound policy

Choosing where to grasp a novel object suffers from an ex-ploration–exploitation problem. The traditional machine learningframework for studying this dilemma is the n-armed bandits prob-lem, wherein an agent must repeatedly choose from a finite set ofn possible actions to maximize the accumulated reward.Among the more successful strategies [6] are upper confidence

bound (UCB) policies. While there are different versions of UCBpolicies [6,20], the principle idea is to assign each action twovariables, i.e., the expected reward µ for taking that action,and a confidence bound ±σ indicating the range in which theactual mean reward is. Both µ and σ indicate how desirableexecuting the action is. A high expected reward µ is valuablein the sense of exploitation and receiving rewards, while a largeconfidence bound σ indicates an informative action that is goodfor exploration. Using the exploration variable σ leads to a morestructured exploration than regular randomized policies (e.g.,ε-greedy [6]). UCB policies also provide performance guarantees,and have an upper bound on the expected regret that scales onlylogarithmically with the number of trials [20].The sumof the expected rewardµ and the standard deviation σ

indicates how desirable executing the action is overall. We call thevalueµ+σ the merit of an action. A UCB policy always selects theaction for which this merit value is the greatest [20]. Intuitively,a UCB policy optimistically chooses the action which could be thebest, and will thus only converge to an action when it knows thatno other action could be better.

Adapting a UCB policy to the continuum-armed banditsrequires a new approach that scales to the high-dimensionalspaces of grasping tasks. The first step towards realizing thisapproach is to create a sample-based model of the exploration σand exploitation µ variables.

2.2. Expected reward and confidence modeling with Gaussian processregression

Modeling the upper confidence bound for continuous actionsrequires the expected reward function and its standard deviationto be approximated. A well-suited approach that satisfies theserequirements is Gaussian process regression (GPR) [18].Rather than mapping inputs to specific output values, GPR

returns a Gaussian distribution of the expected rewards. ThisGaussian distribution is characterized by its mean µ(x) andstandard deviation σ(x), where the standard deviation is aconfidence bound on the expected reward. This technique isnon-parametric, which implies that µ(x) and σ(x) are functionsthat directly incorporate all previous samples. Non-parametricmethods are very adaptable, and apply few constraints on themodel. The GPR approach incorporates a prior that keeps themeanand variance bounded in regionswithout data. Unexplored regionswill thus have a large confidence bound σ(x) and small expectedrewardsµ(x). Sampling from these regions will shiftµ(x) towardsthe actual expected reward at x, but decrease the confidence boundσ(x).We employ the standard Gaussian kernels k (x, y) = σ 2a exp

(−0.5(x − y)TW(x − y)) whereW is a diagonal matrix of kernelwidths. The parameterσa affects the convergence rate of the policy,as explained in Section 2.6.For grasping, the vectors x ∈ R6 and y ∈ R6 each contain

three position and three orientation parameters of grasps, whichdescribe the final position of the hand in the object’s referenceframe. Working in the object’s reference frame allows the objectto be repositioned and reorientated in the workspace withoutaltering the grasp parameters. Additional grasp parameters areexcluded to keep the number of parameters minimal, and thusallow for rapid learning. All of the other motion parameters arehandled by the reactive low level controller, which modifies theseparameters depending on the object and the scene, as well as theparameters in x.The proposed UCB policy will base its decisions on the merit

function M (x) = µ(x) + σ(x), where µ (x) and σ (x) are theexpected reward and standard deviation at grasp x respectively.The standard GPR model [18] for the mean µ, variance σ 2, andstandard deviation σ , are

µ (x) = k (x, Y)T(K+ σ 2s I

)−1 t,σ (x) =

√σ 2 (x) =

√k (x, x)− k (x, Y)T

(K+ σ 2s I

)−1 k (x, Y),where [K]i,j = k(yi, yj) is the Gram matrix, the kernel vectork decomposes as [k(x, Y)]j = k(x, yj), the hyperparameter σ 2sindicate the noise variance, and the N previous data points arestored in Y = [y1, . . . , yn] with corresponding rewards t =[t1, . . . , tn].Both the mean and variance equations can be rewritten as the

weighted sum of Gaussians, giving

µ (x) =N∑j=1

k(x, yj

)αj,

σ 2 (x) = k (x, x)−N∑i=1

N∑j=1

k′(x, 0.5

(yi + yj

))γij,

where k′ (x, y) = σ 2a exp(−(x − y)TW(x − y)), and the constantsare defined as αj = [(K + σ 2s I)

−1t]j and γij = [(K + σ 2s I)−1]i,j

exp(−0.25(yi − yj)TW(yi − yj)). Different upper confidence

1108 O.B. Kroemer et al. / Robotics and Autonomous Systems 58 (2010) 1105–1116

intervals σ have been used in UCB policies [21], and can be usedby modeling them with a second GPR [22].The previous rewards t occur in the exploitation term µ (x),

but not in the standard deviation σ (x) as it represents theexploration, which is independent of the rewards. A similar meritfunction has previously been employed for multi-armed bandits inmetric spaces, wherein GPRwas used to share knowledge betweendiscrete bandits [23].Having chosen a UCB policy framework and a GPRmerit model,

the implementation of the policy has to be adapted to the meritfunction.

2.3. UCB policy for GPR model

Given a model of the UCB merit function, the system requiresa suitable method for determining the action with the highestmerit. Executing this grasping action will acquire the greatestcombination of reward and information.The merit function will most likely not be concave and

will contain an unknown number of maxima with varyingmagnitudes [18]. Determining the global maximum of the meritfunction analytically is therefore usually intractable [18]. However,numerically, we can determine a set of locally optimal grasps. Suchsets of grasps will contain many maxima of the merit function,especially near the previous data points. Given a set of localmaxima, the merit of each candidate grasp is evaluated and therobot executes the grasp with the highest merit.Themethod for finding the localmaximawas inspired bymean-

shift [19], which is commonly used for both mode detection ofkernel densities and clustering. Mean-shift converges onto thelocal maxima of a given point by iteratively applying

xn+1 =

N∑j=1

yjk(xn, yj

)N∑j=1k(xn, yj

) , (1)

where k(xn, yj) is the kernel function, and yj are the N previouslytested maxima candidates as before. The monotonic convergencevia a smooth trajectory can be proven for mean-shift [19]. Tofind all of the local maxima, mean-shift initializes the updatesequencewith all previous data point. The globalmaximum is thendetermined from the set of local maxima, which is guaranteed toinclude the global maximum [24].The intuition behind this approach for grasping is that all of

the previous grasp attempts are locally re-optimized based on thecurrent empirical knowledge, as modeled by the merit function.Subsequently, we choose the best of these optimized grasps toexecute and evaluate.Mean-shift is however limited to kernel densities and does not

work directly in cases of regression, because the αj and γi,j weightsare not always positive [19]. In particular, the standard updaterule (1) cannot be used, nor can we guarantee that the globalmaximumwill be one of the detectedmaxima. However, the globalmaximum is only excluded from the set of found maxima if it isisolated from all previous samples by regions of low merit.As Eq. (1) is not applicable in our regression framework, a new

update step had to be developed, which monotonically convergesupon the local maximum of our merit function.

2.4. Local maxima detection for GPR

Given the model in Section 2.2, the merit function takes theformM(x) =∑N

j=1k(x, yj)αj +

√√√√k(x, x)− N∑i=1

N∑j=1

k′(x, 0.5(yi + yj))γij.

To use the policy described in Section 2.3 with this merit function,a monotonically converging update rule is required that candetermine local maxima. We propose an update rule consisting ofthe current gradient of the merit function, divided by a local upperbound of the merit’s second derivative; Specifically, we propose

xn+1 =∂xµ+ ∂xσ

q (µ)+ q(σ 2)√p(σ 2)

+ xn = s+ xn, (2)

where ∂xµ =∑Nj=1W

(yj − xn

)k(xn, yj

)αj and

∂xσ =

N∑i=1

N∑j=1

2σγijW

(yi + yj2− xn

)k′(x,

yi + yj2

).

The function q(·) returns a local upper bound on the absolutesecond derivative of the input within the xn to xn+1 range.Similarly, p(·) returns a local lower bound on the absolute valueof the input.This form of update rule displays the desired convergence

qualities, as explained in Section 2.5. The rule is only applicablebecause the Gaussian kernels have bounded derivatives resultingin finite q (µ) and q (v), and any real system will have a positivevariance giving a real non-zero

√p (v).

To calculate the local upper and lower bounds, we first define aregion of possible xn+1 values to consider. Therefore, we introducea maximum step size m > 0, where steps with larger magnitudesmust be truncated; i.e., ‖xn+1 − xn‖ ≤ m. Having defined a localneighborhood, q (µ), q (v), and p (v) need to be evaluated.In Section 2.2,µ and v were represented as the linear weighted

sums of Gaussians. Given a linear sum, the rules of superpositioncan be applied to evaluate q (µ), q (v), and p (v). Thus, the upperbound of a function in the region is given by the sum of the localupper bounds of each Gaussian, i.e.,

qm

(N∑j=1

k(x, yj

)αj

)≤

N∑j=1

qm(k(x, yj

)αj).

As Gaussians monotonically tend to zero with increasing distancefrom their mean, determining an upper bound value for themindividually is trivial. In the cases of q (µ) and q (v), themagnitudesof the second derivatives can be bounded by a Gaussian; i.e.,

‖∂2x k(x, yj

)‖ < σ 2a exp

(−(x− yj)TW(x− yj)/6

),

which can then be used to determine the local upper bound.We have thus defined an update step and its implementation,

which can be used to detect the modes of a Gaussian process in aregression framework. The final algorithmhas a time complexity ofO(N3), similar to all other exact GPR methods [22]. However, thiscomplexity scales linearly with the number of dimensions, whilediscretization methods scale exponentially, making the proposedGPR method advantageous when the problem dimensionality isgreater than three. The mode detection algorithm can be easilyparallelized for efficient implementations on multiple computersor GPUs as an anytime algorithm.This section concludes the details of the proposed reinforce-

ment learner, which is outlined in Algorithm 1 in Fig. 3. As shown,the final algorithm is quite compact and straightforward. It consistsof modeling the expected rewards using GPR, and applying a par-allel search to determine a maximum to evaluate next. The modedetection behavior is analyzed in the next section. Incorporatingsupervised data from other data sources is described in Section 2.6which completes the upper level of the controller design.

2.5. Mode detection convergence analysis

Having specified the method for determining maxima of a GPRin Section 2.4, Lyapunov’s direct method can be used to show

O.B. Kroemer et al. / Robotics and Autonomous Systems 58 (2010) 1105–1116 1109

that the method converges monotonically to stationary points.The underlying principle is that an increased lower bound on themerit reduces the set of possible system states and, therefore, acontinually increasing merit leads to convergence. The followingone-dimensional analysis will show that only an upper bound onthemagnitude of the second derivative is required for a convergingupdate rule.The increase in merit is given by M(xn+1) − M(xn). Given an

upper bound u of the second derivative between xn and xn+1, andthe gradient g = ∂xM (xn), the gradient in the region canbe linearlybounded as

g − ‖x− xn‖ u ≤ ∂xM (x) ≤ g + ‖x− xn‖ u.

Considering the case g ≥ 0 and therefore xn+1 ≥ xn, the change inmerit is lower bounded by

M (xn+1)−M (xn) =∫ xn+1

xn∂xM (x) dx ≥

∫ xn+1

xng − (x− xn) udx.

This term is maximal when the linear integrand reaches zero; i.e,g − (xn+1 − xn)u = 0. This limit results in a shift of the forms = xn+1−xn = u−1g , as was proposed in Eq. (2). The same updaterule can be found by using a negative gradient and updating x inthe negative direction. The merit thus always increases, unless thelocal gradient is zero or u is infinite. A zero gradient indicates thatthe local stationary point has been found, and variable u is finite forany practical Gaussian process. In some cases, the initial point maybe within the region of attraction of a point at infinity, which canbe tested for by determining the distance from the previous datapoints.The intuition underlying the results of the analysis is that at

each step, the system assumes the gradient will shift towards zeroat the maximum possible rate within the region. The estimateof the maximum is then moved to the first point where a zerogradient is possible. This concept can easily be generalized tohigher-dimensional problems. The update rule guarantees thatthe gradient cannot shift sign within the update step, and thusensures that the system will not overshoot nor oscillate about thestationary point. The update rule xn+1 = u−1g + xn thereforeguarantees that the algorithm monotonically converges on thelocal stationary point.

2.6. Incorporating supervised data

Having fully designed the central reinforcement learner, theupper level controller still requires a method for allowing priortask information to be incorporated into themerit function to helpreduce the search space.Similar to how a child learns a new task by observing a

parent before trying it themselves [10], a robot can use humandemonstrations of good grasps to define its starting search region.However, whether these grasps are suitable for the robot is initiallyunknown.GPR makes incorporating prior information fairly straightfor-

ward. If the supervised data has a reward associated to it, the datacan be directly added to the data set. If the region suggested bythe demonstration returns only low rewards, the system will be-gin searching neighboring areas where the merit is still high dueto uncertainty. Thus, it defines an initial search region with softboundaries that can move during the learning process.The parameter σa of the merit function specifies how conserva-

tive the policy is in expanding these boundaries; i.e., a higher valuewill encouragemore exploration,while a lower valuewill convergefaster. Hence, it can be seen as a learning rate. With the rewards inthe grasping task set to be within the range 0 to 1, the parameter isset to 0.75 to encourage exploration but also allow for a reasonablerate of convergence.

The robot experiment was initialized with search regionsdefined by 7, 10, and 25 demonstrated grasps for the box, wateringcan, and paddle respectively. The width parameters W of theGaussian kernel were also optimized on these initial parameters.This section concludes the discussion of the upper level

controller. It takes the rewards of grasps, the pose of the object,and, optionally, demonstrated data as inputs, and returns the nextgrasp location to attempt. This grasp location is passed to the robotvia a lower level controller, which generates the complete graspingmotions based on these parameters.

3. Low level reactive imitation controller

While the upper level of the controller selected grasp locations,the lower level is responsible for the execution of the grasp,including the reaching and fingers’motions. It is important that thesystem is adaptive at this level, as the success of the grasps dependon the execution. The finger motions should particularly adaptto the geometry of the object, a process known as preshaping.The robot’s motions are learned from human demonstrations, andsubsequently modified to incorporate the grasp information fromthe active learner and the scene geometry from the vision system.A common approach to the grasp execution problem is to

rely on specially designed sensors (e.g., laser scanner, ERFID)to get accurate and complete representations of the object andenvironment [13,25], followed by lengthy planning phases insimulation [26].We restrict the robot to only using stereo cameras,and a fast reactive sensor-based controller [27].Although densely sampling sensors such as time-of-flight

cameras and laser range finders are favored for reactive obstacleavoidance [28], the sparser information of stereo vision systemshas also been used for these purposes [29,30]. Robot graspingresearch has focused on coarse object representations of novelobjects [31–34], and using additional sensor arrays when in closeproximity to the object [35,36]. Learning to grasp objects is alsooften done in simulation [31,14] which allows for many virtualgrasp attempts on a model of the object. In contrast, the proposedhybrid system relies on relatively few real-world grasps and doesnot rely on having accurate dynamics and contact models.For the lower level system, we propose a sensor-based robot

controller that can perform human inspired motions, includingpreshaping of the hand, smooth and adaptive motion trajectories,and obstacle avoidance, using only stereo vision to detect theenvironment. Unlike previous approaches, we work with a sparsevisual representation of objects, which maintains a high level ofgeometric details. The controller uses potential field methods [27],which treat the robot’s state as a particle in a force field; i.e. therobot is attracted to a goal state, and repelled from obstacles.The attractor field needs to be capable of encoding complex tra-

jectories and adapting to different grasp locations. We thereforeuse the dynamical system motor primitive (DMP) [37,9] frame-work. The DMPs are implemented as passive dynamical systemssuperimposed with external forces; i.e.,

y = αz(βzτ−2(g − y)− τ−1y)+ aτ−2f (x), (3)

where αz and βz are constants, τ controls the duration of theprimitive, a is an amplitude, f (x) is a nonlinear function, and g isthe goal for the state variable y. The variable x ∈ [0, 1] is the state ofa canonical system x = −τx, which acts as a shared clock amongstdifferent DMPs; i.e. it ensures that the finger and arm motions aresynchronized. The function f (x) encodes the trajectory for reachingthe goal state, and takes the form

f (x) =

M∑j=1ψj (x) wjx

M∑i=1ψi (x)

,

1110 O.B. Kroemer et al. / Robotics and Autonomous Systems 58 (2010) 1105–1116

Fig. 4. The left image shows the ECVD representation of the scene on the right. The paddle is the object to be grasped, while the surrounding objects clutter. The coordinateframe of the third finger of the lower finger in the image and the variables used in Section 3 are shown. The x–y–z coordinate system is located at the base of the finger,with z orthogonal to the palm, and y in the direction of the finger. The marked ECVD on the left signifies the jth descriptor, with its position at vj = (vjx, vjy, vjz)T , and edgedirection ej = (ejx, ejy, ejz)T of unit length. The position of the finger tip is given by p = (px, py, pz)T .

where ψ(x) are M Gaussian basis functions, and w are weights.The weights w are acquired by imitation learning, using locallyweighted regression [37,8]. The DMPs treat the goal state g as anadjustable variable and ensure that this state is always reached.However, their capability to generalize can be further improved byusing a task-specific reference frame based on the active learner’sgrasp parameters, as detailed in Section 3.2. This adaptation of theaction to different goals allows the object to be repositioned andreorientated in the robot’s workspace.More important is the choice of the scene’s visual representa-

tion, which is used to augment the attractor field and forms thebasis of the detractor field. The scene description needs to be in3D, work at a fine scale to maintain geometric details, and repre-sent the scenes sparsely to reduce the number of calculations re-quired per time step. The Early Cognitive Vision system of Pugeaultet al. [38,39] (see Fig. 4) fulfills these requirements by extract-ing edge features from the observed scene. The system subse-quently localizes and orientates these edges in 3D space [40], withthe resulting features known as early cognitive vision descriptors(ECVD) [38]. By using a large amount of small ECVDs, any arbitraryobject/scene can be represented. Given an ECVD model of an ob-ject, the object’s position and orientation can be determined [41]and the ECVDs of the object model can be superimposed into thescene representation.As a hybrid system, the lower level controller supplies a

complex adaptive action policy that the upper level can indirectlymodify. The top level controller only needs to modify the actionfor a given object, which can be done more efficiently thanhaving to learn the entire action. To allow for quick learning,the actions given by the reactive controller should be repeatable,while still adaptive. By making the rewards for grasps depend onthe reactive controller, the reinforcement learner finds both goodgrasp locations as well as matching grasp executions.In Sections 3.1 and 3.2, we describe the DMPs for grasping,

followed by their augmentation using the ECVD based detractorfield in Section 3.3.

3.1. Attractor fields based on dynamical systems motor primitives(DMPs)

Generating the grasp execution begins with defining anattractor field as a DMP, which encodes the desired movementsgiven no obstacles. The principle features that need to be definedfor these DMPs are the goal positions, and the generic shape of thetrajectories.

The high level grasp controller gives the goal location andorientation of the hand, but not the fingers. Using the ECVDs, thegoal position of each finger is approximated by first estimatinga locally linearized contact plane for the object in the fingercoordinate system (see Fig. 4). The purpose of this step is to getthe fingers close to the object’s surface during preshaping to allowfor more control of the object during grasping. It is not intendedto infer exact surface properties or whether the grasp is suitable. Ifthe selected surface is unsuitable for grasping, a low reward willbe received and the upper level controller will adapt its policyaccordingly.A contact plane is approximated for each finger to allow for a

range of object shapes. The influence of the ith ECVD is weightedby wi = exp(−σ−2x v2ix − σ−2y v2iy − σ−2z v2iz), where σx, σy, andσz are length constants that reflect the finger’s length and width,and vi is the position of the ECVD in the finger reference frame.The hand orientation is such that the Z direction of the fingershould be approximately parallel to the contact plane, whichreduces the problem to describing the plane as a line in the 2DX–Y space. The X–Y gradient of the plane is approximated byφ = (

∑Ni=1wi)

−1∑Ni=1wi arctan(eiy/eix), where N is the number

of vision descriptors, and ei is the direction of the ith edge. Thedesired Y position of the fingertip is then given by

py =

N∑i=1(wiviy − tan(φ)wivix)

N∑i=1wi

,

which canbe converted to joint angles using the inverse kinematicsof the hand. The proposed method selects the goal postures of thefingers in a deterministic manner, which depends on the object’sgeometry as well as the grasp parameters specified by the activelearner. Thus, the hybrid system’s active learner indirectly selectsthe posture of the fingers through a reactive mechanism based onthe visual model of the object.The next step defines the reaching and grasping trajectories.

Many beneficial traits of human movements, including smoothmotions and small overshoots for obstacle avoidance [42,43,11],can be transferred to the robot through imitation learning. Todemonstrate grasping motions, we used a VICON motion trackingsystem to record human movements during a grasping task. Thegrasped object can be different to the robot’s. VICONmarkers wereonly required at the hand and finger tips. The tracking systemsamples the human’s motions, generating position q, velocity q,

O.B. Kroemer et al. / Robotics and Autonomous Systems 58 (2010) 1105–1116 1111

Fig. 5. The diagram shows the change in coordinate systems for the reachingDMPs. The axes Xw–Yw–Zw are the world coordinate system, and Xp–Yp–Zp is thecoordinate system inwhich theDMP is specified. The trajectory of theDMP is shownby the curved line, starting at point s, and ending at point g. Xp is parallel to theapproach direction of the hand, the arrow a. The axis Yp is perpendicular to Xp , andpointing from s towards g.

and acceleration q data, as well as the samples’ time stamps. Theweightswi of the DMP are then given by

wi =

(T∑k=1

ψi (xk) x2k

)−1

×

T∑j=1

ψi(xj)xj(τ 2qj − αz(βz(g − qj)− τ 1qj)

)a−1,

where xj is the state of the canonical system corresponding to thejth time stamp. The solution is closed form and easily calculated.Further information on imitation learning of DMPs can be found inIjspeert’s paper [8]. As the reaching trajectories are encoded in taskspace the correspondence problem of the arm was not a problem.The DMPs are provably stable [9] and the goal state, as specified

by the upper level controller, will always be achieved. Alterationsadded by the reactive controllers must stay within the bounds ofthe framework to ensure that this stability is maintained.

3.2. Transformed dynamical motor primitives for grasping

While DMPs generalize to arbitrary goal positions, the grasps’approach direction cannot be arbitrarily defined, and the ampli-tude of the trajectory is unnecessarily sensitive to changes in thestart position y0 and the goal position g if y0 ≈ g during training.These limitations can be overcomeby including a preprocessor thatmodifies the DMPs’ hyperparameters.The system can maintain the correct approach direction by

using a task-specific coordinate system. Due to the translationinvariance of DMPs, only a rotation R ∈ SO(3) between thetwo coordinate systems needs to be determined. The majority ofthe reaching motions will lie in a plane defined by the start andgoal locations, and the final approach direction. These componentsof the plane are supplied by the high level controller, with theapproach direction defined by the final hand orientation.The first new in-plane axis xp is set to be along the approach

direction of the grasp; i.e., xp = −a as shown in Fig. 5. Theapproach direction is thus easily defined and only requires that theYp and Zp DMPs reach their goal before the Xp primitive. The secondaxis, yp, must be orthogonal to xp and also in the plane, as shownin Fig. 5. It is set to yp = b−1((g− s)−xp(g− s)Txp), where b−1 is anormalization term, and s and g are the motion’s 3D start and goalpositions respectively. The third axis vector is given by zp = xp×yp.TheDMPs can thus be specified by the preprocessor in theXp–Yp–Zpcoordinate system, and mapped to the Xw–Yw–Zw world referenceframe by multiplying by RT = [xp, yp, zp]T .

Fig. 6. This is a demonstration of the effects of transforming the amplitude variablea of DMPs. The hashed black lines represent boundaries. The dotted black lineshows the trained trajectory of the DMP going to 0.05. If the goal is then placedat 0.1 and the workspace is limited to ±0.075 (top boundary), the dashed blackline is the standard generalization to a larger goal, while the solid plot uses thenew amplitude. If the goal is −0.05, and needs to be reached from above (lowerright boundary), then the dashed grey line is the standard generalization to anegative goal, and the solid grey trajectory uses the new amplitude. Both of thenew trajectories were generated with η = 0.25.

The change of coordinate system is a fundamental step forthe hybrid system. It places the reactive controller, togetherwith all of its modifications, within the reinforcement learner’saction–reward feedback loop. Therefore, the system learns pairingsof grasp locations and grasp executions that lead to high rewards.The second problem relates to the scaling of motions with

ranges greater than ‖y0 − g‖, which are required to move aroundthe outside of objects. In the standard form a = g − y0 [37], whichleads to motions that are overly sensitive to changes in g and y0 ifg ≈ y0 during training. The preprocessor can reduce the sensitivityby using a more robust scaling term, for which we propose theamplitudea = ‖η(g − y0)+ (1− η)(gT − y0T )‖ ,where gT and y0T are the goal and start positions of the trainingdata respectively, and η ∈ [0, 1] is a weighting hyperparameter.This amplitude is always between the training amplitude and thestandard generalization value a = g − y0, and η controls howconservative the generalization is to new goals (see Fig. 6). Bytaking the absolute value of the amplitude, the approach directionis never reversed (see Fig. 6). The amplitude previously proposedby Park et al. [33] corresponds to the special case of η = 0. Examplegeneralizations of a reaching trajectory are shown in Fig. 7.The described transformations allow a single DMP to perform

a larger range of grasps, which implies that fewer DMPs aresrequired in total. Using different DMPs for different sections of theobject or workspace should be avoided as it creates unnecessarydiscontinuities in the rewards, which can slow down the hybridsystem’s learning process. Only one grasp had to be learned for theentire robot experiment, which was then adapted to the varioussituations.

3.3. Detractor fields based on ECVDs

Detractor fields refine the motions generated by the DMPsto avoid obstacles during the reaching motion and ensure thatthe finger tips do not collide with the object during the hand’sapproach.The detractor field is based on ECVDs, which represent small

line segments of an object’s edges localized in 3D, as shown inFig. 4. The detractive forces of multiple ECVDs describing a singleline should not superimpose, nor should the field stop DMPsfrom reaching their ultimate goals. The system therefore uses a

1112 O.B. Kroemer et al. / Robotics and Autonomous Systems 58 (2010) 1105–1116

Fig. 7. Workspace trajectories where the x and y values are governed by twosynchronized DMPs. The semicircle indicates the goal positions, with desiredapproach directions indicated by the light grey straight lines. The approachdirection DMP was trained on an amplitude of one, and η = 0.25.

Nadaraya–Watson model [22] of the form

ua = −v(x)

N∑i=1ricai

N∑j=1rj

,

to generate a suitable detractor field, where ri is a weight assignedto the ith ECVD, s is the strength of the overall field, x is the state ofthe DMPs’ canonical system, cai is the detracting force for a singledescriptor, and subscript a specifies if the detractor field is for thefinger motions or the reaching movements.The weight of an ECVD for collision avoidance is given by ri =

exp(−(vi − p)Th(vi − p)), where vi is the position of the ith ECVDin the local coordinate system, h is a vector of positive length scalehyperparameters, andp is the finger tip position, as shown in Fig. 4.The detractor putsmore importance on ECVDs in the vicinity of thefinger.The reaching and finger movements react differently to edges

and employ different types of basis functions ci for their respectivepotential fields. For the fingers, the individual potential fields arelogistic sigmoid functions about the edge of each ECVD of the formρ(1+exp(diσ−2c ))−1, where di =

∥∥(p− vi)− ei(p− vi)Tei∥∥ is the

distance from the finger to the edge, ρ ≥ 0 is a scaling parameter,and σc ≥ 0 is a length parameter. Differentiating the potential fieldresults in a force ofcfi = ρ

(1+ exp

(diσ−2c

))−2exp

(diσ−2c

).

As the sigmoid is monotonically increasing, the detractor alwaysforces the fingers open further tomove their tips around the ECVDsand ensure that they approach the object from the outside. Asimilar potential function can be employed to force the hand closedwhen near ECVDs pertaining to the scene rather than the object.The reaching motion uses the Gaussian basis functions of the

form % exp(−0.5dTi diσ−2d ), where di = (q − vi) − ei(q − vi)Tei

is the distance from the end effector position, q, to the edge, and% ≥ 0 and σd ≥ 0 are scale and length parameters respectively.Differentiating the potential with respect to di gives a force termin the Y direction ofchi = %(di.Y)σ−2d exp(−0.5dTi diσ

−2d ),

which thus apply a radial force from the edgewith an exponentiallydecaying magnitude.The strength factor s(x) controls the precision of the move-

ments, ensuring that the detractor forces tend to zero at theend of a movement and do not obstruct the DMPs from achiev-ing its goal state. Therefore, the strength of the detractors iscoupled to the canonical system of the DMP. Hence, v(x) =(∑Mj=1 ψj)

−1∑Mi=1 ψiwix, where x is the value of the canonical sys-

tem, ψ are its basis functions, and w specify the varying strengthof the field during the trajectory.Modeling the human tendency towards more precise move-

ments during the last 30% of a motion [42], the strength function,v(x), was set to give the highest strengths during the first 70% ofthe motion for the reaching trajectories, and the last 30% for thefinger movements. Setting the strength in this manner is also ben-eficial to the reinforcement learner. The reward of the learner de-pends mainly on the final position of the hand, and the closing ofthe fingers. If these parts of the motion are more repeatable, thenit is easier for the upper level controller to learn.The detractor fields of both the grasping and reaching

components have been defined, and are superimposed into theDMP framework asy =

(αz(βzτ

−2(g − y)− τ−1y)+ aτ−2f (x))− τ−2ua,

which represents the entire ECVD and DMP based potential field.Combining the ECVD based DMPs with the new coordinate

system for reaching and motion amplitude, we have fully definedthe low level controller. Its main contribution is to learn a graspingmovement by imitation and then to reactively adapt thesemotionsto new situations in a manner suited to the task and specified bythe upper level controller.

4. Evaluations

The following sections evaluate the system both in simulationand on a real robot platform. The first part of the evaluation(Section 4.1) tests the upper level controller against othercontinuum UCB policies on a simulated benchmark problem. Thereal-world evaluation, presented in Section 4.2, demonstrates thecomplete controller working on a real robot grasping novel objectsin cluttered environments.

4.1. Comparative UCB analysis

This section focuses on the reinforcement learner and showsthat the CGB algorithm (see Fig. 3) performs well in practice, andcan be scaled to the more complex domain of grasp learning. Thecomparison is between four UCB policies, including our proposedmethod, on a 1D benchmark example of the continuum-armedbandits problem. The policies were tested on the same set of100 randomly generated 7th order spline reward functions. Therewards were superimposed with uniform noise of width 0.1,but restricted to a range of [0, 1]. The space of bandits was alsorestricted to a range between 0 and 1. None of the policies wereinformed of the length of the experiment in advance, and eachpolicy was tuned to achieve high rewards.

4.1.1. Compared methodsThe tested competing policies are UCBC [16], CAB1 [17], and

Zooming [44]. These algorithms represent standard UCB policyimplementations for continuum bandits in the literature. A keyissue for any policy that uses discretizations is selecting thenumber of discrete bandits to use. Employing a coarser structurewill lead to faster convergence, but the expected rewards uponconvergence are also further from the optimal. Balancing thistrade-off is therefore important for a policy’s success.The UCBC policy of Auer [16] divides the bandits space into

regular intervals and treats each interval as a bandit in a discrete

O.B. Kroemer et al. / Robotics and Autonomous Systems 58 (2010) 1105–1116 1113

Fig. 8. The expected rewards over 100 experiments are shown for the fourcompared methods. The results were filtered for clarity. Due to the differences inexperiment lengths, the x-axis uses a logarithmic scale. The dashed horizontal linerepresents the maximum expected reward given the noise.

Table 1These results pertain to the first 50 grasp attempts in the benchmark problem. Thetable shows the mean computation times for the different algorithms, and howthey would scale to six dimensions, given the computational complexity of thealgorithms [17,16,44]. Similarly, the table shows the amount of time needed toinitialize the systems by trying each of the initial grasps once.

UCBC CAB1 Zoom CGB

Mean reward 0.6419 0.4987 0.6065 0.91221D computation time 46 µs 47 µs 27 µs 2.9 s6D computation time 4.6 s 6.7 ms 5.6 ms 17.6 s1D initialization run time 10 min 12 min 24 min 4 min6D initialization run time 1.9 yrs 1.2 days 4.2 days 24 min

UCB policy. After choosing an interval, a uniform distribution overthe region selects the bandit to attempt. The number of intervalssets the coarseness of the system, and was tuned to 10.Instead of using entire intervals, the CAB1 policy of Klein-

berg [17] selects specific grasps at uniform grid points. A discreteUCB policy is then applied to these points, for which we choseUCB1 [20], as suggested in [17]. The discretization trade-off is dealtwith by resetting the system at fixed intervals with larger numbersof bandits, thus ensuring that the points becomes denser as the ex-periment continues.The zooming algorithm, of Kleinberg et al. [44], also uses a grid

structure to discretize the bandits. In contrast to CAB1, the grid isnot uniform and additional bandits can be introduced at any timein high rewarding regions. A discrete policy is then applied to thisset of active bandits. Similar to CAB1, the zooming algorithmworksin time intervals and resets its grid after fixed numbers of trials.Our proposed Continuum Gaussian Bandits (CGB) method was

initialized with 4 equispaced points. Demonstrated data was notused in order to test its performance without the benefits of suchdata. All four methods were initially run for 55 trials, as shown inFig. 8. The CAB1, UCBC, and Zooming methods extended to 1000trials to demonstrate their convergence behavior.

4.1.2. ResultsThe expected rewards for the four UCB policies during the

experiment can be seen in Fig. 8. The computation and runtimes were also acquired for the experiments for comparison, andestimated for the six-dimensional problem, as shown in Table 1.Apart from our proposed policy, Zooming was the most

successful over the 1000 trials at achieving high rewards,as it adapts its grid to the reward function. However, onlyCGB consistently determined the high rewarding regions andconverged on them. In several trails, the reward function hadtwo distinct peaks with near-optimal rewards, and the CGB policyconverged onto both.

The convergence of UCB policies is frequently described bythe merit’s percentage of exploitation µ(x∗)/(µ(x∗) + σ(x∗)),where x∗ is the current action selected by the policy. This valueis initially zero and increases as the policy returns to previouslyexplored actions with high rewards. The 97.5% exploitation markwas reached by the CGB policy on average at the 33rd trial.Another measure of convergence is found by directly comparingthe different maxima found by CGB. The policy converges whenthe expected value µ(x∗) of the selected action is greater than thehighestmerit valueµ(x)+σ(x) of the other candidate actions. Thiscriterion is based on the fact that the merit function µ(x) + σ(x)tends to µ(x) as the exploration of an action is exhausted. Usingthis criteria, the policy converged on average at the 37th trial.As parametric policies, the standard methods assume that

the optimal solution can be represented by their fixed featuresand corresponding parameters. These policies can therefore onlyconverge to an optimal solution if it is representable by thesefeatures. Both CAB1 and the Zooming algorithmwill converge ontothe true optimum, but only as the number of samples tends toinfinity, as indicated in Fig. 8.In terms of computation times, the previous methods were

faster than the proposed method, although CGB and UCBC exhibitsimilar orders of magnitude. One reason for CGB being slower isthat this implementation performs the parallel search for maximasequentially. Parallelizing this search would reduce the expected6D computation time of CGB to 0.65 s.Most of the system’s time is however used to perform the

actions (i.e., the run times). For this comparison we focused onthe time required to initialize the systems by trying each initialgrasp once. Not only is the proposed method the fastest in termsof run times (see Table 1), it also shows that implementing theother methods for grasping is not practical due to the curse ofdimensionality.The UCBC algorithm has both longer computation and running

times than CAB1 and the Zooming algorithm. However, asCAB1 and the Zooming algorithm increase the number of activeactions throughout the experiment, thesewould ultimately exhibitcomputation and run times greater than UCBC.The memory requirements of the previous methods increases

exponentially with the dimensionality, and CGB will only requiremore memory than UCBC once it has performed a milliongrasps. The memory requirements of CGB scale with the numberof samples, and sufficient memory should be made availabledepending on the difficulty of the learned task.In cases where large numbers of samples have been accumu-

lated, suitable implementations of GPR (e.g., Sparse GP [45]) re-duce the computational complexity. The loss of accuracy incurredby such implementations is comparable to the accuracy limits in-herent to discretization methods, making these methods suitablealternatives to standard GPR.Ultimately the experiment shows that the proposed method

outperforms the other methods in a low-dimensional setting, andis the only practical method for higher dimensions due to the curseof dimensionality.

4.2. Robot grasping task

Having shown that the proposed Gaussian Bandits algorithmis an efficient UCB policy, the robotics evaluation focuses onincluding the lower level controller for improved actions in a robotgrasping scenario. This experiment involves the complete systembeing implemented on a real robot platform. The following sectionsdetail the running of the experiment (Section 4.2.1) and the resultsof the experiment (Section 4.2.2).In this experiment, we implement only the methods proposed

in this paper. The methods described in Section 4.1.1 were nottested on the real system as their discretizationsmake themhighlyimpractical.

1114 O.B. Kroemer et al. / Robotics and Autonomous Systems 58 (2010) 1105–1116

(A) Preshaping. (B) Grasping. (C) Lifting.

Fig. 9. The three main phases of a basic grasp are demonstrated. (A) Preshaping the hand poses the fingers to match the object’s geometry. (B) Grasping closes the threefingers at the same rate to secure the object. (C) The object is lifted and the fingers adjust to the additional weight. The objects at the bottom of A and B are clutter.

(A) Flat. (B) Slanted. (C) Cylindrical Handle. (D) Arched Handle.

(E) Knob. (F) Extreme Point.

Fig. 10. Various preshapes are shown. (A) and (B) show the system adjusting to different plane angles. (C) and (D) demonstrate the preshaping for different types of handles.(E) shows the preshaping for a circular disc structure, such as a door knob, and gets its fingers closely behind the object. (F) shows where the object was out of the reach oftwo fingers, but still hooks the object with one finger.

4.2.1. Grasping experimentThe robot is a basic hand–eye system consisting of a 7 degrees

of freedom Mitsubishi PA-10 arm, a Barrett hand, and a Viderestereo camera. The robot only uses sensors essential for the taskand forgoes additional hardware such as tactile sensors and laserrange finders. The robot’s task was to learn several good grasps ofnovel objects through trial and error. All grasps were executed onthe real robot and not in simulation.Each trial begins by estimating the object’s position and

orientation to convert betweenworld and object reference frames,and to project the ECVD model of the object into the scenerepresentation. The stereo camera allows the object position andorientation to be reliably estimated using the pose estimationmethod of Detry et al. [41].The CGB algorithm then determines the parameters of the next

grasp, which the reactive lower level controller uses to modify thegrasping action. If the robot grasps the object, the robot attemptsto lift the object from the table, thus ensuring that the tableis not supporting the object (Fig. 9). Trials are given rewardsdepending on how little the fingers moved while lifting the object,thereby encouraging more stable grasps. The rewards are notdeterministic due to errors in pose estimation and effects causedby the placement of the object.The robot task was made more difficult by adding clutter to the

scene. After each grasp attempt, the hand reverses along the sameapproach direction, but without employing the detractor fields

or preshaping of the hand, to determine if collisions would haveoccurred if the reactive controller had not been used.The systemwas run three times on a table tennis paddle to show

that it is repeatable. To show that the system can adapt to variousscenarios and objects, the experiment was also run twice on botha toy watering can and a wooden box.The experiments for learning to grasp a paddle consisted of

55 trials, while only 40 trials were required for the watering canand box experiments. Overall 325 different grasp attempts wereexecuted with the combined active and reactive system.

4.2.2. ResultsThe active learner and reactive controller were successfully in-

tegrated and the complete system converged onto high reward-ing grasp regions in all of the trials. The imitation learning wasstraightforward, requiring only one demonstration and allowingfor continuous smooth motions to be implemented. Examples ofthe estimated finger goal locations can be seen in Fig. 10. The pre-shaping adapted to a range of geometries, and consistently placedthe fingers close enough to the object for a controlled grasp tobe executed. This preshaping gave more control over the objectwhen grasping, leading to higher rewards and allowing for moreadvanced grasps to be performed (see Fig. 11).The detractor field and preshaping of the hand allowed the

system to work in cluttered environments, which was not a trivialtask. The hand came into contact with the clutter for an estimated

O.B. Kroemer et al. / Robotics and Autonomous Systems 58 (2010) 1105–1116 1115

Fig. 11. A controlled grasp, made possible by the hybrid system’s preshaping ability. (A) The preshapingmatches the geometry of the object. When grasping, the two fingerson the left pinch the paddle. The finger on the right turns the paddle clockwise about the pinched point. (B) The grasping ends when the paddle becomes aligned with allthe three finger tips.

Fig. 12. The graph shows the expected reward of the attempted grasps over the run of the experiment for the three different objects. All values are averaged over the runsof the experiment, with error bars of+/− two standard deviations. The dashed horizontal line indicates the upper confidence bound of a point at infinity.

8.3% of the grasp attempts, but nevermore than a glancing contact.These contacts were usually with visually occluded parts of theobjects, and thus not fully modeled by the ECVDs. Accumulatingthe scene representation frommultiple views solves this problem.During the reversing phases, when the reactive controller isdeactivated, the hand collided with one or more pieces of clutterduring 85.4% of the attempts. Thus, the reactive control decreasesthe number of contacts with the clutter by a factor of ten. Thefingers always opened sufficiently to accept the object withoutcolliding with it.The rewards during the experiment are shown in Fig. 12. In all of

the experiments, the proposedhybrid system found suitable graspsfor the object. The watering can and box experiments convergedfaster than the paddle experiments, due to their initial searchregion being smaller. While all experiments acquired low rewardsfor the initial grasps, the soft boundaries allowed the system toexplore beyond these regions and find neighboring regions ofbetter grasps.Amongst the most important results of this experiment is

that the central loop of the hybrid controller works in practice.The system did not just quickly learn a graspable location onan object, but rather the hybrid system quickly learned anentire fluid motion for grasping the object, including preshaping.The system took a single demonstrated action and learnedmodifications that generalized the action to three different objects.The learning process was significantly hastened by the hybridapproach, as the reactive controller allowed the dimensionalityof the reinforcement learner to be kept relatively low, whilesimultaneously performing complicated grasping motions.The upper and lower levels divide the grasping problem into

two sub-problems: determining where to grasp an object anddeciding how to correctly execute the grasp. By incorporating the

reactive controller in the learning loop, the hybrid system learnedan action that solves both of these sub-problems.

5. Conclusion

We have presented a hierarchical hybrid controller that canefficiently determine good grasps of objects and execute them.The upper level controller is based on reinforcement learning toallow the robot to learn from its own experiences, but capableof incorporating supervised data from other sources if available.Grasp execution is handled by a lower level controller basedon imitation learning and reactive control. This hybrid structureallowed the system to learn both good grasp locations andcorresponding grasp executions simultaneously, while keeping thedimensionality of the learning problem low.We have shown that the presented algorithms and learning

architectures work well both in simulation and on a real robot.In simulation, the active learner outperformed several standardUCB policies designed for the continuum-armed bandits problem.The entire system was successfully implemented on a real robotplatform, which consistently found highly rewarding grasps forvarious objects.

Acknowledgement

The project receives funding from the European Community’sSeventh Framework Programme under grant agreement no. ICT-248273 GeRT.

References

[1] A. Bicchi, V. Kumar, Robotic grasping and contact: a review, in: Proceedings ofIEEE International Conference on Robotics and Automation, vol. 1, IEEE, 2002,pp. 348–353.

1116 O.B. Kroemer et al. / Robotics and Autonomous Systems 58 (2010) 1105–1116

[2] M.T. Mason, J.K. Salisbury, Robot Hands and the Mechanics of Manipulation,MIT Press, Cambridge, MA, 1985.

[3] A. Saxena, J. Driemeyer, J. Kearns, C. Osondu, A. Ng, Learning to GraspNovel Objects Using Vision, in: Springer Tracts in Advanced Robotics, vol. 39,Springer, Berlin, Heidelberg, 2008, pp. 33–42 (Chapter 4).

[4] M. Salganicoff, L.H. Ungar, R. Bajcsy, Active learning for vision-based robotgrasping, Machine Learning 23 (2-3) (1996) 251–278.

[5] A. Morales, E. Chinellato, A.H. Fagg, A.P. Pobil, An active learning approachfor assessing robot grasp reliability, in: Proc. of International Conference onIntelligent Robots and Systems, IEEE, 2004, pp. 485–490.

[6] R.S. Sutton, A.G. Barto, Reinforcement Learning: An Introduction (AdaptiveComputation and Machine Learning), MIT Press, 1998.

[7] R.E. Bellman, Adaptive Control Processes — A Guided Tour, PrincetonUniversity Press, Princeton, New Jersey, USA, 1961.

[8] J.A. Ijspeert, J. Nakanishi, S. Schaal, movement imitation with nonlineardynamical systems in humanoid robots, in: Proc. of International Conferenceon Robotics and Automation, IEEE, 2002, pp. 1398–1403.

[9] S. Schaal, J. Peters, J. Nakanishi, A. Ijspeert, Learning movement primitives,in: Proc. of International Symposium on Robotics Research, Springer, 2004.

[10] E. Oztop, N.S. Bradley, M.A. Arbib, Infant grasp learning: a computationalmodel, Experimental Brain Research (2004) 480–503.

[11] E. Oztop, M. Kawato, Models for the control of grasping, in: SensorimotorControl of Grasping: Physiology and Pathophysiology, Cambridge UniversityPress, 2009.

[12] D.A. Rosenbaum, Human Motor Control, Academic Press, San Diego, 1991.[13] A. Morales, T. Asfour, P. Azad, S. Knoop, R. Dillmann, Integrated grasp planning

and visual object localization for a humanoid robot with five-fingered hands,in: Proc. of International Conference on Intelligent Robots and Systems, IEEE,2006, pp. 5663–5668.

[14] D. Kragic, A.T. Miller, P.K. Allen, Real-time tracking meets online graspplanning, in: Proc. of International Conference on Robotics and Automation,IEEE, 2001, pp. 2460–2465.

[15] R. Agrawal, The continuum-armed bandit problem, SIAM Journal on Controland Optimization 33 (6) (1995) 1926–1951.

[16] P. Auer, R. Ortner, C. Szepesvári, Improved rates for the stochastic continuum-armed bandit problem, in: Proc. of the Conference on Learning Theory,Springer-Verlag, Berlin, Heidelberg, 2007, pp. 454–468.

[17] R. Kleinberg, Nearly tight bounds for the continuum-armed bandit problem,in: Advances in Neural Information Processing Systems, vol. 17, MIT Press,2005, pp. 697–704.

[18] C.E. Rasmussen, C.K.I. Williams, Gaussian Processes for Machine Learning, MITPress, 2005.

[19] D. Comaniciu, P. Meer, Mean shift: a robust approach toward feature spaceanalysis, IEEE Transactions on Pattern Analysis and Machine Intelligence 24(5) (2002) 603–619.

[20] P. Auer, N. Cesa-Bianchi, P. Fischer, Finite-time analysis of the multiarmedbandit problem, Machine Learning 47 (2-3) (2002) 235–256.

[21] O. Teytaud, S. Gelly, M. Sebag, Anytime many-armed bandits, in: Proc. ofConference sur l’Apprentissage Automatique. Grenoble, France, 2007.

[22] C.M. Bishop, Pattern Recognition and Machine Learning, 1st ed., Springer,2007.

[23] N. Srinivas, A. Krause, S.M. Kakade, M. Seeger, Gaussian process banditswithout regret: an experimental design approach, Computing ResearchRepository abs/0912.3995, 2009.

[24] R. Martinez-Cantin, Active map learning for robots: Insights into statisticalconsistency. Ph.D. Thesis, University of Zaragoza, 2008.

[25] Z. Xue, A. Kasper, J.M. Zoellner, R. Dillmann, An automatic grasp planningsystem for service robots, in: Proc. of International Conference on AdvancedRobotics, 2009.

[26] D. Bertram, J. Kuffner, R. Dillmann, T. Asfour, An integrated approachto inverse kinematics and path planning for redundant manipulators,in: Proc. of International Conference on Robotics and Automation, IEEE, 2006,pp. 1874–1879.

[27] M.W. Spong, S. Hutchinson, M. Vidyasagar, Robot Modeling and Control, JohnWiley and Sons, 2005.

[28] M. Khatib, Sensor-based motion control for mobile robots. Ph.D. Thesis,LAAS–CNRS, Toulouse, France, 1996.

[29] K. Sabe, M. Fukuchi, J.-S. Gutmann, T. Ohashi, K. Kawamoto, T. Yoshigahara,Obstacle avoidance andpathplanning for humanoid robots using stereo vision,in: Proc. of International Conference on Robotics and Automation, IEEE, 2004,pp. 592–597.

[30] S. Lenser, M. Veloso, Visual sonar: fast obstacle avoidance using monocularvision, in: Proc. of International Conference on Intelligent Robots and Systems,2003.

[31] J. Tegin, S. Ekvall, D. Kragic, J. Wikander, B. Iliev, Demonstration based learningand control for automatic grasping, Intelligent Service Robotics (2008) 23–30.

[32] A.T. Miller, S. Knoop, H.I. Christensen, P.K. Allen, Automatic grasp planningusing shape primitives, in: Proc. of International Conference on Robotics andAutomation, IEEE, 2003, pp. 1824–1829.

[33] D.-H. Park, H. Hoffmann, P. Pastor, S. Schaal, Movement reproduction andobstacle avoidance with dynamic movement primitives and potential fields,in: Proc. of International Conference on Humanoid Robots, IEEE, 2008.

[34] F. Bley, V. Schmirgel, K.-F. Kraiss, Mobile manipulation based on genericobject knowledge, in: Proc. of Symposium on Robot and Human InteractiveCommunication, IEEE, 2006, pp. 411–416.

[35] K. Hsiao, P. Nangeroni, M. Huber, A. Saxena, A.Y. Ng, Reactive grasping usingoptical proximity sensors, in: Proc. of International Conference on Roboticsand Automation, IEEE Press, Piscataway, NJ, USA, 2009, pp. 4230–4237.

[36] J. Steffen, R. Haschke, H. Ritter, Experience-based and tactile-driven dynamicgrasp control, in: Proc. of International Conference on Intelligent Robots andSystems, IEEE, 2007, pp. 2938–2943.

[37] A. Ijspeert, J. Nakanishi, S. Schaal, Learning attractor landscapes for learningmotor primitives, in: Advances in Neural Information Processing Systems,vol. 15, MIT Press, Cambridge, MA, 2003, pp. 1547–1554.

[38] N. Pugeault, Early CognitiveVision: FeedbackMechanisms for theDisambigua-tion of Early Visual Representation, Vdm Verlag Dr. Mueller, 2008.

[39] R. Hartley, A. Zisserman, Multiple View Geometry in Computer Vision,Cambridge University Press, New York, NY, USA, 2003.

[40] N. Krueger, M. Lappe, F. Woergoetter, Biologically motivated multi-modalprocessing of visual primitives, The Interdisciplinary Journal of ArtificialIntelligence and the Simulation of Behaviour 1 (5) (2004) 417–427.

[41] R. Detry, N. Pugeault, J. Piater, Probabilistic pose recovery using learnedhierarchical object models, Springer-Verlag, Berlin, Heidelberg, 2008,pp. 107–120.

[42] M. Jeannerod, The study of handmovements during grasping. A historical per-spective, in: Sensorimotor Control of Grasping: Physiology and Pathophysiol-ogy, Cambridge University Press, Cambridge, UK, 2009, pp. 127–140.

[43] M. Jeannerod, Grasping objects: the hand as a pattern recognition device,in: Perspectives of Motor Behaviour and its Neural Basis, S Karger AG, 1997,pp. 19–32.

[44] R. Kleinberg, A. Slivkins, E. Upfal, Multi-armed bandits in metric spaces,in: R.E. Ladner, C. Dwork (Eds.), Proc. of Symposium on Theory of Computing,ACM, 2008, pp. 681–690.

[45] E. Snelson, Z. Ghahramani, Sparse gaussian processes using pseudo-inputs, in:Proc. of Conference on Neural Information Processing Systems, 2005.

O.B. Kroemer is a Ph.D. student at the Robot LearningLab at the Max Planck Institute for Biological Cybernet-ics (MPI). He graduated from the Cambridge UniversityEngineering Department and holds a M.Eng. in Instru-mentation and Control Engineering. His research inter-ests include robotics, reinforcement learning, andmachineLearning.

R. Detry earned his engineering degree from the Univer-sity of Liége in 2006. He is currently a Ph.D. student atthe University of Liége, working under the supervisionof Prof. Justus Piater on multimodal object representa-tions for autonomous, sensorimotor interaction. He hascontributed probabilistic representations andmethods formulti-sensory learning that allow a robotic agent to inferuseful behaviors from perceptual observations.

J. Piater received the Ph.D. degree from the Universityof Massachusetts, Amherst, in 2001, where he held aFulbright graduate student fellowship and, subsequently,spent two years on a European Marie Curie individualfellowship at INRIA Rhone-Alpes, France, before joiningthe University of Liége, Belgium, where he is a professorof computer science and directs the Computer VisionResearch Group. His research interests include computervision and machine learning, with a focus on visuallearning, closed-loop interaction of sensorimotor systems,and video analysis. He has published more than 80

technical papers in scientific journals and at international conferences. He is amember of the IEEE Computer Society.

J. Peters heads the Robot Learning Lab at the MaxPlanck Institute for Biological Cybernetics (MPI) whilebeing an invited researcher at the Computational Learningand Motor Control Lab at the University of SouthernCalifornia (USC). Before joining MPI, he graduated fromthe University of Southern California with a Ph.D. inComputer Science in March of 2007. Jan Peters studiedElectrical Engineering, Computer Science, and MechanicalEngineering. He holds two German M.S. degrees inInformatics and in Electrical Engineering (from HagenUniversity and Munich University of Technology) and two

M.S. degrees in Computer Science and Mechanical Engineering from USC. Duringhis graduate studies, Jan Peters has been a visiting researcher at the Departmentof Robotics at the German Aerospace Research Center (DLR) in Oberpfaffenhofen,Germany, at Siemens Advanced Engineering (SAE) in Singapore, at the NationalUniversity of Singapore (NUS), and at the Department of Humanoid Robotics andComputational Neuroscience at the Advanced Telecommunication Research (ATR)center in Kyoto, Japan. His research interests include robotics, nonlinear control,machine learning, reinforcement learning, and motor skill learning.