Robot Skills: Imitation and Exploration Learning - ULisboa · Robot Skills: Imitation and...

10
Robot Skills: Imitation and Exploration Learning Dynamic Movement Primitives and Reinforcement Learning for a Ping Pong playing Robot Carlos Cardoso Abstract—Table tennis is a task with a long history as a robotics challenge. In table tennis the existence of complex interactions between the physics of the ball and of the robot makes it hard to create an analytical model that can match the skill of a human player. In addition handcrafted models can be very sensitive to the parametrization of the environment, requiring manual tuning for any alterations. By using a machine learning approach the robot can adapt autonomously, learning the task and model simultaneously. This dissertation can be divided into two parts. First we discuss regression with GPs and follow with Cost Regularized Kernel Regression (CrKR), a state of the art Reinforcement Learning (RL) algorithm. We explore strategies for online spar- sification of the collected samples to overcome the regression’s O(n 3 ) computational complexity, allowing the robot to learn beyond a fixed budget of samples. In the second part we address Imitation Learning (IL). Expanding from the state- of-the-art Dynamic Motion Primitives (DMP) framework we developed Quadratic Program Motion Primitives (QPMP), our novel method to imitate trajectories. QPMP takes advantage of the existence of extremely fast solvers for finding the minimum of convex functions and adds increased flexibility to the existing methods with the definition of custom optimisation constraints to the generated trajectory. We present our results in a table tennis simulation and in a robotic platform. By combining demonstrations by an human expert and autonomous learning, we achieve a robot that can easily adapt to its environment and task. I. I NTRODUCTION AND RELATED WORK The perception of robots as humanities future helpers, capable of performing tasks with the same dexterity as a human has been driving the research of robots with increas- ing mechanical complexity. Today robots take the place of humans in the most tedious and dangerous jobs. They can already be seen vacuuming peoples homes and searching for life on mars. However these are still relatively simple robots, with a high degree of specialization for a single task. More general robots with many degrees of freedom and a large number of sensors have become common in research (e.g. iCub, PR2) and are also showing signs of increasingly more industrial use (e.g. Baxter). However this increase in mechanical sophistication was lagged by control methods, many tasks that are relatively simple for humans such as driving a car or repairing a broken object today remain an immense challenge to robots. The increased complexity of the robots leads to an increase in the required human effort to model the robot and its environment for each new task. An appealing alternative to extensive hand modeling is to learn the task and model together by demonstration or by trial and error. One practical way to allow robots to learn motor skills is through the ”programming by demonstra- tion” paradigm [1]. The main idea is that the motor skill Fig. 1: A screenshot of a table tennis swing in our simulation environment. is demonstrated to the robot (typically through kinesthetic teaching) and a representation of the movement is learned from the recorded data. One key issue is therefore to find a proper representation for such movement primitives. One of the most promising approaches draws from the theory of dynamical systems, giving raise to solutions such as Stable Estimator of Dynamical Systems (SEDS), sequenced Linear Dynamical Systems, Implicit Dynamical Systems [2] and, most notably, Dynamic Movement Primitives (DMPs [3]). This latter formulation in particular has proven to be a very effective tool for imitation learning, and has been therefore widely used in robotics and inspired many extensions to add velocity goals [4] and allow uncertainty and way-points in the execution [5]. However, the DMP approach still has a few drawbacks. One is that, despite allowing quick adaptation to new start/end positions, it may generate undesired motion profiles (especially in terms of accelerations) that are not under full control. Another one is that it lacks the possibility to impose constraints in the motion, such as joints position and velocity limits. In section II we introduce Gaussian Process Regression (GPR) and CrKR to learn the mapping between a system state and a desirable robot state according to a cost function. This mapping is found trough autonomous exploration by the robot. We also propose a online sparsification based on Adaptive Non-Maximal Suppression (ANMS) filtering. We follow with trajectory imitation from expert demon- strations in section III. These trajectories will have a shape matching those of the demonstrations used for initializing CrKR. We describe the DMP framework and introduce our new solution, QPMP based on the minimization of a convex function. In section VI we present experiments demonstrating the properties of our methods in a simulation environment and in our robotic platform.

Transcript of Robot Skills: Imitation and Exploration Learning - ULisboa · Robot Skills: Imitation and...

Page 1: Robot Skills: Imitation and Exploration Learning - ULisboa · Robot Skills: Imitation and Exploration Learning Dynamic Movement Primitives and Reinforcement Learning for a Ping Pong

Robot Skills: Imitation and Exploration LearningDynamic Movement Primitives and Reinforcement Learning for a Ping Pong playing Robot

Carlos Cardoso

Abstract— Table tennis is a task with a long history as arobotics challenge. In table tennis the existence of complexinteractions between the physics of the ball and of the robotmakes it hard to create an analytical model that can match theskill of a human player. In addition handcrafted models canbe very sensitive to the parametrization of the environment,requiring manual tuning for any alterations. By using a machinelearning approach the robot can adapt autonomously, learningthe task and model simultaneously.

This dissertation can be divided into two parts. First wediscuss regression with GPs and follow with Cost RegularizedKernel Regression (CrKR), a state of the art ReinforcementLearning (RL) algorithm. We explore strategies for online spar-sification of the collected samples to overcome the regression’sO(n3) computational complexity, allowing the robot to learnbeyond a fixed budget of samples. In the second part weaddress Imitation Learning (IL). Expanding from the state-of-the-art Dynamic Motion Primitives (DMP) framework wedeveloped Quadratic Program Motion Primitives (QPMP), ournovel method to imitate trajectories. QPMP takes advantage ofthe existence of extremely fast solvers for finding the minimumof convex functions and adds increased flexibility to the existingmethods with the definition of custom optimisation constraintsto the generated trajectory.

We present our results in a table tennis simulation and in arobotic platform. By combining demonstrations by an humanexpert and autonomous learning, we achieve a robot that caneasily adapt to its environment and task.

I. INTRODUCTION AND RELATED WORK

The perception of robots as humanities future helpers,capable of performing tasks with the same dexterity as ahuman has been driving the research of robots with increas-ing mechanical complexity. Today robots take the place ofhumans in the most tedious and dangerous jobs. They canalready be seen vacuuming peoples homes and searchingfor life on mars. However these are still relatively simplerobots, with a high degree of specialization for a single task.More general robots with many degrees of freedom and alarge number of sensors have become common in research(e.g. iCub, PR2) and are also showing signs of increasinglymore industrial use (e.g. Baxter). However this increase inmechanical sophistication was lagged by control methods,many tasks that are relatively simple for humans such asdriving a car or repairing a broken object today remain animmense challenge to robots. The increased complexity ofthe robots leads to an increase in the required human effortto model the robot and its environment for each new task.

An appealing alternative to extensive hand modeling is tolearn the task and model together by demonstration or bytrial and error. One practical way to allow robots to learnmotor skills is through the ”programming by demonstra-tion” paradigm [1]. The main idea is that the motor skill

Fig. 1: A screenshot of a table tennis swing in our simulationenvironment.

is demonstrated to the robot (typically through kinestheticteaching) and a representation of the movement is learnedfrom the recorded data. One key issue is therefore to finda proper representation for such movement primitives. Oneof the most promising approaches draws from the theory ofdynamical systems, giving raise to solutions such as StableEstimator of Dynamical Systems (SEDS), sequenced LinearDynamical Systems, Implicit Dynamical Systems [2] and,most notably, Dynamic Movement Primitives (DMPs [3]).This latter formulation in particular has proven to be a veryeffective tool for imitation learning, and has been thereforewidely used in robotics and inspired many extensions to addvelocity goals [4] and allow uncertainty and way-points inthe execution [5]. However, the DMP approach still has a fewdrawbacks. One is that, despite allowing quick adaptation tonew start/end positions, it may generate undesired motionprofiles (especially in terms of accelerations) that are notunder full control. Another one is that it lacks the possibilityto impose constraints in the motion, such as joints positionand velocity limits.

In section II we introduce Gaussian Process Regression(GPR) and CrKR to learn the mapping between a systemstate and a desirable robot state according to a cost function.This mapping is found trough autonomous exploration bythe robot. We also propose a online sparsification based onAdaptive Non-Maximal Suppression (ANMS) filtering.

We follow with trajectory imitation from expert demon-strations in section III. These trajectories will have a shapematching those of the demonstrations used for initializingCrKR. We describe the DMP framework and introduce ournew solution, QPMP based on the minimization of a convexfunction. In section VI we present experiments demonstratingthe properties of our methods in a simulation environmentand in our robotic platform.

Page 2: Robot Skills: Imitation and Exploration Learning - ULisboa · Robot Skills: Imitation and Exploration Learning Dynamic Movement Primitives and Reinforcement Learning for a Ping Pong

II. REINFORCEMENT LEARNING OF MOTIONPARAMETERS

In this section we introduce GP, a simple class of modelsof functions suitable for probabilistic inference in bothregression and classification problems. A ReinforcementLearning method related to Gaussian Processes, CrKR, ispresented in conjunction with a novel strategy to continuelearning beyond the point where the cubic time complexitywith the number of samples becomes a limitation.

A. Gaussian Process Regression

A GP is by definition any distribution over functions suchthat any finite set of function values f (x1), f (x2), ... f (xN )has a joint Gaussian distribution [6].

The GP before training is specified by its mean function

E[ f (x)] = µ(x) (1)

and by it’s covariance function or kernel:

Cov[ f (x), f (x′)] = k(x,x′). (2)

In practice it is common to assume the mean function iszero everywhere. The choice of the covariance function is ofa greater importance since it implicitly encodes assumptionsabout the underlying function to be modeled.

The learning problem for GPR can be seen as, given ntraining points (xi,yi)

ni=1, to find the latent function f (xi)

that transforms the input xi into a prediction yi = f (xi)+εi where εi is Gaussian noise with zero mean and varianceσ2(xi), i.e. ε ∼N (0,σ2(xi)).

The prediction of a GPR is found by calculating the meanand the variance at a new evaluation point x∗:

f (x∗) = kT∗ (K+σ

2n I)

−1y

σ2f (x∗) = k(x∗,x∗)−kT

∗ (K+σ2n I)

−1k∗(3)

where K is the matrix where each entry corresponds tothe covariance kernel evaluated on the input training dataKi, j = k(xi,x j), k∗ is a vector where each entry is thecovariance function evaluated between the input point x∗and the remaining training data. The variance of the inputdata is σ2

n and y is the vector of the training outputs.The kernel defines how the model generalizes to new data.

In this way it provides a prior and specifies the kind ofstructure that can be captured by the GP. The specification ofa kernel that represents the particular structure of the problembeing modeled is one of the main challenges in applying GPfor model learning. However in practice the SE kernel hasbecome the de facto default kernel when the structure of thefunction is unknown or hard to specify. It is defined by onlytwo parameters that are easy to interpret. First the lengthscaleor bandwidth of the kernel l determines the smoothness ofthe function and how far away from the training data theGP can extrapolate. The output variance σ2 determines theaverage distance of the function away from it’s mean. Inpractice it acts as a scale factor.

kSE(x,x′) = σ2 exp(− (x− x′)2

2l2 ) (4)

For multidimensional problems we can multiply many SEkernels, each with a bandwidth adjusted to the correspond-ing dimension. This multiplication of kernels with differentbandwidths makes the so called SE-ARD kernel in eq. 5.

kSE−ARD(x,x′) = σ

2f exp(−1

2

D

∑d=1

(xd− x′d)2

l2d

) (5)

The high flexibility of the SE-ARD kernel means thatmore data may be needed to learn, the called ”curse ofdimensionality” [7]. In principle a kernel with more priorstructure is preferable as it will learn faster, the ”blessing ofabstraction” [8].

B. Marginal Likelihood Maximization.

A very important feature of GPs is the ability to computethe marginal likelihood of a test data set given the modelor evidence. Models can be compared using the marginallikelihood, which allows automatically tuning the parametersof a model and its fit to the data.

p(y|x,θ) = (2π)−n2 ×|K|− n

2

× exp{−12(yTK−1y)}

(6)

We can write in log form and note that it is differentiable.

log p(y|x,θ) =−12yTK−1

y y− 12

log |K|− n2

log2π

∂θ jlog p(y|x,θ) = 1

2yTK−1 ∂K

∂θ jK−1y− 1

2tr(K−1 ∂K

∂θ j)

=12

tr((ααT −K−1)∂K

θ j)

where α=K−1y(7)

The open parameters of GPR are the hyper-parametersof the kernel, these can be the σ and l in the particularcase of the SE kernel. The usual practice is to maximize thelog marginal likelihood using common optimization toolssuch as quasi-Newton methods or take advantage of itsdifferentiability (eq. (7)) and use gradient methods [9].

C. Cost-regularized Kernel Regression

Gaussian processes assume that the training points aresampled from the underlying process with Gaussian noise.However, some tasks are interactive in nature and must beactively explored by the robot system. CrKR is a kernelizedversion of the reward-weighted regression [10], that is suit-able for RL [10] of complex tasks with continuous input andoutput spaces. A common application of CrKR is to learna small set of continuous meta-parameters that generalize amotor primitive [10], [4].

CrKR has been successfully applied to many robot learn-ing tasks such as dart throwing, ball throwing and table tennisin many different robots, outperforming techniques like finitedifference gradient and reward weighted regression [11].

The inference is performed in a way that is similar toGPR. In eq. 3, the term σ2I is replaced by λC to make a

Page 3: Robot Skills: Imitation and Exploration Learning - ULisboa · Robot Skills: Imitation and Exploration Learning Dynamic Movement Primitives and Reinforcement Learning for a Ping Pong

Algorithm 1 Cost-regularized Kernel Regression algorithm

Preparation steps:determine initial state s0, output γ0 and cost c0 of

the demonstration.initialize the corresponding matrices S, Γ, C.choose a kernel k, initialize K.set the exploration/exploitation trade-off λ .

for all iterations j doDetermine the state s j specifying the situation.Calculate the meta-parameters γ j by:

Determine the mean of each meta-parameter i:γi(s

j) = k(s j)T (K+λC)−1Γi,Determine the variance:

σ2(s j) = k(s j,s j)−k(s j,S)T (K+λC)−1k(s j,S),Draw the meta-parameters from a Gaussian distribution

γ j ∼N (0,σ2)Execute the motor-primitive using the new meta-

parametersCalculate the cost c j at the end pf the episodeUpdate the matrices S,Γ,C with the new sample

end for

x

0.0 0.5 1.0

CrKR. meanCrKR. ic95%samplessample cost

label

-0.5

0.0

0.5

1.0

1.5

original functionsamplessample cost

label

-0.5

0.0

0.5

1.0

1.5

2.0

y

Fig. 2: CrKR after learning from an initial demonstration andperforming five exploration iterations. The pink region shows the95% confidence interval for the CrKR exploration, the error barsshow the cost of each tryout.

cost weighted regression. The open parameter λ expresses aexploration exploitation trade-off and also a scaling of thecost function to what could be seen as analogous to thevariance of a GP. This allows similar expressions to GPR 3 tobe used in an interactive task. Given a state of the system s jthe policy for our action is predicted mean γ j(s j) combinedwith the predicted variance σ2(s j) to add exploration.

D. Selection of the Cost Function

Great care must be taken in the choice of a cost functionappropriate to the task. The cost function is employed asa predictive variance used to guide the exploration. Costfunctions that represent some notion of closeness to thetarget tend to perform better than binary success/failurefunctions in robotic tasks. Some desirable properties, like lowaccelerations, in the learned policy may be discovered by theautonomous agent without needing to be explicitly includedin the cost function. For easier reasoning we decided to usethe saturating cost function in eq. 8.

c(x) = 1− exp(− 12a2

∥∥x−xtarget∥∥2) (8)

This cost function is locally quadratic but saturates forlarge deviations form the desired target. In [12] it was shownthat this saturated cost mimics well some properties of thehuman reasoning.

E. Cost weighted Sparsification

Like GPR the prediction step involves the inversion ofa square matrix (K + λC)−1 of order n where n is thenumber of samples. The time complexity of most matrixinversion algorithms is O(n3) and the space for storingthe matrix is O(n2). The high computational complexitybecomes specially relevant when dealing with on-line learn-ing, and real time applications. Nguyen-Tuong et al. [13]proposed a online sparsification scheme with a fixed budgetof samples. For the subsequent samples, a independencemeasure is used to choose whether or not to replace theleast informative data point. This solution also allows a timedecay adapt to with changing dynamics of the system. Thison-line sparsification method although designed with GPR inmind is straightforward to apply to CrKR. The time decayconcept can be replaced by a cost decay where high costsamples are rejected.

The independence measure is stated as:

δ = k(d(m+1),d(m+1))−kTa

a=K−1k

This independence is computationally very expensive torecalculate for each new sample if we keep high amountof samples in our budget. The solution found in [13] usesincremental updates to reduce the complexity of checkingwhich sample to replace in the dictionary.

F. ANMS Filter

A common approach to overcome the cubic complexityof regression algorithms is to find a representative subset ofthe datapoints. Since in our case the points have weightsassociated we can formulate the problem as finding thesparsest set of N low cost points. This is a similar problemto the selection of keypoints in images for computer visionapplications. We experimented with a filter commonly usedfor for spatial feature distribution ANMS [14] [15]. ANMSworks by sorting the entire dataset according to the weights.Then starting from the lowest cost data point all data points

Page 4: Robot Skills: Imitation and Exploration Learning - ULisboa · Robot Skills: Imitation and Exploration Learning Dynamic Movement Primitives and Reinforcement Learning for a Ping Pong

are added in sequence by order of cost. For each point thedistance to the closest point already evaluated is saved andadded to a list. The N points with highest distances in thislist are selected for the regression. We provide a pseudo-codeimplementation of the algorithm in 2.

Algorithm 2 Naive Implementation of ANMS Filtering

Preparation steps:Sort P by cost, let p1,p2,... denote the points in that

order.result = [(∞, p1)] //(radius,keypoint)

for all pi ∈ P dori = distance to nearest point already in result.result.pushback((ri,pi))

end forSort result by radius, return first k entries in result.

Compared to the sparsification methods previously dis-cussed this method allows retaining a larger number oftraining samples, and using a subset for the regression.This avoids discarding samples that when collected were notrelevant but can become necessary due to the collection ofgood samples in different locations of the input space. Thefilter can also work with any distance measure, increasingthe flexibility in the choice of the kernel used for regression.The naive implementation showed in 2 has a quadratic runtime complexity. Sub-quadratic algorithms that make use ofapproximate nearest neighbor search exist [14].

III. TRAJECTORY IMITATION AND GENERALIZATION

LfD or PbD also called IL or Apprenticeship Learningforgo decomposing and programming the desired behavioursexplicitly to the robot system. By instead relying on demon-strations made by a human expert we can achieve a moreexpedite deployment of a robot to a new task, also allowingthe programming of robots by people without a traditionalprogramming background. In the context of playing tabletennis we want to find the motor trajectory that strikes theball with the highest probability of success. An appealingalternative to trying to define what is the optimal criteria forsuch a trajectory is to simply imitate a demonstration by ahuman expert. In this section we describe DMP a state of theart framework for IL of motor trajectories and introduce ournovel solution to this problem based on constrained quadraticoptimization. The methods proposed in this section werepublished in [16].

IV. IMITATION WITH DMPS

As previously introduced, DMP is a widely adopted for-malism to represent motor actions in a way that allowsflexible adjustment without custom parameter tuning. Aftera primitive is learned from a demonstration (or a seriesof demonstrations) it can then be generalized to differentinitial and goal states while maintaining its overall shape.The original DMP have two formulations, one for discretemovements and another for periodic motion. Here we focus

on the discrete formalization, that is needed to representpoint-to-point tasks.

A. Original Formulation

The DMP system controls the motion of a scalar variable ythrough a point attractor g that makes the trajectory convergeasymptotically to the goal and a nonlinear forcing term fthat encodes the characteristics of the demonstration. Thetwo terms are coupled by a canonical system z that acts asa replacement of time.

τ2y = αy(βy(g− y)− v)+ f (z,g)

τ y = v τ z =−αzz(9)

The gains αy, βy are chosen to make the 2nd order systemcritically damped, ie. βy = αy/4. The temporal scaling termτ allows the primitive to execute the movement faster orslower while preserving its shape.

Let us consider a demonstrated trajectory as d(t). Theobjective of learning a DMP is to compute an approximationof f such that the observed profile of the trajectory is asclose as possible to the demonstration. Rewriting the firstequation of (9) and replacing the motion variable y by thedemonstration d we have:

ftarget(z,g) = τ2d−αy(βy(g−d)− τ d) (10)

This term can be represented by a normalized linearcombination of Gaussian basis functions [17], as follows:

ftarget(z,g) =∑

Ni=1 ψi(z)wi

∑Ni=1 ψi(z)

ξ (z,g)

ψi(z) = exp(− 12σ2

i(z− ci)

2)

ξ (z,g) = z(g− y(0))

(11)

The Gaussian basis functions ψi have centers ci alongthe exponential of the Canonical system so that they areequally spaced in time. The scaling term ξ (z,g) makes theaccelerations converge to zero near the goal and to normalizetheir values according to the amplitude of the movement.

The weights wi can be learned from the samples of theoriginal demonstration at sample times zp, p ∈ {0, · · · ,P},using for example Locally Weighted Regression [3]:

wi =sT Γiftarget

sT Γis(12)

where

s=[

ξ (z0) . . .ξ (zP)]T

ftarget =[

ftarget(z0,g) . . . ftarget(zP,g)]T

Γi =

ψi(z0) . . . 0

0. . . 0

0 . . . ψi(zP)

(13)

The canonical system has αz > 0 so that it convergesasymptotically to zero. This ensures that, if the weightsare bounded, the forcing function’s interference eventually

Page 5: Robot Skills: Imitation and Exploration Learning - ULisboa · Robot Skills: Imitation and Exploration Learning Dynamic Movement Primitives and Reinforcement Learning for a Ping Pong

vanishes as the movement ends, ensuring that the pointattractor is free to converge to the goal. The initial valueof z is usually set to 1 but any positive value can be chosen.

B. Hitting Movements using DMPs

For some situations such as striking a moving ball andthrowing darts it is necessary to adapt to an end goal witha desired final velocity v f , different from zero. A DMPformulation that adapts to goals with velocity was proposedby Kober et. al [18]. The proposed system used a shiftinggoal with linear velocity. This system however had somedrawbacks: First, fast changes of the goal position andvelocity lead to error in the final velocity. Second, whenthe final and starting position are close to each other thesystem generates large accelerations. To overcome theseshortcomings Mulling et al [4] proposed an improved system,which is defined as follows:

τ v = αy(βy(g− y)+ gτ− v)+ gτ2 +η f (z)

τ y = v(14)

Here the goal follows a polynomial that starts and endswith zero accelerations and the correct starting and finalposition and velocity, computed as:

g =5

∑j=0

b j

(−τ

ln(z)αz

) j

g =5

∑j=1

jb j(−τln(z)

αz) j−1

g =5

∑j=2

( j2− j)b j(−τln(z)

αz) j−2

(15)The 5th order polynomial coefficients are found by setting

the boundary conditions such that it has the desired initialand final positions and velocities. The accelerations are setto zero at the start and at the end of the movement.

V. QUADRATIC OPTIMIZATION FOR MOTION PRIMITIVES

The approach in Mulling et al. [4], implemented through(14) and (15), achieves the final goal position and velocitywith accuracy. However, the polynomial trajectory is com-puted independently of the demonstrated trajectory and maylead to accelerations profiles significantly different from thedemonstration. Moreover, the acceleration vanishes at the endof the movement. This distorts the demonstrated trajectoryat the most important phase in hitting movements. To havea greater control on the resulting trajectory profiles, weformulate the problem as a convex QP. The imitation problemis addressed by computing the trajectory that best imitates theacceleration profile of the demonstration while constrained tostart and end at specific positions and velocities. Additionalconstraints can be incorporated, for instance via-points andlimits in the joint positions, velocities or accelerations duringthe whole movement trajectory.

Fig. 3: Here we can see how the weights combine with the gaussianfunctions to generate the acceleration of the motor primitive.In the first panel we plotted the thirty gaussian functions withcenters shifted in time and the corresponding weights found by ouroptimization. Each Gaussian was multiplied by it’s weight as shownin the second panel. Finally by adding together all the gaussianswe get the acceleration profile of the motor primitive as seen in thethird panel.

A. Formulation

We use a dynamical system as (14), but instead of havingseparate reference trajectory and forcing terms, we generatea single desired trajectory directly taking all constraints intoaccount (imitation of the demonstration, initial and finalstates, joint limits, waypoints) and plug it into a referencetracking dynamical system. This is no longer a DMP but stillensures robustness to noise and perturbations.To compute the desired trajectory we use a Gaussian kernelexpansion as in the DMP forcing term (11). Let g(t), t ∈[t0, t f

]denote the desired trajectory to be executed between

t0 and t f . Its acceleration is represented as:

g(t) =N

∑i=1

wiψi(t) (16)

The Gaussian basis functions ψi(t) are the same ones usedfor DMP but time warped to fit the desired time interval.Since we do not need a canonical system with exponential

Page 6: Robot Skills: Imitation and Exploration Learning - ULisboa · Robot Skills: Imitation and Exploration Learning Dynamic Movement Primitives and Reinforcement Learning for a Ping Pong

decay, we can have the basis functions spaced linearly intime and with constant width. The expression for the ψi(t)is the same as in (11), but now the input is time t instead ofphase z, and centers ci and variance σ are shifted and warpedaccording to the transformation from the demonstration timeinterval to the imitation time interval.

The desired trajectory position and velocity can be ob-tained by integration:

g(t) = v0 +N

∑i=1

wiψ′i (t)

g(t) = p0 +(t− t0)v0 +N

∑i=1

wiψ′′i (t)

(17)

where 1

ψ′i (t) =

∫ t

t0ψi(τ)dτ ψ

′′i (t) =

∫ t

t0ψ′i (τ)dτ

Given a demonstration d(t) and the desired initialand final times, positions and velocities, respectively(t0, p0,v0, t f , p f ,v f ), the optimization problem is thus writtenas:

minimizewi

∫ t f

t0

(N

∑i=1

wiψi(t)− τ2d(t)

)2

dt

subject to p0 +(t f − t0)v0 +N

∑i=1

wiψ′′i (t f ) = p f

v0 +N

∑i=1

wiψ′i (t f ) = v f

(18)

where τ is the time scale due to different durations of thedemonstration and imitation.

After the trajectory is computed, it can be used as ashifting goal in the acceleration-based controller:

y = αy(βy(g− y)+ g− v)+ g

y = v(19)

The time scaling factor is included directly in the solutionof the minimization problem. Every time a trajectory has tobe computed, the new ψi(t) terms are recomputed to coverthe desired time span, so it is no longer necessary to scaleby τ in the dynamic system.

Note that the computed trajectory plays the role of boththe forcing term and the shifting goal. Because the trajectoryalready incorporates the imitation, there is no need to use twodifferent entities to implement the dynamical controller.

B. Solution

To solve the optimization problem we rearrange (18) asa QP with linear constraints in standard form. Then, anyexisting modern QP solver can be used to obtain the result.

1The cumulative functions ψ′i (t) and ψ

′′i (t) are, in practice, approximated

by numerical integration.

Expanding the cost function in (18) and defining ψi j =∫ψi j(t), θ j =

∫ψ j(t)d(t) and D2 =

∫d2(t) we get2:

J =∫

dt

(N

∑j=1

w jψ j(t)− τ2d(t)

)2

=

J =N

∑j=1

N

∑i=1

wi jψi j−2τ2

N

∑j=1

w jθ j + τ4D2

where wi j = wiw j, ψi j(t) = ψi(t)ψ j(t) and the integrals arebetween t0 and t f .

Because the last term does not depend on the wi’s it canbe left out of the optimization cost J. Therefore, representingin matrix form we get:

J = wTΨw−2τ

2wTθ (20)

where

w =[

w1 . . .wn]T

θ =[

θ1 . . .θN]T

Ψ =

ψ11 . . . ψ1N...

. . ....

ψN1 . . . ψNN

(21)

The constraints of (18) can be written as:

wTψ′′(t f ) = p f − p0− (t f − t0)v0

wTψ′(t f ) = v f − v0

(22)

where

ψ′′(t f ) =

ψ′′1 (t f )

...ψ′′N(t f )

ψ′(t f ) =

ψ′1(t f )...

ψ′N(t f )

(23)

Now we can write (18) in standard QP form:

minimizex

12

wT 2Ψw+ τ2θT w

subject to(ψ′′(tf)

ψ′(tf)

)w =

(p f − p0− (t f − t0)v0

v f − v0

) (24)

These values can be used in a standard QP solver. Foradapting to new targets only p f , p0, v f , v0 and Ψ change inthe resulting QP. The basis functions Ψ must be recomputedfor changes in the final time. The θ vector, obtained fromthe original demonstration remains unchanged and does nothave to be recalculated. This vector therefore is the invariantthat defines the primitive.

The minimization must be solved each time we want toexecute a new motor action. However existing QP solverscan find a solution very fast. If we are controlling an armwith many DOF, solving for each individual DOF is anintrinsically parallel problem. Additional constraints can alsobe added to ensure that joint accelerations, velocities andpositions are bounded within safe physical limits. For a

2In practice the terms ψi j and θ j are approximated by numerical integra-tion

Page 7: Robot Skills: Imitation and Exploration Learning - ULisboa · Robot Skills: Imitation and Exploration Learning Dynamic Movement Primitives and Reinforcement Learning for a Ping Pong

discretization of the time interval at instants ti the followingconstrains can be added:

qmin ≤ wTψ(ti)≤ qmax

qmin ≤ v0 +wTψ′(ti)≤ qmax

qmin ≤ p0 +(ti− t0)v0 +wTψ′′(ti)≤ qmax

(25)

Constraints can also be added to limit jerk:

...q min ≤ wTψ(ti)−wTψ(ti−1)≤...q max (26)

However each of these limitations requires adding 2constraints for each sample point and may increase thecomputation time. In our case the problem is still solvedin the order of tenths of a second for trajectories with1000 points. Some solvers allow the use of lazy constraints.We experimented with a lazy constraint version runningin the proprietary solver Gurobi. The solver introduces theconstraints only when the current iteration’s solution violatesthem. In this way most constraints do not have to be usedreducing the computation for each iteration of the solver.

VI. EXPERIMENTAL RESULTS

In this section we present the experimental results obtainedthroughout this dissertation. First the DMP and QPMPpresented in section III are evaluated for simple trajectoriesin one DOF. More specifically we show the possibilitiesprovided by the constrained optimisation compared to tradi-tional DMPs. Next we evaluate learning with CrKR and theANMS sparsification scheme described in section II. Thesparsification allows continuous learning instead of beinglimited to a fixed budget of samples.

In section VII we present the experimental setup and showresults obtained in a simulated environment.

In the final section (VIII), we describe the robotic platformsetup including the state tracking system. We show thearchitecture of the software and of the physical system.

A. Evaluation of QPMP and DMP

To test our proposed methods we generated a ball hittingmovement, defined in joints space, with our 6 DOF arm toserve as a demonstration for both a DMP with polynomialshifting goal (the system proposed in [4]) and for our QPformulation. For both the DMP and the QP we use Eulerintegration with a 0.001s time step in all experiments. Afterthe final time both the DMP and the QP accelerations areset to zero.We performed three different experiments (Sections VI-B,VI-C and VI-D) that consisted in adapting the demonstratedmotion to different execution conditions, comparing our QPmethod to DMP. Moreover, we discuss how some importantfeatures of the DMP approach are also preserved, namelythe rejection of noise and disturbances during execution(Section VI-E). Finally, we discuss how both approachesdeal with noise in the demonstration (Section VI-F). Eachexperiment shows one useful feature of our proposed method.For the sake of clarity we present these results in separatedexperiments, but in a real application all features can be used

simultaneously. The results presented in this section havebeen published in [16].

B. Adaptation to different initial/final states and movementduration

Both the DMP and the QPMP allow adapting the demon-strated primitive to new initial and final states composed ofposition, velocity and time. As it can be seen in Fig. 4 theDMP and the QP generate similar trajectories. However, theQP distributes the acceleration difference evenly throughoutthe trajectory. This allows a more accurate replication of thedemonstrated movement shape, especially near the end ofthe trajectory. The non-zero final accelerations may increasethe velocity error at the final instant. When a precise finalvelocity is critical to the task a new constraint can be addedto switch off accelerations at a predetermined final fractionof the trajectory.

0 1

−2

−1

0

1

2

time [s]

(a) Angle[rad]

0 1

time [s]

(b) Velocity[rad/s] ∗ 101

demo scaled demo DMP QP0 1

time [s]

(c) Acceleration[rad/s2] ∗ 102

Fig. 4: Trajectory generated by adapting the demonstration to adifferent initial position and a different final velocity. The final timeis set to be at 0.8s instead of the 1.0s of the demonstration. Themovement generated by the global optimization (without additionalconstraints) imitates the shape of the demonstration better, espe-cially in the final part of the motion.

C. Inclusion of joint limits

Taking into account the physical limits of the joints isof crucial importance in practical operations, as every robothas position, velocities and acceleration limits that cannot beviolated. Moreover, in some specific cases there might be theneed for setting limits that are more conservative than themaximum, nominal ones; therefore, it is extremely helpfulto have the possibility of changing these limits on-the-fly,depending on the task requirements, without the overallshape of the movement being deviated too much from thedemonstration. As shown in Fig. 5, with the DMP formula-tion the adaptation to new goals can generate trajectories thatviolate joint limits (even if the demonstration was bounded).Conversely, with our approach these limits can be includedas constraints of the QP by adding inequalities in eq. 26. Theresults in Fig. 5 show that our proposed method can generatea trajectory that is feasible and still a good approximation ofthe original demonstration.

D. Addition of intermediate way-points

An interesting possibility of the QP is adding interme-diate way-points while trying to maintain the shape of themovement. The way-points can be added in the same wayas the final goal but at an intermediate time; they can be

Page 8: Robot Skills: Imitation and Exploration Learning - ULisboa · Robot Skills: Imitation and Exploration Learning Dynamic Movement Primitives and Reinforcement Learning for a Ping Pong

0 1−4

−3

−2

−1

0

1

2

time [s]

(a) Angle[rad]

0 1

time [s]

(b) Velocity[rad/s] ∗ 101

limit demo DMP QP0 1

time [s]

(c) Acceleration[rad/s2] ∗ 102

Fig. 5: Trajectory generation with joint limits. Here the joint limitswere added as constraints in the QP. The position has a physicallimit at ±2.0rad. To adapt to a new end goal (v f = 10rad.s−1,p f =−1.0rad) the QP can take the limits and a safety margin intoaccount to generate a good imitation while a DMP generates aninfeasible trajectory.

defined as a specific position and velocity occurring at adesired time instant. Among many, one practical applicationcould be to perform obstacle avoidance while keeping thedemonstrated shape of the movement. Fig. 6 shows how theQP formulation allows to generate a trajectory with the sameinitial and final position of the demonstration, but with adifferent way-point in the middle; notably, the shape of thetrajectory highly resembles the demonstrated one. Although asmall discontinuity in the acceleration profile seems to occurnear to the way-point, this can be mitigated by adding a jerklimiting constraint in the QP problem.

0 1−2.5−2.0−1.5−1.0−0.50.00.51.01.5

time [s]

(a) Angle[rad]

0 1

time [s]

(b) Velocity[rad/s] ∗ 101

demo DMP QP waypoint0 1

time [s]

(c) Acceleration[rad/s2] ∗ 102

Fig. 6: Intermediate via-points can be added to a trajectory inthe same way as the final goal. Here a via-point is specified withpwaypoint = 1.0rad and vwaypoint = 1.0rad.s−1 at time t = 0.5s. Thepossibility of adding via-points while following the shape of a singleprimitive does not exist with the previous formulations of DMP.

E. Rejection of noise and disturbances

The acceleration based controller of the DMP (see eq.19) is stable when following a polynomial goal [4]. In ourcase we are following an arbitrary reference instead of apolynomial: after transients, the system acts as an all-passfilter and does not diverge when given bounded inputs.We performed an experiment in which we introduced i) noisein the sensory measurements (positions and velocities) andii) a step disturbance in the acceleration, which can simulatefor example the presence of an unexpected load. The plots inFig. 7 show how the controller is able to reject the noise andrecover the desired trajectory when the disturbance vanishes.

0 1

−2

−1

0

1

2

time [s]

(a) Angle[rad]

0 1

time [s]

(b) Velocity[rad/s] ∗ 101

demo DMP QP0 1

time [s]

(c) Acceleration[rad/s2] ∗ 102

Fig. 7: The QP is disturbed by adding 1% noise in the velocityand position sensor measurements, and by adding an accelerationof 100 rad/s2 during t = [0.1,0.3] [s]. The system manages to rejectthe noise and recovers from the disturbance accurately reaching theposition and velocity goal.

F. Effect of noise in the demonstration

The demonstration of a movement usually consists ofa position signal, that is then differentiated to obtain theacceleration profile. The noise in the position signal isamplified by differentiation and may result in a very noisyacceleration profile unsuitable for imitation.DMP use the demonstrated position and velocity in additionto the acceleration for learning the forcing term. This helps tomitigate the jerky motion caused by noise in the acceleration.Our QPMP system loses some of the high frequency noisecontent by representing the demonstrated accelerations asGaussian Basis functions, however if the position signal isvery noisy the stored demonstration will still have very highaccelerations, being unsuitable for imitation. Preprocessingthe demonstration offline through a smoothing filter gen-erates a smooth acceleration profile suitable for imitation.We used a Savitzky-Golay filter [19] to smooth the positionsignal, achieving good results in the imitation (Fig. 8).

0 1

−2

−1

0

1

2

time [s]

(a) Angle[rad]

0 1

time [s]

(b) Velocity[rad/s] ∗ 101

filtered demo DMP - filtered demo QP - filtered demo0 1

time [s]

(c) Acceleration[rad/s2] ∗ 102

Fig. 8: Differentiation of the position signal with smoothing by aSavitzky-Golay filter results in better imitation for both the QPMPand the DMP. To the demonstration’s position signal we addeda noise of amplitude 0.01 [rad] this will be amplified by thedifferentiation to 10 [rad/s] in the velocity and to 10000 [rad/s2]in the acceleration. The QP and the DMP used the accelerationsmoothed by the Savitzky-Golay filter. Filtering is very desirablefor the DMP to avoid jerky motion, and necessary for any usefulimitation with the QP.

VII. EXPERIMENTAL SETUP AND SIMULATION

In order to test the algorithms described in this work wedesigned a robot table tennis experiment. In this experimentthe goal is to hit a table tennis ball thrown by a human playerby means of a serial manipulator with a paddle attached.

Page 9: Robot Skills: Imitation and Exploration Learning - ULisboa · Robot Skills: Imitation and Exploration Learning Dynamic Movement Primitives and Reinforcement Learning for a Ping Pong

The experiments were done in a simulated environmentimplemented using Gazebo a multi-robot simulator withdynamics simulation and advanced 3D graphics.

A. Reinforcement Learning

To learn the mapping between the state of the incomingball and the goal for our robotic manipulator we appliedCrKR and the ANMS filtering described in section II. In thiscase the input to the system is the measured state of the ball’sposition and velocity in cartesian 3D space. The output is theposition and velocity of each joint of the robotic manipulatorat a desired hit time. This goal can then be passed to atrajectory generation algorithm that handles the creation ofan appropriate trajectory to achieve the given manipulatorstate. During the execution of the swing a game state trackerstores information to evaluate the cost of the swing, storing:• The ball state at the goal time thit .• The cartesian position of the paddle by computing the

forward kinematics with the measured state of the jointsat time thit .

• The state of the ball in the next bounce from ground ortable after time thit .

The swing is classified as successful if the ball in the nextbounce after time thit is traveling away from the robot andbounces in the table. Only the successful trials were used asdatapoints to update the regression. The position of the balland of the paddle at time thit were used to compute the costc j associated with the trial according to the saturated costfunction 8.

B. Imitation Learning

Since our simulation model matches the robotic platforma trajectory was captured by kinestethic teaching to serveas basis for the motion primitive. The trajectories were thencalculated using QPMP independently for each DOF. Therobot joint limits were added to the optimization as lazyconstraints to add safety to the trajectory execution withoutsignificantly increasing the computation time for most tra-jectories. The solver will fail to generate the trajectory if itis not feasible due to the constraints. In these situations notrajectory is executed and the trial is discarded.

The table acts as a ground plane that significantly limitsthe configurations that the serial manipulator can achievewithout collision. To ensure that no collision occurs withthe table we sample the forward kinematics of the generatedtrajectory using OpenRAVE. The table was included in therobot model so that any situation where the arm wouldcollide with the table is detected internally as a self-collision.

C. Results

The simulation was run starting from an initial set of fivedemonstrations and a motion primitive. The system throwsthe balls starting with position and velocity sampled from auniform distribution that was tuned to generate trajectorieslying mostly within the bounds of the robots reachability.Regularly the state of the learning is verified by running a

Learning epoch

0 5 10 15 20 25

λ=0.01λ=0.1λ=1.0λ=2.0

legend

55

60

65

70

75

80

Hit

rate

(%)

Fig. 9: The success rate by epoch while learning in the Gazebosimulator for different λ . After each training epoch of 50 randomsamples a evaluation is performed on a fixed test set of 50 trials.A trial is considered successful if the ball is hit and bounces thetable while moving away from the robot. The cost is calculated bythe expression 8. The plots are averaged over 5 different runs foreach value of the CrKR exploration/exploitation trade-off λ .

fixed set of 50 trials that have the same distribution as thetraining samples.

The results obtained can be seen in Fig. 9. We achieveda success rate of 78% according to the criteria previouslydescribed. It should be noted that this rate is dependenton the trajectory of the balls thrown to the robot, sometrajectories in the test set may be unreachable for the serialmanipulator so a 100% success rate is not possible. Thelearned policy can be visualized by variating one of the inputstate dimensions and plotting the resulting policy for thejoint positions in cartesian space by computing the forwardkinematics. Another important consideration is that due totimings and errors in the physics of the simulation there issome variability in our test set. The same policy executed inthe same test can result in a slightly different outcome.

VIII. ROBOTIC PLATFORM

For the experiments with the real robot the state of the ballcannot be measured directly like in the simulation. To trackthe ball position we used a vision system based on MotionCapture (Mo-cap) cameras. The ball position captured by thevision system must then be filtered to get a state estimationthat includes velocity and is in the referential of the robot.To get the resulting state we used an Extended Kalman Filter(EKF). In addition to the filter a reflection coefficient isapplied to the prediction if it crosses the plane of the table.To handle false positives and detection errors we included aMahalanobis distance validation gate. This state is then sentto the same learning software used in the simulation.

Page 10: Robot Skills: Imitation and Exploration Learning - ULisboa · Robot Skills: Imitation and Exploration Learning Dynamic Movement Primitives and Reinforcement Learning for a Ping Pong

A. Experiments on the Robotic PlatformOur robotic platform has a Biorob V3 X6 serial manip-

ulator set on top of a table with three attached sphericalmarkers for pose tracking. The measured pose of the serialmanipulator is used to translate the ball position from thecamera’s referential to the robot referential, automaticallyadjusting for any alterations in the robot location. To theend effector of the robot, we attached a 3D printed paddle.This setup can be seen in Fig. 10.

Fig. 10: The real setup showing the Biorob arm and the six camerasof the Optitrack Mo-cap system. Three markers are attached to thebasis of the robot to automatically translate to the robot referentialwhen the robot changes position relative to the cameras. In theupper right corner we can see the 3D printed paddle.The paddledoes not need markers since its position can be found using therobot state and it’s forward kinematics.

With this setup we performed preliminary experimentswith low speed trajectories learned from demonstrations withQPMP. We tested the tracking and automatic referentialadjustment resulting in the plot seen in Fig. 11. At this pointwe are tuning the control for QPMP with in higher speedsto validate the results obtained in simulation. Videos andmaterial related to these results can be seen in [20].

Fig. 11: EKF predictions based on collected samples of the pingpong ball bouncing on the table. The blue dots are samples of theball position captured by the Optitrack system at a rate of 120FPS. The green triangles show the prediction for the ball position0.5[s] ahead. In initial predictions the filter is being initialized.The samples not pictured were those rejected by the Mahalanobisvalidation gate.

IX. CONCLUSIONS

In this work we explored methods for both imitation learn-ing and exploration learning of motor tasks. Expanding fromthe known methods used for RL in this class of robotics prob-lems, we proposed the addition of a sparsification method

based on ANMS. This sparsification capable of choosingonline a low cost and representative subset of the collectedsamples. Using this subset for regression allows the robotto continue learning past a set budget of samples beyondwhich the matrix inversion becomes too computationallyexpensive. Also taking modern developments in DMPs asa starting point, we take a novel approach by posing theproblem of imitation as a constrained global optimization.Our experiments show that our QPMP method maintainsthe flexibility of the DMP formulation (i.e. adaptation todifferent start and end positions , velocities and time) whileadding important additional constraints on the trajectory suchas the joint limits of the robot and intermediate way-points.

REFERENCES

[1] C. G. Atkeson and S. Schaal, “Robot learning from demonstration,”in Proceedings of the Fourteenth Int. Conference on Mach. Learning,ser. ICML ’97. Morgan Kaufmann Publishers Inc., 1997, pp. 12–20.

[2] R. Krug and D. Dimitrovz, “Representing movement primitives asimplicit dynamical systems learned from multiple demonstrations,” inICAR, 2013, Nov 2013, pp. 1–8.

[3] A. Ijspeert, J. Nakanishi, and S. Schaal, “Movement imitation withnonlinear dynamical systems in humanoid robots,” ICRA, 2002, pp.1398–1403, 2002.

[4] K. Mulling, J. Kober, O. Kroemer, and J. Peters, “Learning to selectand generalize striking movements in robot table tennis,” The Int. J.of Robotics Research, vol. 32, no. 3, pp. 263–279, 2013.

[5] A. Paraschos, C. Daniel, J. Peters, and G. Neumann, “Probabilisticmovement primitives,” in NIPS, 2013, pp. 2616–2624.

[6] C. E. Rasmussen, “Gaussian processes for machine learning,” 2006.[7] R. Bellman, “Dynamic programming and lagrange multipliers,” Pro-

ceedings of the National Academy of Sciences of the United States ofAmerica, vol. 42, no. 10, p. 767, 1956.

[8] N. D. Goodman, T. D. Ullman, and J. B. Tenenbaum, “Learning atheory of causality.” Psychol. Rev., vol. 118, no. 1, p. 110, 2011.

[9] C. Plagemann, K. Kersting, P. Pfaff, and W. Burgard, “Heteroscedasticgaussian process regression for modeling range sensors in mobilerobotics,” in Snowbird learning workshop, 2007.

[10] J. Kober, A. Wilhelm, E. Oztop, and J. Peters, “Reinforcementlearning to adjust parametrized motor primitives to new situations,”Autonomous Robots, vol. 33, no. 4, pp. 361–379, Nov. 2012.

[11] J. Kober, “Learning motor skills: from algorithms to robot experi-ments,” it-Information Technology, vol. 56, no. 3, pp. 141–146, 2014.

[12] K. P. Kording and D. M. Wolpert, “The loss function of sensorimotorlearning,” Proceedings of the National Academy of Sciences of theUnited States of America, vol. 101, no. 26, pp. 9839–9842, 2004.

[13] D. Nguyen-Tuong and J. Peters, “Incremental online sparsification formodel learning in real-time robot control,” Neurocomputing, vol. 74,no. 11, pp. 1859–1867, May 2011.

[14] S. Gauglitz, L. Foschini, M. Turk, and T. Hollerer, “Efficientlyselecting spatially distributed keypoints for visual tracking,” ICIP, pp.1869–1872, 2011.

[15] M. Brown, R. Szeliski, and S. Winder, “Multi-Image Matching usingMulti-Scale Oriented Patches,” 2005.

[16] C. Cardoso, L. Jamone, and A. Bernardino, “A novel approach todynamic movement imitation based on quadratic programming,” inRobotics and Automation (ICRA), 2015 IEEE International Conferenceon. IEEE, 2015, pp. 906–911.

[17] A. Ijspeert, J. Nakanishi, H. Hoffmann, P. Pastor, and S. Schaal,“Dynamical Movement Primitives: Learning Attractor Models forMotor Behaviors,” Neural Comput., vol. 25, no. 2, pp. 328–373, 2013.

[18] J. Kober, K. Mulling, O. Kromer, C. H. Lampert, B. Scholkopf, andJ. Peters, “Movement templates for learning of hitting and batting,” inICRA, 2010, May 2010, pp. 853–858.

[19] A. Savitzky and M. Golay, “Smoothing and differentiation of data bysimplified least squares procedures,” Anal. Chem., vol. 36, pp. 1627–39, 1964.

[20] C. Cardoso, “Robot skills: Imitation and active learning,” https://github.com/carlos-cardoso/robot-skills.git, 2016.