Dynamic Movement Primitives in Latent Space of Time ... · The representation of movement via...

8
Dynamic Movement Primitives in Latent Space of Time-Dependent Variational Autoencoders Nutan Chen, Maximilian Karl, Patrick van der Smagt Abstract— Dynamic movement primitives (DMPs) are pow- erful for the generalization of movements from demonstration. However, high dimensional movements, as they are found in robotics, make finding efficient DMP representations difficult. Typically, they are either used in configuration or Cartesian space, but both approaches do not generalize well. Additionally, limiting DMPs to single demonstrations restricts their general- ization capabilities. In this paper, we explore a method that embeds DMPs into the latent space of a time-dependent variational autoencoder framework. Our method enables the representation of high- dimensional movements in a low-dimensional latent space. Experimental results show that our framework has excellent generalization in the latent space, e.g., switching between move- ments or changing goals. Also, it generates optimal movements when reproducing the movements. I. I NTRODUCTION The representation of movement via Dynamic Movement Primitives (DMP) is a promising approach and is widely used for learning by demonstration and reinforcement learning [1]. The generalization of movements by DMP is robust to adaptation to different speed, shift, stretch and goal posi- tion by changing DMP parameters. A probabilistic version, dubbed ProMP [2], has been proposed which is capable of representing movement variance. High-dimensional movement data increases the difficulty of learning, inverse kinematics and redundant degrees of free- dom representation. In earlier work Gaussian Process Latent Variable Models (GPLVM) [3], [4] and denoising autoen- coder [5] have been investigated for movement representation in latent space using DMP. These approaches efficiently reduced the data representation dimensions and enabled new movements generated by simply changing parameters in the latent space. However, these methods are only used to represent one, or at most two, different movement types; for more different movements, different models have to be trained. ProMP can combine and switch between different movements, and its dimension reduction model, dimension reduction ProMP (DR-ProMP) [6], was investigated for high- dimensional movements based on Expectation Maximization (EM). However, ProMP-based methods blend the movement via points which the trajectories go through, and require multiple demonstrations for each movement type to construct a sufficient probabilistic model. In previous work [7] we have shown that, in a reinforce- ment learning setting, variational autoencoders (VAE) can 1 These authors are with the Faculty for Informatics, Technische Universit¨ at M¨ unchen, 80333 Germany nutan.chen(at)gmail.com, karlma(at)in.tum.de. PvdS is also with fortiss. build more meaningful latent spaces than autoencoders or PCA. Following that line of thought, here we propose to use a VAE rolled out in time to reduce the dimensions for DMPs. To this end, we use a technique called Deep Variational Bayes Filtering (DVBF) [8]. Our approach is an incremental advancement of DVBF and DMP by exploring DVBF to learn movements from multi-demonstrations, as well as benefiting from the latent representation of time- dependent VAE. Integration of DMPs with DVBF increases the constraints of the latent space, and therefore forces the distribution of the movements in the latent space to be meaningful. In our approach we train a model with a shared latent space for various similar as well as different movements. In contrast to ProMP, our method switches the movement completely, and does so by representing different movements in different “parts” of the latent space. II. METHOD Our approach consists of a modified DVBF which repre- sents the high-dimensional data in a latent space and DMPs which learns movements from demonstrations in the latent space. A. Dynamic Movement Primitives DMPs are generally trained from a demonstration of a movement in its joint space or Cartesian space, which can then be re-used and adapted to the environment [1]. A DMP is a point attractor system written as a second-order dynamic model, ¨ y = ( β(g - y) - ˙ y ) + f , (1) where is a time constant and > 0 and β > 0 are damping constants. The trajectory y is attracted to the goal position g by the difference term (g - y). Usually, the last frame of the demonstration is set as the goal during training. The forcing term f encodes the trajectory dynamics, which drives the system to the goal. Following [1], a basic version of f is chosen as a linear combination of basis functions i , f t = P N i=1 i (t)w i P N i=1 i (t) , (2) where t represents the time steps. We focus on the discrete case, but an extension to rhythmic dynamical systems is straightforward by modifying the basis functions . We write a discrete as [1] i (t) = exp h - 1 2σ 2 i (t - c i ) 2 i , (3)

Transcript of Dynamic Movement Primitives in Latent Space of Time ... · The representation of movement via...

Page 1: Dynamic Movement Primitives in Latent Space of Time ... · The representation of movement via Dynamic Movement Primitives (DMP) is a promising approach and is widely used for learning

Dynamic Movement Primitives in Latent Space of Time-DependentVariational Autoencoders

Nutan Chen, Maximilian Karl, Patrick van der Smagt

Abstract— Dynamic movement primitives (DMPs) are pow-erful for the generalization of movements from demonstration.However, high dimensional movements, as they are found inrobotics, make finding efficient DMP representations difficult.Typically, they are either used in configuration or Cartesianspace, but both approaches do not generalize well. Additionally,limiting DMPs to single demonstrations restricts their general-ization capabilities.

In this paper, we explore a method that embeds DMPs intothe latent space of a time-dependent variational autoencoderframework. Our method enables the representation of high-dimensional movements in a low-dimensional latent space.Experimental results show that our framework has excellentgeneralization in the latent space, e.g., switching between move-ments or changing goals. Also, it generates optimal movementswhen reproducing the movements.

I. INTRODUCTION

The representation of movement via Dynamic MovementPrimitives (DMP) is a promising approach and is widely usedfor learning by demonstration and reinforcement learning[1]. The generalization of movements by DMP is robust toadaptation to different speed, shift, stretch and goal posi-tion by changing DMP parameters. A probabilistic version,dubbed ProMP [2], has been proposed which is capable ofrepresenting movement variance.

High-dimensional movement data increases the difficultyof learning, inverse kinematics and redundant degrees of free-dom representation. In earlier work Gaussian Process LatentVariable Models (GPLVM) [3], [4] and denoising autoen-coder [5] have been investigated for movement representationin latent space using DMP. These approaches efficientlyreduced the data representation dimensions and enabled newmovements generated by simply changing parameters inthe latent space. However, these methods are only used torepresent one, or at most two, different movement types;for more different movements, different models have to betrained. ProMP can combine and switch between differentmovements, and its dimension reduction model, dimensionreduction ProMP (DR-ProMP) [6], was investigated for high-dimensional movements based on Expectation Maximization(EM). However, ProMP-based methods blend the movementvia points which the trajectories go through, and requiremultiple demonstrations for each movement type to constructa sufficient probabilistic model.

In previous work [7] we have shown that, in a reinforce-ment learning setting, variational autoencoders (VAE) can

1These authors are with the Faculty for Informatics, TechnischeUniversitat Munchen, 80333 Germany nutan.chen(at)gmail.com,

karlma(at)in.tum.de. PvdS is also with fortiss.

build more meaningful latent spaces than autoencoders orPCA. Following that line of thought, here we propose touse a VAE rolled out in time to reduce the dimensionsfor DMPs. To this end, we use a technique called DeepVariational Bayes Filtering (DVBF) [8]. Our approach is anincremental advancement of DVBF and DMP by exploringDVBF to learn movements from multi-demonstrations, aswell as benefiting from the latent representation of time-dependent VAE. Integration of DMPs with DVBF increasesthe constraints of the latent space, and therefore forcesthe distribution of the movements in the latent space tobe meaningful. In our approach we train a model with ashared latent space for various similar as well as differentmovements. In contrast to ProMP, our method switches themovement completely, and does so by representing differentmovements in different “parts” of the latent space.

II. METHOD

Our approach consists of a modified DVBF which repre-sents the high-dimensional data in a latent space and DMPswhich learns movements from demonstrations in the latentspace.

A. Dynamic Movement PrimitivesDMPs are generally trained from a demonstration of a

movement in its joint space or Cartesian space, which canthen be re-used and adapted to the environment [1]. A DMPis a point attractor system written as a second-order dynamicmodel,

¨

y = ↵

��(g � y)� ˙

y

�+ f , (1)

where ⌧ is a time constant and ↵ > 0 and � > 0 are dampingconstants. The trajectory y is attracted to the goal positiong by the difference term (g� y). Usually, the last frame ofthe demonstration is set as the goal during training.

The forcing term f encodes the trajectory dynamics, whichdrives the system to the goal. Following [1], a basic versionof f is chosen as a linear combination of basis functions i,

ft =

PNi=1

i(t)wiPNi=1

i(t), (2)

where t represents the time steps. We focus on the discretecase, but an extension to rhythmic dynamical systems isstraightforward by modifying the basis functions . We writea discrete as [1]

i(t) = exp

h� 1

2�

2

i

(t� ci)2

i, (3)

16th IEEE-RAS International Conference on Humanoid Robots, 2016
Page 2: Dynamic Movement Primitives in Latent Space of Time ... · The representation of movement via Dynamic Movement Primitives (DMP) is a promising approach and is widely used for learning

where constants ci and �i are the width and centers of theGaussian basis functions, respectively.

During training, with the demonstration y

demo and itscomputed derivatives, the target values of the forcing termresults in,

f

target

= ⌧

2

¨

y

demo � ↵z

��z(g � y

demo

)� ⌧

˙

y

demo

�. (4)

With {(f1

, f

target

1

), . . . , (fn, ftarget

n )} of length n, an opti-mizer can be used to minimize the discrepancy between f

and f

target

w

?= argmin

w

nX

i=1

L

⇥fi, f

target

i

⇤, (5)

where L is a loss function.

B. Variational AutoencoderIn this section, we give a brief overview of variational

autoencoders (VAE), which is the basic framework of time-dependent VAE for DMP.

1) Variational inference: Variational inference [9] is amethod to approximate the intractable posterior distributionp(z|x) through a tractable approximate variational distri-bution q�(z). x 2 Rd and z 2 Rd0

are the observeddata and its corresponding latent representation, respectively.As the dissimilarity function, the Kullback-Leibler (KL)divergence between the approximate distribution q� and thetarget distribution p is minimized to obtain the variationalparameter � for the optimal approximate distribution q�. Themarginal log-likelihood is written as

log p(x) = Eq�(z)

log

p(x, z)

q�(z)

�+KL

�q�(z) k p(z|x)

�.

(6)

log p(x) does not depend on q� and the KL term is non-negative; therefore, the first term, the variational lower boundLbound

(q) is maximized to minimize the KL divergence.2) Variational autoencoder: A Variational autoencoder

(VAE) [10] is a type of autoencoder [11] based on vari-ational inference. The posterior distribution is defined asp✓(z|x) / p✓(x|z)p(z). The prior p(z) is defined as anisotropic Gaussian distribution N (0, ID). The observationmodel p✓(x|z) is defined by a parameterized distribution,which is a generative neural network taking z as an input andoutputs the parameters of the distribution. This distribution ischosen Gaussian for continuous data, or Bernoulli for binarydata. ✓ are the weights of the neural network.

The approximate distribution q�(z|x) is also a neuralnetwork which takes as input x, outputs the parameters ofthe distribution over z and has the weights �. It is calledinference network or recognition network.

� and ✓ can be jointly updated by optimization throughback-propagation. VAE maximizes the variational lowerbound to optimize the parameters ✓ and �.

log p✓(x) � Eq�(z|x)

log

p✓(x, z)

q�(z|x)

�(7)

= Eq�(z|x)[log p✓(x|z)]�KL�q�(z|x) k p(z)

�,

where the first term can be considered as a reconstructionloss, and the second is a regularization term which constrainsthe approximate posterior q�(z|x) to be close to the priorp(z).

C. DMP in latent spaceThe VAE, as formulized in the previous section, has

no internal state, and therefore cannot represent temporaldependencies in the input data. There are several ways ofdealing with this; one possibility is extending the VAE to be arecurrent neural network [12], [13]. While having their meritin, e.g., anomaly detection [14], their prediction capabilitiesare not as good as expected [15]. Furthermore, it is not clearhow a control signal can be included.

A different, very promising approach is obtained by rollinga VAE out in time. This approach, called Deep VariationalBayes Filtering (DVBF) was published in [8]. In this paperwe use DVBF to embed DMPs in VAE (see Fig. 1). Byusing this approach, we can directly learn all parameters—including the DMP parameters—using back-propagation.

1x 1z 2z

2x

encoder enc1Penc1V

1f

decoder

nz

nx

1�nf

1H 1�nH

dec2P

dec2V

decnP

decnV

transition

1x

dec1P

dec1V

Fig. 1: VAE-DMP information flow for the generative move-ment.

1) Problem Formulation: Given a movement x

1:n =

{x1

,x

2

, ...,xn}, where n is the length of the movement,we want to model the movement in latent space z

1:n =

{z1

, z

2

, ..., zn}. In order to embed DMP into DVBF, (1) isformulated for the transition function from zt to zt+1

in thelatent space as

¨

zt+1

= ↵

��(z

goal � zt)� ˙

zt

�+ ft + ✏

˙

zt+1

=

¨

zt+1

dt +

˙

zt

zt+1

=

˙

zt+1

dt + zt (8)

where dt is the step duration, zt is the movement inlatent space at step t, zgoal is the goal in the latent spacecorresponding to the final frame of movement in the jointspace xn, ft 2 Rd0

is modified from (2) by changing thedimension to fit the latent space, ✏ = ✏w✏, ✏ ⇠ N (0,⌃✏)

is the system noise, and w✏ 2 Rd0⇥d0is the weight for the

noise, so that the scale of the noise is learned autonomously.Eq. (8) can be simplified as

zt+1

˙

zt+1

�= A

zt

˙

zt

�+ b, (9)

Page 3: Dynamic Movement Primitives in Latent Space of Time ... · The representation of movement via Dynamic Movement Primitives (DMP) is a promising approach and is widely used for learning

where the transition matrix is computed as

A =

�dt↵�

1

⌧ �dt↵

1

⌧ + 1

�↵�

1

⌧ �↵

1

�dt + I, (10)

and the control input is defined as

b =

dt

1

�(↵�z

goal

+ ft + ✏)dt

1

. (11)

I is an identity matrix.Instead of generating f

target as (4) and (5) to find itsparameters, we embed the DMP transition into DVBF andtrain its parameters through backpropagation.

2) Inference model: The encoder network is only usedfor the initial and goal states. We take a diagonal Gaussiandistribution with mean µt and covariance ⌃t. The encodingprocess is q�(z|x) = N (z|µt, diag(�

2

t )).The mean and covariance are encoded by a neural network

µ

enc

t = w

enc

µ h�(xt) + b

enc

µ ,

��

enc

t

�2

=

�w

enc

� h�(xt) + b

enc

�2

, (12)

where, h� denotes an activation function. wenc

µ , wenc

� , benc

µ

and b

enc

� are parameters. Based on a deep neural network,we parameterize the mean and variance of a Gaussiandistribution.

We can get another representation of a latent space ofthe initial and goal frames by a transition function z

?t =

g(zt). The transition function g can be, e.g., a multilayerperceptron (MLP). In the sequel, we will use z

? in lieu ofz. The transition forces the Gaussian-shaped latent space tobe replaced by any other kind of shapes.

During training, ✏t+1

is predicted from zt and xt+1

, while✏ is sampled from the prior without zt or xt+1

for generatingmovement after training. ✏ is diagonal Gaussian distribution

✏t+1

⇠ p�(✏t+1

|xt+1

, zt) = N (µ

noise

t ,⌃

noise

t ), (13)

where � are the weights of neural network. The mean µ

noise

t

and covariance ⌃noise

t are encoded by a neural network.3) Generative model: We segment a movement into sub-

sequence with length of l. The generative model includesdecoders with l = 1

q�(zt|xt) = N (µ

enc

t ,⌃

enc

t ), (14)

and with l > 1

q

ˆ�(zt+1

|zt, ft, ✏t). (15)

Consequently, the generated observation ˆ

x is

ˆ

xt ⇠ p✓(xt|zt) = N (µ

dec

t ,⌃

dec

t ). (16)

(15) is the transition from zt to zt+1

, which is described inthe next subsection.

We take a Gaussian distribution with mean and constantcovariance. Reconstruction (“decoding”) of x is by anotherneural network,

µ

dec

t = w

dec

µ h✓(zt) + b

dec

µ ,

dec

t2

= b

dec

�2

(17)

'

'

'+

Fig. 2: Neural network structure of the time-dependent VAEin time step t. The network structure is similar to a regularautoencoder with the addition of the DMP transition functionin the center. The nodes ✏t�1

are stochastic with mean andvariance created by the neural network. The autoencodertakes zt�1

together with xt as input.

where h✓ is the activation function. wdec

µ , wdec

� , bdec

µ andb

dec

� are parameters.Reconstruction through z from both (14) and (15) enforces

the zt+1

from the latent transition to approximate the zt+1

from the encoder. Therefore, the training model contains bothz’s of subsequence length of 1, which allows xt of every timestep to be reconstructed from zt of (14), and subsequencelength of n.

4) Transition model: In the latent space, we predict thelocal transformation parameters of f and ✏ (see Fig. 2). Thetransition is described as

zt+1

= f(zt, ft, ✏t), (18)

where f is a function of (9). ✏ is sample-specific noise.With meaningful transition priors, it avoids overfitting andhas meaningful manifolds in the latent space.

The initial ˙

z of the starting of the sequence can bepredicted directly from the encoder with x

1:m as the inputand {z, ˙z} as the output, where 1 < m n. Besides,another approach initializes z

0

through the first step z

1

andits previous step z

0

using (8).

D. Learning

The learning process is through stochastic gradient varia-tional Bayes.

Based on DVBF, the lower bound is rewritten as,

L(x1:n, ✓,�|f1:n)

= Eq�

⇥log p✓(x1:n|z1:n)� log q�(✏1:n|x1:n) + log p(✏

1:n)⇤

(19)= Eq�

⇥log p✓(x1:n|z1:n)

⇤�KL(q�

�✏

1:n|x1:n)kp(✏1:n)�,

(20)

Page 4: Dynamic Movement Primitives in Latent Space of Time ... · The representation of movement via Dynamic Movement Primitives (DMP) is a promising approach and is widely used for learning

where ✏

1:n = {✏1

, ✏

2

, ..., ✏n} and f

1:n = {f1

, f

2

, ..., fn}. f isencoded into z in (19).

During the training process, ✏ can act as a shortcut toencode all of the dynamical information; as a result, f

is not learning meaningful values. The annealing scheduleimproves the training of f and smooths out these localminima [16]. (19) is written as

L(x1:n, ✓,�|f1:n) =E

q�

⇥cta log p✓(x1:n|z1:n) (21)

� log q�(✏1:n|x1:n) + cta log p(✏1:n)⇤,

where cta = max(1, 0.01+ ta/Ta) and ta increases linearlyfrom 0 after every training epoch until cta equals 1.

Having only a small number of demonstrations increasesthe difficulty of training, since Monte Carlo estimation isused in (18) as in [10]. Additionally, a long sequencedemonstration increases the training time and weakens thestructure of the latent space at the start of the sequence,because of vanishing and exploding gradients. Therefore, wesplit a sequence up into overlapping subsequences to increasethe batch size and shorten the sequence. On the contrary, themodel has difficulties in recognizing the dynamics of themovement with too short subsequences, and consequently,the latent space might be unstructured. Thus, as a hyperpa-rameter, the length of the subsequences, lsub, can be searchedduring training.

E. Multi-demonstration modelWe extend VAE-DMP by training a single, shared latent

space with multiple motion sequences. With multiple motionsequences [17], the model (a) is not limited by the number ofsequences, and (b) is more adaptable to a new movements,e.g., movement switching or goal changing.

Instead of learning the weights w for each type of move-ment individually we use a neural network which recognizesthe correct weights from the observations. That way we geta continuous set of weights where the mapping betweenmovement type and weights is trained by backpropagation.This neural network has the following shape:

w(x

1:n) = g

2

�g

1

(x

1

), g

1

(x

2

), g

1

(x

3

), . . . , g

1

(xn)�

(22)

where g

2

and g

1

are each fully connected neural networks.We will call this whole network MLPw.

F. Movement switchingThe first generalization of the model is switching the

movements after training by simply switching the weight wof f and the goal zg . We have ⌘

i 2 [0, 1] for movement i.The weight and goal of the new movements at t step are

w

?t =

X

i

✓⌘

itw

itP

i ⌘it

◆, z

?gt =

X

i

itz

igtP

i ⌘it

!. (23)

Switching from movement A to movement B, ⌘A is changedfrom 1 to 0, and ⌘

B from 0 to 1. Given the starting pointand the duration of switching, ⌘A and ⌘

B can be estimated.Goal changing enables the latent value to transfer from

movement A to movement B, while the force term enables

the movement to follow the demonstration trajectory B afterswitching. If we only change the goal but not the forceterm, the movement also switches from A to B; however,the trajectory is quite random after switching.

G. Goal changingThe flexible properties of DMPs are retained in VAE-

DMP. For instance, it is able to reach new set points bysimply changing the goal after training, while keeping theinvariant properties. Following [1], to keep the invariantproperties, an extra term of (g � y

0

) is multiplied tothe right side of (2). In addition, it can be multiplied by(g � y

0

)/(|gfit

� y

0,fit|+ ⌘) instead to avoid coinciding y

0

and g, where fit represents the training data and ⌘ is a verysmall positive constant.

Other reformulations of DMP, e.g. to do obstacle avoid-ance [18], can be implemented in VAE-DMP straightfor-wardly, but here we only focus on goal changing.

III. EXPERIMENTS

Experiments with two data sets are performed to evaluateVAE-DMP. The first data set contains optical tracking dataof human movements. The second data set is obtained from6-DoF robot arm simulations.

A. High-dimensional human movementThe data set for the first set of experiments is the CMU

Graphics Lab Motion Capture Database, which is a part ofthe KIT Whole-Body Human Motion Database1. These dataare downsampled and preprocessed as described in [5], andthe resulting data consist of multiple movements, each ofwhich is coded in 70 time steps of 50-dimensional vectors.The data is normalized to zero mean for every joint. Sincewe focus on learning relative movements, the body centeris fixed. We split the movements up into subsequences withlsub = 10 time steps.

The DMPs are set to critical control by setting � = ↵/4

[1]. The VAE architecture is 5 layers with d inputs, 200hidden neurons, d0 latents, 200 hidden neurons, d outputs,where rectifier and identity activations are used for the hiddenlayers and the output layer, respectively. The structure of theVAE is shown in Fig. 2. The neural network g

2

inside theMLPw network has 70⇥ h inputs and 50⇥ d

0 outputs. Theg

1

network takes 50 inputs and outputs h. Where h is 10 forthe humanoid experiments and 5 for the robot experiment.Both g

1

and g

2

have no hidden layers. g

1

uses softmaxas activation function and g

2

uses the identity activationfunction. The structure and sizes of the MLPw network canbe seen in Fig. 3. The hyperparameters of DMP and VAEare chosen based on the reconstruction error and the trainingtime by grid search.

1) Learning different movements: In this task we trainmultiple demonstrations of walking, kicking, taichi andpunching from 5 subjects (viz. subject 35, 74, 49, 120, and143), all in one and the same model, and compare the resultswhen we train each movement type to a single, specialised

1https://motion-database.humanoids.kit.edu

Page 5: Dynamic Movement Primitives in Latent Space of Time ... · The representation of movement via Dynamic Movement Primitives (DMP) is a promising approach and is widely used for learning

g

2

w50⇥ d

0

g

1

50x1

h

g

1

50x2

h

. . .g

1

50xn

h

Fig. 3: Structure of MLPw with input and output sizes. Theg

1

networks take the 50-dimensional input xt and has anoutput with h dimensions. g

2

has 70⇥ h inputs and 50⇥ d

0

outputs. h is 10 for the humanoid experiments and 5 for therobot experiments. d0 is the number of latent dimensions.

0

1

2

3

4

5

MSE

#10-4

0

0.5

1

1.5

2

2.5

SD

#10-3

0

0.5

1

1.5

2

2.5

outp

ut S

D

#10-4

AE-DMP singleVAE-DMP 2DVAE-DMP 3DVAE-DMP 5DVAE-DMP 7DVAE-DMP single

Fig. 4: Error of the reconstruction of the movement model-ing. The error is MSE ± SD of every single subject averagedover all joints over a whole movement in radians. The resultis evaluated using Mean square error (MSE ± standarddeviation). The left and middle figures show MSE and SDaveraged over 5 subjects. The right figure is the mean outputSD (directly predicted from the encoder) of the VAE-DMPreconstruction. It is a constant value for every joint duringthe whole movement, so that output SD of VAE-DMP 2D,3D, 5D, and 7D do not have variance.

model (called “VAE-DMP single”). We also compare theresults to our previous “AE-DMP single” [5], in which wetrain one movement per autoencoder. Fig. 4 shows the posereconstruction error of our previous work of AE-DMP singlewith 5D latent space, VAE-DMP with 2D, 3D, 5D and 7Dlatent space, and VAE-DMP single with 5D latent space. ForVAE-DMP single we used 120 rather than 200 hidden unitsin each hidden layer. Noticeable is the improvement overour previous model, AE-DMP single. When sufficient latentvariables are chosen (in this case, 5 or 7), the error is muchlower, even for the approach where all poses are representedin one model. The lowest reconstruction error is obtained inVAE-DMP single, with 7D VAE-DMP a close second.

Although the reconstruction accuracy of VAE-DMP with

z1

z2

walking

kicking

balancing

taichi

punching

Fig. 5: Movement distribution in 2D latent space. z1

and z

2

are two latent dimensions, and every body pose in the jointspace is generated from its corresponding latent state.

z1

z 2

(a) VAE

z1

z 2(b) VAE-DMP

Fig. 6: Trajectories of five human movements in the 2D latentspace using VAE and VAE-DMP. The movements are coloredthe same as Fig. 5

2D latent space yields to that with 7D latent space, we canmore easily plot the former. Fig. 5 therefore illustrates thedistribution of five movements in the latent space of VAE-DMP 2D. The patterns of the various generative movementscan be seen. The walking is periodic, while punching is asingle line in latent space. Kicking is a large-range move-ment, so that it has large range in the latent space, whilebalancing only has relative slight movement, and the rangein the latent space is small.

The latent space of VAE-DMP is more meaningful thanthat of VAE (see Fig. 6). In the VAE latent space, a sequenceof movement may spread far with different spacings forcomplex movements such as taichi, since it does not encodetime information into the latent space. Accordingly, big gapsbetween two time steps may cause difficulties for DMPs.In addition, the a large geometry distance in joint spacemay result in a small distance in the VAE latent space suchas kicking. With correctly encoding the geometry distancefrom the joint space, VAE-DMP can improve the multi-

Page 6: Dynamic Movement Primitives in Latent Space of Time ... · The representation of movement via Dynamic Movement Primitives (DMP) is a promising approach and is widely used for learning

taichibalancing

kickingwalking

1-1

z1

0

-0.5

1

0z 3

0.5

z2

0

1

-1 -1-2

punching

Fig. 7: Nine movements represented in the 3D latent spaceof a single VAE-DMP.

Fig. 8: Smoothness of the latent space. The plot representsvalues of dx/dz, which vary between 4.7 and 7.4 for a2D VAE-DMP. Higher values—indicated by lighter areas—mean that a step in latent space corresponds to a largerbody movement in x. The colored points correspond to thedemonstrated movements, as in Fig. 5. The data are evaluatedat 120⇥ 120 grid points in the 2D latent space.

demonstration model ability compared with VAE.Training with additional 4 walking movements from other

subjects (viz. Subject 86, 91, 114 and 139), we have theresults shown in Fig. 7; in this case, a VAE-DMP 3D isused. The five walking movements distribute in a cluster,disjunct from the other movements. The leg movementscluster approximately on the positive of the z

3

axis, whilethe arm movements cluster on the negative of the z

3

axis.2) Latent space smoothness: The latent space of a VAE-

DMP codes the learned movements in a compact, multivari-ate Gaussian space. As seen in Fig. 5, the demonstrationtrajectories do not span the whole latent space. But whatmovements are coded between the demonstrated trajectories?To verify that no discontinuities or “large jumps” occur whensampling a trajectory in latent space, we evaluate dx/dz

over the whole latent space of a VAE-DMP. The result for a

Fig. 9: Body postures in the 2D latent space, in the spacespanned by the multivariate Gaussian. The area covering avariance of � = 1 is plotted.

VAE-DMP 2D is shown in Fig. 8. In the whole latent space,dx/dz varies between 4.7 and 7.4 and is indeed very smooth.Similar smooth latent spaces are seen in VAE-DMP 3D andVAE-DMP 5D.

The joint space movements generated in 2D latent spaceis illustrated in Fig. 9. As z represents a multivariate normal(µ = 0,� = 1) distribution, we can immediately computethe probability of a certain movement from its z. As the plotshows, the movements between the demonstrated movementsare smoothly interpolated. But also movements beyond thatrepresent viable positions. Only when we move far awayfrom the mean, starting between 2� and 3� does the latentspace represent nonsensical postures. Of course, we canincrease the training data set to obtain a larger confidenceinterval.

3) Movement switching: A model is trained by multipledemonstrations of a walking and jogging. i = 1 and i =

2 represent jogging and walking in ⌘

i, respectively. Themovements are switched from step 20 until step 38 (seeFig. 10). The starting and the duration of the switching canbe changed. The model has a 5D latent space, and the largesttwo variance of the latent space are plotted. After it adaptsto the walking from jogging, it follows the demonstrationtrajectory of walking. The generated latent values are notnecessary to reproduce the latent values of demonstrationprecisely, while the reconstruction ability is more significant.

B. Robot simulation for goal changing

In this experiment we simulate a 6-DoF KUKA robotusing the Robotic Toolbox [19]. In the data set, the lengthof a demonstration is 76 steps. The subsequence length is10 time steps. For the representation of the movement in the

Page 7: Dynamic Movement Primitives in Latent Space of Time ... · The representation of movement via Dynamic Movement Primitives (DMP) is a promising approach and is widely used for learning

1

0.5-0.5

z1

1.501

z2

0.50

-0.5-0.5

0z 3

0.5

joggingwalkingswitching

(a) Movement switching in latent space.

(b) Movement switching observed in state space. The figuresof top three rows are plotted by every 6 time steps. We chooseleft hip for representing the joint, since it exactly follows thewalking and jogging rhythm.

Fig. 10: Movement switching from walking to jogging.

VAE, we use 5 layers with d, 100, d0 = 2, 100, d. The inputdimension of g

1

is d and the output dimension h is 5. g2

takes n⇥ h inputs and has 50⇥ d

0 outputs.The demonstration x

i is generated by moving the end-effector linearly from a starting point p

0

to subsequentpoints pi, i 2 {1, 2, . . . , 5} in Cartesian space. x

1 isthe demonstration for DMP, while {x1

,x

2

, ...,x

5} are thedemonstrations for VAE-DMP (see Fig. 11). p

6

and p

7

arethe results of changing the goal to [x = 0.5, y = 0.5] and[x = 0.8, y = 0.8], respectively using VAE-DMP with giventhe force f

1. All movements were planar, keeping z constantat 0.3.

For method comparison, we use the same version of DMPfor both VAE-DMP and DMP. As described in Sec. II-G, tokeep the invariant properties [1] for changing the goal, bothforcing terms f in VAE-DMP and DMP are multiplied bya scaling term, specifically, (zgoal � z

0

) for VAE-DMP cq.(g � y

0

) for DMP.Fig. 12 shows the results of both DMP and VAE-DMP

when changing the goal to [0.8, 0.8]. The optimal trajectoryis the movement of the new goal with the end-effectormoving linearly from p

0

. The optimal trajectory is not shownin the training data set but only its final frame is given as the

Fig. 11: Robot data set.

new goal. It can be seen that VAE-DMP is able to generatea movement which is close to the optimal trajectory for thegoal changing. However, DMP can only keep invariance ofthe demonstration for every joint independently, the effect ofwhich is most clearly seen in joints 2 and 4. In this case,the robot joints are highly possible to be out of range. Incontrast, VAE-DMP correlates the joints. In this experiment,VAE-DMP learns the invariance of the end-effector insteadof every single joint. DMP requires manual selection of theimportant feature (e.g., joint angles of a robot or Cartesianspace of the end effector) to learn, but VAE-DMP is ableto learn the optimal trajectory autonomously and reproducesnatural movements.

In joint 5, the start and end angle are almost the samein the demonstration. This makes generalization with DMPdifficult, as the force term is almost zero. In VAE-DMP, thejoints are correlated in the latent space, so that the beginningand ending values are not the same.

A video of the results is accompanied, which can bedownloaded at: https://brml.org/projects/dvbf

IV. CONCLUSIONS

In this paper, we presented a novel approach to embedDMPs into a time-dependent variational autoencoder. Usinga new approach called Deep Variational Bayes Filtering,we can embed DMPs in the latent space of a variationalautoencoder, while simultaneously representing a large rangeof different movements in one single DVBF network. Thuswe can also create smooth transitions between differentmovements. While DMPs can only independently keep theinvariance per single joint, our approach allows the model tounsupervisedly learn the invariance features of the trajectory.

Our next step in this work is to validate this workwith larger parts of the KIT Whole-Body Human MotionDatabase. Furthermore, we plan to investigate its use inhumanoid motion planning.

V. ACKNOWLEDGEMENTS

This work has been supported in part by the TAC-MAN project, EC Grant agreement no. 610967, withinthe FP7 framework programme. The data used in thisproject can be found at https://motion-database.

Page 8: Dynamic Movement Primitives in Latent Space of Time ... · The representation of movement via Dynamic Movement Primitives (DMP) is a promising approach and is widely used for learning

0 50 1000.4

0.6

0.8

1

1.2

1.4jo

int a

ngle

joint1

0 50 1000.4

0.6

0.8

1

1.2

1.4joint2

0 50 1000

0.5

1

1.5

2joint3

0 50 1001.55

1.6

1.65

1.7

1.75

1.8

join

t ang

le

joint4

0 50 100steps

1.8

1.9

2

2.1

2.2

2.3joint5

0 50 100-0.5

0

0.5

1joint6

demonstrationoptimal trajectoryDMPVAE-DMPnew goal

Fig. 12: Results of goal changing.

humanoids.kit.edu or at http://mocap.cs.cmu.edu. The database was created with funding from NSF EIA-0196217.

REFERENCES

[1] A. Ijspeert, J. Nakanishi, H. Hoffmann, P. Pastor, and S. Schaal,“Dynamical movement primitives: learning attractor models for motorbehaviors,” Neural computation, vol. 25, no. 2, p. 328373, 2013.

[2] A. Paraschos, C. Daniel, J. Peters, and G. Neumann, “Probabilisticmovement primitives,” in Advances in Neural Information ProcessingSystems, 2013, pp. 2616–2624.

[3] N. D. Lawrence and A. J. Moore, “Hierarchical gaussian process latentvariable models,” in Proceedings of the 24th international conferenceon Machine learning. ACM, 2007, pp. 481–488.

[4] S. Bitzer and S. Vijayakumar, “Latent spaces for dynamic movementprimitives,” in 9th IEEE-RAS International Conference on HumanoidRobots, Dec 2009, pp. 574–581.

[5] N. Chen, J. Bayer, S. Urban, and P. van der Smagt, “Efficientmovement representation by embedding dynamic movement primitivesin deep autoencoders,” in 15th IEEE-RAS International Conference onHumanoid Robots, 2015, pp. 434–440.

[6] A. Colome, G. Neumann, J. Peters, and C. Torras, “Dimensionalityreduction for probabilistic movement primitives,” in Proceedings ofthe International Conference on Humanoid Robots (HUMANOIDS),2014.

[7] H. van Hoof, N. Chen, M. Karl, P. van der Smagt, and J. Peters, “Stablereinforcement learning with autoencoders for tactile and visual data,”IROS, 2016.

[8] M. Karl, M. Soelch, J. Bayer, and P. van der Smagt, “Deep variationalBayes filters: Unsupervised learning of state space models from rawdata,” arXiv:1605.06432, 2016.

[9] C. M. Bishop, Pattern Recognition and Machine Learning. Springer,2006.

[10] D. P. Kingma and M. Welling, “Auto-encoding variational Bayes,”CoRR, vol. abs/1312.6114, 2013.

[11] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Extractingand composing robust features with denoising autoencoders,” in Pro-ceedings of the 25th International Conference on Machine Learning,2008, pp. 1096–1103.

[12] J. Bayer and C. Osendorfer, “Learning stochastic recurrent networks,”in NIPS Workshop on Advances in Variational Inference, 2014.

[13] J. Chung, K. Kastner, L. Dinh, K. Goel, A. C. Courville, andY. Bengio, “A recurrent latent variable model for sequential data,” inAdvances in neural information processing systems, 2015, pp. 2980–2988.

[14] M. Solch, J. Bayer, M. Ludersdorfer, and P. van der Smagt, “Vari-ational inference for on-line anomaly detection in high-dimensionaltime series,” ICML Workshop, 2016.

[15] L. Theis, A. v. d. Oord, and M. Bethge, “A note on the evaluation ofgenerative models,” International Conference on Learning Represen-tations, 2016.

[16] S. Mandt, J. McInerney, F. Abrol, R. Ranganath, and D. Blei, “Varia-tional tempering,” in Proceedings of the 19th International Conferenceon Artificial Intelligence and Statistics, 2016, pp. 704–712.

[17] A. Wilson, A. Fern, S. Ray, and P. Tadepalli, “Multi-task reinforcementlearning: a hierarchical bayesian approach,” in Proceedings of the 24thinternational conference on Machine learning. ACM, 2007, pp. 1015–1022.

[18] H. Hoffmann, P. Pastor, D.-H. Park, and S. Schaal, “Biologically-inspired dynamical systems for movement generation: automatic real-time goal adaptation and obstacle avoidance,” in IEEE InternationalConference on Robotics and Automation, 2009, pp. 2587–2592.

[19] P. I. Corke, Robotics, Vision & Control: Fundamental Algorithms inMATLAB. Springer, 2011, iSBN 978-3-642-20143-1.