Safe Autonomous Reinforcement Learningptak.felk.cvut.cz/tradr/share/thesis_proposal.pdf · The...

CENTER FOR

MACHINE PERCEPTION

CZECH TECHNICAL

UNIVERSITY IN PRAGUE

RESEARCHREPORT

Safe Autonomous ReinforcementLearning

Ph. D. Thesis Proposal

Martin Pecka

[email protected]

August 26, 2015

Supervisor: Tomas Svoboda, Karel Zimmermann

The autohors of this work were supported by the EC projectEU-FP7-ICT-609763 TRADR, by the SGS15/081/OHK3/1T/13 ofthe CTU in Prague and by the project GA14-13876S of the Czech

Grant Agency.

Center for Machine Perception, Department of CyberneticsFaculty of Electrical Engineering, Czech Technical University

Technicka 2, 166 27 Prague 6, Czech Republicfax +420 2 2435 7385, phone +420 2 2435 7637, www: http://cmp.felk.cvut.cz

Safe Autonomous Reinforcement Learning

Martin Pecka

August 26, 2015

Abstract

In the thesis we propose, we focus on equipping existing Reinforce-ment Learning algorithms with different kinds of safety constraintsimposed on the exploration scheme.

Common Reinforcement Learning algorithms are (sometimes im-plicitly) assumed to work in an ergodic1, or even “restartable” environ-ment. However, these conditions are not achievable in field robotics,where the expensive robots can’t simply be replaced by a new func-tioning unit when they perform a “deadly” action. Even so, Rein-forcement Learning offers many advantages over supervised learningthat are useful in the robotics domain. It may reduce the amount ofannotated training data needed to train a task, or, for example, elimi-nate the need of acquiring a model of the whole system. Thus we notethere is a need for something that would allow for using ReinforcementLearning safely in non-ergodic and dangerous environments. Definingand recognizing safe and unsafe states/actions is a difficult task itself.Even when there is a safety classifier, it still remains to incorporatethe safety measures into the Reinforcement Learning process so thatefficiency and convergence of the algorithm is not lost. The proposedthesis deals both with safety-classifier creation and the usage of Rein-forcement Learning and safety measures together.

The available safe exploration methods range from simple algo-rithms for simple environments to sophisticated methods based onprevious experience, state prediction or machine learning. Pitifully,the methods suitable for our field robotics case usually require a pre-cise model of the system, which is however very difficult (or evenimpossible) to obtain from sensory input in unknown environment.

In our previous work, for the safety classifier we proposed a ma-chine learning approach utilizing a cautious simulator. For the con-nection of Reinforcement Learning and safety we further examine a

1There is no “black hole” state from which the agent could not escape by performingany action.

1

modified Gradient Policy Search algorithm. To test the difference ofsafety-enabled learning with normal Reinforcement Learning, we ex-amine the task called Adaptive Traversability and try to devise itssafe and efficient form.

It still remains to examine more methods of connecting Reinforce-ment Learning and safety. Also, we want to try several options for thesafety classifier and compare their precision and capabilities to rep-resent various safety functions. We also need to focus on solving allthese problems with increasing number of dimensions or in continuousspaces instead of discrete ones.

2

Contents

1 Introduction 6

2 Prerequisites 62.1 Markov Decision Processes . . . . . . . . . . . . . . . . . . . . 62.2 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . 8

2.2.1 Q-learning . . . . . . . . . . . . . . . . . . . . . . . . . 92.2.2 Gradient Policy Search . . . . . . . . . . . . . . . . . . 9

3 Problem Description 103.1 Safety definition . . . . . . . . . . . . . . . . . . . . . . . . . . 113.2 Maintaining safety . . . . . . . . . . . . . . . . . . . . . . . . 113.3 Experimental evaluation . . . . . . . . . . . . . . . . . . . . . 12

3.3.1 Adaptive Traversability . . . . . . . . . . . . . . . . . . 12

4 Related work 134.1 State-action space division . . . . . . . . . . . . . . . . . . . . 13

4.1.1 Safety through labeling . . . . . . . . . . . . . . . . . . 144.1.2 Implicit safety . . . . . . . . . . . . . . . . . . . . . . . 16

4.2 Maintaining safety . . . . . . . . . . . . . . . . . . . . . . . . 184.2.1 Safety function-based approaches . . . . . . . . . . . . 184.2.2 Optimal control approaches . . . . . . . . . . . . . . . 204.2.3 Approaches benefiting from prior knowledge . . . . . . 22

4.3 Adaptive Traversability . . . . . . . . . . . . . . . . . . . . . . 24

5 Our Previous Work 255.1 Safety Definition . . . . . . . . . . . . . . . . . . . . . . . . . 25

5.1.1 Safety decoupled from rewards . . . . . . . . . . . . . . 265.2 Learning safety . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5.2.1 Simulator-based safety function learning . . . . . . . . 285.2.2 The learning algorithm . . . . . . . . . . . . . . . . . . 295.2.3 Experimental evaluation . . . . . . . . . . . . . . . . . 33

5.3 Maintaining safety . . . . . . . . . . . . . . . . . . . . . . . . 365.3.1 Egg curling—a testbed for maintaining safety . . . . . 38

5.4 Adaptive Traversability . . . . . . . . . . . . . . . . . . . . . . 395.4.1 Experimental evaluation . . . . . . . . . . . . . . . . . 415.4.2 Comparison to “blinded operator” control . . . . . . . 44

6 Future Work 466.1 Safety as Simulator Parameters Uncertainty . . . . . . . . . . 46

3

References 47

4

Acknowledgements

I would like to thank very much to my supervisor doc. Ing. Tomas Svoboda,Ph.D., for a balanced mixture of liberal and directive leadership, as well asfor his questions “from a distance”.

I would also like to thanks my co-supervisor Ing. Karel Zimmermann,Ph.D. for his attention, knowledge and many wise insights.

Last, but not least, I would also like to thank to the whole T. Svoboda’sgroup for a pleasurable and inspiring atmosphere. It is an honor to workthere.

5

1 Introduction

Many robotic tasks are tackled by Reinforcement Learning (RL) with itera-tive state-action space exploration—e.g. RC helicopter acrobacy [1, 58], adap-tive traversability [65], racing car control [42], quadrupedal locomotion [40]etc. RL essentially needs to sample the state-action space following a socalled exploration strategy 2.

While manually-driven exploration is often prohibitively time consum-ing, autonomous exploration is usually only applied to inherently safe sys-tems (pendulum) or to simulators [58]. We propose methods for making au-tonomous exploration safe even for complicated field robotics systems, wherethe robot-terrain interaction (RTI) can not be fully determined in advance.We test it on the task of autonomous control of the morphology and speedvector of the USAR (Urban Search and Rescue) mobile robotic platformdepicted in Figure 2.

Further organization of this proposal is the following: Section 2 presentsthe basic theory needed for understanding the main part of the thesis; Sec-tion 3 then states the problems to be solved. In Section 4 we show, what isthe current state of the art, and in Section 5 we summarize what we havealready done. Section 6 then closes the proposal with a lookout to the future.

2 Prerequisites

This section serves just as a quick glance at the basic theory we make useof, and which is essential to understand the given problem and solution. Forevery area, we point to more references which cover the topics in detail.

2.1 Markov Decision Processes

Markov Decision Processes (MDPs) are the standard model for deliberatingabout reinforcement learning problems [6]. They provide a lot of simplifica-tions, but are sufficiently robust to describe a large set of real-world problems.

The simplest discrete stochastic MDP comprises of: [34]

• a finite set of states S

2 It is important to note the two different meanings of the word “exploration”. One ofthe meanings could be described as “environment exploration”, and is about building amap of a previously unknown environment. The other meaning is more like “state-actionspace exploration” and it is more about exploring the transition function, rewards, safetyfunction and so on. In the proposed thesis we focus on the latter meaning.

6

• a finite set of actions A

• a stochastic transition model P : Pt(s, a, s′) = Pr(st+1 = s′ | st =

s, at = a) for each s, s′ ∈ S, a ∈ A, where Pr stands for probability

• and, for RL purposes, also an immediate reward function 3 R : S×A→R

To interpret this definition, we say that at every time instant t the agent(robot) happens to be in a state s, and by executing action a it gets to a newstate s′. Furthermore, executing a particular action in a particular state maybring a reward to the agent (defined by R). For an example of a simple MDP,refer to Figure 1.

Egg on table Egg broken

Push gently

Push hard

ε

1-ε

11

1

Figure 1: A simple MDP. It has two states called Egg on table and Eggbroken, between which it can transition using actions Push gently and Pushhard. When the egg is on the table, by pushing gently, there is a smallprobability ε that the egg will fall off the table and break; otherwise, withprobability 1− ε, the egg remains on the table. On the other hand, once theegg is broken, executing any of the two actions has no effect on the state.Such state is called a final state.

The most important and interesting property of MDPs is the Markovproperty. A look at the definition of the transition model shows that the next

3different definitions also use simpler R : S → R if the reward only depends on thereached state, or more complex R : S × A × S → R if the reward depends on both theaction and the reached state

7

state only depends on the current state and the chosen action. Particularly,the next state is independent of all the previous states and actions but thecurrent ones.

As we presented above, there is not only one, but (at least) three waysto define the reward function. The choice of a particular definition is usuallyproblem-dependent. Although this inconsistency may seem as a complica-tion, it can be shown that all these definitions are equal (in the means ofexpressive power).

Other extensions of MDPs to continuous states, time or actions are be-yond the scope of this proposal. However, some of the referenced papers makeuse of these continuous extensions, which proved to be useful for practicalapplications.

MDP solvers usually make use of the technique called dynamic program-ming [33, 7, 55], which has polynomial time complexity in the number ofstates and actions. For a more detailed description of MDPs and their solvers,the reader can refer e.g. to the comprehensive work of Geramifard [23].

2.2 Reinforcement Learning

Having an MDP, the task of reinforcement learning “is to find a policy πmapping states to actions, that maximizes some long-run measure of rein-forcement” [34]. The “long-run” may have different meanings, but there aretwo favorite optimality models: the first one is the finite horizon model,where the term J =

∑ht=0 rt is maximized (h is a predefined time horizon

and rt is the reward obtained in time instant t while executing policy π). Thequantity J is called the utility or expected return. The dependency of rt onthe policy is no longer obvious from this notation, but this is the conventionused in literature when it is clear which policy is used. This model repre-sents the behavior of the robot which only depends on a predefined numberof future states and actions.

The other optimality model is called discounted infinite horizon, whichmeans we maximize the discounted sum J =

∑∞t=0 γ

trt with γ ∈ (0, 1) beingthe discount factor. The infinite horizon tries to find a policy that is the bestone taking into account the whole future. Please note the hidden dependencyon the policy π (and the starting state s0)—it is the policy that decides onwhich action to take, which in turn specifies what will the reward be.

In the reinforcement learning setting, it is usually assumed that the tran-sition function is unknown in advance, so we cannot sample trajectories inany other way than executing actions in the real. The same holds for thereward function, which is usually given by the environment in reaction tothe executed action [23]. This is where reinforcement learning substantially

8

differs from planning.Many algorithms have been devised to find the optimal policy. They

are generally divided into two groups: value iteration algorithms, and policyiteration.

We’ll shortly present Q-learning, a representative of the value iterationgroup, and also the Gradient Policy Search algorithm from the latter group.Other methods as well as solvers for them are described in an overview com-posed by Barto [5].

2.2.1 Q-learning

In Q-learning [62], Q function is built to answer the question: “how good itis to execute action a in state s?” Formally, Q : S ×A → R. A determin-istic policy π is then built as π(s) = arg maxa Q(s, a). A discussion on theperformance of policies constructed this way is given in [64].

Learning of the Q function is an iterative process which utilizes the re-current formula [34]

Q(s, a) := Q(s, a) + α[R(s, a) + γmax

a′Q(s′, a′)−Q(s, a)

]with α ∈ 〈0, 1〉 being the learning rate. Before starting the learning process,the Q function is initialized (either to zeros or randomly).

The agent is allowed to explore the state-action space randomly, or usinga more suitable exploration policy. When the algorithm explores enoughstates and actions, the policy induced from the final Q-function is said to beoptimal. What is interesting on this algorithm is that it is completely model-free. Really, the transition function P is never used here for computations,it is only used “empirically” when executing actions. However, to assureconvergence to the theoretically optimal policy, it is needed to sample eachstate-action pair infinitely often [34].

2.2.2 Gradient Policy Search

In this policy iteration instance, parametrized stochastic policies are used [4].As an example we can define a Gaussian policy π parametrized by a real-valued vector θ as πθ(u|x) = N (θT

(x1

), σ). Please note the change in notation—

we adjusted it here to suit the common practice. States are now denoted byx, actions by u, and πθ(u|x) is the probability of selecting action u given theagent is in state x. Finally, N stands here for the Gaussian distribution, andwe extend the feature vector x with one more dimension to allow the lastelement of θ act as a constant bias factor.

9

The previously described Value Iteration has as its “elementary unit” astate and it computes the “utility” of all states. On the other hand, Pol-icy Iteration algorithms evaluate the utility of whole policies, on a set oftrajectories. It can be computed e.g. as the estimate of

J(θ) = Jπθ(θ) = Eτ∈Tπθ

|τ |∑k=0

γkrτk

where Tπθ is the set of all trajectories possible for the given policy, γ is thediscount factor, and rτk are the rewards corresponding to the k-th step oftrajectory τ [39].

Using an elegant trick [26], the gradient OθJ is derived and it is shownthat it also doesn’t depend on the transition function (but only on the visitedtrajectories). The gradient is of the form

OθJ(θ) =1

||Tθ||∑τ∈Tθ

|τ |∑k=0

(uτk − θT

(xτk1

))(xτk1

) |τ |∑k=0

γkrτk

Having this update rule, a normal gradient ascent is performed. In the

initialization phase, a previously known (non-optimal) policy may be usedas the starting point, or just a uniformly random one.

Some problems are however so complicated that using an algorithm thatdoes not make use of the transition function would result in an enormousneed of training samples. When the transition function is not known inadvance, there is the option of estimating it on-the-fly together with learningthe policy—e.g. the PILCO algorithm [15] uses this approach, or model-based policy search methods [16].

3 Problem Description

To formally define the problem we try to solve, let us assume we have anagent/robot in a partially unknown environment which behaves like MDP.The MDP can have both continuous states and actions. The task is to trainan optimal policy for a given task (defined by rewards), while keeping therobot away from unsafe states both during learning and during execution ofthe learned policy. The seeking for an optimal policy (in whatever sense)ensures a trivial safety function rejecting all actions is not a viable option.

We believe restricting the area of interest only on MDPs is no real draw-back. It allows for uncertain action results (which is a need to model real-world behavior), and the restriction that the following state only depends on

10

the previous one and on the selected action seems to be plausible in mostcases. There can, of course, be series of actions that are safe independently,but unsafe when executed sequentially (they can e.g. lead to overheating). Ifsuch a sequence is identified, it can always be modeled as another dimensionof the state and we are back to MDPs.

This section presents the problems to be solved; in Section 4, an overviewof the state of the art is given, and in Section 5 we give a description ofour achievements so far. In detail, the problem we solve can be dividedinto three main parts—safety definition (Section 3.1), safety implementation(Section 3.2) and experimental evaluation (Section 3.3).

3.1 Safety definition

It is needed to specify more precisely what exactly is the safety we wantto maintain. Unfortunately, there is no unified definition that would satisfyall use cases; thus, several different approaches are found in the literature.See Section 4 for an overview. An intuitive (but vague) definition could bee.g.: “State-space exploration is considered safe if it doesn’t lead the agentto unrecoverable and unwanted states.”

Practically, for the TRADR robotic platform described in Figure 2, wedo not want (at least) the following to happen during exploration:• fall down from a height more than a few centimeters• top over• hit hard an obstacle• physical contact between the laser scanner and terrain or other solid

parts of the environment• physical contact of the omnicamera and terrain/environment• physical contact of the antenna and terrain/environment• slip down on an inclined terrain (this is dangerous because the results

are unpredictable)All these requirements can be easily “translated” into state variables, so byplacing these safety requirements we define a subset of the state space whichis considered safe. The ultimate goal of our thesis is to restrict the state-action space so that unsafe states are never visited.

3.2 Maintaining safety

Having the safety defined, we can proceed to asking how should the safety bemaintained. The first observation is that the robot is not allowed to go to anew state and evaluate the safety function only after that, as would be usualin other reinforcement learning settings. Otherwise it could enter an unsafe

11

state and, by definition, there is no recovery from such state. So we are leftwith two options: either to train a state-action safety function (which wouldwork, since we always know the current state and the selected action), orto estimate the transition function to do a one-step lookahead. The safetyfunction is then evaluated on the predicted state, and the result determinesthe given action’s safety.

The first option, training a state-action safety function, does not seemsuitable. In the training phase, only “positive” examples (safe state-actionpairs) can be given. And training a binary classifier on data from onlyone class could be very tricky or even impossible. Either it would be toobenevolent, or the generalization would be very poor, or we would need toshow it all the points on the “boundary” between safe and unsafe, which alsoseems like an impossible task. So we concentrate on the last option—utilizingan estimate of the transition function.

Another problem arises when we already know which actions are safe andwhich are not. If the exploration strategy from reinforcement learning wouldsuggest to execute an unsafe action, what should happen? It is obviousthe unsafe action cannot be executed; but which other action to select inorder not to break the exploration strategy? In the thesis we will deal withthis problem and develop modifications of existing reinforcement learningalgorithms that take safety into account while preserving efficiency.

3.3 Experimental evaluation

All concepts devised in the previous sections need to be experimentally eval-uated. It will also be a task of the thesis to define good experiments whichwould show the advantages of safe exploration. As a “plain” (not safe bydefinition) reinforcement learning task to compare to we chose the Adap-tive Traversability problem evaluated on the TRADR robotic platform (seeFigure 2 for a brief description of the robot).

3.3.1 Adaptive Traversability

The task of Adaptive Traversability is to autonomously control the four flip-pers of the TRADR robotic platform (purple in Figure 2) and its speedvector based on proprioceptive and local exteroceptive data, while still ap-proximately following the path given by a human operator. The benefitof such control is decreasing the “cognitive load” of the human operator—otherwise there are too many degrees of freedom. Adapting the flipper pose(and possibly also decreasing speed or a slight change in direction) helps the

12

Figure 2: Left: An illustration of the TRADR robotic platform. Coloredoverlays denote the important parts of the robot: red - panoramic camera;green - rotating laser scanner; purple - articulated subtracks (flippers); yel-low - additional equipment (attached on demand). Right column: furtherpictures from training missions.

human operator to concentrate on the task he or she is fulfilling, while the“inferior” duties are left for AT.

4 Related work

Now it is a good time to look around on what others do in the field ofSafe Exploration. Here we present and categorize the existing works thattouch some of our problems. Unfortunately, safe exploration is a ratheryoung field, so there is no settled journal or conference dedicated wholly tosafe exploration. The most of relevant literature is found in reinforcementlearning and optimal control journals and conferences.

4.1 State-action space division

Each work related to safe exploration needs to define safety. Because thefield has still been settling down, there is however no generally recognizeddefinition that would suit all needs. This leads to fragmentation, and theresult is that almost every research group uses its own definition of safety.

13

I-shape V-shape L-shape U-shapesoft

U-shapehard

Figure 3: Flipper configurations: Five distinct flipper configurations aredistinguished. Red denotes low flipper compliance, green denotes high flippercompliance: (i) I-shape with unfolded flippers (useful for traversing holesor stairs), (ii) V-shape with flippers folded in order to provide the best ob-servation capabilities to the robot, (iii) L-shape with front flippers raised(suitable for climbing up), (iv) U-shape soft, pushing the flippers downwith low pressure—low compliance threshold set (suitable for smooth climb-ing down), and (v) U-shape hard, pushing the flippers down with highpressure—high compliance threshold set (allows lifting the robot’s body).

Figure 4: Digital elevation map (DEM): an example of the Digital Eleva-tion Map together with the robot. Color encodes terrain heights in particularbins. These heights are directly used as state description features.

The following categorization is thus our subjective view which tries to bringsome order to the variety of definitions found in literature.

We divide safety definitions into Safety through labeling (Section 4.1.1)and Implicit safety (Section 4.1.2).

4.1.1 Safety through labeling

The largely most used definition of safety is labeling the states/actions withone of several labels indicating the level of safety in that state/action. Whatvaries from author to author is the number and names of these labels.

To start with, Hans et al. [27] have the most granular division of state/actionspace, and we thus chose to use it as the all-encompassing division of statesand actions for this comparison. Their definitions are as follows (slightly

14

reformulated):

• an (s, a, r, s′) tuple (transition) is fatal if the reward r is less thana certain threshold (s is the original state, a is an action and s′ is thestate obtained after executing a in state s, yielding the reward r),

• an action a is fatal in state s if there is non-zero probability of leadingto a fatal transition,

• state s is called supercritical if there exists no policy that wouldguarantee no fatal transition occurs when the agent starts in state s,

• action a is supercritical in state s if it can lead to a supercriticalstate,

• state s is called critical if there is a supercritical or fatal action in thatstate (and the state itself is not supercritical),

• action a is critical in state s if it leads to a critical state (and theaction itself is neither supercritical nor fatal in s),

• state s is called safe if it is neither critical nor supercritical,

• action a is safe in state s if it is neither critical, nor supercritical, norfatal in state s,

• and finally a policy is safe if for all critical states it leads to a safe statein a finite number of non-fatal transitions (and if it only executes safeactions in safe states).

To be able to suit the other definitions to the scheme of Hans et al., itis needed to define one more category. A state s is called fatal if it is anundesired or unrecoverable state, e.g. if the robot is considered broken inthat state. The fatal transition can then be redefined as a transition endingin a fatal state. Opposite to the precisely defined terms in Hans’ definition,the meaning of words “undesired” and “unrecoverable” here is vague andstrongly context-dependent.

Continuing on, Geibel [22] defines only two categories—fatal and goalstates. “Fatal states are terminal states. This means, that the existence ofthe agent ends when it reaches a fatal state” [22]. This roughly correspondsto our defined set of fatal states. Goal states are the rest of final statesthat correspond to successful termination. Since Geibel only considers ter-minal states for safety, his goal states correspond to a subset of safe states.

15

The other categories of Hans et al. need not be represented, since they aremeaningless for final states.

An extension of Geibel’s fatal and goal states is a division presented byGarcia and Fernandez [21]. Their error and non-error states correspond tofatal and goal states, but the authors add another division of the space—theknown and unknown states, where known states are those already visited(and known have empty intersection with error). They then mention a pre-requisite on the MDP that if an action leads to a known error/non-errorstate, then its slight modification must also lead to an error/non-error state(a metric over the state space is required).

In the work of Ertle et al. [20], again the two basic regions are considered—they are called desired and hazardous (corresponding to safe and fatal).However, due to the used learning technique, one more region emerges—theundesired region. It contains the whole hazardous region and a “small span”comprising of desired states, and denotes the set of states where no training(safe) samples are available, because it would be dangerous to acquire thosesamples. In particular, they say that “The hazards must be ‘encircled’ bythe indications of the undesired approaching so that it becomes clear whicharea [. . . ] is undesired” [20].

A summary of the labeling-based definitions is given in Figure 5. Aftera short examination of the figure, it can be seen that no two authors agreeon a single definition (and it is not only a word-play, the definitions containdifferent parts of the state-action space).

4.1.2 Implicit safety

Apart labeling, safety can be defined by some properties the states or policiesmust possess to be called either safe or unsafe. We present a recent one whichutilizes the notion of ergodicity.

An MDP is called ergodic iff for every state there exists a policy thatgets the agent to any other state [46]. In other words, every mistake canbe remedied in such MDP. Moldovan and Abbeel [46] then define δ-safepolicies as policies guaranteeing that from any state the agent can get tothe starting state with probability at least δ (using a return policy, whichis different from the δ-safe one). Stated this way, the safety constraint mayseem intractable, or at least impractical—it is even proved that expressingthe set of δ-safe policies is NP-hard [46]. An approximation of the constraintcan be expressed in the terms of two other MDP problems which are easilysolved [46]; that still leads to δ-safe policies, but the exploration performancemay be suboptimal.

In our view, safety through ergodicity imposes too much constraints on

16

FA

TA

L

SU

PE

RC

RIT

ICA

L

CR

ITIC

AL

SA

FE

FATAL

ERROR

HAZAR-

GOAL

NON-ERROR

UNDESIRED

Defined by Hans et al. Hans et al.

Geibel

Garcıa et al.

Ertle et al.

ACTIONSSTATES

DOUS

SAFECRI-

TICALUNSAFE

DESIRED

Our

Figure 5: A summary of the definitions of safety. The basic division is takenfrom Hans [27] and fatal states are added. States are drawn with solidbackground and white-headed arrows (_) denote the possible actions inthe states. Actions are rendered with striped background and black-headedarrows ( ) end in states where it is possible to end up using the action.

17

the problems the agent can learn. It sometimes happens that a robot has tolearn some task after which it is not able to return to the initial state (e.g.drive down a hill it cannot climb upwards; a human operator then carriesthe robot back to the starting position). But the inability to “return home”in no means indicates the robot is in an unsafe state. Another case could bee.g. when a resource like battery level is a part of the state. Then it doesn’teven make sense to want the robot to be able to return to a state with morecharged battery.


Our categorization of existing safe exploration techniques is based on thework of Garcia and Fernandez [21]. The basic division is as follows: labeling-based approaches (Section 4.2.1), approaches utilizing the expected return orits variance (Section 4.2.2), and approaches benefiting from prior knowledge(Section 4.2.3).

4.2.1 Safety function-based approaches

The algorithms utilizing some kind of state/action labeling (refer to Sec-tion 4.1.1 for the various labeling types) usually make use of two basiccomponents—a safety function (or risk function) and a backup policy. Thetask of the safety function is to estimate the safety of a state or action. Inthe simplest case, the safety function can just provide the labeling of thegiven state or action; or it can return a likelihood that the state or action issafe; and in the best case, it would answer with a likelihood to be safe plusa variance (certainty) of its answer. The backup policy is a policy that is ableto lead the agent out of the critical states back to the safe area. It is notobvious how to get such a policy, but the authors show some ways how toget one.

In the work of Hans et al. [27], the safety function is learned during theexploration by collecting the so-called min-reward samples—this is the min-imum reward ever obtained for executing a particular action in a particularstate. The backup policy is then told to either exist naturally (e.g. a knownsafe, but suboptimal controller), or it can also be learned. To train thebackup policy, an RL task with altered Bellman equations is used:

Q∗min(s, a) = maxs′

min[R(s, a, s′),max

a′Q∗min(s′, a′)

]A policy derived from the computed Q∗min function is then taken as thebackup policy (as it maximizes the minimum reward obtained, and the fatal

18

transitions are defined by low reward). They define a policy to be safe, if itexecutes only safe actions in safe states and produces non-fatal transitionsin critical states. To learn such safe policy, Hans et al. then suggest a level-based exploration scheme. This scheme is based on the idea that it is betterto be always near the known safe space when exploring. All unknown actionsfrom one “level” are explored, and their resulting states are queued to thenext “level”. For exploration of unknown actions he proposes that the ac-tion should be considered critical until proved otherwise, so the explorationscheme uses the backup policy after every unknown action execution. Tofollow this algorithm, the agent needs some kind of “path planning” to beable to get to the queued states and continue exploration from them.

The PI-SRL algorithm of Garcia and Fernandez [21] is a way to safeguardthe classical policy iteration algorithm. Since the labels error/non-error areonly for final states, the risk function here is extended by a so called Case-based memory, which is in short a constant-sized memory for storing thehistorical (s, a, V(s)) samples and is able to find nearest neighbors for a givenquery (using e.g. the Euclidean distance). V(s) stands here for the Valuefunction, which is a state-only analogue of Q function4. In addition to theerror and non-error states, he adds the definition of known and unknownstates, where known states are those that have a neighbor in the case-basedmemory closer than a threshold. A safe policy is then said to be a policy thatalways leads to known non-error final states. To find such policy, the policyiteration is initialized with the safe backup policy and exploration is donevia adding a small amount of Gaussian noise to the actions. This approachis suitable for continuous state- and action-spaces.

Another approach is presented in the work of Geibel [22], where the riskand objective functions are treated separately. So the risk function onlyclassifies the states (again only final states) as either fatal or goal, and therisk of a policy (risk function) is then computed as the expected risk followingthe policy (where fatal states have risk 1 and goal states have risk 0). Thetask is then said to be to maximize the objective function (e.g. discountedinfinite horizon) w.r.t. the condition that the risk of the considered policiesis less than a safety threshold. The optimization itself is done using modifiedQ-learning, where the optimized objective function is a linear combinationof the original objective function and the risk function. By changing theweights in the linear combination the algorithm can be controlled to behavemore safely or in a more risk-neutral way.

4Formally, it is the Q-function which is defined in the terms of V, not the other way wepresent it here. But since we do not talk about the Value function in any other contexthere, we simplified it to such statement.

19

A generalization of the idea of Geibel to take the risk and reward functionsseparately can be found in the work of Kim et al. [36]. In this work, theconstrained RL task is treated as a Constrained MDP and the algorithmCBEETLE for solving Constrained MDPs is shown. The advantage of thiswork is that it allows for several independent risk (cost) functions and doesnot need to convert them to the same scale.

A similar approach of using constrained MDPs to solve the problem isgiven by Moldovan and Abbeel [46]. They do, however, use the ergodicitycondition to tell safe and unsafe states apart. Moreover, this approach isonly shown to work for toy examples like the grid world with only severalthousands of discrete states, which may not be sufficient for real roboticstasks.

The idea of having several risk functions is further developed by Ertleet al. [20]. The agent is told to have several behaviors and a separate safetyfunction is trained for each behavior. This approach allows for modularityand sharing of the learned safety functions among different types of agents.More details on this work will be provided in the next section, because itbelongs to learning with teachers.

As was presented in this section, the labeling-based approaches providea number of different ways to reach safety in exploration. Some of them are,however, limited in several ways—either they provide only estimates insteadof guarantees, they need to visit the unsafe states in order to learn how toavoid them, or they need the state-space to be metric.

4.2.2 Optimal control approaches

Techniques in this category utilize variations of the expected value-variancesafety criterion. The most basic one is treating the rewards as costs (whena reward is denoted by rt, the corresponding cost is denoted by ct). StandardRL methods can then be used to solve the safe exploration task, as describede.g. in [13] for discounted infinite horizon.

The RL objective function

J = E

[∞∑t=0

γtct

](1)

is called the risk-neutral objective. To make this objective risk-sensitive, we

20

specify a risk factor α and rewrite the objective as: [28]

J =1

αlogE

[exp

(α

∞∑t=0

γtct

)](2)

' E

[∞∑t=0

γtct

]+α

2Var

[∞∑t=0

γtct

](3)

which is also called the expected value-variance criterion. This approachis a part of theory using exponential utility functions, which is popular inoptimal control [44]. A safe policy by this criterion can be viewed as a pol-icy that minimizes the number of critical actions (because fatal transitionsare expected to yield much larger costs than safe transitions, increasing thevariance significantly).

To complete this enumeration, the worst-case objective function (alsocalled the minimax objective) used e.g. by [28] is defined as

J = sup

[∞∑t=0

γtct

]. (4)

However, unless a threshold is set, this definition leads only to the safestpossible policies, which are not necessarily safe at all. Expressing the safetyusing costs is however natural for some RL tasks (e.g. when learning thefunction of a dynamic controller of an engine, the engine’s temperature canbe treated as a cost).

As can be seen, the objective functions containing expectations cannot infact assure that no unsafe state will be encountered. On the other hand, theminimax objective provides absolute certainty of the safety. However, it mayhappen that some of the unsafe states can only be reached with a negligibleprobability. In such cases, the α-value criterion defined by Heger [28] can beused—it only takes into account rewards that can be reached with probabilitygreater than α. In the work of Mihatsch and Neuneier [44], a scheme ispresented that allows to “interpolate” between risk-neutral and worst-casebehavior by changing a single parameter.

The work of Delage and Mannor [18] takes into account the uncertaintyof parameters of the MDP. It is often the case that the parameters of theMDP are only estimated from a limited number of samples, causing theparameter uncertainty. He then proposes a possibility that the agent may“invest” some cost to lower the uncertainty in the parameters (by receivingsome observations from other sources than exploration). A completely newresearch area then appears—to decide whether it is more valuable to pay the

21

cost for observations, or to perform exploration by itself. We further extendthis idea in our work.

An approximation scheme for dealing with transition matrix uncertaintyis also presented by Nilim and El Ghaoui [48]. It considers a robust MDPproblem and provides a worst-case, but also robust policy (with respect tothe transition matrix uncertainty).

A theory generalizing these approaches can be found in the work of Schnei-der [59]. The theory states that the optimal control decision is based on threeterms—the deterministic, cautionary and probing terms.

The deterministic term assumes the model is perfect and at-tempts to control for the best performance. Clearly, this may leadto disaster if the model is inaccurate. Adding a cautionary termyields a controller that considers the uncertainty in the modeland chooses a control for the best expected performance. Finally,if the system learns while it is operating, there may be somebenefit to choosing controls that are suboptimal and/or risky inorder to obtain better data for the model and ultimately achievebetter long-term performance. The addition of the probing termdoes this and gives a controller that yields the best long-termperformance.[59]

To conclude this section, we think most of these methods are not wellsuited for exploration that is really safe—the expected value-variance andsimilar criteria provide no warranties on the actual safety. On the otherhand, the worst-case approaches seem to be too strict. Only the parameteruncertainty methods provide both reasonable performance and guaranteeswhen a low level of uncertainty can be achieved.

4.2.3 Approaches benefiting from prior knowledge

The last large group of safe exploration techniques are the ones benefitingfrom various kinds of prior knowledge (other than the parameters of theMDP). We consider this group the most promising for safe exploration, be-cause “it is impossible to avoid undesirable situations in high-risk environ-ments without a certain amount of prior knowledge about the task”[21].

The first option how to incorporate prior knowledge into exploration is toinitialize the search using the prior knowledge. In fact, several works alreadymentioned in previous sections use prior knowledge—namely the approacheswith a backup policy (Hans et al. [27], Garcia and Fernandez [21]). Also,Garcia and Fernandez suggest that the initial estimate of the value functionor Q-function can be done by providing prior knowledge, which results in

22

much faster convergence (since the agent does no more have to explore reallyrandom actions, the estimate of the value function already “leads it” theright way) [21].

A different approach is using the methods of reachability analysis [45] tosolve safe exploration. Gillula and Tomlin in their work [25, 2] define a setof keep-out states (corresponding to unsafe in our labeling) and then a setcalled Pre(τ) is defined as a set of all states from which it is possible to get toa keep-out state in less than τ steps. Reachability analysis is used to computethe Pre(τ) set. Safe states are then all states not in Pre(τ) for a desiredτ . To compute the Pre(τ) set, a partial transition model of the system isused—it composes from a known part of the model (e.g. derived differentialequations), and an unknown, but deterministic, part, which stands for allinfluences not modeled by the first term (wind, friction. . . ). The estimate ofthe unknown portion is then validated online during learning, and wheneverthe estimate is poor, a backup policy is used and the estimate is reset. Thesystem must use safe actions in the Pre(τ) set, while it can do whateveraction desired outside.

Another option how to incorporate prior knowledge is by using Learningfrom Demonstration (LfD) methods. Due to the limited space, we will notgive the basics of LfD—an overview of the state-of-the-art methods is forexample in [3]. For our overview, it is sufficient to state that LfD methodscan derive a policy from a set of demonstrations provided by a teacher.What is important is that the teacher does not necessarily have to havethe same geometrical and physical properties as the trainee (although ithelps the process if possible). It is therefore possible to use LfD to teacha 5-joint arm to play tennis, while using 3-joint human arm as the source ofdemonstrations (but the learned policy may be suboptimal; RL should thenbe used to optimize the policy).

In Apprenticeship Learning [1], the reward function is learned using LfD.The human pilot flies a helicopter at his best, and both system dynamicsand the reward function are learned from the demonstrations. It is howeverapparent that the performance of the agent is no longer objectively optimal,but that it depends on the abilities of the human pilot.

Another way of incorporating prior knowledge into the learning processis to manually select which demonstrations will be provided, as in the workof Ertle et al. [20]. It is suggested that more teacher demonstrations shouldcome from the areas near the unsafe set, in order to teach the agent preciselywhere the border between safe and unsafe is located.

The last technique described in this section is interleaving autonomousexploration with teacher demonstrations. As in the previous case, someteacher demonstrations are provided in advance, and then the exploration

23

part starts utilizing the teacher-provided information. After some time, orin states very different from all other known states, the agent requests theteacher to provide more examples [3, 11]. The idea behind this algorithm isthat it is impossible to think out in advance what all demonstrations will theagent need in order to learn the optimal safe policy.

Finishing this section, the algorithms utilizing prior knowledge seem to bethe most promising out of all the presented approaches. They provide botha speedup of the learning process (by discarding the low-reward areas) anda reasonable way to specify the safety conditions (via LfD, partial transitionfunction, or interleaving).

4.3 Adaptive Traversability

Many approaches focus on optimal robot motion control in an environmentwith a known map, leading rather to the research field of trajectory planning.In contrary to planning [12, 8], AT can easily be exploited in a previouslyunknown environment and hence provide a crucial support to the actualprocedure of map creation. We rather perceive AT as an independent com-plement to trajectory planning and in no way a substitution.

Many authors [24, 12, 43, 9, 37, 29, 61] estimate terrain traversabilityonly from exteroceptive measurements (e.g. laser scans) and plan the (flip-per) motion in advance. In our experience, when the robot is teleoperated ina real environment, it is not possible to plan the flipper trajectory in advancefrom the exteroceptive measurements only. There are mainly three reasons:(i) it is not known in advance, which way is the operator going to lead therobot, (ii) exteroceptive measurements are usually only partially observable,(iii) analytic modeling of Robot-Terrain Interaction (RTI) in a real envi-ronment cannot be inferred from exteroceptive measurements only, becausethe robot can for example slip or the terrain may deform. Especially, Hoet al. [29] directly predict the terrain deformation from exteroceptive mea-surements only to estimate traversability. However when the terrain underthe robot collapses unexpectedly, its profile must be updated without extero-ceptive measurements. Hence, rather reactive control based on all availablemeasurements is needed. Therefore, planning the flipper motion in advanceis not a viable solution for teleoperated USAR missions, where the robot isoften teleoperated in an unknown, dynamically changing environment.

An ample amount of work [63, 38, 19, 41, 31, 35] has been devoted tothe recognition of manually defined discrete classes (e.g. surface type, ex-pected power consumption or slippage coefficient). However, such classes areoften weakly connected to the way the robot can actually interact with theterrain. Few papers describe the estimation of RTI directly, for example,

24

Kim et al. [37] estimate whether the terrain is traversable or not, and Ojedaet al. [49] estimate power consumption on a specific terrain type. In theliterature, the RTI properties are usually specified explicitly [49, 30, 37] orimplicitly (e.g. state estimation correction coefficient [57, 56]). Neverthe-less, RTI properties do not directly determine the optimal reactive control,therefore we tackle the AT task by reinforcement learning.

Contrary to e.g. [1], we rather focus on a model-free reinforcement learn-ing technique and instead of using Value-based algorithms, we use the Q-learning technique. The Q-function is approximated either by (i) regressionforests, which are known to provide good performance when a huge trainingset is available [60] while allowing for online learning, or (ii) Gaussian pro-cesses, which are an efficient solution in the context of reinforcement learningfor control [17].

5 Our Previous Work

In the last two years, we have published two works related to robot safety. Inthe first one [52], we explored the state of the art in robot safety and tackledthe problem of safety definition. Selected results of the work are presentedin Section 5.1. The second work [51] then proposes a representation andlearning algorithm for a safety function. Results of the second work areshown in Section 5.2.

In the time of submission of this thesis proposal, we also finish thework on a journal article, which we intend to submit probably to the IEEETransactions on Robotics journal. The name of the article is Safe Adap-tive Traversability with Incomplete Data. In that work we explore ways ofhandling situations where the reinforcement learning decision procedure ismissing some important exteroceptive data (e.g. the laser scanner is re-flected from some surface and its measurement is thus lost). Some of theseyet unpublished results are shown in Section 5.

5.1 Safety Definition

After going through the published approaches to safety definition (Section 4.1),we have made a few observations that led us to a new definition of safetythat should suit all the needs.

The first observation is that creating labels for actions or transitions isunnecessary. If we need to talk about the “level of safety” of an action, wecan use the worst label out of all possible results(=states) of that action.Moreover, as “it is impossible to completely avoid error states” [53], we can

25

ignore the effects of the action which have only small probability (lower thana safety threshold)—we will call such effects the negligible effects.

A second remark is that the fatal and supercritical sets can be merged.In the work of Hans et al. we have not found any situation where distin-guishing between supercritical and fatal would bring any benefit. Specifi-cally, the authors state that: “Our objective is to never observe supercriticalstates” [27], which effectively involves avoiding fatal transitions, too. Andsince we avoid both supercritical and fatal, we can as well avoid their union.

Putting these observations together, we proposed a novelty definition ofsafety for stochastic MDPs, which is a simplification of the model of Hanset al. and a generalization of the other models. The definition is as follows,and is also depicted in Figure 5:

• A state is unsafe if it means the agent is damaged/destroyed/stuck. . . orit is highly probable that it will get to such state regardless of furtheractions taken.

• A state is critical if there is a possible and not negligible action leadingto an unsafe state.

• A state is safe if no available action leads to an unsafe state (however,there may be an action leading to a critical state).

To illustrate the definition on a real example, please refer to Figure 6.In Figure 6a, the UGV is in a safe state, because all actions it can takelead again to safe states (supposing that actions for movement do not movethe robot for more than a few centimeters). On the other hand, the robot asdepicted in Figure 6b is in a critical state, because going forward would makethe robot fall over and break. If the robot executed action “go forward” oncemore, it would come to an unsafe state. Right after executing the action itwould still not be broken; however, it would start falling down and that isunsafe, because it is not equipped to withstand such fall and therefore it isalmost sure it will break when it meets the ground.

5.1.1 Safety decoupled from rewards

Following the definition, we’d like to emphasize, that maintaining safety isnot necessarily the same as thresholding rewards (which is the most commonway to “enable” safety). As an example, let’s have a robotic rescuer whichgets rewards for saving victims. It was trained in avoiding obstacles, planningefficient paths and so on, and the learned policy would advice to go directlyto a victim if it notices one. However, the robot was not trained for saving

26

(a) A safe state.

(b) A critical state—if the robot went still forward, it would fall downand probably break.

Figure 6: An illustration of safe and critical states.

27

victims in case of fire. So the policy could easily lead the robot to a placewith high temperature, which would inevitably destroy it.

If the robot had a safety function implemented, it could simply warn“going to the hot place is unsafe”. Surely the robot could be trained forsaving victims in burning buildings, and then it would probably succeed.However, decoupling safety and rewards brings an important simplificationto the learning process. Instead of learning how to behave optimally in allcombinations of task and environment (victim – fire, victim – rain, victim –acid, exploration – fire, exploration – rain, exploration – acid), it is sufficientto determine the safety functions for cases of fire, rain or acid, and then totrain the optimal behavior for saving victims or exploration. In our view,the reward-independent safety definition is much more useful.

The decoupled safety is obviously only useful if it can be “plugged into”a given learned policy. Otherwise, we can make use of a separate safetyfunction during learning, but we would still need to train the task with thegiven safety function again.

The example also illustrates the difference between optimally solvingtasks and maintaining safety. The controller for optimal task solution canbe represented as a function (mapping) f : S → A. Whereas the safety-maintaining controller is more like a set mapping f : S→ P(A). This is onemore argument in favor of decoupling safety and task-oriented rewards.

5.2 Learning safety

In our previous paper [51] we defined a safety function for policies as

S(θ) = minx∼π(θ)

s(x)

where s : X → 〈0, 1〉 is the state safety function. We then presented analgorithm for learning the state safety function. The task is to constructa safety function closest to the real safety margins, and not to visit anyunsafe states during the training. Before presenting the algorithm itself, letus describe its basic components.

5.2.1 Simulator-based safety function learning

A cautious simulator is the main component that differentiates our workfrom other safe exploration concepts. We use the simulator to predict safestates among the set of unvisited states (it may be e.g. a simple physicalsimulator). Cautious means that if the simulator labels a state as safe, itis also safe in the real world. On the other hand, it is allowed to wrongly

28

label safe states (in the real world) as unsafe. Having a cautious simulatoris a key to success of our algorithm, and creating such simulator is (much)easier than constructing a plausible physical simulator. We suppose runningthe simulator is still (computationally) expensive, so we try to minimize thenumber of its uses, and we prohibit extensive sampling of the whole statespace using the simulator.

Next, we need to have an experienced human operator that is capableof executing safe trajectories on the robot in the real world. We supposethat this operator has much more (prior or sensory) information than therobot has, and thus he or she can assess the safety of intended actions beforeexecuting them. These safe trajectories will be used to initialize the safetyfunction. If we discover an area in the state space that is misclassified bythe safety function as unsafe, the operator can reach these areas manually,which forces the algorithm to correct the safety estimates for that region.

Combining the simulator and operator results, we can construct the safetyfunction. Such function takes the state of the robot (the extracted features),and labels it either safe or unsafe (by returning a number in the 〈0, 1〉 inter-val, where values greater than smin are considered safe). This component isimplemented using Neyman-Pearson SVM classifier.

Since this work concentrated on learning the safety function only, it uses anaive and inefficient approach for selecting the best safe policy with which thelearning should continue. It takes a random policy and verifies its intersectionwith the safe areas by sampling the states the policy can visit.

5.2.2 The learning algorithm

The algorithm that combines all these components into a safe explorationscheme is shown in Algorithm 1 and is described in detail in the followingsections. In Table 1 we present the basic definitions used in the algorithm.

Initialization On line 1 we first require the operator to generate some real-world trajectories. It is generally not necessary for them to be generated bythe operator; they can also be substituted by a first run of the simulator orby prior knowledge (e.g. if a small part of safe states can be analyticallyexpressed). It is important for this initial set to be sufficiently large—ifit were not, the initial estimate of the safety function would be very poor.All the generated points are inserted into Xreal which is represented either asa set of points, or as a spatial search tree (depending on the expected numberof elements). Then we update the training set T (according to its definitiongiven in Table 1), and update the SVM model of the safety function (S0).Description of the SVM update is given below.

29

Algorithm 1 The safety function training algorithm

1. Xreal = operator-generated initial trajectories2. Update T, S0 := updateSVM(T)3. i := 04. while learning should continue do5. Generate an optimal policy πi safe on Si, or use the operator “as a

policy”

6. Drive using πi, record visited states xnew7. Xreal = Xreal ∪ xnew8. Update T, S′i := updateSVM(T)9. Perturb xnew several times, add the perturbed states to Xsim

safe or

Xsimunsafe depending on the result of simulation

10. Update T, Si+1 := updateSVM(T)11. i++12. end while

Variable Definition

n The dimensionality of feature spaceX Rn, the feature (state) spaceA X×A→ P(X), the set of actionsXreal ⊂ X, already visited statesXsim

safe ⊂ Xsim , states labeled safe by SimXsim

unsafe ⊂ Xsim , states labeled unsafe by SimT {Xreal × {safe}} ∪ {Xsim

safe × {safe}} ∪ {Xsimunsafe × {unsafe}},

the training set for SVMSim X×A→ {safe, unsafe}, the simulatorπi X→ P(A), a stochastic safe policySi X→ {safe, unsafe}, a safety function (SVM)

Table 1: Notation used in the algorithm.

30

The stopping criterion Line 4 represents the stopping criterion. It canbe either a subjective measure (trading off safety function accuracy for timeavailable for experimenting), or a qualitative measure (if the algorithm is nolonger able to simulate more unvisited states, or if the safety function hasn’tchanged for some time).

Generating an optimal safe policy On line 5 a safe policy is generatedbased on Si. As we mentioned earlier, finding the optimal safe policy is notthe main topic of this algorithm and thus we use a trivial sampling-basedmethod of verifying safety of randomly generated policies.

Policy execution The step to be done next is to execute the safe policy(line 7 and further). This may need some additional work to be done, suchas setting the robot to an initial position, changing the environment and soon. After the policy is executed, the newly visited states are added to Xreal

and an update of T and the SVM is run.

Simulation The loop starting on line 9 specifies that we sample some per-turbed states and simulate them in the simulator. Here is one importantpoint—we assume that the further a perturbed state is from the current(real) state of the robot, the less precise the simulation is. Therefore wealways try to perturb only in some small local neighborhood of the currentstate. How to perturb depends on the type of the features—it can be e.g.by Euclidean vector shifting. The magnitude of the shifts is one of the freeparameters of this algorithm.

Once we have the simulations done, we record the simulated states toXsim

safe or Xsimunsafe depending on the results of the simulations (which are either

binary classes or numbers from 〈0, 1〉). Then the training set and SVM areupdated again (which is described in the next section).

This simulation and perturbation can also be run just after initialization,before the algorithm enters the learning loop. This way the initial estimateof S0 will be better.

Updating the safety function (updateSVM) Representation and modi-fication of the safety function are the key points of our algorithm. We needthe safety function to generalize the set Xreal ∪ Xsim

safe in continuous space,not containing any point from Xsim

unsafe.From our assumptions it follows that it is not necessary that a general-

ization over this set also denotes only safe regions (because we defined thatsafe are only visited states, and states tagged safe by Sim). However, if we

31

assume continuousness of the safety function, it can be approximated verywell.

To describe the representation of the safety function, we first define anauxiliary set Tpruned ⊂ T, which is basically the set of all visited or simulatedstates. To avoid serious problems in computation of the safety function, weneed to prune Tpruned in such way, that there are no points from Xsim near toany point from Xreal. This in fact ensures that visited states have “priority”over states just simulated, which allows us to remedy states misclassified bySim as unsafe, although they are safe in reality. Again, the distance functionis a free parameter of this algorithm.

Now, Tpruned contains states of which no two cover each other, and aretagged either safe or unsafe. Finding a representation of Si is now a binaryclassification task. To ensure safety of the estimated safety function, theclassification has to be done in such a way that it never classifies an unsafestate as safe. This can be easily achieved by using the Neyman-Pearsonclassification [47] with false negative rate limit set close to zero (assumingnegative=safe).

One of the possible implementations of this classification scheme is using2ν−SVM presented in [14] utilizing LIBSVM [10]. There is a set of kernelfunctions that can be used with SVMs, and which one to choose again de-pends on the expected structure of the safety function. Preferring SVMshas one good reason against other binary classification tools—SVMs mini-mize structural risk (error on test data) rather than minimizing the erroron training data. This should provide us with a robustly estimated safetyfunction.

Remarks The goal of this algorithm is to find a safety function closest tothe real safety margins of the robot. The approximation of the real safetywith the safety function should get better as the number of visited statesincreases, which can be confirmed taking into account how the training setfor SVMs is built and how SVMs operate (assuming the kernel function isrich enough to represent the safety function).

Also we can conclude that the number of simulator runs is less than ifwe sampled the state space regularly. Furthermore, our approach has theadvantage that it is always sufficient to simulate in local neighborhood ofthe state the robot is in, allowing for better simulations than if we ran thesimulator in distant states.

32

5.2.3 Experimental evaluation

For the experimental evaluation of the algorithm described in Section 5.2, wehave chosen the task of controlling front flippers when driving down a step(both flippers the same angle). This action is interesting because for dif-ferent step heights there are different safe flipper configurations, and froma particular height up, there is no safe flipper configuration.

So the state space consists of all possible step heights (also drop heights;measured at the point where the flipper is attached to the main track). Therobot generates multiple data when driving down a step—first for height 0,then for the maximum height, and then for all the heights until it finishesclimbing down the step (however, we assume only limited sampling capabil-ities, and this is why the data in Fig. 9 are that sparse). The action spacethen covers all possible flipper angles the robot can set when climbing downthe step. For simplicity, we assume the robot can switch quickly betweentwo different flipper configurations.

The policies are from the deterministic linear policy class of the formπ(x) = θ0 + θ1x. The reinforcement learning objective we minimize is J(θ) =θ2

1 (to prefer policies with less flipper motion, e.g. to save power). We seekfor a safety function that would discriminate which flipper configurations aresafe for which step heights. The safety function is represented by a 2C-SVM(equivalent to 2ν-SVM) with Radial Basis Function kernel.

For executing the simulations, we created a simple model of the robot foruse with the Gazebo simulator [50]. Gazebo is a physical simulation library,however our model contains only the basic physical properties. Namely, wehave created plausible collision links for the real robot links (simple enoughto allow for fast collision checking). For each of the links we have estimatedthe weight, center of mass position and inertial properties (with an errorestimated about 10%—no precise and lengthy measurements). Specifically,we have not estimated or set any properties regarding the motors, friction,slippage or other dynamic properties.

Similarly, we put in the simulator a rough terrain representation thatis created directly by triangulating the point cloud (either from the laserscanner or from the point map). Such map is in no means smooth, rigid orregular. It contains triangles with wrongly estimated normals or even cornerpositions and it is non-continuous. Creating a more sophisticated map is anoption for improving the estimated safety function, but it is difficult and wewant to show that this algorithm works well even with the cluttered mapand simplistic robot model. Thus, the task environment can be consideredunstructured.

The simulation is then done in the following manner: first, we get the

33

triangulated map and place the robot to the position corresponding to thereal world. Then we shift it forward 30 cm, set the desired flipper angle andlet the robot “fall” on the ground, adjusting the flipper angle according tothe given policy. If the flipper policy is safe, then the robot only falls a fewmillimeters and remains in a stable state, and we can mark all passed state-action pairs as safe. The policy is considered unsafe if the robot touches theterrain with its fragile parts, if it turns over or if it ends up too far from thedesired [x, y] coordinate—then the simulator tags all the state-action pairsas unsafe. For a reference on how the robot looks visualized by Gazebo, referto Figure 7.

Figure 7: Robot simulation in the Gazebo simulator. All flippers are ina configuration corresponding to flipper angle 0 rad, and the white arrowssymbolize some basic flipper configurations. So lifting up the flippers de-creases the flipper angle. In the image there is also shown the triangulatedterrain. The red and green segments denote detected robot-terrain collisions.

Fortunately, physical simulations in this setting proved to satisfy the re-quirement on cautiousness of the simulator. To even more ensure cautious-ness, we perturb each simulated state several times and return the ratio ofsafe simulations to all simulations as the final result (thus our simulator re-turns values from 〈0, 1〉). Here the great advantage of our algorithm showed

34

up—the simple physical model, as well as the triangulated map, are mattersof hours to create. If we should create a precise physical model (of boththe robot and the terrain), it would still have cases where it fails, and itwould have needed much more effort to be done. Moreover, there are proper-ties of the terrain that cannot be modeled in advance, and our perturbationapproach could overcome some of them.

It is important to notice that the simulations are performed in a spacemuch larger than the feature space (which is 1-dimensional). The simulationsare performed with full 3D models (triangle meshes) incorporating physicalinfluences of forces. So what we do is simulate the problem in its full descrip-tion, and then map the result of the simulation to the problem projected toa 2D subspace consisting of features and actions. If the projection is chosenwise, there should be no problem with this dimension shrinking (only thesimulator could be considered more cautious than necessary).

To verify the safe exploration algorithm in practice, we drove the roboton several steps of different heights, running the algorithm after each trial.After each teleoperated trial there was an autonomous test of the generatedpolicy. We always chose the policy that intersects the largest area of safestates.

Evaluation of the experiment: during the realization phase, the robotnever tried to enter an unsafe state (both from the estimated unsafe set, andfrom the real unsafe states). It always managed to add new points to thesafety function representation and enlarge the area of state space coveredwith the safe region. The safe and optimal policy did not change during theexperiment, it was always a constant policy π = 1.1 + 0x.

The progress of the safety function, as well as its support vectors is shownin Figure 9; note how the safety function’s safe area grew gradually with eachiteration.

After the final iteration, we compared the learned safety function to thelimits that an experienced operator would allow for the robot. Actually, inmore complex instances of safe exploration, getting such limits is impractical.The comparison is shown in Figure 9, and Figure 8 provides a graphicalunderstanding for the data points. It is evident from the figure that we havesucceeded keeping the false negative (FN) rate at 0 (here FN denotes unsafestates classified as safe).

Using the classifier terminology, we can specify true negatives (TN) asthe number of safe states classified safe, false positives (FP) the number ofsafe states classified unsafe, and true positives (TP) the number of unsafestates classified unsafe. Then we may define accuracy as (TP +TN)/(TP +

35

Figure 8: Poses of the robot to better illustrate the meaning of data points inFigure 9. The robot icons are placed approximately with their center on thedata point (the left column represents drop height 0). The “ghost” flippersfor angle 1 rad denote that the robot pushes to get to that angle, but theapplied power is not sufficient (the flippers are compliant). The red barsillustrate the place where the drop height was measured.

TN+FP+FN) and precision as TP/(TP+FP ). With these terms defined,we may say that the objective of the safe exploration algorithm is to achieveprecision as close to 1 as possible, which means to minimize the differencebetween the estimated and real safety functions, while preserving FN = 0.

During the three model updates, the values of accuracy in the individ-ual steps were [0.70, 0.82, 0.81], and precision was [0.42, 0.66, 0.69]. Anotherinteresting metric can be seen when we superimpose the last (best) SVMmodel S2 over the set of visited points in previous model updates. Thisshows how the model gets gradually better—accuracy [0.77, 0.82, 0.81], pre-cision [0.48, 0.66, 0.69]. We note that compared to the first model S0, the lastmodel S2 classifies several previously unsafe-classified points as safe, increas-ing both accuracy and precision. On the second model S1 there is no changeif superimposed with S2.


With the safety requirements clearly defined into a state-action safety func-tion we can evaluate for the current state and any selected action, we canproceed further to finding a way how to efficiently combine this safety func-tion with existing reinforcement learning methods.

At first glance, just summing up the task-oriented rewards and somekind of safety rewards could be a good idea. But what magnitude shouldthe safety rewards have? If they were −∞ for unsafe states, then the rewardpropagation done by reinforcement learning would propagate these −∞s to

36

Figure 9: The progress of learning the SVMs for safety model (it-erations 1, 2 and 3 from the top). The pink area is considered safe by theSVM (the blue solid line is its boundary). The dashed black line denotes thesafety boundary estimated by an experienced operator (just for evaluationpurposes). Data points from Xreal are represented as brown dots, Xsim

safe asplus signs and Xsim

unsafe as crosses. Safety of Xsim data points is coded by colorusing the shown color scale (we used safety threshold smin = 0.7). Encircledpoints are the Support Vectors. The thin red, green and blue lines repre-sent the manually driven trajectories, and the magenta line at the bottomis the trajectory executed using πi. Note that manually visiting the greenand azure points in the last step would greatly improve the safety functionestimate.

37

the whole state space (unless there is a state from which it is impossible toget to any critical state in any time horizon). If −∞ is not a good option,which number is? Finding a good ratio of safety and task-oriented rewardsseems to be a hard (guess)work, so we leave it only as a last resort method.Moreover, the impact of this approach on the trained policies’ optimality isunclear.

Our conclusion is that it is really needed to develop an efficient (andoptimal) algorithm for solving problems of the type

∆θ∗ = argmax∆θ:S(x,πθ+∆θ(x)) 6=unsafe

∆θ∂J

∂θ

where S is the safety classifier. The development of such algorithm is still inprogress.

5.3.1 Egg curling—a testbed for maintaining safety

A practical need for developing novel algorithms is a toy example that allowsto quickly test ideas and run quick simulations. We found such a problemthat helps us to deliberate about the safety-related problems.

Let us have a dining table and a homogeneous and spherical egg lyingon top of it. The distance of the egg from the table edge is known, but itsweight and the friction coefficient of the table surface are not. We push theegg in the direction to the edge of the table. The task is to get it as closeto the edge as possible, but in a safe way that assures the egg does not fallover the edge. We can push the egg as many times as needed, unless it fallsdown.

So the state x is described by the distance to the edge, actions u are rep-resented by the force with which we touch, and rewards r are either inverselylinear or inversely exponential in the remaining distance.

The true transition model can be e.g.

f(x, u) = x+ log(2 ∗ u)

and a cautious simulator is then a parametrized form

fp(x, u) = x+ log(p ∗ u)

with the parameter p unknown. Some previous knowledge can provide uswith information that u ∈ 〈0.1, 100〉.

Having this simulator, we can simulate for various values of p and thenselect only actions that are safe for all values. This way we ensure safety ofthe whole process.

38

u

x

r = 1/x’

Figure 10: The egg curling setup. The egg (white) is on top of a table(gray) in the distance x from an edge. We push the egg with power u andafter the egg stops, we receive a reward r = 1/x′ corresponding to the newdistance from the edge.

On the other hand, the already performed experiments tell us a lot aboutthe real physics of the system, so if we model the parameter space e.g. us-ing GPs, we can quickly find out that values p > 10 are very improbable.Incorporating this knowledge then allows us to be more courageous in thefollowing trials.

In this section, we have presented one possible use case of the egg-curlingtestbed, on which we explore the possibilities of transition model estimation.Other modifications can be easily done to support other ideas.

5.4 Adaptive Traversability

For AT, we model the mutual state of the robot and the local neighboringterrain as n-dimensional feature vector x ∈ Rn consisting of:• exteroceptive features: We merge the individual scans from one

sweep of the rotating laser scanner into a 3D map using the ICP algo-rithm [54]. From the map we construct an Octomap [32] with cube size5 cm, and this Octomap is then cropped to close neighborhood of therobot (50 cm×200 cm size). In this cut, we merge all neighboring foursof cubes, which results in a local representation of the terrain with x/ysubsampled to 10 cm × 10 cm tiles (bins) and vertical resolution 5 cm.This is what we call a Digital Elevation Map (DEM), see Figure 4.

39

Heights in the bins are directly used as exteroceptive features.• proprioceptive features: Robot speed (actual and desired), roll,

pitch, flipper angles, compliance thresholds, actual current in flippersand actual flipper configuration.

The robot allows for many degrees of freedom, but only some of them arerelevant to the AT task. We assume that the speed and heading of the robotare controlled by operator5, and hence AT is used to control the pose of thefour flippers and their compliance, which yields 8 DOF to be controlled. Fur-ther simplification of the action space is allowed by observations made duringexperiments—only 4 discrete (laterally symmetric) flipper configurations areenough for most of the terrains, and 2 different levels of compliance are alsosufficient (the arm has to be in a ”transport” position when the robot moves,so we ignore its DOF here). Thus we defined 5 flipper configurations denotedby c ∈ C = {1 . . . 5}; their representation is shown in Figure 3.

We define the reward function r(c,x) : (C × Rn) → R, which assignsa real-valued reward for achieving state x while using configuration c. Itis expressed as a weighted sum of (i) user-denoted binary penalty (reward)specifying that the state is (not) dangerous, (ii) high pitch/roll angle penalty(preventing robot’s flip-over), (iii) penalty for switching the configurationsexcessively, (iv) robot forward speed reward (for making progress in travers-ing), and (v) the motion roughness penalty6 measured by accelerometers.

A single experiment then consists of an environment with an obstacle,the human operator controlling the speed vector, and the robot controllingthe flipper poses automatically (or manually when gathering training data).

Natural or disaster environments (such as a forest or collapsed buildings)yield many challenges including incomplete or incorrect data due to reflectivesurfaces such as water, occluded view, presence of smoke, or deformableterrain such as deep snow or piles of rubble—conditions common for USAR.Simple interpolation of incomplete data is often insufficient since it doesnot reflect all available measurements—e.g. robot pitch or torque in flipperengines restrict the shape of incompletely measured terrain. Our proposedsolution to AT correctly marginalizes over the incomplete data and estimatessafety of possible flipper configurations. We do not, due to space constraints,give the description here, so it will only appear after our journal paper ispublished.

40

distance traveled [m]

dist

ance

[m]

0.5 1 1.5 2 2.5

0.5

1

1.5

2

distance traveled [m]0.5 1 1.5 2 2.5 3 3.5 4

distance traveled [m]0.5 1 1.5 2

(1) simple obstacle (2) soft-terrain side-roll obstacle (3) long obstacle

distance traveled [m]0.5 1 1.5 2 2.5 3


0.5 1 1.5 2 2.50

0.2

0.4

0.6

(4) double obstacle (5) non-traversable obstacle

Figure 11: Obstacles: Elevation maps (DEM) of testing obstacles con-structed during experiments. The robot went left to right following the hor-izontal centerline on each DEM.

5.4.1 Experimental evaluation

Our proposed solution to the AT was tested on five challenging obstacles cre-ated from logs and stones in an outdoor forest environment; see the examplesin Figure 12 and the elevation maps (DEM) of testing obstacles computedonline by the robot in Figure 11. Each obstacle was traversed multiple timeswith autonomous flipper control (AC); obstacles 1, 2 and 5 were also traversedwith manual flipper control (MC) by the expert operator for the purpose ofquantitative comparison. We emphasize that the complexity of testing ob-stacles was selected in order to challenge robot hardware capabilities. Oneof the testing obstacles even proved to be too complex to be traversed bothwith AC or MC.

Adaptive Traversability (AT) was trained in controlled lab conditionsusing only two artificial obstacles created from EUR pallets7. First training

5At this time, we simplify the task to just controlling the flippers; control of the speedvector is upcoming.

6or smoothness reward7type EUR 1: 800mm × 1200mm × 140mm, see https://en.wikipedia.org/wiki/EUR-

pallet

41

560

390

420

(a) Training objects (b) Testing objects

Figure 12: Obstacles: (a) Three EUR pallets (800mm×1200mm×140mm)with one non standard pallet and concrete shoal used for training part. (b)Natural obstacles in an outdoor forest environment used for testing part.

obstacle was just a single pallet, the second consisted of stairs created fromthree pallets, see Figure 12.

We trained the Q-function in three episodes of collecting the training data.In the first two episodes, the training data were collected with manual flippercontrol. To speed up the learning procedure reasonably negative (but notdangerous) training samples were provided. In the last episode, the trainingdata were collected autonomously by the robot. Each training sample wasaccompanied by its reward. The best results were achieved for the rewardfunction defined as a weighted sum of (i) manually annotated labels (eitherpositive equal to 1 or negative equal to −1), (ii) thresholded exponentialpenalty for pitch angle, and (iii) roughness of motion penalty defined as√a2y + a2

z. In order to reduce oscillations between configurations with similarq-values we (i) introduced additional penalty for changing the mode and(ii) evaluated the q-values over 1 second long time interval. To obtain alarger training dataset, the training samples were synthetically perturbed.We mainly perturbed the height of obstacles while changing pitch of therobot correspondingly.

We tested the adaptive traversability on five challenging natural obstaclesin a forest. Both AC and MC allowed to traverse obstacles 1–4. Obstacle 5consisted of two logs located in parallel with mutual distance equal approxi-mately to the length of the robot with folded flippers. Such obstacle turnedout to be not traversable neither with the autonomous nor with the manualflipper control. For obstacles 1 and 2 quantitative comparison of the au-tonomous and manual flipper control is provided in Tables 2, 3. To compareAC and MC traversability quality, five different metrics were proposed and

42

0 0.5 1 1.5 2 2.5−40

−20

0

20

40

60

80

pitc

h [d

eg]


AC

MC

pitch (AC)pitch (MC)V−shape

L−shapeU−shape softU−shape hard

0 0.5 1 1.5 2 2.5−40

−20

0

20

40

60

80

roll

[deg

]distance traveled [m]

AC

MC

roll (AC)roll (MC)V−shape


(a) Obst.1: pitch + flipper modes (b) Obst.1: roll + flipper modes

0 1 2 3 4−40

−20

0

20

40

60

80

pitc

h [d

eg]


AC

MC

pitch (AC)pitch (MC)V−shape


0 1 2 3 4−40

−20

0

20

40

60

80

roll

[deg

]


AC

MC

roll (AC)roll (MC)V−shape


(c) Obst.2: pitch + flipper modes (d) Obst.2: roll + flipper modes

Figure 13: Pitch, roll, and flipper modes along the obstacle traversal:All graphs show used flipper modes and pitch+roll angles reached by therobot during the obstacle traversal. Graphs (a) and (b) correspond to theobstacle 1, graphs (c) and (d) to the obstacle 2. Autonomous control (AC)is depicted by the blue color and manual control (MC) by the red color.Autonomously selected modes are shown in the first color bar and manual inthe second.

evaluated: (i) average pitch angle (sum of absolute values of the pitch angledivided by the number of samples), (ii) average roll angle, (iii) traversal time(start and end points are defined spatially), (iv) average current in flipperengines (corresponds to flipper torque), (v) power consumption during thewhole experiment, and (vi) the number of configuration changes.

Table 2 shows that the average pitch, roll and the number of flipper modechanges of AC and MC on the obstacle 1 are comparable. However, the powerconsumption and the average current are both lower for AC. This is achieved

43

by more efficient mode selection—such as using the U-shape-soft mode forgoing down the obstacle; see the flipper modes, pitch and roll angle plots ofAC and MC in Figure 13a,b.

Table 2: Comparison of autonomous and manual robot control on the obsta-cle 1 (simple obstacle)

Pitch Roll Time Current Changes Consumption[◦] [◦] [s] [A] [−] [Ah]

AC 11.2 1.8 35.7 3.4 3 0.07MC 11.3 2.8 36.8 5.4 2 0.10

Table 3: Comparison of autonomous and manual robot control on the obsta-cle 2 (contains soft terrain and side-roll).

Pitch Roll Time Current Changes Consumption[◦] [◦] [s] [A] [−] [Ah]

AC 10.2 10.6 75.3 3.9 10 0.17MC 17.1 17.1 132.1 4.6 4 0.33

Table 3 demonstrates that AC outperformed MC in most of the evaluatedmetrics. The most significant difference can be seen in the actual time taken.MC often required the experienced operator to stop the robot to changeflipper configuration (due to high cognitive load), while AC was continuousand proceeded as the robot was driven forward. Therefore, the traversaltime of AC is almost half of the MC time. In addition to that, since ourdefinition of the reward function also contains penalty for being in extremeangles (accounting for robot safety), AC achieved smaller pitch, roll, as wellas flipper current—the ground/obstacle contacts were less frequent and lessintense. The power consumption of AC compared to MC was hence muchlower, enabling the robot to last longer while carrying out the mission.

5.4.2 Comparison to “blinded operator” control

We performed one more experiment, in which we compared the traversaltimes etc. for an experienced operator, unexperienced operator, and a “blindedoperator”, which is an operator who only has the same sensory input as the

44

(a) The obstacle and robot with arm (b) The view a “blinded operator” has

Figure 14: The experiment comparing experienced, unexperienced andblinded operators’ performance with AC and MC

robot available. This is quite usual in USAR teleoperation, because therobot gets far away into dangerous areas, and the only way how to driveit is through looking at the camera and laser streams on a laptop (shownin Figure 14(b)).

The experiment was again conducted in a forest, this time only on a singleand simple obstacle, but with a heavy robotic arm attached to the robot body,which quite increases the center of gravity and thus decreases stability of thewhole system (shown in Figure 14(a)). There was an experienced operator,an unexperienced one (this operator did not drive to robot for a few years),and the the experienced “blinded” operator. There were five traversals foreach combination of operator and AC/MC.

Table 4: Results of the second AT experiment. Please note the “ExpMC” (experienced operator manually controlling everything) part of the ex-periment was done with higher compliance to make the task competing—therefore the higher pitch and roll, and lower power consumption.

Exp. type Pitch (◦) Roll (◦) Current (A) Duration (s) Mode changesExp MC 13.9± 5.7 5.9± 4.3 7.4± 0.3 50.5± 14.0 5.4± 1.1Exp AC 5.5± 0.4 2.5± 0.3 8.3± 0.4 43.9± 7.6 21.8± 10.4Unexp MC 5.8± 1.1 4.1± 0.5 9.5± 0.5 41.3± 3.2 4.5± 0.6Unexp AC 5.7± 1.5 2.9± 0.5 8.9± 1.0 36.1± 8.9 19.0± 6.2Blind MC 6.2± 0.6 2.5± 0.8 7.6± 1.0 63.8± 11.9 6.0± 1.2Blind AC 5.6± 0.3 2.8± 0.2 7.3± 0.4 64.0± 16.5 24.0± 20.7

Evaluating the results shown in Table 4, we can see that the AC did almostalways comparably or even better than MC. An interesting phenomenon

45

is the shorter traversal time for the unexperienced operator than for theexperienced one. We noticed during the experiment that the unexperiencedoperator does not care that much about doing potentially unsafe actions,thus allowing himself for higher speed.

Both experiments show that Adaptive Traversability does the same oreven better job than the full manual control, while the operator can concen-trate on other tasks (e.g. inspecting the camera images) or go faster.

6 Future Work

We have shown that the field of Safe Exploration is not yet fully coveredand not all the problems are satisfyingly solved. Some of the problems havenot yet been even well defined. Our future work will therefore be focused onprecise and formal specification of the tasks to be solved, and to finding themissing algorithms for optimal safe exploration.

We see a possible way in utilizing Gaussian Processes for transition func-tion estimation. There is also the possibility of changing the Gradient PolicySearch algorithm to incorporate safety, or to work with parameter uncer-tainty. The Adaptive Traversability still needs the speed vector to be alsocontrolled autonomously. And there is still a lot of work to be done in cre-ating benchmarks for comparing different SE or AT solutions.

6.1 Safety as Simulator Parameters Uncertainty

As was shown in the egg-curling example, in some cases it seems feasible tocreate a simulator which is capable of plausible simulations, but we do notknow exactly the parameters of the simulator. Suppose we know that thereexists a set of parameters with which the simulator faithfully simulates thereal behavior of the system—it becomes a representation of the transitionfunction. Then there is one more possible definition of safety—safe statescan be only such states that are simulated as safe by all instances of thesimulator with all “allowed” parameter values. As the agent gathers morereal data, some of the simulator parameters become implausible and can bediscarded. The more parameter values become discarded, the more the safetyfunction approaches the real safety of the system and the less false positivesit yields. And, by definition, such safety function is really safe at all times(it is a cautious simulator).

46

References

[1] Pieter Abbeel, Adam Coates, Morgan Quigley, and Andrew Y Ng. Anapplication of reinforcement learning to aerobatic helicopter flight. InProceedings of the 2006 Conference on Advances in Neural InformationProcessing Systems 19, volume 19, page 1, 2007.

[2] Anayo K Akametalu, Melanie N Zeilinger, Jeremy H Gillula, and Claire JTomlin. Reachability-Based Safe Learning with Gaussian Processes.CDC, 2014.

[3] Brenna D. Argall, Sonia Chernova, Manuela Veloso, and Brett Brown-ing. A survey of robot learning from demonstration. Robotics and Au-tonomous Systems, 57(5):469–483, May 2009.

[4] JA Bagnell. Learning decisions: Robustness, uncertainty, and approxi-mation. PhD thesis, Carnegie Mellon University, PA, USA, 2004.

[5] AG Barto. Reinforcement learning: An introduction. MIT Press, 1998.

[6] Richard Bellman. A Markovian decision process. Technical report, DTICDocument, 1957.

[7] Dimitri P. Bertsekas. Dynamic programming: deterministic and stochas-tic models. Prentice-Hall, 1987.

[8] Michael Brunner, Bernd Bruggemann, and Dirk Schulz. Towards au-tonomously traversing complex obstacles with mobile robots with ad-justable chassis. In Carpathian Control Conference (ICCC), 2012 13thInternational, pages 63 –68, may 2012.

[9] Bruno Cafaro, Mario Gianni, Fiora Pirri, Manuel Ruiz, and ArnabSinha. Terrain traversability in rescue environments. In Safety, Se-curity, and Rescue Robotics (SSRR), 2013 IEEE Int. Symposium on,pages 1–8, Oct 2013.

[10] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for supportvector machines. ACM Transactions on Intelligent Systems and Tech-nology, 2(3):27:1—-27:27, 2011.

[11] Sonia Chernova and Manuela Veloso. Confidence-based policy learningfrom demonstration using Gaussian mixture models. In AAMAS ’07Proceedings, page 1. ACM Press, 2007.

47

[12] F. Colas, S. Mahesh, F. Pomerleau, Ming Liu, and R. Siegwart. 3dpath planning and execution for search and rescue ground robots. InIntelligent Robots and Systems (IROS), 2013 IEEE/RSJ InternationalConference on, pages 722–727, Nov 2013.

[13] Stefano P. Coraluppi and Steven I. Marcus. Risk-sensitive and minimaxcontrol of discrete-time, finite-state Markov decision processes. Auto-matica, 1999.

[14] MA Davenport. Controlling false alarms with support vector machines.In IEEE International Conference on Acoustics, Speech and Signal Pro-cessing, pages V–V, 2006.

[15] Marc Deisenroth and Carl E. Rasmussen. PILCO: A model-based anddata-efficient approach to policy search. In Proceedings of the 28th Inter-national Conference on Machine Learning (ICML-11), pages 465–472,2011.

[16] Marc Peter Deisenroth. A Survey on Policy Search for Robotics. Foun-dations and Trends in Robotics, 2(1):1–142, 2011.

[17] M.P. Deisenroth, D. Fox, and C.E. Rasmussen. Gaussian processes fordata-efficient learning in robotics and control. IEEE Transactions onPattern Analysis and Machine Intelligence, 37(2):408–423, 2015.

[18] Erick Delage and Shie Mannor. Percentile optimization in uncertainMarkov decision processes with application to efficient exploration. InProceedings of the 24th international conference on Machine learning- ICML ’07, pages 225–232, New York, New York, USA, 2007. ACMPress.

[19] E. M. DuPont, C. A. Moore, and R. G. Roberts. Terrain classificationfor mobile robots traveling at various speeds: An eigenspace manifoldapproach. In Proc. IEEE Int. Conf. Robotics and Automation ICRA2008, pages 3284–3289, 2008.

[20] Philipp Ertle, Michel Tokic, Richard Cubek, Holger Voos, and Dirk Sof-fker. Towards learning of safety knowledge from human demonstrations.In 2012 IEEE/RSJ International Conference on Intelligent Robots andSystems, pages 5394–5399. IEEE, October 2012.

[21] Javier Garcia and Fernando Fernandez. Safe exploration of state andaction spaces in reinforcement learning. Journal of Artificial IntelligenceResearch, 45:515–564, 2012.

48

[22] P Geibel. Reinforcement learning with bounded risk. In IEEE Interna-tional Conference on Machine Learning, pages 162–169, 2001.

[23] Alborz Geramifard. Practical reinforcement learning using representa-tion learning and safe exploration for large scale Markov decision pro-cesses. PhD thesis, University of Alberta, 2012.

[24] Mario Gianni, Federico Ferri, Matteo Menna, and Fiora Pirri. Adap-tive robust three-dimensional trajectory tracking for actively articulatedtracked vehicles (aatvs). Journal of Field Robotics, 2015.

[25] Jeremy H. Gillula and Claire J. Tomlin. Guaranteed safe online learn-ing of a bounded system. In 2011 IEEE/RSJ International Conferenceon Intelligent Robots and Systems, pages 2979–2984. IEEE, September2011.

[26] PW Glynn. Likelihood ratio gradient estimation: an overview. In Pro-ceedings of the 19th conference on Winter simulation, number 27, pages366–375. ACM, 1987.

[27] Alexander Hans, Daniel Schneegaß, AM Schafer, and S Udluft. Safeexploration for reinforcement learning. In Proceedings of European Sym-posium on Artificial Neural Networks, number 4, pages 23–25, 2008.

[28] Matthias Heger. Consideration of risk in reinforcement learning. In 11thInternational Machine Learning Conference, 1994.

[29] Ken Ho, T. Peynot, and S. Sukkarieh. A near-to-far non-parametriclearning approach for estimating traversability in deformable terrain. InIntelligent Robots and Systems (IROS), 2013 IEEE/RSJ InternationalConference on, pages 2827–2833, Nov 2013.

[30] Ken Ho, Thierry Peynot, and Salah Sukkarich Sukkarich. Traversabilityestimation for a planetary rover via experimental kernel learning in agaussian process framework. In Internation Conference on Robotics andAutomation (ICRA), pages 3475–3482, 2013.

[31] M Hoffmann, K Stepanova, and M Reinstein. The effect of motor actionand different sensory modalities on terrain classification in a quadrupedrobot running with multiple gaits. Robotics and Autonomous Systems,62:1790–1798, 2014.

49

[32] Armin Hornung, Kai M. Wurm, Maren Bennewitz, Cyrill Stachniss,and Wolfram Burgard. OctoMap: An efficient probabilistic 3D map-ping framework based on octrees. Autonomous Robots, 2013. Softwareavailable at http://octomap.github.com.

[33] Ron A. Howard. Dynamic Programming and Markov Processes. Tech-nology Press of Massachusetts Institute of Technology, 1960.

[34] Leslie Pack Kaelbling, Michael L. Littman, and Andrew W. Moore. Re-inforcement Learning: A Survey. Journal of Artificial Intelligence Re-search, 4:237–285, 1996.

[35] Mrinal Kalakrishnan, Jonas Buchli, Peter Pastor, and Stefan Schaal.Learning locomotion over rough terrain using terrain templates. In Intel-ligent Robots and Systems, 2009. IROS 2009. IEEE/RSJ InternationalConference on, pages 167–172. IEEE, 2009.

[36] Dongho Kim, Kee-Eung Kim, and Pascal Poupart. Cost-Sensitive Ex-ploration in Bayesian Reinforcement Learning. In Proceedings of NeuralInformation Processing Systems (NIPS), 2012.

[37] Dongshin Kim, Jie Sun, Sang Min, Oh James, M. Rehg, and Aaron F.Bobick. Traversability classification using unsupervised on-line visuallearning for outdoor robot navigation. In In Proc. of International Con-ference on Robotics and Automation (ICRA), pages 518–525, 2006.

[38] Kisung Kim, Kwangjin Ko, Wansoo Kim, Seungnam Yu, and ChangsooHan. Performance comparison between neural network and SVM forterrain classification of legged robot. In Proc. SICE Annual Conf. 2010,pages 1343–1348, 2010.

[39] Jens Kober and Jan Peters. Reinforcement learning in robotics: a sur-vey. International Journal of Robotics Research, 2013.

[40] N. Kohl and P. Stone. Policy gradient reinforcement learning for fastquadrupedal locomotion. IEEE International Conference on Roboticsand Automation, 2004. Proceedings. ICRA ’04. 2004, 3(5):2619–2624,2004.

[41] P. Komma, C. Weiss, and A. Zell. Adaptive bayesian filtering forvibration-based terrain classification. In Proc. IEEE Int. Conf. Roboticsand Automation ICRA ’09, pages 3307–3313, 2009.

50

http://octomap.github.com

[42] Sascha Lange. Autonomous reinforcement learning on raw visual inputdata in a real world application. The 2010 International Joint Confer-ence on Neural Networks (IJCNN), 2012.

[43] Steven Martin, Liz Murphy, and Peter Corke. Building large scaletraversability maps using vehicle experience. In Internation Symposiumon Experimental Robotics, pages 891–905, 2013.

[44] Oliver Mihatsch and Ralph Neuneier. Risk-sensitive reinforcement learn-ing. Machine learning, 49(2-3):267–290, 2002.

[45] Ian M. Mitchell, Alexandre M. Bayen, and Claire J. Tomlin. A time-dependent Hamilton-Jacobi formulation of reachable sets for continuousdynamic games. IEEE Transactions on Automatic Control, 50(7):947–957, 2005.

[46] Teodor Mihai Moldovan and Pieter Abbeel. Safe Exploration in MarkovDecision Processes. In Proceedings of the 29th International Conferenceon Machine Learning, May 2012.

[47] J. Neyman and E.S. Pearson. Joint Statistical Papers. University ofCalifornia Press Berkeley, CA, 1966.

[48] Arnab Nilim and Laurent El Ghaoui. Robust Control of Markov DecisionProcesses with Uncertain Transition Matrices. Operations Research, 53(5):780–798, October 2005.

[49] Lauro Ojeda, Johann Borenstein, Gary Witus, and Robert Karlsen. Ter-rain characterization and classification with a mobile robot. Journal ofField Robotics, 23:103–122, 2006.

[50] OSRF. Gazebo, 2002.

[51] M Pecka, K Zimmermann, and T Svoboda. Safe Exploration for Re-inforcement Learning in Real Unstructured Environments. In PaulWohlhart and Vincent Lepetit, editors, 20th Computer Vision WinterWorkshop, Graz, Austria, 2015. TU Graz.

[52] Martin Pecka and Tomas Svoboda. Safe Exploration Techniques forReinforcement Learning An Overview. In Lecture Notes in ComputerScience. Springer, 2014.

[53] Peter Geibel and Fritz Wysotzki. Risk-Sensitive Reinforcement LearningApplied to Control under Constraints. Journal Of Artificial IntelligenceResearch, 24:81–108, September 2011.

51

[54] Francois Pomerleau, Francis Colas, Roland Siegwart, and StephaneMagnenat. Comparing ICP Variants on Real-World Data Sets. Au-tonomous Robots, 34(3):133–148, February 2013.

[55] Martin L. Puterman. Markov Decision Processes: Discrete StochasticDynamic Programming. John Wiley & Sons, Inc., New York, NY, USA,1st edition, 1994.

[56] Michal Reinstein and Matej Hoffmann. Dead reckoning in a dynamicquadruped robot based on multimodal proprioceptive sensory informa-tion. IEEE Transactions on Robotics, 29(2):563–571, 2013.

[57] Michal Reinstein, Vladimir Kubelka, and Karel Zimmermann. Terrainadaptive odometry for mobile skid-steer robots. In Robotics and Au-tomation (ICRA), 2013 IEEE International Conference on, pages 4706–4711, 2013.

[58] S Ross and JA Bagnell. Agnostic system identification for model-basedreinforcement learning. In Proceedings of the 29th International Confer-ence on Machine Learning, Edinburgh, Scotland, UK, 2012.

[59] Jeff G. Schneider. Exploiting model uncertainty estimates for safe dy-namic control learning. Neural Information Processing Systems, 9:1047–1053, 1996.

[60] Jamie Shotton, Andrew Fitzgibbon, Mat Cook, Toby Sharp, MarkFinocchio, Richard Moore, Alex Kipman, and Andrew Blake. Real-timehuman pose recognition in parts from single depth images. In ComputerVision and Pattern Recognition (CVPR), 2011 IEEE Conference on,pages 1297–1304. IEEE, 2011.

[61] Emre Ugur, Mehmet R. Dogar, Maya Cakmak, and Erol Sahin. Thelearning and use of traversability affordance using range images on amobile robot. Proceedings - IEEE International Conference on Roboticsand Automation, (April):1721–1726, 2007.

[62] Christopher J.C.H. Watkins and Peter Dayan. Q-learning. MachineLearning, 8(3-4):279–292, May 1992.

[63] C. Weiss, H. Frohlich, and A. Zell. Vibration-based terrain classifica-tion using support vector machines. In Proc. IEEE/RSJ Int IntelligentRobots and Systems Conf, pages 4429–4434, 2006.

52

[64] Ronald J. Williams and Leemon C. Baird. Tight performance boundson greedy policies based on imperfect value functions. Technical report,Northeastern University,College of Computer Science, 1993.

[65] Karel Zimmermann, Petr Zuzanek, Michal Reinstein, and Vaclav Hlavac.Adaptive Traversability of Unknown Complex Terrain with Obstaclesfor Mobile Robots. In IEEE International Conference on Robotics andAutomation, pages 5177—-5182, 2014.

53

Safe Autonomous Reinforcement Learningptak.felk.cvut.cz/tradr/share/thesis_proposal.pdf · The...

Documents

Transcript of Safe Autonomous Reinforcement Learningptak.felk.cvut.cz/tradr/share/thesis_proposal.pdf · The...