Reinforcement Learning on the Lego Mindstorms NXT Robot...

ESCUELA TECNICA SUPERIOR DE INGENIEROS INDUSTRIALES

Departamento de Ingenierıa de Sistemas y Automatica

Master Thesis

Reinforcement Learning on the LegoMindstorms NXT Robot. Analysis and

Implementation.

Author:Angel Martınez-Tenor

Supervisors:Dr. Juan Antonio Fernandez-MadrigalDr. Ana Cruz-Martın

Master Degree in Mechatronics Engineering

Malaga. May 13, 2013

Reinforcement Learning on the Lego Mindstorms NXT robot Angel Martınez-Tenor

Master Thesis ii University of Malaga

Extended Abstract

This Master Thesis deals with the use of reinforcement learning (RL) methods in a smallmobile robot. In particular we are interested in those methods that do not require any modelof the system and are based on a set of rewards defining the task to learn. RL allows the robotto find quasi-optimal actuation strategies based on the interaction robot-environment . Thiswork proposes the implementation of a learning method for an obstacle-avoidance wanderingtask based on the Q-learning algorithm for the Lego Mindstorms NXT robot. A state of theart review and a research proposal are included.

Keywords. Q-learning, reinforcement learning, artificial intelligence, Mindstorms, mobilerobot, TD method (Temporal Difference), MDP (Markov Decision Process).

State of the art

The present work pursues that a mobile robot learns by itself the effects of its actions.Reinforcement learning (RL) methods, based on stochastic actuation models, will be employedas the mechanism that will allow the robot to feedback the effects of its actions, thus leadingits own development.

There is a growing interest in mobile robot applications in which the environment is notpreviously prepared for the robot, as it occurs in home service robotics, where human presenceis common [1] [2] [3] [4]. Most service mobile robots are explicitly preprogrammed with high-level control architectures, requiring advanced software development techniques [5]. It wouldbe desirable to have methods that allow the robot itself to evolve starting from the most basiccontrol level up to higher control architectures; this idea would avoid the engineers to spendtoo much time in developing and using these advanced techniques.

The autonomous learning concept dates back to Alan Turing’s idea of robots learning in away similar to a child [6]. This topic evolved to the current concept of development robotics [1],a today emergent paradigm analogue to the one of development in human psychology.

In autonomous learning, mechanisms for decision-making under uncertainty are generallyemployed [7] [2] [1]. These mechanisms represent cognitive processes capable of selecting asequence of actions that we expect to lead to an specific outcome. Examples of these processescan be found in economic and military decisions [8]. In particular, Markov Decision Processesrepresent the most exploited subfield. They are Dynamic Bayesian Networks (DBN) in whichthe nodes represent system states and the arcs actions along with their expected reward. Aselected action is executed at each step, reaching a new state and obtaining a reward, bothstochastically [8] [7].

The objective in a decision-making problem is calculating a policy, a stationary functiondefining which action a is executed when the agent is in the state s. Classic algorithms such asValue iteration and Policy iteration are used to converge to the optimal policy given a maximumdistance error. However, a detailed stochastic model of the system is required for obtainingreliable results.

1


The alternative model-free decision-making process has no information about the transitionT (s, a, s′) and reward R(s, a, s′) functions of the DBN. The solution is: knowing that we arein state s, execute action a and read which state s′ we reach, and which associated rewardR(s′|s, a) we obtain. In other words, the lack of model is solved by making observations. Thisis the basis of the RL concept.

Hence, a RL problem is a continuous process which evolves from an initial policy to a near-optimal policy by executing actions, making observations, and updating their policy values [8][7]. In the last decades, RL has helped to solve many problems in different fields, such as gametheory, control engineering and statistics. That makes RL an interesting candidate as a basiclearning mechanism for mobile robots in the context of developmental robotics.

Many RL publications have demonstrated the efficacy of Watkins’ Q-learning algorithm [9],within the Temporal difference learning group of methods, among others [8] [2] [10]. In parallelto Temporal difference learning, Monte Carlo methods, Dynamic programming and Exhaustivesearch can also be found in RL approaches [8]. All these methods have a solid theoretical basisand have been addressed by many researchers in the last two decades. Moreover, there is atendency in extending and combining these techniques, resulting in hierarchical Q-learning,neural Q-learning, partially observable Markov decision process, policy gradient RL, quantum-inspired RL, and actor critrics methods [2] [3].

In robotics, however, RL has been used for solving isolated problems mostly [11] [4] [12],such as navigation. Particularly, the use of the Q-learning algorithm as a learning method ina real robot interacting with its environment, although represents an under-exploited topic,can be found in some Lego Mindstorms NXT applications, such as obstacle avoidance [13], linefollowing [14], walking [15], phototaxis behavior [16], and pattern-based navigation [17]. Thereare also more advanced RL works in real robots, including the Lego Mindstorms NXT, whichare out of the scope of this work.

A review of the above publications reveals that the difficulty of convergence to quasi-optimalsolutions in RL, including Q-learning, is a well-known problem. The learning process oftenresults incomplete or falls into local maxima when the Q-learning parameters are not properlytuned [10]. In a real scenario, these problems are increased by the use of real time. In thatcase, the Q-learning parameters define whether the learning process can produce appropriateresults in a few minutes or in several months instead.

Thus, real robot Q-learning parameter tuning can be quite different from simulation param-eter tuning. Besides, real robot-based works do not address the step of evolving from Q-learningoff-line simulations to robot implementations from a global perspective. As a result, the Q-learning parameters need to be analyzed and tuned for each task/system combination. The lackof a flexible generic mechanism in this topic is the main motivation of this proposal. Q-learningparameters are summarized here: [10] [8]

• Reward, the instant recompense obtained in each learning step. The task an agent has tolearn is indirectly defined by a set of rewards.

• Exploitation/exploration strategy, the rule that selects the action to perform at each learn-ing step. This strategy determines whether the agent exploits its learned policy or exploresnew unknown actions.

• Learning rate, or how much the new knowledge influences the learning process at theexpense of previously acquired knowledge [18].

• Discount rate, the forecast degree of the learning process. It sets how the current rewardwill influence the Q-matrix values and policy values, both representing the internal stateof Q-learning algorithm, in the next steps with respect to future steps. This parameter

Master Thesis 2 University of Malaga


defines whether the learning process will generate strategies with small or large look aheadsequences of steps, and the amount of steps needed for learning an adequate policy.

Research proposalThe proposed work addresses the design and implementation of a Q-learning based method in

a small mobile robot so as to autonomously develop an obstacle avoidance task. The educativerobot Lego Mindstorms NXT meets both the hardware and the software requirements of thiswork.

Our setup proposal employs a differential-vehicle configuration. This framework allows usto perform parallel studies based on both robot and environment models. In addition, thiswork involves solving computing problems caused by the lack of real number processing in therobot: a right quantization of sensor values and available actions, and the employment of afixed-point or a floating-point notation system must be developed.

The study will begin with a simple wandering task employing a reduced number of statesand actions. This configuration will allow us to implement an off-line learning process model inwhich the Q-learning parameters can be analyzed and tuned for the learning process to convergewithin a small amount of time. In the next step, the resulting parameters will be used into thereal robot. An analysis of divergences with the off-line learning process will be accomplishedto detect implications on both the temporary nature of real environment experimentation andthe above mentioned numerical problem. Afterwards, the number of states and actions ofthe system will be increased in order to evaluate its learning capacity in a complete obstacle-avoidance navigation task. As a result, this work leads to the design of a modified Q-learningbased method for a real small robot, focused on simplicity and applicability, and with enoughflexibility for adapting the resulting techniques to other tasks and different systems.


Contents

1 Introduction 111.1 Description of the work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.4 Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.5 Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.6 Content of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2 Preliminary study. Reinforcement learning and the Q-learning algorithm 152.1 A brief introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.1.1 Model-based decision-making processes . . . . . . . . . . . . . . . . . . . 152.1.2 Model-free decision-making processes. The Q-learning algorithm . . . . . 17

2.2 Q-learning: The algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.3 Q-learning practical implementations on Lego Mindstorms NXT robots . . . . . 192.4 Other reinforcement learning techniques . . . . . . . . . . . . . . . . . . . . . . 20

3 Simple wandering task (I). Design and Simulation 213.1 Mobile robot configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.2 Task definition: 4 states + 4 actions . . . . . . . . . . . . . . . . . . . . . . . . 213.3 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.3.1 Reward function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.3.2 Transition function and system model . . . . . . . . . . . . . . . . . . . . 23

3.4 Simulation. Q-learning parameters tuning . . . . . . . . . . . . . . . . . . . . . 253.4.1 Reward tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.4.2 Exploitation-exploration strategy . . . . . . . . . . . . . . . . . . . . . . 263.4.3 Learning rate and discount rate tuning . . . . . . . . . . . . . . . . . . . 27

4 Simple wandering task (II): Robot implementation 314.1 Translation of simulation code into NXC . . . . . . . . . . . . . . . . . . . . . . 31

4.1.1 NXT io library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.1.2 Main program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.2 Q-learning algorithm with CPU limitations . . . . . . . . . . . . . . . . . . . . . 354.2.1 Overflow analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.2.2 Precision Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.2.3 Memory limits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.3 Implementation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5 Complex wandering task 455.1 Wander-2 task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.2 Obstacle avoidance task learning: wander-3 implementation and results . . . . . 495.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5


6 Conclusions and future work 57

A Implementation details 59A.1 Debugging, compilation, and NXT communication . . . . . . . . . . . . . . . . . 59A.2 Datalogs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60A.3 Brick keys events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60A.4 Online information through the display and the loudspeaker . . . . . . . . . . . 61

B Octave/Matlab scripts, functions, and files generated by the robot 63B.1 modelData.m (wander-1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63B.2 simulateSimpleTask ConceptualModel.m (wander-1) . . . . . . . . . . . . . . . . 64B.3 simulateSimpleTask FrequentistModel.m (wander-1) . . . . . . . . . . . . . . . . 65B.4 Q learning simpleTask Conceptual.m (wander-1) . . . . . . . . . . . . . . . . . . 66B.5 Q learning simpleTask Frequentist.m (wander-1) . . . . . . . . . . . . . . . . . . 68

C Source codes developed for robot implementations(NXC) 71C.1 NXT io.h . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71C.2 simple task learning.nxc (wander-1) . . . . . . . . . . . . . . . . . . . . . . . . . 75C.3 5s learning.nxc (wander-2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82C.4 final learning.nxc (wander-3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90C.5 simpleTaskDataCollector.nxc (wander-1) . . . . . . . . . . . . . . . . . . . . . . 99

Bibliography 103


List of Figures

1.1 Robot Lego Mindstorms NXT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.1 MDP robotic example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.2 Q-learning algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.1 NXT basic mobile configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.2 NXT Two-contact bumper. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.3 NXT final setup. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.4 Reward function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.5 Wander-1 transition function T (s, a, s′) represented as P (s′|s, a) in the concep-

tual model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.6 Wander-1 transition function T (s, a, s′) represented as P (s′|s, a). Frequentist

model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.7 T (s, a, s′) differences between conceptual and frequentist model. . . . . . . . . . 253.8 Q-learning algorithm used in all simulations. . . . . . . . . . . . . . . . . . . . . 263.9 Wander-1 simulation: results of learning rate α and discount rate γ combinations

for the conceptual model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.10 Wander-1 simulation: results of learning rate α and discount rate γ combinations

for the conceptual model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.11 Wander-1 simulation: learning rate α and discount rate γ results for the frequen-

tist model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.12 Wander-1 simulation: learning rate α and discount rate γ results for the frequen-

tist model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.1 Q-learning algorithm in NXC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334.2 Function observeState() written in NXC. . . . . . . . . . . . . . . . . . . . . . 344.3 Function obtainReward() written in NXC. . . . . . . . . . . . . . . . . . . . . . 344.4 Q-learning algorithm implemented for the analysis of precision. . . . . . . . . . . 374.5 Q-matrix differences between offline and NXT simulation. . . . . . . . . . . . . . 384.6 Step in which the optimal policy was learned for different FP, γ and α in the

NXT robot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384.7 Q-matrix values with FP=1000 and FP=10000. Q-matrix values are multiplied

by FP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.8 Differences in Q-matrix values between FP=1000 and FP=10000. . . . . . . . . 404.9 NXT memory restrictions for states/actions: example of memory needed (left),

number of states and actions available (center and right). . . . . . . . . . . . . . 414.10 Scenario A for the wander-1 task implementation. . . . . . . . . . . . . . . . . . 414.11 Wander-1 learning implementation. Summary of experiments (* Divided by FP). 424.12 Wander-1 learning implementation. Examples of the resulting Q-matrix. . . . . 424.13 Wander-1 task implementation. Robot during the learning process. . . . . . . . 424.14 Wander-1 task implementation. Robot exploiting the learned policy. . . . . . . . 43

7


5.1 Angle restriction of the ultrasonic sensor for detecting frontal obstacles. . . . . . 465.2 ObserveState() routine for the wander-2 task. . . . . . . . . . . . . . . . . . . . 475.3 Scenario B for the wander-2 implementation. . . . . . . . . . . . . . . . . . . . . 485.4 Wander-2 learning implementation: Summary of experiments for scenario A (*

Divided by FP). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485.5 Wander-2 learning implementation: Summary of experiments for scenario B (*

Divided by FP). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485.6 Wander-2 task implementation. Robot during the learning process. . . . . . . . 495.7 Wander-2 task implementation. Robot exploiting the learned policy. . . . . . . . 495.8 Implementation of the exploration/exploitation strategy for the wander-3 task. . 515.9 Scenario C for the wander-3 task implementation. . . . . . . . . . . . . . . . . . 525.10 Results of the wander-3 task implementation test in scenario A. . . . . . . . . . 525.11 Results of the wander-3 task implementation tests in Scenario C. . . . . . . . . . 535.12 Wander-3 task implementation. Robot exploiting the learned policy (top to

bottom, left to right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

A.1 Brick Command Center v3.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59A.2 Datafile generated by the wander-2 (five-states) task program on the robot. . . . 60A.3 NXT display with the learning information of the wander-1 task. . . . . . . . . . 61


List of Tables

3.1 Wander-1 task description. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.2 Adjustment of rewards. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.3 Wander-1 optimal policy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.4 Best Q-learning parameters selected after simulation tests. . . . . . . . . . . . . 293.5 Simulation results with the selected parameters. Frequentist Model. . . . . . . . 30

4.1 Additional implemented techniques for experimenting with the Lego robot. . . . 314.2 Content of the NXT io.h library. . . . . . . . . . . . . . . . . . . . . . . . . . . . 324.3 Rewards definition in the robot. . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.4 Limits in the Q-matrix values to avoid overflow when using a fixed-point notation

system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.5 Q-matrix reference values obtained from Octave simulation. The optimal policy

(values in bold) was learned in step 298. . . . . . . . . . . . . . . . . . . . . . . 364.6 Q-matrix values from the NXT simulation. The optimal policy (values in bold)

was learned in step 298 (note: Q-matrix values have been multiplied by FP). . . 374.7 Parameters of the wander-1 task implementation. . . . . . . . . . . . . . . . . . 41

5.1 Wander-2 task description. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465.2 Wander-2 task optimal policy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465.3 Wander-2 task implementation parameters. . . . . . . . . . . . . . . . . . . . . . 475.4 Wander-3 task. Description of states and actions. . . . . . . . . . . . . . . . . . 505.5 Wander-3 task distance ranges considered in the ultrasonic sensor. . . . . . . . . 505.6 Wander-3 task implementation parameters. . . . . . . . . . . . . . . . . . . . . . 51

9

Chapter 1

Introduction

This chapter contains a description of a thesis developed for the MSc degree in MechatronicsEngineering of the University of Malaga. The objectives, methodology, project phases andmaterials involved in this work are summarized here. The content of the other chapters isbriefly described in the last section.

1.1 Description of the workThis master thesis aims at the design and implementation on the educative robot Lego Mind-

storms NXT of reinforcement learning methods capable of automatically learn the best way ofdevelop specific tasks.

The selected learning method has the advantage of not requiring a model of the environmentin which the robot will operate, but is based in the implementation of a series of rewards whichdefine the objective of the task we want to learn. In general, reinforcement learning methods(RL) allow finding quasi-optimal actuation strategies based on environment-robot interaction.

Among reinforcement learning methods, this work focuses on Watkins’ Q-learning algo-rithm [9]. Publications about this topic show the simplicity and efficacy of Q-learning algo-rithm implementations compared to other methods. Most publications, however, address thetheoretical and mathematical sides of the methodology, with studies that are based generallyon simulations. Works based on real implementations generally combine reinforcement learn-ing with other advanced techniques resulting in very complex methods, such as hierarchicalQ-learning, neural Q-learning, policy Gradient Reinforcement Learning, quantum Inspired RL,actor critics methods, Monte Carlo and TD methods combination.

In the present work we are interested in the use of the Q-learning algorithm as a learningtechnique on a real robot in continuous interaction with its environment, in particular whenthe robot has important software and hardware limitations like the Lego Mindstorms NXT. Areview of the state of the art on the use of RL with Lego robots has shown how it is generallypursued the learning of isolated tasks such as floor-marked path following, obstacle avoidanceor phototaxis behavior.

Here, we will focus on a minimalist implementation of a Q-learning method that does notgo into advanced or combined techniques, leaving them outside the scope of our study. In orderto address this problem correctly, we will take into account the following considerations:

- According to the context of this work, belonging to the Master Degree in MechatronicsEngineering, it will be interesting to employ the same differential vehicle configuration asthe one used in classes. This framework will allow us, firstly, to develop on a well-knownconfiguration, and secondly, to design Q-learning practice exercises for future Masteralumni.

11


- This work also involves the resolution of computing problems caused by the impossibilityof the robot CPU to process real numbers. Hence, a study leading to the right quantizationof sensor values and available actions is essential along with an analysis of the loss ofprecision due to fixed-point or floating-point notation systems, needed by the Q-learningalgorithm.

We will begin this work learning a simple wandering task with a reduced number of statesand actions. This set-up will allow us to implement an off-line model of the learning process,in which the Q-learning parameters can be studied. These parameters can be adjusted inorder to favor an optimal or suboptimal learning convergence within a relatively small amountof time. After that, we implement and execute the obtained solution on the robot, so as toanalyze the divergences with the off-line process. This analysis will focus, on the one hand,on the real-time nature of the experimentation in a physical environment, and, on the otherhand, on the above mentioned numerical problems. The study of this last topic is particularlyinteresting considering that it would allows us to evaluate the use of these techniques in othermicrocontroller-based systems.

Once the above problems are solved, we will proceed to gradually enlarge the numberof states-actions of our study adding ultrasonic, wheels-encoders and contacts sensors to oursystem. In this way, we evaluate the capacity of learning a relatively complex action sequencefor an obstacle-avoidance task.

1.2 ObjectivesThe main objective of this thesis is the implementation of a Q-learning based method, for

the robot Lego Mindstorms NXT to learn an obstacle avoidance wandering task.The difficulty of the convergence to optimal or quasi-optimal solutions in reinforcement

learning, including the Q-learning algorithm, is a well-known problem, since the learning processitself may result insufficient or fall into local maxima. This occurs easily when the Q-learningparameters, or even the rewards, are not properly tuned. In a real robot these problems areenhanced by the constraint of the real time: the above parameters along with the selectedexploration/exploitation strategy will determine if the process of learning a reasonable policycan be made in a few minutes or can be delayed up to several months. In order to achievethe goal of our work efficiently avoiding an unnecessary backtracking, we have established thefollowing milestones:

1. Simple wandering task learning (simulation). Using two frontal contact sensors and fourmovement actions (stop, forward, left-turn and right-turn), the robot will learn how towander dodging obstacles after touching them. As an example, it is expected that in astate corresponding to a collision with the left front contact, the robot learns that turningright is the most likely action that will lead to its optimal policy. A numerical computingenvironment (Octave/Matlab) will be used to achieve this milestone analyzing and tuningthe Q-learning parameters. The aim of this first stage is to obtain an algorithm thatlearns the optimal policy in a relatively low number of steps in most experiments. Thefindings will be used as a reference in subsequent studies.

2. Simple wandering task learning implementation in the Lego Mindstorms NXT. This stagerequires an adaptation of the Q-learning algorithm so as to be processed by a microcon-troller. It implies the usage of fixed-point or floating-point numbers. Thus, it will benecessary to perform an analysis of the effects of this on our learning process because ofthe loss of precision in the Q-matrix values. The milestone here is that the robot learns



the optimal policy in a relatively small period of time so we can perform a comparativestudy between simulated and experimental data. Implementation of some auxiliary toolsis required for proper experimentation: debugging routines, datalogs, etc.

3. Shaping from the learning of the basic wandering task to the final wandering task byadding an ultrasonic sensor and incrementing the number of states. This will be graduallyimplemented following the previous milestones in each shaping step.

1.3 Methodology

The solution developed in this work focuses on simplicity and applicability. Besides, wewill emphasize in having enough flexibility for adapting the resulting techniques so as to learnanother tasks on different systems that also share the limits found in this study. Off-line simu-lations will be implemented in scripts and functions in Octave/Matlab, while real experimentswill be developed in the NXC (Not eXactly C) programming language for the robot. Theemployment of the basic differential vehicle configuration and the NXC language will allow usto maintain the framework generally used in the rest of courses of the master in MechatronicsEngineering.

1.4 Scheduling

This thesis has been developed following a number of stages:

1. Q-learning theoretical and related work study.

2. Simple wandering task learning (simulation).

3. Simple wandering task learning implementation on Lego Mindstorms NXT robot

4. Complex wandering task learning.

1.5 Materials

The hardware and software used in this thesis are enumerated here:

• Robot Lego Mindstorms NXT (Kit NXT Education Base Set 9797): CPU 32bits ARM748MHz. Firmware v1.31. Memory Flash memory (program) 256KB. RAM memory(variables) 64KB (see figure 1.1).

• Personal computer: Intel Core 2 Duo CPU P8400 @ 2.26GHz. OS: Ubuntu 12.10 /Windows XP.

• Octave 0.10.1 / Matlab 7.10(R2010a).

• NBC 1.2.1.r4 and NeXTTool 1.2.1.3 / Brick Command Center v3.3(Build 3.3.8.9).



Figure 1.1: Robot Lego Mindstorms NXT.

1.6 Content of the thesisThe content of this work is summarized as follows:

• Chapter 2 provides an introduction to reinforcement learning and the Q-learning algo-rithm, along with a review of the state of the art of the present proposal.

• The developed work begins in chapter 3 with the study of the parameters involved in theprocess of learning a simple simulated obstacle-avoidance wandering task.

• Chapter 4 translates the results of the previous chapter to the robot and performs a com-parative analysis between the Q-learning simulation and the real robot implementation.

• By knowing the implications and limitations of the Q-learning algorithm in a simplewandering task, chapter 5 shows how the before tuned learning method behaves whenlearning more complex tasks. This will be performed by adding new states and actions.

• Chapter 6 contains the conclusions of this thesis and some proposed further work.

• Appendix A shows some relevant implementation details, which improve the performanceof the experiments with the robot.

• Finally, appendixes B and C collect all the code written in the thesis, both as Octave/-Matlab simulation script, and of the robot application written in NXC.


Chapter 2

Preliminary study. Reinforcementlearning and the Q-learning algorithm

This chapter is an introduction to reinforcement learning and the Q-learning algorithm thatare the basic approach of this work. A description of the Q-learning parameters and a reviewof the state of the art in these techniques are included.

2.1 A brief introductionAn introduction and practical approach to reinforcement learning problem and the Q-learning

algorithm make up this section. Basic statistics and probability theory concepts are requiredfor a correct understanding of this topic. For a more detailed introduction to reinforcementlearning, we suggest [8] and [7].

We begin by introducing the concept of Decision-Making under uncertainty as a cognitiveprocess capable of selecting a set of actions we expect they will lead to specific outcomes.Examples of the application of these processes can be found in economic and military decisions,for instance. The decision-making concept distinguishes between stochastic and deterministicprocesses, according to whether there is randomness at some point or not. Another distinctionwith special interest for our work is concerned with the availability of a model of the system,resulting in model-based or model-free decision-making processes.

2.1.1 Model-based decision-making processesTo begin with it is necessary to introduce the concept of Markov Decision Process (MDP),

defined as a Dynamic Bayesian Network (DBN) [19] whose nodes define the system states, andarcs define the actions along with their associated rewards (see fig. 2.1). Formally, an MDP isa tuple (S,A, T,R), where S is a finite set of states, A a finite set of actions, T the transitionfunction, and R the reward function defined as R : S × A× S ′ → R.

Some relevant characteristics of MDPs are:

• The Markov property is applicable to these networks, meaning that every state has nohistory dependence except for its prior state:

P (s′k|a, s0:k−1) = P (s′k|a, sk−1) (2.1)

• When executing a sequence of actions, a special DBN called Markov Chain is obtained,where actions and rewards are no longer variable.

15


Figure 2.1: MDP robotic example.

• Each time an action is executed, stochasticity will be present in both the new state reachedand the obtained reward.

• We assume total observability; consequently the current state will always be perfectlyknown.

An important concept in machine learning with MDPs is the one of a policy. We call (sta-tionary) policy to a function π : S → A defining the action a to execute when the agent is inthe state s. Thus, a policy is the core of a decision-making process. The term policy value or Vπindicates the goodness of a policy measured through the expected rewards when it is executed.Among the different implementations of policy values, we will use the total expected discountedreward for our reinforcement learning problem, in which the policy value is the expected rewardgathered after infinite steps of the decision process, decreasing the importance of future rewardsas we move forward in time. This can be expressed as:

Vπ(s0) = E [R(s0, π(s0), s1)] + γ · E [R(s1, π(s1), s2)] + γ2 · · · (2.2)

being γ ∈ (0, 1), and resulting in a an exponentially weighted average. A more succinct expres-sion can be obtained by separating the first term in (2.2) in order to define Qπ:

Qπ(s, a) =∑

s′∈succ(s)T (s, a, s′) · E [R(s, a, s′)] +

∑s′∈succ(s)

γ · T (s, a, s′) · Vπ(s′) (2.3)

It can be demonstrated that at least one optimal policy π∗ exists, that is, it has the greatestvalue V ∗π . In the case that two or more optimal policies appear, their values will be identical.The following expressions will serve to calculate both the optimal policy value and the optimalpolicy:

V ∗π (s0) = maxaQ∗π(s0, a) (2.4)

π∗(s0) = argmaxaQ∗π(s0, a) (2.5)

These expressions, called the Bellman equations, can be used recursively for improvingan arbitrary initial policy. Practical implementations involve an intermediate definition: theexpected reward of executing action a when the agent is in state s:

R(s, a) = E [R(s, a, s′)] =∑

s′∈succ(s)T (s, a, s′) · E [R(s, a, s′)] (2.6)



Simplifying (2.3) through (2.6):

Qπ(s, a) = R(s, a) +∑

s′∈succ(s)γ · T (s, a, s′) · Vπ(s′) (2.7)

Therefore,

V ∗π (s0) = maxa

R(s, a) +∑

s′∈succ(s)γ · T (s, a, s′) · V ∗π (s′)

(2.8)

Classic algorithms for finding optimal policies using the Bellman equations are Value iter-ation and Policy iteration, that converge into the optimal policy up to certain error. Unfortu-nately, there is no guaranty that neither of such algorithms converge in a short time. This isone of the reasons of the growing interest in improving these algorithms for their use in robots.The value iteration algorithm is:

1. ∀a,∀s Qk(s, a) = R(s, a) + γ ·∑s′ T (s, a, s′) · Vk−1(s′)

2. ∀s, Vk(s) = maxaQk(s, a), πk(s) = argmaxaQk(s, a)

3. Repeat the loop until |Vk(s)− Vk−1| < ε, ∀s

2.1.2 Model-free decision-making processes. The Q-learning algo-rithm

Equation (2.6) can also be expressed as:

R(s, a) = Es′∈ssucc [R(s, a)] ≡ Es′ [R(s′|s, a)] (2.9)

Similarly:γ ·∑

T (s, a, s′) · Vπ(s′) ≡ γ · Es′ [Vπ(s′)] (2.10)Hence, Qk(s, a) has the same behavior as an expectation:

Qk(s, a) = Es′ [R(s′|s, a)] + γ · Es′ [Vπ(s′)] (2.11)

We will use this expression for deciding what to do when the model is not available. In thatcase, there is no information about the transition function T (s, a, s′), but we can estimate itfrom Qk(s, a) by modifying the value iteration algorithm like this:

Vk(s) = maxa Qk(s, a) (2.12)πk(s) = argmaxa Qk(s, a) (2.13)

Qk(s, a) = Es′

[R(s′|s, a) + γ · Vk−1(s′|a)

](2.14)

Qk(s, a) can be obtained as an average of some observations gathered from the real system:

Qk(s, a) =∑Oi

n, Oi = R(s′|s, a) + γ · Vk−1(s′, a) (2.15)

Along with the transition function T (s, a), the complete reward function R(s, a) is notavailable in a model-free decision-making process. In other words, the lack of a model andreward knowledge is solved by making observations. This is the basis of the reinforcementlearning concept.

Expression (2.15) is similar to a batch sample average. However, a reinforcement learningproblem is a continuous process evolving from an initial policy to a pseudo-optimal policy by



executing sequentially actions, making observations, and updating its policy values. Therefore,a recursive or sequential estimation of the average must be employed instead, based on:

νk = 1k

k∑i=1

Oi , (2.16)

νk−1 = 1k − 1

k−1∑i=1

Oi =(

1k

k∑i=1

Oi− Ok

k

)k

k − 1 =

=(νk −

On

n

)· k

k − 1 = k − 1k· νk−1 + 1

k·Ok (2.17)

Defining αk = 1k:

νk = (1− αk) · νk−1 + αk ·Ok (2.18)

Finally, applying this νk expression to Qk(s, a):

Qk(s, a) = (1− αk) · Qk−1(s, a) + αk [R(s, a, s′) + γ · Vk−1(s′)] (2.19)

being Vk−1(s′) = maxa′Qk−1(s′, a′) according to Bellman equations.

Equation (2.19) represents the general form of the Q-learning algorithm, and it will be usedrepeatedly throughout this work.

2.2 Q-learning: The algorithmA practical Q-learning algorithm structure employed in this thesis is shown in figure 2.2. Asstated before, the success of a reinforcement learning process is subjected to the accurate choiceof its parameters. Here we summarize these parameters. A more detailed explanation of thistopic can be found in [10].

• Reward. The task we want an agent to learn in a reinforcement learning problem is definedby assigning proper rewards as a function of the current state, the action executed, andthe new reached state.

• Exploitation/exploration strategy. It decides whether the agent should exploit its currentlearned policy, or experiment unknown actions at each learning step.

• Learning rate (α). It establishes how new knowledge influences the global learning process.When α = 1, only brand new knowledge is taken into account (the one of the currentstep). On the contrary, when α = 0, new actions do not affect the learning process at all(no learning).

• Discount rate (γ). It regulates the look-ahead of the learning process by determininghow much the current reward will influence the Q-matrix values in the next steps. Whenγ = 0, only immediate rewards are taken into account, hence our agent will only be ableto learn strategies with a sequence of one step. On the other hand, when γ values are closeto 1, the learning process will allow strategies with larger sequences of steps, althoughthat involves a longer process of learning for obtaining a reasonable policy.



% Q-learning algorithm parametersN_STATES, N_ACTIONS,INITIAL_POLICY, GAMMA

% Experiment parametersN_STEPS, STEP_TIME

% System parameters

% Variabless,a,sp % (s,a,s’)R % Obtained Rewardalpha % Learning rate parameterPolicy % Current PolicyV % Value functionQ % Q-matrixstep % Current step

% Initial stateV = 0, Q = 0, Policy = INITIAL_POLICYs = Observe_state() % sensors-based observation

for step = 1:N_STEPS %--------------------- Main loop --------------------

a = Exploitation_exploration_strategy()

robot_execute_action(a), wait(STEP_TIME)

sp = observe_state(), R = obtain_Reward()

Q(s,a) = (1-alpha)*Q(s,a) + alpha*(R+GAMMA*V(sp))

V(s)_update()Policy(s)_update()alpha_update()s = sp % update current state

end %-------------------------------------- Main loop end------------------

Figure 2.2: Q-learning algorithm.

2.3 Q-learning practical implementations on Lego Mind-storms NXT robots

A brief review of relevant scientific publications about implementations of Q-learning on LegoMindstorms NXT robots follows:

• Obstacle avoidance task learning [13]: This paper presents the design of 36 encodingstates and 3 actions employing 1 ultrasonic and 2 contact sensors. Programming language:LeJOS. A comparative study offers better results in traditional Q-learning than Q-learningwith Self Organizing Map.

• Line follower task learning [14]: Q-learning algorithm implemented in Matlab using USBcommunication with the robot. Definition of 4 states encoding 2 light sensors and 3available actions. This design includes an ultrasonic sensor for reactive behavior only.

• Walking and line follower tasks learning [15]: SARSA algorithm on NXT robot thoughLegOS programming language. 4 legs mounting moved by 2 servomotors and 1 lightsensor aiming to a grey-gradient floor for the walking task. 4 wheels vehicle setup and 2light sensors for the line follower task.



• Phototaxis behavior [16]. Implementation of three different layers on NXT robot: Lowlevel navigation, exploration heuristic and Q-learning layer. Almost every functions ofthe robot are executed in a computer communicated with the robot by Bluetooth. Thissetup includes 11 encoding states based on 2 light and 2 contact sensors, and 8 actionsrepresenting wheels power levels.

• Another Q-learning on NXT robot related works can be found on several proposal projectsin robotics courses, including wandering, wall following, pursuit and maintaining a straightpath.

All these works are focused on learning very specific tasks, usually taking advantage ofheuristics or another techniques to improve the learning process.

2.4 Other reinforcement learning techniquesAlthough advanced reinforcement learning techniques have been left out of the scope of this

work for the limitations of the Lego robot, it is important to mention that the majority ofpublications on this topic cover techniques also belonging to the temporal difference learning(TD) methods [8]. These methods are based on estimations of the final reward for each state,which are updated in every iteration of the learning process.

Along with temporal difference learning techniques, works including Monte Carlo methods,dynamic programming and exhaustive search [8] can be easily found in reinforcement learningstudies. All these techniques have a solid theoretical basis and have been addressed by manyresearchers. Also, there is a growing interest in extending and combining these techniques,resulting in new methods such as hierarchical Q-learning, neural Q-learning, RL and par-tially observable Markov decision process (POMDP), Peng and Naıve (McGovern) Q-learning,policy gradient RL, emotionally motivated RL, quantum Inspired RL, quantum parallelized hi-erarchical, Monte Carlo and TD methods combination, curiosity and RL, and actor criticsmethod [7] [22] [2].


Chapter 3

Simple wandering task (I). Design andSimulation

This chapter addresses the study of the Q-learning algorithm in a simulated scenario for asimple obstacle-avoidance wandering task, from now on called wander-1. It includes the partic-ular robot configuration, the model of the system, and a sensitivity analysis of the parametersof the Q-learning algorithm. The work described here will be used as a reference for the realrobot implementation in the next chapter.

3.1 Mobile robot configurationAs discussed in the introduction chapter, we have employed the standard differential drive

mobile configuration suggested in the Lego Mindstorms NXT 9797 kit as the base assemblyfor our robot. This setup includes two driving wheels separated 11cm, a caster wheel, and anultrasonic sensor or sonar (see fig. 3.1).

Since the 9797 kit only provides one sonar for obstacle detection, our proposal includesadding two contact sensors located on the front side of the robot (see fig. 3.2). These sensorswill allow the learning process to find out the more convenient turning direction to execute inorder to dodge an object after colliding with it. Besides, the sonar only will be able to detectobjects located just in front of the robot; the contact sensors can mitigate this problem.

Using the 9797 kit pieces, we have built a solid 2-contacts bumper accessory which can beeasily coupled at the front of the basic setup, placed instead of the light sensor as shown in the9797 kit guide (see fig. 3.3).

This configuration has been employed in all the tasks developed in this work, although theultrasonic sensor has not been necessary for the wander-1 learning described in chapters 3 and4.

3.2 Task definition: 4 states + 4 actionsTo achieve the main objective of this thesis efficiently, we begin by studying the reinforcement

learning of a simple wandering task, wander-1, using a reduced number of states and actions.This configuration will allow us to implement an off-line learning process model so that theQ-learning algorithm parameters can be easily separated and analyzed.

With 2 frontal contact sensors and 4 movement actions (stop, left-turn, right-turn and moveforward), we intend the robot to learn how to wander dodging obstacles after touching them(see table 3.1). As stated in the introduction chapter, we expect that in a state with a left front

21


Figure 3.1: NXT basic mobile configuration. Figure 3.2: NXT Two-contact bumper.

Figure 3.3: NXT final setup.

contact activated, the robot learns that turning to the right is the action that will lead to itsoptimal policy. The ultrasonic sensor is not needed in the wander-1 design.

Task ObjectiveWandering evading obstacles after touching them

Statess1 no contact presseds2 left bumper contact presseds3 right bumper contact presseds4 both contacts pressed

Actionsa1 stopa2 left-turna3 right-turna4 move forward

Table 3.1: Wander-1 task description.

3.3 ModelsAn estimation of the rewards and the transition function is necessary so as to simulate a realisticreinforcement learning process. Both the reward and the transition functions described beloware based on preliminary tests performed on the real robot.

3.3.1 Reward functionWe will take advantage of the servomotor encoders of the wheels of the Lego robot to calculate

the rewards: in each step, we will measure the relative increase of the encoder, assigning larger



reward to both wheels having rotated forward a greater angle (above a fixed threshold).We intended this simple reward would be enough for the robot to learn the avoidance

obstacles wandering task: the lack of forward motion, which happens when the robot collideswith a wall or an object, is interpreted as an absence of positive reward, leading to the learningprocess to select another action. However, preliminary implementation tests showed that oncethe robot collided with obstacles fixed on the ground, the chance of keep rotating both wheelsforward, sliding on the floor, while receiving positive reward, was so high that we were forcedto resign this single concept.

We decided then to add a soft penalization when one frontal contact is activated, and alarge penalization in case of both contacts being on. In those situations, any positive rewarddue to the servomotor encoders measurements is rejected.

The final reward function of the wander-1 learning simulation has been implemented asshown in the Octave/Matlab code of fig. 3.4.

function R=obtainReward_simpleTask(s,a,sp)% Input: s: state (not used here), a: action, sp: new reached state% Output: Gathered RewardR=0;switch sp

case 1 % No contact: R depends directly on the executed actionswitch a

case 1R=0; % Robot stopped: no reward

case 2R=0.1; % Turn: small reward

case 3R=0.1; % Turn: small reward

case 4R=1; % Forward motion: large reward

endcase 2 % Left bumper contact: small penalization

R=-0.2;case 3 % Right bumper contact: small penalization

R=-0.2;case 4 % Both bumper contacts: large penalization

R=-0.7;end

Figure 3.4: Reward function.

The reason of the particular values assigned to R will be explained in section 3.4.

3.3.2 Transition function and system modelIn order to perform any off-line simulation, we need some information about the system

and the environment. Regarding the environment, a 70x70 cm enclosure will be used for thewander-1 task.

The off-line model has been implemented as a function that simulates the execution of theselected action a from the state s, resulting in a new reached state s′. This is the transitionfunction T (s, a, s′) of the system.

The simulations used a conceptual model, which was replaced later by a more detailedfrequentist model, based on the data collected by the robot. Both models were used andstudied separately in the wander-1 simulation.



A: Conceptual model

In this model the transition function is the result of a preliminary and simplified analysisperformed on the robot within its environment, by observing its general behavior and extract-ing intuitive deductions about it. Figure 3.5 shows the transition function as a conditionalprobability chart in which each cell represents P (s′|s, a).

Figure 3.5: Wander-1 transition function T (s, a, s′) represented as P (s′|s, a) in the conceptual model.

As an example, when the simulated robot is in state 3 (right bumper contact) and it exe-cutes the action 3 (right-turn), the probability of reaching s′ = 4 (both bumper contacts) hasconsidered to be 0.9.

The simulator code of appendix B.2 implements the above transition function in an intuitiveway so as to allow performing minor adjustments in the case we want to explore some statesmore deeply. This code was later replaced by a generic algorithm that generates the transitionfunction from frequentist data obtained by the robot in the real environment.

B: Frequentist model

The NXC code of appendix C.5 was written for collecting the necessary data for designinga more realistic frequentist model of the robot in its environment. At each step, an action isselected according to a discrete uniform distribution.

After compiling and running C.5 in the 70x70 cm environment, we obtained the file listedin appendix B.1, containing the variable data(s, a, s′) with the statistical information of thesystem. As an example, data(1, 2, 3) = 12 means that in state s1 the action a2 led to s3 up to12 times. A simple algorithm listed in appendix B.3 was then used to generate the transitionfunction T (s′|s, a) from this statistical data, shown in figure 3.6.

The differences between the conceptual and frequentist models are shown in figure 3.7.Logically, the frequentist model offers an approximation to the real situation closer than theconceptual one. Moreover, 30% of the values of the transition function differs in more than0.2 (up to 0.85) when comparing both models. Even when in this first study we are dealingwith a basic task in a simple environment, we have proven that an intuitive conceptual modelcan be completely inaccurate. Section 3.4 will demonstrate that both models result in differentQ-learning parameters for learning the same task.



Figure 3.6: Wander-1 transition function T (s, a, s′) represented as P (s′|s, a). Frequentist model.

Figure 3.7: T (s, a, s′) differences between conceptual and frequentist model.

3.4 Simulation. Q-learning parameters tuning

For implementing the simulation, we need to translate the Q-learning pseudo-code shownin section 2.2 to a script in our simulation language. The numerical computation platformsOctave and Matlab were employed in this work. Minor modifications have been made on thescripts and functions so as to maintain the compatibility in both programming languages. Thus,every experiment can be reproduced executing the Octave/Matlab scripts shown in this reportwithout any change. The structure of the code is similar to the C programming language,without taking advantage of the matrix calculation in the former. This structure encouragescomparative studies between Octave/Matlab and NXC, the language used in the robot. Themain loop containing the Q-learning algorithm implemented and employed in all simulations isas shown in fig. 3.8.

That code has been extracted from Q learning simpleTask Frequentist.m listed in ap-pendix B.5. Analogously, Q learning simpleTask Conceptual.m is shown in appendix B.4.We call the simulate simpleTask(s,a) and obtainReward simpleTask(s,a,sp) functions,both described in the previous section.

The next step concerns the adjustment of the parameters of the learning process. Since anyQ-learning parameter can modify the behavior of the rest, we have settled the reward valuesand the exploitation-exploration strategy before tuning the learning rate α and discount rateγ. This procedure allows us to obtain the optimal values of α and γ by performing experimentsunder the same conditions once a valid strategy that explores all the Q-matrix cells, and an



for step = 1:N_STEPS

% selectAction (Exploitation/exploration strategy)if ceil(unifrnd(0,100)) > 70 % simple e-greedy

a = ceil(unifrnd(0,4));else

a = Policy(s);end

% executeAction(),wait() and observeState() simulated:sp = simulateSimpleTask_FrequentistModel(N_STATES,s,a,T);R = obtainReward_simpleTask(s,a,sp);

% update Q-matrixQ(s,a) = (1 - alpha) * Q(s,a) + alpha * (R + GAMMA * V(sp));

% V_update(), Policy_update(). Refer to equations (2.4) and (2.5)V(s) = Q(s,1);Policy(s) = 1;for i = 2:N_ACTIONS

if(Q(s,i)>V(s))V(s) = Q(s,i);Policy(s) = i;

endend

s = sp; % update state

Figure 3.8: Q-learning algorithm used in all simulations.

suitable set of rewards that satisfy the requirements discussed in the subsection 3.4.1 have beensettled.

3.4.1 Reward tuningThe rewards assigned to the wander-1 learning process were normalized in [0,1] for a better

analysis of the variations of the Q-matrix values. Besides, this will help to overcome thelimitations of the robot related to the fixed-point numeric system employed, as discussed lateron.

Since the objective of the task is defined by the reward function, we must be careful in thechoice of the set of instant rewards so as to avoid falling into local maxima [10]. As a result,the highest reward, obtained when the robot moves forward without colliding, is selected tentimes higher than the reward assigned when the robot turns. Preliminary tests show that aswe increase this secondary reward, the chance to learn a turning policy increases, even whenno obstacles are found in the path of the robot.

After a set of simulations, the reward values of table (3.2) were chosen. This set of normal-ized rewards were tuned to obtain the optimal policy in all the tests. As an example, althoughminor variations of these values can be also valid, simulations with R=0.2 for the event turnedwithout colliding sometimes resulted in non-optimal turning policies.

3.4.2 Exploitation-exploration strategyWe have selected a fixed ε-greedy strategy, meaning that the agent will usually exploit the

best policy learned at each step but with a probability of ε of exploring any action randomly.



R Event1 moved forward (no collide)

0.1 turned (no collide)-0.2 one bumper collided-0.7 both bumpers collided

Table 3.2: Adjustment of rewards.

The selected value ε = 30% fits in the wander-1 learning process, allowing the robot to exploreall the possible combinations of states and actions in all the tests. Improvements of this strategywill be discussed later.

3.4.3 Learning rate and discount rate tuningIn chapter 2 it was demonstrated that the theoretical learning rate of the Q-learning algorithm

should be αk = 1k, being k the current step, so as to approximate an average estimation.

However, preliminary simulations, and especially real robot tests, showed that the number ofsteps needed in learning an accurate policy had a strong dependence on the first states explored,sometimes resulting in the inability of learning a simple policy even after hours. Therefore wehave chosen a different decreasing profile for αk. On the other hand, the CPU of the robotwill limit the number of decimal places we can employ for our variables. We will analyze andjustify this topic in the next chapter. The final values used for αk in this study are:

α = {0.01, 0.02, 0.05, 0.1, 0.2, 0.4} (3.1)

Analogously, the discount rate values to study different depths in the look-ahead strategy are:

γ = {0, 0.2, 0.4, 0.6, 0.8, 0.9} (3.2)

We have found that values beyond γ = 0.9 can cause numerical overflow problems in the robot.This topic also will be discussed in the next chapter. Other relevant features of the simulationsdescribed in this chapter are summarized in the following:

• The task to learn is so simple that it allows us to predict the optimal policy before runningany experiment. That will be the one shown in table (3.3).

State Actions1 (no obstacle) a4 (move forward)s2 (left-contact) a3 (right-turn)s3 (right-contact) a2 (left-turn)s4 (both contacts) a2 or a3

Table 3.3: Wander-1 optimal policy.

• We will use, as the main indicator of the goodness of the learning process, the latestiteration in which the agent passed from a non-optimal policy to an optimal one.

• An additional custom Q-estimator has been defined as the rate between the optimal valueand the second highest value of the Q-matrix corresponding to state s1 (no contact). Wehave selected this indicator after checking that almost all failed tests learned a policydifferent to a4 (moving forward) in s1, due to the proximities of their Q-matrix values.



• Each simulation has been performed with an specific selection of learning rate α anddiscount rate γ, making a total of 36 simulations.

• Each simulation consisted of 40 experiments, corresponding to 40 complete learning pro-cesses. Each learning process were conducted during 2000 steps.

• The resulting indicators shown here correspond to the average of the 40 indicators ob-tained in each experiment.

The Octave/Matlab script of appendix B.4 was used for simulating the conceptual model.The results are shown in figures 3.9 and 3.10. Likewise, the results of running the simulationwith the frequentist model listed in appendix B.5 are shown in figures 3.11 and 3.12.

Figure 3.9: Wander-1 simulation: results of learning rate α and discount rate γ combinations for the conceptualmodel.

Figure 3.10: Wander-1 simulation: results of learning rate α and discount rate γ combinations for the conceptualmodel.

Figure 3.11: Wander-1 simulation: learning rate α and discount rate γ results for the frequentist model.



Figure 3.12: Wander-1 simulation: learning rate α and discount rate γ results for the frequentist model.

According to the number of steps needed for learning the optimal policy, simulations onthe conceptual model gave as a result that there is a large window to choose a discount rate(γ ≥ 0.2), and learning rate (α ≤ 0.1) so as to obtain the optimal policy efficiently.

The additional indicator Q-estimator showed that γ ≥ 0.8 combined with α ≥ 0.05 resultedin Q-matrix values so similar to each other that minor losses of precision could lead to non-optimal policies. Hence, we have to take into account this indicator when working on therobot.

As to the more realistic frequentist model, simulations narrowed the above windows, requir-ing γ ≥ 0.6, and revealing some patterns in this numeric sensitivity study that had not appearedbefore: according to the number of steps needed for obtaining an optimal learning, we highlightthe presence of two well-distinguished minima at [α = 0.02, γ = 0.8] and [α = 0.02, γ = 0.9].

Since the frequentist model is based on real robot and environment data, and they werecollected with a step time of 250ms (appendix C.5), the simulation gave a reference of 65 seconds(260 steps) as the lowest average time needed in learning the optimal policy (remember thateach value is an average of 40 experiments). This value, much lower than we expected, andhard to improve, will be a demanding reference in later experimentations with the robot.

The above simulations were repeated many more times, obtaining similar evidences. Theresults let us choose between γ = 0.8 or γ = 0.9 interchangeably. However, as we are interestedin learning more complex tasks, we prefer working with higher discount rates, which will allowthe robot to have a longer look-ahead horizon in its strategies.

To sum up, we will select the parameters of table 3.4 as the starting point in the implemen-tation of the Q-learning algorithm for the wander-1 task in the real robot.

Parameter ValueRewards {1, 0.1 , -0.2, -0.7}Exploitation/exploration ε-greedy, 30% explorationLearning Rate α 0.02Discount Rate γ 0.9

Table 3.4: Best Q-learning parameters selected after simulation tests.



Indicator ResultSuccess in learning optimal policy 40 of 40 experiments (100%)Average number of steps needed 260 (69 seconds)

Table 3.5: Simulation results with the selected parameters. Frequentist Model.


Chapter 4

Simple wandering task (II): Robotimplementation

The present chapter describes the transition between Q-learning simulation and real robotimplementation. After describing a library and the main program for the robot, this chapterproceeds with an analysis of the limitations of the platform when using this learning method.We restrict the parameters values so as to avoid overflows and relevant losses of precision. Inthe last section we show the results of the wander-1 learning process in the robot.

4.1 Translation of simulation code into NXCFirst of all, the most relevant difference between a Q-learning real implementation and the

simulation, to keep in mind when working on the robot, is time. Preliminary tests were per-formed with a step time of one second, meaning that the robot executed the selected actionduring that time at each step. Then, it observes the reached state and gets the associatedreward. The CPU of the robot (48 MHz) allows us to neglect this step execution time, makingunnecessary to stop the movement after each action, and thus, favoring a smoother movement.After a few tests, we have decided to lower the step time to 250ms to speed up experimentation.

The following software was employed when working with the robot:

• NBC 1.2.1.r4 and NeXTTool 1.2.1.3. under Ubuntu 12.10.

• Brick Command Center v3.3(Build 3.3.8.9) under Windows XP (just for verifying thatthe tests presented in this work can be reproduced in other platforms without makingany change in the source code).

Additional techniques developed to improve the experimentation with the robot are describedin appendix A and summarized in table 4.1.

Concept SectionDebug, compilation and NXT communications A.1Datalogs (Q-matrix, learned policy...) A.2Brick keys events A.3Online information though display & loudspeaker A.4

Table 4.1: Additional implemented techniques for experimenting with the Lego robot.

31


The programs for the robot have been written in Not eXactly C (NXC), a programminglanguage invented by John Hansen strongly based on C [20]. Two files were developed thinkingin further implementations on other systems: <MainProgram>.nxc, and the library NXT io.h,briefly described in the following.

4.1.1 NXT io libraryNXT io.h has been written in this work to isolate the hardware from the Q-learning algorithm,

being a specific library for the NXT robot hardly portable to other systems. This library alsoincludes some constants and functions needed in the learning process implemented in this work,hence, it has been employed extensively. The only parameter of the library to be modified inour experiments is SPEED, the power of the driving wheels, as will be discussed later. Thecomplete code can be consulted in appendix C.1, and its content is enumerated in table 4.2.

NXT io.hNXT actuators parameters: SPEED, L WHEEL, LEFT BUMPER ...NXT loudspeaker parameters: VOL, Notes(C5, A5...), Notes length(HALF, QUARTER...)void NXT mapSensors(void) /* Sonar and contact sensors mapping */byte getSonarDistance(void)void executeNXT(string command) /* ’stop’, ’turnRight’ ... */void showMessage(const string &m, byte row, bool wait)NXTcheckMemory(long memoryNeeded, string FILE)bool pauseResumeButton(void), exploitePolicyButton(void), saveAndStopButton(void)void pauseNXT(void)

Table 4.2: Content of the NXT io.h library.

4.1.2 Main programThe main program for the robot contains, among other functions and characteristics, the Q-

learning algorithm, its parameters, and the definition of the states and actions. This programcan be easily translated into other programming languages, and it will be modified in eachlearning task.

The whole code simple task learning.nxc, the core of this section, is shown in appendixC.2. Here, the most relevant issues involved in the Q-learning algorithm in NXC are described.The main loop, where the learning process takes place, is shown in fig. 4.1.

Zero-valued indexes have been omitted in states/actions codification for a better compati-bility with Octave/Matlab code.

NXT is a microcontrolled-based robot without FPU (Floating-Point Unit). Therefore, theNXC programming language does not support real numbers needed in the learning algorithm.This is why the expression that updates Q in figure 4.1 has been translated from the theoreticalequation (2.19) used in Octave/Matlab simulations, repeated here for convenience:

Q(s, a) = (1− α) ·Q(s, a) + α [R + γ · V (sp)] (Octave/Matlab) (4.1)

to:

Q[s][a] · FP = (FP − α · FP ) ·Q[s][a] · FPFP

+α · FP · (R · FP + γ·FP ·V [sp]·FP

FP)

FP(4.2)



for(step=1; step<N_STEPS+1; step++){a = selectAction(s); // Exploitation/exploration strategyexecuteAction(a);Wait(STEP_TIME);sp = observeState();R = obtainReward(s,a,sp);

// Update Q-matrixQ[s][a]=((FP-alpha)*Q[s][a])/FP + (alpha*(R+(GAMMA*V[sp])/FP))/FP;

// Update V and PolicyV[s] = Q[s][1];Policy[s] = 1;for(i=2; i<N_ACTIONS+1; i++)

{if(Q[s][i] > V[s])

{V[s] = Q[s][i];Policy[s] = i;}

}s = sp; // Update state

Figure 4.1: Q-learning algorithm in NXC.

where FP is a constant used to emulate real numbers with integers when using a fixed-pointnotation. In the next section we will show how this notation system, with scaling factor 1/10000(FP = 10000 ), will be valid to perform the experiments on the robot.

Since the reward R, the learning rate α, and the discount rate γ are real numbers, all of themmust be expressed in our notation system (multiplied by FP). Q[s, a] and V [s] are representedin this notation too. Thus equation (4.2) can be expressed as:

Q[s][a]FP = (FP − αFP ) ·Q[s][a]FPFP

+αFP · (RFP + γF P ·V [sp]F P

FP)

FP(4.3)

The above expression (4.3) solves the computation of Q[s][a] in our fixed-point notation, inwhich each product must be later divided by FP. The operations have been associated so as tominimize the impact of overflows, as will be discussed in next section.

A short description of the functions called from the main loop of our program is given here:

• selectAction(s): The same ε-greedy function used in simulation.

• executeAcion(a): Calls to NXT io servomotor functions: executeNXT("forward")...

• observeState(): Returns the codification of states based on sensor values; as shown infigure 4.2.

• obtainReward(s,a,sp): Unlike the reward function used in simulation, and after severaltests, we decided to add a practical function which does not need to be redefined aftera change of states and/or actions. Based on wheels encoders and contact sensors, thereward function was defined as shown in table 4.3, and it was implemented as figure 4.3.



byte observeState(void){// Returns the state of the robot by encoding the information measured from// the contacts sensors. In case the number of states or their definitions// change, this function must be updated.

// States discretization:// s1: left_contact=0,right_contact=0// s2: left_contact=1,right_contact=0// s3: left_contact=0,right_contact=1// s4: left_contact=1,right_contact=1

byte state;state = 1 + (LEFT_BUMPER + 2 * RIGHT_BUMPER); // defined in ’NXT_io.h’return(state);}

Figure 4.2: Function observeState() written in NXC.

long obtainReward(byte s, byte a, byte sp){// Input: s,a,s’ not used directly here since we look at the motion of// wheels and sensors (that would be a and s’). Encoders and// contact sensors result in a better function R(s,a,s’)// Output: Obtained Reward

long R; // Fixed-point numberlong treshold;

// Reference: 1 second at SPEED 100 gives above 700 degrees (full batery)treshold=30; // Valid for [40 < SPEED < 80]R=0;

if(MotorRotationCount(L_WHEEL) > treshold &&MotorRotationCount(R_WHEEL) > treshold) R = FP; // R=1

else if(MotorRotationCount(R_WHEEL) > treshold) R = FP / 10; // R=0.1else if(MotorRotationCount(L_WHEEL) > treshold) R = FP / 10; // R=0.1

if(LEFT_BUMPER || RIGHT_BUMPER){if(LEFT_BUMPER && RIGHT_BUMPER) R=-FP/2-FP/5; // R = -0.7else R=-FP/5; // R = -0.2}

ResetRotationCount(L_WHEEL);ResetRotationCount(R_WHEEL);return(R);}

Figure 4.3: Function obtainReward() written in NXC.

The code of fig. 4.3 generates the reward function without coding all the (s, a, s′) combina-tions that take place in the learning process. Nevertheless, it depends on the variables (s, a, s′),resulting in a theoretically valid reward function. All the rewards are returned in fixed-pointnotation, as stated above. It is important to mention that the threshold for detecting whetheror not a wheel moved forward was tuned and tested for wheel powers from 40 to 80. Outsidethis range, the above threshold should be readjusted.



R Event1 Both wheels moved forward above a threshold. No contact activated

0.1 Only a wheel moved forward above a threshold. No contact activated-0.2 One contact activated-0.7 Both contacts activated

Table 4.3: Rewards definition in the robot.

4.2 Q-learning algorithm with CPU limitationsThis section analyzes the limitations of our learning process when working with the real

robot. These limitations include overflows and precision losses in computation, and also storagecapacity, which leads to a maximum number of states and actions that can be implementedwithout exceeding the memory of the robot.

4.2.1 Overflow analysisIn order to detect overflows, we must analyze the three factors involving fixed-point numbersfrom expression (4.3):

(a) (FP − αFP ) ·Q[s][a]FP

(b) γFP · V [sp]FP

(c) αFP · (RFP + γF P ·V [sp]F P

FP)

In this work R has been normalized to 1 (RFP ≤ FP ) and the best learning processes havebeen obtained when α << 1. Thus the greatest factor results FP · Q[s][a]FP from (a). Notethat the same restriction can be found in (b) in the case γ ≈ 1.

On the other hand, since the NXT robot has a 32-bit CPU, NXC signed long variablescan store values up to 231 − 1 = 2, 147, 483, 647 = MaxLongNXC. This integer capacitywill be limited in our notation system so as the integer part of any cell in Q must be ≤MaxLongNXC/FP (fixed-point variables are already multiplied by FP). As an example, ifwe select FP=10000, the greatest integer part involved in the learning should not exceed thevalue of 214,748 to avoid overflows.

The largest Q-matrix values were obtained by simulating the previous frequentist modelfrom chapter 3. The worst case scenario occurs in the unlikely event that the system alwaysgets the greatest reward at every step, since the higher the reward, the higher the Q-matrixvalues (refer to eq. (4.1)). We have run simulations for 30000 steps for the following values ofα and γ:

α = {0.01, 0.02, 0.05, 0.1, 0.2, 0.4} (4.4)

γ = {0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 0.99} (4.5)

The results of this analysis are summarized in table 4.4, where Qmax/(FP ·R) represents thevalue to which the Q-matrix converges. Different values of parameter α for a fixed γ resultedin the same value of convergence, thus α has been left out of the table 4.4.

γ 0.5 0.6 0.7 0.8 0.9 0.95 0.99Qmax/(FP ·R) 2 2.5 3.33 5 10 20 100

Table 4.4: Limits in the Q-matrix values to avoid overflow when using a fixed-point notation system.



The data obtained in 4.4 along with the restriction Q ≤ MaxLongNXC/FP allow us todetermine whether our system will cause overflows for any set of γ,R and FP . As an examplewe check our optimal values from chapter 3: γ = 0.9, R = 1, and FP = 10000. From the table4.4, Qmax = 10 · R · FP = 100, 000. These parameters satisfy the above requirement, sinceQmax is lower than MaxLongNXC/FP (100, 000 < 214, 748).

The importance of normalizing the rewards to 1 can be seen with a simple example: ifrewards rise up to R = 10, the Q-matrix values will be ten times the above values, and therequirement MaxQvalue = 1000000 ≤MaxLongNXC/FP will not be accomplished.

Analogously, with other values for γ, R, and FP, we can deduce the following:

• FP = 100000 (5 decimal places) will result in overflow, since the condition Qmax =X ·R · 100, 000 ≤ 21474, being X any value from table 4.4, involves reducing R to 0.1 andγ to 0.5, leading to a poor look-ahead learning process.

• FP = 10000 (4 decimal places) allows for γ values up to 0.95 (γ = 0.96 causes overflowsin WCS).

• FP = 10000 and γ = 0.9 allows for reward values up to 2, without causing overflow.

• FP = 1000 (3 decimal places) allows for γ values up to 0.999, although this is associatedwith serious precision problems, as will be described in the next subsection.

• FP = 100 practically have no γ restrictions, but also results in precision problems.

Table 4.4 also helps to establish the limits for 16-bit systems (here MaxIntNXC = 215−1 =32767). Using the same procedure as above, we found that only FP = 100 and γ < 0.7 canbe used to avoid overflow. Finally, no 8-bit system (MaxIntNXC = 255) can hold a correctQ-learning algorithm using the notation system employed here.

4.2.2 Precision AnalysisThe study of numeric precision begins with a simulation of the frequentist model explained in

the previous chapter and listed in appendix B.3, collecting the following data: step in which theoptimal policy was learned and kept, the resulting Q-matrix, and two arrays with the historyof the executed actions and states reached in each step. We will use these results, shown intable 4.5, as a reference for comparing with the data obtained by the robot later on.

Table 4.5: Q-matrix reference values obtained from Octave simulation. The optimal policy (values in bold) waslearned in step 298.

We have modified the previous simple task learning.nxc code by using the arrays history aand history sp from the simulation. In this case, the action executed by the robot is selectedby accessing the proper history a value. The same applies to the reached state. Meanwhile,the robot remains stopped. In summary, we are reproducing the same learning process madein simulation, now inside the NXT robot, with all its limitations. This allows us to comparethe resulting Q-matrix of both the robot and the offline simulation.



// Modified Q-learning algorithm, main loop (precision study)---------------for(step=1; step<N_STEPS+1; step++)

{//a = selectAction(s); // Exploitation/exploration strategya = history_a[step-1];//executeAction(a);//Wait(STEP_TIME);//sp = observeState();sp = history_sp[step-1];R = simulateReward(s,a,sp);



{if(Q[s][i] > V[s])


}// Update steps = sp;

Figure 4.4: Q-learning algorithm implemented for the analysis of precision.

The main learning loop of the robot implemented for this purpose is the one shown in figure4.4. Running the code on the robot, and keeping the same parameters selected so far, we obtainthe values shown in table 4.6.

Table 4.6: Q-matrix values from the NXT simulation. The optimal policy (values in bold) was learned in step298 (note: Q-matrix values have been multiplied by FP).

There is no need to repeat the learning process on the robot several times, since the fixedsequence of states and actions will always lead to the same Q-matrix after 2000 steps. Nowboth simulations, the one of Octave with 64-bit floating-point and the one in the NXT with32-bit fixed-point, have a comparable Q-matrix. Their differences are shown in figure 4.5.

Our results show that both tests entered in the optimal policy at the same step (298). Onthe other hand, the greatest difference of Q-matrix values, after 2000 steps, was -0.02605, whichrepresents a deviation of 2.6% of the greatest possible reward in a single step. When referringto relative deviations, the greatest difference was detected in Q(3, 1), with a 3.5% error.

These differences were so small that did not affect the learning process at all, reachingthe optimal policy in the exact same iteration. Hence, we will keep using the previous set ofparameters (γ = 0.9, α = 0.02, FP = 10000).



Figure 4.5: Q-matrix differences between offline and NXT simulation.

Nevertheless, we wanted to know the precision error committed by the robot when workingwith other parameters. A set of experiments involving different γ, α, and FP have beenperformed on the robot using the same procedure. Thus we can study how sensitive are othercombinations of γ and α to the fixed-point notation by changing FP alone.

Figure 4.6 compares the results of these tests according to the step in which the optimalpolicy was reached and maintained. For convenience, we have employed powers of 10 as scal-ing factors for the fixed-point notation system. From the obtained data, we can deduce thefollowing:

Figure 4.6: Step in which the optimal policy was learned for different FP, γ and α in the NXT robot.

• In 32-bit systems without FPU, as in our case, fixed-point notation systems with scalefactors different to 1/1000 and 1/10000 (3 or 4 decimal places) should not be used, eitherbecause of lack or precision or overflow.

• The parameter α should be smaller than 0.1 for reaching the optimal policy in feweriterations. The best results of this analysis occurred with α = 0.02.

• When α = 0, 01, better results were obtained with γ = 0.9 than with γ = 0.75. Althoughour standard case, α = 0, 02, was not affected by this, it is expected that further complextasks will be more sensitive to this parameter, which should lead to the choice of a largeγ.



• FP = 1000 gave better results than FP = 10000 because of precision errors (bold text infig. 4.6). Actually, the similarities in the Q-matrix values fortuitously gave the optimalpolicy in an earlier step than we expected.

• 16-bits systems, restricted to FP = 100 as discussed in section 4.2.1, require a fixedα = 0.1. Smaller values cause precision problems of such a magnitude that they canhardly change the Q-matrix values, remaining most of them invariables. On the otherhand, greater values result in serious problems of convergence.

At this point we can perform a precision analysis regarding the values obtained in the Q-matrix. In order to get a better understanding of this topic, the number of the steps of thelearning process has been reduced to 100. Thus, we can analyze the deviations produced in thefirst part of learning. Figures 4.7 and 4.8 show the results of these tests.

Figure 4.7: Q-matrix values with FP=1000 and FP=10000. Q-matrix values are multiplied by FP.

The main finding of this comparative study was the evidence that the smallest valid learningrate which eludes severe losses of precision is limited by the number of decimals settled in ourfixed-point system. Figure 4.8 shows that as we decrease the number of decimals, the error dueto the loss of precision rises, being this error greater when α is small. In our example, whenα = 0.02, the average error obtained with 3 decimals with respect to 4 decimals after 100 stepswas 2.5%, with a maximum of 4.4% . This is five times greater than the one obtained whenworking with α = 0.1. Thus, although previous tests gave better results when working withα = 0.02 and FP = 1000, we can state that the error of their Q-matrix values were relativelyhigh. This fact is important when designing further complex tasks with FP = 1000, whereα = 0.1 should be preferred to improve the accuracy of the learning process.

On the other hand, the choice of γ = 0.75 or γ = 0.9 had no effects in this comparativestudy.



Figure 4.8: Differences in Q-matrix values between FP=1000 and FP=10000.

4.2.3 Memory limitsThe greatest number of states and actions allowed for our robot have been obtained after

addressing the following considerations:

• The size of the Q-matrix will be limited by the memory of the robot, of 64 kB, whichmust also store the CPU stack and other global variables.

• The cell size of the Q-matrix will be of 32 bits, as 16 bits cells would result in less efficientand limited learning process, as explained in the two previous sections.

• Later on it will be necessary to allocate memory for auxiliary variables, including anexploration matrix introduced in the next chapter.

Figure 4.9 shows the size occupied for the different variables in a particular case, alongwith some pairs of values of the number of states and actions that can be employed withoutexceeding the memory of the robot.

The conclusion is that the amount of states and actions will not represent any problem inthis work: the most complex task implemented here has 16 states and 4 actions, which is lessthan 2% of the total memory available.

4.3 Implementation resultsAt this point, we have obtained all the information about the parameters that best suit on

the robot, as well as their restrictions when they are combined in the same learning struc-ture. Table 4.7 summarizes this data, which represents the input parameters for the wander-1implementation.



Figure 4.9: NXT memory restrictions for states/actions: example of memory needed (left), number of statesand actions available (center and right).

Concept ValueRobot Speed 80 of a range [0,100]Step Time 250 msNumber of Steps 2000Exploitation/exploration rate ε-Greedy 30% explorationDiscount rate γ 0.9Learning rate α 0.02FP 10000 (4 decimals in fixed-point notation)Q-matrix cell size 4 bytes (long)

Table 4.7: Parameters of the wander-1 task implementation.

The full NXC source code was explained in section 4.1 and can be consulted in appendixC.2. The real tests were performed in a 70x70 cm square enclosure shown in figure 4.10.

Figure 4.10: Scenario A for the wander-1 task implementation.



As a reference, the frequentist model simulations gave an average number of steps neededfor learning the task of 260 (recall section 3.5). However, after running six real tests on therobot, the results have been unexpectedly better than the simulated ones. Figure 4.11 collectsthese results.

Figure 4.11: Wander-1 learning implementation. Summary of experiments (* Divided by FP).

Figure 4.12: Wander-1 learning implementation. Examples of the resulting Q-matrix.

Figures 4.13 and 4.14 show the robot during the learning process and exploiting the learnedpolicy for the wander-1 task. It seems that the real environment offered a better model oflearning for the robot than the one designed for the simulation in the wander-1 task. Since allthe tests with the real robot learned the optimal policy efficiently, we can consider that we areready to move onto more complex task implementations.

Figure 4.13: Wander-1 task implementation. Robot during the learning process.



Figure 4.14: Wander-1 task implementation. Robot exploiting the learned policy.


Chapter 5

Complex wandering task

In chapter 4 we discussed a successful and efficient learning processes for a simple wanderingtask. Some interesting conclusions were that the real world scenario resulted in a slightly bettermodel for the robot than those created for simulations. After the work described in sections4.1 and 4.2, the time employed in designing and simulating the wander-1 task model, includingdata collectors, took much longer than the one needed for implementing and testing the learningalgorithm in the robot.

This chapter addresses two new learning processes designed and studied directly on therobot, one with 5 states and 4 actions called wander-2 (section 5.1), and a more compleximplementation with 16 states an 9 actions called wander-3 (section 5.2). The last sectionsummarizes some improvements that can be made to the basic Q-learning algorithm related tothis work.

5.1 Wander-2 taskA wandering task with five states, wander-2, has been proposed and designed for taking

advantage of the ultrasonic sensor mounted, but not employed, in the previous task. Thus,with 1 ultrasonic sensor, 2 frontal contact sensors and 4 movement actions (stop, left-turn,right-turn and move forward), we intend the robot to learn how to wander avoiding obstacles.As shown in the wander-1 task learning, we expect that in a state with a right-front contactactivated, the robot learns that turning to the left will lead to its optimal policy; also, if a wallor obstacle is detected close to the robot, a turning action will be most likely the best decisionto achieve the optimal policy.

Neither rewards nor actions changed in this new setting, just a fifth state has been addedthat is reached when the sonar measures a distance above a given threshold with no collisions.In case that the measured distance is below this threshold without colliding, the agent wouldremain in state s1. This new discretization of states and actions is shown in table 5.1.

Another characteristic of this task is related to the measured distance. Considering theseparation existing between the ultrasonic sensor and the front contacts (13 cm), the thresholddistance has been tuned to 25 cm. The power of the wheels has been reduced from 80 to 50. Thisempirical combination of threshold and speed allows the robot to get close enough to an obstacleto maximize the area of the wandering path. However, after a few tests, we detected someaccuracy problems from the ultrasonic sensor that may affect the learning process. Addressingthese issues will help in the understanding of the remainder tests described in this chapter; thuswe enumerate them here:

• Considering the robot is facing a straight wall, it is unable to detect it if the angle formedby the wall and the line representing the sonar propagation beam (same orientation as

45


Task ObjectiveWandering avoiding obstacles (5 states)

Statess1 No contact & Obstacle nears2 Left bumper contacts3 Right bumper contacts4 Both contactss5 No contact & Obstacle far

Actionsa1 Stopa2 Left-turna3 Right-turna4 Move forward

Table 5.1: Wander-2 task description.

the robot) is below 42 degrees approx. In other words, starting from a setup in whichthe robot faces a wall, with an orientation perpendicular to this wall, the obstacle willbe detected by the sonar provided that the robot does not change its orientation above90− 42 = 48 degrees. This situation is depicted in fig. 5.1.

Figure 5.1: Angle restriction of the ultrasonic sensor for detecting frontal obstacles.

• The highest value detected by the ultrasonic sensor is about 147 cm. Distances above thatwere not detected (the sensor returns 255 cm). We also found that, frequently, distancesgreater than 142 cm were not detected either. There were no problems with measuringthe lowest distance.

The wander-2 task continues being simple enough to guess its optimal policy intuitively.That policy would be the one shown in table 5.2. Again, it has been used for constructing anindicator of goodness of the learning process.

State Actions1 (obstacle near) a2 or a3 (Left or Right-turn)s2 (left-contact) a3 (Right-turn)s3 (right-contact) a2 (Left-turn)s4 (both contacts) a2 or a3 (Left or Right-turn)s5 (obstacle far) a4 (Move forward)

Table 5.2: Wander-2 task optimal policy.

Regarding the robot implementation of the learning process, a few changes have been madefrom the wander-1 code, including the new observeState() routine shown in fig 5.2.



byte observeState(void){// Returns the state of the robot by encoding the information measured from// the ultrasonic and the contacts sensors. In case the number of states// or their definitions change, this function must be updated.

// States discretization:// s1: left_contact=0,right_contact=0 Obstacle near// s2: left_contact=1,right_contact=0// s3: left_contact=0,right_contact=1// s4: left_contact=1,right_contact=1// s5: left_contact=0,right_contact=0 Non Obstacle near */

byte state;byte sonarDistance;byte sonarState;

sonarDistance = getSonarDistance();if(sonarDistance <= DISTANCE_0)

// DISTANCE_0 is the threshold distance defined in ’NXT_io.h’ used to// distinguish between states s1 and s5sonarState = 0;

elsesonarState = 1;

state = 1 + (LEFT_BUMPER + 2 * RIGHT_BUMPER); // defined in ’NXT_io.h’if(state==1 && sonarState==1) state=5;return(state);}

Figure 5.2: ObserveState() routine for the wander-2 task.

The rest of the code can be consulted in appendix C.3. Table 5.3 summarizes the parametersemployed in all wander-2 task experiments.

Parameter ValueRobot Speed 50 (in a range [0,100])Step Time 250 msNumber of Steps 2000Exploitation/exploration ε-greedy 30% explorationDiscount rate γ 0.9Learning rate α 0.02FP 10000 (4 decimals in fixed-point notation)Q-matrix cell size 4 bytes (long)

Table 5.3: Wander-2 task implementation parameters.

The present task was tested in two scenarios:

• Scenario A: The same 70x70cm square enclosure for wander-1, already shown in figure4.10.

• Scenario B: 105x70cm enclosure with a 35x8cm obstacle, as shown in figure 5.3.

Figures 5.4 and 5.5 display the results of the tests in both scenarios A and B. The learningprocess of the wander-2 task in scenario A reached the optimal policy in an average number



Figure 5.3: Scenario B for the wander-2 implementation.

of steps of 531 (133 seconds) ranging from 128 (32s) to 1154 steps (289s). These results implythat the learning process, although with successful in the four experiments performed, neededalmost three times more iterations respect to the learning process of the wander-1 task withjust one state fewer. Besides, the dispersion of the results found in the wander-2 learning waslarger.

However, the number of states and actions used so far is so small that the maneuvers of therobot lacked precision, especially when avoiding certain corners. The learning process cannotimprove this behavior unless we increase the number of states and actions. The next sectionaddresses a more complex task with 16 states and 9 actions, which involves a large incrementin the size of the Q-matrix, from 16 or 20 elements of the previous task, to a new Q-matrix of144 elements.

Figure 5.4: Wander-2 learning implementation: Summary of experiments for scenario A (* Divided by FP).

Figure 5.5: Wander-2 learning implementation: Summary of experiments for scenario B (* Divided by FP).



Figures 5.6 and 5.7 show the robot during the learning process and exploiting the learnedpolicy for the wander-2 task.

Figure 5.6: Wander-2 task implementation. Robot during the learning process.

Figure 5.7: Wander-2 task implementation. Robot exploiting the learned policy.

5.2 Obstacle avoidance task learning: wander-3 imple-mentation and results

The final developed task, wander-3, defines 4 distance ranges from the ultrasonic sensor.Taking into account each distance combined with the 4 possible combinations of the contactsensors, results in 16 states. On the other hand, each wheel will be able to move forward, move



backward or remain stopped independently of the other wheel, thus the robot has 9 differentactions to execute in each learning step. Our goal in this task is that the robot learns to wanderavoiding obstacles with smoother and more precise movements than those obtained in previoustasks. For example, it will be able to turn in one direction in three ways: by moving one wheelforward, the other one backward, or both wheel at the same time. The task definition alongwith the discretization of states and actions is shown in table 5.4.

Task ObjectiveWandering avoiding obstacles (16 states & 9 actions)

Statess1 No contact & obstacle range 0s2 Left contact & obstacle range 0s3 Right contact & obstacle range 0s4 Both contacts & obstacle range 0s5 No contact & obstacle range 1s6 Left contact & obstacle range 1s7 Right contact & obstacle range 1s8 Both contacts & obstacle range 1s9 No contact & obstacle range 2s10 Left contact & obstacle range 2s11 Right contact & obstacle range 2s12 Both contacts & obstacle range 2s13 No contact & obstacle range 3s14 Left contact & obstacle range 3s15 Right contact & obstacle range 3s16 Both contacts & obstacle range 3

Actionsa1 Stopa2 Left-turn (both wheels)a3 Right-turn (both wheels)a4 Move forwarda5 Left wheel forwarda6 Right wheel forwarda7 Left wheel backwarda8 Right wheel backwarda9 Move backward

Table 5.4: Wander-3 task. Description of states and actions.

The new distance ranges of the sonar are found in table 5.5. The rest of the parametersremain the same as in the previous wander-2 task.

Range Distance (cm)0 < 25cm1 25− 50cm2 50− 75cm3 > 75cm

Table 5.5: Wander-3 task distance ranges considered in the ultrasonic sensor.

This task has the disadvantage of being too complex to guess the optimal policy intuitively,hence we will execute the tests for 2000 and 15000 steps (> 1 hour) of learning.

Regarding the implementation on the robot, besides rewriting the functions observeState()and executeAction(byte a) for including the new states and actions, it has been necessaryto change the explotation/exploration strategy so as to avoid that some actions remained un-explored after the learning process. The code in fig 5.8 shows this modification.

The modified ε − greedy strategy favors the chance of selecting the least explored actionwhen the robot is in a given state. A simple exploration matrix, parallel to the Q-matrix,contains the information needed for implementing this strategy.



byte selectAction (byte s)// Input: Current State// Output: Action selected{byte selectedAction;byte i;

if(Random(100)<70) // e-Greedy{// 70% probaility of exploiting the current learned policy.selectedAction = Policy[s];}

else{// 30 % of exploringif(Random(100)<60)

// Improvement to simple e_Greedy: enhances exploring actions posibly// not too often visited. When exploring there is a 60% probability of// selecting the least explored action for the current state.{selectedAction =1;for(i=2; i<N_ACTIONS+1; i++)

{if (exploration[s][i] < exploration[s][selectedAction])

// exploration[state][action] is the exploration matrix// used to count the number of times that the cell// Q[s][a] has been explored (and thus updated).selectedAction = i;

}}

else selectedAction=Random(N_ACTIONS+1); // (1,2,3...N_ACTIONS)// When exploring, there is a 40% probaility of selecting a random// action

}return(selectedAction);}

Figure 5.8: Implementation of the exploration/exploitation strategy for the wander-3 task.

Table 5.6 summarizes the input parameters used in the learning process of the wander-3task. The complete code used in these tests can be consulted in appendix C.4.

Parameters ValueRobot Speed 50 (in a range [0,100])Step Time 250 msNumber of Steps 2000 & 15000Exploitation/exploration ε-greedy 30% exploration (60% the least explored)Discount rate γ 0.9Learning rate α 0.02FP 10000 (4 decimals in our fixed-point notation)Q-matrix cell size 4 bytes (long)

Table 5.6: Wander-3 task implementation parameters.

The experiments were placed on three different scenarios:

• Scenario A: 70x70cm square enclosure used in both wander-1 and wander-2 (figure 4.10).



• Scenario B: 105x70cm enclosure with 35x8cm obstacle in the center, already used in thewander-2 task (figure 5.3).

• Scenario C: A new 200x125cm enclosure with four 135 degree corners and two smallobstacles (see figure 5.9).

Figure 5.9: Scenario C for the wander-3 task implementation.

Figure 5.10 shows the result obtained of a test performed in scenario A for above 9000 steps(≈40 minutes), performed for comparing the wander-3 task with the previous tasks on the samescenario. Figure 5.11 shows the results of the three tests performed in scenario C, one with2000 steps (≈ 8 minutes), and the others with 15000 steps (> 60 minutes). Data from scenarioB were not collected since the learning process resulted very similar to the one in scenario A.

Figure 5.10: Results of the wander-3 task implementation test in scenario A.

A total of six videos for the three tasks covered in this work, showing the evolution of thelearning process and the exploitation of the learned policy, have been recorded for improvingfurther analysis of the learning methods developed. Figure 5.12 shows some scenes of the robot



Figure 5.11: Results of the wander-3 task implementation tests in Scenario C.

exploiting the learned policy of the wander-3 task. The following conclusions are based on thestudy of these recordings along with the interpretation of the data displayed in the figures 5.10and 5.11.

Figure 5.12: Wander-3 task implementation. Robot exploiting the learned policy (top to bottom, left to right).

• Tests performed in scenario A show that the time needed in learning an accurate policyof the wander-3 task was much longer than in the two previous tasks. This behavior



was expected, since the number of states and actions is now higher. However, in the testshown in figure 5.10, some well-explored states were not able to achieve what might beconsidered a reasonable policy, even after 9000 steps. As an example, the state s3 (rightcontact pressed) resulted in action a3 (righ-turn), a policy that hardly will lead to highrewards in both the short and the long run.

• A special case was detected in scenario A. State s13 (no contact pressed and obstaclefar) resulted in a learned action a2, i.e., left turn. Later tests discovered that the failureof detecting the wall due to the sonar angle issue, explained in the previous section, wasinvolved in this problem: it is likely that moving forward from s13 resulted in the collidingstate s11 for the 35 times recorded in the exploration matrix, and thus, left turn actionsrepresented a better choice in the long run.

• States s8, s12 and s16 did not learn an accurate action. In scenario A, they were not evenexplored, something that is reasonable since it involves both frontal bumpers collidingand sonar distances outside the range 0. In scenario C, s12 was reached up to 6 times in15000 steps, a too low count even when exploring all available actions.

• The learning processes of 2000 steps in scenario C generally gave better results than theprevious tests in scenario A with more than 9000 steps. We thus can state that largerand more complex scenarios offer better models of learning for the wander-3 task thana minimalist simple scenario. The simple scenario A has many redundant states in thewander-3 setup, making the learning inefficient. This issue could be improved by usingadvanced techniques, such as state approximation that are out of the scope of this work.

• Although a basis for a correct policy is reached in 2000 steps, figure 5.11 shows that whenpassing from 2000 to 15000 steps the selected actions evolve into a more accurate policy.In particular, experiment #3 resulted in the pattern {a4 (move forward), a3 (right-turn),a2 (left-turn), a1 (stop)} repeated in {s5 ,s6, s7, s8}, {s9 ,s10, s11, s12} and {s13 ,s14, s15,s16}, all representing the states {no contact, left contact, right contact, both contacts}when the distance measured by the sonar belongs to the range 1, 2 and 3 respectively.The learning process was able to establish that the robot should follow a specific policywhen the obstacle was detected in the range 0, and another one when located in ranges{1,2 or 3}. Hence, the tests result in a consistent clustering of three states into one set.

• Among the tests performed in scenario C with 15000 steps, there are some minor differ-ences in their learned policies. Especially, state s2 (object near and left bumper contacted)usually resulted in action a3 (right-turn) as in previous tasks, but sometimes led to a7(left wheel backwards), as shown in 5.11. This particular behavior, reproduced in severaltests, also offers a consistent policy, since it leads to a non-collision state quickly, usuallyfrom s2 to s1. The robot often executes a two-wheel turn (a2 or a3) in the next step,avoiding the obstacle, and thus resulting in an strategy with a larger look ahead sequenceof steps.

• Every test achieves sequences of movements including turns and backwards displacements,preventing the robot from getting stuck, which generally occurs if state s4 is reached.

• Another special case was detected when working with wander-3 in scenario B. After1000 steps, the robot had learned a policy that turned whenever it reached state s13, inwhich the distance measured belonged to the range 3 (furthest). Thus, the robot keptwandering in a small area between the walls and the obstacle. The problem here wasthat the robot barely could explore any action when it was oriented to the furthest wall,because it only reached that state once or twice every 40 steps (approx.). To solve this



problem, the robot was extracted to an open space where the so far rarely explored statewas frequently reached. A few minutes later, the robot had learned to move forwardsfrom state s13. Then, it robot was returned to scenario B, and described a path aroundthe whole enclosure.

• Moving forward actions usually result in everything but a straight path: the robot makessmall turns to the right; the lower the battery, the higher the drift. This data could justifywhy in most of the experiments of wander-2 and wander-3 the learned policies circle thescenarios clockwise, due to this irregularity of the model. The alternative anticlockwisepolicy would lead to sequences of actions involving more turns and less moving forward,since this drift will cause the robot to face to the external walls of the scenarios moreoften.

• Finally, some tests performed in scenario C resulted in turning policies (a3) when therobot reached states representing sonar distances belonging to the ranges 0 and 2 (s1and s9). By turning at those distances from the obstacles, the robot sometimes avoidedto wander on narrow paths between the wall an the obstacles of this scenario, locationsin which the measurement errors of the sonar were frequent and some collisions wereinevitable.

5.3 SummaryTaking into account all the results discussed in the previous sections, we summarize here the

following improvements used in the basic Q-learning algorithm:

• Exploration-exploitation strategy: An exploration matrix parallel to the Q-matrix thatstores the number of times each Q-matrix cell is updated has been created to support theleast explored actions.

• Learning rate parameter (α): Chapters 3 and 4 proved that the use of a fixed α = 0.02resulted in an efficient learning process, and also reduces the precision problems in theNXT robot. Besides, this value was large enough to adapt the policy to changes in theenvironment in relatively short periods.

• Discount rate (γ): The impact on the numerical problem of loss of precision, along withthe results obtained from simulation, give γ = 0.9 as the value that best fits this work.

• Rewards: The individual values of the rewards were normalized and tunned to avoidfalling into non-optimal policies and overflows.


Chapter 6

Conclusions and future work

This master thesis has addressed the design and implementation of a learning method basedon the Q-learning algorithm for an obstacle-avoidance wandering task on the educative robotLego Mindstorms NXT. This learning method has been able to learn optimal or pseudo-optimalpolicies in three tasks of different complexity, maintaining the same values for the parameters.

Simulations based on a frequentist model, built from the data obtained by the robot inits environment, resulted in a set of parameters used later on in the real robot with success.The employment of a fixed-point notation system has also been evaluated and verified, sincethe robot is based on a microcontroller without FPU. The design of a very simple wanderingtask (wander-1 ) allowed us to carry out a comparative study between simulation and realimplementation of the learning method. The limitations of our real agent regarding overflowsand precision losses were identified in this study.

The non-sonar experiments exposed in chapter 4 were aimed at learning to wander insidesmall scenarios, moving forward whenever no contact with obstacles were detected, and turningto the right direction after colliding, usually with the same bumper. The results show that thenumber of steps needed in learning the wander-1 task in the real world scenario are slightly lessthan those needed in simulation, in most tests. Thus, we can state that the real environmentoffered a slightly better model of learning than the simulator. Besides, based on the studiesdescribed on chapter 4 regarding losses of precision and overflows, we can state that the timeneeded in implementing the learning algorithm in the robot is shorter than the time needed insimulation.

The experiments including the sonar sensor described in chapter 5, wander-2 and wander-3, are aimed at maximizing the paths followed by the robot in the scenarios, especially thewander-3 task, in which more precise movements sequences are available. The learned policiesallow the robot to maximize the amount of steps moving forward, receiving then the greatestrewards, while avoiding the penalties incurred by turning in the case that a moving forwardaction leads to a collision.

To sum up, we have obtained a Q-learning based method for the NXT robot that is flexibleenough to be adapted to other tasks on different systems, and that has served to explore thor-oughly the limitations of small robotic platforms for practical reinforcement learning. Furtherwork could also evaluate the findings of this research in other small-scale systems.

The work presented in this master thesis could become part of a larger project orientedto learn more complex tasks. To carry out such proposal, approximation techniques such asneurodynamic programming [21] will be essential in order to approximate states; otherwiseour reinforcement learning problem would be intractable due to the large number of statesthat could arise. This has been shown in the learning process of the final wandering task inthis work (wander-3 ), in which, although obtaining pseudo-optimal policies, several redundantstates appeared that reduced the efficiency of the learning process. Another sort of techniques

57


such as hierarchical reinforcement learning [22] could be necessary so as to learn more complextasks.

Finally, our proposal for future work includes adding two techniques which we considerwould have a great impact in the learning method:

• An improved exploration-exploitation strategy: the implementation of an algorithm thatgoverns the exploration/exploitation rate of the simple ε-greedy used in this work. Thisalgorithm should be based on both Q-matrix and exploration matrix. Related strategiesincluding Bolzmann methods should be addressed here.

• Adding a stage at the end of the learning loop that checks for convergence or for potentialchanges in the model. This is a generalization of the previous strategy. The design ofalgorithms based on the Q-matrix and the exploration matrix could lead to greater α andhigher exploration rates in both the beginning of the learning process and after detectinga relevant change in the model. On the other hand, as the learning process converges intostable policies, α and exploration rates should be reduced.


Appendix A

Implementation details

Since this work involves the performance of many tests on the robot, it has been necessaryto implement some techniques for supporting debugging and data collection for subsequentanalyses. This chapter briefly describes some of these techniques already implemented in theexperiments developed in chapters 4 and 5 and listed in appendix C.

A.1 Debugging, compilation, and NXT communication

As stated in chapter 4, the code implemented on the robot has been developed in the Not eX-actly C programming language [20]. A standard PC under two operating system was employedfor software development: MS Windows XP and Ubuntu 12.10. The IDE Brick Command Cen-ter v3.3(Build 3.3.8.9) under Windows XP has been used for debugging, compiling, exportingthe rxe binaries files to the robot, and importing the resulting data files. Figure A.1 shows ascreenshot of this program.

Figure A.1: Brick Command Center v3.3.

When working under Linux, the compiler NBC 1.2.1.r4 has been used for debugging andsending the binary files to the robot, by typing in a terminal: nbc sourceFileName.nxc -d-S=usb; in that case, the results were imported from the robot with the communication toolNeXTTool 1.2.1.3, by typing: nexttool /COM=USB0 -upload="*.log".

Any test presented in this work can be reproduced in both operating systems withoutmaking any change in the source code.

59


A.2 DatalogsOnce all the steps of the learning process are executed, the function void NXTsaveData(long

memoryNeeded) is called. This routine, after checking if the output file has been successfullycreated with the required size, saves the following information in it:

0. Learning process name and number of steps executed.

1. Learned policy.

2. Resulting value function.

3. Resulting Q-matrix.

4. Exploration matrix.

5. Last step in which the policy changed into the optimal policy.

The source codes implementing the three tasks developed in the present work contains thisfunction. An example of a resulting datafile generated by the robot is shown in fig. A.2.

% 5 States Task NXTNumberOfSteps=2001;Policy = [3 3 2 3 4 ];V = [25758 29021 13687 166 44375 ];Q = [1310 3474 25758 3325658 1702 29021 2716-320 13687 560 826-448 -421 166 -41823044 24080 27686 44375];exploration = [

14 26 186 117 9 157 1312 53 15 184 6 7 378 95 87 1070];% Optimal Policy learned in step: 495

Figure A.2: Datafile generated by the wander-2 (five-states) task program on the robot.

A.3 Brick keys eventsIn order to study the evolution of any learning process, three user events that interrupt the

learning process by pressing the buttons of the brick have been included in the programs. Theseare used for:

1. Exploiting the learned policy so far, without disturbing the learning process, and withthe opportunity to continue learning after another button event.

2. Pausing and resuming the learning process, which also helps us to check the informationshown in the NXT display and translating the robot to a new location.



3. Ending the current process and generating the data file, as through the last learning stepwere reached.

These buttons events were materialized in the following lines of code just at the end of themain learning loop:

if (saveAndStopButton()) break; // Right button (NXT_io.h)

if (exploitePolicyButton()) exploitePolicy(s); // Left button (NXT_io.h)

if (pauseResumeButton()) // Center button (NXT_io.h){Wait(1000);executeAction(INITIAL_POLICY); // INITIAL_POLICY: wheels stoppedpauseNXT();}

Also, the robot enters a pause state after finishing any learning process, waiting for buttonevents. This characteristic has been employed for executing the learned policy after the wholeprocess has finished.

A.4 Online information through the display and the loud-speaker

The function void showMessage(const string &m, byte row, bool wait), implementedin the library NXT io.h, is called from the main code at several points to display the currentstate of the learning process (see fig. A.3). This function is also used in paused mode to displaythe available user events.

Figure A.3: NXT display with the learning information of the wander-1 task.

The complementary transmission of information through the NXT loudspeaker allows theuser a better understanding of the learning process without interrupting the robot. We havedefined notes and note lengths constants in the alphabetic music notation in NXT io.h. Pre-liminary tests reproduced high pitch notes when large rewards were received, and lower pitchnotes in case of large penalizations. Any change in the learned policy was also transmitted withtwo high pitch notes.

The use of sound emissions accelerated and improved the adjustment of rewards, speeds,distance ranges, and above all, error detection. The code listed in this work contains a reduced



set of events taking advantage of the loudspeaker, such as the instant in which the learnedpolicy turns into an optimal policy.


Appendix B

Octave/Matlab scripts, functions, andfiles generated by the robot

B.1 modelData.m (wander-1)In this section of the appendix we include the content of a file generated by the robot after

running the program simpleTaskDataCollector.nxc.

% March 12, 2013. Simple task model data.% Exported by robot NXT running compiled ’simpleTaskDataCollector.nxc’% Parameters: MotorPower = 80; stepTime = 0.250 s; N_STEPS = 7200 (30 minutes)

% data(s,a,s’):data(1,:,:) = [255 8 13 1255 24 17 0255 23 20 0255 70 70 29];data(2,:,:) = [11 106 3 621 44 26 977 37 0 16 76 2 32];data(3,:,:) = [13 0 92 381 0 33 037 16 52 610 1 62 30];data(4,:,:) = [0 5 2 765 5 57 105 61 3 60 2 3 63];% Example:% data(1,2,3)=12 means that in s1, the action a2 led to s3 12 times of% 208+22+12+0=242. That is: P(s3’|s1,a2) = 0.049

% The above data can be translated into Transition matrix in Matlab/Octave:% for s=1:4% for a=1:4

63


% for sp=1:4% T(s,a,sp) = Results(s,a,sp)/sum(Results(s,a,:));% end% end% end

B.2 simulateSimpleTask ConceptualModel.m (wander-1)

This code shows the Octave/Matlab function that simulates the conceptual model of wander-1(referred as SimpleTask in the code).

%-------------------------------------------------------------------------------% Task Objective: Wander evading obstacles after touching them.% States: s1: Non-contact, s2: Left bumper contact,% s3: Right bumper contact, s4: Both contacts% Actions: a1: Stop, a2: Left-turn, a3: Right-turn, a4: Move forward%-------------------------------------------------------------------------------

function sp = simulateSimpleTask_ConceptualModel(s,a)

uncertainty = rand-0.02;

switch a

case 1 % The state doesn’t change when the robot is stoppedsp = s;

case 2 % Left-Turnswitch s

case 1if uncertainty < 0.02 sp = 3;elseif uncertainty < 0.1 sp = 2;else sp = 1;end




end

case 3 % Right-Turnswitch s

case 1if uncertainty < 0.02 sp = 2;elseif uncertainty < 0.1 sp = 3;



else sp = 1;end




end

case 4 % Move forwardswitch s

case 1if uncertainty < 0.05 sp = 4;elseif uncertainty < 0.1 sp = 3;elseif uncertainty < 0.15 sp = 2;else sp = 1;end

case 2if uncertainty < 0.2 sp = 1;else sp = 4;end

case 3if uncertainty < 0.2 sp = 1;else sp = 4;end

case 4sp = 4;end

endend

B.3 simulateSimpleTask FrequentistModel.m (wander-1)

We include here the Octave/Matlab function that simulates the frequentist model of wander-1(referred as SimpleTask in the code) knowing its transition function.


function sp=simulateSimpleTask_FrequentistModel(N_STATES,s,a,T)

uncertainty=rand; %-0.1;



accum=0;

for i=1:N_STATESaccum = accum + T(s,a,i);if uncertainty < accum

sp = i;break;

endend

B.4 Q learning simpleTask Conceptual.m (wander-1)

This is the Octave/Matlab main script of the learning process of the wander-1 task (referredas SimpleTask in the code) using the conceptual model.


clear all;

% Q-learning algorithm parametersN_STATES = 4;N_ACTIONS = 4;INITIAL_POLICY = 1; % Stopped

INITIAL_ALPHA = 0.02; % Parameter of alpha-gamma studyGAMMA = 0.8; % Parameter of alpha-gamma study

% Experiment parametersN_EXPERIMENTS = 40; % Parameter of alpha-gamma study

N_STEPS = 2000; % Number of steps of the learning processSTEP_TIME = 500; % (ms) (not needed in simulation)

% Variablesglobal s,global a, global sp; % (s,a,s’)global R; % Obtained Rewardglobal alpha; % Learning rate parameter (fixed here)global Policy; % Current Policyglobal V; % Value functionglobal Q; % Q-matrixglobal step; % Current step

% Auxiliar variables (for analysis purposes only)global aux_learned; % True: Optimal Policy learnedglobal aux_step; % Last step in which policy changed into optimal policy

for (exp = 1:N_EXPERIMENTS) % For analysis purposes:% 1 experiment: 1 complete learning process

% Initial state



% V = 0, Q = 0, Policy = INITIAL_POLICY, s = 1 (no contact)aux_learned = false;

for i=1:N_STATESV(i) = 0;Policy(i) = INITIAL_POLICY;

for j=1:N_ACTIONSQ(i,j) = 0;

endends = 1;alpha = INITIAL_ALPHA;

% Q-learning algorithm Main loop--------------------------------------------% Input: alpha, GAMMA, N_STEPS, T (transition function)% Output: Q(s,a), V(s), Policy(s)sprintf(’Executing ...’)


% selectAction (Exploitation/exploration strategy)if ceil(unifrnd(0,100)) > 70 % Simple e-greedy


a = Policy(s);end

% executeAction(),wait() and observeState() simulated:sp = simulateSimpleTask_ConceptualModel(s,a);R = obtainReward_simpleTask(s,a,sp);




endend


% Save data for further analysis: If the agent reaches an optimal% policy from a non-optimal one (aux_learned == false), the current% is saved as ’aux_step’, and it will be used as an indicator of the% goodness of the learning processif (Policy == [4,3,2,2] || Policy ==[4,3,2,3]) % Optimal policies

if (aux_learned == false)aux_step = step;aux_learned = true;

endelse aux_learned = false;end%aux_Policy_history(:,step) = Policy;

end % End of the current learning iteration---------------



% Learning process ended

% Octave/Matlab Analysis sentencesPolicy % show learned policyQ % show Q-matrixif (aux_learned == true)

sprintf(’Optimal Policy learned in step %u’, aux_step)Qsorted = sort(Q,2,’descend’);estimator = Qsorted(1,1)/Qsorted(1,2)

elsesprintf(’No Optimal Policy learned’)estimator = 0.5aux_step = N_STEPS

endarray_estimator(exp) = estimator;array_step(exp) = aux_step;

end % End of the current experiment --------------------------------------

sprintf(’Estimator (mean): %f’,mean(array_estimator))sprintf(’Step learned (mean): %f’,mean(array_step))

B.5 Q learning simpleTask Frequentist.m (wander-1)

In this part of the appendix we show the Octave/Matlab main script of the learning processof the wander-1 task (referred as SimpleTask in the code) using the frequentist model.


clear all;

% Q-learning algorithm parametersN_STATES = 4;N_ACTIONS = 4;INITIAL_POLICY = 1; % Stopped

INITIAL_ALPHA = 0.02; % Parameter of alpha-gamma studyGAMMA = 0.9; % Parameter of alpha-gamma study

% Experiment parametersN_EXPERIMENTS = 40; % Parameter of alpha-gamma study

N_STEPS = 2000; % Number of steps of the learning processSTEP_TIME = 250; % (ms) (not needed in simulation)

% Variablesglobal s,global a, global sp; % (s,a,s’)global R; % Obtained Rewardglobal alpha; % Learning rate parameter (fixed here)global Policy; % Current Policy



global V; % Value functionglobal Q; % Q-matrixglobal step; % Current step

% Auxiliar variables (for analysis purposes only)global aux_learned; % True: Optimal Policy learnedglobal aux_step; % Last step in which policy changed into optimal policyglobal aux_V_history % Array of V’s obtained in each experiment

% Transition function (from data collected by the robot)modelData; % ’modelData.m’ is created in NXT robot at the end of

% ’DataCollector.nxc’ program.% It contains frequentist results ’data(s,a,sp)’

% Obtain Transition function from the above variable ’data(s,a,sp)’for s=1:4

for a=1:4for sp=1:4T(s,a,sp) = data(s,a,sp)/sum(data(s,a,:));end

endend

for (exp = 1:N_EXPERIMENTS) % For analysis purposes:% 1 experiment: 1 complete learning process

% Initial state% V = 0, Q = 0, Policy = INITIAL_POLICY, s = 1 (no contact)aux_learned = false;

for i=1:N_STATESV(i) = 0;Policy(i) = INITIAL_POLICY;for j=1:N_ACTIONS

Q(i,j) = 0;end

ends = 1;alpha = INITIAL_ALPHA;

% Q-learning algorithm Main loop--------------------------------------------% Input: alpha, GAMMA, N_STEPS, T (transition function)% Output: Q(s,a), V(s), Policy(s)sprintf(’Executing ...’)


% selectAction (Exploitation/exploration strategy)if ceil(unifrnd(0,100)) > 70 % simple e-greedy


a = Policy(s);end

% executeAction(),wait() and observeState() simulated:sp = simulateSimpleTask_FrequentistModel(N_STATES,s,a,T);R = obtainReward_simpleTask(s,a,sp);






endend


% Save data for further analysis: If the agent reaches an optimal% policy from a non-optimal one (aux_learned == false), the current% is saved as ’aux_step’, and it will be used as an indicator of the% goodness of the learning processif (Policy == [4,3,2,2] || Policy ==[4,3,2,3]) % Optimal policies

if (aux_learned == false)aux_step = step;aux_learned = true;

endelse aux_learned = false;end%aux_Policy_history(:,step) = Policy;

end % End of the current learning iteration---------------

% Learning process ended

% Octave/Matlab Analysis sentencesPolicy % show learned policyQ % show Q-matrixif (aux_learned == true)

sprintf(’Optimal Policy learned in step %u’, aux_step)Qsorted = sort(Q,2,’descend’);estimator = Qsorted(1,1)/Qsorted(1,2)

elsesprintf(’No Optimal Policy learned’)estimator = 0.5aux_step = N_STEPS

endarray_estimator(exp) = estimator;array_step(exp) = aux_step;aux_V_history(:,exp) = V;

end % End of the current experiment --------------------------------------

sprintf(’\n INPUT: alpha: %f, gamma: %f ’,INITIAL_ALPHA,GAMMA)sprintf(’Estimator (average): %f’,mean(array_estimator))sprintf(’Step learned (average): %f’,mean(array_step))sprintf(’V (average):’)sprintf(’ %f’,mean(aux_V_history,2))


Appendix C

Source codes developed for robotimplementations(NXC)

C.1 NXT io.hThis section of the appendix shows the content of the library written in NXC and used in all

the experiments performed with the real robot in this work.

/* References ----------------------------------------------------------------*/#include "NXCDefs.h"#include "NBCCommon.h"

/* NXT Actuators parameters ----------------------------------------------------

/*------------------------------------------------------------------------(NXT Port = connected device)Sensors: 2 = Right_bumper, 3 = left bumper, 4 = Ultrasonic (not used)Actuators: B = Right wheel motor, C = Left wheel motor,------------------------------------------------------------------------*/

#define SPEED 50 // Robot Speed (fixed)#define L_WHEEL OUT_C // Left wheel#define R_WHEEL OUT_B // Right wheel

#define LEFT_BUMPER SENSOR_3#define RIGHT_BUMPER SENSOR_2#define SONAR_STATES 4#define DISTANCE_0 25#define DISTANCE_1 50#define DISTANCE_2 75

/* NXT Loudspeaker parameters ------------------------------------------------*/#define VOL 1 // volume

// Musical Notes#define C5 523#define C#5 554#define D5 587#define D#5 622#define E5 659#define F5 698#define F#5 740#define G5 784#define G#5 831#define A5 880

71


#define A#5 932#define B5 988#define C6 1047#define C#6 1109#define D6 1175#define D#6 1245#define E6 1319#define F6 1397#define F#6 1480#define G6 1568#define G#6 1661#define A6 1760#define A#6 1865#define B6 1976#define C7 2093#define C#7 2217#define D7 2349#define E7 2489

// Musical Note lengths#define HALF 250#define QUARTER 125#define EIGHTH 63#define SIXTEENTH 32

/* Global variables ----------------------------------------------------------*/mutex msgmutex; // Mutex needed for NXT display

/* Functions prototypes-------------------------------------------------------*/

void NXT_mapSensors(void);byte getSonarDistance(void);void executeNXT(string command);void showMessage(const string &m, byte row, bool wait);bool NXTcheckMemory(long memoryNeeded, string FILE);bool exploitePolicyButton(void);bool saveAndStopButton(void);bool anyButton(void);void pauseNXT(void);

/* Functions definitions------------------------------------------------------*/

void NXT_mapSensors(){SetSensorLowspeed(IN_4);SetSensorTouch(IN_2);SetSensorTouch(IN_3);}

//-------------------------------------------------------------------------byte getSonarDistance(void)

{return SensorUS(IN_4);}

//-------------------------------------------------------------------------void executeNXT(string command)



{switch(command)

{case "stop":

{Off(L_WHEEL);Off(R_WHEEL);break;}

case "turnLeft":{OnRev(L_WHEEL, SPEED);OnFwd(R_WHEEL, SPEED);break;}

case "turnRight":{OnFwd(L_WHEEL, SPEED);OnRev(R_WHEEL, SPEED);break;}

case "forward":{OnFwd(L_WHEEL, SPEED);OnFwd(R_WHEEL, SPEED);break;}

case "forwardLeft":{OnFwd(L_WHEEL, SPEED);Off(R_WHEEL);break;}

case "forwardRight":{Off(L_WHEEL);OnFwd(R_WHEEL, SPEED);break;}

case "backLeft":{OnRev(L_WHEEL, SPEED);Off(R_WHEEL);break;}

case "backRight":{Off(L_WHEEL);OnRev(R_WHEEL, SPEED);break;}

case "back":{OnRev(L_WHEEL, SPEED);OnRev(R_WHEEL, SPEED);break;}

}}



//-------------------------------------------------------------------------void showMessage(const string &m, byte row, bool wait)

{// NXT Displaybyte aux,i;string line;

Acquire(msgmutex);aux = StrLen(m)/16;for (i=0;i<=aux;i++)

{line =SubStr(m,16*i,16);TextOut(0, 58-7*(row+i),line);}

if(wait) Wait(1000);Release(msgmutex);}

//-------------------------------------------------------------------------bool NXTcheckMemory(long memoryNeeded, string FILE)

{// Input: NXT required memory// Output: True = NXT required memory availableunsigned long aux;string lin;

DeleteFile(FILE);aux=FreeMemory();

lin=StrCat(NumToStr(memoryNeeded)," of ",NumToStr(aux));showMessage(lin,3,true);if (aux<=memoryNeeded)

{showMessage("Not enough Memory. Delete unnecesary NXT files",3,true);return(false);}

return(true);}

//-------------------------------------------------------------------------bool pauseResumeButton(void)

{if(ButtonPressed(BTNCENTER, true)==0)

return false;else return true;

}

//-------------------------------------------------------------------------bool exploitePolicyButton(void)

{if(ButtonPressed(BTNLEFT, true)==0)

return false;else return true;

}

//-------------------------------------------------------------------------bool saveAndStopButton(void)

{if(ButtonPressed(BTNRIGHT, true)==0)

return false;



else return true;}

//-------------------------------------------------------------------------bool anyButton(void)

{bool aux;aux = (pauseResumeButton() || exploitePolicyButton() || saveAndStopButton());return aux;}

//-------------------------------------------------------------------------void pauseNXT(void)

{showMessage("Paused. Press any Button...", 5,false);while(!anyButton())

{Wait(50);}

}

C.2 simple task learning.nxc (wander-1)In this part of the appendix we show the code implementing the learning process for the

wander-1 task (referred as SimpleTask in the code).

/*------------------------------------------------------------------------------Task Objective: Wander evading obstacles after touching them.States: s1: Non-contact, s2: Left bumper contact,

s3: Right bumper contact, s4: Both contactActions: a1: Stop, a2: Left-turn, a3: Right-turn, s4: Move forward------------------------------------------------------------------------------*/

/*------------------------------------------------------------------------(NXT Port = connected device)Sensors: 2 = Right_bumper, 3 = left bumper, 4 = Ultrasonic (not used here)Actuators: B = Right wheel motor, C = Left wheel motor,------------------------------------------------------------------------*/

/* References ----------------------------------------------------------------*/#include "NXT_io.h"

/* FileSystem parameters */#define NAME "Simple Task NXT"#define FILE "SimpleTask.log" // used in ’NXT_io.h’

/* Constants definition ------------------------------------------------------*/

#define FP 10000 // FIXED POINT: 10000 = 4 decimal places

/* Q-learning algorithm parameters */#define N_STATES 4 // from 1 to 4 (0 unusued)#define N_ACTIONS 4 // from 1 to 4 (0 unusued)#define INITIAL_POLICY 1#define GAMMA (FP*90)/100 // FIXED POINT (0.9)#define INITIAL_ALPHA (FP*2)/100 // FIXED POINT (0.02)



// NOTE: Maximun value alowed in long variables: 2147483647. Chosen FP,GAMMA// and R will never cause overflow in the robot

/* Experiment parameters */#define STEP_TIME 250 // (ms)#define N_STEPS 2000 // Experiment: 8 minutes approx.

/* Global variables ----------------------------------------------------------*/// Local main arrays will be defined here to avoid exceding NXC limitationsbyte Policy[N_STATES+1]; // Current Policylong V[N_STATES+1]; // Value functionlong Q[N_STATES+1][N_ACTIONS+1]; // Q-matrixlong step; // Current step

// Auxiliar variables (for analysis purposes only)bool aux_learned; // True: Optimal Policy learnedlong aux_step; // Last step in which policy changed into optimal policystring policyString;string display;

/* Functions prototypes ------------------------------------------------------*//* Note: Older Bricx Command Center versions will throw error here.

In that case, erasing this section will solve the problem.NBC 1.2.1.r4 compiler under Linux works OK in any case*/

long getMemoryNeeded(long explicitMemory,byte nstates,byte nactions,long nsteps);

void NXTsaveData(long memoryNeeded);byte selectAction(byte s); // Exploitation/exploration strategyvoid executeAction(byte a);byte observeState(void);long obtainReward(byte s, byte a, byte sp);void exploitePolicy(byte s);


//------------------------------------------------------------------------------long getMemoryNeeded(long explicitMemory,byte nstates,byte nactions,long nsteps)

{// Input: if explicitMemory==0// nstates,nactions and nsteps are used to obtain memoryNeeded// Output: NXT memory needed

long memoryNeeded;

if(explicitMemory == 0) // when MemoryNeeded is not explicitly defined{memoryNeeded = 100; // Available for string comments: %..., %....memoryNeeded += (nstates+1)*(5+1)+2; // Policy rowmemoryNeeded += (nstates+1)*(10+1)+2; // Value function rowmemoryNeeded += ((nstates+1)*(10+1)+2)*(nactions+1); // Q matrix rows}

else memoryNeeded = explicitMemory;return memoryNeeded;}

//------------------------------------------------------------------------------void NXTsaveData(long memoryNeeded)

{



byte handle, res;unsigned long memoryUsed, size;string smu, sti, lin;byte s, a;string val;

res=CreateFile(FILE,memoryNeeded,handle);if (res != LDR_SUCCESS)

{showMessage("Error creating file!", 3, true);PlayToneEx(F5,HALF,VOL,FALSE);Wait(HALF);PlayToneEx(C5,HALF,VOL,FALSE);Wait(HALF);Stop(true);}

memoryUsed = 0;

// 0- Save name & step

lin = StrCat("% ",NAME);WriteLnString(handle,lin,size);memoryUsed += size;

lin = StrCat("NumberOfSteps=");val = NumToStr(step);lin += StrCat(val,";");WriteLnString(handle,lin,size);memoryUsed += size;

// 1 - Save policylin ="Policy = [";for (s = 1; s < N_STATES+1; s++)

{val = NumToStr(Policy[s]);lin += StrCat(val," ");}

lin += "];";WriteLnString(handle,lin,size);memoryUsed += size;

// 2 - Save Value Functionlin ="V = [";for (s = 1; s < N_STATES+1; s++)

{val = NumToStr(V[s]);lin += StrCat(val," ");}


// 3 - Save Q-Matrixlin ="Q = [";WriteLnString(handle,lin,size);memoryUsed += size;for (s = 1; s < N_STATES+1; s++)

{lin ="";for (a = 1; a < N_ACTIONS+1; a++)



{val = NumToStr(Q[s][a]);lin += StrCat(val," ");}

WriteLnString(handle,lin,size);memoryUsed += size;}

lin = "];";WriteLnString(handle,lin,size);memoryUsed += size;

// 4 - Save last step in which the policy changed into the optimal policyif(aux_learned == true)

{lin ="% Optimal Policy learned in step: ";val = NumToStr(aux_step);lin += val;WriteLnString(handle,lin,size);memoryUsed += size;}else{lin ="%No optimal policy learned ";WriteLnString(handle,lin,size);memoryUsed += size;}

CloseFile(handle);lin=StrCat("bytes:",NumToStr(memoryUsed));showMessage(lin, 3, false);}

// -----------------------------------------------------------------------------byte selectAction (byte s)

// Input: Current State// Output: Action selected{byte a;if(Random(100)>70) // Simple e-Greedy

// 70% probaility of exploiting the current learned policy.a=Random(N_ACTIONS+1); // (1,2,3 or 4)

elsea = Policy[s];

return(a);}

//------------------------------------------------------------------------------void executeAction(byte a)

// Input: selected action{switch(a)

{case 1:

{executeNXT("stop"); // defined in ’NXT_io.h’break;}

case 2:{



executeNXT("turnLeft");break;}

case 3:{executeNXT("turnRight");break;}

case 4:{executeNXT("forward");break;

}}

}

//------------------------------------------------------------------------------byte observeState(void)

{// Returns the state of the robot by encoding the information measured from// the contacts sensors. In case the number of states or their definitions// change, this function must be updated.

// States discretization:// s1: left_contact=0,right_contact=0// s2: left_contact=1,right_contact=0// s3: left_contact=0,right_contact=1// s4: left_contact=1,right_contact=1

byte state;state = 1 + (LEFT_BUMPER + 2 * RIGHT_BUMPER); // defined in ’NXT_io.h’return(state);}

// -----------------------------------------------------------------------------long obtainReward(byte s, byte a, byte sp)

{// Input: s,a,s’ not used directly here since we look at the motion of// wheels and sensors (that would be a and s’). Encoders and// contact sensors result in a better function R(s,a,s’)// Output: Obtained Reward

long R; // Fixed-point numberlong treshold;





ResetRotationCount(L_WHEEL);



ResetRotationCount(R_WHEEL);return(R);}

// -----------------------------------------------------------------------------void exploitePolicy(byte s)

{/* Auxiliary: For analysis purposes only. When the left button of the NXT brick

is pressed, the learning process breaks into an exploitation loop. NoneQ-matrix value or step is modified. When the left button is pressed again,the learning process continues as the same point as before */

int a;int sp;ClearScreen();showMessage("Exploiting the learned policy",1,false);Wait(500);while(!exploitePolicyButton())

{a = Policy[s]; // exploitation onlyexecuteAction(a);Wait(STEP_TIME);sp = observeState();s = sp; // update state}

ClearScreen();executeAction(INITIAL_POLICY); // stop the robot}

// -----------------------------------Main function ----------------------------task main()

{/* Local variables */byte s,a,sp; // (s,a,s’)long R; // Obtained Reward [FIXED POINT]long alpha; // Learning rate [FIXED POINT]

byte i,j,k;

long explicitMemory; // Auxiliar: define memory needed for NXT file savinglong memoryNeeded;

// Start notificationPlayToneEx(E5,QUARTER,VOL,FALSE);Wait(HALF);PlayToneEx(E5,QUARTER,VOL,FALSE);Wait(HALF);

aux_learned = false; // auxiliar for further analysisexplicitMemory = 0; // 0: memoryNeeded will be calculed in getMemoryNeeded()memoryNeeded = getMemoryNeeded(explicitMemory,N_STATES,N_ACTIONS,N_STEPS);Stop(!NXTcheckMemory(memoryNeeded, FILE));

ResetRotationCount(L_WHEEL);ResetRotationCount(R_WHEEL);NXT_mapSensors();display=StrCat(NAME," started ");showMessage(display, 3,true);

/* Initial State */



for(i=1; i<N_STATES+1; i++) // Q[0][x] & Q[x][0] unusued{V[i]=0;Policy[i] = INITIAL_POLICY;for(j=1; j<N_ACTIONS+1; j++)

{Q[i][j]=0;}

}alpha = INITIAL_ALPHA;s = observeState();

// Q-learning algorithm Main loop ------------------------------------------for(step=1; step<N_STEPS+1; step++)

{a = selectAction(s); // Exploitation/exploration strategyexecuteAction(a);Wait(STEP_TIME);sp = observeState();R = obtainReward(s,a,sp);



{if(Q[s][i] > V[s])



// Display informationClearScreen();showMessage(NAME, 0, false);display = StrCat("STEP: ",NumToStr(step));showMessage(display, 1, false);showMessage("Policy:", 2, false);policyString = "";display="";for(i=1; i<N_STATES+1; i++)

{display += StrCat("s",NumToStr(i),"a",NumToStr(Policy[i])," ");policyString += StrCat(NumToStr(Policy[i])," ");}

showMessage(display, 3,false);

// Saving data for further analysis (only valid when the optimal policy is known)if ((policyString == "4 3 2 2 ") || (policyString == "4 3 2 3 "))

{if (aux_learned == false)

{aux_step = step;aux_learned = true;PlayToneEx(E5,HALF,VOL,FALSE);



Wait(HALF);PlayToneEx(A5,HALF,VOL,FALSE);Wait(HALF);}

}else aux_learned = false;

// User eventsif (saveAndStopButton()) break; // Right button (NXT_io.h)



} // --------------------- Main loop ending --------------------------

executeAction(INITIAL_POLICY); // Stop the robotNXTsaveData(memoryNeeded);

PlayToneEx(E5,HALF,VOL,FALSE); // Ending notificationWait(HALF);PlayToneEx(G5,HALF,VOL,FALSE);Wait(HALF);PlayToneEx(B5,HALF,VOL,FALSE);Wait(HALF);

while(true){showMessage("Finished. Exit:Orange Button, Show learned: Left button",

5,false);PlayToneEx(B5,SIXTEENTH,VOL,FALSE);Wait(300);if (exploitePolicyButton())

exploitePolicy(s);if (pauseResumeButton())

break;}

}

C.3 5s learning.nxc (wander-2)

This code shows the implementation of the learning process for the wander-2 task (referredas 5 States in the code).

/*------------------------------------------------------------------------------Task Objective: Wander evading obstacles.States: s1: Non-contact & obstacle near, s2: Left bumper contact,

s3: Right bumper contact, s4: Both contacts4: Non-contact & non obstacle near

Actions: a1: Stop, a2: Left-turn, a3: Right-turn, s4: Move forward------------------------------------------------------------------------------*/



/*------------------------------------------------------------(NXT Port = connected device)Sensors: 2 = Right_bumper, 3 = left bumper, 4 = UltrasonicActuators: B = Right wheel motor, C = Left wheel motor,------------------------------------------------------------*/


/* FileSystem parameters */#define NAME "5 States Task NXT"#define FILE "5s.log"


#define FP 10000 // FIXED POINT: 10000 = 4 decimal places



/* Experiment parameters */#define STEP_TIME 250 // (ms)#define N_STEPS 2000 // Experiment: 8 minutes approx.

/* Global variables ----------------------------------------------------------*/// Local main arrays will be defined here to avoid exceding NXC limitationsbyte Policy[N_STATES+1]; // Current Policylong V[N_STATES+1]; // Value functionlong Q[N_STATES+1][N_ACTIONS+1]; // Q-matrixlong step; // Current stepint exploration[N_STATES+1][N_ACTIONS+1]; // Exploration matix





void NXTsaveData(long memoryNeeded);byte selectAction(byte s); // Exploitation/exploration strategyvoid executeAction(byte a);byte observeState(void);long obtainReward(byte s, byte a, byte sp);void exploitePolicy(byte s);




long getMemoryNeeded(long explicitMemory,byte nstates,byte nactions,long nsteps){// Input: if explicitMemory==0// nstates,nactions and nsteps are used to obtain memoryNeeded// Output: NXT memory needed

long memoryNeeded;

if(explicitMemory == 0) // when MemoryNeeded is not explicitly defined{memoryNeeded = 100; // Available for string comments: %..., %....memoryNeeded += (nstates+1)*(5+1)+2; // Policy rowmemoryNeeded += (nstates+1)*(10+1)+2; // Value function rowmemoryNeeded += ((nstates+1)*(10+1)+2)*(nactions+1);// Q matrix rowsmemoryNeeded += ((nstates+1)*(5+1)+2)*(nactions+1); // Q explor. rows}



{byte handle, res;unsigned long memoryUsed, size;string smu, sti, lin;byte s, a;string val;



memoryUsed = 0;





{val = NumToStr(Policy[s]);



lin += StrCat(val," ");}


// 2 - Save Value functionlin ="V = [";for (s = 1; s < N_STATES+1; s++)



// 3 - Save Q-matrixlin ="Q = [";WriteLnString(handle,lin,size);memoryUsed += size;for (s = 1; s < N_STATES+1; s++)





// 4 - Save exploration matrixlin ="exploration = [";WriteLnString(handle,lin,size);memoryUsed += size;for (s = 1; s < N_STATES+1; s++)


{val = NumToStr(exploration[s][a]);lin += StrCat(val," ");}



// 5 - Save last step in which the policy changed into the optimal policyif(aux_learned == true)

{



lin ="% Optimal Policy learned in step: ";val = NumToStr(aux_step);lin += val;WriteLnString(handle,lin,size);memoryUsed += size;}else{lin ="%No optimal policy learned ";WriteLnString(handle,lin,size);memoryUsed += size;}



// Input: Current State// Output: Action selected{byte a;

if(Random(100)>70) // Simple e-Greedy// 70% probaility of exploiting the current learned policy.a=Random(N_ACTIONS+1); // (1,2,3...N_ACTIONS)

elsea = Policy[s];

return(a);}


// Input: selected action{switch(a)

{case 1:


case 2:{executeNXT("turnLeft");

break;}

case 3:{executeNXT("turnRight");

break;}

case 4:{executeNXT("forward");

break;}



}}


{// Returns the state of the robot by encoding the information measured from// the ultrasonic and the contacts sensors. In case the number of states// or their definitions change, this function must be updated.

// States discretization:// s1: left_contact=0,right_contact=0 Obstacle near// s2: left_contact=1,right_contact=0// s3: left_contact=0,right_contact=1// s4: left_contact=1,right_contact=1// s5: left_contact=0,right_contact=0 Non Obstacle near */


sonarDistance = getSonarDistance();if(sonarDistance <= DISTANCE_0)

// DISTANCE_0 is the threshold distance defined in ’NXT_io.h’ used to// distinguish between states s1 and s5sonarState = 0;

elsesonarState = 1;

state = 1 + (LEFT_BUMPER + 2 * RIGHT_BUMPER); // defined in ’NXT_io.h’if(state==1 && sonarState==1) state=5;return(state);}



long R;long treshold;





ResetRotationCount(L_WHEEL);ResetRotationCount(R_WHEEL);



return(R);}




int a;int sp;ClearScreen();showMessage("Exploiting the learned policy",1,false);Wait(500);while(!exploitePolicyButton())

{a = Policy[s]; // ExploitationexecuteAction(a);Wait(STEP_TIME);sp = observeState();s = sp; // update state}




byte i,j,k;


// Start notificationPlayToneEx(E5,QUARTER,VOL,FALSE);Wait(HALF);PlayToneEx(E5,QUARTER,VOL,FALSE);Wait(HALF);

aux_learned = false; // auxiliar for further analysisexplicitMemory = 0; // 0: memoryNeeded will be calculed in getMemoryNeeded()memoryNeeded = getMemoryNeeded(explicitMemory, N_STATES, N_ACTIONS, N_STEPS);Stop(!NXTcheckMemory(memoryNeeded, FILE));




/* Initial State */for(i=1; i<N_STATES+1; i++) // Q[0][x] & Q[x][0] unusued

{V[i]=0;Policy[i] = INITIAL_POLICY;for(j=1; j<N_ACTIONS+1; j++)

{exploration[i][j]=0;Q[i][j]=0;}



{PlayToneEx(E5,EIGHTH,VOL,FALSE);

a = selectAction(s); // Exploitation/exploration strategyexecuteAction(a);Wait(STEP_TIME);sp = observeState();R = obtainReward(s,a,sp);if (exploration[s][a]<INT_MAX)

exploration[s][a]++;



{if(Q[s][i] > V[s])






// Saving data for further analysis (only valid when the optimal policy is known)if ( (policyString == "2 3 2 2 4 ") || (policyString == "2 3 2 3 4 ") ||



(policyString == "3 3 2 2 4 ") || (policyString == "3 3 2 3 4 ")){if (aux_learned == false)

{aux_step = step;aux_learned = true;PlayToneEx(E5,HALF,VOL,FALSE);Wait(HALF);PlayToneEx(A5,HALF,VOL,FALSE);Wait(HALF);}

}else aux_learned = false;




} // --------------------- Main loop ending --------------------------


PlayToneEx(E5,HALF,VOL,FALSE); // Ending notificationWait(HALF);PlayToneEx(G5,HALF,VOL,FALSE);Wait(HALF);PlayToneEx(B5,HALF,VOL,FALSE);Wait(HALF);




break;}

}

C.4 final learning.nxc (wander-3)

This is the code implementing the learning process for the wander-3 task (referred as FinalTask in the code).



/*------------------------------------------------------------------------------Task Objective: Wander evading obstacles.States: 16, based on 2 contacts and 4 distance ranges of sonarActions: 9: 3 actions for 2 servomotors (Stop, move fwd, move back)------------------------------------------------------------------------------*/

/*------------------------------------------------------------------------(NXT Port = connected device)Sensors: 2 = Right_bumper, 3 = left bumper, 4 = UltrasonicActuators: B = Right wheel motor, C = Left wheel motor,------------------------------------------------------------------------*/


/* FileSystem parameters */#define NAME "Final Task NXT"#define FILE "final.log" // used in ’NXT_io.h’

/* Constants definition ------------------------------------------------------*/#define FP 10000 // FIXED POINT: 10000 = 4 decimal places



/* Experiment parameters */#define STEP_TIME 250 // (ms)#define N_STEPS 15000 // Experiment time: >60 minutes

/* Global variables ----------------------------------------------------------*/// Local main arrays will be defined here to avoid exceding NXC limitationsbyte Policy[N_STATES+1]; // Current Policylong V[N_STATES+1]; // Value functionlong Q[N_STATES+1][N_ACTIONS+1]; // Q-matrixlong step; // Current stepint exploration[N_STATES+1][N_ACTIONS+1]; // Exploration matix





void NXTsaveData(long memoryNeeded);byte selectAction(byte s); // Exploitation/exploration strategyvoid executeAction(byte a);



byte observeState(void);long obtainReward(byte s, byte a, byte sp);void exploitePolicy(byte s);


long getMemoryNeeded(long explicitMemory,byte nstates,byte nactions,long nsteps){// Input: if explicitMemory==0// nstates,nactions and nsteps are used to obtain memoryNeeded// Output: NXT memory needed

if(explicitMemory == 0) // when MemoryNeeded is not explicitly defined{memoryNeeded = 100; // Available for string comments: %..., %....memoryNeeded += (nstates+1)*(5+1)+2; // Policy rowmemoryNeeded += (nstates+1)*(10+1)+2; // Value function rowmemoryNeeded += ((nstates+1)*(10+1)+2)*(nactions+1);// Q matrix rowsmemoryNeeded += ((nstates+1)*(5+1)+2)*(nactions+1); // Q explor. rows}



{byte handle, res;unsigned long memoryUsed, size;string smu, sti, lin;byte s, a;string val;



memoryUsed = 0;







{val = NumToStr(Policy[s]);lin += StrCat(val," ");}


// 2 - Save Value functionlin ="V = [";for (s = 1; s < N_STATES+1; s++)



// 3 - Save Q-matrixlin ="Q = [";WriteLnString(handle,lin,size);memoryUsed += size;for (s = 1; s < N_STATES+1; s++)





// 4 - Save exploration matrixlin ="exploration = [";WriteLnString(handle,lin,size);memoryUsed += size;for (s = 1; s < N_STATES+1; s++)


{val = NumToStr(exploration[s][a]);lin += StrCat(val," ");}



// 5 - Save last step in which the policy changed into the optimal policy// (not needed in this task)if(aux_learned == true)



{lin ="% Optimal Policy learned in step: ";val = NumToStr(aux_step);lin += val;WriteLnString(handle,lin,size);memoryUsed += size;}else{lin ="%No optimal policy learned ";WriteLnString(handle,lin,size);memoryUsed += size;}



// Input: Current State// Output: Action selected{byte selectedAction;byte i;

if(Random(100)<70) // e-Greedy{// 70% probaility of exploiting the current learned policy.selectedAction = Policy[s];}

else{// 30 % of exploringif(Random(100)<60)

// Improvement to simple e_Greedy: enhances exploring actions posibly// not too often visited. When exploring there is a 60% probability of// selecting the least explored action for the current state.{selectedAction =1;for(i=2; i<N_ACTIONS+1; i++)

{if (exploration[s][i] < exploration[s][selectedAction])

// exploration[state][action] is the exploration matrix// used to count the number of times that the cell// Q[s][a] has been explored (and thus updated).selectedAction = i;

}}

else selectedAction=Random(N_ACTIONS+1); // (1,2,3...N_ACTIONS)// When exploring, there is a 40% probaility of selecting a random// action

}return(selectedAction);}


// Input: selected action



{switch(a)

{case 1:

executeNXT("stop");break;

case 2:executeNXT("turnLeft");

break;case 3:

executeNXT("turnRight");break;

case 4:executeNXT("forward");break;

case 5:executeNXT("forwardLeft");break;

case 6:executeNXT("forwardRight");

break;case 7:

executeNXT("backLeft");break;

case 8:executeNXT("backRight");

break;case 9:

executeNXT("back");break;

}}

//------------------------------------------------------------------------------byte observeState(void) // Need to be updated when N_STATES is modified

{// Returns the state of the robot by encoding the information measured from// the ultrasonic and the contacts sensors. In case the number of states// or their definitions change, this function must be updated.

// States discretization:// s1: left_contact=0,right_contact=0,distance_obstacle<=25cm// s2: left_contact=1,right_contact=0,distance_obstacle<=25cm// s3: left_contact=0,right_contact=1,distance_obstacle<=25cm// s4: left_contact=1,right_contact=1,distance_obstacle<=25cm

// s5: left_contact=0,right_contact=0,distance_obstacle<=50cm// s6: left_contact=1,right_contact=0,distance_obstacle<=50cm// s7: left_contact=0,right_contact=1,distance_obstacle<=50cm// s8: left_contact=1,right_contact=1,distance_obstacle<=50cm

// s9: left_contact=0,right_contact=0,distance_obstacle<=75cm// s10: left_contact=1,right_contact=0,distance_obstacle<=75cm// s11: left_contact=0,right_contact=1,distance_obstacle<=75cm// s12: left_contact=1,right_contact=1,distance_obstacle<=75cm

// s13: left_contact=0,right_contact=0,distance_obstacle>75cm// s14: left_contact=1,right_contact=0,distance_obstacle>75cm// s15: left_contact=0,right_contact=1,distance_obstacle>75cm// s16: left_contact=1,right_contact=1,distance_obstacle>75cm




sonarDistance = getSonarDistance();

// DISTANCE_0. DISTANCE_1 and DISTANCE_3 defined in ’NXT_io.h’if(sonarDistance <= DISTANCE_0) sonarState=0;

else if (sonarDistance <= DISTANCE_1) sonarState=1;else if (sonarDistance <= DISTANCE_2) sonarState=2;else sonarState=3;

state = 1 + sonarState*SONAR_STATES+(LEFT_BUMPER + 2 * RIGHT_BUMPER);// LEFT_BUMPER and RIGHT_BUMPER defined in ’NXT_io.h’

return(state);}



long R;long treshold;

// Alternative: using MotorBlockTachoCount()// Reference: 1 second at SPEED 100 gives above 700 degrees (full batery)treshold=30; // Valid for [40 < SPEED < 80]R=0;




ResetRotationCount(L_WHEEL);ResetRotationCount(R_WHEEL);

return(R);}




int a;int sp;ClearScreen();showMessage("Exploiting the learned policy",1,false);



Wait(500);while(!exploitePolicyButton())

{a = Policy[s]; // ExploitationexecuteAction(a);Wait(STEP_TIME);sp = observeState();s = sp; // update state}




byte i,j,k;


// Starting notificationPlayToneEx(E5,QUARTER,VOL,FALSE);Wait(HALF);PlayToneEx(E5,QUARTER,VOL,FALSE);Wait(HALF);

aux_learned = false; // auxiliar for further analysisexplicitMemory = 0; // 0: memoryNeeded will be calculed in getMemoryNeeded()memoryNeeded = getMemoryNeeded(explicitMemory, N_STATES, N_ACTIONS, N_STEPS);Stop(!NXTcheckMemory(memoryNeeded, FILE));



{V[i]=0;Policy[i] = INITIAL_POLICY;for(j=1; j<N_ACTIONS+1; j++)

{exploration[i][j]=0;Q[i][j]=0;}





{PlayToneEx(E5,EIGHTH,VOL,FALSE);

a = selectAction(s); // Exploitation/exploration strategyexecuteAction(a);Wait(STEP_TIME);sp = observeState();R = obtainReward(s,a,sp);if (exploration[s][a]<INT_MAX)

exploration[s][a]++;



{if(Q[s][i] > V[s])









} // --------------------- Main loop ending --------------------------


PlayToneEx(E5,HALF,VOL,FALSE); // Ending notification



Wait(HALF);PlayToneEx(G5,HALF,VOL,FALSE);Wait(HALF);PlayToneEx(B5,HALF,VOL,FALSE);Wait(HALF);




break;}

}

C.5 simpleTaskDataCollector.nxc (wander-1)Finally, this section of the appendix shows the source code implemented for the robot to

collect information from the wander-1 task (referred as SimpleTask in the code). This programhave been used for the modeling purposes described in section 3.3.2.

/*------------------------------------------------------------------------------Task Objective: Wander evading obstacles after touching them.States: s1: Non-contact, s2: Left bumper contact,

s3: Right bumper contact, s4: Both contactActions: a1: Stop, a2: Left-turn, a3: Right-turn, s4: Move forward------------------------------------------------------------------------------*/

/*------------------------------------------------------------------------(NXT Port = connected device)Sensors: 2 = Right_bumper, 3 = left bumper, 4 = Ultrasonic (not used here)Actuators: B = Right wheel motor, C = Left wheel motor,------------------------------------------------------------------------*/


/* FileSystem parameters */#define FILE "modelData.m"


/* Q-learning algorithm parameters */#define N_STATES 4 // from 1 to 4 (0 unusued)#define N_ACTIONS 4 // from 1 to 4 (0 unusued)#define INITIAL_POLICY 1 // stopped

/* Experiment parameters */#define STEP_TIME 250 // (ms)#define N_STEPS 7200 // Total collector time: 7200*0.250s = 1800s (30’)

/* Global variables ----------------------------------------------------------*/



int exploration[N_STATES+1][N_ACTIONS+1]; // Exploration matrixint results[N_STATES+1][N_ACTIONS+1][N_STATES+1]; // Collector Result matrix

/* Functions prototypes ------------------------------------------------------*/


void NXTsaveQlearningData(long memoryNeeded);byte selectAction(byte s); // Exploration onlyvoid executeAction(byte a);byte observeState(void);


void NXTsaveData(long memoryNeeded) //{byte handle, res;unsigned long memoryUsed, size;string smu, sti, lin;byte s, a, sp;string val;


{showMessage("Error creating file!",true);PlayToneEx(F5,HALF,VOL,FALSE);Wait(HALF);PlayToneEx(C5,HALF,VOL,FALSE);Wait(HALF);Stop(true);}

memoryUsed = 0;

// 1 - Save the resulting frequentist matrix T(s,a,s’)for (s = 1; s < N_STATES+1; s++)

{lin =StrCat("T(",NumToStr(s),",:,:) = [");WriteLnString(handle,lin,size);memoryUsed += size;for (a = 1; a < N_ACTIONS+1; a++)

{lin ="";for (sp = 1; sp < N_STATES+1; sp++)

{val = NumToStr(results[s][a][sp]);lin += StrCat(val," ");}


lin = "];";WriteLnString(handle,lin,size);

}memoryUsed += size;CloseFile(handle);lin=StrCat("bytes:",NumToStr(memoryUsed));showMessage(lin,true);}




// Input: current state// Output: selected action{byte a;byte selectedAction;selectedAction=Random(N_ACTIONS+1); // Random exploration (1,2,3 or 4)return(selectedAction);}


// Input: Action selected{switch(a)

{case 1:


case 2:{executeNXT("turnLeft");

break;}

case 3:{executeNXT("turnRight");

break;}

case 4:{executeNXT("forward");

break;}

}}


{// Output: state reachedbyte state;

state = 1 + (LEFT_BUMPER + 2 * RIGHT_BUMPER); // defined in ’NXT_io.h’// States discretization:// s1: left_contact=0,right_contact=0// s2: left_contact=1,right_contact=0// s3: left_contact=0,right_contact=1// s4: left_contact=1,right_contact=1

return(state);}


{/* Local variables */byte s,a,sp; // (s,a,s’)



long step; // Current stepbyte i,j,k;long memoryNeeded;memoryNeeded =(N_STATES+1)*(N_STATES+1)*(N_ACTIONS+1)*(5+1)+(N_STATES+1)*4;Stop(!NXTcheckMemory(memoryNeeded, FILE));NXT_mapSensors();


{for(j=1; j<N_ACTIONS+1; j++)

{exploration[i][j]=0;for(k=1; k<N_STATES+1; k++)

{results[i][j][k]=0;}

}}

s = observeState();

// data collecter main loop ------------------------------------------for(step=1; step<N_STEPS+1; step++)

{a = selectAction(s); // exploration strategyexecuteAction(a);Wait(STEP_TIME);sp = observeState();if (exploration[s][a]<INT_MAX) exploration[s][a]++; //INT_MAX=32767if (results[s][a][sp]<INT_MAX) results[s][a][sp]++;s = sp; // update stateif (saveAndStopButton()) break; // (right button of the NXT brick)}

executeAction(1); // Stop the robotNXTsaveData(memoryNeeded);

PlayToneEx(E5,HALF,VOL,FALSE); // Ending notificationWait(HALF);PlayToneEx(A5,HALF,VOL,FALSE);Wait(HALF);}


Bibliography

[1] M. Lungarella, G. Metta, R. Pfeifer, and G. Sandini. Developmental robotics: a survey.Connection Science, 15:151–190, 2003.

[2] M. Wiering and M. van Otterlo. Reinforcement Learning: State-Of-the-Art. Adaptation,Learning, and Optimization. Springer, 2012.

[3] Y. Song, Y. Li, C. Li, and G. Zhang. An efficient initialization approach of q-learning formobile robots. International Journal of Control, Automation and Systems, 10:166–172,2012.

[4] C. Gaskett. Learning for Robot Control. Ph.D. Thesis, Research School of InformationSciences and Engineering. Department of Systems Engineering. The Australian NationalUniversity, 2002.

[5] J.A. Fernandez-Madrigal, A. Cruz-Martın, C. Galindo, and J. Gonzalez-Jimenez. Hetero-geneity as a Corner Stone of Software Development in Robotics, chapter 2, pages 13–22.Software Engineering and Development. Nova Publishers, September 2009.

[6] A.M. Turing. Intelligent machinery. Report, The National Physical Laboratoy.AMT/C/11, 1948.

[7] L.P. Kaelbling, M.L. Littman, and A.W. Moore. Reinforcement learning: A survey. J.Artif. Intell. Res. (JAIR), 4:237–285, 1996.

[8] R.S. Sutton and A.G. Barto. Reinforcement Learning: An Introduction. Adaptive Com-putation and Machine Learning Series. Mit Press, 1998.

[9] C. Watkins. Learning from delayed rewards. Ph.D. Thesis, King’s College, Cambridge,England, 1989.

[10] I. Millington. Artificial Intelligence for Games, chapter 7. Adaptive Computation andMachine Learning Series. Elsevier Science, 2006.

[11] D. Dong, C. Chen, J. Chu, and T.J. Tarn. Robust quantum-inspired reinforcement learningfor robot navigation. Mechatronics, IEEE/ASME Transactions on, 17(1):86 –97, feb. 2012.

[12] M.A. Jaradat, M. Al-Rousan, and L. Quadan. Reinforcement based mobile robot naviga-tion in dynamic environment. Robotics and Computer-integrated Manufacturing, 27:135–149, 2011.

[13] G.J. Ferrer. Encoding robotic sensor states for q-learning using the self-organizing map.J. Comput. Sci. Coll., 25(5):133–139, May 2010.

[14] V.R. Cruz-Alvarez, E. Hidalgo-Pena, and H.G. Acosta-Mesa. A line follower robot imple-mentation using lego’s mindstorms kit and q-learning. Acta Universitaria, 22(0), 2012.

103


[15] P. Vamplew. Lego mindstorms robots as a platform for teaching reinforcement learning.In International Conference on Artificial Intelligence in Science and Technology, pages21–25, 2004.

[16] C.A. Kroustis and M.C. Casey. Combining heuristics and q-learning in an adaptive lightseeking robot. Technical Report Rep. CS-08-01, 2008., University of Surrey, Departmentof Computing., 2008.

[17] D. Dos-Santos, R. Penalver, and W. Pereira. Autonomous navigation robotic system torecognize irregular patterns. In ICINCO (2), pages 295–301, 2004.

[18] E. Even-Dar and Y. Mansour. Learning rates for q-learning. J. Mach. Learn. Res., 5:1–25,December 2004.

[19] K.P. Murphy. Dynamic bayesian networks. Probabilistic Graphical Models, M. Jordan,2002.

[20] J. Hansen. Not eXactly C (NXC) Programmer’s Guide. Version 1.0.1 b33., 2007.

[21] D.P. Bertsekas and J.N. Tsitsiklis. Neuro-dynamic Programming. Athena Scientific, Bel-mont, MA, 1996.

[22] A.G. Barto and S. Mahadevan. Recent advances in hierarchical reinforcement learning.Discrete Event Dynamic Systems, 13(1-2):41–77, 2003.


Reinforcement Learning on the Lego Mindstorms NXT Robot...

Documents

Transcript of Reinforcement Learning on the Lego Mindstorms NXT Robot...