Hierarchical Reinforcement Learning: A Hybrid...
Transcript of Hierarchical Reinforcement Learning: A Hybrid...
The University of New South Wales
School of Computer Science and Engineering
Hierarchical Reinforcement Learning:A Hybrid Approach
Malcolm Ross Kinsella Ryan
A Thesis submitted as a requirement for the Degree of
Doctor of Philosophy
September 2002
Supervisor: Prof. Claude Sammut
ii
Abstract
In this thesis we investigate the relationships between the symbolic and sub-
symbolic methods used for controlling agents by artificial intelligence, focusing
in particular on methods that learn. In light of the strengths and weaknesses of
each approach, we propose a hybridisation of symbolic and subsymbolic methods
to capitalise on the best features of each.
We implement such a hybrid system, called Rachel which incorporates
techniques from Teleo-Reactive Planning, Hierarchical Reinforcement Learning
and Inductive Logic Programming. Rachel uses a novel representation of be-
haviours, Reinforcement-Learnt Teleo-operators (RL-Tops), which defines the
behaviour in terms of its desired consequences but leaves the implementation of
the policy to be learnt by reinforcement learning. An RL-Top is an abstract,
symbolic description of the purpose of a behaviour, and is used by Rachel both
as a planning operator and as the definition of a reward function by which the
behaviour can be learnt.
Two new hierarchical reinforcement learning algorithms are introduced, Planned
Hierarchical Semi-Markov Q-Learning (P-HSMQ) and Teleo-Reactive Q-Learning
(TRQ). The former is an extension of the Hierarchical Semi-Markov Q-Learning
algorithm to use computer generated plans in place of task-hierarchies (which
are commonly provided by the trainer). The latter is a further elaboration of
the algorithm to include more intelligent behaviour termination. The knowl-
edge contained in the plan is used to determine when an executing behaviour
is no longer appropriate, and can be prematurely terminated, resulting in more
efficient policies.
Incomplete descriptions of the effects of behaviours can lead the planner
to make false assumptions in building plans. As behaviours are learnt, not
implemented, not every effect of actions can be known in advance. Rachel
implements a “reflector” which monitors for such unexpected and unwanted side-
effects. Using ILP it learns to predict when they will occur, and so repair its
plans to avoid them.
Together, the components of Rachel form a learning system which is able
to receive abstract descriptions of behaviours, build plans to discover which of
them may be useful to achieve its goals, learn concrete policies and optimal
choices of behaviour through trial and error, discover and predict any unwanted
side-effects that result and repair its plans to avoid them. It is a demonstration
iii
that different approaches to AI, symbolic and sub-symbolic, can be elegantly
combined into a single agent architecture.
Declaration
I hereby declare that this submission is my own work and to the best of my knowl-edge it contains no material previously published or written by another person,nor material which to a substantial extent has been accepted for the award ofany other degree or diploma at UNSW or any other educational institution, ex-cept where due acknowledgement is made in the thesis. Any contribution madeto the research by others, with whom I have worked at UNSW or elsewhere, isexplicitly acknowledged in the thesis.
I also declare that the intellectual content of this thesis is the product ofmy own work, except to the extent that assistance from others in the project’sdesign and conception or in style, presentation and linguistic expression is ac-knowledged.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Malcolm Ross Kinsella Ryan
i
Contents
1 Introduction 11.1 Picture this: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Human intelligence: A Cognitive Model . . . . . . . . . . . . . . . 3
1.2.1 Types of Knowledge . . . . . . . . . . . . . . . . . . . . . 31.2.2 Interaction between the types . . . . . . . . . . . . . . . . 4
1.3 Declarative and Procedural Knowledge in Artificial Intelligence . . 41.3.1 Declarative and Procedural Knowledge in Control . . . . . 6
1.4 A Hybrid approach . . . . . . . . . . . . . . . . . . . . . . . . . . 91.4.1 Putting declarative knowledge into reinforcement learning 91.4.2 Putting procedural knowledge into symbolic planning . . . 121.4.3 Getting declarative knowledge out of procedural learning . 131.4.4 Rachel: A hybrid planning/reinforcement learning system 14
1.5 Contributions of this thesis . . . . . . . . . . . . . . . . . . . . . . 151.6 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2 Background - Reinforcement Learning 182.1 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . 18
2.1.1 The Reinforcement Learning Model . . . . . . . . . . . . . 192.1.2 Markov Decision Processes . . . . . . . . . . . . . . . . . . 212.1.3 Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 222.1.4 The curse of dimensionality . . . . . . . . . . . . . . . . . 24
2.2 Hierarchical Reinforcement Learning in Theory . . . . . . . . . . 262.2.1 A Motivating Example . . . . . . . . . . . . . . . . . . . . 272.2.2 Limiting the Agent’s Choices . . . . . . . . . . . . . . . . 282.2.3 Providing Local Goals . . . . . . . . . . . . . . . . . . . . 302.2.4 Semi-Markov Decision Processes: A Theoretical Framework 332.2.5 Learning behaviours . . . . . . . . . . . . . . . . . . . . . 35
2.3 Hierarchical Reinforcement Learning in Practice . . . . . . . . . . 352.3.1 Semi-Markov Q-Learning . . . . . . . . . . . . . . . . . . . 362.3.2 Hierarchical Semi-Markov Q-Learning . . . . . . . . . . . . 372.3.3 MAXQ-Q . . . . . . . . . . . . . . . . . . . . . . . . . . . 372.3.4 Q-Learning with Hierarchies of Abstract Machines . . . . . 39
2.4 Termination Improvement . . . . . . . . . . . . . . . . . . . . . . 40
ii
CONTENTS iii
2.5 Producing the hierarchy . . . . . . . . . . . . . . . . . . . . . . . 432.6 Other work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.6.1 Model-based Reinforcement Learning . . . . . . . . . . . . 432.6.2 Other hybrid learning algorithms . . . . . . . . . . . . . . 44
2.7 Reinforcement learning in this thesis . . . . . . . . . . . . . . . . 452.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3 Background - Symbolic Planning 463.1 The symbolic planning model . . . . . . . . . . . . . . . . . . . . 463.2 Building Plans . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.2.1 The Strips representation . . . . . . . . . . . . . . . . . . 523.2.2 Means-ends planning . . . . . . . . . . . . . . . . . . . . . 533.2.3 Extensions to the Strips representation . . . . . . . . . . 54
3.3 Handling Incomplete Action Models . . . . . . . . . . . . . . . . . 593.4 Learning Action Models . . . . . . . . . . . . . . . . . . . . . . . 60
3.4.1 Generating data . . . . . . . . . . . . . . . . . . . . . . . . 613.4.2 What is learnt . . . . . . . . . . . . . . . . . . . . . . . . . 623.4.3 How to learn . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.5 Inductive Logic Programming . . . . . . . . . . . . . . . . . . . . 633.6 Other related work . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.6.1 Explanation Based Learning . . . . . . . . . . . . . . . . . 663.7 Planning in this thesis . . . . . . . . . . . . . . . . . . . . . . . . 663.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4 A Hybrid Representation 684.1 Representing states . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.1.1 Instruments: Representing primitive state . . . . . . . . . 694.1.2 Fluents: Representing abstract state . . . . . . . . . . . . 69
4.2 Representing goals . . . . . . . . . . . . . . . . . . . . . . . . . . 704.3 Representing actions . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.3.1 Reinforcement-Learnt Teleo-operators . . . . . . . . . . . . 724.3.2 State abstraction . . . . . . . . . . . . . . . . . . . . . . . 744.3.3 Parameterised Behaviours . . . . . . . . . . . . . . . . . . 744.3.4 Hierarchies of behaviours . . . . . . . . . . . . . . . . . . . 75
4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5 Rachel: Planning and Acting 775.1 The Planner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.1.1 Semi-universal planning . . . . . . . . . . . . . . . . . . . 785.1.2 Variable binding . . . . . . . . . . . . . . . . . . . . . . . 815.1.3 The planning algorithm . . . . . . . . . . . . . . . . . . . 825.1.4 Computational complexity . . . . . . . . . . . . . . . . . . 83
5.2 The Actor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
CONTENTS iv
5.2.1 Planned Hierarchical Semi-Markov Q-Learning . . . . . . . 855.2.2 Termination Improvement . . . . . . . . . . . . . . . . . . 875.2.3 Teleo-Reactive Q-Learning . . . . . . . . . . . . . . . . . . 90
5.3 Multiple levels of hierarchy . . . . . . . . . . . . . . . . . . . . . . 975.3.1 Hierarchical planning . . . . . . . . . . . . . . . . . . . . . 985.3.2 Hierarchical learning: P-HSMQ . . . . . . . . . . . . . . . 1005.3.3 Hierarchical learning: TRQ . . . . . . . . . . . . . . . . . 100
5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6 Rachel: Reflection 1056.1 The Frame Assumption . . . . . . . . . . . . . . . . . . . . . . . . 1056.2 Example Domain - Taxi-world . . . . . . . . . . . . . . . . . . . . 1076.3 Detecting and identifying side-effects . . . . . . . . . . . . . . . . 111
6.3.1 Plan Execution Failures . . . . . . . . . . . . . . . . . . . 1116.3.2 Diagnosing the failure . . . . . . . . . . . . . . . . . . . . 112
6.4 Gathering examples . . . . . . . . . . . . . . . . . . . . . . . . . . 1156.5 Inducing a description . . . . . . . . . . . . . . . . . . . . . . . . 117
6.5.1 Input to Aleph . . . . . . . . . . . . . . . . . . . . . . . . 1206.5.2 Modifications to Aleph . . . . . . . . . . . . . . . . . . . 1216.5.3 Output from Aleph . . . . . . . . . . . . . . . . . . . . . 1236.5.4 Adding Incrementality . . . . . . . . . . . . . . . . . . . . 124
6.6 Incorporating side-effects into plans . . . . . . . . . . . . . . . . . 1246.6.1 Exploratory planning . . . . . . . . . . . . . . . . . . . . . 129
6.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
7 Experiment Results 1327.1 Experiments in the gridworld domain . . . . . . . . . . . . . . . . 132
7.1.1 Domain description . . . . . . . . . . . . . . . . . . . . . . 1327.1.2 Experiment 1: Planning vs HSMQ vs P-HSMQ . . . . . . 1347.1.3 Experiment 2: P-HSMQ vs TRQ . . . . . . . . . . . . . . 1387.1.4 Experiment 3: The effect of the η parameter . . . . . . . . 1427.1.5 Discussion of the gridworld experiments . . . . . . . . . . 146
7.2 Experiments in the taxi-car domain . . . . . . . . . . . . . . . . . 1477.2.1 Domain description . . . . . . . . . . . . . . . . . . . . . . 1477.2.2 Experiment 4: Reflection . . . . . . . . . . . . . . . . . . . 1517.2.3 Experiment 5: Second-order side-effects . . . . . . . . . . . 1567.2.4 The bigger taxi world . . . . . . . . . . . . . . . . . . . . . 1567.2.5 Experiment 6: The effect of the training set size . . . . . . 1587.2.6 Experiment 7: The effect of the pool size . . . . . . . . . . 1637.2.7 Experiment 8: The effect of exploratory planning . . . . . 1657.2.8 Discussion of the taxi-car experiments . . . . . . . . . . . 167
7.3 Experiments in the soccer domain . . . . . . . . . . . . . . . . . . 1677.3.1 Domain description . . . . . . . . . . . . . . . . . . . . . . 168
CONTENTS v
7.3.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . 1697.3.3 Domain representation . . . . . . . . . . . . . . . . . . . . 1707.3.4 Experiment 9: HSMQ vs P-HSMQ vs TRQ . . . . . . . . 1797.3.5 Experiment 10: Learning primitive policies . . . . . . . . . 1817.3.6 Experiment 11: Reflection . . . . . . . . . . . . . . . . . . 1837.3.7 Discussion of the soccer experiments . . . . . . . . . . . . 187
7.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
8 Conclusions and Future Work 1898.1 Summary of Rachel . . . . . . . . . . . . . . . . . . . . . . . . . 1898.2 Summary of Experimental results . . . . . . . . . . . . . . . . . . 1908.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
8.3.1 Better Planning . . . . . . . . . . . . . . . . . . . . . . . . 1918.3.2 Better Acting and Learning . . . . . . . . . . . . . . . . . 1938.3.3 Better Reflecting . . . . . . . . . . . . . . . . . . . . . . . 194
8.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
References 197
List of Figures
1.1 The three parts of the Rachel architecture. . . . . . . . . . . . . 14
2.1 An illustration of a reinforcement-learning agent. . . . . . . . . . 192.2 An example world . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.3 Two different internal policies for the behaviour Go(hall, bedroom2). 322.4 A simple navigation task illustrating the advantage of termination
improvement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.1 An illustration of a planning agent. . . . . . . . . . . . . . . . . . 473.2 The example world again . . . . . . . . . . . . . . . . . . . . . . . 483.3 A plan for fetching the coffee. . . . . . . . . . . . . . . . . . . . . 513.4 A universal plan for fetching the coffee. . . . . . . . . . . . . . . . 58
5.1 The three parts of the Rachel architecture. . . . . . . . . . . . . 775.2 Two linear plans to fetch both the book and the coffee. . . . . . . 795.3 A semi-universal plan to fetch both the book and the coffee. . . . 805.4 The example world with a bump. . . . . . . . . . . . . . . . . . . 885.5 A plan for fetching the coffee and the book. . . . . . . . . . . . . 895.6 A plan for fetching either the coffee or the book. . . . . . . . . . . 925.7 A narrow bridge over a chasm. . . . . . . . . . . . . . . . . . . . . 94
6.1 The Taxi-Car Domain. . . . . . . . . . . . . . . . . . . . . . . . . 1076.2 A plan for the Deliver behaviour in the Taxi world. . . . . . . . . 1106.3 A plan for the Deliver behaviour which avoids running out of fuel. 1286.4 A plan for the Deliver behaviour using exploratory planning. . . . 130
7.1 The first experimental domain - the Grid-world. . . . . . . . . . . 1337.2 A comparison of learning rates for four approaches to the grid-
world problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1367.3 Average trial lengths for Experiment 1. . . . . . . . . . . . . . . . 1377.4 A comparison of learning rates for TRQ and P-HSMQ . . . . . . 1407.5 Final policy performance for Experiment 2. . . . . . . . . . . . . . 1437.6 A comparison of learning rates for TRQ different values of η. . . . 1447.7 Convergence times for TRQ with different value of η. . . . . . . . 145
vi
LIST OF FIGURES vii
7.8 The second experimental domain - The Taxi-Car. . . . . . . . . . 1477.9 A “naive” plan for the Deliver behaviour in the Taxi world. . . . . 1507.10 A hand-crafted plan which adds refueling to the naive plan. . . . 1527.11 A comparison of learning performance in the Taxi-world. . . . . . 1547.12 The accuracy of maintain (old) and induced (new) hypotheses for
each iteration of the reflector. . . . . . . . . . . . . . . . . . . . . 1557.13 The 25× 25 taxi-world . . . . . . . . . . . . . . . . . . . . . . . . 1577.14 ILP in a noisy world. . . . . . . . . . . . . . . . . . . . . . . . . . 1597.15 The effect of training set size on reflection. . . . . . . . . . . . . . 1617.16 The effect of pool size on reflection. . . . . . . . . . . . . . . . . . 1647.17 The effect of exploratory planning. . . . . . . . . . . . . . . . . . 1667.18 The soccer domain. . . . . . . . . . . . . . . . . . . . . . . . . . . 1687.19 Learning in the soccer-world, with hard-coded behaviours. . . . . 1807.20 Learning in the soccer-world, with learnt behaviours. . . . . . . . 1827.21 Part of the plan for Shoot1(bot(1)). . . . . . . . . . . . . . . . . . 186
8.1 A plan with subgoal splitting. . . . . . . . . . . . . . . . . . . . . 192
List of Tables
6.1 The instruments used in the Taxi-car domain. . . . . . . . . . . . 1086.2 Fluents used in the Taxi-car domain. . . . . . . . . . . . . . . . . 1096.3 The four types of agent behaviour in the Taxi-world. . . . . . . . 1096.4 Classifying states as positive and negative examples of a side-effect.1186.5 Input to Aleph: Positive and negative examples . . . . . . . . . . 1206.6 Input to Aleph: the background file . . . . . . . . . . . . . . . . . 122
7.1 Instruments used in the Grid-world domain. . . . . . . . . . . . . 1337.2 Fluents used in the Grid-world domain. . . . . . . . . . . . . . . . 1347.3 Behaviours available in the Grid-world. . . . . . . . . . . . . . . . 1357.4 Instruments used in the Taxi-car domain. . . . . . . . . . . . . . . 1497.5 Fluents used in the Taxi-car domain. . . . . . . . . . . . . . . . . 1497.6 Behaviours available in the Taxi-Car domain. . . . . . . . . . . . 1517.7 The success-rates of final policies learnt in the taxi-world. . . . . . 1547.8 The fuel factor for each reflective approach to Experiment 6. . . . 1627.9 The fuel factor for each reflective approach to Experiment 7. . . . 1647.10 Instruments used in the soccer domain. . . . . . . . . . . . . . . . 1707.11 Objects in the Soccer domain. . . . . . . . . . . . . . . . . . . . . 1717.12 Fluents used in the Soccer domain. . . . . . . . . . . . . . . . . . 1727.13 Granularity 0 behaviours in the soccer domain. . . . . . . . . . . 1737.14 Granularity 1 behaviours in the soccer domain. . . . . . . . . . . 1747.15 Granularity 2 behaviours in the soccer domain. . . . . . . . . . . 1757.16 Granularity 2 behaviours in the soccer domain, cont. . . . . . . . 1767.17 Discretisation of instruments in the soccer domain. . . . . . . . . 1777.18 Progress estimators used in the soccer domain. . . . . . . . . . . . 1787.19 The side-effects detected in the soccer-world. . . . . . . . . . . . . 185
viii
List of Algorithms
1 Watkin’s Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . 232 SMDP Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 363 HSMQ-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384 HAMQ-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 415 Rachel’s planning algorithm: Iterative Deepening . . . . . . . . 836 Rachel’s planning algorithm: Adding new nodes . . . . . . . . . 847 Planned HSMQ-Learning . . . . . . . . . . . . . . . . . . . . . . . 868 TRQ-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 919 TRQ-Learning: Persisting with a behaviour . . . . . . . . . . . . 9210 Hierarchical planning: Iterative Deepening . . . . . . . . . . . . . 9811 Hierarchical planning: Adding new nodes . . . . . . . . . . . . . . 9912 Planned HSMQ-Learning with multiple levels of hierarchy . . . . 10113 Teleo-Reactive Q-Learning . . . . . . . . . . . . . . . . . . . . . . 10214 Execute a behaviour . . . . . . . . . . . . . . . . . . . . . . . . . 10215 TRQ Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10316 Discard lessons from interrupted behaviours . . . . . . . . . . . . 10417 Reflector: Detecting side-effects . . . . . . . . . . . . . . . . . . . 11418 Reflector: Trimming a condition . . . . . . . . . . . . . . . . . . . 11419 Exploratory planning: Adding new nodes . . . . . . . . . . . . . . 12720 Exploratory planning: Adding maintenance conditions . . . . . . 12821 Termination-improved execution of a policy learnt by P-HSMQ . . 139
ix
Dedication
This work is dedicated to the memory of my father,
Reginald Kinsella Ryan (1932-1988)
for letting me play on the miraculous “TV Typewriter” he was building in hisstudy. If only he could have seen how far things would come.
I love you, Dad.
x
Acknowledgements
This thesis would never have been possible without the behind-the-scenes effortof an enormous crowd of people. I’d really like to thank them all. First ofall, I must say a big thankyou to Mum and John who housed me and fed me(especially when the money was tight), who put up with all the mood-swingsand never stopped loving me. Thankyou so much. I promise I’ll get a real joband a place of my own real soon.
Thanks must also go to my supervisor, Claude Sammut, who inspired meto do a PhD, and to my mentor Mark Pendrith, who warned me not to. Youprobably both think I listened too much to the other, but I have learnt a lotfrom both of you. I really valued your advice, even if I did not always take it.
I am grateful for the friendly co-operation of the international A.I. commu-nity. It is easy to feel isolated, doing a PhD in Australia when all the action isgoing on overseas, but many productive email discussions with Tom Dietterich,Stuart Russell, Nils Nilsson, Ronald Parr, David Andre and Scott Benson havehelped me to keep up with the action and understand how my work fits into thebigger picture. Thankyou all for your time and valuable assistance.
The UNSW AI Lab has always been a great place to study. There hasalways been an air of co-operation and collaboration. This work has been greatlyimproved through innumerable discussions with others in the lab. My thanksparticularly go to: Mark Reid, Phil Preston, Jamie Westendorp, Waleed Kadous,Mark Peters, Joe Thurbon, Bernhard Hengst and Michael McGarity. We’vealso had a lot of fun together. My thanks also to all the helpful people in theComputer Services Group: Tiina Muukkonen, Zain Rahmat, Slade Matthews,Simon Bowden, and Geoff Oakley.
Doing a PhD is a real emotional roller-coaster, and I have had my fair shareof joyful highs and depressing lows. I could never have made it through withoutthe support of my brothers and sisters in Christ. My heartfelt thanks go to allof you: to Ross Salter who was brave enough to show me his attempts at Pascal,to Sarah Davis for many drive-home conversations, to Liz Anthony for daring tobe real, to my beloved “little sisters” Vickie and Su Williams (now Deenick andNguyen) for unstinting love and hospitality, to my one-time housemates DaveHore and Alex Zunica for putting up with my midnight cookery, to ChristinaCook for asking tough questions, to Rowan Kemp for being a math-geek with a
xi
LIST OF ALGORITHMS xii
gift for story-telling, and to Rob and Jude Graves and Wendy Bebbington forrecognising and encouraging my feral/hippy tendencies. Especial thanks needto go to Andrew Cameron and Anthony Sell for helping me through my darkesthours. I love you all dearly and praise God that he gave me each of you.
I have one word of advice that every PhD student ought to hear: Sit upstraight. If I have learnt nothing else from my experience as a PhD student, it isthis. Take care of your back, and make sure you have a friendly physiotherapist.I have two: Philip Richardson and Peter Hunt, and I am very grateful to bothof them.
The following people deserve a more dubious kind of acknowledgement. Theyhave provided many hours of pleasurable distraction, when I probably shouldhave been working on my thesis. The culprits are: the UNSW Circus Soci-ety (particularly Mark Aiken, Kim Isaacs, Brock Misfud, Lucy Young and BekEarnshaw), the authors of INTERCAL (Don Woods, James Lyon, Louis Howelland Eric S. Raymond), the various members of Agora Nomic past and present(particularly my long-standing fellow conspirator Steve Gardner), the authorsof Piled Higher and Deeper1, and entire the Nethack DevTeam. Without your“help” my time as a PhD student might have been a lot shorter, but also a lotduller.
Finally I must acknowledge my Father God and my Lord Jesus Christ, formaking this astonishing world in which to live, and this astonishing creaturethat I call myself. Every day I marvel at your power, your creativity and yourlove.
“You are worthy, our Lord and God, to receive glory and honor and power,for you created all things, and by your will they were created and have their
being.”
Revelation 4:11 (NIV)
1http://www.phdcomics.com
Mathematical terminology
Symbol MeaningStates
s A primitive state of the world.st The state of the world at time t.f(P1, . . . , Pk) A fluent – a predicate describing a state of the world.f ∧ g The logical conjunction (“and”) of fluents f and g.s |= f1 ∧ f2 ∧ . . . ∧ fk State s models fluents f1 . . . fk
i.e. The fluents are true in state s.Actions
a A primitive action.B A behaviour, a temporally abstract action.B(P1, . . . , Pk) A pameterised behaviour.a An action, either primitive or temporally-abstract.P The set of all primitive actions.B The set of all behaviours.A The set of all possible actions,
both primitive and temporally abstract.Root The root behaviour of a learning task.
A description of the agent’s main goal.B.pre The precondition of behaviour B.
A conjunction of fluents which describes the set of states inwhich the behaviour can be executed (its application space).
B.post The post-condition or behaviour B.A conjunction of fluents which describes the set of stateswhich form the goal of the behaviour.
B.sfx The side-effects of behaviour B.A set of conjunctions of fluents which cannot be guaranteedto remain true while executing B.
B.plan The plan decomposition of behaviour B.B.gran The granularity of behaviour B.
xiii
LIST OF ALGORITHMS xiv
Mathematical terminology (cont.)
Symbol MeaningPlans
P A plan.N A node of a plan.N A set of all nodes.N t The set of all nodes activates at time t.N.cond The condition of node N.
A conjunction of fluents.N.B The behaviour dictated by node N.N.type The type of node N – either policy or exploratory.
OtherP (X | Y ) The probability of event X given Y.E {X | Y } The expected value of random variable X given Y.T (s′|s, a) The transition probabilty function for primitive action aT (s′, k|s, B) The transition probabilty function for behaviour B
R (r|s, a) The reward probabilty function for behaviour B
Chapter 1
Introduction
1.1 Picture this:
A quiet suburban street, late in the afternoon. A car is parked at the curb with
two people inside one young and one old. They are talking. The younger one
sits in the driver’s seat. She seems nervous. Her companion is calming and
reassuring her. Two familiar cardboard squares are prominently displayed on the
front and rear of the vehicle, each bearing a large letter ’L’. They are learner’s
plates. This is her first driving lesson.
Lesson number one is simple: Start the car and pull away from the curb.
Her instructor reminds her of the procedure: start the engine, select first gear,
indicate, check the mirrors and then accelerate away smoothly. She repeats these
instructions to herself as she turns the key. With her foot firmly on the clutch,
she hunts for first gear. The stick finally settles into place and slowly she releases
the clutch - but not slowly enough. The car lurches forward suddenly and then
comes to an abrupt halt. Her instructor reassures her and reminds her of the
importance of finding the “friction point” and using the accelerator as well as
the clutch. She tries again. On the third attempt the car hops forward once or
twice, but doesn’t stall. Buoyed by her success, the pupil moves on to the next
step.
The scene changes . . . Five years have passed and she now drives to work
every day. Changing gears is second-nature to her — she doesn’t even think
about it. In fact, she had made this trip so many times she swears she could
do it blindfolded. Along the way she has learnt things her instructor never told
her: how much further she can go when the fuel gauge says “empty”, and how
1
1. Introduction 2
to steer with her knees and do her make-up simultaneously in slow traffic. And
next weekend she will be passing her knowledge on, giving her younger brother
his first lesson.
What has effected this enormous transformation? How did our subject go from
being unable to even start the car moving, to driving large distances every day?
The answer is simple: Practice.
Aristotle, in his Politics, said “The things we learn to do, we learn by do-
ing.” He was referring to ethical behaviour, but the same could be said of any
human activity, whether it be walking, talking, driving, juggling or playing the
piano. Expertise comes with experience, and experimentation will always be our
greatest teacher.
And yet experience alone is not enough. Our subject’s driving practice was
not attempted haphazardly but was informed by advice, structured by reason
and refined by reflection. Without any background knowledge of how the car
works and the purpose and meaning of the various dials and controls, our young
driver might still be randomly pressing buttons, turning knobs and pushing levers
years later, without ever having left the curbside.
From the Turing Test to the artificial humanoid, human intelligence has al-
ways been our inspiration for artificial intelligence. Some would say it is our
only measure of what intelligence really is. Even those who regard AI as an
engineering discipline draw ideas from human examples. In this thesis we shall
be exploring how the familiar combination of knowledge, reasoning and practice
can be implemented in an artificial system.
The focus of this work will be on the problem of control. A computer-
controlled “agent” interacts with an external environment to achieve certain
goals. It may be a robot interacting with an office environment, or a software
agent exploring the web, or any of a number of similar models. The agent can
sense certain features of the environment and can use its actuators to make
changes in it. Our object will be to create a control policy for the agent which
allows it to achieve its goals. As in the example that opened this chapter, that
policy will be constructed by a combination of knowledge, reasoning, practice
and reflection.
1. Introduction 3
1.2 Human intelligence: A Cognitive Model
To better understand the aspects of human intelligence from which we will be
drawing our inspiration, let us first look briefly into a psychological model of
human problem-solving from Cognitive Psychology 1 We will illustrate this model
from our earlier example.
Consider first the situation our driver finds herself in when first we meet her.
Even before getting into the driver’s seat she has received extensive training
about driving. Some of it was in the form of instruction from her teacher. Some
of it was from observing others drive. She knows what most of the controls
were for. She has memorised the road rules and been tested on them, and she
can recite the sequence of steps she is about to attempt. Yet with all of this
knowledge, she cannot drive the car.
After many years of practice her driving is smooth and mostly fault-free, but
when she comes to teach her younger sibling she will encounter a new frustration.
He will ask questions like: “How do you release the clutch without stalling?” and
“Exactly when are you supposed to change gears?” She can only answer vaguely
at best, communicating her knowledge in abstract terms. In the end she says,
“You develop a feel for it”. She has the ability but she cannot put it precisely
into words.
1.2.1 Types of Knowledge
Cognitive psychologists recognise these two experiences as evidence of two dis-
tinct kinds of knowledge which they call declarative and procedural (Anderson,
1976; Cohen & Squire, 1980). Declarative knowledge is “knowing what”. It is
knowledge of which we are consciously aware.We can communicate it and reason
with it, but it is abstract and does not on its own convey ability to do the things
it describes.
Procedural knowledge is “knowing how”. It is implicit and unconscious. It
tells us how to do things, down to the intricacies of sensing and movement, but
being implicit it is difficult to communicate. Even if it could be explained, it is
of limited use to others as it is tailored intimately to the person possessing it.
1As a model of human rationality, Cognitive Psychology is one of many, and it necessarilyhas its opponents. While I am persuaded by many of its claims, I am not presenting it hereas the only such model, or indeed the best. It is merely a source of inspiration. Even a poormodel of human intelligence can inspire a good model for artificial intelligence.
1. Introduction 4
1.2.2 Interaction between the types
Having distinguished these different kinds of knowledge, how do they relate?
What are their roles in the learning process? Cognitive psychology describes
three phases of skill acquisition in which these two kinds of knowledge each play
a role (Anderson, 1995):
1. The cognitive phase, in which the person acquires the necessary declarative
knowledge to attempt to solve the problem. This knowledge is used to form
a plan of attack for solving the problem.
2. The associative phase, in which the declarative knowledge is gradually
converted into procedural knowledge through practice. Errors in the initial
declarative knowledge are also discovered and corrected, refining the plan
if needs be.
3. The autonomous phase, in which procedural knowledge has taken over and
the skill becomes more and more automatic, requiring significantly less
attention to perform.
The relationship between declarative and procedural knowledge is circular.
Initial declarative knowledge helps us acquire procedural knowledge by guiding
our practice, and the experience gained through performing the skill enables us
to add to and refine that initial knowledge.
1.3 Declarative and Procedural Knowledge in
Artificial Intelligence
If human intelligence manifests itself as both declarative and procedural knowl-
edge, one might expect to find a similar dichotomy in AI research, and indeed
this is the case. There has been a long-standing philosophical dispute about
the “correct” way to construct AI, with the protagonists falling into two broad
camps the “classicists” and the “connectionists”. Many parallels can be drawn
between these two viewpoints and the two kinds of intelligence we describe.
Classical AI (also “Symbolic AI” or “Good Old-Fashioned AI”) has its roots
in the earliest AI traditions from the 1960s and 70s (Winograd, 1972, Minsky,
1974, Bobrow & Winograd, 1977). Newell and Simon (1976) summarise this
1. Introduction 5
position as the concept that the mind is a physicial symbol system, that is, “a
system that produces through time an evolving collection of symbol structures”.
They distinguish this from a general-purpose computer by indicating that sym-
bols designate objects in the world.
Practically, symbolic AI has primarily concerned itself with abstract logical
reasoning, using symbolic logic (or its equivalent) to express knowledge about
the environmen, to reason about it, and to decide how to interact with it. It
often deals with abstractions, assuming that complex descriptions of objects and
actions such as “pick up the big, red pyramid” can easily be translated from the
abstract to the concrete.
In this way, symbolic AI bears similarity to declarative knowledge. Its lan-
guage is similar to that which we use when we reason declaratively (at least on
paper) and so is more-or-less transparent to our intellects. This enables us to
communicate with such systems directly, encoding our knowledge and interpret-
ing results relatively easily – to a limit.
Beyond a certain point of complexity, our knowledge becomes increasingly
difficult to express symbolically, reasoning with such knowledge becomes very
complex, and the results of such systems require more and more expert knowledge
to interpret. Simply put, there are limits on how much of the world we can
comprehend and explain.
Symbolic AI has been criticised for failing to represent the full complexity
of the world (Dreyfus, 1979, Smolensky, 1989). The assumed mapping between
the abstract symbols and the real-world objects and actions they represent turns
out to be very hard to implement. For example, recognising an “apple” from
a picture can be quite difficult - the apple can be red, green, yellow or even
brown. It can be lit in many different ways. It may sit in amongst many other
fruit, or be hanging on a tree, it may be a cartoon illustration rather than a
realistic photograph, and so on and so on. This problem of mapping between
the input signals from hardware sensors to the symbols of reasoning and planning
is known as the symbol grounding problem (Harnad, 1990, Ziemke, 1997) and is
a significant stumbling block for classical AI.
In light of these failings, Connectionist AI (“subsymbolic AI” or “New Fan-
gled AI”) provided a dramatic rethink of our approach to building AI systems
(Rumelhart, 1989, Churchland, 1990). Rather than explicitly attempting to
model our thinking patterns, connectionist approaches instead attempt to model
1. Introduction 6
our brains. They are inspired by the neural structure of the brain, with a mul-
titude of simple computation elements (neurons) connected together into a self-
organising network.
From this starting point connectionist approaches have diversified into a large
number of strongly statistical techniques. What they share in common is the
absense of any explicit internal representation beyond the immediate numeric
inputs and outputs from sensors and actuators.
One of the strong philosophical claims of this field is that despite the fact
that symbols are not explicitly represented, there are emergent “sub-symbols”
(Smolensky, 1989) – patterns of activation within the connectionist networks
which generalise the low-level input data into an implicit high-level representa-
tion. This has been a point of contention with pratictioners from the symbolic
camp (eg. Rosenberg, 1990), who argue that these sub-symbols are not guar-
anteed to exist, may not follow our concepts of logical behaviour, and are not
discernable to external observers. Practically speaking, connectionist systems
often lack comprehensibility as they are without an accessible high-level expla-
nation for their behaviour.
Sub-symbolic AI thus has many things in common with unconscious proce-
dural knowledge. While connectionism isn’t always procedural, in the sense of
dealing with action, it does deal directly with input from the world and has no
explicit abstract representations. It is often able to model much more complex
and subtle relationships in the world, but at the loss of comprehensibility.
It is not our intention to provide a rigorous discussion of the philosophical
merits of either approach – other, wiser, minds have attempted that (see (Hauge-
land, 1997) for a good anthology) and our intentions are more practical. We shall
focus on one problem area in particular, that of control, and aim to show how
the dispute can be resolved in a way that takes advantage of the strengths of
both approaches. Whether this kind of renconciliation can be extended to the
wider problem of AI is a question we shall leave for the philosophers.
1.3.1 Declarative and Procedural Knowledge in Control
Since we are interested in the problem of control, how does the Symbolic/Sub-
symbolic distinction manifest itself in AI approaches to this field?
1. Introduction 7
The symbolic approach to control
The typical symbolic approach to control is Symbolic Planning 2 (Allen, Hendler,
& Tate, 1990; Ghallab & Milani, 1996). Symbolic planning is characterised by:
• A first-order logic representation for describing the agent’s state,
• A logical model of the agent’s actions in terms of cause-and-effect,
• Explicitly codified goals,
• A formal reasoning process to determine a sequence of actions which will
cause the goals to become true.
The advantages of this approach are:
• It can make use of background knowledge provided by the trainer in the
form of the action model,
• The resulting plans are logically sound and the correctness of the planning
process is easily verified,
• The plans are comprehensible. The reasons behind a particular choice of
action can be extracted from the plan.
The disadvantages are:
• Important concrete details of the effects of actions, such as duration,
stochasticity (and other kinds of non-determinacy), continuity and depen-
dencies on the finer details of the state are hard to specify in a symbolic
model.
• The planning process doesn’t scale down to fine-grained problems. Logical
reasoning about cause and effect needs to be supplemented by numeric and
probabilistic reasoning which complicate the planning process and obscure
the plans.
2Practitioners of symbolic planning generally refer to it just as “planning”. There are,however, sub-symbolic techniques that are also called “planning”, so we shall use the name“symbolic planning” where necessary, to preserve the distinction.
1. Introduction 8
The sub-symbolic approach to control
The sub-symbolic approach to control is typified by Reinforcement Learning
(Sutton & Barto, 1998). The characteristics of this approach are:
• States are represented as vectors of numerical values
• Goals are specified implicitly in reward functions. Actions have a numeric
“value” depending on how well they achieve those goals.
• General-purpose cause-and-effect models of actions are usually abandoned,
or else they are numeric models which attach probabilities to different
outcomes.
• As the effects of actions are difficult for the programmer to discover and
describe, policies are learnt via interaction with the environment, rather
than deduced from models.
The advantages of this approach are:
• It can handle subtle and complex environments for which cannot ade-
quately be described by high-level logical models.
• It can learn by trial-and-error those things which cannot be specified by
the trainer.
The disadvantages are:
• Policies are opaque. Interpreting the reason for a particular action is vir-
tually impossible.
• When background knowledge does exist, it is not easy to incorporate into
the agent’s policy.
• Without background knowledge, this approach doesn’t scale up to large
problems.
Many of the strengths and weaknesses of these two approaches are based
on what we shall call the granularity of the actions involved. The symbolic
approach is best suited to coarse-grained states and actions – large scale features
and changes in the world which can be regarded at a high level of abstraction.
1. Introduction 9
The sub-symbolic approach is better suited to fined-grained states and actions
– low-level aspects of the world which often require concrete numerical detail to
be described accurately.
1.4 A Hybrid approach
In our opinion basing an approach to AI on a single kind of representation,
symbolic or sub-symbolic, is going to falter. Each representation has limits
to what it can express. Just as human intelligence is based on an interaction
between declarative and procedural knowledge, so also artificial intelligence is
going to need to incorporate both symbolic and sub-symbolic techniques if it is
going to overcome these limitations.
In this thesis we propose a hybridisation of symbolic and sub-symbolic ap-
proaches to artificial intelligence for control. We shall combine symbolic planning
with reinforcement learning to produce a system that capitalises on the strengths
of each to overcome the weaknesses of the other. Let’s examine briefly how such a
hybrid would achieve this, first from the point of view of reinforcement learning,
then from the point of view of planning.
1.4.1 Putting declarative knowledge into reinforcement
learning
The search for a general-purpose reinforcement learning algorithm that can be
applied to arbitrary learning problems of any size has been largely fruitless. Al-
gorithms do not scale up without careful hand-tailoring to particular problems.
In response to this realisation, some researchers have directed their attention
toward building better special-purpose solutions instead, solutions that deliber-
ately incorporate domain-specific background knowledge in a systematic way to
improve learning performance.
Hierarchical reinforcement learning (HRL) (Sutton, Precup, & Singh, 1999;
Dietterich, 2000a; Parr & Russell, 1998), is one such technique that is proving
quite effective. It implements the familiar intuition that a complex task can be
more easily solved if it can be decomposed into a set of simpler tasks. Solutions
are found for the simpler problems, and then recombined to solve the original
task. This intuition has successfully been used to simplify problems in several
1. Introduction 10
areas of artificial intelligence, such as hierarchical planning (Sacerdoti, 1973; Iba,
1989) and structured induction (Shapiro, 1987).
Reinforcement learning generally attempts to find a policy mapping primi-
tive states directly to primitive actions. Generally every possible primitive action
can be explored in every possible primitive state, resulting in a massive search
space of possible policies. HRL tries to cut down this search space by intro-
ducing high-level structure into policies. The trainer specifies a set of abstract
behaviours which are temporally abstract (have significant duration) and spa-
tially abstract (they map from one set of states to another). Policies are learnt
in terms of behaviours. Behaviours are decomposed (in various ways) into prim-
itive state/action mappings.
Behaviours have two important advantages:
• They limit the policy space, by specifying particular policy mappings for
large portions of the state-space. This is done either by hard-coded restric-
tions (limiting choices, or even specifying particular parts of the policy) or
else by local goals.
• They are temporally abstract. Choosing to execute a behaviour generally
means committing to it long-term. A policy involving many hundreds of
primitive actions could possibly be specified as only a short sequence of
behaviours. Learning a behaviour-based policy requires fewer decisions to
be made, so the search space is reduced significantly.
However, it is not simply a matter of adding a bunch of behaviours and
instantly getting superior performance. Behaviours only work if they reduce the
size of the policy space. Providing an agent with a large repertoire of behaviours
in which many alternatives exist for every decision may make learning harder
rather than easier. Time will be wasted exploring inappropriate behaviours.
Most common monolithic (i.e., non-hierarchical) RL approaches are model-
free – that is they do not require models of actions, nor do they endeavour to
build them. Hierarchical reinforcement learning algorithms have inherited this
characteristic, insofar as they do not attempt to build or use models of behaviours
(with some exceptions). This is true for the same reasons as for monolithic RL:
like primitive actions, behaviours have complex, stochastic effects which cannot
easily be specified or modeled. It is easier to evaluate them based on goal-specific
criteria, rather than build general-purpose models.
1. Introduction 11
However when the repertoire of behaviours is large, some means of limit-
ing the set of choices in necessary. It is assumed that the trainer has some
background knowledge of the tasks and knows which behaviours might be ap-
propriate in different parts of the problem. It may be that this set of behaviours
is small enough that no further limits need to be applied, but as more ambitious
problems are tackled the repertoire of necessary behaviours will increase, and
further background knowledge will need to be implemented to limit the set of
choices available in a certain situation to those that may be appropriate, ignoring
those that were included for different parts of the problem, or different problems
altogether.
Most existing algorithms implement such knowledge in the form of a task-
hierarchy (Dietterich, 2000a) or similar structure. A task-hierarchy is essentially
a function which maps a situation to a set of behaviours which might be appro-
priate in that situation. The set of choices available to the learning algorithm is
thus kept to a minimum, and the size of the policy space is kept under control.
Task hierarchies have hitherto be constructed manually by the trainer, but
this need not be the case. In this thesis we shall investigate the possibility of
automatically constructing them, based on abstract models of behaviours. The
concrete details of a behaviour’s effects may be difficult to model, but its intended
purpose is often much simpler and more easily expressed. Given an abstract
model of a behaviour’s purpose, and similar definition of the agent’s goal we can
use symbolic planning techniques to reason which actions might be appropriate
in different situations, but rather than choosing a particular behaviour in this
way, we shall determine a set of appropriate behaviours, and then use learning
to select a particular one. In this way we can use abstract knowledge from the
model to limit the set of choices, and then concrete knowledge gathered from
experience to make the selection.
An additional issue in HRL is that of temporal abstraction and commitment
to behaviours. It is recognised that behaviour-based policies are not necessar-
ily optimally efficient, and that better policies can be generated by “cutting
corners”, relaxing the commitment to completing a behaviour once it has been
started, in favour of switching to a better behaviour sooner. While this indeed
produces more efficient policies, it removes one of the principle advantages of
behaviours, which is their temporal abstraction. The more often the agent can
make decisions, the more decisions it will have to make, and the longer it will
1. Introduction 12
take to learn the right ones. For this reason most advocates of this approach
reserve it for optimising a behaviour-based policy that has already been learnt.
Modeling the purpose of behaviours can also aid us in finding the best tradeoff
in this problem. A symbolic model allows the agent to reason about why an
executing behaviour is appropriate. Once the behaviour has been started, this
condition can be monitored. As long as the behaviour remains appropriate it is
reasonable to persist with it. However, if, for some reason, the behaviour becomes
inappropriate, we have valid grounds for interrupting it and selecting another.
Policies learnt in this way will be more efficient than those that blindly continue
executing behaviours which have long since become pointless. Meanwhile they
keep the number of occasions on which a new decision needs to be made to a
minimum, and thus learning is not excessively delayed.
We shall investigate hybrid algorithms using both symbolic planning and
hierarchical reinforcement learning which implement both these improvements
to stand-alone HRL.
1.4.2 Putting procedural knowledge into symbolic plan-
ning
Analysing the problem from the opposite viewpoint, much research has also gone
into scaling symbolic planning to more complex fine-grained problems. Hierar-
chy is also of value here, and significant work has been done to create hierarchical
planners, which first decompose plans into large steps, and then refine it recur-
sively into smaller and smaller steps until a solution can be found in terms of
the most primitive actions.
These techniques can considerably reduce the search time involved in building
a plan, but still require a logical model of the effects of actions. Often the default
assumption is that the primitive actions are actually already moderately complex
behaviours, which can be cleanly modeled, or else the problem domain is such
that primitive actions are noiseless and deterministic.
These problems have lead to recent developments in planners that acknowl-
edge that the world is noisy and that their models are likely to be incomplete.
As a result the expectation that actions operate instantaneously and predictably
has been abandoned. Planners are now designed to include contingencies in their
plans for when things go awry, and plan execution is monitored to ensure that
1. Introduction 13
any such execution failures are detected and handled as they arise.
All the same, maintaining the logical description of actions limits the ability
to express fine-grained actions in realistic settings. Symbolic techniques still rely
on a programmer to provide behaviours which can be described at a medium-
to-high level of abstraction.
In this thesis we shall explore an alternative option. Rather than starting with
hard-coded behaviours which can be described abstractly, we shall instead start
with an abstract definition of a desired behaviour, and then use reinforcement
learning to learn a concrete implementation. So the planner does not have to deal
directly with primitive actions, but can instead operate at the level of abstraction
at which it operates best.
Learning behaviours is not without its drawbacks. It cannot be predicted
exactly how the learnt behaviour will operate, how efficiently it will work nor
what side-effects it may produce. The standard planning process of finding the
shortest possible sequence of behaviours to reach a goal may not produce the
best possible plan. A certain amount of trial-and-error exploration of different
possible plans will be necessary to find the best possible solution. To this end,
we will use hierarchical reinforcement learning to select between different paths
to the goal based on concrete experience with the different possibilities.
1.4.3 Getting declarative knowledge out of procedural
learning
Our aim so far has implicitly been to find the most efficient policy to achieve
an agent’s goals. We have discussed how a combination of symbolic and sub-
symbolic AI might achieve this more effectively than either approach alone. How-
ever we have a second aim running alongside this. We would also like the agent
to be able to improve its body of declarative knowledge through analysis of the
results of procedural learning. In other words, by “reflecting” on its experiences
executing behaviours, the agent should be able to repair incorrect or incomplete
parts of its symbolic model. The advantages of this are two-fold:
1. It makes the knowledge that is implicitly contained in its policies explicitly
available for reasoning and planning. Incorrect or incomplete plans can
hamper the agent’s ability to improve its policy. Repairing such faults will
allow better policies to be found.
1. Introduction 14
2. Explicit declarative knowledge can be more easily communicated to other
agents, including the trainer. If the reasons for certain decisions can be
modeled declaratively, then it is to our advantage to do so, to make the
agent’s decisions more comprehensible.
Extracting symbolic knowledge from experience has been the object of much
research (Benson, 1995; Shen, 1993; Wang, 1996; Oates & Cohen, 1996; Lorenzo
& Otero, 2000; Gil, 1994; desJardins, 1994). There are many different things
that can be learnt and modeled. In this work we have chosen to focus on one
particular element which has not received too much attention, learning side-
effects of behaviours. These can be detected as the results of plan execution
failures, and then the agent can learn to predict and avoid them using symbolic
machine learning tools.
1.4.4 Rachel: A hybrid planning/reinforcement learning
system
Plans
Builds Plans
ActorExecutes PlansLearns Policies
ReflectorMonitors executionLearns Side−effects
RL−TOPModel
tracesExecution
Side−effectdescriptions
Planner
Figure 1.1: The three parts of the Rachel architecture.
We have implemented this proposed hybrid of planning and reinforcement
learning in a system we call Rachel. Rachelconsists of three parts, a plan-
ner, an actor and a reflector. All three components share a common symbolic
1. Introduction 15
representation of the world in terms of a set of fluents which describe the state
and teleo-operators which describe potentially useful high-level behaviours.
The planner, a simple means-ends problem solver based on the Teleo-reactive
formalism of (Nilsson, 1994), builds abstract plans for achieving the agent’s
goals. These plans serve as task-hierarchies for the actor, which uses hierarchical
reinforcement learning to select between alternative paths in the plan and to
build concrete policies for abstract behaviours. The actor executes these policies
in the world, monitored by the reflector. When plan execution fails to proceed
as expected, the reflector diagnoses the fault and gathers examples of its cause.
Given enough examples, it uses the Inductive Logic Programming algorithm
Aleph to produce a symbolic description of the cause, which is then fed back
into the planner to make more accurate plans.
1.5 Contributions of this thesis
Work towards the goal of unifying the symbolic and sub-symbolic approaches
to artificial intelligence is still in still its infancy, and there are many aspects to
be considered. The work in this thesis focuses on the problem of learning and
control. The principle contributions made are as follows:
1. The design of a shared representation of states, actions and goals for
both reinforcement learning and planning, particularly the Reinforcement-
Learnt Teleo-Operator (RL-Top) formalism which combines the represen-
tations of abstract behaviours from both fields.
2. A hybrid planning/reinforcement learning architecture Rachel which shows
how reinforcement learning can be used to ground abstract behaviours in
planning, and how symbolic plans can be used turn background knowl-
edge into high-level structure to help solve complex reinforcement learning
problems.
3. Two different hierarchical reinforcement learning algorithms that incor-
porate background knowledge from plans: 1) Planned Hierarchical Semi-
Markov Q-Learning (P-HSMQ) which the extends HSMQ algorithm (Diet-
terich, 2000b) to use plan-built task hierarchies, and 2) Teleo-Reactive Q-
Learning (TRQ) a more complex algorithm which implementations teleo-
reactive execution semantics to improve the termination of behaviours.
1. Introduction 16
4. An examination of how the symbolic representation of the domain can
help diagnose problems in the learnt policy, focusing on the detection of
side-effects not present in the original behaviour descriptions.
5. An extension to Rachel incorporating code to detect such side-effects and
overcome them by refining the symbolic knowledge base using inductive
logic programming.
6. Experimental verification of the effectiveness of this system in domains of
various complexity, ranging from simple grid-based problems to complex
control tasks.
1.6 Overview
The remaining chapters are arranged as follows: Chapters 2 and 3 review the
necessary background to this work, in reinforcement learning and symbolic plan-
ning respectively.
Chapter 4 introduces a hybrid representation for the agent control problem,
which combines elements from both the preceding chapters. At the heart of this
new representation is the concept of a Reinforcement Learnt Teleo-Operator
(RL-Top) which models a reinforcement-learnt behaviour as a symbolic plan-
ning operator.
In Chapter 5 we explain the first two elements of the Rachel architecture:
the planner and the actor. We shall derive two different algorithms for incor-
porating plans into hierarchical reinforcement learning: Planned Hierarchical
Semi-Markov Q-Learning (P-HSMQ) and Teleo-Reactive Q-Learning (TRQ).
In Chapter 6 we tackle the problem of incompletely specified world models,
focusing on the frame problem and how the existence of unexpected side-effects
can make planning less effective. We add a third element to Rachel, the re-
flector, which automatically diagnoses such problems and learns to avoid them
using ILP.
Chapter 7 presents empirical testing of the various algorithms, compared
with existing techniques. We show that the combination of planning, learning
and reflecting can automatically produce results which would otherwise require
a significant degree of hand-crafted background knowledge.
1. Introduction 17
Finally, in Chapter 8 we draw it all together, reflect on what has been
achieved, and suggest a variety of areas for extensions and improvement.
Chapter 2
Background - Reinforcement
Learning
As this thesis describes a hybrid system, it draws on material from several major
subfields of artificial intelligence: reinforcement learning, symbolic planning and
knowledge refinement. In this chapter and the next we review those fields with
an eye to explaining the work that is to come. Each field is quite large in itself
and it is not possible to discuss them comprehensively. Instead, we shall focus on
those aspects of each domain that provide the necessary background material for
this thesis. This material has been split into two chapters. This chapter presents
the subsymbolic approach to control as performed by reinforcement learning and
hierarchical reinforcement learning, and in the next chapter we shall deal with
the symbolic approach to control performed in symbolic planning and model
learning.
2.1 Reinforcement Learning
In the past decade reinforcement learning has established itself as an important
method employed in artificial intelligence research for learning to control an
agent. It is a statistical approach to the problem that addresses learning how
best to behave as an online optimisation problem, with an initially unknown
value function. Its foundations are in the mathematics of dynamic programming.
Many different approaches have been put forward, far too many to cover
here, but most share a common formulation of the reinforcement learning prob-
lem as established in the work of Watkins (1989) and Sutton (1988). This work
18
2. Background - Reinforcement Learning 19
set the foundation for the field, establishing a sound theoretical model for re-
inforcement learning and introducing practical learning methods. We describe
this foundation, both practical and theoretical, and the Q-Learning algorithm
that embodies it and which illustrates principles that have been extended to a
host of more complex algorithms.
2.1.1 The Reinforcement Learning Model
RL Agent
action
Environment
state
Policy
Goals
Reward
Figure 2.1: An illustration of a reinforcement-learning agent.
Reinforcement learning models an agent interacting with an environment,
trying to optimise its choice of action according to some reward criterion, as
illustrated in Figure 2.1. The agent operates over a sequence of discrete time-
steps (t, t + 1, t + 2, . . .). At each step it observes the state of the environment
st and selects an appropriate action at. Executing the action produces a change
in the state of the environment to st+1. It is assumed that the sets of possible
states S and available actions A are both finite. This is not always the case in
practice, but it greatly simplifies the theory, so we shall follow this convention.
The mapping of states to actions is done by an internal policy π. The initial
policy is arbitrarily chosen, generally random, random, and improved based on
the agent’s experiences. Each experience 〈st, at, st+1〉 is evaluated according to
some fixed reward function, yielding a reinforcement value rt ∈ <. The agent’s
objective is to modify its policy to maximise its long-term reward. There are
several possible definitions of “long-term reward” but the one most commonly
2. Background - Reinforcement Learning 20
employed is the expected discounted return given by:
Rt = rt + γrt+1 + γ2rt+2 + . . .
=∑∞
i=0 γirt+i
(2.1)
where γ is the discount rate that specifies the relative weight of future rewards,
with 0 ≤ γ < 1. Should the agent reach some terminal state sT , then the infinite
sum is cut short: all subsequent rewards rT+1, rT+2, . . . are considered to be zero.
Semantically speaking, these reward signals are an expression of the agent’s
goals. A positive reward generally indicates a result that the agent considers
favourable, a negative reward unfavourable (although it must be noted that it
is the relative value of a reward that determines its goodness, not the absolute).
There is no standard form that this function should take, except that it should
obey the Markov property (below). A common formulation in goal-achievement
tasks is to offer to the agent a positive reward rt = 1 when the agent achieves
its goal and rt = 0 elsewhere. Using the discounted return above, this equates
to an optimal policy that chooses the shortest path to the goal.
There is some disagreement about from where these reward signals should
be considered to originate in the model shown in Figure 2.1. Some would say
that they are an expression of the agents goals and thus belong inside the agent.
Others argue that they are immutable criteria provided to the agent in advance
by its creator, and thus belong in the environment. I choose to place them in a
compromise position: inside the agent but outside the learning sub-system. This
expresses the fact that a complex agent may elect to pursue different goals at
different times, and thus may change the way it evaluates its progress. However,
for the time being we shall assume that the agent has a fixed goal and thus a
fixed reward function.
Note however that the learning algorithms assume no prior knowledge of the
reinforcement function, when rewards will occur or what values they will take.
All learning is done by trial and error: an action is performed and the resulting
state transition and reward are observed, and used to update the policy. As
the environment may well be stochastic, with transitions and rewards occurring
probabilistically, the same transition may need to be observed many times over
before the best choice can be made.
2. Background - Reinforcement Learning 21
2.1.2 Markov Decision Processes
If we are to construct algorithms that learn policies with any guarantee of per-
formance, some theoretical restrictions need to be placed on this model. One
measure of the complexity of the problem is the amount of information nec-
essary to make accurate predictions about the outcomes of an agent’s actions.
Actions could have non-deterministic or stochastic effects on the state that may
depend on information hidden from the agent or on events long past. All these
possibilities complicate the process of prediction and thus make learning difficult.
To avoid this difficulty most reinforcement learning algorithms make a strong
assumption about the structure of the environment. They assume that it oper-
ates as a Markov Decision Process (MDP). An MDP describes a process that
has no hidden state or dependence on history. The outcomes of every action, in
terms of state transition and reward, obey fixed probability distributions that
depend only on the current state and the action performed.
Formally an MDP can be described as a tuple 〈S,A, T, R〉 where S is a finite
set of states, A is a finite set of actions, T : S × A × S → [0, 1] is a transition
function and R : S × A×< → [0, 1] is a reward function with:
T (s′|s, a) = P (st+1 = s′ | st = s, at = a) (2.2)
R (r|s, a) = P (rt = r | st = s, at = a) (2.3)
which express the probability of ending up in state s′ and receiving reinforcement
r after executing action a in state s, respectively. These probabilities must be
independent of any criteria other than the values of s and a. This is called
the Markov Property. An in-depth treatment of the theory of Markov Decision
Processes can be found in (Bellman, 1957), (Bertsekas, 1987), (Howard, 1960)
or (Puterman, 1994).
Given this simplifying assumption, the best action to choose in any state
depends on that state alone. This means that the agent’s policy can be expressed
as a purely reactive mapping of states to actions, π : S → A. Furthermore every
state s can be assigned a value V π(s) that denotes the expected discounted
return if the policy π is followed:
V π(s) = E {Rt | ε(π, s, t)} (2.4)
= E
{
∞∑
i=0
γirt+i | ε(π, s, t)
}
(2.5)
2. Background - Reinforcement Learning 22
=∫ +∞
−∞
rR (r|s, a) dr + γ∑
s′∈S
T (s′|s, a)V π(s′) (2.6)
where ε(π, s, t) denotes the event of policy π being initiated in state s at time t.
V π called the state value function for policy π.
The optimal policy can now be simply defined as that policy π? that max-
imises V π(s) for all states s. The Markov property guarantees that there is
such a globally optimal policy (Sutton & Barto, 1998) although it may not be
unique. We define the optimal state-value function V ?(s) as being the state value
function of the policy π?:
V ?(s) = V π?
(s)
= maxπ V π(s)(2.7)
We can also define an optimal state-action value function Q?(s, a) in terms
of V ?(s) as:
Q?(s, a) = E {rt + γV ?(st+1) | st = s, at = a} (2.8)
This function expresses the expected discounted return if action a is executed in
state s and the optimal policy is followed thereafter. If such a function is known
then the optimal policy can be extracted from it simply:
π?(s) = arg maxa∈A
Q?(s, a) (2.9)
Thus the reinforcement learning problem can be transformed from learning
the optimal policy π? to learning the optimal state-action value function Q?.
This turns out to be a relatively simple dynamic programming problem. The
simplest and most commonly used solution is Watkins’ Q-Learning.
2.1.3 Q-Learning
Q-Learning (Watkins, 1989, Watkins & Dayan, 1992) is an online incremental
learning algorithm that learns an optimal policy for a given MDP by building
an approximate state-action value function Q(s, a) that converges to the opti-
mal function Q? in Equation 2.8 above. It is a simple algorithm which avoids
the complexities of modeling the functions R and T of the MDP by learning
Q directly from its experiences. It has significant practical limitations, but is
theoretically sound and has provided a foundation for many more complex algo-
rithms. Pseudocode for this algorithm is given in Algorithm 1.
2. Background - Reinforcement Learning 23
Algorithm 1 Watkin’s Q-Learningfunction Q-Learning
t← 0
Observe state st
while st is not a terminal state do
Choose action at ← π(st) according to an exploration policy
Execute at
Observe resulting state st+1 and reward rt
Q(st, at)α←− rt + γ maxa∈A Q(st+1, a)
t← t + 1
end while
end Q-Learning
The approximate Q-function is stored in a table. Its initial values may be
arbitrarily chosen, typically they are all zero or else randomly assigned. At each
time-step an action is performed according to the policy dictated by the current
Q-function:
at = π(st) = arg maxa∈A
Q(st, a) (2.10)
The result of executing this action is used to update Q(st, at), according to
the temporal-difference rule:
Q(st, at)← (1− α)Q(st, at) + α(rt + γ maxa∈A
Q(st+1, a)) (2.11)
where α is a learning rate, 0 ≤ α ≤ 1.
(The above expression is somewhat cumbersome. There are two operations
being described simultaneously which are not clearly differentiated. The first
operation is the temporal-difference step, which estimates the value of Q(st, at)
as:
Qnew = rt + γ maxa∈A
Q(st+1, a)
This value is the input to the second operation, which updates the existing value
of Q(st, at) towards this target value, using an exponentially weighted rolling
average with learning rate α:
Q(st, at)← (1− α)Q(st, at) + αQnew
To simplify the equations we shall henceforth use the short-hand notation:
Xα←− Y
2. Background - Reinforcement Learning 24
to indicate that the value of X is adjusted towards the target value Y via an
exponentially weighted rolling average with learning rate α, that is:
X ← (1− α)X + αY
Thus Equation 2.11 shall be written as:
Q(st, at)α←− rt + γ max
a∈AQ(st+1, a) (2.12)
This is not standard notation, however I believe it captures the important ele-
ments of the formula more clearly and concisely.)
The approximate state-action value function Q is proven to converge to the
optimal function Q? (and hence π to π?) given certain technical restrictions on
learning rates (∑∞
t=1 αt = ∞ and∑∞
t=1 α2t < ∞) and the requirement that all
state-action pairs continue to be updated indefinitely (Watkins & Dayan, 1992,
Tsitsiklis, 1994, Jaakkola, Jordan, & Singh, 1994). This second requirement
means that in executing the learnt policy the agent must also do a certain pro-
portion of non-policy actions for the purposes of exploration. Exploration is
important in all the algorithms that follow. The standard approach to explo-
ration, followed in this work, is the ε-greedy algorithm which simply takes an
exploratory action with some small probability ε, and a policy action otherwise.
There exist a number of other reinforcement learning algorithms that offer
alternative approaches to learning within this framework, but Q-Learning re-
mains the de facto baseline upon which other research is built and against which
it is compared. For a more thorough examination of the alternatives, The reader
is referred to the more comprehensive treatment in (Sutton & Barto, 1998).
One algorithm that any future researcher should certainly consider is SARSA(λ)
(Sutton, 1996). This is rapidly gaining ground as a contender to Q-Learning as
a baseline reinforcement learning algorithm.
2.1.4 The curse of dimensionality
While Q-Learning and related reinforcement learning algorithms have strong
theoretical convergence properties, they often perform very poorly in practice
(Bellman, 1961). Optimal policies can be found for toy problems, but the algo-
rithms generally fail to scale up to realistic control problems. Without doing a
full analysis of the algorithm, we can observe certain factors which contribute to
this failure.
2. Background - Reinforcement Learning 25
To find the optimal policy, a Q-value must be learnt for every state-action
pair. This means, first of all, that every such pair needs to be explored at least
once. So convergence time is at best O(|S|.|A|). Real-world problems typically
have large multi-dimensional state spaces. |S| is exponential in the number of
dimensions, so each extra dimension added to a problem multiplies the time it
takes.
Furthermore states are generally only accessible from a handful of close neigh-
bours, so the distance between any pair of states in terms of action steps also
increases with the size and dimensionality of the space. Yet a change in the
value of one state may have consequences for the policy in a far distant state.
As information can only propagate from one state to another through individ-
ual state transitions, the further apart two states are, the longer it will take for
this information to be propagated. Thus the diameter of the state space is an
additional factor in the time required to reach convergence.
A general-purpose solution to this problem has not yet been found. There
have been many attempts to represent the table of Q-values more compactly
by using one variety of function approximator or another, such as neural net-
works (Sutton, 1987, Rummery & Niranjan, 1994), CMACs (Sutton, 1995, San-
tamarıa, Sutton, & Ram, 1998), or locally weighted regression (Atkeson, Moore,
& Schaal, 1997). These have met with mixed success. Sometimes the result-
ing state-abstraction has enabled the learning algorithm to converge in times
faster by several orders of magnitude for a particular domain (eg. Tesauro,
1994, Zhang & Dietterich, 1995, Baxter, Tridgell, & Weaver, 1998), but no such
approach has proven to be a general-purpose solution. What works well in one
domain will often fail spectacularly in another. Significant theoertical results
have been produced for off-line evaluation of stationary policies using both lin-
ear function approximators (Tsitsiklis & Roy, 1997), and also general agnostic
function approximators (Papavassiliou & Russell, 1999), but practical results
based on these methods are still outstanding. For a summary of different func-
tion approximation methods applied to RL, see (Kaelbling, Littman, & Moore,
1996).
As a result of these difficulties researchers have turned from seeking general-
purpose to special-purpose solutions. It has been recognised that a number
of the most successful applications of reinforcement learning have used signif-
icant task-specific background knowledge tacitly incorporated into the agent’s
2. Background - Reinforcement Learning 26
representation of its states and actions. Focus is shifting towards creating an
architecture by which this tacit information can become explicit and can be
represented in a systematic way. The aim is to create systems that can bene-
fit from the programmer’s task-specific knowledge whilst maintaining desirable
theoretical properties of convergence.
2.2 Hierarchical Reinforcement Learning in The-
ory
Significant attention has recently been given to hierarchical decomposition as
a means to this end. “Hierarchical reinforcement learning” (HRL) is the name
given to a class of learning algorithms that share a common approach to scaling
up reinforcement learning. Their origins lie partly with behaviour-based tech-
niques for robot programming (Brooks, 1986; Maes, 1990; Mataric, 1996) and
partly with the hierarchical methods used in symbolic planning (Korf, 1987, Iba,
1989, Knoblock, 1991), particularly HTN planning (Tate, 1975, Sacerdoti, 1977,
Erol, Hendler, & Nau, 1994). What they have in common with these techniques
is the intuition that a complex problem can be solved by decomposing it into a
collection of smaller problems.
Hierarchical reinforcement learning accelerates learning by forcing a structure
on the policies it learns. The reactive state-to-action mapping of Q-learning is
replaced by a hierarchy of temporally-abstract actions. These are actions that
operate over several time-steps. Like a subroutine or procedure call, once a
temporally abstract action is executed it continues to control the agent until it
terminates, at which point control is restored to the main policy. These actions
(variously called subtasks, behaviours, macros, options, or abstract machines de-
pending on the particular algorithm in question) must themselves be further
decomposed into one-step actions that the agent can execute. We shall hence-
forth refer to one-step actions as primitive actions and temporally-abstract ac-
tions as behaviours. Policies learnt using primitive actions alone shall be called
monolithic to distinguish them from hierarchical or behaviour-based policies.
How does this decomposition aid us? There are two different ways. One,
it allows us to limit the choices available to the agent, even to the point of
hard-coding parts of the policy; and two, it allows us to specify local goals for
2. Background - Reinforcement Learning 27
certain parts of the policy. Different HRL algorithms implement these features
in different ways. Some implement one and not the other. We shall postpone
describing specific algorithms until Section 2.3, and for the moment present these
features in more general terms, with the aid of an example.
2.2.1 A Motivating Example
Bathroom Bedroom2
Study LoungeCloset
Robot
Coffee Book
Hall
Laundry
Dining
KitchenBedroom1
Figure 2.2: An example world
Figure 2.2 shows an example world we shall use to illustrate the concepts in
this thesis. Imagine that the learning agent is a house-hold robot in a house with
the layout shown. Its purpose is to fetch objects from one room to another. It
is able to know its location with a precision as shown by the cells of the grid,
and its primitive actions enable it to navigate from a cell to any of its eight
neighbours, with a small probability of error.
If the robot is in the same cell as an object, it can pick it up and carry it.
There are two objects in the world that we are interested in. In the kitchen in
the north-west corner of the map is a machine which dispenses a cup of coffee.
In the second bedroom there is a book, also indicated on the map. The robot
starts at its docking location in the study. Its goal will vary from example to
example as we consider different aspects of HRL (and later, of planning).
2. Background - Reinforcement Learning 28
In this world we have 15,000 states (75×50 cells, with two different states for
each object, depending on whether the robot is holding it or not) and 9 primitive
actions (each of the 8 compass directions, plus the pickup action). This is not in
itself a complex world, and most goals will be relatively easy to complete, but
it is certainly one that can be made simpler by providing an appropriate set of
behaviours.
The obvious behaviours to specify are: Go(Room1, Room2) which moves the
robot between two neighbouring rooms, and Get(Object, Room) which moves
towards and picks up the specified object when the robot is in the same room
as it. We will discuss how these behaviours are implemented as we examine
individual techniques.
2.2.2 Limiting the Agent’s Choices
Since learning time is dominated by the number of state-action pairs that need to
be explored, the obvious way to accelerate the process is to cut down the number
of such pairs. Using background knowledge we can identify action choices which
are plainly unhelpful and eliminate them from the set of possible policies. There
is a variety of ways in which this can be done.
Limiting Available Primitive Actions
The simplest solution is to hard-code portions of the policy. Some or all of the
internal operation of a behaviour can be written by hand by the trainer. This
removes the need for the agent to do any kind of learning at all for significant
portions of the state space, which will immediately improve performance. This
assumes however that the trainer is able to do this. Part of the point of learn-
ing policies is to relieve the trainer of the need to specify them, so this may
be of limited use. Still, there are some situations in which simple behaviours
might be wholly or partially specified, and algorithms have been designed to
take advantage of this.
Less drastically, the internal policy of a behaviour could be learnt using only
a limited subset of all available primitive actions. This is useful if the trainer
knows that certain primitive actions are only suitable for particular behaviours
and not for others. From the example, the Go() behaviours could reasonably be
limited to only use the primitive actions which move the robot, and ignore the
2. Background - Reinforcement Learning 29
pickup action, which would be of no use to that behaviour.
Limiting Available Behaviours
Likewise, limits can be placed on which behaviours are available to the agent at
different times. Behaviours are generally limited in scope, so they often can only
be executed from a subset of all possible states. For instance the Get() behaviour
can only be applied when the agent is it the same room as the target object.
The set of states in which a behaviour can be applied is called its applicability
space. Learning algorithms should not allow the agent to choose a behaviour in
a state in which it is not applicable.
However this may not be limiting enough. As more ambitious problems are
tackled, the repertoire of behaviours available to an agent is likely to become
large, and many behaviours will have overlapping applicability spaces. It is of
no use to limit the internal policy choices of behaviours if choosing between the
behaviours becomes just as difficult.
To this end, most HRL algorithms implement some kind of task hierarchy
to limit the choice of behaviours to those that are appropriate to the agent’s
situation. Consider the situation in the example world when the robot is in the
hall with the goal of fetching both the book and the coffee. There are six ap-
plicable behaviours: Go(hall, study), Go(hall, dining), Go(hall, bedroom1),
Go(hall, bathroom), Go(hall, bedroom2), and Go(hall, lounge). Of these, only
two are appropriate: Go(hall, dining), if the agent decides to fetch the coffee
first, and Go(hall, bedroom2) if the agent decides to fetch the book. Exploring
the others is a waste of time. The trainer, who specified the behaviours, should
realise this and incorporate it into the task hierarchy, limiting the agent’s choices
in this situation to one of these two behaviours. The larger an agents repertoire
of behaviours becomes, the more critical this kind of background knowledge is
going to become.
Committing to Behaviours
Finally, choices are limited by requiring long-term commitment to a behaviour.
It is conceivable that a learning algorithm could be written which implemented
hard-coded behaviours but allowed the agent to choose a different behaviour on
every time step. Such an algorithm would hardly be any better than learning
2. Background - Reinforcement Learning 30
a primitive policy directly, and could easily be worse. Long-term commitment
to behaviours has two benefits. First, a single behaviour can traverse a long
sequence of states in a single “jump”, effectively reducing the diameter of the
state-space and propagating rewards more quickly. In the grid-world, for exam-
ple, fetching both the coffee and the book takes 126 primitive actions, but can
be done with a sequence of just 10 behaviours.
Second, a behaviour can “funnel” the agent into a particular set of terminat-
ing states. These states are then the launching points for new behaviours. If no
behaviour ever terminates in a given state, then no policy needs to be learnt for
that state. Again, referring to the grid-world, each Go() behaviour terminates
in one of the six cells surrounding a doorway, in one of four possible configura-
tion of what the robot is holding. There are 10 doors, so this yields 240 states.
Each Get() behaviour terminates in the same location as the target object with
2 possible configurations of what the agent is holding, yielding a further 4 states.
Plus 1 starting state gives a total of 245 states in which the agent needs to learn
to choose a behaviour, out of a possible 15,000. This is a significant reduction
in the size of the policy-space and will result in much faster learning.
Flexible limitations
Limiting the policy space in this fashion will clearly have an effect on optimality.
If the optimal policy does not fit the hierarchical structure, then any policy
produced by a hierarchical reinforcement learner will be sub-optimal. This may
well be satisfactory, but if not, it is possible to some degree to have the best
of both worlds by imposing structure on the policy during the early phase of
learning and relaxing it later. This allows the agent to learn a near-to-optimal
policy quickly and then refine it to optimality in the long-term. Such techniques
shall be described in more detail in Section 2.4.
2.2.3 Providing Local Goals
So far we have assumed that all choices the agent makes, at any point in the
hierarchy, are made to optimise the one global reward function. Such a policy
is said to be hierarchically optimal. A hierarchically optimal policy is the best
possible policy within the confines of the hierarchical structure imposed by the
trainer.
2. Background - Reinforcement Learning 31
Hierarchical optimality, however, contradicts part of the intuition of behaviour-
based decomposition of problems. The idea that a problem can be decomposed
into several independent subparts which can be solved separately and recombined
no longer holds true. The solution to each subpart must be made to optimise
the whole policy, and thus depends on the solutions to every other subpart. The
internal policy for a behaviour depends on its role in the greater task.
Consider, for example, the behaviour Go(hall, bedroom2) in the grid-world
problem. Figure 2.3 shows two possible policies for this behaviour. Assume,
for the moment that diagonal movement is impossible. Which of these policies
is hierarchically optimal? The answer depends the context in which it is being
used. If the agent’s overall goal was to reach the room as soon as possible, then
the policy in Figure 2.3(a) is preferable. If, on the other hand, the goal is to
pick up the book, then the policy in Figure 2.3(b) is better, as it will result in a
shorter overall path to the book.
Furthermore, the same behaviour may have different internal policies in dif-
ferent parts of the problem. For instance, if the agent’s goal is to fetch the book,
carry it to another room and then return to the bedroom, then the first in-
stance of Go(hall, bedroom2) will use the policy in Figure 2.3(b) and the second
instance will use the policy in Figure 2.3(a).
An alternative is to define local goals for each behaviour in terms of a
behaviour-specific reward function. The behaviour’s internal policy is learnt
to optimise this local reward, rather than the global reward. This is called re-
cursive optimality and is a weaker form than hierarchical optimality. Recursively
optimal policies make best use of the behaviours provided to them, but cannot
control what the behaviours themselves do, and so cannot guarantee policies
that are as efficient as hierarchically optimal policies.
The advantages of this approach, however, are several. First of all, learning
an internal policy using a local reward function is likely to be much faster than
learning with a global one. The behaviour can be learnt independently, without
reference to the others. Local goals are generally simpler than global goals, and
local rewards occur sooner than global ones. So each individual behaviour will
be learnt more quickly. 1
1It has been suggested (Dieterrich, personal communication) that subgoal hints could beprovided to an hierarchically optimal learner. A temporary reward shaping mechanism, whichadds extra components to the reward function, could encourage the agent to achieve a par-ticular behaviour’s subgoal. Such extra rewards would be phased out over time so that the
2. Background - Reinforcement Learning 32
Book
Bedroom2
(a) A policy which optimises the number of steps to enter the room
Book
Bedroom2
(b) A policy which optimises the number of steps to reach the book
Figure 2.3: Two different internal policies for the behaviour Go(hall, bedroom2).
2. Background - Reinforcement Learning 33
Furthermore, local goals often allow state abstraction. Elements of the state
that are irrelevant to a local reward function can be ignored when learning the
behaviour. So, for example, if the Go(hall, bedroom2) behaviour had a local
reward function which rewarded the agent for arriving in the bedroom, then the
internal policy for the behaviour could ignore what the robot is carrying. This
would reduce the size of the state space for this behaviour by a factor of four.
Finally, local goals allow re-use. Once a behaviour has been learnt in one
context, it can be used again in other contexts without having to re-learn its
internal policy. This is useful not only in a life-long learning agent, but also
when the same behaviour is employed several different times within the one
policy.
The decision whether or not to include local goals is a trade-off between
optimality and learning speed. In the ideal case, when local rewards exactly
match the projected global rewards, the policies learnt will be identical. However
this is unlikely to occur, and so we must decide which measure of performance
is more important to us. In practice different researchers have chosen different
approaches, as will become apparent in Section 2.3.
2.2.4 Semi-Markov Decision Processes: A Theoretical Frame-
work
So far we have described hierarchical reinforcement learning in abstract terms.
We have assumed that choosing between behaviours can be done in much the
same was as choosing primitive actions in monolithic reinforcement learning, to
optimise the expected discounted return. There is, however, a fundamental dif-
ference between monolithic and hierarchical reinforcement learning: behaviours
are temporally extended where primitive actions are not. Executing a behaviour
will produce a sequence of state-transitions, yielding a sequence of rewards. The
MDP model that was explained in Section 2.1.2 is limited insofar as it assumes
each action will take a single time-step. A new theoretical model is needed to
take this difference into account.
Semi-Markov Decision Processes are an extension of the MDP model to in-
clude a concept of duration, allowing multiple-step (and indeed continuous time)
final policy is hierarchically optimal. However to us this seems to be an inelegant way ofachieving the same result as learning both a recursively optimal and a hierachically optimalpolicy simultaneously, and passing control from one to the other.
2. Background - Reinforcement Learning 34
actions. Formally an SMDP is a tuple 〈S,B, T, R〉, where S is a set of states, B
is a set of temporally-abstract actions, T : S ×B×S ×< → [0, 1] is a transition
function (including duration of execution), and R : S×B×< → [0, 1] is a reward
function:
T (s′, k|s, B) = P (Bt terminates in s′ at time t + k | st = s, Bt = B)(2.13)
R (r|s, B) = P (rt = r | st = s, Bt = B) (2.14)
T and R must both obey the Markov property, i.e. they can only depend on the
behaviour and the state in which it was started.
A policy is a mapping π : S → B from states to behaviours. A state-value
function can be given as:
V π(s) =∫ +∞
−∞
rR (r|s, π(s)) dr +∑
s′,k
T (s′, k|s, π(s))γkV π(s′) (2.15)
Semi-Markov Decision Processes are designed to model any continuous-time
discrete-event system. Their purpose in hierarchical reinforcement learning is
more constrained. Executing a behaviour results in a sequence on primitive
actions being performed. The value of the behaviour is equal to the value of that
sequence. Thus if behaviour B is initiated in state st and terminates sometime
later in state st+k then the SMDP reward value r is equal to the accumulation
of the one-step rewards received while executing B:
r = rt + γrt+1 + γ2rt+2 + · · ·+ γk−1rt+k−1 (2.16)
Thus the state-value function in Equation 2.15above becomes:
V π(s) = E
{
∞∑
i=0
γirt+i | ε(π, s, t)
}
(2.17)
which is identical to the state-value function for primitive policies shown previ-
ously in Equation 2.4. We can define an optimal behaviour-based policy π? with
the optimal state-value function V π?
as:
V π?
(s) = maxπ
V π(s) (2.18)
Since the value measure V π for a behaviour-based policy π is identical to
the value measure V π for a primitive policy we know that π? yields the optimal
primitive policy over the limited set of policies that our hierarchy allows.
2. Background - Reinforcement Learning 35
2.2.5 Learning behaviours
Learning internal policies of behaviours can be expressed along the same lines.
Formally, let B.π be the policy of behaviour B, and B.A be the set of sub-actions
(either behaviours or primitives) available to B. Let Root indicate the root
behaviour, with reward function equal to that of the original (MDP) learning
task. The recursively optimal policy has:
B.π?(s) = arg maxa∈B.A
B.Q?(s, a) (2.19)
where B.Q?(s, a) is the optimal state-action value function for behaviour B ac-
cording to its local reinforcement function B.r (defined by the trainer in accor-
dance to the behaviour’s goals) .
In contrast, the hierarchically optimal policy has
B.π?(stack, s) = arg maxa∈B.A
Root.Q?(stack, s, a) (2.20)
where stack = {Root, . . . , B} is the calling stack of behaviours and Root.Q? is
the state-action value function according to the root reinforcement function. The
stack is a necessary part of the input to an hierarchically optimal policy, as the
behaviour may operate differently in different calling contexts. (Hierarchically
optimal policies do not allow local goals for behaviours, so B.r and B.Q? are not
defined.)
2.3 Hierarchical Reinforcement Learning in Prac-
tice
We have discussed the expected benefits of hierarchical reinforcement learning in
abstract terms without referring to any particular algorithm, to show what mo-
tivates its exploration. Historically a large number of different implementations
have been proposed (e.g. Dayan & Hinton, 1992, Lin, 1993, Kaelbling, 1993)
but only recently have they been developed into a strong theoretical framework,
that has been commonly agreed upon. Even so, there are several current imple-
mentations that differ significantly in which elements they emphasise and how
they approach the problem. We shall focus on four of the most recent offerings:
SMDP Q-Learning, HSMQ-Learning, MAXQ-Q and HAMQ-Learning.
2. Background - Reinforcement Learning 36
2.3.1 Semi-Markov Q-Learning
The simplest algorithm extends Watkins’ Q-Learning to include temporally ab-
stract behaviours with hard-coded internal policies. Such an approach is exam-
ined by Sutton et al. (1999) and they call these behaviours options. Assuming
these options obey the Semi-Markov property, an optimal policy can be learnt
in a manner analogous to Watkins’ Q-Learning. The algorithm is called SMDP
Q-Learning and is shown as Algorithm 2.
Algorithm 2 SMDP Q-Learningfunction SMDPQ
t← 0
Observe state st
while st is not a terminal state do
Choose behaviour Bt ← π(st) according to an exploration policy
totalReward ← 0
discount ← 1
k ← 0
while Bt has not terminated do
Execute Bt
Observe reward r
totalReward ← totalReward + discount × r
discount ← discount × γ
k ← k + 1
end while
Observe state st+k
Q(st, Bt)α←− totalReward + discount ×maxB∈B Q(st+k, B)
t← t + k
end while
end SMDPQ
Just as primitive Q-Learning learns a state-action value function, so SMDP
Q-Learning learns a state-behaviour value function Q : S × B → <, which is an
approximation to the optimal state-behaviour value function Q?:
Q?(s, B) = E
{
k−1∑
i=0
γirt+i + γkV ?(st+k) | ε(s, B, t)
}
(2.21)
where k is the duration of the behaviour B, and ε(s, B, t) indicates the event of
executing behaviour B in state s at time t.
2. Background - Reinforcement Learning 37
The optimal policy is defined as before:
π?(s) = arg maxB∈B
Q?(s, B) (2.22)
The approximation Q(s, B) can be learnt via the update rule (analogous to the
Q-Learning update rule in Equation 2.12):
Q(st, Bt)α←− Rt + γk max
B∈BQ(st+k, B) (2.23)
where k is the duration of Bt and Rt is a discounted accumulation of all single-
step reinforcement values received while executing the behaviour:
Rt =k−1∑
i=0
γirt+i (2.24)
SMDP Q-Learning can be shown to converge to the optimal behaviour-based
policy under circumstances similar to those for 1-step Q-Learning (Parr, 1998).
2.3.2 Hierarchical Semi-Markov Q-Learning
Hierarchical Semi-Markov Q-Learning (HSMQ) (Dietterich, 2000b) is a recur-
sively optimal learning algorithm that learns reactive behaviour-based policies,
with a trainer-specified task hierarchy. As shown in Algorithm 3 it is a simple
elaboration of the SMDP Q-Learning algorithm. The SMDPQ update rule given
in Equation 2.23 is applied recursively with local reward functions at each level
of the hierarchy. TaskHierarchy is a function which returns a set of available
actions (behaviours or primitives) that can be used by a particular behaviour in
a given state. This hierarchy is hand-coded by the trainer based on knowledge
of which actions are appropriate on what occasions.
HSMQ-Learning can be proven to converge to a recursively optimal policy
with the same kinds of requirements as SMDP Q-Learning, provided also that
the exploration policy for behaviours is greedy in the limit (Singh, Jaakkola,
Littman, & Szepesvari, 2000).
2.3.3 MAXQ-Q
A more sophisticated algorithm for learning recursively optimal policies is Di-
etterich’s MAXQ-Q (Dietterich, 2000a). The policies it learns are equivalent to
those of HSMQ, but it uses a special decomposition of the state-action value
2. Background - Reinforcement Learning 38
Algorithm 3 HSMQ-Learningfunction HSMQ(state st, action at)
returns sequence of state transtions {〈st, at, st+1〉, . . .}
if at is primitive then
Execute action at
Observe next state st+1
return {〈st, at, st + 1〉}
else
sequence S ← {}
behaviour B← at
At ← TaskHierarchy(st, B)
while B is not terminated do
Choose action at ← B.π(st) from At
according to an exploration policy
sequence S′ ← HSMQ(st, at)
k ← 0 totalReward ← 0
for each 〈s, a, s′〉 ∈ S′ do
totalReward ← totalReward + γkB.r(s, a, s′)
k ← k + 1
end for
Observe next state st+k
At+k ← TaskHierarchy(st+k, B)
B.Q(st, at)α←− totalReward + γk maxa∈At+k
B.Q(st+k, a)
S ← S + S′
t← t + k
end while
return S
end if
end HSMQ
2. Background - Reinforcement Learning 39
function in order to learn them more efficiently. MAXQ-Q relies on the obser-
vation that the value of a behaviour B as part of its parent behaviour P can be
split into two parts: the reward expected while executing B, and the discounted
reward of continuing to execute P after B has terminated. That is:
P.Q(s, B) = P.I(s, B) + P.C(P, s, B) (2.25)
where P.I(s, B) is the expected total discounted reward (according to the reward
function of the parent behaviour P) that is received while executing behaviour B
from initial state s, and P.C(Bparent, s, Bchild) is the expected total reward of
continuing to execute behaviour Bparent after Bchild has terminated, discounted
appropriately to take into account the time spent in Bchild. (Again with rewards
calculated according to the behaviour P.)
Furthermore the I(s, B) function can be recursively decomposed into I and
C via the rule:
P.I(s, B) = maxa∈B.A
P.Q(s, a) (2.26)
There are several advantages to this decomposition, primarily of value in learning
recursively optimal Q-values. The I and C functions can each be represented
with certain state abstractions that do not apply to both parts. The explanation
is complex and beyond the scope of this review. For full details and pseudocode
see (Dietterich, 2000a).
2.3.4 Q-Learning with Hierarchies of Abstract Machines
Q-Learning with Hierarchies of Abstract Machines (HAMQ) (Parr, 1998; Parr &
Russell, 1998) is an hierarchically optimal learning algorithm that uses a more
elaborate model to structure the policy space. Behaviours are implemented as
hierarchies of abstract machines (HAMs) which resemble finite-state machines, in
that they include an internal machine state. The state of the machine dictates
the action it may take. Actions include: 1) performing primitive actions, 2)
calling other machines as subroutines, 3) making choices, 4) terminating and
returning control to the calling behaviour. Transitions between machine states
may be deterministic, stochastic or may rely on the state of the environment.
Learning takes place at choice states only, where the behaviour must decide which
of several internal state transitions to make. HAMs represent a compromise
between hard-coded policies and fully-learnt policies. Some transitions can be
2. Background - Reinforcement Learning 40
hard-coded into the machine while others can be learnt. Thus they allow for
background knowledge in the form of partial solutions to be specified.
Behaviours in HAMQ are merely a typographic convenience. In effect they
are compiled into a single abstract machine, consisting of actions nodes and
choice nodes only. Algorithm 4 shows the Pseudocode for learning in such a
machine.
Andre and Russell (Andre & Russell, 2000) have extended the expressive
power of HAMs by introducing parameterisation, aborts and interrupts, and
memory variables. These “Programmable HAMs” allow quite complex program-
matic description of behaviours, while also providing room for exploration and
optimisation of alternatives.
2.4 Termination Improvement
Start Goal
Figure 2.4: A simple navigation task illustrating the advantage of terminationimprovement. The circles show the overlapping applicability spaces for a col-lection of hard-code navigation behaviours. Each behaviour moves the agenttowards the central landmark location (the black dots). The heavy line indicatesthe standard policy with commitment to behaviours. The lighter line indicatesthe path taken by a termination improved policy.
In Section 2.2.2 above, we discussed the importance of long-term commitment
to behaviours. Without this, much of the benefit of using temporally abstract
2. Background - Reinforcement Learning 41
Algorithm 4 HAMQ-Learningfunction HAMQ
t← 0
node ← starting node
totalReward ← 0
k ← 0
choice a← null
choice state s← null
choice node n← null
while s is not a terminal state do
if node is an action node then
Execute action
Observe reward r
totalReward ← totalReward + γkr
k ← k + 1
node ← node.next
else node is a choice node
Observe state s′
if n 6= null then
Q(n, s, a)α←− totalReward + γk maxa′∈A Q(node, s′, a′)
totalReward ← 0
k ← 0
end if
n← node
s← s′
Choose transition a← π(n, s) according to an exploration policy
node ← a.destination
end if
end while
end HAMQ
2. Background - Reinforcement Learning 42
actions is lost. However it can also be an obstacle in the way of producing optimal
policies. Consider the situation illustrated in Figure 2.4. The task is to navigate
to the indicated goal location. Behaviours are represented by dotted circles
and black dots indicating the application space and terminal states respectively.
The heavy line shows a path from the starting location to the goal, using the
behaviours provided. The path travels from one termination state to the next,
indicating that each behaviour is being executed all the way to completion.
Compare this with the path shown by the lighter line. In this case each be-
haviour is executed only until a more appropriate behaviour becomes applicable.
“Cutting corners” in this way results in a significantly shorter path, and a policy
much closer to the optimal one.
This example is taken from the work of Sutton, Singh, Precup, & Ravindran,
1999 who call this process termination improvement. They show how to pro-
duce such corner-cutting policies using hard-coded behaviours. Having already
learnt an optimal policy π using these behaviours, they transform it into an im-
proved interrupted policy π′ by prematurely interrupting an executing behaviour
B whenever Q(s, B) < V (s), i.e. when there is a better alternative behaviour
available. The resulting policy is guaranteed to be of equal or greater efficiency
than the original.
A similar approach can be applied to policies learnt using MAXQ-Q (Di-
etterich, 2000a). While MAXQ-Q is a recursively optimal learning algorithm,
it nevertheless learns a value for each primitive action using the global reward
function. In normal execution, actions are chosen on the basis of the local Q-
value assigned to each by its calling behaviour. However once such a recursively
optimal policy has been learnt, it can be improved by switching to selecting
primitive actions based on their global Q-value instead. There is no longer any
commitment to behaviours. Execution reverts to the reactive semantics of mono-
lithic Q-learning, and the hierarchy serves only as a means to assign Q-values
to primitives. This is called the hierarchical greedy policy, and is also guaran-
teed to be of equal or greater efficiency than the recursive policy. Furthermore,
by continuing to update these Q-values, via polling execution (Kaelbling, 1993,
(Dietterich, 1998)), this policy can be further improved.
In both these algorithms it is important that the transformation is made to
the policy once an uninterrupted policy has already been learnt. Without this
delay the advantages of using temporally abstract actions would be lost.
2. Background - Reinforcement Learning 43
2.5 Producing the hierarchy
As stated earlier, typically the hierarchy of behaviours is defined by a human
trainer. Many researchers have pointed to the desirability of automating this task
(eg. Boutilier, Dean, & Hanks, 1999; Hauskrecht, Meuleau, Kaelbling, Dean, &
Boutilier, 1998). This work is one approach to that problem.
Another quite different approach is the HEXQ algorithm (Hengst, 2002).
This algorithm is an extension of MAXQ-Q which attempts to automatically
decompose a problem into a collection of subproblems. Sub-problems are created
corresponding to particular variables in the state-vector. Variables that change
infrequently inspire behaviours which aim to cause those variables to change.
A similar approach is used by acQuire (McGovern & Barto, 2001). It uses
exploration to identify “bottlenecks” in the state-space – states which are part of
many trajectories through the space. Bottleneck states are selected as subgoals
for new behaviours.
Both these approaches implement a kind of blind behaviour invention, based
only on the dynamics of the world without any background knowledge. In this
thesis we take a much less radical approach, keeping the trainer but expressing
her knowledge in a more flexible form. Ultimately it would be good to have
systems which can both accept information from a trainer and discover it from
the world.
2.6 Other work
A few other reinforcement learning techniques that need to be mentioned due to
their apparent similarity to the work in this thesis. In each case the similarity is
mostly superficial, but it is worth describing each to acknowledge the possibilities
of different approaches and distinguish them from my own.
2.6.1 Model-based Reinforcement Learning
Not all reinforcement learning algorithms are model-free like Q-learning. There
also exist algorithms which attempt to learn models of the transition and reward
functions T (s′|s, a) and R (r|s, a), as described in Section 2.1.2 above, and then
use these models to create policies, using value iteration (Sutton, 1990). Such
2. Background - Reinforcement Learning 44
techniques have been less popular in practice as learning accurate models of T
and R has been found to be harder than learning Q-values directly.
Nevertheless these techniques have attracted sufficient interest to be applied
to the hierarchical reinforcement learning problem, and model-based hierarchi-
cal reinforcement learning algorithms exist, (eg. H-Dyna (Singh, 1992), SMDP
Planning (Sutton et al., 1999), abstract MDPs (Hauskrecht et al., 1998) and
discrete event models (Mahadevan, Khaleeli, & Marchalleck, 1997)).
While this thesis also involves the application of models to hierarchical re-
inforcement learning, the style and application of those models is significantly
different. Model-based HRL algorithms learn concrete, numerical models of ac-
tions’ effects in order to generate policies. In this work we will be using abstract,
symbolic models of actions’ purposes in order to guide exploration.
2.6.2 Other hybrid learning algorithms
Other systems exist which combine symbolic background knowledge with re-
inforcement learning in quite a different fashion to the behaviour-based model
presented here. One approach is to incorporate the symbolic knowledge into
the representation of the Q-function. For instance, the RATLE system (Maclin
& Shavlik, 1996) uses knowledge-based neural nets for this purpose. These are
recurrent neural-networks which are structured by the trainer, using a simple
programming language to represent background knowledge (Maclin, 1995).
Similarly Relational Reinforcement Learning (Dzeroski, Raedt, & Blockeel,
1998) employs inductive-logic programming to represent the Q-function as a logic
program. Symbolic background knowledge can be incorporated into the learnt
representation.
This is a radically different use of symbolic background knowledge. Both
approaches are attempting to represent concrete numeric information (the Q-
function) using abstract symbolic methods. This is an inversion of the work
in this thesis, in which we use numeric methods to represent abstract symbolic
concepts.
2. Background - Reinforcement Learning 45
2.7 Reinforcement learning in this thesis
Having presented a broad overview of hierarchical reinforcement learning, we will
focus for the rest of this thesis on a few particular issues. Both new algorithms
we present will be recursively optimal algorithms based on HSMQ. HSMQ was
chosen over MAXQ-Q for its relative simplicity, although there is no obvious
reason why the algorithms could not be adapted to use the MAXQ value function
decomposition.
Recursive optimality was chosen over hierarchical optimality as the resulting
independence and re-usability of behaviours more naturally suits the symbolic-
planning framework we will be using, but again the automatic hierarchy con-
struction methods we will employ could just as well be used for hierarchically
optimal policies.
The particular issues that we shall focus on are task hierarchies and termi-
nation improvement. We aim to show how symbolic background knowledge can
be used to automatically make use of knowledge that otherwise a trainer would
have to encode by hand.
2.8 Summary
This chapter showed how the agent learning problem can be cast in a subsym-
bolic fashion as an online dynamic programming problem. We presented the
standard MDP and SMDP models for reinforcement learning and hierarchical
reinforcement learning, and showed how these models can be used to produce
a variety of different algorithms depending on the choice of optimality criterion
and execution semantics. In the next chapter we will show how the same prob-
lem can be approached in quite a different fashion using the tools of symbolic
planning and knowledge refinement.
Chapter 3
Background - Symbolic Planning
In the last chapter we discussed the sub-symbolic approach to agent control. We
began with a technique that was primarily designed to learn policies without
human guidance, and we showed how it could be modified to allow it to use
high-level domain knowledge from a trainer.
The symbolic planning approach to control has the opposite problem. It
is first and foremost a technique for reasoning about action based on abstract
models of behaviour provided by a trainer. The difficulty lies in handling the
low-level intricacies of the domain, which cannot be captured in an abstract
model. In this chapter we shall first present a simple, commonly-used approach
to symbolic planning, means-ends analysis using the Strips formalism, and then
describe ways in which it has been extended to handle more complex worlds.
3.1 The symbolic planning model
At a fundamental level the problem faced by symbolic planning is much the same
as that of reinforcement learning. An agent interacts with an environment to
achieve certain goals. The objective is to produce a policy which maps states to
actions so as to achieve the agent’s goals as efficiently as possible. The method
of creating those policies, however, is significantly different.
Symbolic planning aims to construct policies (or plans) from a model of the
world provided by the trainer. The emphasis is therefore on creating a language
in which a trainer can easily specify this model, and on a reasoning process which
is logically sound. Learning by trial-and-error has also been studied, but as a
later addition to a well-established field.
46
3. Background - Symbolic Planning 47
Planning Agent
action
Model
Environment
Policy
Goals
Planner
state
Figure 3.1: An illustration of a planning agent.
Thus the essence of the symbolic approach is the language it uses. States,
actions and goals are usually represented in the language of first-order logic (or
its equivalent). Fundamental to this is the idea of using fluents to describe
features of the agent’s state. A fluent describes a set of states in which certain
properties or relationships hold.
Consider the example world used in the previous chapter (reproduced in
Figure 3.2). What are the important abstract features of the agent’s state? The
agent’s location is one. Another is the location of the book, and of the coffee.
So the fluent location(Object, Location) might be defined to represent that
Object is in Location. Object could be robot, book or coffee. Location
could be any of the rooms. The fluent holding(varObject) will be used to
signify that the robot has the Object in its possession.
Another important feature is the topology of the world – which rooms are
connected. This can be encoded using the fluent door(Room1, Room2) to indicate
that Room1 is connected to Room2.
States can now be described by a conjunction of fluents. We shall use the
notation:
s |= f1 ∧ f2 ∧ . . . ∧ fk
To say that primitive state s models fluents f1 . . . fk, i.e. the fluents are true in
3. Background - Symbolic Planning 48
Bathroom Bedroom2
Study LoungeCloset
Robot
Coffee Book
Hall
Laundry
Dining
KitchenBedroom1
Figure 3.2: The example world again
state s. Thus the initial state of the world, shown in Figure 3.2 is given by:
s |= location(robot, study)
∧ location(book, bedroom2) ∧ location(coffee, kitchen)
∧ door(lounge, hall) ∧ door(hall, lounge)
∧ door(hall, study) ∧ door(study, hall)
∧ door(study, closet) ∧ door(closet, study)
∧ door(hall, bathroom) ∧ door(bathroom, hall)
∧ door(hall, bedroom1) ∧ door(bedroom1, hall)
∧ door(hall, bedroom2) ∧ door(bedroom2, hall)
∧ door(hall, dining) ∧ door(dining, hall)
∧ door(dining, kitchen) ∧ door(kitchen, dining)
∧ door(dining, laundry) ∧ door(laundry, dining)
(For the sake of brevity, door fluents will henceforth be omitted from expressions
unless they are of particular importance.) Note that while this description may
not uniquely identify the state s (as there are many states that match this
description), we shall assume that it is operationally complete, that is it encodes
all the necessary information to determine how to achieve the agent’s goals from
this state. We shall use the notation Fluents(s) to indicate the set of all fluents
3. Background - Symbolic Planning 49
which are satisfied in a given state of the world s.
Fluents are also used to represent the agents goals. It is typically assumed
that the agent’s goal is to arrive at any of a particular subset of states of the
world, and that set can be represented as a conjunction of fluents. So, for
instance, a goal in the example world may be for the agent to fetch a cup of
coffee and bring it into the dining room. This goal can be represented as:
G = location(robot, dining) ∧ holding(coffee)
Any state s which satisfies G is a goal state.
Finally we need to model the agent’s actions using the fluents. Actions are
described in terms of their effects, and the preconditions required to produce
those effects. Two kinds of effects are typically distinguished: deliberate and
accidental. The deliberate effects of an action are called its post-conditions,
the accidental effects are called side-effects. In planning, the agent should only
rely on the post-conditions of an action as useful effects of an action. Side-
effects are specified so that the agent may know that they might possibly occur,
and can ensure that its plans do not rely on their absence. This division is
mostly semantic – the environment makes no distinction between deliberate and
accidental effects – but it is a useful one when it comes to planning the agents
behaviour.
Consider the Get(coffee, kitchen) action in the sample world. What are
the abstract effects of this behaviour? The deliberate effect is that the robot is
holding the coffee, i.e.:
post-condition = holding(coffee)
Under what conditions will the behaviour achieve this effect? It is expected to
work so long as both the agent and the coffee are in the kitchen, so:
precondition = location(coffee, kitchen) ∧ location(robot, kitchen)
It also has a side-effect:
side-effect = ¬location(coffee, kitchen)
as once it has been picked up, the coffee is no longer considered to be in the
room.
3. Background - Symbolic Planning 50
It should be noted at this point that these descriptions of states and actions
are significantly more abstract than the primitive states and actions used in the
formulation of the reinforcement learning problem for this same world. This is
typically the case. Symbolic planning is rarely done at such a low level. Actions
are assumed to be implemented as deterministic processes that can be described
in high-level terms. The underlying implementation issues are largely ignored, so
that actions can be considered to be deterministic, discrete, and instantaneous
and to satisfy the Markov property (i.e. involving no hidden state). Recent
planning research has relaxed these assumptions somewhat, as will be elaborated
in Section 3.2.3.
3.2 Building Plans
Given such a world model it is possible to build a policy or plan by which the
agent can achieve its goals from its initial state. A plan is a sequence of actions
to be executed in order. Each action achieves the preconditions of the next until
the final action reaches the goal. Figure 3.3 shows an illustration of a plan for
the example world. The goal, in the top node of the plan, is for the robot to be
in the dining room, holding a cup of coffee. The nodes below it represent set of
states, as described by the conjunction of fluents in each. The arrows are actions
leading from one set of states to the next. Such a plan can be constructed by a
logical inference process from the action models.
The particulars of this planning process vary depending on the representation
used for the action models among other things. We do not intend to give a
comprehensive description of the state-of-the-art in planning. For that, We refer
the reader to (Allen et al., 1990) and (Ghallab & Milani, 1996).
This thesis is not an attempt to produce an improved symbolic planner,
but rather to create a hybrid of planning and reinforcement learning. As such,
we shall focus on a simple and well-established planning algorithm (Means-
Ends analysis (Newell & Simon, 1972)) and an equally basic representation (the
Strips model (Fikes & Nilsson, 1971)). We shall consider only one recent im-
provement to this combination (Teleo-reactive planning (Nilsson, 1994)). There
have been many others, but for the sake of simplicity, we shall avoid discussing
these and focus on the techniques used in this thesis. We will reserve discussion
of how more complex planning techniques might be applied to the problem until
3. Background - Symbolic Planning 51
Go(kitchen, dining)
Get(coffee, kitchen)
Go(dining, kitchen)
Go(hall, dining)
Go(study, hall)
location(robot, kitchen)holding(coffee)
location(coffee, kitchen)location(robot, dining)
holding(coffee)
location(coffee, kitchen)location(robot, hall)
location(coffee, kitchen)
location(coffee, kitchen)location(robot, study)
location(robot, dining)
location(robot, kicthen)
Figure 3.3: A plan for fetching the coffee.
3. Background - Symbolic Planning 52
the future work section in Chapter 8.
3.2.1 The Strips representation
One of the earliest and most enduring symbolic representations of actions is the
Strips representation originally proposed by Fikes and Nilsson. It represents
actions as operators with three principle components:
1. A precondition. A list of fluents that must be true for the action to be
executed.
2. An add list. A list of fluents that are true after the action has been exe-
cuted. This describes the post-conditions of the action.
3. A delete list. A list of fluents that the action may cause to become false.
These are the side-effects of the action.
Formally, an operator 〈A.pre, A.add, A.del〉 for an action A models the fact that
if A is executed in a state s with:
s |=∧
f∈A.pre
f (3.1)
then the resulting state s′ will satisfy:
s′ |=∧
f∈A.add
f ∧∧
f∈Fluents(s)−A.del
f (3.2)
where Fluents(s) is the set all fluents satisfied by s.
The first term of this equation indicates that the post-conditions of A are true
in s′. The second term indicates that any undeleted fluents from state s also
remain true in s′. This is called the frame assumption. Without this assumption
we would have to explicitly describe every fluent that was not affected by the
action, in addition to those that were. Frame assumptions are a difficult but
necessary part of reasoning about action, and much research has gone into them
(Shoham & McDermott, 1988; Hayes, 1973). We shall return to this issue later,
when we discuss learning action models in Chapter 6.
The Strips representation places important restrictions on the description
of actions. A particular operator has a single precondition for all its effects, and
both pre- and post-conditions are simple conjunctions of fluents. An action that
3. Background - Symbolic Planning 53
has multiple conditional effects must be described as several different operators.
(Henceforth we shall use preconditions, add-lists and delete-lists interchangeably
as sets of fluents and as logical conjunctions. It ought to be clear from the context
which is intended in any instance.)
It is convenient to describe families of similar operators as schemata, with
certain constants replaced by variables. Thus actions in the example world might
be described by the operator schemata:
Go(Room1, Room2)
Pre: location(robot, Room1) ∧ door(Room1, Room2)
Add: location(robot, Room2)
Del: location(robot, Room1)
Get(Object, Room)
Pre: location(Object, Room) ∧ location(robot, Room)
Add: holding(Object)
Del: location(Object, Room)
(We follow the Prolog convention of beginning variable names with capital letters
and constants with lowercase letters.) Variables in the schema name must be
bound in the process of planning in order to fully instantiate the operator. Only
fully instantiated operators can be executed.
3.2.2 Means-ends planning
Equation 3.2 allows us to form two reasoning rules. The first, called progression
allows us to predict the effects of actions. If state s satisfies the fluents in Fbefore
and Fbefore ⊂ A.pre then the state s′ after executing A will satisfy:
Fafter = (Fbefore − A.del) ∪ A.add (3.3)
Alternatively, the regression rule tells us that if A.add ⊂ Fafter then action A
can be used to achieve Fafter if executed in a state satisfying:
Fbefore = (Fafter − A.add) ∪ A.pre (3.4)
provided that A.del ∩ Fafter = ∅. This second rule is important, as it is the
foundation of the planning algorithm we will employ.
3. Background - Symbolic Planning 54
Planning with Strips operators can be considered as a kind of search. A
path needs to be found between the initial state and the goal. Actions describe
transitions from one set of states to another, using the above rules of progression
and regression. Standard AI search techniques can be applied to this problem
in a variety of ways.
Perhaps the simplest approach is means ends analysis, also known as regres-
sion planning (Newell & Simon, 1972, Georgeff, 1987). This involves a simple
breadth first search through the state space, starting from the goal and work-
ing backwards towards the initial state, using the regression rule above to select
actions.
Here is a simple example from the example world. Let our goal be to reach
the dining room with the coffee. I.e.:
G = {location(robot, dining), holding(coffee)}
To achieve this goal we apply the regression rule above. First we find an ac-
tion which achieves one or more of the fluents in the goal. In this case, the
Go(dining, kitchen) action will serve. Then we regress the goal to find the
conditions that need to be true before the action is executed:
Fbefore = (Fafter − Go(dining, kitchen).add) ∪ Go(dining, kitchen).pre
= ({location(robot, dining), holding(coffee)}
−{location(robot, dining)}) ∪ {location(robot, kitchen)}
= {location(robot, kitchen), holding(coffee)}
The search then continues for an action which satisfies this regressed condition,
until such time as a condition is found which is satisfied by the initial state.
There are, of course, other actions which could have been chosen here in place
of Go(dining, kitchen), thus building the shortest plan is equivalent to finding
the shortest path through a graph, and is done by breadth-first search or an
equivalent algorithm.
This algorithm is sound and complete, assuming the model itself is likewise.
The plans it generates are correct and if a plan exists within the limitations of
the representation, then it will be found.
3.2.3 Extensions to the Strips representation
The Strips representation and the means-ends planning algorithm have been
recognised to have a number of limitations. Many improvements have been made
3. Background - Symbolic Planning 55
on these techniques in years since they were proposed. Three improvements we
shall focus on are:
• Durative actions: the ability to model actions that are not instantaneous.
• Universal plans: the ability to construct plans that contain contingencies
for handling random execution failures
• Hierarchical planning : the ability to improve planning efficiency by con-
structing plans at different levels of abstraction.
Representing durative actions: Teleo-reactive operators
Many descriptions of the Strips planning process described above assume that
the operators are instantaneous and deterministic, and so can be executed with-
out any kind of monitoring. It is assumed that each action will terminate suc-
cessfully, achieving the precondition for the next action without any need for
verification. The original authors, however, realised that this was not so. In the
real world, actions take time to execute and may fail to perform as intended,
and this needs to be taken into account.
A neglected element of the original Strips research is the PLANEX plan-
execution monitor (Fikes, 1971, Fikes, Hart, & Nilsson, 1972). This program
monitored the state of the world at each stage of the plan to ensure that it
matched the preconditions for the subsequent actions. If ever this was not true,
PLANEX would search backwards through the plan until it found a step in the
plan for which the preconditions were satisfied. Thus it could recover from small
errors by simply going back a few steps in the plan. If none of the prior steps
were satisfied, it would then resort to constructing a new plan.
The Teleo-reactive formalism developed by Nilsson and Benson (Nilsson,
1994; Benson & Nilsson, 1994; Benson, 1996) extends the Strips representa-
tion to more explicitly capture these features. An action in this formalism is
represented by a teleo-operator (or Top). A Top is much like a Strips opera-
tor. It has a pre-image, a post-condition and a set of side-effects, much like the
precondition, add and delete lists of a Strips operator. However the semantic
interpretation of these parts contains an important difference. A Strips oper-
ator is assumed to model an instantaneous action, whereas a Top represents
the continuous execution of a durative action until it reaches its post-condition.
3. Background - Symbolic Planning 56
Termination may not be immediate, but is guaranteed to eventually occur. To
achieve that post-condition, the Top must be initiated in a state satisfying its
pre-image. The Top is then guaranteed to remain inside that pre-image until
it terminates on achieving the post-condition. The side-effect list contains those
fluents whose truth value may change during the execution of the Top.
This definition is convenient as it allows temporally abstract actions to be
used in planning without any significant changes to the planning algorithm.
Means-ends analysis can be applied directly to teleo-operators to produce plans
containing durative actions.
Execution of a teleo-reactive plan must be monitored. An action must be
terminated when it achieves its post-condition. Furthermore, monitoring allows
the detection and handling of random execution failures. Execution of a teleo-
reactive plan resembles the circuit semantics of behaviour-based programming
(Brooks, 1986; Maes, 1990; Kaelbling & Rosenschein, 1990). The plan is treated
as an ordered sequence of production rules from states to actions. For example,
the plan in Figure 3.3 can be represented as:
location(robot, dining), holding(coffee) → terminate
location(robot, kitchen), holding(coffee) → Go(kitchen, dining)
location(robot, kitchen), location(coffee, kitchen) → Get(coffee, kitchen)
location(robot, dining), location(coffee, kitchen) → Go(dining, kitchen)
location(robot, hall), location(coffee, kitchen) → Go(hall, dining)
location(robot, study), location(coffee, kitchen) → Go(study, hall)
(For brevity’s sake the door fluents have been omitted.) This plan is executed
reactively with continuous feedback. At any instant the action executing is
dictated by the first rule in the list with its left-hand side satisfied. Such a rule
is said to be active. Execution is expected to proceed up the list as each action
is performed, but it is recognised that actions may occasionally fail or external
events may cause unexpected changes in the state of the world. Such execution
failures are accommodated as execution immediately jumps up or down the list
to the appropriate rule. Say, for example, the robot was executing the second
rule in the list, carrying the coffee from the kitchen into the dining when it
accidentally spilled. Because the plan is constantly monitored, execution would
immediately drop back to the rule below it and the robot would go and fetch
another cup of coffee.
3. Background - Symbolic Planning 57
Universal Plans
Plan monitoring provides some robustness to executions failures, but can only
handle situations that already exist in the plan. What would happen in the
above example if the agent were to accidentally enter the laundry rather than
the kitchen? The plan contains no productions which match this scenario, so
execution would fail.
To make truly robust plans contingencies need to be added to handle such
circumstances. One possibility is to create plans which contain paths to the goal
from every possible state. Such plans are called universal (Schoppers, 1987).
Universal plans, combined with plan monitoring, allow the agent to recover from
arbitrary execution failures, as the plan will contain an appropriate course of
action regardless of what circumstances may arise. A universal plan is best pic-
tured as a tree, with the goal at the the root and many paths converging towards
it. A universal plan for the coffee-fetching task is illustrated in Figure 3.4.
The teleo-reactive execution formalism can also be applied to executing such
plan trees. Just as the list of rules was scanned from top to bottom for the first
active rule, so the plan tree can be searched in a breadth-first fashion to find the
shallowest active node. This node dictates the action to be executed. (If ever
there are two active nodes at the same depth then ties are broken randomly).
So, in answer to the above scenario, if the robot were to accidentally enter
the laundry while executing the Go(dining, kitchen) behaviour then control
would pass down to the left child node, and the robot would begin executing
Go(laundry, dining) in order to recover from the mistake.
In practice universal planning is rarely done. It is costly to perform as it
requires exhaustive enumeration of all possibilities. A simpler alternative is to
make best use of the information gathered in the normal planning process. While
searching for a plan from a particular state, many false paths are explored. These
paths can be stored as contingencies in a plan tree with little extra cost. If an
execution failure places the agent in a state not already covered by any node of
the tree, then the tree can be expanded until the new state is covered. We shall
call such plans semi-universal.
3. Background - Symbolic Planning 58
Go(dining, kitchen)
Go(laundry, dining) Go(hall, kitchen)
Go(study, hall)
Go(closet, study)
Go(lounge, hall)
Go(kitchen, dining)
Get(coffee, kitchen)
holding(coffee)
location(robot, laundry)
location(coffee, kitchen)location(robot, study)
location(coffee, kitchen)location(robot, lounge)
location(coffee, kitchen)location(robot, closet)
location(coffee, kitchen)location(robot, hall)
location(coffee, kitchen)location(robot, kicthen)
location(coffee, kitchen)location(robot, dining)
location(robot, kitchen)holding(coffee)
location(robot, dining)
location(coffee, kitchen)
Figure 3.4: A universal plan for fetching the coffee. Dotted arrows show whereadditional nodes have been omitted to save space.
3. Background - Symbolic Planning 59
Hierarchical Planning
In a moderately complex environment the depth and the branching factor of the
plan trees can become quite large, making the planning process intractable. The
planner can waste much time and effort ordering the minute details of the plan.
More progress could be made by establishing the broad strokes of the plan first,
and then filling in the details later. For example, a plan to travel from Sydney
to London would not be constructed on a per-footstep basis. Rather the flights
would be arranged first, then the details of catching each filled in later. A large
number of false paths could be avoided in this fashion. This is the intuition
behind hierarchical planning (Sacerdoti, 1974; Rosenschein, 1981; Korf, 1985;
Iba, 1989).
A hierarchical planner typically defines operators at various levels of ab-
straction. The most abstract macro-operators describe large-scale movements
through the state space. A macro-operator A has an internal plan A.plan which
implements A in terms of finer-grain operators. This proceeds recursively until
ultimately all operators are described in terms of concrete actions that can be
directly executed.
A further advantage of hierarchical planning is that each macro-operator can
be treated as an independent planning problem. Not only does this reduce the
overall number of action ordering considerations, but it also allows for re-use.
Once a plan has been constructed for a macro-operator it can be stored in a plan
library and used whenever the macro-operator is needed.
The teleo-reactive planning formalism can be extended to incorporate hierar-
chical plans. Teleo-operators can be written to describe macro-actions which are
in fact implemented as plans of finer-grain actions. Executing a macro-action A
simply means executing its internal plan, A.plan, according to the teleo-reactive
semantics. As long as the external plan continues to recommend A, then A’s
internal plan continues to execute. Should the external plan require that A stop
executing at any time, then A.plan and all actions below it in the hierarchy are
immediately interrupted.
3.3 Handling Incomplete Action Models
One of the most significant obstacles to applying planning algorithms to real
world problems is the need for the action model to fully specify the outcome of
3. Background - Symbolic Planning 60
each action, both in terms of what it does and does not do. The correctness
of Strips (and Top) planning relies on each operator to specify every fluent
that could possibly be changed by the action – not only the immediate intended
effects, but also those that are unintended, and anything that might logically
follow from either. Even in a domain of only moderate complexity this can
result in quite lengthy descriptions. Omissions can lead to incorrect plans that
do not achieve their goals, or the inability to produce a plan at all.
This situation can be improved by augmenting the world model with a theory
of logical relationships between fluents. A more powerful planner could apply
an action to indirectly achieve fluents implied by its post-condition, but not
included in it. This would allow post-conditions to be expressed more compactly,
and side-effects similarly. Implementations of such ideas exist (Reiter, 1987) but
have significant limitations in how they can be applied without threatening the
soundness of the planning process.
Ultimately it is inevitable that some aspects of a complex world model will
be omitted. The desire for autonomy in our agents leads us to consider how they
might overcome this obstacle of their own accord.
3.4 Learning Action Models
The problem of incompletely specified world models has driven research into
agents that can learn missing information autonomously and correct errors in
the model through experience of interacting with the world. There is no single
standard approach to this problem, rather it has been attacked from a variety of
angles by assorted researchers. The problem can either be regarded as learning
an action model from scratch (Benson, 1995; Shen, 1993; Wang, 1996; Oates &
Cohen, 1996; Lorenzo & Otero, 2000) or correcting an existing trainer-specified
model (Gil, 1994; desJardins, 1994). Further distinctions can be drawn on the
basis of how the input data for learning is generated, what kind of information
is learnt (including how it is represented and how noise is handled) and how
learning is done.
3. Background - Symbolic Planning 61
3.4.1 Generating data
If the model is incomplete, then the agent must learn through interaction with
its environment. There are several possible ways it can do this, ranging from
passive to active. Data can be generated by:
• Observing an expert controlling the agent
• Executing plans generated by the agent and observing their success or
failure
• Deliberate exploration and experimentation to test particular parts of the
model.
Observing an expert is the most passive source of information. This is the
necessary starting point for most systems that learn without a prior model (eg:
Benson, 1996; Lorenzo & Otero, 2000; Wang, 1996; Oates & Cohen, 1996), as
they have no other means to direct their exploration of the world (other than
choosing actions randomly). A lot of examples of the effects of actions can
be gathered in this way, particularly if the expert chooses training examples
deliberately with an eye to their learning potential. However the wealth of
information can lead to a lack of focus in the learning process. Every new effect
observed can spawn a new learning task to predict its cause. The agent has no
way to determine which effects should be regarded as important or unimportant.
The semantic division between deliberate post-conditions and accidental side-
effects is not available.
If a partial model is already available, then the agent has more structure
within which to categorise its experiences. It can build plans in the existing
model and then test to see whether they operate as expected. If the model
is not accurate enough then failures will occur, either in the planning or in
the execution phase. Particular kinds of failures in the plans are indicative of
particular kinds of errors in the model. Evidence gathered from such failures
can be used to successively refine the world model until the plan is successful.
Systems that implement such learning are Trail (Benson, 1995; Benson, 1996),
Live (Shen, 1993; Shen, 1994), Observer (Wang, 1996), Expo (Gil, 1994) and
CAP (Hume & Sammut, 1991).
It is possible to go further than this and allow the agent to explicitly set out
to perform experiments to test certain parts of its world model. This can include
3. Background - Symbolic Planning 62
deliberately attempting to use an action outside of its learnt precondition to see
if it is too specific, or else in different parts of the precondition to test where it
might be faulty. In this way the actor can gather examples that may not arise
in the ordinary course of execution. This is a particular focus of Expo, Live
and CAP.
3.4.2 What is learnt
There are essentially only two things to learn: the effects of an action and the
conditions that are necessary to produce them. However the representation used
for actions strongly influences exactly what kinds of models can be learnt. Most
systems use Strips-like operators and so are limited to learning an action’s
post-conditions and their associated precondition. A post-condition is identified
as any fluent that changes truth value on the execution of the action. As the dis-
tinction between a post-condition and a side-effect is a semantic one, not present
in the world, all such effects are generally treated as post-conditions, unless in-
telligent data gathering allows a better classification. To find the precondition
for this effect the agent can form a generalisation of all the states in which the
action produced the effect.
Recent work has turned to more complex representations. Trail uses teleo-
operators to model actions. This allows multiple operators to be learnt for each
action, based on different post-conditions. To learn the pre-image of each oper-
ator the agent must generalise over the sequences of states in which the action
is executed, distinguishing those that lead to the post-condition in question and
those that do not.
A more sophisticated representation is used by Lorenzo and Otero. They
represent the effects of actions in the language of the Situation Calculus (Mc-
Carthy & Hayes, 1969). This language expresses the effects of behaviours in
terms of a logic program, with independent rules describing when each of the
post-conditions and side-effects of a behaviour may occur. This representation
is more expressive than the simple Strips notation, and allows the planner to
construct more accurate plans, only taking into account particular effects when
they are expected to occur. Lorenzo and Otero’s (apparently nameless) system
learns such logic programs.
3. Background - Symbolic Planning 63
3.4.3 How to learn
Discovering the effects of actions is not difficult. It merely requires observing
what fluents change when the action is executed. Distinguishing which effects
are relevant and which are irrelevant is harder, and learning the necessary pre-
conditions to produce these effects is harder still. How this is done depends on
the assumptions made in the chosen representation.
The Strips representation assumes that actions are deterministic and can
only be executed when their precondition is true. Thus learning preconditions
in the Strips model is generally done by making simple generalisations over the
states in which the actions work. If an action does not work in given state in
the precondition then it must be too general. It can be made more specific by
comparing the unsuccessful case with an earlier successful case. As execution
is assumed to be noise free, the explanation for the failure must rely in the
difference between the two cases.
Two approaches that go beyond this are Benson’s Trail and Lorenzo and
Otero’s system. They both assume that there may be noise in the data, that
actions make sometimes fail or succeed randomly. Learning preconditions in such
a domain requires more examples, both positive and negative, to be gathered.
Both of these systems use Inductive Logic Programming (variants of Dinus
and Progol respectively) to produce descriptions of the preconditions from the
noisy data.
3.5 Inductive Logic Programming
We need to briefly explain what Inductive Logic Programming (ILP) is, and how
it applies to this work. ILP is a kind of machine learning. Like other machine
learning algorithms, such as decision tree learning or nearest-neighbour methods,
ILP endeavours to build a “classifier” which can accurately classify positive and
negative examples of a target concept. Unlike these other approaches, ILP uses
first-order logic programs to represent facts and hypotheses. Thus it is useful for
learning relationships between different elements of an example, where attribute-
value learners cannot.
Formally, we are given a set of training examples E , consisting of true E+
and false E− ground instances of an unknown target concept, and background
knowledge B defining predicates which provide additional information about the
3. Background - Symbolic Planning 64
arguments of the examples in E . The aim is to find an hypothesis H in th which
accurately classifies examples in E . I.e.:
H ∪ B |= e if e ∈ E+
6|= e if e ∈ E−
Many hypotheses may fit this requirement for a given training set. The aim
is to find one which will generalise in such a way as to accurately classify unseen
examples. This is principally achieved by trying to find the simplest possible
hypothesis which classifies the examples. Measures of simplicity and manifold
and vary from algorithm to algorithm.
A simple example (drawn from (Lavrac & Dzeroski, 1994)) will help to illus-
trate the process. Say our target concept is to learn the “daughter” relationship.
Our training set consists of a set of ground instances classified as positive and
negative examples:
E+ : daughter(mary, ann)
daughter(eve, tom)
E− : daughter(tom, ann)
daughter(eve, ann)
Furthermore, we have background knowledge which describes the family rela-
tionships and gender of the people in the example set:
B : parent(ann, mary)
parent(ann, tom)
parent(tom, eve)
parent(tom, ian)
female(ann)
female(mary)
female(eve)
From this input we aim to generate a hypothesis H which accurately classifies
the training set. Hypotheses are typically expressed as sets of Horn clauses
(conjunctions of first-order predicates with a single negated term). For this
example, a single clause will suffice:
H : daughter(X, Y) ← female(X), parent(Y, X)
We are not endeavouring to produce a new ILP algorithm is this thesis,
rather shall be using it as a tool, so we shall avoid delving into its internal
3. Background - Symbolic Planning 65
implementation and describe it only from an external perspective. For a more
comprehensive tutorial on ILP, see (Lavrac & Dzeroski, 1994). There are many
different ILP algorithms, with many different features. Some that are important
are:
• Noise handling : The ability to handle misclassifications of examples in
E is an important part of any machine learning algorithm. There is a
trade-off between producing an accurate classifier and producing a simple
hypothesis.
• Background knowledge: The background knowledge B can be either exten-
sional or intensional. Extensional background must be enumerated as a
list of grounded atoms (as in the example above). Intensional background
knowledge can be expressed as a logic program, and thus can be both more
complex and more succinct.
• Incrementality : The learning process can be classified as batch or incre-
mental. A batch-mode learning algorithm is designed to be run once with
all available examples to produce a single hypothesis. An incremental al-
gorithm is designed to be run repeatedly, refining its hypothesis as more
and more examples become available.
We will be using ILP for action-model learning. This is appropriate for this
task as the language of action models is inherently first-order and relational.
The algorithm we use will need to be noise-resistant as the effects of actions
will often be non-deterministic. Background knowledge will be drawn from the
fluents used in planning, some of which will have intensional definitions, so an
algorithm that allows intensional background knowledge is preferable. Finally,
the process of learning an action model is intrinsically incremental. As the agent
explores the world and gathers more examples, we wish to be able to refine the
hypotheses it forms.
Unfortunately no single algorithm could be found which satisfied all these re-
quirements. In particular, incremental algorithms are rare and only just becom-
ing the object of much scrutiny (Shapiro, 1981), (Muggleton & Buntine, 1988),
(Taylor, 1996), (Westendorp, 2003). Instead we settled on the batch learning ILP
algorithm Aleph (Srinivasan, 2001a), which has good noise-handling and uses
intensional background knowledge. In Chapter 6 we describe how this algorithm
was adapted to provide a very simple kind of incrementality.
3. Background - Symbolic Planning 66
3.6 Other related work
As in the previous chapter, there are some other areas of research which bear
some external similarity to the work in this thesis. These deserve mention, if
only to show how they differ from what we are doing.
3.6.1 Explanation Based Learning
Much of the research into action model learning stemmed from another area of
research called explanation based learning (EBL). Explanation based learning is
a process for learning rules to speed up logical inference in a problem solver (eg:
LEX (Mitchell, Utgoff, & Banerji, 1984), SOAR (Laird, Rosenbloom, & Newell,
1986) and Prodigy (Minton, 1988)). It does this by observing patterns in the
reasoning process, generalising them and attempting to apply them to similar
situations in other problems.
EBL can be applied to the symbolic planning process to automatically gen-
erate macro-operators by analogy from one planning problem to another (Car-
bonell, 1984). While this bears some similarity to the action model learning
described above, the content of what is being learnt is significantly different.
EBL algorithms learn to summarise information already present in the agent’s
knowledge. So, for example, if the action model already contains operators which
can be applied to solve a particular subpart of a plan then EBL could be used to
tag that sequence of operators as a potentially useful macro, and even generalise
it for use in a variety of similar situations. It cannot, however, add new effects
to operators which are not present in the model. This is the key difference:
EBL learns about the agent’s internal model by observing the planning process.
Action model learning learns about the external world, by interacting with it.
For an excellent review of explanation based learning research, see (Diet-
terich, 1996).
3.7 Planning in this thesis
As mentioned above, the representation and algorithm used for planning in this
thesis will be kept fairly simple, so as not to distract from the main thrust of the
work, which is the hybridisation of reinforcement learning and planning. We shall
be using teleo-operators to represent actions and constructing (semi-)universal
3. Background - Symbolic Planning 67
plans using means ends analysis. We will concentrate first on using a single level
of planning, and then extend our algorithms to incorporate hierarchical planning
using operators at various levels or granularity.
Action model learning will be limited to to improving an existing model by
learning to predict unexpected side-effects that arise while executing plans, with
a limited amount of deliberate exploration. Both the operator representation
and the planning algorithm will be extended to allow conditional side-effects
and build plans which can avoid them.
We will be using the ILP algorithm Aleph to learn the circumstances under
which side-effect arise. As the actions we will be modeling are reinforcement
learnt behaviours, the model learning process will need to be noise-tolerant and
incremental.
3.8 Summary
This chapter has explained how an agent can build plans to solve control prob-
lems using a symbolic model of the effects of actions provided by a trainer. We
have discussed how such a model might be represented, and how to build plans
which include durative actions, fault-tolerance and hierarchical structure. When
the action model is incomplete, methods exist for improving and repairing it
based on experience drawn from interacting with the world.
In the next chapter we shall consider how the work from this chapter and the
last can be combined into a single representation for both reinforcement learning
and symbolic planning, combining the strengths of each to overcome the other’s
weaknesses.
Chapter 4
A Hybrid Representation
In the introduction to this thesis we argued that a hybrid approach, combin-
ing symbolic planning and reinforcement learning, would allow us to produce
agents which could interact with complex environments more effectively than
those based on either approach alone. The background chapters have described
the representations used by each approach. These representations have much
in common, sharing a common task, but they also have significant differences.
In this chapter we aim to resolve these differences to produce a common repre-
sentation for states, goals and abstract behaviours which can be used for both
planning and reinforcement learning.
4.1 Representing states
The first important element of each approach is the representation of the agent’s
state. State is the combination of all those properties of the agent and the world
which are necessary to determine appropriate action. We will need to distinguish
between primitive and abstract representations of state. The primitive descrip-
tion of state contains all the concrete details which uniquely identify a particular
situation. Abstract state descriptions will be used to represent sets of primitive
states in terms of the high-level features and relationships they model.
Primitive state is used for concrete decision making in reinforcement learning,
and as the basis for defining fluents for abstract state descriptions. Abstract
state is used to represent the pre- and post-conditions of behaviours, and also
the agent’s goals. Planning will use the agent’s abstract state to decide which
behaviours are appropriate.
68
4. A Hybrid Representation 69
4.1.1 Instruments: Representing primitive state
In the theory presented in Chapters 2 and 3 we treated primitive states anony-
mously, simply assuming that there was a finite set of discrete states S =
{s1, s2, . . .}. In practice however, these states are generally constructed from a
collection of different sensors and other sources of information about the world.
So a single state is typically represented as a vector of different state variables.
For example, in the grid-world problem each primitive state would be a vector
of four elements: the robot’s x and y coordinates on the map, the location of
the coffee and the location of the book. Each of these variables can take on
a finite number of possible values, resulting in a finite set of primitive states.
(In some domains the state variables are continuous-valued rather than discrete.
In these cases we shall assume that an appropriate discretisation exists. While
recognising that this assumption may result in hidden-state problems, we do not
intend to address them in this work.)
We shall represent state variables symbolically as named functions called in-
struments. An instrument i returns the current value of its related state-variable.
Families of related instruments can be represented as parameterised schemata.
For instance, in the example world from the previous chapters, rather than defin-
ing separate instruments location coffee and location book to return the lo-
cations of the coffee and the book, a single instrument schema location(Object)
is defined which returns the location of the Object. An instrument schema must
yield an output for every instantiation of its parameters.
Explicitly named instruments and instrument schemata allow the possibility
of creating parameterised behaviours with state-representations as functions of
the parameters. This will be explained in more detail in Section 4.3.3 below.
In addition to the various sensors available to the agent, one special instru-
ment schema is required to define parameterised behaviours. This is the identity
function id(X) = X which outputs the value of its input parameter X regardless
of the state.
4.1.2 Fluents: Representing abstract state
Fluents are first-order predicates that represent abstract features of the state.
Fluents may be defined intensionally, as relationships between various instrument
values, or extensionally as facts about the world that are independent of the
4. A Hybrid Representation 70
primitive state. An example of the former is the location fluent, which is
defined in terms of the instruments that output the robot’s x,y-coordinates.
The door fluent on the other hand is extensionally defined.
Fluents have mode and type information which will be used in the planning
and reflecting processes. Each parameter of a fluent is marked as either an
input or an output. A fluent may be queried with any of its output variables
existentially quantified, but all input variables must be bound.
Mode information has particular importance when building plans. A plan
node may contain fluents with unbound variables, but only if the modes of the
fluents permit. So for example, the fluent location may have mode:
location(+Object,−Room)
indicating that Object is an input variable and Room is an output variable. So
a plan node could contain a condition like:
location(robot, Room)
in which Object is bound but Location is unbound, but a node condition like:
location(Object, kitchen)
would be invalid, as Object is an input variable, and must be bound.
Type information defines the types of constants that can be used to bind
each variable. Type information is only used in reflection and is discussed in
detail in Section 6.5.1.
4.2 Representing goals
Having defined a language for describing the agent’s state, we now need to de-
scribe its goals. Symbolic planning is primarily focused on goals of achievement,
that is problems in which the goal is to achieve a certain state of the world (or
one of a set of states). The aim is to produce the most efficient policy to achieve
this.
Goals of achievement are also common in reinforcement learning tasks, but
the reward-based goal specification is much more general than that and can
include goals of maintenance (in which a certain condition is to be maintained)
4. A Hybrid Representation 71
and other kinds of optimal behaviour. Symbolic planning, on the other hand,
is focused primarily on goals of achievement with less attention of the other
varieties. For this work we shall focus on goals of achievement alone, as in
practice they are the most common kind of goal, and they will require less
complexity in the hybrid model. We will reserve discussion of other kinds of
goals to the future work section in Chapter 8.
Therefore we shall assume that our goal is to achieve one of a set of goal
states G ⊂ S, as economically as possible. Furthermore we will require that this
G can be described as a conjunction of fluents G such that:
s ∈ G iff s |= G
To represent this goal as a reinforcement learning task it is necessary to
define a reward function. Many different reward functions might be used for this
purpose (see (Koenig & Simmons, 1993) for a good summary of such functions).
The principle differences between them lie in the relative values of the initial and
final Q-values. If the Q-values are initialised below their expected optimal values,
this is called pessimistic initialisation. Initialising them above the expected
optimal values is called optimistic initialisation. There are known advantages
and disadvantages to each (Hauskrecht et al., 1998), but ultimately the choice
is fairly arbitrary. For this work, we have chosen the pessimistic option by
initialising all Q-values to zero and using a reward function that rewards the
agent when it reaches the goal:
r(s, a, s′) =
1 if s′ |= G
0 otherwise(4.1)
We could have just as easily chosen an optimistic function which punishes the
agent for every action which does not reach the goal. This would effect time taken
to learn the policy, but not the correctness of any of the algorithms presented.
4.3 Representing actions
We shall distinguish between two different kinds of actions: primitive actions and
abstract actions (behaviours). Primitive actions are concrete, discrete, single-
timestep actions which obey the Markov property. They are assumed to be
low-level operations and so are not modeled. The set of primitive actions is
4. A Hybrid Representation 72
denoted by P and is assumed to be finite. The set of abstract actions is denoted
by B, and the set of all actions primitive and abstract is A = P ∪ B.
The representation of abstract actions is the key to the hybridisation of plan-
ning and learning. Each behaviour has a defined purpose, specified by the trainer.
This purpose dictates when the behaviour might be used and what it achieves.
In planning jargon, these correspond to the pre-image and post-conditions of the
behaviour. In hierarchical reinforcement learning, they are the application space
and the local reward function. Our aim is to produce a single representation
which serves both these purposes.
4.3.1 Reinforcement-Learnt Teleo-operators
A Reinforcement-Learning Teleo-operator (RL-Top) is a representation of a
durative behaviour with a fixed purpose, but with a policy that is learnt by
reinforcement learning. It is based on the Teleo-operator symbolic representation
of actions (Nilsson, 1994, Benson & Nilsson, 1994). An RL-Top for a behaviour
B has a pre-image B.pre and a post-condition B.post, each of which is specified
as a list of fluents. However, unlike a teleo-operator, an RL-Top does not model
the actual operation of the behaviour, but its intended operation.
An RL-Top〈B.pre, B.post〉 represents the fact that the intended operation
of B will result in B.post becoming true if the behaviour is initiated from a
state satisfying B.pre. The action is durative, i.e. it takes several time steps
to complete, so B.pre should be maintained until such time as B.post becomes
true. An RL-Top does not include a delete-list, and the representation makes no
claims about the side-effects of the behaviour. It will be assumed that behaviours
have no side-effects that do not immediately follow from their post-conditions,
until such time as such effects are observed. (More on this in Chapter 6.)
For example, the Go(hall, dining) behaviour from the gridworld in Sec-
tion 2.2.1 might be described as an RL-Top:
Go(hall, dining)
pre: location(robot, hall)
post: location(robot, dining)
This describes a behaviour which is applicable whenever the robot is the the hall,
and has a local goal which is satisfied when the robot enters the dining room.
The other possible effects of this behaviour are not defined.
4. A Hybrid Representation 73
This symbolic representation can be used to make plans which dictate the
appropriate actions to use in a situation, assuming they operate according to
their intended purpose. It can also be used to learn concrete policies for the
behaviours, by converting B.pre and B.post into a behaviour-specific reward
function.
Each behaviour B has an internal policy B.π which implements it. This
policy is a mapping from states to actions: B.π : S → B.A, where B.A is a
subset ofA, specific to B. (We shall assume initially that B.A ⊂ P. Later we will
expand this to allow multiple levels of hierarchy.) The set B.A is specified by the
trainer for each behaviour B, based on her determination of which primitives are
appropriate for that behaviour. With primitive behaviours added the definition
of Go(hall, dining) becomes:
Go(hall, dining)
pre: location(robot, hall)
post: location(robot, dining)
P: { n, ne, e, se, s, sw, w, nw }
Our aim is for each behaviour to learn a policy which satisfies its intended
purpose. We do so by using a recursively optimal reinforcement learning algo-
rithm with the local reward functions:
B.r(s, a, s′) =
1 if s′ |= B.post
−1 if s′ 6|= B.post and s′ 6|= B.pre
0 otherwise
(4.2)
for each behaviour B. This function is chosen to encourage the behaviour to find
the shortest path to its post-condition without exiting its pre-image.1 Just as
the main goal of the agent is limited to being a goal of achievement, so also we
limit the kinds of behaviours we might use. This is not seen to be too drastic
a limitation, at least for a first attempt at a hybrid model. We discuss the
relaxation of this requirement in the future work section of Chapter 8.
1This is not guaranteed. In a stochastic world it may select a short path with a smallprobability of failure over a much longer path with no possibility of failure at all. This isdeemed to be an acceptable trade-off.
4. A Hybrid Representation 74
4.3.2 State abstraction
As discussed in Section 2.2.3, one advantage of providing local goals to be-
haviours is that it allows state-abstractions to be tailored to individual be-
haviours. The RL-Top representation allows for this, in a simple fashion. Each
behaviour has a view which is a set of instruments used to identify the prim-
itive state representation for that behaviour. A behaviour’s view may omit
certain state-variables that are irrelevant to its operation. For example, the
Go(hall, dining) behaviour above depends only on the x and y coordinates of
the robot, and not on the locations of the book or the coffee, so these latter
instruments do not need to be included in its view. The behaviour is thus:
Go(hall, dining)
pre: location(robot, hall)
post: location(robot, dining)
view: { x, y }
P: { n, ne, e, se, s, sw, w, nw }
It is currently up to the trainer to determine which instruments to include
in a behaviour’s view when specifying the behaviour, based on her knowledge of
the problem domain. More sophisticated automatic state abstraction is left for
future work.
4.3.3 Parameterised Behaviours
There are a large number of possible variations of the Go() behaviour which
specify different initial and goal rooms. As a planning operator, it would be
natural to express all of these behaviours as an operator schema.
Teleo-operators are naturally specified as schemata, with variables which re-
late their pre-image and post-condition. Parameterised behaviours have also
been used in hierarchical reinforcement learning, although only in a fairly sim-
ple fashion. Dietterich uses parameterised behaviours in the task hierarchies for
MAXQ, but parameterisation is little more than a way of having several be-
haviours with a similar name. Each parameterisation is treated as a separate
behaviour. 2
2More sophisticated parameterisation methods are also included in Alisp (Andre & Russell,2002), but were unknown at the time this thesis was submitted.
4. A Hybrid Representation 75
Having a symbolic representation of behaviours and instruments allows us
to use a more complex relationship between parameters and behaviours. The
parameters of an RL-Top can also occur as variables in the view, the pre-image
and the post-condition. So, for example, the family of Go() behaviours can be
represented as a schema:
Go(From, To)
pre: location(robot, From)∧ door(From, To)
post: location(robot, To)
view: { x, y, id(To) }
P: { n, ne, e, se, s, sw, w, nw }
Notice that the special instrument id (the identity function) has been added to
the view. It is needed, so that the state-representation includes the identity of
the target room. In rooms such as the hall, which have many exits, the policy
Go.π will need this extra information to determine which direction to go. (The
identity of the the originating room is not needed, as it is uniquely identified by
x and y.)
For a more complex example, consider a world with various objects in it and
a robot that can move and turn in all directions. A possible behaviour in such a
world might be Approach(Object). The policy for such a behaviour would rely
on the distance and angle to the target object, but not the identity of the ob-
ject itself. Instrument schemata such as distance(Object) and angle(Object)
could be included in the view for such a behaviour, which allow it to obtain the
important state information and abstract away the identity of the object.
Parameterised behaviours must have all parameters bound before they can
be executed. This binding may be done in the planning stage, or at runtime in
the execution of a plan. Examples of both will be provided when planning is
discussed.
4.3.4 Hierarchies of behaviours
So far we have assumed that there is only one intermediate level of behaviours in
our hierarchy. The main task is solved in terms of behaviours and behaviours in
terms of primitive actions. However both hierarchical reinforcement learning and
hierarchical planning allow more than one level of temporal abstraction. Coarse-
grained behaviours can be represented in terms of finer-grained behaviours, which
4. A Hybrid Representation 76
are represented in behaviours finer still, until at the bottom of the hierarchy
appear the primitive actions.
To construct hierarchical plans, we need to identify which behaviours are
available at which levels of abstraction. We have taken the simplest possible
approach to this and allowed the trainer to specify a numerical granularity for
each behaviour. The main task will be represented as a behaviour with granu-
larity zero, the coarsest level. Behaviours with granularities 1, 2, 3, . . . represent
successively finer grained behaviours.
The levels of granularity are determined by the trainer, so that the hierar-
chical planner (to be described in Section 5.3.1) knows which behaviours are
available to it when decomposing a task. Each behaviour B has a plan B.plan
associated with it. A behaviour of granularity g is decomposed into a plan of
behaviours with granularity g + 1, if this is possible. Behaviours from this plan
may then be included in the internal policy for B.
The gridworld example is not really complex enough to require multiple levels
of granularity. Later we will introduce a simulated soccer domain which includes
three levels of granularity. At level zero there is the main task Score which is
applicable everywhere and has the goal of scoring a goal.
This is decomposed into a plan using mid-level behaviours with granularity
one. Behaviours at this level represent tactical decisions: when to pass, when
to shoot, etc. These are decomposed in turn into behaviours with granularity
two, representing simpler low level activities such as capturing the ball, turning,
dribbling, and the like.
4.4 Summary
In this chapter we have combined the elements of the previous chapters into
a single representation for states, goals and actions which can be used for both
learning and planning. The heart of its design is the Reinforcement-Learnt Teleo-
operator which models the purpose of a behaviour symbolically, and provides a
mapping to convert the symbolic representation into a local reward function for
learning the behaviour. The next chapter will present algorithms which use this
representation for planning and learning.
Chapter 5
Rachel: Planning and Acting
In this chapter and the next we shall present a description of Rachel, a hy-
brid planning/reinforcement learning system that implements the reinforcement-
learnt teleo-operator formalism of the previous chapter. Rachel consists of
three interacting components: the planner which builds plans, the actor which
executes them, and the reflector, which reflects on the outcomes of execution and
refines the world model. Figure 5.1 shows a block diagram of these components
and the relationships between them. The description of this system has been
split in two parts: in this chapter We shall focus discussion on the interaction
between the planner and the actor, leaving the reflector for the next.
Plans
Builds Plans
ActorExecutes PlansLearns Policies
ReflectorMonitors executionLearns Side−effects
RL−TOPModel
tracesExecution
Side−effectdescriptions
Planner
Figure 5.1: The three parts of the Rachel architecture.
77
5. Rachel: Planning and Acting 78
The relationship between the planner and the actor is straightforward – the
planner builds plans, and the actor executes them. However the manner in which
plans are constructed and executed are somewhat unconventional, so as to in-
corporate reinforcement learning. We shall present an algorithm for the planner
which deliberately produces plans with alternative paths, and two different al-
gorithms for the actor which learn to choose among these alternatives, and learn
primitive policies for behaviours.
5.1 The Planner
Rachel implements a custom planner to build teleo-reactive plan trees. Given a
goal G it produces a plan to achieve that goal. The planner uses the means-ends
analysis technique described in 3.2.2, with some variations.
5.1.1 Semi-universal planning
The first variation is that the plans it produces are semi-universal. That is,
the planner builds paths by backward-chaining from the goal, and stores all the
paths it generates, regardless of whether they include the current state or not,
stopping once it covers the current state. While the “unsuccessful” paths are not
useful in the agent’s current state, they may become active during the execution
of the plan, should one of the behaviours on the correct path fail unexpectedly.
This is a very likely possibility, when we take into account the fact that the
behaviours shall be learnt as we go.
Furthermore, it is normal in the construction of universal plans to keep only
one path from any particular state, and to discard any new step which does not
add to the set of states covered by the plan. This is because most planners are
only looking for the single shortest path to the goal from each state. Rachel,
on the other hand, is actively looking for alternative paths, of any length, and
so does not discard such redundant steps. It does, however, avoid creating paths
which contain loops. So a new step which does not add anything to the path it
extends is discarded.
So consider the grid-world problem from earlier chapters. Say, for example,
the agent’s goal was to get both the coffee and the book, starting from its initial
location in the study. A linear planner would produce one of the two plans
5. Rachel: Planning and Acting 79
Go(study, hall)
Get(coffee, kitchen)
Go(dining, kitchen)
Go(hall, dining)
Go(study, hall)
Go(kitchen, dining)
Get(book, bedroom2)
Go(hall, bedroom2)
Go(dining, hall)
Get(coffee, kitchen)
Go(dining, kitchen)
Go(hall, dining)
Go(bedroom2, hall)
Get(book, bedroom2)
Go(hall, bedroom2)
holding(book)
location(book, bedroom2)
location(robot, study)
holding(book)
location(coffee, kitchen)
location(robot, dining)holding(coffee)
location(book, bedroom2)
holding(coffee)
location(robot, kitchen)holding(coffee)
location(book, bedroom2)
holding(book)
location(coffee, kitchen)location(robot, kicthen)
location(book, bedroom2)
location(book, bedroom2)location(coffee, kitchen)location(robot, dining)
location(book, bedroom2)
location(coffee, kitchen)
location(coffee, kitchen)location(robot, hall)
location(book, bedroom2)
holding(book)
location(coffee, kitchen)location(robot, study)
location(book, bedroom2)
location(robot, dining)holding(coffee)
location(book, bedroom2)
location(robot, hall)
location(robot, hall)
holding(coffee)location(book, bedroom2)
location(robot, bedroom2)
location(coffee, kitchen)holding(book)
location(robot, hall)
holding(coffee)
location(coffee, kitchen)holding(book)
location(robot, bedroom2)
location(coffee, kitchen)
location(coffee, kitchen)location(book, bedroom2)
location(robot, bedroom2)
location(robot, kitchen)
location(coffee, kitchen)
Figure 5.2: Two linear plans to fetch both the book and the coffee.
5. Rachel: Planning and Acting 80
Go(dining, kitchen)
Go(hall, dining)
Go(bedroom2, hall)
Get(book, bedroom2) Get(coffee, kitchen)
Go(dining, kitchen)
Go(hall, dining)
Go(kitchen, dining)
Go(hall, bedroom2)
Go(dining, hall)
Get(coffee, kitchen) Get(book, bedroom2)
Go(bedroom1, hall)
Go(lounge, hall)
Go(laundry, dining)
Go(hall, bedroom2)
Go(study, hall)Go(dining, hall)
Go(laundry, dining)
location(coffee, kitchen)
holding(book)
location(robot, dining)
location(coffee, kitchen)holding(book)
location(robot, hall)
location(coffee, kitchen)holding(book)
location(robot, bedroom2)
location(coffee, kitchen)location(book, bedroom2)
location(robot, bedroom2)
location(coffee, kitchen)location(book, bedroom2)
location(robot, study)
location(coffee, kitchen)location(book, bedroom2)
location(robot, hall)
location(coffee, kitchen)location(robot, kitchen)
holding(book)
location(robot, dining)holding(coffee)
location(book, bedroom2)
location(robot, kitchen)holding(coffee)
location(book, bedroom2)
location(coffee, kitchen)location(robot, kicthen)
location(book, bedroom2)
location(coffee, kitchen)location(robot, dining)
location(book, bedroom2)
location(coffee, kitchen)location(robot, hall)
location(book, bedroom2)
holding(coffee)location(book, bedroom2)
location(robot, hall)
holding(coffee)location(book, bedroom2)
location(robot, bedroom2)
holding(coffee)holding(book)
location(coffee, kitchen)holding(book)
location(robot, laundry)
location(coffee, kitchen)
holding(book)
location(robot, bedroom1)
location(coffee, kitchen)location(book, bedroom2)
location(robot, dining)location(coffee, kitchen)location(book, bedroom2)
location(robot, laundry)
holding(coffee)location(book, bedroom2)
location(robot, lounge)
Figure 5.3: A semi-universal plan to fetch both the book and the coffee. Thedashed arrows indicate where nodes have been omitted from the diagram, tosave space.
5. Rachel: Planning and Acting 81
shown in Figure 5.2. Both of these are valid alternative ways to achieve the
goal. Most planners would choose the one on the left, as the shorter of the two.
Rachel’s planner instead produces the semi-universal plan shown in Figure 5.3,
which contains both the linear plans. The job of choosing one or other of these
alternatives is not up to the planner. Instead, it is delegated to the actor, which
will do so using hierarchical reinforcement learning.
Notice, however, that the semi-universal plan has only been expanded seven
levels deep, so not all of the second of the linear plans is included. This is because
the planner is not completely universal. It does not try to exhaustively search
for all possible paths. Rather, it uses iterative deepening to successively expand
the depth of the tree until the current state is covered. In this case, the left-hand
branch covers the robot’s starting state, so the plan is not expanded any further.
That is, until the agent arrives in a state which is not covered by this plan, in
which case the plan will be further expanded until either the state is covered or
the plan can not be grown any further.
Rachel’s plan also contains many “irrelevant” alternatives (only some of
which are included in the figure). These alternatives are not truly irrelevant
however. They are kept in case any of the behaviours fail unexpectedly, resulting
in a state which is not covered by either of the linear plans, eg. if the robot
accidentally entered the laundry while navigating from the dining room to the
kitchen. A linear planner would have to re-plan in this instance. A universal
planner, like Rachel’s can take advantage of the fact that it has already found
a contingency to handle this situation.
5.1.2 Variable binding
Another variation in Rachel’s planner is in the handling of variables in goals
and parameterised behaviours. The binding of variables may be done at planning
time, or may be delayed until run-time. The difference is apparent in the fol-
lowing example. Consider the agent in the example world from earlier chapters,
with the goal:
G = location(robot, Room) ∧ location(coffee, Room)
The unbound variable Room is treated as being existentially quantified. So this
goal is satisfied when the robot is in the same room as the coffee. A behaviour
5. Rachel: Planning and Acting 82
B to achieve this goal, if there exists a substitution σ for variables in G and the
parameters of B such that:
σ(B.post) ∩ σ(G) 6= ∅ (5.1)
If this is the case, then the regressed condition is given by
Fbefore = (σ(G)− σ(B.post)) ∪ σ(B.pre) (5.2)
which is an extension of Equation 3.4 to include variable binding. If variable
binding is required to be done at plan time, then σ must bind all the parameters
of B to constant values. If run-time binding is allowed then parameters of B may
be bound to variables, however the variables must be guaranteed to be bound
by the run-time evaluation of Fbefore The mode information for fluents is used to
ensure Fbefore is Prolog-ordered, i.e. that there is an ordering of the fluents which
ensures that every unbound variable occurs as the output of a fluent before it
occurs as an input.
So in the example, the behaviour Go(From, To) can be used to achieve the
goal G above with the binding: σ = {To/Room} The regressed condition is then:
Fbefore = (σ(G)− σ(Go(From, To).post)) ∪ σ(Go(From, To).pre)
= ({location(robot, Room), location(coffee, Room)}
−{location(robot, Room)}) ∪ {location(robot, From), door(From, Room)}
= {location(robot, From), door(From, Room), location(coffee, Room)}
If the fluents location and door can each be used with the second argument
as an output, then this is a valid Prolog-ordered set and the variables Room
and From can be bound at run-time. If not, then a constant binding of these
variables might be necessary. In its present form, the planner requires the trainer
to specify when a particular variable is to be bound at plan-time, and to specify
a set of candidate values for that variable.
5.1.3 The planning algorithm
Algorithms 5 and 6 show pseudocode for the planning process. Plans are grown
by iterative deepening. The GrowPlan function grows an existing plan one
step deeper. It calls the PlanStep function at each leaf node of the required
depth, to generate new steps. PlanStep implements the regression process,
5. Rachel: Planning and Acting 83
Algorithm 5 Rachel’s planning algorithm: Iterative Deepeningfunction GrowPlan(plan P )
ExpandNode(P.root, P.depth + 1, {})
end GrowPlan
function ExpandNode(node N, depth d, explored E)
E ← E ∪ {N.cond}
if d > 0 then
for each child C ∈ N.children do
ExpandNode(C, d− 1, E)
end for
else
PlanStep(N, E)
end if
end ExpandNode
finding a behaviour which achieves the input condition and regressing it to get
the new node condition.
Further extensions to the planner will be described in later sections, including
hierarchical planning (section 5.3.1, conditional side-effects (section 6.6) and
exploratory plan steps (section 6.6.1).
For all the examples in this thesis, we shall assume the plans used have been
grown to the maximum possible depth, except where noted. This is only possible
as the examples involve relatively simple planning tasks. For more complex tasks,
planning can be interleaved with acting – the actor can explore the paths that
the planner has already discovered while the planner searches for more.
5.1.4 Computational complexity
It should be noted that the computation time of the above algorithm for plan
construction can potentially be exponential in the number of different situa-
tions (conjunctions of state fluents) the agent can encounter. This is a serious
practical problem once the agent’s domain becomes moderately complex. It is
allayed somewhat by the addition hierarchy, in Section 5.3.1, but still ought to
be addressed.
The advantage of planning depends on the relative costs of computation
and exploration. For any real-world problem, computation will be orders of
5. Rachel: Planning and Acting 84
Algorithm 6 Rachel’s planning algorithm: Adding new nodesfunction PlanStep(node N, explored E)
for each behaviour B′ do
\\Find which of the node conditions B′ achieves, if any
(achieved , unachieved )← Achieved(B′, N.cond)
if achieved = {} then
Skip to the next behaviour
end if
\\Check for interference
if B′.post ∧ N.cond⇒⊥ then
Skip to next behaviour
end if
\\Construct the new node’s condition
newCondition ← unachieved ∧ B′.pre
if newCondition ⇒⊥ then
Skip to the next behaviour
end if
\\Check if the new condition has already been explored
for each condition C ∈ E do
if newCondition ⇒ C then
Skip to the next behaviour
end if
end for
\\Add the new node tree
Nnew .cond← newCondition
Nnew .parent← N
Nnew .B← B′
N.children← N.children ∪ {Nnew}
end for
end PlanStep
5. Rachel: Planning and Acting 85
magnitude faster than actions execute, so it is worth doing a considerable amount
of computation in order to avoid unprofitable exploration. The exact nature of
this trade-off will vary from one application to another. A more sophisticated
planner would allow the trainer to tune the amount of search it does relative
to the expected advantage. We discuss this idea, along with other possible
improvements to the planner, in Section 8.3.1.
5.2 The Actor
The actor is the part of Rachel which directly controls the agent’s actions.
Its job is to choose one of the alternative behaviours offered by the plan, and
execute it, learning a primitive policy for that behaviour according to the local
reward function given by that behaviours RL-Top (Equation 4.2). The choice of
behaviours is to be optimised according to the global reward function r(s, a, s′)
(Equation 4.1).
We shall describe two different algorithms for the actor, which implement
different execution semantics. The first, Planned Hierarchical Semi-Markov Q-
Learning, will use the standard subroutine-semantics of HSMQ, MAXQ-Q and
others, consulting the plan only when a new behaviour needs to be selected. The
second algorithm, Teleo-Reactive Q-Learning uses semantics based on those for
teleo-reactive plan execution (Section 3.2.3). This allows for more reactivity to
changes in the world, but requires a more complex algorithm, as will be shown.
We concentrate in this section on algorithms which use a single intermediate
level of hierarchy only. The algorithms will be extended to include multiple levels
of hierarchy in Section 5.3, later in the chapter.
5.2.1 Planned Hierarchical Semi-Markov Q-Learning
The simplest use of plans to inform hierarchical learning is as a replacement for
the task-hierarchy. Where the HSMQ algorithm (Algorithm 3 on page 38) con-
sults the function TaskHierarchy to determine the set of available behaviours
at each choice point, Planned Hierarchical Semi-Markov Q-Learning (P-HSMQ)
uses the plan instead. Algorithm 7 shows pseudocode for this process.
The ActiveBehaviours function returns the set of behaviours dictated by
the active nodes of the plan (with duplicates removed). One of these behaviours
5. Rachel: Planning and Acting 86
Algorithm 7 Planned HSMQ-Learningfunction P-HSMQ-1(goal G)
plan P ← BuildPlan(G)
t← 0
Observe state st
Bt ← ActiveBehaviours(P, st)
while st 6|= G do
T ← t
Choose behaviour B← π(st) from Bt according to an exploration policy
sequence S ← {}
while st |= B.pre ∧ st 6|= B.post do
Choose primitive action at ← B.π(st)
according to an exploration policy
Execute action at
Observe next state st+1
B.Q(st, at)α←− B.r(st, at, st+1) + γ maxa∈B.P B.Q(st+1, a)
S ← S + 〈st, at, st+1〉
t← t + 1
end while
k ← 0 totalReward ← 0
for each 〈s, a, s′〉 ∈ S do
totalReward ← totalReward + γkr(s, a, s′)
k ← k + 1
end for
Bt ← ActiveBehaviours(P, st)
Q(sT , B)α←− totalReward + γk maxB′∈Bt
Q(st, B′)
end while
end P-HSMQ
5. Rachel: Planning and Acting 87
is then selected by the reinforcement learning algorithm for execution. A be-
haviour B learns its policy as it executes, using its local reward function, until
it terminates, either successfully (satisfying B.post) or unsuccessfully (prema-
turely leaving B.pre). The experiences gathered while executing the behaviour
and evaluated using the global reward function and used to update its global
Q-value.
This algorithm has the same theoretical properties as the HSMQ algorithm
on which it is based, and is guaranteed to converge to a recursively optimal
policy within the restrictions placed on it by the plan.
5.2.2 Termination Improvement
While the P-HSMQ algorithm is effective, it does not make full use of the infor-
mation available to it in the plans it builds. It only checks the appropriateness of
a behaviour when it is initiated, and always executes it until completion, ignoring
any effects that might cause the action to no longer be appropriate.
Consider the following scenario: We add a further complexity to the example
grid-world from earlier chapters. There is a bump on the floor near the west
end of the hall, as indicated by the shaded area shown in Figure 5.4. When the
robot moves onto the bump while carrying a cup of coffee, there is a 10% chance
that it spills.
The agent’s goal is to fetch the coffee and the book and take them both
into the lounge. It is faster, in this scenario, to fetch the coffee first and then
the book. Suppose the agent has already visited the kitchen and is carrying the
coffee. It re-enters the hall and starts executing Go(hall, bedroom2), as dictated
by the plan (Figure 5.5, but as it passes over the bump it spills the coffee. Once
the P-HSMQ algorithm has begun executing a behaviour it is committed to
completing it, so the robot continues down the hall to the bedroom. Only once
it enters the room does it re-evaluate its choice of behaviour, and realises that
it needs to return to the kitchen to fetch another cup.
A more efficient solution would be to return to the kitchen as soon as the
coffee is spilt, but to do so the agent would need to relax its commitment to
behaviours, and be able to interrupt a behaviour before it terminates. This is a
kind of termination improvement as described in Section 2.4. As pointed out in
that section, the difficulty with termination improvement is that the more often
the agent is able to make choices about its behaviours, the more complex the
5. Rachel: Planning and Acting 88
Closet
Robot
Coffee Book
Hall
Laundry
Dining
Bump
KitchenBedroom1 Bathroom Bedroom2
Study Lounge
Figure 5.4: The example world revisited. A “bump” has been added at thewestern end of the hall (indicated by the shaded squares. When the robot movesonto the bump there is a 10it spills the coffee.
policy is to learn. Optimally we would like the agent to only reconsider its choice
when the current behaviour is no longer worthwhile.
A plan provides us with a way to make this decision. Each node dictates
the conditions which make its corresponding behaviour appropriate. So long
as the node is active, the behaviour should continue executing. Should the
node become inactive, then the behaviour may no longer be appropriate and
should be interrupted. In the plan in Figure 5.5 the node which dictates the
Go(hall, bedroom2) behaviour has the condition:
location(robot, hall) ∧ holding(coffee) ∧ location(book, bedroom2)
So as long as the robot is in the hall and carrying the coffee, this behaviour will
continue executing. However if the robot should spill the coffee, holding(coffee)
will no longer be satisfied and the node will become inactive (even though the
Go behaviour has not terminated). This is a good indication that the behaviour
should be interrupted, and another chosen from the newly active nodes (shown
with broken outlines in Figure 5.5). Notice that the Go(hall, bedroom2) is again
in the set of available behaviours, but is dictated by a different node, as part of
5. Rachel: Planning and Acting 89
Get(coffee, kitchen)
Go(bedroom2,hall)Go(dining, hall)
location(robot, bedroom2)location(coffee, kitchen)
holding(book)
location(coffee, kitchen)location(robot, kitchen)
holding(book)
location(robot, kitchen)holding(coffee)holding(book)
location(robot, dining)holding(coffee)holding(book)
location(coffee, kitchen)location(robot, bedroom2)
location(book, bedroom2)
location(robot, bedroom2)holding(coffee)holding(book)
location(robot, bedroom2)
location(book, bedroom2)holding(coffee)
location(robot, kitchen)
location(book, bedroom2)holding(coffee)
location(robot, kitchen)location(coffee, kitchen)location(book, bedroom2)
location(robot, hall)location(coffee, kitchen)location(book, bedroom2)
location(robot, hall)location(coffee, kitchen)location(book, bedroom2)
location(book, bedroom2)
location(robot, hall)holding(coffee)
Active nodesafter spill
Active nodebefore spill
Go(hall, lounge)
Get(coffee, kitchen)
Go(bedroom2, hall)
Go(hall, bedroom2)
Get(book, bedroom2)
location(robot, hall)
Go(kitchen, dining)
holding(book)holding(coffee)
location(robot, lounge)
holding(book)holding(coffee)
Get(book, bedroom2)
Go(hall, bedroom2)
Go(kitchen, dining)
Go(hall, dining)
Figure 5.5: A plan for fetching the coffee and the book. Dotted arrows indicateplaces where steps have been omitted, for the sake of brevity. The highlightednodes are those that are active before and after the coffee is spilled.
5. Rachel: Planning and Acting 90
a sequence that fetches the book first, and then the coffee. A better alternative
in this situation is the Go(hall, dining) behaviour which takes the robot back
towards the kitchen to fetch another coffee.
This process of executing a behaviour as long as its node is active follows
the teleo-reactive semantics for plan execution described in Section 3.2.3. What
follows is an HRL algorithm which implements these semantics, called Teleo-
Reactive Q-Learning.
5.2.3 Teleo-Reactive Q-Learning
Teleo-Reactive Q-Learning (TRQ), as shown in algorithm 8, is an adaptation
of P-HSMQ which implements the teleo-reactive execution semantics described
above.
There are two important issues that need to be dealt with in implementing
this algorithm. They are: (1) maintaining the Semi-Markov Property of inter-
rupted behaviours (necessary for the correctness of the SMDPQ update rule);
and (2) ensuring that behaviour’s internal policies are fully explored in spite of
interruptions. Each of these issues is detailed below.
Maintaining the Semi-Markov property
The correctness of the SMDPQ-Learning update rule used in TRQ requires that
the behaviours executed obey the Semi-Markov property. That is, that the
outcomes of executing a behaviour – its duration, the rewards it receives and
the state it terminates in – depend only on the identity of the behaviour and the
state in which it is initiated. The teleo-reactive execution semantics violate this
condition. If there are two different nodes active in the same state, dictating
the same behaviour but with different interruption criteria, then the outcome of
executing the behaviour will depend on which node is chosen.
Returning to the example world, consider the goal:
G = location(robot, bedroom2) ∧ holding(Object)
i.e. the robot is to be in the bedroom and holding something. The variable
Object is considered to be existentially quantified, so the goal is satisfied if the
robot is carrying either the coffee or the book. The plan for this goal is shown in
Figure 5.6. Suppose the agent decides to fetch the coffee first, and having done
5. Rachel: Planning and Acting 91
Algorithm 8 TRQ-Learningfunction TRQ-1(goal G)
plan P ← BuildPlan(G)
t← 0
Observe state st
N t ← ActiveNodes(P, st)
while st 6|= G do
T ← t
Choose node N← π(st) from N t according to an exploration policy
sequence S ← {}
B← N.B
while N ∈ N t do
Choose primitive action at ← B.π(st)
according to an exploration policy
Execute action at
Observe next state st+1
B.Q(st, at)α←− B.r(st, at, st+1) + γ maxa∈B.A B.Q(st+1, a)
S ← S + 〈st, at, st+1〉
t← t + 1
N t ← ActiveNodes(P, st)
end while
k ← 0 totalReward ← 0
for each 〈s, a, s′〉 ∈ S′ do
totalReward ← totalReward + γkr(s, a, s′)
k ← k + 1
end for
Q(sT , N)α←− totalReward + γk maxN′∈N t
Q(st, N′)
if st |= B.pre ∧ st 6|= B.post then
\\Behaviour B has been interrupted prematurely
with probability η do
Persist(B)
end with
end if
end while
end TRQ-1
5. Rachel: Planning and Acting 92
Algorithm 9 TRQ-Learning: Persisting with a behaviourfunction Persist(behaviour B)
while st |= B.pre ∧ st 6|= B.post do
Choose primitive action at ← B.π(st) according to an exploration policy
Execute action at
Observe next state st+1
B.Q(st, at)α←− B.r(st, at, st+1) + γ maxa∈B.P B.Q(st+1, a)
t← t + 1
end while
end Persist
so, it returns to the hall. When the agent enters the hall it is presented with a
choice: There are two active nodes in the plan which are highlighted in the figure.
Both nodes indicate the same behaviour to be executed: Go(hall, bedroom2).
location(book, bedroom2)
location(robot, hall)location(book, bedroom2)
location(robot, bedroom2)holding(Object)
Get(book, bedroom2)Go(hall, bedroom2)
Go(dining, hall) Go(hall, bedroom2)
location(robot, dining)holding(coffee)
location(robot, hall)holding(coffee)
location(robot, bedroom2)
Figure 5.6: A plan for fetching either the coffee or the book. The two highlightednodes are active when the agent is in the hall and holding the coffee.
While executing the behaviour the robot passes over the bump and spills the
coffee. What happens next depends on which node was being executed. The
right-hand node is still activated, so if it was selected the agent would continue
executing the behaviour. The left-hand node however is no longer active. If
this node was the one selected then the behaviour would be interrupted and a
new choice made. This shows that the termination of the behaviour depends
on more than just its starting state. More information is needed to satisfy the
Semi-Markov property.
5. Rachel: Planning and Acting 93
The solution is to explicitly recognise the difference between these two cases
and treat them as distinct alternatives for the agent to choose between. We
treat each node of the plan as a separate alternative. A particular node always
executes the same behaviour and terminates under the same conditions, so it
will satisfy the Semi-Markov property.
Teleo-reactive Q-Learning assigns Q-values to nodes of the plan rather than
to behaviours. We define an optimal state-node value function:
Q?(s, N) = E
{
k−1∑
i=0
γirt+i + γkV ?(st+k) | ε(s, N, t)
}
(5.3)
where ε(s, N, t) indicates the event of executing behaviour N.B in state s at time
t, until N is no longer active, and k is the duration of this execution.
We learn an approximate value function Q(s, N), by a version of the SMDPQ-
Learning update rule:
Q(st, Nt)α←− Rt + γk max
N∈NQ(st+k, N) (5.4)
Execution then consists of finding all active nodes in the plan and selecting
the one with the best Q-value. The behaviour corresponding to this policy node
is then executed and learnt in the usual way. When the node is no longer active
the behaviour is interrupted, whether is has terminated or not, and the gathered
experience is used to update the value of the node.
Since the execution of a node in a plan obeys the Semi-Markov property,
this update rule is guaranteed to result in convergence to an optimal policy (in
terms of node selection) provided that the underlying primitive policies for nodes
converge. Ensuring this convergence is the issue we will examine next.
Persisting with interrupted behaviours
In a recursively optimal HRL algorithm such as TRQ the policy of a behaviour
is independent of its calling context. A behaviour aims to learn a policy to
maximise its local rewards, regardless of its parent task. To ensure that this
happens, behaviours must be allowed to fully explore their application spaces.
To be precise, the correct convergence of the Q-values (and thus the policy) for
a state s cannot be guaranteed unless all states reachable from s are adequately
explored (infinitely often in the limit).
The teleo-reactive semantics alone will not guarantee this. If a node N in a
plan has condition N.cond and dictates behaviour B then the execution of B is
5. Rachel: Planning and Acting 94
limited to the set of states that satisfy N.cond. States within the application
space of B which do not satisfy N.cond will never be explored (unless there is
another node which also dictates B under less restrictive conditions).
This, of course, can have seriously detrimental effect on learning the internal
policy of B. Convergence to the optimal policy is no longer guaranteed, even if
the optimal policy lies within the set of states satisfying N.cond. To illustrate
this, consider the narrow bridge example in Figure 5.7
0.62
0.46
0.31
0.18
0.06
−0.04
−0.14
−0.23
−0.30
0.8
0.62
0.46
0.31
0.18
0.06
−0.04
−0.14
−0.23
−0.30
0.8
−0.1
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.62
0.46
0.31
0.18
0.06
−0.04
0.8
−0.1
−0.1
Figure 5.7: A narrow bridge over a chasm. (a) On the left is the optimal policyfor crossing the bridge. (b) On the right is the policy learnt if the behaviour isinterrupted whenever it moves to the right-hand side of the bridge.
The bridge links two sides of a chasm. It is 10 cells long but only 2 cells
wide. The agent can move in any of the 4 cardinal directions, but each movement
includes a probability of error pe, resulting in the agent immediately falling off
the bridge. Let us define a behaviour Cross:
5. Rachel: Planning and Acting 95
Cross
Pre: on bridge
Post: on far side
According to the local reward function for RL-Tops the behaviour gets reward
+1 for reaching the other side and penalty −1 for falling off the bridge. The
optimal policy is to head directly for the other side, as shown in Figure 5.7(a).
The numbers in each cell show the optimal state-value V ?(s) for the cell, with
γ set to 1 and Pe at 0.1. Notice that for the first half of the bridge the values
are negative, indicating that the expected probability of falling off the bridge is
greater than the probability of reaching the other side. Still, there are no better
alternative policies for this behaviour.
Now consider what would happen if this behaviour were only ever used when
dictated by a node which required that the agent remain on the left-hand side of
the bridge, and interrupted the behaviour whenever it strayed to the right hand
side of the bridge. The optimal policy for Cross does not violate this condition,
but learning that policy is no longer possible. Since actions on the right side of
the bridge are never explored, the Q-values for all those states remain at their
initial value of zero. Assuming that moving to the right-hand side of the bridge
yield a future return of zero, the agent prefers to move to the right of the bridge
instead of moving forward, for the first three cells of the bridge. The convergent
policy for this situation is shown in Figure 5.7(b).
While this is a pathological example, it is clear that to guarantee the cor-
rect convergence of the internal policies of behaviours, they must be allowed to
explore these behaviours thoroughly. To achieve this, the TRQ algorithm occa-
sionally elects to execute a behaviour to completion rather than interrupt it. If
a behaviour is interrupted before it terminates, then the algorithm may decide,
with probability η to ignore the interruption and persist with the behaviour until
it succeeds or fails. Setting the value of η is a tradeoff between taking advan-
tage of interruptions (when η is low), and faster convergence of the behaviours’
policies (when η is high). A greedy-in-the-limit schedule of η values is perhaps
the best way to resolve this.
Once the algorithm begins to persist with a behaviour, all of the hierarchy
above that behaviour is ignored. The value of the node that was dictating the
behaviour is updated as if the behaviour was interrupted. Experiences gathered
while persisting with a behaviour are only used to update that behaviours inter-
5. Rachel: Planning and Acting 96
nal Q-values, and are ignored by the level above. In this way, persistence is a
kind of off-policy exploration, at the behaviour level.
When a persisting behaviour terminates, either successfully or unsuccessfully,
control returns to the plan and a new plan node is selected according to the next
state.
Proof of convergence
With these two conditions the policies learnt by the TRQ algorithm should be
guaranteed to converge with probability 1 to a recursive optimal policy with the
limits of the plan. The proof of this statement would follow the same lines as the
proof for MAXQ-Q (Dietterich, 2000a) and of SMDP Q-learning (Parr, 1998).
We will outline it here, without entering into the details
The proof is inductive. We start by proving the convergence of the behaviours
of finest granularity, which learn policies directly in terms of primitive actions.
We then proceed to prove convergence for successively coarser levels of granu-
larity until we reach the top of the hierarchy.
The base case is straightforward. The behaviours of finest granularity use Q-
learning to learn a policy using primitive actions. The world is a Markov Decision
Problem, so given an appropriate schedule of learning rates, these behaviours will
converge to optimal policies with probability 1.
The recursive case is based on Proposition 4.5 from (Bertsekas & Tsitsiklis,
1996), which describes the necessary conditions to prove the convergence of an
iteration of the form:
rt+1(i)αt(i)←− (Urt)(i) + wt(i) + ut(i)
for a mapping U with random noise term wt and a decaying error term ut. It
relies on two important factors:
1. The SMDP Q-Learning update rule (Equation 2.23) is a weighted max-
norm pseudo-contraction. That is, the mapping U :
(UQ)(s, Nt) = Rt + γk maxN∈N
Q(st+k, N)
satisfies the inequality:
‖UQt −Q?‖ξ ≤ β‖Qt −Q?‖ξ
5. Rachel: Planning and Acting 97
for some positive weight vector ξ and scalar β ∈ [0, 1). This fact is proven
by Parr (1998).
2. The error term B.ut(s, N) given by:
B.ut(s, N) =(
∫ +∞
−∞ rR (r|s, π(s)) dr
+∑
s′,k T (s′, k|s, N) γk maxN′∈B B.Q(s′, N′))
−(
∫ +∞
−∞ rR? (r|s, π(s)) dr
+∑
s′,k T ? (s′, k|s, N|γ)k maxN′∈B B.Q(s′, N′))
converges to zero with probability 1. This term represents the difference
between doing an update with current internal policy of N.B (with out-
comes given by R and T ), and doing an update with the optimal internal
policy of N.B (with outcomes given by R? and T ?).
This condition follows from the inductive hypothesis. If B is of granularity
g then N.B is of granularity g+1, so we can assume that the internal policy
of N.B convergences with probability 1. Therefore T (s′, k|s, N) converges
to T ?(s′, k|s, N) and B.ut(s, N) converges to zero with probability 1, as
required. (The complete proof requires a particular upper bound on the
rate of convergence, but we shall not concern ourselves with such details
here.)
5.3 Multiple levels of hierarchy
So far the explanations of Rachel have assumed only one level of hierarchy
between the main goal and the primitive policies. In certain domains it may be
desirable to have two or more levels of hierarchy. In this section we describe how
the algorithms presented above can be extended to multiple levels of hierarchy.
The key element is the idea of behaviour granularity, as presented in Sec-
tion 4.3.4. The main task is treated as a behaviour of granularity zero, with
post-condition equal to the goal and a pre-image that is true in all states (op-
tionally a smaller pre-image may be used to treat some states as failure states).
We shall refer to this behaviour as Root. We then extend the planner to al-
low it to decompose a behaviour of granularity g into a plan of behaviours of
granularity g + 1. We can do this recursively until we reach a behaviour that
5. Rachel: Planning and Acting 98
cannot be further decomposed, and so learns a primitive policy instead. Execut-
ing the resulting hierarchy of plans will require an extended version of the TRQ
algorithm.
5.3.1 Hierarchical planning
Algorithm 10 Hierarchical planning: Iterative Deepeningfunction GrowPlan(behaviour B)
ExpandNode(B, B.plan.root, B.plan.depth + 1, {})
end GrowPlan
function ExpandNode(behaviour B, node N, depth d, explored E)
E ← E ∪ {N.cond}
if d > 0 then
for each child C ∈ N.children do
ExpandNode(B, C, d− 1, E)
end for
else
PlanStep(B, N, E)
end if
end ExpandNode
Three modifications need to be made to the planning algorithm to accom-
modate hierarchical planning. All three are fairly simple:
1. Instead of taking a goal expression as input, we take the behaviour which
is to be decomposed. Its post-condition will serve as the goal.
2. In choosing a behaviour to add to the plan in PlanStep we restrict our
search to behaviours of the appropriate granularity (i.e. one more than the
granularity of the behaviour being decomposed).
3. When constructing a plan for behaviour B there is no point in adding nodes
which lie outside of B.pre, as they will never be executed. So conditions
from B.pre are added to every node. If that results in a node that can
never be satisfied, then it is pruned.
The pseudocode for the resulting algorithm in shown in Algorithms 10 and 11.
5. Rachel: Planning and Acting 99
Algorithm 11 Hierarchical planning: Adding new nodesfunction PlanStep(behaviour B, node N, explored E)
for each behaviour B′ with granularity B.gran + 1 do
\\Find which of the node conditions B′ achieves, if any
(achieved , unachieved )← Achieved(B′, N.cond)
if achieved = {} then
Skip to the next behaviour
end if
\\Check for interference
if B′.post ∧ N.cond⇒⊥ then
Skip to next behaviour
end if
\\Construct the new node’s condition
newCondition ← B.pre ∧ unachieved ∧ B′.pre
if newCondition ⇒⊥ then
Skip to the next behaviour
end if
\\Check if the new condition has already been explored
for each condition C ∈ E do
if newCondition ⇒ C then
Skip to the next behaviour
end if
end for
\\Add the new node tree
Nnew .cond← newCondition
Nnew .parent← N
Nnew .B← B′
N.children← N.children ∪ {Nnew}
end for
end PlanStep
5. Rachel: Planning and Acting 100
5.3.2 Hierarchical learning: P-HSMQ
Extending P-HSMQ to multiple levels of hierarchy is straightforward. It is sim-
ply a matter of making the algorithm operate recursively, starting with the Root
behaviour and selecting a behaviour from its plan and then executing that be-
haviour in the same fashion. Eventually there will be a behaviour for which
the plan does not cover the current state, and the agent will have to resort to
choosing a primitive action instead. The pseudocode is shown in Algorithm 12
Note that primitive actions are allowed at any level of the hierarchy, when
no behaviour is available. This means that the agent can use plans that are only
partially complete, and fill in the remainder of the policy with primitive actions.
5.3.3 Hierarchical learning: TRQ
An extra complexity arises when extending the TRQ algorithm to multiple levels
of hierarchy. The possibility arises that a behaviour B1 may be interrupted while
it was in the process of executing sub-behaviour B2. The question then arises,
how should we update the Q-value B1.Q(s, B2) (the value of B2 according to the
local reward for B1)?
The simplest answer is that the experiences gathered while executing B2
should be discarded. This will not affect the correctness of the algorithm, pro-
vided that B1 is occasionally permitted to persist beyond the interruption, as
described in Section 5.2.3 above. This is the solution employed in the pseudocode
of Algorithm 13.
There are two sources of behaviour interruption in the multi-level TRQ al-
gorithm. Suppose {Root, B1, B2, . . . , Bn} is the stack of behaviours executing at
time t. The teleo-reactive semantics require that plans are monitored reactively
at all levels of the hierarchy, so if the node of Bk.plan executing at time t is no
longer active at time t+1 then all behaviours Bk+1, . . . , Bn must be interrupted.
Alternatively, Bk may elect to persist with Bk+1 in spite of the interruption.
In this case, learning must be suspended on all behaviours Root, B1, . . . , Bk−1
which are above Bk in the stack. In either case, the interrupted behaviours must
discard any gathered experiences.
5. Rachel: Planning and Acting 101
Algorithm 12 Planned HSMQ-Learning with multiple levels of hierarchyfunction P-HSMQ(behaviour B) returns sequence S
t← 0
sequence S = {}
Observe state st
Bt ← ActiveBehaviours(B.plan, st)
while st |= B.pre ∧ st 6|= B.post do
T ← t
if Bt = ∅ then
Choose primitive aT ← π(st) from B.P
according to an exploration policy
Execute action aT
Observe state st+1
sequence S′ ← {〈st, at, st+1〉}
else
Choose behaviour BT ← π(st) from Bt
according to an exploration policy
sequence S′ ← P-HSMQ(BT )
aT ← BT
end if
k ← 0 totalReward ← 0
for each 〈s, a, s′〉 ∈ S′ do
totalReward ← totalReward + γkB.r(s, a, s′)
k ← k + 1
end for
S ← S + S′
t← t + k
Observe state st
At ← ActiveBehaviours(B.plan, st)
if At = ∅ then
At ← B.P
end if
B.Q(sT , aT )α←− totalReward + γk maxa′∈At
Q(st, a′)
end while
return S
end P-HSMQ
5. Rachel: Planning and Acting 102
Algorithm 13 Teleo-Reactive Q-Learningfunction TRQ(behaviour B)
while B has not terminated do
Observe state st
〈st, at, st+1〉 ← TRQ-Execute(st, B)
TRQ-Update(B, 〈st, at, st+1〉)
end while
end TRQ
Algorithm 14 Execute a behaviourfunction TRQ-Execute(state st, behaviour B)
returns experience 〈st, at, st+1〉
node N← B.activeNode
\\Select a new node if necessary
if N = null then
N t ← ActiveNodes(B.plan, st)
if N t 6= ∅ then
Choose node N← B.π(st) from N t
end if
B.lesson← {}
end if
\\Execute the active node
B.activeNode← N
if N = null then
Choose primitive at ← B.π(st)
Execute action at
Observe next state st+1
else
〈st, at, st+1〉 ← TRQ-Execute(st, N.B)
TRQ-Update(N.B, 〈st, at, st+1〉)
B.lesson← B.lesson + 〈st, at, st+1〉
end if
return 〈st, at, st+1〉
end TRQ-Execute
5. Rachel: Planning and Acting 103
Algorithm 15 TRQ Updatefunction TRQ-Update(behaviour B, experience 〈st, at, st+1〉)
N← B.activeNode
At ← ActiveNodes(B.plan, st)
if At = ∅ then
At = B.P
end if
if N = null then
\\Update executed primitive
B.Q(st, at)α←− B.r(st, at, st+1) + γ maxa∈At
B.Q(st+1, a)
else if node N /∈ At then
k ← 0 totalReward ← 0
for each 〈s, a, s′〉 ∈ B.lesson do
totalReward ← totalReward + γkr(s, a, s′)
k ← k + 1
end for
B.Q(sT , N)α←− totalReward + γk maxa∈At
B.Q(st, a)
B.activeNode← null
if st+1 |= N.B.pre and st+1 6|= N.B.post then
\\N.B has been interrupted due to a side-effect.
with probability η do
Interrupt(st+1, Root)
TRQ(N.B)
else
Interrupt(st+1, N.B)
end with
end if
end if
end TRQ-Update
5. Rachel: Planning and Acting 104
Algorithm 16 Discard lessons from interrupted behavioursfunction Interrupt(state s, behaviour B)
N← B.activeNode
if N 6= null then
Interrupt(s, N.B)
end if
B.activeNode← null
B.lesson← {}
end Interrupt
5.4 Summary
In this chapter we have presented algorithms for planning with reinforcement-
learnt behaviours and for learning with plan-constructed task-hierarchies. We
have shown how having a plan can inform the execution process allowing intel-
ligent decisions of when to prematurely interrupt a behaviour, leading to better
policies. In the next chapter we shall introduce the third element of the Rachel
architecture, the reflector, which observes the learnt behaviours of the actor and
uses its observations to better inform the planner, thereby closing the loop.
Chapter 6
Rachel: Reflection
The third and final component of Rachel is the reflector. This component
monitors the execution of plans by the actor, detects and records undesirable
side-effects, and induces pre-conditions under which they might be avoided. This
information is in turn fed back into the planner which we shall modify to incor-
porate it into new plans.
The need for the reflector is due to assumptions made in the planner. The
planner initially assumes that behaviours are side-effect free. If a given fluent F
is true when a behaviour is initiated then the planner assumes it will remain true
until the behaviour terminates, unless it is known to interfere with the pre-image
or post-conditions of the behaviour. This is called the frame assumption. In this
chapter we shall investigate what happens when this assumption does not turn
out to be true, and how the reflector enables Rachel to overcome this problem.
6.1 The Frame Assumption
It is a commonly recognised problem in the field of symbolic planning that to
build plans we need to model not only what fluents are affected by a behaviour,
but also what fluents are unaffected. Without such knowledge, we cannot be sure
that conditions achieved by one behaviour will not be undone by a subsequent
step in the plan. Taking the grid-world as an example, the plans we build to
fetch the book and the coffee rely on the fact that the Go() behaviour does
not affect the truth holding() fluent (or, at least, is unlikely to do so). With
this knowledge, it is safe to build plans which first Get() the book or the coffee
and the Go() to another location. Teleo-reactive plans can compensate for low-
105
6. Rachel: Reflection 106
probability failures in this area (as in the coffee-spilling scenario in Section 5.2.2),
but if such a side-effect consistently recurs, then the plan will be useless unless
it takes the effect into account.
The difficulty lies in the fact that there are typically many more things that
a behaviour does not affect than things it does. Enumerating them all is often
impractical. This is known as the frame problem (Shoham & McDermott, 1988,
Hayes, 1973). The most convenient solution, used by many planning algorithms,
is the Strips assumption (Fikes & Nilsson, 1971). This assumption states that
the only fluents affected by a planning operator are those explicitly mentioned
in its add and delete lists.
The planner described in Chapter 5 relies on an even stronger frame assump-
tion. Unlike Strips-operators, Rachel’s behaviour initially have no concrete
implementation. Instead, they must be learnt. As a result, it is hard to specify
in advance what the side-effects of a behaviour will ultimately be, once it has
learnt a policy. The local reward function for a behaviour, defined in Equa-
tion 4.2, does not reward or punish any effects other than outright success or
failure. The doctrine of recursive-optimality says it should be this way: the local
policy of a behaviour should be based only the goals of the behaviour, and be
independent of the context in which it is being used. If the reward function does
not exclude an effect, then we cannot say for sure that it will not end up being
part of the learnt policy.
Nevertheless, to build plans with these behaviours we must make assump-
tions about what side-effects they will have. Rachel’s planner makes the most
optimistic assumption: that a behaviour has no effect other than those described
in its post-condition. If a fluent F is true when a behaviour B is initiated, then
the planner assumes that it will remain true until the behaviour terminates.
Furthermore, if the fluent does not conflict with the post-condition of B, then
the fluent will also be true in the terminal state.
This is a convenient assumption to make when planning, in the absence of a
learnt policy for the behaviour, but since the local reward function for B makes
no attempt to enforce it, in practice it may turn to be untrue.
Two kinds of errors can result from a bad frame assumption:
1. A plan can be generated with too many alternative paths, some of which
are not viable in practice. If there are two apparent paths to the goal, but
one of is rendered useless by an unforeseen side-effect, both of them will
6. Rachel: Reflection 107
be optimisticly included in the plan. This situation is recoverable, as the
actor’s ability to do hierarchical reinforcement learning will enable it to
learn which path to take and which to avoid. The only loss is time wasted
exploring the unprofitable path.
2. A plan can be generated with too few alternatives, omitting some be-
haviours which are in fact necessary. The purpose of the planner is to
prune the set of behaviours available to be explored by the actor to only
those that appear relevant. However, if the planner is too optimistic about
the effects of a behaviour then it may prune other important behaviours
from this set. The actor cannot recover from this situation on its own, as
it can only explore those alternative behaviours offered by the plan. To fix
this, the agent must notice that it is being too optimistic in its applica-
tion of the frame assumption, and somehow revise its plans to contain the
necessary alternatives.
It is this second problem which we will be particularly focusing on in this
chapter. To make it clearer, we will introduce a new example domain - the
taxi-world.
6.2 Example Domain - Taxi-world
The “taxi-car domain” was originally posed by Dietterich (Dietterich, 1998) to
study the MAXQ algorithm. A number of variants of this problem have appeared
in his papers. The particular variant we shall explore operates as follows.
R G
Y
F
B
Figure 6.1: The Taxi-Car Domain.
The world consists of a 5× 5 grid as shown in Figure 6.1. The agent controls
a taxi that can move around the grid in any of the four cardinal directions
6. Rachel: Reflection 108
as the walls allow. Five cells in the grid have been labelled as locations of
particular interest. They are red, green, blue, yellow and fuel. At one of
these locations is a passenger who waits for the taxi to take her to another
location. The passenger’s starting location and destination are randomly chosen
and vary from one trial to the name. It is the job of the agent to learn a policy
to navigate to the passenger’s location, pick her up, navigate to her destination
and drop her off there.
There is an added complication: the taxi begins with a randomly determined
amount of fuel. Each movement the taxi makes uses up one unit of fuel. If the
taxi runs out of fuel, then it has failed in its task. There is not always enough
starting fuel to complete the task assigned to the agent. There is, however, the
possibility of refilling the tank by visiting the Fuel location and executing the
refill action.
The agent is equipped with the instruments shown in Table 6.2. Fluents are
shown in Table 6.2. Some fluents, like psgr loc() and psgr dest() are simple
wrappers to certain instruments; others have more complex definitions.
Table 6.1: The instruments used in the Taxi-car domain.
x the current X position of the taxiy the current Y position of the taxipassenger location one of red, green, blue, yellow, fuel or taxipassenger destination one of red, green, blue, yellow, or fuelfuel the amount of fuel in the tank
between 16 (full) and 0 (empty)
The agent’s goal is described as a root behaviour Deliver:
Deliver(L) : L ∈ Locations
gran: 0
view: {x, y, passenger location, passenger destination, fuel}
pre: fuel(F) ∧ gt(F, 0)
post: psgr dest(L) ∧ psgr loc(L)
P: { north, south, east, west, pickup, putdown, fill}
Four sub-behaviours are defined to simplify this task: GoTo, Get, Put, and
Refuel. Teleo-operator descriptions of these behaviours are given in Table 6.3.
6. Rachel: Reflection 109
Table 6.2: Fluents used in the Taxi-car domain.
psgr loc(Location) the passenger’s locationpsgr in taxi the passenger is in the taxipsgr dest(Destination) the passenger’s destinationtaxi loc(Location) the taxi’s locationfuel(Fuel) the fuel leveldistance(Location, Distance) the Manhattan distance to a given locationgt(X, Y) X is greater than Y
rgte(X, Y, R) X is greater than or equal to Y × R
Only one of these sub-behaviours, GoTo, needs to be learnt, the other three are
just wrappers for a single primitive action: pickup, putdown and fill respectively.
Table 6.3: The four types of agent behaviour in the Taxi-world.
GoTo(L) : L ∈ Locationsgran: 1view: { x, y, id(L) }pre: true
post: taxi loc(L)P: { north, south, east, west}
Refuel
gran: 1pre: taxi loc(bowser)post: fuel(15)P: { fill}
Get(L) : L ∈ Locationsgran: 1pre: taxi loc(L) ∧ psgr loc(L)post: psgr in taxi
P: { pickup}
Put(L) : L ∈ Locationsgran: 1pre: taxi loc(L) ∧ psgr in taxi
post: psgr loc(L)P: { putdown}
Figure 6.2 shows the decomposition of the Deliver behaviour into a plan
of sub-behaviours, using the hierarchical planning algorithm described in Sec-
tion 5.3.1. There is an important flaw in this plan: it never recommends the
Refuel behaviour. The GoTo() behaviour has a side-effect which is not part of
its definition. It uses up fuel. In some cases, it may use up all the fuel. For the
sake of the example, we have deliberately omitted this fact from its description.
As a result, the planner erroneously assumes there is no effect. Thus, according
to the planner, the Refuel behaviour is never necessary. This is an example of a
plan with too few alternative, as described above.
6. Rachel: Reflection 110
GoTo(L)
Put(D)
GoTo(D)
Get(L)
fuel(F) gt(F,0)
psgr_in_taxipsgr_dest(D)
fuel(F) gt(F,0)
taxi_loc(L)
psgr_loc(D)fuel(F) gt(F,0)
psgr_in_taxipsgr_dest(D)
taxi_loc(D)fuel(F) gt(F,0)
fuel(F) gt(F,0)
psgr_dest(D)psgr_loc(L)
psgr_dest(D)psgr_loc(L)
psgr_dest(D)
Figure 6.2: A plan for the Deliver behaviour in the Taxi world.
6. Rachel: Reflection 111
The mistake becomes plain as the agent endeavours to execute the plan and
learn the behaviours. The taxi will frequently run out of fuel while executing
the GoTo behaviour and the agent will be unable to learn a way to avoid this, as
the Refuel behaviour is never available to it. Even when GoTo has converged to
an optimal policy (from the small fraction of attempts which succeed before the
fuel runs out) the agent will unavoidably fail in a significant number of trials.
The aim of the reflector is to rectify this problem. Doing so will require four
steps:
• Detecting when side-effects occur and diagnosing what went wrong.
• Gathering positive and negative examples of the side-effect
• Inducing a description of when the side-effect is likely to occur
• Incorporating the learnt information into a new plan which includes steps
to avoid the side-effect.
Each of these steps is described in detail below.
6.3 Detecting and identifying side-effects
In a complex world, a behaviour that executes for any length of time is likely
to affect the truth of a large number of different fluents. Learning to predict
even a single change is time consuming, learning every possible change is totally
impractical. So the agent needs to focus its attention on the important effects
and ignore the others. Arguably the most important effects are those that cause
the agent’s plans to fail. These effects point to an important deficiency in the
agent’s world model. So Rachel’s reflector endeavours only to learn to predict
those effects that result in actual plan execution failures.
6.3.1 Plan Execution Failures
A plan execution failure can be characterised as follows. The agent is executing
plan P. It observes two successive states st−1 and st. Node N is activated in state
st−1 and dictates behaviour B. A plan failure occurs in state st under one of two
6. Rachel: Reflection 112
circumstances, either:
st 6|= N.cond
st |= B.pre
st 6|= B.post
(6.1)
or:st |= B.post
st 6|= N.parent.cond(6.2)
Equation 6.1 describes a side-effect that occurs in the middle of executing be-
haviour B. The pre-image is still satisfied, so the behaviour has not failed, but
the node condition is no longer true, so a side-effect has occurred. Equation 6.2
describes a side-effect that occurs at the point of completion of B. The be-
haviour has succeeded, its post-condition is satisfied. The parent node ought to
be satisfied but is not, due to some other condition becoming false.
Both of these cases describe what are termed negative side-effects (side ef-
fects that result in unexpected failure). Positive side-effects (which result in
unexpected success) are ignored by the reflector. So, for example, the execution
of a behaviour will be interrupted if its parent node becomes active, even though
the condition of the child node is still true. This is a positive side-effect and is
ignored.
Also ignored are any plan failures that are due to failure of the learnt be-
haviour (ie. when the behaviour violates its pre-image). These are to be cor-
rected by the existing reinforcement learning process, not by reflection.
6.3.2 Diagnosing the failure
Having detected that a plan failure has taken place, the reflector must identify
the actual side-effect. In both of the cases above, there is a node Nf of the plan
that is expected to be satisfied in st but is not. In case 1 Nf = N, in case 2
Nf = N.parent.
One or more of the fluents in this node’s condition is false. We wish to find
which they are. If there are no variables in the node condition, then this is
simply a matter of testing each in turn. However variables complicate matters.
The planner may include existentially bound variables in node conditions. If two
or more fluents have shared variables then they cannot be tested in isolation.
6. Rachel: Reflection 113
The reason is simple:
Say, s |= foo(a) ∧ bar(b)
then s |= ∃X : foo(X)
and s |= ∃X : bar(X)
but s 6|= ∃X : (foo(X) ∧ bar(X))
This means that we cannot always identify a single fluent as the cause of the
side-effect. Instead we try to determine the shortest possible sequence of fluents
which cannot be satisfied simultaneously.
The first step is to split the node’s condition up into conjunctions of inde-
pendent fluents. Two fluents are dependent if they share one or more variables,
or if they both depend on a common third fluent.
Nf = C1 ∧ C2 ∧ . . . ∧ Ck
Now each condition Ci can be tested in isolation. Any that cannot be satisfied
in state st are sources for side-effects, others are ignored. Those that do fail are
trimmed to remove any irrelevant fluents. If
Ci = ∃V1, . . . , Vm : (f1 ∧ . . . ∧ fn)
and for some j < n,
st 6|= ∃V1, . . . , Vm : (f1 ∧ . . . ∧ fj)
then fluents fj+1, . . . , fj do not effect the failure of Ci and can be discarded. We
do this to make the side-effect as general as possible.
Finally all trimmed conditions are added to the set of known side-effects for
the executing behaviour B. Note that each side-effect is recorded as a conjunction
of fluents that under certain circumstance will fail to hold. This is the standard
form for a delete-list on a planning operator.
Pseudocode for this operation is shown in Algorithm 17. We shall illustrate
it with an example from the taxi-car domain. Suppose the taxi has picked up the
passenger and is on its way to the destination, executing the GoTo(L) behaviour
(where variable L is bound at run-time by evaluating psgr dest(L) in the node’s
condition). The node that is executing is the third from the top, we shall call it
N3.
N3.cond = ∃L, F : (psgr dest(L) ∧ psgr in taxi ∧ fuel(F) ∧ gt(F, 0))
6. Rachel: Reflection 114
Algorithm 17 Reflector: Detecting side-effectsfunction DetectSideEffects(behaviour B, node N, state s)
if s |= B.post then
if s |= N.parent.cond then
return
else
Nf ← N.parent
end if
else
if s |= N.cond or s 6|= B.pre then
return
else
Nf ← N
end if
end if
candidateEffects ← SplitCondition(Nf .cond)
for each C ∈ candidateEffects do
if s 6|= C then
C ← TrimCondition(s, C)
B.sfx← B.sfx ∪ C
end if
end for
end DetectSideEffects
Algorithm 18 Reflector: Trimming a conditionfunction TrimCondition(state s, conjunction C) returns a trimmed conjunction
newCondition ← true
for each fluent f ∈ C do
newCondition ← newCondition ∧ f
if s 6|= newCondition then
break
end if
end for
return newCondition
end TrimCondition
6. Rachel: Reflection 115
Now suppose the fuel runs out in state st, before the taxi reaches its desti-
nation. The pre-image of GoTo(L) is true, so a case 1 side-effect has occurred:
st 6|= N3.cond
st |= GoTo(L).pre
st 6|= GoTo(L).post
So the failed node Nf = N3. Now Nf .cond can be split into three independent
conjunctions:
Nf .cond = C1 ∧ C2 ∧ C3
C1 = ∃L : psgr dest(L)
C2 = psgr in taxi
C3 = ∃F : (fuel(F), gt(F, 0))
Now st |= C1 and st |= C2 so they can be ignored. However st 6|= C3 so this
conjunction is the source of a side-effect. To trim C3 we test successively longer
prefixes, until we find one that cannot be satisfied:
st |= ∃F : fuel(F)
st 6|= ∃F : (fuel(F) ∧ gt(F, 0))
So ∃F : (fuel(F) ∧ gt(F, 0)) is added to the known side-effects of GoTo(L). If
there had’ve been any further fluents in C3 after gt(F, 0), they would have been
trimmed.
6.4 Gathering examples
The standard Strips representation handles side-effects fairly simplistically.
They are all-or-nothing. If a behaviour has a particular side-effect, then it can
never be used to achieve a goal which would conflict with that side-effect. There
is no way to represent a side-effect which only occurs under certain circum-
stances. We have discovered that the GoTo behaviour has the side-effect that it
sometimes runs out of fuel. Does this mean that we can never use it when fuel
is an issue? Preferably not. What we want to know is when the behaviour is
likely to run out of fuel, and what steps we can take to prevent it.
With this in mind, Rachel implements a conditional model of behaviours’
side-effects. Each discovered side-effect has an associated condition describing
when it will not occur. If a behaviour B has a side-effect which causes the
6. Rachel: Reflection 116
failure of some conjunction of fluents C. Then maintains(B, C) is the set of
states in which B can be executed without causing C to fail. This is called the
maintenance condition for C.
Formally, if B is initiated in a state st with
st |= B.pre ∧ C ∧maintains(B, C)
then it will terminate in a state st+k, with
st+k |= B.post ∧ C
and every intermediate state st+1, . . . , st+k−1 will satisfy
st+i |= B.pre ∧ C
We wish to learn a description of maintains(B, C) as a disjunction of con-
junctions of fluents. Such a description can be used by the planner to ensure
that the side-effects do not occur.
To learn such a description we need to collect positive and negative examples
of maintains(B, C). The reflector does this by monitoring the actor’s execution.
Whenever the actor is executing behaviour B, the reflector accumulates a trace
of states T . When the behaviour terminates, the reflector attempts to classify
the states in T as either positive or negative examples.
Suppose B begins executing in state st and terminates in state st+k (success-
fully, unsuccessfully or due to an interruption). The states T = {st, . . . , st+k−1}
are then classified. A state st+i is a positive example of maintains(B, C) if:
st+k |= B.post
st+j |= C for all j, i ≤ j ≤ k
that is, if the behaviour succeeds, and every state in {st+i, . . . , st+k} preserves
the condition C.
A state st+i is classified as a negative example if:
st+i |= C
st+j 6|= C for some j, i < j ≤ k
that is, if C is satisfied in st+i but executing B causes it to no longer be satisfied.
Notice that not all states will match one of these two conditions. States in
which the condition is already false are ignored, as are states which lead neither
6. Rachel: Reflection 117
to a side-effect, nor to successful termination. We have no interest in classifying
such states either way.
An example of the classification process is given in Table 6.4. It show five pos-
sible execution traces from executing the GoTo(X) behaviours in the taxi-world,
and classifies each state as a positive or negative example of maintains(GoTo, ∃F :
(fuel(F) ∧ gt(F, 0)))
Classified examples are stored in two lists, E+ and E−, for positive and nega-
tive examples respectively. Each list has a maximum size, n+max and n−
max respec-
tively. Once a list is full, new examples replace randomly selected old examples.
This allows us to keep a “decaying window” containing a mixture of old and new
examples.
6.5 Inducing a description
Periodically, the actor may request a new side-effect description from the reflec-
tor. Rachel currently does this after a fixed number of trials, specified by the
user. The reflector selects a behaviour B and a condition C about which to learn
from its the list of effects it is currently monitoring. To be selected, a sufficient
positive and negative examples of the effect must been gathered to form a train-
ing set. The size of the training set is given by n+train and n−
train for positive and
negative examples respectively. At present, these sizes are up to the judgement
of the trainer. The larger they are, the more accurate the induced description
is likely to be, but the longer it will take to gather enough examples. We shall
investigate this tradeoff empirically in the experimental work in Section 6.
Two training sets E+train ⊆ E
+ and E−train ⊆ E− are randomly selected. This
training data must then be used to induce a symbolic description of the maintains(B, C).
We wish to learn a description that we can incorporate into future plans. This
learning process must have the following features:
1. The hypotheses must be expressed in the language of fluents used by the
planner.
2. Hypotheses can only be based on features of the current state and the
behaviour being executed (including its parameters). It may be necessary
(as in the Taxi-world example) to relate different fluent values to either
other.
6. Rachel: Reflection 118
Table 6.4: Classifying states as positive and negative examples of a side-effect.Examples are drawn from execution of the GoTo behaviour in the taxi domain.They are classified as positive or negative examples of maintains(GoTo, ∃F :(fuel(F) ∧ gt(F, 0))).
state fluents classification
A) Behaviour terminates successfully
s1 psgr in taxi, psgr dest(red), fuel(5) +s2 psgr in taxi, psgr dest(red), fuel(4) +s3 psgr in taxi, psgr dest(red), fuel(3) +s4 psgr in taxi, psgr dest(red), taxi loc(red), fuel(2) none
B) Behaviour interrupted due to running out of fuel
s5 psgr in taxi, psgr dest(blue), fuel(3) −s6 psgr in taxi, psgr dest(blue), fuel(2) −s7 psgr in taxi, psgr dest(blue), fuel(1) −s8 psgr in taxi, psgr dest(blue), fuel(0) none
C) Behaviour runs out of fuel and terminates simultaneously
s9 psgr in taxi, psgr dest(green), fuel(3) −s10 psgr in taxi, psgr dest(green), fuel(2) −s11 psgr in taxi, psgr dest(green), fuel(1) −s12 psgr in taxi, psgr dest(green), taxi loc(green), fuel(0) none
D) Behaviour runs out of fuel but persists to completion
s13 psgr in taxi, psgr dest(red), fuel(3) −s14 psgr in taxi, psgr dest(red), fuel(2) −s15 psgr in taxi, psgr dest(red), fuel(1) −s16 psgr in taxi, psgr dest(red), fuel(0) nones17 psgr in taxi, psgr dest(red), fuel(0) nones18 psgr in taxi, psgr dest(red), taxi loc(red), fuel(0) none
E) Behaviour interrupted due to an unrelated side-effect
s19 psgr in taxi, psgr dest(yellow), fuel(5) nones20 psgr in taxi, psgr dest(yellow), fuel(4) nones21 psgr loc(blue), psgr dest(yellow), fuel(3) none
6. Rachel: Reflection 119
3. Since the agent’s environment is stochastic and its behaviours are being
learnt as they are used, the training data is inevitably noisy. The learning
algorithm must be able to handle this noise.
4. Each conjunction in the description will produce in an extra branch in
the resulting plan-tree, so shorter descriptions are preferable to keep the
complexity of planning to a minimum.
Items 1) and 2) above suggest that Inductive Logic Programming (ILP) is
the appropriate tool for this learning task. The planner operates in terms of
first-order fluents. States and side-effects, which form the input to the learn-
ing algorithm, are described in this language, and a first-order description of
maintains(B, C) can be directly incorporated into plans. ILP naturally operates
in this language.
We chose the ILP system Aleph (Srinivasan, 2001a) to for this purpose.
Aleph is able to use general Prolog programs as intensional background knowl-
edge. This means that it can use Rachel’s fluent definitions directly. It also
has robust noise handling features that are able to overcome the noisiness of the
training data. (Some modifications were required, however to prevent it from
over-fitting, as outlined below.)
Aleph is designed to be a framework for exploring different ideas in ILP,
rather than just a single algorithm. It allows the user to customise many aspects
of the ILP process, which allows it to emulate a variety of ILP systems. Under its
default settings, we we used, it operates similarly to Progol (Muggleton, 1995).
The standard operation cycles through four steps (as described in the Aleph
manual):
1. Select Example. Select an uncovered positive example to be generalise.
If none exist, stop.
2. Saturate. Construct the most specific clause that entails the example
selected, and is within language restrictions provided. This is usually a
definite clause with many literals, and is called the “bottom clause.”
3. Reduce. Search through all clauses contain a subset of the literals in the
bottom clause, from general to specific, to find the clause with the best
accuracy. Add this clause to the theory.
6. Rachel: Reflection 120
Table 6.5: Input to Aleph: Positive and negative examples
Positive examples: Negative examples:
maintains(s1).
maintains(s2).
maintains(s3).
...
maintains(s4).
maintains(s5).
maintains(s6).
...
4. Cover Removal. Remove all positive examples which are covered by this
clause, and return to step 1.
6.5.1 Input to Aleph
Aleph requires three files as input: one of positive examples, one of negative
examples and one of background. The positive and negative example files are
listings of the target predicate (maintains) for each state in E+train and E−train
respectively. Each predicate is time-stamped to identify the particular state
that it describes, as shown in Table 6.5.1.
Table 6.5.1 shows an abbreviated example from the background file for the
taxi-car domain. This file has four parts. The first part contains the parameter
settings for Aleph. Four important parameters for our purposes are:
• minacc This sets the minimum accuracy requirement for a clause to become
part of the theory.
• i This is an upper bound on the number of “links” required to connect
two literals in a clause, by unbound variables. For example psgr loc(L) ∧
distance(L, 2) has one link, psgr loc(L) ∧ distance(L, D) ∧ fuel(F) ∧
gt(F, D) has 3.
• clauselength This is the maximum number of literals allow in a clause.
• inducettime This is a new parameter, introduced by the modifications
described below. It is an upper limit on the amount of time Aleph spends
in the induction loop.
6. Rachel: Reflection 121
The second part of the background describes the language that Aleph will
use to build its theory. Mode and type information is given for the target concept,
and the fluents that will be used to describe it. Aleph ensures that every
clause it generates obeys these mode and type requirements, to avoid the cost of
generating and testing nonsensical hypotheses. This part of the background file
also contains Prolog clauses defining all the fluents in terms of the instrument
values which follow.
The third part contains an extensional listing of all instrument values for
every example state, as Prolog facts. The final part contains an extensional
listing of a special params fluent which lists the parameters of the behaviour
executing at the time the side-effect occurred. Fluents and instruments are
time-stamped to relate them to a particular state (even those like gt that are in
fact independent of state).
6.5.2 Modifications to Aleph
Two simple but important modifications have been made to Aleph in order
to use it for this purpose. Both changes relate to how Aleph handles positive
examples for which no adequate description can be generated.
The Aleph algorithm, as described above, loops through its four steps until
its theory covers every positive example in the training set. There is no mecha-
nism in existing versions of Aleph to allow it to treat some positive examples
as noise. If the reduction step cannot find a clause with sufficient accuracy then
it resorts to adding the example itself to the theory as a single fact. So a theory
generated from noisy data might resemble:
maintains(S) :-
psgr_loc(L), dist(L, D), fuel(F), gt(F, D).
maintains(S) :-
psgr_dest(blue).
maintains(s23).
maintains(s59).
maintains(s101).
This indicates that states s23, s59 and s101 were not covered by the rest of
the theory, and did not generate clauses which satisfied the minimum accuracy
requirements. They are effectively noise.
6. Rachel: Reflection 122
Table 6.6: Input to Aleph: the background file
%%% PART 1: Parameter Settings
:- set(minacc, 0.5).
:- set(i, 4).
:- set(clauselength, 5).
:- set(inducetime, 1800).
%%% PART 2: Fluent definitions
:- modeh(1,maintains(+state)).
:- determination(maintains/1, psgr_loc/2).
:- modeb(*, psgr_loc(+state, -location)).
psgr_loc(State, Location) :-
passenger_location(State, Location).
:- determination(maintains/1, psgr_in_taxi/2).
:- modeb(*, psgr_in_taxi(+state)).
psgr_dest(State, Destination) :-
passenger_destination(State, Destination).
:- determination(maintains/1, gt/3).
:- modeb(*, gt(+state, +int, +int)).
gt(State, A, B) :-
A > B.
...
%%% PART 3: Instrument values
xpos(s1, 5).
xpos(s2, 4).
xpos(s3, 3).
...
%%% PART 4: Behaviour Parameters
:- determination(maintains/1, params/2).
:- modeb(*, params(+state, -location)).
params(s1, red).
params(s2, red).
params(s3, red).
...
6. Rachel: Reflection 123
Descriptions that name particular states are not at all useful to Rachel. A
particular time-stamped state occurred once and will never happen again. So we
have modified Aleph to filter out these singleton hypotheses. Positive examples
which cannot be covered are ignored.
Furthermore it has been noticed that the probability of generating such hy-
potheses increases as the algorithm processes. It generally happens that a fairly
comprehensive theory is established after Aleph has considered the first few
positive examples, and then a large amount of time is spent adding each of the
exceptional states as singletons. Much effort can be saved by cutting off the cov-
ering process early. Rachel’s modified Aleph uses the simplest possible cutoff:
a time limit is placed on the induction process. After a set time (specified in
seconds by the inducetime parameter) the search process is stopped and the set
of hypotheses constructed so far is returned as the theory. This may be crude
but it proves in practice to be effective.
6.5.3 Output from Aleph
Aleph outputs a number of clauses of the form:
maintains(S) :-
f_1(S, ...), f_2(S, ...), ..., f_k(S, ...).
All fluents in the body of a clause, f 1, . . . , f k, are linked by the state variable
S. (This is guaranteed, as maintains is the only term that outputs a state, and
every other fluent requires a state as input.) The other parameters of each fluent
may be bound to constant values, or to (possibly shared) variables.
Before adopting this theory as a new description of the side-effect, the reflec-
tor evaluates it against all the examples in E+ and E−. If a previous description
has been learnt, it is similarly evaluated, otherwise the null-description (which
says that the side-effect never happens) is evaluated. The new hypothesis is
adopted if and only if it satisfies both a minimum accuracy threshold and is
more accurate than the old hypothesis. Accuracy is measured in terms of the
total number of classification errors produced by the theory.
If the new hypothesis is adopted, then the reflector converts it into a list
of conjunctions by taking the body of each clause and stripping off the state
variables. This set of conjunctions is then sent to the planner, so that it can be
used in rebuilding plans.
6. Rachel: Reflection 124
6.5.4 Adding Incrementality
Aleph is a batch learner. It takes a batch of examples, and produces a theory to
describe them. Our learning problem, on the other hand, is incremental. We have
a continuous stream of new examples being created, and wish to incrementally
revise our theory to match new evidence as it arrives. Some extra work needs to
be done to use Aleph in this way.
The reflector handles this by keeping fixed size pools of positive and negative
training examples, E+max and E−max, for each side-effect being learnt. Training
examples are drawn randomly from these pools. Examples are added to the
pools as they arrive, until each pool reaches its maximum size (n+max and n−
max
respectively). After this, incoming examples randomly replace examples already
in the set. This random replacement technique provides a “decaying window”
of old and new examples.
Reflection first happens when each example pool is full. It is subsequently
repeated when at least half the examples each of the sets have been replaced.
(Positive and negative examples may arrive at different rates.)
After reflection, the new theory produced by Aleph is tested on all the
examples in the pool, as is theory from the previous step. Whichever theory is
more accurate is retained. Accuracy is measured by the sum of the number of
false positives (examples in E−max that are covered by the theory) and the number
of true negatives (examples in E+max that are uncovered).
The initial theory is that the side-effect in question never happens, i.e. the
empty theory. The result of the first batch is only kept if it proves more accurate
than this default theory.
6.6 Incorporating side-effects into plans
Repairing a plan when a new side-effect description arrives is a complicated
business. It requires that the planner investigate every plan in its library for
nodes that use the affected behaviour, test whether the side-effect in question
affects that use, and if so rebuild the portion of the plan-tree rooted at that
node. The current implementation of Rachel takes the simpler course of simply
throwing out all existing plan trees and rebuilding them from scratch whenever
a new side-effect description arrives. This is terribly inefficient, but since new
descriptions in practice arrive only rarely it is not much of a problem. More
6. Rachel: Reflection 125
intelligent plan repair left for future work.
The PlanStep function presented earlier in Section 5.3.1 needs to be ex-
tended to include the ability to plan with behaviours that have conditional side-
effects. The extension is fairly simple: before the newly created node is added
to the tree, each of the side-effects of the employed behaviour are checked to see
if they interfere with the desired operation of the behaviour.
A side-effect cannot interfere with the the post-conditions of a behaviour, as
such failures are never recorded as side-effects. If it interferes in any way, it will
be by contradicting the “unachieved” conditions of the parent node, which are
carried over to the new node. The frame-assumption says that it is safe to do
this, but side-effects violate this assumption. So it is necessary to test each of
the discovered side-effects against the maintained conditions to ensure that there
are no conflicts.
If behaviour B has a side-effect that results in the failure of a condition C,
then it conflicts with the maintained conditions if:
¬C ⇒ ¬unachieved
If this is the case, then one of the maintenance conditions for C must be added
to the Nnew .cond in order to ensure that the side-effect does not occur. Each of
the conjunctions learnt by the reflector results in a new plan node. Pseudocode
for the extended plan-step operation is show in Algorithm 19.
The params fluent in the maintenance conditions plays a special role. Rather
than being added to the newly created node, it places a restriction on the pa-
rameters of the behaviour. The parameters of the behaviour are unified with the
corresponding parameters in the params fluent. If the unification is not possible,
then that maintenance condition cannot be used.
Continuing with our example, we know the GoTo behaviour has a side-effect
which causes ∃F : (fuel(F) ∧ gt(F, 0)) to fail. Suppose Aleph learns two main-
tenance conditions:
params(L) ∧ distance(L, D) ∧ fuel(F) ∧ gt(F, D)
and:
params(fuel)
which say that the taxi can safely reach any location for which it has more than
enough fuel, or the fuel location no matter how far it is. (This is not necessarily
a likely outcome of reflection, but we shall use it for the purposes of the example.)
6. Rachel: Reflection 126
Suppose the planner is attempting to achieve the goal:
taxi loc(blue) ∧ fuel(F) ∧ gt((F), 0)
Now we know that the GoTo(blue) behaviour will achieve the taxi loc(blue)
condition. The remainder of the goal must be maintained from the previous
node. An new node condition is constructed:
fuel(F) ∧ gt((F), 0)
However GoTo()’s side-effect conflicts with this condition. So one of the two
maintenance conditions must be added.
The first maintenance condition contains the fluent params(L). So L must be
unified with blue, the parameter of the behaviour. Under this unification the
maintenance condition becomes:
distance(blue, D) ∧ fuel(F) ∧ gt(F, D)
This condition is added to the new node condition above to get:
fuel(F) ∧ gt(F, 0) ∧ distance(blue, D) ∧ gt(F, D)
A node is added to the plan with this new condition.
Attempting to use the second maintenance condition however results in fail-
ure. The fuel parameter of the params fluent in the condition cannot be unified
with the blue parameter of the behaviour, so this alternative does not permit
the creation of a new node.
Figure 6.3 shows a partial plan for the Deliver(D) behaviour (with the location
D unbound), which uses the maintenance conditions above to avoid running out
of fuel. Notice that the Refuel behaviour has now been included in the plan. It
is worth examining in detail why this is so.
Previously, before the side-effect of GoTo and its corresponding maintenance
conditions were known, the node dictating GoTo(D) had the condition:
psgr dest(D) ∧ psgr in taxi ∧ fuel(F) ∧ gt(F, 0)
The PlanStep function would consider Refuel as a possible extension to this
node, as it achieves fuel(16) which would satisfy the fluents fuel(varF) ∧
gt(F, 0). This would leave the unachieved conditions:
psgr dest(D) ∧ psgr in taxi
6. Rachel: Reflection 127
Algorithm 19 Exploratory planning: Adding new nodesfunction PlanStep(behaviour B, node N, explored E)
for each behaviour B′ with granularity B.gran + 1 do
\\Find which of the node conditions B′ achieves, if any
(achieved , unachieved )← Achieved(B′, N.cond)
if achieved = {} then
Skip to the next behaviour
end if
\\Check for interference
if B′.post ∧ N.cond⇒⊥ then
Skip to next behaviour
end if
\\Construct the new node’s condition
newCondition ← B.pre ∧ unachieved ∧ B′.pre
choose either
type ← policy
newCondition ← AddMaintenance(newCondition , B′)
or
type ← exploratory
end choose
if newCondition ⇒⊥ then
Skip to the next behaviour
end if
\\Check if the new condition has already been explored
for each condition C ∈ E do
if newCondition ⇒ C then
Skip to the next behaviour
end if
end for
\\Add the new node tree
Nnew .cond← newCondition
Nnew .parent← N
Nnew .B← B′
Nnew .type← type
N.children← N.children ∪ {Nnew}
end for
end PlanStep
6. Rachel: Reflection 128
Algorithm 20 Exploratory planning: Adding maintenance conditionsfunction AddMaintenance(condition N , behaviour B) returns augmented condition
\\Check for any side-effects that will conflict with
\\this node and add appropriate maintenance conditions.
for each condition C ∈ B.sfx do
if ¬C ⇒ ¬N then
pick M ∈ maintains(B, C)
N ← N ∧M
end if
end for
return N
end AddMaintenance
GoTo(fuel)
Put(D)
Get(L)Refuel
GoTo(D)
Get(L)
psgr_dest(fuel)
distance(D, Dist)
gt(F, Dist)
psgr_loc(L) taxi_loc(L)
fuel(F) gt(F, 0) fuel(F) gt(F,0)
psgr_dest(D)psgr_in_taxitaxi_loc(fuel)
distance(D, Dist)fuel(F) gt(F, 0)gt(16, Dist)
psgr_loc(D)fuel(F) gt(F,0)
psgr_dest(D)
psgr_in_taxipsgr_dest(D)
taxi_loc(D)fuel(F) gt(F,0)
psgr_loc(L)
psgr_in_taxifuel(F) gt(F,0)
psgr_dest(fuel)
taxi_loc(L)
psgr_in_taxipsgr_dest(D)
distance(D, Dist)
gt(F, Dist)fuel(F) gt(F, 0)
psgr_dest(D)
Figure 6.3: A plan for the Deliver behaviour which avoids running out of fuel.
6. Rachel: Reflection 129
Add to this the pre-image of Refuel and the pre-image of the parent behaviour
Deliver(D) and the condition for the new node would be:
psgr dest(D) ∧ psgr in taxi ∧ taxi loc(fuel) ∧ fuel(F) ∧ gt(F, 0)
However this condition is implied by the condition of its parent (above), so it
would never be activated. Therefore it is pruned from the plan.
However things change after the reflector has detected the side-effect on GoTo
and learnt the maintenance conditions to avoid it. The condition of the GoTo(D)
node is now:
psgr dest(D) ∧ psgr in taxi ∧ fuel(F) ∧ gt(F, 0)
∧ distance(D, Dist) ∧ gt(F, Dist)
Once again, PlanStep will consider Refuel as a candidate behaviour, to achieve
fuel(varF)∧gt(F, 0). Notice that in doing so, Refuel binds the variable F to 16.
So the term gt(F, Dist) becomes gt(16, Dist). The unachieved conditions are
now:
psgr dest(D) ∧ psgr in taxi ∧ distance(D, Dist) ∧ gt(16, Dist)
Add to this the pre-images of Refuel and Deliver(D) and we get a new node with
condition:
psgr dest(D) ∧ psgr in taxi ∧ distance(D, Dist) ∧ gt(16, Dist)
∧ taxi loc(fuel) ∧ fuel(F) ∧ gt(F, 0)
This condition is not implied by any of the ancestor nodes, so is added to the
plan.
The plan in Figure 6.3 shows only part of the full plan. The entire plan
is 15 levels deep and contains a total of 186 nodes. Many of these describe
different possible orderings on moving, getting the passenger, and refueling. The
size of this plan could be drastically cut down by introducing partially-ordered
planning, but this has its own set of difficulties, as described in Chapter 8.
6.6.1 Exploratory planning
The extended PlanStep function in Algorithm 19 contains a further complexity
that remains to be explained. It adds a type property to nodes that can either
be policy or exploratory. Exploratory nodes do not take side-effects into account.
6. Rachel: Reflection 130
Figure 6.4 shows the plan from Figure 6.3 with exploratory planning. A single
exploratory node has been added which uses the GoTo(D) behaviour ignoring its
side-effects.
GoTo(fuel)
Put(D)
GoTo(D)
GoTo(D)
fuel(F) gt(F,0)
psgr_in_taxipsgr_dest(D)
taxi_loc(D)fuel(F) gt(F,0)
psgr_loc(D)
fuel(F) gt(F,0)
psgr_dest(D)psgr_in_taxipsgr_dest(D)
distance(D, Dist)
gt(F, Dist)fuel(F) gt(F, 0)
psgr_in_taxifuel(F) gt(F,0)
psgr_dest(fuel)
psgr_dest(D)
psgr_in_taxi
Figure 6.4: A plan for the Deliver behaviour using exploratory planning. Thenode with the broken outline is exploratory.
Exploratory nodes exist to allow the agent to explore behaviours where they
might otherwise be prohibited. Without exploration of this kind, the agent
would be helpless to fix any over-specialised maintenance conditions produced
by the reflection. If the reflector underestimates the size of the maintenance
region for a particular side-effect then the policy nodes of the plan will be overly
restrictive and the behaviour will not be used as widely as it might be. Without
some form of exploration such a mistake is incurable. The actor will never use
the behaviour outside the limits of the plan, and so the necessary examples
of successful execution will never be generated. Without any new examples
the reflector is likely to continue producing the same over-specialised side-effect
description.
Exploratory nodes indicate when behaviours might be explored, while still
limiting choices to applicable behaviours. When the actor chooses a policy be-
haviour, it can choose only from the policy nodes of the plan, but when it chooses
an exploratory action it is free to choose from nodes of either type.
By occasionally exploring a behaviour even when it is expected to cause a
side-effect, counter-examples can be generated for overly restrictive maintenance
6. Rachel: Reflection 131
conditions, which may result in a more general description the next time the
reflector considers the side-effect.
6.7 Summary
In this chapter we have explained the operation of the reflector, the third element
of the Rachel architecture. We have shown how the reflector detects execution
failures in the agent’s plan, diagnoses their causes and learns to predict when
they will occur, using Inductive Logic programming. This new knowledge can
then be integrated into future plans, allowing the agent to avoid such unwanted
effects in the future.
This completes our discussion of the implementation of Rachel. In the
next chapter we attempt an empirical investigation into the benefits Rachel
provides.
Chapter 7
Experiment Results
In this chapter we shall exhibit the performance of the Rachel architecture in
three experimental domains: the simple gridworld and taxi-car domains from
earlier chapters, and a third more complex simulation based on the Robocup
robotic soccer competition.
7.1 Experiments in the gridworld domain
7.1.1 Domain description
The gridworld experiments were conducted in the world described earlier in
Sections 2.2.1 and 5.2.2. The map is reproduced in Figure 7.1. We represent the
plan of a house as a 75 × 50 grid. The agent is a robot located in one of these
cells. There are also two objects in the house, the coffee and the book, with
starting locations as indicated on the map. Walls and doors divide the map into
a collection of rooms, as shown on the map.
Primitive representation
The primitive state of the agent is represented by four instruments: the robot’sx
and y coordinates, and the locations of each of the objects. Each object has only
two possible locations – either it is in its starting location or else the robot is
carrying it. The instruments representing this state are shown in Table 7.1.1.
There are nine primitive actions available to the agent: one for each of the
eight directions of movement (n, ne, e, se, s, sw, w, nw), plus the pickup action.
The movement actions move the robot to one of the eight neighbouring cells,
132
7. Experiment Results 133
Closet
Robot
Coffee Book
Hall
Laundry
Dining
Bump
KitchenBedroom1 Bathroom Bedroom2
Study Lounge
Figure 7.1: The first experimental domain - the Grid-world.
Table 7.1: Instruments used in the Grid-world domain.
x the current X position of the roboty the current Y position of the robotloc(Object) the location of Object
0, if it is in its orignal location1, if it is being carried by the robot
provided there is no wall blocking the movement. These actions have a 5%
chance of error, in which case the agent moves at right angles to its desired
heading.
The pickup action will pick up any object in the same location as the robot.
If there is no object in the robot’s cell, this action does nothing. The robot can
carry both object simultaneously without problem. It cannot drop objects.
In the second and third experiments in this domain, we shall introduce a
“bump” into the world, as indicated by the shaded area on the map. If the
robot enters a shaded cell while carrying the coffee there is a 10% chance that
it will spill the coffee. If spilt, the coffee returns to its initial location.
7. Experiment Results 134
Symbolic representation
The symbolic representation of the Grid-world describes the rooms and their
topology. The fluents, shown in Table 7.1.1, represent the locations of the objects
and the robot in terms of the rooms they are in. The loc(Object) instrument
translates directly to the holding(Object) fluent. The connections between
rooms are given by the door() fluent.
Table 7.2: Fluents used in the Grid-world domain.
location(Object, Room) true when Object is in Room.Object may be one of robot, book or coffee
holding(Object) true when the robot is holding Object
door(From, To) true if there is a door linking rooms From and To
Using these fluents, we describe the agents goal, and the behaviours it uses to
achieve this, shown in Table 7.3. The goal is to fetch both the coffee and the book,
and take them into the lounge. This is represented by the root behaviour Fetch.
Two sub-behaviours, Go() and Get() are also provided. The Go() behaviour is
designed to take the robot from its current room to a neighbouring one. The
Get() behaviour can be executed once the robot is in the same room as an object,
and is designed to locate the object and pick it up.
7.1.2 Experiment 1: Planning vs HSMQ vs P-HSMQ
The aim of the first experiment is to demonstrate the advantage of using a
combination of planning and hierarchical reinforcement learning over either one
alone. To this end, we shall compare three different approaches to learning the
Fetch behaviour above:
1. HSMQ-learning (Algorithm 1 in Section 2.3.2) with no pruning (all appli-
cable behaviours are available)
2. Executing the plan directly with reinforcement learning only at the bot-
tom of the hierarchy (learning primitive policies for behaviours). Choices
between behaviours in the plan were resolved in favour of the shallower
node, breaking ties randomly.
7. Experiment Results 135
Table 7.3: Behaviours available in the Grid-world.
Fetch
gran: 0view: { x, y, loc(book), loc(coffee) }pre: true
post: location(robot, lounge) ∧ holding(book) ∧ holding(coffee)P: { n, ne, e, se, s, sw, w, nw, pickup }
Go(From, To)gran: 1view: { x, y, id(To) }pre: location(robot, From)post: location(robot, To)P: { n, ne, e, se, s, sw, w, nw }
Get(Object)gran: 1view: { x, y, id(Object) }pre: location(Object, Room) ∧ location(robot, Room)post: holding(Object)P: { n, ne, e, se, s, sw, w, nw, pickup }
3. P-HSMQ-learning (Algorithm 12 in Section 5.3.2) with a plan-based task-
hierarchy
Each approach was run twenty times, with each run consisting of 1000 consec-
utive trials. A trial begins with the agent empty-handed at its starting location
in the study, and ends when the agent managed to arrive in the lounge with both
the coffee and the book.
The learning parameters were set as follows: The learning rate α was 0.1.
The discount factor γ was 0.95. Exploration was done in an ε-greedy fashion
with a 1 in 10 chance of the agent choosing an exploratory action (both at the
level of primitive actions, and in the choice of behaviours). Exploratory actions
were chosen using recency-based exploration (Thrun, 1992).
Results
Figure 7.2 shows the results of this experiment (note: this graph is plotted on
a log-log scale, to highlight the differences in the early trials). Clearly the plan-
7. Experiment Results 136
100
1000
10000
10000 100000
Tria
l len
gth
Total number of experiences
HSMQ all behavioursPlan w/o HRL
P-HSMQ
Figure 7.2: A comparison of learning rates for four approaches to the grid-world problem: (1) Using HSMQ to select from every applicable behaviour, (2)Using a plan to select behaviours, (3) Using P-HSMQ to select behaviours fromalternatives provided by a plan. Error bars represent one standard deviation.
7. Experiment Results 137
based approaches converge much more rapidly than the unplanned approach.
Measuring the average number of experiences required before trial-length falls
below 500 shows that approach 1 takes 72324 primitive actions, approach 2 which
takes 29956, and approach 3 which takes 35151. In both cases the difference
is significant with 99% confidence. The reason for this difference is apparent.
Approach 1 spent much of its time in early trials learning behaviours never
featured in its final policy. Approaches 2 and 3 restricted their exploration to a
smaller set of behaviours, which were more relevant to the task at hand. The
long term performance of approach 1 is also poorer, as it continues to explore
behaviours which do not contribute to the goal.
Tria
l len
gth
200
250
300
350
400
Approach 1 Approach 2 Approach 3
Figure 7.3: Average trial lengths for each individual run of the three approachesin Experiment 1. Error bars indicate one standard deviation.
The exploratory actions performed by approaches 1 and 3 hide differences
between the final policies learnt by each. To resolve this, the learnt policies
from each of the three approaches were run for a further 1000 trials without
any further learning or exploration. The results of these trials are shown in
Figure 7.3.
This graph shows the average trial lengths for each repeat run. Notice that
the results from both HSMQ approaches fall into two clusters, one below 250 and
one above 280. These correspond to the two high-level solutions to the problem:
either getting the coffee first then the book (the shorter solution) or else getting
7. Experiment Results 138
the book then the coffee (the longer solution). Both approaches converged to a
policy which implemented one of these two solutions well (indicated by the small
deviation per run). Approach 1, without the plan, chose the shorter solution in
8 out of 20 runs. Approach 3, with the plan, chose the shorter solution in 13
out of 20 runs. Contrast this with approach 2, which does no learning at the
behaviour level. It does not settle on one solution or the other, but selects one
randomly for each trial. This is shown by the much greater standard deviation
in these runs.
Ideally both approaches 1 and 3 ought always to converge to the better of
the two solutions. The failure to do so is probably due to lack of exploration,
and the “lock-in” effect that occurs when Q-values are pessimisticly initialised.
This is expected to be more of a problem with the unplanned approach as it has
more options to explore and thus will take longer to find the better solution.
Even so, the combination of planning and hierarchical learning shown in
approach 3 appears on average, to converge to a better and more stable solution
than either planning or hierarchical learning alone.
7.1.3 Experiment 2: P-HSMQ vs TRQ
Our second experiment investigates the issue of termination improvement. For
this experiment we introduce the “bump” into the world to examine how the
side-effect it causes effects the performance of each algorithm. We wish to com-
pare the performance of the P-HSMQ (Algorithm 12) and TRQ (Algorithm 13)
algorithms, and how the final policy produced by each compares to that produced
by a standard termination-improvement technique.
Twenty independent runs are performed with each of P-HSMQ and TRQ.
Learning and exploration rates were set as in Experiment 1. In the TRQ ap-
proach, the probability of taking exploratory actions η was set to 0.1. Each run
consisted of 1000 learning trials, followed by 1000 test trials (in which learning
is disabled and both ε and η are set to zero).
Each policy learnt using P-HSMQ was also tested for 1000 trials using ter-
mination improvement (following the algorithm in (Sutton et al., 1999)) rather
than the usual subroutine semantics of HSMQ. The algorithm for termination-
improved execution is shown in Algorithm 21.
7. Experiment Results 139
Algorithm 21 Termination-improved execution of a policy learnt by P-HSMQfunction TermImp(behaviour B)
t← 0
Observe state st
while st |= B.pre ∧ st 6|= B.post do
st+1 ← TempImpStep(B, st)
end while
end TermImp
function TermImpStep(behaviour B, state st) returns state st+1
Bt ← ActiveBehaviours(B.plan, st)
if Bt = ∅ then
Choose primitive at ← π(st) from B.P
according to an exploration policy
Execute at
Observe state st+1
else
Choose behaviour Bt ← π(st) from Bt
according to an exploration policy
st+1 ← TermImpStep(Bt, st)
end if
return st+1
end TermImpStep
7. Experiment Results 140
-10000
0
10000
20000
30000
40000
50000
60000
70000
80000
0 200000 400000 600000 800000 1e+06 1.2e+06 1.4e+06
Tria
l len
gth
Total number of experiences
TRQ n = 0.1PHSMQ
(a) performance against number of experiences
100
1000
10000
100000
50 100 150 200 250 300
Tria
l len
gth
Number of trials
TRQ n = 0.1PHSMQ
(b) performance against number of trials (first 300 trials only)
Figure 7.4: A comparison of learning rates for TRQ and P-HSMQ in the grid-world problem with the bump. Error bars represent one standard deviation.
7. Experiment Results 141
Results
Figure 7.4 shows the results of this experiment, plotted as performance versus
number of experiences and performance versus number of trials. As can be seen
from the first of the two graphs, both algorithms took approximately the same
number of experiences to terminate. If we count the number of experiences
before each approach produced a trial with less than 500 steps, we see that TRQ
took an average of 773276 steps while P-HSMQ took only 4% more, at 803526
steps.
There is, however, a significant difference in the number of trials required for
convergence. P-HSMQ took an average of 71 trials before it produces a trial less
than 500 steps long. TRQ took 103. A T-test shows this difference has more
than 99% significance. So P-HSMQ converged with fewer, longer trials, whereas
TRQ took many more trials, but they were significantly shorter.
Turning to the results produced in the testing phase, Figure 7.5 shows the
average trial length for each run of each approach, plus the results for the
termination-improved P-HSMQ policies. The data has been split into two cases:
Figure 7.5(a) shows the average performance in those trials in which the coffee
was not spilled (approximately 90% of the trials) and Figure 7.5(b) shows the
average performance for trials in which the coffee was spilled.
Four important facts are noticeable:
1. The runs of TRQ appear to fall into two distinct sets. Runs 4, 5, 8, 10 and
13 produced significantly better policies than the others (in both graphs).
Examining the policies produced shows that in these runs the agent learnt
to fetch the coffee first and then the book, whereas in the other runs the
agent learnt to fetch the book first, and then the coffee. Fetching the book
first requires that the agent traverses the length of the hallway 3 times,
instead of just 1 if the coffee is fetched first.
2. The results of P-HSMQ in the no-spill case show that it consistently per-
forms as well as the worse of the two policies learnt by TRQ. Examining
the policies produced shows that in all 20 runs the agent learnt to fetch
the book first, the poorer policy.
3. As expected, the policies produced by the P-HSMQ algorithm suffered
when the robot spilled the coffee. These trials were significantly longer
7. Experiment Results 142
than the average performance of even the worse of the two sets of TRQ
policies (with greater than 99% confidence, according to a T-test).
4. Termination improvement improves this situation significantly. The per-
formance of the termination-improved P-HSMQ policies is not significantly
different to the worse of the two TRQ policies.
7.1.4 Experiment 3: The effect of the η parameter
In the final experiment with the grid-world domain we investigate the effect
of that η parameter on the performance of the TRQ algorithm. The problem
specified in Experiment 2 above was repeated for eleven different values of η,
ranging from 0 to 1. For each value, 20 runs of 1000 learning trials followed by
1000 test trials were performed.
Results
As in Experiment 2 above, we have plotted the performance data for this exper-
iment in terms of both performance versus number of experience (Figure 7.6(a))
and performance versus number of trials (Figure 7.6(b)). Results are only shown
for three of the eleven values of η tested. The rest are summarised in Figure 7.7,
which plots the average number of experiences and the average number of trials
executed before a trial was completed in fewer than 500 steps.
Once again, there is no significant difference in the total number of experi-
ences required for convergence for any of the values of η used. However, small
values of η do show a higher variance. An F-test comparing results with η = 0
and η = 1 show that the standard deviation of the former, 136459, is significantly
greater than the latter, 36983, with greater than 99
On the other hand, the number of trials taken before convergence in seen
to vary significantly with η, falling from an average of 123 when η = 0 to a
minimum of 72 when η = 0.7. A T-test shows this is difference is significant
with greater than 99% confidence. There is a slight rise as η increases further.
When η = 1 the average is 76 trials before convergence. A T-test rates this
difference at 85% significance.
7. Experiment Results 143
150
200
250
300
350
400
450
500
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Ave
rage
Tria
l len
gth
Run
TRQPHSMQ
PHSMQ improved
(a) when the coffee is not spilled
200
250
300
350
400
450
500
550
600
650
700
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Ave
rage
Tria
l len
gth
Run
TRQPHSMQ
PHSMQ improved
(b) when the coffee is spilled
Figure 7.5: Final policy performance for Experiment 2. Comparing policiesproduced by TRQ, P-HSMQ and P-HSMQwith termination improvement. Errorbars show one standard deviation.
7. Experiment Results 144
-10000
0
10000
20000
30000
40000
50000
60000
70000
80000
0 200000 400000 600000 800000 1e+06 1.2e+06 1.4e+06
Tria
l len
gth
Total number of experiences
TRQ eta = 0.0TRQ eta = 0.5TRQ eta = 1.0
(a) performance against number of experiences
100
1000
10000
100000
50 100 150 200 250 300
Tria
l len
gth
Number of trials
TRQ eta = 0.0TRQ eta = 0.5TRQ eta = 1.0
(b) performance against number of trials (first 300 trials only)
Figure 7.6: A comparison of learning rates for TRQ different values of η. Errorbars show one standard deviation.
7. Experiment Results 145
600000
650000
700000
750000
800000
850000
900000
950000
0 0.2 0.4 0.6 0.8 1
No.
exp
erie
nces
bef
ore
conv
erge
nce
Eta
(a) average number of experiences before convergence
60
70
80
90
100
110
120
130
140
150
160
0 0.2 0.4 0.6 0.8 1
No.
tria
ls b
efor
e co
nver
genc
e
Eta
(b) average number of trials before convergence
Figure 7.7: Convergence times for TRQ with different value of η.
7. Experiment Results 146
7.1.5 Discussion of the gridworld experiments
The grid-world experiments verify our expectations that pruning the task-hierarchy
is advantageous, resulting in faster convergence than exploring all behaviours,
and that the additional reactivity of Teleo-Reactive Q-learning allows better
policies. In this example, the extra cost of TRQ is minimised as there are few
redundancies in the plan, i.e. there are no cases where two or more active nodes
dictate the same behaviour. (In more complex worlds, we would expect TRQ
to take longer to converge as redundant nodes in the plan would have to be
explored individually.)
In both Experiments 1 and 2 the final policies learnt are sometimes sub-
optimal. The task has two possible solutions at the behaviour level, which differ
according to the order in which the objects are fetched. The two solutions differ
significantly in performance – fetching the book first causes the agent to take
around 80-100 more steps to reach the goal. Theory suggests that this path
should not be taken. In practice, the problem is probably due to insufficient
exploration. If one path performs better initially, it will receive the bulk of the
agent’s attention and will improve more quickly. Once such a decision has been
made, the chances of the agent switching to the alternative path are low, as it
must be explored a large number of times in order to rival the initial choice. This
is particularly the case in recursively optimal reinforcement learning. Behaviours
must be explored extensively in order for there internal policies to improve, before
a decision is made as to which behaviour is optimal.
Experiment 2 seems to show that P-HSMQ is more sensitive to this effect
than TRQ. In all of the twenty runs P-HSMQ converged to the longer of the
two possible solutions, whereas TRQ was able to find the better solution in five
runs. One possible explanation is that by committing to behaviours P-HSMQ
amplifies the drawbacks of early unconverged behaviours. Getting the coffee
first requires that the agent crosses the hall twice while carrying the coffee –
once to execute Go(hall, bedroom2) to fetch the book, and then once to execute
Go(hall, lounge) to finish the task. If the coffee is fetched second, then the hall
is only crossed once with the coffee in hand, when the final Go(hall, lounge) is
executed. Once both these behaviours have optimised their policies, the prob-
ability of spilling the coffee is much the same in either case, but initially these
behaviours will involve a lot of random exploration and the probability of spillage
will be high. As a result, in the early phase of learning it is likely to seem better
7. Experiment Results 147
to fetch the book first, minimising the probability of a spill. Once this decision
is made, it will “lock in” until enough exploration of the alternative proves it to
be sub-optimal. This is true for both algorithms, but since P-HSMQ attaches a
much greater cost to spillage than TRQ, it is likely to require much more explo-
ration to recover. This possibly explains why TRQ was able to find the better
path more often than P-HSMQ.
Experiment 3 shows an apparent insensitivity of the TRQ algorithm to the
value of its parameter eta. This suggests that the time taken for convergence in
this task is dominated by the amount of time required to learn primitive policies
for behaviours, and not by the time spent learning higher-level policy decisions.
Ultimately the same amount low-level learning is required regardless of the value
of eta, so the same amount of time is taken.
Alternatively, our measure of convergence might be too loose. We showed
above that most runs did not converge to the optimal behaviour-based policy.
If we used a tighter measure of convergence which would not be satisfied by the
sub-optimal path, then we might expect the value of eta to have a stronger effect.
Further experiments need to be run to confirm this.
7.2 Experiments in the taxi-car domain
7.2.1 Domain description
R G
Y
F
B
Figure 7.8: The second experimental domain - The Taxi-Car.
The second set of experiments, focused on the action of the reflector, will be
conducted in the taxi-world from Section 6.2. The map of the world is reproduced
in Figure 7.8. The agent controls a simulated taxicab which has to navigate
7. Experiment Results 148
between different locations in the world, picking up passengers and delivering
them to their desired destination. The agent has a limited fuel supply and a
critical part of the learning problem will by knowing when and how to refuel.
Primitive representation
The primitive state of the world is defined by the following factors: the x-y
position of the taxi, the location of the passenger, the passenger’s desired des-
tination and the amount of fuel in the taxi. These elements are represented by
the instruments shown in Table 7.2.1.
Initially the taxi is placed at a random position in the world. The passenger is
randomly placed at one of the five locations, with a different random location as
its destination. The fuel tank is randomly set between half full and full, i.e. 8-16
units, inclusive.
There are seven primitive actions available to the agent, four controlling the
movement of the taxi (north, south, east, west), two for picking and and putting
down the passenger (pickup, putdown); and one for refilling the fuel tank (fill).
The movement primitives move the agent one cell in the direction specified,
unless there is a wall in the way, in which case they do nothing. These actions
have a 5% chance of error, in which case the agent moves at right angles to its
desired heading. Each movement action, whether successful or not, uses 1 unit
of fuel.
The pickup action only operates when the taxi is in the same location as the
passenger, in which case the passenger’s location is set to taxi. The putdown
action only operates when the passenger is in the taxi and the taxi is at one of
the five locations, in which case the passenger’s location is set to the location of
the taxi. Under other conditions these actions have no effect. Neither of these
actions has any effect on fuel.
The fill action only operates when the taxi is at the fuel() location, in which
case it sets the fuel tank to full (16 units). Otherwise it does nothing.
Symbolic representation
The symbolic representation of the state of the taxi-world is given by the flu-
ents in Table 7.2.1. Some fluents, like psgr loc() and psgr dest() are simple
wrappers to certain instruments; others have more complex definitions. Two flu-
7. Experiment Results 149
Table 7.4: Instruments used in the Taxi-car domain.
x the current X position of the taxiy the current Y position of the taxipassenger location one of red, green, blue, yellow, fuel or taxipassenger destination one of red, green, blue, yellow, or fuelfuel the amount of fuel in the tank
between 16 (full) and 0 (empty)
ents, distance() and rgte() are not used in the definition of behaviours. They
are included to make the hypothesis language for describing side-effects more
expressive.
Table 7.5: Fluents used in the Taxi-car domain.
psgr loc(Location) the passenger’s locationpsgr in taxi the passenger is in the taxipsgr dest(Destination) the passenger’s destinationtaxi loc(Location) the taxi’s locationfuel(Fuel) the fuel leveldistance(Location, Distance) the Manhattan distance to a given locationgt(X, Y) X is greater than Y
rgte(X, Y, R) X is greater than or equal to Y × R
The five behaviours defining the taxi-car learning problem are show in Ta-
ble 7.6. The root task is given by the granularity zero behaviour Deliver() which
sets the main goal:
psgr loc(L) ∧ psgr dest(L)
Four granularity one behaviours are provided to achieve this goal: GoTo(), Get(),
Put() and Refuel. Of these, only GoTo() needs to be learnt, the other three are
simply high-level wrappers to the primitive actions pickup, putdown and fill,
respectively.
The model of the GoTo() behaviour is missing an important fact - it uses up
fuel. A better model might include extra preconditions to ensure an adequate
amount of fuel is available to reach the goal. As it is, the behaviour pays no
attention to fuel, and is not penalised if the fuel happens to run out. That
7. Experiment Results 150
concern is left to the parent behaviour. This means that the GoTo() behaviour
does not need to include fuel in its view, simplifying its state-space, but will
also cause trouble for the planner. In the absence of any other information,
the planner assumes the behaviour has no effect on fuel whatsoever. Using this
assumption, it builds the plan shown in Figure 7.9, which we shall refer to as
the “naive” plan.
GoTo(L)
Put(D)
GoTo(D)
Get(L)
fuel(F) gt(F,0)
psgr_in_taxipsgr_dest(D)
fuel(F) gt(F,0)
taxi_loc(L)
psgr_loc(D)fuel(F) gt(F,0)
psgr_in_taxipsgr_dest(D)
taxi_loc(D)fuel(F) gt(F,0)
fuel(F) gt(F,0)
psgr_dest(D)psgr_loc(L)
psgr_dest(D)psgr_loc(L)
psgr_dest(D)
Figure 7.9: A “naive” plan for the Deliver behaviour in the Taxi world, assumingthe GoTo() behaviour has no effect on fuel.
As already discussed in Chapter 6, when the agent executes this naive plan
a plan failure will sometimes occur. This happens when the taxi runs out of
fuel unexpectedly, while executing one or other of the nodes which dictate the
GoTo() behaviour. This triggers the reflector, which diagnoses the side-effect
that caused the plan-failure and attempts to learn a condition under which it
can be avoided. In the experiments that follow we shall investigate the effects of
reflection on the agents performance.
7. Experiment Results 151
Table 7.6: Behaviours available in the Taxi-Car domain.
Deliver()gran: 0view: { x, y, passenger location, passenger destination, fuel}pre: true
post: psgr loc(L) ∧ psgr dest(L)P: { north, south, east, west, pickup, putdown, fill}
GoTo(L) : L ∈ Locationsgran: 1view: { x, y, id(L) }pre: true
post: taxi loc(L)P: { north, south, east, west}
Refuel
gran: 1pre: taxi loc(bowser)post: fuel(16)P: { fill}
Get(L) : L ∈ Locationsgran: 1pre: taxi loc(L) ∧ psgr loc(L)post: psgr in taxi
P: { pickup}
Put(L) : L ∈ Locationsgran: 1pre: taxi loc(L) ∧ psgr in taxi
post: psgr loc(L)P: { putdown}
7.2.2 Experiment 4: Reflection
The aim of the first taxi-world experiment is to compare the Rachel’s per-
formance in the taxi-world with and without reflection, and also against a
hand-crafted task-hierarchy which includes the possibility of using the Refuel
behaviour. Our aim is to show how reflection might be used to identify and
repair important omissions from the world model.
We compare four different approaches to the problem:
1. TRQ without reflection (using the naive plan),
2. TRQ with reflection,
3. TRQ with a hand-crafted plan (described below),
4. TRQ with all instantiations of all behaviours available (whenever they are
applicable).
The hand-crafted plan is shown in Figure 7.10. Essentially it consists of
the naive plan plus an alternative branch which allows the agent to use the
7. Experiment Results 152
GoTo(D)
Get(L)
GoTo(L)
Put(D) Refuel
GoTo(fuel)
psgr_dest(D)psgr_loc(D)
fuel(F) gt(F,0)
fuel(F) gt(F,0)
psgr_dest(D)psgr_loc(L)
psgr_dest(D)psgr_loc(L)taxi_loc(L)
fuel(F) gt(F,0)
psgr_in_taxipsgr_dest(D)
fuel(F) gt(F,0)
psgr_in_taxipsgr_dest(D)
taxi_loc(D)fuel(F) gt(F,0)
taxi_loc(fuel)fuel(F) gt(F,0)
fuel(F) gt(F,0)
Figure 7.10: A hand-crafted plan which adds refueling to the naive plan.
7. Experiment Results 153
GoTo(fuel()) and Refuel behaviours whenever they are applicable. This is not
a well-formed plan, insofar as it could not be produced by the planner, but it is
still executable. The side-effects produced by executing this plan are ignored.
Twenty independent runs were made of each approach, each 5000 trials long.
In approach 2, reflection was focused on a single side-effect: learning when the
fuel run out. Other side-effects were detected, but ignored for the sake of this
experiment. Data was passed from the actor to the reflector in batches, after
every 500 trials. The training set and example pool sizes were set as follows:
n+train = n−
train = 100, n+max = n−
max = 1000
The reflector used the default parameters for Aleph, with the following
exceptions:
1. The minimum acceptable accuracy of a clause, minacc was set to 0.5,
2. The upper bound on layers of new variables, i was set to 4.
3. The upper bound on the number of literals in a clause, clauselength was
set to 5.
4. The custom-added limitation on time spend learning (see Section 6.5.2),
inducetime was set to 1800 (30 minutes).
In all four approaches the parameters to the reinforcement learning algorithm
are the same: the learning rate α is 0.5, the discount factor γ is 0.95 and the
exploration is recency-based, with a 10% probability of taking an exploratory
action.
After learning, a further 5000 trials were performed for each run, with learn-
ing disabled, to judge the performance of the resulting policy.
Results
Figure 7.11 shows the results of this experiment. It shows the average number
of successful trials for every 100 trial period, during learning. Plainly the naive
plan provides a significant advantage over exploring all possible behaviours, but
it falls far short of the performance gained by the hand-crafted hierarchy. This
is borne out by the testing phase, the results of which are shown in Table 7.7.
The runs which used reflection also quickly out-performed the naive plan,
soon after the first side-effect description was induced on the 500th trial. They
7. Experiment Results 154
quickly converged to policies almost as successful as those using the hand-crafted
hierarchy. The final performance, however, was still not quite as good, with
only 91% success for the trials using reflection, compared to 95% for the hand-
crafted approach. A T-test confirms that this is a significant difference with 99%
confidence.
0
20
40
60
80
100
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
Suc
cess
rate
(%)
Number of trials
Naive planReflection
Hand-crafted planAll behaviours
Figure 7.11: A comparison of learning performance in the Taxi-world, com-paring (1) TRQ without reflection, (2) TRQ with reflection, (3) TRQ with ahand-crafted plan, (4) TRQwith all behaviours. Error-bars show one standarddeviation.
Table 7.7: The success-rates of final policies learnt in the taxi-world.
Approach Average Success Rate Std. dev.Naive plan 54.78 4.96Reflection 91.13 3.19Hand-crafted 95.17 2.31All behaviours 11.31 4.61
It is worth examining what kinds of descriptions were learnt by the reflector.
In the 5000 trials of the learning phase, the reflector had an opportunity to
7. Experiment Results 155
run ten times. In most cases, the result of reflection was only ever adopted
as an improved description of the side-effect three to five times out of the ten.
Figure 7.12 shows the change in accuracy of the side-effect description over these
ten iterations. It shows the average accuracy of the old (maintained) and new
(induced) hypotheses computed on all the examples in the pool. The initial
default description, which assume the side-effect never happens rates at only
around 65% accuracy, so it is immediately replaced with a hypothesis base on the
first induction (82% accurate, on average), but after several iterations the newly
created hypotheses cannot better the maintained hypothesis, and the accuracy
converges to approximately 89%.
60
65
70
75
80
85
90
95
100
1 2 3 4 5 6 7 8 9 10
Acc
urac
y (%
)
Batch No.
OldNew
Figure 7.12: The accuracy of maintain (old) and induced (new) hypotheses foreach iteration of the reflector.
There is a noticeable change in the hypotheses themselves. The first induced
hypothesis in each run had between 1 and 4 clauses, with a mean of 2.5. The
final hypothesis in each case contained just a single clause of the form:
maintains(S) :-
params(L), distance(L, D), fuel(F), rgte(F, D, ###).
where ### is some constant between 2.2 and 2.6. This clause expresses the fact
7. Experiment Results 156
that the side-effect will not occur if the amount of fuel remaining is more than
the Manhattan distance to the target location multiplied by a certain factor.
7.2.3 Experiment 5: Second-order side-effects
In the previous experiment we limited reflection to operate only on the problem
of running out of fuel. The initial “naive” plan does not present us with any other
side-effects, so this may seem like a trivial limitation. However this overlooks the
fact that the plans generated after reflection contain many additional conditions
which are a source of numerous “second-order” side-effects. In this extension of
the above experiment we remove the limitation and allow the reflector to operate
on second-order and later side-effects.
Results
The experimental set up was the same as above. Twenty runs were performed
each consisting of 5000 trials. Of these twenty runs, only five were able to execute
to completion. The other fifteen terminated prematurely, as the planner ran out
of memory due to the size of the plans being generated – in each case involving
over 5000 nodes.
The five cases that did execute to completion were able to do so because
a side-effect description that was learnt that prevented any plan whatsoever to
be built. Without exploratory planning to overcome this, the agent reverted to
learning a policy with monolithic Q-learning, with little success.
7.2.4 The bigger taxi world
The small size of the taxi-world in Figure 7.8 means that the GoTo() behaviour
will be learnt very quickly. By the time the reflector has gathered enough ex-
amples of the side-effect to induce a description, the target concept has become
stationary. This more or less negates the need for repeated reflection, as can be
seen from the results of Experiment 4.
To better investigate the interaction between learning behaviours and reflect-
ing on their side-effects, a larger example world is required. For this experiment,
we have scaled up the taxi-world from a 5 × 5 grid to a 25 × 25 grid, shown
in Figure 7.13. The distances between locations are five times greater, so the
initial fuel for the taxi has also been scaled. The full tank now holds a maximum
7. Experiment Results 157
R G
BY
F
Figure 7.13: The 25× 25 taxi-world
7. Experiment Results 158
of 80 units of fuel. The taxi starts with an fuel level in the range 40 to 80.
The dynamics of the world are otherwise unchanged, and the same instruments,
actions, fluents and behaviours are used.
In the next three experiments we shall show the effects on reflection caused
by varying the training set size, varying the pool size and doing exploratory
planning, using this larger taxi-world as our test bed.
7.2.5 Experiment 6: The effect of the training set size
The effects of varying the absolute size of the training set for ILP problems are
well established, and this problem add any new results worth discussing. The
effect of varying the relative sizes of the positive and negative training sets, on
the other hand, is significant and worth exploring.
Aleph uses coverage as a measure of the goodness of a hypothesis. That is it
counts the number P of positive examples number N of negative examples from
the training set which are covered by a given hypothesis and uses the difference
P − N as a score for that hypothesis. By varying the sizes of the two training
sets E+train and E−train we can bias this estimate.
In a noisy world, the space of examples is likely to fall into three sets, those
that are definitely positive, those that are definitely negative, and those that
are a mixture of positive and negative (as illustrated in Figure 7.2.5(a)). When
choosing a hypothesis we are attempting to draw a boundary line, calling every-
thing on one side “positive” and everything on the other side “negative”. This
boundary will presumably lie somewhere in the “mixed” area. A more general
hypothesis will cover more examples in this area, both positive and negative. A
more specific hypothesis will cover fewer examples.
The coverage heuristic says that a more general hypothesis will be chosen if
for every extra negative example it covers, it also covers at least one extra positive
example. Thus it is sensitive to the density of positive and negative examples. If
there are more positive examples overall, then a more general hypothesis will be
chosen (Figure 7.2.5(b)). On the other hand, if there are more negative examples,
then the more specific hypothesis will score higher (Figure 7.2.5(c)). This is a
primitive form of cost-sensitive ILP (Srinivasan, 2001b).
As we discussed in Section 6.6.1, over-specialised side-effect descriptions cause
difficulty for Rachel. If the reflector underestimates the size of the maintenance
region for a particular side-effect then the policy nodes of the plan will be overly
7. Experiment Results 159
−−
−
−−
−−
−
−−
−−−
−−
−
−−−
−
++
++
+
++
++
+
++
+
++
+
++
+
+
definitepositive
definitivenegative
mixed
−
−−−
−
−
−
−−
−
+
++
+
++
+
+
++
coverage coverage
specific general
++
+
++
+
+ +
+ +
= 12 − 0= 12
= 20 − 4= 16
−−
−
−−
−−
−
−−
−−−
−−
−
−−−
−
+
++
+
++
+
+
++
coverage= 6 − 0= 6
coverage= 10 − 8= 2
specific general
Figure 7.14: ILP in a noisy world, (a) showing an area of mixed positive and neg-ative examples, (b) if positive examples are over-represented, the coverage heuris-tic favours general hypotheses, (c) if negative examples are over-represented, thecoverage heuristic favours specific hypotheses.
7. Experiment Results 160
restrictive and the behaviour will not be used as widely as it might be. Thus we
would expect better results if the reflector were biased towards producing more
general results.
To investigate this effect, we conducted experiments in the big taxi world,
varying the size of the negative training set while keeping the positive training
set constant. Twenty runs of 5000 trials were made with each of the following
settings:
1. No reflection
2. No reflection, with the hand-crafted hierarchy in Figure 7.10 above.
3. Reflection with n+train = 500, n−
train = 100
4. Reflection with n+train = 500, n−
train = 200
5. Reflection with n+train = 500, n−
train = 300
6. Reflection with n+train = 500, n−
train = 400
7. Reflection with n+train = 500, n−
train = 500
Each run used the TRQ algorithm with η = 0.1, α = 0.5, γ = 0.95 and
ε = 0.1. The pool sizes n+max and n−
max were both set to 5000. The parameters
for Aleph were the same as in Experiment 4 above, except that the minimum
accuracy was set by the equation:
minacc =n+
train + 1
n+train + n−
train
so that the resulting hypothesis must always be more accurate than the default
hypothesis (that the side-effect never happens).
With more training examples than previous experiments, there are more
opportunities for Aleph to build clauses that cover only two or three examples.
Each such clause increases the branching factor of the resulting plan, and so the
size of the plan increases dramatically, exhausting available memory. To alleviate
this problem, the reflector only kept the three best clauses from the hypotheses
generated by Aleph, evaluating them in terms of coverage on the whole pool of
examples.
Second-order side-effects were ignored and exploratory planning was not used
for this experiment.
7. Experiment Results 161
0
10
20
30
40
50
60
70
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
Suc
cess
rate
(%)
Number of trials
No reflectionHand-crafted
n- = 100n- = 200n- = 500
Figure 7.15: The effect of training set size on reflection, showing: (1) “Naive”TRQ without reflection, (2) TRQ with a hand-crafted hierarchy, (3) TRQ withreflection using negative training set sizes 100, 200 and 500. Error bars showone standard deviation.
Results
The results of this experiment are shown in Figure 7.15. To avoid cluttering
the graph, the results of approaches 5 and 6 have been omitted, but they follow
the pattern established by those already shown. As in previous experiments,
the hand-crafted hierarchy significantly out-performs the “naive” approach with
more than twice the success rate after 5000 trials.
The approaches using reflection all show a marked change around 2100 tri-
als. This is when sufficient examples were collected for the first batch to be
sent to Aleph. In each case, the reflection resulted in a significant increase in
performance as the agent began to use the Refuel behaviour. At the 3000 trial
mark each reflective approach significantly outperforms the naive approach with
greater than 99% confidence.
However the final results are not so rosy. The naive approach continues
to improve after the reflective approaches have flattened out, and at the 5000
7. Experiment Results 162
trial mark it is performing comparably to reflective approaches 5 and 6 and
significantly better than approach 7. However reflective approaches 3 and 4
are still performing significantly better. Neither approach, however does as well
as the hand-crafted approach which has significantly a greater final success rate
than any other. (All differences are significant with greater than 99% confidence,
according to a two-tail T-test).
What is the determining factor that decides the outcome of the different
reflective approaches? Looking at the descriptions learnt by each approach we
see a familiar pattern. Each description contains several clauses, but the one
that covers by far the majority of examples is of the form:
maintains(S) :-
params(L), distance(L, D), fuel(F), rgte(F, D, ###).
where ### is replaced by some constant we shall call the “fuel factor”. It is
this constant that distinguishes the different reflective approaches, as shown in
Table 7.8. The approaches with larger values of n−train chose more specialised
hypotheses with fuel factors around 2.5. Those approaches with smaller values
of n−train chose more general hypotheses with smaller fuel factors.
Table 7.8: The fuel factor for each reflective approach to Experiment 6.
Approach n+train n−
train av. fuel factor3 500 100 1.614 500 200 2.185 500 300 2.386 500 400 2.497 500 500 2.65
If the fuel factor is large, the more fuel the agent estimates it will need to
reach its destination. If it does not have enough fuel, it may attempt to attempt
to go and refuel, but only if it thinks it can reach fuel location without running
out of fuel. If the fuel factor is very large, even this appears impossible, and
the plan can provide no further contingencies. The agent resorts to monolithic
with its primitive actions, which will take a much longer time to produce a
working policy. This explains why the approaches that produce more specialised
hypotheses perform more poorly in the long run.
7. Experiment Results 163
7.2.6 Experiment 7: The effect of the pool size
The aim of this experiment is to study the effect of changing the size of the pool
of examples kept by the reflector. Twenty runs of 5000 trials were made with
each of the following settings:
1. n+max = n−
max = 2500
2. n+max = n−
max = 5000
3. n+max = n−
max = 7500
4. n+max = n−
max = 10000
5. n+max = n−
max = 12500
6. n+max = n−
max = 15000
Batches of data were passed from the actor to the reflector every 100 trials.
The training set size in each case was kept constant, regardless of pool size.
n+train = 500 and n−
train = 100. The other settings for the actor and the reflector
were the same as in approach 7 of Experiment 6 above. Second-order side-effects
were ignored. Exploratory planning was not used for this experiment.
Results
The results of this experiment are shown in Figure 7.15. Plainly reflecting too
early can have a highly detrimental effect on learning. The approach that used
the smallest pool size, collecting only 2500 positive and negative examples, per-
formed very poorly, much worse than if no reflection had been done at all. All
the of the other reflective approaches appear to perform better than the naive
approach, with a general trend towards better results the longer reflection is
delayed. All the reflective approaches with 7500 or more examples in each pool
is show final policies significantly better than the naive approach (with 99%
confidence). The degree of improvement slows as the number of examples in-
creases. The results of using pool sizes of 12500 or 15000 examples are roughly
comparable.
Examining the side-effect descriptions produced from the different sized pools
gives us an insight into why the smallest pool size is so catastrophically worse
than the others. The approaches with pools of 5000 examples or more produced
7. Experiment Results 164
-10
0
10
20
30
40
50
60
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
Suc
cess
rate
(%)
Number of trials
No reflectionpool = 2500 pool = 5000 pool = 7500 pool = 10000pool = 12500pool = 15000
Figure 7.16: The effect of pool size on reflection, showing: (1) “Naive” TRQwithout reflection, (2) TRQ with reflection using pool set sizes ranging from2500 to 15000. Error bars show one standard deviation.
Table 7.9: The fuel factor for each reflective approach to Experiment 7.
Approach n+max n−
max av. fuel factor1 2500 2500 –2 5000 5000 2.233 7500 7500 2.024 10000 10000 1.715 12500 12500 1.716 15000 15000 1.46
7. Experiment Results 165
the familiar descriptions comparing distance and fuel as in previous experiments.
The fuel-factors for these approaches are shown in Table 7.9. However the de-
scriptions learnt from the pool of 2500 examples were quite different. In fourteen
of the twenty runs of this approach, the learnt concept was:
maintains(S) :-
psgr_loc(L).
That is, the side effect can be avoided so long as the passenger is at a location
(and not in the taxi). Why is this concept produced? The GoTo() behaviour
is used twice in the naive plan, once to go to passenger’s starting location and
once to go to to the passenger’s destination. Early on in the learning process
the policy for GoTo() is likely to be so bad that it uses up almost all the fuel
in the first of these two movements. The second one almost always fails. The
simplest way to distinguish between these two cases is to examine the position
of the passenger. There is a high correlation between the passenger being in the
taxi and the taxi running out of fuel. So this description is adopted.
Why is this description so damaging? It says that the passenger must not be
in the taxi in order for the taxi to go to any location without running out of fuel.
This makes delivering the passenger to her destination impossible. The planner
cannot construct a plan which satisfies its goals and maintains this condition,
so the learner reverts to learning a policy in terms of primitive actions, which
reduces its performance enormously. Furthermore, if behaviours are no longer
executed then there is no source of new examples to contradict the poor side-
effect description, so the agent cannot recover from its mistake.
7.2.7 Experiment 8: The effect of exploratory planning
The final experiment in the taxi-world investigates the effects of exploratory
planning. As shown above, reflecting too early can significantly impinge on
performance of the agent. Learning a condition for avoiding a side-effect that
is too specific results in a plan which never uses the affected behaviour. If a
behaviour is never used, then it never produces counter-examples from which
a better condition might be learnt. Exploratory planning allows the agent to
occasionally explore such behaviours, even if their side-effects would make them
inappropriate.
7. Experiment Results 166
-10
0
10
20
30
40
50
60
0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000
Suc
cess
rate
(%)
Number of trials
No reflectionpool = 2500, w/o explorationpool = 2500, w/ exploration
pool = 12500, w/o exploration
Figure 7.17: The effect of exploratory planning, showing: (1) “Naive” TRQwithout reflection, (2) TRQ with reflection using an example pool of size 2500without exploration, (3) TRQ with reflection using an example pool of size 2500with exploration, (4) TRQ with reflection using an example pool of size 12500without exploration. Error bars show one standard deviation.
In this experiment we repeated the first case of the previous experiment, with
n+train = n−
train = 100 and n+max = n−
max = 1000, except with exploratory planning
activated. Once again, we performed 20 runs of 5000 trials, and we compared
the performance against the results from Experiment 7 above.
Results
Figure 7.17 shows the results of this experiment, plotted along with three re-
sults from the previous experiment. The difference between the results with
exploratory planning and without is pronounced. As we saw above, without ex-
ploratory planning reflecting with only 2500 examples in the pool leads to very
poor performance. However with exploratory planning, this same pool size pro-
duces results which are statistically indistinguishable from a pool of five times
the size, and achieves these results much sooner.
7. Experiment Results 167
The cost is in terms of computational effort. The plans built by exploratory
planning had on average about 900 nodes apiece, only 40 of which were policy
nodes. Not only does it take more time to build these plans, but traversing the
tree to find the active nodes on each timestep takes considerably more time.
7.2.8 Discussion of the taxi-car experiments
Reflection has obvious advantages and disadvantages. As these experiments
show, provided it is not done prematurely and is sufficiently optimistic in its
predictions, it can effectively repair mistakes in plans, resulting in much better
performance. However if it is too pessimistic in its conclusions, either due to a
poor balance of positive and negative examples, or to drawing conclusions too
early, before behahaviours have converged, it can easily cut off whole paths of a
plan which may in fact have been worth following. We have shown that this prob-
lem can be alleviated to some degree by exploratory planning and incremental
learning.
The chief problem with reflection is managing the size of the plans it produces.
Each clause produced to describe a side-effect increases the branching factor
of the resulting plan, and plan size tends to grow exponentially. Exploratory
planning increases this further still. This effects both the time it takes to build
plans and the time it takes to execute them. In these experiments we have used
some rather ad-hoc methods to limit this growth by placing arbitrary limits
on plan depth and on the number of clauses in a side-effect description. More
investigation is needed to find more principled ways to trade off plan size against
performance.
7.3 Experiments in the soccer domain
In the final series of experiments we shall apply Rachel to a far more complex
domain: a robot soccer simulation. Our aim is to show how Rachel performs
in a more realistic control task than the simple grid-worlds examined so far.
Furthermore, the soccer domain illustrates the application of multiple levels of
hierarchy, which has not been necessary in earlier examples.
7. Experiment Results 168
Figure 7.18: The soccer domain.
7.3.1 Domain description
The soccer simulator used for these experiments is the SocccerBots simulation
(Figure 7.18, which is part of TeamBots package available from http://www.teambots.org/.
TeamBots is a collection of Java applications and packages for multi-agent re-
search, maintained by Tucker Balch. The SoccerBots package is a simulation of
the RoboCup small-size robot league. We shall briefly outline its salient features
below. For full details, consult the documentation available on the web site.
The SoccerBots program simulates a 152.5cm by 274cm field with a goal at
either end. Up to ten robots may be placed on the field. Robots are circular and
12cm in diameter. The simulated robots can move at 0.3 meters/sec and turn
at 360 degrees/sec. Each robot belongs to one of two teams: those going “east”
and those going “west”.
The ball is 40mm in diameter and it can move at a maximum of 0.5 meters/sec
when kicked. It decelerates at 0.1 meters/sec/sec. Ball collisions with walls and
robots are perfectly elastic. To prevent deadlocks, whenever the ball has been
stationary for more than 30 seconds it is moved to a random position on the
field.
7. Experiment Results 169
For the experiments described below, only three robots were used. Rachel
controlled two eastward-heading robots (with yellow and white markings) with
one stationary opponent randomly placed on the field. Rachel learnt to control
both robots to move the ball into the eastern goal. Note that we are controlling
the two robots with a single agent. The agent is able to observe the state of both
robots, and control the movements of both simultaneously.
7.3.2 Related work
As far as we are aware, this is the first published application of reinforcement
learning to this particular simulator. There have, however, been experiments
done with similar simulations. Stone and Sutton (2001) applied reinforcement
learning to the RoboCup simulation league simulator. In this work the agent
learnt to play a game of “3-vs-2 keep-away”. Three “keeper” robots learnt to keep
the ball away from two “taker” robots. Each player learnt a policy independently,
based on its own state and using its own actions. Actions were drawn from a set
of hand-crafted “skills”, such as “go to ball”, “pass ball” etc.
Wiering, Salustowicz, and Schmidhuber (1999) has also applied multi-agent
reinforcement learning to a custom soccer simulation. In their work, the agents
played 3-on-3 soccer, with each player acting independently but sharing a com-
mon policy. Again, actions were drawn from a set of hand-crafted behaviours.
Two things set this present work apart from these examples. The task we
are attempting is made more difficult by the fact that we wish to learn policies
based of more primitive actions. We shall employ behaviours similar to the skills
described above, but instead of providing them in advance, the agent will have
to learn them from primitive actions such as “move forward”, “turn left”, “turn
right”, etc.
The second difference is in the approach to controlling multiple soccer play-
ers. Multi-agent reinforcement learning is significantly more difficult than single-
agent learning. We shall avoid this problem by treating the pair of soccer players
as a single composite agent.
7. Experiment Results 170
Table 7.10: Instruments used in the soccer domain.
x(Bot) the current X position of Boty(Bot) the current Y position of Botdistance(Bot, Object) the distance from Bot to Object
angle(Bot, Object) the bearing from Bot to Object
angle(Bot, Object1, Object2) the difference in bearing between Object1, Object2according to Bot
delta(distance(Bot, Object)) the change in distance between Bot and Object
since the previous time-step.delta(angle(Bot, Object)) the change in bearing between Bot and Object
since the previous time-step.can kick(Bot) 1 if Bot can kick the ball, 0 otherwise
7.3.3 Domain representation
Primitive representation
The state of the soccer world is given by the position, heading and velocity of
each of the robots, and the position, heading and velocity of the ball. Each of
these of these values can be measured in various ways - in absolute coordinates,
or relative to each other or to another landmark on the field. Which of these
representations is appropriate will vary from behaviour to behaviour. The set of
available instruments in Table 7.3.3 provides means to compute most of them.
The x(Bot) and y(Bot) instruments give absolute x,y-coordinates for Bot,
which must be one of the two controlled robots, called bot(0) and bot(1). The
coordinate system is in metres from the centre of the field, which is (0, 0). The
simulator does not provide absolute positioning for the other objects in the world.
It could be computed, but it does not turn out to be useful, so such instruments
are not provided.
The distance(Bot, Object) and angle(Bot, Object) instruments give the
distance and bearing to Object from the point of view of Bot. Object can be
any of the items in Table 7.3.3, including both objects in the world (such as
the ball) and landmarks (such as the goal). Distances are measured in metres.
Bearings are in degrees between 0 deg and 360 deg. 0 deg corresponds to the
direction the robot is heading. In some cases it is more important to know
the angle between two objects, rather than the absolute bearings of each. The
7. Experiment Results 171
Table 7.11: Objects in the Soccer domain.
bot(0), bot(1) The two bots.teammate A symbol used by each bot to represent the other.opponent The opponent bot.ball The ball.goal(ours) The goal the bots are guarding.goal(theirs) The opponent’s goal, towards which the bots will shoot.
angle(Bot, varObject1, Object2) instrument provides this information.
In the simulation, the robots’ sensors do not provide information about ve-
locities, so this information must be computed from successive distance() and
angle() values. The delta() instrument takes another instrument as a param-
eter, and returns the difference between the values of the instrument on the
current and the previous timestep.
Finally there is a special-purpose instrument can kick(Bot) which simply
returns 1 or 0, indicating whether or not the Bot is in a position to kick the ball.
This can only happen if the ball is within 2cm of a “kicking spot” on the front
of the robot.
Control of the robots consists of a command sent every 100ms of simulated
time. Each robot has two effectors, one controlling movement and one controlling
kicking. The movement effector, denoted move(Bot, m) can take one of four
values: m ∈ {forward, left, right, stop}. The kick effector kick(Bot, k) can take one
of two values l ∈ {true, false}. If a bot does not receive a movement command
on a particular timestep, then it assumes the value stop. If it does not receive
a kick command, it assumes the value false. So there are a total of 64 different
primitive actions (8 possibilities for each robot, yielding 82 combinations) that
could be executed. For the most part, the behaviours will limit their attention
to a subset of these which control a single robot while the other is stationary.
Symbolic representation
The symbolic representation of the soccer-world describes the layout of the field
in more abstract terms. Six fluents are used, as shown in Table 7.3.3. They
define such high-level concepts as when a goal has been scored, which robot (if
any) is currently controlling the ball, whether a target is close enough to kick
7. Experiment Results 172
the ball to it, etc. As with the instruments above, the Object variables in the
fluents can be instantiated with any of the objects in Table 7.3.3.
Table 7.12: Fluents used in the Soccer domain.
score(Value) Value is 1, when we score a goal−1, when the opponents score.
controlling ball(Bot) True if the Bot is within 30cm of the ball.recently controlling ball(Bot) True if controlling ball(Bot) was true
on any of the last 10 timesteps.near edge(Object) True if Object is within 10cm
of the edge of the field.within kicking distance(Bot, Object) True if the Bot is within 60cm
of the Object
lined up(Bot, Object1, Object2) True if the angle between Object1 andObject2 is less than 90 degrees
One special fluent, recently controlling ball(Bot) needs particular ex-
planation. This fluent is true when Bot is controlling the ball and remains true
for ten time-steps thereafter. This is needed to define behaviours in which a bot
deliberately loses control of the ball, such as when it shoots a goal, or passes to an-
other bot. The preconditions of such behaviours use recently controlling ball(Bot)
to allow the bot to kick the ball and then lose control briefly before the ball
reaches its target. This causes a minor violation of the Markov assumption for
these behaviours, but does not prove troublesome.
Eleven behaviours are used in the soccer world, at three levels of granularity.
At level 0 is the main task Score (Table 7.3.3) which has access to the complete
state representation and all primitives. The goal of the behaviour is to kick the
ball into the opponents goal. It can be applied everywhere, except for a small
set of failure states in which the agent has scored an own-goal.
Three behaviours with granularity 1 define strategic decisions (Table 7.3.3).
Which bot is going to fetch the ball? Will it then attempt to shoot a goal
itself, or pass the ball to its teammate. The CaptureBall1(Bot), Shoot1(Bot) and
Pass1(FromBot, ToBot) behaviours implement these strategies. Each of these
behaviours controls only one of the two bots, and limits its view to data from
that bot’s viewpoint.
The seven remaining behaviours, with granularity 2, define simple movements
7. Experiment Results 173
Table 7.13: Granularity 0 behaviours in the soccer domain.Score
gran: 0view: { x(bot(0)), y(bot(0)), can kick(bot(0)),
distance(bot(0), goal(ours)), angle(bot(0), goal(ours)),distance(bot(0), goal(theirs)), angle(bot(0), goal(theirs)),distance(bot(0), teammate), angle(bot(0), teammate),distance(bot(0), opponent), angle(bot(0), opponent),distance(bot(0), ball), angle(bot(0), ball),delta(distance(bot(0), ball)), delta(angle(bot(0), ball)),angle(bot(0), ball, goal(theirs)),
x(bot(1)), y(bot(1)), can kick(bot(1)),distance(bot(1), goal(ours)), angle(bot(1), goal(ours)),distance(bot(1), goal(theirs)), angle(bot(1), goal(theirs)),distance(bot(1), teammate), angle(bot(1), teammate),distance(bot(1), opponent), angle(bot(1), opponent),distance(bot(1), ball), angle(bot(1), ball),delta(distance(bot(1), ball)), delta(angle(bot(1), ball)),angle(bot(1), ball, goal(theirs)) }
pre: ¬score(−1)post: score(1)P: {(move(bot(0),m1), kick(bot(0), k1),move(bot(1),m2), kick(bot(1), k2)) |
m1,m2 ∈ {forward, left, right, stop}, k1, k2 ∈ {true, false}}
around the field (Tables 7.3.3 & 7.3.3). Again, each behaviour controls only one
bot, the other remains stationary. Kicking is only possible in those behaviours
that particularly need it: in ClearBall(Bot) to get the ball off of the edge of the
field, in Shoot2(Bot) to shoot a goal and in Pass2(FromBot, ToBot) to pass to
the other bot. The primitive state representation is tailored to each behaviour,
omitting any unnecessary instruments.
Even using a hierarchical decomposition, the soccer domain is a very large
and complex search space. Two additional measures were needed to making
learning possible, function approximation and progress estimation.
7. Experiment Results 174
Table 7.14: Granularity 1 behaviours in the soccer domain.CaptureBall1(Bot)
gran: 1view: { x(Bot), y(Bot),
distance(Bot, ball), delta(distance(Bot, ball)),angle(Bot, ball), delta(angle(Bot, ball)),distance(Bot, teammate), angle(Bot, teammate),distance(Bot, opponent), angle(Bot, opponent) }
pre: ¬ controlling ball(Bot)post: controlling ball(Bot)P: {move(Bot, forward),move(Bot, left),move(Bot, right),move(Bot, stop)}
Shoot1(Bot)
gran: 1view: { x(Bot), y(Bot), can kick(Bot),
distance(Bot, goal(theirs)), angle(Bot, goal(theirs)),distance(Bot, ball), delta(distance(Bot, ball)),angle(Bot, ball), delta(angle(Bot, ball)),distance(ball, goal(theirs)), delta(distance(ball, goal(theirs))),angle(Bot, ball, goal(theirs)), delta(angle(Bot, ball, goal(theirs))),distance(Bot, teammate), angle(Bot, teammate),distance(Bot, opponent), angle(Bot, opponent) }
pre: recently controlling ball(Bot)post: score(1)P: {(move(Bot,m), kick(Bot, k)) |
m ∈ {forward, left, right, stop}, k ∈ {true, false}}
Pass1(FromBot, ToBot)
gran: 1view: { x(FromBot), y(FromBot), can kick(FromBot),
x(ToBot), y(ToBot),distance(FromBot, ToBot), delta(distance(FromBot, ToBot)),angle(FromBot, ToBot), delta(angle(FromBot, ToBot)),distance(FromBot, ball), delta(distance(FromBot, ball)),angle(FromBot, ball), delta(angle(FromBot, ball)),distance(FromBot, opponent), angle(FromBot, opponent) }
pre: recently controlling ball(FromBot)post: controlling ball(ToBot)P: {(move(FromBot,m), kick(FromBot, k)) |
m ∈ {forward, left, right, stop}, k ∈ {true, false}}
7. Experiment Results 175
Table 7.15: Granularity 2 behaviours in the soccer domain.Approach(Bot, Object)
gran: 2view: { x(Bot), y(Bot),
distance(Bot, Object), delta(distance(Bot, Object)),angle(Bot, Object), delta(angle(Bot, Object)),distance(Bot, teammate), angle(Bot, teammate),distance(Bot, opponent), angle(Bot, opponent) }
pre: ¬ within kicking distance(Bot, Object)post: within kicking distance(Bot, Object)P: {move(Bot, forward),move(Bot, left),move(Bot, right),move(Bot, stop)}
CaptureBall2(Bot)
gran: 2view: { x(Bot), y(Bot),
distance(Bot, ball), delta(distance(Bot, ball)),angle(Bot, ball), delta(angle(Bot, ball)),distance(Bot, teammate), angle(Bot, teammate),distance(Bot, opponent), angle(Bot, opponent) }
pre: within kicking distance(Bot, ball) ∧ ¬controlling ball(Bot)post: controlling ball(Bot)P: {move(Bot, forward),move(Bot, left),move(Bot, right),move(Bot, stop)}
TurnWithBall(Bot, Object)
gran: 2view: { x(Bot), y(Bot),
distance(Bot, ball), delta(distance(Bot, ball)),angle(Bot, ball), delta(angle(Bot, ball)),angle(Bot, ball, Object), delta(angle(Bot, ball, Object)),distance(Bot, teammate), angle(Bot, teammate),distance(Bot, opponent), angle(Bot, opponent) }
pre: within kicking distance(Bot, ball) ∧ ¬lined up(Bot, ball, Object)post: lined up(Bot, ball, Object)P: {move(Bot, forward),move(Bot, left),move(Bot, right),move(Bot, stop)}
ClearBall(Bot)
gran: 2view: { x(Bot), y(Bot), can kick(Bot),
distance(Bot, ball), delta(distance(Bot, ball)),angle(Bot, ball), delta(angle(Bot, ball)),distance(Bot, teammate), angle(Bot, teammate),distance(Bot, opponent), angle(Bot, opponent) }
pre: controlling ball(Bot) ∧ near edge(ball)post: ¬near edge(ball)P: {(move(Bot,m), kick(Bot, k)) |
m ∈ {forward, left, right, stop}, k ∈ {true, false}}
7. Experiment Results 176
Table 7.16: Granularity 2 behaviours in the soccer domain, cont.Dribble(Bot, Object)
gran: 2view: { x(Bot), y(Bot),
distance(Bot, Object), angle(Bot, Object),distance(Bot, ball), delta(distance(Bot, ball)),angle(Bot, ball), delta(angle(Bot, ball)),distance(ball, Object), delta(distance(ball, Object)),angle(Bot, ball, Object), delta(angle(Bot, ball, Object)),distance(Bot, teammate), angle(Bot, teammate),distance(Bot, opponent), angle(Bot, opponent) }
pre: controlling ball(Bot) ∧ ¬near edge(ball)post: within kicking distance(ball, Object)P: {move(Bot, forward),move(Bot, left),move(Bot, right),move(Bot, stop)}
Shoot2(Bot)
gran: 2view: { x(Bot), y(Bot), can kick(Bot),
distance(Bot, goal(theirs)), angle(Bot, goal(theirs)),distance(Bot, ball), delta(distance(Bot, ball)),angle(Bot, ball), delta(angle(Bot, ball)),distance(ball, goal(theirs)), delta(distance(ball, goal(theirs))),angle(Bot, ball, goal(theirs)), delta(angle(Bot, ball, goal(theirs))),distance(Bot, teammate), angle(Bot, teammate),distance(Bot, opponent), angle(Bot, opponent) }
pre: lined up(Bot, ball, goal(theirs)) ∧ controlling ball(Bot)∧ within kicking distance(ball, goal(theirs))
post: score(1)P: {(move(Bot,m), kick(Bot, k)) |
m ∈ {forward, left, right, stop}, k ∈ {true, false}}
Pass2(FromBot, ToBot)
gran: 2view: { x(FromBot), y(FromBot), can kick(FromBot),
x(ToBot), y(ToBot),distance(FromBot, ToBot), delta(distance(FromBot, ToBot)),angle(FromBot, ToBot), delta(angle(FromBot, ToBot)),distance(FromBot, ball), delta(distance(FromBot, ball)),angle(FromBot, ball), delta(angle(FromBot, ball)),distance(FromBot, opponent), angle(FromBot, opponent) }
pre: recently controlling ball(FromBot)∧ within kicking distance(ball, ToBot)
post: controlling ball(ToBot)P: {(move(FromBot,m), kick(FromBot, k)) |
m ∈ {forward, left, right, stop}, k ∈ {true, false}}
7. Experiment Results 177
Function approximation
Most of the instruments listed in Table 7.3.3 return continuous values. These
must somehow be discretised to form a finite table of Q-values. Furthermore,
even once they are discretised there will be a very large number of discrete table
entries, even for the simplest behaviours. Some form of generalisation is needed
to make learning possible.
We used CMACs ((Albus, 1975), Santamarıa et al., 1998) for this purpose.
Each behaviour represented its Q-values using a CMAC with one tiling per in-
strument in its view, discretised only along that dimension. That is, if behaviour
B has instruments i1, i2, . . . , ik in its view then:
Q(s, a) =k
∑
j=0
Q(j, ij(s), a)
where Q(j, x, a) is the contribution to the Q-value given by the value of instru-
ment ij. Q(j, x, a) is stored as a table based on a discretisation of x. The
discretisations used for each instrument are given in Table 7.3.3. These values
were updated using gradient-descent.
Table 7.17: Discretisation of instruments in the soccer domain.
Instrument Discretisation
x(Bot) 100 equal intervals from -1.5m to 1.5my(Bot) 60 equal intervals from -0.9m to 0.9mdistance(Bot, Object) 115 equal intervals from 0m to 3.45mangle(Bot, Object) 10 equal intervals from 0deg to 360 degangle(Bot, Object1, Object2) 10 equal intervals from 0deg to 360 degdelta(distance(Bot, Object)) 3 intervals: equal to 0, less than 0, greater than 0.delta(angle(Bot, Object)) 3 intervals: equal to 0, less than 0, greater than 0.can kick(Bot) 2 discrete values: 0 or 1
Progress estimation
The standard reward function used by Rachel(Equation 4.2) provides no feed-
back on progress towards the goal. The agent receives zero reinforcement until it
reaches its goal. As a result, initial exploration is little more than a random walk
through the state-space until the goal is reached. The size and connectivity of
7. Experiment Results 178
the soccer-domain makes this impractical, so some additional element is needed
to encourage actions which get the agent closer to the goal and discourage ac-
tions that move it further away. The reward functions used in the soccer world
are augmented to include a progress estimator (Mataric, 1994) to provide this
information:
B.r(s, a, s′) =
1 if s′ |= B.post
−1 if s′ 6|= B.post and s′ 6|= B.pre
PE (s, s′) otherwise
PE (s, s′) is a function which returns a positive value when s′ is estimated to be
closer to the goal than s and a negative value when it is further away (or zero
otherwise). Its value is constrained to the range (−(1− γ), 1− γ) so that even
an infinite sequence of such rewards cannot have a better Q-value than eventual
success, or a worse Q-value than eventual failure. 1
As with behaviours and instruments, Rachel allows progress estimators with
parameters. So the soccer domain only implements a single progress estimator,
reward approach(Object, Target), which returns a positive reward of (1− γ) if
Object gets closer to Target, and −(1−γ) if it gets further away. The variables
Object and Target are instantiated appropriately for each behaviour, as shown
in Table 7.3.3.
Table 7.18: Progress estimators used in the soccer domain.
Behaviour Progress estimator
Score reward approach (ball, goal(theirs))CaptureBall1(Bot) reward approach (Bot, ball)Shoot1(Bot) reward approach (ball, goal(theirs))Pass1(FromBot, ToBot) reward approach (ball, ToBot)Approach(Bot, Object) reward approach (Bot, Object)CaptureBall2(Bot) reward approach (Bot, ball)TurnWithBall(Bot, Object) noneClearBall(Bot) noneDribble(Bot, Object) reward approach (ball, Object)Shoot2(Bot) reward approach (ball, goal(theirs))Pass2(FromBot, ToBot) reward approach (ball, ToBot)
1It has been pointed out by a reviewer that this progress estimator does not support policyinvariance, as studied in (Ng, Harada, & Russell, 1999). This indeed appears to be the case.We were unaware of this body of work at the time of writing.
7. Experiment Results 179
7.3.4 Experiment 9: HSMQ vs P-HSMQ vs TRQ
In the first experiment in the soccer world, we compared three approaches:
1. HSMQ with all applicable behaviours
2. P-HSMQ
3. TRQ with η = 0.1
To further simplify the problem, this first experiment was run with hand-
coded policies for all the granularity 2 behaviours. So the agent’s task is simply
to learn to choose between these behaviours.
Each approach was run ten times, with each run consisting of 1000 consecu-
tive trials. A trial begins with the two bots, the opponent and the ball placed
randomly on the field, and ends when the ball moves into one of the two goals.
An upper limit of 5000 steps was placed on the length of each trial. Trials that
exceeded this limit were counted as failures. This was done because sometimes
the hand-crafted behaviours could get stuck in certain positions, unable to ter-
minate either successfully or unsuccessfully. Tests showed that any trial that
was likely to complete without getting stuck would finish well under the 5000
step limit.
The learning parameters were set as follows: The learning rate α was 0.1.
The discount factor γ was 0.95. Exploration was done in an ε-greedy fashion
with a 1 in 10 chance of the agent choosing an exploratory action (both at the
level of primitive actions, and in the choice of behaviours). Exploratory actions
were chosen randomly.
Results
The results of this experiment can be seen in Graphs 7.19(a) and (b) which show
the average success rate and trial length respectively. From the outset the two
plan based approaches are significantly better than the unplanned approach, and
this is still the case after 1000 trials. The final success rate for both planned
algorithms is greater than 99%, whereas the unplanned approach only achieves
an average of 89%, significantly less (with 99% confidence).
It is worth noting that only 5% of the failures caused by the unplanned ap-
proach were due to it kicking an own goal. The other 95% were due to exceeding
7. Experiment Results 180
70
75
80
85
90
95
100
105
0 100 200 300 400 500 600 700 800 900 1000
Suc
cess
Rat
e (%
)
Number of trials
HSMQ w/ all behavioursPHSMQ
TRQ
(a) success rate
-500
0
500
1000
1500
2000
2500
0 100 200 300 400 500 600 700 800 900
Tria
l len
gth
Number of trials
HSMQ w/ all behavioursPHSMQ
TRQ
(b) trial length (successful trials only)
Figure 7.19: Learning in the soccer world, with hard-coded behaviours, usingHSMQ, P-HSMQ and TRQ. (Error bars show 1 standard deviation.)
7. Experiment Results 181
the time limit. On the other hand, neither planned approach ever reached the
5000 step limit.
The second graph, showing trial lengths, only shows the average length of
successful trials, over a 100 trial window. As can be seen, even on only the
successful trials, the unplanned approach takes significantly longer to score a
goal. In the last 100 trials, the average length for the unplanned approach was
906 steps, significantly greater than the corresponding 481.99 for P-HSMQ and
357.65 for TRQ (with 99% confidence). The P-HSMQ approach shows a greater
average trial length than TRQ, but the variance on both results is high, so the
significance of this difference is low.
Perhaps the most notable feature of these graphs is the flatness of the results
for both planned approaches. There appears to be no net change in the quality
of their policies in the whole time they are executed. It appears that once the
planner has eliminated the inappropriate behaviour choices, there is very little
the reinforcement learning algorithms can do to improve on this, at least within
the time frame of the experiment.
7.3.5 Experiment 10: Learning primitive policies
In the previous experiment, the lowest-level behaviours had hand-crafted policies.
Now we shall remove this assistance. In this experiment we shall compare the
two approaches P-HSMQ and TRQ, learning both behaviour choices with the
hierarchy and primitive policies for the behaviours. Each approach was run ten
times, with each run consisting of 1000 consecutive trials. The experimental
set-up and learning parameters were as per the previous experiment.
The unplanned approach was not used for this experiment, as it could not
complete anywhere near the required number of trials in a reasonable amount of
time.
Results
Graphs 7.20(a) and (b) show the results of this experiment. Both approaches
show significant improvement over the 1000 trials, with TRQ reaching a 91%
success rate, and P-HSMQ 86%. A T-test gives this difference a 96% significance.
The TRQ algorithm does appear to do better on the whole, improving earlier and
with less variance than HSMQ. On the graph of trial lengths we see a familiar
7. Experiment Results 182
45
50
55
60
65
70
75
80
85
90
95
100
0 100 200 300 400 500 600 700 800 900 1000
Suc
cess
rate
(%)
Number of trials
PHSMQTRQ
(a) success rate
-1000
-500
0
500
1000
1500
2000
2500
3000
0 100 200 300 400 500 600 700 800 900
Tria
l len
gth
Number of trials
PHSMQTRQ
(b) trial length
Figure 7.20: Learning in the soccer world, with learnt behaviours, using P-HSMQand TRQ. (Error bars show 1 standard deviation.)
7. Experiment Results 183
pattern – early trials of P-HSMQ are longer and show much greater variance
than those of TRQ, although both approaches converge to comparable values
towards the end.
7.3.6 Experiment 11: Reflection
In the final experiment we shall attempt to apply reflection to the soccer domain,
to see whether it can improve the agent’s performance. For this experiment we
took data from the TRQ approach in the previous experiment. The experiences
from the last 100 trials of each run were passed through the reflector’s side-effect
detection process to determine what side-effects were occurring.
Each run showed over forty different side-effects of varying frequency (listed in
Table 7.19). Of these, one common and potentially interesting effect was the fail-
ure of Dribble(bot(1)) to maintain the condition lined up(bot(1), ball, goal(theirs)).
The reflector was used to learn a maintenance condition for this effect, which
was incorporated into the agent’s plans. The agent was then allowed to run for
a further 100 trials using the new plans, but keeping the Q-values saved from
the previous experiment. This was done ten times, once for each of the ten runs
in the previous experiment. For the sake of comparison, 100 extra trials without
the new plans were also performed.
The reflector was run with a training set of 500 positive and negative exam-
ples, and a pool of the same size2. The default settings for Aleph were used,
with the following exceptions:
1. The minimum acceptable accuracy of a clause, minacc was set to 0.5,
2. The upper bound on layers of new variables, i was set to 4.
3. The upper bound on the number of literals in a clause, clauselength was
set to 5.
4. The custom-added limitation on time spend learning (see Section 6.5.2),
inducetime was set to 1 hour.
5. Only the single best clause (in terms of coverage) was kept.
2Experiment 7 would indicate that a larger pool size would be necessary, but this is onlyif the agent is learning the behaviours from scratch. The data for this experiment was drawnfrom the end of the previous experiment, by which time the behaviours were well established,so the pool size can be small.
7. Experiment Results 184
Results
First it is worth noting the sheer number of side-effects that do occur in this
domain. In just 100 trials, fourty-three different side-effects were noted by the
reflector, as listed in Table 7.19 in order of frequency. The effect we have chosen
to examine was at the top of this list, occurring 1968 times in all. Just below it
is an identical effect, but involving bot(0) rather than bot(1).
This proliferation of side-effects may at first appear to be a serious problem,
but in practive many of them are irrelevant artifacts of total-order planning.
Consider the plan for Shoot1() for instance, part of which is shown in Figure 7.21.
It involves ordering the behaviours CaptureBall2(), Dribble(), TurnWithBall() and
Shoot2(). When the agent is controlling the ball in the middle of the field, there
are several possible alternatives to choose from, given by different paths in the
plan, but when the agent loses control of the ball there is only one appropriate
behaviour, which is CaptureBall2().
Although it is the only available behaviour in the situation, there are still sev-
eral different nodes which use CaptureBall2(), each with different conditions de-
pending on which behaviours it plans to execute next. If lined up(bot(0), ball, goal(theirs))
is already true, then the plan hopes to maintain it so that TurnWithBall() need
not be executed. If it isn’t true, then that condition is maintained instead. If
CaptureBall2() fails to maintain either condition, a plan execution failure has
occurred, and a side-effect is recorded. But after the side-effect occurs, the
CaptureBall2() behaviour is still the only available behaviour, so execute will
continue until it succeeds. Even if we knew when the side-effect was going to
occur, it would not improve matters.
However the same side-effect on the Dribble() behaviour has more important
consequences. It affects whether we choose to execute TurnWithBall() before
or after Dribble(). If the side-effect is particularly common, then we would be
wasting time using TurnWithBall() before we were close to the goal. In this case,
reflection could potentially prune away this alternative and result in improved
performance.
In practice learning even this side-effect has minimal effect on performance,
and what effect it has is slightly detrimental. The average success rate for the
runs performed without reflection is 87%, with reflection it dropped slightly to
80%. A two-tailed T-test shows this difference to be 97% significant. The average
length of the successful trials did not change significantly: 624 timesteps without
7. Experiment Results 185
Table 7.19: The side-effects detected in the soccer-world.
Behaviour Affected condition No. occurrences
Dribble() lined up(⊥1, ball, goal(theirs)) 1968Dribble() lined up(⊥0, ball, goal(theirs)) 1864Dribble() ¬lined up(⊥1, ball, goal(theirs)) 1432Dribble() ¬lined up(⊥0, ball, goal(theirs)) 1405CaptureBall2() recently controlling ball(Bot) 1401ClearBall() lined up(⊥0, ball, goal(theirs)) 942ClearBall() lined up(⊥1, ball, goal(theirs)) 899CaptureBall2() lined up(⊥1, ball, goal(theirs)) 767CaptureBall2() lined up(⊥0, ball, goal(theirs)) 744CaptureBall2() within kicking distance(ball, goal(theirs)) 707CaptureBall2() ¬near edge(ball) 657CaptureBall2() near edge(ball) 564TurnWithBall() controlling ball(⊥0) 544TurnWithBall() controlling ball(⊥1) 536ClearBall() ¬lined up(⊥0, ball, goal(theirs)) 506ClearBall() ¬lined up(⊥1, ball, goal(theirs)) 498Pass2() recently controlling ball(Bot) 417TurnWithBall() ¬near edge(ball) 372ClearBall() controlling ball(⊥1) 362TurnWithBall() within kicking distance(ball, goal(theirs)) 360CaptureBall2() within kicking distance(ball,⊥0) 354TurnWithBall() near edge(ball) 327ClearBall() controlling ball(⊥0) 308Approach() ¬near edge(ball) 292Approach() near edge(ball) 252Approach() within kicking distance(Bot, Object) 249CaptureBall2() within kicking distance(ball,⊥1) 234CaptureBall2() ¬lined up(⊥1, ball, goal(theirs)) 222CaptureBall2() ¬lined up(⊥0, ball, goal(theirs)) 217ClearBall() within kicking distance(ball,⊥0) 183TurnWithBall() recently controlling ball()Bot 161Approach() lined up(⊥1, ball, goal(theirs)) 155Approach() lined up(⊥0, ball, goal(theirs)) 146Approach() recently controlling ball(⊥0) 141Approach() recently controlling ball(⊥1) 124ClearBall() within kicking distance(ball,⊥1) 74Approach() ¬controlling ball(Bot) 60TurnWithBall() ¬controlling ball(⊥1) 55TurnWithBall() ¬controlling ball(⊥0) 48Approach() ¬lined up(⊥0, ball, goal(theirs)) 36Approach() ¬lined up(⊥1, ball, goal(theirs)) 21Dribble() recently controlling ball(Bot) 2Approach() controlling ball(Bot) 2
7.
Experim
ent
Resu
lts186
controlling_ball(bot(1))
lined_up(bot(1), ball, goal(theirs))within_kicking_distance(ball, goal(theirs))
score(1)recently_controlling_ball(bot(1))
controlling_ball(bot(1))
not(lined_up(bot(1), ball, goal(theirs))within_kicking_distance(ball, goal(theirs))
controlling_ball(bot(1))not(lined_up(bot(1), ball, goal(theirs))
not(near_edge(ball))
not(controlling_ball(bot(1)))recently_controlling_ball(bot(1))
within_kicking_distance(ball, goal(theirs))not(lined_up(bot(1), ball, goal(theirs)))within_kicking_distance(bot(1), ball)
controlling_ball(bot(1))lined_up(bot(1), ball, goal(theirs))
not(near_edge(ball))
not(controlling_ball(bot(1)))recently_controlling_ball(bot(1))
within_kicking_distance(bot(1), ball)lined_up(bot(1), ball, goal(theirs)
not(near_edge(ball))
controlling_ball(bot(1))not(lined_up(bot(1), ball, goal(theirs)))
not(near_edge(ball))
Shoot2(bot(1))
CaptureBall2(bot(1))
TurnWithBall(bot(1), goal(theirs))Dribble(bot(1), goal(theirs))
TurnWithBall(bot(1), goal(theirs))
CaptureBall2(bot(1))
Dribble(bot(1), goal(theirs))
Figure 7.21: Part of the plan for Shoot1(bot(1)).
7. Experiment Results 187
reflection and 680 timesteps with. The variance in these values is so large that
this difference is not significant.
The cause of this drop in performance is not immediately apparent. Analysis
of the events preceding the failures shows that they were all due to ball going
into the wrong goal, never because of reaching the 5000 step time-out. In 77%
of causes the agent was attempting to clear the ball into the centre of the field
at the time, which is to be expected and is no different to the trials without
reflection.
There is a marked rise in the number of times the agent executed the Shoot1(bot(1))
behaviour directly, being unable to decompose it into a finer granularity be-
haviour. Without reflection this only occurs 96 times, with reflection it occurs
1219 times. This appears to indicate that, on many occasions, learning a de-
scription of the side-effect only served to limit the application of the Dribble()
behaviour without providing an alternative action. This is understandably dis-
advantageous, but it is not clear that it was necessarily the cause for the extra
failures. Of all the failed trials in the run with reflection, only 46% showed this
problem. Likewise 45% of the successful trials also had this same symptom. So
the cause-and-effect relationship is far from clear.
7.3.7 Discussion of the soccer experiments
The soccer domain is an excellent example of the importance of the task-hierarchy
in hierarchical reinforcement learning. In spite of much hand-crafted background
knowledge in the form of behaviours definitions, progress estimators and function
approximators, it is still very difficult to learn well without a good task-hierarchy.
There are many possible behaviours in this world, which can be parameterised
in various ways. Exploring every possibility is not very productive, as we have
seen. Even when the behaviours are hard-coded, learning to choose between
them without is difficult. When the behaviours themselves have to be learnt, it
is well and truly impossible, unless some extra background knowledge is provided
to direct the agent towards the appropriate behaviours. This knowledge could
be encoded directly by hand, but we have shown that it can also be presented
in a more flexible form as a symbolic model. As we have seen, both the TRQ
and P-HSMQ algorithms can use such a model to learn successful policies.
This world does, however, highlight one of the weaknesses of planning and
reflection. There are a great many side-effects that occur, and no simple way to
7. Experiment Results 188
distinguish the important ones from the unimportant ones. Also, many of those
side-effects appear as near-repetitions, very similiar to others in the list with only
a few parameters changed. For example, both lined up(bot(0), ball, goal(theirs))
and lined up(bot(1), ball, goal(theirs)) are recorded as being affected by the
Dribble() behaviour. Obviously it would be more advantageous to treat these as
a single side-effect. Rachel’s reflector does not as yet know how to do this.
7.4 Summary
In this chapter we have experimentally verified the claimed advantages of Rachel’s
hybrid architecture over straight-forward hierarchical reinforcement learning. We
have seen how a symbolic model of desired behaviours can be built for three dif-
ferent test domains of varying levels of complexity, and how the information in
this model can significantly improve the agent’s ability to learn to perform those
behaviours.
We have also compared the P-HSMQ and TRQ algorithms and demonstrated
that the latter algorithm can produce significantly better policies in domains
where unexpected side-effects are possible, and early termination of behaviours
is desirable. In such domains the TRQ algorithm appears to produce policies sim-
ilar to those produced by standard termination improvement algorithms, without
requiring separate learning and execution phases.
Finally we have investigated the reflection mechanism thoroughly and seen
that while it is able to predict side-effects and plan to avoid them, it is quite
sensitive to variations in training- and test-set sizes. Exploratory planning can
reduce this sensitivity somewhat, but at the expense of producing much larger
plans.
This concludes our investigation of the Rachel architecture. In the final
chapter, we shall summarise the results we have obtained, draw conclusions and
speculate on future extensions to this work.
Chapter 8
Conclusions and Future Work
In this thesis we have presented a hybrid learning system which combines the
symbolic and subsymbolic approaches to artificial intelligence for control. Our
aim in doing so was to capitalise on the strengths of each approach, creating an
agent which can solve more complex tasks that either approach alone.
8.1 Summary of Rachel
Rachel is an architecture for a learning agent, which combines symbolic plan-
ning, hierarchical reinforcement learning and inductive logic planning. As with
other hierarchical reinforcement learning algorithms, it aims to learn a collection
of behaviours and combine them to achieve certain goals. However unlike other
HRL algorithms, it allows the trainer to specify behaviours in terms of abstract
symbolic descriptions in the form of Reinforcement-Learnt Teleo-operators (RL-
Tops). Rachel is then able to interpret these RL-Tops as both operators for
symbolic planning, and also sub-task descriptions for recursively optimal rein-
forcement learning.
The hybrid representation has several advantages:
• Automatic construction of task hierarchies. Having an explicit sym-
bolic model of the purposes of its behaviours, Rachel uses symbolic plan-
ning to determine which behaviours are likely to help it achieve its goals
in a particular state, and which are not. The plan it creates acts as a
task hierarchy for the hierarchical reinforcement learning algorithm, limit-
ing exploration to behaviours that are likely to be useful, and thus making
189
8. Conclusions and Future Work 190
learning faster. Plans can also include multiple levels of hierarchy, to make
planning and learning easier.
• Choosing optimal paths in plans. Unlike other planning systems,
Rachel does not assume that the shortest plan it produces is the best.
Rather, it searches for all possible paths to the goal and then uses hierar-
chical reinforcement learning to select the optimal one.
• Learning concrete policies for behaviours. Again, unlike other plan-
ning systems, Rachel does not rely on the trainer to provide fully imple-
mented behaviours. Rather, it uses reinforcement learning to learn primi-
tive policies which achieve the behaviours’ goals.
• Intelligent interruption of behaviours. The Teleo-reactive Q-Learning
algorithm uses the knowledge contained in the plan it executes to intelli-
gently interrupt behaviours when they are no longer appropriate. This
results in better policies than comparable HRL algorithms, without the
performance loss of fully reactive execution.
• Diagnosing execution failures. When a behaviour is interrupted, the
plan also provides information about what went wrong. The symbolic
representation allows Rachel to diagnose the nature of the failure and
gather evidence of what caused it.
• Learning how to avoid unwanted side-effects. Based on this evidence,
Rachel’s reflector is able to induce a description of the cause of the side-
effect in a symbolic form which can be incorporated back into the plans,
so that the unwanted effect may be avoided in the future.
8.2 Summary of Experimental results
Rachel has been successfully tested in three domains: two simple grid-worlds
and the SoccerBot soccer simulator. For each domain we have built a symbolic
model of the desired behaviours at multiples levels of granularity. We have shown
how plans built from these models can significantly improve the agent’s ability
to learn to perform its required tasks.
The Teleo-reactive Q-Learning algorithm has also been shown to produce
significantly better policies in domains where unexpected side-effects are possible,
8. Conclusions and Future Work 191
and early termination of behaviours is desirable. It is able to reproduce the
advantages of existing termination improvement algorithms, with minimal extra
learning time and without requiring separate learning and execution phases.
The results from the reflection mechanism have been more mixed. Side-effect
detection works effectively, but in a complex domain like the soccer simulator
there can be a great variety of side-effects and distinguishing those that are
important from those that are not is a difficult exercise.
Predicting side-effects can be spectacularly successful when it works and dis-
astrously bad when it goes wrong. Which of these results will occur depends
heavily on the training- and test-set sizes, and on the time it takes for the actor
to learn reasonable policies for behaviours. Reflecting prematurely, while be-
haviours are still being learnt, can lead to bad predictions which in turn prevent
the agent from improving the behaviours. Exploratory planning can relieve this
problem, but at the expense of much greater planning time and larger plan trees.
Nevertheless the results are largely positive. The synthesis of planning and
reinforcement learning has allowed Rachel to learn to behave successfully in a
world as complex as the soccer simulator, which could not have been achieved
with hierarchical reinforcement learning alone, without significant human inter-
vention.
8.3 Future Work
Throughout this work the emphasis has been on building a hybrid tool out
of simple and well-studied building blocks. As such, there is much room for
improvement in each of the components, to incorporate more complex means of
planning, acting and reflecting. Some suggested avenues for further work are
outlined below.
8.3.1 Better Planning
The planner is the most obvious starting point for improvement. The mean-ends
planning algorithm is primitive and costly. More modern planning techniques
abound, which could well be adapted to Rachel’s needs.
8. Conclusions and Future Work 192
Independent subgoals
Since Rachel’s planner is universal, it performs poorly in domains which have
multiple sub-goals which can be performed in arbitrary order. As it stands, it
will expand each sequence into a separate path of the path, duplicating a lot of
effort and building a plan that is bigger than it needs to be. A more intelligent
planner might recognise independent subgoals and build separate sub-trees for
each. Such a planner is used in Trail (Benson, 1996). The coffee-and-book
task, for example, might produce a plan such as that shown in Figure 8.1.
location(robot, kitchen)holding(coffee)holding(book)
location(coffee, kitchen)location(robot, kitchen)
holding(book)
location(coffee, kitchen)location(robot, kitchen)
holding(book)
location(coffee, kitchen)location(robot, kitchen)
holding(book)
holding(coffee)
Go(kitchen, dining)
Get(coffee, kitchen)
Go(hall, dining)
Go(dining, kitchen)
location(robot,hall)
location(robot, dining)
location(robot, kitchen)
location(robot, bedroom2)
Go(kitchen, dining)
Go(dining, hall)Go(bedroom, hall)
location(robot, bedroom2)
location(book, bedroom2)holding(coffee)
location(robot, bedroom2)
location(book, bedroom2)holding(coffee)
holding(book)
Get(book, bedroom2)
Go(hall, bedroom2)
location(robot, hall)
holding(book)holding(coffee)
location(robot, lounge)
holding(book)holding(coffee)
Go(hall, lounge)
Figure 8.1: A possible plan for fetching the coffee and the book with subgoalsplitting. The double line indicates that the second node has been split intothree subgoals to be achieved independently.
Partial planners such as UP-POP (Weld, 1994) and UMCP (Erol et al., 1994)
might be used to do this kind of decomposition of independent goals. However,
they are not designed to build plans with multiple alternatives (as this is not a
commonly required feature in planning systems).
8. Conclusions and Future Work 193
More expressive plans
Most work in planning is directed at finding a plan which achieves certain cer-
tain goal conditions. Reinforcement learning, on the other hand, recognises that
goals of achievement are only one kind of control task. Other kinds of tasks in-
clude “maintenance” tasks, where a goal condition must be achieved and actively
maintained, and “cyclic” tasks in which the agent cycles through a sequence of
goal states. We are not aware of any existing work in planning to tackle such
tasks, but there is no intrinsic reason why symbolic methods could not be applied
to them.
State-space planning, such as that used by Rachel, is merely a search for
paths in a directed graph. Each node of the graph is a set of fluents, each edge is
a behaviour which can be executed when its source node is satisfied, to cause its
target node to become satisfied. With this model solving a maintenance task is
simply finding a cycle of nodes in the graph which all satisfy the goal condition.
Solving a process task requires finding a path through successive goal nodes. If
a planner of this kind could be built, then it could well be applied to building a
HAM-like structure for reinforcement learning. An algorithm like HAMQ, or a
recursively-optimal variant thereof, could be applied to learn policies which fit
this structure.
On a more ambitious front, recent investigations into planning have at-
tempted incorporate a much more complex language for expressing plans in-
cluding conditional effects (Anderson, Weld, & Smith, 1998), sensing actions
(Weld, Anderson, & Smith, 1998) and even loop structures (Lin & Levesque,
1998). Recent research in hierarchical reinforcement learning has also explored
such complex languages for structuring policies (Lagoudakis & Littman, 2002,
Andre & Russell, 2000, Shapiro, Langley, & Shachter, 2001). Potentially these
approaches could be combined to build quite complex structures and learn poli-
cies within them.
8.3.2 Better Acting and Learning
Incorporating other reinforcement learning research
It is common practice among researchers studying reinforcement learning to use
Q-Learning as a starting point for all work, only deviating from this standard
algorithm as far as necessary to implement the particular ideas they propose.
8. Conclusions and Future Work 194
This work is no exception. There is value in this – Q-Learning is a well under-
stood algorithm and serves as solid baseline – but it means that little is known
about how different variations combine and interact. One worthwhile direction
for further development would be to combine the TRQ algorithm with other
recent developments in reinforcement learning.
In particular, it would be interesting to see how the MAXQ function decom-
position (Dietterich, 2000a) might be applied, and how it affects learning. By
separating the Q-value TRQ assigns to a node into immediate and continuation
values, it may be possible to at least partially reduce the extra learning time
TRQ takes over P-HSMQ.
Committing to paths in the path
Another possible avenue for investigation is in the area of commitment to paths in
the plan. When designing P-HSMQ and TRQ we deliberately chose to maximise
reactivity by allowing the agent to choose a new behaviour from all active nodes
in the plan, whenever the executing behaviour terminated. If the plan contained
a path to goal several behaviours long, then agent would make a choice between
every step of the plan, whether to continue with that path or choose another.
A possible alternative would be to design the algorithm so that once the agent
had started on a path it executed it to completion, unless an execution failure
occurred along the way. This is taking behaviour-commitment to the extreme –
committing to a whole sequence of behaviours – and would have all the associated
advantages and disadvantages we have already discussed.
8.3.3 Better Reflecting
The reflector is possibly the most fruitful area for expansion of the Rachel
architecture. We have limited ourselves, in this thesis, to learning only one of
the many possible aspects of the system, in a fairly simple fashion. There are
many possible improvements in both what is learnt, and how it is learnt.
Learning preconditions
In this work we have focused on detecting, diagnosing and learning to predict
side-effects, but these are not the only aspects of the system for which symbolic
models could be learnt. Another possibility is to learn revised preconditions for
8. Conclusions and Future Work 195
behaviours. A behaviour may never learnt a policy which successfully achieves
its goals for some or all of its specified application space. A behaviour that con-
sistently fails is as detrimental as a behaviour that consistently causes unwanted
side-effects. In the long run, it is worth the agent’s while to recognise that a
given behaviour isn’t working, and revise its symbolic model of the behaviour to
limit its application to those states in which it works.
Of course, great care would have to be taken to establish that the behaviour
had already been well explored and was not still in the process of improving, or
else learning would be hampered.
Inventing behaviours
Another possible kind of reflection might come from failure in the planning pro-
cess. If the planner builds a plan that covers all but a certain subset of states,
then the reflector might invent a behaviour to fill in the gap. Starting in one of
the uncovered states it could explore randomly until it arrived in a state covered
by the plan. A postcondition could be chosen by finding a set of fluents which
are satisfied by the covered state but not by any of the uncovered states. A
precondition could be induced by generalising over all of the uncovered states.
The new behaviour could then be introduced into plans and a policy learnt as
usual.
Deliberate experimentation
In Section 3.4.1 we made a distinction between passive and active learning. Pas-
sive learning is done by observing data from an independent controller. Active
learning deliberately controls the agent in a certain way to produce examples for
learning. Rachel’s reflector is moderately active. Exploratory planning allows
it to deliberately choose to explore behaviours it wouldn’t normally, so as to
generate examples for reflection. However this decision is still made randomly,
and there is no guarantee that it will generate the kinds of examples that would
be most useful.
A more active approach would be for the agent to do deliberate experimen-
tation – to choose a set of states that it particularly wanted to explore and
actively head for that location in order to test a particular behaviour. Early
ILP systems, such as MIS (Shapiro, 1981), Marvin (Sammut & Banerji, 1986)
8. Conclusions and Future Work 196
and CIGOL (Muggleton & Buntine, 1988), have used the trainer as an oracle to
provide the truth-value for any critical examples that might influence the final
hypothesis. Such systems might be revived for this domain, but rather than
using the trainer as an oracle, the critical examples could be used as fodder for
experimentation.
Another possible reason for experimentation would be to deliberately practice
a behaviour that seems to be performing poorly. Other behaviours could be used
to set up appropriate initial conditions, and then the behaviour could be repeated
several times over to improve its policy.
Incremental ILP
We used the Aleph ILP algorithm for Rachel’s thesis because it handled
intensional background knowledge and noisy data. However it is a batch-mode
learner and we had to modify it to learn side-effect descriptions incrementally.
The modifications we made were far from the ideal solution. There is a definite
lack of research into truly incremental ILP. Late in the work we came across the
Hillary algorithm by Iba, Wogulis, and Langley (1988) which is an incremental
ILP algorithm, but no follow-up work appears to have been done on this until the
recent publication of the NILE algorithm (Westendorp, 2003). This is certainly
an area in which much more research could be done.
8.4 Conclusion
The overarching goal of this work was to demonstrate the possibility of recon-
ciling symbolic and statistical approaches to artificial intelligence. The gulf is
not too deep, it can be bridged. Rachel is only a small step in that direction,
but I hope it shows that such a resolution is possible, and that as in human
beings, so also in computers, multiple kinds of representations can co-exist and
complement one another. As we expand our horizons to more and more com-
plex tasks, the ability to represent problems at multiple levels of abstraction will
become paramount. Different approaches have different strengths, and it will be
by combining those strengths that artificial intelligence will flourish.
References
Albus, J. S. (1975). A new approach to manipulator control: The cerebellarmodel articulation controller (CMAC). Transactions of the ASME, 220–227.
Allen, J., Hendler, J., & Tate, A. (Eds.). (1990). Readings in planning. CanMatero, CA: Morgan Kaufmann.
Anderson, C., Weld, D., & Smith, D. (1998). Conditional effects in Graphplan. InProceedings of the fourth international conference on AI planning systems.
Anderson, J. R. (1976). Language, memory and thought. Hillsdale, NJ: Erlbaum.
Anderson, J. R. (1995). Cognitive psychology and its implications (4th ed.). W.H. Freeman.
Andre, D., & Russell, S. (2002). State abstraction for programmable reinforce-ment learning agents. In Proceedings of the eighteenth national conferenceon artificial intelligence.
Andre, D., & Russell, S. J. (2000). Programmable reinforcement learning agents.In Advances in neural information processing systems 12: Proceedings ofthe 1999 conference (p. 1019-1025). San Franciso, CA: Morgan Kaufmann.
Aristotle. (350 BC). Politics. Athens.
Atkeson, C. G., Moore, A. W., & Schaal, S. (1997). Locally weighted learning.Artificial Intelligence Review, 11 (1-5), 11-73.
Baxter, J., Tridgell, A., & Weaver, L. (1998). Knightcap: A chess program thatlearns by combining TD(λ) with game-tree search. In Proceedings of thefifteenth international conference on machine learning (pp. 28–36). SanFranciso, CA: Morgan Kaufmann.
Bellman, R. (1957). Dynamic programming. Princeton University Press.
Bellman, R. (1961). Adaptive control processes: A guided tour. PrincetonUniversity Press.
197
REFERENCES 198
Benson, S. (1995). Inductive learning of reactive action models. In Proceedingsof the twelfth international conference on machine learning. San Franciso,CA: Morgan Kaufmann.
Benson, S. (1996). Learning action models for reactive autonomous agents. Un-published doctoral dissertation, Department of Computer Science, Stan-ford University.
Benson, S., & Nilsson, N. J. (1994). Reacting, planning and learning in anautonomous agent. In K. Furukawa, D. Michie, & S. Muggleton (Eds.),Machine intelligence 14. Oxford: the Calrendon Press.
Bertsekas, D. P. (1987). Dynamic programming: Deterministic and stochasticmodels. Englewood Cliffs, NJ: Prentice-Hall.
Bertsekas, D. P., & Tsitsiklis, J. N. (1996). Neuro-dynamic programming. Bel-mont, MA: Athena Scientific.
Bobrow, D. G., & Winograd, T. (1977). An overview of krl, a knowledgerepresentation language. Cognitive Science, 1, 3-46.
Boutilier, C., Dean, T., & Hanks, S. (1999). Decision-theoretic planning: Struc-tural assumptions and computational leverage. Journal of Artificial Intel-ligence Research, 11, 1-94.
Brooks, R. A. (1986). A robust layered control system for a mobile robot. IEEEJournal of Robotics and Automation, RA-2 (1), 14–23.
Carbonell, J. G. (1984). Learning by analogy: Formulating and generalizingplans from past experience. In R. S. Michalski, J. G. Carbonell, & T. M.Mitchell (Eds.), Machine learning: An artificial intelligence approach (p.137-161). Berlin, Heidelberg: Springer.
Churchland, P. M. (1990). On the nature of theories: A neurocomputational per-spective. In C. W. Savage (Ed.), Scientific theories: Minenesota studies inthe philosophy of science (Vol. 14). Minneapolis: University of MinnesotaPress.
Cohen, N. J., & Squire, L. R. (1980). Preserved learning and retention of patternanalysing skills in amensia: Dissociation of knowing how and knowingwhat. Science, 210, 207–210.
Dayan, P., & Hinton, G. E. (1992). Feudal reinforcement learning. Advances inNeural Information Processing Systems, 5, 271–278.
REFERENCES 199
desJardins, M. (1994). Knowledge development methods for planning systems.In Planning and learning: On to real applications: Papers from the 1994AAAI fall symposium (pp. 34–40). AAAI Press, Menlo Park, California.
Dietterich, T. G. (1996). Machine learning. ACM Computing Surveys, 28 (4es),3–3.
Dietterich, T. G. (1998). The MAXQ method for hierarchical reinforcementlearning. In Proceedings of the fifteenth international conference on ma-chine learning (pp. 118–126). San Franciso, CA: Morgan Kaufmann.
Dietterich, T. G. (2000a). Hierarchical reinforcement learning with the MAXQvalue dunction decomposition. Artitificial Intelligence, 13, 227–303.
Dietterich, T. G. (2000b). An overview of MAXQ hierarchical reinforcementlearning. In B. Y. Choueiry & T. Walsh (Eds.), Proceedings of the sympo-sium on abstraction, reformulation and approximation SARA 2000, lecturenotes in artificial intelligence (p. 26-44). New York, NY: Springer Verlag.
Dreyfus, H. L. (1979). What computers can’t do: A critique of artificial reason(2nd ed. ed.). New York: Harper and Row.
Dzeroski, S., Raedt, L. D., & Blockeel, H. (1998). Relational reinforcementlearning. In Proceedings of the fifteenth international conference on ma-chine learning. San Franciso, CA: Morgan Kaufmann.
Erol, K., Hendler, J. A., & Nau, D. S. (1994). UMCP: A sound and completeprocedure for hierarchical task-network planning. In Artificial intelligenceplanning systems (p. 249-254).
Fikes, R. E. (1971). Monitored execution of robot plans produced by STRIPS.In Information processing 71, proceedings of the ifip congress (Vol. 1, pp.189–194).
Fikes, R. E., Hart, P. E., & Nilsson, N. J. (1972). Learning and executinggeneralized robot plans. Artificial Intelligence, 3, 251–288.
Fikes, R. E., & Nilsson, N. J. (1971). STRIPS: A new approach to the applicationof theorem proving to problem solving. Aritifical Intelligence, 2, 189-208.
Georgeff, M. P. (1987). Planning. In Annual review of computing science (Vol. 2,p. 359-400). Annual Reviews Inc.
Ghallab, M., & Milani, A. (Eds.). (1996). New directions in AI planning. Ams-terdam, Netherlands: IOS Press.
REFERENCES 200
Gil, Y. (1994). Learning by experimentation: Incremental refinement of in-complete planning domains. In Proceedings of the eleventh internationalconference on machine learning. San Franciso, CA: Morgan Kaufmann.
Harnad, S. (1990). The symbol grounding problem. Physica D, 42, 335-346.
Haugeland, J. (Ed.). (1997). Mind design II. Cambridge, MA: Bradford/MITPress.
Hauskrecht, M., Meuleau, N., Kaelbling, L. P., Dean, T., & Boutilier, C. (1998).Hierarchical solution of Markov decision processes using macro-actions. InUncertainty in artificial intelligence (pp. 220–229).
Hayes, P. J. (1973). The frame problem and related problems in artificial in-telligence. In Artificial and human thinking (pp. 45–59). Jossey-Bass Inc.and Elseview Scientific Pub. Co.
Hengst, B. (2002). Discovering hierarchy in reinforcement learning with HEXQ.In Proceedings of the nineteenth international conference on machine learn-ing (p. 243-250). San Franciso, CA: Morgan Kaufmann.
Howard, R. A. (1960). Dynamic programming and markov processes. Cambridge,MA: The MIT Press.
Hume, D., & Sammut, C. (1991). Using inverse resolution to learn relationsfrom experiments. In Proceedings of the eighth international conference onmachine learning. San Franciso, CA: Morgan Kaufmann.
Iba, G. A. (1989). A heuristic approach to the discovery of macro-operators.Machine Learning, 3, 285-317.
Iba, W., Wogulis, J., & Langley, P. (1988). Trading off simplicity and coverage inincremental concept learning. In Proceedings of the fifth international con-ference on machine learning (pp. 73–79). Ann Arbor, Michigan: MorganKaufmann.
Jaakkola, T., Jordan, M. I., & Singh, S. P. (1994). On the convergence ofstochastic iterative dynamic programming algorithms. In Advances in neu-ral information processing systems (Vol. 6). The MIT Press.
Kaelbling, L. P. (1993). Hierarchical learning in stochastic domains: Preliminaryresults. In Proceedings of the tenth international conference on machinelearning. San Franciso, CA: Morgan Kaufmann.
Kaelbling, L. P., Littman, M. L., & Moore, A. P. (1996). Reinforcement learning:A survey. Journal of Artificial Intelligence Research, 4, 237-285.
REFERENCES 201
Kaelbling, L. P., & Rosenschein, S. J. (1990). Action and planning in embeddedagents. In Roboics and autonomous systems (Vol. 6, pp. 35–48).
Knoblock, C. A. (1991). Search reduction in hierarchical problem solving. In Pro-ceedings of the ninth national conference on artificial intelligence (AAAI-91) (Vol. 2, pp. 686–691). Anaheim, California, USA: AAAI Press/MITPress.
Koenig, S., & Simmons, R. G. (1993). Complexity analysis of real-time re-inforcement learning. In National conference on artificial intelligence (p.99-107).
Korf, R. E. (1985). Macro-operators: A weak method for learning. ArtificialIntelligence, 26, 35-77.
Korf, R. E. (1987). Planning as search: A quantitative approach. ArtificialIntelligence, 33 (1), 65-88.
Lagoudakis, M., & Littman, M. (2002). Algorithm selection using reinforce-ment learning. In Proceedings of the nineteenth international conferenceon machine learning. San Franciso, CA: Morgan Kaufmann.
Laird, J. E., Rosenbloom, P. S., & Newell, A. (1986). Chunking in soar: Theanatomy of a general learning mechanism. Machine Learning, 1 (1), 11-46.
Lavrac, N., & Dzeroski, S. (1994). Inductive logic programming: Techniques andapplications. Ellis Horwood.
Lin, F., & Levesque, H. J. (1998). What robots can do: Robot programs andeffective achievability. Artificial Intelligence, 101 (1-2), 201-226.
Lin, L.-J. (1993). Reinforcement learning for robots using neural networks.Unpublished doctoral dissertation, School of Computer Science, CarnegieMellon University.
Lorenzo, D., & Otero, R. P. (2000). Using ILP to learn logic programs for rea-soning about actions. In Proceedings of the tenth international conferenceon inductive logic programming.
Maclin, R., & Shavlik, J. W. (1996). Creating advice-taking reinforcementlearners. Machine Learning, 22, 251–282.
Maclin, R. F. (1995). Learning from instruction and experience: Methods forincorporating procedural domain theories into knowledge-based neural net-works. Unpublished doctoral dissertation, University of Wisonsin - Madi-son.
REFERENCES 202
Maes, P. (1990). How to do the right thing. Connection Science Journal, SpecialIssue on Hybrid Systems, 1.
Mahadevan, S., Khaleeli, N., & Marchalleck, N. (1997). Designing agent con-trollers using discrete-event markov models. In Working notes of the AAAIfall symposium on model-directed autonomous systems. Cambridge, Mas-sachusetts.
Mataric, M. J. (1994). Reward functions for accelerated learning. In Proceedingsof the eleventh international conference on machine learning. San Franciso,CA: Morgan Kaufmann.
Mataric, M. J. (1996). Behaviour based control: Examples from navigation,learning and group behaviour. Journal of Experimental and TheoreticalArtificial Intelligence, 9 (2-3).
McCarthy, J., & Hayes, P. J. (1969). Some philosophical problems from thestandpoint of artificial intelligence. In Machine intelligence (Vol. 4, pp.463–502).
McGovern, A., & Barto, A. G. (2001). Automatic discovery of subgoals in rein-forcement learning using diverse density. In Proceedings of the eighteenthinternational conference on machine learning (pp. 361–368). San Franciso,CA: Morgan Kaufmann.
Minsky, M. (1974). A framework for representing knowledge (Tech. Rep. No.Memo 306). MIT AI Lab.
Minton, S. (1988). Learning search control knowledge: An explanation-basedapproach. Dordrecht: Kluwer Academic Publishers.
Mitchell, T. M., Utgoff, P. E., & Banerji, R. (1984). Learning by experimenta-tion: Acquiring and refining problem-solving heuristics. In R. S. Michalski,J. G. Carbonell, & T. M. Mitchell (Eds.), Machine learning: An artificialintelligence approach (p. 163-190). Berlin, Heidelberg: Springer.
Muggleton, S. (1995). Inverse entailment and Progol. New Generation Comput-ing, Special issue on Inductive Logic Programming, 13 (3-4), 245-286.
Muggleton, S., & Buntine, W. (1988). Machine invention of first-order predicatesby inverting resolution. In Proceedings of the fifth international conferenceon machine learning (p. 167-192). Ann Arbor, Michigan: Morgan Kauf-mann.
Newell, A., & Simon, H. A. (1972). Human problem solving. Englewood Cliffs,New Jersey: Prentice-Hall.
REFERENCES 203
Newell, A., & Simon, H. A. (1976). Computer science as empirical inquiry:Symbols and search. In Communications of the association for computingmachinery (Vol. 19, p. 113-126).
Ng, A. Y., Harada, D., & Russell, S. (1999). Policy invariance under rewardtransformations: Theory and application to reward shaping. In Proceed-ings of the sixteenth international conference on machine learning. SanFranciso, CA: Morgan Kaufmann.
Nilsson, N. J. (1994). Teleo-reactive programs for agent control. Journal ofArtificial Intelligence Research, 1, 139-158.
Oates, T., & Cohen, P. R. (1996). Searching for planning operators with context-dependent and probabilistic effects. In H. Shrobe & T. Senator (Eds.),Proceedings of the thirteenth national conference on artificial intelligenceand the eighth innovative applications of artificial intelligence conference,vol. 2 (pp. 865–868). Menlo Park, California: AAAI Press.
Papavassiliou, V. A., & Russell, S. J. (1999). Convergence of reinforcement learn-ing with general function approximators. In Proceedings of the seventeenthinternational joint conference on artificial intelligence. San Franciso, CA:Morgan Kaufmann.
Parr, R. (1998). Hierarchical control and learning for markov decision processes.Unpublished doctoral dissertation, University of California at Berkeley.
Parr, R., & Russell, S. (1998). Reinforcement learning with hierarchies of ma-chines. In M. I. Jordan, M. J. Kearns, & S. A. Solla (Eds.), Advances inneural information processing systems (Vol. 10). The MIT Press.
Puterman, M. L. (1994). Markov decision processes: Discrete stochastic dynamicprogramming. New York, NY: John Wiley & Sons, Inc.
Reiter, R. (1987). A logic for default reasoning. In M. L. Ginsberg (Ed.),Readings in nonmonotonic reasoning (pp. 68–93). Los Altos, California:Morgan Kaufmann.
Rosenberg, J. F. (1990). Connectionism and cognition. In Acta analytica (p.33-36). Dubrovnik.
Rosenschein, S. J. (1981). Plan synthesis: A logical perspective. In Proceedingsof the seventh international joint conference on artificial intelligence (pp.331–337). San Franciso, CA: Morgan Kaufmann.
Rumelhart, D. E. (1989). The architecture of mind: A connectionist approach.In M. I. Posner (Ed.), Foundations of cognitive science. Cambridge, MA:Bradford/MIT Press.
REFERENCES 204
Rummery, G. A., & Niranjan, M. (1994). Online Q-learning using connec-tionist systems (Tech. Rep. No. CUED/F-INFENG/TR 166). CambridgeUniversity Engineering Department.
Sacerdoti, E. D. (1973). Planning in a hierarchy of abstraction spaces. In Pro-ceedings of the third international joint conference on artificial intelligence.San Franciso, CA: Morgan Kaufmann.
Sacerdoti, E. D. (1974). Planning in a hierarchy of abstraction spaces. ArtificialIntelligence, 5 (2), 115-135.
Sacerdoti, E. D. (1977). A structure for plans and behaviour. Amsterdam,London, New York: Elsevier/North-Holland.
Sammut, C., & Banerji, R. (1986). Learning concepts by asking questions.In R. S. Michalski, J. G. Carbonell, & T. M. Mitchell (Eds.), Machinelearning: An artificial intelligence approach (Vol. 2, p. 167-192). Los Altos,CA: Morgan Kaufmann.
Santamarıa, J. C., Sutton, R. S., & Ram, A. (1998). Experiments with rein-forcement learning in problems with continuous state and action spaces.Adaptive Behaviour, 6 (2).
Schoppers, M. (1987). Universal plans for reactive robots in unpredicatble sys-tems. In Proceedings of the tenth international joint conference on artificialintelligence. San Franciso, CA: Morgan Kaufmann.
Shapiro, A. D. (1987). Structured induction in expert systems. Addison Wesley,London.
Shapiro, D., Langley, P., & Shachter, R. (2001). Using background knowledgeto speed reinforcement learning in physical agents. Proceedings of the FifthInternational Conference on Autonomous Agents, 254-261.
Shapiro, E. (1981). An algorithm that infers theories from facts. In Proceedingsof the seventh international joint conference on artificial intelligence (p.446-452). San Franciso, CA: Morgan Kaufmann.
Shapiro, E. Y. (1981). Inductive inference of theories from facts (Tech. Rep. No.192). Dept of CS, Yale University.
Shen, W.-M. (1993). Discovery as autonomous learning from the environment.Machine Learning, 12.
Shen, W.-M. (1994). Autonomous learning from the environment. W.H. Free-man, Computer Science Press.
REFERENCES 205
Shoham, Y., & McDermott, D. V. (1988). Problems in formal temporal reason-ing. Artificial Intelligence, 36 (1), 49-61.
Singh, S. P. (1992). Reinforcement learning with a hierarchy of abstract models.In Proceedings of the tenth national conference on artifical intelligence.Cambridge, MA: MIT Press.
Singh, S. P., Jaakkola, T., Littman, M. L., & Szepesvari, C. (2000). Conver-gence results for single-step on-policy reinforcement-learning algorithms.Machine Learning, 38 (3), 287-308.
Smolensky, P. (1989). Connectionist modeling: Neural computation / mentalconnections. In L. Nadel, L. A. Cooper, P. Culicover, & R. M. Harnish(Eds.), Neural conenctions, mental computation. Cambridge, MA: Brad-ford/MIT Press.
Srinivasan, A. (2001a). The aleph manual .http://web.comlab.ox.ac.uk/oucl/research/areas/machlearn/Aleph/.
Srinivasan, A. (2001b). Extracting context-sensitive models in inductive logicprogramming. Machine Learning, 44 (3), 301–324.
Stone, P., & Sutton, R. S. (2001). Scaling reinforcement learning towardRoboCup soccer. In Proceedings of the eighteenth international conferenceon machine learning (pp. 537–544). San Franciso, CA: Morgan Kaufmann.
Sutton, R. S. (1987). Implementation details for the TD(λ) procedure for the caseof vector predictions and backpropagation (Tech. Rep. No. TN87-509.1).GTE Laboratories.
Sutton, R. S. (1988). Learning to predict by the methods of temporal differences.Machine Learning, 3, 9-44.
Sutton, R. S. (1990). Integrated architectures for learning, planning and react-ing based on approximating dynamic programming. In Proceedings of theseventh international conference on machine learning. San Franciso, CA:Morgan Kaufmann.
Sutton, R. S. (1995). Generalisation in reinforcement learning successful exam-ples using sparse coarse coding. Advances in Neural Neural InformationProcessing Systems.
Sutton, R. S. (1996). Generalization in reinforcement learning: Successful ex-amples using sparse coarse coding. In D. S. Touretzky, M. C. Mozer, &M. E. Hasselmo (Eds.), Advances in neural information processing systems(Vol. 8, pp. 1038–1044). The MIT Press.
REFERENCES 206
Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction.Cambridge, MA: The MIT Press.
Sutton, R. S., Precup, D., & Singh, S. P. (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning.Artificial Intelligence, 112 (1-2), 181-211.
Sutton, R. S., Singh, S., Precup, D., & Ravindran, B. (1999). Improved switchingamong temporally abstract actions. In Advances in neural informationprocessing systems (Vol. 11). The MIT Press.
Tate, A. (1975). Using goal structure to direct search in a problem solver.Unpublished doctoral dissertation, Universtiy of Ediaburgh.
Taylor, K. (1996). Autonomous learning by incremental induction and revision.Unpublished doctoral dissertation, Australian National University.
Tesauro, G. (1994). Td-gammon, a self-teaching backgammon program achievesmaster-level play. Neural Computation, 6, 215–219.
Thrun, S. B. (1992). The role of exploration in learning control. In D. White &D. Sofge (Eds.), Handbook of intelligent control: Neural, fuzzy and adaptiveapproaches. Van Nostrand Reinhold.
Tsitsiklis, J. N. (1994). Asynchronous stochastic approximation and Q-learning.Machine Learning, 16 (3).
Tsitsiklis, J. N., & Roy, B. V. (1997). An analysis of temporal-difference leanringwith function approximation. In Advances in neural information processingsystems 9: Proceedings of the 1996 conference. San Franciso, CA: MorganKaufmann.
Wang, X. (1996). Planning while learning operators. In B. Drabble (Ed.),Proceedings of the 3rd international conference on artificial intelligenceplanning systems (AIPS-96) (pp. 229–236). AAAI Press.
Watkins, C. J. C. H. (1989). Learning from delayed rewards. Unpublisheddoctoral dissertation, King’s College, Cambridge, England.
Watkins, C. J. C. H., & Dayan, P. (1992). Q-learning. Machine Learning, 8 (3),279–292.
Weld, D. S. (1994). An introduction to least commitment planning. AI Magazine,15 (4), 27-61.
Weld, D. S., Anderson, C. R., & Smith, D. E. (1998). Extending Graphplan tohandle uncertainty and sensing actions. In AAAI/IAAI (p. 897-904).
REFERENCES 207
Westendorp, J. H. (2003). Noise-resistant incremental relational learning usingpossible worlds. In S. Matwin & C. Sammut (Eds.), Ilp 2002 (Vol. 2583,pp. 317–332). Springer Verlag.
Wiering, M., Salustowicz, R., & Schmidhuber, J. (1999). Reinforcement learn-ing soccer teams with incomplete world models. Journal of AutonomousRobots.
Winograd, T. (1972). Understanding natural language. Cognitive Psychology,1, 1-191.
Zhang, W., & Dietterich, T. G. (1995). A reinforcement learning approachto job-shop scheduling. In Proceedings of the fourteenth international jointconference on artificial intelligence. San Franciso, CA: Morgan Kaufmann.
Ziemke, T. (1997). Rethinking grounding. In Proceedings of new trends incognitive science - does representation need reality. Vienna.