Hierarchical Reinforcement Learning: A Hybrid...

The University of New South Wales

School of Computer Science and Engineering

Hierarchical Reinforcement Learning:A Hybrid Approach

Malcolm Ross Kinsella Ryan

A Thesis submitted as a requirement for the Degree of

Doctor of Philosophy

September 2002

Supervisor: Prof. Claude Sammut

ii

Abstract

In this thesis we investigate the relationships between the symbolic and sub-

symbolic methods used for controlling agents by artificial intelligence, focusing

in particular on methods that learn. In light of the strengths and weaknesses of

each approach, we propose a hybridisation of symbolic and subsymbolic methods

to capitalise on the best features of each.

We implement such a hybrid system, called Rachel which incorporates

techniques from Teleo-Reactive Planning, Hierarchical Reinforcement Learning

and Inductive Logic Programming. Rachel uses a novel representation of be-

haviours, Reinforcement-Learnt Teleo-operators (RL-Tops), which defines the

behaviour in terms of its desired consequences but leaves the implementation of

the policy to be learnt by reinforcement learning. An RL-Top is an abstract,

symbolic description of the purpose of a behaviour, and is used by Rachel both

as a planning operator and as the definition of a reward function by which the

behaviour can be learnt.

Two new hierarchical reinforcement learning algorithms are introduced, Planned

Hierarchical Semi-Markov Q-Learning (P-HSMQ) and Teleo-Reactive Q-Learning

(TRQ). The former is an extension of the Hierarchical Semi-Markov Q-Learning

algorithm to use computer generated plans in place of task-hierarchies (which

are commonly provided by the trainer). The latter is a further elaboration of

the algorithm to include more intelligent behaviour termination. The knowl-

edge contained in the plan is used to determine when an executing behaviour

is no longer appropriate, and can be prematurely terminated, resulting in more

efficient policies.

Incomplete descriptions of the effects of behaviours can lead the planner

to make false assumptions in building plans. As behaviours are learnt, not

implemented, not every effect of actions can be known in advance. Rachel

implements a “reflector” which monitors for such unexpected and unwanted side-

effects. Using ILP it learns to predict when they will occur, and so repair its

plans to avoid them.

Together, the components of Rachel form a learning system which is able

to receive abstract descriptions of behaviours, build plans to discover which of

them may be useful to achieve its goals, learn concrete policies and optimal

choices of behaviour through trial and error, discover and predict any unwanted

side-effects that result and repair its plans to avoid them. It is a demonstration

iii

that different approaches to AI, symbolic and sub-symbolic, can be elegantly

combined into a single agent architecture.

Declaration

I hereby declare that this submission is my own work and to the best of my knowl-edge it contains no material previously published or written by another person,nor material which to a substantial extent has been accepted for the award ofany other degree or diploma at UNSW or any other educational institution, ex-cept where due acknowledgement is made in the thesis. Any contribution madeto the research by others, with whom I have worked at UNSW or elsewhere, isexplicitly acknowledged in the thesis.

I also declare that the intellectual content of this thesis is the product ofmy own work, except to the extent that assistance from others in the project’sdesign and conception or in style, presentation and linguistic expression is ac-knowledged.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Malcolm Ross Kinsella Ryan

i

Contents

1 Introduction 11.1 Picture this: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Human intelligence: A Cognitive Model . . . . . . . . . . . . . . . 3

1.2.1 Types of Knowledge . . . . . . . . . . . . . . . . . . . . . 31.2.2 Interaction between the types . . . . . . . . . . . . . . . . 4

1.3 Declarative and Procedural Knowledge in Artificial Intelligence . . 41.3.1 Declarative and Procedural Knowledge in Control . . . . . 6

1.4 A Hybrid approach . . . . . . . . . . . . . . . . . . . . . . . . . . 91.4.1 Putting declarative knowledge into reinforcement learning 91.4.2 Putting procedural knowledge into symbolic planning . . . 121.4.3 Getting declarative knowledge out of procedural learning . 131.4.4 Rachel: A hybrid planning/reinforcement learning system 14

1.5 Contributions of this thesis . . . . . . . . . . . . . . . . . . . . . . 151.6 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2 Background - Reinforcement Learning 182.1 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . 18

2.1.1 The Reinforcement Learning Model . . . . . . . . . . . . . 192.1.2 Markov Decision Processes . . . . . . . . . . . . . . . . . . 212.1.3 Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 222.1.4 The curse of dimensionality . . . . . . . . . . . . . . . . . 24

2.2 Hierarchical Reinforcement Learning in Theory . . . . . . . . . . 262.2.1 A Motivating Example . . . . . . . . . . . . . . . . . . . . 272.2.2 Limiting the Agent’s Choices . . . . . . . . . . . . . . . . 282.2.3 Providing Local Goals . . . . . . . . . . . . . . . . . . . . 302.2.4 Semi-Markov Decision Processes: A Theoretical Framework 332.2.5 Learning behaviours . . . . . . . . . . . . . . . . . . . . . 35

2.3 Hierarchical Reinforcement Learning in Practice . . . . . . . . . . 352.3.1 Semi-Markov Q-Learning . . . . . . . . . . . . . . . . . . . 362.3.2 Hierarchical Semi-Markov Q-Learning . . . . . . . . . . . . 372.3.3 MAXQ-Q . . . . . . . . . . . . . . . . . . . . . . . . . . . 372.3.4 Q-Learning with Hierarchies of Abstract Machines . . . . . 39

2.4 Termination Improvement . . . . . . . . . . . . . . . . . . . . . . 40

ii

CONTENTS iii

2.5 Producing the hierarchy . . . . . . . . . . . . . . . . . . . . . . . 432.6 Other work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

2.6.1 Model-based Reinforcement Learning . . . . . . . . . . . . 432.6.2 Other hybrid learning algorithms . . . . . . . . . . . . . . 44

2.7 Reinforcement learning in this thesis . . . . . . . . . . . . . . . . 452.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3 Background - Symbolic Planning 463.1 The symbolic planning model . . . . . . . . . . . . . . . . . . . . 463.2 Building Plans . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.2.1 The Strips representation . . . . . . . . . . . . . . . . . . 523.2.2 Means-ends planning . . . . . . . . . . . . . . . . . . . . . 533.2.3 Extensions to the Strips representation . . . . . . . . . . 54

3.3 Handling Incomplete Action Models . . . . . . . . . . . . . . . . . 593.4 Learning Action Models . . . . . . . . . . . . . . . . . . . . . . . 60

3.4.1 Generating data . . . . . . . . . . . . . . . . . . . . . . . . 613.4.2 What is learnt . . . . . . . . . . . . . . . . . . . . . . . . . 623.4.3 How to learn . . . . . . . . . . . . . . . . . . . . . . . . . 63

3.5 Inductive Logic Programming . . . . . . . . . . . . . . . . . . . . 633.6 Other related work . . . . . . . . . . . . . . . . . . . . . . . . . . 66

3.6.1 Explanation Based Learning . . . . . . . . . . . . . . . . . 663.7 Planning in this thesis . . . . . . . . . . . . . . . . . . . . . . . . 663.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4 A Hybrid Representation 684.1 Representing states . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4.1.1 Instruments: Representing primitive state . . . . . . . . . 694.1.2 Fluents: Representing abstract state . . . . . . . . . . . . 69

4.2 Representing goals . . . . . . . . . . . . . . . . . . . . . . . . . . 704.3 Representing actions . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.3.1 Reinforcement-Learnt Teleo-operators . . . . . . . . . . . . 724.3.2 State abstraction . . . . . . . . . . . . . . . . . . . . . . . 744.3.3 Parameterised Behaviours . . . . . . . . . . . . . . . . . . 744.3.4 Hierarchies of behaviours . . . . . . . . . . . . . . . . . . . 75

4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5 Rachel: Planning and Acting 775.1 The Planner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.1.1 Semi-universal planning . . . . . . . . . . . . . . . . . . . 785.1.2 Variable binding . . . . . . . . . . . . . . . . . . . . . . . 815.1.3 The planning algorithm . . . . . . . . . . . . . . . . . . . 825.1.4 Computational complexity . . . . . . . . . . . . . . . . . . 83

5.2 The Actor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

CONTENTS iv

5.2.1 Planned Hierarchical Semi-Markov Q-Learning . . . . . . . 855.2.2 Termination Improvement . . . . . . . . . . . . . . . . . . 875.2.3 Teleo-Reactive Q-Learning . . . . . . . . . . . . . . . . . . 90

5.3 Multiple levels of hierarchy . . . . . . . . . . . . . . . . . . . . . . 975.3.1 Hierarchical planning . . . . . . . . . . . . . . . . . . . . . 985.3.2 Hierarchical learning: P-HSMQ . . . . . . . . . . . . . . . 1005.3.3 Hierarchical learning: TRQ . . . . . . . . . . . . . . . . . 100

5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

6 Rachel: Reflection 1056.1 The Frame Assumption . . . . . . . . . . . . . . . . . . . . . . . . 1056.2 Example Domain - Taxi-world . . . . . . . . . . . . . . . . . . . . 1076.3 Detecting and identifying side-effects . . . . . . . . . . . . . . . . 111

6.3.1 Plan Execution Failures . . . . . . . . . . . . . . . . . . . 1116.3.2 Diagnosing the failure . . . . . . . . . . . . . . . . . . . . 112

6.4 Gathering examples . . . . . . . . . . . . . . . . . . . . . . . . . . 1156.5 Inducing a description . . . . . . . . . . . . . . . . . . . . . . . . 117

6.5.1 Input to Aleph . . . . . . . . . . . . . . . . . . . . . . . . 1206.5.2 Modifications to Aleph . . . . . . . . . . . . . . . . . . . 1216.5.3 Output from Aleph . . . . . . . . . . . . . . . . . . . . . 1236.5.4 Adding Incrementality . . . . . . . . . . . . . . . . . . . . 124

6.6 Incorporating side-effects into plans . . . . . . . . . . . . . . . . . 1246.6.1 Exploratory planning . . . . . . . . . . . . . . . . . . . . . 129

6.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

7 Experiment Results 1327.1 Experiments in the gridworld domain . . . . . . . . . . . . . . . . 132

7.1.1 Domain description . . . . . . . . . . . . . . . . . . . . . . 1327.1.2 Experiment 1: Planning vs HSMQ vs P-HSMQ . . . . . . 1347.1.3 Experiment 2: P-HSMQ vs TRQ . . . . . . . . . . . . . . 1387.1.4 Experiment 3: The effect of the η parameter . . . . . . . . 1427.1.5 Discussion of the gridworld experiments . . . . . . . . . . 146

7.2 Experiments in the taxi-car domain . . . . . . . . . . . . . . . . . 1477.2.1 Domain description . . . . . . . . . . . . . . . . . . . . . . 1477.2.2 Experiment 4: Reflection . . . . . . . . . . . . . . . . . . . 1517.2.3 Experiment 5: Second-order side-effects . . . . . . . . . . . 1567.2.4 The bigger taxi world . . . . . . . . . . . . . . . . . . . . . 1567.2.5 Experiment 6: The effect of the training set size . . . . . . 1587.2.6 Experiment 7: The effect of the pool size . . . . . . . . . . 1637.2.7 Experiment 8: The effect of exploratory planning . . . . . 1657.2.8 Discussion of the taxi-car experiments . . . . . . . . . . . 167

7.3 Experiments in the soccer domain . . . . . . . . . . . . . . . . . . 1677.3.1 Domain description . . . . . . . . . . . . . . . . . . . . . . 168

CONTENTS v

7.3.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . 1697.3.3 Domain representation . . . . . . . . . . . . . . . . . . . . 1707.3.4 Experiment 9: HSMQ vs P-HSMQ vs TRQ . . . . . . . . 1797.3.5 Experiment 10: Learning primitive policies . . . . . . . . . 1817.3.6 Experiment 11: Reflection . . . . . . . . . . . . . . . . . . 1837.3.7 Discussion of the soccer experiments . . . . . . . . . . . . 187

7.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188

8 Conclusions and Future Work 1898.1 Summary of Rachel . . . . . . . . . . . . . . . . . . . . . . . . . 1898.2 Summary of Experimental results . . . . . . . . . . . . . . . . . . 1908.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191

8.3.1 Better Planning . . . . . . . . . . . . . . . . . . . . . . . . 1918.3.2 Better Acting and Learning . . . . . . . . . . . . . . . . . 1938.3.3 Better Reflecting . . . . . . . . . . . . . . . . . . . . . . . 194

8.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196

References 197

List of Figures

1.1 The three parts of the Rachel architecture. . . . . . . . . . . . . 14

2.1 An illustration of a reinforcement-learning agent. . . . . . . . . . 192.2 An example world . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.3 Two different internal policies for the behaviour Go(hall, bedroom2). 322.4 A simple navigation task illustrating the advantage of termination

improvement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.1 An illustration of a planning agent. . . . . . . . . . . . . . . . . . 473.2 The example world again . . . . . . . . . . . . . . . . . . . . . . . 483.3 A plan for fetching the coffee. . . . . . . . . . . . . . . . . . . . . 513.4 A universal plan for fetching the coffee. . . . . . . . . . . . . . . . 58

5.1 The three parts of the Rachel architecture. . . . . . . . . . . . . 775.2 Two linear plans to fetch both the book and the coffee. . . . . . . 795.3 A semi-universal plan to fetch both the book and the coffee. . . . 805.4 The example world with a bump. . . . . . . . . . . . . . . . . . . 885.5 A plan for fetching the coffee and the book. . . . . . . . . . . . . 895.6 A plan for fetching either the coffee or the book. . . . . . . . . . . 925.7 A narrow bridge over a chasm. . . . . . . . . . . . . . . . . . . . . 94

6.1 The Taxi-Car Domain. . . . . . . . . . . . . . . . . . . . . . . . . 1076.2 A plan for the Deliver behaviour in the Taxi world. . . . . . . . . 1106.3 A plan for the Deliver behaviour which avoids running out of fuel. 1286.4 A plan for the Deliver behaviour using exploratory planning. . . . 130

7.1 The first experimental domain - the Grid-world. . . . . . . . . . . 1337.2 A comparison of learning rates for four approaches to the grid-

world problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1367.3 Average trial lengths for Experiment 1. . . . . . . . . . . . . . . . 1377.4 A comparison of learning rates for TRQ and P-HSMQ . . . . . . 1407.5 Final policy performance for Experiment 2. . . . . . . . . . . . . . 1437.6 A comparison of learning rates for TRQ different values of η. . . . 1447.7 Convergence times for TRQ with different value of η. . . . . . . . 145

vi

LIST OF FIGURES vii

7.8 The second experimental domain - The Taxi-Car. . . . . . . . . . 1477.9 A “naive” plan for the Deliver behaviour in the Taxi world. . . . . 1507.10 A hand-crafted plan which adds refueling to the naive plan. . . . 1527.11 A comparison of learning performance in the Taxi-world. . . . . . 1547.12 The accuracy of maintain (old) and induced (new) hypotheses for

each iteration of the reflector. . . . . . . . . . . . . . . . . . . . . 1557.13 The 25× 25 taxi-world . . . . . . . . . . . . . . . . . . . . . . . . 1577.14 ILP in a noisy world. . . . . . . . . . . . . . . . . . . . . . . . . . 1597.15 The effect of training set size on reflection. . . . . . . . . . . . . . 1617.16 The effect of pool size on reflection. . . . . . . . . . . . . . . . . . 1647.17 The effect of exploratory planning. . . . . . . . . . . . . . . . . . 1667.18 The soccer domain. . . . . . . . . . . . . . . . . . . . . . . . . . . 1687.19 Learning in the soccer-world, with hard-coded behaviours. . . . . 1807.20 Learning in the soccer-world, with learnt behaviours. . . . . . . . 1827.21 Part of the plan for Shoot1(bot(1)). . . . . . . . . . . . . . . . . . 186

8.1 A plan with subgoal splitting. . . . . . . . . . . . . . . . . . . . . 192

List of Tables

6.1 The instruments used in the Taxi-car domain. . . . . . . . . . . . 1086.2 Fluents used in the Taxi-car domain. . . . . . . . . . . . . . . . . 1096.3 The four types of agent behaviour in the Taxi-world. . . . . . . . 1096.4 Classifying states as positive and negative examples of a side-effect.1186.5 Input to Aleph: Positive and negative examples . . . . . . . . . . 1206.6 Input to Aleph: the background file . . . . . . . . . . . . . . . . . 122

7.1 Instruments used in the Grid-world domain. . . . . . . . . . . . . 1337.2 Fluents used in the Grid-world domain. . . . . . . . . . . . . . . . 1347.3 Behaviours available in the Grid-world. . . . . . . . . . . . . . . . 1357.4 Instruments used in the Taxi-car domain. . . . . . . . . . . . . . . 1497.5 Fluents used in the Taxi-car domain. . . . . . . . . . . . . . . . . 1497.6 Behaviours available in the Taxi-Car domain. . . . . . . . . . . . 1517.7 The success-rates of final policies learnt in the taxi-world. . . . . . 1547.8 The fuel factor for each reflective approach to Experiment 6. . . . 1627.9 The fuel factor for each reflective approach to Experiment 7. . . . 1647.10 Instruments used in the soccer domain. . . . . . . . . . . . . . . . 1707.11 Objects in the Soccer domain. . . . . . . . . . . . . . . . . . . . . 1717.12 Fluents used in the Soccer domain. . . . . . . . . . . . . . . . . . 1727.13 Granularity 0 behaviours in the soccer domain. . . . . . . . . . . 1737.14 Granularity 1 behaviours in the soccer domain. . . . . . . . . . . 1747.15 Granularity 2 behaviours in the soccer domain. . . . . . . . . . . 1757.16 Granularity 2 behaviours in the soccer domain, cont. . . . . . . . 1767.17 Discretisation of instruments in the soccer domain. . . . . . . . . 1777.18 Progress estimators used in the soccer domain. . . . . . . . . . . . 1787.19 The side-effects detected in the soccer-world. . . . . . . . . . . . . 185

viii

List of Algorithms

1 Watkin’s Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . 232 SMDP Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 363 HSMQ-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384 HAMQ-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 415 Rachel’s planning algorithm: Iterative Deepening . . . . . . . . 836 Rachel’s planning algorithm: Adding new nodes . . . . . . . . . 847 Planned HSMQ-Learning . . . . . . . . . . . . . . . . . . . . . . . 868 TRQ-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 919 TRQ-Learning: Persisting with a behaviour . . . . . . . . . . . . 9210 Hierarchical planning: Iterative Deepening . . . . . . . . . . . . . 9811 Hierarchical planning: Adding new nodes . . . . . . . . . . . . . . 9912 Planned HSMQ-Learning with multiple levels of hierarchy . . . . 10113 Teleo-Reactive Q-Learning . . . . . . . . . . . . . . . . . . . . . . 10214 Execute a behaviour . . . . . . . . . . . . . . . . . . . . . . . . . 10215 TRQ Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10316 Discard lessons from interrupted behaviours . . . . . . . . . . . . 10417 Reflector: Detecting side-effects . . . . . . . . . . . . . . . . . . . 11418 Reflector: Trimming a condition . . . . . . . . . . . . . . . . . . . 11419 Exploratory planning: Adding new nodes . . . . . . . . . . . . . . 12720 Exploratory planning: Adding maintenance conditions . . . . . . 12821 Termination-improved execution of a policy learnt by P-HSMQ . . 139

ix

Dedication

This work is dedicated to the memory of my father,

Reginald Kinsella Ryan (1932-1988)

for letting me play on the miraculous “TV Typewriter” he was building in hisstudy. If only he could have seen how far things would come.

I love you, Dad.

x

Acknowledgements

This thesis would never have been possible without the behind-the-scenes effortof an enormous crowd of people. I’d really like to thank them all. First ofall, I must say a big thankyou to Mum and John who housed me and fed me(especially when the money was tight), who put up with all the mood-swingsand never stopped loving me. Thankyou so much. I promise I’ll get a real joband a place of my own real soon.

Thanks must also go to my supervisor, Claude Sammut, who inspired meto do a PhD, and to my mentor Mark Pendrith, who warned me not to. Youprobably both think I listened too much to the other, but I have learnt a lotfrom both of you. I really valued your advice, even if I did not always take it.

I am grateful for the friendly co-operation of the international A.I. commu-nity. It is easy to feel isolated, doing a PhD in Australia when all the action isgoing on overseas, but many productive email discussions with Tom Dietterich,Stuart Russell, Nils Nilsson, Ronald Parr, David Andre and Scott Benson havehelped me to keep up with the action and understand how my work fits into thebigger picture. Thankyou all for your time and valuable assistance.

The UNSW AI Lab has always been a great place to study. There hasalways been an air of co-operation and collaboration. This work has been greatlyimproved through innumerable discussions with others in the lab. My thanksparticularly go to: Mark Reid, Phil Preston, Jamie Westendorp, Waleed Kadous,Mark Peters, Joe Thurbon, Bernhard Hengst and Michael McGarity. We’vealso had a lot of fun together. My thanks also to all the helpful people in theComputer Services Group: Tiina Muukkonen, Zain Rahmat, Slade Matthews,Simon Bowden, and Geoff Oakley.

Doing a PhD is a real emotional roller-coaster, and I have had my fair shareof joyful highs and depressing lows. I could never have made it through withoutthe support of my brothers and sisters in Christ. My heartfelt thanks go to allof you: to Ross Salter who was brave enough to show me his attempts at Pascal,to Sarah Davis for many drive-home conversations, to Liz Anthony for daring tobe real, to my beloved “little sisters” Vickie and Su Williams (now Deenick andNguyen) for unstinting love and hospitality, to my one-time housemates DaveHore and Alex Zunica for putting up with my midnight cookery, to ChristinaCook for asking tough questions, to Rowan Kemp for being a math-geek with a

xi

LIST OF ALGORITHMS xii

gift for story-telling, and to Rob and Jude Graves and Wendy Bebbington forrecognising and encouraging my feral/hippy tendencies. Especial thanks needto go to Andrew Cameron and Anthony Sell for helping me through my darkesthours. I love you all dearly and praise God that he gave me each of you.

I have one word of advice that every PhD student ought to hear: Sit upstraight. If I have learnt nothing else from my experience as a PhD student, it isthis. Take care of your back, and make sure you have a friendly physiotherapist.I have two: Philip Richardson and Peter Hunt, and I am very grateful to bothof them.

The following people deserve a more dubious kind of acknowledgement. Theyhave provided many hours of pleasurable distraction, when I probably shouldhave been working on my thesis. The culprits are: the UNSW Circus Soci-ety (particularly Mark Aiken, Kim Isaacs, Brock Misfud, Lucy Young and BekEarnshaw), the authors of INTERCAL (Don Woods, James Lyon, Louis Howelland Eric S. Raymond), the various members of Agora Nomic past and present(particularly my long-standing fellow conspirator Steve Gardner), the authorsof Piled Higher and Deeper1, and entire the Nethack DevTeam. Without your“help” my time as a PhD student might have been a lot shorter, but also a lotduller.

Finally I must acknowledge my Father God and my Lord Jesus Christ, formaking this astonishing world in which to live, and this astonishing creaturethat I call myself. Every day I marvel at your power, your creativity and yourlove.

“You are worthy, our Lord and God, to receive glory and honor and power,for you created all things, and by your will they were created and have their

being.”

Revelation 4:11 (NIV)

1http://www.phdcomics.com

Mathematical terminology

Symbol MeaningStates

s A primitive state of the world.st The state of the world at time t.f(P1, . . . , Pk) A fluent – a predicate describing a state of the world.f ∧ g The logical conjunction (“and”) of fluents f and g.s |= f1 ∧ f2 ∧ . . . ∧ fk State s models fluents f1 . . . fk

i.e. The fluents are true in state s.Actions

a A primitive action.B A behaviour, a temporally abstract action.B(P1, . . . , Pk) A pameterised behaviour.a An action, either primitive or temporally-abstract.P The set of all primitive actions.B The set of all behaviours.A The set of all possible actions,

both primitive and temporally abstract.Root The root behaviour of a learning task.

A description of the agent’s main goal.B.pre The precondition of behaviour B.

A conjunction of fluents which describes the set of states inwhich the behaviour can be executed (its application space).

B.post The post-condition or behaviour B.A conjunction of fluents which describes the set of stateswhich form the goal of the behaviour.

B.sfx The side-effects of behaviour B.A set of conjunctions of fluents which cannot be guaranteedto remain true while executing B.

B.plan The plan decomposition of behaviour B.B.gran The granularity of behaviour B.

xiii

LIST OF ALGORITHMS xiv

Mathematical terminology (cont.)

Symbol MeaningPlans

P A plan.N A node of a plan.N A set of all nodes.N t The set of all nodes activates at time t.N.cond The condition of node N.

A conjunction of fluents.N.B The behaviour dictated by node N.N.type The type of node N – either policy or exploratory.

OtherP (X | Y ) The probability of event X given Y.E {X | Y } The expected value of random variable X given Y.T (s′|s, a) The transition probabilty function for primitive action aT (s′, k|s, B) The transition probabilty function for behaviour B

R (r|s, a) The reward probabilty function for behaviour B

Chapter 1

Introduction

1.1 Picture this:

A quiet suburban street, late in the afternoon. A car is parked at the curb with

two people inside one young and one old. They are talking. The younger one

sits in the driver’s seat. She seems nervous. Her companion is calming and

reassuring her. Two familiar cardboard squares are prominently displayed on the

front and rear of the vehicle, each bearing a large letter ’L’. They are learner’s

plates. This is her first driving lesson.

Lesson number one is simple: Start the car and pull away from the curb.

Her instructor reminds her of the procedure: start the engine, select first gear,

indicate, check the mirrors and then accelerate away smoothly. She repeats these

instructions to herself as she turns the key. With her foot firmly on the clutch,

she hunts for first gear. The stick finally settles into place and slowly she releases

the clutch - but not slowly enough. The car lurches forward suddenly and then

comes to an abrupt halt. Her instructor reassures her and reminds her of the

importance of finding the “friction point” and using the accelerator as well as

the clutch. She tries again. On the third attempt the car hops forward once or

twice, but doesn’t stall. Buoyed by her success, the pupil moves on to the next

step.

The scene changes . . . Five years have passed and she now drives to work

every day. Changing gears is second-nature to her — she doesn’t even think

about it. In fact, she had made this trip so many times she swears she could

do it blindfolded. Along the way she has learnt things her instructor never told

her: how much further she can go when the fuel gauge says “empty”, and how

1

1. Introduction 2

to steer with her knees and do her make-up simultaneously in slow traffic. And

next weekend she will be passing her knowledge on, giving her younger brother

his first lesson.

What has effected this enormous transformation? How did our subject go from

being unable to even start the car moving, to driving large distances every day?

The answer is simple: Practice.

Aristotle, in his Politics, said “The things we learn to do, we learn by do-

ing.” He was referring to ethical behaviour, but the same could be said of any

human activity, whether it be walking, talking, driving, juggling or playing the

piano. Expertise comes with experience, and experimentation will always be our

greatest teacher.

And yet experience alone is not enough. Our subject’s driving practice was

not attempted haphazardly but was informed by advice, structured by reason

and refined by reflection. Without any background knowledge of how the car

works and the purpose and meaning of the various dials and controls, our young

driver might still be randomly pressing buttons, turning knobs and pushing levers

years later, without ever having left the curbside.

From the Turing Test to the artificial humanoid, human intelligence has al-

ways been our inspiration for artificial intelligence. Some would say it is our

only measure of what intelligence really is. Even those who regard AI as an

engineering discipline draw ideas from human examples. In this thesis we shall

be exploring how the familiar combination of knowledge, reasoning and practice

can be implemented in an artificial system.

The focus of this work will be on the problem of control. A computer-

controlled “agent” interacts with an external environment to achieve certain

goals. It may be a robot interacting with an office environment, or a software

agent exploring the web, or any of a number of similar models. The agent can

sense certain features of the environment and can use its actuators to make

changes in it. Our object will be to create a control policy for the agent which

allows it to achieve its goals. As in the example that opened this chapter, that

policy will be constructed by a combination of knowledge, reasoning, practice

and reflection.

1. Introduction 3

1.2 Human intelligence: A Cognitive Model

To better understand the aspects of human intelligence from which we will be

drawing our inspiration, let us first look briefly into a psychological model of

human problem-solving from Cognitive Psychology 1 We will illustrate this model

from our earlier example.

Consider first the situation our driver finds herself in when first we meet her.

Even before getting into the driver’s seat she has received extensive training

about driving. Some of it was in the form of instruction from her teacher. Some

of it was from observing others drive. She knows what most of the controls

were for. She has memorised the road rules and been tested on them, and she

can recite the sequence of steps she is about to attempt. Yet with all of this

knowledge, she cannot drive the car.

After many years of practice her driving is smooth and mostly fault-free, but

when she comes to teach her younger sibling she will encounter a new frustration.

He will ask questions like: “How do you release the clutch without stalling?” and

“Exactly when are you supposed to change gears?” She can only answer vaguely

at best, communicating her knowledge in abstract terms. In the end she says,

“You develop a feel for it”. She has the ability but she cannot put it precisely

into words.

1.2.1 Types of Knowledge

Cognitive psychologists recognise these two experiences as evidence of two dis-

tinct kinds of knowledge which they call declarative and procedural (Anderson,

1976; Cohen & Squire, 1980). Declarative knowledge is “knowing what”. It is

knowledge of which we are consciously aware.We can communicate it and reason

with it, but it is abstract and does not on its own convey ability to do the things

it describes.

Procedural knowledge is “knowing how”. It is implicit and unconscious. It

tells us how to do things, down to the intricacies of sensing and movement, but

being implicit it is difficult to communicate. Even if it could be explained, it is

of limited use to others as it is tailored intimately to the person possessing it.

1As a model of human rationality, Cognitive Psychology is one of many, and it necessarilyhas its opponents. While I am persuaded by many of its claims, I am not presenting it hereas the only such model, or indeed the best. It is merely a source of inspiration. Even a poormodel of human intelligence can inspire a good model for artificial intelligence.

1. Introduction 4

1.2.2 Interaction between the types

Having distinguished these different kinds of knowledge, how do they relate?

What are their roles in the learning process? Cognitive psychology describes

three phases of skill acquisition in which these two kinds of knowledge each play

a role (Anderson, 1995):

1. The cognitive phase, in which the person acquires the necessary declarative

knowledge to attempt to solve the problem. This knowledge is used to form

a plan of attack for solving the problem.

2. The associative phase, in which the declarative knowledge is gradually

converted into procedural knowledge through practice. Errors in the initial

declarative knowledge are also discovered and corrected, refining the plan

if needs be.

3. The autonomous phase, in which procedural knowledge has taken over and

the skill becomes more and more automatic, requiring significantly less

attention to perform.

The relationship between declarative and procedural knowledge is circular.

Initial declarative knowledge helps us acquire procedural knowledge by guiding

our practice, and the experience gained through performing the skill enables us

to add to and refine that initial knowledge.

1.3 Declarative and Procedural Knowledge in

Artificial Intelligence

If human intelligence manifests itself as both declarative and procedural knowl-

edge, one might expect to find a similar dichotomy in AI research, and indeed

this is the case. There has been a long-standing philosophical dispute about

the “correct” way to construct AI, with the protagonists falling into two broad

camps the “classicists” and the “connectionists”. Many parallels can be drawn

between these two viewpoints and the two kinds of intelligence we describe.

Classical AI (also “Symbolic AI” or “Good Old-Fashioned AI”) has its roots

in the earliest AI traditions from the 1960s and 70s (Winograd, 1972, Minsky,

1974, Bobrow & Winograd, 1977). Newell and Simon (1976) summarise this

1. Introduction 5

position as the concept that the mind is a physicial symbol system, that is, “a

system that produces through time an evolving collection of symbol structures”.

They distinguish this from a general-purpose computer by indicating that sym-

bols designate objects in the world.

Practically, symbolic AI has primarily concerned itself with abstract logical

reasoning, using symbolic logic (or its equivalent) to express knowledge about

the environmen, to reason about it, and to decide how to interact with it. It

often deals with abstractions, assuming that complex descriptions of objects and

actions such as “pick up the big, red pyramid” can easily be translated from the

abstract to the concrete.

In this way, symbolic AI bears similarity to declarative knowledge. Its lan-

guage is similar to that which we use when we reason declaratively (at least on

paper) and so is more-or-less transparent to our intellects. This enables us to

communicate with such systems directly, encoding our knowledge and interpret-

ing results relatively easily – to a limit.

Beyond a certain point of complexity, our knowledge becomes increasingly

difficult to express symbolically, reasoning with such knowledge becomes very

complex, and the results of such systems require more and more expert knowledge

to interpret. Simply put, there are limits on how much of the world we can

comprehend and explain.

Symbolic AI has been criticised for failing to represent the full complexity

of the world (Dreyfus, 1979, Smolensky, 1989). The assumed mapping between

the abstract symbols and the real-world objects and actions they represent turns

out to be very hard to implement. For example, recognising an “apple” from

a picture can be quite difficult - the apple can be red, green, yellow or even

brown. It can be lit in many different ways. It may sit in amongst many other

fruit, or be hanging on a tree, it may be a cartoon illustration rather than a

realistic photograph, and so on and so on. This problem of mapping between

the input signals from hardware sensors to the symbols of reasoning and planning

is known as the symbol grounding problem (Harnad, 1990, Ziemke, 1997) and is

a significant stumbling block for classical AI.

In light of these failings, Connectionist AI (“subsymbolic AI” or “New Fan-

gled AI”) provided a dramatic rethink of our approach to building AI systems

(Rumelhart, 1989, Churchland, 1990). Rather than explicitly attempting to

model our thinking patterns, connectionist approaches instead attempt to model

1. Introduction 6

our brains. They are inspired by the neural structure of the brain, with a mul-

titude of simple computation elements (neurons) connected together into a self-

organising network.

From this starting point connectionist approaches have diversified into a large

number of strongly statistical techniques. What they share in common is the

absense of any explicit internal representation beyond the immediate numeric

inputs and outputs from sensors and actuators.

One of the strong philosophical claims of this field is that despite the fact

that symbols are not explicitly represented, there are emergent “sub-symbols”

(Smolensky, 1989) – patterns of activation within the connectionist networks

which generalise the low-level input data into an implicit high-level representa-

tion. This has been a point of contention with pratictioners from the symbolic

camp (eg. Rosenberg, 1990), who argue that these sub-symbols are not guar-

anteed to exist, may not follow our concepts of logical behaviour, and are not

discernable to external observers. Practically speaking, connectionist systems

often lack comprehensibility as they are without an accessible high-level expla-

nation for their behaviour.

Sub-symbolic AI thus has many things in common with unconscious proce-

dural knowledge. While connectionism isn’t always procedural, in the sense of

dealing with action, it does deal directly with input from the world and has no

explicit abstract representations. It is often able to model much more complex

and subtle relationships in the world, but at the loss of comprehensibility.

It is not our intention to provide a rigorous discussion of the philosophical

merits of either approach – other, wiser, minds have attempted that (see (Hauge-

land, 1997) for a good anthology) and our intentions are more practical. We shall

focus on one problem area in particular, that of control, and aim to show how

the dispute can be resolved in a way that takes advantage of the strengths of

both approaches. Whether this kind of renconciliation can be extended to the

wider problem of AI is a question we shall leave for the philosophers.

1.3.1 Declarative and Procedural Knowledge in Control

Since we are interested in the problem of control, how does the Symbolic/Sub-

symbolic distinction manifest itself in AI approaches to this field?

1. Introduction 7

The symbolic approach to control

The typical symbolic approach to control is Symbolic Planning 2 (Allen, Hendler,

& Tate, 1990; Ghallab & Milani, 1996). Symbolic planning is characterised by:

• A first-order logic representation for describing the agent’s state,

• A logical model of the agent’s actions in terms of cause-and-effect,

• Explicitly codified goals,

• A formal reasoning process to determine a sequence of actions which will

cause the goals to become true.

The advantages of this approach are:

• It can make use of background knowledge provided by the trainer in the

form of the action model,

• The resulting plans are logically sound and the correctness of the planning

process is easily verified,

• The plans are comprehensible. The reasons behind a particular choice of

action can be extracted from the plan.

The disadvantages are:

• Important concrete details of the effects of actions, such as duration,

stochasticity (and other kinds of non-determinacy), continuity and depen-

dencies on the finer details of the state are hard to specify in a symbolic

model.

• The planning process doesn’t scale down to fine-grained problems. Logical

reasoning about cause and effect needs to be supplemented by numeric and

probabilistic reasoning which complicate the planning process and obscure

the plans.

2Practitioners of symbolic planning generally refer to it just as “planning”. There are,however, sub-symbolic techniques that are also called “planning”, so we shall use the name“symbolic planning” where necessary, to preserve the distinction.

1. Introduction 8

The sub-symbolic approach to control

The sub-symbolic approach to control is typified by Reinforcement Learning

(Sutton & Barto, 1998). The characteristics of this approach are:

• States are represented as vectors of numerical values

• Goals are specified implicitly in reward functions. Actions have a numeric

“value” depending on how well they achieve those goals.

• General-purpose cause-and-effect models of actions are usually abandoned,

or else they are numeric models which attach probabilities to different

outcomes.

• As the effects of actions are difficult for the programmer to discover and

describe, policies are learnt via interaction with the environment, rather

than deduced from models.

The advantages of this approach are:

• It can handle subtle and complex environments for which cannot ade-

quately be described by high-level logical models.

• It can learn by trial-and-error those things which cannot be specified by

the trainer.

The disadvantages are:

• Policies are opaque. Interpreting the reason for a particular action is vir-

tually impossible.

• When background knowledge does exist, it is not easy to incorporate into

the agent’s policy.

• Without background knowledge, this approach doesn’t scale up to large

problems.

Many of the strengths and weaknesses of these two approaches are based

on what we shall call the granularity of the actions involved. The symbolic

approach is best suited to coarse-grained states and actions – large scale features

and changes in the world which can be regarded at a high level of abstraction.

1. Introduction 9

The sub-symbolic approach is better suited to fined-grained states and actions

– low-level aspects of the world which often require concrete numerical detail to

be described accurately.

1.4 A Hybrid approach

In our opinion basing an approach to AI on a single kind of representation,

symbolic or sub-symbolic, is going to falter. Each representation has limits

to what it can express. Just as human intelligence is based on an interaction

between declarative and procedural knowledge, so also artificial intelligence is

going to need to incorporate both symbolic and sub-symbolic techniques if it is

going to overcome these limitations.

In this thesis we propose a hybridisation of symbolic and sub-symbolic ap-

proaches to artificial intelligence for control. We shall combine symbolic planning

with reinforcement learning to produce a system that capitalises on the strengths

of each to overcome the weaknesses of the other. Let’s examine briefly how such a

hybrid would achieve this, first from the point of view of reinforcement learning,

then from the point of view of planning.

1.4.1 Putting declarative knowledge into reinforcement

learning

The search for a general-purpose reinforcement learning algorithm that can be

applied to arbitrary learning problems of any size has been largely fruitless. Al-

gorithms do not scale up without careful hand-tailoring to particular problems.

In response to this realisation, some researchers have directed their attention

toward building better special-purpose solutions instead, solutions that deliber-

ately incorporate domain-specific background knowledge in a systematic way to

improve learning performance.

Hierarchical reinforcement learning (HRL) (Sutton, Precup, & Singh, 1999;

Dietterich, 2000a; Parr & Russell, 1998), is one such technique that is proving

quite effective. It implements the familiar intuition that a complex task can be

more easily solved if it can be decomposed into a set of simpler tasks. Solutions

are found for the simpler problems, and then recombined to solve the original

task. This intuition has successfully been used to simplify problems in several

1. Introduction 10

areas of artificial intelligence, such as hierarchical planning (Sacerdoti, 1973; Iba,

1989) and structured induction (Shapiro, 1987).

Reinforcement learning generally attempts to find a policy mapping primi-

tive states directly to primitive actions. Generally every possible primitive action

can be explored in every possible primitive state, resulting in a massive search

space of possible policies. HRL tries to cut down this search space by intro-

ducing high-level structure into policies. The trainer specifies a set of abstract

behaviours which are temporally abstract (have significant duration) and spa-

tially abstract (they map from one set of states to another). Policies are learnt

in terms of behaviours. Behaviours are decomposed (in various ways) into prim-

itive state/action mappings.

Behaviours have two important advantages:

• They limit the policy space, by specifying particular policy mappings for

large portions of the state-space. This is done either by hard-coded restric-

tions (limiting choices, or even specifying particular parts of the policy) or

else by local goals.

• They are temporally abstract. Choosing to execute a behaviour generally

means committing to it long-term. A policy involving many hundreds of

primitive actions could possibly be specified as only a short sequence of

behaviours. Learning a behaviour-based policy requires fewer decisions to

be made, so the search space is reduced significantly.

However, it is not simply a matter of adding a bunch of behaviours and

instantly getting superior performance. Behaviours only work if they reduce the

size of the policy space. Providing an agent with a large repertoire of behaviours

in which many alternatives exist for every decision may make learning harder

rather than easier. Time will be wasted exploring inappropriate behaviours.

Most common monolithic (i.e., non-hierarchical) RL approaches are model-

free – that is they do not require models of actions, nor do they endeavour to

build them. Hierarchical reinforcement learning algorithms have inherited this

characteristic, insofar as they do not attempt to build or use models of behaviours

(with some exceptions). This is true for the same reasons as for monolithic RL:

like primitive actions, behaviours have complex, stochastic effects which cannot

easily be specified or modeled. It is easier to evaluate them based on goal-specific

criteria, rather than build general-purpose models.

1. Introduction 11

However when the repertoire of behaviours is large, some means of limit-

ing the set of choices in necessary. It is assumed that the trainer has some

background knowledge of the tasks and knows which behaviours might be ap-

propriate in different parts of the problem. It may be that this set of behaviours

is small enough that no further limits need to be applied, but as more ambitious

problems are tackled the repertoire of necessary behaviours will increase, and

further background knowledge will need to be implemented to limit the set of

choices available in a certain situation to those that may be appropriate, ignoring

those that were included for different parts of the problem, or different problems

altogether.

Most existing algorithms implement such knowledge in the form of a task-

hierarchy (Dietterich, 2000a) or similar structure. A task-hierarchy is essentially

a function which maps a situation to a set of behaviours which might be appro-

priate in that situation. The set of choices available to the learning algorithm is

thus kept to a minimum, and the size of the policy space is kept under control.

Task hierarchies have hitherto be constructed manually by the trainer, but

this need not be the case. In this thesis we shall investigate the possibility of

automatically constructing them, based on abstract models of behaviours. The

concrete details of a behaviour’s effects may be difficult to model, but its intended

purpose is often much simpler and more easily expressed. Given an abstract

model of a behaviour’s purpose, and similar definition of the agent’s goal we can

use symbolic planning techniques to reason which actions might be appropriate

in different situations, but rather than choosing a particular behaviour in this

way, we shall determine a set of appropriate behaviours, and then use learning

to select a particular one. In this way we can use abstract knowledge from the

model to limit the set of choices, and then concrete knowledge gathered from

experience to make the selection.

An additional issue in HRL is that of temporal abstraction and commitment

to behaviours. It is recognised that behaviour-based policies are not necessar-

ily optimally efficient, and that better policies can be generated by “cutting

corners”, relaxing the commitment to completing a behaviour once it has been

started, in favour of switching to a better behaviour sooner. While this indeed

produces more efficient policies, it removes one of the principle advantages of

behaviours, which is their temporal abstraction. The more often the agent can

make decisions, the more decisions it will have to make, and the longer it will

1. Introduction 12

take to learn the right ones. For this reason most advocates of this approach

reserve it for optimising a behaviour-based policy that has already been learnt.

Modeling the purpose of behaviours can also aid us in finding the best tradeoff

in this problem. A symbolic model allows the agent to reason about why an

executing behaviour is appropriate. Once the behaviour has been started, this

condition can be monitored. As long as the behaviour remains appropriate it is

reasonable to persist with it. However, if, for some reason, the behaviour becomes

inappropriate, we have valid grounds for interrupting it and selecting another.

Policies learnt in this way will be more efficient than those that blindly continue

executing behaviours which have long since become pointless. Meanwhile they

keep the number of occasions on which a new decision needs to be made to a

minimum, and thus learning is not excessively delayed.

We shall investigate hybrid algorithms using both symbolic planning and

hierarchical reinforcement learning which implement both these improvements

to stand-alone HRL.

1.4.2 Putting procedural knowledge into symbolic plan-

ning

Analysing the problem from the opposite viewpoint, much research has also gone

into scaling symbolic planning to more complex fine-grained problems. Hierar-

chy is also of value here, and significant work has been done to create hierarchical

planners, which first decompose plans into large steps, and then refine it recur-

sively into smaller and smaller steps until a solution can be found in terms of

the most primitive actions.

These techniques can considerably reduce the search time involved in building

a plan, but still require a logical model of the effects of actions. Often the default

assumption is that the primitive actions are actually already moderately complex

behaviours, which can be cleanly modeled, or else the problem domain is such

that primitive actions are noiseless and deterministic.

These problems have lead to recent developments in planners that acknowl-

edge that the world is noisy and that their models are likely to be incomplete.

As a result the expectation that actions operate instantaneously and predictably

has been abandoned. Planners are now designed to include contingencies in their

plans for when things go awry, and plan execution is monitored to ensure that

1. Introduction 13

any such execution failures are detected and handled as they arise.

All the same, maintaining the logical description of actions limits the ability

to express fine-grained actions in realistic settings. Symbolic techniques still rely

on a programmer to provide behaviours which can be described at a medium-

to-high level of abstraction.

In this thesis we shall explore an alternative option. Rather than starting with

hard-coded behaviours which can be described abstractly, we shall instead start

with an abstract definition of a desired behaviour, and then use reinforcement

learning to learn a concrete implementation. So the planner does not have to deal

directly with primitive actions, but can instead operate at the level of abstraction

at which it operates best.

Learning behaviours is not without its drawbacks. It cannot be predicted

exactly how the learnt behaviour will operate, how efficiently it will work nor

what side-effects it may produce. The standard planning process of finding the

shortest possible sequence of behaviours to reach a goal may not produce the

best possible plan. A certain amount of trial-and-error exploration of different

possible plans will be necessary to find the best possible solution. To this end,

we will use hierarchical reinforcement learning to select between different paths

to the goal based on concrete experience with the different possibilities.

1.4.3 Getting declarative knowledge out of procedural

learning

Our aim so far has implicitly been to find the most efficient policy to achieve

an agent’s goals. We have discussed how a combination of symbolic and sub-

symbolic AI might achieve this more effectively than either approach alone. How-

ever we have a second aim running alongside this. We would also like the agent

to be able to improve its body of declarative knowledge through analysis of the

results of procedural learning. In other words, by “reflecting” on its experiences

executing behaviours, the agent should be able to repair incorrect or incomplete

parts of its symbolic model. The advantages of this are two-fold:

1. It makes the knowledge that is implicitly contained in its policies explicitly

available for reasoning and planning. Incorrect or incomplete plans can

hamper the agent’s ability to improve its policy. Repairing such faults will

allow better policies to be found.

1. Introduction 14

2. Explicit declarative knowledge can be more easily communicated to other

agents, including the trainer. If the reasons for certain decisions can be

modeled declaratively, then it is to our advantage to do so, to make the

agent’s decisions more comprehensible.

Extracting symbolic knowledge from experience has been the object of much

research (Benson, 1995; Shen, 1993; Wang, 1996; Oates & Cohen, 1996; Lorenzo

& Otero, 2000; Gil, 1994; desJardins, 1994). There are many different things

that can be learnt and modeled. In this work we have chosen to focus on one

particular element which has not received too much attention, learning side-

effects of behaviours. These can be detected as the results of plan execution

failures, and then the agent can learn to predict and avoid them using symbolic

machine learning tools.

1.4.4 Rachel: A hybrid planning/reinforcement learning

system

Plans

Builds Plans

ActorExecutes PlansLearns Policies

ReflectorMonitors executionLearns Side−effects

RL−TOPModel

tracesExecution

Side−effectdescriptions

Planner

Figure 1.1: The three parts of the Rachel architecture.

We have implemented this proposed hybrid of planning and reinforcement

learning in a system we call Rachel. Rachelconsists of three parts, a plan-

ner, an actor and a reflector. All three components share a common symbolic

1. Introduction 15

representation of the world in terms of a set of fluents which describe the state

and teleo-operators which describe potentially useful high-level behaviours.

The planner, a simple means-ends problem solver based on the Teleo-reactive

formalism of (Nilsson, 1994), builds abstract plans for achieving the agent’s

goals. These plans serve as task-hierarchies for the actor, which uses hierarchical

reinforcement learning to select between alternative paths in the plan and to

build concrete policies for abstract behaviours. The actor executes these policies

in the world, monitored by the reflector. When plan execution fails to proceed

as expected, the reflector diagnoses the fault and gathers examples of its cause.

Given enough examples, it uses the Inductive Logic Programming algorithm

Aleph to produce a symbolic description of the cause, which is then fed back

into the planner to make more accurate plans.

1.5 Contributions of this thesis

Work towards the goal of unifying the symbolic and sub-symbolic approaches

to artificial intelligence is still in still its infancy, and there are many aspects to

be considered. The work in this thesis focuses on the problem of learning and

control. The principle contributions made are as follows:

1. The design of a shared representation of states, actions and goals for

both reinforcement learning and planning, particularly the Reinforcement-

Learnt Teleo-Operator (RL-Top) formalism which combines the represen-

tations of abstract behaviours from both fields.

2. A hybrid planning/reinforcement learning architecture Rachel which shows

how reinforcement learning can be used to ground abstract behaviours in

planning, and how symbolic plans can be used turn background knowl-

edge into high-level structure to help solve complex reinforcement learning

problems.

3. Two different hierarchical reinforcement learning algorithms that incor-

porate background knowledge from plans: 1) Planned Hierarchical Semi-

Markov Q-Learning (P-HSMQ) which the extends HSMQ algorithm (Diet-

terich, 2000b) to use plan-built task hierarchies, and 2) Teleo-Reactive Q-

Learning (TRQ) a more complex algorithm which implementations teleo-

reactive execution semantics to improve the termination of behaviours.

1. Introduction 16

4. An examination of how the symbolic representation of the domain can

help diagnose problems in the learnt policy, focusing on the detection of

side-effects not present in the original behaviour descriptions.

5. An extension to Rachel incorporating code to detect such side-effects and

overcome them by refining the symbolic knowledge base using inductive

logic programming.

6. Experimental verification of the effectiveness of this system in domains of

various complexity, ranging from simple grid-based problems to complex

control tasks.

1.6 Overview

The remaining chapters are arranged as follows: Chapters 2 and 3 review the

necessary background to this work, in reinforcement learning and symbolic plan-

ning respectively.

Chapter 4 introduces a hybrid representation for the agent control problem,

which combines elements from both the preceding chapters. At the heart of this

new representation is the concept of a Reinforcement Learnt Teleo-Operator

(RL-Top) which models a reinforcement-learnt behaviour as a symbolic plan-

ning operator.

In Chapter 5 we explain the first two elements of the Rachel architecture:

the planner and the actor. We shall derive two different algorithms for incor-

porating plans into hierarchical reinforcement learning: Planned Hierarchical

Semi-Markov Q-Learning (P-HSMQ) and Teleo-Reactive Q-Learning (TRQ).

In Chapter 6 we tackle the problem of incompletely specified world models,

focusing on the frame problem and how the existence of unexpected side-effects

can make planning less effective. We add a third element to Rachel, the re-

flector, which automatically diagnoses such problems and learns to avoid them

using ILP.

Chapter 7 presents empirical testing of the various algorithms, compared

with existing techniques. We show that the combination of planning, learning

and reflecting can automatically produce results which would otherwise require

a significant degree of hand-crafted background knowledge.

1. Introduction 17

Finally, in Chapter 8 we draw it all together, reflect on what has been

achieved, and suggest a variety of areas for extensions and improvement.

Chapter 2

Background - Reinforcement

Learning

As this thesis describes a hybrid system, it draws on material from several major

subfields of artificial intelligence: reinforcement learning, symbolic planning and

knowledge refinement. In this chapter and the next we review those fields with

an eye to explaining the work that is to come. Each field is quite large in itself

and it is not possible to discuss them comprehensively. Instead, we shall focus on

those aspects of each domain that provide the necessary background material for

this thesis. This material has been split into two chapters. This chapter presents

the subsymbolic approach to control as performed by reinforcement learning and

hierarchical reinforcement learning, and in the next chapter we shall deal with

the symbolic approach to control performed in symbolic planning and model

learning.

2.1 Reinforcement Learning

In the past decade reinforcement learning has established itself as an important

method employed in artificial intelligence research for learning to control an

agent. It is a statistical approach to the problem that addresses learning how

best to behave as an online optimisation problem, with an initially unknown

value function. Its foundations are in the mathematics of dynamic programming.

Many different approaches have been put forward, far too many to cover

here, but most share a common formulation of the reinforcement learning prob-

lem as established in the work of Watkins (1989) and Sutton (1988). This work

18

2. Background - Reinforcement Learning 19

set the foundation for the field, establishing a sound theoretical model for re-

inforcement learning and introducing practical learning methods. We describe

this foundation, both practical and theoretical, and the Q-Learning algorithm

that embodies it and which illustrates principles that have been extended to a

host of more complex algorithms.

2.1.1 The Reinforcement Learning Model

RL Agent

action

Environment

state

Policy

Goals

Reward

Figure 2.1: An illustration of a reinforcement-learning agent.

Reinforcement learning models an agent interacting with an environment,

trying to optimise its choice of action according to some reward criterion, as

illustrated in Figure 2.1. The agent operates over a sequence of discrete time-

steps (t, t + 1, t + 2, . . .). At each step it observes the state of the environment

st and selects an appropriate action at. Executing the action produces a change

in the state of the environment to st+1. It is assumed that the sets of possible

states S and available actions A are both finite. This is not always the case in

practice, but it greatly simplifies the theory, so we shall follow this convention.

The mapping of states to actions is done by an internal policy π. The initial

policy is arbitrarily chosen, generally random, random, and improved based on

the agent’s experiences. Each experience 〈st, at, st+1〉 is evaluated according to

some fixed reward function, yielding a reinforcement value rt ∈ <. The agent’s

objective is to modify its policy to maximise its long-term reward. There are

several possible definitions of “long-term reward” but the one most commonly


employed is the expected discounted return given by:

Rt = rt + γrt+1 + γ2rt+2 + . . .

=∑∞

i=0 γirt+i

(2.1)

where γ is the discount rate that specifies the relative weight of future rewards,

with 0 ≤ γ < 1. Should the agent reach some terminal state sT , then the infinite

sum is cut short: all subsequent rewards rT+1, rT+2, . . . are considered to be zero.

Semantically speaking, these reward signals are an expression of the agent’s

goals. A positive reward generally indicates a result that the agent considers

favourable, a negative reward unfavourable (although it must be noted that it

is the relative value of a reward that determines its goodness, not the absolute).

There is no standard form that this function should take, except that it should

obey the Markov property (below). A common formulation in goal-achievement

tasks is to offer to the agent a positive reward rt = 1 when the agent achieves

its goal and rt = 0 elsewhere. Using the discounted return above, this equates

to an optimal policy that chooses the shortest path to the goal.

There is some disagreement about from where these reward signals should

be considered to originate in the model shown in Figure 2.1. Some would say

that they are an expression of the agents goals and thus belong inside the agent.

Others argue that they are immutable criteria provided to the agent in advance

by its creator, and thus belong in the environment. I choose to place them in a

compromise position: inside the agent but outside the learning sub-system. This

expresses the fact that a complex agent may elect to pursue different goals at

different times, and thus may change the way it evaluates its progress. However,

for the time being we shall assume that the agent has a fixed goal and thus a

fixed reward function.

Note however that the learning algorithms assume no prior knowledge of the

reinforcement function, when rewards will occur or what values they will take.

All learning is done by trial and error: an action is performed and the resulting

state transition and reward are observed, and used to update the policy. As

the environment may well be stochastic, with transitions and rewards occurring

probabilistically, the same transition may need to be observed many times over

before the best choice can be made.


2.1.2 Markov Decision Processes

If we are to construct algorithms that learn policies with any guarantee of per-

formance, some theoretical restrictions need to be placed on this model. One

measure of the complexity of the problem is the amount of information nec-

essary to make accurate predictions about the outcomes of an agent’s actions.

Actions could have non-deterministic or stochastic effects on the state that may

depend on information hidden from the agent or on events long past. All these

possibilities complicate the process of prediction and thus make learning difficult.

To avoid this difficulty most reinforcement learning algorithms make a strong

assumption about the structure of the environment. They assume that it oper-

ates as a Markov Decision Process (MDP). An MDP describes a process that

has no hidden state or dependence on history. The outcomes of every action, in

terms of state transition and reward, obey fixed probability distributions that

depend only on the current state and the action performed.

Formally an MDP can be described as a tuple 〈S,A, T, R〉 where S is a finite

set of states, A is a finite set of actions, T : S × A × S → [0, 1] is a transition

function and R : S × A×< → [0, 1] is a reward function with:

T (s′|s, a) = P (st+1 = s′ | st = s, at = a) (2.2)

R (r|s, a) = P (rt = r | st = s, at = a) (2.3)

which express the probability of ending up in state s′ and receiving reinforcement

r after executing action a in state s, respectively. These probabilities must be

independent of any criteria other than the values of s and a. This is called

the Markov Property. An in-depth treatment of the theory of Markov Decision

Processes can be found in (Bellman, 1957), (Bertsekas, 1987), (Howard, 1960)

or (Puterman, 1994).

Given this simplifying assumption, the best action to choose in any state

depends on that state alone. This means that the agent’s policy can be expressed

as a purely reactive mapping of states to actions, π : S → A. Furthermore every

state s can be assigned a value V π(s) that denotes the expected discounted

return if the policy π is followed:

V π(s) = E {Rt | ε(π, s, t)} (2.4)

= E

{

∞∑

i=0

γirt+i | ε(π, s, t)

}

(2.5)


=∫ +∞

−∞

rR (r|s, a) dr + γ∑

s′∈S

T (s′|s, a)V π(s′) (2.6)

where ε(π, s, t) denotes the event of policy π being initiated in state s at time t.

V π called the state value function for policy π.

The optimal policy can now be simply defined as that policy π? that max-

imises V π(s) for all states s. The Markov property guarantees that there is

such a globally optimal policy (Sutton & Barto, 1998) although it may not be

unique. We define the optimal state-value function V ?(s) as being the state value

function of the policy π?:

V ?(s) = V π?

(s)

= maxπ V π(s)(2.7)

We can also define an optimal state-action value function Q?(s, a) in terms

of V ?(s) as:

Q?(s, a) = E {rt + γV ?(st+1) | st = s, at = a} (2.8)

This function expresses the expected discounted return if action a is executed in

state s and the optimal policy is followed thereafter. If such a function is known

then the optimal policy can be extracted from it simply:

π?(s) = arg maxa∈A

Q?(s, a) (2.9)

Thus the reinforcement learning problem can be transformed from learning

the optimal policy π? to learning the optimal state-action value function Q?.

This turns out to be a relatively simple dynamic programming problem. The

simplest and most commonly used solution is Watkins’ Q-Learning.

2.1.3 Q-Learning

Q-Learning (Watkins, 1989, Watkins & Dayan, 1992) is an online incremental

learning algorithm that learns an optimal policy for a given MDP by building

an approximate state-action value function Q(s, a) that converges to the opti-

mal function Q? in Equation 2.8 above. It is a simple algorithm which avoids

the complexities of modeling the functions R and T of the MDP by learning

Q directly from its experiences. It has significant practical limitations, but is

theoretically sound and has provided a foundation for many more complex algo-

rithms. Pseudocode for this algorithm is given in Algorithm 1.


Algorithm 1 Watkin’s Q-Learningfunction Q-Learning

t← 0

Observe state st

while st is not a terminal state do

Choose action at ← π(st) according to an exploration policy

Execute at

Observe resulting state st+1 and reward rt

Q(st, at)α←− rt + γ maxa∈A Q(st+1, a)

t← t + 1

end while

end Q-Learning

The approximate Q-function is stored in a table. Its initial values may be

arbitrarily chosen, typically they are all zero or else randomly assigned. At each

time-step an action is performed according to the policy dictated by the current

Q-function:

at = π(st) = arg maxa∈A

Q(st, a) (2.10)

The result of executing this action is used to update Q(st, at), according to

the temporal-difference rule:

Q(st, at)← (1− α)Q(st, at) + α(rt + γ maxa∈A

Q(st+1, a)) (2.11)

where α is a learning rate, 0 ≤ α ≤ 1.

(The above expression is somewhat cumbersome. There are two operations

being described simultaneously which are not clearly differentiated. The first

operation is the temporal-difference step, which estimates the value of Q(st, at)

as:

Qnew = rt + γ maxa∈A

Q(st+1, a)

This value is the input to the second operation, which updates the existing value

of Q(st, at) towards this target value, using an exponentially weighted rolling

average with learning rate α:

Q(st, at)← (1− α)Q(st, at) + αQnew

To simplify the equations we shall henceforth use the short-hand notation:

Xα←− Y


to indicate that the value of X is adjusted towards the target value Y via an

exponentially weighted rolling average with learning rate α, that is:

X ← (1− α)X + αY

Thus Equation 2.11 shall be written as:

Q(st, at)α←− rt + γ max

a∈AQ(st+1, a) (2.12)

This is not standard notation, however I believe it captures the important ele-

ments of the formula more clearly and concisely.)

The approximate state-action value function Q is proven to converge to the

optimal function Q? (and hence π to π?) given certain technical restrictions on

learning rates (∑∞

t=1 αt = ∞ and∑∞

t=1 α2t < ∞) and the requirement that all

state-action pairs continue to be updated indefinitely (Watkins & Dayan, 1992,

Tsitsiklis, 1994, Jaakkola, Jordan, & Singh, 1994). This second requirement

means that in executing the learnt policy the agent must also do a certain pro-

portion of non-policy actions for the purposes of exploration. Exploration is

important in all the algorithms that follow. The standard approach to explo-

ration, followed in this work, is the ε-greedy algorithm which simply takes an

exploratory action with some small probability ε, and a policy action otherwise.

There exist a number of other reinforcement learning algorithms that offer

alternative approaches to learning within this framework, but Q-Learning re-

mains the de facto baseline upon which other research is built and against which

it is compared. For a more thorough examination of the alternatives, The reader

is referred to the more comprehensive treatment in (Sutton & Barto, 1998).

One algorithm that any future researcher should certainly consider is SARSA(λ)

(Sutton, 1996). This is rapidly gaining ground as a contender to Q-Learning as

a baseline reinforcement learning algorithm.

2.1.4 The curse of dimensionality

While Q-Learning and related reinforcement learning algorithms have strong

theoretical convergence properties, they often perform very poorly in practice

(Bellman, 1961). Optimal policies can be found for toy problems, but the algo-

rithms generally fail to scale up to realistic control problems. Without doing a

full analysis of the algorithm, we can observe certain factors which contribute to

this failure.


To find the optimal policy, a Q-value must be learnt for every state-action

pair. This means, first of all, that every such pair needs to be explored at least

once. So convergence time is at best O(|S|.|A|). Real-world problems typically

have large multi-dimensional state spaces. |S| is exponential in the number of

dimensions, so each extra dimension added to a problem multiplies the time it

takes.

Furthermore states are generally only accessible from a handful of close neigh-

bours, so the distance between any pair of states in terms of action steps also

increases with the size and dimensionality of the space. Yet a change in the

value of one state may have consequences for the policy in a far distant state.

As information can only propagate from one state to another through individ-

ual state transitions, the further apart two states are, the longer it will take for

this information to be propagated. Thus the diameter of the state space is an

additional factor in the time required to reach convergence.

A general-purpose solution to this problem has not yet been found. There

have been many attempts to represent the table of Q-values more compactly

by using one variety of function approximator or another, such as neural net-

works (Sutton, 1987, Rummery & Niranjan, 1994), CMACs (Sutton, 1995, San-

tamarıa, Sutton, & Ram, 1998), or locally weighted regression (Atkeson, Moore,

& Schaal, 1997). These have met with mixed success. Sometimes the result-

ing state-abstraction has enabled the learning algorithm to converge in times

faster by several orders of magnitude for a particular domain (eg. Tesauro,

1994, Zhang & Dietterich, 1995, Baxter, Tridgell, & Weaver, 1998), but no such

approach has proven to be a general-purpose solution. What works well in one

domain will often fail spectacularly in another. Significant theoertical results

have been produced for off-line evaluation of stationary policies using both lin-

ear function approximators (Tsitsiklis & Roy, 1997), and also general agnostic

function approximators (Papavassiliou & Russell, 1999), but practical results

based on these methods are still outstanding. For a summary of different func-

tion approximation methods applied to RL, see (Kaelbling, Littman, & Moore,

1996).

As a result of these difficulties researchers have turned from seeking general-

purpose to special-purpose solutions. It has been recognised that a number

of the most successful applications of reinforcement learning have used signif-

icant task-specific background knowledge tacitly incorporated into the agent’s


representation of its states and actions. Focus is shifting towards creating an

architecture by which this tacit information can become explicit and can be

represented in a systematic way. The aim is to create systems that can bene-

fit from the programmer’s task-specific knowledge whilst maintaining desirable

theoretical properties of convergence.

2.2 Hierarchical Reinforcement Learning in The-

ory

Significant attention has recently been given to hierarchical decomposition as

a means to this end. “Hierarchical reinforcement learning” (HRL) is the name

given to a class of learning algorithms that share a common approach to scaling

up reinforcement learning. Their origins lie partly with behaviour-based tech-

niques for robot programming (Brooks, 1986; Maes, 1990; Mataric, 1996) and

partly with the hierarchical methods used in symbolic planning (Korf, 1987, Iba,

1989, Knoblock, 1991), particularly HTN planning (Tate, 1975, Sacerdoti, 1977,

Erol, Hendler, & Nau, 1994). What they have in common with these techniques

is the intuition that a complex problem can be solved by decomposing it into a

collection of smaller problems.

Hierarchical reinforcement learning accelerates learning by forcing a structure

on the policies it learns. The reactive state-to-action mapping of Q-learning is

replaced by a hierarchy of temporally-abstract actions. These are actions that

operate over several time-steps. Like a subroutine or procedure call, once a

temporally abstract action is executed it continues to control the agent until it

terminates, at which point control is restored to the main policy. These actions

(variously called subtasks, behaviours, macros, options, or abstract machines de-

pending on the particular algorithm in question) must themselves be further

decomposed into one-step actions that the agent can execute. We shall hence-

forth refer to one-step actions as primitive actions and temporally-abstract ac-

tions as behaviours. Policies learnt using primitive actions alone shall be called

monolithic to distinguish them from hierarchical or behaviour-based policies.

How does this decomposition aid us? There are two different ways. One,

it allows us to limit the choices available to the agent, even to the point of

hard-coding parts of the policy; and two, it allows us to specify local goals for


certain parts of the policy. Different HRL algorithms implement these features

in different ways. Some implement one and not the other. We shall postpone

describing specific algorithms until Section 2.3, and for the moment present these

features in more general terms, with the aid of an example.

2.2.1 A Motivating Example

Bathroom Bedroom2

Study LoungeCloset

Robot

Coffee Book

Hall

Laundry

Dining

KitchenBedroom1

Figure 2.2: An example world

Figure 2.2 shows an example world we shall use to illustrate the concepts in

this thesis. Imagine that the learning agent is a house-hold robot in a house with

the layout shown. Its purpose is to fetch objects from one room to another. It

is able to know its location with a precision as shown by the cells of the grid,

and its primitive actions enable it to navigate from a cell to any of its eight

neighbours, with a small probability of error.

If the robot is in the same cell as an object, it can pick it up and carry it.

There are two objects in the world that we are interested in. In the kitchen in

the north-west corner of the map is a machine which dispenses a cup of coffee.

In the second bedroom there is a book, also indicated on the map. The robot

starts at its docking location in the study. Its goal will vary from example to

example as we consider different aspects of HRL (and later, of planning).


In this world we have 15,000 states (75×50 cells, with two different states for

each object, depending on whether the robot is holding it or not) and 9 primitive

actions (each of the 8 compass directions, plus the pickup action). This is not in

itself a complex world, and most goals will be relatively easy to complete, but

it is certainly one that can be made simpler by providing an appropriate set of

behaviours.

The obvious behaviours to specify are: Go(Room1, Room2) which moves the

robot between two neighbouring rooms, and Get(Object, Room) which moves

towards and picks up the specified object when the robot is in the same room

as it. We will discuss how these behaviours are implemented as we examine

individual techniques.

2.2.2 Limiting the Agent’s Choices

Since learning time is dominated by the number of state-action pairs that need to

be explored, the obvious way to accelerate the process is to cut down the number

of such pairs. Using background knowledge we can identify action choices which

are plainly unhelpful and eliminate them from the set of possible policies. There

is a variety of ways in which this can be done.

Limiting Available Primitive Actions

The simplest solution is to hard-code portions of the policy. Some or all of the

internal operation of a behaviour can be written by hand by the trainer. This

removes the need for the agent to do any kind of learning at all for significant

portions of the state space, which will immediately improve performance. This

assumes however that the trainer is able to do this. Part of the point of learn-

ing policies is to relieve the trainer of the need to specify them, so this may

be of limited use. Still, there are some situations in which simple behaviours

might be wholly or partially specified, and algorithms have been designed to

take advantage of this.

Less drastically, the internal policy of a behaviour could be learnt using only

a limited subset of all available primitive actions. This is useful if the trainer

knows that certain primitive actions are only suitable for particular behaviours

and not for others. From the example, the Go() behaviours could reasonably be

limited to only use the primitive actions which move the robot, and ignore the


pickup action, which would be of no use to that behaviour.

Limiting Available Behaviours

Likewise, limits can be placed on which behaviours are available to the agent at

different times. Behaviours are generally limited in scope, so they often can only

be executed from a subset of all possible states. For instance the Get() behaviour

can only be applied when the agent is it the same room as the target object.

The set of states in which a behaviour can be applied is called its applicability

space. Learning algorithms should not allow the agent to choose a behaviour in

a state in which it is not applicable.

However this may not be limiting enough. As more ambitious problems are

tackled, the repertoire of behaviours available to an agent is likely to become

large, and many behaviours will have overlapping applicability spaces. It is of

no use to limit the internal policy choices of behaviours if choosing between the

behaviours becomes just as difficult.

To this end, most HRL algorithms implement some kind of task hierarchy

to limit the choice of behaviours to those that are appropriate to the agent’s

situation. Consider the situation in the example world when the robot is in the

hall with the goal of fetching both the book and the coffee. There are six ap-

plicable behaviours: Go(hall, study), Go(hall, dining), Go(hall, bedroom1),

Go(hall, bathroom), Go(hall, bedroom2), and Go(hall, lounge). Of these, only

two are appropriate: Go(hall, dining), if the agent decides to fetch the coffee

first, and Go(hall, bedroom2) if the agent decides to fetch the book. Exploring

the others is a waste of time. The trainer, who specified the behaviours, should

realise this and incorporate it into the task hierarchy, limiting the agent’s choices

in this situation to one of these two behaviours. The larger an agents repertoire

of behaviours becomes, the more critical this kind of background knowledge is

going to become.

Committing to Behaviours

Finally, choices are limited by requiring long-term commitment to a behaviour.

It is conceivable that a learning algorithm could be written which implemented

hard-coded behaviours but allowed the agent to choose a different behaviour on

every time step. Such an algorithm would hardly be any better than learning


a primitive policy directly, and could easily be worse. Long-term commitment

to behaviours has two benefits. First, a single behaviour can traverse a long

sequence of states in a single “jump”, effectively reducing the diameter of the

state-space and propagating rewards more quickly. In the grid-world, for exam-

ple, fetching both the coffee and the book takes 126 primitive actions, but can

be done with a sequence of just 10 behaviours.

Second, a behaviour can “funnel” the agent into a particular set of terminat-

ing states. These states are then the launching points for new behaviours. If no

behaviour ever terminates in a given state, then no policy needs to be learnt for

that state. Again, referring to the grid-world, each Go() behaviour terminates

in one of the six cells surrounding a doorway, in one of four possible configura-

tion of what the robot is holding. There are 10 doors, so this yields 240 states.

Each Get() behaviour terminates in the same location as the target object with

2 possible configurations of what the agent is holding, yielding a further 4 states.

Plus 1 starting state gives a total of 245 states in which the agent needs to learn

to choose a behaviour, out of a possible 15,000. This is a significant reduction

in the size of the policy-space and will result in much faster learning.

Flexible limitations

Limiting the policy space in this fashion will clearly have an effect on optimality.

If the optimal policy does not fit the hierarchical structure, then any policy

produced by a hierarchical reinforcement learner will be sub-optimal. This may

well be satisfactory, but if not, it is possible to some degree to have the best

of both worlds by imposing structure on the policy during the early phase of

learning and relaxing it later. This allows the agent to learn a near-to-optimal

policy quickly and then refine it to optimality in the long-term. Such techniques

shall be described in more detail in Section 2.4.

2.2.3 Providing Local Goals

So far we have assumed that all choices the agent makes, at any point in the

hierarchy, are made to optimise the one global reward function. Such a policy

is said to be hierarchically optimal. A hierarchically optimal policy is the best

possible policy within the confines of the hierarchical structure imposed by the

trainer.


Hierarchical optimality, however, contradicts part of the intuition of behaviour-

based decomposition of problems. The idea that a problem can be decomposed

into several independent subparts which can be solved separately and recombined

no longer holds true. The solution to each subpart must be made to optimise

the whole policy, and thus depends on the solutions to every other subpart. The

internal policy for a behaviour depends on its role in the greater task.

Consider, for example, the behaviour Go(hall, bedroom2) in the grid-world

problem. Figure 2.3 shows two possible policies for this behaviour. Assume,

for the moment that diagonal movement is impossible. Which of these policies

is hierarchically optimal? The answer depends the context in which it is being

used. If the agent’s overall goal was to reach the room as soon as possible, then

the policy in Figure 2.3(a) is preferable. If, on the other hand, the goal is to

pick up the book, then the policy in Figure 2.3(b) is better, as it will result in a

shorter overall path to the book.

Furthermore, the same behaviour may have different internal policies in dif-

ferent parts of the problem. For instance, if the agent’s goal is to fetch the book,

carry it to another room and then return to the bedroom, then the first in-

stance of Go(hall, bedroom2) will use the policy in Figure 2.3(b) and the second

instance will use the policy in Figure 2.3(a).

An alternative is to define local goals for each behaviour in terms of a

behaviour-specific reward function. The behaviour’s internal policy is learnt

to optimise this local reward, rather than the global reward. This is called re-

cursive optimality and is a weaker form than hierarchical optimality. Recursively

optimal policies make best use of the behaviours provided to them, but cannot

control what the behaviours themselves do, and so cannot guarantee policies

that are as efficient as hierarchically optimal policies.

The advantages of this approach, however, are several. First of all, learning

an internal policy using a local reward function is likely to be much faster than

learning with a global one. The behaviour can be learnt independently, without

reference to the others. Local goals are generally simpler than global goals, and

local rewards occur sooner than global ones. So each individual behaviour will

be learnt more quickly. 1

1It has been suggested (Dieterrich, personal communication) that subgoal hints could beprovided to an hierarchically optimal learner. A temporary reward shaping mechanism, whichadds extra components to the reward function, could encourage the agent to achieve a par-ticular behaviour’s subgoal. Such extra rewards would be phased out over time so that the


Book

Bedroom2

(a) A policy which optimises the number of steps to enter the room

Book

Bedroom2

(b) A policy which optimises the number of steps to reach the book

Figure 2.3: Two different internal policies for the behaviour Go(hall, bedroom2).


Furthermore, local goals often allow state abstraction. Elements of the state

that are irrelevant to a local reward function can be ignored when learning the

behaviour. So, for example, if the Go(hall, bedroom2) behaviour had a local

reward function which rewarded the agent for arriving in the bedroom, then the

internal policy for the behaviour could ignore what the robot is carrying. This

would reduce the size of the state space for this behaviour by a factor of four.

Finally, local goals allow re-use. Once a behaviour has been learnt in one

context, it can be used again in other contexts without having to re-learn its

internal policy. This is useful not only in a life-long learning agent, but also

when the same behaviour is employed several different times within the one

policy.

The decision whether or not to include local goals is a trade-off between

optimality and learning speed. In the ideal case, when local rewards exactly

match the projected global rewards, the policies learnt will be identical. However

this is unlikely to occur, and so we must decide which measure of performance

is more important to us. In practice different researchers have chosen different

approaches, as will become apparent in Section 2.3.

2.2.4 Semi-Markov Decision Processes: A Theoretical Frame-

work

So far we have described hierarchical reinforcement learning in abstract terms.

We have assumed that choosing between behaviours can be done in much the

same was as choosing primitive actions in monolithic reinforcement learning, to

optimise the expected discounted return. There is, however, a fundamental dif-

ference between monolithic and hierarchical reinforcement learning: behaviours

are temporally extended where primitive actions are not. Executing a behaviour

will produce a sequence of state-transitions, yielding a sequence of rewards. The

MDP model that was explained in Section 2.1.2 is limited insofar as it assumes

each action will take a single time-step. A new theoretical model is needed to

take this difference into account.

Semi-Markov Decision Processes are an extension of the MDP model to in-

clude a concept of duration, allowing multiple-step (and indeed continuous time)

final policy is hierarchically optimal. However to us this seems to be an inelegant way ofachieving the same result as learning both a recursively optimal and a hierachically optimalpolicy simultaneously, and passing control from one to the other.


actions. Formally an SMDP is a tuple 〈S,B, T, R〉, where S is a set of states, B

is a set of temporally-abstract actions, T : S ×B×S ×< → [0, 1] is a transition

function (including duration of execution), and R : S×B×< → [0, 1] is a reward

function:

T (s′, k|s, B) = P (Bt terminates in s′ at time t + k | st = s, Bt = B)(2.13)

R (r|s, B) = P (rt = r | st = s, Bt = B) (2.14)

T and R must both obey the Markov property, i.e. they can only depend on the

behaviour and the state in which it was started.

A policy is a mapping π : S → B from states to behaviours. A state-value

function can be given as:

V π(s) =∫ +∞

−∞

rR (r|s, π(s)) dr +∑

s′,k

T (s′, k|s, π(s))γkV π(s′) (2.15)

Semi-Markov Decision Processes are designed to model any continuous-time

discrete-event system. Their purpose in hierarchical reinforcement learning is

more constrained. Executing a behaviour results in a sequence on primitive

actions being performed. The value of the behaviour is equal to the value of that

sequence. Thus if behaviour B is initiated in state st and terminates sometime

later in state st+k then the SMDP reward value r is equal to the accumulation

of the one-step rewards received while executing B:

r = rt + γrt+1 + γ2rt+2 + · · ·+ γk−1rt+k−1 (2.16)

Thus the state-value function in Equation 2.15above becomes:

V π(s) = E

{

∞∑

i=0

γirt+i | ε(π, s, t)

}

(2.17)

which is identical to the state-value function for primitive policies shown previ-

ously in Equation 2.4. We can define an optimal behaviour-based policy π? with

the optimal state-value function V π?

as:

V π?

(s) = maxπ

V π(s) (2.18)

Since the value measure V π for a behaviour-based policy π is identical to

the value measure V π for a primitive policy we know that π? yields the optimal

primitive policy over the limited set of policies that our hierarchy allows.


2.2.5 Learning behaviours

Learning internal policies of behaviours can be expressed along the same lines.

Formally, let B.π be the policy of behaviour B, and B.A be the set of sub-actions

(either behaviours or primitives) available to B. Let Root indicate the root

behaviour, with reward function equal to that of the original (MDP) learning

task. The recursively optimal policy has:

B.π?(s) = arg maxa∈B.A

B.Q?(s, a) (2.19)

where B.Q?(s, a) is the optimal state-action value function for behaviour B ac-

cording to its local reinforcement function B.r (defined by the trainer in accor-

dance to the behaviour’s goals) .

In contrast, the hierarchically optimal policy has

B.π?(stack, s) = arg maxa∈B.A

Root.Q?(stack, s, a) (2.20)

where stack = {Root, . . . , B} is the calling stack of behaviours and Root.Q? is

the state-action value function according to the root reinforcement function. The

stack is a necessary part of the input to an hierarchically optimal policy, as the

behaviour may operate differently in different calling contexts. (Hierarchically

optimal policies do not allow local goals for behaviours, so B.r and B.Q? are not

defined.)

2.3 Hierarchical Reinforcement Learning in Prac-

tice

We have discussed the expected benefits of hierarchical reinforcement learning in

abstract terms without referring to any particular algorithm, to show what mo-

tivates its exploration. Historically a large number of different implementations

have been proposed (e.g. Dayan & Hinton, 1992, Lin, 1993, Kaelbling, 1993)

but only recently have they been developed into a strong theoretical framework,

that has been commonly agreed upon. Even so, there are several current imple-

mentations that differ significantly in which elements they emphasise and how

they approach the problem. We shall focus on four of the most recent offerings:

SMDP Q-Learning, HSMQ-Learning, MAXQ-Q and HAMQ-Learning.


2.3.1 Semi-Markov Q-Learning

The simplest algorithm extends Watkins’ Q-Learning to include temporally ab-

stract behaviours with hard-coded internal policies. Such an approach is exam-

ined by Sutton et al. (1999) and they call these behaviours options. Assuming

these options obey the Semi-Markov property, an optimal policy can be learnt

in a manner analogous to Watkins’ Q-Learning. The algorithm is called SMDP

Q-Learning and is shown as Algorithm 2.

Algorithm 2 SMDP Q-Learningfunction SMDPQ

t← 0

Observe state st

while st is not a terminal state do

Choose behaviour Bt ← π(st) according to an exploration policy

totalReward ← 0

discount ← 1

k ← 0

while Bt has not terminated do

Execute Bt

Observe reward r

totalReward ← totalReward + discount × r

discount ← discount × γ

k ← k + 1

end while

Observe state st+k

Q(st, Bt)α←− totalReward + discount ×maxB∈B Q(st+k, B)

t← t + k

end while

end SMDPQ

Just as primitive Q-Learning learns a state-action value function, so SMDP

Q-Learning learns a state-behaviour value function Q : S × B → <, which is an

approximation to the optimal state-behaviour value function Q?:

Q?(s, B) = E

{

k−1∑

i=0

γirt+i + γkV ?(st+k) | ε(s, B, t)

}

(2.21)

where k is the duration of the behaviour B, and ε(s, B, t) indicates the event of

executing behaviour B in state s at time t.


The optimal policy is defined as before:

π?(s) = arg maxB∈B

Q?(s, B) (2.22)

The approximation Q(s, B) can be learnt via the update rule (analogous to the

Q-Learning update rule in Equation 2.12):

Q(st, Bt)α←− Rt + γk max

B∈BQ(st+k, B) (2.23)

where k is the duration of Bt and Rt is a discounted accumulation of all single-

step reinforcement values received while executing the behaviour:

Rt =k−1∑

i=0

γirt+i (2.24)

SMDP Q-Learning can be shown to converge to the optimal behaviour-based

policy under circumstances similar to those for 1-step Q-Learning (Parr, 1998).

2.3.2 Hierarchical Semi-Markov Q-Learning

Hierarchical Semi-Markov Q-Learning (HSMQ) (Dietterich, 2000b) is a recur-

sively optimal learning algorithm that learns reactive behaviour-based policies,

with a trainer-specified task hierarchy. As shown in Algorithm 3 it is a simple

elaboration of the SMDP Q-Learning algorithm. The SMDPQ update rule given

in Equation 2.23 is applied recursively with local reward functions at each level

of the hierarchy. TaskHierarchy is a function which returns a set of available

actions (behaviours or primitives) that can be used by a particular behaviour in

a given state. This hierarchy is hand-coded by the trainer based on knowledge

of which actions are appropriate on what occasions.

HSMQ-Learning can be proven to converge to a recursively optimal policy

with the same kinds of requirements as SMDP Q-Learning, provided also that

the exploration policy for behaviours is greedy in the limit (Singh, Jaakkola,

Littman, & Szepesvari, 2000).

2.3.3 MAXQ-Q

A more sophisticated algorithm for learning recursively optimal policies is Di-

etterich’s MAXQ-Q (Dietterich, 2000a). The policies it learns are equivalent to

those of HSMQ, but it uses a special decomposition of the state-action value


Algorithm 3 HSMQ-Learningfunction HSMQ(state st, action at)

returns sequence of state transtions {〈st, at, st+1〉, . . .}

if at is primitive then

Execute action at

Observe next state st+1

return {〈st, at, st + 1〉}

else

sequence S ← {}

behaviour B← at

At ← TaskHierarchy(st, B)

while B is not terminated do

Choose action at ← B.π(st) from At

according to an exploration policy

sequence S′ ← HSMQ(st, at)

k ← 0 totalReward ← 0

for each 〈s, a, s′〉 ∈ S′ do

totalReward ← totalReward + γkB.r(s, a, s′)

k ← k + 1

end for

Observe next state st+k

At+k ← TaskHierarchy(st+k, B)

B.Q(st, at)α←− totalReward + γk maxa∈At+k

B.Q(st+k, a)

S ← S + S′

t← t + k

end while

return S

end if

end HSMQ


function in order to learn them more efficiently. MAXQ-Q relies on the obser-

vation that the value of a behaviour B as part of its parent behaviour P can be

split into two parts: the reward expected while executing B, and the discounted

reward of continuing to execute P after B has terminated. That is:

P.Q(s, B) = P.I(s, B) + P.C(P, s, B) (2.25)

where P.I(s, B) is the expected total discounted reward (according to the reward

function of the parent behaviour P) that is received while executing behaviour B

from initial state s, and P.C(Bparent, s, Bchild) is the expected total reward of

continuing to execute behaviour Bparent after Bchild has terminated, discounted

appropriately to take into account the time spent in Bchild. (Again with rewards

calculated according to the behaviour P.)

Furthermore the I(s, B) function can be recursively decomposed into I and

C via the rule:

P.I(s, B) = maxa∈B.A

P.Q(s, a) (2.26)

There are several advantages to this decomposition, primarily of value in learning

recursively optimal Q-values. The I and C functions can each be represented

with certain state abstractions that do not apply to both parts. The explanation

is complex and beyond the scope of this review. For full details and pseudocode

see (Dietterich, 2000a).

2.3.4 Q-Learning with Hierarchies of Abstract Machines

Q-Learning with Hierarchies of Abstract Machines (HAMQ) (Parr, 1998; Parr &

Russell, 1998) is an hierarchically optimal learning algorithm that uses a more

elaborate model to structure the policy space. Behaviours are implemented as

hierarchies of abstract machines (HAMs) which resemble finite-state machines, in

that they include an internal machine state. The state of the machine dictates

the action it may take. Actions include: 1) performing primitive actions, 2)

calling other machines as subroutines, 3) making choices, 4) terminating and

returning control to the calling behaviour. Transitions between machine states

may be deterministic, stochastic or may rely on the state of the environment.

Learning takes place at choice states only, where the behaviour must decide which

of several internal state transitions to make. HAMs represent a compromise

between hard-coded policies and fully-learnt policies. Some transitions can be


hard-coded into the machine while others can be learnt. Thus they allow for

background knowledge in the form of partial solutions to be specified.

Behaviours in HAMQ are merely a typographic convenience. In effect they

are compiled into a single abstract machine, consisting of actions nodes and

choice nodes only. Algorithm 4 shows the Pseudocode for learning in such a

machine.

Andre and Russell (Andre & Russell, 2000) have extended the expressive

power of HAMs by introducing parameterisation, aborts and interrupts, and

memory variables. These “Programmable HAMs” allow quite complex program-

matic description of behaviours, while also providing room for exploration and

optimisation of alternatives.

2.4 Termination Improvement

Start Goal

Figure 2.4: A simple navigation task illustrating the advantage of terminationimprovement. The circles show the overlapping applicability spaces for a col-lection of hard-code navigation behaviours. Each behaviour moves the agenttowards the central landmark location (the black dots). The heavy line indicatesthe standard policy with commitment to behaviours. The lighter line indicatesthe path taken by a termination improved policy.

In Section 2.2.2 above, we discussed the importance of long-term commitment

to behaviours. Without this, much of the benefit of using temporally abstract


Algorithm 4 HAMQ-Learningfunction HAMQ

t← 0

node ← starting node

totalReward ← 0

k ← 0

choice a← null

choice state s← null

choice node n← null

while s is not a terminal state do

if node is an action node then

Execute action

Observe reward r

totalReward ← totalReward + γkr

k ← k + 1

node ← node.next

else node is a choice node

Observe state s′

if n 6= null then

Q(n, s, a)α←− totalReward + γk maxa′∈A Q(node, s′, a′)

totalReward ← 0

k ← 0

end if

n← node

s← s′

Choose transition a← π(n, s) according to an exploration policy

node ← a.destination

end if

end while

end HAMQ


actions is lost. However it can also be an obstacle in the way of producing optimal

policies. Consider the situation illustrated in Figure 2.4. The task is to navigate

to the indicated goal location. Behaviours are represented by dotted circles

and black dots indicating the application space and terminal states respectively.

The heavy line shows a path from the starting location to the goal, using the

behaviours provided. The path travels from one termination state to the next,

indicating that each behaviour is being executed all the way to completion.

Compare this with the path shown by the lighter line. In this case each be-

haviour is executed only until a more appropriate behaviour becomes applicable.

“Cutting corners” in this way results in a significantly shorter path, and a policy

much closer to the optimal one.

This example is taken from the work of Sutton, Singh, Precup, & Ravindran,

1999 who call this process termination improvement. They show how to pro-

duce such corner-cutting policies using hard-coded behaviours. Having already

learnt an optimal policy π using these behaviours, they transform it into an im-

proved interrupted policy π′ by prematurely interrupting an executing behaviour

B whenever Q(s, B) < V (s), i.e. when there is a better alternative behaviour

available. The resulting policy is guaranteed to be of equal or greater efficiency

than the original.

A similar approach can be applied to policies learnt using MAXQ-Q (Di-

etterich, 2000a). While MAXQ-Q is a recursively optimal learning algorithm,

it nevertheless learns a value for each primitive action using the global reward

function. In normal execution, actions are chosen on the basis of the local Q-

value assigned to each by its calling behaviour. However once such a recursively

optimal policy has been learnt, it can be improved by switching to selecting

primitive actions based on their global Q-value instead. There is no longer any

commitment to behaviours. Execution reverts to the reactive semantics of mono-

lithic Q-learning, and the hierarchy serves only as a means to assign Q-values

to primitives. This is called the hierarchical greedy policy, and is also guaran-

teed to be of equal or greater efficiency than the recursive policy. Furthermore,

by continuing to update these Q-values, via polling execution (Kaelbling, 1993,

(Dietterich, 1998)), this policy can be further improved.

In both these algorithms it is important that the transformation is made to

the policy once an uninterrupted policy has already been learnt. Without this

delay the advantages of using temporally abstract actions would be lost.


2.5 Producing the hierarchy

As stated earlier, typically the hierarchy of behaviours is defined by a human

trainer. Many researchers have pointed to the desirability of automating this task

(eg. Boutilier, Dean, & Hanks, 1999; Hauskrecht, Meuleau, Kaelbling, Dean, &

Boutilier, 1998). This work is one approach to that problem.

Another quite different approach is the HEXQ algorithm (Hengst, 2002).

This algorithm is an extension of MAXQ-Q which attempts to automatically

decompose a problem into a collection of subproblems. Sub-problems are created

corresponding to particular variables in the state-vector. Variables that change

infrequently inspire behaviours which aim to cause those variables to change.

A similar approach is used by acQuire (McGovern & Barto, 2001). It uses

exploration to identify “bottlenecks” in the state-space – states which are part of

many trajectories through the space. Bottleneck states are selected as subgoals

for new behaviours.

Both these approaches implement a kind of blind behaviour invention, based

only on the dynamics of the world without any background knowledge. In this

thesis we take a much less radical approach, keeping the trainer but expressing

her knowledge in a more flexible form. Ultimately it would be good to have

systems which can both accept information from a trainer and discover it from

the world.

2.6 Other work

A few other reinforcement learning techniques that need to be mentioned due to

their apparent similarity to the work in this thesis. In each case the similarity is

mostly superficial, but it is worth describing each to acknowledge the possibilities

of different approaches and distinguish them from my own.

2.6.1 Model-based Reinforcement Learning

Not all reinforcement learning algorithms are model-free like Q-learning. There

also exist algorithms which attempt to learn models of the transition and reward

functions T (s′|s, a) and R (r|s, a), as described in Section 2.1.2 above, and then

use these models to create policies, using value iteration (Sutton, 1990). Such


techniques have been less popular in practice as learning accurate models of T

and R has been found to be harder than learning Q-values directly.

Nevertheless these techniques have attracted sufficient interest to be applied

to the hierarchical reinforcement learning problem, and model-based hierarchi-

cal reinforcement learning algorithms exist, (eg. H-Dyna (Singh, 1992), SMDP

Planning (Sutton et al., 1999), abstract MDPs (Hauskrecht et al., 1998) and

discrete event models (Mahadevan, Khaleeli, & Marchalleck, 1997)).

While this thesis also involves the application of models to hierarchical re-

inforcement learning, the style and application of those models is significantly

different. Model-based HRL algorithms learn concrete, numerical models of ac-

tions’ effects in order to generate policies. In this work we will be using abstract,

symbolic models of actions’ purposes in order to guide exploration.

2.6.2 Other hybrid learning algorithms

Other systems exist which combine symbolic background knowledge with re-

inforcement learning in quite a different fashion to the behaviour-based model

presented here. One approach is to incorporate the symbolic knowledge into

the representation of the Q-function. For instance, the RATLE system (Maclin

& Shavlik, 1996) uses knowledge-based neural nets for this purpose. These are

recurrent neural-networks which are structured by the trainer, using a simple

programming language to represent background knowledge (Maclin, 1995).

Similarly Relational Reinforcement Learning (Dzeroski, Raedt, & Blockeel,

1998) employs inductive-logic programming to represent the Q-function as a logic

program. Symbolic background knowledge can be incorporated into the learnt

representation.

This is a radically different use of symbolic background knowledge. Both

approaches are attempting to represent concrete numeric information (the Q-

function) using abstract symbolic methods. This is an inversion of the work

in this thesis, in which we use numeric methods to represent abstract symbolic

concepts.


2.7 Reinforcement learning in this thesis

Having presented a broad overview of hierarchical reinforcement learning, we will

focus for the rest of this thesis on a few particular issues. Both new algorithms

we present will be recursively optimal algorithms based on HSMQ. HSMQ was

chosen over MAXQ-Q for its relative simplicity, although there is no obvious

reason why the algorithms could not be adapted to use the MAXQ value function

decomposition.

Recursive optimality was chosen over hierarchical optimality as the resulting

independence and re-usability of behaviours more naturally suits the symbolic-

planning framework we will be using, but again the automatic hierarchy con-

struction methods we will employ could just as well be used for hierarchically

optimal policies.

The particular issues that we shall focus on are task hierarchies and termi-

nation improvement. We aim to show how symbolic background knowledge can

be used to automatically make use of knowledge that otherwise a trainer would

have to encode by hand.

2.8 Summary

This chapter showed how the agent learning problem can be cast in a subsym-

bolic fashion as an online dynamic programming problem. We presented the

standard MDP and SMDP models for reinforcement learning and hierarchical

reinforcement learning, and showed how these models can be used to produce

a variety of different algorithms depending on the choice of optimality criterion

and execution semantics. In the next chapter we will show how the same prob-

lem can be approached in quite a different fashion using the tools of symbolic

planning and knowledge refinement.

Chapter 3

Background - Symbolic Planning

In the last chapter we discussed the sub-symbolic approach to agent control. We

began with a technique that was primarily designed to learn policies without

human guidance, and we showed how it could be modified to allow it to use

high-level domain knowledge from a trainer.

The symbolic planning approach to control has the opposite problem. It

is first and foremost a technique for reasoning about action based on abstract

models of behaviour provided by a trainer. The difficulty lies in handling the

low-level intricacies of the domain, which cannot be captured in an abstract

model. In this chapter we shall first present a simple, commonly-used approach

to symbolic planning, means-ends analysis using the Strips formalism, and then

describe ways in which it has been extended to handle more complex worlds.

3.1 The symbolic planning model

At a fundamental level the problem faced by symbolic planning is much the same

as that of reinforcement learning. An agent interacts with an environment to

achieve certain goals. The objective is to produce a policy which maps states to

actions so as to achieve the agent’s goals as efficiently as possible. The method

of creating those policies, however, is significantly different.

Symbolic planning aims to construct policies (or plans) from a model of the

world provided by the trainer. The emphasis is therefore on creating a language

in which a trainer can easily specify this model, and on a reasoning process which

is logically sound. Learning by trial-and-error has also been studied, but as a

later addition to a well-established field.

46

3. Background - Symbolic Planning 47

Planning Agent

action

Model

Environment

Policy

Goals

Planner

state

Figure 3.1: An illustration of a planning agent.

Thus the essence of the symbolic approach is the language it uses. States,

actions and goals are usually represented in the language of first-order logic (or

its equivalent). Fundamental to this is the idea of using fluents to describe

features of the agent’s state. A fluent describes a set of states in which certain

properties or relationships hold.

Consider the example world used in the previous chapter (reproduced in

Figure 3.2). What are the important abstract features of the agent’s state? The

agent’s location is one. Another is the location of the book, and of the coffee.

So the fluent location(Object, Location) might be defined to represent that

Object is in Location. Object could be robot, book or coffee. Location

could be any of the rooms. The fluent holding(varObject) will be used to

signify that the robot has the Object in its possession.

Another important feature is the topology of the world – which rooms are

connected. This can be encoded using the fluent door(Room1, Room2) to indicate

that Room1 is connected to Room2.

States can now be described by a conjunction of fluents. We shall use the

notation:

s |= f1 ∧ f2 ∧ . . . ∧ fk

To say that primitive state s models fluents f1 . . . fk, i.e. the fluents are true in


Bathroom Bedroom2

Study LoungeCloset

Robot

Coffee Book

Hall

Laundry

Dining

KitchenBedroom1

Figure 3.2: The example world again

state s. Thus the initial state of the world, shown in Figure 3.2 is given by:

s |= location(robot, study)

∧ location(book, bedroom2) ∧ location(coffee, kitchen)

∧ door(lounge, hall) ∧ door(hall, lounge)

∧ door(hall, study) ∧ door(study, hall)

∧ door(study, closet) ∧ door(closet, study)

∧ door(hall, bathroom) ∧ door(bathroom, hall)

∧ door(hall, bedroom1) ∧ door(bedroom1, hall)

∧ door(hall, bedroom2) ∧ door(bedroom2, hall)

∧ door(hall, dining) ∧ door(dining, hall)

∧ door(dining, kitchen) ∧ door(kitchen, dining)

∧ door(dining, laundry) ∧ door(laundry, dining)

(For the sake of brevity, door fluents will henceforth be omitted from expressions

unless they are of particular importance.) Note that while this description may

not uniquely identify the state s (as there are many states that match this

description), we shall assume that it is operationally complete, that is it encodes

all the necessary information to determine how to achieve the agent’s goals from

this state. We shall use the notation Fluents(s) to indicate the set of all fluents


which are satisfied in a given state of the world s.

Fluents are also used to represent the agents goals. It is typically assumed

that the agent’s goal is to arrive at any of a particular subset of states of the

world, and that set can be represented as a conjunction of fluents. So, for

instance, a goal in the example world may be for the agent to fetch a cup of

coffee and bring it into the dining room. This goal can be represented as:

G = location(robot, dining) ∧ holding(coffee)

Any state s which satisfies G is a goal state.

Finally we need to model the agent’s actions using the fluents. Actions are

described in terms of their effects, and the preconditions required to produce

those effects. Two kinds of effects are typically distinguished: deliberate and

accidental. The deliberate effects of an action are called its post-conditions,

the accidental effects are called side-effects. In planning, the agent should only

rely on the post-conditions of an action as useful effects of an action. Side-

effects are specified so that the agent may know that they might possibly occur,

and can ensure that its plans do not rely on their absence. This division is

mostly semantic – the environment makes no distinction between deliberate and

accidental effects – but it is a useful one when it comes to planning the agents

behaviour.

Consider the Get(coffee, kitchen) action in the sample world. What are

the abstract effects of this behaviour? The deliberate effect is that the robot is

holding the coffee, i.e.:

post-condition = holding(coffee)

Under what conditions will the behaviour achieve this effect? It is expected to

work so long as both the agent and the coffee are in the kitchen, so:

precondition = location(coffee, kitchen) ∧ location(robot, kitchen)

It also has a side-effect:

side-effect = ¬location(coffee, kitchen)

as once it has been picked up, the coffee is no longer considered to be in the

room.


It should be noted at this point that these descriptions of states and actions

are significantly more abstract than the primitive states and actions used in the

formulation of the reinforcement learning problem for this same world. This is

typically the case. Symbolic planning is rarely done at such a low level. Actions

are assumed to be implemented as deterministic processes that can be described

in high-level terms. The underlying implementation issues are largely ignored, so

that actions can be considered to be deterministic, discrete, and instantaneous

and to satisfy the Markov property (i.e. involving no hidden state). Recent

planning research has relaxed these assumptions somewhat, as will be elaborated

in Section 3.2.3.

3.2 Building Plans

Given such a world model it is possible to build a policy or plan by which the

agent can achieve its goals from its initial state. A plan is a sequence of actions

to be executed in order. Each action achieves the preconditions of the next until

the final action reaches the goal. Figure 3.3 shows an illustration of a plan for

the example world. The goal, in the top node of the plan, is for the robot to be

in the dining room, holding a cup of coffee. The nodes below it represent set of

states, as described by the conjunction of fluents in each. The arrows are actions

leading from one set of states to the next. Such a plan can be constructed by a

logical inference process from the action models.

The particulars of this planning process vary depending on the representation

used for the action models among other things. We do not intend to give a

comprehensive description of the state-of-the-art in planning. For that, We refer

the reader to (Allen et al., 1990) and (Ghallab & Milani, 1996).

This thesis is not an attempt to produce an improved symbolic planner,

but rather to create a hybrid of planning and reinforcement learning. As such,

we shall focus on a simple and well-established planning algorithm (Means-

Ends analysis (Newell & Simon, 1972)) and an equally basic representation (the

Strips model (Fikes & Nilsson, 1971)). We shall consider only one recent im-

provement to this combination (Teleo-reactive planning (Nilsson, 1994)). There

have been many others, but for the sake of simplicity, we shall avoid discussing

these and focus on the techniques used in this thesis. We will reserve discussion

of how more complex planning techniques might be applied to the problem until


Go(kitchen, dining)

Get(coffee, kitchen)

Go(dining, kitchen)

Go(hall, dining)

Go(study, hall)

location(robot, kitchen)holding(coffee)

location(coffee, kitchen)location(robot, dining)

holding(coffee)

location(coffee, kitchen)location(robot, hall)

location(coffee, kitchen)

location(coffee, kitchen)location(robot, study)

location(robot, dining)

location(robot, kicthen)

Figure 3.3: A plan for fetching the coffee.


the future work section in Chapter 8.

3.2.1 The Strips representation

One of the earliest and most enduring symbolic representations of actions is the

Strips representation originally proposed by Fikes and Nilsson. It represents

actions as operators with three principle components:

1. A precondition. A list of fluents that must be true for the action to be

executed.

2. An add list. A list of fluents that are true after the action has been exe-

cuted. This describes the post-conditions of the action.

3. A delete list. A list of fluents that the action may cause to become false.

These are the side-effects of the action.

Formally, an operator 〈A.pre, A.add, A.del〉 for an action A models the fact that

if A is executed in a state s with:

s |=∧

f∈A.pre

f (3.1)

then the resulting state s′ will satisfy:

s′ |=∧

f∈A.add

f ∧∧

f∈Fluents(s)−A.del

f (3.2)

where Fluents(s) is the set all fluents satisfied by s.

The first term of this equation indicates that the post-conditions of A are true

in s′. The second term indicates that any undeleted fluents from state s also

remain true in s′. This is called the frame assumption. Without this assumption

we would have to explicitly describe every fluent that was not affected by the

action, in addition to those that were. Frame assumptions are a difficult but

necessary part of reasoning about action, and much research has gone into them

(Shoham & McDermott, 1988; Hayes, 1973). We shall return to this issue later,

when we discuss learning action models in Chapter 6.

The Strips representation places important restrictions on the description

of actions. A particular operator has a single precondition for all its effects, and

both pre- and post-conditions are simple conjunctions of fluents. An action that


has multiple conditional effects must be described as several different operators.

(Henceforth we shall use preconditions, add-lists and delete-lists interchangeably

as sets of fluents and as logical conjunctions. It ought to be clear from the context

which is intended in any instance.)

It is convenient to describe families of similar operators as schemata, with

certain constants replaced by variables. Thus actions in the example world might

be described by the operator schemata:

Go(Room1, Room2)

Pre: location(robot, Room1) ∧ door(Room1, Room2)

Add: location(robot, Room2)

Del: location(robot, Room1)

Get(Object, Room)

Pre: location(Object, Room) ∧ location(robot, Room)

Add: holding(Object)

Del: location(Object, Room)

(We follow the Prolog convention of beginning variable names with capital letters

and constants with lowercase letters.) Variables in the schema name must be

bound in the process of planning in order to fully instantiate the operator. Only

fully instantiated operators can be executed.

3.2.2 Means-ends planning

Equation 3.2 allows us to form two reasoning rules. The first, called progression

allows us to predict the effects of actions. If state s satisfies the fluents in Fbefore

and Fbefore ⊂ A.pre then the state s′ after executing A will satisfy:

Fafter = (Fbefore − A.del) ∪ A.add (3.3)

Alternatively, the regression rule tells us that if A.add ⊂ Fafter then action A

can be used to achieve Fafter if executed in a state satisfying:

Fbefore = (Fafter − A.add) ∪ A.pre (3.4)

provided that A.del ∩ Fafter = ∅. This second rule is important, as it is the

foundation of the planning algorithm we will employ.


Planning with Strips operators can be considered as a kind of search. A

path needs to be found between the initial state and the goal. Actions describe

transitions from one set of states to another, using the above rules of progression

and regression. Standard AI search techniques can be applied to this problem

in a variety of ways.

Perhaps the simplest approach is means ends analysis, also known as regres-

sion planning (Newell & Simon, 1972, Georgeff, 1987). This involves a simple

breadth first search through the state space, starting from the goal and work-

ing backwards towards the initial state, using the regression rule above to select

actions.

Here is a simple example from the example world. Let our goal be to reach

the dining room with the coffee. I.e.:

G = {location(robot, dining), holding(coffee)}

To achieve this goal we apply the regression rule above. First we find an ac-

tion which achieves one or more of the fluents in the goal. In this case, the

Go(dining, kitchen) action will serve. Then we regress the goal to find the

conditions that need to be true before the action is executed:

Fbefore = (Fafter − Go(dining, kitchen).add) ∪ Go(dining, kitchen).pre

= ({location(robot, dining), holding(coffee)}

−{location(robot, dining)}) ∪ {location(robot, kitchen)}

= {location(robot, kitchen), holding(coffee)}

The search then continues for an action which satisfies this regressed condition,

until such time as a condition is found which is satisfied by the initial state.

There are, of course, other actions which could have been chosen here in place

of Go(dining, kitchen), thus building the shortest plan is equivalent to finding

the shortest path through a graph, and is done by breadth-first search or an

equivalent algorithm.

This algorithm is sound and complete, assuming the model itself is likewise.

The plans it generates are correct and if a plan exists within the limitations of

the representation, then it will be found.

3.2.3 Extensions to the Strips representation

The Strips representation and the means-ends planning algorithm have been

recognised to have a number of limitations. Many improvements have been made


on these techniques in years since they were proposed. Three improvements we

shall focus on are:

• Durative actions: the ability to model actions that are not instantaneous.

• Universal plans: the ability to construct plans that contain contingencies

for handling random execution failures

• Hierarchical planning : the ability to improve planning efficiency by con-

structing plans at different levels of abstraction.

Representing durative actions: Teleo-reactive operators

Many descriptions of the Strips planning process described above assume that

the operators are instantaneous and deterministic, and so can be executed with-

out any kind of monitoring. It is assumed that each action will terminate suc-

cessfully, achieving the precondition for the next action without any need for

verification. The original authors, however, realised that this was not so. In the

real world, actions take time to execute and may fail to perform as intended,

and this needs to be taken into account.

A neglected element of the original Strips research is the PLANEX plan-

execution monitor (Fikes, 1971, Fikes, Hart, & Nilsson, 1972). This program

monitored the state of the world at each stage of the plan to ensure that it

matched the preconditions for the subsequent actions. If ever this was not true,

PLANEX would search backwards through the plan until it found a step in the

plan for which the preconditions were satisfied. Thus it could recover from small

errors by simply going back a few steps in the plan. If none of the prior steps

were satisfied, it would then resort to constructing a new plan.

The Teleo-reactive formalism developed by Nilsson and Benson (Nilsson,

1994; Benson & Nilsson, 1994; Benson, 1996) extends the Strips representa-

tion to more explicitly capture these features. An action in this formalism is

represented by a teleo-operator (or Top). A Top is much like a Strips opera-

tor. It has a pre-image, a post-condition and a set of side-effects, much like the

precondition, add and delete lists of a Strips operator. However the semantic

interpretation of these parts contains an important difference. A Strips oper-

ator is assumed to model an instantaneous action, whereas a Top represents

the continuous execution of a durative action until it reaches its post-condition.


Termination may not be immediate, but is guaranteed to eventually occur. To

achieve that post-condition, the Top must be initiated in a state satisfying its

pre-image. The Top is then guaranteed to remain inside that pre-image until

it terminates on achieving the post-condition. The side-effect list contains those

fluents whose truth value may change during the execution of the Top.

This definition is convenient as it allows temporally abstract actions to be

used in planning without any significant changes to the planning algorithm.

Means-ends analysis can be applied directly to teleo-operators to produce plans

containing durative actions.

Execution of a teleo-reactive plan must be monitored. An action must be

terminated when it achieves its post-condition. Furthermore, monitoring allows

the detection and handling of random execution failures. Execution of a teleo-

reactive plan resembles the circuit semantics of behaviour-based programming

(Brooks, 1986; Maes, 1990; Kaelbling & Rosenschein, 1990). The plan is treated

as an ordered sequence of production rules from states to actions. For example,

the plan in Figure 3.3 can be represented as:

location(robot, dining), holding(coffee) → terminate

location(robot, kitchen), holding(coffee) → Go(kitchen, dining)

location(robot, kitchen), location(coffee, kitchen) → Get(coffee, kitchen)

location(robot, dining), location(coffee, kitchen) → Go(dining, kitchen)

location(robot, hall), location(coffee, kitchen) → Go(hall, dining)

location(robot, study), location(coffee, kitchen) → Go(study, hall)

(For brevity’s sake the door fluents have been omitted.) This plan is executed

reactively with continuous feedback. At any instant the action executing is

dictated by the first rule in the list with its left-hand side satisfied. Such a rule

is said to be active. Execution is expected to proceed up the list as each action

is performed, but it is recognised that actions may occasionally fail or external

events may cause unexpected changes in the state of the world. Such execution

failures are accommodated as execution immediately jumps up or down the list

to the appropriate rule. Say, for example, the robot was executing the second

rule in the list, carrying the coffee from the kitchen into the dining when it

accidentally spilled. Because the plan is constantly monitored, execution would

immediately drop back to the rule below it and the robot would go and fetch

another cup of coffee.


Universal Plans

Plan monitoring provides some robustness to executions failures, but can only

handle situations that already exist in the plan. What would happen in the

above example if the agent were to accidentally enter the laundry rather than

the kitchen? The plan contains no productions which match this scenario, so

execution would fail.

To make truly robust plans contingencies need to be added to handle such

circumstances. One possibility is to create plans which contain paths to the goal

from every possible state. Such plans are called universal (Schoppers, 1987).

Universal plans, combined with plan monitoring, allow the agent to recover from

arbitrary execution failures, as the plan will contain an appropriate course of

action regardless of what circumstances may arise. A universal plan is best pic-

tured as a tree, with the goal at the the root and many paths converging towards

it. A universal plan for the coffee-fetching task is illustrated in Figure 3.4.

The teleo-reactive execution formalism can also be applied to executing such

plan trees. Just as the list of rules was scanned from top to bottom for the first

active rule, so the plan tree can be searched in a breadth-first fashion to find the

shallowest active node. This node dictates the action to be executed. (If ever

there are two active nodes at the same depth then ties are broken randomly).

So, in answer to the above scenario, if the robot were to accidentally enter

the laundry while executing the Go(dining, kitchen) behaviour then control

would pass down to the left child node, and the robot would begin executing

Go(laundry, dining) in order to recover from the mistake.

In practice universal planning is rarely done. It is costly to perform as it

requires exhaustive enumeration of all possibilities. A simpler alternative is to

make best use of the information gathered in the normal planning process. While

searching for a plan from a particular state, many false paths are explored. These

paths can be stored as contingencies in a plan tree with little extra cost. If an

execution failure places the agent in a state not already covered by any node of

the tree, then the tree can be expanded until the new state is covered. We shall

call such plans semi-universal.


Go(dining, kitchen)

Go(laundry, dining) Go(hall, kitchen)

Go(study, hall)

Go(closet, study)

Go(lounge, hall)

Go(kitchen, dining)


holding(coffee)

location(robot, laundry)


location(coffee, kitchen)location(robot, lounge)

location(coffee, kitchen)location(robot, closet)


location(coffee, kitchen)location(robot, kicthen)





Figure 3.4: A universal plan for fetching the coffee. Dotted arrows show whereadditional nodes have been omitted to save space.


Hierarchical Planning

In a moderately complex environment the depth and the branching factor of the

plan trees can become quite large, making the planning process intractable. The

planner can waste much time and effort ordering the minute details of the plan.

More progress could be made by establishing the broad strokes of the plan first,

and then filling in the details later. For example, a plan to travel from Sydney

to London would not be constructed on a per-footstep basis. Rather the flights

would be arranged first, then the details of catching each filled in later. A large

number of false paths could be avoided in this fashion. This is the intuition

behind hierarchical planning (Sacerdoti, 1974; Rosenschein, 1981; Korf, 1985;

Iba, 1989).

A hierarchical planner typically defines operators at various levels of ab-

straction. The most abstract macro-operators describe large-scale movements

through the state space. A macro-operator A has an internal plan A.plan which

implements A in terms of finer-grain operators. This proceeds recursively until

ultimately all operators are described in terms of concrete actions that can be

directly executed.

A further advantage of hierarchical planning is that each macro-operator can

be treated as an independent planning problem. Not only does this reduce the

overall number of action ordering considerations, but it also allows for re-use.

Once a plan has been constructed for a macro-operator it can be stored in a plan

library and used whenever the macro-operator is needed.

The teleo-reactive planning formalism can be extended to incorporate hierar-

chical plans. Teleo-operators can be written to describe macro-actions which are

in fact implemented as plans of finer-grain actions. Executing a macro-action A

simply means executing its internal plan, A.plan, according to the teleo-reactive

semantics. As long as the external plan continues to recommend A, then A’s

internal plan continues to execute. Should the external plan require that A stop

executing at any time, then A.plan and all actions below it in the hierarchy are

immediately interrupted.

3.3 Handling Incomplete Action Models

One of the most significant obstacles to applying planning algorithms to real

world problems is the need for the action model to fully specify the outcome of


each action, both in terms of what it does and does not do. The correctness

of Strips (and Top) planning relies on each operator to specify every fluent

that could possibly be changed by the action – not only the immediate intended

effects, but also those that are unintended, and anything that might logically

follow from either. Even in a domain of only moderate complexity this can

result in quite lengthy descriptions. Omissions can lead to incorrect plans that

do not achieve their goals, or the inability to produce a plan at all.

This situation can be improved by augmenting the world model with a theory

of logical relationships between fluents. A more powerful planner could apply

an action to indirectly achieve fluents implied by its post-condition, but not

included in it. This would allow post-conditions to be expressed more compactly,

and side-effects similarly. Implementations of such ideas exist (Reiter, 1987) but

have significant limitations in how they can be applied without threatening the

soundness of the planning process.

Ultimately it is inevitable that some aspects of a complex world model will

be omitted. The desire for autonomy in our agents leads us to consider how they

might overcome this obstacle of their own accord.

3.4 Learning Action Models

The problem of incompletely specified world models has driven research into

agents that can learn missing information autonomously and correct errors in

the model through experience of interacting with the world. There is no single

standard approach to this problem, rather it has been attacked from a variety of

angles by assorted researchers. The problem can either be regarded as learning

an action model from scratch (Benson, 1995; Shen, 1993; Wang, 1996; Oates &

Cohen, 1996; Lorenzo & Otero, 2000) or correcting an existing trainer-specified

model (Gil, 1994; desJardins, 1994). Further distinctions can be drawn on the

basis of how the input data for learning is generated, what kind of information

is learnt (including how it is represented and how noise is handled) and how

learning is done.


3.4.1 Generating data

If the model is incomplete, then the agent must learn through interaction with

its environment. There are several possible ways it can do this, ranging from

passive to active. Data can be generated by:

• Observing an expert controlling the agent

• Executing plans generated by the agent and observing their success or

failure

• Deliberate exploration and experimentation to test particular parts of the

model.

Observing an expert is the most passive source of information. This is the

necessary starting point for most systems that learn without a prior model (eg:

Benson, 1996; Lorenzo & Otero, 2000; Wang, 1996; Oates & Cohen, 1996), as

they have no other means to direct their exploration of the world (other than

choosing actions randomly). A lot of examples of the effects of actions can

be gathered in this way, particularly if the expert chooses training examples

deliberately with an eye to their learning potential. However the wealth of

information can lead to a lack of focus in the learning process. Every new effect

observed can spawn a new learning task to predict its cause. The agent has no

way to determine which effects should be regarded as important or unimportant.

The semantic division between deliberate post-conditions and accidental side-

effects is not available.

If a partial model is already available, then the agent has more structure

within which to categorise its experiences. It can build plans in the existing

model and then test to see whether they operate as expected. If the model

is not accurate enough then failures will occur, either in the planning or in

the execution phase. Particular kinds of failures in the plans are indicative of

particular kinds of errors in the model. Evidence gathered from such failures

can be used to successively refine the world model until the plan is successful.

Systems that implement such learning are Trail (Benson, 1995; Benson, 1996),

Live (Shen, 1993; Shen, 1994), Observer (Wang, 1996), Expo (Gil, 1994) and

CAP (Hume & Sammut, 1991).

It is possible to go further than this and allow the agent to explicitly set out

to perform experiments to test certain parts of its world model. This can include


deliberately attempting to use an action outside of its learnt precondition to see

if it is too specific, or else in different parts of the precondition to test where it

might be faulty. In this way the actor can gather examples that may not arise

in the ordinary course of execution. This is a particular focus of Expo, Live

and CAP.

3.4.2 What is learnt

There are essentially only two things to learn: the effects of an action and the

conditions that are necessary to produce them. However the representation used

for actions strongly influences exactly what kinds of models can be learnt. Most

systems use Strips-like operators and so are limited to learning an action’s

post-conditions and their associated precondition. A post-condition is identified

as any fluent that changes truth value on the execution of the action. As the dis-

tinction between a post-condition and a side-effect is a semantic one, not present

in the world, all such effects are generally treated as post-conditions, unless in-

telligent data gathering allows a better classification. To find the precondition

for this effect the agent can form a generalisation of all the states in which the

action produced the effect.

Recent work has turned to more complex representations. Trail uses teleo-

operators to model actions. This allows multiple operators to be learnt for each

action, based on different post-conditions. To learn the pre-image of each oper-

ator the agent must generalise over the sequences of states in which the action

is executed, distinguishing those that lead to the post-condition in question and

those that do not.

A more sophisticated representation is used by Lorenzo and Otero. They

represent the effects of actions in the language of the Situation Calculus (Mc-

Carthy & Hayes, 1969). This language expresses the effects of behaviours in

terms of a logic program, with independent rules describing when each of the

post-conditions and side-effects of a behaviour may occur. This representation

is more expressive than the simple Strips notation, and allows the planner to

construct more accurate plans, only taking into account particular effects when

they are expected to occur. Lorenzo and Otero’s (apparently nameless) system

learns such logic programs.


3.4.3 How to learn

Discovering the effects of actions is not difficult. It merely requires observing

what fluents change when the action is executed. Distinguishing which effects

are relevant and which are irrelevant is harder, and learning the necessary pre-

conditions to produce these effects is harder still. How this is done depends on

the assumptions made in the chosen representation.

The Strips representation assumes that actions are deterministic and can

only be executed when their precondition is true. Thus learning preconditions

in the Strips model is generally done by making simple generalisations over the

states in which the actions work. If an action does not work in given state in

the precondition then it must be too general. It can be made more specific by

comparing the unsuccessful case with an earlier successful case. As execution

is assumed to be noise free, the explanation for the failure must rely in the

difference between the two cases.

Two approaches that go beyond this are Benson’s Trail and Lorenzo and

Otero’s system. They both assume that there may be noise in the data, that

actions make sometimes fail or succeed randomly. Learning preconditions in such

a domain requires more examples, both positive and negative, to be gathered.

Both of these systems use Inductive Logic Programming (variants of Dinus

and Progol respectively) to produce descriptions of the preconditions from the

noisy data.

3.5 Inductive Logic Programming

We need to briefly explain what Inductive Logic Programming (ILP) is, and how

it applies to this work. ILP is a kind of machine learning. Like other machine

learning algorithms, such as decision tree learning or nearest-neighbour methods,

ILP endeavours to build a “classifier” which can accurately classify positive and

negative examples of a target concept. Unlike these other approaches, ILP uses

first-order logic programs to represent facts and hypotheses. Thus it is useful for

learning relationships between different elements of an example, where attribute-

value learners cannot.

Formally, we are given a set of training examples E , consisting of true E+

and false E− ground instances of an unknown target concept, and background

knowledge B defining predicates which provide additional information about the


arguments of the examples in E . The aim is to find an hypothesis H in th which

accurately classifies examples in E . I.e.:

H ∪ B |= e if e ∈ E+

6|= e if e ∈ E−

Many hypotheses may fit this requirement for a given training set. The aim

is to find one which will generalise in such a way as to accurately classify unseen

examples. This is principally achieved by trying to find the simplest possible

hypothesis which classifies the examples. Measures of simplicity and manifold

and vary from algorithm to algorithm.

A simple example (drawn from (Lavrac & Dzeroski, 1994)) will help to illus-

trate the process. Say our target concept is to learn the “daughter” relationship.

Our training set consists of a set of ground instances classified as positive and

negative examples:

E+ : daughter(mary, ann)

daughter(eve, tom)

E− : daughter(tom, ann)

daughter(eve, ann)

Furthermore, we have background knowledge which describes the family rela-

tionships and gender of the people in the example set:

B : parent(ann, mary)

parent(ann, tom)

parent(tom, eve)

parent(tom, ian)

female(ann)

female(mary)

female(eve)

From this input we aim to generate a hypothesis H which accurately classifies

the training set. Hypotheses are typically expressed as sets of Horn clauses

(conjunctions of first-order predicates with a single negated term). For this

example, a single clause will suffice:

H : daughter(X, Y) ← female(X), parent(Y, X)

We are not endeavouring to produce a new ILP algorithm is this thesis,

rather shall be using it as a tool, so we shall avoid delving into its internal


implementation and describe it only from an external perspective. For a more

comprehensive tutorial on ILP, see (Lavrac & Dzeroski, 1994). There are many

different ILP algorithms, with many different features. Some that are important

are:

• Noise handling : The ability to handle misclassifications of examples in

E is an important part of any machine learning algorithm. There is a

trade-off between producing an accurate classifier and producing a simple

hypothesis.

• Background knowledge: The background knowledge B can be either exten-

sional or intensional. Extensional background must be enumerated as a

list of grounded atoms (as in the example above). Intensional background

knowledge can be expressed as a logic program, and thus can be both more

complex and more succinct.

• Incrementality : The learning process can be classified as batch or incre-

mental. A batch-mode learning algorithm is designed to be run once with

all available examples to produce a single hypothesis. An incremental al-

gorithm is designed to be run repeatedly, refining its hypothesis as more

and more examples become available.

We will be using ILP for action-model learning. This is appropriate for this

task as the language of action models is inherently first-order and relational.

The algorithm we use will need to be noise-resistant as the effects of actions

will often be non-deterministic. Background knowledge will be drawn from the

fluents used in planning, some of which will have intensional definitions, so an

algorithm that allows intensional background knowledge is preferable. Finally,

the process of learning an action model is intrinsically incremental. As the agent

explores the world and gathers more examples, we wish to be able to refine the

hypotheses it forms.

Unfortunately no single algorithm could be found which satisfied all these re-

quirements. In particular, incremental algorithms are rare and only just becom-

ing the object of much scrutiny (Shapiro, 1981), (Muggleton & Buntine, 1988),

(Taylor, 1996), (Westendorp, 2003). Instead we settled on the batch learning ILP

algorithm Aleph (Srinivasan, 2001a), which has good noise-handling and uses

intensional background knowledge. In Chapter 6 we describe how this algorithm

was adapted to provide a very simple kind of incrementality.


3.6 Other related work

As in the previous chapter, there are some other areas of research which bear

some external similarity to the work in this thesis. These deserve mention, if

only to show how they differ from what we are doing.

3.6.1 Explanation Based Learning

Much of the research into action model learning stemmed from another area of

research called explanation based learning (EBL). Explanation based learning is

a process for learning rules to speed up logical inference in a problem solver (eg:

LEX (Mitchell, Utgoff, & Banerji, 1984), SOAR (Laird, Rosenbloom, & Newell,

1986) and Prodigy (Minton, 1988)). It does this by observing patterns in the

reasoning process, generalising them and attempting to apply them to similar

situations in other problems.

EBL can be applied to the symbolic planning process to automatically gen-

erate macro-operators by analogy from one planning problem to another (Car-

bonell, 1984). While this bears some similarity to the action model learning

described above, the content of what is being learnt is significantly different.

EBL algorithms learn to summarise information already present in the agent’s

knowledge. So, for example, if the action model already contains operators which

can be applied to solve a particular subpart of a plan then EBL could be used to

tag that sequence of operators as a potentially useful macro, and even generalise

it for use in a variety of similar situations. It cannot, however, add new effects

to operators which are not present in the model. This is the key difference:

EBL learns about the agent’s internal model by observing the planning process.

Action model learning learns about the external world, by interacting with it.

For an excellent review of explanation based learning research, see (Diet-

terich, 1996).

3.7 Planning in this thesis

As mentioned above, the representation and algorithm used for planning in this

thesis will be kept fairly simple, so as not to distract from the main thrust of the

work, which is the hybridisation of reinforcement learning and planning. We shall

be using teleo-operators to represent actions and constructing (semi-)universal


plans using means ends analysis. We will concentrate first on using a single level

of planning, and then extend our algorithms to incorporate hierarchical planning

using operators at various levels or granularity.

Action model learning will be limited to to improving an existing model by

learning to predict unexpected side-effects that arise while executing plans, with

a limited amount of deliberate exploration. Both the operator representation

and the planning algorithm will be extended to allow conditional side-effects

and build plans which can avoid them.

We will be using the ILP algorithm Aleph to learn the circumstances under

which side-effect arise. As the actions we will be modeling are reinforcement

learnt behaviours, the model learning process will need to be noise-tolerant and

incremental.

3.8 Summary

This chapter has explained how an agent can build plans to solve control prob-

lems using a symbolic model of the effects of actions provided by a trainer. We

have discussed how such a model might be represented, and how to build plans

which include durative actions, fault-tolerance and hierarchical structure. When

the action model is incomplete, methods exist for improving and repairing it

based on experience drawn from interacting with the world.

In the next chapter we shall consider how the work from this chapter and the

last can be combined into a single representation for both reinforcement learning

and symbolic planning, combining the strengths of each to overcome the other’s

weaknesses.

Chapter 4

A Hybrid Representation

In the introduction to this thesis we argued that a hybrid approach, combin-

ing symbolic planning and reinforcement learning, would allow us to produce

agents which could interact with complex environments more effectively than

those based on either approach alone. The background chapters have described

the representations used by each approach. These representations have much

in common, sharing a common task, but they also have significant differences.

In this chapter we aim to resolve these differences to produce a common repre-

sentation for states, goals and abstract behaviours which can be used for both

planning and reinforcement learning.

4.1 Representing states

The first important element of each approach is the representation of the agent’s

state. State is the combination of all those properties of the agent and the world

which are necessary to determine appropriate action. We will need to distinguish

between primitive and abstract representations of state. The primitive descrip-

tion of state contains all the concrete details which uniquely identify a particular

situation. Abstract state descriptions will be used to represent sets of primitive

states in terms of the high-level features and relationships they model.

Primitive state is used for concrete decision making in reinforcement learning,

and as the basis for defining fluents for abstract state descriptions. Abstract

state is used to represent the pre- and post-conditions of behaviours, and also

the agent’s goals. Planning will use the agent’s abstract state to decide which

behaviours are appropriate.

68

4. A Hybrid Representation 69

4.1.1 Instruments: Representing primitive state

In the theory presented in Chapters 2 and 3 we treated primitive states anony-

mously, simply assuming that there was a finite set of discrete states S =

{s1, s2, . . .}. In practice however, these states are generally constructed from a

collection of different sensors and other sources of information about the world.

So a single state is typically represented as a vector of different state variables.

For example, in the grid-world problem each primitive state would be a vector

of four elements: the robot’s x and y coordinates on the map, the location of

the coffee and the location of the book. Each of these variables can take on

a finite number of possible values, resulting in a finite set of primitive states.

(In some domains the state variables are continuous-valued rather than discrete.

In these cases we shall assume that an appropriate discretisation exists. While

recognising that this assumption may result in hidden-state problems, we do not

intend to address them in this work.)

We shall represent state variables symbolically as named functions called in-

struments. An instrument i returns the current value of its related state-variable.

Families of related instruments can be represented as parameterised schemata.

For instance, in the example world from the previous chapters, rather than defin-

ing separate instruments location coffee and location book to return the lo-

cations of the coffee and the book, a single instrument schema location(Object)

is defined which returns the location of the Object. An instrument schema must

yield an output for every instantiation of its parameters.

Explicitly named instruments and instrument schemata allow the possibility

of creating parameterised behaviours with state-representations as functions of

the parameters. This will be explained in more detail in Section 4.3.3 below.

In addition to the various sensors available to the agent, one special instru-

ment schema is required to define parameterised behaviours. This is the identity

function id(X) = X which outputs the value of its input parameter X regardless

of the state.

4.1.2 Fluents: Representing abstract state

Fluents are first-order predicates that represent abstract features of the state.

Fluents may be defined intensionally, as relationships between various instrument

values, or extensionally as facts about the world that are independent of the


primitive state. An example of the former is the location fluent, which is

defined in terms of the instruments that output the robot’s x,y-coordinates.

The door fluent on the other hand is extensionally defined.

Fluents have mode and type information which will be used in the planning

and reflecting processes. Each parameter of a fluent is marked as either an

input or an output. A fluent may be queried with any of its output variables

existentially quantified, but all input variables must be bound.

Mode information has particular importance when building plans. A plan

node may contain fluents with unbound variables, but only if the modes of the

fluents permit. So for example, the fluent location may have mode:

location(+Object,−Room)

indicating that Object is an input variable and Room is an output variable. So

a plan node could contain a condition like:

location(robot, Room)

in which Object is bound but Location is unbound, but a node condition like:

location(Object, kitchen)

would be invalid, as Object is an input variable, and must be bound.

Type information defines the types of constants that can be used to bind

each variable. Type information is only used in reflection and is discussed in

detail in Section 6.5.1.

4.2 Representing goals

Having defined a language for describing the agent’s state, we now need to de-

scribe its goals. Symbolic planning is primarily focused on goals of achievement,

that is problems in which the goal is to achieve a certain state of the world (or

one of a set of states). The aim is to produce the most efficient policy to achieve

this.

Goals of achievement are also common in reinforcement learning tasks, but

the reward-based goal specification is much more general than that and can

include goals of maintenance (in which a certain condition is to be maintained)


and other kinds of optimal behaviour. Symbolic planning, on the other hand,

is focused primarily on goals of achievement with less attention of the other

varieties. For this work we shall focus on goals of achievement alone, as in

practice they are the most common kind of goal, and they will require less

complexity in the hybrid model. We will reserve discussion of other kinds of

goals to the future work section in Chapter 8.

Therefore we shall assume that our goal is to achieve one of a set of goal

states G ⊂ S, as economically as possible. Furthermore we will require that this

G can be described as a conjunction of fluents G such that:

s ∈ G iff s |= G

To represent this goal as a reinforcement learning task it is necessary to

define a reward function. Many different reward functions might be used for this

purpose (see (Koenig & Simmons, 1993) for a good summary of such functions).

The principle differences between them lie in the relative values of the initial and

final Q-values. If the Q-values are initialised below their expected optimal values,

this is called pessimistic initialisation. Initialising them above the expected

optimal values is called optimistic initialisation. There are known advantages

and disadvantages to each (Hauskrecht et al., 1998), but ultimately the choice

is fairly arbitrary. For this work, we have chosen the pessimistic option by

initialising all Q-values to zero and using a reward function that rewards the

agent when it reaches the goal:

r(s, a, s′) =

1 if s′ |= G

0 otherwise(4.1)

We could have just as easily chosen an optimistic function which punishes the

agent for every action which does not reach the goal. This would effect time taken

to learn the policy, but not the correctness of any of the algorithms presented.

4.3 Representing actions

We shall distinguish between two different kinds of actions: primitive actions and

abstract actions (behaviours). Primitive actions are concrete, discrete, single-

timestep actions which obey the Markov property. They are assumed to be

low-level operations and so are not modeled. The set of primitive actions is


denoted by P and is assumed to be finite. The set of abstract actions is denoted

by B, and the set of all actions primitive and abstract is A = P ∪ B.

The representation of abstract actions is the key to the hybridisation of plan-

ning and learning. Each behaviour has a defined purpose, specified by the trainer.

This purpose dictates when the behaviour might be used and what it achieves.

In planning jargon, these correspond to the pre-image and post-conditions of the

behaviour. In hierarchical reinforcement learning, they are the application space

and the local reward function. Our aim is to produce a single representation

which serves both these purposes.

4.3.1 Reinforcement-Learnt Teleo-operators

A Reinforcement-Learning Teleo-operator (RL-Top) is a representation of a

durative behaviour with a fixed purpose, but with a policy that is learnt by

reinforcement learning. It is based on the Teleo-operator symbolic representation

of actions (Nilsson, 1994, Benson & Nilsson, 1994). An RL-Top for a behaviour

B has a pre-image B.pre and a post-condition B.post, each of which is specified

as a list of fluents. However, unlike a teleo-operator, an RL-Top does not model

the actual operation of the behaviour, but its intended operation.

An RL-Top〈B.pre, B.post〉 represents the fact that the intended operation

of B will result in B.post becoming true if the behaviour is initiated from a

state satisfying B.pre. The action is durative, i.e. it takes several time steps

to complete, so B.pre should be maintained until such time as B.post becomes

true. An RL-Top does not include a delete-list, and the representation makes no

claims about the side-effects of the behaviour. It will be assumed that behaviours

have no side-effects that do not immediately follow from their post-conditions,

until such time as such effects are observed. (More on this in Chapter 6.)

For example, the Go(hall, dining) behaviour from the gridworld in Sec-

tion 2.2.1 might be described as an RL-Top:

Go(hall, dining)

pre: location(robot, hall)

post: location(robot, dining)

This describes a behaviour which is applicable whenever the robot is the the hall,

and has a local goal which is satisfied when the robot enters the dining room.

The other possible effects of this behaviour are not defined.


This symbolic representation can be used to make plans which dictate the

appropriate actions to use in a situation, assuming they operate according to

their intended purpose. It can also be used to learn concrete policies for the

behaviours, by converting B.pre and B.post into a behaviour-specific reward

function.

Each behaviour B has an internal policy B.π which implements it. This

policy is a mapping from states to actions: B.π : S → B.A, where B.A is a

subset ofA, specific to B. (We shall assume initially that B.A ⊂ P. Later we will

expand this to allow multiple levels of hierarchy.) The set B.A is specified by the

trainer for each behaviour B, based on her determination of which primitives are

appropriate for that behaviour. With primitive behaviours added the definition

of Go(hall, dining) becomes:

Go(hall, dining)



P: { n, ne, e, se, s, sw, w, nw }

Our aim is for each behaviour to learn a policy which satisfies its intended

purpose. We do so by using a recursively optimal reinforcement learning algo-

rithm with the local reward functions:

B.r(s, a, s′) =

1 if s′ |= B.post

−1 if s′ 6|= B.post and s′ 6|= B.pre

0 otherwise

(4.2)

for each behaviour B. This function is chosen to encourage the behaviour to find

the shortest path to its post-condition without exiting its pre-image.1 Just as

the main goal of the agent is limited to being a goal of achievement, so also we

limit the kinds of behaviours we might use. This is not seen to be too drastic

a limitation, at least for a first attempt at a hybrid model. We discuss the

relaxation of this requirement in the future work section of Chapter 8.

1This is not guaranteed. In a stochastic world it may select a short path with a smallprobability of failure over a much longer path with no possibility of failure at all. This isdeemed to be an acceptable trade-off.


4.3.2 State abstraction

As discussed in Section 2.2.3, one advantage of providing local goals to be-

haviours is that it allows state-abstractions to be tailored to individual be-

haviours. The RL-Top representation allows for this, in a simple fashion. Each

behaviour has a view which is a set of instruments used to identify the prim-

itive state representation for that behaviour. A behaviour’s view may omit

certain state-variables that are irrelevant to its operation. For example, the

Go(hall, dining) behaviour above depends only on the x and y coordinates of

the robot, and not on the locations of the book or the coffee, so these latter

instruments do not need to be included in its view. The behaviour is thus:

Go(hall, dining)



view: { x, y }

P: { n, ne, e, se, s, sw, w, nw }

It is currently up to the trainer to determine which instruments to include

in a behaviour’s view when specifying the behaviour, based on her knowledge of

the problem domain. More sophisticated automatic state abstraction is left for

future work.

4.3.3 Parameterised Behaviours

There are a large number of possible variations of the Go() behaviour which

specify different initial and goal rooms. As a planning operator, it would be

natural to express all of these behaviours as an operator schema.

Teleo-operators are naturally specified as schemata, with variables which re-

late their pre-image and post-condition. Parameterised behaviours have also

been used in hierarchical reinforcement learning, although only in a fairly sim-

ple fashion. Dietterich uses parameterised behaviours in the task hierarchies for

MAXQ, but parameterisation is little more than a way of having several be-

haviours with a similar name. Each parameterisation is treated as a separate

behaviour. 2

2More sophisticated parameterisation methods are also included in Alisp (Andre & Russell,2002), but were unknown at the time this thesis was submitted.


Having a symbolic representation of behaviours and instruments allows us

to use a more complex relationship between parameters and behaviours. The

parameters of an RL-Top can also occur as variables in the view, the pre-image

and the post-condition. So, for example, the family of Go() behaviours can be

represented as a schema:

Go(From, To)

pre: location(robot, From)∧ door(From, To)

post: location(robot, To)

view: { x, y, id(To) }

P: { n, ne, e, se, s, sw, w, nw }

Notice that the special instrument id (the identity function) has been added to

the view. It is needed, so that the state-representation includes the identity of

the target room. In rooms such as the hall, which have many exits, the policy

Go.π will need this extra information to determine which direction to go. (The

identity of the the originating room is not needed, as it is uniquely identified by

x and y.)

For a more complex example, consider a world with various objects in it and

a robot that can move and turn in all directions. A possible behaviour in such a

world might be Approach(Object). The policy for such a behaviour would rely

on the distance and angle to the target object, but not the identity of the ob-

ject itself. Instrument schemata such as distance(Object) and angle(Object)

could be included in the view for such a behaviour, which allow it to obtain the

important state information and abstract away the identity of the object.

Parameterised behaviours must have all parameters bound before they can

be executed. This binding may be done in the planning stage, or at runtime in

the execution of a plan. Examples of both will be provided when planning is

discussed.

4.3.4 Hierarchies of behaviours

So far we have assumed that there is only one intermediate level of behaviours in

our hierarchy. The main task is solved in terms of behaviours and behaviours in

terms of primitive actions. However both hierarchical reinforcement learning and

hierarchical planning allow more than one level of temporal abstraction. Coarse-

grained behaviours can be represented in terms of finer-grained behaviours, which


are represented in behaviours finer still, until at the bottom of the hierarchy

appear the primitive actions.

To construct hierarchical plans, we need to identify which behaviours are

available at which levels of abstraction. We have taken the simplest possible

approach to this and allowed the trainer to specify a numerical granularity for

each behaviour. The main task will be represented as a behaviour with granu-

larity zero, the coarsest level. Behaviours with granularities 1, 2, 3, . . . represent

successively finer grained behaviours.

The levels of granularity are determined by the trainer, so that the hierar-

chical planner (to be described in Section 5.3.1) knows which behaviours are

available to it when decomposing a task. Each behaviour B has a plan B.plan

associated with it. A behaviour of granularity g is decomposed into a plan of

behaviours with granularity g + 1, if this is possible. Behaviours from this plan

may then be included in the internal policy for B.

The gridworld example is not really complex enough to require multiple levels

of granularity. Later we will introduce a simulated soccer domain which includes

three levels of granularity. At level zero there is the main task Score which is

applicable everywhere and has the goal of scoring a goal.

This is decomposed into a plan using mid-level behaviours with granularity

one. Behaviours at this level represent tactical decisions: when to pass, when

to shoot, etc. These are decomposed in turn into behaviours with granularity

two, representing simpler low level activities such as capturing the ball, turning,

dribbling, and the like.

4.4 Summary

In this chapter we have combined the elements of the previous chapters into

a single representation for states, goals and actions which can be used for both

learning and planning. The heart of its design is the Reinforcement-Learnt Teleo-

operator which models the purpose of a behaviour symbolically, and provides a

mapping to convert the symbolic representation into a local reward function for

learning the behaviour. The next chapter will present algorithms which use this

representation for planning and learning.

Chapter 5

Rachel: Planning and Acting

In this chapter and the next we shall present a description of Rachel, a hy-

brid planning/reinforcement learning system that implements the reinforcement-

learnt teleo-operator formalism of the previous chapter. Rachel consists of

three interacting components: the planner which builds plans, the actor which

executes them, and the reflector, which reflects on the outcomes of execution and

refines the world model. Figure 5.1 shows a block diagram of these components

and the relationships between them. The description of this system has been

split in two parts: in this chapter We shall focus discussion on the interaction

between the planner and the actor, leaving the reflector for the next.

Plans

Builds Plans

ActorExecutes PlansLearns Policies

ReflectorMonitors executionLearns Side−effects

RL−TOPModel

tracesExecution

Side−effectdescriptions

Planner

Figure 5.1: The three parts of the Rachel architecture.

77

5. Rachel: Planning and Acting 78

The relationship between the planner and the actor is straightforward – the

planner builds plans, and the actor executes them. However the manner in which

plans are constructed and executed are somewhat unconventional, so as to in-

corporate reinforcement learning. We shall present an algorithm for the planner

which deliberately produces plans with alternative paths, and two different al-

gorithms for the actor which learn to choose among these alternatives, and learn

primitive policies for behaviours.

5.1 The Planner

Rachel implements a custom planner to build teleo-reactive plan trees. Given a

goal G it produces a plan to achieve that goal. The planner uses the means-ends

analysis technique described in 3.2.2, with some variations.

5.1.1 Semi-universal planning

The first variation is that the plans it produces are semi-universal. That is,

the planner builds paths by backward-chaining from the goal, and stores all the

paths it generates, regardless of whether they include the current state or not,

stopping once it covers the current state. While the “unsuccessful” paths are not

useful in the agent’s current state, they may become active during the execution

of the plan, should one of the behaviours on the correct path fail unexpectedly.

This is a very likely possibility, when we take into account the fact that the

behaviours shall be learnt as we go.

Furthermore, it is normal in the construction of universal plans to keep only

one path from any particular state, and to discard any new step which does not

add to the set of states covered by the plan. This is because most planners are

only looking for the single shortest path to the goal from each state. Rachel,

on the other hand, is actively looking for alternative paths, of any length, and

so does not discard such redundant steps. It does, however, avoid creating paths

which contain loops. So a new step which does not add anything to the path it

extends is discarded.

So consider the grid-world problem from earlier chapters. Say, for example,

the agent’s goal was to get both the coffee and the book, starting from its initial

location in the study. A linear planner would produce one of the two plans


Go(study, hall)


Go(dining, kitchen)

Go(hall, dining)

Go(study, hall)

Go(kitchen, dining)

Get(book, bedroom2)

Go(hall, bedroom2)

Go(dining, hall)


Go(dining, kitchen)

Go(hall, dining)

Go(bedroom2, hall)

Get(book, bedroom2)

Go(hall, bedroom2)

holding(book)

location(book, bedroom2)

location(robot, study)

holding(book)


location(robot, dining)holding(coffee)


holding(coffee)



holding(book)



location(book, bedroom2)location(coffee, kitchen)location(robot, dining)





holding(book)





location(robot, hall)


holding(coffee)location(book, bedroom2)

location(robot, bedroom2)

location(coffee, kitchen)holding(book)


holding(coffee)




location(coffee, kitchen)location(book, bedroom2)


location(robot, kitchen)


Figure 5.2: Two linear plans to fetch both the book and the coffee.


Go(dining, kitchen)

Go(hall, dining)

Go(bedroom2, hall)

Get(book, bedroom2) Get(coffee, kitchen)

Go(dining, kitchen)

Go(hall, dining)

Go(kitchen, dining)

Go(hall, bedroom2)

Go(dining, hall)

Get(coffee, kitchen) Get(book, bedroom2)

Go(bedroom1, hall)

Go(lounge, hall)

Go(laundry, dining)

Go(hall, bedroom2)

Go(study, hall)Go(dining, hall)

Go(laundry, dining)


holding(book)









location(robot, study)



location(coffee, kitchen)location(robot, kitchen)

holding(book)















holding(coffee)holding(book)




holding(book)



location(robot, dining)location(coffee, kitchen)location(book, bedroom2)



location(robot, lounge)

Figure 5.3: A semi-universal plan to fetch both the book and the coffee. Thedashed arrows indicate where nodes have been omitted from the diagram, tosave space.


shown in Figure 5.2. Both of these are valid alternative ways to achieve the

goal. Most planners would choose the one on the left, as the shorter of the two.

Rachel’s planner instead produces the semi-universal plan shown in Figure 5.3,

which contains both the linear plans. The job of choosing one or other of these

alternatives is not up to the planner. Instead, it is delegated to the actor, which

will do so using hierarchical reinforcement learning.

Notice, however, that the semi-universal plan has only been expanded seven

levels deep, so not all of the second of the linear plans is included. This is because

the planner is not completely universal. It does not try to exhaustively search

for all possible paths. Rather, it uses iterative deepening to successively expand

the depth of the tree until the current state is covered. In this case, the left-hand

branch covers the robot’s starting state, so the plan is not expanded any further.

That is, until the agent arrives in a state which is not covered by this plan, in

which case the plan will be further expanded until either the state is covered or

the plan can not be grown any further.

Rachel’s plan also contains many “irrelevant” alternatives (only some of

which are included in the figure). These alternatives are not truly irrelevant

however. They are kept in case any of the behaviours fail unexpectedly, resulting

in a state which is not covered by either of the linear plans, eg. if the robot

accidentally entered the laundry while navigating from the dining room to the

kitchen. A linear planner would have to re-plan in this instance. A universal

planner, like Rachel’s can take advantage of the fact that it has already found

a contingency to handle this situation.

5.1.2 Variable binding

Another variation in Rachel’s planner is in the handling of variables in goals

and parameterised behaviours. The binding of variables may be done at planning

time, or may be delayed until run-time. The difference is apparent in the fol-

lowing example. Consider the agent in the example world from earlier chapters,

with the goal:

G = location(robot, Room) ∧ location(coffee, Room)

The unbound variable Room is treated as being existentially quantified. So this

goal is satisfied when the robot is in the same room as the coffee. A behaviour


B to achieve this goal, if there exists a substitution σ for variables in G and the

parameters of B such that:

σ(B.post) ∩ σ(G) 6= ∅ (5.1)

If this is the case, then the regressed condition is given by

Fbefore = (σ(G)− σ(B.post)) ∪ σ(B.pre) (5.2)

which is an extension of Equation 3.4 to include variable binding. If variable

binding is required to be done at plan time, then σ must bind all the parameters

of B to constant values. If run-time binding is allowed then parameters of B may

be bound to variables, however the variables must be guaranteed to be bound

by the run-time evaluation of Fbefore The mode information for fluents is used to

ensure Fbefore is Prolog-ordered, i.e. that there is an ordering of the fluents which

ensures that every unbound variable occurs as the output of a fluent before it

occurs as an input.

So in the example, the behaviour Go(From, To) can be used to achieve the

goal G above with the binding: σ = {To/Room} The regressed condition is then:

Fbefore = (σ(G)− σ(Go(From, To).post)) ∪ σ(Go(From, To).pre)

= ({location(robot, Room), location(coffee, Room)}

−{location(robot, Room)}) ∪ {location(robot, From), door(From, Room)}

= {location(robot, From), door(From, Room), location(coffee, Room)}

If the fluents location and door can each be used with the second argument

as an output, then this is a valid Prolog-ordered set and the variables Room

and From can be bound at run-time. If not, then a constant binding of these

variables might be necessary. In its present form, the planner requires the trainer

to specify when a particular variable is to be bound at plan-time, and to specify

a set of candidate values for that variable.

5.1.3 The planning algorithm

Algorithms 5 and 6 show pseudocode for the planning process. Plans are grown

by iterative deepening. The GrowPlan function grows an existing plan one

step deeper. It calls the PlanStep function at each leaf node of the required

depth, to generate new steps. PlanStep implements the regression process,


Algorithm 5 Rachel’s planning algorithm: Iterative Deepeningfunction GrowPlan(plan P )

ExpandNode(P.root, P.depth + 1, {})

end GrowPlan

function ExpandNode(node N, depth d, explored E)

E ← E ∪ {N.cond}

if d > 0 then

for each child C ∈ N.children do

ExpandNode(C, d− 1, E)

end for

else

PlanStep(N, E)

end if

end ExpandNode

finding a behaviour which achieves the input condition and regressing it to get

the new node condition.

Further extensions to the planner will be described in later sections, including

hierarchical planning (section 5.3.1, conditional side-effects (section 6.6) and

exploratory plan steps (section 6.6.1).

For all the examples in this thesis, we shall assume the plans used have been

grown to the maximum possible depth, except where noted. This is only possible

as the examples involve relatively simple planning tasks. For more complex tasks,

planning can be interleaved with acting – the actor can explore the paths that

the planner has already discovered while the planner searches for more.

5.1.4 Computational complexity

It should be noted that the computation time of the above algorithm for plan

construction can potentially be exponential in the number of different situa-

tions (conjunctions of state fluents) the agent can encounter. This is a serious

practical problem once the agent’s domain becomes moderately complex. It is

allayed somewhat by the addition hierarchy, in Section 5.3.1, but still ought to

be addressed.

The advantage of planning depends on the relative costs of computation

and exploration. For any real-world problem, computation will be orders of


Algorithm 6 Rachel’s planning algorithm: Adding new nodesfunction PlanStep(node N, explored E)

for each behaviour B′ do

\\Find which of the node conditions B′ achieves, if any

(achieved , unachieved )← Achieved(B′, N.cond)

if achieved = {} then

Skip to the next behaviour

end if

\\Check for interference

if B′.post ∧ N.cond⇒⊥ then

Skip to next behaviour

end if

\\Construct the new node’s condition

newCondition ← unachieved ∧ B′.pre

if newCondition ⇒⊥ then


end if

\\Check if the new condition has already been explored

for each condition C ∈ E do

if newCondition ⇒ C then


end if

end for

\\Add the new node tree

Nnew .cond← newCondition

Nnew .parent← N

Nnew .B← B′

N.children← N.children ∪ {Nnew}

end for

end PlanStep


magnitude faster than actions execute, so it is worth doing a considerable amount

of computation in order to avoid unprofitable exploration. The exact nature of

this trade-off will vary from one application to another. A more sophisticated

planner would allow the trainer to tune the amount of search it does relative

to the expected advantage. We discuss this idea, along with other possible

improvements to the planner, in Section 8.3.1.

5.2 The Actor

The actor is the part of Rachel which directly controls the agent’s actions.

Its job is to choose one of the alternative behaviours offered by the plan, and

execute it, learning a primitive policy for that behaviour according to the local

reward function given by that behaviours RL-Top (Equation 4.2). The choice of

behaviours is to be optimised according to the global reward function r(s, a, s′)

(Equation 4.1).

We shall describe two different algorithms for the actor, which implement

different execution semantics. The first, Planned Hierarchical Semi-Markov Q-

Learning, will use the standard subroutine-semantics of HSMQ, MAXQ-Q and

others, consulting the plan only when a new behaviour needs to be selected. The

second algorithm, Teleo-Reactive Q-Learning uses semantics based on those for

teleo-reactive plan execution (Section 3.2.3). This allows for more reactivity to

changes in the world, but requires a more complex algorithm, as will be shown.

We concentrate in this section on algorithms which use a single intermediate

level of hierarchy only. The algorithms will be extended to include multiple levels

of hierarchy in Section 5.3, later in the chapter.

5.2.1 Planned Hierarchical Semi-Markov Q-Learning

The simplest use of plans to inform hierarchical learning is as a replacement for

the task-hierarchy. Where the HSMQ algorithm (Algorithm 3 on page 38) con-

sults the function TaskHierarchy to determine the set of available behaviours

at each choice point, Planned Hierarchical Semi-Markov Q-Learning (P-HSMQ)

uses the plan instead. Algorithm 7 shows pseudocode for this process.

The ActiveBehaviours function returns the set of behaviours dictated by

the active nodes of the plan (with duplicates removed). One of these behaviours


Algorithm 7 Planned HSMQ-Learningfunction P-HSMQ-1(goal G)

plan P ← BuildPlan(G)

t← 0

Observe state st

Bt ← ActiveBehaviours(P, st)

while st 6|= G do

T ← t

Choose behaviour B← π(st) from Bt according to an exploration policy

sequence S ← {}

while st |= B.pre ∧ st 6|= B.post do

Choose primitive action at ← B.π(st)


Execute action at


B.Q(st, at)α←− B.r(st, at, st+1) + γ maxa∈B.P B.Q(st+1, a)

S ← S + 〈st, at, st+1〉

t← t + 1

end while


for each 〈s, a, s′〉 ∈ S do

totalReward ← totalReward + γkr(s, a, s′)

k ← k + 1

end for

Bt ← ActiveBehaviours(P, st)

Q(sT , B)α←− totalReward + γk maxB′∈Bt

Q(st, B′)

end while

end P-HSMQ


is then selected by the reinforcement learning algorithm for execution. A be-

haviour B learns its policy as it executes, using its local reward function, until

it terminates, either successfully (satisfying B.post) or unsuccessfully (prema-

turely leaving B.pre). The experiences gathered while executing the behaviour

and evaluated using the global reward function and used to update its global

Q-value.

This algorithm has the same theoretical properties as the HSMQ algorithm

on which it is based, and is guaranteed to converge to a recursively optimal

policy within the restrictions placed on it by the plan.

5.2.2 Termination Improvement

While the P-HSMQ algorithm is effective, it does not make full use of the infor-

mation available to it in the plans it builds. It only checks the appropriateness of

a behaviour when it is initiated, and always executes it until completion, ignoring

any effects that might cause the action to no longer be appropriate.

Consider the following scenario: We add a further complexity to the example

grid-world from earlier chapters. There is a bump on the floor near the west

end of the hall, as indicated by the shaded area shown in Figure 5.4. When the

robot moves onto the bump while carrying a cup of coffee, there is a 10% chance

that it spills.

The agent’s goal is to fetch the coffee and the book and take them both

into the lounge. It is faster, in this scenario, to fetch the coffee first and then

the book. Suppose the agent has already visited the kitchen and is carrying the

coffee. It re-enters the hall and starts executing Go(hall, bedroom2), as dictated

by the plan (Figure 5.5, but as it passes over the bump it spills the coffee. Once

the P-HSMQ algorithm has begun executing a behaviour it is committed to

completing it, so the robot continues down the hall to the bedroom. Only once

it enters the room does it re-evaluate its choice of behaviour, and realises that

it needs to return to the kitchen to fetch another cup.

A more efficient solution would be to return to the kitchen as soon as the

coffee is spilt, but to do so the agent would need to relax its commitment to

behaviours, and be able to interrupt a behaviour before it terminates. This is a

kind of termination improvement as described in Section 2.4. As pointed out in

that section, the difficulty with termination improvement is that the more often

the agent is able to make choices about its behaviours, the more complex the


Closet

Robot

Coffee Book

Hall

Laundry

Dining

Bump

KitchenBedroom1 Bathroom Bedroom2

Study Lounge

Figure 5.4: The example world revisited. A “bump” has been added at thewestern end of the hall (indicated by the shaded squares. When the robot movesonto the bump there is a 10it spills the coffee.

policy is to learn. Optimally we would like the agent to only reconsider its choice

when the current behaviour is no longer worthwhile.

A plan provides us with a way to make this decision. Each node dictates

the conditions which make its corresponding behaviour appropriate. So long

as the node is active, the behaviour should continue executing. Should the

node become inactive, then the behaviour may no longer be appropriate and

should be interrupted. In the plan in Figure 5.5 the node which dictates the

Go(hall, bedroom2) behaviour has the condition:

location(robot, hall) ∧ holding(coffee) ∧ location(book, bedroom2)

So as long as the robot is in the hall and carrying the coffee, this behaviour will

continue executing. However if the robot should spill the coffee, holding(coffee)

will no longer be satisfied and the node will become inactive (even though the

Go behaviour has not terminated). This is a good indication that the behaviour

should be interrupted, and another chosen from the newly active nodes (shown

with broken outlines in Figure 5.5). Notice that the Go(hall, bedroom2) is again

in the set of available behaviours, but is dictated by a different node, as part of



Go(bedroom2,hall)Go(dining, hall)

location(robot, bedroom2)location(coffee, kitchen)

holding(book)


holding(book)

location(robot, kitchen)holding(coffee)holding(book)

location(robot, dining)holding(coffee)holding(book)

location(coffee, kitchen)location(robot, bedroom2)


location(robot, bedroom2)holding(coffee)holding(book)


location(book, bedroom2)holding(coffee)



location(robot, kitchen)location(coffee, kitchen)location(book, bedroom2)

location(robot, hall)location(coffee, kitchen)location(book, bedroom2)

location(robot, hall)location(coffee, kitchen)location(book, bedroom2)


location(robot, hall)holding(coffee)

Active nodesafter spill

Active nodebefore spill

Go(hall, lounge)


Go(bedroom2, hall)

Go(hall, bedroom2)

Get(book, bedroom2)


Go(kitchen, dining)

holding(book)holding(coffee)



Get(book, bedroom2)

Go(hall, bedroom2)

Go(kitchen, dining)

Go(hall, dining)

Figure 5.5: A plan for fetching the coffee and the book. Dotted arrows indicateplaces where steps have been omitted, for the sake of brevity. The highlightednodes are those that are active before and after the coffee is spilled.


a sequence that fetches the book first, and then the coffee. A better alternative

in this situation is the Go(hall, dining) behaviour which takes the robot back

towards the kitchen to fetch another coffee.

This process of executing a behaviour as long as its node is active follows

the teleo-reactive semantics for plan execution described in Section 3.2.3. What

follows is an HRL algorithm which implements these semantics, called Teleo-

Reactive Q-Learning.

5.2.3 Teleo-Reactive Q-Learning

Teleo-Reactive Q-Learning (TRQ), as shown in algorithm 8, is an adaptation

of P-HSMQ which implements the teleo-reactive execution semantics described

above.

There are two important issues that need to be dealt with in implementing

this algorithm. They are: (1) maintaining the Semi-Markov Property of inter-

rupted behaviours (necessary for the correctness of the SMDPQ update rule);

and (2) ensuring that behaviour’s internal policies are fully explored in spite of

interruptions. Each of these issues is detailed below.

Maintaining the Semi-Markov property

The correctness of the SMDPQ-Learning update rule used in TRQ requires that

the behaviours executed obey the Semi-Markov property. That is, that the

outcomes of executing a behaviour – its duration, the rewards it receives and

the state it terminates in – depend only on the identity of the behaviour and the

state in which it is initiated. The teleo-reactive execution semantics violate this

condition. If there are two different nodes active in the same state, dictating

the same behaviour but with different interruption criteria, then the outcome of

executing the behaviour will depend on which node is chosen.

Returning to the example world, consider the goal:

G = location(robot, bedroom2) ∧ holding(Object)

i.e. the robot is to be in the bedroom and holding something. The variable

Object is considered to be existentially quantified, so the goal is satisfied if the

robot is carrying either the coffee or the book. The plan for this goal is shown in

Figure 5.6. Suppose the agent decides to fetch the coffee first, and having done


Algorithm 8 TRQ-Learningfunction TRQ-1(goal G)

plan P ← BuildPlan(G)

t← 0

Observe state st

N t ← ActiveNodes(P, st)

while st 6|= G do

T ← t

Choose node N← π(st) from N t according to an exploration policy

sequence S ← {}

B← N.B

while N ∈ N t do

Choose primitive action at ← B.π(st)


Execute action at


B.Q(st, at)α←− B.r(st, at, st+1) + γ maxa∈B.A B.Q(st+1, a)

S ← S + 〈st, at, st+1〉

t← t + 1

N t ← ActiveNodes(P, st)

end while




k ← k + 1

end for

Q(sT , N)α←− totalReward + γk maxN′∈N t

Q(st, N′)

if st |= B.pre ∧ st 6|= B.post then

\\Behaviour B has been interrupted prematurely

with probability η do

Persist(B)

end with

end if

end while

end TRQ-1


Algorithm 9 TRQ-Learning: Persisting with a behaviourfunction Persist(behaviour B)


Choose primitive action at ← B.π(st) according to an exploration policy

Execute action at


B.Q(st, at)α←− B.r(st, at, st+1) + γ maxa∈B.P B.Q(st+1, a)

t← t + 1

end while

end Persist

so, it returns to the hall. When the agent enters the hall it is presented with a

choice: There are two active nodes in the plan which are highlighted in the figure.

Both nodes indicate the same behaviour to be executed: Go(hall, bedroom2).


location(robot, hall)location(book, bedroom2)

location(robot, bedroom2)holding(Object)

Get(book, bedroom2)Go(hall, bedroom2)

Go(dining, hall) Go(hall, bedroom2)


location(robot, hall)holding(coffee)


Figure 5.6: A plan for fetching either the coffee or the book. The two highlightednodes are active when the agent is in the hall and holding the coffee.

While executing the behaviour the robot passes over the bump and spills the

coffee. What happens next depends on which node was being executed. The

right-hand node is still activated, so if it was selected the agent would continue

executing the behaviour. The left-hand node however is no longer active. If

this node was the one selected then the behaviour would be interrupted and a

new choice made. This shows that the termination of the behaviour depends

on more than just its starting state. More information is needed to satisfy the

Semi-Markov property.


The solution is to explicitly recognise the difference between these two cases

and treat them as distinct alternatives for the agent to choose between. We

treat each node of the plan as a separate alternative. A particular node always

executes the same behaviour and terminates under the same conditions, so it

will satisfy the Semi-Markov property.

Teleo-reactive Q-Learning assigns Q-values to nodes of the plan rather than

to behaviours. We define an optimal state-node value function:

Q?(s, N) = E

{

k−1∑

i=0

γirt+i + γkV ?(st+k) | ε(s, N, t)

}

(5.3)

where ε(s, N, t) indicates the event of executing behaviour N.B in state s at time

t, until N is no longer active, and k is the duration of this execution.

We learn an approximate value function Q(s, N), by a version of the SMDPQ-

Learning update rule:

Q(st, Nt)α←− Rt + γk max

N∈NQ(st+k, N) (5.4)

Execution then consists of finding all active nodes in the plan and selecting

the one with the best Q-value. The behaviour corresponding to this policy node

is then executed and learnt in the usual way. When the node is no longer active

the behaviour is interrupted, whether is has terminated or not, and the gathered

experience is used to update the value of the node.

Since the execution of a node in a plan obeys the Semi-Markov property,

this update rule is guaranteed to result in convergence to an optimal policy (in

terms of node selection) provided that the underlying primitive policies for nodes

converge. Ensuring this convergence is the issue we will examine next.

Persisting with interrupted behaviours

In a recursively optimal HRL algorithm such as TRQ the policy of a behaviour

is independent of its calling context. A behaviour aims to learn a policy to

maximise its local rewards, regardless of its parent task. To ensure that this

happens, behaviours must be allowed to fully explore their application spaces.

To be precise, the correct convergence of the Q-values (and thus the policy) for

a state s cannot be guaranteed unless all states reachable from s are adequately

explored (infinitely often in the limit).

The teleo-reactive semantics alone will not guarantee this. If a node N in a

plan has condition N.cond and dictates behaviour B then the execution of B is


limited to the set of states that satisfy N.cond. States within the application

space of B which do not satisfy N.cond will never be explored (unless there is

another node which also dictates B under less restrictive conditions).

This, of course, can have seriously detrimental effect on learning the internal

policy of B. Convergence to the optimal policy is no longer guaranteed, even if

the optimal policy lies within the set of states satisfying N.cond. To illustrate

this, consider the narrow bridge example in Figure 5.7

0.62

0.46

0.31

0.18

0.06

−0.04

−0.14

−0.23

−0.30

0.8

0.62

0.46

0.31

0.18

0.06

−0.04

−0.14

−0.23

−0.30

0.8

−0.1

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.62

0.46

0.31

0.18

0.06

−0.04

0.8

−0.1

−0.1

Figure 5.7: A narrow bridge over a chasm. (a) On the left is the optimal policyfor crossing the bridge. (b) On the right is the policy learnt if the behaviour isinterrupted whenever it moves to the right-hand side of the bridge.

The bridge links two sides of a chasm. It is 10 cells long but only 2 cells

wide. The agent can move in any of the 4 cardinal directions, but each movement

includes a probability of error pe, resulting in the agent immediately falling off

the bridge. Let us define a behaviour Cross:


Cross

Pre: on bridge

Post: on far side

According to the local reward function for RL-Tops the behaviour gets reward

+1 for reaching the other side and penalty −1 for falling off the bridge. The

optimal policy is to head directly for the other side, as shown in Figure 5.7(a).

The numbers in each cell show the optimal state-value V ?(s) for the cell, with

γ set to 1 and Pe at 0.1. Notice that for the first half of the bridge the values

are negative, indicating that the expected probability of falling off the bridge is

greater than the probability of reaching the other side. Still, there are no better

alternative policies for this behaviour.

Now consider what would happen if this behaviour were only ever used when

dictated by a node which required that the agent remain on the left-hand side of

the bridge, and interrupted the behaviour whenever it strayed to the right hand

side of the bridge. The optimal policy for Cross does not violate this condition,

but learning that policy is no longer possible. Since actions on the right side of

the bridge are never explored, the Q-values for all those states remain at their

initial value of zero. Assuming that moving to the right-hand side of the bridge

yield a future return of zero, the agent prefers to move to the right of the bridge

instead of moving forward, for the first three cells of the bridge. The convergent

policy for this situation is shown in Figure 5.7(b).

While this is a pathological example, it is clear that to guarantee the cor-

rect convergence of the internal policies of behaviours, they must be allowed to

explore these behaviours thoroughly. To achieve this, the TRQ algorithm occa-

sionally elects to execute a behaviour to completion rather than interrupt it. If

a behaviour is interrupted before it terminates, then the algorithm may decide,

with probability η to ignore the interruption and persist with the behaviour until

it succeeds or fails. Setting the value of η is a tradeoff between taking advan-

tage of interruptions (when η is low), and faster convergence of the behaviours’

policies (when η is high). A greedy-in-the-limit schedule of η values is perhaps

the best way to resolve this.

Once the algorithm begins to persist with a behaviour, all of the hierarchy

above that behaviour is ignored. The value of the node that was dictating the

behaviour is updated as if the behaviour was interrupted. Experiences gathered

while persisting with a behaviour are only used to update that behaviours inter-


nal Q-values, and are ignored by the level above. In this way, persistence is a

kind of off-policy exploration, at the behaviour level.

When a persisting behaviour terminates, either successfully or unsuccessfully,

control returns to the plan and a new plan node is selected according to the next

state.

Proof of convergence

With these two conditions the policies learnt by the TRQ algorithm should be

guaranteed to converge with probability 1 to a recursive optimal policy with the

limits of the plan. The proof of this statement would follow the same lines as the

proof for MAXQ-Q (Dietterich, 2000a) and of SMDP Q-learning (Parr, 1998).

We will outline it here, without entering into the details

The proof is inductive. We start by proving the convergence of the behaviours

of finest granularity, which learn policies directly in terms of primitive actions.

We then proceed to prove convergence for successively coarser levels of granu-

larity until we reach the top of the hierarchy.

The base case is straightforward. The behaviours of finest granularity use Q-

learning to learn a policy using primitive actions. The world is a Markov Decision

Problem, so given an appropriate schedule of learning rates, these behaviours will

converge to optimal policies with probability 1.

The recursive case is based on Proposition 4.5 from (Bertsekas & Tsitsiklis,

1996), which describes the necessary conditions to prove the convergence of an

iteration of the form:

rt+1(i)αt(i)←− (Urt)(i) + wt(i) + ut(i)

for a mapping U with random noise term wt and a decaying error term ut. It

relies on two important factors:

1. The SMDP Q-Learning update rule (Equation 2.23) is a weighted max-

norm pseudo-contraction. That is, the mapping U :

(UQ)(s, Nt) = Rt + γk maxN∈N

Q(st+k, N)

satisfies the inequality:

‖UQt −Q?‖ξ ≤ β‖Qt −Q?‖ξ


for some positive weight vector ξ and scalar β ∈ [0, 1). This fact is proven

by Parr (1998).

2. The error term B.ut(s, N) given by:

B.ut(s, N) =(

∫ +∞

−∞ rR (r|s, π(s)) dr

+∑

s′,k T (s′, k|s, N) γk maxN′∈B B.Q(s′, N′))

−(

∫ +∞

−∞ rR? (r|s, π(s)) dr

+∑

s′,k T ? (s′, k|s, N|γ)k maxN′∈B B.Q(s′, N′))

converges to zero with probability 1. This term represents the difference

between doing an update with current internal policy of N.B (with out-

comes given by R and T ), and doing an update with the optimal internal

policy of N.B (with outcomes given by R? and T ?).

This condition follows from the inductive hypothesis. If B is of granularity

g then N.B is of granularity g+1, so we can assume that the internal policy

of N.B convergences with probability 1. Therefore T (s′, k|s, N) converges

to T ?(s′, k|s, N) and B.ut(s, N) converges to zero with probability 1, as

required. (The complete proof requires a particular upper bound on the

rate of convergence, but we shall not concern ourselves with such details

here.)

5.3 Multiple levels of hierarchy

So far the explanations of Rachel have assumed only one level of hierarchy

between the main goal and the primitive policies. In certain domains it may be

desirable to have two or more levels of hierarchy. In this section we describe how

the algorithms presented above can be extended to multiple levels of hierarchy.

The key element is the idea of behaviour granularity, as presented in Sec-

tion 4.3.4. The main task is treated as a behaviour of granularity zero, with

post-condition equal to the goal and a pre-image that is true in all states (op-

tionally a smaller pre-image may be used to treat some states as failure states).

We shall refer to this behaviour as Root. We then extend the planner to al-

low it to decompose a behaviour of granularity g into a plan of behaviours of

granularity g + 1. We can do this recursively until we reach a behaviour that


cannot be further decomposed, and so learns a primitive policy instead. Execut-

ing the resulting hierarchy of plans will require an extended version of the TRQ

algorithm.

5.3.1 Hierarchical planning

Algorithm 10 Hierarchical planning: Iterative Deepeningfunction GrowPlan(behaviour B)

ExpandNode(B, B.plan.root, B.plan.depth + 1, {})

end GrowPlan

function ExpandNode(behaviour B, node N, depth d, explored E)

E ← E ∪ {N.cond}

if d > 0 then

for each child C ∈ N.children do

ExpandNode(B, C, d− 1, E)

end for

else

PlanStep(B, N, E)

end if

end ExpandNode

Three modifications need to be made to the planning algorithm to accom-

modate hierarchical planning. All three are fairly simple:

1. Instead of taking a goal expression as input, we take the behaviour which

is to be decomposed. Its post-condition will serve as the goal.

2. In choosing a behaviour to add to the plan in PlanStep we restrict our

search to behaviours of the appropriate granularity (i.e. one more than the

granularity of the behaviour being decomposed).

3. When constructing a plan for behaviour B there is no point in adding nodes

which lie outside of B.pre, as they will never be executed. So conditions

from B.pre are added to every node. If that results in a node that can

never be satisfied, then it is pruned.

The pseudocode for the resulting algorithm in shown in Algorithms 10 and 11.


Algorithm 11 Hierarchical planning: Adding new nodesfunction PlanStep(behaviour B, node N, explored E)

for each behaviour B′ with granularity B.gran + 1 do





end if




end if


newCondition ← B.pre ∧ unachieved ∧ B′.pre



end if





end if

end for



Nnew .parent← N

Nnew .B← B′


end for

end PlanStep


5.3.2 Hierarchical learning: P-HSMQ

Extending P-HSMQ to multiple levels of hierarchy is straightforward. It is sim-

ply a matter of making the algorithm operate recursively, starting with the Root

behaviour and selecting a behaviour from its plan and then executing that be-

haviour in the same fashion. Eventually there will be a behaviour for which

the plan does not cover the current state, and the agent will have to resort to

choosing a primitive action instead. The pseudocode is shown in Algorithm 12

Note that primitive actions are allowed at any level of the hierarchy, when

no behaviour is available. This means that the agent can use plans that are only

partially complete, and fill in the remainder of the policy with primitive actions.

5.3.3 Hierarchical learning: TRQ

An extra complexity arises when extending the TRQ algorithm to multiple levels

of hierarchy. The possibility arises that a behaviour B1 may be interrupted while

it was in the process of executing sub-behaviour B2. The question then arises,

how should we update the Q-value B1.Q(s, B2) (the value of B2 according to the

local reward for B1)?

The simplest answer is that the experiences gathered while executing B2

should be discarded. This will not affect the correctness of the algorithm, pro-

vided that B1 is occasionally permitted to persist beyond the interruption, as

described in Section 5.2.3 above. This is the solution employed in the pseudocode

of Algorithm 13.

There are two sources of behaviour interruption in the multi-level TRQ al-

gorithm. Suppose {Root, B1, B2, . . . , Bn} is the stack of behaviours executing at

time t. The teleo-reactive semantics require that plans are monitored reactively

at all levels of the hierarchy, so if the node of Bk.plan executing at time t is no

longer active at time t+1 then all behaviours Bk+1, . . . , Bn must be interrupted.

Alternatively, Bk may elect to persist with Bk+1 in spite of the interruption.

In this case, learning must be suspended on all behaviours Root, B1, . . . , Bk−1

which are above Bk in the stack. In either case, the interrupted behaviours must

discard any gathered experiences.


Algorithm 12 Planned HSMQ-Learning with multiple levels of hierarchyfunction P-HSMQ(behaviour B) returns sequence S

t← 0

sequence S = {}

Observe state st

Bt ← ActiveBehaviours(B.plan, st)


T ← t

if Bt = ∅ then

Choose primitive aT ← π(st) from B.P


Execute action aT

Observe state st+1

sequence S′ ← {〈st, at, st+1〉}

else

Choose behaviour BT ← π(st) from Bt


sequence S′ ← P-HSMQ(BT )

aT ← BT

end if



totalReward ← totalReward + γkB.r(s, a, s′)

k ← k + 1

end for

S ← S + S′

t← t + k

Observe state st

At ← ActiveBehaviours(B.plan, st)

if At = ∅ then

At ← B.P

end if

B.Q(sT , aT )α←− totalReward + γk maxa′∈At

Q(st, a′)

end while

return S

end P-HSMQ


Algorithm 13 Teleo-Reactive Q-Learningfunction TRQ(behaviour B)

while B has not terminated do

Observe state st

〈st, at, st+1〉 ← TRQ-Execute(st, B)

TRQ-Update(B, 〈st, at, st+1〉)

end while

end TRQ

Algorithm 14 Execute a behaviourfunction TRQ-Execute(state st, behaviour B)

returns experience 〈st, at, st+1〉

node N← B.activeNode

\\Select a new node if necessary

if N = null then

N t ← ActiveNodes(B.plan, st)

if N t 6= ∅ then

Choose node N← B.π(st) from N t

end if

B.lesson← {}

end if

\\Execute the active node

B.activeNode← N

if N = null then

Choose primitive at ← B.π(st)

Execute action at


else

〈st, at, st+1〉 ← TRQ-Execute(st, N.B)

TRQ-Update(N.B, 〈st, at, st+1〉)

B.lesson← B.lesson + 〈st, at, st+1〉

end if

return 〈st, at, st+1〉

end TRQ-Execute


Algorithm 15 TRQ Updatefunction TRQ-Update(behaviour B, experience 〈st, at, st+1〉)

N← B.activeNode

At ← ActiveNodes(B.plan, st)

if At = ∅ then

At = B.P

end if

if N = null then

\\Update executed primitive

B.Q(st, at)α←− B.r(st, at, st+1) + γ maxa∈At

B.Q(st+1, a)

else if node N /∈ At then


for each 〈s, a, s′〉 ∈ B.lesson do


k ← k + 1

end for

B.Q(sT , N)α←− totalReward + γk maxa∈At

B.Q(st, a)

B.activeNode← null

if st+1 |= N.B.pre and st+1 6|= N.B.post then

\\N.B has been interrupted due to a side-effect.

with probability η do

Interrupt(st+1, Root)

TRQ(N.B)

else

Interrupt(st+1, N.B)

end with

end if

end if

end TRQ-Update


Algorithm 16 Discard lessons from interrupted behavioursfunction Interrupt(state s, behaviour B)

N← B.activeNode

if N 6= null then

Interrupt(s, N.B)

end if

B.activeNode← null

B.lesson← {}

end Interrupt

5.4 Summary

In this chapter we have presented algorithms for planning with reinforcement-

learnt behaviours and for learning with plan-constructed task-hierarchies. We

have shown how having a plan can inform the execution process allowing intel-

ligent decisions of when to prematurely interrupt a behaviour, leading to better

policies. In the next chapter we shall introduce the third element of the Rachel

architecture, the reflector, which observes the learnt behaviours of the actor and

uses its observations to better inform the planner, thereby closing the loop.

Chapter 6

Rachel: Reflection

The third and final component of Rachel is the reflector. This component

monitors the execution of plans by the actor, detects and records undesirable

side-effects, and induces pre-conditions under which they might be avoided. This

information is in turn fed back into the planner which we shall modify to incor-

porate it into new plans.

The need for the reflector is due to assumptions made in the planner. The

planner initially assumes that behaviours are side-effect free. If a given fluent F

is true when a behaviour is initiated then the planner assumes it will remain true

until the behaviour terminates, unless it is known to interfere with the pre-image

or post-conditions of the behaviour. This is called the frame assumption. In this

chapter we shall investigate what happens when this assumption does not turn

out to be true, and how the reflector enables Rachel to overcome this problem.

6.1 The Frame Assumption

It is a commonly recognised problem in the field of symbolic planning that to

build plans we need to model not only what fluents are affected by a behaviour,

but also what fluents are unaffected. Without such knowledge, we cannot be sure

that conditions achieved by one behaviour will not be undone by a subsequent

step in the plan. Taking the grid-world as an example, the plans we build to

fetch the book and the coffee rely on the fact that the Go() behaviour does

not affect the truth holding() fluent (or, at least, is unlikely to do so). With

this knowledge, it is safe to build plans which first Get() the book or the coffee

and the Go() to another location. Teleo-reactive plans can compensate for low-

105

6. Rachel: Reflection 106

probability failures in this area (as in the coffee-spilling scenario in Section 5.2.2),

but if such a side-effect consistently recurs, then the plan will be useless unless

it takes the effect into account.

The difficulty lies in the fact that there are typically many more things that

a behaviour does not affect than things it does. Enumerating them all is often

impractical. This is known as the frame problem (Shoham & McDermott, 1988,

Hayes, 1973). The most convenient solution, used by many planning algorithms,

is the Strips assumption (Fikes & Nilsson, 1971). This assumption states that

the only fluents affected by a planning operator are those explicitly mentioned

in its add and delete lists.

The planner described in Chapter 5 relies on an even stronger frame assump-

tion. Unlike Strips-operators, Rachel’s behaviour initially have no concrete

implementation. Instead, they must be learnt. As a result, it is hard to specify

in advance what the side-effects of a behaviour will ultimately be, once it has

learnt a policy. The local reward function for a behaviour, defined in Equa-

tion 4.2, does not reward or punish any effects other than outright success or

failure. The doctrine of recursive-optimality says it should be this way: the local

policy of a behaviour should be based only the goals of the behaviour, and be

independent of the context in which it is being used. If the reward function does

not exclude an effect, then we cannot say for sure that it will not end up being

part of the learnt policy.

Nevertheless, to build plans with these behaviours we must make assump-

tions about what side-effects they will have. Rachel’s planner makes the most

optimistic assumption: that a behaviour has no effect other than those described

in its post-condition. If a fluent F is true when a behaviour B is initiated, then

the planner assumes that it will remain true until the behaviour terminates.

Furthermore, if the fluent does not conflict with the post-condition of B, then

the fluent will also be true in the terminal state.

This is a convenient assumption to make when planning, in the absence of a

learnt policy for the behaviour, but since the local reward function for B makes

no attempt to enforce it, in practice it may turn to be untrue.

Two kinds of errors can result from a bad frame assumption:

1. A plan can be generated with too many alternative paths, some of which

are not viable in practice. If there are two apparent paths to the goal, but

one of is rendered useless by an unforeseen side-effect, both of them will


be optimisticly included in the plan. This situation is recoverable, as the

actor’s ability to do hierarchical reinforcement learning will enable it to

learn which path to take and which to avoid. The only loss is time wasted

exploring the unprofitable path.

2. A plan can be generated with too few alternatives, omitting some be-

haviours which are in fact necessary. The purpose of the planner is to

prune the set of behaviours available to be explored by the actor to only

those that appear relevant. However, if the planner is too optimistic about

the effects of a behaviour then it may prune other important behaviours

from this set. The actor cannot recover from this situation on its own, as

it can only explore those alternative behaviours offered by the plan. To fix

this, the agent must notice that it is being too optimistic in its applica-

tion of the frame assumption, and somehow revise its plans to contain the

necessary alternatives.

It is this second problem which we will be particularly focusing on in this

chapter. To make it clearer, we will introduce a new example domain - the

taxi-world.

6.2 Example Domain - Taxi-world

The “taxi-car domain” was originally posed by Dietterich (Dietterich, 1998) to

study the MAXQ algorithm. A number of variants of this problem have appeared

in his papers. The particular variant we shall explore operates as follows.

R G

Y

F

B

Figure 6.1: The Taxi-Car Domain.

The world consists of a 5× 5 grid as shown in Figure 6.1. The agent controls

a taxi that can move around the grid in any of the four cardinal directions


as the walls allow. Five cells in the grid have been labelled as locations of

particular interest. They are red, green, blue, yellow and fuel. At one of

these locations is a passenger who waits for the taxi to take her to another

location. The passenger’s starting location and destination are randomly chosen

and vary from one trial to the name. It is the job of the agent to learn a policy

to navigate to the passenger’s location, pick her up, navigate to her destination

and drop her off there.

There is an added complication: the taxi begins with a randomly determined

amount of fuel. Each movement the taxi makes uses up one unit of fuel. If the

taxi runs out of fuel, then it has failed in its task. There is not always enough

starting fuel to complete the task assigned to the agent. There is, however, the

possibility of refilling the tank by visiting the Fuel location and executing the

refill action.

The agent is equipped with the instruments shown in Table 6.2. Fluents are

shown in Table 6.2. Some fluents, like psgr loc() and psgr dest() are simple

wrappers to certain instruments; others have more complex definitions.

Table 6.1: The instruments used in the Taxi-car domain.

x the current X position of the taxiy the current Y position of the taxipassenger location one of red, green, blue, yellow, fuel or taxipassenger destination one of red, green, blue, yellow, or fuelfuel the amount of fuel in the tank

between 16 (full) and 0 (empty)

The agent’s goal is described as a root behaviour Deliver:

Deliver(L) : L ∈ Locations

gran: 0

view: {x, y, passenger location, passenger destination, fuel}

pre: fuel(F) ∧ gt(F, 0)

post: psgr dest(L) ∧ psgr loc(L)

P: { north, south, east, west, pickup, putdown, fill}

Four sub-behaviours are defined to simplify this task: GoTo, Get, Put, and

Refuel. Teleo-operator descriptions of these behaviours are given in Table 6.3.


Table 6.2: Fluents used in the Taxi-car domain.

psgr loc(Location) the passenger’s locationpsgr in taxi the passenger is in the taxipsgr dest(Destination) the passenger’s destinationtaxi loc(Location) the taxi’s locationfuel(Fuel) the fuel leveldistance(Location, Distance) the Manhattan distance to a given locationgt(X, Y) X is greater than Y

rgte(X, Y, R) X is greater than or equal to Y × R

Only one of these sub-behaviours, GoTo, needs to be learnt, the other three are

just wrappers for a single primitive action: pickup, putdown and fill respectively.

Table 6.3: The four types of agent behaviour in the Taxi-world.

GoTo(L) : L ∈ Locationsgran: 1view: { x, y, id(L) }pre: true

post: taxi loc(L)P: { north, south, east, west}

Refuel

gran: 1pre: taxi loc(bowser)post: fuel(15)P: { fill}

Get(L) : L ∈ Locationsgran: 1pre: taxi loc(L) ∧ psgr loc(L)post: psgr in taxi

P: { pickup}

Put(L) : L ∈ Locationsgran: 1pre: taxi loc(L) ∧ psgr in taxi

post: psgr loc(L)P: { putdown}

Figure 6.2 shows the decomposition of the Deliver behaviour into a plan

of sub-behaviours, using the hierarchical planning algorithm described in Sec-

tion 5.3.1. There is an important flaw in this plan: it never recommends the

Refuel behaviour. The GoTo() behaviour has a side-effect which is not part of

its definition. It uses up fuel. In some cases, it may use up all the fuel. For the

sake of the example, we have deliberately omitted this fact from its description.

As a result, the planner erroneously assumes there is no effect. Thus, according

to the planner, the Refuel behaviour is never necessary. This is an example of a

plan with too few alternative, as described above.


GoTo(L)

Put(D)

GoTo(D)

Get(L)

fuel(F) gt(F,0)

psgr_in_taxipsgr_dest(D)

fuel(F) gt(F,0)

taxi_loc(L)

psgr_loc(D)fuel(F) gt(F,0)


taxi_loc(D)fuel(F) gt(F,0)

fuel(F) gt(F,0)

psgr_dest(D)psgr_loc(L)


psgr_dest(D)

Figure 6.2: A plan for the Deliver behaviour in the Taxi world.


The mistake becomes plain as the agent endeavours to execute the plan and

learn the behaviours. The taxi will frequently run out of fuel while executing

the GoTo behaviour and the agent will be unable to learn a way to avoid this, as

the Refuel behaviour is never available to it. Even when GoTo has converged to

an optimal policy (from the small fraction of attempts which succeed before the

fuel runs out) the agent will unavoidably fail in a significant number of trials.

The aim of the reflector is to rectify this problem. Doing so will require four

steps:

• Detecting when side-effects occur and diagnosing what went wrong.

• Gathering positive and negative examples of the side-effect

• Inducing a description of when the side-effect is likely to occur

• Incorporating the learnt information into a new plan which includes steps

to avoid the side-effect.

Each of these steps is described in detail below.

6.3 Detecting and identifying side-effects

In a complex world, a behaviour that executes for any length of time is likely

to affect the truth of a large number of different fluents. Learning to predict

even a single change is time consuming, learning every possible change is totally

impractical. So the agent needs to focus its attention on the important effects

and ignore the others. Arguably the most important effects are those that cause

the agent’s plans to fail. These effects point to an important deficiency in the

agent’s world model. So Rachel’s reflector endeavours only to learn to predict

those effects that result in actual plan execution failures.

6.3.1 Plan Execution Failures

A plan execution failure can be characterised as follows. The agent is executing

plan P. It observes two successive states st−1 and st. Node N is activated in state

st−1 and dictates behaviour B. A plan failure occurs in state st under one of two


circumstances, either:

st 6|= N.cond

st |= B.pre

st 6|= B.post

(6.1)

or:st |= B.post

st 6|= N.parent.cond(6.2)

Equation 6.1 describes a side-effect that occurs in the middle of executing be-

haviour B. The pre-image is still satisfied, so the behaviour has not failed, but

the node condition is no longer true, so a side-effect has occurred. Equation 6.2

describes a side-effect that occurs at the point of completion of B. The be-

haviour has succeeded, its post-condition is satisfied. The parent node ought to

be satisfied but is not, due to some other condition becoming false.

Both of these cases describe what are termed negative side-effects (side ef-

fects that result in unexpected failure). Positive side-effects (which result in

unexpected success) are ignored by the reflector. So, for example, the execution

of a behaviour will be interrupted if its parent node becomes active, even though

the condition of the child node is still true. This is a positive side-effect and is

ignored.

Also ignored are any plan failures that are due to failure of the learnt be-

haviour (ie. when the behaviour violates its pre-image). These are to be cor-

rected by the existing reinforcement learning process, not by reflection.

6.3.2 Diagnosing the failure

Having detected that a plan failure has taken place, the reflector must identify

the actual side-effect. In both of the cases above, there is a node Nf of the plan

that is expected to be satisfied in st but is not. In case 1 Nf = N, in case 2

Nf = N.parent.

One or more of the fluents in this node’s condition is false. We wish to find

which they are. If there are no variables in the node condition, then this is

simply a matter of testing each in turn. However variables complicate matters.

The planner may include existentially bound variables in node conditions. If two

or more fluents have shared variables then they cannot be tested in isolation.


The reason is simple:

Say, s |= foo(a) ∧ bar(b)

then s |= ∃X : foo(X)

and s |= ∃X : bar(X)

but s 6|= ∃X : (foo(X) ∧ bar(X))

This means that we cannot always identify a single fluent as the cause of the

side-effect. Instead we try to determine the shortest possible sequence of fluents

which cannot be satisfied simultaneously.

The first step is to split the node’s condition up into conjunctions of inde-

pendent fluents. Two fluents are dependent if they share one or more variables,

or if they both depend on a common third fluent.

Nf = C1 ∧ C2 ∧ . . . ∧ Ck

Now each condition Ci can be tested in isolation. Any that cannot be satisfied

in state st are sources for side-effects, others are ignored. Those that do fail are

trimmed to remove any irrelevant fluents. If

Ci = ∃V1, . . . , Vm : (f1 ∧ . . . ∧ fn)

and for some j < n,

st 6|= ∃V1, . . . , Vm : (f1 ∧ . . . ∧ fj)

then fluents fj+1, . . . , fj do not effect the failure of Ci and can be discarded. We

do this to make the side-effect as general as possible.

Finally all trimmed conditions are added to the set of known side-effects for

the executing behaviour B. Note that each side-effect is recorded as a conjunction

of fluents that under certain circumstance will fail to hold. This is the standard

form for a delete-list on a planning operator.

Pseudocode for this operation is shown in Algorithm 17. We shall illustrate

it with an example from the taxi-car domain. Suppose the taxi has picked up the

passenger and is on its way to the destination, executing the GoTo(L) behaviour

(where variable L is bound at run-time by evaluating psgr dest(L) in the node’s

condition). The node that is executing is the third from the top, we shall call it

N3.

N3.cond = ∃L, F : (psgr dest(L) ∧ psgr in taxi ∧ fuel(F) ∧ gt(F, 0))


Algorithm 17 Reflector: Detecting side-effectsfunction DetectSideEffects(behaviour B, node N, state s)

if s |= B.post then

if s |= N.parent.cond then

return

else

Nf ← N.parent

end if

else

if s |= N.cond or s 6|= B.pre then

return

else

Nf ← N

end if

end if

candidateEffects ← SplitCondition(Nf .cond)

for each C ∈ candidateEffects do

if s 6|= C then

C ← TrimCondition(s, C)

B.sfx← B.sfx ∪ C

end if

end for

end DetectSideEffects

Algorithm 18 Reflector: Trimming a conditionfunction TrimCondition(state s, conjunction C) returns a trimmed conjunction

newCondition ← true

for each fluent f ∈ C do

newCondition ← newCondition ∧ f

if s 6|= newCondition then

break

end if

end for

return newCondition

end TrimCondition


Now suppose the fuel runs out in state st, before the taxi reaches its desti-

nation. The pre-image of GoTo(L) is true, so a case 1 side-effect has occurred:

st 6|= N3.cond

st |= GoTo(L).pre

st 6|= GoTo(L).post

So the failed node Nf = N3. Now Nf .cond can be split into three independent

conjunctions:

Nf .cond = C1 ∧ C2 ∧ C3

C1 = ∃L : psgr dest(L)

C2 = psgr in taxi

C3 = ∃F : (fuel(F), gt(F, 0))

Now st |= C1 and st |= C2 so they can be ignored. However st 6|= C3 so this

conjunction is the source of a side-effect. To trim C3 we test successively longer

prefixes, until we find one that cannot be satisfied:

st |= ∃F : fuel(F)

st 6|= ∃F : (fuel(F) ∧ gt(F, 0))

So ∃F : (fuel(F) ∧ gt(F, 0)) is added to the known side-effects of GoTo(L). If

there had’ve been any further fluents in C3 after gt(F, 0), they would have been

trimmed.

6.4 Gathering examples

The standard Strips representation handles side-effects fairly simplistically.

They are all-or-nothing. If a behaviour has a particular side-effect, then it can

never be used to achieve a goal which would conflict with that side-effect. There

is no way to represent a side-effect which only occurs under certain circum-

stances. We have discovered that the GoTo behaviour has the side-effect that it

sometimes runs out of fuel. Does this mean that we can never use it when fuel

is an issue? Preferably not. What we want to know is when the behaviour is

likely to run out of fuel, and what steps we can take to prevent it.

With this in mind, Rachel implements a conditional model of behaviours’

side-effects. Each discovered side-effect has an associated condition describing

when it will not occur. If a behaviour B has a side-effect which causes the


failure of some conjunction of fluents C. Then maintains(B, C) is the set of

states in which B can be executed without causing C to fail. This is called the

maintenance condition for C.

Formally, if B is initiated in a state st with

st |= B.pre ∧ C ∧maintains(B, C)

then it will terminate in a state st+k, with

st+k |= B.post ∧ C

and every intermediate state st+1, . . . , st+k−1 will satisfy

st+i |= B.pre ∧ C

We wish to learn a description of maintains(B, C) as a disjunction of con-

junctions of fluents. Such a description can be used by the planner to ensure

that the side-effects do not occur.

To learn such a description we need to collect positive and negative examples

of maintains(B, C). The reflector does this by monitoring the actor’s execution.

Whenever the actor is executing behaviour B, the reflector accumulates a trace

of states T . When the behaviour terminates, the reflector attempts to classify

the states in T as either positive or negative examples.

Suppose B begins executing in state st and terminates in state st+k (success-

fully, unsuccessfully or due to an interruption). The states T = {st, . . . , st+k−1}

are then classified. A state st+i is a positive example of maintains(B, C) if:

st+k |= B.post

st+j |= C for all j, i ≤ j ≤ k

that is, if the behaviour succeeds, and every state in {st+i, . . . , st+k} preserves

the condition C.

A state st+i is classified as a negative example if:

st+i |= C

st+j 6|= C for some j, i < j ≤ k

that is, if C is satisfied in st+i but executing B causes it to no longer be satisfied.

Notice that not all states will match one of these two conditions. States in

which the condition is already false are ignored, as are states which lead neither


to a side-effect, nor to successful termination. We have no interest in classifying

such states either way.

An example of the classification process is given in Table 6.4. It show five pos-

sible execution traces from executing the GoTo(X) behaviours in the taxi-world,

and classifies each state as a positive or negative example of maintains(GoTo, ∃F :

(fuel(F) ∧ gt(F, 0)))

Classified examples are stored in two lists, E+ and E−, for positive and nega-

tive examples respectively. Each list has a maximum size, n+max and n−

max respec-

tively. Once a list is full, new examples replace randomly selected old examples.

This allows us to keep a “decaying window” containing a mixture of old and new

examples.

6.5 Inducing a description

Periodically, the actor may request a new side-effect description from the reflec-

tor. Rachel currently does this after a fixed number of trials, specified by the

user. The reflector selects a behaviour B and a condition C about which to learn

from its the list of effects it is currently monitoring. To be selected, a sufficient

positive and negative examples of the effect must been gathered to form a train-

ing set. The size of the training set is given by n+train and n−

train for positive and

negative examples respectively. At present, these sizes are up to the judgement

of the trainer. The larger they are, the more accurate the induced description

is likely to be, but the longer it will take to gather enough examples. We shall

investigate this tradeoff empirically in the experimental work in Section 6.

Two training sets E+train ⊆ E

+ and E−train ⊆ E− are randomly selected. This

training data must then be used to induce a symbolic description of the maintains(B, C).

We wish to learn a description that we can incorporate into future plans. This

learning process must have the following features:

1. The hypotheses must be expressed in the language of fluents used by the

planner.

2. Hypotheses can only be based on features of the current state and the

behaviour being executed (including its parameters). It may be necessary

(as in the Taxi-world example) to relate different fluent values to either

other.


Table 6.4: Classifying states as positive and negative examples of a side-effect.Examples are drawn from execution of the GoTo behaviour in the taxi domain.They are classified as positive or negative examples of maintains(GoTo, ∃F :(fuel(F) ∧ gt(F, 0))).

state fluents classification

A) Behaviour terminates successfully

s1 psgr in taxi, psgr dest(red), fuel(5) +s2 psgr in taxi, psgr dest(red), fuel(4) +s3 psgr in taxi, psgr dest(red), fuel(3) +s4 psgr in taxi, psgr dest(red), taxi loc(red), fuel(2) none

B) Behaviour interrupted due to running out of fuel

s5 psgr in taxi, psgr dest(blue), fuel(3) −s6 psgr in taxi, psgr dest(blue), fuel(2) −s7 psgr in taxi, psgr dest(blue), fuel(1) −s8 psgr in taxi, psgr dest(blue), fuel(0) none

C) Behaviour runs out of fuel and terminates simultaneously

s9 psgr in taxi, psgr dest(green), fuel(3) −s10 psgr in taxi, psgr dest(green), fuel(2) −s11 psgr in taxi, psgr dest(green), fuel(1) −s12 psgr in taxi, psgr dest(green), taxi loc(green), fuel(0) none

D) Behaviour runs out of fuel but persists to completion

s13 psgr in taxi, psgr dest(red), fuel(3) −s14 psgr in taxi, psgr dest(red), fuel(2) −s15 psgr in taxi, psgr dest(red), fuel(1) −s16 psgr in taxi, psgr dest(red), fuel(0) nones17 psgr in taxi, psgr dest(red), fuel(0) nones18 psgr in taxi, psgr dest(red), taxi loc(red), fuel(0) none

E) Behaviour interrupted due to an unrelated side-effect

s19 psgr in taxi, psgr dest(yellow), fuel(5) nones20 psgr in taxi, psgr dest(yellow), fuel(4) nones21 psgr loc(blue), psgr dest(yellow), fuel(3) none


3. Since the agent’s environment is stochastic and its behaviours are being

learnt as they are used, the training data is inevitably noisy. The learning

algorithm must be able to handle this noise.

4. Each conjunction in the description will produce in an extra branch in

the resulting plan-tree, so shorter descriptions are preferable to keep the

complexity of planning to a minimum.

Items 1) and 2) above suggest that Inductive Logic Programming (ILP) is

the appropriate tool for this learning task. The planner operates in terms of

first-order fluents. States and side-effects, which form the input to the learn-

ing algorithm, are described in this language, and a first-order description of

maintains(B, C) can be directly incorporated into plans. ILP naturally operates

in this language.

We chose the ILP system Aleph (Srinivasan, 2001a) to for this purpose.

Aleph is able to use general Prolog programs as intensional background knowl-

edge. This means that it can use Rachel’s fluent definitions directly. It also

has robust noise handling features that are able to overcome the noisiness of the

training data. (Some modifications were required, however to prevent it from

over-fitting, as outlined below.)

Aleph is designed to be a framework for exploring different ideas in ILP,

rather than just a single algorithm. It allows the user to customise many aspects

of the ILP process, which allows it to emulate a variety of ILP systems. Under its

default settings, we we used, it operates similarly to Progol (Muggleton, 1995).

The standard operation cycles through four steps (as described in the Aleph

manual):

1. Select Example. Select an uncovered positive example to be generalise.

If none exist, stop.

2. Saturate. Construct the most specific clause that entails the example

selected, and is within language restrictions provided. This is usually a

definite clause with many literals, and is called the “bottom clause.”

3. Reduce. Search through all clauses contain a subset of the literals in the

bottom clause, from general to specific, to find the clause with the best

accuracy. Add this clause to the theory.


Table 6.5: Input to Aleph: Positive and negative examples

Positive examples: Negative examples:

maintains(s1).

maintains(s2).

maintains(s3).

...

maintains(s4).

maintains(s5).

maintains(s6).

...

4. Cover Removal. Remove all positive examples which are covered by this

clause, and return to step 1.

6.5.1 Input to Aleph

Aleph requires three files as input: one of positive examples, one of negative

examples and one of background. The positive and negative example files are

listings of the target predicate (maintains) for each state in E+train and E−train

respectively. Each predicate is time-stamped to identify the particular state

that it describes, as shown in Table 6.5.1.

Table 6.5.1 shows an abbreviated example from the background file for the

taxi-car domain. This file has four parts. The first part contains the parameter

settings for Aleph. Four important parameters for our purposes are:

• minacc This sets the minimum accuracy requirement for a clause to become

part of the theory.

• i This is an upper bound on the number of “links” required to connect

two literals in a clause, by unbound variables. For example psgr loc(L) ∧

distance(L, 2) has one link, psgr loc(L) ∧ distance(L, D) ∧ fuel(F) ∧

gt(F, D) has 3.

• clauselength This is the maximum number of literals allow in a clause.

• inducettime This is a new parameter, introduced by the modifications

described below. It is an upper limit on the amount of time Aleph spends

in the induction loop.


The second part of the background describes the language that Aleph will

use to build its theory. Mode and type information is given for the target concept,

and the fluents that will be used to describe it. Aleph ensures that every

clause it generates obeys these mode and type requirements, to avoid the cost of

generating and testing nonsensical hypotheses. This part of the background file

also contains Prolog clauses defining all the fluents in terms of the instrument

values which follow.

The third part contains an extensional listing of all instrument values for

every example state, as Prolog facts. The final part contains an extensional

listing of a special params fluent which lists the parameters of the behaviour

executing at the time the side-effect occurred. Fluents and instruments are

time-stamped to relate them to a particular state (even those like gt that are in

fact independent of state).

6.5.2 Modifications to Aleph

Two simple but important modifications have been made to Aleph in order

to use it for this purpose. Both changes relate to how Aleph handles positive

examples for which no adequate description can be generated.

The Aleph algorithm, as described above, loops through its four steps until

its theory covers every positive example in the training set. There is no mecha-

nism in existing versions of Aleph to allow it to treat some positive examples

as noise. If the reduction step cannot find a clause with sufficient accuracy then

it resorts to adding the example itself to the theory as a single fact. So a theory

generated from noisy data might resemble:

maintains(S) :-

psgr_loc(L), dist(L, D), fuel(F), gt(F, D).

maintains(S) :-

psgr_dest(blue).

maintains(s23).

maintains(s59).

maintains(s101).

This indicates that states s23, s59 and s101 were not covered by the rest of

the theory, and did not generate clauses which satisfied the minimum accuracy

requirements. They are effectively noise.


Table 6.6: Input to Aleph: the background file

%%% PART 1: Parameter Settings

:- set(minacc, 0.5).

:- set(i, 4).

:- set(clauselength, 5).

:- set(inducetime, 1800).

%%% PART 2: Fluent definitions

:- modeh(1,maintains(+state)).

:- determination(maintains/1, psgr_loc/2).

:- modeb(*, psgr_loc(+state, -location)).

psgr_loc(State, Location) :-

passenger_location(State, Location).

:- determination(maintains/1, psgr_in_taxi/2).

:- modeb(*, psgr_in_taxi(+state)).

psgr_dest(State, Destination) :-

passenger_destination(State, Destination).

:- determination(maintains/1, gt/3).

:- modeb(*, gt(+state, +int, +int)).

gt(State, A, B) :-

A > B.

...

%%% PART 3: Instrument values

xpos(s1, 5).

xpos(s2, 4).

xpos(s3, 3).

...

%%% PART 4: Behaviour Parameters

:- determination(maintains/1, params/2).

:- modeb(*, params(+state, -location)).

params(s1, red).

params(s2, red).

params(s3, red).

...


Descriptions that name particular states are not at all useful to Rachel. A

particular time-stamped state occurred once and will never happen again. So we

have modified Aleph to filter out these singleton hypotheses. Positive examples

which cannot be covered are ignored.

Furthermore it has been noticed that the probability of generating such hy-

potheses increases as the algorithm processes. It generally happens that a fairly

comprehensive theory is established after Aleph has considered the first few

positive examples, and then a large amount of time is spent adding each of the

exceptional states as singletons. Much effort can be saved by cutting off the cov-

ering process early. Rachel’s modified Aleph uses the simplest possible cutoff:

a time limit is placed on the induction process. After a set time (specified in

seconds by the inducetime parameter) the search process is stopped and the set

of hypotheses constructed so far is returned as the theory. This may be crude

but it proves in practice to be effective.

6.5.3 Output from Aleph

Aleph outputs a number of clauses of the form:

maintains(S) :-

f_1(S, ...), f_2(S, ...), ..., f_k(S, ...).

All fluents in the body of a clause, f 1, . . . , f k, are linked by the state variable

S. (This is guaranteed, as maintains is the only term that outputs a state, and

every other fluent requires a state as input.) The other parameters of each fluent

may be bound to constant values, or to (possibly shared) variables.

Before adopting this theory as a new description of the side-effect, the reflec-

tor evaluates it against all the examples in E+ and E−. If a previous description

has been learnt, it is similarly evaluated, otherwise the null-description (which

says that the side-effect never happens) is evaluated. The new hypothesis is

adopted if and only if it satisfies both a minimum accuracy threshold and is

more accurate than the old hypothesis. Accuracy is measured in terms of the

total number of classification errors produced by the theory.

If the new hypothesis is adopted, then the reflector converts it into a list

of conjunctions by taking the body of each clause and stripping off the state

variables. This set of conjunctions is then sent to the planner, so that it can be

used in rebuilding plans.


6.5.4 Adding Incrementality

Aleph is a batch learner. It takes a batch of examples, and produces a theory to

describe them. Our learning problem, on the other hand, is incremental. We have

a continuous stream of new examples being created, and wish to incrementally

revise our theory to match new evidence as it arrives. Some extra work needs to

be done to use Aleph in this way.

The reflector handles this by keeping fixed size pools of positive and negative

training examples, E+max and E−max, for each side-effect being learnt. Training

examples are drawn randomly from these pools. Examples are added to the

pools as they arrive, until each pool reaches its maximum size (n+max and n−

max

respectively). After this, incoming examples randomly replace examples already

in the set. This random replacement technique provides a “decaying window”

of old and new examples.

Reflection first happens when each example pool is full. It is subsequently

repeated when at least half the examples each of the sets have been replaced.

(Positive and negative examples may arrive at different rates.)

After reflection, the new theory produced by Aleph is tested on all the

examples in the pool, as is theory from the previous step. Whichever theory is

more accurate is retained. Accuracy is measured by the sum of the number of

false positives (examples in E−max that are covered by the theory) and the number

of true negatives (examples in E+max that are uncovered).

The initial theory is that the side-effect in question never happens, i.e. the

empty theory. The result of the first batch is only kept if it proves more accurate

than this default theory.

6.6 Incorporating side-effects into plans

Repairing a plan when a new side-effect description arrives is a complicated

business. It requires that the planner investigate every plan in its library for

nodes that use the affected behaviour, test whether the side-effect in question

affects that use, and if so rebuild the portion of the plan-tree rooted at that

node. The current implementation of Rachel takes the simpler course of simply

throwing out all existing plan trees and rebuilding them from scratch whenever

a new side-effect description arrives. This is terribly inefficient, but since new

descriptions in practice arrive only rarely it is not much of a problem. More


intelligent plan repair left for future work.

The PlanStep function presented earlier in Section 5.3.1 needs to be ex-

tended to include the ability to plan with behaviours that have conditional side-

effects. The extension is fairly simple: before the newly created node is added

to the tree, each of the side-effects of the employed behaviour are checked to see

if they interfere with the desired operation of the behaviour.

A side-effect cannot interfere with the the post-conditions of a behaviour, as

such failures are never recorded as side-effects. If it interferes in any way, it will

be by contradicting the “unachieved” conditions of the parent node, which are

carried over to the new node. The frame-assumption says that it is safe to do

this, but side-effects violate this assumption. So it is necessary to test each of

the discovered side-effects against the maintained conditions to ensure that there

are no conflicts.

If behaviour B has a side-effect that results in the failure of a condition C,

then it conflicts with the maintained conditions if:

¬C ⇒ ¬unachieved

If this is the case, then one of the maintenance conditions for C must be added

to the Nnew .cond in order to ensure that the side-effect does not occur. Each of

the conjunctions learnt by the reflector results in a new plan node. Pseudocode

for the extended plan-step operation is show in Algorithm 19.

The params fluent in the maintenance conditions plays a special role. Rather

than being added to the newly created node, it places a restriction on the pa-

rameters of the behaviour. The parameters of the behaviour are unified with the

corresponding parameters in the params fluent. If the unification is not possible,

then that maintenance condition cannot be used.

Continuing with our example, we know the GoTo behaviour has a side-effect

which causes ∃F : (fuel(F) ∧ gt(F, 0)) to fail. Suppose Aleph learns two main-

tenance conditions:

params(L) ∧ distance(L, D) ∧ fuel(F) ∧ gt(F, D)

and:

params(fuel)

which say that the taxi can safely reach any location for which it has more than

enough fuel, or the fuel location no matter how far it is. (This is not necessarily

a likely outcome of reflection, but we shall use it for the purposes of the example.)


Suppose the planner is attempting to achieve the goal:

taxi loc(blue) ∧ fuel(F) ∧ gt((F), 0)

Now we know that the GoTo(blue) behaviour will achieve the taxi loc(blue)

condition. The remainder of the goal must be maintained from the previous

node. An new node condition is constructed:

fuel(F) ∧ gt((F), 0)

However GoTo()’s side-effect conflicts with this condition. So one of the two

maintenance conditions must be added.

The first maintenance condition contains the fluent params(L). So L must be

unified with blue, the parameter of the behaviour. Under this unification the

maintenance condition becomes:

distance(blue, D) ∧ fuel(F) ∧ gt(F, D)

This condition is added to the new node condition above to get:

fuel(F) ∧ gt(F, 0) ∧ distance(blue, D) ∧ gt(F, D)

A node is added to the plan with this new condition.

Attempting to use the second maintenance condition however results in fail-

ure. The fuel parameter of the params fluent in the condition cannot be unified

with the blue parameter of the behaviour, so this alternative does not permit

the creation of a new node.

Figure 6.3 shows a partial plan for the Deliver(D) behaviour (with the location

D unbound), which uses the maintenance conditions above to avoid running out

of fuel. Notice that the Refuel behaviour has now been included in the plan. It

is worth examining in detail why this is so.

Previously, before the side-effect of GoTo and its corresponding maintenance

conditions were known, the node dictating GoTo(D) had the condition:

psgr dest(D) ∧ psgr in taxi ∧ fuel(F) ∧ gt(F, 0)

The PlanStep function would consider Refuel as a possible extension to this

node, as it achieves fuel(16) which would satisfy the fluents fuel(varF) ∧

gt(F, 0). This would leave the unachieved conditions:

psgr dest(D) ∧ psgr in taxi


Algorithm 19 Exploratory planning: Adding new nodesfunction PlanStep(behaviour B, node N, explored E)

for each behaviour B′ with granularity B.gran + 1 do





end if




end if


newCondition ← B.pre ∧ unachieved ∧ B′.pre

choose either

type ← policy

newCondition ← AddMaintenance(newCondition , B′)

or

type ← exploratory

end choose



end if





end if

end for



Nnew .parent← N

Nnew .B← B′

Nnew .type← type


end for

end PlanStep


Algorithm 20 Exploratory planning: Adding maintenance conditionsfunction AddMaintenance(condition N , behaviour B) returns augmented condition

\\Check for any side-effects that will conflict with

\\this node and add appropriate maintenance conditions.

for each condition C ∈ B.sfx do

if ¬C ⇒ ¬N then

pick M ∈ maintains(B, C)

N ← N ∧M

end if

end for

return N

end AddMaintenance

GoTo(fuel)

Put(D)

Get(L)Refuel

GoTo(D)

Get(L)

psgr_dest(fuel)

distance(D, Dist)

gt(F, Dist)

psgr_loc(L) taxi_loc(L)

fuel(F) gt(F, 0) fuel(F) gt(F,0)

psgr_dest(D)psgr_in_taxitaxi_loc(fuel)

distance(D, Dist)fuel(F) gt(F, 0)gt(16, Dist)


psgr_dest(D)



psgr_loc(L)

psgr_in_taxifuel(F) gt(F,0)

psgr_dest(fuel)

taxi_loc(L)


distance(D, Dist)

gt(F, Dist)fuel(F) gt(F, 0)

psgr_dest(D)

Figure 6.3: A plan for the Deliver behaviour which avoids running out of fuel.


Add to this the pre-image of Refuel and the pre-image of the parent behaviour

Deliver(D) and the condition for the new node would be:

psgr dest(D) ∧ psgr in taxi ∧ taxi loc(fuel) ∧ fuel(F) ∧ gt(F, 0)

However this condition is implied by the condition of its parent (above), so it

would never be activated. Therefore it is pruned from the plan.

However things change after the reflector has detected the side-effect on GoTo

and learnt the maintenance conditions to avoid it. The condition of the GoTo(D)

node is now:

psgr dest(D) ∧ psgr in taxi ∧ fuel(F) ∧ gt(F, 0)

∧ distance(D, Dist) ∧ gt(F, Dist)

Once again, PlanStep will consider Refuel as a candidate behaviour, to achieve

fuel(varF)∧gt(F, 0). Notice that in doing so, Refuel binds the variable F to 16.

So the term gt(F, Dist) becomes gt(16, Dist). The unachieved conditions are

now:

psgr dest(D) ∧ psgr in taxi ∧ distance(D, Dist) ∧ gt(16, Dist)

Add to this the pre-images of Refuel and Deliver(D) and we get a new node with

condition:

psgr dest(D) ∧ psgr in taxi ∧ distance(D, Dist) ∧ gt(16, Dist)

∧ taxi loc(fuel) ∧ fuel(F) ∧ gt(F, 0)

This condition is not implied by any of the ancestor nodes, so is added to the

plan.

The plan in Figure 6.3 shows only part of the full plan. The entire plan

is 15 levels deep and contains a total of 186 nodes. Many of these describe

different possible orderings on moving, getting the passenger, and refueling. The

size of this plan could be drastically cut down by introducing partially-ordered

planning, but this has its own set of difficulties, as described in Chapter 8.

6.6.1 Exploratory planning

The extended PlanStep function in Algorithm 19 contains a further complexity

that remains to be explained. It adds a type property to nodes that can either

be policy or exploratory. Exploratory nodes do not take side-effects into account.


Figure 6.4 shows the plan from Figure 6.3 with exploratory planning. A single

exploratory node has been added which uses the GoTo(D) behaviour ignoring its

side-effects.

GoTo(fuel)

Put(D)

GoTo(D)

GoTo(D)

fuel(F) gt(F,0)



psgr_loc(D)

fuel(F) gt(F,0)

psgr_dest(D)psgr_in_taxipsgr_dest(D)

distance(D, Dist)

gt(F, Dist)fuel(F) gt(F, 0)

psgr_in_taxifuel(F) gt(F,0)

psgr_dest(fuel)

psgr_dest(D)

psgr_in_taxi

Figure 6.4: A plan for the Deliver behaviour using exploratory planning. Thenode with the broken outline is exploratory.

Exploratory nodes exist to allow the agent to explore behaviours where they

might otherwise be prohibited. Without exploration of this kind, the agent

would be helpless to fix any over-specialised maintenance conditions produced

by the reflection. If the reflector underestimates the size of the maintenance

region for a particular side-effect then the policy nodes of the plan will be overly

restrictive and the behaviour will not be used as widely as it might be. Without

some form of exploration such a mistake is incurable. The actor will never use

the behaviour outside the limits of the plan, and so the necessary examples

of successful execution will never be generated. Without any new examples

the reflector is likely to continue producing the same over-specialised side-effect

description.

Exploratory nodes indicate when behaviours might be explored, while still

limiting choices to applicable behaviours. When the actor chooses a policy be-

haviour, it can choose only from the policy nodes of the plan, but when it chooses

an exploratory action it is free to choose from nodes of either type.

By occasionally exploring a behaviour even when it is expected to cause a

side-effect, counter-examples can be generated for overly restrictive maintenance


conditions, which may result in a more general description the next time the

reflector considers the side-effect.

6.7 Summary

In this chapter we have explained the operation of the reflector, the third element

of the Rachel architecture. We have shown how the reflector detects execution

failures in the agent’s plan, diagnoses their causes and learns to predict when

they will occur, using Inductive Logic programming. This new knowledge can

then be integrated into future plans, allowing the agent to avoid such unwanted

effects in the future.

This completes our discussion of the implementation of Rachel. In the

next chapter we attempt an empirical investigation into the benefits Rachel

provides.

Chapter 7

Experiment Results

In this chapter we shall exhibit the performance of the Rachel architecture in

three experimental domains: the simple gridworld and taxi-car domains from

earlier chapters, and a third more complex simulation based on the Robocup

robotic soccer competition.

7.1 Experiments in the gridworld domain

7.1.1 Domain description

The gridworld experiments were conducted in the world described earlier in

Sections 2.2.1 and 5.2.2. The map is reproduced in Figure 7.1. We represent the

plan of a house as a 75 × 50 grid. The agent is a robot located in one of these

cells. There are also two objects in the house, the coffee and the book, with

starting locations as indicated on the map. Walls and doors divide the map into

a collection of rooms, as shown on the map.

Primitive representation

The primitive state of the agent is represented by four instruments: the robot’sx

and y coordinates, and the locations of each of the objects. Each object has only

two possible locations – either it is in its starting location or else the robot is

carrying it. The instruments representing this state are shown in Table 7.1.1.

There are nine primitive actions available to the agent: one for each of the

eight directions of movement (n, ne, e, se, s, sw, w, nw), plus the pickup action.

The movement actions move the robot to one of the eight neighbouring cells,

132

7. Experiment Results 133

Closet

Robot

Coffee Book

Hall

Laundry

Dining

Bump

KitchenBedroom1 Bathroom Bedroom2

Study Lounge

Figure 7.1: The first experimental domain - the Grid-world.

Table 7.1: Instruments used in the Grid-world domain.

x the current X position of the roboty the current Y position of the robotloc(Object) the location of Object

0, if it is in its orignal location1, if it is being carried by the robot

provided there is no wall blocking the movement. These actions have a 5%

chance of error, in which case the agent moves at right angles to its desired

heading.

The pickup action will pick up any object in the same location as the robot.

If there is no object in the robot’s cell, this action does nothing. The robot can

carry both object simultaneously without problem. It cannot drop objects.

In the second and third experiments in this domain, we shall introduce a

“bump” into the world, as indicated by the shaded area on the map. If the

robot enters a shaded cell while carrying the coffee there is a 10% chance that

it will spill the coffee. If spilt, the coffee returns to its initial location.


Symbolic representation

The symbolic representation of the Grid-world describes the rooms and their

topology. The fluents, shown in Table 7.1.1, represent the locations of the objects

and the robot in terms of the rooms they are in. The loc(Object) instrument

translates directly to the holding(Object) fluent. The connections between

rooms are given by the door() fluent.

Table 7.2: Fluents used in the Grid-world domain.

location(Object, Room) true when Object is in Room.Object may be one of robot, book or coffee

holding(Object) true when the robot is holding Object

door(From, To) true if there is a door linking rooms From and To

Using these fluents, we describe the agents goal, and the behaviours it uses to

achieve this, shown in Table 7.3. The goal is to fetch both the coffee and the book,

and take them into the lounge. This is represented by the root behaviour Fetch.

Two sub-behaviours, Go() and Get() are also provided. The Go() behaviour is

designed to take the robot from its current room to a neighbouring one. The

Get() behaviour can be executed once the robot is in the same room as an object,

and is designed to locate the object and pick it up.

7.1.2 Experiment 1: Planning vs HSMQ vs P-HSMQ

The aim of the first experiment is to demonstrate the advantage of using a

combination of planning and hierarchical reinforcement learning over either one

alone. To this end, we shall compare three different approaches to learning the

Fetch behaviour above:

1. HSMQ-learning (Algorithm 1 in Section 2.3.2) with no pruning (all appli-

cable behaviours are available)

2. Executing the plan directly with reinforcement learning only at the bot-

tom of the hierarchy (learning primitive policies for behaviours). Choices

between behaviours in the plan were resolved in favour of the shallower

node, breaking ties randomly.


Table 7.3: Behaviours available in the Grid-world.

Fetch

gran: 0view: { x, y, loc(book), loc(coffee) }pre: true

post: location(robot, lounge) ∧ holding(book) ∧ holding(coffee)P: { n, ne, e, se, s, sw, w, nw, pickup }

Go(From, To)gran: 1view: { x, y, id(To) }pre: location(robot, From)post: location(robot, To)P: { n, ne, e, se, s, sw, w, nw }

Get(Object)gran: 1view: { x, y, id(Object) }pre: location(Object, Room) ∧ location(robot, Room)post: holding(Object)P: { n, ne, e, se, s, sw, w, nw, pickup }

3. P-HSMQ-learning (Algorithm 12 in Section 5.3.2) with a plan-based task-

hierarchy

Each approach was run twenty times, with each run consisting of 1000 consec-

utive trials. A trial begins with the agent empty-handed at its starting location

in the study, and ends when the agent managed to arrive in the lounge with both

the coffee and the book.

The learning parameters were set as follows: The learning rate α was 0.1.

The discount factor γ was 0.95. Exploration was done in an ε-greedy fashion

with a 1 in 10 chance of the agent choosing an exploratory action (both at the

level of primitive actions, and in the choice of behaviours). Exploratory actions

were chosen using recency-based exploration (Thrun, 1992).

Results

Figure 7.2 shows the results of this experiment (note: this graph is plotted on

a log-log scale, to highlight the differences in the early trials). Clearly the plan-


100

1000

10000

10000 100000

Tria

l len

gth

Total number of experiences

HSMQ all behavioursPlan w/o HRL

P-HSMQ

Figure 7.2: A comparison of learning rates for four approaches to the grid-world problem: (1) Using HSMQ to select from every applicable behaviour, (2)Using a plan to select behaviours, (3) Using P-HSMQ to select behaviours fromalternatives provided by a plan. Error bars represent one standard deviation.


based approaches converge much more rapidly than the unplanned approach.

Measuring the average number of experiences required before trial-length falls

below 500 shows that approach 1 takes 72324 primitive actions, approach 2 which

takes 29956, and approach 3 which takes 35151. In both cases the difference

is significant with 99% confidence. The reason for this difference is apparent.

Approach 1 spent much of its time in early trials learning behaviours never

featured in its final policy. Approaches 2 and 3 restricted their exploration to a

smaller set of behaviours, which were more relevant to the task at hand. The

long term performance of approach 1 is also poorer, as it continues to explore

behaviours which do not contribute to the goal.

Tria

l len

gth

200

250

300

350

400

Approach 1 Approach 2 Approach 3

Figure 7.3: Average trial lengths for each individual run of the three approachesin Experiment 1. Error bars indicate one standard deviation.

The exploratory actions performed by approaches 1 and 3 hide differences

between the final policies learnt by each. To resolve this, the learnt policies

from each of the three approaches were run for a further 1000 trials without

any further learning or exploration. The results of these trials are shown in

Figure 7.3.

This graph shows the average trial lengths for each repeat run. Notice that

the results from both HSMQ approaches fall into two clusters, one below 250 and

one above 280. These correspond to the two high-level solutions to the problem:

either getting the coffee first then the book (the shorter solution) or else getting


the book then the coffee (the longer solution). Both approaches converged to a

policy which implemented one of these two solutions well (indicated by the small

deviation per run). Approach 1, without the plan, chose the shorter solution in

8 out of 20 runs. Approach 3, with the plan, chose the shorter solution in 13

out of 20 runs. Contrast this with approach 2, which does no learning at the

behaviour level. It does not settle on one solution or the other, but selects one

randomly for each trial. This is shown by the much greater standard deviation

in these runs.

Ideally both approaches 1 and 3 ought always to converge to the better of

the two solutions. The failure to do so is probably due to lack of exploration,

and the “lock-in” effect that occurs when Q-values are pessimisticly initialised.

This is expected to be more of a problem with the unplanned approach as it has

more options to explore and thus will take longer to find the better solution.

Even so, the combination of planning and hierarchical learning shown in

approach 3 appears on average, to converge to a better and more stable solution

than either planning or hierarchical learning alone.

7.1.3 Experiment 2: P-HSMQ vs TRQ

Our second experiment investigates the issue of termination improvement. For

this experiment we introduce the “bump” into the world to examine how the

side-effect it causes effects the performance of each algorithm. We wish to com-

pare the performance of the P-HSMQ (Algorithm 12) and TRQ (Algorithm 13)

algorithms, and how the final policy produced by each compares to that produced

by a standard termination-improvement technique.

Twenty independent runs are performed with each of P-HSMQ and TRQ.

Learning and exploration rates were set as in Experiment 1. In the TRQ ap-

proach, the probability of taking exploratory actions η was set to 0.1. Each run

consisted of 1000 learning trials, followed by 1000 test trials (in which learning

is disabled and both ε and η are set to zero).

Each policy learnt using P-HSMQ was also tested for 1000 trials using ter-

mination improvement (following the algorithm in (Sutton et al., 1999)) rather

than the usual subroutine semantics of HSMQ. The algorithm for termination-

improved execution is shown in Algorithm 21.


Algorithm 21 Termination-improved execution of a policy learnt by P-HSMQfunction TermImp(behaviour B)

t← 0

Observe state st


st+1 ← TempImpStep(B, st)

end while

end TermImp

function TermImpStep(behaviour B, state st) returns state st+1

Bt ← ActiveBehaviours(B.plan, st)

if Bt = ∅ then

Choose primitive at ← π(st) from B.P


Execute at

Observe state st+1

else

Choose behaviour Bt ← π(st) from Bt


st+1 ← TermImpStep(Bt, st)

end if

return st+1

end TermImpStep


-10000

0

10000

20000

30000

40000

50000

60000

70000

80000

0 200000 400000 600000 800000 1e+06 1.2e+06 1.4e+06

Tria

l len

gth


TRQ n = 0.1PHSMQ

(a) performance against number of experiences

100

1000

10000

100000

50 100 150 200 250 300

Tria

l len

gth

Number of trials

TRQ n = 0.1PHSMQ

(b) performance against number of trials (first 300 trials only)

Figure 7.4: A comparison of learning rates for TRQ and P-HSMQ in the grid-world problem with the bump. Error bars represent one standard deviation.


Results

Figure 7.4 shows the results of this experiment, plotted as performance versus

number of experiences and performance versus number of trials. As can be seen

from the first of the two graphs, both algorithms took approximately the same

number of experiences to terminate. If we count the number of experiences

before each approach produced a trial with less than 500 steps, we see that TRQ

took an average of 773276 steps while P-HSMQ took only 4% more, at 803526

steps.

There is, however, a significant difference in the number of trials required for

convergence. P-HSMQ took an average of 71 trials before it produces a trial less

than 500 steps long. TRQ took 103. A T-test shows this difference has more

than 99% significance. So P-HSMQ converged with fewer, longer trials, whereas

TRQ took many more trials, but they were significantly shorter.

Turning to the results produced in the testing phase, Figure 7.5 shows the

average trial length for each run of each approach, plus the results for the

termination-improved P-HSMQ policies. The data has been split into two cases:

Figure 7.5(a) shows the average performance in those trials in which the coffee

was not spilled (approximately 90% of the trials) and Figure 7.5(b) shows the

average performance for trials in which the coffee was spilled.

Four important facts are noticeable:

1. The runs of TRQ appear to fall into two distinct sets. Runs 4, 5, 8, 10 and

13 produced significantly better policies than the others (in both graphs).

Examining the policies produced shows that in these runs the agent learnt

to fetch the coffee first and then the book, whereas in the other runs the

agent learnt to fetch the book first, and then the coffee. Fetching the book

first requires that the agent traverses the length of the hallway 3 times,

instead of just 1 if the coffee is fetched first.

2. The results of P-HSMQ in the no-spill case show that it consistently per-

forms as well as the worse of the two policies learnt by TRQ. Examining

the policies produced shows that in all 20 runs the agent learnt to fetch

the book first, the poorer policy.

3. As expected, the policies produced by the P-HSMQ algorithm suffered

when the robot spilled the coffee. These trials were significantly longer


than the average performance of even the worse of the two sets of TRQ

policies (with greater than 99% confidence, according to a T-test).

4. Termination improvement improves this situation significantly. The per-

formance of the termination-improved P-HSMQ policies is not significantly

different to the worse of the two TRQ policies.

7.1.4 Experiment 3: The effect of the η parameter

In the final experiment with the grid-world domain we investigate the effect

of that η parameter on the performance of the TRQ algorithm. The problem

specified in Experiment 2 above was repeated for eleven different values of η,

ranging from 0 to 1. For each value, 20 runs of 1000 learning trials followed by

1000 test trials were performed.

Results

As in Experiment 2 above, we have plotted the performance data for this exper-

iment in terms of both performance versus number of experience (Figure 7.6(a))

and performance versus number of trials (Figure 7.6(b)). Results are only shown

for three of the eleven values of η tested. The rest are summarised in Figure 7.7,

which plots the average number of experiences and the average number of trials

executed before a trial was completed in fewer than 500 steps.

Once again, there is no significant difference in the total number of experi-

ences required for convergence for any of the values of η used. However, small

values of η do show a higher variance. An F-test comparing results with η = 0

and η = 1 show that the standard deviation of the former, 136459, is significantly

greater than the latter, 36983, with greater than 99

On the other hand, the number of trials taken before convergence in seen

to vary significantly with η, falling from an average of 123 when η = 0 to a

minimum of 72 when η = 0.7. A T-test shows this is difference is significant

with greater than 99% confidence. There is a slight rise as η increases further.

When η = 1 the average is 76 trials before convergence. A T-test rates this

difference at 85% significance.


150

200

250

300

350

400

450

500

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Ave

rage

Tria

l len

gth

Run

TRQPHSMQ

PHSMQ improved

(a) when the coffee is not spilled

200

250

300

350

400

450

500

550

600

650

700

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Ave

rage

Tria

l len

gth

Run

TRQPHSMQ

PHSMQ improved

(b) when the coffee is spilled

Figure 7.5: Final policy performance for Experiment 2. Comparing policiesproduced by TRQ, P-HSMQ and P-HSMQwith termination improvement. Errorbars show one standard deviation.


-10000

0

10000

20000

30000

40000

50000

60000

70000

80000

0 200000 400000 600000 800000 1e+06 1.2e+06 1.4e+06

Tria

l len

gth


TRQ eta = 0.0TRQ eta = 0.5TRQ eta = 1.0

(a) performance against number of experiences

100

1000

10000

100000

50 100 150 200 250 300

Tria

l len

gth

Number of trials

TRQ eta = 0.0TRQ eta = 0.5TRQ eta = 1.0

(b) performance against number of trials (first 300 trials only)

Figure 7.6: A comparison of learning rates for TRQ different values of η. Errorbars show one standard deviation.


600000

650000

700000

750000

800000

850000

900000

950000

0 0.2 0.4 0.6 0.8 1

No.

exp

erie

nces

bef

ore

conv

erge

nce

Eta

(a) average number of experiences before convergence

60

70

80

90

100

110

120

130

140

150

160

0 0.2 0.4 0.6 0.8 1

No.

tria

ls b

efor

e co

nver

genc

e

Eta

(b) average number of trials before convergence

Figure 7.7: Convergence times for TRQ with different value of η.


7.1.5 Discussion of the gridworld experiments

The grid-world experiments verify our expectations that pruning the task-hierarchy

is advantageous, resulting in faster convergence than exploring all behaviours,

and that the additional reactivity of Teleo-Reactive Q-learning allows better

policies. In this example, the extra cost of TRQ is minimised as there are few

redundancies in the plan, i.e. there are no cases where two or more active nodes

dictate the same behaviour. (In more complex worlds, we would expect TRQ

to take longer to converge as redundant nodes in the plan would have to be

explored individually.)

In both Experiments 1 and 2 the final policies learnt are sometimes sub-

optimal. The task has two possible solutions at the behaviour level, which differ

according to the order in which the objects are fetched. The two solutions differ

significantly in performance – fetching the book first causes the agent to take

around 80-100 more steps to reach the goal. Theory suggests that this path

should not be taken. In practice, the problem is probably due to insufficient

exploration. If one path performs better initially, it will receive the bulk of the

agent’s attention and will improve more quickly. Once such a decision has been

made, the chances of the agent switching to the alternative path are low, as it

must be explored a large number of times in order to rival the initial choice. This

is particularly the case in recursively optimal reinforcement learning. Behaviours

must be explored extensively in order for there internal policies to improve, before

a decision is made as to which behaviour is optimal.

Experiment 2 seems to show that P-HSMQ is more sensitive to this effect

than TRQ. In all of the twenty runs P-HSMQ converged to the longer of the

two possible solutions, whereas TRQ was able to find the better solution in five

runs. One possible explanation is that by committing to behaviours P-HSMQ

amplifies the drawbacks of early unconverged behaviours. Getting the coffee

first requires that the agent crosses the hall twice while carrying the coffee –

once to execute Go(hall, bedroom2) to fetch the book, and then once to execute

Go(hall, lounge) to finish the task. If the coffee is fetched second, then the hall

is only crossed once with the coffee in hand, when the final Go(hall, lounge) is

executed. Once both these behaviours have optimised their policies, the prob-

ability of spilling the coffee is much the same in either case, but initially these

behaviours will involve a lot of random exploration and the probability of spillage

will be high. As a result, in the early phase of learning it is likely to seem better


to fetch the book first, minimising the probability of a spill. Once this decision

is made, it will “lock in” until enough exploration of the alternative proves it to

be sub-optimal. This is true for both algorithms, but since P-HSMQ attaches a

much greater cost to spillage than TRQ, it is likely to require much more explo-

ration to recover. This possibly explains why TRQ was able to find the better

path more often than P-HSMQ.

Experiment 3 shows an apparent insensitivity of the TRQ algorithm to the

value of its parameter eta. This suggests that the time taken for convergence in

this task is dominated by the amount of time required to learn primitive policies

for behaviours, and not by the time spent learning higher-level policy decisions.

Ultimately the same amount low-level learning is required regardless of the value

of eta, so the same amount of time is taken.

Alternatively, our measure of convergence might be too loose. We showed

above that most runs did not converge to the optimal behaviour-based policy.

If we used a tighter measure of convergence which would not be satisfied by the

sub-optimal path, then we might expect the value of eta to have a stronger effect.

Further experiments need to be run to confirm this.

7.2 Experiments in the taxi-car domain


R G

Y

F

B

Figure 7.8: The second experimental domain - The Taxi-Car.

The second set of experiments, focused on the action of the reflector, will be

conducted in the taxi-world from Section 6.2. The map of the world is reproduced

in Figure 7.8. The agent controls a simulated taxicab which has to navigate


between different locations in the world, picking up passengers and delivering

them to their desired destination. The agent has a limited fuel supply and a

critical part of the learning problem will by knowing when and how to refuel.


The primitive state of the world is defined by the following factors: the x-y

position of the taxi, the location of the passenger, the passenger’s desired des-

tination and the amount of fuel in the taxi. These elements are represented by

the instruments shown in Table 7.2.1.

Initially the taxi is placed at a random position in the world. The passenger is

randomly placed at one of the five locations, with a different random location as

its destination. The fuel tank is randomly set between half full and full, i.e. 8-16

units, inclusive.

There are seven primitive actions available to the agent, four controlling the

movement of the taxi (north, south, east, west), two for picking and and putting

down the passenger (pickup, putdown); and one for refilling the fuel tank (fill).

The movement primitives move the agent one cell in the direction specified,

unless there is a wall in the way, in which case they do nothing. These actions

have a 5% chance of error, in which case the agent moves at right angles to its

desired heading. Each movement action, whether successful or not, uses 1 unit

of fuel.

The pickup action only operates when the taxi is in the same location as the

passenger, in which case the passenger’s location is set to taxi. The putdown

action only operates when the passenger is in the taxi and the taxi is at one of

the five locations, in which case the passenger’s location is set to the location of

the taxi. Under other conditions these actions have no effect. Neither of these

actions has any effect on fuel.

The fill action only operates when the taxi is at the fuel() location, in which

case it sets the fuel tank to full (16 units). Otherwise it does nothing.


The symbolic representation of the state of the taxi-world is given by the flu-

ents in Table 7.2.1. Some fluents, like psgr loc() and psgr dest() are simple

wrappers to certain instruments; others have more complex definitions. Two flu-


Table 7.4: Instruments used in the Taxi-car domain.

x the current X position of the taxiy the current Y position of the taxipassenger location one of red, green, blue, yellow, fuel or taxipassenger destination one of red, green, blue, yellow, or fuelfuel the amount of fuel in the tank

between 16 (full) and 0 (empty)

ents, distance() and rgte() are not used in the definition of behaviours. They

are included to make the hypothesis language for describing side-effects more

expressive.

Table 7.5: Fluents used in the Taxi-car domain.

psgr loc(Location) the passenger’s locationpsgr in taxi the passenger is in the taxipsgr dest(Destination) the passenger’s destinationtaxi loc(Location) the taxi’s locationfuel(Fuel) the fuel leveldistance(Location, Distance) the Manhattan distance to a given locationgt(X, Y) X is greater than Y

rgte(X, Y, R) X is greater than or equal to Y × R

The five behaviours defining the taxi-car learning problem are show in Ta-

ble 7.6. The root task is given by the granularity zero behaviour Deliver() which

sets the main goal:

psgr loc(L) ∧ psgr dest(L)

Four granularity one behaviours are provided to achieve this goal: GoTo(), Get(),

Put() and Refuel. Of these, only GoTo() needs to be learnt, the other three are

simply high-level wrappers to the primitive actions pickup, putdown and fill,

respectively.

The model of the GoTo() behaviour is missing an important fact - it uses up

fuel. A better model might include extra preconditions to ensure an adequate

amount of fuel is available to reach the goal. As it is, the behaviour pays no

attention to fuel, and is not penalised if the fuel happens to run out. That


concern is left to the parent behaviour. This means that the GoTo() behaviour

does not need to include fuel in its view, simplifying its state-space, but will

also cause trouble for the planner. In the absence of any other information,

the planner assumes the behaviour has no effect on fuel whatsoever. Using this

assumption, it builds the plan shown in Figure 7.9, which we shall refer to as

the “naive” plan.

GoTo(L)

Put(D)

GoTo(D)

Get(L)

fuel(F) gt(F,0)


fuel(F) gt(F,0)

taxi_loc(L)




fuel(F) gt(F,0)



psgr_dest(D)

Figure 7.9: A “naive” plan for the Deliver behaviour in the Taxi world, assumingthe GoTo() behaviour has no effect on fuel.

As already discussed in Chapter 6, when the agent executes this naive plan

a plan failure will sometimes occur. This happens when the taxi runs out of

fuel unexpectedly, while executing one or other of the nodes which dictate the

GoTo() behaviour. This triggers the reflector, which diagnoses the side-effect

that caused the plan-failure and attempts to learn a condition under which it

can be avoided. In the experiments that follow we shall investigate the effects of

reflection on the agents performance.


Table 7.6: Behaviours available in the Taxi-Car domain.

Deliver()gran: 0view: { x, y, passenger location, passenger destination, fuel}pre: true

post: psgr loc(L) ∧ psgr dest(L)P: { north, south, east, west, pickup, putdown, fill}

GoTo(L) : L ∈ Locationsgran: 1view: { x, y, id(L) }pre: true

post: taxi loc(L)P: { north, south, east, west}

Refuel

gran: 1pre: taxi loc(bowser)post: fuel(16)P: { fill}

Get(L) : L ∈ Locationsgran: 1pre: taxi loc(L) ∧ psgr loc(L)post: psgr in taxi

P: { pickup}

Put(L) : L ∈ Locationsgran: 1pre: taxi loc(L) ∧ psgr in taxi

post: psgr loc(L)P: { putdown}

7.2.2 Experiment 4: Reflection

The aim of the first taxi-world experiment is to compare the Rachel’s per-

formance in the taxi-world with and without reflection, and also against a

hand-crafted task-hierarchy which includes the possibility of using the Refuel

behaviour. Our aim is to show how reflection might be used to identify and

repair important omissions from the world model.

We compare four different approaches to the problem:

1. TRQ without reflection (using the naive plan),

2. TRQ with reflection,

3. TRQ with a hand-crafted plan (described below),

4. TRQ with all instantiations of all behaviours available (whenever they are

applicable).

The hand-crafted plan is shown in Figure 7.10. Essentially it consists of

the naive plan plus an alternative branch which allows the agent to use the


GoTo(D)

Get(L)

GoTo(L)

Put(D) Refuel

GoTo(fuel)

psgr_dest(D)psgr_loc(D)

fuel(F) gt(F,0)

fuel(F) gt(F,0)


psgr_dest(D)psgr_loc(L)taxi_loc(L)

fuel(F) gt(F,0)


fuel(F) gt(F,0)



taxi_loc(fuel)fuel(F) gt(F,0)

fuel(F) gt(F,0)

Figure 7.10: A hand-crafted plan which adds refueling to the naive plan.


GoTo(fuel()) and Refuel behaviours whenever they are applicable. This is not

a well-formed plan, insofar as it could not be produced by the planner, but it is

still executable. The side-effects produced by executing this plan are ignored.

Twenty independent runs were made of each approach, each 5000 trials long.

In approach 2, reflection was focused on a single side-effect: learning when the

fuel run out. Other side-effects were detected, but ignored for the sake of this

experiment. Data was passed from the actor to the reflector in batches, after

every 500 trials. The training set and example pool sizes were set as follows:

n+train = n−

train = 100, n+max = n−

max = 1000

The reflector used the default parameters for Aleph, with the following

exceptions:

1. The minimum acceptable accuracy of a clause, minacc was set to 0.5,

2. The upper bound on layers of new variables, i was set to 4.

3. The upper bound on the number of literals in a clause, clauselength was

set to 5.

4. The custom-added limitation on time spend learning (see Section 6.5.2),

inducetime was set to 1800 (30 minutes).

In all four approaches the parameters to the reinforcement learning algorithm

are the same: the learning rate α is 0.5, the discount factor γ is 0.95 and the

exploration is recency-based, with a 10% probability of taking an exploratory

action.

After learning, a further 5000 trials were performed for each run, with learn-

ing disabled, to judge the performance of the resulting policy.

Results

Figure 7.11 shows the results of this experiment. It shows the average number

of successful trials for every 100 trial period, during learning. Plainly the naive

plan provides a significant advantage over exploring all possible behaviours, but

it falls far short of the performance gained by the hand-crafted hierarchy. This

is borne out by the testing phase, the results of which are shown in Table 7.7.

The runs which used reflection also quickly out-performed the naive plan,

soon after the first side-effect description was induced on the 500th trial. They


quickly converged to policies almost as successful as those using the hand-crafted

hierarchy. The final performance, however, was still not quite as good, with

only 91% success for the trials using reflection, compared to 95% for the hand-

crafted approach. A T-test confirms that this is a significant difference with 99%

confidence.

0

20

40

60

80

100

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Suc

cess

rate

(%)

Number of trials

Naive planReflection

Hand-crafted planAll behaviours

Figure 7.11: A comparison of learning performance in the Taxi-world, com-paring (1) TRQ without reflection, (2) TRQ with reflection, (3) TRQ with ahand-crafted plan, (4) TRQwith all behaviours. Error-bars show one standarddeviation.

Table 7.7: The success-rates of final policies learnt in the taxi-world.

Approach Average Success Rate Std. dev.Naive plan 54.78 4.96Reflection 91.13 3.19Hand-crafted 95.17 2.31All behaviours 11.31 4.61

It is worth examining what kinds of descriptions were learnt by the reflector.

In the 5000 trials of the learning phase, the reflector had an opportunity to


run ten times. In most cases, the result of reflection was only ever adopted

as an improved description of the side-effect three to five times out of the ten.

Figure 7.12 shows the change in accuracy of the side-effect description over these

ten iterations. It shows the average accuracy of the old (maintained) and new

(induced) hypotheses computed on all the examples in the pool. The initial

default description, which assume the side-effect never happens rates at only

around 65% accuracy, so it is immediately replaced with a hypothesis base on the

first induction (82% accurate, on average), but after several iterations the newly

created hypotheses cannot better the maintained hypothesis, and the accuracy

converges to approximately 89%.

60

65

70

75

80

85

90

95

100

1 2 3 4 5 6 7 8 9 10

Acc

urac

y (%

)

Batch No.

OldNew

Figure 7.12: The accuracy of maintain (old) and induced (new) hypotheses foreach iteration of the reflector.

There is a noticeable change in the hypotheses themselves. The first induced

hypothesis in each run had between 1 and 4 clauses, with a mean of 2.5. The

final hypothesis in each case contained just a single clause of the form:

maintains(S) :-

params(L), distance(L, D), fuel(F), rgte(F, D, ###).

where ### is some constant between 2.2 and 2.6. This clause expresses the fact


that the side-effect will not occur if the amount of fuel remaining is more than

the Manhattan distance to the target location multiplied by a certain factor.

7.2.3 Experiment 5: Second-order side-effects

In the previous experiment we limited reflection to operate only on the problem

of running out of fuel. The initial “naive” plan does not present us with any other

side-effects, so this may seem like a trivial limitation. However this overlooks the

fact that the plans generated after reflection contain many additional conditions

which are a source of numerous “second-order” side-effects. In this extension of

the above experiment we remove the limitation and allow the reflector to operate

on second-order and later side-effects.

Results

The experimental set up was the same as above. Twenty runs were performed

each consisting of 5000 trials. Of these twenty runs, only five were able to execute

to completion. The other fifteen terminated prematurely, as the planner ran out

of memory due to the size of the plans being generated – in each case involving

over 5000 nodes.

The five cases that did execute to completion were able to do so because

a side-effect description that was learnt that prevented any plan whatsoever to

be built. Without exploratory planning to overcome this, the agent reverted to

learning a policy with monolithic Q-learning, with little success.

7.2.4 The bigger taxi world

The small size of the taxi-world in Figure 7.8 means that the GoTo() behaviour

will be learnt very quickly. By the time the reflector has gathered enough ex-

amples of the side-effect to induce a description, the target concept has become

stationary. This more or less negates the need for repeated reflection, as can be

seen from the results of Experiment 4.

To better investigate the interaction between learning behaviours and reflect-

ing on their side-effects, a larger example world is required. For this experiment,

we have scaled up the taxi-world from a 5 × 5 grid to a 25 × 25 grid, shown

in Figure 7.13. The distances between locations are five times greater, so the

initial fuel for the taxi has also been scaled. The full tank now holds a maximum


R G

BY

F

Figure 7.13: The 25× 25 taxi-world


of 80 units of fuel. The taxi starts with an fuel level in the range 40 to 80.

The dynamics of the world are otherwise unchanged, and the same instruments,

actions, fluents and behaviours are used.

In the next three experiments we shall show the effects on reflection caused

by varying the training set size, varying the pool size and doing exploratory

planning, using this larger taxi-world as our test bed.

7.2.5 Experiment 6: The effect of the training set size

The effects of varying the absolute size of the training set for ILP problems are

well established, and this problem add any new results worth discussing. The

effect of varying the relative sizes of the positive and negative training sets, on

the other hand, is significant and worth exploring.

Aleph uses coverage as a measure of the goodness of a hypothesis. That is it

counts the number P of positive examples number N of negative examples from

the training set which are covered by a given hypothesis and uses the difference

P − N as a score for that hypothesis. By varying the sizes of the two training

sets E+train and E−train we can bias this estimate.

In a noisy world, the space of examples is likely to fall into three sets, those

that are definitely positive, those that are definitely negative, and those that

are a mixture of positive and negative (as illustrated in Figure 7.2.5(a)). When

choosing a hypothesis we are attempting to draw a boundary line, calling every-

thing on one side “positive” and everything on the other side “negative”. This

boundary will presumably lie somewhere in the “mixed” area. A more general

hypothesis will cover more examples in this area, both positive and negative. A

more specific hypothesis will cover fewer examples.

The coverage heuristic says that a more general hypothesis will be chosen if

for every extra negative example it covers, it also covers at least one extra positive

example. Thus it is sensitive to the density of positive and negative examples. If

there are more positive examples overall, then a more general hypothesis will be

chosen (Figure 7.2.5(b)). On the other hand, if there are more negative examples,

then the more specific hypothesis will score higher (Figure 7.2.5(c)). This is a

primitive form of cost-sensitive ILP (Srinivasan, 2001b).

As we discussed in Section 6.6.1, over-specialised side-effect descriptions cause

difficulty for Rachel. If the reflector underestimates the size of the maintenance

region for a particular side-effect then the policy nodes of the plan will be overly


−−

−

−−

−−

−

−−

−−−

−−

−

−−−

−

++

++

+

++

++

+

++

+

++

+

++

+

+

definitepositive

definitivenegative

mixed

−

−−−

−

−

−

−−

−

+

++

+

++

+

+

++

coverage coverage

specific general

++

+

++

+

+ +

+ +

= 12 − 0= 12

= 20 − 4= 16

−−

−

−−

−−

−

−−

−−−

−−

−

−−−

−

+

++

+

++

+

+

++

coverage= 6 − 0= 6

coverage= 10 − 8= 2

specific general

Figure 7.14: ILP in a noisy world, (a) showing an area of mixed positive and neg-ative examples, (b) if positive examples are over-represented, the coverage heuris-tic favours general hypotheses, (c) if negative examples are over-represented, thecoverage heuristic favours specific hypotheses.


restrictive and the behaviour will not be used as widely as it might be. Thus we

would expect better results if the reflector were biased towards producing more

general results.

To investigate this effect, we conducted experiments in the big taxi world,

varying the size of the negative training set while keeping the positive training

set constant. Twenty runs of 5000 trials were made with each of the following

settings:

1. No reflection

2. No reflection, with the hand-crafted hierarchy in Figure 7.10 above.

3. Reflection with n+train = 500, n−

train = 100


train = 200


train = 300


train = 400


train = 500

Each run used the TRQ algorithm with η = 0.1, α = 0.5, γ = 0.95 and

ε = 0.1. The pool sizes n+max and n−

max were both set to 5000. The parameters

for Aleph were the same as in Experiment 4 above, except that the minimum

accuracy was set by the equation:

minacc =n+

train + 1

n+train + n−

train

so that the resulting hypothesis must always be more accurate than the default

hypothesis (that the side-effect never happens).

With more training examples than previous experiments, there are more

opportunities for Aleph to build clauses that cover only two or three examples.

Each such clause increases the branching factor of the resulting plan, and so the

size of the plan increases dramatically, exhausting available memory. To alleviate

this problem, the reflector only kept the three best clauses from the hypotheses

generated by Aleph, evaluating them in terms of coverage on the whole pool of

examples.

Second-order side-effects were ignored and exploratory planning was not used

for this experiment.


0

10

20

30

40

50

60

70

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Suc

cess

rate

(%)

Number of trials

No reflectionHand-crafted

n- = 100n- = 200n- = 500

Figure 7.15: The effect of training set size on reflection, showing: (1) “Naive”TRQ without reflection, (2) TRQ with a hand-crafted hierarchy, (3) TRQ withreflection using negative training set sizes 100, 200 and 500. Error bars showone standard deviation.

Results

The results of this experiment are shown in Figure 7.15. To avoid cluttering

the graph, the results of approaches 5 and 6 have been omitted, but they follow

the pattern established by those already shown. As in previous experiments,

the hand-crafted hierarchy significantly out-performs the “naive” approach with

more than twice the success rate after 5000 trials.

The approaches using reflection all show a marked change around 2100 tri-

als. This is when sufficient examples were collected for the first batch to be

sent to Aleph. In each case, the reflection resulted in a significant increase in

performance as the agent began to use the Refuel behaviour. At the 3000 trial

mark each reflective approach significantly outperforms the naive approach with

greater than 99% confidence.

However the final results are not so rosy. The naive approach continues

to improve after the reflective approaches have flattened out, and at the 5000


trial mark it is performing comparably to reflective approaches 5 and 6 and

significantly better than approach 7. However reflective approaches 3 and 4

are still performing significantly better. Neither approach, however does as well

as the hand-crafted approach which has significantly a greater final success rate

than any other. (All differences are significant with greater than 99% confidence,

according to a two-tail T-test).

What is the determining factor that decides the outcome of the different

reflective approaches? Looking at the descriptions learnt by each approach we

see a familiar pattern. Each description contains several clauses, but the one

that covers by far the majority of examples is of the form:

maintains(S) :-

params(L), distance(L, D), fuel(F), rgte(F, D, ###).

where ### is replaced by some constant we shall call the “fuel factor”. It is

this constant that distinguishes the different reflective approaches, as shown in

Table 7.8. The approaches with larger values of n−train chose more specialised

hypotheses with fuel factors around 2.5. Those approaches with smaller values

of n−train chose more general hypotheses with smaller fuel factors.

Table 7.8: The fuel factor for each reflective approach to Experiment 6.

Approach n+train n−

train av. fuel factor3 500 100 1.614 500 200 2.185 500 300 2.386 500 400 2.497 500 500 2.65

If the fuel factor is large, the more fuel the agent estimates it will need to

reach its destination. If it does not have enough fuel, it may attempt to attempt

to go and refuel, but only if it thinks it can reach fuel location without running

out of fuel. If the fuel factor is very large, even this appears impossible, and

the plan can provide no further contingencies. The agent resorts to monolithic

with its primitive actions, which will take a much longer time to produce a

working policy. This explains why the approaches that produce more specialised

hypotheses perform more poorly in the long run.


7.2.6 Experiment 7: The effect of the pool size

The aim of this experiment is to study the effect of changing the size of the pool

of examples kept by the reflector. Twenty runs of 5000 trials were made with

each of the following settings:

1. n+max = n−

max = 2500

2. n+max = n−

max = 5000

3. n+max = n−

max = 7500

4. n+max = n−

max = 10000

5. n+max = n−

max = 12500

6. n+max = n−

max = 15000

Batches of data were passed from the actor to the reflector every 100 trials.

The training set size in each case was kept constant, regardless of pool size.

n+train = 500 and n−

train = 100. The other settings for the actor and the reflector

were the same as in approach 7 of Experiment 6 above. Second-order side-effects

were ignored. Exploratory planning was not used for this experiment.

Results

The results of this experiment are shown in Figure 7.15. Plainly reflecting too

early can have a highly detrimental effect on learning. The approach that used

the smallest pool size, collecting only 2500 positive and negative examples, per-

formed very poorly, much worse than if no reflection had been done at all. All

the of the other reflective approaches appear to perform better than the naive

approach, with a general trend towards better results the longer reflection is

delayed. All the reflective approaches with 7500 or more examples in each pool

is show final policies significantly better than the naive approach (with 99%

confidence). The degree of improvement slows as the number of examples in-

creases. The results of using pool sizes of 12500 or 15000 examples are roughly

comparable.

Examining the side-effect descriptions produced from the different sized pools

gives us an insight into why the smallest pool size is so catastrophically worse

than the others. The approaches with pools of 5000 examples or more produced


-10

0

10

20

30

40

50

60

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Suc

cess

rate

(%)

Number of trials

No reflectionpool = 2500 pool = 5000 pool = 7500 pool = 10000pool = 12500pool = 15000

Figure 7.16: The effect of pool size on reflection, showing: (1) “Naive” TRQwithout reflection, (2) TRQ with reflection using pool set sizes ranging from2500 to 15000. Error bars show one standard deviation.

Table 7.9: The fuel factor for each reflective approach to Experiment 7.

Approach n+max n−

max av. fuel factor1 2500 2500 –2 5000 5000 2.233 7500 7500 2.024 10000 10000 1.715 12500 12500 1.716 15000 15000 1.46


the familiar descriptions comparing distance and fuel as in previous experiments.

The fuel-factors for these approaches are shown in Table 7.9. However the de-

scriptions learnt from the pool of 2500 examples were quite different. In fourteen

of the twenty runs of this approach, the learnt concept was:

maintains(S) :-

psgr_loc(L).

That is, the side effect can be avoided so long as the passenger is at a location

(and not in the taxi). Why is this concept produced? The GoTo() behaviour

is used twice in the naive plan, once to go to passenger’s starting location and

once to go to to the passenger’s destination. Early on in the learning process

the policy for GoTo() is likely to be so bad that it uses up almost all the fuel

in the first of these two movements. The second one almost always fails. The

simplest way to distinguish between these two cases is to examine the position

of the passenger. There is a high correlation between the passenger being in the

taxi and the taxi running out of fuel. So this description is adopted.

Why is this description so damaging? It says that the passenger must not be

in the taxi in order for the taxi to go to any location without running out of fuel.

This makes delivering the passenger to her destination impossible. The planner

cannot construct a plan which satisfies its goals and maintains this condition,

so the learner reverts to learning a policy in terms of primitive actions, which

reduces its performance enormously. Furthermore, if behaviours are no longer

executed then there is no source of new examples to contradict the poor side-

effect description, so the agent cannot recover from its mistake.

7.2.7 Experiment 8: The effect of exploratory planning

The final experiment in the taxi-world investigates the effects of exploratory

planning. As shown above, reflecting too early can significantly impinge on

performance of the agent. Learning a condition for avoiding a side-effect that

is too specific results in a plan which never uses the affected behaviour. If a

behaviour is never used, then it never produces counter-examples from which

a better condition might be learnt. Exploratory planning allows the agent to

occasionally explore such behaviours, even if their side-effects would make them

inappropriate.


-10

0

10

20

30

40

50

60

0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000

Suc

cess

rate

(%)

Number of trials

No reflectionpool = 2500, w/o explorationpool = 2500, w/ exploration

pool = 12500, w/o exploration

Figure 7.17: The effect of exploratory planning, showing: (1) “Naive” TRQwithout reflection, (2) TRQ with reflection using an example pool of size 2500without exploration, (3) TRQ with reflection using an example pool of size 2500with exploration, (4) TRQ with reflection using an example pool of size 12500without exploration. Error bars show one standard deviation.

In this experiment we repeated the first case of the previous experiment, with

n+train = n−

train = 100 and n+max = n−

max = 1000, except with exploratory planning

activated. Once again, we performed 20 runs of 5000 trials, and we compared

the performance against the results from Experiment 7 above.

Results

Figure 7.17 shows the results of this experiment, plotted along with three re-

sults from the previous experiment. The difference between the results with

exploratory planning and without is pronounced. As we saw above, without ex-

ploratory planning reflecting with only 2500 examples in the pool leads to very

poor performance. However with exploratory planning, this same pool size pro-

duces results which are statistically indistinguishable from a pool of five times

the size, and achieves these results much sooner.


The cost is in terms of computational effort. The plans built by exploratory

planning had on average about 900 nodes apiece, only 40 of which were policy

nodes. Not only does it take more time to build these plans, but traversing the

tree to find the active nodes on each timestep takes considerably more time.

7.2.8 Discussion of the taxi-car experiments

Reflection has obvious advantages and disadvantages. As these experiments

show, provided it is not done prematurely and is sufficiently optimistic in its

predictions, it can effectively repair mistakes in plans, resulting in much better

performance. However if it is too pessimistic in its conclusions, either due to a

poor balance of positive and negative examples, or to drawing conclusions too

early, before behahaviours have converged, it can easily cut off whole paths of a

plan which may in fact have been worth following. We have shown that this prob-

lem can be alleviated to some degree by exploratory planning and incremental

learning.

The chief problem with reflection is managing the size of the plans it produces.

Each clause produced to describe a side-effect increases the branching factor

of the resulting plan, and plan size tends to grow exponentially. Exploratory

planning increases this further still. This effects both the time it takes to build

plans and the time it takes to execute them. In these experiments we have used

some rather ad-hoc methods to limit this growth by placing arbitrary limits

on plan depth and on the number of clauses in a side-effect description. More

investigation is needed to find more principled ways to trade off plan size against

performance.

7.3 Experiments in the soccer domain

In the final series of experiments we shall apply Rachel to a far more complex

domain: a robot soccer simulation. Our aim is to show how Rachel performs

in a more realistic control task than the simple grid-worlds examined so far.

Furthermore, the soccer domain illustrates the application of multiple levels of

hierarchy, which has not been necessary in earlier examples.


Figure 7.18: The soccer domain.


The soccer simulator used for these experiments is the SocccerBots simulation

(Figure 7.18, which is part of TeamBots package available from http://www.teambots.org/.

TeamBots is a collection of Java applications and packages for multi-agent re-

search, maintained by Tucker Balch. The SoccerBots package is a simulation of

the RoboCup small-size robot league. We shall briefly outline its salient features

below. For full details, consult the documentation available on the web site.

The SoccerBots program simulates a 152.5cm by 274cm field with a goal at

either end. Up to ten robots may be placed on the field. Robots are circular and

12cm in diameter. The simulated robots can move at 0.3 meters/sec and turn

at 360 degrees/sec. Each robot belongs to one of two teams: those going “east”

and those going “west”.

The ball is 40mm in diameter and it can move at a maximum of 0.5 meters/sec

when kicked. It decelerates at 0.1 meters/sec/sec. Ball collisions with walls and

robots are perfectly elastic. To prevent deadlocks, whenever the ball has been

stationary for more than 30 seconds it is moved to a random position on the

field.


For the experiments described below, only three robots were used. Rachel

controlled two eastward-heading robots (with yellow and white markings) with

one stationary opponent randomly placed on the field. Rachel learnt to control

both robots to move the ball into the eastern goal. Note that we are controlling

the two robots with a single agent. The agent is able to observe the state of both

robots, and control the movements of both simultaneously.

7.3.2 Related work

As far as we are aware, this is the first published application of reinforcement

learning to this particular simulator. There have, however, been experiments

done with similar simulations. Stone and Sutton (2001) applied reinforcement

learning to the RoboCup simulation league simulator. In this work the agent

learnt to play a game of “3-vs-2 keep-away”. Three “keeper” robots learnt to keep

the ball away from two “taker” robots. Each player learnt a policy independently,

based on its own state and using its own actions. Actions were drawn from a set

of hand-crafted “skills”, such as “go to ball”, “pass ball” etc.

Wiering, Salustowicz, and Schmidhuber (1999) has also applied multi-agent

reinforcement learning to a custom soccer simulation. In their work, the agents

played 3-on-3 soccer, with each player acting independently but sharing a com-

mon policy. Again, actions were drawn from a set of hand-crafted behaviours.

Two things set this present work apart from these examples. The task we

are attempting is made more difficult by the fact that we wish to learn policies

based of more primitive actions. We shall employ behaviours similar to the skills

described above, but instead of providing them in advance, the agent will have

to learn them from primitive actions such as “move forward”, “turn left”, “turn

right”, etc.

The second difference is in the approach to controlling multiple soccer play-

ers. Multi-agent reinforcement learning is significantly more difficult than single-

agent learning. We shall avoid this problem by treating the pair of soccer players

as a single composite agent.


Table 7.10: Instruments used in the soccer domain.

x(Bot) the current X position of Boty(Bot) the current Y position of Botdistance(Bot, Object) the distance from Bot to Object

angle(Bot, Object) the bearing from Bot to Object

angle(Bot, Object1, Object2) the difference in bearing between Object1, Object2according to Bot

delta(distance(Bot, Object)) the change in distance between Bot and Object

since the previous time-step.delta(angle(Bot, Object)) the change in bearing between Bot and Object

since the previous time-step.can kick(Bot) 1 if Bot can kick the ball, 0 otherwise

7.3.3 Domain representation


The state of the soccer world is given by the position, heading and velocity of

each of the robots, and the position, heading and velocity of the ball. Each of

these of these values can be measured in various ways - in absolute coordinates,

or relative to each other or to another landmark on the field. Which of these

representations is appropriate will vary from behaviour to behaviour. The set of

available instruments in Table 7.3.3 provides means to compute most of them.

The x(Bot) and y(Bot) instruments give absolute x,y-coordinates for Bot,

which must be one of the two controlled robots, called bot(0) and bot(1). The

coordinate system is in metres from the centre of the field, which is (0, 0). The

simulator does not provide absolute positioning for the other objects in the world.

It could be computed, but it does not turn out to be useful, so such instruments

are not provided.

The distance(Bot, Object) and angle(Bot, Object) instruments give the

distance and bearing to Object from the point of view of Bot. Object can be

any of the items in Table 7.3.3, including both objects in the world (such as

the ball) and landmarks (such as the goal). Distances are measured in metres.

Bearings are in degrees between 0 deg and 360 deg. 0 deg corresponds to the

direction the robot is heading. In some cases it is more important to know

the angle between two objects, rather than the absolute bearings of each. The


Table 7.11: Objects in the Soccer domain.

bot(0), bot(1) The two bots.teammate A symbol used by each bot to represent the other.opponent The opponent bot.ball The ball.goal(ours) The goal the bots are guarding.goal(theirs) The opponent’s goal, towards which the bots will shoot.

angle(Bot, varObject1, Object2) instrument provides this information.

In the simulation, the robots’ sensors do not provide information about ve-

locities, so this information must be computed from successive distance() and

angle() values. The delta() instrument takes another instrument as a param-

eter, and returns the difference between the values of the instrument on the

current and the previous timestep.

Finally there is a special-purpose instrument can kick(Bot) which simply

returns 1 or 0, indicating whether or not the Bot is in a position to kick the ball.

This can only happen if the ball is within 2cm of a “kicking spot” on the front

of the robot.

Control of the robots consists of a command sent every 100ms of simulated

time. Each robot has two effectors, one controlling movement and one controlling

kicking. The movement effector, denoted move(Bot, m) can take one of four

values: m ∈ {forward, left, right, stop}. The kick effector kick(Bot, k) can take one

of two values l ∈ {true, false}. If a bot does not receive a movement command

on a particular timestep, then it assumes the value stop. If it does not receive

a kick command, it assumes the value false. So there are a total of 64 different

primitive actions (8 possibilities for each robot, yielding 82 combinations) that

could be executed. For the most part, the behaviours will limit their attention

to a subset of these which control a single robot while the other is stationary.


The symbolic representation of the soccer-world describes the layout of the field

in more abstract terms. Six fluents are used, as shown in Table 7.3.3. They

define such high-level concepts as when a goal has been scored, which robot (if

any) is currently controlling the ball, whether a target is close enough to kick


the ball to it, etc. As with the instruments above, the Object variables in the

fluents can be instantiated with any of the objects in Table 7.3.3.

Table 7.12: Fluents used in the Soccer domain.

score(Value) Value is 1, when we score a goal−1, when the opponents score.

controlling ball(Bot) True if the Bot is within 30cm of the ball.recently controlling ball(Bot) True if controlling ball(Bot) was true

on any of the last 10 timesteps.near edge(Object) True if Object is within 10cm

of the edge of the field.within kicking distance(Bot, Object) True if the Bot is within 60cm

of the Object

lined up(Bot, Object1, Object2) True if the angle between Object1 andObject2 is less than 90 degrees

One special fluent, recently controlling ball(Bot) needs particular ex-

planation. This fluent is true when Bot is controlling the ball and remains true

for ten time-steps thereafter. This is needed to define behaviours in which a bot

deliberately loses control of the ball, such as when it shoots a goal, or passes to an-

other bot. The preconditions of such behaviours use recently controlling ball(Bot)

to allow the bot to kick the ball and then lose control briefly before the ball

reaches its target. This causes a minor violation of the Markov assumption for

these behaviours, but does not prove troublesome.

Eleven behaviours are used in the soccer world, at three levels of granularity.

At level 0 is the main task Score (Table 7.3.3) which has access to the complete

state representation and all primitives. The goal of the behaviour is to kick the

ball into the opponents goal. It can be applied everywhere, except for a small

set of failure states in which the agent has scored an own-goal.

Three behaviours with granularity 1 define strategic decisions (Table 7.3.3).

Which bot is going to fetch the ball? Will it then attempt to shoot a goal

itself, or pass the ball to its teammate. The CaptureBall1(Bot), Shoot1(Bot) and

Pass1(FromBot, ToBot) behaviours implement these strategies. Each of these

behaviours controls only one of the two bots, and limits its view to data from

that bot’s viewpoint.

The seven remaining behaviours, with granularity 2, define simple movements


Table 7.13: Granularity 0 behaviours in the soccer domain.Score

gran: 0view: { x(bot(0)), y(bot(0)), can kick(bot(0)),

distance(bot(0), goal(ours)), angle(bot(0), goal(ours)),distance(bot(0), goal(theirs)), angle(bot(0), goal(theirs)),distance(bot(0), teammate), angle(bot(0), teammate),distance(bot(0), opponent), angle(bot(0), opponent),distance(bot(0), ball), angle(bot(0), ball),delta(distance(bot(0), ball)), delta(angle(bot(0), ball)),angle(bot(0), ball, goal(theirs)),

x(bot(1)), y(bot(1)), can kick(bot(1)),distance(bot(1), goal(ours)), angle(bot(1), goal(ours)),distance(bot(1), goal(theirs)), angle(bot(1), goal(theirs)),distance(bot(1), teammate), angle(bot(1), teammate),distance(bot(1), opponent), angle(bot(1), opponent),distance(bot(1), ball), angle(bot(1), ball),delta(distance(bot(1), ball)), delta(angle(bot(1), ball)),angle(bot(1), ball, goal(theirs)) }

pre: ¬score(−1)post: score(1)P: {(move(bot(0),m1), kick(bot(0), k1),move(bot(1),m2), kick(bot(1), k2)) |

m1,m2 ∈ {forward, left, right, stop}, k1, k2 ∈ {true, false}}

around the field (Tables 7.3.3 & 7.3.3). Again, each behaviour controls only one

bot, the other remains stationary. Kicking is only possible in those behaviours

that particularly need it: in ClearBall(Bot) to get the ball off of the edge of the

field, in Shoot2(Bot) to shoot a goal and in Pass2(FromBot, ToBot) to pass to

the other bot. The primitive state representation is tailored to each behaviour,

omitting any unnecessary instruments.

Even using a hierarchical decomposition, the soccer domain is a very large

and complex search space. Two additional measures were needed to making

learning possible, function approximation and progress estimation.


Table 7.14: Granularity 1 behaviours in the soccer domain.CaptureBall1(Bot)

gran: 1view: { x(Bot), y(Bot),

distance(Bot, ball), delta(distance(Bot, ball)),angle(Bot, ball), delta(angle(Bot, ball)),distance(Bot, teammate), angle(Bot, teammate),distance(Bot, opponent), angle(Bot, opponent) }

pre: ¬ controlling ball(Bot)post: controlling ball(Bot)P: {move(Bot, forward),move(Bot, left),move(Bot, right),move(Bot, stop)}

Shoot1(Bot)

gran: 1view: { x(Bot), y(Bot), can kick(Bot),

distance(Bot, goal(theirs)), angle(Bot, goal(theirs)),distance(Bot, ball), delta(distance(Bot, ball)),angle(Bot, ball), delta(angle(Bot, ball)),distance(ball, goal(theirs)), delta(distance(ball, goal(theirs))),angle(Bot, ball, goal(theirs)), delta(angle(Bot, ball, goal(theirs))),distance(Bot, teammate), angle(Bot, teammate),distance(Bot, opponent), angle(Bot, opponent) }

pre: recently controlling ball(Bot)post: score(1)P: {(move(Bot,m), kick(Bot, k)) |

m ∈ {forward, left, right, stop}, k ∈ {true, false}}

Pass1(FromBot, ToBot)

gran: 1view: { x(FromBot), y(FromBot), can kick(FromBot),

x(ToBot), y(ToBot),distance(FromBot, ToBot), delta(distance(FromBot, ToBot)),angle(FromBot, ToBot), delta(angle(FromBot, ToBot)),distance(FromBot, ball), delta(distance(FromBot, ball)),angle(FromBot, ball), delta(angle(FromBot, ball)),distance(FromBot, opponent), angle(FromBot, opponent) }

pre: recently controlling ball(FromBot)post: controlling ball(ToBot)P: {(move(FromBot,m), kick(FromBot, k)) |



Table 7.15: Granularity 2 behaviours in the soccer domain.Approach(Bot, Object)


distance(Bot, Object), delta(distance(Bot, Object)),angle(Bot, Object), delta(angle(Bot, Object)),distance(Bot, teammate), angle(Bot, teammate),distance(Bot, opponent), angle(Bot, opponent) }

pre: ¬ within kicking distance(Bot, Object)post: within kicking distance(Bot, Object)P: {move(Bot, forward),move(Bot, left),move(Bot, right),move(Bot, stop)}

CaptureBall2(Bot)



pre: within kicking distance(Bot, ball) ∧ ¬controlling ball(Bot)post: controlling ball(Bot)P: {move(Bot, forward),move(Bot, left),move(Bot, right),move(Bot, stop)}

TurnWithBall(Bot, Object)


distance(Bot, ball), delta(distance(Bot, ball)),angle(Bot, ball), delta(angle(Bot, ball)),angle(Bot, ball, Object), delta(angle(Bot, ball, Object)),distance(Bot, teammate), angle(Bot, teammate),distance(Bot, opponent), angle(Bot, opponent) }

pre: within kicking distance(Bot, ball) ∧ ¬lined up(Bot, ball, Object)post: lined up(Bot, ball, Object)P: {move(Bot, forward),move(Bot, left),move(Bot, right),move(Bot, stop)}

ClearBall(Bot)



pre: controlling ball(Bot) ∧ near edge(ball)post: ¬near edge(ball)P: {(move(Bot,m), kick(Bot, k)) |



Table 7.16: Granularity 2 behaviours in the soccer domain, cont.Dribble(Bot, Object)


distance(Bot, Object), angle(Bot, Object),distance(Bot, ball), delta(distance(Bot, ball)),angle(Bot, ball), delta(angle(Bot, ball)),distance(ball, Object), delta(distance(ball, Object)),angle(Bot, ball, Object), delta(angle(Bot, ball, Object)),distance(Bot, teammate), angle(Bot, teammate),distance(Bot, opponent), angle(Bot, opponent) }

pre: controlling ball(Bot) ∧ ¬near edge(ball)post: within kicking distance(ball, Object)P: {move(Bot, forward),move(Bot, left),move(Bot, right),move(Bot, stop)}

Shoot2(Bot)


distance(Bot, goal(theirs)), angle(Bot, goal(theirs)),distance(Bot, ball), delta(distance(Bot, ball)),angle(Bot, ball), delta(angle(Bot, ball)),distance(ball, goal(theirs)), delta(distance(ball, goal(theirs))),angle(Bot, ball, goal(theirs)), delta(angle(Bot, ball, goal(theirs))),distance(Bot, teammate), angle(Bot, teammate),distance(Bot, opponent), angle(Bot, opponent) }

pre: lined up(Bot, ball, goal(theirs)) ∧ controlling ball(Bot)∧ within kicking distance(ball, goal(theirs))

post: score(1)P: {(move(Bot,m), kick(Bot, k)) |


Pass2(FromBot, ToBot)

gran: 2view: { x(FromBot), y(FromBot), can kick(FromBot),

x(ToBot), y(ToBot),distance(FromBot, ToBot), delta(distance(FromBot, ToBot)),angle(FromBot, ToBot), delta(angle(FromBot, ToBot)),distance(FromBot, ball), delta(distance(FromBot, ball)),angle(FromBot, ball), delta(angle(FromBot, ball)),distance(FromBot, opponent), angle(FromBot, opponent) }

pre: recently controlling ball(FromBot)∧ within kicking distance(ball, ToBot)

post: controlling ball(ToBot)P: {(move(FromBot,m), kick(FromBot, k)) |



Function approximation

Most of the instruments listed in Table 7.3.3 return continuous values. These

must somehow be discretised to form a finite table of Q-values. Furthermore,

even once they are discretised there will be a very large number of discrete table

entries, even for the simplest behaviours. Some form of generalisation is needed

to make learning possible.

We used CMACs ((Albus, 1975), Santamarıa et al., 1998) for this purpose.

Each behaviour represented its Q-values using a CMAC with one tiling per in-

strument in its view, discretised only along that dimension. That is, if behaviour

B has instruments i1, i2, . . . , ik in its view then:

Q(s, a) =k

∑

j=0

Q(j, ij(s), a)

where Q(j, x, a) is the contribution to the Q-value given by the value of instru-

ment ij. Q(j, x, a) is stored as a table based on a discretisation of x. The

discretisations used for each instrument are given in Table 7.3.3. These values

were updated using gradient-descent.

Table 7.17: Discretisation of instruments in the soccer domain.

Instrument Discretisation

x(Bot) 100 equal intervals from -1.5m to 1.5my(Bot) 60 equal intervals from -0.9m to 0.9mdistance(Bot, Object) 115 equal intervals from 0m to 3.45mangle(Bot, Object) 10 equal intervals from 0deg to 360 degangle(Bot, Object1, Object2) 10 equal intervals from 0deg to 360 degdelta(distance(Bot, Object)) 3 intervals: equal to 0, less than 0, greater than 0.delta(angle(Bot, Object)) 3 intervals: equal to 0, less than 0, greater than 0.can kick(Bot) 2 discrete values: 0 or 1

Progress estimation

The standard reward function used by Rachel(Equation 4.2) provides no feed-

back on progress towards the goal. The agent receives zero reinforcement until it

reaches its goal. As a result, initial exploration is little more than a random walk

through the state-space until the goal is reached. The size and connectivity of


the soccer-domain makes this impractical, so some additional element is needed

to encourage actions which get the agent closer to the goal and discourage ac-

tions that move it further away. The reward functions used in the soccer world

are augmented to include a progress estimator (Mataric, 1994) to provide this

information:

B.r(s, a, s′) =

1 if s′ |= B.post

−1 if s′ 6|= B.post and s′ 6|= B.pre

PE (s, s′) otherwise

PE (s, s′) is a function which returns a positive value when s′ is estimated to be

closer to the goal than s and a negative value when it is further away (or zero

otherwise). Its value is constrained to the range (−(1− γ), 1− γ) so that even

an infinite sequence of such rewards cannot have a better Q-value than eventual

success, or a worse Q-value than eventual failure. 1

As with behaviours and instruments, Rachel allows progress estimators with

parameters. So the soccer domain only implements a single progress estimator,

reward approach(Object, Target), which returns a positive reward of (1− γ) if

Object gets closer to Target, and −(1−γ) if it gets further away. The variables

Object and Target are instantiated appropriately for each behaviour, as shown

in Table 7.3.3.

Table 7.18: Progress estimators used in the soccer domain.

Behaviour Progress estimator

Score reward approach (ball, goal(theirs))CaptureBall1(Bot) reward approach (Bot, ball)Shoot1(Bot) reward approach (ball, goal(theirs))Pass1(FromBot, ToBot) reward approach (ball, ToBot)Approach(Bot, Object) reward approach (Bot, Object)CaptureBall2(Bot) reward approach (Bot, ball)TurnWithBall(Bot, Object) noneClearBall(Bot) noneDribble(Bot, Object) reward approach (ball, Object)Shoot2(Bot) reward approach (ball, goal(theirs))Pass2(FromBot, ToBot) reward approach (ball, ToBot)

1It has been pointed out by a reviewer that this progress estimator does not support policyinvariance, as studied in (Ng, Harada, & Russell, 1999). This indeed appears to be the case.We were unaware of this body of work at the time of writing.


7.3.4 Experiment 9: HSMQ vs P-HSMQ vs TRQ

In the first experiment in the soccer world, we compared three approaches:

1. HSMQ with all applicable behaviours

2. P-HSMQ

3. TRQ with η = 0.1

To further simplify the problem, this first experiment was run with hand-

coded policies for all the granularity 2 behaviours. So the agent’s task is simply

to learn to choose between these behaviours.

Each approach was run ten times, with each run consisting of 1000 consecu-

tive trials. A trial begins with the two bots, the opponent and the ball placed

randomly on the field, and ends when the ball moves into one of the two goals.

An upper limit of 5000 steps was placed on the length of each trial. Trials that

exceeded this limit were counted as failures. This was done because sometimes

the hand-crafted behaviours could get stuck in certain positions, unable to ter-

minate either successfully or unsuccessfully. Tests showed that any trial that

was likely to complete without getting stuck would finish well under the 5000

step limit.

The learning parameters were set as follows: The learning rate α was 0.1.

The discount factor γ was 0.95. Exploration was done in an ε-greedy fashion

with a 1 in 10 chance of the agent choosing an exploratory action (both at the

level of primitive actions, and in the choice of behaviours). Exploratory actions

were chosen randomly.

Results

The results of this experiment can be seen in Graphs 7.19(a) and (b) which show

the average success rate and trial length respectively. From the outset the two

plan based approaches are significantly better than the unplanned approach, and

this is still the case after 1000 trials. The final success rate for both planned

algorithms is greater than 99%, whereas the unplanned approach only achieves

an average of 89%, significantly less (with 99% confidence).

It is worth noting that only 5% of the failures caused by the unplanned ap-

proach were due to it kicking an own goal. The other 95% were due to exceeding


70

75

80

85

90

95

100

105

0 100 200 300 400 500 600 700 800 900 1000

Suc

cess

Rat

e (%

)

Number of trials

HSMQ w/ all behavioursPHSMQ

TRQ

(a) success rate

-500

0

500

1000

1500

2000

2500

0 100 200 300 400 500 600 700 800 900

Tria

l len

gth

Number of trials

HSMQ w/ all behavioursPHSMQ

TRQ

(b) trial length (successful trials only)

Figure 7.19: Learning in the soccer world, with hard-coded behaviours, usingHSMQ, P-HSMQ and TRQ. (Error bars show 1 standard deviation.)


the time limit. On the other hand, neither planned approach ever reached the

5000 step limit.

The second graph, showing trial lengths, only shows the average length of

successful trials, over a 100 trial window. As can be seen, even on only the

successful trials, the unplanned approach takes significantly longer to score a

goal. In the last 100 trials, the average length for the unplanned approach was

906 steps, significantly greater than the corresponding 481.99 for P-HSMQ and

357.65 for TRQ (with 99% confidence). The P-HSMQ approach shows a greater

average trial length than TRQ, but the variance on both results is high, so the

significance of this difference is low.

Perhaps the most notable feature of these graphs is the flatness of the results

for both planned approaches. There appears to be no net change in the quality

of their policies in the whole time they are executed. It appears that once the

planner has eliminated the inappropriate behaviour choices, there is very little

the reinforcement learning algorithms can do to improve on this, at least within

the time frame of the experiment.

7.3.5 Experiment 10: Learning primitive policies

In the previous experiment, the lowest-level behaviours had hand-crafted policies.

Now we shall remove this assistance. In this experiment we shall compare the

two approaches P-HSMQ and TRQ, learning both behaviour choices with the

hierarchy and primitive policies for the behaviours. Each approach was run ten

times, with each run consisting of 1000 consecutive trials. The experimental

set-up and learning parameters were as per the previous experiment.

The unplanned approach was not used for this experiment, as it could not

complete anywhere near the required number of trials in a reasonable amount of

time.

Results

Graphs 7.20(a) and (b) show the results of this experiment. Both approaches

show significant improvement over the 1000 trials, with TRQ reaching a 91%

success rate, and P-HSMQ 86%. A T-test gives this difference a 96% significance.

The TRQ algorithm does appear to do better on the whole, improving earlier and

with less variance than HSMQ. On the graph of trial lengths we see a familiar


45

50

55

60

65

70

75

80

85

90

95

100

0 100 200 300 400 500 600 700 800 900 1000

Suc

cess

rate

(%)

Number of trials

PHSMQTRQ

(a) success rate

-1000

-500

0

500

1000

1500

2000

2500

3000

0 100 200 300 400 500 600 700 800 900

Tria

l len

gth

Number of trials

PHSMQTRQ

(b) trial length

Figure 7.20: Learning in the soccer world, with learnt behaviours, using P-HSMQand TRQ. (Error bars show 1 standard deviation.)


pattern – early trials of P-HSMQ are longer and show much greater variance

than those of TRQ, although both approaches converge to comparable values

towards the end.

7.3.6 Experiment 11: Reflection

In the final experiment we shall attempt to apply reflection to the soccer domain,

to see whether it can improve the agent’s performance. For this experiment we

took data from the TRQ approach in the previous experiment. The experiences

from the last 100 trials of each run were passed through the reflector’s side-effect

detection process to determine what side-effects were occurring.

Each run showed over forty different side-effects of varying frequency (listed in

Table 7.19). Of these, one common and potentially interesting effect was the fail-

ure of Dribble(bot(1)) to maintain the condition lined up(bot(1), ball, goal(theirs)).

The reflector was used to learn a maintenance condition for this effect, which

was incorporated into the agent’s plans. The agent was then allowed to run for

a further 100 trials using the new plans, but keeping the Q-values saved from

the previous experiment. This was done ten times, once for each of the ten runs

in the previous experiment. For the sake of comparison, 100 extra trials without

the new plans were also performed.

The reflector was run with a training set of 500 positive and negative exam-

ples, and a pool of the same size2. The default settings for Aleph were used,

with the following exceptions:

1. The minimum acceptable accuracy of a clause, minacc was set to 0.5,

2. The upper bound on layers of new variables, i was set to 4.

3. The upper bound on the number of literals in a clause, clauselength was

set to 5.

4. The custom-added limitation on time spend learning (see Section 6.5.2),

inducetime was set to 1 hour.

5. Only the single best clause (in terms of coverage) was kept.

2Experiment 7 would indicate that a larger pool size would be necessary, but this is onlyif the agent is learning the behaviours from scratch. The data for this experiment was drawnfrom the end of the previous experiment, by which time the behaviours were well established,so the pool size can be small.


Results

First it is worth noting the sheer number of side-effects that do occur in this

domain. In just 100 trials, fourty-three different side-effects were noted by the

reflector, as listed in Table 7.19 in order of frequency. The effect we have chosen

to examine was at the top of this list, occurring 1968 times in all. Just below it

is an identical effect, but involving bot(0) rather than bot(1).

This proliferation of side-effects may at first appear to be a serious problem,

but in practive many of them are irrelevant artifacts of total-order planning.

Consider the plan for Shoot1() for instance, part of which is shown in Figure 7.21.

It involves ordering the behaviours CaptureBall2(), Dribble(), TurnWithBall() and

Shoot2(). When the agent is controlling the ball in the middle of the field, there

are several possible alternatives to choose from, given by different paths in the

plan, but when the agent loses control of the ball there is only one appropriate

behaviour, which is CaptureBall2().

Although it is the only available behaviour in the situation, there are still sev-

eral different nodes which use CaptureBall2(), each with different conditions de-

pending on which behaviours it plans to execute next. If lined up(bot(0), ball, goal(theirs))

is already true, then the plan hopes to maintain it so that TurnWithBall() need

not be executed. If it isn’t true, then that condition is maintained instead. If

CaptureBall2() fails to maintain either condition, a plan execution failure has

occurred, and a side-effect is recorded. But after the side-effect occurs, the

CaptureBall2() behaviour is still the only available behaviour, so execute will

continue until it succeeds. Even if we knew when the side-effect was going to

occur, it would not improve matters.

However the same side-effect on the Dribble() behaviour has more important

consequences. It affects whether we choose to execute TurnWithBall() before

or after Dribble(). If the side-effect is particularly common, then we would be

wasting time using TurnWithBall() before we were close to the goal. In this case,

reflection could potentially prune away this alternative and result in improved

performance.

In practice learning even this side-effect has minimal effect on performance,

and what effect it has is slightly detrimental. The average success rate for the

runs performed without reflection is 87%, with reflection it dropped slightly to

80%. A two-tailed T-test shows this difference to be 97% significant. The average

length of the successful trials did not change significantly: 624 timesteps without


Table 7.19: The side-effects detected in the soccer-world.

Behaviour Affected condition No. occurrences

Dribble() lined up(⊥1, ball, goal(theirs)) 1968Dribble() lined up(⊥0, ball, goal(theirs)) 1864Dribble() ¬lined up(⊥1, ball, goal(theirs)) 1432Dribble() ¬lined up(⊥0, ball, goal(theirs)) 1405CaptureBall2() recently controlling ball(Bot) 1401ClearBall() lined up(⊥0, ball, goal(theirs)) 942ClearBall() lined up(⊥1, ball, goal(theirs)) 899CaptureBall2() lined up(⊥1, ball, goal(theirs)) 767CaptureBall2() lined up(⊥0, ball, goal(theirs)) 744CaptureBall2() within kicking distance(ball, goal(theirs)) 707CaptureBall2() ¬near edge(ball) 657CaptureBall2() near edge(ball) 564TurnWithBall() controlling ball(⊥0) 544TurnWithBall() controlling ball(⊥1) 536ClearBall() ¬lined up(⊥0, ball, goal(theirs)) 506ClearBall() ¬lined up(⊥1, ball, goal(theirs)) 498Pass2() recently controlling ball(Bot) 417TurnWithBall() ¬near edge(ball) 372ClearBall() controlling ball(⊥1) 362TurnWithBall() within kicking distance(ball, goal(theirs)) 360CaptureBall2() within kicking distance(ball,⊥0) 354TurnWithBall() near edge(ball) 327ClearBall() controlling ball(⊥0) 308Approach() ¬near edge(ball) 292Approach() near edge(ball) 252Approach() within kicking distance(Bot, Object) 249CaptureBall2() within kicking distance(ball,⊥1) 234CaptureBall2() ¬lined up(⊥1, ball, goal(theirs)) 222CaptureBall2() ¬lined up(⊥0, ball, goal(theirs)) 217ClearBall() within kicking distance(ball,⊥0) 183TurnWithBall() recently controlling ball()Bot 161Approach() lined up(⊥1, ball, goal(theirs)) 155Approach() lined up(⊥0, ball, goal(theirs)) 146Approach() recently controlling ball(⊥0) 141Approach() recently controlling ball(⊥1) 124ClearBall() within kicking distance(ball,⊥1) 74Approach() ¬controlling ball(Bot) 60TurnWithBall() ¬controlling ball(⊥1) 55TurnWithBall() ¬controlling ball(⊥0) 48Approach() ¬lined up(⊥0, ball, goal(theirs)) 36Approach() ¬lined up(⊥1, ball, goal(theirs)) 21Dribble() recently controlling ball(Bot) 2Approach() controlling ball(Bot) 2

7.

Experim

ent

Resu

lts186

controlling_ball(bot(1))

lined_up(bot(1), ball, goal(theirs))within_kicking_distance(ball, goal(theirs))

score(1)recently_controlling_ball(bot(1))

controlling_ball(bot(1))

not(lined_up(bot(1), ball, goal(theirs))within_kicking_distance(ball, goal(theirs))

controlling_ball(bot(1))not(lined_up(bot(1), ball, goal(theirs))

not(near_edge(ball))

not(controlling_ball(bot(1)))recently_controlling_ball(bot(1))

within_kicking_distance(ball, goal(theirs))not(lined_up(bot(1), ball, goal(theirs)))within_kicking_distance(bot(1), ball)

controlling_ball(bot(1))lined_up(bot(1), ball, goal(theirs))


not(controlling_ball(bot(1)))recently_controlling_ball(bot(1))

within_kicking_distance(bot(1), ball)lined_up(bot(1), ball, goal(theirs)


controlling_ball(bot(1))not(lined_up(bot(1), ball, goal(theirs)))


Shoot2(bot(1))

CaptureBall2(bot(1))

TurnWithBall(bot(1), goal(theirs))Dribble(bot(1), goal(theirs))

TurnWithBall(bot(1), goal(theirs))

CaptureBall2(bot(1))

Dribble(bot(1), goal(theirs))

Figure 7.21: Part of the plan for Shoot1(bot(1)).


reflection and 680 timesteps with. The variance in these values is so large that

this difference is not significant.

The cause of this drop in performance is not immediately apparent. Analysis

of the events preceding the failures shows that they were all due to ball going

into the wrong goal, never because of reaching the 5000 step time-out. In 77%

of causes the agent was attempting to clear the ball into the centre of the field

at the time, which is to be expected and is no different to the trials without

reflection.

There is a marked rise in the number of times the agent executed the Shoot1(bot(1))

behaviour directly, being unable to decompose it into a finer granularity be-

haviour. Without reflection this only occurs 96 times, with reflection it occurs

1219 times. This appears to indicate that, on many occasions, learning a de-

scription of the side-effect only served to limit the application of the Dribble()

behaviour without providing an alternative action. This is understandably dis-

advantageous, but it is not clear that it was necessarily the cause for the extra

failures. Of all the failed trials in the run with reflection, only 46% showed this

problem. Likewise 45% of the successful trials also had this same symptom. So

the cause-and-effect relationship is far from clear.

7.3.7 Discussion of the soccer experiments

The soccer domain is an excellent example of the importance of the task-hierarchy

in hierarchical reinforcement learning. In spite of much hand-crafted background

knowledge in the form of behaviours definitions, progress estimators and function

approximators, it is still very difficult to learn well without a good task-hierarchy.

There are many possible behaviours in this world, which can be parameterised

in various ways. Exploring every possibility is not very productive, as we have

seen. Even when the behaviours are hard-coded, learning to choose between

them without is difficult. When the behaviours themselves have to be learnt, it

is well and truly impossible, unless some extra background knowledge is provided

to direct the agent towards the appropriate behaviours. This knowledge could

be encoded directly by hand, but we have shown that it can also be presented

in a more flexible form as a symbolic model. As we have seen, both the TRQ

and P-HSMQ algorithms can use such a model to learn successful policies.

This world does, however, highlight one of the weaknesses of planning and

reflection. There are a great many side-effects that occur, and no simple way to


distinguish the important ones from the unimportant ones. Also, many of those

side-effects appear as near-repetitions, very similiar to others in the list with only

a few parameters changed. For example, both lined up(bot(0), ball, goal(theirs))

and lined up(bot(1), ball, goal(theirs)) are recorded as being affected by the

Dribble() behaviour. Obviously it would be more advantageous to treat these as

a single side-effect. Rachel’s reflector does not as yet know how to do this.

7.4 Summary

In this chapter we have experimentally verified the claimed advantages of Rachel’s

hybrid architecture over straight-forward hierarchical reinforcement learning. We

have seen how a symbolic model of desired behaviours can be built for three dif-

ferent test domains of varying levels of complexity, and how the information in

this model can significantly improve the agent’s ability to learn to perform those

behaviours.

We have also compared the P-HSMQ and TRQ algorithms and demonstrated

that the latter algorithm can produce significantly better policies in domains

where unexpected side-effects are possible, and early termination of behaviours

is desirable. In such domains the TRQ algorithm appears to produce policies sim-

ilar to those produced by standard termination improvement algorithms, without

requiring separate learning and execution phases.

Finally we have investigated the reflection mechanism thoroughly and seen

that while it is able to predict side-effects and plan to avoid them, it is quite

sensitive to variations in training- and test-set sizes. Exploratory planning can

reduce this sensitivity somewhat, but at the expense of producing much larger

plans.

This concludes our investigation of the Rachel architecture. In the final

chapter, we shall summarise the results we have obtained, draw conclusions and

speculate on future extensions to this work.

Chapter 8

Conclusions and Future Work

In this thesis we have presented a hybrid learning system which combines the

symbolic and subsymbolic approaches to artificial intelligence for control. Our

aim in doing so was to capitalise on the strengths of each approach, creating an

agent which can solve more complex tasks that either approach alone.

8.1 Summary of Rachel

Rachel is an architecture for a learning agent, which combines symbolic plan-

ning, hierarchical reinforcement learning and inductive logic planning. As with

other hierarchical reinforcement learning algorithms, it aims to learn a collection

of behaviours and combine them to achieve certain goals. However unlike other

HRL algorithms, it allows the trainer to specify behaviours in terms of abstract

symbolic descriptions in the form of Reinforcement-Learnt Teleo-operators (RL-

Tops). Rachel is then able to interpret these RL-Tops as both operators for

symbolic planning, and also sub-task descriptions for recursively optimal rein-

forcement learning.

The hybrid representation has several advantages:

• Automatic construction of task hierarchies. Having an explicit sym-

bolic model of the purposes of its behaviours, Rachel uses symbolic plan-

ning to determine which behaviours are likely to help it achieve its goals

in a particular state, and which are not. The plan it creates acts as a

task hierarchy for the hierarchical reinforcement learning algorithm, limit-

ing exploration to behaviours that are likely to be useful, and thus making

189

8. Conclusions and Future Work 190

learning faster. Plans can also include multiple levels of hierarchy, to make

planning and learning easier.

• Choosing optimal paths in plans. Unlike other planning systems,

Rachel does not assume that the shortest plan it produces is the best.

Rather, it searches for all possible paths to the goal and then uses hierar-

chical reinforcement learning to select the optimal one.

• Learning concrete policies for behaviours. Again, unlike other plan-

ning systems, Rachel does not rely on the trainer to provide fully imple-

mented behaviours. Rather, it uses reinforcement learning to learn primi-

tive policies which achieve the behaviours’ goals.

• Intelligent interruption of behaviours. The Teleo-reactive Q-Learning

algorithm uses the knowledge contained in the plan it executes to intelli-

gently interrupt behaviours when they are no longer appropriate. This

results in better policies than comparable HRL algorithms, without the

performance loss of fully reactive execution.

• Diagnosing execution failures. When a behaviour is interrupted, the

plan also provides information about what went wrong. The symbolic

representation allows Rachel to diagnose the nature of the failure and

gather evidence of what caused it.

• Learning how to avoid unwanted side-effects. Based on this evidence,

Rachel’s reflector is able to induce a description of the cause of the side-

effect in a symbolic form which can be incorporated back into the plans,

so that the unwanted effect may be avoided in the future.

8.2 Summary of Experimental results

Rachel has been successfully tested in three domains: two simple grid-worlds

and the SoccerBot soccer simulator. For each domain we have built a symbolic

model of the desired behaviours at multiples levels of granularity. We have shown

how plans built from these models can significantly improve the agent’s ability

to learn to perform its required tasks.

The Teleo-reactive Q-Learning algorithm has also been shown to produce

significantly better policies in domains where unexpected side-effects are possible,


and early termination of behaviours is desirable. It is able to reproduce the

advantages of existing termination improvement algorithms, with minimal extra

learning time and without requiring separate learning and execution phases.

The results from the reflection mechanism have been more mixed. Side-effect

detection works effectively, but in a complex domain like the soccer simulator

there can be a great variety of side-effects and distinguishing those that are

important from those that are not is a difficult exercise.

Predicting side-effects can be spectacularly successful when it works and dis-

astrously bad when it goes wrong. Which of these results will occur depends

heavily on the training- and test-set sizes, and on the time it takes for the actor

to learn reasonable policies for behaviours. Reflecting prematurely, while be-

haviours are still being learnt, can lead to bad predictions which in turn prevent

the agent from improving the behaviours. Exploratory planning can relieve this

problem, but at the expense of much greater planning time and larger plan trees.

Nevertheless the results are largely positive. The synthesis of planning and

reinforcement learning has allowed Rachel to learn to behave successfully in a

world as complex as the soccer simulator, which could not have been achieved

with hierarchical reinforcement learning alone, without significant human inter-

vention.

8.3 Future Work

Throughout this work the emphasis has been on building a hybrid tool out

of simple and well-studied building blocks. As such, there is much room for

improvement in each of the components, to incorporate more complex means of

planning, acting and reflecting. Some suggested avenues for further work are

outlined below.

8.3.1 Better Planning

The planner is the most obvious starting point for improvement. The mean-ends

planning algorithm is primitive and costly. More modern planning techniques

abound, which could well be adapted to Rachel’s needs.


Independent subgoals

Since Rachel’s planner is universal, it performs poorly in domains which have

multiple sub-goals which can be performed in arbitrary order. As it stands, it

will expand each sequence into a separate path of the path, duplicating a lot of

effort and building a plan that is bigger than it needs to be. A more intelligent

planner might recognise independent subgoals and build separate sub-trees for

each. Such a planner is used in Trail (Benson, 1996). The coffee-and-book

task, for example, might produce a plan such as that shown in Figure 8.1.

location(robot, kitchen)holding(coffee)holding(book)


holding(book)


holding(book)


holding(book)

holding(coffee)

Go(kitchen, dining)


Go(hall, dining)

Go(dining, kitchen)

location(robot,hall)




Go(kitchen, dining)

Go(dining, hall)Go(bedroom, hall)





holding(book)

Get(book, bedroom2)

Go(hall, bedroom2)





Go(hall, lounge)

Figure 8.1: A possible plan for fetching the coffee and the book with subgoalsplitting. The double line indicates that the second node has been split intothree subgoals to be achieved independently.

Partial planners such as UP-POP (Weld, 1994) and UMCP (Erol et al., 1994)

might be used to do this kind of decomposition of independent goals. However,

they are not designed to build plans with multiple alternatives (as this is not a

commonly required feature in planning systems).


More expressive plans

Most work in planning is directed at finding a plan which achieves certain cer-

tain goal conditions. Reinforcement learning, on the other hand, recognises that

goals of achievement are only one kind of control task. Other kinds of tasks in-

clude “maintenance” tasks, where a goal condition must be achieved and actively

maintained, and “cyclic” tasks in which the agent cycles through a sequence of

goal states. We are not aware of any existing work in planning to tackle such

tasks, but there is no intrinsic reason why symbolic methods could not be applied

to them.

State-space planning, such as that used by Rachel, is merely a search for

paths in a directed graph. Each node of the graph is a set of fluents, each edge is

a behaviour which can be executed when its source node is satisfied, to cause its

target node to become satisfied. With this model solving a maintenance task is

simply finding a cycle of nodes in the graph which all satisfy the goal condition.

Solving a process task requires finding a path through successive goal nodes. If

a planner of this kind could be built, then it could well be applied to building a

HAM-like structure for reinforcement learning. An algorithm like HAMQ, or a

recursively-optimal variant thereof, could be applied to learn policies which fit

this structure.

On a more ambitious front, recent investigations into planning have at-

tempted incorporate a much more complex language for expressing plans in-

cluding conditional effects (Anderson, Weld, & Smith, 1998), sensing actions

(Weld, Anderson, & Smith, 1998) and even loop structures (Lin & Levesque,

1998). Recent research in hierarchical reinforcement learning has also explored

such complex languages for structuring policies (Lagoudakis & Littman, 2002,

Andre & Russell, 2000, Shapiro, Langley, & Shachter, 2001). Potentially these

approaches could be combined to build quite complex structures and learn poli-

cies within them.

8.3.2 Better Acting and Learning

Incorporating other reinforcement learning research

It is common practice among researchers studying reinforcement learning to use

Q-Learning as a starting point for all work, only deviating from this standard

algorithm as far as necessary to implement the particular ideas they propose.


This work is no exception. There is value in this – Q-Learning is a well under-

stood algorithm and serves as solid baseline – but it means that little is known

about how different variations combine and interact. One worthwhile direction

for further development would be to combine the TRQ algorithm with other

recent developments in reinforcement learning.

In particular, it would be interesting to see how the MAXQ function decom-

position (Dietterich, 2000a) might be applied, and how it affects learning. By

separating the Q-value TRQ assigns to a node into immediate and continuation

values, it may be possible to at least partially reduce the extra learning time

TRQ takes over P-HSMQ.

Committing to paths in the path

Another possible avenue for investigation is in the area of commitment to paths in

the plan. When designing P-HSMQ and TRQ we deliberately chose to maximise

reactivity by allowing the agent to choose a new behaviour from all active nodes

in the plan, whenever the executing behaviour terminated. If the plan contained

a path to goal several behaviours long, then agent would make a choice between

every step of the plan, whether to continue with that path or choose another.

A possible alternative would be to design the algorithm so that once the agent

had started on a path it executed it to completion, unless an execution failure

occurred along the way. This is taking behaviour-commitment to the extreme –

committing to a whole sequence of behaviours – and would have all the associated

advantages and disadvantages we have already discussed.

8.3.3 Better Reflecting

The reflector is possibly the most fruitful area for expansion of the Rachel

architecture. We have limited ourselves, in this thesis, to learning only one of

the many possible aspects of the system, in a fairly simple fashion. There are

many possible improvements in both what is learnt, and how it is learnt.

Learning preconditions

In this work we have focused on detecting, diagnosing and learning to predict

side-effects, but these are not the only aspects of the system for which symbolic

models could be learnt. Another possibility is to learn revised preconditions for


behaviours. A behaviour may never learnt a policy which successfully achieves

its goals for some or all of its specified application space. A behaviour that con-

sistently fails is as detrimental as a behaviour that consistently causes unwanted

side-effects. In the long run, it is worth the agent’s while to recognise that a

given behaviour isn’t working, and revise its symbolic model of the behaviour to

limit its application to those states in which it works.

Of course, great care would have to be taken to establish that the behaviour

had already been well explored and was not still in the process of improving, or

else learning would be hampered.

Inventing behaviours

Another possible kind of reflection might come from failure in the planning pro-

cess. If the planner builds a plan that covers all but a certain subset of states,

then the reflector might invent a behaviour to fill in the gap. Starting in one of

the uncovered states it could explore randomly until it arrived in a state covered

by the plan. A postcondition could be chosen by finding a set of fluents which

are satisfied by the covered state but not by any of the uncovered states. A

precondition could be induced by generalising over all of the uncovered states.

The new behaviour could then be introduced into plans and a policy learnt as

usual.

Deliberate experimentation

In Section 3.4.1 we made a distinction between passive and active learning. Pas-

sive learning is done by observing data from an independent controller. Active

learning deliberately controls the agent in a certain way to produce examples for

learning. Rachel’s reflector is moderately active. Exploratory planning allows

it to deliberately choose to explore behaviours it wouldn’t normally, so as to

generate examples for reflection. However this decision is still made randomly,

and there is no guarantee that it will generate the kinds of examples that would

be most useful.

A more active approach would be for the agent to do deliberate experimen-

tation – to choose a set of states that it particularly wanted to explore and

actively head for that location in order to test a particular behaviour. Early

ILP systems, such as MIS (Shapiro, 1981), Marvin (Sammut & Banerji, 1986)


and CIGOL (Muggleton & Buntine, 1988), have used the trainer as an oracle to

provide the truth-value for any critical examples that might influence the final

hypothesis. Such systems might be revived for this domain, but rather than

using the trainer as an oracle, the critical examples could be used as fodder for

experimentation.

Another possible reason for experimentation would be to deliberately practice

a behaviour that seems to be performing poorly. Other behaviours could be used

to set up appropriate initial conditions, and then the behaviour could be repeated

several times over to improve its policy.

Incremental ILP

We used the Aleph ILP algorithm for Rachel’s thesis because it handled

intensional background knowledge and noisy data. However it is a batch-mode

learner and we had to modify it to learn side-effect descriptions incrementally.

The modifications we made were far from the ideal solution. There is a definite

lack of research into truly incremental ILP. Late in the work we came across the

Hillary algorithm by Iba, Wogulis, and Langley (1988) which is an incremental

ILP algorithm, but no follow-up work appears to have been done on this until the

recent publication of the NILE algorithm (Westendorp, 2003). This is certainly

an area in which much more research could be done.

8.4 Conclusion

The overarching goal of this work was to demonstrate the possibility of recon-

ciling symbolic and statistical approaches to artificial intelligence. The gulf is

not too deep, it can be bridged. Rachel is only a small step in that direction,

but I hope it shows that such a resolution is possible, and that as in human

beings, so also in computers, multiple kinds of representations can co-exist and

complement one another. As we expand our horizons to more and more com-

plex tasks, the ability to represent problems at multiple levels of abstraction will

become paramount. Different approaches have different strengths, and it will be

by combining those strengths that artificial intelligence will flourish.

References

Albus, J. S. (1975). A new approach to manipulator control: The cerebellarmodel articulation controller (CMAC). Transactions of the ASME, 220–227.

Allen, J., Hendler, J., & Tate, A. (Eds.). (1990). Readings in planning. CanMatero, CA: Morgan Kaufmann.

Anderson, C., Weld, D., & Smith, D. (1998). Conditional effects in Graphplan. InProceedings of the fourth international conference on AI planning systems.

Anderson, J. R. (1976). Language, memory and thought. Hillsdale, NJ: Erlbaum.

Anderson, J. R. (1995). Cognitive psychology and its implications (4th ed.). W.H. Freeman.

Andre, D., & Russell, S. (2002). State abstraction for programmable reinforce-ment learning agents. In Proceedings of the eighteenth national conferenceon artificial intelligence.

Andre, D., & Russell, S. J. (2000). Programmable reinforcement learning agents.In Advances in neural information processing systems 12: Proceedings ofthe 1999 conference (p. 1019-1025). San Franciso, CA: Morgan Kaufmann.

Aristotle. (350 BC). Politics. Athens.

Atkeson, C. G., Moore, A. W., & Schaal, S. (1997). Locally weighted learning.Artificial Intelligence Review, 11 (1-5), 11-73.

Baxter, J., Tridgell, A., & Weaver, L. (1998). Knightcap: A chess program thatlearns by combining TD(λ) with game-tree search. In Proceedings of thefifteenth international conference on machine learning (pp. 28–36). SanFranciso, CA: Morgan Kaufmann.

Bellman, R. (1957). Dynamic programming. Princeton University Press.

Bellman, R. (1961). Adaptive control processes: A guided tour. PrincetonUniversity Press.

197

REFERENCES 198

Benson, S. (1995). Inductive learning of reactive action models. In Proceedingsof the twelfth international conference on machine learning. San Franciso,CA: Morgan Kaufmann.

Benson, S. (1996). Learning action models for reactive autonomous agents. Un-published doctoral dissertation, Department of Computer Science, Stan-ford University.

Benson, S., & Nilsson, N. J. (1994). Reacting, planning and learning in anautonomous agent. In K. Furukawa, D. Michie, & S. Muggleton (Eds.),Machine intelligence 14. Oxford: the Calrendon Press.

Bertsekas, D. P. (1987). Dynamic programming: Deterministic and stochasticmodels. Englewood Cliffs, NJ: Prentice-Hall.

Bertsekas, D. P., & Tsitsiklis, J. N. (1996). Neuro-dynamic programming. Bel-mont, MA: Athena Scientific.

Bobrow, D. G., & Winograd, T. (1977). An overview of krl, a knowledgerepresentation language. Cognitive Science, 1, 3-46.

Boutilier, C., Dean, T., & Hanks, S. (1999). Decision-theoretic planning: Struc-tural assumptions and computational leverage. Journal of Artificial Intel-ligence Research, 11, 1-94.

Brooks, R. A. (1986). A robust layered control system for a mobile robot. IEEEJournal of Robotics and Automation, RA-2 (1), 14–23.

Carbonell, J. G. (1984). Learning by analogy: Formulating and generalizingplans from past experience. In R. S. Michalski, J. G. Carbonell, & T. M.Mitchell (Eds.), Machine learning: An artificial intelligence approach (p.137-161). Berlin, Heidelberg: Springer.

Churchland, P. M. (1990). On the nature of theories: A neurocomputational per-spective. In C. W. Savage (Ed.), Scientific theories: Minenesota studies inthe philosophy of science (Vol. 14). Minneapolis: University of MinnesotaPress.

Cohen, N. J., & Squire, L. R. (1980). Preserved learning and retention of patternanalysing skills in amensia: Dissociation of knowing how and knowingwhat. Science, 210, 207–210.

Dayan, P., & Hinton, G. E. (1992). Feudal reinforcement learning. Advances inNeural Information Processing Systems, 5, 271–278.

REFERENCES 199

desJardins, M. (1994). Knowledge development methods for planning systems.In Planning and learning: On to real applications: Papers from the 1994AAAI fall symposium (pp. 34–40). AAAI Press, Menlo Park, California.

Dietterich, T. G. (1996). Machine learning. ACM Computing Surveys, 28 (4es),3–3.

Dietterich, T. G. (1998). The MAXQ method for hierarchical reinforcementlearning. In Proceedings of the fifteenth international conference on ma-chine learning (pp. 118–126). San Franciso, CA: Morgan Kaufmann.

Dietterich, T. G. (2000a). Hierarchical reinforcement learning with the MAXQvalue dunction decomposition. Artitificial Intelligence, 13, 227–303.

Dietterich, T. G. (2000b). An overview of MAXQ hierarchical reinforcementlearning. In B. Y. Choueiry & T. Walsh (Eds.), Proceedings of the sympo-sium on abstraction, reformulation and approximation SARA 2000, lecturenotes in artificial intelligence (p. 26-44). New York, NY: Springer Verlag.

Dreyfus, H. L. (1979). What computers can’t do: A critique of artificial reason(2nd ed. ed.). New York: Harper and Row.

Dzeroski, S., Raedt, L. D., & Blockeel, H. (1998). Relational reinforcementlearning. In Proceedings of the fifteenth international conference on ma-chine learning. San Franciso, CA: Morgan Kaufmann.

Erol, K., Hendler, J. A., & Nau, D. S. (1994). UMCP: A sound and completeprocedure for hierarchical task-network planning. In Artificial intelligenceplanning systems (p. 249-254).

Fikes, R. E. (1971). Monitored execution of robot plans produced by STRIPS.In Information processing 71, proceedings of the ifip congress (Vol. 1, pp.189–194).

Fikes, R. E., Hart, P. E., & Nilsson, N. J. (1972). Learning and executinggeneralized robot plans. Artificial Intelligence, 3, 251–288.

Fikes, R. E., & Nilsson, N. J. (1971). STRIPS: A new approach to the applicationof theorem proving to problem solving. Aritifical Intelligence, 2, 189-208.

Georgeff, M. P. (1987). Planning. In Annual review of computing science (Vol. 2,p. 359-400). Annual Reviews Inc.

Ghallab, M., & Milani, A. (Eds.). (1996). New directions in AI planning. Ams-terdam, Netherlands: IOS Press.

REFERENCES 200

Gil, Y. (1994). Learning by experimentation: Incremental refinement of in-complete planning domains. In Proceedings of the eleventh internationalconference on machine learning. San Franciso, CA: Morgan Kaufmann.

Harnad, S. (1990). The symbol grounding problem. Physica D, 42, 335-346.

Haugeland, J. (Ed.). (1997). Mind design II. Cambridge, MA: Bradford/MITPress.

Hauskrecht, M., Meuleau, N., Kaelbling, L. P., Dean, T., & Boutilier, C. (1998).Hierarchical solution of Markov decision processes using macro-actions. InUncertainty in artificial intelligence (pp. 220–229).

Hayes, P. J. (1973). The frame problem and related problems in artificial in-telligence. In Artificial and human thinking (pp. 45–59). Jossey-Bass Inc.and Elseview Scientific Pub. Co.

Hengst, B. (2002). Discovering hierarchy in reinforcement learning with HEXQ.In Proceedings of the nineteenth international conference on machine learn-ing (p. 243-250). San Franciso, CA: Morgan Kaufmann.

Howard, R. A. (1960). Dynamic programming and markov processes. Cambridge,MA: The MIT Press.

Hume, D., & Sammut, C. (1991). Using inverse resolution to learn relationsfrom experiments. In Proceedings of the eighth international conference onmachine learning. San Franciso, CA: Morgan Kaufmann.

Iba, G. A. (1989). A heuristic approach to the discovery of macro-operators.Machine Learning, 3, 285-317.

Iba, W., Wogulis, J., & Langley, P. (1988). Trading off simplicity and coverage inincremental concept learning. In Proceedings of the fifth international con-ference on machine learning (pp. 73–79). Ann Arbor, Michigan: MorganKaufmann.

Jaakkola, T., Jordan, M. I., & Singh, S. P. (1994). On the convergence ofstochastic iterative dynamic programming algorithms. In Advances in neu-ral information processing systems (Vol. 6). The MIT Press.

Kaelbling, L. P. (1993). Hierarchical learning in stochastic domains: Preliminaryresults. In Proceedings of the tenth international conference on machinelearning. San Franciso, CA: Morgan Kaufmann.

Kaelbling, L. P., Littman, M. L., & Moore, A. P. (1996). Reinforcement learning:A survey. Journal of Artificial Intelligence Research, 4, 237-285.

REFERENCES 201

Kaelbling, L. P., & Rosenschein, S. J. (1990). Action and planning in embeddedagents. In Roboics and autonomous systems (Vol. 6, pp. 35–48).

Knoblock, C. A. (1991). Search reduction in hierarchical problem solving. In Pro-ceedings of the ninth national conference on artificial intelligence (AAAI-91) (Vol. 2, pp. 686–691). Anaheim, California, USA: AAAI Press/MITPress.

Koenig, S., & Simmons, R. G. (1993). Complexity analysis of real-time re-inforcement learning. In National conference on artificial intelligence (p.99-107).

Korf, R. E. (1985). Macro-operators: A weak method for learning. ArtificialIntelligence, 26, 35-77.

Korf, R. E. (1987). Planning as search: A quantitative approach. ArtificialIntelligence, 33 (1), 65-88.

Lagoudakis, M., & Littman, M. (2002). Algorithm selection using reinforce-ment learning. In Proceedings of the nineteenth international conferenceon machine learning. San Franciso, CA: Morgan Kaufmann.

Laird, J. E., Rosenbloom, P. S., & Newell, A. (1986). Chunking in soar: Theanatomy of a general learning mechanism. Machine Learning, 1 (1), 11-46.

Lavrac, N., & Dzeroski, S. (1994). Inductive logic programming: Techniques andapplications. Ellis Horwood.

Lin, F., & Levesque, H. J. (1998). What robots can do: Robot programs andeffective achievability. Artificial Intelligence, 101 (1-2), 201-226.

Lin, L.-J. (1993). Reinforcement learning for robots using neural networks.Unpublished doctoral dissertation, School of Computer Science, CarnegieMellon University.

Lorenzo, D., & Otero, R. P. (2000). Using ILP to learn logic programs for rea-soning about actions. In Proceedings of the tenth international conferenceon inductive logic programming.

Maclin, R., & Shavlik, J. W. (1996). Creating advice-taking reinforcementlearners. Machine Learning, 22, 251–282.

Maclin, R. F. (1995). Learning from instruction and experience: Methods forincorporating procedural domain theories into knowledge-based neural net-works. Unpublished doctoral dissertation, University of Wisonsin - Madi-son.

REFERENCES 202

Maes, P. (1990). How to do the right thing. Connection Science Journal, SpecialIssue on Hybrid Systems, 1.

Mahadevan, S., Khaleeli, N., & Marchalleck, N. (1997). Designing agent con-trollers using discrete-event markov models. In Working notes of the AAAIfall symposium on model-directed autonomous systems. Cambridge, Mas-sachusetts.

Mataric, M. J. (1994). Reward functions for accelerated learning. In Proceedingsof the eleventh international conference on machine learning. San Franciso,CA: Morgan Kaufmann.

Mataric, M. J. (1996). Behaviour based control: Examples from navigation,learning and group behaviour. Journal of Experimental and TheoreticalArtificial Intelligence, 9 (2-3).

McCarthy, J., & Hayes, P. J. (1969). Some philosophical problems from thestandpoint of artificial intelligence. In Machine intelligence (Vol. 4, pp.463–502).

McGovern, A., & Barto, A. G. (2001). Automatic discovery of subgoals in rein-forcement learning using diverse density. In Proceedings of the eighteenthinternational conference on machine learning (pp. 361–368). San Franciso,CA: Morgan Kaufmann.

Minsky, M. (1974). A framework for representing knowledge (Tech. Rep. No.Memo 306). MIT AI Lab.

Minton, S. (1988). Learning search control knowledge: An explanation-basedapproach. Dordrecht: Kluwer Academic Publishers.

Mitchell, T. M., Utgoff, P. E., & Banerji, R. (1984). Learning by experimenta-tion: Acquiring and refining problem-solving heuristics. In R. S. Michalski,J. G. Carbonell, & T. M. Mitchell (Eds.), Machine learning: An artificialintelligence approach (p. 163-190). Berlin, Heidelberg: Springer.

Muggleton, S. (1995). Inverse entailment and Progol. New Generation Comput-ing, Special issue on Inductive Logic Programming, 13 (3-4), 245-286.

Muggleton, S., & Buntine, W. (1988). Machine invention of first-order predicatesby inverting resolution. In Proceedings of the fifth international conferenceon machine learning (p. 167-192). Ann Arbor, Michigan: Morgan Kauf-mann.

Newell, A., & Simon, H. A. (1972). Human problem solving. Englewood Cliffs,New Jersey: Prentice-Hall.

REFERENCES 203

Newell, A., & Simon, H. A. (1976). Computer science as empirical inquiry:Symbols and search. In Communications of the association for computingmachinery (Vol. 19, p. 113-126).

Ng, A. Y., Harada, D., & Russell, S. (1999). Policy invariance under rewardtransformations: Theory and application to reward shaping. In Proceed-ings of the sixteenth international conference on machine learning. SanFranciso, CA: Morgan Kaufmann.

Nilsson, N. J. (1994). Teleo-reactive programs for agent control. Journal ofArtificial Intelligence Research, 1, 139-158.

Oates, T., & Cohen, P. R. (1996). Searching for planning operators with context-dependent and probabilistic effects. In H. Shrobe & T. Senator (Eds.),Proceedings of the thirteenth national conference on artificial intelligenceand the eighth innovative applications of artificial intelligence conference,vol. 2 (pp. 865–868). Menlo Park, California: AAAI Press.

Papavassiliou, V. A., & Russell, S. J. (1999). Convergence of reinforcement learn-ing with general function approximators. In Proceedings of the seventeenthinternational joint conference on artificial intelligence. San Franciso, CA:Morgan Kaufmann.

Parr, R. (1998). Hierarchical control and learning for markov decision processes.Unpublished doctoral dissertation, University of California at Berkeley.

Parr, R., & Russell, S. (1998). Reinforcement learning with hierarchies of ma-chines. In M. I. Jordan, M. J. Kearns, & S. A. Solla (Eds.), Advances inneural information processing systems (Vol. 10). The MIT Press.

Puterman, M. L. (1994). Markov decision processes: Discrete stochastic dynamicprogramming. New York, NY: John Wiley & Sons, Inc.

Reiter, R. (1987). A logic for default reasoning. In M. L. Ginsberg (Ed.),Readings in nonmonotonic reasoning (pp. 68–93). Los Altos, California:Morgan Kaufmann.

Rosenberg, J. F. (1990). Connectionism and cognition. In Acta analytica (p.33-36). Dubrovnik.

Rosenschein, S. J. (1981). Plan synthesis: A logical perspective. In Proceedingsof the seventh international joint conference on artificial intelligence (pp.331–337). San Franciso, CA: Morgan Kaufmann.

Rumelhart, D. E. (1989). The architecture of mind: A connectionist approach.In M. I. Posner (Ed.), Foundations of cognitive science. Cambridge, MA:Bradford/MIT Press.

REFERENCES 204

Rummery, G. A., & Niranjan, M. (1994). Online Q-learning using connec-tionist systems (Tech. Rep. No. CUED/F-INFENG/TR 166). CambridgeUniversity Engineering Department.

Sacerdoti, E. D. (1973). Planning in a hierarchy of abstraction spaces. In Pro-ceedings of the third international joint conference on artificial intelligence.San Franciso, CA: Morgan Kaufmann.

Sacerdoti, E. D. (1974). Planning in a hierarchy of abstraction spaces. ArtificialIntelligence, 5 (2), 115-135.

Sacerdoti, E. D. (1977). A structure for plans and behaviour. Amsterdam,London, New York: Elsevier/North-Holland.

Sammut, C., & Banerji, R. (1986). Learning concepts by asking questions.In R. S. Michalski, J. G. Carbonell, & T. M. Mitchell (Eds.), Machinelearning: An artificial intelligence approach (Vol. 2, p. 167-192). Los Altos,CA: Morgan Kaufmann.

Santamarıa, J. C., Sutton, R. S., & Ram, A. (1998). Experiments with rein-forcement learning in problems with continuous state and action spaces.Adaptive Behaviour, 6 (2).

Schoppers, M. (1987). Universal plans for reactive robots in unpredicatble sys-tems. In Proceedings of the tenth international joint conference on artificialintelligence. San Franciso, CA: Morgan Kaufmann.

Shapiro, A. D. (1987). Structured induction in expert systems. Addison Wesley,London.

Shapiro, D., Langley, P., & Shachter, R. (2001). Using background knowledgeto speed reinforcement learning in physical agents. Proceedings of the FifthInternational Conference on Autonomous Agents, 254-261.

Shapiro, E. (1981). An algorithm that infers theories from facts. In Proceedingsof the seventh international joint conference on artificial intelligence (p.446-452). San Franciso, CA: Morgan Kaufmann.

Shapiro, E. Y. (1981). Inductive inference of theories from facts (Tech. Rep. No.192). Dept of CS, Yale University.

Shen, W.-M. (1993). Discovery as autonomous learning from the environment.Machine Learning, 12.

Shen, W.-M. (1994). Autonomous learning from the environment. W.H. Free-man, Computer Science Press.

REFERENCES 205

Shoham, Y., & McDermott, D. V. (1988). Problems in formal temporal reason-ing. Artificial Intelligence, 36 (1), 49-61.

Singh, S. P. (1992). Reinforcement learning with a hierarchy of abstract models.In Proceedings of the tenth national conference on artifical intelligence.Cambridge, MA: MIT Press.

Singh, S. P., Jaakkola, T., Littman, M. L., & Szepesvari, C. (2000). Conver-gence results for single-step on-policy reinforcement-learning algorithms.Machine Learning, 38 (3), 287-308.

Smolensky, P. (1989). Connectionist modeling: Neural computation / mentalconnections. In L. Nadel, L. A. Cooper, P. Culicover, & R. M. Harnish(Eds.), Neural conenctions, mental computation. Cambridge, MA: Brad-ford/MIT Press.

Srinivasan, A. (2001a). The aleph manual .http://web.comlab.ox.ac.uk/oucl/research/areas/machlearn/Aleph/.

Srinivasan, A. (2001b). Extracting context-sensitive models in inductive logicprogramming. Machine Learning, 44 (3), 301–324.

Stone, P., & Sutton, R. S. (2001). Scaling reinforcement learning towardRoboCup soccer. In Proceedings of the eighteenth international conferenceon machine learning (pp. 537–544). San Franciso, CA: Morgan Kaufmann.

Sutton, R. S. (1987). Implementation details for the TD(λ) procedure for the caseof vector predictions and backpropagation (Tech. Rep. No. TN87-509.1).GTE Laboratories.

Sutton, R. S. (1988). Learning to predict by the methods of temporal differences.Machine Learning, 3, 9-44.

Sutton, R. S. (1990). Integrated architectures for learning, planning and react-ing based on approximating dynamic programming. In Proceedings of theseventh international conference on machine learning. San Franciso, CA:Morgan Kaufmann.

Sutton, R. S. (1995). Generalisation in reinforcement learning successful exam-ples using sparse coarse coding. Advances in Neural Neural InformationProcessing Systems.

Sutton, R. S. (1996). Generalization in reinforcement learning: Successful ex-amples using sparse coarse coding. In D. S. Touretzky, M. C. Mozer, &M. E. Hasselmo (Eds.), Advances in neural information processing systems(Vol. 8, pp. 1038–1044). The MIT Press.

REFERENCES 206

Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction.Cambridge, MA: The MIT Press.

Sutton, R. S., Precup, D., & Singh, S. P. (1999). Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning.Artificial Intelligence, 112 (1-2), 181-211.

Sutton, R. S., Singh, S., Precup, D., & Ravindran, B. (1999). Improved switchingamong temporally abstract actions. In Advances in neural informationprocessing systems (Vol. 11). The MIT Press.

Tate, A. (1975). Using goal structure to direct search in a problem solver.Unpublished doctoral dissertation, Universtiy of Ediaburgh.

Taylor, K. (1996). Autonomous learning by incremental induction and revision.Unpublished doctoral dissertation, Australian National University.

Tesauro, G. (1994). Td-gammon, a self-teaching backgammon program achievesmaster-level play. Neural Computation, 6, 215–219.

Thrun, S. B. (1992). The role of exploration in learning control. In D. White &D. Sofge (Eds.), Handbook of intelligent control: Neural, fuzzy and adaptiveapproaches. Van Nostrand Reinhold.

Tsitsiklis, J. N. (1994). Asynchronous stochastic approximation and Q-learning.Machine Learning, 16 (3).

Tsitsiklis, J. N., & Roy, B. V. (1997). An analysis of temporal-difference leanringwith function approximation. In Advances in neural information processingsystems 9: Proceedings of the 1996 conference. San Franciso, CA: MorganKaufmann.

Wang, X. (1996). Planning while learning operators. In B. Drabble (Ed.),Proceedings of the 3rd international conference on artificial intelligenceplanning systems (AIPS-96) (pp. 229–236). AAAI Press.

Watkins, C. J. C. H. (1989). Learning from delayed rewards. Unpublisheddoctoral dissertation, King’s College, Cambridge, England.

Watkins, C. J. C. H., & Dayan, P. (1992). Q-learning. Machine Learning, 8 (3),279–292.

Weld, D. S. (1994). An introduction to least commitment planning. AI Magazine,15 (4), 27-61.

Weld, D. S., Anderson, C. R., & Smith, D. E. (1998). Extending Graphplan tohandle uncertainty and sensing actions. In AAAI/IAAI (p. 897-904).

REFERENCES 207

Westendorp, J. H. (2003). Noise-resistant incremental relational learning usingpossible worlds. In S. Matwin & C. Sammut (Eds.), Ilp 2002 (Vol. 2583,pp. 317–332). Springer Verlag.

Wiering, M., Salustowicz, R., & Schmidhuber, J. (1999). Reinforcement learn-ing soccer teams with incomplete world models. Journal of AutonomousRobots.

Winograd, T. (1972). Understanding natural language. Cognitive Psychology,1, 1-191.

Zhang, W., & Dietterich, T. G. (1995). A reinforcement learning approachto job-shop scheduling. In Proceedings of the fourteenth international jointconference on artificial intelligence. San Franciso, CA: Morgan Kaufmann.

Ziemke, T. (1997). Rethinking grounding. In Proceedings of new trends incognitive science - does representation need reality. Vienna.

Hierarchical Reinforcement Learning: A Hybrid...

Documents

Transcript of Hierarchical Reinforcement Learning: A Hybrid...