Approximate Dynamic Programming with Adaptive …...of adaptive critic designs (ACDs). Dynamic...

Approximate Dynamic Programming with

Adaptive Critics

and

The Algebraic Perceptron as a Fast Neural Network

related to Support Vector Machines

by

Thomas Hanselmann, Dipl. El.-Ing. ETH

This thesis is presented for the degree of

Doctor of Philosophy

of

The University of Western Australia

School of Electrical, Electronic and Computer Engineering,

School of Mathematics and Statistics

The University of Western Australia

2003

i

Acknowledgements

To all of my supervisors, Dr. Anthony Zaknich, Dr. Lyle Noakes, Dr. Andrey Savkinand Dr. Yianni Attikiouzel, whom I had at some stages during my studies, I would liketo express my deep gratitude for their guidance, encouragement and support. Additionalspecial thanks go to Dr. Anthony Zaknich, who stayed on voluntarily as my primarysupervisor without any payment from the University, and also a very special thank yougoes to Dr. Lyle Noakes for his many mathematical discussions, ideas and great optimismwhich beats the most evil Murphy.

I have enjoyed my stay in the CIIPS research group and would like to thank all itsmembers for their friendship. In particular, I want to thank Dr. Chris deSilva, a formerCIIPS member, for his excellent courses and lecture notes in advanced signal processing.Also, I would like to thank my friends James Young, Martin Masek and Brad Finch for allthe good times, especially the badminton. I am also very grateful to Mahmoud El-Hirbawyand Frederick Chee, whom I got to know better only after submitting my thesis, duringour time as Associate Lecturers.

To my good friend Thomas Meier, I am very grateful for many discussions and questionsI had in the early stages of my thesis, which he started solving by the words ‘let’s makean example’ and the outcome often was simple, but still non-trivial. Many thanks go alsoto my good friends Andreas Haeberli and Ivo Hasler, who taught me more than they evermight think. I also would like to thank Elena and Thomas, Christine and Andreas, Dorisand Ivo and Adrian Rothenfluh for their hospitality and generosity on my conference tripsaround the world.

I would like to thank my girlfriend Ped for her enormous love and great supportduring my studies despite the many thousand separating miles most of the time and Iwish for her to finish her Ph.D. very soon! Many special thanks go to my father, whosegenerosity is immense, and enabled me to live in a very nice place during the last yearof my studies. Also many thanks to my brother, who dealt with many bureaucrats andforms during my absence from Switzerland. Many memories and thanks go to my latemother and grandparents, who gave me everything to be a Hanselmann (to be pronouncedas ‘handsome man’) and only occasionally to be a hassle man.

Finally, I would like to acknowledge the financial support provided by the UniversityPostgraduate Award (UPA) from the University of Western Australia, the Australian Re-search Council (ARC) Scholarship by Dr. Andrey Savkin and an ad-hoc Scholarship fromthe Centre for Intelligent Information Processing Systems (CIIPS) Research Group by Dr.Anthony Zaknich.

Special Acknowledgements and Thanks to the Reviewers

I would like to thank all the reviewers of my thesis for their time spent and in particularI am very grateful to the valuable comments made by Dr. Paul J. Werbos and Dr. Danil

ii

V. Prokhorov and for their excellent and supportive feedback. I feel sorry not to be ableto delve into all the suggestions, at least for this thesis, as there are other more mundaneaffairs and university deadlines that prevent me to do so. Nevertheless, if I get a chancein my future career to pursue these topics, I would love to tackle some of the issues raisedin this thesis and follow upon the advise given to me.

iii

Abstract

This thesis treats two aspects of intelligent control: The first part is about long-term op-timization by approximating dynamic programming and in the second part a specific classof a fast neural network, related to support vector machines (SVMs), is considered.

The first part relates to approximate dynamic programming, especially in the frameworkof adaptive critic designs (ACDs). Dynamic programming can be used to find an optimaldecision or control policy over a long-term period. However, in practice it is difficult,and often impossible, to calculate a dynamic programming solution, due to the ‘curse ofdimensionality’. The adaptive critic design framework addresses this issue and tries tofind a good solution by approximating the dynamic programming process for a stationaryenvironment.

In an adaptive critic design there are three modules, the plant or environment to becontrolled, a critic to estimate the long-term cost and an action or controller module toproduce the decision or control strategy. Even though there have been many publications onthe subject over the past two decades, there are some points that have had less attention.While most of the publications address the training of the critic, one of the points that hasnot received systematic attention is training of the action module1 Normally, training startswith an arbitrary, hopefully stable, decision policy and its long-term cost is then estimatedby the critic. Often the critic is a neural network that has to be trained, using a temporaldifference and Bellman’s principle of optimality. Once the critic network has converged,a policy improvement step is carried out by gradient descent to adjust the parameters ofthe controller network. Then the critic is retrained again to give the new long-term costestimate. However, it would be preferable to focus more on extremal policies earlier inthe training. Therefore, the Calculus of Variations is investigated to discard the idea ofusing the Euler equations to train the actor. However, an adaptive critic formulation for acontinuous plant with a short-term cost as an integral cost density is made and the chainrule is applied to calculate the total derivative of the short-term cost with respect to theactor weights. This is different from the discrete systems, usually used in adaptive critics,which are used in conjunction with total ordered derivatives. This idea is then extended tosecond order derivatives such that Newton’s method can be applied to speed up convergence.Based on this, an almost concurrent actor and critic training was proposed. The equationsare developed for any non-linear system and short-term cost density function and thesewere tested on a linear quadratic regulator (LQR) setup. With this approach the solution

1Werbos comments that most treatment of action nets or policies either assume enumerative maxi-mization, which is good only for small problems, except for the games of Backgammon or Go [1], or,gradient-based training. The latter is prone to difficulties with local minima due to the non-convex natureof the cost-to-go function. With incremental methods, such as backpropagation through time, calculusof variations and model-predictive control, the dangers of non-convexity of the cost-to-go function withrespect to the control is much less than the with respect to the critic parameters, when the sampling timesare small. Therefore, getting the critic right has priority. But with larger sampling times, when the controlrepresents a more complex plan, non-convexity becomes more serious.

iv

to the actor and critic weights can be achieved in only a few actor-critic training cycles.Some other, more minor issues, in the adaptive critic framework are investigated,

such as the influence of the discounting factor in the Bellman equation on total orderedderivatives, the target interpretation in backpropagation through time as moving and fixedtargets, the relation between simultaneous recurrent networks and dynamic programmingis stated and a reinterpretation of the recurrent generalized multilayer perceptron (GMLP)as a recurrent generalized finite impulse MLP (GFIR-MLP) is made.

Another subject in this area that is investigated, is that of a hybrid dynamical system,characterized as a continuous plant and a set of basic feedback controllers, which are usedto control the plant by finding a switching sequence to select one basic controller at a time.The special but important case is considered when the plant is linear but with some uncer-tainty in the state space and in the observation vector, and a quadratic cost function. Thisis a form of robust control, where a dynamic programming solution has to be calculated.Due to the special form, the recursive dynamic programming solution can be approximatedby a certain form of an adaptive critic design, sometimes called Q-learning. However,extra care has to be taken to avoid instability due to the approximation errors and therecursive procedure of dynamic programming, which tends to considerably amplify errors.

The problem of fast learning with limited data is addressed in the second part. Thisarea has been investigated for many decades but only recently started to blossom in full withVapnik’s statistical learning theory and a class of algorithms, the so-called support vectormachines, which make use of very high dimensional feature spaces and linear separationstherein. In this thesis, the special case of binary pattern recognition is investigated. Analgorithm, called the algebraic perceptron algorithm, is introduced. It extends the well-known perceptron algorithm to achieve a linear dichotomy in high-dimensional spaces ofdimensionality such as 350 and which represents polynomial curves in the input space. Thisis achieved through inner product kernels, similar to support vector machines. However, incontrast to SVMs, the algebraic perceptron will not find an optimal solution to separate thetwo classes. Nevertheless, it can be optimized towards an optimal solution, if necessary.Sometimes, especially with many densely placed data points, it can even achieve a bettersolution than the theoretical superior optimal support vector machines whose solution isoften tricky to calculate for large data sets.

There are many interesting geometrical interpretations and possibilities that can be usedto extend the algebraic perceptron. One such application suggested is to use it as a tool todecompose more complicated objects into simpler ones. This potential is demonstrated onan artificially created binary object of overlapping ellipses.

Contents

Preface 1

Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

I Approximation of Dynamic Programming via ACDs 5

1 Introduction 6

1.1 Calculus of Variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.2 Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.2.1 Hamiltonian Notation for Dynamic Programming . . . . . . . . . . . 91.2.2 Approximate Value Iteration . . . . . . . . . . . . . . . . . . . . . . 101.2.3 Approximate Policy Iteration . . . . . . . . . . . . . . . . . . . . . . 101.2.4 Approximate Value versus Policy Iteration . . . . . . . . . . . . . . . 10

2 Calculus of Variations for Optimal Control 12

2.1 Problem Statement for Uncontrolled Functionals . . . . . . . . . . . . . . . 122.2 Euler Equations and Boundary Conditions . . . . . . . . . . . . . . . . . . . 14

2.2.1 Piecewise Smooth Extremals: Weierstrass-Erdmann corner conditions 142.3 Constrained Extremals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.3.1 The Lagrange Multiplier Method . . . . . . . . . . . . . . . . . . . . 162.4 Variational Approach Applied to Control Problems . . . . . . . . . . . . . . 18

2.4.1 The Hamiltonian Notation . . . . . . . . . . . . . . . . . . . . . . . 202.4.1.1 Boundary Conditions . . . . . . . . . . . . . . . . . . . . . 21

2.4.1.1.1 Problems with fixed final time (δtf = 0) . . . . . . 212.4.1.1.2 Problems with free final time (δtf is arbitrary) . . 22

2.4.1.2 Pontryagin’s Minimum Principle . . . . . . . . . . . . . . . 232.4.2 Summary of Variational Approach Applied to Control Problems . . 24

2.4.2.1 Additional Necessary Conditions . . . . . . . . . . . . . . . 252.4.2.2 State Variable Inequality Constraints . . . . . . . . . . . . 26

2.5 Solving the Variational Approach with Neural Networks . . . . . . . . . . . 27

v

CONTENTS vi

2.5.1 Direct Application of Calculus of Variations . . . . . . . . . . . . . . 272.5.2 Hamiltonian Formulation . . . . . . . . . . . . . . . . . . . . . . . . 292.5.3 Some Remarks to the Neural Networks Approach . . . . . . . . . . . 31

3 Adaptive Critic Designs 32

3.1 Conventional Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.1.1 Training the Critic via HDP and DHP . . . . . . . . . . . . . . . . . 343.1.2 Training the Actor . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.2 Training with Consideration of the Euler Equations . . . . . . . . . . . . . . 393.2.1 Actor Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.2.2 Critic Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.3 J − λ Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.4 J − λ Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.5 Why Continuous Time ACD . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4 DP, ACD, SRN, Total ordered Derivatives 46

4.1 Discretization of the Training Equations . . . . . . . . . . . . . . . . . . . . 474.2 Discrete Dynamic Programming . . . . . . . . . . . . . . . . . . . . . . . . . 484.3 Simultaneous Recurrent Network . . . . . . . . . . . . . . . . . . . . . . . . 49

4.3.1 BPTT and Chain Rule for Derivatives . . . . . . . . . . . . . . . . . 504.3.1.1 Derivation of BPTT (h) . . . . . . . . . . . . . . . . . . . . 52

4.3.2 Fixed and Moving Targets . . . . . . . . . . . . . . . . . . . . . . . . 534.3.3 Real-Time Recurrent Learning (RTRL) . . . . . . . . . . . . . . . . 554.3.4 Some Notes on BPTT and RTRL . . . . . . . . . . . . . . . . . . . . 574.3.5 Recursive Generalized FIR-MLP . . . . . . . . . . . . . . . . . . . . 58

4.4 Continuous Version of ‘Ordered’ Total Derivatives . . . . . . . . . . . . . . 594.4.1 Continuous Adaptive Critics . . . . . . . . . . . . . . . . . . . . . . 614.4.2 Second Order Adaptation for Actor Training . . . . . . . . . . . . . 63

4.4.2.1 Newton’s method . . . . . . . . . . . . . . . . . . . . . . . 634.4.3 Almost Concurrent Actor and Critic Adaptation . . . . . . . . . . . 694.4.4 Some Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5 Testing of Euler Equations with Adaptive Critics 74

5.1 LQR system with Euler equations . . . . . . . . . . . . . . . . . . . . . . . 745.1.1 Optimal LQR-control . . . . . . . . . . . . . . . . . . . . . . . . . . 755.1.2 Pretraining of Actor with Euler Equations . . . . . . . . . . . . . . . 75

5.1.2.1 Training-Algorithm . . . . . . . . . . . . . . . . . . . . . . 755.1.2.2 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.1.3 Conventional ACD Training . . . . . . . . . . . . . . . . . . . . . . . 785.1.4 Comparison Conventional ACD Training versus Euler Training . . . 78

5.2 Improved adaptation laws for LQR systems . . . . . . . . . . . . . . . . . . 79

CONTENTS vii

5.2.1 Numerical Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 835.2.1.1 rank(K) = dim(x) . . . . . . . . . . . . . . . . . . . . . . . 835.2.1.2 rank(K) < dim(x) . . . . . . . . . . . . . . . . . . . . . . . 83

5.3 Continuous Version of Total Derivatives for Euler-ACDs . . . . . . . . . . . 855.4 Results for Continuous ACDs with Newton’s Method . . . . . . . . . . . . . 86

5.4.1 rank(K) = dim(x) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 865.4.2 rank(K) < dim(x) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

6 Hybrid Dynamical System 90

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 906.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

6.2.1 Linear Output Feedback Basic Controllers . . . . . . . . . . . . . . . 926.3 Solving the Dynamic Programming Equation . . . . . . . . . . . . . . . . . 93

6.3.1 The first steps backwards . . . . . . . . . . . . . . . . . . . . . . . . 946.3.1.1 Calculation of boundary extrema . . . . . . . . . . . . . . . 98

6.3.2 Special Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 986.3.3 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

6.3.3.1 Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . 996.3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

6.3.4.1 Results without smoothing . . . . . . . . . . . . . . . . . . 996.3.4.2 Results with local smoothing . . . . . . . . . . . . . . . . . 102

II A class of fast and cheap neural networks 109

7 Introduction 110

7.1 Statistical Learning Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 1107.1.1 The model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1117.1.2 Imitation versus Identification . . . . . . . . . . . . . . . . . . . . . . 1127.1.3 Minimizing the Risk Functional from Empirical Data . . . . . . . . . 112

7.1.3.1 Pattern Recognition . . . . . . . . . . . . . . . . . . . . . . 1137.1.3.2 Regression Estimation . . . . . . . . . . . . . . . . . . . . . 1147.1.3.3 Density Estimation . . . . . . . . . . . . . . . . . . . . . . 1147.1.3.4 Empirical Risk Minimization . . . . . . . . . . . . . . . . . 115

7.1.4 Identification of Stochastic Objects . . . . . . . . . . . . . . . . . . . 1157.1.4.1 Ill-posed Problems . . . . . . . . . . . . . . . . . . . . . . . 1167.1.4.2 Well-posed Problems in Tikhonov’s Sense . . . . . . . . . . 1177.1.4.3 Tikhonov’s Regularization Method . . . . . . . . . . . . . . 117

7.1.5 Important Ideas of Statistical Learning Theory . . . . . . . . . . . . 1187.1.5.1 Consistency of ERM Principle . . . . . . . . . . . . . . . . 119

7.1.5.1.1 Entropy of the set of indicator functions . . . . . . 1197.1.5.1.2 Conditions for uniform convergence . . . . . . . . 119

CONTENTS viii

7.1.5.1.3 Key Theorem of Learning Theory . . . . . . . . . 1207.1.5.1.4 Three milestones of learning theory . . . . . . . . 1217.1.5.1.5 Bounds on the rate of convergence . . . . . . . . . 122

7.1.5.2 VC-dimension . . . . . . . . . . . . . . . . . . . . . . . . . 1237.1.5.3 Structural Risk Minimization (SRM) . . . . . . . . . . . . 123

7.1.5.3.1 Principle of Structural Risk Minimization . . . . . 123

8 Support Vector Machines 126

8.1 Geometry of SVMs for Pattern Classification . . . . . . . . . . . . . . . . . 1268.1.1 The linear separable case . . . . . . . . . . . . . . . . . . . . . . . . 1268.1.2 The linear non-separable case . . . . . . . . . . . . . . . . . . . . . . 1288.1.3 The Dual objective Formulation . . . . . . . . . . . . . . . . . . . . 1298.1.4 Lifting the Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . 130

8.1.4.1 Implications of lifting the dimension . . . . . . . . . . . . . 1308.1.4.2 Eliminating the equality constraint through lifting and pro-

jection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1318.1.4.2.1 Approximate solution by lifting heuristic . . . . . 133

8.1.5 Balanced Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . 1348.2 Transforms to High-dimensional Feature Space . . . . . . . . . . . . . . . . 135

8.2.1 Inner-Product Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . 1368.3 SVMs for Function Approximation . . . . . . . . . . . . . . . . . . . . . . . 136

9 Algebraic Perceptron 142

9.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1439.1.1 Iterative Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1449.1.2 Recursive Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 1459.1.3 Algebraic Perceptron Convergence Proof . . . . . . . . . . . . . . . . 1459.1.4 Special Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . 1469.1.5 Other Theoretical Considerations . . . . . . . . . . . . . . . . . . . . 146

9.1.5.1 Other Kernels . . . . . . . . . . . . . . . . . . . . . . . . . 1469.1.5.2 Large Data Sets . . . . . . . . . . . . . . . . . . . . . . . . 147

9.2 Comparison Algebraic Perceptron and SVM . . . . . . . . . . . . . . . . . . 1489.2.1 Comparison Results and Experimental Observations . . . . . . . . . 148

9.3 Voting Algebraic Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . 1529.3.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1529.3.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

9.3.2.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

10 Optimizing an Algebraic Perceptron Solution 157

10.1 Optimization based on the primal objective . . . . . . . . . . . . . . . . . . 15710.1.1 Maximizing the Margin ρmin . . . . . . . . . . . . . . . . . . . . . . 158

CONTENTS ix

10.2 Optimization based on the dual objective . . . . . . . . . . . . . . . . . . . 15910.2.1 Summary of Adatron Optimization . . . . . . . . . . . . . . . . . . . 161

10.3 Conclusions about Optimizing the AP . . . . . . . . . . . . . . . . . . . . . 162

11 Decomposition Algorithm based on AP 163

11.1 Region Growing Algebraic Perceptron . . . . . . . . . . . . . . . . . . . . . 16311.1.1 Notation and Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . 16311.1.2 RGAP− Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

11.1.2.1 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16411.1.2.2 Adding Intersections . . . . . . . . . . . . . . . . . . . . . . 16511.1.2.3 Intersection RGAP: IRGAP . . . . . . . . . . . . . . . . . 16611.1.2.4 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . 166

11.2 Example of three overlapping ellipses . . . . . . . . . . . . . . . . . . . . . . 16611.3 Discussion and Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . 167

11.3.1 Speed up techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . 16711.3.2 Improving robustness . . . . . . . . . . . . . . . . . . . . . . . . . . 16811.3.3 Generalization to binary objects with genus greater one . . . . . . . 16811.3.4 Structural complexity control . . . . . . . . . . . . . . . . . . . . . . 168

11.4 Extension to SVM algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 16811.5 Conclusion about the RGAP-Framework . . . . . . . . . . . . . . . . . . . . 168

12 Conclusions and Outlook 172

List of Publications 174

A Notation 175

A.1 Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175A.2 Chain Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

B Calculation of some useful derivatives 176

B.1 System equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176B.2 Useful one-step derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177

B.2.1 Total derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177B.2.2 Total derivatives involving φ . . . . . . . . . . . . . . . . . . . . . . 178B.2.3 Partial derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

B.3 Calculation of dJ(t)du(t) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

B.4 Derivation of dcdw . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

C Terms and Definitions 182

C.1 Probability Model of a Random Experiment . . . . . . . . . . . . . . . . . . 182

References 187

List of Figures

3.1 Graphical representation of an ACD. Here, not only the state but also itsderivative are inputs to the critic. A single network having dedicated ar-eas for the functional blocks of plant, actor and critic can be used. In [2],Prokhorov showed that it is of advantage to input an extended state vec-tor consisting of all available information of state, control and even modelreference inputs to the critic. This is intuitively clear, as when adaptingbackwards long-term information from J or λ have to be “squeezed” throughsubspaces of lower dimensions when only partial state, control or referenceinformation is used. However, sometimes it makes sense not to use allthe information, e.g. because it might be simply difficult to access it, espe-cially in technical systems that might be built of certain encapsulated blocks.Here, only the state derivative was used and extensions are straight forward.Similar to this graph, Prokhorov has combined critics approximating J andlambda in [3]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.2 Graphical representation of the training of the regularization network, J andλ approximators. Together they achieve a GDHP-style training. . . . . . . 45

4.1 Discrete Dynamic Programming model. Assuming a stationary policy andone-step costs E, there is a total cost function Jπ for every policy π. Anoptimal policy πopt will have a minimal the total cost function Jπopt

. . . . . 484.2 The SRN model. It has a total of N nodes where the first m nodes are

the external inputs x1, . . . , xm and the nodes indexed by m + 1 through N

are internal nodes. The output nodes can be arbitrarily selected from theinternal nodes; they can be regarded simply as the internal nodes withoutlimitation. In addition to the external inputs, the internal nodes have theirdelayed values as further inputs as well as the current output of lower in-dexed nodes. An additional node x0 ≡ 1 can be added to provide some bias,or, alternatively x1(t) := 1 can be chosen. . . . . . . . . . . . . . . . . . . . 50

x

LIST OF FIGURES xi

4.3 A simple flow diagram that demonstrates the forward influence of a functionblock evaluated at consecutive time steps onto a target quantity. To workout the sensitivities of certain quantities state vectors at a certain time orparameters, backpropagation through time can be used (dashed paths). It isbasically applying the chain-rule for derivatives. . . . . . . . . . . . . . . . . 51

4.4 Another interpretation of the formulation of total derivatives can be achievedby looking at the definition of the target. Tar(k, h) is the target calculatedat the fixed time k and includes contributions up to the time k + h. Thiscan be seen as a fixed target. On the other hand if a target Tar(t, h) is usedwith a starting time t equal to the time the derivative has to be calculated,it can be seen as a moving target. . . . . . . . . . . . . . . . . . . . . . . . . 54

4.5 The recurrent generalzied FIR MLP distinguishes from the SRN by havingFIR tap-lines instead of only a single weight. Therefore it uses less (linear)nodes to generate an equivalent SRN. . . . . . . . . . . . . . . . . . . . . . . 58

4.6 The connection between neighboring trajectories due to a slight change inthe weights. Multiplying all the vectors by δt makes clear that the order ofderivatives with respect to time and weights can be exchanged, see equation(4.51). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.1 The optimal cost-to-go function J(x;Kopt) over the state space x. . . . . . . 775.2 Cost function after some time causes parameters to oscillates around Kgood.

Even Kopt is not that close to the optimal parameter matrix Kopt, the cost-to-go function J(x;Kopt) is very close to the optimal one (see figure 5.3). . 77

5.3 Difference between J(x;Kgood) and the optimal J(x;Kopt). The saddle pointbehaviour is difficult for learning with random initialization points. Somewould like to increase and others to decrease the current J(x;Kgood) whileconverging to the optimal J . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.4 While training from a random initial K the J(x;Ktemp) function becomesat certain stages narrow but long and flat valley. This gives problems ofconvergence speed, oscillations, or, even divergence. . . . . . . . . . . . . . . 77

5.5 Trajectory of the parameters −K, k11:blue, k12:green, k21:red, k22:cyan.The optimal values are dotted. After 768 iterations the values of the param-eters are those of −Kgood. In the next iteration −K = −

[0.51090.3506

−4.3997−0.4627

],

then further adaptation was too large and the system diverged. . . . . . . . . 785.6 Trajectory of state space x(t) and control u(t) for parameters −Kgood and

initial state x0 =[

1.7877−1.0512

]x1:blue, x2:green, u1:red, u2:cyan. The dotted

lines correspond to the optimal parameters −Kopt. . . . . . . . . . . . . . . 785.7 Euler-Training. The Parameters are learned very fast until training breaks

down as it comes closer to the optimal values. . . . . . . . . . . . . . . . . . 875.8 Euler-Training, using the same time-scale as the conventional training. . . . 87

LIST OF FIGURES xii

5.9 Euler-, then Conventional Training. Here the optimal parameters can beretrieved and a reduction in the overall training time can be achieved dueto the fast Euler-Pretraining stage. . . . . . . . . . . . . . . . . . . . . . . . 87

5.10 Conventional Training. Here the optimal parameters can be retrieved. How-ever, training is slow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

5.11 Trajectory of the parameters K for the system given in section 5.2.1.1. Thesolid lines represent the time actor training via Newton’s method. Duringthe time indicated by the dashed lines, actor parameters are frozen andcritic weights are adapted. After four actor-critic cycles the parameters arelearned within an error better than 10−5. . . . . . . . . . . . . . . . . . . . . 88

5.12 Trajectory of critic parameters Wc. The solid lines represent the time critictraining is performed. After the first actor-critic cycle the actor-critic con-sistency is achieved and the proposed linear critic updates due to actorchanges can be applied. This is shown by the black lines which representa jump towards the optimal values, especially for the non-zero w11, w22 atthe second actor-critic cycle. . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

5.13 Trajectory of the parameters K for the system given in section 5.2.1.2. Thesolid lines represent the time actor training via Newton’s method. Duringthe time indicated by the dashed lines, actor parameters are frozen andcritic weights are adapted. After four actor-critic cycles the parameters arelearned within an error better than 10−5. . . . . . . . . . . . . . . . . . . . . 89

5.14 Trajectory of critic parameters Wc (note: w12 = w21. The solid linesrepresent the time critic training is performed. After the first actor-criticcycle the actor-critic consistency is achieved and the proposed linear criticupdates due to actor changes can be applied. This is shown by the blacklines which represent a jump towards the optimal values, given by (5.64),especially for the non-zero w11, w22 at the second actor-critic cycle. . . . . . 89

6.1 Situation with one possible maximum at 2xif inside the corresponding region

and one outside at 1xif . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

6.2 Problematic situation with the maximum at the boundary of the two decisionregions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

6.3 Solution without local smoothing at first iteration backwards. . . . . . . . . . 1006.4 Solution without local smoothing at second iteration backwards. . . . . . . . 1006.5 Solution without local smoothing at third iteration backwards. . . . . . . . . 1016.6 Solution without local smoothing at tenth iteration backwards. And a few

iterations further, the situation will become completely unstable (not shown).It is clearly visible the the corners start bending. This is because numericalinaccuracy of determining the supremum in equation (6.25) because for thecorner grid point x, the corresponding xf lie outside the supported gridregion for both controllers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

LIST OF FIGURES xiii

6.7 Solution with local smoothing (r = 3) at first iteration backwards. . . . . . . 1026.8 Solution with local smoothing (r = 3) at second iteration backwards. . . . . . 1026.9 Solution with local smoothing (r = 3) at third iteration backwards. . . . . . . 1036.10 Solution with local smoothing (r = 3) at tenth iteration backwards. . . . . . 1036.11 Solution with local smoothing (r = 8) at first iteration backwards. . . . . . . 1046.12 Solution with local smoothing (r = 8)at second iteration backwards. . . . . . 1046.13 Solution with local smoothing (r = 8) at third iteration backwards. . . . . . . 1056.14 Solution with local smoothing (r = 8) at tenth iteration backwards. . . . . . 1056.15 Control trajectories u1(t) and u2(t) in red and blue, and state magnitude

|x(t)| in green. (r = 3). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1066.16 Control trajectories u1(t) and u2(t) in red and blue, and state magnitude

|x(t)| in green. (r = 8). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1066.17 Control trajectories x(t) and x(t) in red and green. (r = 3). . . . . . . . . . 1076.18 Control trajectories x(t) and x(t) in red and green. (r = 8). . . . . . . . . . 1076.19 Solution with local smoothing (r = 8) at tenth iteration backwards with a

supported grid in [−5, 5] × [−5, 5]. Even though the number of grid pointshas not been increased, the solution does not change significantly, whereas in dynamic programming a fine quantization needed to be maintained,increasing the calculation time rapidly. . . . . . . . . . . . . . . . . . . . . . 108

6.20 Control trajectories u1(t) and u2(t) in red and blue, and state magnitude|x(t)| in green (r = 8). There is not much difference to the smaller grid,however it could capture systems outside the previous supported grid of[−2, 2]× [−2, 2]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

7.1 A model of the learning process from examples. Some generator processproduces some vectors x and a target operator or supervisor sets a value(may be vectorial) y for every vector x. During the learning process thelearning machine tries to return an estimated value y that is close to y,given the same input x and the corresponding target. . . . . . . . . . . . . 113

7.2 The bound on the (guaranteed) risk is the sum of the empirical risk andof the confidence interval. The empirical risk is decreasing with increasingstructure index, k, because the capacity parameter, the VC dimension hk,is increasing, while the confidence interval is increased. Having a huge ca-pacity of functions of the learning machine at hand, it is intuitively clearthat the training error (the empirical risk) is getting smaller, while the dan-ger of overtraining decreases the confidence of having the ‘true’ underlyingfunction that was responsible for generating the observed sample data, and,hence increases the confidence interval. . . . . . . . . . . . . . . . . . . . . . 125

LIST OF FIGURES xiv

8.1 Clearly, w∗ = (x2−x1)||x2−x1|| is not the true normal to the optimal hyperplane.

Adding more (support vector) data between x1 and x3 will achieve a w∗

that is closer to w. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1338.2 Lifting causes a rotation of the optimal hyperplane by the angle φ. Above,

IR2 is lifted by λ. Below, a projection construction in IR2 is given by flippingthe optimal hyperplane, with normal w, achieved from the lifted data ontoIR2. If the two data points x1 and x2 were lying on R1, no rotation wouldoccur, i.e. φ = 0, and the projected part wp of the lifted solution w wouldpoint in the same direction as the original solution w. Furthermore, if theoriginal data were distributed symmetrical with respect to the origin, liftingand optimizing in the lifted space would not cause a rotation and the angle θ

would be 90o, and hence the separating hyperplanes from original and liftedsymmetrical data would be parallel. This is easy to see with two data pointsbut is more difficult to understand when many data points are involved, asthe meaning of symmetry becomes unclear. Nevertheless, it helps to developa heuristic to work out the original hyperplane as described in the text. . . . 140

8.3 Lifting IRn to IRn+1. Projecting the lifted data on the sphere SIRn+1 yieldsthe data for the algebraic perceptron. The spherical mapping always enforcesa zero bias, meaning the separating hyperplane goes through the origin. . . . 141

8.4 The XOR problem is not separable in IR2 but it is in IR6. It is clear thata second degree polynomial, say P (x;y) := (1 + x1y1 + x2y2)

2 = 0, in IR2

can separate the two classes for an appropriate coefficient vector y. Theexpansion in monomials can be interpreted as a linear vector space withbasis

1, x1, x2, x1x2, x

21, x

22

and hence has dimension 6. . . . . . . . . . . 141

9.1 The results displayed graphically of one random training set used in Ta-ble 9.1. The first two rows correspond to SVM light with default argumentsand c=2500, respectively. The last row corresponds to the algebraic per-ceptron. Columns from left to right are for 1000, 2000 and 4800 trainingpoints. Light and dark grey are misclassified back- and foreground pixels,respectively; correctly classified object pixels are black. . . . . . . . . . . . . 150

9.2 The problem of channel equalization. A binary source symbol s(t) is trans-mitted through a noise-free channel H(z), yielding y(t). Gaussian noise e(t)is added to simulate distorted and noisy transmission y(t) which has to beequalized to find estimates s(t) as close as possible to the original symbols(t− τ), with a possible delay τ . . . . . . . . . . . . . . . . . . . . . . . . . 153

9.3 400 random training points (blue) with zero mean Gaussian distributionand variance σ2 around the 16 noise-free transmitted states [y(t) y(t−1)](red). The optimal Bayesian decision boundary (blue) is achieved by placingGaussian kernels centered at the blue training points with kernel radius σ;diamond and cross shaped points correspond to −1 and +1 labelled points. . 154

LIST OF FIGURES xv

9.4 Algebraic perceptron separation via polynomial kernel of degree m = 15.Convergence is achieved after 80083 iterations and the algebraic perceptronseparation is given by the black line. Critical points, which are points clas-sified correctly less then 60% at all iterations, are magenta colored. . . . . . 154

9.5 Histogram of correct classification during the algebraic perceptron algorithm.It is assumed that critical points have a frequency of around 0.5 times thenumber of iterations, or even less. Low frequency numbers can also beachieved if the algorithm converges in only a few hundred iterations, pre-venting to gather more accurate statistics. . . . . . . . . . . . . . . . . . . . 154

9.6 Removing the critical points, achieves a much better separation in a muchfaster time. The solution shown here is averaged over 10 runs of the alge-braic perceptron with different initialization z0. The number of iterationsdrops significantly by a factor of a couple of thousands. To achieve bettergeneralization a lower degree kernel could be applied now on the training setwith some critical points removed. . . . . . . . . . . . . . . . . . . . . . . . 154

9.7 Algebraic perceptron solution via polynomial kernel of degree m = 10. Noconvergence is achieved and the algorithm is stopped after 100000 iterations.Nevertheless, the corresponding histogram given in figure 9.8 as a similarform as before. Critical points, which are points classified correctly less then60% at all iterations, are magenta colored. . . . . . . . . . . . . . . . . . . . 155

9.8 Averaged histogram of correct classification during the algebraic perceptronalgorithm for m = 10. Even though no convergence of the algebraic percep-tron algorithm could be achieved the histogram of correctly classified pointslooks similar as before and can be used to determine the critical points infigure 9.7. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

9.9 Algebraic perceptron solution for m = 20 is similar as for m = 15, how-ever convergence is already achieved four times as fast. Critical points aresimilar to those with m = 15. Therefore, using a higher kernel order letsdetermine the histogram for critical points faster. After removal of the crit-ical points, a lower order kernel should be used to avoid over fitting and badgeneralization; see figure 9.11. . . . . . . . . . . . . . . . . . . . . . . . . . . 155

9.10 Histogram of correctly classified points for m = 20. . . . . . . . . . . . . . . 1559.11 Algebraic perceptron solution for m = 20 with some critical points removed.

Overfitting occurs now as there are no central data points, which got iden-tified as critical points, anymore. Therefore, after having identified criticalpoints it would make sense to lower the kernel degree m. . . . . . . . . . . . 156

9.12 Histogram of correctly classified points for m = 20 with some critical pointsremoved. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

11.1 Individual ellipses which compose the original image (row 2, column 2).The last image shows the resampled picture . . . . . . . . . . . . . . . . . . 167

LIST OF FIGURES xvi

11.2 Decomposition achieved by the RGAP algorithm; it finds four instead of theoriginal three ellipses. The fourth ellipse is shown against the two closestoriginal ellipses (rows 2 and 3, columns 2 and 1, respectively). This is dueto the undersampling and a fourth ellipse is found. Light and dark graypixels are wrong fore- and background, respectively. . . . . . . . . . . . . . . 170

11.3 With different initialization seeds an almost perfect retrieval of the originalellipses could be achieved. Light and dark gray pixels are wrong fore- andbackground, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

11.4 a.) Shark fin like shape, modelled from ellipses and intersections. Oneproblem is obvious, the less bent an edge is, the larger the according ellipse.b.) Another foreground object, i.e. a (-1)-object, which would intersect thebackground ellipses, would disturb the algorithm to correctly identify thebackground ellipses and possible stop its growth. . . . . . . . . . . . . . . . . 171

List of Tables

5.1 Comparison between the Euler and Conventional Training . . . . . . . . . . 79

9.1 Results for ‘Scissors’ test set for a polynomial kernel of degree 25, averagedover 10 randomly resampled training sets of sizes 1000 and 2000 points,and the full data set of 4800 points. In the last column the training set forthe algebraic perceptron was split into 6 chunks a 800 training points. Oneach of the chunks the algorithm was run and the selected support vectorswere merged, then a final run of the algebraic perceptron on all the selectedsupport vectors achieved the results presented. ‘Best values’ are bold. . . . 151

xvii

Preface

Background

The wider general field of study involves many current areas and topics related to thebrain, mind and the consciousness. These are mainly Neuroscience (dealing with structure,development, chemistry, function, pharmacology and pathology of the nervous system);Cognitive science (the study of precise nature of different mental tasks and the operationsof the brain that enable these tasks to be performed, including branches of psychology andconnectionism, which model human thought and behavior in terms of parallel distributednetworks of neuron–like units, with learning mediated by changes in strength of the con-nections between these elements [4]); Artificial Intelligence (traditional rule based expertsystems and fuzzy systems); Neurocontrol (defined as the use of well-specified neural net-works – artificial or natural – to emit actual control signals (Werbos in [5]); which can beseen as an extension to conventional control theory.); Computer science (as the tool forany artificial implementations in general, and for expert systems like linguistical-, medical-or vision systems specifically); Biology (Behaviorism and models of biological neurons andnetworks); Physics (especially thermo–statistical mechanics, nonlinear dynamical systemsand chaos theory); and Mathematics (Logic systems, information and coding theory).

Since all these fields, or at least parts thereof, concern the brain, mind and conscious-ness, it is useful to define some more specific terms. Brain Theory is comprised of manydifferent theories about how the structures of the brain can perform such diverse functionsas perception, memory, control of movement, and higher mental function. As such, itincludes both attempts to extend notions of computing, as well as applications of modernelectronic computers to explore the performance of complex models. An example of theformer is the study of cooperative computation between different structures in the brainas a new paradigm for computing that transcends classical notions associated with serialexecution of symbolic programs. For the latter, computational neuroscience makes sys-tematic use of mathematical analysis and computer simulation to model the structure andfunction of living brains, building on earlier work in both neural modelling and biologicalcontrol theory [4].

Brain theory and neural computation address the analysis and design of modular, oftenhierarchical learning systems. More exactly, brain theory seeks to enhance the understand-ing of human thought and the neural basis of human and animal behavior. It requires

1

PREFACE BACKGROUND 2

empirical data to shape and constrain modelling and in return provides concepts and hy-potheses to shape and constrain experimentation, whereas neural computation developsnew strategies for building “intelligent” machines or adaptive robots.The brain can be seen as a neurocontroller in the most general sense. On the highestlevel it does some decision making based on its perception of a very noisy environmentand experience acquired in the past. It tries to make these decisions such that the futureinteractions resulting from the decisions made and the expected development of the envi-ronment are best tuned. In this very rough model of the brain, there are at least threebasic adaptive subsystems involved [6]:

• An Action- or Motor system, outputting the control signals to the environment.

• An “Emotional”, “Evaluation” or “Critic” system, used to access long–term costs ofdecisions.

• An “Expectations” or “System Identification” component, modelling the perceivedenvironment.

But this can also be seen as an optimal control scheme modelled by the Bellman equationand implemented as an approximation by the adaptive critic design (ACD) family. Anexcellent treatment of ACDs was done by Prokhorov in his Ph.D. thesis [2].Another approach to the same field is often undertaken from a stochastic decision makingprocess viewpoint. The action system is often called an (embedded) agent. This approachis used in many disciplines like AI, robotics, systems and control engineering and theo-retical computer science. The formal framework is a Markov decision process with theconcepts of reinforcement learning or dynamic programming, which is treated in the bookby Puterman [7]. An excellent work is also the Ph.D. thesis by Satinder P. Singh [8]. An-other good source is the book by Sutton and Barto [9] and for a survey of reinforcementlearning [10]. The environment is often described by stochastic state transitions and theirprobability distribution but can also be model free. Many researchers have published a lotwithin this framework and there are many algorithms that converge to an optimal solutionor close to an optimal solution [11, 12, 13]. However, these algorithms work on a finiteset of states. If a continuous state space is used convergence proofs are few and only forspecial cases like linear quadratic regulators [14, 15].In this thesis the first notation was used, based on the adaptive critic framework. Whilework with the adaptive critic designs was the original baseline for an intelligent brain-like system, the work with the Calculus of Variations resulted in solving a two pointboundary value problem, known to be difficult. These problems, multiplied with thetraining of neural networks within the adaptive critic framework and I got stuck and wasdesperately looking for help, caused me great difficulty. I found help in the departmentof mathematics from Lyle Noakes. Lyle was very helpful and had an algorithm, calledthe leap-frog algorithm that solves a distant two point boundary value problem and isproven to converge almost always [16, 17]. However, this algorithm needs fixed end points

PREFACE CONTRIBUTIONS 3

and therefore cannot be applied simply to the adaptive critic design. But as life andthesis development are of a non-linear nature, Lyle asked me whether I knew about theperceptron algorithm, because I was doing work with neural networks. He mentionedthat he had it extended to work with polynomial as well as linear separations. From thevery beginning I had a hunch that this was related to support vector machines, whichat the time I did not understand in detail. As I was in search of fast and especiallyaccurate neural networks for the adaptive critics, I was keen to read about SVMs andthe perceptron algorithm. This was the beginning of part II in this thesis. Besides this,Andrey Savkin gave me an interesting problem of a hybrid dynamical system (HDS),where a linear but uncertain plant was controlled by switching between linear feedbackcontrollers. He had worked out the existence of a switching sequence for this problem viadynamic programming. While an approximate and fast solution could be found on thebasis of Q-learning and piecewise quadratic functions, the feeling was that concentratingon the switching boundary might be more promising. Eventually, this problem could betransformed into a classification problem and make use of some SVM-based classifier. Thisgave me another incentive to work on the algebraic perceptron. At this stage, I had tostop work on it, as there was no time to do further investigation into the HDS system.

Contributions

In part I investigation of the Euler equations was conducted in the hope of yielding addi-tional training equations to speed up convergence in an adaptive critic design. Althoughthe outcome was not too pleasing, the insight achieved was still worth reporting.A second approach based on a continuous formulation for the ACD was more successful.There, second order steepest descent adaptation based on Newton’s method for zero searchwas applied to a linear quadratic regulator (LQR) system, but the equations also hold forany non-linear system. This approach fits very well into the adaptive critic designs and ismost suitable in conjunction with Global Dual Heuristic Programming (GDHP), the mostadvanced ACD. Also, an almost concurrent actor-critic adaptation has been proposed tokeep Bellman’s optimality principle satisfied when changing the underlying control policydue to actor training.Another contribution in the ACD framework, was in the field of robust control for a hybriddynamical system. A linear plant with uncertain states and observation and a quadraticperformance index was controlled by switching between linear feedback controllers. Basedon a proof of the existence of a dynamic programming solution for a switching sequence,a piecewise quadratic approximation of the action-value function (Q-function) was madeand demonstrated on an example to be stable.In part II a fast neural network for classification was investigated and developed. It is anextension to the perceptron algorithm and related to support vector machines and is calledthe algebraic perceptron. It can be used as a fast ‘preprocessor’ to identify significant data

PREFACE THESIS OUTLINE 4

(support vectors), and for data reduction. An extension, the voting algebraic perceptron,can to some extend also handle noisy, non-separable data. Furthermore, a framework hasbeen outlined where the basic algebraic perceptron algorithm can be used to decomposecomplicated figures into simpler ones.

Thesis Outline

Part I is dedicated to approximate dynamic programming within the adaptive critic frame-work. Chapter 1 gives a short introduction in the calculus of variations and dynamicprogramming. Approximate value and policy iteration, the basic operations to train net-works in the adaptive critic design, are outlined. In chapter 2 the theory of the calculusof variations for optimal control is established and gives some theoretical background.Chapter 3 treats the basic forms of adaptive critics and introduces a first simple approachto extend training with the Euler equations from the calculus of variations. Also, somenotes and ideas about consistency and regularization between a J-critic and λ-critic aregiven. Chapter 4 makes a connection between the continuous and discrete case and in-vestigates two algorithms, backpropagation through time (BPTT ) and realtime recurrentlearning (RTRL). Some interpretations of targets to be used in training neural networksare made. These algorithms are based on calculating total derivatives and are connectedto the training of simultaneous recurrent networks, which have the same objective func-tion as discrete dynamic programming. Chapter 4 concludes with the derivation of totalordered derivatives for the continuous case and introduces the second order adaptationfor actor training. Chapter 5 applies the ideas developed in chapters 3 and 4 on a simpleLQR system. Chapter 6 is dedicated to the hybrid dynamical system of a continuous linearplant with uncertain state and noisy observation and with quadratic performance index.Linear feedback control is applied by switching between linear feedback controllers. Theswitching sequence results from a dynamic programming solution, whose existence wasproven by Andrey Savkin. Here, the emphasis is on achieving an approximate dynamicprogramming solution efficiently.Part II treats support vector machines and the algebraic perceptron as a related classof neural network. Chapter 7 gives an introduction and some background material onstatistical learning theory which is the theoretical foundation for support vector machines.Chapter 8 introduces the basic SVM algorithms and some theoretical issues about lifting.Chapter 9 introduces the algebraic perceptron in its iterative and recurrent form. Italso gives an example of its potential by comparing it with the standard SVM algorithm.Chapter 10 discusses the optimization of the non-optimal solution achieved by the standardalgebraic perceptron. Chapter 11 introduces a decomposition algorithm based on thealgebraic perceptron. Finally, chapter 12 concludes this work and gives an outlook forpossible future work.

Part I

Approximation of Dynamic

Programming via Adaptive Critic

Designs

5

Chapter 1

Introduction

The primary family of Adaptive Critic Designs (ACDs) consists of the designs calledHDP (Heuristic Dynamic Programming), DHP (Dual Heuristic Programming) and GDHP(Global Dual Heuristic Programming), which approximate dynamic programing by ap-proximating the cost-to-go function of dynamic programing (DP), its derivative, or acombination thereof. In the framework of ACDs, the part that estimates these quantitiesis called a critic. Also, there exist action-dependent versions in the family of ACDs, wherethe control u(t) is also an input to the critic, e.g. ADHDP.

GDHP is a combination of HDP and DHP and gives an estimate of the cost-to-go func-tion J(x(t);w) and its derivative ∂J(x(t);w)

∂(x(t)) , where w is a parameter vector. The difficultpart of GDHP is the adaptation of the weight vector w in such a way that the estimatedderivative is integrable and its integral be consistent with the estimated cost-to-go function.

The goal of dynamic programming (DP) and its approximations is to try to calculate theoptimal cost-to-go function. A typical problem would be to go from a state x(t0) at timet0 to a final state x(tf ) at time tf , such that the relevant cost is minimal. A naturalmeasure for the cost would be the sum of immediate costs in a discrete system, or theintegral of some immediate cost-density in the continuous case.

This is essentially the topic of the Calculus of Variations (COV). Therefore, the nextsection will state the problem from this perspective. The notation follows Vincent andGrantham [18], so the function x(t) in the next section defines a continuous and piecewisedifferentiable function that corresponds to the state trajectory x(t) in the ACD. A moreextensive mathematical treatment, can be found in [19] and its extension [20]. The cost-to-go function J = J [x(.)] =

∫ tft0

φ(x, x, t)dt is the integral of some immediate cost-densityφ, which at a time t depends on the state x and its derivative x.

However, from a conceptual control-theory viewpoint, there is a difference between DPbased approaches and COV methods. The former are based on the Bellman optimality

6

CHAPTER 1. INTRODUCTION 7

principle and hold also when random disturbances are involved, whereas COV methodsare based on deterministic trajectory evaluations.

1.1 Calculus of Variations

The calculus of variations is concerned with finding a function of time, such as a trajectorypath, to minimize (or maximize) a path-dependent cumulative cost integral. This costintegral is a function of a function, which is called a functional, defined by (1.1)1

J [x(.)] =∫ tf

t0

φ(x, x, t)dt (1.1)

with x(t0) = x0 and x(tf ) = xf are specified and fixed, and

φ(.) is a specified continuous scalar-valued function ∈ C(2).

Suppose x∗(t) extremizes J [x(.)]. Let η(t) be a continuous and differentiable function on[t0, tf ] satisfying η(t0) = η(tf ) = 0, and for an arbitrary small α let x(t) = x∗(t) + αη(t)denote a variation in x∗(t).The difference ∆J of the functional of a variation x(.) of x∗(.) and the functional of x∗(t)is denoted by ∆J(α) ∆= J [x∗(.) + αη(.)] − J [x∗(.)]. Developing ∆J(α) in a Taylor seriesat x∗(t) and taking the limit of ∆J(α)

α yields the first variation,

δJ∆= lim

α→0

∆J(α)α

=∫ tf

t0

[[∂φ

∂x

]T

η +[∂φ

∂x

]T

η

]dt

=[∂φ

∂x

]T

η

∣∣∣∣∣tf

t0

+∫ tf

t0

[∂φ

∂x

]T

− d

dt

[∂φ

∂x

]T

η dt, since η(t0) = η(tf ) = 0 yields

=∫ tf

t0

[∂φ

∂x

]T

− d

dt

[∂φ

∂x

]T

η dt

As δJ!= 0 for an extrema, this implies that x∗ with fixed endpoints x∗(t0) and x∗(tf ) has

to satisfy the Euler equations, defined by (1.2) and (1.3).

d

dt

[∂φ

∂x

]− ∂φ

∂x= 0, or more explicitly (1.2)

∂2φ

∂x∂t+

∂2φ

∂x∂x· x +

∂2φ

∂x2· x− ∂φ

∂x= 0 (1.3)

This is due to the fundamental lemma of the calculus of variations that states ifβ(t) is continuous on [t0, tf ] and satisfies the condition

∫ tft0

βT (t)η(t)dt for all η(t) that

1This is only a special case of a COV method with fixed start and endpoint and more elaborate versionsare looked at in chapter 2. Also the case of a fixed terminal state is more difficult for an ACD, but at thisstage it is important to get a simple introduction for the COV theory.


are continuous on [t0, tf ] and satisfies η(t0) = η(tf ) = 0, then β(t) ≡ 0 on [t0, tf ].

If a function x(t) satisfies (1.2) it is called an extremal for the corresponding calculusof variation problem (1.1). However, the solution may not satisfy the boundary condi-tions on x(t) or may not minimize the functional, as it could also maximize it, yield aninflection, or, a saddle point. Worst of all, it could also be a local minima (or maxima).Werbos emphasizes the need for copying mechanisms at all levels of system design, as akind of backtracking, to avoid getting stuck in local minima within a general non-convexcost functional. But on the other hand, the optimizing function x∗(t) must belong to thefamily satisfying the Euler equations, which greatly limits the search for the optimal x∗(t).

When the function φ defined by (1.4)

φ(x(t), x(t), t) = φ(x(t), x(t)) (1.4)

does not depend directly on the independent variable t, the Euler equations have a first

integral defined by (1.5).

xT (t)∂φ(x(t), x(t))

∂x(t)− φ(x(t), x(t)) = const. (1.5)

1.2 Dynamic Programming

Following [21] the cost-to-go, J , for a system in state x(t) at time t is defined by (1.6)with the remaining time tr (1.7) and the optimal cost-to-go (1.8).

J(x(t),u(t), tr) = S(x(tf ), tf ) +∫ tf

tL(x(τ),u(τ), τ)dτ, with (1.6)

tr = tf − t, (1.7)

Jopt(x(t), tr) = minu(t)

J(x(t),u(t), tr) (1.8)

Breaking up the integral in (1.6) into two segments [t, t + δt] and [t + δt, tf ] and using theprinciple of optimality leads to (1.9) and (1.10).

Jopt(x(t), tr) = minu(t)

S(x(tf ), tf ) +

∫ tf

t+δtf

L(x(τ),u(τ), τ)dτ +∫ t+δt

tL(x(τ),u(τ), τ)dτ

(1.9)

= minu(t)

Jopt(x(t + δt), tr − δt) +

∫ t+δt


(1.10)


Approximations for sufficiently small δt are defined by (1.11) to (1.13).

x(t + δt) .= x(t) + xδt (1.11)∫ t+δt


.= L(x(t),u(t), t)δt (1.12)

Jopt(x(t + δt), tr − δt) .= Jopt(x(t), tr) +[∂Jopt

∂x(t)

]T

xδt− ∂Jopt

∂trδt (1.13)

This can be summarized by (1.14)

Jopt(x(t), tr).= min

u(t)

Jopt(x(t), tr) +

[∂Jopt

∂x(t)

]T

xδt− ∂Jopt

∂trδt + L(x(t),u(t), t)δt

(1.14)

The optimal cost-to-go function Jopt(x(t), tr) does not depend on u(t), therefore solvingfor ∂Jopt

∂tryields the so-called Hamilton-Jacobi-Bellman partial differential equation

(HJB, for short) (1.15).

∂Jopt

∂tr= min

u(t)L(x(t),u(t), t) +

[∂Jopt

∂x(t)

]T

x (1.15)

with the boundary condition Jopt(x(t), 0) = S(x(tf ), tf ) (1.16)

1.2.1 Hamiltonian Notation for Dynamic Programming

Defining a function, the so-called Hamiltonian H, as:

H(x(t),u(t), Joptx , t) := L(x(t),u(t), t) + Jopt

xT (x(t), t) f(x(t),u(t)), with (1.17)

H(x(t),uopt(t), Joptx , t) = min

u(t)∈AH(x(t),u(t), Jopt

x , t)

and where (1.18)

Joptx :=

∂Jopt(x(t), t)∂x(t)

(1.19)

Joptt :=

∂Jopt(x(t), t)∂t

(1.20)

and using the ‘forward’ time t = tf − tr, the Hamilton-Jacobi-Bellman equation canbe written as:

0 = Joptt +H(x(t),uopt(t), Jopt

x , t)

with boundary condition:

Jopt(x(tf ), tf ) = S(x(tf ), tf )

(1.21)

(1.22)


1.2.2 Approximate Value Iteration

There are two basic methods to establish the cost-to-go function J(.) given by (1.6) fora particular fixed u(t), or for the optimal cost-to-go function Jopt, given by (1.8): valueiteration and policy iteration. A third way is to solve the HJB partial differential equation(1.15) directly, but this may often be difficult. From a practical view point, one is interestedin stationary cost-to-go functions which are not explicitly time-dependent. In practise, onemight use difference equations instead of differential equations, say x(t+1) = F(x(t),u(t))and instead of a cost density L(.), one would have a one-step cost Ln(.) =

∫ t+∆tt L(.)dt′,

such that the cost-to-go function has the form J(.) =∑N

n=0 Ln(.), with N → ∞ forstationary problems.Consider the approximate value iteration (1.23) subject to (1.24), the system constraints.

Joptk+1(x(t)) = min

u(t)∈A

∫ t+∆t

tL(x(t′),u(t′))dt′ + Jopt

k (x(t + ∆t))

, (1.23)

x(t) = f(x(t),u(t)). (1.24)

The variable k denotes an iteration index and should not be confused with the time variablet. For the start condition one can assume a particular, suitable functional form of J0(x(t)),or, simply set it to zero J0(x(t)) := 0.

1.2.3 Approximate Policy Iteration

In policy iteration one assumes an initial policy, say u0(t) and performs a policy evaluation(iterating over m until convergence m = M), followed by a policy improvement step, whichtogether form a policy iteration, k → k +1. The policy evaluation is defined by (1.25) andthe policy improvement step is defined by (1.26).

Joptm+1(x(t)) =

∫ t+∆t

tL(x(t′),uk(t′))dt′ + Jopt

m (x(t + ∆t)), (policy evaluation)(1.25)

uk+1(t) = arg minu(t)∈A

∫ t+∆t

tL(x(t′),u(t′))dt′ + Jopt

M (x(t + ∆t))

(policy improvement)

(1.26)

If uk+1(t) = uk(t) the iteration is stopped.

1.2.4 Approximate Value versus Policy Iteration

While approximate value iteration guesses an initial approximation for J and performspolicy minimization according to (1.23), it is much harder to guess a good first estimatefor a ‘real-world’ cost-to-go function J0 than an initial guess of a stabilizing policy u0 asin policy iteration. Value iteration generally uses an infinite number of iterations, whereaspolicy iteration converges in a finite number of iterations, under fairly general assumptions,at least in a stochastic shortest path setup. One assumption is the existence of at least


one proper policy, meaning a positive probability, exists to reach the termination stateafter at most n stages, regardless of the initial state. Another is that for every improperpolicy the corresponding J is infinite for at least one state [11]. Value iteration has anotherdisadvantage. It has to be able to accurately represent intermediate cost-to-go functions.

Later, it is seen that the so-called Q-learning, a form of HDP which approximates thevalue function, can lead to instability due to approximation errors, a result that has alsobeen highlighted by Boyan and Moore [22]. Sutton [23] uses a special quite powerful CMACnetwork, a somewhat sparse and coarse-coded ‘lookup-table’ (local, like RBF) networkand the SARSA training algorithm, which circumvents solving the dynamic programmingproblem by approximating value functions as accurately as possible, but implicitly paysattention to the ‘correct’ policy, rather than the ‘exact’ cost-to-go function.

Chapter 2

Calculus of Variations for Optimal

Control

In section 1.1 an introduction was given of the Calculus of Variations based on [18].However, a much more elaborate treatment of the Calculus of Variations and its applicationto Optimal Control is made in Kirk’s excellent book [24]. The following is a summary of thecalculus of variations for optimal control, following Kirk’s book but modified appropriatelyto suit the purpose of the later chapters.

2.1 Problem Statement for Uncontrolled Functionals

Uncontrolled shall mean that there is no control and the long-term cost-to-go functionalJ is seen as a function of the state x. Both, the final time tf and the final state x(tf )are considered as free, whereas the start time t0 and the start state x(t0) = x0 are given.Up to the section 2.3 it is assumed that state components are independent. The costfunctional has the form defined by (2.1).

J(x) =∫ tf

t0

φ(x(t), x(t), t) dt (2.1)

12

CHAPTER 2. CALCULUS OF VARIATIONS FOR OPTIMAL CONTROL 13

Assuming two neighboring trajectories, x∗(t) and x(t) = x∗(t) + δx(t), the difference intheir functionals is given by (2.2) and its first order approximations (2.3) and (2.4).

∆J =∫ tf+δtf

t0

φ(x(t), x(t), t) dt−∫ tf

t0

φ(x∗(t), x∗(t), t) dt (2.2)

.=∫ tf

t0

δxT (t)

∂φ(x∗(t), x∗(t), t)∂x

+ δxT (t)∂φ(x∗(t), x∗(t), t)

∂x

dt

+∫ tf+δtf

tf

φ(x∗(tf ), x∗(t), t) dt (2.3)

.= δxT (tf )∂φ(x∗(tf ), x∗(tf ), tf )

∂x+ φ(x∗(tf ), x∗(tf ), tf ) δtf

+∫ tf

t0

δxT (t)

∂φ(x∗(t), x∗(t), t)∂x

− d

dt

[∂φ(x∗(t), x∗(t), t)

∂x

]dt (2.4)

These equations are obtained by expanding x(t) into a Taylor series, integration by partsusing the fact that δx(t0) = 0, and approximating the “short time” integral by its inte-grand times δtf , as well as expanding the integrand into a Taylor series. The symbol .=means equal to first order terms. Therefore, the right-hand side is δJ .

To apply the fundamental Lemma of the calculus of variation, one has to eliminate alldependent variables, i.e. one has to express the functional solely in terms of independentvariables. This can be expressed by having only general variations.

The general variation δGJ of the functional J is made up of the sum of all the varia-tions in the independent (free) parameters. Here, δGJ is the sum of the variation δxJ

resulting from the difference δxx(t) in the interval [t0, tf ] and the difference δtJ in the endpoints of two “neighboring” functions, say x∗(t) and x(t). The subscript in the δ-symbolindicates from which parameter the respective part of the variation stems. The variationδx(t) = x(t)− x∗(t) has meaning only in the interval [t0, tf ], since x∗(t) is not defined fort ∈ (tf , tf + δtf ].

The variation δx(tf ) is neither zero nor free but depends on δtf . Taking a general variationδxf := δGx(tf ) = δxx(tf ) + δtx(tf ) as the sum of the variations stemming from thevariations due to the variation solely in x(tf ) and a variation in tf , one of these variationscan be eliminated by setting δGx(tf ) = 0. Setting δGx(tf ) = 0 implies δGJ

.= 0, where J

now depends on x(tf ) as there is only a general variation in x(tf ). But δGJ.= 0 implies

that x(tf ) .= x∗(tf ), therefore x(tf ) .= x∗(tf ). The Taylor expansion of δtx(tf ) and usingthe previous result yields δtx(tf ) .= x(tf ) δtf

.= x∗(tf ) δtf . Together these equations caneasily be solved for as defined by (2.5).

δx(tf ) .= δxf − x∗(tf ) δtf (2.5)


The variation in J can be formulated by (2.6).

δJ(x∗, δx) = δxTf

∂φ(x∗(tf ), x∗(tf ), tf )∂x

+[φ(x∗(tf ), x∗(tf ), tf )− x∗T (tf )

φ(x∗(t), x∗(t), t)∂x

]δtf

+∫ tf

t0

δxT (t)

∂φ(x∗(t), x∗(t), t)∂x

− d

dt

[∂φ(x∗(t), x∗(t), t)

∂x

]dt (2.6)

Consider the two cases:

• tf and x(tf ) are unrelated ⇒ δtf and δxf are independent of one another andarbitrary, so their coefficients must each be zero, which implies (2.7).

φ(x∗(tf ), x∗(tf ), tf ) = 0 (2.7)

• tf and x(tf ) are related, let’s say by x(tf ) = θ(tf ), then δxf.= dθ(tf )

dt δtf . Collectingterms and using the same argumentation, but only for an arbitrary δtf , yields theso called transversality condition (2.8).

[∂φ(x∗(t), x∗(t), t)

∂x

]T [dθ(tf )

dt− x∗(tf )

]+ φ(x∗(tf ), x∗(tf ), tf ) = 0 (2.8)

2.2 Euler Equations and Boundary Conditions

In summary, for an extremal x∗ of the functional J the first variation δJ has to be zero,which is equivalent to the fact that the Euler equations (2.9) and the boundary conditions(2.10) have to be satisfied.

∂φ(x∗(t), x∗(t), t)∂x

− d

dt

[∂φ(x∗(t), x∗(t), t)

∂x

]= 0

δxTf

∂φ(x∗(tf ), x∗(tf ), tf )∂x

+[φ(x∗(tf ), x∗(tf ), tf )− x∗T (tf )

φ(x∗(tf ), x∗(tf ), tf )∂x

]δtf = 0

(2.9)

(2.10)

2.2.1 Piecewise Smooth Extremals: Weierstrass-Erdmann corner con-

ditions

In the case that the admissible extremals x∗ may only be piecewise differentiable, i.e.have a finite number of discontinuities in the first derivative at times ti, (i = 1..N),the functional J =

∑Ni=1 Ji can be thought of as being the sum of functionals Ji with

extremals x∗i (t), t ∈ [ti−1, ti] that belong to the class C1. This achieves the boundaryconditions (2.11).


0 = δxTi

[∂φ(x∗(ti), x∗(t−i ), ti)

∂x− ∂φ(x∗(ti), x∗(t+i ), ti)

∂x

]

+[

φ(x∗(ti), x∗(t−i ), ti)− x∗T (t−i )φ(x∗(ti), x∗(t−i ), ti)

∂x

]

−[φ(x∗(ti), x∗(t+i ), ti)− x∗T (t+i )

φ(x∗(ti), x∗(t+i ), ti)∂x

]δti (2.11)

The Weierstrass-Erdmann corner conditions are a special case of these boundary condi-tions. They are obtained in the case when ti and x∗(ti) are unrelated and hence δti andδxi are independently arbitrary. The boundary conditions are then (2.12) and (2.13).

∂φ(x∗(ti), x∗(t−i ), ti)∂x

=∂φ(x∗(ti), x∗(t+i ), ti)

∂x(2.12)

φ(x∗(ti), x∗(t−i ), ti) = φ(x∗(ti), x∗(t+i ), ti)

−δxT (t−i )∂φ(x∗(ti), x∗(t−i ), ti)

∂x−δxT (t+i )

∂φ(x∗(ti), x∗(t+i ), ti)∂x

(2.13)

In the case when ti and x(ti) are related, say by x(ti) = θ(ti), so that δx(ti).= dθ(ti)

dt δti,then the boundary conditions are (2.14).

[∂φ(x∗(ti), x∗(t−i ), ti)

∂x

]T [dθ(ti)

dt− x∗(t−i )

]+ φ(x∗(ti), x∗(t−i ), ti) =

[∂φ(x∗(ti), x∗(t+i ), ti)

∂x

]T [dθ(ti)

dt− x∗(t+i )

]+ φ(x∗(ti), x∗(t+i ), ti) (2.14)

2.3 Constrained Extremals

So far it has been assumed that the components of the extremal solutions x∗ are inde-pendent. For control problems, the situation is more complicated, as the state trajectoryis determined by the control u. Thus there are n + m functions, x and u, but only them = dim(u) controls are independent and the n = dim(x) state trajectories are dependenton the m controls.

There are two methods to solve the control situation. Assume the extremal is givenby w which has n + m components and n = dim(a) constraints, expressed in the forma(w(t), t) = 0 or a(w(t), w(t), t) = 0. The first kind of constraints are so called pointconstraints, whereas the constraints of the second type are called differential equationconstraints.

• The Elimination Method simply involves expressing all the dependent variables andvariations thereof in terms of the independent ones. Then the Euler Equations (2.9)can be used to solve this “new” problem. However, it is almost always impossible


to eliminate all the dependent variables and then solve the differential equations.However, it can be used when a simplification in the equation system is achieved.Even when only part of the dependent variables can be eliminated it might be worth-while as the “problem size”, in terms of the number of total equations, shrinks. Thenit might be solved with the Lagrange Multiplier Method.

• The Lagrange Multiplier Method uses a trick by building an augmented integrandfunction φa of the functional by adding a term which is zero at the extremal w(t).This term is simply the inner product of the so called Lagrange multipliers p(t) withthe point and/or differential constraints, a(w(t), t) = 0 and/or a(w(t), w(t), t) = 0.Therefore, if the constraints are satisfied, Ja = J for any function p.

2.3.1 The Lagrange Multiplier Method

As mentioned above, one builds an augmented functional Ja (2.15) by adjoining the n =dim(a) constraining relations a(w, w, t) = 0 to J :

Ja(w,p) =∫ tf

t0

φ(w(t), w(t), t) + pT (t)a(w(t), w(t), t)

dt (2.15)

Building the variation of the functional in terms of variations in w(t), w(t) and p(t), yields(2.16).

δJa(w, δw,p, δp) =∫ tf

t0

δwT (t)

[∂φ(w(t), w(t), t)

∂w+

∂aT (w(t), w(t), t)∂w

p(t)]

+δwT (t)[∂φ(w(t), w(t), t)

∂w+

∂aT (w(t), w(t), t)∂w

p(t)]

+aT (w(t), w(t), t)δp(t)

dt (2.16)

Integrating by parts the terms containing δw(t) yields (2.19) (Remember δw(t0) = δw(tf ) =0). On an extremal the variation must be zero, i.e. δJa(w∗,p) = 0. Additionally, the n

constraints a(w∗, w∗(t), t) are zero for all times t ∈ [t0, tf ]. Therefore, one can choose the n

Lagrange multipliers arbitrarily. The trick is to choose them such that n dependent com-ponents of δw(t) are zero throughout the interval [t0, tf ]. The remaining (n+m)−n = m

components of δw(t) are then independent and the fundamental lemma of the calculusof variations can be applied. This means that the integrands for the m remaining equa-tions (2.19) must be zero. Together, with the n carefully chosen Lagrange multipliers,this can be simplified by building an augmented integrand function that unifies the m+n

equations in the augmented Euler equations (2.20).In the next few paragraphs the results are given for the augmented Euler equations forpoint and differential equation constraints. However, even if the derivation differs slightly,the results are basically the same. This is not surprising as point constraints are identifiedas a special case of the differential equation constraints. The only difference is in the


constraints themselves, where the argument w(t) for point constraints is simply left out.

The augmented integrand functions for point and differential equation constraints are(2.17) and (2.18).

φa(w(t), w(t),p(t), t) := φ(w(t), w(t), t) + pT (t) a(w(t), t) (2.17)

φa(w(t), w(t),p(t), t) := φ(w(t), w(t), t) + pT (t) a(w(t), w(t), t) (2.18)

The augmented first variations are (after integration by parts) (2.19).

δJa(w, δw,p, δp) =∫ tf

t0

δwT (t)

[∂φ(w(t), w(t), t)

∂w+

∂aT (w(t), w(t), t)∂w

p(t)

− d

dt

[∂φ(w(t), w(t), t)

∂w+

∂aT (w(t), w(t), t)∂w

p(t)]]

+ aT (w(t), w(t), t) δp(t)

dt (2.19)

The n + m augmented Euler equations are (2.20).

∂φa(w∗(t), w∗(t),p∗(t), t)∂w

− d

dt

[∂φa(w∗(t), w∗(t),p∗(t), t)

∂w

]= 0 (2.20)

These are subject to the 2(n + m) boundary conditions (2.21) and (2.22).

w(t0) = w0 (2.21)

w(tf ) = wf (2.22)

Note that the n+m second-order Euler equations will raise 2(n+m) integration constants,plus n Lagrange multipliers, which yield a total of 3n + 2m unknowns. On the other sideone has (n + m) boundary conditions on each side of t0 and tf , plus n constraints, whichallow the elimination of 3n + 2m quantities.

Isoperimetric constraints are constraints of the form (note: upper boundary is t andnot tf ) (2.23).

∫ t

t0

e(w(t), w(t), t) dt = constant (2.23)

A set of new variables z(t) :=∫ tt0

e(w(t), w(t), t) dt can be easily defined. Differentiationwith respect to t yields z(t) = e(w(t), w(t), t), which can be transformed into differentialequation constraints a(w(t), w(t), t) := e(w(t), w(t), t)− z(t).The r = dim(z) additional augmented Euler equations from the additional isoperimetric


constraints are (2.24).

∂φa(w∗(t), w∗(t),p∗(t), z∗(t), t)∂z

− d

dt

[∂φa(w∗(t), w∗(t),p∗(t), z∗(t), t)

∂z

]= 0 (2.24)

At this point there are a total of (n + m + r) equations involving (n + m + r + r) func-tions w∗,p∗, z∗ and some additional r differential equations z(t) = e(w(t), w(t), t) whosesolution must satisfy the boundary conditions z(tf ) = constant. Since φa does not con-tain z(t) and therefore ∂φa

∂z ≡ 0. Also, ∂φa

∂z = −p∗(t). Therefore, the r additional Eulerequations always yield (2.25).

p(t) = 0 (2.25)

This implies that the Lagrange Multipliers are constant.

2.4 Variational Approach Applied to Control Problems

In control problems a very general task is to find an admissible control u∗ that causesthe system (2.26) to follow an admissible trajectory x∗ that minimizes the performancemeasure (2.27,2.28).

x(t) = f(x(t),u(t), t) (2.26)

J(u(.)) = s(x(tf ), tf ) +∫ tf

t0

φ(x(t),u(t), t) dt (2.27)

assuming sis differentiable=

∫ tf

t0

φ(x(t),u(t), t) +

ds(x(t), t)dt

dt + s(x(t0), t0) (2.28)

This can be interpreted as a cost-to-go function form the starts at state x(t0) at a time t0

to end up in a final state x(tf ) at a time tf when applying the control u(t) in between. φ

is the immediate cost-density and s(x(tf ), tf ) is a final termination cost, associated withthe final state.Because x(t0) and t0 are fixed, minimization does not affect the s(x(t0), t0) term. However,the system constraints, a := f − x, have to be taken into account. They can be treatedas differential constraints and then used in conjunction with the results achieved in theprevious sections. This yields an augmented functional with Lagrange multipliers p(t)(2.29) and (2.30), respectively, with the augmented cost-density (2.31).

Ja(u(.)) =∫ tf

t0

φ(x(t),u(t), t) + xT (t)

∂s(x(t), t)∂x

+∂s(x(t), t)

∂t

+pT (t) [f(x(t),u(t), t)− x(t)]

dt (2.29)

=∫ tf

t0

φa(x(t), x(t),u(t),p(t), t) dt (2.30)


φa(x(t), x(t),u(t),p(t), t) := φ(x(t),u(t), t) + xT (t)∂s(x(t), t)

∂x+

∂s(x(t), t)∂t

+pT(t) [f(x(t),u(t), t)−x(t)] (2.31)

To determine the variation δJa on an extremal u∗, the variations δx, δx, δu, δp and δtf

are introduced using the prior result of equation (2.6), where φa is used instead of φ, asdefined by (2.32).

δJ(u∗) = δxTf

∂φa(x∗(tf ), x∗(tf ),u∗(tf ),p∗(tf ), tf )∂x

+[φa(x∗(tf ), x∗(tf ),u∗(tf ),p∗(tf ), tf )− x∗T (tf )

φa(x∗(tf ), x∗(tf ),u∗(tf ),p∗(tf ), tf )∂x

]δtf

+∫ tf

t0

δxT (t)

[∂φa(x∗(t), x∗(t),u∗(t),p∗(t), t)

∂x− d

dt

[∂φa(x∗(t), x∗(t),u∗(t),p∗(t), t)

∂x

]]

+ δuT (t)∂φa(x∗(t), x∗(t),u∗(t),p∗(t), t)

∂u+ δpT (t)

∂φa(x∗(t), x∗(t),u∗(t),p∗(t), t)∂p

dt(2.32)

This result is achieved because u(t) and p(t) do not appear in φa and therefore the partialinstead of total derivatives can be used to take account of the time dependent variationsδu(t) and δx(t).

If only terms involving s are considered in the integral in (2.32) this yields (2.33) and(2.34).

∂

∂x

[x∗T (t)

∂s(x∗(t), t)∂x

+∂s(x ∗ (t), t)

∂t

]− d

dt

∂

∂x

[x∗T (t)

∂s(x∗(t), t)∂x

]

=∂2s(x∗(t), t)

∂x2x∗(t) +

∂2s(x∗(t), t)∂t∂x

− d

dt

[∂s(x∗(t), t)

∂x

](2.33)

assuming second partialderivatives are continuous=

∂2s(x∗(t), t)∂x2

x∗(t) +∂2s(x∗(t), t)

∂t∂x(2.34)

It must be mentioned that to make the last term in (2.33) zero, it is assumed that thesecond partial derivatives are continuous, so that the order of differentiation can be ex-changed, after applying the chain rule.

Assuming s is well behaved, as implied above, the integral in (2.32) becomes (2.35).

∫ tf

t0

δxT (t)

[∂φa(x∗(t), x∗(t),u∗(t),p∗(t), t)

∂x+

∂fT (x∗(t),u∗(t), t)∂x

p∗(t)− d

dt[−p∗(t)]

]

+ δuT (t)[∂φ(x∗(t),u∗(t), t)

∂u+

∂fT (x∗(t),u∗(t), t)∂u

p∗(t)]

+δpT (t) [f(x∗(t),u∗(t), t)− x∗(t)]

dt (2.35)

As before, it is concluded that the integral must vanish on an extremal regardless of


the boundary conditions. This means the “system constraints”, which form the state

equations (2.36), must be satisfied by an extrema to make the coefficient of the arbitraryvariation δp(t) be zero.

x∗(t) = f(x∗(t),u∗(t), t) (2.36)

The arbitrary Lagrange Multipliers p∗(t), are used to make the coefficient of the variationδx(t) zero. This yields the so called costate equations with the costate p(t) (2.37).

p∗(t) = −∂fT (x∗(t),u∗(t), t)∂x

p∗(t)− ∂φ(x∗(t),u∗(t), t)∂x

(2.37)

For the remaining variation δu(t), the zero-coefficient condition yields (2.38).

0 =∂φ(x∗(t),u∗(t), t)

∂u+

∂fT (x∗(t),u∗(t), t)∂u

p∗(t) (2.38)

These are necessary conditions for an extremal, but not sufficient as one still has the termsoutside the integral in (2.32), for which (2.39) must hold.

0 = δxTf

[∂s(x∗(tf ), tf )

∂x− p∗(tf )

]

+ δtf

[φ(x∗(tf ),u∗(tf ), tf ) +

∂s(x∗(tf ), tf )∂t

+ p∗T (tf ) f(x∗(tf ),u∗(tf ), tf )]

(2.39)

In summary, the necessary conditions are given by (2.36,2.37,2.38) which consist of a setof 2n first-order differential equations (2.36,2.37) and a set of m algebraic relations (2.38).All of these equations must be satisfied throughout the interval [t0, tf ]. The solution of thestate and costate equations will contain 2n integration constants. These constants can beevaluated by solving the n boundary conditions x(t0) = x0 and an additional set of n (orn + 1, depending on whether tf is specified or free) relationships from equations (2.38).Again, as expected this is a two-point boundary-value problem but the n second-orderEuler equations have been transformed into 2n first-order differential equations.

2.4.1 The Hamiltonian Notation

In the following developments it is convenient to use a function H, called the Hamilto-

nian, defined by (2.40).

H(x(t),u(t), t) := φ(x(t),u(t), t) + pT (t) f(x(t),u(t), t) (2.40)

Using this notation, the necessary conditions (2.36,2.37,2.38) for an extremal can be statedas (t ∈ [t0, tf ]) (2.41) to (2.44).


x∗(t) =∂H(x∗(t),u∗(t),p∗(t), t)

∂p

p∗(t) = −∂H(x∗(t),u∗(t),p∗(t), t)∂x

0 =∂H(x∗(t),u∗(t),p∗(t), t)

∂u

(2.41)

(2.42)

(2.43)

and

δxTf

[∂s(x∗(tf ), tf )

∂x− p∗(tf )

]+ δtf

[H(x∗(tf ),u∗(tf ), tf ) +

∂s(x∗(tf ), tf )∂t

]= 0 (2.44)

2.4.1.1 Boundary Conditions

It is convenient to distinguish boundary conditions with fixed and free final time tf . Thisis done in the following two subsections where the remaining cases of boundary conditionsare classified.

2.4.1.1.1 Problems with fixed final time (δtf = 0)

• Final state specified: x∗(tf ) = xf .

• Final state free: δtf = 0 ⇒ ∂s(x∗(tf ),tf )∂x − p∗(tf ) = 0.

• Final state lying on the surface defined by m(x(t)) = 0, where k = dim(m) < n =dim(x). In other words, the final state lies on the hypersurface m(x(t)) = 0, definedby the intersection of the k hypersurfaces mi(x(t)) = 0, (1 ≤ i ≤ k). The δx(tf )must be tangent to the k hypersurfaces at any point x∗(tf ), or equivalently, normalto the k gradient vectors ∂mi(x

∗(tf ))∂x , assumed to be linearly independent. This, and

the simplified (δtf = 0) equation (2.44) yield (2.45) and (2.46).

δxT (tf )∂mi(x∗(tf ))

∂x= 0, ∀i, 1 ≤ i ≤ k (2.45)

δxT (tf )[∂s(x∗(tf ), tf )

∂x− p∗(tf )

]= 0 (2.46)

This means that the factor of δx(tf ) in the boundary conditions must be a linearcombination of the gradient vectors above and is defined by (2.47).

[∂s(x∗(tf ), tf )

∂x− p∗(tf )

]=

k∑

i=1

di∂mi(x∗(tf ))

∂x=

∂mT (x∗(tf ))∂x

d (2.47)

Thus, it is necessary to determine 2n constants of integration from the state- andcostate equations, k coefficients di. On the other hand there are, n equations (2.47),


n and k boundary conditions x(t0) = x0 and m(x(tf )) = 0, respectively, to solve forthe unknowns.

2.4.1.1.2 Problems with free final time (δtf is arbitrary) .

• Final state specified δxf = 0. Besides the 2n boundary conditions x∗(t0) = x0 andx∗(tf ) = xf there is an additional one to solve for the free time parameter as definedby (2.48).

H(x∗(tf ),u∗(tf ),p∗(tf ), tf ) +∂s(x∗(tf ), tf )

∂t= 0 (2.48)

• Final state free, δxf , δtf are arbitrary and independent. Beside the n boundaryconditions x(t0) = x0, there are n + 1 further conditions (2.49) and (2.50).

p∗(tf ) =∂s(x(tf ), tf )

∂x(2.49)


∂t= 0 (2.50)

• Final state lies on the moving point θ(t). Here δxf and δtf are related by δxf.=

dθ(tf )dt δtf , which yields 2n boundary conditions (2.51) and (2.52).


∂t

+[∂s(x∗(tf ), tf )

∂x− p∗(tf )

]T dθ(tf )dt

= 0 (2.51)

x∗(tf ) = θ(tf ) (2.52)

• Final state lying on the surface defined by m(x(t)) = 0. Since δx(tf ) must be in thetangent plane m(x∗(tf )) = 0 and δxf = δx(tf ) is independent of δtf , there is onecondition from (2.44). For all the other conditions the same argumentation can beapplied as in the case δtf = 0, as defined by (2.53) to (2.57).


∂t= 0 (2.53)

x∗(t0) = x0 (2.54)

∂s(x∗(tf ), tf )∂x

− p∗(tf ) =k∑

i=1

di∂mi(x∗(tf ))

∂x(2.55)

=∂mT (x∗(tf ))

∂xd (2.56)

m(x∗(tf )) = 0 (2.57)

• Final state lying on the moving surface defined by m(x(t), t) = 0. Here δtf doesinfluence the admissible values of δxf . To remain on the surface m(x(t), t) = 0 the


value δx depends on δtf . However, it is possible to augment the vector x(t) by ann + 1 element xn+1(t) ≡ t. With the same argumentation as in the above case (thattangent vectors must be perpendicular to the gradient vectors, assumed to be linearindependent), the result is (2.58).

∂s(x∗(tf ),tf )∂x − p∗(tf )

H(x∗(tf ),u∗(tf ),p∗(tf ), tf ) + ∂s(x∗(tf ),tf )∂t

T [δxf

δtf

]=

k∑

i=1

di

∂mi(x∗(tf ),tf )∂x

mi(x∗(tf ),tf )∂t

T [δxf

δtf

](2.58)

Together with all the other boundary conditions, this leads to a total set of 2n+k+1equations, with 2n constants of integration and the parameters d1, .., dk and tf ,defined by (2.59) to (2.62).

∂s(x∗(tf ), tf )∂x

− p∗(tf ) =k∑

i=1

di∂mi(x∗(tf ), tf )

∂x

=∂mT (x∗(tf ), tf )

∂xd (2.59)


∂t=

k∑

i=1

di∂mi(x∗(tf ), tf )

∂t

=∂mT (x∗(tf ), tf )

∂td (2.60)

m(x∗(tf ), tf ) = 0 (2.61)

x∗(t0) = x0 (2.62)

2.4.1.2 Pontryagin’s Minimum Principle

A control u∗ causes the functional J to have a local minima if (2.63) for all admissiblecontrols u sufficiently close to u∗. For a control u = u∗ + δu, close enough to the optimalcontrol u∗ the increment in J can be expressed by (2.64).

J(u)− J(u∗) = ∆J ≥ 0 (2.63)

∆J(u∗, δu) = δJ(u∗, δu) + o(δu) (2.64)

If there were unbounded controls the same argumentation can be used as before, that thevariation δu is arbitrary and thus the variation δJ(u, δu) must be zero for an extremal u

and all admissible variations δu with a sufficiently small norm. As long as the extremalu∗(t) is in the interior for all times t ∈ [t0, tf ] of a limited control region, δu is arbitrarily.However, if the control u∗ is on a boundary for some time t ∈ [t1, t2] ⊂ [t0, tf ], δu cannotbe chosen arbitrarily anymore. If only admissible variations δu within [t1, t2], i.e. pointinginto the interior of the admissible region, are considered, a necessary condition for u∗ to


minimize J is δJ(u∗, δu) ≥ 0. For variations δu which are non-zero only for t /∈ [t1, t2], itis necessary that δJ(u∗, δu) = 0. Considering all admissible variations δu with ||δu|| smallenough, the sign of ∆J is determined by δJ . A necessary condition for u∗ to minimize J

is (2.65).

δJ(u∗, δu) ≥ 0 (2.65)

To apply this modification to the Hamiltonian system (2.41,2.42,2.43), where it was as-sumed that the control values are unconstrained, it is possible to approximate ∆J(u∗, δu) .=δJ(u, δu) from (2.32). If the state equations (2.41) are satisfied, and p∗(t) is selected suchthat the coefficient of δx(t) is zero, and the boundary conditions (2.44) are satisfied onehas (2.66) and (2.67).

∆J(u∗, δu) .=∫ tf

t0

[∂H(x∗(t),u∗(t),p∗(t), t)

∂u

]T

δu(t) dt (2.66)

.=∫ tf

t0

[H(x∗(t),u∗(t) + δu(t),p∗(t), t)−H(x∗(t),u∗(t),p∗(t), t)]T dt (2.67)

If u∗(t) + δu(t) is in a sufficiently small neighborhood of u∗ (||δu|| < β) then the higher-order terms are small and the integral dominates these and approximates ∆J exactly in thelimit β → 0. Together with ∆J(u∗, δu) ≥ 0, for any sufficiently close u(t) := u∗(t)+δu(t),(2.68) must be valid for all admissible δu(t), and thus controls u(t), for all times t ∈ [t0, tf ].This states that an optimal control must minimize the Hamiltonian and (2.68) is referredto as Pontryagin’s minimum principle.

H(x∗(t),u(t),p∗(t), t) ≥ H(x∗(t),u∗(t),p∗(t), t) (2.68)

2.4.2 Summary of Variational Approach Applied to Control Problems

The Hamiltonian is given by the sum of the short-term cost φ and the dot-product betweenthe costate and the “system-constraints” (2.69).

H(x(t),u(t), t) := φ(x(t),u(t), t) + pT (t) f(x(t),u(t), t) (2.69)

The state and costate equations are given by (2.70) and (2.71). The inequality (2.72) isPontryagin’s minimum principle. All these equations and inequalities must be satisfied forall t ∈ [t0, tf ]. Additionally, the boundary conditions (2.73) must hold.


x∗(t) =∂H(x∗(t),u∗(t),p∗(t), t)

∂p

p∗(t) = −∂H(x∗(t),u∗(t),p∗(t), t)∂x

H(x∗(t),u∗(t),p∗(t), t) ≤ H(x∗(t),u(t),p∗(t), t) ∀ admissible u(t)

(2.70)

(2.71)

(2.72)

and the boundary conditions:

δxTf

[∂s(x∗(tf ), tf )

∂x− p∗(tf )

]+ δtf

[H(x∗(tf ),u∗(tf ), tf ) +

∂s(x∗(tf ), tf )∂t

]= 0 (2.73)

Note:

• u∗(t) is a control that causes H(x∗(tf ),u∗(tf ), tf ) to assume its global minimum.

• Equations (2.70,2.71,2.72,2.73) constitute a set of necessary but in general not suf-ficient conditions.

• If the boundaries of the restricting regions are more and more lifted such that theoptimal control will not be constrained by the boundaries any longer, a necessarycondition for u∗(t) is (2.74).

∂H(x∗(t),u∗(t),p∗(t), t)∂u

= 0 ∀ t ∈ [t0, tf ] (2.74)

If the Hessian ∂2H(x∗(t),u∗(t),p∗(t),t)∂u2 is positive definite this is sufficient to be a local

minimum. It is clear that if the Hamiltonian can be expressed in the form (2.75),which is quadratic in u(t), it is sufficient to be a global minimum.

H(x∗(t),u∗(t),p∗(t), t) = b(x(t),p(t), t) + cT (x(t),p(t), t) u(t) +12uT (t) R(t) u(t)

(2.75)

In this case the Hessian is R(t) and the optimal control is (2.76).

u∗(t) = −R−1(t) c(x∗(t),p∗(t), t) (2.76)

2.4.2.1 Additional Necessary Conditions

Pontryagin et al. have also derived some other necessary conditions for optimality. Simi-larly to the Euler-Equations, derived from an only implicit time-dependence (φ = φ(x(t), x(t))),the Hamiltionian becomes:

• constant if the final time tf is fixed, that is:H(x∗(t),u∗(t),p∗(t)) = const ∀ t ∈ [t0, tf ].


• zero if the final time tf is free:H(x∗(t),u∗(t),p∗(t)) = 0 ∀ t ∈ [t0, tf ].

2.4.2.2 State Variable Inequality Constraints

Consider state constraints of the form (2.77) where a is a function of the states and possiblytime, which has continuous first and second partial derivatives with respect to x(t).

a(x(t), t) ≥ 0, with l = dim(a) ≤ m = dim(u) (2.77)

A new component of x as defined by (2.78) with the help of the vector Heaviside given by(2.79).

xn+1 := fn+1(x(t)) := aT (x(t), t) diag(Θ(−a)) a(x(t), t) (2.78)

Θ(a) =

0 : ai < 0, ∀i = 1, .., l

1 : ai ≥ 0, ∀i = 1, .., las vector Heaviside function (2.79)

Then, xn+1(t) can be written as an integral (2.80) requiring the two boundary conditionsxn+1(t0) = 0 and xn+1(tf ) = 0 to be satisfied.

xn+1(t) =∫ t

t0

xn+1(t)dt + xn+1(t0) (2.80)

Since xn+1(t) ≥ 0 ∀ t, satisfaction of the boundary conditions implies that xn+1(t) ≡0, ∀ t ∈ [t0, tf ], but this occurs only if the constraints are satisfied throughout thetime interval [t0, tf ]. Thus, to minimize (2.81), subject to the state equation constraintsx(t) = f(x(t),u(t), t), admissibility constraints on the control variables and state inequal-ity constraints a(x(t), t) the Hamiltonian (where fn+1 is given in (2.78)) has to be extendedto (2.82).

J(u) = s(x(tf ), tf ) +∫ tf

t0

φ(x(t),u(t), t) dt (2.81)

H(x(t),u(t),p(t), pn+1(t), t) := φ(x(t),u(t), t) + pT (t) f(x(t),u(t), t) + pn+1(t)fn+1(x(t))

(2.82)


This leads to the necessary conditions (2.83) to (2.87) for optimality (t ∈ [t0, tf ]):

x = f(x(t),u(t), t) (2.83)

xn+1 = fn+1(x(t), t) (2.84)

p(t) = −∂H(x∗(t),u∗(t),p∗(t), p∗n+1(t), t)∂x

(2.85)

pn+1(t) = −∂H(x∗(t),u∗(t),p∗(t), p∗n+1(t), t)∂xn+1

= 0 (2.86)

H(x∗(t),u∗(t),p∗(t), p∗n+1(t), t) ≤ H(x∗(t),u(t),p∗(t), p∗n+1(t), t) ∀ admissible u(t)

(2.87)

2.5 Solving the Variational Approach with Neural Networks

In the general case the n second order Euler, or its 2n first-order counterpart in the Hamil-tonian notation, with the two-point boundary conditions is difficult to solve and in mostcases only numerically solvable. Even the numerical solution can be difficult, especiallyfor higher dimensional systems where n is large, or, for long time intervals. Here, anotherapproach using function approximators for the controller and function approximators forthe Lagrange multipliers is proposed. Also, the relation between Dynamic Programing andthe Calculus of Variation can be utilized to combine the two fields for practical algorithms.

2.5.1 Direct Application of Calculus of Variations

Firstly, the original problem of the Calculus of Variations is restated. The goal is tominimize (2.88) subject to the boundary conditions x(t0) = x0 and x(tf ) = xf , whichmeans that the variations δx(t0) and δx(tf ) must be zero.

J [x(.)] =∫ tf

t0

φ(x(t), x(t), t) dt (2.88)

To apply the results of section 2.1, it is necessary that the components of x be independent.The goal is now to find a control law that drives the system in such a way that the objective(2.88) is minimized by applying the Calculus of Variations.It is necessary to consider constraints stemming from the system equations. Just applyingthe Calculus of Variation to the original problem that consists of minimizing the functionalJ [x(.)] =

∫ tft0

φ(x(t), x(t), t) dt, implicitly establishes the relation between x(t) and x(t). Ifa mapping can be found between x(t) and x(t) that explicitly states this relationship, thenno further constraints are imposed at all. It is assumed that a mapping f : x u7−→ x can befound that just states the relationship explicitly, where u can be seen as a set of param-eters. Clearly, if this mapping can form any relationship between x and x with suitableu these are no constraints for the variational problem. However, if the function f with itsparameters u does not allow certain mappings this would impose constraints to the vari-


ational problem. To distinguish these two cases the term of controllability could be used,which is a property between the input (control) and the state space. Here1, if an explicitrelationship can be achieved for the implicitly given system constraints, then the system is“strongly controllable”. For a linear continuous-time system x(t) = A(t)x(t)+B(t)u(t) ifrank(B(t)) = dim(x(t)) hence dim(u(t)) = dim(x(t)), and arbitrarily selectable controlsu, this is a sufficient criterion to achieve this. In general however, this freedom is not avail-able and the system constraints have to be imposed and the Lagrange multiplier methodhas to be used. Also, if the state components are not independent this dependence fromthe state has to be eliminated.But all in all the optimization problem can be seen in terms of the state x by adaptingparameters for a mapping g : x wa7−→ u with some parameters wa, and eventually subjectit to some specific constraints.

In the most general case when there are also some constraints, the augmented versionsof the cost-density function (2.89) has to be used and the Euler equations enforced bymaking the ‘Euler error’ EE (2.90) zero.

φa(x∗(t), x∗(t),p∗(t), t) = φ(x∗(t), x∗(t), t) + pT (t) a(x∗(t), x∗(t), t) (2.89)

EE :=∂φa(x∗(t), x∗(t),p∗(t), t)

∂x− d

dt

[∂φa(x∗(t), x∗(t),p∗(t), t)

∂x

]

(2.90)

Depending on the complexity of φa, the total derivative with respect to t could be simpli-fied, it could be approximated by finite differences (2.91).

EE.=

∂φa(x∗(t), x∗(t),p∗(t), t)∂x

− 1δt

[∂φa(x∗(t + δt), x∗(t + δt),p∗(t + δt), t + δt)

∂x− ∂φa(x∗(t), x∗(t),p∗(t), t)

∂x

]

(2.91)

As the Lagrange multipliers p(t) are not known, it might be possible to approximate themby p(x(t);wlm). The error EE could be used for a steepest decent algorithm, where theparameters of the controller and the Lagrange multiplier approximator are concurrentlyadapted according to (2.92) and (2.93).

wa = −η∂+ET

E

∂waEE (2.92)

wlm = −η∂+ET

E

∂wlmEE (2.93)

1The definition of controllability given here is a more general than practical one; it differs from the“normal” definition as here it is not required that the system state to be driven to zero in a finite timeinterval (see for example [21]). Therefore, the term “strongly controllable” is used because of the strongconditions, rank(B(t)) = dim(x(t)), to be enforced for controllability.


However, it might be too difficult to train an additional neural network and the concurrentadaptation might lead to instability.

2.5.2 Hamiltonian Formulation

Here, the optimization problem is seen in terms of the control u. Therefore the problem isformulated in the Hamiltonian notation (2.94) to (2.96), where p(t) and f(x(t),u(t)) arethe augmented Lagrange multipliers and differential constraints, respectively.

H(x(t),u(t)) := φ(x(t),u(t)) + pT (t)f(x(t),u(t)) (2.94)

p(t) :=

[p(t)

pn+1(t)

](2.95)

f(x(t),u(t)) :=

[f(x(t),u(t))fn+1(x(t), t)

](2.96)

The constraints are made up from the “ordinary system constraints” (2.97) and some stateinequality constraints (2.98) with (2.99), where the vector Heaviside function is given by(2.79).

af := x− f(x(t),u(t)) = 0 (2.97)

a(x(t), t) ≥ 0 (2.98)

fn+1(x(t), t) := aT (x(t), t) diag(Θ(−a(x(t), t))) a(x(t), t) (2.99)

It is desired that xn+1(t) be equal to fn+1(x(t), t), which leads to an augmented equalityconstraint an+1 := xn+1 − fn+1 = 0. This means that all inequality constraints can betreated as an additional equality constraint. It is evident that xn+1(t) ≥ 0 for all t, andxn+1(t) = 0 if and only if all the inequality constraints are satisfied at time t. For notationalsimplicity the˜sign to indicate augmented states and costates, is left out (but it should beclear that the first n components of the vectors x(t) and p(t) are from the ordinary systemand xn+1(t) =

∫ tt0

xn+1(x(t),u(t), t)dt + xn+1(t0) and pn+1 are the augmented state andcostate, respectively). Requiring the boundary conditions for xn+1(t0) = xn+1(tf ) = 0,implies that xn+1(t) ≡ 0 for all t ∈ [t0, tf ] and thus satisfies all inequality constraintsthroughout the interval [t0, tf ].

The primary goal is to obtain a controller g(x(t);wa) that provides an estimate of themapping from the state x(t) to the controls u(t) = u(x(t)). This is done by solving theHamiltonian problem given above. However, as the Lagrange multipliers are not known,another estimator network might be used to approximate them: p(x(t),u(t);wlm). Totrain these networks, a steepest descent algorithm with the errors (2.100) and (2.101)might be used (assuming no control boundaries, i.e. ∂H

∂u = 0) yielding steepest descent


updates according to (2.104) to (2.105).

Ep := ˙p(t) +∂H(x,u, p, t)

∂x.=

p(t + δt)− p(t)δt

+∂φ

∂x+

∂fT

∂xp (2.100)

Ea :=∂H(x,u, p, t)

∂u=

∂φ

∂u+

∂fT

∂up (2.101)

wp = −η∂+ET

p

∂wpEp (2.102)

wa = −η∂+ET

a

∂waEa (2.103)

∂+ETp

∂wp=

∂pT

∂wp

∂+ETp

∂p=

1δt

[∂pT

∂wp

]t+δt

t

+∂pT

∂wp

∂f∂x

(2.104)

∂+ETa

∂wa=

∂gT

∂wa

∂+ETa

∂u=

∂gT

∂wa

[∂2φ

∂u2+

[∂pT

∂u∂f∂u

+ pT ∂2f∂u2

]](2.105)

In the case where the control u is also constrained, the Hamiltonian could be chosen forthe error Ea (2.106) and used to update the controller weights (2.107).

Ea := H(x(t),u(t),p(t), t) (2.106)

wa := −η∂+Ea

∂waEa (2.107)

Because of the steepest descent algorithm this should at least converge to a local minima.Adapting iteratively, or even concurrently, between p(x,u;wlm) and g(x;wa) might helpavoid local minima.

Furthermore, depending on the kind of boundary conditions additional training errorscan be obtained. For example, for a fixed final state xf and a free final time tf theboundary conditions demand the Hamiltonian to be zero. Hence, requiring an additionalHamiltonian error (2.108), evaluated at the final state xf (uf = u(xf )), to be minimized:

EH := H(xf ,uf , p(xf ,uf )) = φ(xf ,uf ) + fT (xf ,uf ) p(xf ,uf ;wp) (2.108)

This gives the additional update laws (2.109) to (2.112).

wp = −η∂+ET

H∂wp

EH

∣∣∣∣xf

(2.109)

wa = −η∂+ET

H∂wa

EH

∣∣∣∣xf

(2.110)

∂+ETH

∂wp=

∂pT

∂wp

∂+ETH

∂p

∣∣∣∣xf

=∂pT

∂wpf∣∣∣∣xf

(2.111)

∂+ETH

∂wa=

∂gT

∂wa

∂+ETH

∂u

∣∣∣∣xf

=∂gT

∂wp

[∂φ

∂u+

∂fT

∂up +

∂pT

∂uf

]

xf

(2.112)


2.5.3 Some Remarks to the Neural Networks Approach

Chapter 2 was developed in detail to make the understanding of the Calculus of Variationsand the Hamiltonian approach and their interconnections clear. This is a prerequisiteand provides for the possibility of applying this traditional framework in the context ofadaptive critics. The driving idea behind use of the calculus of variation is to concentrateon extremal policies, rather than any arbitrary (but hopefully stable) policy as in the initialphase of an adaptive critic training, which are then gradually improved by successive criticand actor training cycles. To avoid computing cost-to-go functions for these non-optimaland uninteresting intermediate actors, some additional training procedure was sought thathelps to concentrate on extremal policies.

Whether the calculus of variation or its dual Hamiltonian form really might be theproper form to use to improve adaptive critics could be questioned on the grounds thatthe variational calculus is an optimization problem over an infinite function space, whereasin an adaptive critic design the critic as the actor are practical neural networks with afinite number of parameters. Therefore the problem at hand is that of a finite-dimensionaloptimization problem. Another question arises from the need to solve a two point bound-ary value problem (TPBVP) associated with the Euler and Hamiltonian approach. Usu-ally, TPBVPs are used to compute optimal control trajectories and associated controlsequences, rather then obtaining parameterized controllers. Also TPBVPs are known tobe inherently difficult to solve. Because of this difficulty and the additional task of trainingneural networks, the focus was shifted in chapter 4 to the finite-dimensional optimizationproblem. In section 4.4, an ‘integral version of backpropagation through time’ has beendeveloped, which turns out to be only an initial value problem (IVP) associated with aforward integration. To speed up convergence, Newton’s method is still hard to beat,though at the cost of having to calculate second order derivatives. However, it is straightforward to apply the same idea to solve the IVP, just with some additional set of firstorder differential equations. This is dealt with in detail in section 4.4.2.

While the basic theory for the calculus of variation has been developed in this chapter,actual implementation and tests are done in chapter 5, where the simplest case for anupdate using the variational calculus is done. The problems encountered there actuallyfostered the development of chapter 4 and lead to a better understanding of anotherproblem, namely, having a one-step time difference and calculating total derivatives withrespect to some quantity. In this case indirect dependence is lost and only direct influencethrough partial derivatives is maintained. This can be avoided using a longer horizon overseveral steps of the target quantity. As an outlook, the example given by equation (4.71)in section 4.4.1 makes clear what is meant by this. It should be possible to modify thetraining equations in section 2.5 to use integrated errors over some time δt, rather thanusing only errors at the current time t. In section 5.3 an example has been applied to ‘fix’the training equations of section 3.2, which made them more robust.

Chapter 3

Adaptive Critic Designs

As mentioned before the goal is to find a critic that yields an approximation J(...;wc)to the stationary cost-to-go function J [x(t)], where x(t) is the state at time t. Thestationarity means that ∂J

∂trof (1.15) is zero for all times, which implies that it is an

infinite horizon problem with tf →∞.The stationary cost-to-go function J is given by (3.1) (see equation(1.10))

J [x] = minu∈A

L(x,u, t)δt + J [x + xδt] (3.1)

However, as the cost-to-go function is not necessarily finite or well-defined in the generalcase (see (1.6)) for tf →∞, a contraction factor is often used to force it to be finite, or otheradditional assumptions must be made. There are several possibilities, e.g. by enforcing||x + xδt|| < α||x|| for some α < 1 and for all x and u, or, a stationary solution is enforcedby discounting the future cost-to-go with a factor γ = 1

1+p < 1, where p is an interestrate. The latter case is normally used in ACDs. This is a Bellman equation of type II [20],and its solution will be the target function (3.2) to learn in an ACD framework, whichin this case is called Heuristic Dynamic Programming (HDP). Using the total derivativedJdx as the target (3.3) is called Dual Heuristic Programming (DHP) and combining bothachieves Global Dual Heuristic Programming (GDHP). All these names and families aredue to Werbos [5]. Normally, the cost-density L(x,u, t) is not explicitly time dependentand its relation with the discount factor γ will be discussed in the following paragraphs.Figure 3.1 is a graphical representation of an ACD framework.

J [x] = minu∈A

L(x,u, t)δt + γJ [x + xδt]

(3.2)

λ[x] :=dJ [x]dx

= minu∈A

d(L(x,u, t)δt)

dx+ γ

dJ [x + xδt]dx

(3.3)

= minu∈A

[∂L

∂x+

duT

dx∂L

∂u

]δt + γ

d(x + xδt)T

dxλ [x + xδt]

(3.4)

It would be better to call the cost-to-go function just the cost function and interpret the

32

CHAPTER 3. ADAPTIVE CRITIC DESIGNS 33

integral∫ tt0

L(x,u, τ)dτ at a current time t as the cost-so-far, where t0 is some start timein the past. Of course, the cost-so-far will converge to the cost function for t → ∞. Thecost-to-go would then be better defined as the cost minus the cost-so-far.

If a loss (or immediate cost density) function L(x,u, t) = γt−t0φ(x,u) is used in (3.1) thisis equivalent to the problem of using a loss function L(x,u, t) = φ(x,u) in equation (3.2).This is seen by noting that from equation (3.1) together with L the cost-to-go function J

is given by J(x(t)) =∫ tft γτ−t0φ(x,u)dτ . Calculating the total differential yields (3.5) to

(3.6).

dJ(t) =[

limδt→0

J(t + δt)− J(t)δt

]dt =

[limδt→0

−1δt

∫ t+δt

tγτ−t0φ(x,u)dτ

]dt (3.5)

= γt−t0

[limδt→0

−1δt

[γδtφ(x,u)t+δt + φ(x,u)t

] δt

2

]dt (3.6)

= −φ(x,u)dt (3.7)

Since the solution is stationary, dJ(t, t0)!= dJ(t), the start time t0 can be chosen (as long

as it is finite, but the start time is always finite!). Therefore, it is set equal to the fixedtime t in (3.5) to (3.6), achieving γt−t0 = 1. Further, it is noticed that γδt .= 1 + ln(γ)δtgoes faster to 1 than functions φ(t + δt) are allowed to increase or decrease when δt → 0.This means that the total differentials are related by dJ = −φ dt.On the other hand, from equation (3.2) together with L the cost-to-go function J is givenby J(x(t)) =

∫ tft φ(x,u)dτ . Calculation of the discounted total differentials (3.8) yields the

same value (3.9) as for the undiscounted differential (3.5), and hence the total differentialdγJ and dJ are equal, allowing this interpretation.

dγJ(t) =[

limδt→0

γJ(t + δt)− J(t)δt

]dt =

[limδt→0

−1δt

∫ t+δt

tφ(x,u)dτ

]dt (3.8)

= −φ(x,u)dt (3.9)

In other words, by shifting to the discounted, stationary Bellman equation(3.1), the sim-pler, “undiscounted” loss function L can be used to achieve the same cost-to-go functionJ1.Using a plant modelled by x = f(x,u), an immediate cost density U = φ(x, x) and anactor u = g(x;wa), the critic J = J [x(.)] = J(x, x;wc) will be trained such that it modelsthe long-term cost, given by

∫ tft0

φ(x, x)dt.

A summary of the formulae describing the ACD system and some relations are given by

1To make the integral∫ tf =∞

tγτ−t0φ(x,u)dτ converge, it might be necessary to in-

clude an integration constant φ0, so that the stationary Bellman equation is J(x(t)) =minu(t)∈A L(x(t),u(t)) + γ < J(x(t + dt)) > − φ0 and converges also in the case γ → 0. The an-gle brackets here denote the expectation with respect to any noise in the system that affects the nextstate.


(3.10) to (3.16).

x = f(x,u) (3.10)

U = φ(x, x) (3.11)

u = g(x;wa) (3.12)

J = J(x, x;wc) (3.13)

dx = x dt (3.14)

dx = df(x,u) =[∂fT

∂x

]T

dx +[∂fT

∂u

]T

du (3.15)

du = dg(x;wa) =[∂gT

∂x

]T

dx (3.16)

3.1 Conventional Training

In the following section adaptation equations are given based on the continuous formulationof the Bellman equation and its derivative with respect to the state x for the specificformulation of the approximators J(x, x) and λ(x, x) as a function of the state x and itstime derivative x. Firstly, a summary is presented of the terminology and the resultingupdate equations introduced by Werbos [5, 25, 26] (and for the most clear and extensiveanalysis [27]).

3.1.1 Training the Critic via HDP and DHP

Werbos [27] distinguishes between pure and Galerkinized versions of HDP and DHP (andeven GDHP), for short HDPG and DHPG. The difference is simply in the kind of partialor total derivatives that are used in the adaptation of the weights2. Starting from a scalartarget T = 1

2eT (t)e(t), a learning rate α > 0 and weight updates are performed as defined

by (3.17) and (3.18), where the equations for HDP are (3.19) to (3.21) and for DHP (3.22)to (3.24).

w(t + 1) = w(t)− α∂T

∂w, for pure versions (3.17)

w(t + 1) = w(t)− α∂+T

∂w, for Galerkinized version (3.18)

2To distinguish the total from the partial derivative the notation ∂+.∂.

is used for the total derivative


e(t) = e(t) = γJ(x(t + 1);w) + φ(t)− J(x(t);w)− φ0 (3.19)

∂T

∂w=

∂e(t)∂w

∂+T

∂e(t)= e(t)

∂e(t)∂w

= e(t)∂(−J(x(t);w))

∂w, pure HDP (3.20)

∂+T

∂w=

∂+e(t)∂w

∂+T

∂e(t)= e(t)

∂+e(t)∂w

= e(t)

(γ

∂+J(x(t +1);w)∂w

− ∂+J(x(t);w)∂w

),HDPG

(3.21)

e(t) =∂+φ(t)∂x(t)

+ γ∂xT (t + 1)

∂x(t)λ(x(t + 1);w)− λ(x(t);w) (3.22)

∂T

∂w=

∂eT (t)∂w

∂+T

∂e(t)=

∂eT (t)∂w

e(t) =∂(−λT (x(t);w))

∂we(t), pure DHP (3.23)

∂+T

∂w=

∂+eT (t)∂w

∂+T

∂e(t)=

(γ

∂+λT (x(t +1);w)∂w

− ∂+λT (x(t);w)∂w

)e(t), DHPG

(3.24)

In [27] Werbos does an excellent stability analysis for the case of a linear system withadditive Gaussian noise and quadratic short-term cost function (for which the long-termcost-to-go function J is quadratic and can be calculated exactly) of all the pure andGalerkinized variants of HDP, DHP and GDHP. It turns out that all the Galerkinized ver-sions converge to the wrong critic weights in those cases. This fact has also been observedby others [2, 28, 29, 30]. Nevertheless, the Galerkinized versions have the advantage ofmore robustness [27] and under some circumstances can also converge to the correct equi-librium weights. An example is the two-sample method, where the temporal error and thegradient are calculated on the basis of two independent samples [2, 27]. In some cases, forexample, if the underlying sampling method does not match the steady state probabilitydistribution in a Markov Chain, HDP can also diverge [11].In the notation stemming from the continuous cost-density function φ(x, x) the equationsfor the HDP and DHP critics are given by (3.25) to (3.28), respectively. The corresponding,discounted differentials are (3.29) and (3.32). Weight updates are done according to (3.36)


to (3.38) by using the ‘Bellman errors’ (3.34) and (3.35).

J(x, x;wc)!=

∫ tf

tγτ−t0φ(x, x)dτ =

∫ tf

t0

γτ−t0φ(x, x)dτ −∫ t

t0

γτ−t0φ(x, x)dτ (3.25)

optimalcontrol= min

u∈Aφ(x, x)dt + γJ(x + dx, x + dx;wc) (3.26)

λ(x, x;wc) =dJ

dx!=

∫ tf

tγτ−t0

[∂φ

∂x+

[∂fT

∂x+

∂gT

∂x∂fT

∂u

]∂φ

∂x

]dτ (3.27)

optimalcontrol= min

u∈A

[∂φ

∂x+

[∂fT

∂x+

∂gT

∂x∂fT

∂u

]∂φ

∂x

]dt

+ γ

[1 +

[∂fT

∂x+

∂gT

∂x∂fT

∂u

]dt

]λ(x +dx, x+dx;wc)

(3.28)

Discounted total differentials dγ :

dγ J(wc(t)) = γJ(t + dt;wc(t))− J(t;wc(t))!= −

∫ t+dt

tφ(x, x)dτ (3.29)

.= −φ(x, x)dt.= − [φ(x + dx, x + dx) + φ(x, x)]

dt

2(3.30)

γ := γ

[1 +

[∂fT

∂x+

∂gT

∂x∂fT

∂u

]dt

](3.31)

dγ λ(wc(t)) = γ λ(t + dt;wc(t))− λ(t;wc(t))!= −

∫ t+dt

t

[∂φ

∂x+

[∂fT

∂x+

∂gT

∂x∂fT

∂u

]∂φ

∂x

]dτ

(3.32)

.= −[∂φ

∂x+

[∂fT

∂x+

∂gT

∂x∂fT

∂u

]∂φ

∂x

]dt (3.33)

‘Bellman errors’:

dEJ := dγ J(wc(t)) + φ(x, x)dt (3.34)

dEλ := dγ λ(wc(t)) +[∂φ

∂x+

[∂fT

∂x+

∂gT

∂x∂fT

∂u

]∂φ

∂x

]

t

dt (3.35)

Weight updates:

dwcJ = ηJ dEJ

∂J

∂wc(3.36)

dwcλ = ηλ

∂λT

∂wcdEλ (3.37)

dwc = dwcJ + dwcλ (3.38)

So far this is conventional but with the difference that instead of a direct connection to thecritic from the actor, the time derivative of the state is used for the critic estimation. Also,due to the continuous formulation of the plant, the notation d(.) has been used. However,


it is easy to discretize it by replacing d(.) with ∆(.). ηJ and ηλ are just weighting factorsin the continuous case and learning rates in the discrete case. For the analysis, they canbe set to 1.Often the system model, actor and critic are modelled by one network where adapta-tion takes place only in one area corresponding to its function as system, actor or criticmodel. As was shown by Prokhorov [2], it is of advantage to feed the critic with as muchinformation as possible, that is, actor and model reference inputs should be included inthe critic input3. Therefore, a generalized MLP (or even a recurrent network) as intro-duced in chapter 4 is to be preferred. When using model reference it makes sense tomodel at least part of the short-term cost density as a function of the deviation fromthe model reference ∆r(t) and hence the long-term cost is actually a function of r(t), theactor parameters wa and all other external inputs and not only the current state x0, i.e.J(x(t0), r(t ∈ [t0, . . . , T ]),wa). Sometimes, only the control output u is used to indicatethe dependence of the long-term cost on the control policy applied. However, in the caseof an adaptive critic design, clearly the actor weights wa are the independent variablesand not u because u = g(x;wz)4. To get most information to the critic network, all theindependent variables should be provided for the best possible long-term cost estimate.Here, only the state derivative x has been directly provided to the critic, apart from theusual state x, mainly to keep the equations as simple as possible. Effects from the actorhave to go through the system part, which could be seen as an input of an amalgamatedsystem-critic network.

3.1.2 Training the Actor

Training the actor is achieved by noting that in equations (3.26) and (3.28) the minimumrequires that the gradient with respect to the action u must be zero. Hence, the updateequations are defined by (3.39) to (3.52).

3Werbos also emphasizes this fact, showing a simple way to think about this by analogy to ordinarylinear-quadratic optimal control. Whenever there is a partially observed environment, those methodsconverge to the right answer only if a “certainty equivalence” approach using Kalman filtering is adopted.Only when there is filtered/estimated state information can the usual LQR methods go to the right answer,see also [31].

4Another justification to provide the critic with as much independent information as possible is thatnormally dim(u) < dim(wa) and hence a compression takes place and backpropagating adaptation signalsface an ill-posed inverse problem because the direct source cannot be uniquely determined.


Ju :=∂+J

∂u=

∂fT

∂u∂J

∂x!= 0 (3.39)

λu :=∂+λT

∂u=

∂fT

∂u∂λT

∂x!= 0 (3.40)

γu := γ∂uT (t + dt)

∂u(t)= 1 +

∂fT (t)∂u(t)

∂gT (t + dt)∂x(t + dt)

dt (3.41)

dγ Ju(wa(t)) = γuJu(t + dt;wa(t))− Ju(t;wa(t))!= −

∫ t+dt

t

∂+φ(x, x)∂u

dτ (3.42)

.= −∂+φ(x, x)∂u

dt = −∂fT

∂u∂φ

∂xdt (3.43)

dE1(wa(t)) = −Ju(wa(t)) = −∂+J

∂u= −∂fT

∂u∂J

∂x(3.44)

dE2(wa(t)) = −[γuJu(t+dt;wa(t)) +

∂fT

∂u∂φ

∂xdt

](3.45)

= −∂fT

∂u

[γ

∂gT (t+dt)∂x(t+dt)

Ju(t+dt;wa(t)) +∂φ

∂x

]dt (3.46)

!= −Ju(wa(t)) (3.47)

dE(wa(t)) = dγ Ju(wa(t)) +∂fT

∂u∂φ

∂xdt (3.48)

= dE1(wa(t))− dE2(wa(t)) + 2∂fT

∂u∂φ

∂x(3.49)

dwa1Ju

= ηJu

∂gT

∂wadE1(wa(t)) (3.50)

dwa2Ju

= ηJu

∂gT

∂wadE2(wa(t)) (3.51)

dwaJu= ηJu

∂gT

∂wadE(wa(t)) (3.52)

Equation (3.51) should preferably be used as it includes the derivative calculation throughthe system (by γ, given by (3.31)) and derivative calculation of the current utility. With(3.52), there might be a problem as the error dE is basically the difference of two quanti-ties that go to zero as well (see (3.42)).

Since λu is already a n×m matrix, where m is the dimension of the vector x, calculationof the derivative with respect to the n-dimensional vector u would yield a n×m×n tensor.As the number of operations involved in the multiplication of tensors grows exponentiallywith the tensor orders, this approach is not taken and only the gradient ∂J

∂u is used.


3.2 Training with Consideration of the Euler Equations

3.2.1 Actor Training

Starting with the actor training because it follows naturally from the equations, the ACDframework is given by the equations (3.10),(3.11),(3.12) and (3.13). The goal of the criticJ(x, x;wc) is to yield a good approximation of the cost-to-go function J [x(.)], which has tobe minimized. As it has been seen previously, for J [x(.)] to be minimized (or maximized),x(.) must be an extremal, meaning it has to satisfy the Euler equations (1.2). In the abovecase with φ = φ(x, x) they are d

dt

[∂φ∂x

]− ∂φ

∂x = ∂2φ∂x∂x x + ∂2φ

∂x2 x − ∂φ∂x

!= 0. It is easily seen

that left multiplying by xT is equivalent to ddt

[xT ∂φ

∂x − φ]

= 0, which has the property(3.53).

C(wa) := xT ∂φ

∂x− φ = constant. (3.53)

For a given φ, the only possibility to satisfy (3.53) is to have appropriate controls u =g(x;wa). This is emphasized by the notation C(wa) which states that only appropriatechanges in the actor weights can achieve this.

Using dCdt

!= 0 yields to an update according to (3.54) to (3.57).

Cwa :=∂C

∂wa=

∂gT

∂wa

∂C

∂u(3.54)

∂C

∂u=

∂fT

∂u∂C

∂x=

∂fT

∂u

[∂φ

∂x+

[xT ∂2φ

∂x2

]T

− ∂φ

∂x

]=

∂fT

∂u∂2φ

∂x2f (3.55)

dC = C(t + dt;wa(t))− C(t;wa(t))!= d

(xT ∂φ

∂x

)− dφ =

[xT ∂φ

∂x− φ

]

t+dt

−[xT ∂φ

∂x− φ

]

t

(3.56)

dwa = η dC∂C

∂wa(3.57)

Again, η can be assumed to be 1 for the continuous case, or a learning rate in the discretecase with dt → ∆t. It should be noted that ∂C(wa)

∂wa

!= 0 could also be used, however itwould involve tensors of order 3. This is analogous to the discussion with the conventionalactor training.

As long as the Hessian ∂2φ∂x2 can be calculated easily, the given update rules are not difficult

to calculate. For many problems the cost density φ might be given in the form φ(x, x) =A(x)+ xT Bx, were the Hessian with respect to x would just be a constant matrix B+BT .While equation (3.53) is considerably simpler than equation (1.2), there is a drawback inthat equation (3.53) does not imply equation (1.2), only vice versa. The reason is dueto the inner product between x and the Euler equations and therefore equivalence holds


only for this inner product with the time derivative of equation (3.53). Thus, trainingan actor according to equation (3.57) might not satisfy the Euler equation and thereforenot minimize the cost functional. Nevertheless, an optimal solution has to also satisfyequation (3.53). This problem, due to the inner product, might also be dampened byaveraging over many trajectories starting at different points, as it is done with adaptivecritics, and thus it might not be too severe.

Of course, the gradient ∂J∂u of the critic with respect to the action u still has to be

zero to minimize the cost-to-go function J . However, using the additional weight updateaccording to (3.54)-(3.57) in conjunction with the conventional adaptive critic training willreduce the set of trajectories x(t) to ‘more’ extremal ones, which is a drastic reduction of allthe possible ones. It is hoped that this will accelerate actor training. This is investigatedmore closely in chapter 5.

3.2.2 Critic Training

While in the actor training an adaptation dwa in the direction of ∂C∂wa

could be tried to

enforce a critic training such that the vector ∂J(wc)∂wa

looks in the same direction as dwa

as well. Of course, a trained actor will have dwa = 0, which means either dC = 0 or∂C∂wa

= 0. Applying this training should achieve ∂J∂wa

= 0 as well. In the case of dC = 0 itis possible to go in any direction, allowing for the achievement of a minima, and not onlyan extrema with the choice of ∂J

∂wa= 0, as in the conventional training of section 3.1.2.

The critic training equations are based on minimizing the square norm of the ‘Euler error’Ea and are defined by (3.58) to (3.63).

∂J(wc(t))∂wa

!=∂C

∂wa(3.58)

Ea(wc(t)) :=∂C

∂wa− ∂J(wc(t))

∂wa=

∂gT

∂wa

[∂C

∂u− ∂J

∂u

](3.59)

=∂gT

∂wa

∂fT

∂u

[[∂φ

∂x+

∂2φ

∂x2f − ∂φ

∂x

]−

[∂J

∂x

]](3.60)

T :=12Ea(wc(t))TEa(wc(t)) (3.61)

dwc = −ηc∂T

∂wcEa(wc(t)) (3.62)

= ηc∂2J

∂wc∂waEa(wc(t)) (3.63)

However, this approach to critic training makes use of second order derivatives and thusdoes not meet the goal of avoiding computation of second order derivatives5.

5At least at this stage. Later, in section 4.4.1 this goal is relaxed and the additional complexity ofcalculating second order derivatives will be accepted to benefit from the fast convergence of Newton’smethod.


3.3 J − λ Consistency

As λ is defined as dJdx , and, due to the proposition of Poincare, which states that by

differentiating twice a differential form always equals zero, this means d(λTdx) = 0, andthus implies ∂2J

∂xi∂xj= ∂2J

∂xj∂xi. Therefore, the J − λ consistency can be formulated for the

approximating functions as defined by (3.64) to (3.67).

dJ!= λTdx (3.64)

dEJ−λ = dJ − λTdx = −φ(x, x)dt− λT fdt, or (3.65)dEJ−λ

dt= −φ(x, x)− λT f (3.66)

∂2J

∂xi∂xj

!=∂2J

∂xj∂xi=

dλT

dx!=

[dλ

T

dx

]T

(3.67)

This is important when combining a derivative (λ−) critic with an ordinary one, becausethe λ−critic has to learn an integrable function. However, at this stage it is merely a note.

3.4 J − λ Regularization

Assuming two independent approximators J = J(x, x;wJ) and λ = λ(x, x;wλ) are given.Then, to achieve the J − λ consistency a regularization network F (x, x, J ;wr) could beintroduced which has the goal to learn the function y(x). See figure 3.2.

Differentiation of F with respect to x yields (3.68) which can be solved for dJdx (3.69) and

thus giving a goal for λ = dJdx .

dF

dx=

∂F

∂x+

[∂fT

∂x+

∂gT

∂x∂fT

∂u

]∂F

∂x+

∂F

∂J

dJ

dx= y′(x), or (3.68)

dJ

dx=

[y′(x)−

[∂F

∂x+

[∂fT

∂x+

∂gT

∂x∂fT

∂u

]∂F

∂x

]]1

∂F∂J

(3.69)

If y(x) := J is chosen, and if y′(x) is used for the approximation λ(x, x;wλ) then anerror Eλ is achieved to train the weights wλ. To train the weights wJ the error EJ

(3.70) is obtained simply by taking the difference of the regularization network with theJ approximator, and likewise the error Eλreg

(3.71) is used for the λ-critic.

EJreg:= F − J (3.70)

Eλreg:=

dJ

dx− λ = −

[∂F

∂x+

[∂fT

∂x+

∂gT

∂x∂fT

∂u

]∂F

∂x

]1

∂F∂J

(3.71)

So far no “knowledge” has been included of the λ approximator in the regularizationnetwork F . It is desired to learn the “true” function J [x], but only the J approximation


has been used instead. Therefore the consistency condition dJ = λTdx has been added,and the error dEJ−λ of (3.65) is used. Thus, the error for the regularization network isdefined by (3.72) which equals to (3.72).

EF := J − F + dEJ−λ = dEJ−λ −EJreg(3.72)

=[−φ(x, x)− λT f

]dt− EJreg

(3.73)

Note that the other consistency condition has been added implicitly by training the λ

approximator with a goal dJdx , which must be integrable.

The regularization network can be thought of, together with the J and λ approximators,as a single network that is capable of learning in a GDHP fashion without having tocalculate second order derivatives explicitly. To get reasonable J and λ approximationsthey might be trained concurrently with the traditional training of section 3.1.A summary of the training equations is given by (3.74) to (3.86).

dγ J(wJ(t)) = γJ(t + dt;wJ(t))− J(t;wJ(t)) (3.74)

γ := γ

[1 +

[∂fT

∂x+

∂gT

∂x∂fT

∂u

]dt

](3.75)

dγλ(wλ(t)) = γλ(t + dt;wλ(t))− λ(t;wλ)(t)) (3.76)

EJHDP= dγ J(wJ(t)) + φ(x, x)dt (3.77)

EJreg= F (x, x, J ;wr)− J(x, x;wJ); (3.78)

EJtotal= EJHDP

+ EJreg(3.79)

EF =[−φ(x, x)− λT f

]dt− EJreg

(3.80)

EλDHP= dγ λ(wc(t)) +

[∂φ

∂x+

[∂fT

∂x+

∂gT

∂x∂fT

∂u

]∂φ

∂x

]

t

dt (3.81)

Eλreg=

dJ

dx− λ = −

[∂F

∂x+

[∂fT

∂x+

∂gT

∂x∂fT

∂u

]∂F

∂x

]1

∂F∂J

(3.82)

Eλtotal= EλDHP

+ Eλreg(3.83)

dwr = ηF EF∂F

∂wr(3.84)

dwJ = ηJ EJtotal

∂J

∂wJ

(3.85)

dwλ = ηλ

∂λT

∂wλ

Eλtotal(3.86)

Note that terms in brackets of (3.81) and (3.82) are equal and both 0 if there is a perfecttraining because then F = J = J , dJ = −φ dt and ∂F

∂J= 1. The fact that they are 0 is

seen in equation (3.82) and (3.68), respectively.As appealing as the idea of using implicit differentiation might be, the training of an addi-


tional network has proven far too difficult. This fact was also noticed by Danil Prokhorov,who remarked that people already had trouble in training two networks, the critic and theactor. Furthermore, it is much simpler to use a regularization via a cost function than aregularization network. A successful approach to a GDHP-style training via cost-functionwas done by Drucker and Le Cun, [32].

3.5 Why Continuous Time ACD

The idea to use the Euler equations and the calculus of variations to adapt controllerand critics in an ACD framework is based on the same objective to minimize a long-term cost functional6. Analysis of the general system is done more easily with the toolsof traditional continuous calculus then in a discrete domain. Therefore, the ideas wereinvestigated in a continuous domain. Traditionally, discretization methods could then beapplied to transfer results from the continuous to a discrete domain. Maybe this is asomewhat biased view, but it was the one chosen. Advantages are that the continuouscalculus is well developed and many very strong integration routines exist which can beused to make a connection to the discrete world. Especially, variable step-size integratorsare well developed and have significant speed advantages because they can adapt thestep-size to an appropriate length, where as in discrete systems the step-size has to bechosen by the worst-case scenario. That is, the sampling frequency has to be higher thanthe Nyquist frequency, which is twice the maximal allowed signal frequency. In contrast,variable step-size integrators automatically vary the sampling frequency as necessary. Thisresults in considerable speed ups, as no time is wasted when the system’s signals are oflow frequency and accuracy is still achieved when there are high frequencies involved bylowering the step-size. This is a kind of automatically built in understanding, as thealgorithm understands when something happens and when to focus and not blindly wastetime when the system’s states only change slowly. This turned out to be an advantage inchapter 4.4.

Another reason is that not many people have analyzed continuous-time ACDs. PaulWerbos includes some analysis for continuous ACDs in [27]. Jeff Dalton of the Universityof Missouri-Rolla also analyzed critic based systems in continuous time in his dissertation,[33]. Randal Beard of Brigham Young University proposed incremental approximations ofthe Hamilton-Jacobi-Bellman equation using polynomial function bases [34], rather thenneural networks7.

6At least for the deterministic case. For the stochastic case the calculus of variance encounters additionaldifficulties as disturbances have to be treated specially unlike with dynamic programming where Bellman’soptimality principle accounts automatically for those.

7Thanks to Danil V. Prokhorov to point out the last two references.


Figure 3.1: Graphical representation of an ACD. Here, not only the state but also itsderivative are inputs to the critic. A single network having dedicated areas for the func-tional blocks of plant, actor and critic can be used. In [2], Prokhorov showed that it isof advantage to input an extended state vector consisting of all available information ofstate, control and even model reference inputs to the critic. This is intuitively clear, aswhen adapting backwards long-term information from J or λ have to be “squeezed” throughsubspaces of lower dimensions when only partial state, control or reference information isused. However, sometimes it makes sense not to use all the information, e.g. because itmight be simply difficult to access it, especially in technical systems that might be built ofcertain encapsulated blocks. Here, only the state derivative was used and extensions arestraight forward. Similar to this graph, Prokhorov has combined critics approximating Jand lambda in [3].


1Σ−

+EJreg^

Σ+

EJHDP^

+EJHDPEJHDPEJHDP

EJtotal^

F

F(x,x,J;wr). ^

J λ

Eλreg^

Σ ++ EλDHP^

Σ−

−Hdt=[φ+λΤ f]dtEF

J(x,x;wJ).^ .λ(x,x;wλ)^ .

Eλtotal^

∂F ∂F ∂F∂x ∂x. ∂J

x x.

System plant

x=f(x,u).

System plant

x=f(x,u).

System plant

x=f(x,u).

Actor/controller

u=g(x;wa)

.

∂H∂u

Figure 3.2: Graphical representation of the training of the regularization network, J andλ approximators. Together they achieve a GDHP-style training.

Chapter 4

Dynamic Programming, Adaptive

Critics, Simultaneous Recurrent

Neural Networks, Total Ordered

Derivatives

There are some inherent similarities between dynamic programming (DP), adaptive criticdesigns (ACD) and simultaneous recurrent neural networks (SRN) and their training viathe backpropagation algorithm. Until now backpropagation has been one of the mostsuccessful algorithms to train a neural network, especially the most widely used MultiLayer Perceptron (MLP). Mathematically the backpropagation algorithm is simply thechain-rule for calculating derivatives of a scalar target with respect to any quantity thetarget depends upon. There are many formulations of this algorithm and many peoplehave discovered it in their own right. However, the most compact and beautiful notationgoes back to Werbos, who sees the neural network as a set of ordered functions which arefunctions that only depend on function outputs with a lower ordering index. The orderingcan be of a structural and temporal nature.

It turns out that formulating the target-function of a simultaneous recurrent networkis the same as the long-term cost in dynamic programming, though their interpretationis quite different. Nevertheless, their formulae are basically the same and this gave theinspiration to connect adaptive critics with recurrent networks and backpropagation [2]. Arecurrent network and its connection to dynamic programming is discussed, its derivativesinterpreted and compared with a forward perturbation method called Real-Time Recur-rent Learning (RTRL). The Generalized MLP (GMLP) introduced by Werbos is alsoreinterpreted to easily accommodate Wan’s FIR-MLP [35]. Then a connection is made tothe continuous case and a novel integral formulation for ‘total ordered derivatives’ is madeand applied to adaptive critics. Using second order derivatives, Newton’s algorithm can beutilized to achieve fast convergence. Also, a suggestion for ‘almost concurrent adaptation’

46

CHAPTER 4. DP, ACD, SRN, TOTAL ORDERED DERIVATIVES 47

of actor and critic is made.To start with, the backpropagation algorithm to calculate sensitivities of a target with

respect to some input quantity is investigated and also how the training equations can bediscretized.

4.1 Discretization of the Training Equations

So far the analysis in chapters 1-3 has concentrated mainly on continuous systems withinfinitesimal (in practice sufficiently short) one-step time δt. Often it is desired to usethe Bellman principle of optimality for larger time steps ∆t for which δt would have tobe replaced by ∆t in equation (1.10). The approximation of the integral in (1.12) isthen obviously not valid anymore. However, it is always possible to numerically integrate.Splitting up the interval ∆t :=

∑Ni=1 δti with δti := ti+1 − ti, sufficiently small δti can be

obtained as indicated by (4.1).

∫ t+∆t

tf(τ)dτ =

N∑

i=1

∫ ti+δti

ti

f(τ)dτ.=

N∑

i=1

f(ti)δti (4.1)

If a chain of function blocks zi, where the output of the previous block is the input of thefollowing (i.e. zi = zi(zi−1)), sensitivities can be calculated easily as defined by (4.2).

∂+zTi

∂z0=

[∂zT

1

∂z0

]·[∂zT

2

∂z1

]. . .

[∂zT

i

∂zi−1

]=:

i−1∏

j=0

[∂zT

j+1

∂zj

](4.2)

Where,∏i−1

j=0 is a (i−1)-times right multiplication and ∂+ emphasizes the total derivative.

The one step cost U can be written as (4.3) and for a step from ti to ti + δti the relationis xi+1 = xi + f(xi,ui)δti, with xi := x(ti) and ui = g(xi), so the sensitivity can becalculated as defined by (4.4) to (4.6).

U(t) =∫ t+∆t

tφ(x(τ), x(τ))dτ

.=N−1∑

i=0

φ(x(ti), x(ti))δti =N−1∑

i=0

φ(zi)δti =N−1∑

i=0

φ(xi)δti (4.3)

∂U(t)∂z0

=N−1∑

i=0

∂+zTi

∂z0

∂φ(zi)∂zi

(4.4)

∂U(t)∂x0

=N−1∑

i=0

∂+xTi

∂x0

∂φ(xi)∂xi

=N−1∑

i=0

∂+xTi

∂x0

[∂φ

∂x+

[∂fT

∂x+

∂gT

∂x∂fT

∂u

]∂φ

∂x

]

ti

δti (4.5)

∂+xTi

∂x0=

i−1∏

j=0

[∂xT

j+1

∂xj

]=

i−1∏

j=0

[1 +

[∂fT

∂x+

∂gT

∂x∂fT

∂u

]

tj

δtj

](4.6)


A more elaborate calculation is presented in appendix A, where the result given by (B.32) isthe partial derivative of (4.3) with respect to u(t) at time t0, or, (4.4) with z0 := u(t = t0),respectively.

4.2 Discrete Dynamic Programming

Discrete Dynamic Programming can be seen as an N−step process for a finite horizon orin the limit as N →∞ as a process of infinite duration. In the latter case it is necessary tointroduce a damping factor γ < 1 to ensure that the sum is bounded. Given the interestrate p it can be interpreted as γ = 1

1+p , meaning that next step costs can be reduced tothe current time by investing future payments with an interest rate p from the currenttime on, such that they will grow to the actual future step costs.

T1 T2 TN

x0 x1 x2 xN

π1 π2 πN

xj : state vector at time tj.π

j : decision vector made at time tj.

xj = Tj(xj−1, πj): transfer function at step j (time tj).

J(x0,..,xN; π1,.., πN) : total cost for the whole N−step processgoing from state x0 to xN making thedecision π

1,.., πN .=> J shall be minimized => π

1,.., πNoptimal.

Figure 4.1: Discrete Dynamic Programming model. Assuming a stationary policy andone-step costs E, there is a total cost function Jπ for every policy π. An optimal policyπopt will have a minimal the total cost function Jπopt

.

Figure 4.1 shows the model. The equations for the total costs, following a policy π and


πopt are (4.7) and (4.9).

Jπ(x(k)) =N∑

k=t

γt−kE(x(t)), applying π = π(x(t)) (4.7)

πopt = arg minπ(x(t))

Jπ(x(k)) (4.8)

Jπopt(x(k)) =

N∑

k=t

γt−kE(x(t)), applying πopt = πopt(x(t)) which minimizes J. (4.9)

4.3 Simultaneous Recurrent Network

A simultaneous recurrent network can be conveniently composed of a feedforward net-work with some delay lines from its nodes to some virtual internal inputs, see figure 4.2.However, sometimes people would regard the model in figure 4.2 more as a time-lagged re-current neural network (TLRNN) rather than a SRN. This is because the SRN is normallyseen as a settling network which has to settle to a stable state for the given input xext

i (t)until a next input at time t + 1 is presented. Nevertheless, without loss of generality theexternal signals xext

i (t) could be downsampled and stay constant over the settling periodof the SRN. In this case the delay block z−1 corresponds to the sampling frequency 1

δt forthe settling dynamics and is much higher than the sampling frequency 1

∆t of the externalsignal1.

The most general (ordered) feedforward network can be seen as a multilayer perceptron(MLP) which also has connections among nodes on the same layer, although only fromlower indexed nodes to higher ones. Werbos calls this the generalized MLP (GMLP). Themathematical treatment is actually simpler and a conventional MLP is simply achieved bysetting the connection weights among nodes on the same layer to zero. The mathematicaldescription of the SRN is defined by (4.10) to (4.13).

xi(t) := xexti (t), 1 ≤ i ≤ m (4.10)

neti(t) :=i−1∑

j=1

wij(t)xj(t) +N∑

j=m+1

w1ij(t)xj(t− 1), m + 1 ≤ i ≤ N (4.11)

xi(t) := fi(neti(t)), m + 1 ≤ i ≤ N (4.12)

Tar = Tar(k;h) =k+h∑

k=t

γt−kE(t), with E(t) =12

N∑

i=m+1

(di(t)− xi(t))2 (4.13)

Comparing the cost function (4.9) of the optimal N -stage discrete dynamic programmingprocess and the target function (4.13), it is clear that a trained SRN with objective tominimize a target function of “length” h implements a discrete dynamic programming

1Also the target function 4.13 should then be downsampled, as the total error is mainly of interest asa sum of errors sampled at a lower frequency, e.g. 1

∆t, when the network has settled. This can easily be

achieved by choosing the desired signal di(t) = xi(t) during the settling time which makes E(t) zero fort 6= n∆t, n ∈ ZZ.


x2(t) x

m(t) x

m+1(t−1) x

N(t−1) x

m+1(t)

z−1

z−1

z−1

z−1

External inputs Internal inputs delayed

xN−1

(t)

Internal inputs current

xm+1

(t) xN(t)

x1(t)

Figure 4.2: The SRN model. It has a total of N nodes where the first m nodes are theexternal inputs x1, . . . , xm and the nodes indexed by m + 1 through N are internal nodes.The output nodes can be arbitrarily selected from the internal nodes; they can be regardedsimply as the internal nodes without limitation. In addition to the external inputs, theinternal nodes have their delayed values as further inputs as well as the current output oflower indexed nodes. An additional node x0 ≡ 1 can be added to provide some bias, or,alternatively x1(t) := 1 can be chosen.

process, where the transition function and cost function are implicitly represented in thenetwork2.

4.3.1 Backpropagation Through Time and Chain Rule for Derivatives

The recursive networks introduced above can be trained by the traditional backpropagationalgorithm, which comes in many varieties but can be best summarized as backpropagationthrough time (BPTT ). BPTT is basically the chain-rule applied to the ordered networkstructure and an unfolding thereof in time, see figure 4.3.Mathematically it can be stated as (4.14), where ld(xi(t) denotes the ‘later dependencies’of xi(t). The notation F xi was introduced as a notational simplification by Werbos andthe symbol ∂+

∂. denotes the total derivative in contrast to the partial one without the “+”sign.

F xi ≡ ∂+Tar

∂xi(t)=

∂Tar

∂xi(t)+

∑

ld(xi(t))

∂+Tar

∂ld(xi(t))∂ld(xi(t))

∂xi(t)(4.14)

For the given simultaneous neural network the truncated backpropagation through time2However, as Werbos pointed out, this is only true in the deterministic case.


Target

F(k+1)

F(k+h)

directinfluenceon target

indirectinfluenceon target

F(k)

Figure 4.3: A simple flow diagram that demonstrates the forward influence of a func-tion block evaluated at consecutive time steps onto a target quantity. To work out thesensitivities of certain quantities state vectors at a certain time or parameters, backpropa-gation through time can be used (dashed paths). It is basically applying the chain-rule forderivatives.

algorithm BPTT (h) with truncation depth h yields (4.15) to (4.20).

F xi(t) = −γt−k(di(t)− xi(t))

+N∑

j=i+1

F xj(t)f ′j(netj(t))wji(t) (4.15)

+N∑

j=m+1

F xj(t + 1)f ′j(netj(t + 1))w1ji(t + 1)

F xi(t) ≡ 0 ∀t > k + h (4.16)

F wij(t) = F xi(t)f ′i(neti(t))xj(t) (4.17)

F w1ij(t) = F xi(t)f ′i(neti(t))xj(t− 1) (4.18)

∆wij(k) = −η F wij(k) (4.19)

∆w1ij(k) = −η F w1

ij(k) (4.20)

The BPTT (h) algorithm calculates the weight updates at time k, by calculating Tar(k; h)and then going backwards in time and structure, from t = k + h and i = N, .., 1 using(4.15) and (4.16). Then, finally, by doing the updates according to (4.19) and (4.20), witha learning rate η > 0.An equivalent, but slightly more efficient formulation for BPTT (h) can be achieved by


defining (4.21).γt−kF xi(t) := F xi(t) (4.21)

Starting with t= k+h and i=N,..,1 yields F xi(t) =−γhei(t)+∑N

j=i+1F xj(t)f ′(netj(t))wji(t)where all F xj(t) contain a factor γh = γt−k such that F xi(t) = −ei(t)+

∑Nj=i+1 F xi(t)wji(t).

Going one time step backwards, i.e. t = k + h − 1, it follows from (4.15), that all thesummands contain a factor γh−1 and γh for the first and the second sum, respectively.Therefore it can be defined by (4.22).

F xi(t) = −(di(t)− xi(t))

+N∑

j=i+1

F xj(t)f ′(netj(t))wji(t) (4.22)

+ γN∑

j=m+1

F xj(t + 1)f ′(netj(t + 1))w1ji(t + 1)

This is slightly more efficient, because there is only one multiplication by γ, of the secondsum, but no multiplication of a power of γ, in the direct error term. At time k, F xi(t)is equal to F xi(t) due to (4.21) and so equation (4.22) could be used in the BPTT (h)algorithm instead of (4.15). However, for all other times t within k < t ≤ k + h the totalderivatives differ according to (4.21).

4.3.1.1 Derivation of BPTT (h)

To make the general formula (4.14) more understandable, the BPTT (h) update formulaefrom above is derived in detail as an exercise. Equation (4.14) can be written as:

F xi(t) =∂Tar

∂xi(t)+

∑

t′≥t

N∑

j=1

∂+Tar

∂netj(t′)∂netj(t′)∂xi(t)

= −γt−k(di(t)− xi(t)) +N∑

j=i+1

∂+Tar

∂netj(t)∂netj(t)∂xi(t)

+N∑

j=m+1

∂+Tar

∂netj(t + 1)∂netj(t + 1)

∂xi(t)

+∑

t′>t+1

N∑

j=m+1

∂+Tar

∂netj(t′)∂netj(t′)∂xi(t)


While ∂netj(t′)

∂xi(t)≡ 0,∀t′ > t + 1, the last line is always zero. Furthermore, it follows from

equations (4.11) and (4.12) that,

∂netj(t)∂xi(t)

= wji(t)

∂netj(t + 1)∂xi(t)

= w1ji(t)

F neti(t) =∂Tar

∂neti(t)+

∑

t′≥t

N∑

j=m+1

∂+Tar

∂xj(t′)∂xj(t′)∂neti(t)

= F xi(t)f ′i(neti(t))

F neti(t + 1) =∂Tar

∂neti(t + 1)+

∑

t′≥t+1

N∑

j=m+1

∂+Tar

∂xj(t′)∂xj(t′)

∂neti(t + 1)

= F xi(t + 1)f ′i(neti(t + 1))

Putting these expressions together then yields (4.15).

4.3.2 Fixed and Moving Targets

So far only BPTT (h) with a fixed target Tar(k; h) at a certain time k, has been considered.Now, ‘shifted’ and ‘moving’ targets, Tar(k + q; h) and Tar(t + q;h), are considered fortotal derivatives with respect to quantities y at time t+q, where q > 0 as defined by (4.23)to (4.25), where equation (4.25) is simply equation (4.13) for k := k + q. The differencebetween ‘shifted’ and ‘moving’ target is that with the former k is fixed, whereas in thelatter, the target calculation always starts at the current time t + q (q is just a constantintroduced to make the formulas more similar), see figure 4.4.

Fs y(t + q) ≡ ∂+Tar(k + q; h)∂y(t + q)

shifted (4.23)

Fm y(t + q) ≡ ∂+Tar(t + q; h)∂y(t + q)

moving (4.24)

Tar(k + q; h) =k+q+h∑

t=k+q

γt−(k+q)E(t) (4.25)

Using (4.15) again, but this time with a target Tar(t; h) yields (4.26) and (4.27).

Fm xi(t) = −(di(t)− xi(t))

+N∑

j=i+1

Fm xj(t)f ′(netj(t))wji(t) (4.26)

+N∑

j=m+1

F xj(t + 1)f ′(netj(t + 1))wintji (t + 1)

F xi(t′) ≡ 0 ∀t′ > t + h (4.27)


Fixed Target Tar(k;h) Moving Target Tar(t;h)

k k+h k k+h k+2h

: Total gradient calculation

∂+Tar(k;h) ∂+Tar(t;h)∂ x(t) ∂ x(t)

Figure 4.4: Another interpretation of the formulation of total derivatives can be achievedby looking at the definition of the target. Tar(k, h) is the target calculated at the fixed timek and includes contributions up to the time k + h. This can be seen as a fixed target. Onthe other hand if a target Tar(t, h) is used with a starting time t equal to the time thederivative has to be calculated, it can be seen as a moving target.

It is obvious that the ‘shifted’ targets are related by equation (4.28) and for q = 1 thisyields equation (4.29).

γqTar(k+q; h) = γqk+q+h∑

t=k+q

γt−(k+q)E(t) =k+q+h∑

t=k+q

γt−kE(t)

= Tar(k; h)−k+q−1∑

t=k

γt−kE(t) +k+h+q∑

t=k+h+1

γt−kE(t) (4.28)

Tar(k; h) = γTar(k+1; h)+E(k)−γh+1E(k+1+h) (4.29)

Equation (4.29) is used to calculate (4.32) to (4.32).

∂+Tar(k; h)∂xj(t + 1)

=∂+(γTar(k+1; h)+E(k)−γh+1E(k+1+h))

∂xj(t + 1)(4.30)

= γ∂+Tar(k+1;h)

∂xj(t+1)− γh+1 ∂+E(k+1+h)

∂xj(t + 1)(4.31)

h→∞= γ∂+Tar(k+1;h)

∂xj(t+1)= γ Fs xj(t + 1) (4.32)

The term ∂+E(k)∂xj(t+1) is zero, because t ≥ k always, and therefore E(k) does not depend on

xj(t + 1). Using equation (4.31) with k := t and substituting ∂+Tar(t;h)∂xj(t+1) (= F xj(t + 1)) in


(4.26) to have a ‘moving’ target yields (4.33) and (4.34).

Fm xi(t) = −(di(t)− xi(t))

+N∑

j=i+1

Fm xj(t)f ′j(netj(t))wji(t)

+ γN∑

j=m+1

[(Fm xj(t + 1)− γh ∂+E(t+1+h)

∂xj(t + 1)) · f ′j(netj(t + 1))w1

ji(t + 1)]

(4.33)

Fm xi(t′) ≡ 0 ∀t′ > t + h (4.34)

This has the form of (4.22) when h →∞ and therefore can be interpreted as BPTT (∞)with moving targets. Fm xi(t) are now the total derivatives of the moving targets,Tar(t; h), with respect to the nodes xi(t).On the other hand, for h = 0 (note the second sum in (4.33) is always 0 when h = 0)it can be considered as BPTT (0) with moving target (or fixed, since they are the samefor h = 0), which is the instantaneous gradient of the current error E(t) with respect tothe node xi(t), or with respect to the weights wij(t) when using (4.17), (4.18). The lattertarget is exactly as in RTRL.Another interesting point is that when γ → 1, F xi(t) of (4.15) must be equal to Fm xi(t)of (4.33) with h → ∞, and assuming E(∞) ≡ 0. This means that the past errors donot count in weight updates but only future errors contribute to weight updates. This isnot surprising because future errors have an infinite support, whereas the number of pasterrors from some past starting time up to the current time t, is always finite.

4.3.3 Real-Time Recurrent Learning (RTRL)

A very compact formulation of RTRL can be found in [36]. That derivation is followed,only extended slightly to account for ordered dependencies between output nodes (xj(t) =xj(xm+1(t),..,xj−1(t)) at time t. Defining a concatenated input vector, ξj , of dimension m+2(N−m) by (4.35) and appropriate weight vectors, wj(t) (m+1 ≤ j ≤ N), whose elementsconnect elements of ξj(t) with node xj(t), and state vector, x(t) = [xm+1(t), .., xN (t)]T ,the system (4.10-4.12) can be written as (4.35).


ξj(t) :=

xext1 (t)

...xext

m (t)xm+1(t− 1)

...xN (t− 1)xm+1(t)

...xj−1(t)

0...0

=

[xext(t)xj(t)

](4.35)

x(t) :=

xm+1(t)...

xN (t)

=

f(wTm+1(t)ξ

m+1(t))...

f(wTN (t)ξN (t))

(4.36)

Differentiating (4.36) with respect to the weights wj(t) using the chain rule once, yields(4.37) to (4.40), where wint

j (t) is the subvector of wj(t) without the elements connectingnode xj(t) to external nodes xext

k (t) in ξj(t), and, wj(t) is equal to w(t) with elementswjl(t) set to 0 for l ≥ j + N − m. Equation (4.37) now describes recursively, the statedynamics (note that the first N −m elements of x(t) are equal to those of xj(t + 1)).

∂xT (t)∂wj(t)

= Φ(t)

[∂xjT (t)∂wj(t)

W1(t) + UjT (t)

](4.37)

Φ(t) := diag(f ′(wTm+1(t)ξ

m+1(t)), . . . , f ′(wTN (t)ξN (t))) (4.38)

W(t) :=

[Wext(t)Wint(t)

]=

[wext

m+1(t), ..,wextN (t)

wintm+1(t), .., w

intN (t)

](4.39)

Uj(t) :=

0

ξjT (t)0

← (j −m)th row (4.40)

Having an instantaneous error E(t) = 12e

T (t)e(t) with e(t) = d(t) − x(t), as in equation(4.13), the same target, Tar(k; h), could also be used and equations (4.41) to (4.47) follow.


∂E(t)∂wj(t)

=∂eT (t)∂wj(t)

e(t) (4.41)

= −∂xT (t)∂wj(t)

e(t) (4.42)

∂+Tar(k;h)∂wj(k)

=k+h∑

t=k

γt−k ∂+E(t)∂wj(k)

(4.43)

= −k+h∑

t=k

γt−k ∂+xT (t)∂wj(k)

e(t) (4.44)

∆wj(k) = −η∂+Tar(k; h)

∂wj(k)(4.45)

= ηk+h∑

t=k

γt−k ∂+xT (t)∂wj(k)

e(t) (4.46)

h=0= η∂xT (k)∂wj(k)

e(k) (4.47)

However, linking ∂xT (t)∂wj(t)

together with the gradient,∂+Tar(k;h)∂wj(t)

, of the total derivative ofthe target with respect to the weights, is done easily only when h = 0, meaning the totalgradient is approximated by the instantaneous gradient. If one keeps the weights constantduring the target window from time k up to k + h an easy relation can be stated usingordinary derivatives (4.48).

∆wj(k)wj(t)=wj(k)

= ηk+h∑

t=k

γt−k ∂+xT (t)∂wj(t)

e(t)

= ηk+h∑

t=k

γt−k ∂xT (t)∂wj(t)

e(t) (4.48)

To make the RTRL algorithm complete, the initial condition for the recursive formula(4.37) needs to be given. Assuming the network is in a constant state, one can set ∂xT (0)

∂wj(0) =0 for all j. One potential problem with the RTRL algorithm is that when the learning rateη is too large, additional feedback produced by the weight changes can cause instability.Another possibility would be to update weights only every hth time step (h > 1). Thiswould be a “forward propagation through time”.

4.3.4 Some Notes on BPTT and RTRL

A few more details can be found in [37] and for computational complexity calculations,Williams and Zipser discuss various forms of BPTT and RTRL in depth, see [38]. Inaddition to the approach for BPTT here, which follows Werbos notation, another efficient


implementation, featuring “+=” operations, of BPTT (h) for ordered structures, includingrecurrent networks, is proposed by Lee Feldkamp, et. al., see [39].

4.3.5 Recursive Generalized FIR-MLP

Wan has introduced a highly impressive MLP for time series prediction of a chaotic laser[40]. He replaced the single weights from one node to another by a FIR tap-line. This canbe generalized using a recurrent generalized MLP, but instead of a single delay a tap-lineis used as well, see figure 4.5. In this scenario it is easier to recognize the FIR tap-lineand it may be of an advantage because a FIR filter has an intuitive interpretation and isa standard tool in signal processing. Nevertheless, it has to be emphasized that it is astructural equivalent to the recursive generalized MLP, or SRN model. The only differenceis in the number of delays which can be modelled by using more (linear) nodes in the SRNnetwork. Depending on the application one or the other might be preferred.

z−1 z−1 z−1

x2(t) x

m(t)x

1(t) ... x

1(t−L) x

N(t−L)...x

m+1(t) x

N−1(t)...

x0≡1

z−1 z−1 z−1

z−1 z−1 z−1

z−1 z−1 z−1

z−1 z−1 z−1

z−1 z−1 z−1

z−1 z−1 z−1

x1(t−1) x

N(t−1)...

External inputs

Internal inputs current

External and internal inputs

delayed

Figure 4.5: The recurrent generalzied FIR MLP distinguishes from the SRN by havingFIR tap-lines instead of only a single weight. Therefore it uses less (linear) nodes togenerate an equivalent SRN.


4.4 Continuous Version of ‘Ordered’ Total Derivatives

In section 4.1, its extension in the appendix and the section about backpropagation throughtime, a simple method for calculation of total derivatives for ordered systems was achievedby discretizing the continuous plant and utility or short-term cost and treating them asordered systems where total derivatives can be easily calculated by the formulae (4.49) or(4.50).

∂+zTn

∂zk=

∂zTn

∂zk+

n−1∑

j=k+1

∂zTj

∂zk

∂+zTn

∂zj(4.49)

=∂zT

n

∂zk+

n−1∑

j=k+1

∂+zTj

∂zk

∂zTn

∂zj(4.50)

For continuous systems where x(t) represents the state of the system and is under theinfluence of infinitesimal changes during the infinitesimal time step dt, the chain rule can beapplied analogously. Given the setup of an adaptive critic design where x = f(x,g(x;w))the goal is to adapt the weights w such that x is an optimal trajectory, in the sense thatit has a minimal long-term cost. Clearly, x can be seen as a function only of x and w, sox = h(x;w).

x(t;w+δw)

x(t;w)

x(t;w)x(t;w+δw).

.

dx(t;w)dw δw

x(t0;w)

Figure 4.6: The connection between neighboring trajectories due to a slight change in theweights. Multiplying all the vectors by δt makes clear that the order of derivatives withrespect to time and weights can be exchanged, see equation (4.51).

A deviation δw in w leads to a deviation in the trajectory x, say xδw. Therefore it is(4.51), and the order of the differentiations can be exchanged as defined by (4.52) to (4.54).See figure 4.6.

d

dt

(dxT

dwδw

)= hT (xδw,w + δw)− hT (x,w) (4.51)


d

dt

dxT

dw=

dhT

dw=

dxT

dw∂hT

∂x+

∂hT

∂w, (4.52)

=dxT

dw∂fT

∂x+

dgT

dw∂fT

∂u(4.53)

=dxT

dw

[∂fT

∂x+

∂gT

∂x∂fT

∂u

]+

∂gT

∂w∂fT

∂u(4.54)

This relation proves to be very useful as it is just a differential equation for the otherwisehard to calculate quantity defined by (4.55) to (4.57).

q :=dxT

dw(4.55)

q = q[∂fT

∂x+

∂gT

∂x∂fT

∂u

]+

∂gT

∂w∂fT

∂u, with initial condition (4.56)

q(t0) = 0 (4.57)

If this is expressed in an integral form, the similarity with the discrete ordered systemis easily seen. Whereas, in the discrete system a summation is performed over the laterdependencies of a quantity whose target sensitivity is calculated. An integration has tobe performed, where the same total and partial derivatives appear, only at infinitesimaltime steps as defined by (4.58) to (4.61).

x(t1) = x(t0) +∫ t1

t0

f(x(t),g(x(t);wa))dt (4.58)

dxT (t1)dwa

=∫ t1

t0

dfT (x(t),g(x(t);wa))dwa

dt (4.59)

=∫ t1

t0

[dxT

dwa

(∂fT

∂x+

∂gT

∂x∂fT

∂u

)+

∂gT

∂wa

∂fT

∂u

]dt (4.60)

=∫ t1

t0

[dxT

dwa

dxT

dx+

∂xT

∂wa

]dt (4.61)

Again, this is the integral formulation of the differential equation (4.56) with initial con-dition (4.57) and dxT

dwa=: q.

Therefore, the summation is exchanged by the integration and the partial derivative hasto be included into the integral, which is not surprising, because in the discrete case totalderivatives of intermediate quantities are calculated recursively by the same formula (4.50).Instead of a discrete ordered system, it is a distributed (over time) and ordered (structuraldependencies) system in the continuous case, where infinitesimal changes are expressed interms of total time derivatives of the target quantity x and split up into a part of totaland partial derivatives, for indirect and direct influence on the target quantity, just aswith the discrete case.This trick of solving for a total derivative by integration is the key to continuous adaptivecritics as well as to Euler training introduced in section 3.2, which will be shown in chapter5, to suffer by using only partial derivatives.


4.4.1 Continuous Adaptive Critics

For continuous adaptive critics the plant and the cost-density function are continuous andthe one-step cost is an integral of a cost-density function over a sufficiently long timeinterval [t0 t1]. This is different from chapter 3, where the step-length is short enough tobe linearized3. The short-term cost is given by (4.62).

U(t0, t1) =∫ t1

t0

φ(x, x)dt (4.62)

Given a long-term cost estimator (4.63), called a critic, with some parameters wc whichdepend on the policy π(wa) : x(t) 7→ g(x(t);wa).

J(x(t1);wc) := Jπ(wa)(x(t1);wc) =∫ ∞

t1

φ(x(t), x(t))dt (4.63)

As seen before, in adaptive critic designs an estimator is sought that is minimal withrespect to its control output u, and respectively to its parameters wa. Using Bellman’sprinciple of optimality (4.64, 4.65) must hold and two objectives can be achieved simulta-neously.

J(x(t0);wc) := Jπ(wa)(x(t0);wc) =∫ t1

t0

φ(x(t), x(t))dt +∫ ∞

t1

φ(x(t), x(t))dt (4.64)

= U(t0, t1) + Jπ(wa)(x(t1);wc) (4.65)

Firstly, the critic weights wc can be adapted using the traditional adaptive critic updates,using an error (4.66) measuring the temporal difference of the critic estimates.

E(x(t0), t0, t1;wa,wc) :=∫ t1

t0

φ(x(t), x(t))dt + J(x(t1);wc)− J(x(t0);wc) (4.66)

Applying an adaptation law to the critic parameters wc to enforce the temporal errorto zero, ensures optimality for the given policy g(x;wa) with fixed parameters wa. Forexample, (4.67).

δwc := −η∂E(t)∂wc

E(t) (4.67)

Secondly, the policy can be improved by forcing (4.68,4.70) to be zero.3If the equations are discretized and the duration of a time step δt is set to 1, as it is done usually done,

the short-term cost becomes the one-step cost.


d(U(t0, t1;wa) + Jπ(wa)(x(t1);wc)

)

dwa=

dU(t0, t1;wa)dwa

+dxT (t1)

dwa

dJπ(wa)(x(t1);wc)dx(t1)

(4.68)

!= 0 (4.69)

=∫ t1

t0

dxT (t)dwa

[∂φ

∂x+

∂gT

∂x∂fT

∂u∂φ

∂x

]dt

+dxT (t1)

dwa

dJπ(wa)(x(t1);wc)dx(t1)

(4.70)

The superscript π(wa) indicates that this equation is only valid for converged critics, giventhe current policy. Solving (4.59) with initial condition (4.57) yields the result for the totalderivative dJπ(wa)(x(t0);wc)

dwa, which can be used to update the actor weights wa in the usual

steepest gradient manner. This is the continuous counterpart of the traditional adaptivecritic designs. As mentioned earlier in section 2.5.3 the problem with the one-step critic isthat total derivatives taken over one step only, often miss out on the indirect contributions.An example is a one-step development of the state x(t + 1) = x(t) + f(x(t),u(t)) withsome control u(t) = g(x(t);w), such that the total derivative of x(t + 1) with respect tothe weights w is given by (4.71) which is equal to (4.72) because dxT (t)

dw = 0.

dxT (t + 1)dw

=dxT (t)

dw+

dxT (t)dw

∂fT (x(t),u(t))∂x(t)

+dgT (x(t);w)

dw∂fT (x(t),u(t))

∂u(t)(4.71)

=∂gT (x(t);w)

∂w∂fT (x(t),u(t))

∂u(t)(4.72)

If this procedure is now repeated at every time step and the starting time t0 is alwaysreset to the current time t, indirect influence through x(t) and all its later dependencies,like f(x(t),u(t)), are always going to be missed. That can amount to a serious problem assubstantial parts, like ∂fT (x(t),u(t))

∂x(t) or ∂gT (x(t),u(t))∂x(t) , are ignored as well. That is the reason

why BPTT (h > 0) is so much more powerful than just having the instantaneous gradientas in BPTT (0). The same applies to the continuous formulation adopted here with theadditional benefit of having variable step size control from the integration routine. Onefinal remark to RTRL and BPTT . The BPTT algorithm is considered more efficientlybecause in its recursive formulation, gradients are calculated with respect to a scalar target,while in RTRL the quantity ∂xT (t)

∂w is a gradient of a vector, resulting in a matrix quantity.The same applies for the continuous calculation as well where the matrix quantity q hasto be integrated. However, as x is the state vector of the system and not the vector ofall nodes as in a simultaneous recurrent network (SRN) using the RTRL algorithm, x ismost likely to be of much smaller dimensionality. Therefore, having q as a matrix mightnot be a too severe drawback.


4.4.2 Second Order Adaptation for Actor Training

As seen before, the short-term cost from time t0 to t1, starting in state x(t0) is given by(4.74).

U(x(t0), t0, t1;wa) =∫ t1

t0

φ(x, x)dt =∫ t1

t0

φ(x, f(x,g(x;wa)))dt (4.73)

=∫ t1

t0

φ(x,wa)dt (4.74)

Assuming a stationary environment, the long-term cost in state x(t0) and following thepolicy given by g(x;wa) is, also known as Bellman’s optimality condition as in (4.75) and(4.76).

J(x(t0);wa) = U(x(t0), t0, t1;wa) + J(x(t1);wa) (4.75)

J0 = U + J1, for short. (4.76)

Where, J(x(t0);wa) is the minimal cost in state x(t0) following the policy π : g(x;wa).Thus, a better notation would be Jπ(wa)(x(t0)) to indicate that J is actually a purefunction of the state for a given policy. However, to simplify the notation neither thesuperscript π(wa) nor the argument wa are used if not necessary. In adaptive criticdesigns the long-term cost function J(x;wa) is approximated by J(x;wc). This meansthat if for a certain policy g(x;wa) Bellman’s principle of optimality is satisfied, wc isdetermined by the cost density φ and the policy parameters wa. An optimal policy is apolicy that minimizes J(x;wa) and therefore a necessary condition is (4.77,4.78,4.79).

dJ(x(t0))dwa

=dU(x(t0), t0, t1;wa)

dwa+

dJ(x(t1))dwa

!= 0 (4.77)

dJ0

dwa=

dU

dwa+

dJ1

dwa

!= 0, for short (4.78)

=dU

dwa+

dxT

dwa

dJ1

dx!= 0 (4.79)

4.4.2.1 Newton’s method

In traditional adaptive critic designs (4.77) is used to train the actor parameters via asimple gradient descent method. To speed up the traditional approach, Newton’s methodcould be used, though with the additional cost of computing the Jacobian of the functiondJ0dwa

with respect to wa. In the context here, Newton’s method for zero search is given by


(4.80) to (4.85).

F (X) = 0, (4.80)

find X by iterating Xk → Xk+1 according to

Xk+1 := Xk −[

∂F

∂X

]−1

F (Xk), (4.81)

identifying

F :=dJ0

dwa(4.82)

X := wa (4.83)∂F

∂X:=

d2J0

dw2a

, (4.84)

yields

wk+1a := wk

a −[d2J0

dw2a

]−1dJ0

dwa(4.85)

To calculate the Jacobian, equation 4.79 is differentiated again with respect to wa, yielding(4.86) to (4.88), where dJ1

dx and d2J1dx2 might be approximated by a backpropagated J-

approximator or a λ-critic and by a backpropagated λ-critic, respectively.

d2J0

dw2a

=d

dwa

(dU

dwa+

dxT

dwa

dJ1

dx

)=

d2U

dw2a

+d

dwa

(dxT

dwa

dJ1

dx

)(4.86)

=d2U

dw2a

+d2xT

dw2a

dJ1

dx+

d

dwa

(dJ1

dx

)dx

dwa(4.87)

=d2U

dw2a

+d2xT

dw2a

dJ1

dx+

dxT

dwa

d2J1

dx2

dxdwa

(4.88)

It has to be mentioned that d2xT

dw2a

is a third order tensor, but with “the inner-product

multiplication over the components of x”, the term d2xT

dw2a

dJ1dx gets the correct dimensions.

Matrix notation starts to fail here and one is better advised to resort to tensor notationwith upper and lower indices, which is done later for more complicated expressions.An important note has to be made about derivatives of critics and derivative (λ−) critics.They represent not instantaneous derivatives but rather averaged derivatives. Therefore,an averaged version of (4.88) is used, given by (4.89), where the expectation is taken overa set of sampled start states x(t0) according to their probability distribution from thedomain of interest.

E

d2J0

dw2a

= E

d2U

dw2a

+dim(x)∑

k=1

d2xk

dw2a

dJ1

dxk+

dxT

dwa

d2J1

dx2

dxdwa

(4.89)

All the necessary terms in (4.89) are fully expanded in equations (4.90) to (4.132).


dU

dwa=

∫ t1

t0

(dxT

dwa

∂φ

∂x+

∂φ

∂wa

)dt (4.90)

d2U

dw2a

=∫ t1

t0

d

dwa

dim(x)∑

k=1

dxk

dwa

∂φ

∂xk+

∂φ

∂wa

dt (4.91)

or for one element:

d2U

dwidwj=

∫ t1

t0

dim(x)∑

k=1

(d2xk

dwidwj

∂φ

∂xk+

d

dwi

(∂φ

∂xk

)dxk

dwi

)+

d

dwi

(∂φ

∂wj

) dt(4.92)

d

dwi

(∂φ

∂xk

)=

dim(x)∑

l=1

dxl

dwi

∂2φ

∂xl∂xk+

∂2φ

∂wi∂xk=

dim(x)∑

l=1

qli

∂2φ

∂xl∂xk+

∂2φ

∂wi∂xk(4.93)

d

dwi

(∂φ

∂wj

)=

dim(x)∑

l=1

dxl

dwi

∂2φ

∂xl∂wj+

∂2φ

∂wi∂wj=

dim(x)∑

l=1

qli

∂2φ

∂xl∂wj+

∂2φ

∂wi∂wj(4.94)

d2U

dwidwj=

∫ t1

t0

dim(x)∑

k=1

rk

ij

∂φ

∂xk+

dim(x)∑

l=1

qli

∂2φ

∂xl∂xk+

∂2φ

∂wi∂xk

qk

j

+dim(x)∑

l=1

qli

∂2φ

∂xl∂wj+

∂2φ

∂wi∂wj

dt (4.95)

with the abbreviations:

wa :=[w1, . . . , wi, . . . , wdim(wa)

]T (4.96)

q :=dxT

dwa(4.97)

qkj :=

dxk

dwj(4.98)

r :=d2xT

dw2a

=d

dwa

(dxT

dwa

)=

d

dwa(q) (4.99)

rkij :=

d

dwi

(qkj

)=

d

dwi

(dxk

dwj

)=

d2xk

dwidwj(4.100)


and the relations for their total time derivatives:

x = f(x,g(x;wa)) = h(x;wa) (4.101)

q =dxT

dwa=

dxT

dwa

∂hT

∂x+

∂hT

∂wa(4.102)

qkj =

dim(x)∑

m=1

qmj

∂hk

∂xm+

∂hk

∂wj(4.103)

rkij =

d

dwi

(dxk

dwj

)=

d

dwi

(qkj

)=

d

dwi

dim(x)∑

m=1

qmj

∂hk

∂xm+

∂hk

∂wj

(4.104)

=dim(x)∑

m=1

(d

dwi

(qmj

) ∂hk

∂xm+

d

dwi

(∂hk

∂xm

)qmj

)+

d

dwi

(∂hk

∂wj

)(4.105)

with

d

dwi

(∂hk

∂xm

)=

dim(x)∑

n=1

dxn

dwi

∂2hk

∂xn∂xm+

∂2hk

∂wi∂xm=

dim(x)∑

n=1

qni

∂2hk

∂xn∂xm+

∂2hk

∂wi∂xm(4.106)

d

dwi

(∂hk

∂wj

)=

dim(x)∑

n=1

dxn

dwi

∂2hk

∂xn∂wj+

∂2hk

∂wi∂wj=

dim(x)∑

n=1

qni

∂2hk

∂xn∂wj+

∂2hk

∂wi∂wi(4.107)

it is:

rkij =

dim(x)∑

m=1

rm

ij

∂hk

∂xm+dim(x)∑

n=1

qni

∂2hk

∂xn∂xmqmj +

∂2hk

∂wi∂xmqmj

+

dim(x)∑

n=1

qni

∂2hk

∂xn∂wj+

∂2hk

∂wi∂wj(4.108)

All these differential equations can be computed easily in principle, knowing that theinitial condition is always zero. The tricky part is only the complexity of the formulaeachieved by using the product and chain rules appropriately and expressing derivativesof h in terms of derivatives of f and g. Depending on the complexity of the system athand it might be simpler to find h(x,w) = f(x,g(x;wa)) and use discrete differences toapproximate partial derivatives.For the first order partial derivatives of h with respect to the states or weights it is:

∂hT

∂x=

∂fT

∂x+

∂gT

∂x∂fT

∂u(4.109)

∂hk

∂xm=

∂fk

∂xm+

dim(u)∑

p=1

∂gp

∂xm

∂fk

∂up, ∀k = 1, ..,dim(x),∀m = 1, .., dim(x) (4.110)

∂hT

∂wa=

∂gT

∂wa

∂fT

∂u(4.111)

∂hk

∂wj=

dim(u)∑

p=1

∂gp

∂wj

∂fk

∂up(4.112)


and for the second order partial derivatives of h with respect to the states and/or weightsit is:

∂2hk

∂xn∂xm=

∂

∂xn

∂fk

∂xm+

dim(u)∑

p=1

∂gp

∂xm

∂fk

∂up

(4.113)

=∂2fk

∂xn∂xm+dim(u)∑

p=1

∂gp

∂xn

∂2fk

∂up∂xm

+dim(u)∑

p=1

∂2gp

∂xn∂xm

∂fk

∂up+

∂2fk

∂xn∂up+dim(u)∑

r=1

∂gr

∂xn

∂2fk

∂ur∂up

∂gp

∂xm

(4.114)

∂2hk

∂wi∂xm=

∂

∂wi

∂fk

∂xm+

dim(u)∑

p=1

∂gp

∂xm

∂fk

∂up

(4.115)

=dim(u)∑

p=1

∂gp

∂wi

∂2fk

∂up∂xm+

dim(u)∑

p=1

∂2gp

∂wi∂xm

∂fk

∂up+

dim(u)∑

r=1

∂gr

∂wi

∂2fk

∂ur∂up

∂gp

∂xm

(4.116)

=dim(u)∑

p=1

∂gp

∂wi

∂2fk

∂up∂xm+

∂2gp

∂wi∂xm

∂fk

∂up+

dim(u)∑

r=1

∂gr

∂wi

∂2fk

∂ur∂up

∂gp

∂xm

(4.117)

∂2hk

∂xn∂wj=

dim(u)∑

p=1

∂2gp

∂xn∂wj

∂fk

∂up+

∂2fk

∂xn∂up+

dim(u)∑

r=1

∂gr

∂xn

∂2fk

∂ur∂up

∂gp

∂wj

(4.118)

=dim(u)∑

p=1

∂2gp

∂xn∂wj

∂fk

∂up+

∂2fk

∂xn∂up

∂gp

∂wj+

dim(u)∑

r=1

∂gr

∂xn

∂2fk

∂ur∂up

∂gp

∂wj

(4.119)

∂2hk

∂wi∂wj=

dim(u)∑

p=1

∂2gp

∂wi∂wj

∂fk

∂up+

dim(u)∑

r=1

∂gr

∂wi

∂2fk

∂ur∂up

∂gp

∂wj

(4.120)

The only remaining quantities to calculate are the partial derivatives of φ(x,wa) withrespect to the weights and states. First order terms are given by:

∂φ

∂x=

∂

∂x(φ(x,h(x,wa))) =

∂φ

∂x+

∂hT

∂x∂φ

∂x(4.121)

∂φ

∂xk=

∂φ

∂xk+

dim(x)∑

l=1

∂hl

∂xk

∂φ

∂xl(4.122)

∂φ

∂wa=

∂hT

∂wa

∂φ

∂x(4.123)

∂φ

∂wj=

dim(x)∑

l=1

∂hl

∂wj

∂φ

∂xl(4.124)


and second order terms are:

∂2φ

∂xl∂xk=

∂

∂xl

∂φ

∂xk+

dim(x)∑

p=1

∂hp

∂xk

∂φ

∂xp

(4.125)

=∂2φ

∂xl∂xk+

dim(x)∑

p=1

∂hp

∂xl

∂2φ

∂xp∂xk

+dim(x)∑

p=1

∂2hp

∂xl∂xk

∂φ

∂xp+

∂2φ

∂xl∂xp+

dim(x)∑

q=1

∂hq

∂xl

∂2φ

∂xq∂xp

∂hp

∂xk

(4.126)

∂2φ

∂wi∂xk=

∂

∂wi

∂φ

∂xk+

dim(x)∑

p=1

∂hp

∂xk

∂φ

∂xp

(4.127)

=dim(x)∑

p=1

∂hp

∂wi

∂2φ

∂xp∂xk+

dim(x)∑

p=1

∂2hp

∂wi∂xk

∂φ

∂xp+

dim(x)∑

q=1

∂hq

∂wi

∂2φ

∂xq∂xp

∂hp

∂xk

(4.128)

∂2φ

∂xk∂wj=

∂

∂xk

dim(x)∑

p=1

∂hp

∂wj

∂φ

∂xp

(4.129)

=dim(x)∑

p=1

∂2hp

∂xk∂wj

∂φ

∂xp+

∂2φ

∂xk∂xp+

dim(x)∑

q=1

∂hq

∂xk

∂2φ

∂xqxp

∂hp

∂wj

(4.130)

∂2φ

∂wi∂wj=

∂

∂wi

dim(x)∑

l=1

∂hl

∂wj

∂φ

∂xl

(4.131)

=dim(x)∑

l=1

∂2hl

∂wi∂wj

∂φ

∂xl+

dim(x)∑

p=1

∂hp

∂wi

∂2φ

∂xpxl

∂hl

∂wj

(4.132)

For the first actor training a mid-term interval [t0 t1] could be chosen with a critic outputof zero, e.g. wc ≡ 0. In the next cycle the actor weights wa are fixed and the criticweights wc are adapted, by forming the standard Bellman error Ec according to (4.133)and (4.134).

Ec := U(x(t0), t0, t1;wa) + Jπ(wa)(x(t1);wc)− Jπ(wa)(x(t1);wc) (4.133)

δwc := −ηc∂Ec

∂wcEc (4.134)

After convergence of the critic has been achieved, the error Ec is zero, and the criticJπ(wa)(.;wc) is consistent with the policy π(wa). A fast training method for the con-troller has been achieved with Newton’s method. However, after one actor training cycle,the actor parameters wa change to wa + δwa. To keep Bellman’s optimality conditionconsistent, the critic weights wc have to be adapted as well. Therefore, for convergedcritics wc + δwc and wc according to certain policies with parameters wa + δwa and wa,


respectively, the following conditions (4.135) and (4.136) must hold.

U(x0;wa+δwa) + Jπ(wa+δwa)(xδwa ;wc+δwc)− Jπ(wa+δwa)(x0;wc+δwc)!= 0 (4.135)

U(x0, t0, t1;wa) + Jπ(wa)(x;wc)− Jπ(wa)(x0;wc)!= 0 (4.136)

Where, xδwa means x(t) following the policy given by g(x;wa + δwa), starting in statex0 = x(t0). This is used in the following section to find the critic update δwc due to anactor update δwa.

4.4.3 Almost Concurrent Actor and Critic Adaptation

Given a consistent actor-critic pair, i.e. Bellman’s optimality equation is satisfied with noerror, actor training would induce an ‘error’, or better, a change due to the new policy,∆Ewa , (4.137) and its second order approximation (4.139). Similarly, starting from aconsistent actor-critic pair, with a fixed actor and changing critic weights would introducean ‘error’ ∆Ewc , (4.140) with a first order approximation (4.142).

∆Ewa = U(wa + δwa)− U(wa) + J(xδwa ;wc)− J(x;wc) (4.137)

.= δwTa

[dU

dwa+

dxT

dwa

∂J(x;wc)∂x

]+

12δwT

a

[d2U

dwa+

dxT

dwa

∂2J(x;wc)∂2x

dxdwa

]δwa(4.138)

= δwTa

dJ(x0)dwa

+12δwT

a

d2J(x0)dw2

a

δwa ≈ 12δwT

a

d2J(x0)dw2

a

δwa (4.139)

∆Ewc = J(x;wc + δwc)− J(x0;wc + δwc)− [J(x;wc)− J(x0;wc)] (4.140)

=: ∆J(x0;wc + δwc)−∆J(x0;wc) (4.141).= δwT

c

[∂J(x;wc)

∂wc− ∂J(x0;wc)

∂wc

]=: δwT

c

∂∆J(x0;wc)∂wc

(4.142)

To achieve consistency again after a training cycle involving actor and critic training, thechange due to the actor ∆Ewa has to be matched by an appropriate critic change ∆Ewc ,i.e. ∆Ewa

!= ∆Ewc . For a given actor change δwa and an approximated expectationoperator over a set of a sufficiently large number na ≥ dim(wa) of starting points x0,i, itfollows that (4.143) has to hold.

δwTa Ena

dJ0

dwa

+

12δwT

a Ena

d2J0

dw2a

δwa

!= δwTc

∂∆J(x0,i;wc)∂wc

(4.143)

To solve for δwc there are two possibilities. First, one might gather more points to buildup a matrix A given by (4.146) and then calculating the pseudo-inverse. However, due tocorrelation of the columns in A the matrix AT A is ill-conditioned and close to singular. Itcan easily be seen that the columns are correlated. Because the long-term cost J(x0,i;wc)is dependent on the actor parameters wa, where wa are trained by an averaging processover many states x0,i, its derivatives along a single trajectory ∂J(x0,i;wc)

∂wcare dependent.

This is because the trajectory is completely determined by the policy defined by wa. Thus,


differences in the derivatives on one trajectory starting at x0,i are very similar to differencesin the derivatives on another trajectory starting at another point x0,j because they followthe same controller law, given by g(x;wa). Therefore, the subtraction of derivativesalong a trajectory makes the columns more independent from individual starting pointsx0,i (and thus counteracts the idea of using many different, randomly selected pointsx0,i to achieve independence) and therefore correlates the columns of A. Further, thesubtraction also leads to cancellation and close to zero values for short-term evaluation∆t = t1− t0 (remember: xi = x(t1)i,x0,i = x(t0)i). The approach is written down for thesake of completness but in practice the second approach, discussed below, is much morepromising. For the first approach at least as many starting points as parameters wc areneeded: nc ≥ dim(wc). Furthermore, δwa might be computed with a safeguarded Newtonalgorithm, where the safeguard could be a simple backstepping, taking only a fractionηa <= 1 (ηa is estimated by the algorithm) of the original computed Newton update toensure a decrease in the objective function, E

dJ0dwa

, of Newton’s method. Together, this

yields the following training cycle:

δwa := −ηa

[Ena

d2J0

dw2a

]−1

Ena

dJ0

dwa

(4.144)

δwc :=(AT A

)−1AT sIIdim(wc)×1, with (4.145)

A :=[∂∆J(x0,1;wc)

∂wc. . .


. . .∂∆J(x0,nc ;wc)

∂wc

]T

(4.146)


=∂J(x1,i;wc)

∂wc− ∂J(x0,i;wc)

∂wc(4.147)

s := δwTa Ena

dJ0

dwa

+

12δwT

a Ena

d2J0

dw2a

δwa (4.148)

IIdim(wc)×1 := [1 . . . 1 . . . 1]T (4.149)

Enaf(.) :=1na

na∑

i=1

f(x0,i) (4.150)

Where, Ena

d2J0dw2

a

and Ena

dJ0dwa

are given by (4.89) and (4.79), respectively.

The second approach expresses the difference ∆J(x0;wc +δwc) in long-term cost J(x;wc)into a first-order Taylor series and selects δwc := η ∂∆J(x0;wc)

∂wc, as follows (4.151) to (4.153).

∆J(x0;wc) = J(x;wc)− J(x0;wc) (4.151)

∆J(x0;wc + δwc).= ∆J(x0;wc) + δwT

c

∂∆J(x0;wc)∂wc

(4.152)

= ∆J(x0;wc) + η||∂∆J(x0;wc)∂wc

||2 (4.153)

Using J(x;wc + δwc) as the new cost-to-go for the new policy π(wa + δwa) leads to a


critic update according to (4.154) to (4.155).

η =∆Ewa

‖ ∂∆J(x;wc)∂wc

‖2(4.154)

δwc = η∂∆J(x;wc)

∂wc= η

[∂J(x;wc)

∂wc− ∂J(x0;wc)

∂wc

](4.155)

Where, ∆Ewa is given by (4.138). With this choice ∆Ewa is exactly equal to ∆Ewc =δwc

∂∆J(x0;wc)∂wc

given by (4.142), as demanded by a first order approximation. To accountfor higher accuracy in Bellman’s optimality condition, standard HDP training could im-prove consistency between actor and critic to arbitrary degree. However, the first orderapproximation introduced here might be sufficient to safely improve the current policyagain, at least for a few actor-critic training cycles. In any case standard HDP critictraining will be sped up, and HDP critic training might be reduced to be invoked onlyevery n− th actor-critic training cycle.

4.4.4 Some Remarks

A few notes on the convergence of adaptive critics via actor and critic cycles for continuousdomains should be made. Landelius has proven convergence for HDP and DHP in thespecial case of a LQR system [15, 41].

Prokhorov has investigated convergence in training for ACDs with non-linear neu-ral networks based on proofs of stochastic optimization [2], which is based on proofs ofstochastic iterative algorithms in general (Kushner and Clark [42], and a reference to apublication by Tsypkin in Russian) and uses the proofs in Bertsekas and Tsitsiklis [11] inparticular. There are two kinds of errors: approximation errors by the network and errorsdue to the nature of an incremental optimization of nonconvex functions.

Prokhorov also emphasizes that “Strictly speaking, critics lose validity as the weights ofthe action network are changed. A more rigorous approach would be to resume the critictraining as soon as one action weight update is made”. The procedure here, suggestsprecisely this. However, as the error is only a linear approximation in terms of weightchanges, after a while some error might build up and a standard critic update can beperformed to achieve an ‘error-free’ Bellman equation. Nevertheless, this is probably asclose as possible to having a concurrent actor and critic training without using higherorder terms to model the influence of δwa on necessary critic changes δwc.

A final remark to ‘continuous backpropagation’ and its discrete counterpart BPTT (h).Normally, all the algorithms will be implemented on a discrete clocked computer whichseems to be a plus for the discrete BPTT (h). However, any integration routine does ba-sically a discretization but with variable time steps. This is of a great advantage becauseno time is wasted by cycling through areas with little steps when nothing happens. Thetruncation depth h in BPTT (h) is the same as the time t1, used to indicate the short- tomid-term costs. Another difference is that calculating total derivatives in BPTT (h) is per-


formed backwards, whereas with the ‘continuous backpropagation’ a forward integrationis performed.

In the previous section a complete formulation to train the controller or actor based onsecond order derivatives in conjunction with Newton’s method have been introduced, aswell as an almost concurrent adaptation for critic weights based on actor changes. How-ever, to use this approach it is necessary to have second order derivatives for the controllernetwork g(x;wa) as well as for the critic network J(x;wc). See, for example, equation(4.88). For the critic approximation network J(x;wc) this means ∂2J(x;wc)

∂x2 has to be cal-culated4, and this is exactly what has to be done in GDHP , which is the most advancedadaptive critic design. The simplest way to calculate the second order derivatives wassuggested by Werbos [43] and implemented by Prokhorov for a one-layered multilayerperceptron [2]. Basically, for the given network J(x;wc) a dual network λ(x;wc) is con-structed by applying the backpropagation algorithm on the network J(x;wc). Together,with the original J(x;wc) network this can be seen as a combined ‘forward’ network whichstill has the same parameters wc as the original network. Applying backpropagation onthis combined network which outputs λ(x;wc) = dJ(x;wc)

dx and calculates precisely the de-

sired second order derivatives ∂2J(x;wc)∂x2 . This is perhaps the most efficient implementation

for calculating second order derivatives, at least the author of this thesis is not aware ofany better solution. In chapter 5, where these formulae introduced here are tested onan LQR system, the critic network is modelled by a quadratic form on which it is easyto calculate exact second order derivatives. As the quadratic form is also theoreticallycorrect, it is expected to converge exactly to the theoretical solution given by the solutionto the corresponding algebraic Riccati equation.

Another implementation of a second order training method, called the extended Kalmanfilter (EKF), has been made and successfully experimented with, see e.g. [2], [44]5. Theadvantage of the EKF algorithm over Newton’s method is that it is based on pattern-by-pattern updates, unlike the presented Newton method here. However, the method here,particularly (4.89) and (4.85), where the expectation operator and update equation are tobe approximated by a batch update, this is not necessarily the case for further updates, asa running average could be used and therefore a more pattern-by-pattern update versioncould easily be achieved. Also the inverse Hessian could be updated, using the matrix in-version lemma, also known as Woodbury’s equation, and so achieving a pattern-by-patternupdate, see e.g. [45]. Another, more disturbing part for both EKF and Newton’s methodis that both algorithms are of complexity o(N2), where N is the number of parameters,where as Werbos’ method for GDHP outlined above, still is of o(N).

In this chapter the discount factor γ has been left out but it is straight forward tointroduce it in the equations corresponding to Bellman’s optimality equation, by modifying

4For fixed parameters wc partial and total derivatives are the same: ∂2J(x;wc)

∂x2 ≡ d2J(x;wc)

dx2 and usedinterchangeably here.

5Thanks to Prokhorv and Werbos for pointing to some work regarding the EKF algorithm, initiated atFord.


terms involving the cost-to-go function J(x(t1);wc) =: J1 with a multiplicative factor γ.As under some benign assumptions to be outlined later the cost integral for the LQRsystem is finite, no γ-factor has to be introduced anyway, or simply would have to be setto one.

Chapter 5

Testing of Euler Equations with

Adaptive Critics

In this chapter the Euler training, developed in section 3.2, is tested on a linear quadraticregulator (LQR). It turns out that that approach has considerable problems due to onlytaking partial derivatives. In that first simple choice of ‘Euler training’ the controllernetwork is assumed to be a universal function approximator that can find the exact map-ping needed, without limitation and therefore introducing any constraints. Knowing theanalytical solution to the LQR problem and using exact quadratic forms for the critic aswell as a linear feedback for the controller, allows for complete investigation of the prob-lems incurred. This was done in section 5.2, where the Euler equation can be calculatedexplicitly, leading to some specific adaptation laws for the LQR system, merely to investi-gate the problems, rather than for of practical use. This lead to the development and useof total derivatives in section 4.4, and is applied to the simplified Euler equations (3.53)and investigated in section 5.3. Furthermore, the second order adaptation, developed insection 4.4.2, is tested as well, on the same LQR system in section 5.4, showing robustnessand fast convergence.

5.1 Linear system with quadratic cost-to-go function (LQR)

he LQR system equations and cost-density are defined by (5.1) to (5.2).

x = Ax + Bu, A =

[−1 −21 −4

], B =

[0.5 −10.5 2

](5.1)

φ(x, x) = xTQ x + xTR x, Q = R =

[1 00 1

](5.2)

74

CHAPTER 5. TESTING OF EULER EQUATIONS WITH ADAPTIVE CRITICS 75

The control u = g(x,w) should be of a state-feedback form with some parameters w andthe cost-to-go function or performance index is given by (5.3).

J(x, x) =∫ ∞

t0

φ(x, x)dt (5.3)

5.1.1 Optimal LQR-control

To solve the system above with minimal performance index, an Algebraic Riccati Equation(ARE) has to be solved. Details can be found in an advanced book on modern controltheory, e.g. the excellent book by Brogan [21], chapter 14. However, for numerical pur-poses, MATLAB’s lqr-function can be used to calculate the optimal feedback gain. Tomake use of MATLAB’s lqr-function the performance index has to be changed to (5.4),where a simple comparison with the original performance index yields Q = Q + ATRA,R = BTRB, N = ATRB and RT = R. Additional requirements are that the pair (A,B)be stabilizable, R > 0, Q − NR−1NT ≥ 0 and, that neither Q − NR−1NT ≥ 0 norA−BR−1NT has an unobservable mode on the imaginary axis.

J(x,u) =∫ ∞

t0

φ(x,u)dt =∫ ∞

t0

(xT Q x + uT R u + 2xT Nu)dt (5.4)

The optimal control law has the form u(x) = g(x;K) = −Kx with feedback matrix K

which can be expressed as (5.5) and (5.6).

K = R−1(BTS + NT

), where S is the solution to the ARE: (5.5)

0 = ATS + SA− (SBT + N)R−1(BTS + NT ) + Q (5.6)

5.1.2 Pretraining of Actor with Euler Equations

Training of the actor according to equations (3.54) to (3.57) needs a theoretical learningrate of η = 1. But as the constant C of (3.53) is unknown and a random initialization ofthe actor network is done, a smaller learning rate of η = 0.05 is safer to use. Equation(3.57) is used to update the actor weights.

5.1.2.1 Training-Algorithm

The following steps are done for training:

• Initialization of parameters (dt, dti = dt/N, ηa, µa) which are the step-size of anintegration step, the step-size within an integration-step, the learning rate and amomentum term (normally chosen to be zero), respectively.

• A random initial start point is chosen.

• Use this random initial start point, calculate the trajectory from the start pointduring the time max spt ∗ dt while adapting the actor weights by steepest descent


such that the long-term cost J is minimized, using equations (3.54-3.57). max spt

should be made relatively small due to dying out transients; in the experiments it ischosen to be one.

• Get a new random initial start point and repeat the procedure.

While training goes on, the initial state-space and/or learning rates may be increased toaccount for smaller errors.Note: Using the same start point for a few consecutive trajectories allows for trainingwhether J really decreases, or not, e.g. because of a too large step-size.

5.1.2.2 Problems

The biggest problems related to training are that close to the origin errors are very small(in the order of 10−9 or even much smaller), and that while learning the control u = −Kx

from an arbitrary matrix K the cost-to-go J(K) does not look paraboloid but has along, flat and narrow valley at several stages during training (see figure 5.4). Whiletraining goes on, the gradients become so small and the valley so flat that it requiresa huge learning rate in one direction and a very small one in the other. With a largerlearning rate divergence is very likely and with a small learning rate training takes avery long time. In between some oscillations around a good solution for the cost-to-go

J(Kgood),Kgood =[0.4872 −4.38950.2718 −0.4166

]can be achieved but the parameters are still quite

far away from the optimal ones.Euler-Training with partial derivatives is ill-conditioned close to the optimal

policy. A closer look at equation (3.53), or, (3.56) reveals the problem: it is not possible,from a short-term deviation, to predict a well-conditioned long-term effect if there areslight errors in the gradient calculation. As (3.53) must be valid for the optimal policyfor all trajectories at all times, a possible deviation in the “constant” will not be sensitiveto any trajectory nor to a particular time. This means there is almost no deviation errorwhich could be used to improve a good policy to an optimal one and only possible witha true total gradient. Equation (3.56) makes this even more clear. Not even an increaseof dC(t, dt), defined as difference of two terms evaluated at times t + dt and t, to adiscrete ∆C(t, ∆t) can improve a possible deviation in the constant because a constantis naturally time-independent. During the training this constant is implicitly learned andbecomes less and less insensitive to any quantity. Not using the true gradient but onlythe partial gradient causes the training to fail before reaching the optimal solution. Thisis clearly highlighted by figures 5.3 and 5.4 (snapshots of the training progress).Also, practical tests on improving the actor parameters utilizing equation (3.53) and usingthe Nelder-Mead simplex or quasi-Newton methods failed to improve the policy in a reliablemanner. This is due to the fact that equations (3.53) and (3.56), respectively, only contain“short-term information” represented by the immediate cost density φ, current state-changes x and a derivative term involving φ and x.


−1

−0.5

0

0.5

1

−1

−0.5

0

0.5

10

0.5

1

1.5

2

2.5

Figure 5.1: The optimal cost-to-gofunction J(x;Kopt) over the state spacex.

−1

−0.5

0

0.5

1

−1

−0.5

0

0.5

10

0.5

1

1.5

2

2.5

Figure 5.2: Cost function after sometime causes parameters to oscillatesaround Kgood. Even Kopt is not thatclose to the optimal parameter matrixKopt, the cost-to-go function J(x;Kopt)is very close to the optimal one (see fig-ure 5.3).

−1

−0.5

0

0.5

1

−1

−0.5

0

0.5

1−0.04

−0.03

−0.02

−0.01

0

0.01

0.02

0.03

0.04

0.05

Figure 5.3: Difference betweenJ(x;Kgood) and the optimal J(x;Kopt).The saddle point behaviour is difficultfor learning with random initializationpoints. Some would like to increaseand others to decrease the currentJ(x;Kgood) while converging to theoptimal J .

−1

−0.5

0

0.5

1

−1

−0.5

0

0.5

10

5

10

15

20

Figure 5.4: While training from a ran-dom initial K the J(x;Ktemp) functionbecomes at certain stages narrow butlong and flat valley. This gives prob-lems of convergence speed, oscillations,or, even divergence.


0 100 200 300 400 500 600 700 800−1

0

1

2

3

4

5

Figure 5.5: Trajectory of the param-eters −K, k11:blue, k12:green, k21:red,k22:cyan. The optimal values are dot-ted. After 768 iterations the values of theparameters are those of −Kgood. In the

next iteration −K = −[

0.51090.3506

−4.3997−0.4627

],

then further adaptation was too large andthe system diverged.

0 2 4 6 8-6

-4

-2

0

Figure 5.6: Trajectory of state spacex(t) and control u(t) for parameters−Kgood and initial state x0 =

[1.7877−1.0512

]

x1:blue, x2:green, u1:red, u2:cyan. Thedotted lines correspond to the optimal pa-rameters −Kopt.

5.1.3 Conventional ACD Training

Using the conventional ACD training equations ((3.46,3.51) and (3.34,3.36) for actor andcritic, respectively), a close to the optimal solution can be achieved, with a feedback matrix

K =[0.6584 −4.67150.3349 −0.3347

]as well. However, the switching between critic and actor training,

must be chosen carefully. The critic has to converge for the current actor policy before aswitch can occur. If a fixed number is used for critic training, this number is in the orderof 500-4000 depending on the learning rate used. Actor training uses only 30-50 cycleswith a comparable learning rate. With fixed learning rates and fixed numbers of criticand actor training cycles, convergence is extremely slow. Similarly to the training withthe Euler equations, conventional ACD training also suffers from slow convergence closeto the optimal solution, due to reduced errors. However, it has less convergence problemsclose to the optimal solution than the Euler training, but suffers much more from slowconvergence.

5.1.4 Comparison Conventional ACD Training versus Euler Training

The slow convergence of the conventional training is also intuitively clear, as the trainingstarts with an arbitrary policy and calculates the cost-to-go function for this policy. Thenthe policy is changed such that the current cost-to-go function for the prior policy becomes‘independent’ of it and a next cost-to-go function is evaluated. This will finally convergeto the optimal policy under the assumption that all states are visited infinitively often.On the other hand, the Euler training does not calculate the cost-to-go function but


limits training directly on extremal policies. This yields a much faster training in thebeginning but it deteriorates close to the optimal solution. Clearly, the necessity of havingan adaptive optimization routine for conventional training is a must to achieve acceptabletraining times. But those can be applied to the Euler training as well, speeding it up evenfurther. However, it is obvious that the two methods can be combined, using the Eulertraining for a fast pre-training, while the conventional training can be used close to theoptimum solution.Table 5.1 gives an idea of feedback parameter values for a single run (column K andaveraged over 10 training runs with different initial values (column K). Figure 5.8 and5.10 indicate the convergence speed. Figure 5.7 is a ten-fold enlargement of the beginningof figure 5.8. Figure 5.9 uses a combination of Euler and conventional training.

Training Method K K # Iterations TrainingTime/sec

Euler-Training

[0.4140 −4.30650.2506 −0.4451

] [0.5234 −4.45750.3730 −0.5555

]15000 373

Euler-, then Con-ventional

[0.6402 −4.60040.3457 −0.3478

] [0.6402 −4.60040.3457 −0.3478

]90000 2511

Euler-, then Con-ventional

[0.6576 −4.66030.3384 −0.3392

] [0.6546 −4.66940.3370 −0.3373

]200000 5604

Conventional

[0.4311 −1.87630.3340 −0.2549

] [0.0524 −1.47620.4168 −0.2175

]15000 424

Conventional

[0.6140 −4.34220.3393 −0.3626

] [0.6011 −4.37600.3400 −0.3598

]90000 2544

Conventional

[0.6528 −4.66640.3367 −0.3374

] [0.6522 −4.66630.3368 −0.3374

]200000 5655

Optimal Solution

[0.6667 −4.66670.3333 −0.3333

] [0.6667 −4.66670.3333 −0.3333

]- -

Table 5.1: Comparison between the Euler and Conventional Training

5.2 Improved adaptation laws for LQR systems

As the previous section showed, initially Euler training adapts fast, but gets into problemssoon. Therefore, other adaptation laws are investigated in the LQR framework.


Summarizing the equations for the LQR system, yields (5.7) to (5.17).

x = f(x, u) = A x + B u (5.7)

u = g(x) = −Kx (5.8)

φ(x, x) = xT Q x + xT R x (5.9)

φ(x, x) = φ(x, f(x, u)) = xT(Q+ATRA)x+uT(R+BTRB)u+xT(ATRB)u+uT(BTRA)x(5.10)

= xT (Q + AT RA)x + uT (R + BT RB)u + 2xT (AT RB)u, if R = RT (5.11)

= xT Qx + uT Ru + 2xT Nu (5.12)

Q = Q + AT RA (5.13)

R = BT RB (5.14)

N = AT RB (5.15)

Q := CT C, ⇒ Q = QT (5.16)

R := DT D, ⇒ R = RT (5.17)

The system is (5.18) and the constant c of equation (3.53) is (5.19) and becomes (5.23),where the last two terms of the second last line cancel out because they are just transposesof each other, scalar and with opposite sign.

x = f(x, g(x; K)) = f(x;K) = (A−BK)x (5.18)

c(K) = xT ∂φ(x, x)∂x

− φ(x, x) != constant (5.19)

= xT(A−BK)T (R+RT )(A−BK)x− [xT Qx + xT (A−BK)T R(A−BK)x

](5.20)

= xT[(A−BK)T RT (A−BK)−Q

]x = xT

[(A−BK)T DTD(A−BK)− CT C

]x(5.21)

= xT[(A−BK)T DT ∓ CT

][D(A−BK)± C] x

±xT (A−BK)T DT Cx∓ xT CT D (A−BK) x (5.22)

= xT [D(A−BK)∓ C]T [D(A−BK)± C] x (5.23)

As c(K) has to be constant for all times and trajectories x, (5.24) must hold.

dc(K)dt

!= 0 (5.24)

From the solution of the algebraic Riccati equation it is known that K is a constantfeedback matrix for the given problem. Therefore, the closed loop matrix Acl := A−BK

is constant and (5.25,5.26) hold.

dc(K)dt

= xT

([DAcl ∓ C

]T [DAcl ± C

]+

[DAcl ∓ C

] [DAcl ± C

]T)

x (5.25)

= xT AclT([

DAcl ∓ C]T [

DAcl ± C]

+[DAcl ∓ C

] [DAcl ± C

]T)

x!= 0(5.26)


This implies:

Acl = 0, which is the trivial solution, or (5.27)[DAcl ± C

]= 0. (5.28)

From the interesting second equation there are two solutions for the feedback matrix (5.29)and associated closed loop matrices (5.30), where the real parts of the eigenvalues of Acl

+

and Acl− are negative and positive, respectively.

K± = (BT B)−1(A±DC), with associated closed-loop matrices (5.29)

Acl± = A−BK± (5.30)

Therefore, the stable closed loop solution will be achieved with K+. These conditionsimply (5.31).

c(K) ≡ 0. (5.31)

However, equation (5.28) could also be used for an adaptive law with an error E to bemade zero, respectively to minimize the target T using a steepest gradient method (5.32-5.38), where vecr(K) is the row-wise stacking operator on the matrix K, that is, the rowsof the matrix K are concatenated into a vector (also a column-wise stacking could be used,but ∂gT (x,K)

∂vecc(K) would be less compact to write in equation (5.38)).

E := [D(A−BK) + C] x != 0 (5.32)

= Dx + Cx (5.33)

= Df(x, u) + Cx = Df(x, g(x; K)) + Cx (5.34)

T =12ET E (5.35)

vecr(K) = −η∂T

∂K(5.36)

= −η∂gT (x,K)∂vecr(K)

∂fT (x, u)∂u

∂ET

∂x

∂T

∂E(5.37)

= η

x 0 ... ... 00 x 0 ... 0...

......

......

0 ... ... 0 x

BT DT E (5.38)

Another adaptive law can be obtained when calculating dc(K)dt by directly taking care of

an adaptive K. First order expansion of the differential dc in K and x, yields (5.39) to


(5.41).

dc(K; dK, x; dx) = c(K + dK, x + dx)− c(K,x) (5.39)

c(K + dK, x + dx) = (x + dx)T[(A−B(K + dK))T RT (A−B(K + dK))−Q

](x + dx)

(5.40).= xT

[(A−B(K + dK))T RT (A−B(K + dK))−Q

]x

+dxT[(A−BK)T RT (A−BK)−Q

]x

+xT[(A−BK)T RT (A−BK)−Q

]dx

.= xT[(A−BK)T RT (A−BK)−Q

]x

−xT[(BdK)T RT (A−BK) + (A−BK)T RT BdK

]x

+dxT[(A−BK)T RT (A−BK)−Q

]x


]dx (5.41)

With dx = xdt = (A−BK)xdt and R = RT (5.41) becomes (5.43).

dc(K; dK, x; dx) R=RT

= −xT[((A−BK)T RT BdK)T + (A−BK)T RT BdK

]x

+xT (A−BK)T [(A−BK)T RT (A−BK)−Q

]x


](A−BK) x (5.42)

dc(K; dK, x; dx)dt

= −xT

[((A−BK)T RT B

dK

dt

)T

+ (A−BK)T RT BdK

dt

]

−(A−BK)T[(A−BK)T RT (A−BK)−Q

](5.43)

− [(A−BK)T RT (A−BK)−Q

](A−BK)

x

!= 0,∀x (5.44)

As this has to be valid for all trajectories x, this implies the curly bracket has to be zero,and has the form (5.45,).

dc(K; dK, x; dx)dt

= −xT

(EK)T+EK−ATcl

[AT

clRT Acl −Q

]−[AT

clRT Acl −Q

]Acl

x

(5.45)

= xT

ATclF + FAcl − (EK)T −EK

x, with (5.46)

E = (A−BK)T RT B = ATclR

T B (5.47)

F = ATclR

T Acl −Q, note: F = F T (5.48)

If equation (5.45) has to hold for any x, this implies that (5.49) holds and therefore K has


to change accordingly to (5.50).

FAcl!= EK (5.49)

K = (ET E)−1ET FAcl (5.50)

5.2.1 Numerical Example

5.2.1.1 rank(K) = dim(x)

Using the following system values (5.51) to (5.53), the optimal feedback is given by (5.55)and the stable and instable feedbacks are (5.56) and (5.57), respectively.

A =

[−1 −21 −4

], eig(A) = −2,−3 (5.51)

B =

[0.5 −10.5 2

](5.52)

Q = R = C = D =

[1 00 1

](5.53)

S =

[1 00 1

], solution to ARE (5.6) (5.54)

K∗ =

[0.6667 −4.66670.3333 −0.3333

], optimal feedback by (5.5) (5.55)

K+ =

[0.6667 −4.66670.3333 −0.3333

]feedback by (5.29) and (5.37) (5.56)

K− =

[−2 61 1

], instable feedback by (5.29) (5.57)

Acl∗ = A−BK∗ =

[−1 00 1

], eig(Acl

+) = −1,−1 (5.58)

Acl+ = A−BK+ =

[−1 00 1

], eig(Acl

+) = −1,−1 (5.59)

Acl− = A−BK− =

[1 00 −1

], eig(Acl

−) = 1, 1 (5.60)

If the matrix B is full rank, all the (stable) methods achieve the same optimal result forthe feedback matrix K∗, including equation (5.50), which is not listed here.

5.2.1.2 rank(K) < dim(x)

Lowering the dimension of the control u and therefore the rank of the control matrix B

and the feedback matrix K to impose constraints on the possible mappings g : x K7→ u,and the following system values (5.61) to (5.63), yields the feedback matrices (5.65) to


(5.67).

A =

[−1 −21 −4

], eig(A) = −2,−3 (5.61)

B =

[1−3

](5.62)

Q = R = C = D =

[1 00 1

](5.63)

S =

[1.0207 −0.1865−0.1865 2.67881

], solution to ARE (5.6) (5.64)

K∗ =[−0.2420 0.1777

], optimal feedback by (5.5) (5.65)

K+ =[−0.3000 0.7000

]feedback by (5.29) and (5.37) (5.66)

K− =[−0.5000 1.3000

](5.67)

K =[−0.3571 1.2143

], by (5.50) (5.68)

Acl∗ = A−BK∗ =

[−0.7580 −2.17770.2741 −3.4669

], eig(Acl

+) = −1.000,−3.225 (5.69)

Acl+ = A−BK+ =

[−0.7000 −2.70000.1000 −1.9000

], eig(Acl

+) = −1,−1.6 (5.70)

Acl− = A−BK− =

[−0.5000 −3.3000−0.5000 −0.1000

], eig(Acl

−) = −1.6, 1 (5.71)

Acl = A−BK =

[−0.6429 −3.2143−0.0714 −0.3571

], eig(Acl

−) = −1, 0 (5.72)

If the matrix B is not of full rank, all the adaptive methods give different solutions whichyield larger eigenvalues and cost-integrals (3.2). This is not surprising because they allviolate the independence assumptions of state components which is crucial to the funda-mental lemma of the calculus of variation and thus the Euler equations. For this case theaugmented Euler equations had to be used which involves the introduction of Lagrangemultipliers. As seen before in chapter 2 this is equivalent to the Hamiltonian formulationwhich is mostly used in optimal control. The Hamiltonian formulation has the additionalbenefit to fit nicely with Pontryagin’s minimum principle which can also handle constraintson the controls u. However, this chapter is mainly intended to learn about the problemsinvolved with the simple adaptation laws used in the beginning of this chapter. The nextsection improves the laws by using total instead of partial derivatives and it will showhighly improved robustness around the optimal solution with a full rank feedback matrixK. The method is still based on the Euler equations (3.53). In section 5.4 it is shownthat the approach of using traditional adaptive critics will find the correct optimal valuesfor K in both cases independent of the rank of K. However, when used with the training


methods introduced in section 4.4, only a few actor-critic training cycles are needed andthus speed up the traditional adaptive critic training quite considerably.

5.3 Continuous Version of Total Derivatives for Euler-ACDs

As seen before in equation (5.31) the constant c is zero for LQR systems. Therefore, atraining target to minimize could be (5.73).

T =12

∫ t1

t0

||c||2dt (5.73)

dT

dw!= 0, necessary condition for extremum (5.74)

c = c(x;w) = xT ∂φ(x, x)∂x

− φ(x, f(x,g(x;w))) =: D(x,w)− P (x,w)(5.75)

Using the approach developed in section 4.4 for handling total derivatives and applied tothe Euler equations (3.53, 5.75), yields the update (5.76) to (5.77).

dc

dw=

dxT

dw∂c

∂x+

∂c

∂w(5.76)

=dxT

dw

[dfT

dx

(∂φ

∂x+

∂2φ

∂x2f)

+∂2φ

∂x∂xf −

(∂φ

∂x+

dfT

dx∂φ

∂x

)]+

∂gT

∂w∂fT

∂u∂2φ

∂x2f (5.77)

A more detailed derivation of this result is given in appendix B.4. For dxT

dw the same trickof integrating the differential equation (4.56) with initial condition (4.57) is used. In thecase of the LQR system, equation (5.77) evaluates to (5.78).

dc

dw=

dxT

dw

[(A−BK)T (

2(R + RT

)x)−

((Q + QT

)x + (A−BK)T (

R + RT)x)]

+∂gT

∂wBT

(R + RT

)x (5.78)

Using (5.78) and starting in the vicinity, Kgood =

[0.65 −4.70.3 −0.4

], of the optimal value K∗,

and with a learning rate η = 0.01, after 100000 iterations yields almost correct values forthe feedback parameters K as defined by (5.79).

K =

[0.6659 −4.66470.3328 −0.3339

]. (5.79)

Unlike the approach using only partial derivatives, which is quite unstable and even di-verges away from the optimal feedback values, this is robust in the vicinity of the optimalvalues and has no signs of divergence. However, using random starting values, or too largelearning rates, this approach has problems with convergence. The rate of convergence is


also very low. The reasons are similar to the approach using only partial derivatives.The idea of using the Euler equation, especially the simplified ones to speed up the adap-tive critic designs, has failed so far but has led to more insight and the conclusion thattotal derivatives carry enormous benefits in terms of robustness. However, to solve thevariational problem in the context of optimal control, doing a straight forward optimiza-tion on the finite dimensional set of control parameters will prove much more successfulthan ‘going on a detour via the infinite dimensional function space underlying the Eulerequations’. Guided by the fact that if second order derivatives can be calculated, nothingis going to beat Newton’s algorithm, the approach of section 4.4.2 was developed andtested in the following section.

5.4 Results for Continuous ACDs with Newton’s Method

In this section the Newton training is tested on the same LQR example as has been usedpreviously.

5.4.1 rank(K) = dim(x)

Figure 5.11 shows the actor or feedback parameters K. The solid lines represent onlyperiods of actor training with input and output values to the Newton routine. To improvestability, Newton’s method was extended to only allow changes dK which satisfy ||dK||∞ ≤10||K||∞, otherwise dK := ||K||∞

||dK||∞dK. It may occasionally happen that Newton’s methoddiverges from a random set of parameters, e.g. if their values are too large and with thatfeedback matrix even the short-term integral gets very large values and numerical problemsoccur, or, the proposed clipping might cause oscillations. In these relatively seldom cases,another initialization is the fastest way to solve the problem. Figure 5.12 shows adaptationfor the critic parameters wc. Remarkably, the linearized critic updates, proposed in section4.4.3, works very well and speeds up convergence considerably, especially when actorchanges are of significant magnitudes.

5.4.2 rank(K) < dim(x)

Similar observations are made as in the case with a fully ranked feedback matrix. However,in contrast to the other tested methods in section 5.2, the optimal values can be achieved.Figures 5.13 and 5.14 show corresponding actor and critic weight adaptation, respectively.


0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

x 104

−1

0

1

2

3

4

5

# iterations

Actor parameters K

Figure 5.7: Euler-Training. The Pa-rameters are learned very fast untiltraining breaks down as it comes closerto the optimal values.

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

x 105

−1

0

1

2

3

4

5

# iterations

Averaged Actor parameters <K>

Figure 5.8: Euler-Training, using thesame time-scale as the conventionaltraining.

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

x 105

−1

0

1

2

3

4

5

# iterations

Averaged Actor parameters <K>

Figure 5.9: Euler-, then ConventionalTraining. Here the optimal parameterscan be retrieved and a reduction in theoverall training time can be achieved dueto the fast Euler-Pretraining stage.

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

x 105

−1

0

1

2

3

4

5

# iterations

Actor parameters K

Figure 5.10: Conventional Training.Here the optimal parameters can be re-trieved. However, training is slow.


0 20 40 60 80 100 120 140 160 180−5

−4

−3

−2

−1

0

1

Traing time per second

Fee

dbac

k pa

ram

eter

s K

k11

k21

k12

k22

Figure 5.11: Trajectory of the parameters K for the system given in section 5.2.1.1.The solid lines represent the time actor training via Newton’s method. During the timeindicated by the dashed lines, actor parameters are frozen and critic weights are adapted.After four actor-critic cycles the parameters are learned within an error better than 10−5.

0 20 40 60 80 100 120 140 160 180−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

1.2

1.4


Crit

ic p

aram

eter

s W

c

w11

w21

w12

w22

Figure 5.12: Trajectory of critic parameters Wc. The solid lines represent the timecritic training is performed. After the first actor-critic cycle the actor-critic consistency isachieved and the proposed linear critic updates due to actor changes can be applied. Thisis shown by the black lines which represent a jump towards the optimal values, especiallyfor the non-zero w11, w22 at the second actor-critic cycle.


0 20 40 60 80 100 120 140 160−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3


Fee

dbac

k pa

ram

eter

s K

k11

k12

Figure 5.13: Trajectory of the parameters K for the system given in section 5.2.1.2.The solid lines represent the time actor training via Newton’s method. During the timeindicated by the dashed lines, actor parameters are frozen and critic weights are adapted.After four actor-critic cycles the parameters are learned within an error better than 10−5.

0 20 40 60 80 100 120 140 160−0.5

0

0.5

1

1.5

2

2.5

3


Crit

ic p

aram

eter

s W

c

w11

w21

w12

w22

Figure 5.14: Trajectory of critic parameters Wc (note: w12 = w21. The solid linesrepresent the time critic training is performed. After the first actor-critic cycle the actor-critic consistency is achieved and the proposed linear critic updates due to actor changescan be applied. This is shown by the black lines which represent a jump towards the optimalvalues, given by (5.64), especially for the non-zero w11, w22 at the second actor-critic cycle.

Chapter 6

Hybrid Dynamical System

6.1 Introduction

A Hybrid Dynamical System (HDS) is a system with mixed continuous and discrete states.In this section a special HDS of a linear noisy plant with unknown state, noisy observationsand with quadratic cost density and quadratic terminal cost is considered. The plant iscontrolled by switched output feedback control.

In [46] theoretical results are given for this hybrid dynamical system (HDS) in termsof the existence of suitable solutions to a dynamic programming equation and a Riccatidifferential equations of the H∞ filtering type. This is a robust control problem for a classof HDS consisting of a continuous time plant with control and disturbance inputs anda discrete event controller. The controller is defined by a collection of given controllerswhich are called basic controllers. The control strategy is a rule for switching from onebasic controller to another. The control goal is to achieve a level of performance defined byan integral performance index similar to the requirement in standard H∞ control theorywhich can be seen as minimizing the cost-to-go function in an adaptive critic design.

For more detail about hybrid dynamical systems in this context the reader is referredto the recent books by Matveev, [47], and Savkin, [48], and an extension of the theoryto infinite horizon can be found in [49]. Another, closely related publication for robustoutput feedback stabilizability via controller switching is given in [50].

In the next section the problem for the example of the HDS in [46] is given and trans-formed into an adaptive critic design (ACD).

6.2 Problem Formulation

The linear plant obeys (6.1), where x ∈ IRn is the state, ξ(t) ∈ IRp and ν(t) ∈ IRl arethe disturbance inputs, u(t) ∈ IRh is the control input, z(t) ∈ IRq is the controlled outputand y(t) ∈ IRl is the measured output, A(.), B1(.), B2(.),K(.), G(.) and C(.) are piecewise

90

CHAPTER 6. HYBRID DYNAMICAL SYSTEM 91

continuous matrix functions.

x(t) = A(t)x(t) + B1(t)ξ(t) + B2(t)u(t)

z(t) = K(t)x(t) + G(t)u(t) (6.1)

y(t) = C(t)x(t) + ν(t)

However, the plant’s state at t0 is not known, and the task is to find a switching controllerIj(.) which selects the control input u(t) as the output of a basic controller among a setof k basic controllers Ul(t, y(t)) with Ul(., 0) = 0 as defined by (6.2) to (6.4).

ul(t) = Ul(t, y(t)), l = 1, .., k, output feedback basic controllers (6.2)

ij = Ij(y(.)|tjt0) switching controller (6.3)

u(t) = Uij (t, y(t)), ∀t ∈ [tj , tj+1) output feedback controller (6.4)

Given a set of output measurements y(.)|tjt0 a switching sequence ijN−1j=0 can be con-

structed.

Definition. Let X0 > 0, Xf > 0 be symmetric positive definite cost matrices. If thereexists a function V (x0) ≥ 0 such that V (0) = 0 and for any vector x0 ∈ IRn, there existsa controller of the form (6.2), (6.3), (6.4), such that (6.5) holds for all solutions of theclosed loop system (6.1), (6.4) with any disturbance inputs [ξ(.), ν(.) ∈ L2[t0, tN ], then theoutput feedback robust control problem with the cost matrices X0 and Xf is said to have asolution via controlled switching with the output feedback basic controllers (6.2).

x(tN )′Xfx(tN ) +∫ tN

t0

(||z(t)||2− ||ξ(t)||2− ||ν(t)||2)dt ≤ (x(t0)−x0)′X0(x(t0)−x0)+ V (x0)(6.5)

In [50] it is shown that the solution to the H∞ problem (6.5) involves the cost-function(6.6), where x(t) is the solution of the differential equation (6.7) with P (t) being thesolution to the Riccati differential equation (6.8), where the initial conditions are P (t0) =X−1

0 and x(t0) = x0, respectively.

W (t, x(t), u(t), y(t)) ∆= ||K(t)x(t) + G(t)u(t)||2 − ||C(t)x(t)− y(t)||2 (6.6)

˙x(t) =[A(t) + P (t)

[K(t)′K(t)− C(t)′C(t)

]]x(t) + P (t)C(t)′y(t) + B2(t)u(t) (6.7)

P (t) = A(t)P (t) + P (t)A(t)′ + P (t)[K(t)′K(t)− C(t)′C(t)

]P (t) + B1(t)B1(t)′ (6.8)

Now, let M(.) be a function from IRn to IR and x0 ∈ IRn be a given vector and define (6.9),where the supremum is taken over all solutions to the system (6.7) with y(.) ∈ L2[tj , tj+1),


u(t) ≡ Ui(t, y(t)) and initial condition x(tj) = x0.

F ij (x0, M(.)) ∆= sup

y(.)∈L2[tj ,tj+1)M(x(tj+1)) +

∫ tj+1

tj

W (t, x(t), Ui(t, y(t)), y(t))dt (6.9)

This leads to theorem 4 in [50] which states that for the system (6.1) with output feedbackbasic controllers (6.4), symmetric positive definite matrices X0 and Xf and suppose thatK(.)′G(.) ≡ 0, F i

j (., .) defined by (6.9), the following statements are equivalent:

(i) The output feedback robust control problem with the cost matrices X0 and Xf hasa solution via controlled switching with output feedback basic controllers (6.4).

(ii) The solution P (.) to the Riccati equation (6.8) with initial condition P (t0) = X−10

is defined and positive definite on the interval [t0, tN ], P (tN )−1 > Xf holds and thedynamic programming equation1 (6.10) has a solution such that V0(x0) ≥ 0 for allx0 ∈ IRn and V0(0) = 0.

VN (x) = x′0[Xf + Xf (P (tn)−1 −Xf )−1Xf

]x0

Vj(x0) = mini=1,..,k

F ij (x0, Vj+1(.)) (6.10)

Furthermore, suppose that condition (ii) holds and let Ij(x0) be an index such that theminimum in (6.10) is achieved for i = Ij(x0) and x(.) be the solution to equation (6.7) withinitial condition x(t0) = x0. Then the controller (6.4), (6.3) associated with the switchingsequence ijN−1

j=0 , where ij∆= i)j(x(tj)) solves the output feedback robust control problem

(6.5) with V (.) ≡ V0(.).

6.2.1 Linear Output Feedback Basic Controllers

Now, consider the case of linear basic controllers defined by (6.11).

ui(t) = Li(t)y(t), i = 1, .., k, linear output feedback basic controllers (6.11)

The dynamic programming solution is given by (6.12) to (6.14), where Yij , Zij , Rij andMij are the n by n submatrices of the transition matrix function Ψij(tj , tj+1), given by(6.15).

VN (x) = x′[Xf + Xf (P (tN )−1 −Xf )−1

]x, terminal cost (6.12)

Vj(x) = mini=1,..,k

supxj+1∈IRn

F ij (xj , xj+1 + Vj+1(xj+1)), ∀j = 0, .., N − 1, with (6.13)

F ij (xj , xj+1) := −x′j+1YijR

−1ij xj+1 + x′j+1

[Rij−H+YijR

−1ij Mij−Zij

]xj − x′jR

−1ij Mij xj (6.14)

1Here Vj(x) is used as the cost-to-go from stage j at time tj until the final stage N at time tN , as it isnormally used in the mathematical control community where Jj(x) would denote the cost-so-far, unlike inthe ACD designs as pointed out in chapter 3.


Ψij(tj , tj+1) =:

[Yij Zij

Rij Mij

](6.15)

The transition matrix function Ψij(tj , t) is given through the Hamiltonian system Hij(t)with state x(t) and costate p(t) as defined by (6.16) to (6.20), where P = P (t) is given byequation (6.8) with initial condition P (t0) = X−1

0 .

[p(t)˙x(t)

]= Hij(t)

[p(t)x(t)

], ∀t ∈ [tj , tj+1) (6.16)

Hij(t) :=

[−(A + PK ′K + B2LiC)′ −2K ′K

12(PC ′ + B2Li)(PC ′ + B2Li)′ A + PK ′K + B2LiC

](6.17)

Ψij(tj , t) = Ψij(tj , tj) +∫ t

tj

Ψij(tj , τ)dτ, (6.18)

Ψij(tj , t) = Hij(t)Ψij(tj , t), and (6.19)

Ψij(tj , tj) = II2n×2n, being the 2n by 2n identity matrix. (6.20)

The index i denotes the basic controller used, where the index j refers to the start timetj .Theorem 6 in [50] states that the following statements are equivalent:

(i) The output feedback robust control problem with the cost matrices X0 and Xf

has a solution via controlled switching with linear output feedback basic controllersui(t) := Li(t)y(t).

(ii) The inequality P (tN )−1 > Xf holds and the dynamic programming equation (6.13)has a solution such that V0(x0) ≥ 0 for all x0 ∈ IRn and V0(0) = 0.

if the system is given by (6.1) with the linear output feedback basic controllers ui(t) :=Li(t)y(t), given symmetric positive definite cost matrices X0 and Xf . Suppose that G(.) ≡0 and P (.) being the solution to equation (6.8) with initial condition P (t0) = X−1

0 ispositive definite on the interval [t0, tN ], which implies detRij 6= 0, ∀i = 1, .., k and j =0, .., N − 1 and F i

j (., .), defined by (6.14). F ij (xj , xj+1) is then equal to x(tj)′p(tj) −

x(tj+1)′p(tj+1).Also, suppose that condition (ii) holds and let ij(xj) be an index such that the minimumin (6.13) is achieved for i = ij(xj) and x(.) be the solution to equation (6.7) with initialcondition x(t0) = x0. Then the controller (6.11), (6.3) associated with the switchingsequence ijN−1

j=0 , where ij∆= ij(x(tj)), solves the output feedback robust control problem

(6.5) with V (.) ≡ V0(.).

6.3 Solving the Dynamic Programming Equation

As it has been seen in the previous sections there exists a nice dynamic programmingsolution (6.12) to (6.14), which is a time-dependent cost-function V.(x) over the “esti-


mated” state space, to the robust control problem with linear output feedback. This ismore tricky, as there is no state feedback but only output feedback and the state is notaccessible. Nevertheless, the form of the dynamic programming solution, has the sameform as when using state feedback with action dependent adaptive critics where the critichas to implement (6.12) to (6.14), but where the estimated state takes on the role of thestate.Solving (6.12 to 6.14) in practice is not trivial. Discretizing the state space will lead to asolution but the calculation time will explode due to the curse of dimensionality and alsothe supremum in equation (6.14) is nasty, as it has to be taken over the whole state space.The first idea to use ACDs is problematic as well. Due to the time-dependence of thecost-to-go function a build up of networks at every sampled time step tj has to be used.To avoid the curse of dimensionality of a densely sampled state space, the cost-to-gofunction will be represented with an interpolation function, like a neural network. However,dynamic programming is a recursive operation and even small approximation errors canlead to instability of the training process, and instability was observed, similar to thatnoted by Boyan [22]. Another problem is having only k possible trajectories from apoint x, some points cannot be reached within the support of the function approximator.This is problematic because no function approximator can extrapolate well and thus itis not possible to compute some values close to the support boundaries of the functionapproximator reliably. One way out might be to reduce the support when going backwardsin time to solve equation (6.13). Defining a reduction factor of rk := max ||IBk+1(x)||

max ||IBk(x)|| , as theratio of the maximal diameters of the support, IBk+1(x) and IBk(x), where no extrapolationis needed at every time step k, makes clear that even for moderate rk’s close to one, thesupport will diminish exponentially with time, as rk, (k = N − 1, .., 1) can be maximal 1,if there were no extrapolation needed.Because of these problems a way was sought to solve equation (6.13) exactly, using the factthat it must be a piecewise quadratic function, but with difficult to determine boundaries.

6.3.1 The first steps backwards

The goal is to solve equation (6.13) efficiently. For convenience the so-called Q-function,introduced by Watkins, is used. This is nothing else other than a special case of HDP,namely action dependent HDP (ADHPD). Starting at stage N with the final cost (6.21),defined as the minimum of the Q-functions (6.22).

VN (x) = mini=1,..,k

QiN (x), with (6.21)

QiN (x) := x′

[Xf + Xf (P (tN )−1Xf )−1 −Xf

]x, ∀i = 1, .., k (6.22)

At stage N −1 the cost function is defined by (6.23) to (6.25), where the decision function


IN−1(x) (6.26) is defined for convenience.

VN−1(x) = mini=1,..,k

QiN−1(x), with (6.23)

QiN−1(x) := sup

xf∈IRn

F i

N−1(x, xf ) + VN (xf )

, where F iN−1 is given by (6.14)(6.24)

= supxf∈IRn

F i

N−1(x, xf ) + minl=1,..,k

QlN (xf )

(6.25)

IN−1(x) := arg mini=1,..,k

QiN−1(x) (6.26)

Because both F iN−1 and Qi

N are quadratic forms there is a chance to solve for the supremumexplicitly. However the min operation in equation (6.25) must be accounted for. The ideais to solve for a supremem for every Ql

N , l = 1, .., k and then to check whether the solutionachieved at xf stems from a minimal Ql

N . A necessary condition is that the gradient ofthe part being maximized with respect to xf must be zero, or it must be a ‘boundary’point x, where Qi

N−1(x) = QjN−1(x) for i 6= j. See figures 6.1 and 6.2.

−25 −20 −15 −10 −5 0 5 10 15 20 25−5

−4

−3

−2

−1

0

1

2

3

4x 10

4

f

2 i

fx

x i 1

NV

Q 2N

2Q iN−1

QN1

1QiN−1

Figure 6.1: Situation with one possiblemaximum at 2xi

f inside the correspondingregion and one outside at 1xi

f .

−25 −20 −15 −10 −5 0 5 10 15 20 25−1.5

−1

−0.5

0

0.5

1

1.5

2x 10

4

N1

Q

Q 2N

2 i x f

2Q iN−1

NV

fx

1 i

N−1i

Q1

Figure 6.2: Problematic situation withthe maximum at the boundary of the twodecision regions.

To avoid the supremum, the QiN−1-function of (6.25) is split up into lQi

N−1(x, xf ) definedby (6.27), for some matrix functions Ei

N−1, FiN−1, G

iN−1 and H l

N which are easily deter-mined by comparison with equations (6.14) and (6.22), and is of a quadratic form as shownin (6.28).

lQiN−1(x, xf ) := F i

N−1(x, xf ) + QlN (xf ), (6.27)

=: x′fEiN−1xf + x′fF i

N−1x + x′GiN−1x + x′fH l

Nxf , (6.28)


A necessary condition for an extremal point xf is given by (6.29) and its solution is (6.30).

∇xf

lQiN−1(x, xf ) != 0 (6.29)

lxif = lxi

f (x) = −(EiN−1 + Ei

N−1′ + H l

N + H lN′)−1F i

N−1x (6.30)

However, the only interesting solutions lxif are those where Ql

N (lxif ) is minimal with respect

to l = 1, .., k. They are achieved for l = l∗ = IN (lxif ). To take the supremum in (6.25),

is to take the maximum over all remaining k′ ≤ k values l∗QiN−1(x, l∗xi

f (x))2 at possiblesolution points l∗xi

f (x). These steps are summarized in (6.31) to (6.34).

QiN−1(x) = max

l∗

l∗Qi

N−1(x, l∗xif (x))

(6.31)

l∗∗,i := arg maxl∗

l∗Qi

N−1(x, l∗xif (x))

(6.32)

xf := l∗∗,ixi

f (x) := select the l∗∗,i − th column vector of l∗xif (x) (6.33)

QiN−1(x) = l∗∗,i

QiN−1(x, xf ) = x′f (Ei

N−1 + H l∗∗,i

N )xf + x′fF iN−1x + x′Gi

N−1x(6.34)

This works fine as long as the supremum is not on the boundary between two regions,as in figure 6.1. However, if the supremum is on a boundary, all the possible solutionsat boundary points, which might be supremae, are determined and then added to the setl∗,i

xif (x) and take the maximum as in (6.31). Possible boundary extrema, as illustrated in

figure 6.2 will be calculated in section 6.3.1.1 below, but first the derivation of the recursionformula is continued. Using (6.30) with the maximizing l∗∗,i from (6.31) and using a tildesign for sums of a quantity with its conjugate, leads to (6.35),6.36) and (6.37).

QiN−1(x) = x′F i

N−1′(Ei

N−1 + H l∗∗,i

N )−H(EiN−1 + H l∗∗,i

N )(EiN−1 + H l∗∗,i

N )−1F iN−1x

−x′F iN−1

′(EiN−1 + H l∗∗,i

N +)−1F iN−1x + x′Gi

N−1x (6.35)

= x′[Gi

N−1+F iN−1

′(EiN−1+H l∗∗,i

N )−1((Ei

N−1+H l∗∗,i

N )(EiN−1+H l∗∗,i

N )−1−II)

F iN−1

]x(6.36)

QiN−1(x) =: x′H i

N−1x (6.37)

Defining H iN−1 to be the part in square brackets in (6.35) and realizing that Qi

N−1(x)has the same quadratic form as Qi

N (x), this can be used recursively for all prior stageswith j = N − 1, .., 0. It is noticed that if there is a unique maximum in the set of thevalues l∗Qi

N−1(x, l∗xif (x)), some slight deviation in x still gives the same solution and hence

QiN−1(x) is continuous in a small ball with radius ε around x. Nevertheless, one has to

be aware that l∗∗,i =: l∗∗,i(x) is dependent on x, even though it is constant in a small ballwith radius ε around x, and likely to be in a much larger area. This is the case, at least,

2l∗xif (x) can be seen as a n × k′ matrix of concatenated possible solution points, to which the scalar

function lQiN−1(x, xf ) is applied with xf as single possible solution from the set l∗xi

f (x)). This can besummarized in MATLAB style notation if the second argument is extended to a matrix argument in thefunction l∗Qi

N−1(x,xf (x)) = diag(xf (x)′(EiN−1+H l∗

N )xf (x))′+x′F iN−1

′xf (x)+ones(1, k′)∗x′Gi

N−1x. Theresult is now a k′ × 1 row vector of values of which the maximum is of interest.


if the critical point l∗∗,ixi

f (x) is not a boundary point between two regions.For a recursive implementation all that has to be done, is to keep track of some relativelyeasy to determine matrix functions Ei

j , Fij , G

ij and H i

j given by (6.38) to (6.46).

Eij = −YijR

−1ij (6.38)

F ij = R−H

ij + YijR−1ij Mij − Zij (6.39)

Gij = −R−1

ij Mij (6.40)

Ij+1(x) = arg mini=1,..,k

Qij+1(x) = arg min

i=1,..,kx′H i

j+1x (6.41)

Eij = Ei

j + Eij′ (6.42)

H l∗∗,i

j+1 = H l∗∗,i

j+1 (x) = H l∗∗,i

j+1 + H l∗∗,i

j+1

′(6.43)

H ij = H i

j(x) = Gij+F i

j′(Ei

j+H l∗∗,i

j+1 )−1((Ei

j+H l∗∗,i

j+1 )(Eij+H l∗∗,i

j+1 )−1−II)

F ij (6.44)

H l∗∗,i

N = H l∗N = H l

N = Xf + Xf ((P (tN )−1 −Xf )−1)Xf (6.45)

IN (x) := 1, or select any number from 1, .., k. (6.46)

Since H ij(x0) looses its symmetry, symmetry could simply be enforced by redefining

H ij(x) := 1

2

(H i

j(x) + H ij(x)′

), because only the function value of the quadratic form

Qij(x, x0) = x′H i

j(x0)x in x around some point x0 is of interest. However, doing onlya few recursions over j leads to instability because of inaccuracy in the approximations,especially close to the switching boundary and the grid boundaries3. As the supremum inequation (6.25) has to be calculated over the whole IRn and not just over a limited supportarea, it is necessary to extrapolate the function Vj+1(xf ) = minl=1,..,k Ql

j+1(xf ) in (6.25)for some values of xf . This most likely occurs for points x close to support boundaries.The problem is even amplified by the fact that there is only a finite set of k controllerswhich imposes a problem of ‘local reachability’, because only k trajectories can enter acertain state and even for very small δt they can only come from k points in the vicinityof x.Because of these reasons, and the assumption that the cost functions Vj(x) are almosteverywhere smooth, which is clear by the construction4, a smoothing procedure is used atpoints x by fitting a quadratic form Qi

j(x) = x′H ijx with H i

j = H ij′, in the neighborhood

of x.3If a grid for possible points x0 is assumed, or more generally a hull around the set of arbitrary expansion

points x04Though the ‘smoothness’ will more and more deteriorate due to the crisp min-function in equation

(6.23) when calculating backwards over j. Nevertheless, the crisp boundaries, which are determined bypoints x where Qr

j (x) = Qsj(x) are of the same minimal values for some r, s ∈ 1, .., k, r 6= s, are on a

n−1-dimensional manifold and not on a n-dimensional one like Vj(x). Therefore, the boundary points areof measure zero when measured in the n dimensional space IRn.


6.3.1.1 Calculation of boundary extrema

Given two quadratic functions f(x) = x′Ax + x′Bx0 + x′0Cx0 and g(x) = x′Ex + x′Fx0 +x′0Gx0, the goal is to maximize f(x) subject to f(x) = g(x), i.e. the maximum on theintersection of f and g. So, use the Lagrange multiplier method which says, minimizef(x)+λ(f(x)−g(x)), subject to f(x)−g(x) = 0. Hence, the gradient must satisfy (6.47),and, consequently (6.48) to (6.54) must hold.

∇(f(x) + λ(f(x)− g(x))) != 0 (6.47)

∇f(x) =λ

1 + λ∇g(x) =: µ∇g(x), λ 6= −1, λ =

µ

1− µ, µ 6= 1

the gradients must be (anti-)parallel (6.48)

∇f(x) = (A + A′)x + Bx0 = Ax + Bx0 (6.49)

∇g(x) = (E + E′)x + Fx0 = Ex + Fx0 (6.50)

0 = [A− µE]x + (B − µF )x0 (6.51)

x = −[A− µE]−1(B − µF )x0 (6.52)

0 = f(x)− g(x) = x′(A− E)x + x′(B − F )x0 + x′0(C −G)x0 (6.53)

0 = ([A− µE]−1(B − µF )x0)′(A−E)[A− µE]−1(B − µF )x0

−([A− µE]−1(B − µF )x0)′(B − F )x0 + x′0(C −G)x0 (6.54)

This is only a one-dimensional problem in µ that can easily be solved numerically. Identify-ing the functions f(xb) and g(xb) with the quadratic forms lQi

N−1(x, xb) and jQiN−1(x, xb),

respectively (l 6= j), solves the problem of determining possible candidates xlλj

mb on the

boundary according to the, say m = 1, .., lM j , Lagrange multipliers lλjm = µm

1−µmfound

satisfying equation (6.54).

6.3.2 Special Remarks

Although part I of this thesis is mainly concerned with applying and improving adaptivecritics, which are based on state feedback, this example of a HDS system shows that robustcontrol with only output feedback can be addressed within the ACD framework. This isdue to the special treatment of the problem where theoretical proofs for the existence of adynamic programming solution are given and a state estimation based on Kalman filteringis used to achieve equations 6.21 to 6.26 (for references, see [46], [49], [47] and [48]).

6.3.3 Experiment

As a simple experiment to demonstrate the practicality of the approximation via the sug-gested ACD-framework and also to demonstrate the validity of the theoretical approach,


a two-dimensional system is used, x(t) =

[x1(t)x2(t)

]. For simplicity, the graphs 6.3 to 6.14

are shown on the state space with axis x1 and x2, and the hat-symbol is omitted, although

the calculations are based on the estimated state x(t) =

[x1(t)x2(t)

].

6.3.3.1 Parameters

Plant parameters are:

A(t) =

[0 1

−1.25 −1

], B1(t) =

[01

], B2(t) =

[01

], C(t) =

[1 −2

],

K(t) =[0.1 0

], G(t) = 0.

Two basic controllers u1(t) = L1y(t) and u2(t) = L2y(t) with output feedback L1 and L2

are used to control the plant. The disturbance inputs ξ(t) and ν(t) are chosen as:L1(t) = 3, L2(t) = −1,

ξ(t) = 110 sin(5πt)

ν(t) = 120 cos(5πt).

It can be seen easily that the system cannot be controlled by a control input u(.) ≡ u1(.)or u(.) ≡ u2(.), as it is unstable and ||x(t)|| → ∞.

6.3.4 Results

Applying the algorithm introduced in the previous chapter results in a huge improvementin the calculation time. This is mainly due to the enormous reduction in the number of gridpoints compared with the traditional discrete dynamic programming solution. A furtherdifference is that this method can extrapolate outside the grid boundaries, where as thediscrete dynamic programming cannot extend over its grid. However, some accuracy islost due to the quadratic approximation and extrapolation, which might make the decisionboundaries less accurate. On the other hand dynamic programming suffers from boundaryeffects and can only be dealt with by having a huge grid, eventually shrinking betweentime steps, or, use extrapolation outside the grid as well, which would come closer to theproposed method.

6.3.4.1 Results without smoothing

Here a problem arises from numerical inaccuracy, when the suprema is on a boundarypoint xf belonging to the pair of Qr

j(x, xf ) and Qsj(x, xf ). Another source of error is the

problem associated with the extrapolation, which can be dampened by smoothing aroundthe support boundaries and therefore increasing the quality of the extrapolation.


−20

2

−2

0

20

0.5

1

1.5

x1

Q1100

(x) at time step 100.

x2

Q1 10

0(x)

−20

2

−2

0

20

0.5

1

1.5

x1

Q2100


x2

Q2 10

0(x)

−20

2

−2

0

20

0.5

1

1.5

x1

V100


x2

V10

0(x)

−20

2

−2

0

21

1.5

2

x1

I100


x2

I 100(x

)

Figure 6.3: Solution without local smoothing at first iteration backwards.

−20

2

−2

0

20

0.5

1

1.5

x1

Q199


x2

Q1 99

(x)

−20

2

−2

0

20

0.5

1

1.5

x1

Q299


x2

Q2 99

(x)

−20

2

−2

0

20

0.5

1

1.5

x1

V99


x2

V99

(x)

−20

2

−2

0

21

1.5

2

x1

I99


x2

I 99(x

)

Figure 6.4: Solution without local smoothing at second iteration backwards.


−20

2

−2

0

20

0.5

1

1.5

x1

Q198


x2

Q1 98

(x)

−20

2

−2

0

20

0.5

1

1.5

x1

Q298


x2

Q2 98

(x)

−20

2

−2

0

20

0.5

1

1.5

x1

V98


x2

V98

(x)

−20

2

−2

0

21

1.5

2

x1

I98


x2

I 98(x

)

Figure 6.5: Solution without local smoothing at third iteration backwards.

−20

2

−2

0

20

0.5

1

1.5

x1

Q191


x2

Q1 91

(x)

−20

2

−2

0

20

0.5

1

1.5

x1

Q291


x2

Q2 91

(x)

−20

2

−2

0

20

0.5

1

1.5

x1

V91


x2

V91

(x)

−20

2

−2

0

21

1.5

2

x1

I91


x2

I 91(x

)

Figure 6.6: Solution without local smoothing at tenth iteration backwards. And a fewiterations further, the situation will become completely unstable (not shown). It is clearlyvisible the the corners start bending. This is because numerical inaccuracy of determiningthe supremum in equation (6.25) because for the corner grid point x, the corresponding xf

lie outside the supported grid region for both controllers.


6.3.4.2 Results with local smoothing

Given an interior point, all the surrounding (2r + 1)dim(x) = (2r + 1)2 points, lying in thedim(x)-cube around the interior point, are used for a quadratic fit in the LMS sense. Twosurroundings with r = 3 and r = 8 were used.

−20

2

−2

0

20

0.5

1

1.5

x1

Q1100


x2

Q1 10

0(x)

−20

2

−2

0

20

0.5

1

1.5

x1

Q2100


x2

Q2 10

0(x)

−20

2

−2

0

20

0.5

1

1.5

x1

V100


x2

V10

0(x)

−20

2

−2

0

21

1.5

2

x1

I100


x2

I 100(x

)

Figure 6.7: Solution with local smoothing (r = 3) at first iteration backwards.

−20

2

−2

0

20

0.5

1

1.5

x1

Q199


x2

Q1 99

(x)

−20

2

−2

0

20

0.5

1

1.5

x1

Q299


x2

Q2 99

(x)

−20

2

−2

0

20

0.5

1

1.5

x1

V99


x2

V99

(x)

−20

2

−2

0

21

1.5

2

x1

I99


x2

I 99(x

)

Figure 6.8: Solution with local smoothing (r = 3) at second iteration backwards.


−20

2

−2

0

20

0.5

1

1.5

x1

Q198


x2

Q1 98

(x)

−20

2

−2

0

20

0.5

1

1.5

x1

Q298


x2

Q2 98

(x)

−20

2

−2

0

20

0.5

1

1.5

x1

V98


x2

V98

(x)

−20

2

−2

0

21

1.5

2

x1

I98


x2

I 98(x

)

Figure 6.9: Solution with local smoothing (r = 3) at third iteration backwards.

−20

2

−2

0

20

0.5

1

1.5

x1

Q191


x2

Q1 91

(x)

−20

2

−2

0

20

0.5

1

1.5

x1

Q291


x2

Q2 91

(x)

−20

2

−2

0

20

0.5

1

1.5

x1

V91


x2

V91

(x)

−20

2

−2

0

21

1.5

2

x1

I91


x2

I 91(x

)

Figure 6.10: Solution with local smoothing (r = 3) at tenth iteration backwards.


With the r = 8, a slightly smoother function is found, especially at boundary points.

−20

2

−2

0

20

0.5

1

1.5

x1

Q1100


x2

Q1 10

0(x)

−20

2

−2

0

20

0.5

1

1.5

x1

Q2100


x2

Q2 10

0(x)

−20

2

−2

0

20

0.5

1

1.5

x1

V100


x2

V10

0(x)

−20

2

−2

0

21

1.5

2

x1

I100


x2

I 100(x

)

Figure 6.11: Solution with local smoothing (r = 8) at first iteration backwards.

−20

2

−2

0

20

0.5

1

1.5

x1

Q199


x2

Q1 99

(x)

−20

2

−2

0

20

0.5

1

1.5

x1

Q299


x2

Q2 99

(x)

−20

2

−2

0

20

0.5

1

1.5

x1

V99


x2

V99

(x)

−20

2

−2

0

21

1.5

2

x1

I99


x2

I 99(x

)

Figure 6.12: Solution with local smoothing (r = 8)at second iteration backwards.


−20

2

−2

0

20

0.5

1

1.5

x1

Q198


x2

Q1 98

(x)

−20

2

−2

0

20

0.5

1

1.5

x1

Q298


x2

Q2 98

(x)

−20

2

−2

0

20

0.5

1

1.5

x1

V98


x2

V98

(x)

−20

2

−2

0

21

1.5

2

x1

I98


x2

I 98(x

)

Figure 6.13: Solution with local smoothing (r = 8) at third iteration backwards.

−20

2

−2

0

20

0.5

1

x1

Q191


x2

Q1 91

(x)

−20

2

−2

0

20

0.5

1

x1

Q291


x2

Q2 91

(x)

−20

2

−2

0

20

0.5

1

x1

V91


x2

V91

(x)

−20

2

−2

0

21

1.5

2

x1

I91


x2

I 91(x

)

Figure 6.14: Solution with local smoothing (r = 8) at tenth iteration backwards.


In figures (6.15) and (6.16) the control trajectories of the basic controller’s output is givenin blue and red for controller 1 and 2, respectively. The green trajectory denotes themagnitude of the state of the unknown plant. It is easily seen that the plant is driven tothe origin over time and kept there.

0 1 2 3 4 5 6 7 8 9 10−4

−3

−2

−1

0

1

2

3

t

u 1(t),

u 2(t),

| x(t

)|

Switching control u1(t), u

2(t) and | x(t)|

Figure 6.15: Control trajectories u1(t) and u2(t) in red and blue, and state magnitude|x(t)| in green. (r = 3).

0 1 2 3 4 5 6 7 8 9 10−4

−3

−2

−1

0

1

2

3

t

u 1(t),

u 2(t),

| x(t

)|


2(t) and | x(t)|

Figure 6.16: Control trajectories u1(t) and u2(t) in red and blue, and state magnitude|x(t)| in green. (r = 8).


The corresponding trajectories from an initial unknown state x = [−0.4,−0.6]T (red tra-jectory) and an initial estimated state x = [0, 0]T (green trajectory). The time axes is inz-direction, whereas the xy-plane is the state space. It is easily seen how the uncertainplant is captured and driven to zero. See figures 6.17 and 6.18.

−0.8−0.6

−0.4−0.2

00.2

−1

−0.5

0

0.5

10

2

4

6

8

10

^

x1

x2

t

Trajectories of x and x

Figure 6.17: Control trajectories x(t) and x(t) in red and green. (r = 3).

−0.8−0.6

−0.4−0.2

00.2

−1

−0.5

0

0.5

10

2

4

6

8

10

^

x1

x2

t

Trajectories of x and x

Figure 6.18: Control trajectories x(t) and x(t) in red and green. (r = 8).


Extending the grid in state space does not influence the calculation time, which is of ordero(n2

grid) and proportional to the squared number of controllers k2.

Figure 6.19: Solution with local smoothing (r = 8) at tenth iteration backwards with asupported grid in [−5, 5] × [−5, 5]. Even though the number of grid points has not beenincreased, the solution does not change significantly, where as in dynamic programming afine quantization needed to be maintained, increasing the calculation time rapidly.

0 1 2 3 4 5 6 7 8 9 10−4

−3

−2

−1

0

1

2

3

t

u 1(t),

u 2(t),

| x(t

)|


2(t) and | x(t)|

Figure 6.20: Control trajectories u1(t) and u2(t) in red and blue, and state magnitude|x(t)| in green (r = 8). There is not much difference to the smaller grid, however it couldcapture systems outside the previous supported grid of [−2, 2]× [−2, 2].

Part II

A class of fast and cheap neural

networks

109

Chapter 7

Introduction

In this, the second part of this thesis, a class of networks is investigated that can learnfast and accurately. Speed is an overall goal for an intelligent system and in part one ofthis thesis it was found that precision is also crucial.

In recent years one branch of “learning from data” became very popular. It does notuse the assumption of having an infinite pool of data to draw from. This is a much morefeasible assumption for practical implementations of learning systems. However, it has tobe said that many people have used other approaches based on limited data in statisticsbefore.

A branch of algorithms that use this limited data and filter out the “very important”ones have also become more popular and is called Support Vector Machines (SVMs). Thereare two categories that these algorithms can be divided into, namely, classification andfunction approximation. Here, only classification is considered, but in principle a straightforward tree-like quantization can turn any binary classification task into a function ap-proximation; although it may not be as efficient as with a more elaborate method.

In this introduction an outline of the theoretical framework of the basic SVM algo-rithms is given. Good summaries can be found in Haykin’s book [51] and in Burges’ tutorial[52]. A more detailed source is the book [53] by Vladimir Vapnik, widely recognized asthe founder of the SVM framework.

After the introduction of SVM algorithms, non-optimal but fast algorithms are intro-duced that have some similarities and in fact can be used as a simple way to find “veryimportant” data points. They are based on the Perceptron algorithm, introduced byRosenblatt [54], but have an extension via a kernel mapping to a high-dimensional featurespace, similar to the concept in SVM algorithms.

7.1 Statistical Learning Theory

Following the outstanding book [53] by Vapnik, a short summary of important ideas arepresented to give an outline for a better understanding of the general problem at hand.Also the seminar talk and notes by Achim Hoffmann [55], who relied on most of the

110


material from Vapnik’s books as well, were quite helpful for this summary. StatisticalLearning Theory tries to answer questions about functional dependencies from a givencollection of data; it tries to infer from a limited number of data samples some functionaldependence. It covers a wide area of classical statistics particularly, classification andregression analysis, as well as density estimation. This theory was developed accordingto the so-called learning paradigm. In contrast to classical statistics it was developed forsmall data samples and does not rely on a priori knowledge about the problem to be solved.It considers a structure on a set of functions implemented by the learning machine, as aset of nested subsets of functions, where a specific measure of subset capacity is defined.Generalization is controlled by two factors, namely, the quality of approximation of theselected function and a capacity measure of the subset of functions of which the approxi-mating function was chosen.However, the whole framework is based on asymptotic behavior of uniform convergencewhen increasing the data set of samples. Consequently it is expected that the conceptsstill hold and give reasonable solutions to the problem at hand for very small data sets.The goal of a theory of a general statistical inference is:

• To describe conditions under which the best approximation to an unknown functionwith an increasing number of examples can be found in a given set of functions.

• Find the best method of inference for a given number of examples.

Many well-known mathematicians like Fisher, Glivenko, Cantelli and Kolmogorov startedto develop a theory on this general statistical inference and set out the mathematicaltheorems that form the foundation for Vapnik’s work1. It is worthwhile to note thatstatistical learning theory goes beyond the scope in this thesis which follows Vapnik’sapproach. One reason is the appealing mathematical treatment by Vapnik and successfulSVM applications.

7.1.1 The model

Figure 7.1 shows the model of a learning process from examples. A process generatessome input vectors x and a target operator or supervisor specifies a desired output valuey, which can be a binary, real- or complex-valued scalar or vector. For a classification taskit would be a binary (or n-ary) value, whereas for a regression task it normally would bea real-valued scalar or vector. Often the examples are summarized as pairs in the trainingdata T = (x1,y1), .., (xi,yi), .., (xN ,yN ) = (xi,yi)|i = 1, .., N. The learning processis a process of choosing the “best” function from a set of given functions, implementableby the learning machine. The “best” function will be defined more specifically later on,but it does not necessarily mean that it is the function that minimizes an error measure

1Werbos pointed out that Vapnik was clear in his IJCNN03 talk that Fisher’s work represents theoriginal “thesis” and Vapnik’s work is the purest “antithesis”.


only on the training set because as it is seen later, this is not sufficient for the “best”generalization.

7.1.2 Imitation versus Identification

In [53] Vapnik considers two approaches to the learning problem:

• Learning as a problem of minimizing an empirical risk functional. This can be seenas imitating the unknown system. The idea is to try to construct an operator whichprovides the best prediction on some testing data when trained on a set of trainingdata stemming from the same generator as the test data. For this problem, a non-asymptotic theory can be developed.

• Learning as a problem of identification of the desired function, i.e. finding a functionusing the observed data that is close to the desired one. This is a much harderproblem and in general leads to solving so-called ill-posed problems. In this caseonly an asymptotic theory can be developed.

It turns out that from a conceptual point of view imitation can be regarded as a partialestimation of the probability measure (e.g. viewing an object whose back is occluded canbe imitated by reconstructing its front, regardless of the occluded shape in the back),whereas identification is estimating the probability measure over the entire set of the σ-algebra on the probability space at hand (e.g. also reconstructing the occluded part, whichis a much harder task and needs some extra information or assumptions, like for example‘the object is a can’ to be able to estimate the complete probability measure).

7.1.3 Minimizing the Risk Functional from Empirical Data

In the following sections short outlines of the three basic statistical problems that can beeasily dealt with within the framework of statistical learning theory are given, togetherwith the mathematical formulae, which are almost self-explanatory. Basically, one hasto consider the minimization of a risk functional (7.1), where the function L(z, f(z))2

is integrable for any f(z) ∈ f(z) = set of admissible functions and specifies the lossoccurred at any z ∈ Z ⊂ IRn, given the probability distribution FZ(z).

R(f(z)) =∫

L(z, f(z))dF (z) (7.1)

Therefore, the risk is the expected loss over the domain of interest Z. The practicalproblem is that the probability distribution FZ(z) is unknown and only a finite set ofobservations z1, .., zN , drawn randomly and independently from the distribution FZ(z), isgiven.

2Even though z is usually a vector no bold typeset is used for ‘theoretical’ variables. The same appliesfor α later on denoting a set of parameters


Generator

Learning Machine

Supervisorx y

y

Figure 7.1: A model of the learning process from examples. Some generator processproduces some vectors x and a target operator or supervisor sets a value (may be vectorial)y for every vector x. During the learning process the learning machine tries to return anestimated value y that is close to y, given the same input x and the corresponding target.

It is practical to specify the set of admissible functions in a parameterized version, f(z, α) =f(z, α), α ∈ Λ, where Λ is an arbitrary parameter set, and to absorb the function f(z)in a new loss function Q(z, α) = L(z, f(z, α)). Therefore, the risk can be redefined as(7.2).

R(α) =∫

Q(z, α))dF (z), α ∈ Λ (7.2)

7.1.3.1 Pattern Recognition

The problem of pattern recognition goes back to the late 1950s and has emerged as thesimplest case of studying learning theory. Rosenblatt’s perceptron was the first algorithmthat could learn from a set of training data and later an extension of this first algorithmis introduced.It is assumed that the supervisor classifies the given input patterns x according to theconditional probability function FΩ|X(ω|x), where ω ∈ 0, 1, ..., k − 1 denotes one of k

classes. Let φ(ω, α), α ∈ Λ be the classification rule, taking one of the values 0, 1, .., k−1.The simplest loss function is given by (7.3) to (7.5).

L(ω,x) =

0 if ω = φ

1 if ω 6= φ(7.3)

,and therefore

Q(z, α) = L(ω, φ(x, α)), where (7.4)

z = [ω,xT ]T (7.5)


The loss function Q(z, α) describes a set of binary indicator functions, parameterized byα ∈ Λ, taking on only two values zero and one. It should be noted that neither theenvironment distribution FX(x) nor the conditional probability density of the decisionrule FΩ|X(ω|x) are known; nevertheless it is certain that they do exist and therefore thejoint probability distribution FZ(z) = FΩ,X(ω,x) = FΩ|X(ω|x)FX(x) also exists.

7.1.3.2 Regression Estimation

The regression estimation tries to estimate a functional dependence between two sets ofelements X and Y based on independently selected xi ∈ X according to the distributionFX(x) and the yi ∈ Y are realized according to FY |X(y|x), which can be done empirically,i.e. given the input xi and then measure the output yi of the supervisor. Based onthis, N pairs (xi, yi), i = 1, ..N can be formed to yield the training data. Normally yi

is a scalar but it could be an n-dimensional vector as well. This corresponds to solve n

scalar regression estimation problems. It turns out that the knowledge of FY |X(y|x) is notnecessarily required and often it is sufficient to know only some of its characteristics, forexample its conditional expectation, which is called the regression (7.6).

r(x) =∫

y dFY |X(y|x) (7.6)

The regression estimation involves finding the estimate r(x) from a set of functionsf(x, α), α ∈ Λ. Under the conditions (7.7) and (7.8) it can be shown that this is theequivalent of minimizing the risk functional defined by (7.9) to (7.11).

∫y2dFY,X(y,x) < ∞ (7.7)

∫r2(x)dFY,X(y,x) < ∞ (7.8)

R(α) =∫

Q(z, α)dFZ(z), with (7.9)

z = [y xT ]T and (7.10)

Q(z, α) = (y − f(x, α))2, α ∈ Λ (7.11)

7.1.3.3 Density Estimation

If the desired density is in the set p(x, α), α ∈ Λ, then it can be defined by (7.12).

p(x, α0) =dFX(x)

dx, for some α0 ∈ Λ (7.12)


It can be shown that estimating the density in L1 is the equivalent of minimizing thefunctional (7.13).

R(α) = −∫

ln p(x, α)dF (x) (7.13)

Then, the minimum (or infimum if it does not exist) is achieved by the function p(x, α∗),which only differs on a set of zero measure from p(x, α0). This density estimation settingis called the Fisher-Wald setting and it is restricted to functions Q(z, α) with the form(7.14) and (7.15).

Q(z, α) = − log p(x, α) (7.14)

z = x (7.15)

7.1.3.4 Empirical Risk Minimization

The main question that arises from the three previous problems is: How can the risk func-tional be minimized? Strictly speaking, it is impossible to minimize the risk functionalbecause of its unknown probability distribution F (x)3 However, on the basis of an empir-ical distribution F (z) based on the data z1, .., zN , the so-called empirical risk functional(7.16) can be constructed, which can be minimized.

Remp =1N

N∑

i=1

Q(zi, α), α ∈ Λ (7.16)

The minimum of the empirical risk functional is attained at loss functions Q(z, αN ), whichcan be regarded to be an approximation of the “true” loss function Q(z, α0). This is calledthe empirical risk minimization (induction) principle.

7.1.4 Identification of Stochastic Objects

If the goal is not only to imitate but rather to identify the supervisor, the probabilitydistributions can be directly estimated by the so-called empirical distribution functiondefined by (7.17) to (7.18).

FN (x) =1N

N∑

i=1

Θ(x− xi), where (7.17)

Θ(x) =

1 x > 0, element-wise greater operator

0 otherwise(7.18)

3For convenience the subscript X has been dropped as the functions can be distinguished by theirargument, so F (z) = FZ(z) and F (x) = FX(x) are not the same functions. The subscript N will be usedto distinguish the empirical distribution FN (x) from the unknown distribution FX(x).


This is an approximation to the probability distribution (7.19) and if the random variableξ has a density then (7.19) is given by (7.20), where p(x) is the probability density.

F (x) = Pξ < x (7.19)

F (x) =∫ x

−∞p(u)du (7.20)

Similarly, an integral equation (7.21) for the conditional probability and conditional den-sity, based on approximations of the right-hand side, can be achieved. F (ω, x) is thenapproximated by (7.22) with the help of (7.23).

∫ x

−∞P (ω|t)dF (t) = F (ω, x) (7.21)

FN (ω, x) =1N

N∑

i=1

Θ(x− xi)δ(ω, x) (7.22)

δ(ω, x) =

1 if x belongs to the class ω

0 otherwise(7.23)

For the conditional density (7.24) F (y, x) is approximated by (7.25).

∫ y

−∞

∫ x

−∞p(t|u)dF (u)dt = F (y, x) (7.24)

FN (y, x) =1N

N∑

i=1

Θ(y − yi)Θ(x− xi) (7.25)

The trick of using the approximations on the right-hand side of the integral equations isbased on one of the most important theorems in statistics: The Glivenko-Cantelli theorem.It states that the empirical distribution function FN (x) converges in probability4 to theactual distribution function F (x) according to theorem 1.

Theorem 1. (Glivenko-Cantelli)

supx|F (x)− FN (x)|

P

−→N →∞

0 (7.26)

7.1.4.1 Ill-posed Problems

A solution to the operator5 equation (7.27) is said to be stable if a small change in theright-hand side F (x) ∈ F (x, α) results in a small change in the solution f(t).

Af(t) = F (x) ∈ F (x, α) (7.27)4It can be shown that convergence takes place almost surely. For a definition of these terms see appendix

C.5Functions are mappings from a set of numbers to another set of numbers (where numbers can be also

vectorial), functionals are mappings from a set of functions to a set of numbers and operators are mappingsbetween sets of functions.


This means the metric distance relations (7.28) and (7.29) in the spaces E1 and E26 have

to hold for any ε > 0.

ρE1(f(t, α1), f(t, α2)) ≤ ε (7.28)

ρE2(F (x, α1), F (x, α2)) ≤ δ(ε) (7.29)

The problem is said to be well posed in Hadamard’s sense if the solution of the operatorequation (7.27) exists, is unique and is stable. The operator A is said to be continuous ifit maps close elements in E1 to close elements in E2, where closeness is measured by theappropriate metrics in those spaces. If there exists a unique solution of such an operator,there also exists an inverse operator A−1. However, if it is not continuous the solution willbe unstable and therefore the equation is ill-posed.

7.1.4.2 Well-posed Problems in Tikhonov’s Sense

Tikhonov’s idea to achieve a well-posed, or correct, problem is to restrict the set of possiblefunctions. Let f ∈ M and F ∈ N be connected by the operator equation Af(t) = F (x)and as before, elements in M and N belong to the metric spaces E1 and E2, respectively.In a collection of operators A the only interesting ones are those that define a one-to-onemapping between the elements of M and N . To make things practicable, it is requiredthat the operator A be continuous. Let M∗ ⊂ M be a compact set, then the followinglemma shows Tikonov’s regularization criteria by restricting the set (class) of correctnessto M∗:

Lemma. If A is a continuous one-to-one operator defined on a compact set M∗ ⊂ M,then the inverse operator A−1 is continuous on the set N ∗ = AM∗.

For a proof see appendix A1.2 in [53]. The conditions for a problem of solving the operatorequation Af = F to be well-posed, or correct in Tikhonov’s sense, can be summarized asfollows:

• The solution of Af = F exists for each F ∈ AM∗ = N ∗ and belongs to M∗.

• The solution belonging to M∗ is unique for any F ∈ N .

• The solutions belonging to M∗ are stable with respect to F ∈ N ∗.

If M∗ = M and N ∗ = N correctness in Tikhonov’s sense corresponds to correctnessin Hadamard’s sense.

7.1.4.3 Tikhonov’s Regularization Method

To solve the operator equation Af = F , which is defined by a continuous one-to-oneoperator A, acting from M into N and suppose a solution of the operator equation exists,

6Where f(t) ∈ E1 and F (x) ∈ E2 and the operator A maps functions from E1 into E2.


an “improved” functional (7.30) is defined, with a regularization parameter γ > 0, and alower semi-continuous functional W (f), called the regularizer.

Rγ(f , F ) = ρ2E2

(Af, F ) + γW (f) (7.30)

The regularizer has the following three properties:

1. The solution of the operator equation Af = F belongs to the domain of definitionD(W ) of the functional W (f).

2. On the domain of the definition, functional W (f) admits real-valued nonnegativevalues.

3. The sets Mc = f : W (f) ≤ c, with c ≥ 0, are all compact.

It can be proven that the problem of minimizing the functional (7.30) is stable, whichmeans that to the close functions F and Fδ, where ρE2(F, Fδ) ≤ δ, there correspond closeelements fγ and fγ

δ , which minimize the functionals Rγ(f, F ) and Rγ(f, Fδ).The problem goal of the regularization theory is to determine a relationship between γ

and δ such that a sequence of solutions fγδ of regularized problems Rγ(f, Fδ) converges to

the solution of the operator equation Af = F as δ → 0. These relations are establishedby the following theorems 2 and 3.

Theorem 2. Let E1 and E2 be metric spaces, and suppose for F ∈ N there exists asolution f ∈ W (D) of Af = F . Instead of an exact right-hand side F, let there beapproximations Fδ ∈ E2 (not necessarily belonging to N ) be given such that ρE2(F, Fδ) ≤ δ.Suppose the values of parameter γ are chosen in such a manner that (7.31) holds.

γ(δ) → 0 for δ → 0,

limδ→0δ2

γ(δ) ≤ r ≤ ∞.(7.31)

Then the elements fγ(δ)δ minimizing the functional Rγ(δ)(f, Fδ) on D(W ) converge to the

exact solution f as δ → 0.

Theorem 3. Let E1 be a Hilbert space and W (f) = ||f ||2. Then for γ(δ) satisfying therelations (7.31) with r = 0, the regularized elements f

γ(δ)δ converge to the exact solution f

in the metric of the space E1 as δ → 0.

For proofs of theorems 2 and 3 refer to appendix A1.3.2 of [53].

7.1.5 Important Ideas of Statistical Learning Theory

Important questions addressed by the Statistical Learning Theory are:

• What are the (necessary and sufficient) conditions for consistency of a learning pro-cess based on the ERM principle?


• How fast is the rate of convergence of the learning process?

• How to control the rate of convergence (the generalization capability) of the learningprocess? (This leads to the structural risk minimization (SRM) principle).

• How to construct learning algorithms which can control their generalization capa-bilities? (This leads to support vector algorithms).

7.1.5.1 Consistency of ERM Principle

The empirical risk minimization (ERM) principle is said to be consistent for the set offunctions Q(z, α), α ∈ Λ and for the probability distribution function F (z) the conditions(7.32) and (7.33) hold7. l denotes the sample size.

R(αl)P

−→l →∞

infα∈Λ

R(α) (7.32)

Remp(αl)P

−→l →∞

infα∈Λ

R(α) (7.33)

(7.34)

To answer the question of necessary and sufficient conditions some formal definitions needto be made and are only outlined here. Some of them are given in appendix C in aconcentrated form. For more detail refer to Vapnik’s book [53].

7.1.5.1.1 Entropy of the set of indicator functions

Let Q(z, α), α ∈ Λ be a set of indicator functions and consider a sample z1, .., zl. LetNΛ(z1, .., zl) be the number of different separation functions, or dichotomies, in Λ forthe set of objects z1, .., zl. Then the function HΛ(z1, .., zl) = lnNΛ(z1, .., zl) is calledthe random entropy. The expectation of the random entropy over the joint distributionfunction F (z1, .., zl) is defined as HΛ(l) = E

[lnNΛ(z1, .., zl)

]and is called entropy of the

set of indicator functions Q(z, α), α ∈ Λ on samples of size l.For real functions the definitions have to be widened and the definitions need to be ex-tended by the concept of an ε-net8. The indicator functions get replaced by a set ofbounded9 loss functions A ≤ Q(z, α) ≤ B,α ∈ Λ. The function HΛ(ε; z1, . . . , zl) =lnNΛ(ε; z1, . . . , zl) is called the ε-entropy10 of the set of bounded functions A ≤ Q(z, α) ≤B, α ∈ Λ on a sample of size l.

7.1.5.1.2 Conditions for uniform convergence7See also the definition 16 for strict consistency and lemma 1 in the appendix C8See definition 18 in appendix C.9Vapnik considers also unbounded functions which need a few more extensions

10See definition 20 in appendix C.


Theorem 4. (Vapnik) For uniform one-sided convergence of means to their mathematicalexpectation to take place on a set of uniformly bounded functions Q(z, α), α ∈ Λ, it isnecessary and sufficient that for any positive ε, δ and ε, there exists a set of functionsQ∗(z, α∗), α∗ ∈ Λ∗, such that

1. For any function Q(z, α) there exists a function Q∗(z, α∗) satisfying the conditionsQ(z, α) ≥ Q∗(z, α∗) and

∫(Q(z, α)−Q∗(z, α∗))dF (z) < ε.

2. The ε-entropy of the set of functions Q∗(z, α∗), α∗ ∈ Λ∗, satisfies the inequality

liml→∞

HΛ∗(ε, l)l

< δ. (7.35)

These necessary and sufficient conditions hold for a fixed probability distribution F (z).In order that uniform convergence takes place for any probability measure F ∈ P, it isnecessary that (7.35) holds for every F ∈ P.

The uniform convergence, P

supα∈Λ |∫

Q(z, α)dF (z)− 1l

∑li=1 Q(zi, α)| > ε

−→l→00, gives

the necessary and sufficient conditions for the consistency of the ERM principle. This isthe key theorem of learning theory about the equivalence of strict consistency of the ERMprinciple and the existence of one-sided uniform convergence to their mathematical meansover a given set of functions.

7.1.5.1.3 Key Theorem of Learning Theory

Theorem 5. (Vapnik) Given two constants a and A such that for all functions in the setQ(z, α), α ∈ Λ and for a given distribution function F (z), the following inequalities holdtrue:

a ≤∫

Q(z, α)dF (z) ≤ A,α ∈ Λ (7.36)

Then the following statements are equivalent

1. The empirical risk minimization principle is strictly consistent on the set of functionsQ(z, α), α ∈ Λ, given a (fixed) distribution function F (z).

2. One-sided uniform convergence of means to their mathematical expectation takesplace over the set of functions Q(z, α), α ∈ Λ, given a (fixed) distribution functionF (z).

A proof is given in section 3.5 in [53]. This is a very interesting result, as it is a gen-eralization of the main law of statistics, the ‘Law of Large Numbers’ which states thatthe sequence of means converges to the expectation of a random variable (if it exists)as the number l increases. In the context here, this happens when the set of functions


Q(z, α), α ∈ Λ contains only one element, then the sequence of random variables ξl of arandom process (defined by definition 17 in appendix C) always converges in probabilityto zero.A generalization of the ‘Law of Large Numbers’ can be done easily when the set of functionsQ(z, α), α ∈ Λ is finite and a bit more difficult if it is infinite, in which case it becomes a‘Law of Large Numbers in a functional space’ (space of functions Q(z, α), α ∈ Λ). In thelatter case, the sequence ξl does not necessarily converge to zero if the number of functionsQ(z, α), α ∈ Λ is infinite. The problem is how to describe the properties of an infinite setof functions Q(z, α), α ∈ Λ and the probability measure F (z) under which the sequence ofrandom variables ξl converges to zero. Nevertheless, the equivalence between the ‘Law ofLarge Numbers in a functional space’ and uniform two-sided convergence of the means totheir mathematical expectations still holds and thus can be viewed as a generalization ofthe classical ‘Law of Large Numbers’.Two concepts, the annealed entropy and the growth function are very useful to answerquestions about the convergence rate and the consistency of the ERM principle for anyprobability measure.

Definition 1. The annealed entropy HΛann(l) = lnE

[NΛ(z1, . . . , zl)

].

Definition 2. The growth function GΛ(l) = ln supz1,...,zlNΛ(z1, . . . , zl).

7.1.5.1.4 Three milestones of learning theory 11 The capacity concept of the en-tropy or the growth function completely defines the qualitative behavior of the learning pro-cess: the consistency of learning. Further, it can be used to determine the nonasymptoticbound on the rate of convergence of the learning process, for both distribution-dependentand distribution-independent cases.

1. The first milestone is given by theorem 4 and for any learning machine minimizingthe empirical risk it is required to satisfy (7.35).

2. Sufficient conditions for a fast12 rate of of convergence of the ERM method is achievedwhen

liml→∞

HΛann(l)

l= 0

holds for the particular probability measure F (z).11The following statements are valid for pattern recognition but can be generalized for bounded real-

valued functions by the use of an ε-net and the replacement of the annealed entropy with the annealedε-entropy and similar for the growth function.

12Fast means with exponential bounds on the asymptotic rate of uniform convergence of the ERMprinciple. This leads to the question of the existence and values for the positive constants b and c such thatfor sufficiently large l > l(ε, Λ, P ), the inequality Psupα∈Λ |

∫Q(z, α)dF (z)− 1

l

∑li=1 Q(zi, α)| > ε <

be−cε2l, will hold for a particular F (z).


3. The necessary and sufficient condition for fast13 convergence and for consistency ofthe ERM principle for any probability measure F (z) ∈ P is:

liml→∞

GΛ(l)l

= 0.

7.1.5.1.5 Bounds on the rate of convergence

As it was seen in the previous paragraph the bounds on the convergence rate dependon the combinatorial structure of the set of functions the learning machine can choosefrom. For the set of totally bounded non-negative functions 0 ≤ Q(z, α) ≤ B, α ∈ Λ theinequality (7.37) holds with probability of at least 1 − η simultaneously for all functionsQ(z, α).

R(α) = Remp(α) +E(l)2

(1 +

√1 +

4Remp(α)E(l)

)(7.37)

with

E(l) = Bε(l),

ε(l) = 4GΛ,B(2l)− ln(η

4 )l

and (7.38)

GΛ,B(l) = ln supz1,...,zl

NΛ(z1, . . . , zl). (7.39)

For the difference, ∆(αl), between the attained risk, R(αl), and the minimal risk, R(α0) =infα∈Λ R(α), inequality (7.40) holds with probability 1− 2η.

∆(αl) := R(αl)−R(α0) < B

[√− ln η

2l+ E(l)

(1 +

√1 +

4Remp(αl)BE(l)

)](7.40)

E(l) = 4GΛ,B(2l)− ln(η

4 )l

+1l, for distribution-free nonconstructive bounds

E(l) = 4h(ln(2 l

h) + 1)− ln(η4 )

l, for distribution-free constructive bounds.

with h = V C−dimension of the set of real-valued functions Q(z, α), α ∈ Λ. Note, thebounds are also achieved for the pattern recognition task with B = 1. For real-valued func-tions Q(z, α), α ∈ Λ a set of indicators for the function Q(z, α∗) is introduced by defining itas Θ (Q(z, α∗)− β) , β ∈ (infz Q(z, α∗), supz Q(z, α∗)), with Θ(u ≥ 0) = 1,Θ(u < 0) = 0).The set of indicators corresponds to a single indicator function in the pattern recognitioncase. The complete set of indicators for a set of real-valued functions Q(z, α), α ∈ Λ is de-fined as Θ (Q(z, α)− β) , α ∈ Λ, β ∈ B =

(infz,α Q(z, α), supz,α Q(z, α)

), and corresponds

to the set of indicators in the pattern recognition case.13Similar as before, but now the supremum has to be taken over all probability measures:

supF (z)∈P Psupα∈Λ |∫


∑li=1 Q(zi, α)| > ε < be−cε2l.


7.1.5.2 VC-dimension

The VC-dimension is a useful combinatorial parameter of sets of functions. It can beused to estimate the true risk on the basis of the empirical risk and the number of i.i.d.training examples. In the previous paragraphs the concepts of annealed entropy andgrowth functions were used to derive theoretical distribution-dependent and distribution-free bounds. However, these bounds are nonconstructive because they cannot be evaluatedin practice. To address this new concept, the Vapnik-Chervonenkis (VC) dimension isintroduced, which can be evaluated for various function sets, relevant to learning systems.The VC-dimension is a combinatorial parameter on sets of subsets, e.g. on concept classesor hypothesis classes.

Definition 3. It is said that a subset S ⊆ X of samples of the domain of samples X, isshattered by the concept space C, if and only if S ∩ c|c ∈ C = 2S, i.e. the power set ofS.

Definition 4. (VC-dimension) The Vapnik-Chervonenkis dimension of the concept spaceC, V C−dim(C) is the cardinality of the largest set S ⊆ X, shattered by C, i.e.

V C−dim(C) = maxS∈S|S∈⊆X∧S∩c|c∈C=2S

|S|. (7.41)

Some examples of the V C−dimension h for

• a set of linear indicator functions Q(z, α) = Θ(∑n

p=1 αpzp + α0) in the n-dimensionalcoordinate space Z = (z1, . . . , zn): h = n + 1.

• for a sigmoidal neural network: h ≤ c|W |4, where c is some constant and |W | thenumber of parameters.

7.1.5.3 Structural Risk Minimization (SRM)

The complexity (or capacity) of a function class from which the learner chooses a functionthat minimizes the empirical risk determines the convergence rate of the learner to theoptimal function. For a given number of i.i.d. training examples, there is a trade-offbetween the degree to which the empirical risk can be minimized and to which it willdeviate from the true risk. In traditional learning this is often addressed as the ‘problem ofovertraining’ and was mainly dealt with by some extra validation set. Statistical LearningTheory on the other hand gives theoretical bounds for this trade-off. Structural riskminimization makes use of this concept and this is the theoretical basis for support vectormachine algorithms. The VC-dimension can also be used to determine a sufficient numberof training examples to learn probably approximately correct (PAC).

7.1.5.3.1 Principle of Structural Risk Minimization

The idea is to have an ordered set or structure, S, of sets Sk of functions Q(z, α ∈ Λ) such


that they can be nested:

S1 ⊂ S2 ⊂ · · · ⊂ Sk, · · · , where Sk = Q(z, α)|α ∈ Λk and S∗ =⋃

k

Sk (7.42)

Considering admissible structures, the structures that satisfy the following properties:

1. Any element of Sk of the structure S has a finite VC dimension hk.

2. Any element of Sk of the structure (7.42) contains either

(a) a set of totally bounded functions 0 ≤ Q(z, α) ≤ Bk, α ∈ Λk,

(b) or a set of nonnegative functions Q(z, α), α ∈ Λk, satisfying the inequality

supα∈Λk

p√

E[Qp(z, α)]E[Q(z, α)]

≤ τk < ∞. (7.43)

3. The set S∗ is everywhere dense in the set S in the L1(F ) metric where F = F (z) isthe distribution function from which examples are drawn.

The structure (7.42) implies the following: All the sequences, the sequence of values ofVC dimensions hk, the sequence of values for the bounds Bk and the sequence of valuesfor the bounds τk for the structure elements Sk is nondecreasing with increasing k, i.e.

h1 ≤ h2 ≤ · · · ≤ hk ≤ · · · (7.44)

B1 ≤ B2 ≤ · · · ≤ Bk ≤ · · · (7.45)

τ1 ≤ τ2 ≤ · · · ≤ τk ≤ · · · (7.46)

If the function that minimizes the empirical risk in the set of functions Sk is denoted byQ(z, αk

l ) then the actual risk for this function is bounded with probability 1− η by (7.47).

R(αkl ) ≤ Remp(αk

l ) + BkEk(l)

1 +

√1 +

4Remp(z, αkl )

BkEk(l)

(7.47)

where

Ek(l) = 4hk

(ln

(2lhk

)+ 1

)− ln

(η4

)

l. (7.48)

For a given set of observations z1, . . . , zl the SRM method chooses the element Sk ofthe structure for which the smallest bound on the risk (the smallest guaranteed risk) isachieved. Therefore, the idea of the structural risk minimization principle can be summa-rized as:Provide the given set of functions with an admissible structure and then find the functionsthat minimizes the guaranteed risk (7.47) over the given elements of the structure. Thisis graphically illustrated in figure 7.2.


Figure 7.2: The bound on the (guaranteed) risk is the sum of the empirical risk and ofthe confidence interval. The empirical risk is decreasing with increasing structure index, k,because the capacity parameter, the VC dimension hk, is increasing, while the confidenceinterval is increased. Having a huge capacity of functions of the learning machine athand, it is intuitively clear that the training error (the empirical risk) is getting smaller,while the danger of overtraining decreases the confidence of having the ‘true’ underlyingfunction that was responsible for generating the observed sample data, and, hence increasesthe confidence interval.

Chapter 8

Support Vector Machines

Support vector machines is the term for a family of algorithms based on statistical learningtheory. Their name comes from the fact that they learn their knowledge from the trainingset and use some of these data points, the so called support vectors, to describe the solutionas a linear combination of mapped support vectors in a usually high-dimensional featurespace. In this chapter the description of the basic theory follows Haykin’s approach in[51]. However, deviations are made and some specific matters are elaborated more closelyfor the sake of a better understanding of the algebraic perceptron in later chapters.There are basically two types of algorithms, one for pattern classification and one forfunction approximation.

8.1 Geometry of SVMs for Pattern Classification

Let w denote the solution in form of the normal of the separating hyperplane that separatesthe training data T = i = 1, .., N |(xi, yi), where xi is an input vector and yi = ±1 itsclass label.

8.1.1 The linear separable case

If the training set T is separable, (8.1) and (8.2) are equivalents for separability.

wTxi + b > 0, for ∀ yi = +1 (8.1)

wTxi + b < 0, for ∀ yi = −1 (8.2)

A hyperplane separating the data points, is defined by (8.3).

g(x) := wTx + b = 0 (8.3)

The geometrical interpretation of g(x) is an algebraic measure of the distance of the pointx from the hyperplane g(x) = 0. This can be easily seen when projecting x onto the

126

CHAPTER 8. SUPPORT VECTOR MACHINES 127

hyperplane (8.4) and therefore (8.5) and hence (8.6) hold.

x = xp + rw||w|| (8.4)

g(x) = g

(xp + r

w||w||

)= g(xp) + r g

(w||w||

)= r ||w|| (8.5)

r =g(x)||w|| (8.6)

Hard separability is achieved when for all points (xi, yi), i = 1, .., N (8.7) and (8.8) isvalid.

wTxi + b ≥ d+min > 0, for yi = +1 (8.7)

wTxi + b ≤ d−min < 0, for yi = −1 (8.8)

Often dmin := d+min := −d−min := 1.

Soft separability is given when for any data point xi, d+min > wTxi + b > 0 or

d−min < wTxi + b < 0, but the problem is still separable. Sometimes, the separabilitycondition is omitted.

It is clear from the definition of the separable problem that the distance between theclosest planes containing points of different classes is given by (8.9), where ρ is called themargin of separation.

ρ := g(xi|yi = 1) + g(xi|yi = −1) =|d+

min|||w|| +

|d−min|||w|| =

2 dmin

||w|| (8.9)

This margin is closely related to the VC-dimension; when ρ is maximized the VC-dimensionof the learning machine is minimized and generalization is best, in the sense that thereis a lowest upper bound for the generalization error, averaged over all equally probablefunctions, that the learning machine can achieve.Therefore, the goal of the so called primal objective is to maximize ρ, and respectivelyminimize ||w|| subject to the constraints (8.10), hence there is a Lagrangian function(8.11).

yi g(xi) ≥ dmin (8.10)

LP (w, b,α) =12

wTw −N∑

i=1

αi (g(xi)yi − dmin). (8.11)

Note: g(xi)yi − dmin is always ≥ 0.A necessary condition for J to be maximal is that the gradient with respect to the param-


eters w and b is zero as defined by (8.12) and (8.13).

∂LP (w, b, α)∂w

= w −N∑

i=1

αi yi xi!= 0, ⇒ w =

N∑

i=1

αi yi xi (8.12)

∂LP (w, b, α)∂b

= −N∑

i=1

αi yi!= 0, ⇒ αTy = 0 (8.13)

At the optimum each individual product between a Lagrange multiplier and its constrainthas to be zero (Kuhn-Tucker conditions) as defined by (8.14).

αi [g(xi)− dmin] = 0 (8.14)

Therefore, non-zero Lagrange multipliers imply that the contents of the brackets be zero,hence xi is laying on a closest hyperplane parallel to the optimal separating one and istherefore a support vector.

8.1.2 The linear non-separable case

In this case one introduces slack variables ξi to penalize misclassified points (0 < dmin <

ξi). It would be nice to find a separating hyperplane that minimizes the number ofmisclassifications on the training set. This might be done by minimizing the functionaldefined by (8.15), with the indicator function (8.16), with respect to w and subject to theconstraints yi(wTxi + b) ≥ dmin − ξi, i = 1, .., N .

J(ξ) =N∑

i=1

I(ξi − dmin) (8.15)

I(ξ) :=

0 if ξ ≤ 01 if ξ > 0

(8.16)

However, minimization of J(ξ) with respect to w would lead to a nonconvex optimizationproblem that is NP-complete. Therefore, one simplifies the functional to (8.17).

J(ξ) =N∑

i=1

ξi (8.17)

Note that now soft separated points are also penalized, but only less (0 < ξi ≤ dmin).Moreover, the functional can be combined with the original objective of minimizing thenorm of w defined by (8.18) subject to (8.19). This leads to a formulation of a primalLagrangian (8.20) for the overall goal that meets the constraints as well.

JP (w, ξ) =12wTw +

N∑

i=1

Ciξi =12wTw + ξTC (8.18)

ξ ≥ 0, note: often Ci = C, as in (8.17). (8.19)


LP (w, ξ, α,β) = JP (w, ξ)−N∑

i=1

αi (g(xi)yi − dmin + ξi)−N∑

i=1

βiξi (8.20)

While the first two necessary conditions (8.12), (8.13) are also the same in the non-separable case, another necessary condition appears, as defined by (8.21).

∂LP (w, b, ξ, α)∂ξk

= Ck − αk − βk!= 0, ∀k = 1, .., N. (8.21)

8.1.3 The Dual objective Formulation

To simplify solving the minimization of (8.11), or more generally for the non-separableproblems (8.20) and using the necessary conditions (8.12), (8.13) and for the non-separablecase (8.21), the Wolfe Duality Theorem [56] can be used because the primal problem is aconvex quadratic objective with respect to its parameters. This allows one to rewrite theprimal objective completely in terms of the Lagrange multipliers α.Minimizing (8.22) is equivalent to maximizing (8.23) subject to conditions (8.24) and(8.25).

LP (w, b, ξ, α, β) =12wTw + ξTC−

N∑

i=1

αidiwTxi − b

N∑

i=1

αidi +N∑

i=1

αi (8.22)

LD(α) =N∑

i=1

αi − 12

N∑

i=1

N∑

j=1

αiαjdidjxTi xj , subject to (8.23)

N∑

i=1

αidi = 0, and (8.24)

0 ≤ αi ≤ Ci, ∀i = 1, .., N. (8.25)

Alternatively, in vector form minimizing (8.22) is equivalent to maximizing (8.26) subjectto conditions (8.27) and (8.28) with e being the N-dimensional one-vector and Q is definedby (8.29) with the help of data matrix Xd, given by (8.30).

LD(α) = αTe− 12

αTQ α, subject to (8.26)

αTd = 0, and (8.27)

0 ≤ α ≤ C, (8.28)

Q = XTd Xd, and (8.29)

Xd = [d1x1 d2x2 ... dNxN] . (8.30)


Note that neither the slack variables ξ nor their Lagrange multipliers β occur in the dualformulation. The separable problem can be seen as a non-separable one with C → ∞, sothere is no upper limit in the separable case for the Lagrange multipliers α.It can be seen from (8.21) that α + β = C. The Kuhn-Tucker conditions at the optimalsolution are now (8.31) and (8.32).

αi

[di(wTxi + b)− dmin + ξi

]= 0, ∀ i = 1, .., N, and (8.31)

βi ξi = 0, ∀ i = 1, .., N. (8.32)

Therefore, it is seen that if αi < Ci then ξi must be zero. This allows one to determine b

from any data point whose Lagrange multiplier satisfies 0 < αi < Ci, as in this case thebracketed part of equation (8.31) must be zero. However, it is numerically better to takethe mean value of b, resulting from all such points.

A remark has to be made regarding the preference of the dual objective over the primalone. Apart from missing the slack variables ξ and its Lagrange multipliers β the crucialcriterion for preferring the dual objective is the dimensionality of w which is often muchhigher than the number of points which is equal to the dimensionality of the Lagrangemultiplier vector α. This is especially true in the case when a transform to a high-dimensional space is made and inner-product kernels are used as in section 8.2 and thedimension of w can easily be of the order 350 which clearly negates any attempt of solvingthe primal problem directly.

8.1.4 Lifting the Dimension

One nuisance in the dual optimization problem is the equality constraint (8.27). Theoptimization would be much simpler if the equality constraint could be avoided altogether.Here, the implications of lifting in the feature space are considered, where the liftingconsists of adding an extra dimension that can absorb the affine bias of the hyperplane.To absorb the parameter b, which is the distance from the origin times ||w|| of the optimalhyperplane g(x) = 0, one can lift, or embed, the feature space, to a space whose dimensionis increased by one as defined by (8.33) to (8.35) so that the optimal hyperplane changesfrom an affine to a linear space, i.e. the kernel of the mapping g.

w :=[wT ,

b

λ

]T

(8.33)

x :=[xT , λ

]T, and therefore (8.34)

g(x) = wT x = wTx + b = g(x) (8.35)

8.1.4.1 Implications of lifting the dimension

There are two obvious choices for the functional J . Either it can be changed to J =J + b2

λ2 = 12w

T w−∑Ni=1 αi yi [g(xi)− dmin], which has the same form as J but in a space


of an increased dimension by 1. Or, the other approach is to keep the original J and useJ := J − b2

λ2 = J . In the first case the equality constraint αTy = 0 falls away, yielding anon-optimal solution, with respect to the original data, which maximizes 1

||w||2 = 1

||w||2+ b2

λ2

.

Of course, the second approach achieves the same solutions as before.

8.1.4.2 Eliminating the equality constraint through lifting and projection

It is straightforward to show that the necessary condition of the gradient of the liftedfunctional J being zero does not contain the equality constrain αTy = 0, as defined by(8.36) to (8.38).

J(w, α) =12wT w −

N∑

i=1

αi (g(xi)− dmin) (8.36)

=12wT w −

N∑

i=1

αi

(yi wT xi − dmin

)(8.37)

∂+J(w;α)∂w

= w −N∑

i=1

αi yixi!= 0 (8.38)

It can be shown (see section 8.1.5, [57]) that for ||w||2 > 14 the solution w in the lifted

space is a balanced one.However, the solution achieved in the lifted space cannot be easily linked to the originalsolution. The idea of projecting the ‘lifted solution’ back to the original space will notwork in general. The notation for the lifted solution w and its projection are given by(8.39) and (8.40), respectively.

w =N∑

i=1

αi yi xi =N∑

i=1

αi yi

[xi

λ

](8.39)

wp = s[w

0

]=

N∑

i=1

αi yi

[xi

0

], for some scalar s 6= 0. (8.40)

In [58] a proof was shown that the projected solution obtained from the lifted data cannotbe the original solution. Unfortunately, although it is true that in general wp 6= w, there

is an error in the formula for g(w) = cos(θ(w)) = wTp wp√

2wTp wp+w2

λ

in equation (43) in [58],

it should read as wTp wp√

wTp wp+w2

λ

√wT

p wp. The argument in [58] was that the gradient of g(w)

with respect to w =[

wwλ

]must be zero because the optimization problems in the lifted

and original spaces should satisfy (8.41) if there exists a projected solution.

arg maxw∈W

minx∈X

d(x; w) ?= arg maxw∈W

minx∈X

d(x;w) (8.41)


or, with some abbreviations

arg maxw∈W

g(w)f(w) ?= arg maxw∈W

f(w) (8.42)

with g(w) = cos(θ(w)). The implication is that now every wp 6= 0 would satisfy (8.45)for wλ = 0, meaning a projected solution could exist.

∂+g(w)∂wp

= 2wp

[(wT

p wp)2 + w2λw

Tp wp

]−12

−12wT

p wp

[2(wT

p wp)2wp + w2λ2wp

] [(wT

p wp

)2+ w2

λwTp wp

]− 32 != 0 (8.43)

wλ=0= 2wp

[(wT

p wp

)−1 − (wT

p wp

)−1]

!= 0 (8.44)

∂+g(w)∂wλ

= −wTp wp(2wT

p wp + w2λ)−

32 wλ

!= 0 (8.45)

However, the form g(w) = cos(θ(w)) is wrong as well and has a more complicated formin general. This form was based on [57] equations (24) and (25) which do not account forasymmetric data distribution (with respect to the origin) and do not reflect the resultingrotation of the lifted solution w, by the angle φ. See figure 8.2 which is a geometricalconstruction flipping the plane that goes through m, m and the origin and is normal tothe optimal separating hyperplane, achieved from lifted data, down to IR2. The statement“It is easy to see that the normal of the optimal classifying hyperplane w∗ can be obtainedin the following way: Select a pair of data points, say x1 and x2, from different clusterssuch that the distance between them is the minimum among all such pairs. w∗ can thenbe expressed as w∗ = (x2 − x1)/||x2 − x1||.” in [57] seems not to be correct as figure 8.1shows a trivial counter example. Nevertheless, the results in table 1 in [57] still indicatethe correct behavior, as in the case of many data points the suggested solution w∗ getscloser to the true solution w based on the original data. This can be seen as adding morecritical data points (support vectors) between x1 and x3 in figure 8.1.A simplified analysis with only two points x1 and x2 belonging to opposite classes can bedone. It can be seen that if these points were to represent data clusters the qualitativeargument still would hold, though the general analysis would be more complex. Lifting x1

and x2 results in the points x1 = (x1, λ) and x2 = (x2, λ) with midpoint m = 12(x1 + x2).

The normalized normal vector w of the plane separating x1 and x2 with maximum marginand going through the origin leads to the optimization problem (8.46). The (signed)distances from the points xi are ti := wT xi, i = 1, 2. t1 = −t2 is equivalent to wT m = 0.

maxw

wT x1 subject to wT m = 0 and wT w = 1. (8.46)


Figure 8.1: Clearly, w∗ = (x2−x1)||x2−x1|| is not the true normal to the optimal hyperplane.

Adding more (support vector) data between x1 and x3 will achieve a w∗ that is closer tow.

Using Lagrange multipliers and differentiating with respect to w yields (8.47) to (8.50).

x1 = λm + 2µw (8.47)

wT x1 = 2µ (8.48)

xT1 m = λmT m (8.49)

w ∝ x1 − xT1 m

mT mm (8.50)

Clearly, as long as x1 and x2 and the origin are not collinear the projection wp of w ontothe original data plane has a different direction from the normal w = x2−x1 of the optimalseparating plane in the original space. However, if m = 1

2(x1 +x2) is located at the origin,the lifted and original solution are the same, independent of the lifting factor λ. If thereare many data points involved, the behavior will be similar, however the optimizationproblem becomes the SVM problem and is more difficult to solve. Realizing that only thesupport vectors determine the optimal hyperplane, the following heuristic can be appliedto find an approximate solution to the original optimal hyperplane, given by w and biasb.

8.1.4.2.1 Approximate solution by lifting heuristic

1. To avoid the equality constraint (8.27) in the SVM-problem, the data are lifted andsolved for a separating solution w.

2. The found support vectors of class 1 and 2 are averaged and projected back to yieldcenters m1 and m2 of approximate support vector clusters.

3. Translate the original data x ∈ X such that the average m = 12(m1 + m2) is at the

origin, resulting in shifted data x ∈ X |x = x−m,x ∈ X.


4. Lift the shifted data and redo the optimization in the lifted space, yielding ˆw, whoseprojection is now certainly closer to the original w.

Variants can be done by identifying the corresponding original data points of the supportvectors of the last step and do a non-lifted SVM solution only on this set of data points,thus reducing the original problem drastically. As it is seen in [57] already moderate liftingfactors λ give close approximations to the original solution. The larger the lifting amountλ is, the closer the lifted and the original solution will be, at least in theory. If λ is toolarge, numerical problems in the optimization prevail, destroying all the benefit gained bysimplifying the optimization process. Translating the original data in the way proposedwill reduce the influence of the lifting factor on the optimization problem and thus givea better approximation. Applying the original SVM-procedure on the translated datacould be done, hoping a small bias b will reduce the impact of the equality constraint.Furthermore, the whole optimization in the lifted space could be avoided by using a fastpreprocessor like the algebraic perceptron, introduced in the next chapter, to get a set ofimportant data points, though not necessarily the true support vectors.

8.1.5 Balanced Classifiers

Balanced classifiers are classifiers whose separating hyperplane is equally distant from bothclasses, i.e. from the closest hyperplanes parallel to the separating one, each containingsome points either from the positive or negative class. An unbalanced classifier, f(x;w′),with some parameters w′ is closer to either the positive or negative class. Without loss ofgenerality it is assumed the unbalanced classifier is closer to the negative class:

f(x;w′, b′) = sign(g(x)) = sign(w′Tx + b′), with (8.51)

w′Tx+i + b′ = d+

min := c, c > dmin := 1, b′ :=b′

λ(8.52)

w′Tx−i + b′ = d−min := −dmin := −1 (8.53)

Following [57], it is easy to construct a balanced classifier, f(w, b), from an unbalancedone, say f(w′, b′), by shifting the unbalanced separating hyperplane until it is exactlyin between the two classes,

(w = 2dmin

1+c w′, b = 2dmin1+c b′ + dmin

1−c1+c

). Conversely, given a

balanced classifier an unbalanced one can be constructed:

w′(c) :=1 + c

2 dminw (8.54)

b′(c) :=1 + c

2 dminb− 1− c

2 dmin=

b− 1 + (b + 1) c

2 dmin(8.55)


Optimizing J(w′, b′; α) = J(w′; α), minimizes ||w′||2 = ||w′||2 + b′2. Therefore, when

L(c) := ||w′(c)||2 = ||w′(c)||2 + b′(c)2 (8.56)

=(

1 + c

2dmin

)2

||w||2 +

(b− 1 + (b + 1) c

2 dmin

)2

(8.57)

takes the minimum value at c = 1, it will be a balanced classifier. Calculating the totalderivative of L(c) with respect to c and its Hessian to ensure minimality:

dL(c)dc

=1 + c

2 d2min

||w||2 +b− 1 + (b + 1) c

2 d2min

(b + 1) (8.58)

c:=1=||w||2d2

min

+b2 + b

d2min

=||w||2d2

min

+(b + 1

2)2 − 14

d2min

(8.59)

d2L(c)dc2

=||w||22 d2

min

+(b + 1)2

2 d2min

> 0 (8.60)

⇒ dL(c)dc

> 0 ∀c ≥ 1, if ||w||2 >14

(8.61)

⇒ L(c ≥ 1) is minimal at c = 1 if ||w||2 > 14 . In other words for ||w||2 > 1

4 a balancedclassifier f(x; w) = f(x;w, b) is achieved when solving the ‘lifted’ optimization problemJ(w′ = w; α).

Hence, for ||w|| > 12 it is a balanced classifier whose hyperplane G ⊂ IRn+1 : g(x; w) =

wT x = 0 goes through the origin and is equally distant from all positive and negativesupport vectors because it is balanced.

8.2 Transforms to High-dimensional Feature Space

So far the concept of separating linear separable patterns has been considered, or at leastseparable in a regularized sense. However, in cases where there is no separation or theregularization would have to be taken so far that the underlying meaning of the data islost, transforms to higher dimensional spaces can be used in which the transformed databecomes separable. An easy example is the XOR-problem, where in two dimensions theproblem is not separable but it becomes when transformed into IR6, see figure 8.4.If some separating higher-order manifold can be found, like an ellipse that has the twosquare samples inside and the other data points outside, a linear representation of thismanifold in terms of some basis-functions might be found. This is exactly how one candetermine that IR6 can linearly separate the XOR-problem by using (1 + xTy)2 and ex-panding it into a sum of monomials 1 + 2y1x1 + 2y2x2 + 2y1y2x1x2 + y2

1x21 + y2

2x22.


8.2.1 Inner-Product Kernels

As it was seen before, a mapping, from the input space E to another generally muchhigher-dimensional feature space V , can be introduced. Let φ(x) ∈ V , with x ∈ E,denote this transformation. Then the same principles can be applied in the feature spaceas was done before in the input space. However, this expansion becomes very unpracticalas the number of terms explodes for higher-order manifolds. Nevertheless, there existsa neat trick to overcome this computational problem. If the sum of monomials can bewritten as a function of an inner-product it is easy to evaluate on which side of the socreated hyperplane a transformed data point lies, and therefore decide whether a problemis separable or not in this newly created, so called feature-space. More mathematically,if an inner product in the feature space can be defined as a function of the input pointsx and y as defined by (8.62), the optimal hyperplane for separation can be constructedwithout evaluating the transform φ(.) explicitly and also the direct calculation of the innerproduct in the feature space.

K(x,y) = φT (x)φ(x) =m∑

i=1

φi(x)φi(y) ∀x,y ∈ E (8.62)

This kernel expansion is a special case of Mercer’s theorem which states that a kernel ona closed interval a ≤ x,y ≤ b can be expanded into

K(x,y) =∞∑

i=1

λiφi(x)φi(y), λi > 0 (8.63)

if the kernel satisfies the following condition of semi-positiveness for any square-integrablefunction ψ(.):

∫ b

a

∫ b

aK(x,y)ψ(x)ψ(y)dxdy ≥ 0, subject to (8.64)

∫ b

aψ2(x)dx < ∞ (8.65)

8.3 SVMs for Function Approximation

Function approximation or non-linear regression can be modelled as (8.66), where f(.)is a deterministic function of its argument vector and ν is a random expectational errorrepresenting our “ignorance” about the dependence of D and X [59].

D = f(X) + ν (8.66)

The goal now is to reveal and approximate the function f(.) by an estimator F (.;w),given only some training set T = (xi, diN

i=1 of realizations (measurements) (xi, di) ofcorresponding inputs and outputs. The parameters (or weights) are completely determined


from the training data T . The assumptions about ν are that it is a random variable withzero mean and independent of any input x, for short E[ν|x] = 0.In the framework of SVMs the following estimator F is used:

y = F (x;w) = wT φ(x) =m∑

j=0

wjφj(x) (8.67)

where m+1 is the dimensionality of the feature space. To accommodate a bias representedby w0 it is required that φ0(x) ≡ 1.To determine the parameters w, a measure, say L, has to be defined that tells us howgood the estimator yi = F (xi;w) for the desired value of di from the training data T is.This can be done by minimizing the averaged loss, or empirical risk:

Remp :=1N

N∑

1

Lε(di, yi), subject to (8.68)

||w||2 ≤ c0, some constant. (8.69)

This is very much the essence of statistical learning theory, where some empirical error onthe training set is minimized, subject to a regularization condition that accounts for thefact that there is only a finite number of training data. A more detailed justification forusing the empirical risk instead of the true risk was summarized in section 7.1.5. Here,the ε-insensitive loss function is used, as first suggested by Vapnik. It is defined by (8.70).

Lε :=

|d− y| − ε, for |d− y| ≥ ε

0, otherwise(8.70)

The reason for this choice is that the estimator should be robust for outliers given a specificmodel estimator. Vice versa a robust estimator has to be insensitive to small changes inthe model. As the given model is uncertain to an amount of ν and if ε = ν is chosen, thereis no cost involved for errors smaller than ε, therefore it takes account of the ignorance builtin the model. Nevertheless, it should be emphasized that statistical learning theory doesnot depend critically on the form of the loss function. With an optimal robust estimationprocedure in mind, the maximum degradation should be minimized, which leads to somekind of minimax algorithm [60].The constraint optimization problem given by equations (8.68), (8.69) and (8.70) can bereformulated as:

di −wT φ(xi) ≤ ε + ξi, i = 1, .., N. (8.71)

wT φ(xi)− di ≤ ε + ξ′i, i = 1, .., N. (8.72)

ξi ≥ 0, i = 1, .., N. (8.73)

ξ′i ≥ 0, i = 1, .., N. (8.74)


where the slack variables ξi and ξ′i contain the error that cannot be contributed to thenoise in the model. Absorbing the inequality (8.69) also in the empirical risk a singlefunctional can be written as (8.75), where the constant C now absorbs the normalizingfactor 1

N as well the constant c0.

Φ(w, ξ, ξ′) = C

N∑

i=1

(ξi + ξ′i) +12wTw (8.75)

Another way to look at equation (8.75) is to view it as a weighted classification problem,namely training data points that match the desired value within an accuracy of ε in oneclass and the remaining points in the other, weighted according their linear error. Thisbecomes obvious by inspection of equation (8.18) from the classification problem, which isthe same apart from the fact that here an additional set of slack variables ξ′ is given, dueto the splitting up of the equations in under- and overshoot error to avoid the absolutevalue function, see equations (8.71, 8.72).Using Φ(w, ξ, ξ′) as primary objective, its Lagrangian and dual objective form can bederived, respectively:Minimizing

LP (w, ξ, ξ′, γ, γ ′) = CN∑

i=1

(ξi + ξ′i) +12wTw

−N∑

i=1

αi[wT φ(xi)− di + ε + ξi] (8.76)

−N∑

i=1

α′i[wT φ(xi)− di + ε + ξ′i]

−N∑

i=1

γiξi + γ′iξ′i

subject to ξ ≥ 0 and ξ′ ≥ 0, is equivalent to:Maximizing

LD(α,α′) =N∑

i=1

di(α− αi)− εN∑

i=1

(αi + α′i)

−12

N∑

i=1

N∑

j=1

(αi − α′i)(αj − α′j)K(xi,xj) (8.77)

subject toN∑

i=1

(αi −α′i) = 0;

0 ≤ α ≤ C

0 ≤ α′ ≤ C


with the inner-product kernel K(xi,xj) = φT (xi)φ(xj) and where the necessary conditionsare again that the gradient with respect to the parameters has to be zero:

∂LP

∂w= w −

N∑

i=1

φ(xi)(αi − α′i)!= 0, ⇒ w =

N∑

i=1

φ(xi)(αi − α′i) (8.78)

∂LP

∂ξ= C−α− γ

!= 0, ⇒ γ = C−α (8.79)

∂LP

∂ξ′= C−α′ − γ ′ != 0, ⇒ γ ′ = C−α′ (8.80)

Remarks:

• The parameters ε and C = [C1, .., CN ]T often= Ce control the VC dimension of theapproximating function F (x,w) = wTx =

∑Ni=1 (αi − α′i)K(x,xi).

• Due to the coupling of the loss and the regularization constant c0 in equation (8.69),complexity control for regression involves a simultaneous tuning of the parameters ε

and C. Therefore, regression is intrinsically more difficult than pattern classification.


Figure 8.2: Lifting causes a rotation of the optimal hyperplane by the angle φ. Above,IR2 is lifted by λ. Below, a projection construction in IR2 is given by flipping the optimalhyperplane, with normal w, achieved from the lifted data onto IR2. If the two data pointsx1 and x2 were lying on R1, no rotation would occur, i.e. φ = 0, and the projected partwp of the lifted solution w would point in the same direction as the original solution w.Furthermore, if the original data were distributed symmetrical with respect to the origin,lifting and optimizing in the lifted space would not cause a rotation and the angle θ would be90o, and hence the separating hyperplanes from original and lifted symmetrical data wouldbe parallel. This is easy to see with two data points but is more difficult to understand whenmany data points are involved, as the meaning of symmetry becomes unclear. Nevertheless,it helps to develop a heuristic to work out the original hyperplane as described in the text.


Figure 8.3: Lifting IRn to IRn+1. Projecting the lifted data on the sphere SIRn+1 yieldsthe data for the algebraic perceptron. The spherical mapping always enforces a zero bias,meaning the separating hyperplane goes through the origin.

x1

x2

R2 R6 There exists a hyperplane i n R5 that l i nearl y separates the data.

R5R1

Figure 8.4: The XOR problem is not separable in IR2 but it is in IR6. It is clear thata second degree polynomial, say P (x;y) := (1 + x1y1 + x2y2)

2 = 0, in IR2 can separatethe two classes for an appropriate coefficient vector y. The expansion in monomials canbe interpreted as a linear vector space with basis

1, x1, x2, x1x2, x

21, x

22

and hence has

dimension 6.

Chapter 9

Algebraic Perceptron

The algebraic perceptron got its name from the fact that it extends Rosenblatt’s percep-tron, which is looking for an affine separating hyperplane, to higher degree polynomialseparations. When I started this work Lyle Noakes had extended the standard perceptronalgorithm to higher degree polynomial separation, using an inner-product homomorphismbetween two linear vector spaces with a special inner-product defined on them. He provedweak convergence and with some minor modifications strict convergence, which meansthat the algorithm for weak convergence would not converge if and only if there weresome points lying exactly in the separating hyperplane. However, the algorithm can beextended to circumvent this problem and then it converges absolutely also in the casewhen points were exactly on the separating hyperplane, found by the weak convergencealgorithm. Nevertheless, for practical purpose this is very unlikely to happen, and theextension is not needed. At this point I had the feeling that this is somewhat related tothe SVM machines but both of us were not familiar with the SVM machines nor withwork done by Fries [61] and independently by Vijakumar [57]. Friess extended an optimalperceptron algorithm invented by Anlauf and Biehl [62], which was used in a different con-text of Boltzman machines. However, the approach introduced here has combined manyfeatures also found in other papers but never combined altogether. As it is a non-optimalalgorithm working in a high-dimensional vector space, no difficult optimization takes placeto find a separating dichotomy, and this should be revealed in a fast algorithm, thoughwith the price of not having an optimal one. However, this may not even be crucial,especially when there are many data points involved defining two classes like in a binaryimage. Optimizing the found solution towards an optimal one will be investigated in alater chapter. Another desire would be to decompose a data set into several simple struc-tures, for example low degree polynomials when the overall structure would be of muchhigher degree. This is considered in the chapter 11.

142

CHAPTER 9. ALGEBRAIC PERCEPTRON 143

9.1 Algorithm

Given a training input set X = x1, ..,xi, ..,xN and a classification mapping d(.) = ±1which determines the class for a given input; for short di := d(xi), then the training setT = i = 1, .., N |(xi, d(xi)) can be conveniently defined as the set of input/output pairs.Lets denote the dimension of the input space E n := dim(E), so dim(x) = n.Before the algorithm is introduced, it helps to mention a few tricks that will be applied:

• Applying a preprocessing step, which was called lifting the input dimension and isutilized as follow:

x =[xT λ]T

||[xT λ]T || , with λ being a constant. (9.1)

Here, the input space is lifted before the non-linear transform φ(.) is applied. Thisis not the same as lifting in the feature space as in section 8.1.4. However, it issomewhat related as the goal here is to achieve a similar absorbtion of the affine biasand also to have a spherical mapping in the feature space where a linear hyperplanethrough the origin is sought, avoiding an affine bias as well.

• Confining the mapping φ(.) ∈ V onto the unit sphere SV of the high-dimensionalfeature space V, has the advantage that the separating element always has magnitude1 and does not grow as in other algorithms, like the cone algorithm [63], which suffersfrom decreasingly smaller relative updates.

• Hemisphere mapping: If the spherical mapping φ(.) is multiplied by the class labeld(.), it is guaranteed that all data points are mapped on one hemisphere, if and onlyif the data is linearly separable in V . Then the equator plane can be used as theseparating hyperplane and the hemisphere pole as the separating element.

• Use of a homogenous polynomial kernel, which is less a trick and more an illus-trative example of a mapping φ: Define P (x) :=

∑Mi=1 moni(x, p), with M =

(n + 1)p being the total number of monomials mon(.; p) of degree p, which havethe form x1x1 · · ·x1︸︷︷︸

p1

x2x2 · · ·x2︸︷︷︸p2

. . . xn+1xn+1 · · ·xn+1︸︷︷︸pn+1

or any permutation thereof, e.g.

x2xn+1x1x2 . . . x2︸︷︷︸p

, and with∑n+1

i=1 pi = p. Note the value of a monomial moni(.; m)

can always be computed as xp11 xp2

2 · · ·xpn+1

n+1 . Further, a homogeneous polynomialP (y)(z) := (yT z)p can be defined that can be written as

P (y)(z) =M∑

i=1

moni(y; p) ·moni(z; p), with M = (n + 1)p. (9.2)


and it is a polynomial of degree 2p in 2(n + 1) variables1. Defining

φ(y) := Coeffs(P (y)(z)|z free) :=

mon1(y; p)...

monM (y; p)

(9.3)

φ(y)(z) := P (y)(z) (9.4)

then it follows directly from (9.2) that φ(y)(z) can be viewed as the scalar product< φ(y), φ(z) >V . Note that when φ(y) it is normalized such that ||φ(y)|| = 1 itfollows from (9.2) as well that the scalar product (9.4) is also “normalized”, meaningis in the range [−1, 1]. Note 2: If the arguments y of the monomials moni(y; p) arenormalized such that ||y|| = 1 then φ(y) is automatically normalized as well. Thisis the reason for the lifted input vector to be normalized, so that it lies automat-ically on the unit sphere SV in the feature space V . Note 3: If a commutativemultiplication of monomials in equation (9.2) (e.g. the usual scalar multiplication)and as before a commutative multiplication in the individual monomials moni(.; p)is used, the dimension of the coefficient vector (which is the dimension of the fea-ture space V ) can be reduced by combining monomials that are just a permutationof each other. With lexicographic ordering of the monomials with respect to thepowers p1p2 · · · pn+1 all components having just a permutation in the first monomialmoni(.; p) in equation (9.3) can be combined because they have all the same value.There are exactly

(p

p1,p2,··· ,pn+1

)= p!

p1!p2!···pn+1!such terms and therefore can be re-

placed by one component whose value is p!p1!p2!···pn+1!

moni(.; p). Now there are only

M ′ =(p+(n+1)−1

p

)< M = (n + 1)p coefficients. For a proof see [52]. Nevertheless,

for practical purpose this is still an immense number even for moderate n and p.

For the remaining parts it is assumed that the input data has been lifted and to simplifythe notation the lifting-tilde sign is avoided.

9.1.1 Iterative Algorithm

Given the training set T = i = 1, .., N |(xi, d(xi)), the feature space points are yi :=φ(xi)d(xi), ∀i = 1, .., N. which are ‘hemisphere’-mapped from the lifted input points xi.Then the algebraic perceptron algorithm is defined as follow:

Set j := 0 and select a random z0, e.g. as φ(xz0).while ∃ν = 1, .., N such that <zj ,yν >V < 0, do

yi := arg minyν<zj ,yν >V zj+1 := zj − 2 <zj ,yi >V yi

j := j + 1od1n + 1 corresponds to the dimension of the extended, or lifted, input vector.


The first line in the loop determines a most violating point yi, i.e. a point which buildsthe largest angle with the current separating element zi.

9.1.2 Recursive Algorithm

For convenience, let I(j) be the indicator function to the most violating point at iterationj (if there are several of them, choose one or select all of them). A recursive formulationof the iterative algorithm is easily obtained:

zj = z0 − 2j∑

i=1

<zi−1,yI(i) > yI(i). (9.5)

zj!= z0 +

∑

k∈SVs

αkyk (9.6)

αk,i := −2 < zi−1,yI(i) >, note: αk,i > 0 (9.7)

αk =j∑

i=1

αk,i δk,I(i) (9.8)

Noticing that equation (9.5) is a linear combination of some specific training points yI(i)

the algorithm can be reformulated according to equation (9.6), where SVs denote the setof indicators I(i), i = 1, .., j with no repetitions. The factors αk are determined accordingto the next two equations, where δk,l is the Kronecker Delta function. In analogy to theSVM-framework, the set of most violating points selected during the algorithm, is alsocalled the set of support vectors. It was found that the number of support vectors roughlymatches those achieved by SVMs [64]. Also the set of support vectors found by the twoalgorithms is roughly the same.

9.1.3 Algebraic Perceptron Convergence Proof

The proof consists of two parts, first it is shown that every iteration brings the candidateseparating element zi closer to the optimal solution z∗ and that if a separating solutionexists, the algorithm will terminate in a finite number of steps. Assuming that the problemdata

Y = i = 1, .., N |yi = φ(xi)d(xi)

is separable, i.e. it is contained on a half-space, or more restrictively, contained in thecone C(z∗, θ∗) where z∗ is the optimal solution providing the largest possible margin ρ∗.This can be stated as yi ∈ C(z∗, θ∗), or equivalently, < z∗,yi >≥ cos(θ∗)||z∗||||yi|| =cos(θ∗) ≥ 0. Similarly, by definition of a violating point yi, yi is not contained by thecone C(zj, θ = π

2 ), which is a half-space. It is < zj ,yi >< cos(θ) = 0. Equations (9.9)to (9.13) show that as long as zj is not a separating solution, zj will be moved in every


iteration j → j + 1 to zj+1, which is closer to the optimal solution z∗, assumed to exist.

z∗ − zj+1 = z∗ − (zj − 2 < zj ,yi > yi) = z∗ − zj + 2 < zj ,yi > yi (9.9)

||z∗ − zj+1||2 = < z∗ − zj + 2 < zj ,yi > yi, z∗ − zj + 2 < zj ,yi > yi > (9.10)

= ||z∗ − zj ||2 + 4 < zj ,yi >< z∗ − zj , yi > +4 < zj ,yi >2 ||yi||2 (9.11)

= ||z∗ − zj ||2 + 4 < zj ,yi >︸︷︷︸<0

< z∗,yi >︸︷︷︸≥0

(9.12)

= ||z∗ − zj ||2 − γj , γj := −4 < zj ,yi >< z∗,yi >≥ 0 (9.13)

Note that γj is zero, if and only if the optimal separating covering cone has an openingangle θ∗ = π

2 , which means that the margin ρ∗ is zero. If the problem is separable with amargin ρ∗ > 0, (9.13) means that in every iteration the current solution zj is moved closerto the optimal solution z∗ until a separating solution is achieved, i.e. no violating pointyi exists. That the algorithm terminates in a finite number of steps becomes clear fromthe recursive formulation (9.5), which directly leads to (9.14).

2 ≥ < z∗, zj − z0 > = −2j∑

i=1

< zi−1,yI(i−1) >︸︷︷︸=:−αk,i∈[−1,0[

< z∗,yI(i−1) >︸︷︷︸≥ρmin(z∗;T ′)>0

(9.14)

The first factor of the summand is always negative where as the second factor is alwayspositive. Together with the factor −2 the summation will always yield a positive number.Because the left hand side is certainly bounded by 2, the summation cannot have aninfinite number of purely positive numbers, hence j must be finite, q.e.d. The algorithmcan be extended such that it converges in the special case of θ∗ = π

2 as well. However, forpractical applications, the overhead for this special case is not justified.

9.1.4 Special Initialization

Sometimes, it may be better to use a special choice for the initial element z0, to reduce thenumber of iterations, or to get a ‘more optimal’ solution. This choice could be a sphericalaverage over the mapped training data Y = i = 1, .., N |yi = φ(xi)d(xi). However, oftenthis overhead is not justified, as running the algorithm with another initialization is faster.

9.1.5 Other Theoretical Considerations

9.1.5.1 Other Kernels

From the SVM framework, as well as from the convergence proofs of the algebraic percep-tron, it is clear that the homogenous polynomial kernel < φ(x),φ(y) >V = (< x,y >E)p

algebraic perceptron algorithm, can be replaced by any other kernel satisfying Mercer’stheorem, or having the inner-product homomorphism between the input space E and fea-ture space V , respectively. In the latter case of the algebraic perceptron convergence, only


finite dimensional spaces V are covered, otherwise Mercer’s theorem has to be used.A good overview of kernels, including the Fisher kernel and string-kernels, is given in therecent book of Herbrich [65]. A class of very general kernels can be constructed from anyMercer kernel, k : X × X → IR, using Reproducing Kernel Hilbert Spaces.

Reproducing Kernel Hilbert Spaces

Given a Mercer kernel, k : X × X → IR, let F be the linear function space on X withfunctions f(.) generated by the Mercer kernel k(x, .). Let f(.) =

∑ri=1 αik(xi, .) and

g(.) =∑s

j=1 βik(xj , .) be two arbitrary functions in F . Then an inner product can bedefined as (9.15) and (9.16), respectively.

< f, g > :=r∑

i=1

s∑

j=1

αik(xi, .)βjk(xi, xj) (9.15)

=s∑

j=1

βjf(xj) =r∑

i=1

αig(xi) (9.16)

It can be easily verified that this definition satisfies all the properties of an inner-product,namely:

• < f, g >=< g, f >,∀f, g ∈ F .

• < cf, g >= c < f, g >, for any scalar c.

• < f + g, h >=< f, h > + < g, h >,∀f, g, h ∈ F .

• < f, f >= 0 =⇒ f = 0.

The last property can easily be proven from the reproducing property (9.17)

< f, k(x, .) >= f(x) (9.17)

Then 0 ≤ (f(x))2 = (< f, k(x, .) >)2 ≤< f, f >< k(x, .), k(x, .) >=< f, f > k(x, x) andknowing that the Mercer kernel k is positive definite, < f, f >= 0 implies that f(x) = 0.

9.1.5.2 Large Data Sets

There is a vast amount of literature related to the handling of large data sets. Scholkopfand Smola have dedicated a whole chapter of their outstanding book [66], where theyoutline and classify the algorithms in four generic classes: interior point methods, subsetselection, sequential minimization (which is an extreme subset method having only twoelements) and iterative methods. To make SVM methods work with large data sets thereare many tricks of the trade. Caching techniques are quite important (see for example[67, 68]) and can speed up algorithms significantly. However, processor architectures andoperating systems have to be considered when designing cache strategies.


For the work that was done for the IJCNN 2001 contest [69] on predicting a binary se-quence with 50000 training points, a simple chunking strategy was applied to the algebraicperceptron and the adatron. The chunking was splitting up the training data set intosmaller blocks (≈ 1000 points) which where overlapping each other. Training was done fortwo iterations2 over the whole set of data points. Due to the highly asymmetric trainingdata (only 10% positive labelled points) the algebraic perceptron had problems to trainaccurately. Depending on the problem, splitting up the training data in non-overlappingchunks can be useful as seen in table 9.1 for the ‘scissors’ problem.

9.2 Comparison Algebraic Perceptron and SVM

An example to show the speed and surprisingly high accuracy and good generalizationof the algebraic perceptron, a pair of scissors were digitalized and training times andgeneralization were compared with the standard C-SVM algorithm. See figure 9.1. Theoriginal image consists of 4800 points and the scissors has one of its blades oriented suchthat it shows a very rugged staircase effect, which is not ideal for description by continuousfunctions. The medium to large data set of 4800 points, which causes the kernel matrixto be of dimension 4800 by 4800, cannot be handled anymore by a naive implementationof the SVM algorithm because it is not trivial to compute the inverse of such a largematrix. However, special working sets and caching algorithms can handle this and onefreely available algorithm, SV M light, was made public by Joachims [67, 70]. SV M light

was also chosen because it is written in C and relatively fast. The results presented hereare taken from [64].

9.2.1 Comparison Results and Experimental Observations

The results are very promising for the algebraic perceptron. It beats SVM light on bothdata sets and is faster as well. However, training times for the algebraic perceptroncan easily increase as the number of data points grows. But as with other methods,splitting up the training set into chunks of free and fixed training vectors, or using smaller,resampled training sets, training times can be kept under control. Remarkably, evenwhen the algebraic perceptron was stopped before it achieved a separating solution onthe training data, the generalization was better than that of SVM light. The fact that thealgebraic perceptron generalizes better than the optimal SVM light is a bit puzzling. Theremay be several reasons for this. Firstly, the input space is low-dimensional and relativelyfinely sampled in a regular grid. By the continuity of the mapping into the feature-spacethis will be contained and therefore the algebraic perceptron might find a solution more

2It turned out that the ad hoc implementation of the cache split up the memory so badly that not morethan two iterations could be handled. Time constraints for submission and memory constraints on themachine used, called for a fast engineering solution. Also, the high-degree (55 order polynomial) kernellead to overtraining, which was indicated by the winning team’s choice of a kernel with a radial basis kerneland well adapted SVM penalty-parameters.


in the ‘middle’, closer to the optimal hyperplane. A second reason might be, that themapping onto the sphere causes a space distortion, which might have similar effects asexplicitly introduced distortions by Amari [71].The speed of the algebraic perceptron can be attributed to the fact that the perceptronalgorithm is indeed a very simple algorithm and it is a non-optimal one. All of thesupport vector machines try to find an optimal separating plane which involves finding anoptimal out of the infinitely many separating hyperplanes (in the case of separable data ofcourse), whereas the algebraic perceptron terminates when it has found just an arbitraryseparating one. However, in the case of successful separation an optimization to find theoptimal hyperplane is assumed to be much easier because no points have to “cross” thehyperplane. Therefore, the optimization problem should be guaranteed to take place atthe global minima.In the case of a polynomial kernel, the power m also gives an upper bound on the VC-dimension of the problem at hand. As the training times reduce significantly when in-creasing m, a simple algorithm can be developed to estimate, at least a rough, upperbound.Robustness has not been addressed so far apart from section 9.3. Even though the per-ceptron algorithm is prone to bad performance in the presence of noise, special measurescan be taken, e.g. leave-one-out techniques. In section 9.3 it has been shown that the‘voting algebraic perceptron’ is a promising way towards the handling of overlapping data,like overlapping data sets, and capable of detecting problematic data points based on thefraction of correctly separated data points and the law of large numbers during a run ofthe algebraic perceptron algorithm.


Figure 9.1: The results displayed graphically of one random training set used in Table9.1. The first two rows correspond to SVM light with default arguments and c=2500, re-spectively. The last row corresponds to the algebraic perceptron. Columns from left to rightare for 1000, 2000 and 4800 training points. Light and dark grey are misclassified back-and foreground pixels, respectively; correctly classified object pixels are black.


SVM

ligh

tA

PS

VM

ligh

tA

PS

VM

ligh

tA

P

#Tra

inin

gPoin

ts1000

2000

4800

Para

met

ers

def

ault

c=2500

def

ault

c=2500

def

ault

c=2500

6x800

chunks

#SV

s222.2

85.6

119.2

397.4

138.3

174.5

403

287

340

295

#B

ounded

SV

s178.8

35.9

-351.6

82.1

-356

232

-Tra

inin

gti

me/

s0.8

75

60.5

11.4

98

2.4

42

391.4

22.5

72.5

23

561.2

563.3

46.5

Tes

ting

tim

e/s

1.0

07

0.5

97

0.5

47

1.5

13

0.7

57

0.8

01

1.5

12

1.1

91

1.5

72

1.2

81

true

posi

tives

(TP

)4291.2

4270.3

4272.3

4289.8

4296.8

4318.0

4283

4334

4354

4342

fals

eneg

ati

ves

(FN

)62.8

83.7

81.7

64.2

57.2

36.0

71

20

012

fals

eposi

tives

(FP

)310.7

79.4

65.7

288.6

61.5

35.2

292

45

03

true

neg

ati

ves

(TN

)135.3

366.6

380.3

157.4

384.5

410.8

154

401

446

443

Acc

ura

cy=

(TP

+T

N)/

T0.9

222

0.9

660

0.9

693

0.9

265

0.9

753

0.9

852

0.9

244

0.9

865

1.0

000

0.9

969

TP

R=

TP

/(T

P+

FN

)0.9

856

0.9

808

0.9

812

0.9

853

0.9

869

0.9

917

0.9

837

1.0

000

1.0

000

0.9

972

TN

R=

TN

/(T

N+

FP

)0.3

034

0.8

220

0.8

527

0.3

529

0.8

621

0.9

211

0.9

837

0.8

991

1.0

000

0.9

933

FP

R=

(FP

/(F

P+

TN

)0.6

966

0.1

780

0.1

473

0.6

471

0.1

379

0.0

789

0.6

547

0.1

009

0.0

000

0.0

090

FN

R=

(FN

/(F

N+

TP

)0.0

144

0.0

192

0.0

188

0.0

147

0.0

131

0.0

083

0.0

163

0.0

046

0.0

000

0.0

067

pos.

Pre

c.=

TP

/(T

P+

FP

)0.9

325

0.9

818

0.9

849

0.9

370

0.9

859

0.9

919

0.9

362

0.9

897

1.0

000

0.9

993

neg

.P

rec.

=T

N/(T

N+

FN

)0.6

882

0.8

156

0.8

250

0.7

145

0.8

716

0.9

200

0.6

844

0.9

525

1.0

000

0.9

736

Rem

ark

sto

talpoin

ts:

T=

P+

N,

true

posi

tives

:P

=T

P+

FN

,tr

ue

neg

ati

ves

:N

=T

N+

FP,

true

pos.

rate

:T

PR

(=re

call,=

sensi

tivity),

true

neg

.ra

te:

TN

R(=

spec

ifity

),fa

lse

pos.

rate

:FP

R,

fals

eneg

.ra

te:

FN

R.

As

inth

eca

sew

ith

the

’ellip

se’

train

ing

set,

the

alg

ebra

icper

ceptr

on

out-

per

form

sS

VM

ligh

t.

Fur-

ther

,it

should

be

note

dth

at

TP

Rare

not

suffi

-ci

ent

todes

crib

eth

ere

-su

lts,

als

oT

NR

(spec

i-fity

)and

posi

tive

as

wel

las

neg

ati

ve

pre

cisi

on

are

nee

ded

.

Though

the

train

ing

tim

eof

the

alg

ebra

icper

cep-

tron

incr

ease

sre

lati

vel

ym

ore

than

that

of

the

SVM

ligh

tit

isst

illa

mag-

nit

ude

fast

erand

exhib

its

clea

rly

abet

tergen

eraliza

-ti

on.

Usi

ng

all

the

train

ing

poin

tsand

aco

nver

ged

alg

ebra

icper

ceptr

on

mea

ns

that

giv

endata

can

be

sep-

ara

ted

by

the

use

dm

odel

poly

no-

mia

lofdeg

ree

25.

SVM

ligh

tnee

ds

roughly

the

sam

eam

ount

oftr

ain

-in

gti

me

but

cannot

find

an

exis

t-in

gse

para

ting

hyper

pla

ne.

One

way

toim

pro

ve

the

alg

ebra

icper

ceptr

on

isto

split

up

the

train

-in

gdata

into

chunks

and

extr

act

the

support

vec

tors

from

those

and

mer

ge

them

toget

her

aft

erw

ard

s.T

his

issi

milar

tohav

ing

inact

ive

and

act

ive

vari

able

s(w

ork

ing

set)

inoth

eropti

miz

ati

on

alg

ori

thm

s,e.

g.

SVM

ligh

t.

Table

9.1:

Res

ults

for

‘Sci

ssor

s’te

stse

tfo

ra

poly

nom

ialke

rnel

ofde

gree

25,

aver

aged

over

10ra

ndom

lyre

sam

pled

trai

ning

sets

ofsi

zes

1000

and

2000

poin

ts,an

dth

efu

llda

tase

tof

4800

poin

ts.

Inth

ela

stco

lum

nth

etrai

ning

set

for

the

alge

brai

cpe

rcep

tron

was

split

into

6ch

unks

a80

0trai

ning

poin

ts.

On

each

ofth

ech

unks

the

algo

rith

mwas

run

and

the

select

edsu

ppor

tve

ctor

swer

em

erge

d,th

ena

final

run

ofth

eal

gebr

aic

perc

eptron

onal

lth

ese

lect

edsu

ppor

tve

ctor

sac

hiev

edth

ere

sults

pres

ente

d.‘B

estva

lues

’ar

ebol

d.


9.3 Voting Algebraic Perceptron

To address the problem of robustness an extension of the algebraic perceptron algorithmcan be made by including a simple counter for each data point. This counter is incrementedat each iteration j of the basic algorithm for points that are correctly classified by thecurrent separating element zj , hence its name as the correctly classified points get a vote.Normalizing the vote accumulators by the total number of iterations gives an estimate ofthe probability of correct classification.

9.3.1 Algorithm

Initialize accumulators ai := 0, ∀ i = 1, .., N .Set j := 0 and select a random z0, e.g. as φ(xz0).while ∃ν = 1, .., N such that <zj ,yν >V < 0, do

yi := arg minyν<zj ,yν >V

ai :=

ai + 1, if < zj ,yi >V > 0

ai else, ∀ i = 1, .., N.

zj+1 := zj − 2 <zj ,yi >V yi

j := j + 1odfi := ai/j, ∀ i = 1, .., N.

Points yi with an associated frequency that is around 0.6 or lower indicate, problematicpoints for convergence. This is illustrated by figures 9.3 to 9.12 of the following channelequalization problem.

9.3.2 Example

The example is taken from [72] which deals with channel equalization of some binary sourcestream going through a channel H(z) = 0.3482z−1 + 0.8704z−1 + 0.3482z−3. Noise-freechannel outputs y(t) = [y(t) . . . y(t − m + 1)]T are obtained by feeding a binary sourcesequence s(t) = [s(t) . . . s(tm+1−nh)]T , s(.) = ±1, through the channel H(z) with channellength nh + 1 and equalizer of order m. See figure 9.2. Here, m = 2 and nh = 2 are used.The number of all possible binary source sequences is 2nh+m, giving the 16 red centeredpoints in figure 9.3. The other blue points are achieved by adding random Gaussian noisewith zero mean and variance σ2 around the noise-free 16 red points. The optimal Bayesiandecision boundary (blue) was calculated using a regression network, while the maximumdistance line from any points of opposite label is red. For more details about the channelequalization problem the reader is referred to [72], here only the visual effect of the votingmechanism shall be demonstrated.


Figure 9.2: The problem of channel equalization. A binary source symbol s(t) is trans-mitted through a noise-free channel H(z), yielding y(t). Gaussian noise e(t) is added tosimulate distorted and noisy transmission y(t) which has to be equalized to find estimatess(t) as close as possible to the original symbol s(t− τ), with a possible delay τ .

9.3.2.1 Results

The proposed modification to record some statistics about correct classification, clearlyindicates critical points. However, these points also depend on the initialization point z0

and on the selected kernel. Nevertheless, the qualitative behavior always looks the sameand the histogram can be used to detect critical points, using different initializations anddifferent kernels can indicate how critical they are, independent of the kernel and initial-ization. This example shows that the voting algebraic perceptron can handle noisy data aswell. Nevertheless, it has to be pointed out that only removing data points would changethe underlying distribution and therefore some form of a penalization has to be applied.In the SVM framework, this is done by the selection of the parameter C. Here, loweringthe kernel order could be one form of penalization. However, as the algebraic perceptronis a combinatorial and non-optimal rather than an optimization algorithm this might besufficient for a quick solution, which might be improved later by other means. Also, ahierarchical classification could be thought of by removing critical points only temporar-ily to get a lower order separation on noncritical points which would then be combinedwith another separation only applied to the critical points. These two separations could becombined by a weighted superposition and the weights could form the kind of penalizationto achieve a form of regularization.


−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

Figure 9.3: 400 random training points(blue) with zero mean Gaussian distribu-tion and variance σ2 around the 16 noise-free transmitted states [y(t) y(t−1)] (red).The optimal Bayesian decision boundary(blue) is achieved by placing Gaussian ker-nels centered at the blue training pointswith kernel radius σ; diamond and crossshaped points correspond to −1 and +1 la-belled points.

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

m=15 NrOfIter=80083 bConverged=1

Figure 9.4: Algebraic perceptron separa-tion via polynomial kernel of degree m =15. Convergence is achieved after 80083 it-erations and the algebraic perceptron sep-aration is given by the black line. Criti-cal points, which are points classified cor-rectly less then 60% at all iterations, aremagenta colored.

0 50 100 150 200 250 300 350 4000.4

0.5

0.6

0.7

0.8

0.9

1m=15 NrOfIter=80083 bConverged=1

Figure 9.5: Histogram of correct classifi-cation during the algebraic perceptron al-gorithm. It is assumed that critical pointshave a frequency of around 0.5 times thenumber of iterations, or even less. Lowfrequency numbers can also be achieved ifthe algorithm converges in only a few hun-dred iterations, preventing to gather moreaccurate statistics.

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

m=15 NrOfIterAv=11.5 bConverged=1

Figure 9.6: Removing the critical points,achieves a much better separation in amuch faster time. The solution shownhere is averaged over 10 runs of the al-gebraic perceptron with different initializa-tion z0. The number of iterations dropssignificantly by a factor of a couple of thou-sands. To achieve better generalization alower degree kernel could be applied now onthe training set with some critical pointsremoved.


−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

2


Figure 9.7: Algebraic perceptron solutionvia polynomial kernel of degree m = 10.No convergence is achieved and the algo-rithm is stopped after 100000 iterations.Nevertheless, the corresponding histogramgiven in figure 9.8 as a similar form asbefore. Critical points, which are pointsclassified correctly less then 60% at all it-erations, are magenta colored.

0 50 100 150 200 250 300 350 4000.4

0.5

0.6

0.7

0.8

0.9


Figure 9.8: Averaged histogram of correctclassification during the algebraic percep-tron algorithm for m = 10. Even thoughno convergence of the algebraic perceptronalgorithm could be achieved the histogramof correctly classified points looks similaras before and can be used to determine thecritical points in figure 9.7.

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

2


Figure 9.9: Algebraic perceptron solu-tion for m = 20 is similar as for m = 15,however convergence is already achievedfour times as fast. Critical points aresimilar to those with m = 15. There-fore, using a higher kernel order lets de-termine the histogram for critical pointsfaster. After removal of the criticalpoints, a lower order kernel should beused to avoid over fitting and bad gen-eralization; see figure 9.11.

0 50 100 150 200 250 300 350 4000.4

0.5

0.6

0.7

0.8

0.9


Figure 9.10: Histogram of correctlyclassified points for m = 20.


−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

m=20 NrOfIterAv=1308.4 bConverged=1

Figure 9.11: Algebraic perceptron so-lution for m = 20 with some criticalpoints removed. Overfitting occurs nowas there are no central data points, whichgot identified as critical points, anymore.Therefore, after having identified criticalpoints it would make sense to lower thekernel degree m.

0 50 100 150 200 250 300 350 4000.4

0.5

0.6

0.7

0.8

0.9

1m=20 NrOfIterAv=1308.4 bConverged=1

Figure 9.12: Histogram of correctlyclassified points for m = 20 with somecritical points removed.

Chapter 10

Optimizing an Algebraic

Perceptron Solution

Until this point the goal was only to achieve a separating solution, which might already bea good one. Surprisingly, results could be achieved that were better than with an ‘optimal’SVM algorithm [64]. Nevertheless, with lots of fine-tuning the standard SVM algorithmshould be superior, but often the optimization is difficult to achieve, especially for largedata sets, or it is relatively slow.Here, methods are investigated to decrease the generalization error, having achieved aseparating algebraic perceptron solution. This is the equivalent of ‘turning’ the achievedhyperplane towards the optimal one. Firstly, this is tried directly on the primal objectiveand later in another attempt on the dual objective. It turns out that for higher order(polynomial) kernels only the second approach is practically feasible, unless the problemis not too hard, e.g. when there is a lot of space for a separating hyperplane between thetwo classes, and of moderate kernel order.

10.1 Optimization based on the primal objective

Starting with a converged algebraic perceptron, a separating hyperplane H (10.1), is givenby its normal zj , which is a linear combination of the support vectors as defined by (9.6).

H = y ∈ V | <zj ,= y>= 0 (10.1)

By geometry the minimal distance ρmin of any point to the hyperplane is given by(10.2,10.3) where ρ is the margin between a ‘+’ and a ‘−’ point and it has to satisfy

157

CHAPTER 10. OPTIMIZING AN ALGEBRAIC PERCEPTRON SOLUTION 158

(10.4).

ρmin = mink∈SVs

||yk|| cos(θk) =mink∈SVs

< yk, zj >V

||zj ||

(10.2)

= mink∈SVs

< yk, zj >V , if ||zj || = 1 (10.3)

ρ ≥ 2 ρmin (10.4)

The factor 2 comes from the fact that the mapped ‘+’ and ‘−’ points, φ(x), lay on differentsides of the separating hyperplane.Depending on the optimization criteria J , which shall be minimized/maximized, the up-date of the multipliers αk is done at an iteration t in the usual gradient descent manneraccording to (10.5).

αk(t + 1) := αk(t)∓ η∂+J

∂αk,

−: minimize,+: maximize

(10.5)

It is obvious that if the multipliers αk are modified, the norm of ||zj || is changed as well.

10.1.1 Maximizing the Margin ρmin

The goal is to maximize

J := ρmin = mink∈SVs

< yk, zj >V

||zj ||

(10.6)

However, in the case of having multiple k’s, say the set minSVs, with the same ρmin onemight consider minimizing:

J :=∑

k∈minSVs

ρk = mink∈SVs

< yk, zj >V

||zj ||

(10.7)

Using the following abbreviations1 (10.8) to (10.12), the total derivatives of J with respectto the multipliers αk can be calculated as defined by (10.13) to (10.19).

Kk =Kk

||zj||.=

αk√∑m∈SVs α2

m

(10.8)

Kk = <yk, zj >=<yk, z0 >+∑

l∈SVs

αl <yk,yl > (10.9)

.= αk (10.10)

||zj ||2 =<zj , zj > =∑

m∈SVs

αm

∑

n∈SVs

αn <ym,yn > (10.11)

.=∑

m∈SVs

α2m (10.12)

1Note that the approximations are only valid for <yk,yl >.= δk,l, which is the case e.g. for high degree

polynomial kernels


∂+J

∂αk=

∑

l∈minSVs

∂+Kl

∂αk

∂+J

∂Kl

=∑

l∈minSVs

∂+Kl

∂αk(10.13)

∂+Kl

∂αk=

∂+zTj

∂αk

∂+Kl

∂zj= yT

k

∂+Kl

∂zj(10.14)

=<yk,yl >

||zj || − <yl, zj ><yk, zj >

||zj ||3 (10.15)

.=δk,l

||zj || −αl αk

||zj ||3 (10.16)

∂+Kl

∂zj=

1||zj ||2

(∂+Kl

∂zj||zj || −Kl

∂+||zj ||∂zj

)(10.17)

=1

||zj ||2(yl ||zj ||− <yl, zj >

zj

||zj ||)

(10.18)

∂||zj ||∂zj

=∂√

<zj , zj >

∂zj=

zj√<zj , zj >

=zj

||zj || (10.19)

Using these equations to update the multipliers αk, the learning rates η had to be verysmall such that another point suddenly would not have become a point in minSVs. Alsothe fact that in the very high dimensional space all inner-products look like Dirac-kernels,differentiation gets problematic. Even though improvements up to a factor of a few thou-sand on the minimal margin could be achieved, generalization did not improve. Also, otheralgorithms were tried, like the cone algorithm [63] but without noticeably more success.All this was carried out on the scissors problem. See figure 9.1.However, for problems that have more space between opposite points and use lower order,less Dirac-like kernels, this method can work. An example for channel equalization is givenin [73, 74].Therefore, another much more appealing path was considered, where the focus was set onsolving the actual C-SVM problem, based on the dual objective.

10.2 Optimization based on the dual objective

The standard C-SVM optimization problem can easily be extended to account for non-separable points. The non-separable case can be handled by introducing slack variablesξ to relax the hard separability condition of (8.7, 8.8) to yi

(wTxi + b

) ≥ dmin − ξi, withb = 0 as the hyperplane goes through the origin, and as seen before, this does not introducethe equality constraint αTy = 0.Primary objective:Minimize (10.20) subject to the constraints (10.21) to (10.24).

JP (w; ξ) :=12wTw +

N∑

i=1

Ci ξi =12wTw + ξTC (10.20)


w =N∑

i=1

αi yi φ(xi) (10.21)

0 ≤ yi wTxi − dmin + ξi, ∀i = 1, .., N (10.22)

0 ≤ ξ (10.23)

0 ≤ α ≤ C = [C1C2 . . . CN ]T , often Ci = C. (10.24)

Or, as an unconstrained minimization of the Lagrangian with multipliers α and β, asstated by (10.25).

LP (w; ξ) := JP (w; ξ)−N∑

i=1

αi (yi g(xi)− dmin + ξi)−N∑

i=1

βi ξi (10.25)

Dual objective:Maximize (10.26) subject to the constraints (10.27).

LD(α) :=N∑

i=1

αi − 12

αTQ α = αTe− 12

αTQ α (10.26)

0 ≤ α ≤ C (10.27)

Note: 0 ≤ ξ, 0 ≤ β and α + β = C ⇒ ξi = 0 if αi < Ci.

This problem can be solved by the Adatron algorithm, whose name is a composure ofthe words ‘adaptive perceptron’ and was invented by Anlauf and Biehl as an optimalperceptron algorithm [62]. Friess and Vijakumar extended it independently to kernelbased algorithms in high-dimensional feature spaces [57, 61]. The Adatron algorithm isbasically a gradient descent algorithm for ‘interior’ variables. When an update would forcea variable to leave the ‘interior’ or feasible region, the update is reduced such that theupdated solution is still feasible. Looking at the unconstrained optimization problem, anecessary condition is that the gradient of LD with respect to its argument α, be zero(10.28), which implies (10.29).

∂+LD

∂α= e− 1

2(Q + QT

)α = e−Q α

!= 0 (10.28)

⇒ α∗ = Q−1e (10.29)

But as this solution might not be feasible, an update vector could be made to be only afraction γ in the direction of the negative gradient, as defined by (10.30). Or, if secondorder directions are used in the direction from the current point α towards the optimal


solution α∗, as defined by (10.31) and where γl,q > 0 are learning rates.

δαluc = γl ∂

+LD

∂α= γ (e−Q α) , for the linear case (10.30)

δαquc = γq (α∗ −α) , for the quadratic case (10.31)

If the Adatron is sequentially updated (only one pattern, say k, at every iteration, i.e.δαk 6= 0 and all others are zero), this is the SVMseq algorithm [57]. There it is proventhat γl < 2/maxiQii for convergence. In case of the following batch algorithm the factor

1√N

should be included to accommodate for the changed arc-length of the gradient vector(however this is without proof, but surely can be accepted if δLD ≥ 0):

γl = N−12 2/max

iQii

on sphere= 2N

−12 (10.32)

γq = 1 (10.33)

δαl,qc = minmaxδαl,q

uc ,−α,C−α (10.34)

γl,q = minδαl,qc /δαl,q

uc, where ‘/’ is to be taken element-wise. (10.35)

The change of the objective function by an update δα can be used as a convergencecriterion and is given by (10.36).

δLD = LD(α + δα)− LD(α) = δαT (e−Qα)− 12

δαTQ δα (10.36)

Note: With the quadratic update the optimal solution α∗, can be achieved in one stepif γ := γq = 1. However, due to the size of Q an inversion is impossible in practice, butalgorithms based on active set strategies that partition the matrix Q into free and fixedparts, can allow inversions for a sub-matrix of moderate size, even to extreme cases of2× 2 or scalars [67, 75, 76].

10.2.1 Summary of Adatron Optimization

The Adatron algorithm to optimize a solution of the algebraic perceptron can be summa-rized as follow:

1.) Initializeα = αAP , and computeQ = [Q]ij = yi K(xi, xj) yj , where xi = [xi,λ]

||[xi,λ]|| , ∀i, j = 1, .., N .

2.) Calculateδα = γl,q δαl,q

c , according to (10.30, or, 10.31), (10.32, or, 10.33), (10.34) and (10.35)α = α + δα

3.) If training has not converged then go to step 2.4.) Terminate.


10.3 Conclusions about Optimizing the AP

The algebraic perceptron, is a fast but non-optimal algorithm. Even though the idea ofimproving a non-optimal solution towards an optimal by a steepest gradient method isappealing, because there are no points that have to cross the separating plane, this isonly workable as long as the kernel is of lower order and data points are not criticallyclose to the separating hyperplane, such that only very small learning rates can be usedto avoid some other points crossing to the other side. Nevertheless, in some cases this canbe applicable without severe problems. However, if the kernel is of high order, the kernelapproaches more a Dirac-kernel and differentiation gets problematic. In this case, theadatron algorithm, working on the Lagrange multipliers of the dual objective is more usefuland the algebraic perceptron might be used as an initialization procedure to determinepossible candidates for support vectors for other algorithms.The idea of lifting is also applicable in other SVM-frameworks other than SVMseq, e.g.Mangasarian’s ‘Successive overrelaxation’ [75]. The mapping onto a sphere has a niceproperty because then the equality constraint falls away when using the standard SVMalgorithm. However, this can be interpreted as a function space distortion compared withthe standard method. But whether this is really a disadvantage remains to be investigated.Amari has successfully applied an explicit distortion [71].

Chapter 11

Decomposition Algorithm Based

on the Algebraic Perceptron

One of the oldest problems in pattern recognition is the classification of two classes, such asan object and its background. For higher level semantics it is useful to have descriptions ofobjects, for example as a union of primitives chosen from some natural class. In principle,methods like the Hough Transform [77, 78] can be used to decompose an image intoprimitives, but their usefulness is limited in practice, except for primitives described byfew parameters (preferably 2 or 3). All too often one is faced with an explosion in thenumber of points that need to be considered, due to the curse of dimensionality (of theparameter space).

Here the case is investigated when primitives are given by low degree polynomials,computed with the algebraic perceptron. As it has been seen before, the algebraic per-ceptron can handle higher degree polynomials in a direct implementation (calculating theinner product directly in the feature space and not by a kernel-homomorphism) only upto a very moderate degree which is around ≤ 7, depending on the available memory andinput dimension. A direct implementation has practical advantages, as the result can beinterpreted directly by the given solution, represented as the polynomial coefficient vector.However, the moderate degree of directly calculable polynomials does not allow for verycomplicated regions. Nevertheless, a combination in a specified way, like unions and inter-sections, of several low degree polynomials can again describe more complex situations.

11.1 Region Growing Algebraic Perceptron

11.1.1 Notation and Notes

Let T = (xk, d(xk)) be a training set of input-output pairs, and define yk := ι(xk)d(xk),where ι(.) is a mapping from E = IRm+1 to V = IRn, having the property ι(x)(x′) :=<

ι(x), ι(x′) >V = K(< x,x′ >E). The original input vectors, xk, are mapped onto theunit-sphere in E, such that xk := [xk, λ]

||[xk, λ]|| , λ 6= 0. The class label d(xk) maps to ±1, dis-

163

CHAPTER 11. DECOMPOSITION ALGORITHM BASED ON AP 164

tinguishing background and foreground points. The angle brackets denote inner productsin the vector spaces V and E respectively.

• ‘AP’ stands for algebraic perceptron but any other algorithm that achieves a di-chotomy on a training set TS can be used. z = AP (z0, TS,#MaxIterations)denotes the solution z of a basic algorithm that separates the training data set TS,given an initial solution estimate, z0, within a maximum number of iterations, say#MaxIterations. If a separation is achieved within #MaxIterations, it is saidthat the algebraic perceptron has converged.

• The region growing algorithm for the algebraic perceptron for (-1)-labeled points, forshort RGAP−, decomposes all the (-1)-labeled points as a union of regions describedby individual low-degree algebraic perceptrons.

• The algorithm RGAP+ for (+1)-points can be achieved by replacing (-1) with (+1)and ‘TM’ with ‘TP’ in the RGAP−.

• ‘Add’ in the algorithm, adds only if the element to be added isn’t contained in thetarget set, i.e. there are no duplicate elements in a set. However, this is only aperformance issue and the algorithm works with duplicated entries as well.

• Extensions to n-ary images can be done, applying the basic algorithm in a hierar-chical manner. However, they are not considered here.

11.1.2 RGAP− Algorithm

The underlying idea is to select a seeding point and let a region, limited (say) by a low-degree polynomial constraint, grow as far as possible. If growth is stopped, the data pointsin the region are removed from the training set and the process is repeated until no furtherpoints are to be classified.

11.1.2.1 Training

1.) Initialize counter for number of (-1)-APs to zero: N− := 0; Initialize the set of (lowdegree) APs: ZJ− = . Split set of mapped training points T ′ := ι(xk)d(xk)into TP = T ′|d(xk) = +1 and TM = T ′|d(xk) = −1.

2.) Set j = 0, Select arbitrary (-1)-point from TM , say y0, as zj and remove it fromTM ; Initialize the non-converged training point set as empty, i.e. NCTS = .

3.) Add y0 and all points of TP to TS.

4.) If there is any point of TM misclassified by zj , i.e. <zj ,yk >< 0 for yk in TM , trainand set zj := AP (zj ;TS;∞). Note: this is going to converge for degree p >= 2polynomials. Proof: There are two cases j = 0 and j > 0. Case j = 0: There is


only one (-1)-point and all others are (+1)-points. This can always be separated bya degree p >= 2 polynomial. Case j > 0: By construction in the following stepsthere are no (-1)-points misclassified. q.e.d.

4b.) Assumed speed up, at least in the first iteration, but not necessary: Add all (-1)-points of T’, which are classified correctly by zj , to TS.

5.) Determine k = argmaxk < zj ,yk >V with d(xk) ∈ T ′\TS (=TM = all (-1)-pointsnot in current training set TS) and d(xk) 6= sign(< zj ,yk >V ) (or equivalently,<zj , ι(xk)>V < 0), Note: yk is a “minimal violating point”. If there is no violatingpoint go to 10.)

6.) Remove yk from TM and add it to TS.

7.) Train z = AP (zj ;TS; MaxIter) by a (low degree) algebraic perceptron; (Set SV =SV (z) = support vectors of z; implicit given through z).

8.) If the AP converges set j = j + 1, then zj := z. Otherwise, remove yk from TS andadd it to NCTS.

9.) If all (-1)-training points were used, i.e. TM = go to 10.), otherwise go to 4b.)

10.) If reduction of SV s is desired retrain zj = AP (ι(xrandom);TS;∞).

11.) Increment N− by 1. Add zj to ZJ−.

12.) If NCTS 6= set T ′ := NCTS ∪ TP and go to 2.).

13.) Output ZJ− and terminate.

If it is desired, the output set ZJ− can be ordered according to a size measure, for examplethe number of training points that are within the region.

11.1.2.2 Adding Intersections

Even though arbitrary shapes could be formed by a union of low-degree shapes (suchas ellipses) at different positions, orientations and scaling, it would be nice to extendthe shapes that can be described more naturally and more closely to the constrainingshapes. Figure 11.4 shows a case in point. To handle the simplest case, which has onlyone foreground object, intersections can be used to cut off parts. In topological terms thiscase has a genus of one, as the object consists only of one piece without holes.

Assuming foreground pixels are (-1)-labelled and then running RGAP+ with the back-ground as interior, it is possible to form an intersection between the union set found byRGAP− and the complement of the union set found by RGAP+. The result is the desireddescription of the object.


11.1.2.3 Intersection RGAP: IRGAP

1.) Run ZJ− := RGAP− and ZJ+ := RGAP+.

2.) Set ZJ := ZJ− ∪ ZJ+ and store from which set the elements came: ZJ [i].d = +1means that the i-th element in the set came from ZJ+. ZJ [i].zj shall denote i-thelement zj found by RGAP .

3.) Sort ZJ by the ‘size of its elements’, which could be the number of (-1)- and (+1)-training points for elements of ZJ− and ZJ+, respectively.

11.1.2.4 Prediction

The following algorithm handles both RGAP and IRGAP prediction.set i:=0.set NrOfElems := # elements in ZJ .p := sign(<ZJ [i].zj, ι(x)>V ) · ZJ [i].d;while (p 6= ZJ [i].d)

i := i + 1;if(i >= NrOfElems)

d(x) =‘guess’; terminate;p := sign(<ZJ [i].zj, ι(x)>V ) · ZJ [i].d;

d(x) := p ;terminate;Note: ‘guess’ can be determined if a priority convention is made, e.g. (-1)-sets overrule(+1)-sets. Then the complement of the last (-1)-set in the ordered list ZJ will determined(x) := +1.

11.2 Example of three overlapping ellipses

The original data is given by three overlapping ellipses on a discrete 60 by 80 rasterimage, as seen in figure 11.1. The training data is achieved by resampling 1000 pointsequally distributed as seen in the last image of figure 11.2. In the same figure the individualdecomposed ellipses are shown. The order of retrieving individual ellipses can be permuted,of course as, the initial seeds are randomly chosen. In the illustrated case of figure 11.2,four ellipses are found where as the last ellipse (row 2, column 2) was due to only oneremaining training point, which was not captured by the first ellipse (row 1, column 2).


Figure 11.1: Individual ellipses which compose the original image (row 2, column 2). Thelast image shows the resampled picture

11.3 Discussion and Improvements

The previously introduced algorithm can be improved in several ways. A few shall beoutlined and be seen as the docking points to extend it and build a framework for a groupof decomposition algorithms based on a restricted region growing.

11.3.1 Speed up techniques

Instead of testing whether all points belong to a certain region, step 8.) of RGAP canbe changed as follows. Because the point yk to be added is a ‘minimal violating’ pointat iteration j, it becomes more and more unlikely that following points will be ‘non-violating’, i.e. interior points. Therefore, an early stopping condition can be applied. Ifit fails consecutively a certain number of times, remove yk and all other (-1)-points from


the sets TM and TS and add them to the set NCTS; so TM = .

11.3.2 Improving robustness

To improve robustness, other, more SVM-like algorithms that can regularize noisy databy including slack-variables in the optimization problem, can be applied instead of thealgebraic perceptron.

11.3.3 Generalization to binary objects with genus greater one

This can be done by labelling points according to the separating polynomials and maskingout already recognized objects. The problem can be seen in figure 11.4, where the little(-1)-ellipse denies the growth of the second (+1)-ellipse.

11.3.4 Structural complexity control

Structural complexity control can be applied on the list ZJ of decomposed objects. Asimple scheme would be to minimize the length of the list by allowing different basicshapes, e.g. different (low) degree polynomials, or intersections and not only unions.

11.4 Extension to SVM algorithms

The idea of identifying minimum violating points might also be applicable in the SVMframework, using the values of the regularization constant C and the Lagrange multipliersαk and βk, as well as the slack variables ζk from (8.21), which implies αk + βk = C.Together with the Kuhn-Tucker conditions (8.32) it follows that ζk = 0 if αk < C. Thismeans violating points have an αk = C and the degree of violation is reflected by themagnitude of ζk. So a minimum violating point has an αk = C and a minimal ζk > 0,where the minimum is taken over all data points k = 1, .., N .

11.5 Conclusion about the RGAP-Framework

A framework for a decomposition algorithm based on the algebraic perceptron or otherSVM-like algorithms has been introduced. The feasibility and potential of this approachwas demonstrated by a simple example. The advantage of this framework is that the basicalgorithm is fast and achieves a decomposition of an original object into more basic featuresand therefore often yields a structural description. For example a hand can be decomposedinto five elongated ellipses for the fingers and the palm into a bigger, less excentric one.Nevertheless, for more complicated objects the framework has to be extended, or it canbecome quite inefficient, for example when an object has more than two componentsas it was outlined in figure 11.4. If such an object should only be approximated by aunion of ellipses that classify the training data correctly the list of these ellipses would


become rather large and would not generalize well because individual ellipses would onlybe described by a few (even only one) interior point. But this is not a problem, more afeature that allows complexity control on a higher level, namely a structural one.


Figure 11.2: Decomposition achieved by the RGAP algorithm; it finds four instead ofthe original three ellipses. The fourth ellipse is shown against the two closest originalellipses (rows 2 and 3, columns 2 and 1, respectively). This is due to the undersamplingand a fourth ellipse is found. Light and dark gray pixels are wrong fore- and background,respectively.


Figure 11.3: With different initialization seeds an almost perfect retrieval of the originalellipses could be achieved. Light and dark gray pixels are wrong fore- and background,respectively.

(+1)

(+1)

(−1) (+1)

(+1)(−1)

a) b)

(−1)

Figure 11.4: a.) Shark fin like shape, modelled from ellipses and intersections. Oneproblem is obvious, the less bent an edge is, the larger the according ellipse. b.) Anotherforeground object, i.e. a (-1)-object, which would intersect the background ellipses, woulddisturb the algorithm to correctly identify the background ellipses and possible stop itsgrowth.

Chapter 12

Conclusions and Outlook

Two topics, adaptive critic designs and the algebraic perceptron, a fast neural networkclassifier, were considered in this thesis that are of core importance to intelligent systems.

Adaptive critic designs are used to approximate a long-term cost and find a policy tominimize it. This long-term planning is crucial for an intelligent system. However, speedand resources to acquire knowledge is also of great importance. The goal of adaptive criticdesigns is to overcome the ‘curse of dimensionality’ suffered by exact dynamic program-ming (DP), making DP often useless in practice, especially for continuous state spaces.Speed was addressed by looking at ways to improve convergence of adaptive critics byconcentrating more on the actor or controller, rather than on critic convergence.

Further research steps with adaptive critics based on this thesis would be to extendthe quadratic critic with an arbitrary function approximator like a neural network to builda GDHP network and apply the training equations developed in section 4.4.2. Also theactor would have to be changed to a universal function approximator. To escape localminima during the training process, an interesting idea of local coupled minimizers wasrecently presented by Suykens [79, 80] and might be applied in this context. While, theEuler equations proved to be difficult to utilize in the adaptive critic designs, specializedapproaches to solve two point boundary value problems might be worthwhile to investigate.

Many questions regarding stability of adaptive critic designs are still unanswered anddifficult to solve. However, in practice there might be a shift from theoretical provablestability, often limited by very restrictive assumption such that the theoretical modelbecomes meaningless as soon as the environment changes just a little, towards adaptive‘stability experience’. Higher level control might detect critical areas and switch to anothercontrol scheme to assert robust control. A simple model in this direction has been achievedwith a robust control for a hybrid dynamical system. Although theory proves the existenceof a switching sequence stabilizing the linear plant with uncertainty, approaches to solvethis in practice can introduce instability. Therefore, there is a need to investigate stabilityof approximation methods as introduced in this thesis. Here, action-values as long-termcost estimators given some action, were used to determine a switching sequence. However,from a stability view point, it would make more sense to concentrate directly on the

172

CHAPTER 12. CONCLUSIONS AND OUTLOOK 173

switching boundaries, which have a much more crisp boundary then the different action-values. One direction of work could be to see this as a classification problem and usemethods introduced in part II of this thesis.

In part II of this thesis a class of accurate learning algorithms based on statisticallearning theory were investigated. These so called support vector machines use a high-dimensional feature space to do linear separations that possess ‘best generalization ca-pabilities’ given some limited number of data. In the last couple of years an enormousamount of publications appeared and go far beyond this thesis. To avoid the optimization,which can be very difficult for large data sets, a combinatorial algorithm, the algebraicperceptron, was introduced. The algebraic perceptron is related to support vector ma-chines (SVM) by using an inner-product kernel similar to SVM. It has been shown thatit can be considerable faster than an SVM, however, its theoretical generalization capa-bility is not as good but in practice it surprisingly surpassed an SVM machine, becauseof numerical difficulties in the optimization process. Within the framework of SVMs andalgebraic perceptron future research directions seem almost unlimited. One of them hasbeen addressed in chapter 11 where a decomposition algorithm was introduced to segmentbinary shapes into simpler ones. SVMs could be used instead of the algebraic perceptronand slack variables and Lagrange multipliers could be used as indicators in the regiongrowing process.

Also, improvements and investigation on the basic algebraic perceptron could be made.One area that was only slightly covered in this thesis is the problem of inseparable data,or noisy data. While leave-one-out methods could be used to determine outliers, votingmethods could be used to build up an estimated probability measure for the separatingboundary, based on the law of large numbers. This seems a promising approach, as thesimple idea of the voting algebraic perceptron has shown in section 9.3. Theoretical ques-tions of how many iterations are needed for a separable problem based on the marginmight be addressed as well, or when to stop if the problem is not separable. Also, combi-nations of the algebraic perceptron as preprocessor and SVM algorithms applied to largedata sets might be of interest for fast processing.

List of Publications

T. Hanselmann, A. Zaknich, and Y. Attikiouzel. Learning functions and their derivativesusing taylor series and neural networks. INNS-IEEE International Joint Conference onNeural Networks IJCNN, 1999.

T. Hanselmann, A. Zaknich, and Y. Attikiouzel. Connection between BPTT and RTRL.3rd IMACS/IEEE International Multiconference on Circuits, Systems, 4-8 July 1999Athens, July 1999. Reprinted in Computational Intelligence and Applications, Ed. Mas-torakis, Nikos, E., World Scientific, ISBN 960-8052-05-X, pp. 97–102, 1999.

T. Hanselmann and L. Noakes. Comparison between support vector algorithm and alge-braic perceptron. INNS-IEEE International Joint Conference on Neural Networks IJCNN,2001.

T. Hanselmann and L. Noakes. Optimizing an algebraic perceptron solution. INNS-IEEEInternational Joint Conference on Neural Networks IJCNN, 2001.

T. Hanselmann and L. Noakes. A decomposition algorithm based on the algebraic per-ceptron. In The seventh Australian and New Zealand Intelligent Information SystemsConference (ANZIIS), Perth, Australia, Nov ”19–21” 2001.

J. Young, T. Hanselmann, A. Zaknich, and Y. Attikiouzel. Algebraic perceptron in digi-tal channel equalization. INNS-IEEE International Joint Conference on Neural NetworksIJCNN, 2001.

J. Young, T. Hanselmann, A. Zaknich, and Y. Attikiouzel. Fine tuning the algebraicperceptron equaliser to increase the separation margin. Inter-University PostgraduateElectrical Engineering Symposium, Murdoch University Rockingham Campus, WesternAustralia, 2nd October 2002, 2002.

174

Appendix A

Notation

A.1 Derivatives

Given the scalar s and the vectors a = [a1, .., an]T , and b = [b1, .., bm]T . If s is a functionof a and b, the (partial) derivatives are defined as

∂s

∂a:=

∂s∂a1∂s∂a2...

∂s∂an

(A.1)

∂2s

∂a∂b:=

∂[

∂s∂b

]T

∂a=

∂2s∂a1∂b1

∂2s∂a1∂b2

... ∂2s∂a1∂bm

∂2s∂a2∂b1

∂2s∂a2∂b2

... ∂2s∂a2∂bm

......

∂2s∂an∂b1

∂2s∂an∂b2

... ∂2s∂an∂bm

(A.2)

∂aT

∂b≡

[∂a∂b

]T

(A.3)

A.2 Chain Rule

ds(a(b),b)db

=∂+s(a(b),b)

∂b=

∂aT

∂b∂s

∂a+

∂s

∂b(A.4)

δs(a,b) =∂s

∂aδaT +

∂s

∂bδbT =

[∂s

∂a

]T

δa +[

∂s

∂b

]T

δb = δaT ∂s

∂a+ δbT ∂s

∂b(A.5)

δaT (b) = δbT ∂aT

∂b=

[[∂aT

∂b

]T

δb

]T

=[

∂a∂b

δb]T

(A.6)

175

Appendix B

Calculation of some useful

derivatives

The total derivative operator with respect to a variable x is denoted as ∂+.∂x , while the

partial derivative is denoted as ∂.∂x . There is an important relationship between those

derivatives [25]:

∂+T

∂x=

∂T

∂x+

∑

y∈Y

∂+T

∂y

∂y

∂x=

∂T

∂x+

∑

y∈Y

∂T

∂y

∂+y

∂x(B.1)

where Y is the set of all intermediate variables used to calculate the target T that dependdirectly or indirectly on the variable x. For calculation of ordinary partial derivatives itis necessary to state what are the direct variables of a function. In the notation below,a function depending on t shall just mean that the function is time-variant and to beevaluated at the given time. For the calculation of partial derivatives, the function istaken with the arguments given on the most right-hand side, e.g. f(t) = f(x(t),u(t)) isa function f which depends on the two variables x(t) and u(t), for short: u and x. Eventhough u = g(x) could be eliminated, making f = f(x) only dependent on x, this wouldalso involve a change in the notation as the functions f(x,u) and f(x) = f(x, g(x)).

B.1 System equations

x(t) =dxdt

= f(x(t),u(t)) (B.2)

u(t) = g(x(t)) (B.3)

x(t + dt) = x(t) + dx(t) = x(t) + x dt = x(t) + f(x(t),u(t)) dt (B.4)

u(t + dt) = u(t) + du(t) = u(t) + u(t) dt = u(t) + g(x(t + dt))− g(x(t)) (B.5)

x(t + dt) = x(t) + dx(t) = x(t) + x(t) dt = f(x(t + dt),u(t + dt)) (B.6)

176

APPENDIX B. CALCULATION OF SOME USEFUL DERIVATIVES 177

B.2 Useful one-step derivatives

B.2.1 Total derivatives

∂+xT (t)∂x(t)

=∂xT (t)∂x(t)

= 1 (B.7)

∂+xT (t+dt)∂x(t)

=[1 +

[∂fT

∂x+

∂gT

∂x∂fT

∂u

]

t

dt

](B.8)


=∂xT (x(t), x(t), t, dt)

∂x(t)= 1 dt (B.9)

∂+xT (t+dt)∂u(t)

=∂fT (x(t),u(t))

∂u(t)dt =

[∂fT

∂u

]

t

dt (B.10)

∂+uT (t+dt)∂x(t)

=∂+xT (t+dt)

∂x(t)∂+uT (t+dt)∂x(t+dt)

=[1 +

[∂fT

∂x+

∂gT

∂x∂fT

∂u

]

t

dt

] [∂gT

∂x

]

t+dt

(B.11)


=∂+xT (t+dt)

∂x(t)∂+uT (t+dt)∂x(t+dt)

=[∂gT

∂x

]

t+dt

dt (B.12)

∂+uT (t+dt)∂u(t)

= 1 +∂+

[gT (x(t+dt))− gT (x(t))

]

∂u(t)=


∂g(x(t+dt))∂x(t+dt)

=[∂fT

∂u

]

t

[∂gT

∂x

]

t+dt

dt (B.13)


=∂+xT (t+dt)

∂x(t)∂f(x(t+dt),u(t+dt))

∂x(t+dt)+


∂fT (x(t+dt),u(t+dt))∂u(t+dt)

=∂+xT (t+dt)

∂x(t)

[∂fT

∂x+

∂gT

∂x∂fT

∂u

]

t+dt

=[1 +

[∂fT

∂x+

∂gT

∂x∂fT

∂u

]

t

dt

] [∂fT

∂x+

∂gT

∂x∂fT

∂u

]

t+dt

(B.14)


=∂+xT (t+dt)

∂x(t)∂f(x(t+dt),u(t+dt))

∂x(t+dt)+



=∂+xT (t+dt)

∂x(t)

[∂fT

∂x+

∂gT

∂x∂fT

∂u

]

t+dt

=[∂fT

∂x+

∂gT

∂x∂fT

∂u

]

t+dt

dt (B.15)


=∂+xT (t+dt)

∂u(t)∂f(x(t+dt),u(t+dt))

∂x(t+dt)+

∂+uT (t+dt)∂u(t)


=∂+xT (t+dt)

∂u(t)

[∂fT

∂x+

∂gT

∂x∂fT

∂u

]

t+dt

=[∂fT

∂u

]

t

[∂fT

∂x+

∂gT

∂x∂fT

∂u

]

t+dt

dt (B.16)


B.2.2 Total derivatives involving φ

φ(t) = φ(x(t), x(t)) = φ(x(t),u(t)) (B.17)

∂+φ(t)∂x(t)

=∂φ(x(t), x(t))

∂x(t)+

∂+xT (t)∂x(t)

∂φ(x(t), x(t))∂x(t)

=[∂φ

∂x

]

t

+[∂fT

∂x+

∂gT

∂x∂fT

∂u

]

t

·[∂φ

∂x

]

t

(B.18)

∂+φ(t)∂u(t)

=∂+xT

∂u∂φ

∂x+

∂+xT

∂u∂φ

∂x=

∂+xT

∂u∂φ

∂x+

[∂+xT

∂u∂fT

∂x+

∂+uT

∂u∂fT

∂u

]∂φ

∂x

=∂fT

∂u∂φ

∂x(B.19)

B.2.3 Partial derivatives

∂xT (t+dt)∂x(t)

:=∂xT (x(t),u(t), t, dt)

∂x(t)=

[1 +

[∂fT

∂x

]dt

]

t

(B.20)

∂xT (t+dt)∂u(t)

:=∂xT (x(t),u(t), t, dt)

∂u(t)=

∂fT (x(t),u(t))∂u(t)

dt =[∂fT

∂u

]

t

dt (B.21)

B.3 Calculation of dJ(t)du(t)

Goal: Calculation of the total derivative dJ(t)du(t) of the cost-to-go function with respect to

the control u(t). The goal is to train the approximation J such that J becomes indepen-dent of u(t), which means that dJ(t)

du(t)

!= 0, respectively the partial derivative of J withrespect to u(t) must be zero. However, calculation of partial derivatives needs to stateexactly of which variables a function is dependent to achieve meaningful results. Note:Numerically, it is easy to use forward perturbation as an approximation for the partialderivatives. Then, disturbing only one input the output changes divided by the scalar in-put change yields the partial derivative of the function with respect to the disturbed input.

Problems:

• To calculate a total derivative of a function with respect to its variable, the functionmust be expressed solely by the variable. In the case of the variable being a vector-variable, the function must be expressed by all its components. In the case ofthe cost-to-go function J = J(x(t)) it is a function of the state x(t) only, or, thestate x(t) and the time t for an infinite, or, a finite horizon problem, respectively.However, the approximation J = J(x(t), x(t);w) depends also on x = f(x(t),u(t)),or equivalently on the control u(t).


• As the input variables, i.e. x(t), x(t), of J depend on the time, different time-scales must be taken into account when calculating partial derivatives. The outertime-scale runs on a time unit ∆t per ‘Bellman-Iteration’, where as the integrationapproximation uses N time-steps δtn, not necessarily of equal duration, within ∆t.

Thus, dUtil(u(t0);x(t0),∆t)du(t0)

N>16= ∂Util∆(x(t0),u(t0);∆t)

∂u(t0) , ∆t = tN − t0. The left-hand-side can beseen as a total derivative with respect to u(t0) where x(t0) is seen as an initial conditionsuch that u(t0) = g(x(t0)). Under this assumption the total and the partial derivativeswith respect to u(t0) are the same and can be calculated easily with forward perturbation.Therefore, δUtil(u(t0);x(t0), ∆t) = δuT (t0)

dUtil(u(t0);x(t0),∆t)du(t0) , where δu(t0) is an arbitrary

change and not dependent on x(t0). Further, if the time units are chosen equal to ∆t thesensitivity of u(t0) with respect to t0, times ∆t is simply δu(t0). On the other, right-hand-side, there is only a partial derivative with respect to u(t0). Therefore, to get the samechange δUtil(u(t0);x(t0), ∆t) = δUtil∆(x(t0),u(t0),∆t) the chain-rule has to be used:

δUtil∆(x(t0),u(t0), ∆t)

= δxT (t0)∂Util∆(x(t0),u(t0),∆t)

∂x(t0)+ δuT (t0)

∂Util∆(x(t0),u(t0),∆t)∂u(t0)

(B.22)

= δxT (t0)∂Util∆(x(t0),u(t0),∆t)

∂x(t0)+ δuT (δx(t0))

∂Util∆(x(t0),u(t0),∆t)∂u(t0)

(B.23)

with δu(t0) = δu(δx(t = t0;∆t)) being a function of x and implicit of the time t. However,on the right-hand-side there is a different time-scale. By definition it is ∆t =

∑N−1n=0 δtn.

Thus, tn = t(n) and ∂t(n)∂n = δtn. Further, the perturbations of u must be the same on

both, the integration, as well as on the ‘Bellman-Iteration’ time-scales. Therefore, thefollowing must hold (note: left time-scale with indexed time tn is during integration; righttime-scale with time t is on ‘Bellman-Iteration’):

δuT (t0) = δt0∂uT (t0)

∂t0

!= ∆uT (∆x(t0;∆t)) = ∆tduT (t)

dt

∣∣∣∣t0

= ∆t

[xT ∂gT

∂x

]

t0

,(B.24)

hence∂uT (t0)

∂t0=

∆t

δt0

∂uT (t)∂t

∣∣∣∣t0

(B.25)

Using xn := x(tn), xn := x(tn) and un := u(tn) the partial derivatives are calculates as


follow:

U til(t = t0;∆t) = Util(x(t = t0),u(t = t0);∆t) =∫ t0+∆t

t0

φ(x(t), x(t)) dt (B.26)

U til∆(t0;∆t) = Util∆(x(t0),u(t0);∆t) =N−1∑

n=0

φ(xn, xn) δtn (B.27)

∂Util∆(x(t0),u(t0);∆t)∂u(t0)

=∂+

(∑N−1n=0 φ(xn, xn) δtn

)

∂u0=

N−1∑

n=0

∂+ (φ(xn, x0) δtn)∂u0

(B.28)

=∂+φ(x0, x0) δt0

∂u0+

N−1∑

n=1

∂+ (φ(xn, xn) δtn)∂u0

(B.29)

=∂+φ(x0, x0) δt0

∂u0+

∂xT1

∂u0·

N−1∑

n=1

∂+xTn

∂x1

∂+φ(xn, xn) δtn∂xn

(B.30)

= δt0

[∂fT

∂u∂φ

∂x

]

t0

+ δt0∂fT

∂u·

N−1∑

n=1

δtn

n−1∏

j=1

∂+xTj+1

∂xj

·

[∂φ

∂x+

[∂fT

∂x+

∂gT

∂x∂fT

∂u

]· ∂φ

∂x

]

tn

= δt0

[∂fT

∂u

]

t0

[∂φ

∂x+

N−1∑

n=1

δtn

n−1∏

j=1

[1 + δtj

[∂fT

∂x+

∂gT

∂x∂fT

∂u

]

tj

] ·

[∂φ

∂x+

[∂fT

∂x+

∂gT

∂x∂fT

∂u

]∂φ

∂x

]

tn

]

(B.31)

The goal is to calculate the sensitivity dUtil(t;∆t)du(t)

∣∣∣t0

of the utility Util(t; ∆t) = Util(x(t),u(t);∆t)

with respect to the control input u(t) at time t = t0. As u = u(t) is a function of t thepartial derivative ∂Util(x(t),u(t);∆t)

∂u(t) does not yield the desired result because u = u(t) is

a function of the time scale t and thus the sensitivity ∂Util(x(t),u(t);∆t)∂u(t) depends upon the

sensitivity of ∂u(t)∂t . Therefore, the following trick is used: To calculate dUtil(x(t),u(t);∆t)

du(t)

the ‘underlying’ sensitivity ∂Util(x(t),u(t);∆t)∂t is calculated first and then multiplied with

∂t∂u(t) = 1

∂u(t)∂t

.

Therefore, on the outer, ‘Bellman-Iteration’ time-scale, it is (see (B.25)):

dUtil(t;∆t)du(t)

∣∣∣∣∣t=t0

=∂Util(x(t),u(t);∆t)

∂u(t)

∣∣∣∣t=t0

=∂t

∂u(t)

∣∣∣∣t0

∂uT (t0)∂t0

∣∣∣∣t0

∂Util∆(x(t0),u(t0);∆t)∂u(t0)

=∆t

δt0

∂Util∆(x(t0),u(t0);∆t)∂u(t0)

(B.32)


B.4 Derivation of dcdw

Here, the derivation of equation (5.77) is performed. Given

c(x;w) = xT ∂φ(x, x)∂x

− φ(x, f(x,g(x;w))) =: D(x,w)− P (x,w) (B.33)

the goal is to calculate:

dc

dw=

dxT

dw∂c

∂x+

∂c

∂w(B.34)

with

∂c

∂x=

∂D

∂x− ∂P

∂x(B.35)

∂D

∂x=

dfT

dx∂φ

∂x+

d

dx

(∂φ

∂x

)T

f =dfT

dx

[∂φ

∂x+

∂2φ

∂x2∂f

]+

∂2φ

∂xxf (B.36)

∂P

∂x=

∂φ

∂x+

dfT

dx∂φ

∂x(B.37)

∂D

∂w=

dfT

dw∂φ

∂x+

d

dw

(∂φ

∂x

)T

f (B.38)

dfT

dw=

∂xT

∂w∂fT

∂x+

(∂xT

∂w∂gT

∂x+

∂gT

∂w

)∂fT

∂u, note:

∂xT

∂w= 0 (B.39)

=∂gT

∂w∂f∂u

(B.40)

d

dw

(∂φ

∂x

)T

=∂xT

∂w∂2φ

∂x∂x+

dfT

dw∂2φ

∂x2(B.41)

∂P

∂w=

∂xT

∂w∂φ

∂x+

(∂xT

∂wdfT

dx+

∂gT

∂w∂fT

∂u

)∂φ

∂x=

∂gT

∂w∂fT

∂u∂φ

∂x(B.42)

and combining these partial results, yields equation (5.77).

Appendix C

Terms and Definitions

C.1 Probability Model of a Random Experiment

Lets start with some definitions first and then look at a probabilistic system which describesa random experiment.

Definition 5. Let S be a collection of subsets of the set S1. S is then a σ-algebra (or aσ-field) if

1. ∅ ∈ S and S ∈ S

2. If A ∈ S, then the complement A = S\A also belongs to S.

3. If (Ai) is a sequence of sets in S, then⋃∞

i=1Ai ∈ S.

Definition 6. A measurable space is a pair (S,S) consisting of a set S and a σ-algebraof subsets S. Sets in the σ-algebra are called measurable sets. This is the qualitative aspectof a measurable space. To actually measure it quantitatively a measure µ is defined onS, which is a mapping µ : S → IR. µ is a (finite) measure on S if

1. µ(∅) = 0;

2. µ(A) ≥ 0, for all A ∈ S;

3. µ is countably additive, that is, if (Ai) is disjoint sequence of sets in S (so Am∩An =∅ for m 6= n) then µ (

⋃∞n=1) =

∑∞n=1 µ(An).

Definition 7. A Borel algebra B is a σ-algebra on IR by the set of all open intervals(a, b). It is also the σ-algebra generated by the set of all closed intervals [a, b]. Sets in Bare called Borel sets. This construction can be generalized to IRn.

Definition 8. The Lebesque measure, λ, is the unique measure on (IR,B) which assignsa “length” to all the Borel sets and satisfies λ([a, b]) = b− a for all closed intervals [a, b].

1Sometimes Ω is instead used to emphasize that the set at hand is the set of basic events; see also thedefinition of a probabilistic system

182

APPENDIX C. TERMS AND DEFINITIONS 183

Definition 9. A probability measure or a probability distribution2 on a measurable space(S,S) is a “normed” measure µ that satisfies µ(S) = 1. The triple (S,S, µ) is a so-calledprobability space.

Definition 10. A random variable is a measurable function (or mapping) from a samplespace Ω to IRn, ω 7→ ξ(ω), which is a random (vector) variable. For this random variableto be measurable, the relation ω : ξ(ω) < z ∈ S, which is a sigma-algebra on Ω, needs tobe valid for any z ∈ IRn. If there exists such a σ-algebra S then there exists the probabilitydistribution function Fξ(z) = Pω : ξ(ω) < z of the random (vector) ξ.

The concept of random variables was introduced before the concept of measure theorywas developed. At the time, probability distributions could be defined on IRn, but noton other sets. Problems involving sample spaces other than IRn where mapped into IRn

using random variables, so that probability distributions could be used.

Definition 11. A probability density function on IRn is a function p : IRn → IR whichsatisfies p(x) ≥ 0 for all x ∈ IRn, and

∫IRn p(x)dx = 1.

While it is possible to define measures on the Borel algebra on IRn, it is easier totalk about probability distributions on IRn in terms of probability density functions. Soµ([a1, b1]× · · · × [an, bn]) =

∫ bn

an. . .

∫ b1a1

p(x)dx1 . . . dxn.

Probabilistic System: According to Kolmogorov’s axiomatization, to every random ex-periment there is a set of elementary events where every ωi describes one possible outcomeof the experiment. Further, define a collection A of subsets of Ω as a set of events A ∈ A.Let A contain the empty set ∅, the event that never occurs, and the whole set Ω as theevent that surely occurs. For the set A there are the set operations union, complementand intersection defined. So A is a σ-algebra on Ω.

The qualitative aspect of a random experiment can be summarized by the pair (Ω,A)where as to measure the quantitative aspect a (countably additive) probability measureP (A) has to be defined on A.

Given a probability space (Ω,F , P ) to model the random experiment at hand and letw1, . . . , wN be the N outcomes of the i.i.d. trials based on this model, some more defini-tions about convergence are given:

Definition 12. Convergence in probability of a sequence of random variables wi, . . . , wl, . . .

to a random variable w0 takes place when P|wl − a0| > δ −→l →∞ 0, for any δ > 0 holds.

Convergence in probability (for an approximative measure El(.)) takes place when for ev-ery ε > 0, PsupA∈F |P (A)− El(A)| > ε converges to 0 when the number of observations

l →∞ 3. It is written as supA∈F |P (A)− El(A)|P

−→l →∞

0.

2see as well the definition of a random variable3When the σ-algebra F is poor (e.g. if the number of elements in F is finite) it can be easily estimated


Definition 13. Uniform convergence takes place when the estimator El(A) = E(A; w1, . . . , wl),A ∈ F defines a sequence of measure approximations that converges uniformly to the prob-

ability measure P if the relation supA∈F |P (A)− El(A)|P

−→l →∞

0 holds true.

Definition 14. Partial uniform convergence of an estimator El(A) towards a proba-bility measure P (A) takes place when for a subset F∗ ⊂ F the following convergence in

probability holds: supA∈F∗ |P (A)− El(A)|P

−→l →∞

0.

Definition 15. Almost sure convergence of a sequence of random variables w1, . . . , wl, . . .

to the random variable w0 takes place when for any δ > 0 the relation

Psupl>n |wl − w0| −→n →∞ 0 holds, for short wl

a.s

−→l →∞

0.

Theorem 6. (Lebesque). Any probability distribution function on the line can uniquelybe represented as the sum F (x) = FD(x)+FAC(x)+FS(x) of three nonnegative monotonefunctions where FD is a discrete component, representable as FD(x) =

∑xi<x p(x), p(xi) ≥

0,∑

i p(xi) ≤ 1 (measure concentrated at countably many points), and FAC is an absolutelycontinuous component, representable as FAC(x) =

∫ x−∞ p(y)dy, p(x) ≥ 0 (measure which

possesses a density), and FS is a singular component (a continuous function whose set ofjumps (points x for which (F (x + ε) − F (x − ε) > 0, ε → 0) has Lebesque measure equalto 0 (measure concentrated on a subset of the line with measure zero which has no pointwith positive measure). An example of a singular component is the Cantor function.

Theorem 7. (Chentsov). Let P′ be the set of all admissible probability measures on theBorel sets B. Then for any estimator El(A) of an unknown probability measure definedon the Borel subsets A ⊂ (0, 1) there exists a measure P ∈ P0 for which El(A) does notprovide uniform convergence.

Theorem 8. (Chentsov). There exists an estimator El(A) which provides uniform con-vergence to any measure in the set PD&AC .

Definition 16. Strict (nontrivial) consistency for minimizing the empirical risk isachieved4, when for the set of functions Q(z, α), α ∈ Λ, the probability distribution func-tion F (z), and for a non-empty subset Λ(c) = α :

∫Q(z, α)dF (z) ≥ c the following

convergence in probability holds: infα∈Λ(c) Remp(α)P

−→l →∞

infα∈Λ(c) R(α).

Lemma 1. If the method of empirical risk minimization is strictly consistent, the following

is true: R(αl)P

−→l →∞

infα∈Λ R(α).

by the empirical measure of the frequency of occurrence νl(A) = ν(A; w1, . . . , wl) = nAl

, where nA is thenumber of elements wi belonging to the event A. However, if F is rich the empirical measure νl may notconverge to the probability measure P (A).

4For the classification and regression problem; for the density estimation problem, the maximum likeli-hood method is strictly consistent with respect to the densities p(z, α), α ∈ Λ, if for any p(z, α0), α0 ∈ Λ, the

convergence infα∈Λ1l

∑li=1−logp(zi, α)

P−→

l →∞

∫p(z, α0)(−logp(z, α0))dz holds true for the i.i.d. samples

z1, .., zl drawn from p(z, α0), α0 ∈ Λ.


Definition 17. A sequence of random variables ξl = supα∈Λ

∣∣∣∫


∑li=1 Q(zi, α)

∣∣∣is a two-sided empirical process. A one-sided empirical process is a sequence ofrandom variables ξl

+ = supα∈Λ

(∫Q(z, α)dF (z)− 1

l

∑li=1 Q(zi, α)

)+, where

(u)+ =

u if u > 0,

0 otherwise..

Note: The analysis of consistency of the empirical risk minimization method is essentiallyconnected with the analysis of the convergence of one-sided and two-sided empirical pro-cesses.If a two-sided empirical process converges in probability to zero, i.e. (for any ε > 0)P

supα∈Λ

∣∣∣∫


∑li=1 Q(zi, α)

∣∣∣ > ε

−→l →∞ 0 takes place, this relation is

called uniform convergence of means to their mathematical expectations over a given setof functions, or, simply, uniform convergence5,6.If a one-sided empirical process converges in probability to zero, i.e. (for any ε > 0)

P

supα∈Λ

(∫Q(z, α)dF (z)− 1

l

∑li=1 Q(zi, α)

)+

> ε

−→

l →∞ 0 takes place, this relation

is called uniform one-sided convergence of means to their mathematical expectations overa given set of functions, or simply, uniform one-sided convergence.

Definition 18. The set B of elements b in a metric space M is called an ε-net of the setG if any point g ∈ G is distant from some point b ∈ B by an amount not exceeding ε, i.e.ρ(b, g) < ε. It is said that G admits a covering by a finite ε-net if for each ε there existsan ε-net Bε consisting of a finite number of elements. Bε∗ is a minimal ε-net if it is finiteand contains a minimal number of elements.

Definition 19. Let the number NΛ(z1, . . . , zl) of distinguishable clusters of events Aα =z : Q(z, α) > 0, α ∈ Λ (two events are distinguishable on a sample that belongs to oneevent and does not belong to the other). To handle real-valued functions it is necessary toextend the idea of counting the number of distinguishable events to a metric space with anε-net defined on it. Define the l-dimensional vector q∗(α) = (Q(z1, α), . . . , Q(zl, α)), α ∈ Λinduced by the sample z1, . . . , zl. Further let N(ε; z1, . . . , zl) be the number of elements of aminimal ε-net on the set of vectors q∗(α), α ∈ Λ. It is supposed that for any l the functionln NΛ(ε; z1, . . . , zl) is measurable.

Definition 20. The random entropy of the set of indicator functions Q(z, α), α ∈Λ on the sample z1, . . . , zl is defined as HΛ(z1, . . . , zl) = lnNΛ(z1, . . . , zl). The en-

tropy of the set of indicator functions Q(z, α), α ∈ Λ on samples of size l is defined asHΛ(l) =

∫HΛ(z1, . . . , zl)dF (z1, . . . , zl). The random ε-entropy of the set of uniformly

bounded functions Q(z, α), α ∈ Λ on the sample z1, . . . , zl is given by HΛ(ε; z1, . . . , zl) =ln NΛ(ε; z1, . . . , zl). The ε-entropy on uniformly bounded functions Q(z, α), α ∈ Λ on thesample z1, . . . , zl is accordingly given as HΛ(ε; l) =

∫HΛ(ε; z1, . . . , zl)dF (z1, . . . , zl).

5The Glivenko-Cantelli theorem is actually a special case of a two-sided empirical process withQ(z, α), α ∈ Λ being a set of indicator functions

6The Law of Large Numbers in statistics is a special case of two-sided empirical process with the set offunctions Q(z, α), α ∈ Λ containing only one element which let us build the empirical mean


Definition 21. The annealed entropy and annealed ε-entropy are defined as HΛann(l) =

ln E[NΛ(z1, . . . , zl)] and HΛann(ε; l) = lnE[NΛ(ε; z1, . . . , zl)], respectively7.

Definition 22. The growth function GΛ(l) and its ε counterpart GΛ(ε; l) are definedas: GΛ(l) = ln supz1,...,zl

NΛ(z1, . . . , zl) and GΛ(ε; l) = ln supz1,...,zlNΛ(ε; z1, . . . , zl), re-

spectively. Note: For any l HΛann(l) ≤ GΛ(l) and HΛ

ann(ε; l) ≤ GΛ(ε; l).

Definition 23. A learning machine that has an admissible set of real-valued functionsQ(z, α), α ∈ Λ is potentially nonfalsifiable if there exists two functions ψ1(z) ≥ ψ0(z) suchthat there exists a positive constant c for which

∫(ψ1(z)− ψ0(z))dF (z) = c > 0 holds, and,

for almost any sample z1, . . . , zl, any sequence of binary values δ1, . . . , δl, δi ∈ 0, 1 andany ε a function Q(z, α∗) in the set of functions Q(z, α), α ∈ Λ, for which the inequalities|ψδi(zi)−Q(zi, α

∗)| < ε hold true.

7For indicator functions Q(z, α), α ∈ Λ the minimal ε-net of vectors q(α), α ∈ Λ does not depend on εif ε < 1, i.e. NΛ(z1, . . . , zl) = NΛ(ε; z1, . . . , zl)

References

[1] Tesauro G. Neurogammon: a neural-network backgammon program. IJCNN Inter-national Joint Conference on Neural Networks, 3:33–39, June, 17-21 1990.

[2] Danil V. Prokhorov. Adaptive Critic Designs and Their Applications. Ph.d. disser-tation, Texas Tech University, Lubbock, TX, 1997. Available by request from theauthor ([email protected]).

[3] Danil V. Prokhorov and Donald C. Wunsch. Adaptive critic design. IEEE Transac-tions on Neural Networks, 8(5):997–1007, September 1997.

[4] Michael A. Arbib, editor. The Handbook of Brain Theory and Neural Networks. MITPress, 1995. ISBN: 0-262-01148-4.

[5] David A. White and Donald A. Sofge, editors. Handbook of Intelligent Control. VanNostrand Reinhold, New York, 1992.

[6] Paul J. Werbos. Optimal neurocontrol: Practical benefits, new results and biologicalevidence. WESCON/95 Conference Record (IEEE), pages 580–5, November 7-9 1995.ISBN: 0 7803 2636 9.

[7] M.L. Puterman. Markov Decision Processes : Discrete stochastic dynamic program-ming. John Wiley and Sons, Inc, 605 Third Avenue, New York, NY, 1994.

[8] Satinder P. Singh. Learning to solve Markovian Decision Processes. Ph.d. dissertation,University of Massachusetts, Amherst, USA, 1994.

[9] Richard S. Sutton and Andrew G. Barto. Reinforcement learning : an introduction.MIT Press, Cambridge, Mass, 1998.

[10] Leslie Pack Kaelbling, Michael L. Littman, and Andrew P. Moore. Reinforcementlearning: A survey. Journal of Artificial Intelligence Research, 4:237–285, 1996.

[11] Dimitry P. Bertsekas and John N. Tsitsiklis. Neuro-Dynamic Programming. AthenaScientific, Belmont, MA 02178-9998, USA, 1996.

[12] T. Jaakkola, M. I. Jordan, and S. P. Singh. On the convergence of stochastic iterativedynamic programming algorithms. Neural Computation, 6:1185–1201, 1994.

187

REFERENCES 188

[13] P. Dayan. The convergence of td(λ) for general λ. Machine Learning, 8:341–362,1992.

[14] Steven J. Bradtke, B. Ydstie, Erik, and Andrew G. Barto. The analysis of decom-position methods for support vector machines. Proceedings of the American ContorlConfernce, Baltimore Maryland, pages 3475–9, June 1994.

[15] T. Landelius. Reinforcement Learning and Distributed Local Model Synthesis. PhDthesis, Linkoping University, Sweden, SE-581 83 Linkoping, Sweden, 1997. Disserta-tion No 469, ISBN 91-7871-892-9.

[16] Lyle Noakes. A global algorithm for geodesics. J. Austral. Math. Soc. (Series A),64:37–50, 1998. c©1998 Australian Mathematical Society 0263-6115/98.

[17] C. Yalcin Kaya and J. Lyle Noakes. The leap-frog algorithm and optimal control:Background and demonstration. Proceedings of International Conference on Opti-mization Techniques and Applications (ICOTA ’98), Perth Australia, pages 835–842,1998.

[18] Thomas L. Vincent and Walter J. Grantham. Nonlinear and Optimal Control Systems.John Wiley and Sons, Inc, 605 Third Avenue, New York, NY, 1997.

[19] I. N. Bronstein and K. A. Semendjajew. Taschenbuch der Mathematik. Verlag Nauka,Moskau; B.G. Teubner Verlagsgesellschaft, Stuttgart/Leipzig; Verlag Harri Deutsch,Thun/Frankfurt, 25th edition, 1991.

[20] I. N. Bronstein and K. A. Semendjajew. Erganzende Kapitel zum Taschenbuch derMathematik. Verlag Harri Deutsch, Thun/Frankfurt, 6th edition, 1991.

[21] William L. Brogan. Modern Control Theory. Prentice Hall, Upper Saddle River, NewJersey 07458, 3rd edition, 1991.

[22] Justin A. Boyan and Andrew W. Moore. Generalization in reinforcement learning:Safely approximating the value function. In G. Tesauro, D. S. Touretzky, and T. K.Leen, editors, Advances in Neural Information Processing Systems 7, pages 369–376,Cambridge, MA, 1995. The MIT Press.

[23] Richard S. Sutton. Generalization in reinforcement learning: Successful examplesusing sparse coarse coding. In David S. Touretzky, Michael C. Mozer, and Michael E.Hasselmo, editors, Advances in Neural Information Processing Systems, volume 8,pages 1038–1044. The MIT Press, 1996.

[24] Donald E. Kirk. Optimal Control Theory, An Introduction. Prentice-Hall NetworkSeries. Prentice-Hall, Inc, Englewood Cliffs, New Jersey, 1970.

REFERENCES 189

[25] P. Werbos. Beyond regression: New Tools for Prediction and Analysis in the Behav-ioral Sciences. Ph.d. dissertation, Harvard Univ., Cambridge, MA, 1974. Reprintedin The Roots of Backpropagation: From Ordered Derivatives to Neural Networks andPolitical Forecasting.

[26] P. Werbos. Backpropagation through time: What it does and how to do it. Proceedingsof the IEEE Control, 78(10):1550–1560, March 1990.

[27] P. Werbos. Stable adaptive control using new critic designs.http://xxx.lanl.gov/abs/adap-org/9810001, March 1998.

[28] Leemon C. Baird. Residual algorithms. In Justin A. Boyan, Andrew W.Moore, and Richard S. Sutton, editors, Proceedings of the Workshop on ValueFunction Approximation, Machine Learning Conference, July 9, (Technical reportCMU-CS-95-206), 1995. Workshop proceedings are at http://www.cs.cmu.edu/∼reinf/ml95/proceedings.html.

[29] Leemon C. Baird. Residual algorithms: Reinforcement learning with function approx-imation. In Armand Prieditis and editors Stuart Russell, editors, Machine Learning:Proceedings of the Twelfth International Conference, 9-12 July, Morgan KaufmanPublishers, San Francisco, CA, pages 30–37, 1995. 22 Nov 95 errata corrects errorsin the published version.

[30] Stuart Dreyfus. Ahc for stochastic minimum cost path problems using neural net func-tion approximators, 1995. Workshop proceedings are at http://www.cs.cmu.edu/∼reinf/ml95/proceedings.html.

[31] Arthur E. Bryson and Yu-Chi Ho. Applied Optimal Control. John Wiley & Sons,1975.

[32] H. Drucker and Y. Le Cun. Improving generalization performance using double back-propagation. IEEE Transactions on Neural Networks, 3(6):991–997, June 1992.

[33] J. S. Dalton. A Critic based system for neural guidance and control. Ph.d. dissertationin ee, University of Missoury Rolla, 1994.

[34] R. Beard, G. Saridis, and J. Wen. Improving the performance of stabilizing contorlsfor nonlinear systems. IEEE Control Systems Magazine, 16(5):27–35, October 1996.

[35] E.A. Wan. Temporal backpropagation for fir-neural networks. IEEE InternationalJoint Conference on Neural Networks, San Diego, CA, 1:575–580, 1990.

[36] S. Haykin. Neural Networks a Comprehensive Foundation, chapter 15. Prentice Hall,Upper Saddle River, NJ, 2nd edition, 1998.

REFERENCES 190

[37] T. Hanselmann, A. Zaknich, and Y. Attikiouzel. Connection between bptt and rtrl.3rd IMACS/IEEE International Multiconference on Circuits, Systems, 4-8 July 1999Athens, July 1999. Reprinted in Computational Intelligence and Applications, Ed.Mastorakis, Nikos, E., World Scientific, ISBN 960-8052-05-X,1999.

[38] S. Williams and D. Zipser. Gradient-based learning algorithms for recurrent networksand their computational complexity. In Chauvin and Rumelhart, editors, Backprop-agation: Theory, Architectures and Applications, pages 433–486. LEA, 1995.

[39] D.V. Prokhorv, G.V. Puskorius, and L.A. Feldkamp. Dynamical Neural Networks forControl. IEEE Press, 2001.

[40] E.A. Wan. Time series prediction by using a connectionist network with internaldelay lines. In A.S. Weigend and N.A. Gershenfeld, editors, Time Series Predictions:Forecasting the Future and Understanding the Past, pages 195–217. Adision Wesley,1994.

[41] Tomas Landelius and Hans Knutsson. Greedy adaptive critics for lqr problems: Con-vergence proofs. http://citeseer.nj.nec.com/429530.html.

[42] H. J. Kushner and D. S. Clark. Stochastic Approximation Methods for Constrainedand Unconstrained Systems. Springer Verlag, Berlin, 1978.

[43] P. Werbos. How to use the chain rule for ordered derivatives. In David A. White andDonald A. Sofge, editors, Handbook of Intelligent Control, chapter 10.6. Van NostrandReinhold, New York, 1992.

[44] P. Eaton, D. Prohorov, and Wunsch D. Neurocontroller alternatives for “fuzzy” ball-and-beam systems with nonlinear, nonuniform friction. IEEE Transactions on NeuralNetworks, 11(2):423–435, March 2000.

[45] S. Haykin. Neural Networks a Comprehensive Foundation, Second Edition. PrenticeHall, Upper Saddle River, NJ, 1998.

[46] Andrey V. Savkin, Ian R. Petersen, Efstratios Skafidas, and Robin J. Evans. Hybriddynamical systems: Robust control synthesis problems. Systems and Control Letters,29(2):81–90, 1996.

[47] A.S. Matveev and A.V. Savkin. Qualitative Theory of Hybrid Dynamical Systems.Birkhauser, Birkhauser, Boston, 2000.

[48] A.V. Savkin and R.J. Evans. Hybrid Dynamical Systems. Controller and SensorSwitching Problems. Birkhauser, Boston, Birkhauser, Boston, 2002.

[49] Andrey V. Savkin and Robin J. Evans. A new approach to robust control of hybridsystems over infinite times. IEEE Transactions on Automatic Control, 43(9):1292–1296, September 1998.

REFERENCES 191

[50] Andrey V. Savkin, E. Skafidas, and Robin J. Evans. Robust output feedback stabi-lizability via controller switching. Automatica, 35(1):69–74, January 1999.


[52] C. Burges. A tutorial on support vector machines for pattern recognition.http://svm.first.gmd.de/papers/Burges98.ps.gz.

[53] V. N. Vapnik. Statistical Learning Theory. On Adaptive and Learning Systems forSignal Processing, Communications and Control. John Wiley and Sons, Inc., 1998.

[54] F. Rosenblatt. Principles of neurodynamics: perceptrons and the theory of brainmechanisms. Spartan, New York, 1962.

[55] Achim Hoffmann. Tutorial: Vc learning theory and it’s applications. ICONIP’99conference held in Perth, Western Australia, 16-20 November 1999, 1999.

[56] P. Wolfe. A duality theorem for nonlinear programming. Quarterly of Applied Math-ematics, 19(3):239–244, 1961.

[57] S. Vijayakumar and S. Wu. Sequential support vector classifiers andregression. Int. Conf. Soft Computing, pages 610–619, 1999. cite-seer.nj.nec.com/vijayakumar99sequential.html.

[58] T. Hanselmann and L. Noakes. Optimizing an algebraic perceptron solution. INNS-IEEE International Joint Conference on Neural Networks IJCNN, 2001.


[60] O. L. Huber. Robust Statistics. John Wiley and Sons, Inc., 1981.

[61] T. Friess, N. Cristianini, and C. Campbell. The kernel adatron algorithm: a fast andsimple learning procedure for support vector machine. In Proc. 15th InternationalConference on Machine Learning, Morgan Kaufman Publishers, 1998.

[62] J. Anlauf and M. Biehl. The adatron: an adaptive perceptron algorithm. EurophysicsLetters, 10(7):687–692, 1989.

[63] S.J. Wan. Cone algorithm: An extension of the perceptron algorithm. IEEE Trans-actions on Systems, Man and Cybernetics, 24(10), October 1994.

[64] T. Hanselmann and L. Noakes. Comparison between support vector algorithm andalgebraic perceptron. INNS-IEEE International Joint Conference on Neural NetworksIJCNN, 2001.

REFERENCES 192

[65] Ralf Herbrich. Learning Kernel Classifiers: Theory and Algorithms. MIT Press, 605Third Avenue, New York, NY, 2002.

[66] Bernhard Scholkopf and Smola Alexander J. Learning with Kernels. MIT Press, 2002.

[67] T. Joachims. Making large-scale svm learning practical. In B. Schlkopf, C. Burges, andA. Smola, editors, Advances in Kernel Methods - Support Vector Learning, chapter 11.MIT Press, 1999.

[68] S.K. Shevade, C. Keerthi, S.S. Bhattacharyya, and K.R.K. Murthy. Improvements toplatt’s smo algroithm for svm classifier design. Technical Report CD-99-14, 1999.

[69] Ijcnn challenge neural network competition (gac), 2001.http://www.geocities.com/ijcnn/challenge.html.

[70] T. Joachims. Software package SVM light, Version 3.5, 2000. http://ais.gmd.de/∼thorsten/svm light/.

[71] S. Amari and Si Wu. An information-geometrical method for improving the per-formance of support vector machine classifiers. Ninth International Conference onArtificial Neural Networks, ICANN 99, 1:85–90, 1999.

[72] Sheng Chen, Bernard Mulgrew, and Peter M. Grant. A clustering technique fordigital communications channel equalization using radial basis function networks.IEEE Transactions on Neural Networks, 4(4):570–578, 1993.

[73] J. Young, T. Hanselmann, A. Zaknich, and Y. Attikiouzel. Algebraic perceptron indigital channel equalization. INNS-IEEE International Joint Conference on NeuralNetworks IJCNN, 2001.

[74] J. Young, T. Hanselmann, A. Zaknich, and Y. Attikiouzel. Fine tuning the algebraicperceptron equaliser to increase the separation margin. Inter-University Postgrad-uate Electrical Engineering Symposium, Murdoch University Rockingham Campus,Western Australia, 2nd October 2002, 2002.

[75] O. L. Mangasarian and D. R. Musicant. Successive overrelaxation for support vectormachines. IEEE-NN, 10(5):1032, 1999.

[76] J. Platt. Sequential minimal optimization: A fast algorithm for training supportvector machines. Technical Report Technical Report 98-14, Microsoft Research, Red-mond, Washington, April 1998. citeseer.nj.nec.com/platt98sequential.html.

[77] R. Duda and P. Hart. Use of the hough transform to detect lines and curves inpictures. Commun. ACM, pages 11–15, 1972.

[78] D. Ballard. Generalized hough transform to detect arbitrary patterns. IEEE Trans-actions on Pattern Analysis and Machine Intelligence, 13(2):111–122, 1981.

REFERENCES 193

[79] J.A.K. Suykens. Learning and generalization by coupled local minimizers. Proceed-ings. IJCNN ’01. International Joint Conference on Neural Networks, 1:337–341, July2001.

[80] J.A.K. Suykens and J. Vandewalle. Coupled local minimizers: alternative formula-tions and extensions. Proceedings. IJCNN ’02. International Joint Conference onNeural Networks, 3:2039 –2043, July 2002.

Approximate Dynamic Programming with Adaptive …...of adaptive critic designs (ACDs). Dynamic...

Documents

Transcript of Approximate Dynamic Programming with Adaptive …...of adaptive critic designs (ACDs). Dynamic...