1 Neuro-Dynamic Programming José A. Ramírez ， Yan Liao Advanced Decision Processes ECECS 841,...

Neuro-Dynamic Programming

José A. Ramírez ， Yan LiaoAdvanced Decision Processes

ECECS 841, Spring 2003University of Cincinnati

Neuro-Dynamic Programming - ECECS 841- Advanced Decision Processes - J.Ramirez & Y. Liao

Outline1. Neuro-Dynamic Programming (NDP): motivation.

2. Introduction to Infinite Horizon Problems Minimization of Total Cost, Discounted Problems, Finite-State Systems,

Value Iteration and Error Bounds, Policy Iteration,

The Role of Contraction Mappings.

3. Stochastic Control Overview: State Equation (system model), Value Function, Stationary policies and value function, Introductory example: Tetris (game).

Outline4. Control of Complex Systems:

Motivation about use of NDP in complex systems, Examples of complex systems where NDP could be

applied.

5. Value Function Approximation: Linear parameterization: parameter vector and basis

functions, Continuation Tetris example.

Outline6. Temporal-Difference Learning (TD()):

Introduction: Autonomous systems, general TD() algorithm, Controlled Systems, TD() for more general systems:

Approximate policy iteration, Controlled TD, Q-functions, and approximating the Q-function (Q-

learning), Comments about relationship with Approximate Value

Iteration.7. Actors and Critics:

Averaged Rewards, Independent actors, Using critic Feedback.

1. Neuro-Dynamic Programming (NDP): motivation

Study ofDecision-Making

-How decisions are made (psychologists, economists).

-How decisions ought to be made: “rational decision-making” (engineers and management scientists).

“rational” and “irrational” behavior

clear objectives, strategic behavior.

Rational decision problems:

-Development of mathematical theory: understanding of dynamics models,uncertainties, objectives, and characterization of optimal decision strategies.

-If optimal strategies do exist, then computational methods are used as complement (e.g., Implementation).

-In contrast to rational decision-making, there is no clear-cut mathematical theory about decisions made by participants of natural systems (speculative theories, refining ideas by experimentation).

-One approach: hypothesis that behavior is in some sense rational, then ideas from study of rational decision-making are used to characterize such behavior, e.g., utility and equilibrium theory in financial economics.

-Also, study of animal behavior is subject of interest: evolutionary theory and its popular precept “survival of the fittest” –support the possibility that behavior to some extent concurs with that rational agent.

-Contributions from study of natural systems to science of rational decision-making:

-Computational complexity of decision problems and lack of systematic approaches for dealing with it.

-For example: practical problems addressed by the theory of dynamic programming (DP) can rarely solved using DP algorithms because the computational time required

for the generation of optimal strategies typically grows exponentially in the number of variables involved Curse of dimensionality.

-This call for an understanding of suboptimal solutions /decision-making under computational constraints. Problem no satisfactory theory has been developed to this end.

-Interesting: the fact that biological mechanisms facilitate the efficient synthesis of adequate strategies motivates the possibility that understanding such mechanisms can inspire new and computationally feasible methodologies for strategic decision-making.

-Reinforcement Learning (RL): over two decades, RL algorithms –originally conceived as descriptive models for phenomena observed in animal behavior- have grown out in the field of artificial intelligence and been applied to solving complex sequential decision problems.

-Success of RL in solving large-scale problems has generated special interest among operations researchers and control theorists research devoted to understand those methods and their potential.

-Developments from the operations research and control theorists: focused in normative view, acknowledge of relative disconnect from descriptive models of animal behavior, some operations researchers and control theorists have come to refer this area of research as Neuro-Dynamic Programming (NDP) instead of RL.

-During these lectures we will present a sample of the recent developments and open issues of research in NDP.

-Specifically, we will be focused in two algorithmic ideas of greatest use in NDP, and for which there has been significant theoretical progress in recent years:

-Temporal-Difference learning-Actor-Critic Methods.

-First, we begin providing some background and perspective on the methodology and problems may address.

Comments about references

2. Introduction to Infinite Horizon ProblemsMaterial taken from “Dynamic Programming and Optimal Control”, vol. I, II; and “Neuro-Dynamic Programming” by Dimitri P. Berstsektas and John Tsitsiklis.

The Dynamic Programming Problems with infinite horizon are characterized by the following aspects:a) The number of stages is infinite*.b) The system is stationary the system equation, the cost per stage, and the random disturbance statistics do not change from one stage to the next.

Why Infinite Horizon Problems?: -They are interesting because their analysis is elegant and insightful.-Implementation of optimal policies is often simple. Optimal policies

are typically stationary, e.g., optimal rule used to choose controls does not change from stage to stage.-NDP! complex systems.

*This assumption is never satisfied in practice, but is a reasonable approximation for problems with a finite but very large number of stages.

2. Introduction to Infinite Horizon Problems-They require more sophisticated analysis than the finite horizon problems. It is needed to analyze limiting behavior as the horizon tends to infinity.

-We consider four principal classes of infinite horizon problems. The first two classes try to minimize J (x0), the total cost over an infinite number of stages:

i) Stochastic shortest path problems: in this case = 1 and assume that there is an additional state 0, which is a cost-free termination state; once the system reach the termination state it remains there at not additional cost. objective: reach the termination state with the minimal cost.

ii) Discounted problems with bounded cost per stage: here < 1, and the absolute one-stage cost |g(x,u,w)| is bounded from above by some constant M. Thus, J(x0) is well defined because it is the infinite sum of a sequence of numbers that are bounded in absolute value by the decreasing geometric progression Mk.

factor.discount 1

:,xx)w),x(,x(gElim)x(JN

wkN,...,k

2. Introduction to Infinite Horizon Problemsiii) Discounted problems with unbounded cost per stage: here the discount factor

may or may not be less than 1, and the cost per stage is unbounded. this problem is difficult to analyze because of the possibility of infinite cost for some policies (more details in chap.3, Dynamic Programming,vol. II, by Bertsekas).

iv) Average cost problems: in some problems we have J(x0) =, for every policy and initial state i, then in many problems the average cost per stage, given by

where JN(i) is the N-stage cost-to-go of policy starting at state x0, is well defined

as a limit and is finite.

))w),x(,x(g(JN

kkkkkN

2. Introduction to Infinite Horizon ProblemsA Preview of Infinite Horizon Results:

Let J* the optimal cost-to-go function of the infinite horizon problem, and consider the case = 1, with JN(x) as the optimal cost of the problem involving N stages, initial state x, cost per stage g(x,u,w), and zero terminal cost. Thus, the N-stage cost can be computed after N iterations of the DP algorithm*:

Thus, we can speculate the following:

i) The optimal infinite horizon cost is the limit of the corresponding N-stage optimal costs as N ∞ :

.)x(J,))w,,x(f(J)w,,x(gEmin)x(J k)x(Uu

k 0 01

.x)x(Jlim)x(J NN

*Note that the time indexing has been reversed from the original DP algorithm, thus the optimal finite horizon cost functions can be computed with a singleDP recursion (more details in chap.1, “Dynamic Programming”, vol. II, by D.P. Bertsekas.

2. Introduction to Infinite Horizon Problemsii) The following limiting form of the DP algorithm should hold for all states:

this is not an algorithm, but a system of equations (one equation per state), which has as solution the costs-to-go for all states. It is also viewed as a functional equation for the cost-to-go function J* , and it is called Bellman’s equation.

iii) If (x) attains the minimum in the right-hand side of the Bellman’s equation for each x, then the policy ={, , …} should be optimal. This is true for most infinite horizon problems of interest. stationary policies.

Most of the analysis of infinite horizon problems are focused around the above three issues and efficient methods to compute J* and optimal policies.

))w,,x(f(J)w,,x(gEmin)x(J *

2. Introduction to Infinite Horizon ProblemsStationary Policy:A stationary policy is an admissible policy ={, , …} with a corresponding cost function J (x). is optimal if J(x)=J*(x) for all states x.

Some Shorthand Notation:The use of single recursions in the DP algorithm to compute optimal costs over a finite horizon, motivates the introduction of two mappings that play an important theoretical role and give us a convenient shorthand notation for expressions that are Complicated to write.

For any function J:S, where S is the states space, we consider the function obtained by applying the DP mapping J as follows:

T can be viewed as a mapping that transforms J on S into the function TJ on S. TJ represent the optimal cost function for the one-stage problem that has stage cost g and terminal cost J.

.Sx,))w,,x(f(J)w,,x(gEmin)x)(TJ( *

2. Introduction to Infinite Horizon ProblemsSimilarly, for any control function : S C, where C is the space of controls, we have:

Also, we denote the composition Tk of the mapping T with itself k times

Then, for k=0 we have

.Sx,))w),x(,x(f(J)w),x(,x(gEmin)x)(JT( *

.Sx),x))(JT(T()x)(JT( kk 1

2. Introduction to Infinite Horizon ProblemsSome Basic Properties

Monotonicity Lemma: For any functions J:S and J’:S, such that

and for any function :SC with (x) U(x), for all x S, we have

,Sx),x('J)x(J

,...,k,Sx),x)('JT()x)(JT(

,...,,k,Sx),x)('JT()x)(JT(kk

2. Introduction to Infinite Horizon ProblemsThe Role of Contraction Mappings (Dynamic Programming, vol. II, Bertsekas)

Definition 1.4.1: A mapping H: B(S) B(S) is said to be a contraction mapping if there exists a scalar <1 such that

Where || ∙ || is the norm

It is said to be an m-stage contraction mapping if there exists a positive integer m and

some < 1 such that

where Hm denotes the composition of H with itself m times.

Note: B(S) is the set of all bounded real-valued functions on S. Every function J:S .

),S(B'J,J,'JJ'HJHJ

.)x(JmaxJXx

),S(B'J,J,'JJ'JHJH mm

2. Introduction to Infinite Horizon ProblemsThe Role of Contraction Mappings (Dynamic Programming, vol. II, Bertsekas)

Proposition 1.4.1: (Contraction Mapping Fixed-Point Theorem)If H: B(S) B(S) is a contraction mapping or an m-stage contraction mapping, then there exists a unique fixed point of H; i.e., there exists a unique function J* B(S) such that

Furthermore, if J is any function in B(S) and Hk is the composition of H with itself k times, then

.JHJ **

kJJHlim

3. Stochastic Control Overview

State Equation: Let’s consider a discrete-time dynamic system, that at each time t, takes on a state xt and evolves according to:

),w,a,x(fx tttt 1

where wt is a disturbance (iid) and at is a control decision. We restrict attention to finite state, disturbances, and control spaces, denoted by , W, and , respectively.

Value Function: Let r : x associates a reward r( xt , at ) with a decision at, made at state xt. Let a stationary policy with : . For each policy we define a value function v( ∙ , ) : :

.xx))x(,x(rE),x(vt

factor.discount

Optimal Value Function: we define the optimal value function V as follows:

From dynamic programming, we have that any stationary policy * given by

).,(max)(

,)),,((),(maxarg)(* waxfVaxrEx wAa

is optimal in the sense that

),x(v)x(V *

Example 1: TetrisThe video arcade game of Tetris can be viewed as an instance of stochastic control. In particular, we can view the state xt as an encoding of the current “wall of bricks” and the shape of the current “falling piece.” The decision at identifies an orientation and horizontal position for placement of the falling piece onto the wall. Though the arcade game employs a more complicated scoring system, consider for simplicity a reward r(xt, at) equal to the number of rows eliminated by placing the piece in the position described by a t. Then, a stationary policy that maximizes the value

,))(,(),(0

t xxxxrExv

essentially optimizes a combination of present and future row elimination, with decreasing emphasis placed on rows to be eliminated at times farther into the future.

Example 1: Tetris, cont.Tetris was first programmed by Alexey Pajitnov, Dmitry Pavlovsky, and Vadim Gerasimov, computer engineers at the Computer Center of the Russian Academy of Sciences in 1985-86.

Standard shapes

w=10Number of states

states. 10248127

shapes ofnumber 2612010

:mm wh

Dynamic programming algorithms compute the optimal value function V. The result is stored in a “look-up” table with one entry V(x) per state x X. When is required, the value function is used to generate optimal decisions. For example, given a current state xt X , a decision at is selected according to

.)),,((),(maxarg waxfVaxrEa ttwAa

4. Control of Complex Systems

The main objective is in the development of a methodology for the control of “complex systems”.

Two common characteristics of these type of systems are:

i-An intractable state space : intractable state spaces preclude the use of classical DP which compute and store one numerical value per state.

ii- Severe nonlinearities: methods of traditional linear control, which are applicable even in large state spaces, are ruled out by severe nonlinearities.

Let’s review some examples of complex systems, where NDP could be and has been applied.

4. Control of Complex Systemsa) Call Admission and RoutingWith rising demand in telecommunication network resources, effective management is as important as ever. Admission (deciding which calls to accept/reject) and routing (allocating links in the network to particular calls) are examples of decisions that must be made at any point in time. The objective is to make the “best” use of limited network resources. In principle, such sequential decision problems can be addressed by dynamic programming. Unfortunately, the enormous state spaces involved render dynamic programming algorithms inapplicable, and heuristic control strategies are used in lieu.

b) Strategic Asset AllocationStrategic asset allocation is the problem of distributing an investor’s wealth among assets in the market in order to take on a combination of risk and expected return that best suits the investor’s preferences. In general, the optimal strategy involves dynamic rebalancing of wealth among assets over time. If each asset offers a fixed rate of risk and return, and some additional simplifying assumptions are made, the only state variable is wealth, and the problem can be solved efficiently by dynamic programming algorithms. There are even closed form solutions in cases involving certain types of investor preferences. However, in the more realistic setting involving risks and returns that fluctuate with economic conditions, economic indicators must be taken into account as state variables, and this quickly leads to an intractable state space. The design of effective strategies in such situations constitutes an important challenge in the growing field of financial engineering.

4. Control of Complex Systemsc) Supply–Chain ManagementWith today’s tight vertical integration, increased production complexity, and diversification, the inventory flow within and among corporations can be viewed as a complex network – called a supply chain – consisting of storage, production, and distribution sites. In a supply chain, raw materials and parts from external vendors are processed through several stages to produce finished goods. Finished goods arethen transported to distributors, then to wholesalers, and finally retailers, before reaching customers. The goal in supply–chain management is to achieve a particular level of product availability while minimizing costs. The solution is a policy that decides how much to order or produce at various sites given the present state of the company and the operating environment.

d) Emissions ReductionsThe threat of global warming that may result from accumulation of carbon dioxide and other “greenhouse gasses” poses a serious dilemma. In particular, cuts in emission levels bear a detrimental short–term impacton economic growth. At the same time, a depleting environment can severely hurt the economy – especially the agricultural sector – in the longer term. To complicate the matter further, scientific evidence on the relationship between emission levels and global warming is inconclusive, leading to uncertainty about the benefits of various cuts. One systematic approach to considering these conflicting goals involves the formulation of a dynamic system model that describes our understanding of economic growth and environmental science. Given such a model, the design of environmental policy amountsto dynamic programming. Unfortunately, classical algorithms are inapplicable due to the size of the state space.

4. Control of Complex Systemse) Semiconductor Wafer FabricationThe manufacturing floor at a semiconductor wafer fabrication facility is organized into service stations, each equipped with specialized machinery. There is a single stream of jobs arriving on a production floor. Each job follows a deterministic route that revisits the same station multiple times. This leads to a scheduling problem where, at any time, each station must select a job to service such that (long term) production capacity is maximized. Such a system can be viewed as a special class of queueing networks, which are models suitable for a variety of applications in manufacturing, telecommunications, and computer systems. Optimal control of queueing networks is notoriously difficult, and this reputation is strengthened by formal characterizations of computational complexity.

Other systems: parking lots, football, games strategy, combinatorial optimization – maintenance and repair, dynamic channel allocation, backgammon. Some papers in applications:

-Tsitsiklis, J. and Van Roy, B. “Neuro-Dynamic Programming Overview and a Case Study in Optimal Stopping.” IEEE Proceedings of the 36th Conference on Decision & Control, San Diego, California, pp. 1181-1186, December, 1997.

-Van Roy, B., Bertesekas, D.P., Lee, Y., and Tsitsiklis, J. “A Neuro-Dynamic Programming Approach to Retailer Inventory Management” IEEE Proceedings of the 36th Conference on Decision & Control, San Diego, California, pp. 4052-4057, December, 1997.

-Marbach, P. and Tsitsiklis, J. “A Neuro-Dynamic Programming Approach to Admission Control in ATM Networks: The Single Link Case” Technical Report LIDS-P-2402, Laboratory for Information and Decision Systems, M.I.T., November 1997.

- Marbach, P, Mihatsch, O, and Tsitsiklis, J. “Call Admission Control and Routing in Integrated Services Networks Using Reinforcement Learning” IEEE Proceedings of the 37th Conference on Decision & Control, Tampa, Florida, pp. 563-568, December, 1998.

-Bertsekas, D.P., Homer, M.L., “Missile Defense and Interceptor Allocation by Neuro-Dynamic Programming” IEEE Transactions on Systems Man and Cybernetics, vol. 30, pp.101-110, 2000.

4. Control of Complex Systems

-For the examples presented, state spaces are intractable as consequence of the “curse of dimensionality”, that is, state spaces grow exponentially in the number of state variables. difficult (if not impossible) to compute (store) one value per state as is required by classical DP.

-Additional shortcoming with classical DP: computations require use of transition probabilities. For many complex systems, such probabilities are not readily accessible. On the other hand, is often easier develop simulation models for the system and generate sample trajectories.

-Objective of NDP: overcoming curse of dimensionality through use of parameterized (value) function approximators and through use of output generated by simulators, rather than explicit transition probabilities.

5. Value Function Approximation

-Intractability of state spaces value function approximation.

-Two important pre-conditions for the development of effective approximation:

i-Choose a parameterization*: that yields a good approximation

ii-Algorithms for computing appropriate parameter values.

*Note: the choice of a suitable parameterization requires some practical experience or theoretical analysis that provides rough information about the shape of the function to be approximated.

KXv :~

),(),(~ xVuxv Kuu vector,parameter :

Linear parameterization*

-General classes of parameterizations have been found used in NDP, to keep the exposition simple, let us focus on linear parameterizations, which take the form

,)()(),(~1

kk xkuuxv

where 1, …, K are “basis functions” mapping X to and u = (u(1), …, u(K))’ is a vector of scalar weights.

Similar to statistical regression, the basis functions 1, …, K are selected by a human based on intuition or analysis to the problem at hand.

Hint: one interpretation that is useful for the construction of basis functions involves viewing each function k as a “feature” – that is, a numerical value capturing a salient characteristic of the state that may be pertinent to effective decision making.

Example 2: Tetris, continuation.In our stochastic control formulation of Tetris, the state is an encoding of the current wall configuration and the current falling piece. There are clearly too many states for exact dynamic programming algorithms to be applicable. However, we may believe that most information relevant to game–playing decisions can be captured by a few intuitive features. In particular, one feature, say 1, may map states to the height of the wall. Another, say 2, could map states to a measure of “jaggedness” of the wall. A third might provide a scalar encoding of the type of the current falling piece (there are seven different shapes in the arcade game). Given a collection of such features, the next task is to select weights u(1), . . . , u(K) such that

for all states x. This approximation could then be used to generate a game–playing strategy.

),()()(1

xVxkuK

Example 2: Tetris, continuation.Similar approach is presented in the book “Neuro-Dynamic Programming” (chapter 8, casesof study) by D.P. Bertesekas and J. Tsitsiklis, with the following parameterization, after someexperimentation:

a) Let the height hk of the kth column of the wall. There are w such features, where w is the wall’s width.

b) The absolute difference hk - hk+1 between the heights of the kth and the (k+1)st column, k=1,…, w-1.

c) The maximum wall height maxk hk.d) The number of holes L in the wall, that is, the number of empty positions of the wall that

are surrounded by full positions.

Example 2: Tetris, continuation.

Thus, there are 2w+1 features, which together with a constant offset, require 2w+2 weights in alinear architecture of the form

Using this parameterization, with w=10 (22 features), an strategy is generated by NDP thateliminates an average of 3554 rows per game, reflecting performance comparable of an expertplayer.

.)12(max)2()()()0(),(~1

Lwuhwuhhwkuhkuuuxv kk

offset

6. Temporal-Difference LearningIntroduction to Temporal-Difference Learning

Material from:Richard Sutton, “Learning to Predict by Methods of Temporal Differences”, MachineLearning, 3: 9-44, 1988. this paper provide the first formal results in the theory of temporal-difference (TD) methods.

- “Learning to predict”:-Use of past experience (historical information) with a incompletely know

system to predict its future behavior.-”Learning to predict“ is one of the most basic and prevalent kinds of learning.-In prediction learning training examples can be taken directly from the

temporal -sequence of ordinary sensory input; no special supervisor or teacher is required.

-Conventional prediction-learning methods (Widrow-Hoff, LMS, Delta Rule, Backpropagation):

-Driven by error between predicted and actual outcomes.

6. Temporal-Difference Learning

-TD methods:• Driven by error or difference between temporally successive predictions.

learning occurs whenever there is a change in prediction over time.

-Advantages of TD methods over conventional methods:• They are more incremental, and therefore more easier to compute.

• They tend to make more efficient use of their experience: they converge faster and produce better predictions.

-TD Approach:• Predictions are based on numerical features combined using adjustable

parameters or “weights”. similar to connectionist models (Neural Networks)

-TD and supervised-learning approaches to prediction:• Historically, the most important learning paradigm has been supervised learning:

learner is asked to associate pairs of items (input,output).

• Supervised learning has been used in patter classification, concept acquisition, learning from examples, system identification, and associative memory.

AInput Real Output

Estimated OutputA

Error LearningAlgorithm

Adjust estimator parameters

-Single-step and multi-step prediction problems:• Single-step: all information about the correctness of each prediction is revealed at

once. supervised learning methods.

• Multi-step: correctness is not revealed until more than one step after the prediction is made, but partial information relevant to its correctness is revealed at each step. TD learning methods.

-Computational issues:• Sutton introduce a particular TD procedure by formally relating it to a classical

supervised-learning procedure, the Widrow-Hoff rule (also known as “delta rule’, the ADALINE –Adaptive Linear Element, and the Least Mean Squares –LMS- filter).

• We consider multi-step prediction problems in which experience comes in observation-outcome sequences of the form x1, x2, x3, …, xm, z, where each xt is a vector of observations available at time t in the sequence, and z is the outcome of the sequence. Also, xt n and z .

6. Temporal-Difference Learning-Computational issues (cont.):

• For each observation-outcome, the learner produces a corresponding sequence of predictions P1, P2, P3, …, Pm, each of which is an estimate of z.

• Predictions Pt are based on a vector of modifiable parameters w. Pt(xt ,w)

• All learning procedures are expressed as rules for updating w. For each observation, an increment to w, denoted wt , is determined. After a complete sequence has been processed, w is changed by (the sum of) all the sequences increments:

• The supervised-learning approach treats each sequence of observations and its outcome as a sequence of observation-outcome pairs: (x1 , z) , (x2 , z), …, (xm , z). In this case increments due to time t depends on the error between Pt and z, and on how to change w will affect Pt .

6. Temporal-Difference Learning-Computational issues (cont.):

• Then, a prototypical supervised-learning update procedure is

where is a positive parameter affecting the rate of learning, and the gradient wPt , isthe vector of partial derivatives of Pt with respect to each component of w.

• Special case: consider Pt a linear function of xt and w, that is Pt = wTxt = i w(i) xt(i), where w(i) and xt(i) are ith component of w and xt. wPt = xt. Thus,

which correspond to the Widrow-Hoff rule.

• This equation depend critically on z, and thus cannot be determined until the end of the sequence when z becomes known. All observations and predictions made during a sequence must be remembered until its end: wt cannot be computed incrementally.

,P)Pz(w twtt

,x)xwz(w ttT

6. Temporal-Difference Learning-TD Procedure:There is a Temporal-Difference procedure that produces the same (exactly) result, and can beComputed incrementally. The key is to represent the error z-Pt as a sum of the changes in predictions as follows

Using this equation and the prototypical supervised-learning equation, we have

.zP)PP(Pzdef

kkwttt

tkkktw

PPPwPPzww

This equation can be computed incrementally,because it depends only on a pair of successivepredictions and on the sum of all past values of thegradient.

We refer to this procedure as TD(1).

6. Temporal-Difference LearningThe TD() family of learning procedures:

• The “hallmark” of temporal-difference methods is their sensitivity to changes in successive predictions rather than overall error between predictions and the final outcome.

• In response to an increase (decrease) in prediction from Pt to Pt+1 , an increment wt is

determined that increases (decreases) the predictions for some or all of the preceding observations vectors x1, …,xt.

• TD(1) is a special case where all the predictions are altered to an equal extent.

• Now, consider the case where greater alterations are made to more recent predictions. We consider an exponential weighting with recency, in which alterations to the predictions of observation vectors occurring k steps in the past are weighted according to k for 0 1:

ktttt PPPw

11 )( 1

=1, TD(1)

0 < < 1

=0TD(0)

k increases

6. Temporal-Difference LearningThe TD() family of learning procedures:

• For =0 we have the TD(0) procedure:

• For =1 we have the TD(1) procedure, that is equivalent to the Widrow-Hoff rule, except that TD(1) is incremental:

• Alterations of past predictions can be weighted in ways other than the exponential form given previously, let

tttttwtttttwt

ePPePPPwePe

)())((

.1,,...,1

Also referred in literature as eligibility vectors.

kkwttt PPPw

twttt P)PP(w 1

6. Temporal-Difference Learning(Material taken from Neuro-Dynamic Programming, Chapter 5)

Monte Carlo Simulation: brief overview

• Suppose v is a random variable with unknown mean m that we want to estimate. • Using Monte Carlo simulation to estimate m: generate a number of samples v1, v2, …, vN,

and then estimate m by forming the sample mean

Also, we can compute the sample mean recursively

with M1 = v1 .

111 NNNN Mv

6. Temporal-Difference LearningMonte Carlo simulation: case of iid samples

Suppose N samples v1, v2, …, vN , independent and identically distributed, with mean m, andvariance 2. Then we have

where MN is said to be an unbiased estimator of m. Also, its variance is given by

As N the variance of MN converge to zero MN converges to m.

Also, the strong law of large numbers provides an additional property: the sequence MN converges to m with probability one. The estimator is consistent.

6. Temporal-Difference LearningPolicy Evaluation by Monte Carlo simulation

-Consider the stochastic shortest path problem, with state space {0, 1, 2, …, n} with 0 as an absorbing state and cost-free. Let the cost-to-go from i to j g( i , j ) (given the control action μ(i) , pij(μ(i))). Suppose that we have a fixed stationary policy μ (proper) and we want to

calculate, using simulation, the corresponding cost-to-go vector

J μ’ = ( J μ (1) J μ (2) . . . J μ (n) ).

Approach: generate, starting from each i, many samples states trajectories and average the corresponding costs to obtain an approximation to J μ(i). Instead of do this for each state i, lets use each trajectory to obtain cost samples for all states visited by the trajectory, and consider the cost of the trajectory portion that starts at each intermediate state.

Suppose that a number of simulation runs are performed, each ending at the termination state 0. Consider the m-th time a given state i0 is encountered, and let (i0 , i1 ,…, iN) be the remainder of the corresponding trajectory, where iN = 0.

Then, let c( i0 , m ) the cumulative cost up to reaching state 0, then

Some assumptions: different simulated trajectories are statistically independent, and each trajectory is generated according to the Markov process determined by the policy μ. Then we have,

).,(...),(),( 1100 NN iigiigmic

)}.m,i(c{E)i(J

The estimation of Jμ(i) is obtained by forming the sample mean

subsequent to the Kth encounter with state i. The sample mean can be expressed in iterative form:

starting with J(i)=0.

,...,2,1 )),(),(()(:)( miJmiciJiJ m

,...,2,1 ,1

Consider the trajectory (i0, i1, …, iN), and let k an integer, such as 1 k N. The trajectory contains the subtrajectory (ik, ik+1,…, iN). sample trajectory with initial state ik that can be used to update J(ik) using the iterative equation previously presented.

Algorithm*:Run a simulation and generate the state trajectory (i0, i1, …,iN), update the estimates J(ik) for each k=0, 1, …, N-1, the formula

The step size γ(ik) can change from one iteration to iteration.

*Additional details Neuro-Dynamic Programming by Bertsekas and Tsitsiklis, chapter 5.

)),(),(...),(),()(()(:)( 1211 kNNkkkkkkk iJiigiigiigiiJiJ

6. Temporal-Difference LearningMonte Carlo simulation using Temporal Differences

Here we consider the implementation of the Monte Carlo policy evaluation algorithm that incrementally updates the cost-to-go estimates J(i).

First, assume that for any trajectory i0, i1, …, iN , with iN = 0, and ik =0 for k > N,g( ik, ik+1) =0 for k N, and J(0)=0. Also, the policy under consideration is proper.

Lets rewrite the previous formula in the following form:

Note that we use the property J(iN)=0.

))).()(),((

))()(),((

)()(),((()(:)(

kkkkkk

iJiJiig

iJiJiigiJiJ

Equivalently, we can rewrite the previous equation as follows:

are called temporal differences (TD). The TD represents the difference between the estimate

of the cost-to-go based on the simulated outcome in the current stage, and the current estimate J(ik).

),...()(:)( 11 Nkkkk dddiJiJ

),i(J)i(J)i,i(gd kkkkk 11

),(),( 11 kkk iJiig

Advantage: The estimations can be computed incrementally, e.g., for the l-th temporal difference dl (once that it becomes available) we have

as soon as dl is available.

The temporal difference dl appears in the update formula for J(ik) for every k l, then,

as soon as transition il+1 has been simulated.

,N,...,kld)i(J)i(J lkk 1 , :

,l,...,k)),i(J)i(J)i,i(g)i(J

d)i(J)i(J

Monte Carlo simulation using Temporal Differences: TD()

Here we introduce the TD() algorithm as a stochastic approximation method for solving a suitably reformulated Bellman equation.

The Monte Carlo evaluation algorithm can be viewed as a Robbins-Monro stochastic approximation (more details chapter 4. Neuro-Dynamic Programming) method for solving the equations

for unknowns J(ik), as ik ranges over the states in the state space. Other algorithms can be

generated in similar form, e.g., starting from other systems equation involving J and then replacing expectations by single estimates. For example, from Bellman’s equation

,)i,i(gE)i(Jm

)},i(J)i,i(g{E)i(J kkkk 11

the stochastic approximation method takes the form

which is updated each time that state ik is visited.

Lets take now a fixed value l, nonnegative and integer, and taking into consideration the cost for the first l+1 transitions, then the stochastic algorithm could be based on the (l+1)-step Bellman equation

Without any special knowledge to select one value of l over another, we consider forming a weighted average of all possible multistep Bellman equations. Specifically, we fix some < 1, and multiply by (1- ) l and sum over all nonnegative l.

,)i(J)i,i(gE)i(Jl

mlkmkmkk

)),i(J)i(J)i,i(g()i(J)i(J kkkkkk 11:

Then, we have

Interchanging the order of the two summations, and using the fact that we have

.)i(J)i,i(gE)()i(Jl

mlkmkmk

0 0111

ml .)(

)i(J)i(J)i(J)i,i(gE

))(i(J)i,i(g)(E)i(J

mkmkmkmkm

mmkmkk

The previous equation can be expressed in terms of the temporal differences as follows

are the temporal differences, and E{dm}=0 for all m (Bellman’s equation).

The Robbins-Monro stochastic approximation method, equivalent to the previous equation, is

where γ is a stepsize parameter (can change from iteration to iteration). The above equation provide us with a family of algorithms, one for each choice of , and it is known as TD(). Note that for =1 we have the Monte Carlo policy evaluation method, also called TD(1).

),i(JdE)i(J kkm

),i(J)i(J)i,i(gd mmmmm 11

,d)i(J:)i(Jkm

Also, for =0, we have another limiting case, and using the convention 00=1, then the TD(0) method is presented as follows

This equation coincides with the one-step Bellman’s equation previously presented.

Off-line and On-line variants

When all of the updates are carried out simultaneously, after the entire trajectory has been simulated, then we have the off-line version of TD(). In alternative form, when the updates are evaluated one term at a time, we have the on-line version of TD().

)).i(J)i(J)i,i(g()i(J:)i(J kkkkkk 11

Discounted Problem

the (l+1)-step Bellman equation

Specifically, we fix some < 1, and multiply by (1- ) l and sum over all nonnegative l

Temporal-Difference Learning (TD()):

mk iJiigEiJ

11 )(),()(

11 )(),()1()(

mlk iJiigEiJ

Interchanging the order of the two summations, and using the fact that

we have

ml )1(

)()()(),(

))((),()1()(

iJiJiJiigE

iJiigEiJ

)()()( kkm

k iJdEiJ

)()(),( 11 mmmmm iJiJiigd

In terms of the temporal differences defined by

we have

Again we have E{dm}=0 for all m.

From here on, the development is entirely similar to the development for the undiscounted case.

The only difference is that enters in the definition of the temporal difference and that is replaced by .

In particular, we have

,)()(:)(

kk diJiJ

Temporal-Difference Learning (TD()): Approximation (linear) To tune basis function weights

Value function Autonomous systems Controlled systems

Approximated policy iteration Controlled TD

Q-function Relationship with Approximate Value Iteration Historical View

Value function

Autonomous systems Problem formulation

Autonomous process:

Value function:

where is a scalar reward

is a discount factor

Linear approximation

where is a collection of basis function

),(1 ttt wxfx

xx)x(r)x(Vt

)(xr)1,0[

)()(),(~1

xkuuxv k

K ,...., 21

Value function

Autonomous systems Suppose that we observe a sequence of states ;

at time t the weight vector has been set to some value Temporal difference corresponding to the transition

from to :td

),(~),(~)( 1 tttttt uxvuxvxrd

....,, 210 xxx

tx 1tx

a prediction of

given our current approximation

to the value function

),(~tuv

an “improved prediction ” that incorporates knowledge of the reward and the next

stage 1tx)( txr

Value function

Autonomous systems Given an arbitrary initial weight vector , finally we find

the “correct weight vector”

So the updating law of the weight vector is

where is a scalar step size, is called eligibility vector

ttt uuu 1

tttt zdu

Value function

Autonomous systems Eligibility vector is defined as

),(~)(0

uxvz ut

uu uxv ),(~ )()(

1 xku k

))(),...(( 1 xx K

providing a direction for the adjustment of such that

moves towards the improved prediction

tu),(~

tt uxv

Value function

Autonomous systems Note that the eligibility vectors can be recursively updated

according to

)()()(

)( )()(

)()()()(

Consequently, when the updating law of the weight vector can be rewritten as:

That means, only the last state has effect on the update In the more general case of ,

xduu t

Value function

Autonomous systems

)),(~),(~)()(( 11 ttttttttt uxvuxvxrxuu

“trigger”step size direction

Convergence – Linear Approximators Under appropriate technical conditions:

i) For any [0,1], there exists a vector u() such that the sequence ut generated by the algorithm converges (with probability one) to u() .

ii) The limit of convergence u() satisfies

[59] J. N. Tsitsiklis and B. Van Roy, An Analysis of Temporal–Deference Learning with Function Approximation. IEEE Transactions on Automatic

Control, 42(5):674–690, 1997.

[10] Bertsekas and Tsitsiklis, chapter 6. Neuro-Dynamic Programming.

,)x(v)x(v

by defined is norm the

where1

Value function

Controlled systems

Unlike an autonomous system, a controlled system cannot be passively simulated and observed. Control decisions are required and influence the system’s dynamics.

The objective here is to approximate the optimal value function of a controlled system.

Approximate Policy Iteration A classical dynamic programming algorithm - policy iteration

Given a value function corresponding to a stationary policy

, an improved policy can be defined by

In particular, for all . Furthermore, a sequence of policies initialized

with some arbitrary and updated according to

converges to an optimal policy .

Value function

Controlled systems

)),,,((),(maxarg waxfvaxrx wa

),(),( xvxv Xx ,...2,1,0mm

)),,,((),(maxarg1 mwa

m waxfvaxrx

Value function

Controlled systems Approximate Policy Iteration

For each value function , let

generating a sequence of weight vectors

Select such that

With an arbitrary initial stationary policy

)),,,((~),(maxarg~ 1

m uwaxfvaxrx

k k xkuuxv1

)()(),(~ ,..., 21 uu

1mu )~

,(),(~ 1m

m xvuxv

Value function

Two loops:

External: find converged weight vector update the present stationary policy

Internal: applying temporal-difference learning to generate each iterate weight vector value function approximation

Initialization:mm uu 1

Value function

A result from section 6.2 [10] D. P. Bertsekas and J. N. Tsitsiklis, Neuro-Dynamic Programming. Athena Scientific, Bellmont, MA, 1996.

if there exists some such that for all m

The external sequence does not always converge.

),())((max 1m

Xxxvxu

2)())((maxsuplim

Controlled TD

arbitrarily initialize and ;

generate a decision according to

Value function

Controlled systems

)u),w,a,x(f(v~)a,x(rmaxarga tttwa

),( ,1 tttt waxfx

),(~),(~)( 1 tttttt uxvuxvxrd ttttt zduu 1

Value function

Controlled systems Big problem: Convergence

A modification that has been found to be useful in practical applications involves adding “exploration noise” to the controls.

One approach to this end involves randomizing decisions by choosing at each time t a decision ,for , with probability

where is a small scalar.

aat Aat

Aa tttw

)/))u),w,a,x(f(v~)a,x(rexp((

Value function

Controlled systemsNote: 1) Probability >0

2) , the probability of ,such that

simple proof:

0 ta 1

)u),w,a,x(f(v~)a,x(rmaxarga tttwa

)/))),,,((~),()),,,((~),(exp((

)/))),,,((~),()),,,((~),(exp((*)/))),,,((~),(exp((

)/))),,,((~),(exp((lim

)/))),,,((~),(exp((

)/))),,,((~),(exp((lim

Aa tttwtttw

Aa tttwtttwtttw

Aa tttw

uwaxfvaxruwaxfvaxr

uwaxfvaxruwaxfvaxruwaxfvaxr

uwaxfvaxr

Q-Function Given V,

Define a Q-function:

Q–learning is a variant of temporal–difference learning that approximates Q functions rather than value functions.

))w,a,x(f(V)a,x(rmaxarga ttwa

))w,a,x(f(V)a,x(r)a,x(Q ttw

),(maxarg axQa ta

),()(),,(~),(1

axkuuaxqaxQ k

Q-learning

arbitrarily initialize and ;

generate a decision according to

Q-Function

),,(~maxarg tta

t uaxqa

),( ,1 tttt waxfx

),,(~),,(~),( 11 ttttttttt uaxquaxqaxrd ttttt zduu 1

),()(0

Q-Function Like in the case of controlled TD, it is often desirable to incorporate exploration. Randomize decisions by choosing at each time t a decision ,for , with

probability

where is a small scalar.Note: 1) Probability >0

2) , the probability of ,such that

The analysis of Q–learning bears many similarities with that of controlled TD, and results that apply to one can often be generalized in a straightforward way to accommodate the other.

aat Aat

)/),,(~exp(

0 ta 1

),,(~maxarg tta

t uaxqa

Relationship with Approximate Value Iteration

The classical value iteration algorithm can be described compactly in terms of the “dynamic programming operator” T,

Approximate Value iteration:

Disadvantage: the approximate value iteration need not possess fixed points, and

therefore should not be expected to converge. In fact, even in cases where a fixed point exists, and even when the

system is autonomous, the algorithm can generate a diverging sequence of weight vectors.

)),,((),(max)( waxfvwxrxTv wa

Ruuvkk uvuvTuvTuv

1 ,~,~minarg,~,~

),(~0uv

),(~0uvT

),(~1uv

),(~1uvT

),(~2uv

Controlled TD can be thought of as a stochastic approximation algorithm designed to converge on fixed points of approximate value iteration.

Advantage:

Controlled TD uses simulation to effectively bypass the need to explicitly compute projections required for approximate value iteration. Autonomous systems

Controlled systems: the possible introduction of exploration

Historical View A long history and big names

Sutton: Temporal–difference based on earlier work by Barto and Sutton on models for classical conditioning phenomena observed in animal behavior and by Barto, Sutton, and Anderson on “actor–critic methods”

Witten: look–up table algorithm bears similarities with one proposed a decade earlier.

Watkins: Q–learning was propose in his thesis; and the study of temporal–dierence learning was integrated with classical ideas from dynamic programming and stochastic approximation theory.

The work of Werbos and Barto, Bradtke, and Singh also contributed to the above integration.

Historical View Application

Tesauro: a world–class Backgammon playing program. The practical potential was first demonstrated since then

channel allocation in cellular communication networks elevator dispatching inventory management job–shop scheduling

Actors and Critics: Averaged Rewards

Independent actors

An actor is a parameterized class of policies.

1lim)(

tNxxrE

Independent actors (cont.)

one stochastic gradient method, which was proposed by Marbach and Tsitsiklis

Using critic Feedback

Actors and Critics:

tttm xa

an estimate of the

gradient given

parameters

)),((ˆm

tmt raxrq

kk axkuuaxq

),()(),,(~

Bibliography [1] D. P. Bertsekas, Dynamic Programming and Optimal Control. Athena

Scientific, Bellmont, MA, 1995. [2] D. P. Bertsekas and J. N. Tsitsiklis, Neuro-Dynamic Programming.

Athena Scientific, Bellmont, MA, 1996. [3] R. S. Sutton, Temporal Credic Assignment in Reinforcement Learning.

PhD thesis, University of Massachusetts, Amherst, Amherst, MA, 1984. [4] R. S. Sutton, Learning to Predict by the Methods of Temporal

Differences. Machine Learning, 3:9–44, 1988. [5] R. S. Sutton and A. G. Barto, Reinforcement Learning: An

Introduction.MIT Press, Cambridge, MA, 1998.

[6] J. N. Tsitsiklis and B. Van Roy, An Analysis of Temporal–DierenceLearning with Function Approximation. IEEE Transactions on AutomaticControl, 42(5):674–690, 1997.

1 Neuro-Dynamic Programming José A. Ramírez ， Yan Liao Advanced Decision Processes ECECS 841,...

Documents

Transcript of 1 Neuro-Dynamic Programming José A. Ramírez ， Yan Liao Advanced Decision Processes ECECS 841,...

Shuyi Liao Thesis

Alonso Ramírez Cover - EUR

GORETTI RAMÍREZ Associate Professor

Liao Final Report

Chuang Liao 2010

Liao Fan's Four Lessons

Healthy Eating at School and ECECS Health Promotion Service Early Childhood focus.

Fotos Maria Angeles Ramírez

RAMÍREZ ROSALES DAMARYS

Dissertation Liao

Chia Y. Han ECECS Department University of Cincinnati Kai Liao College of DAAP University of Cincinnati Collective Pavilions A Generative Architectural.

HUI LIAO - rhsmith.umd.edu

Liao Master Thesis

Liao Thesis 2013

Héctor Ángel Ramírez Navarro

Manual Aiep-JAZZ RAMÍREZ

Kimberly Liao 8B

Liao Fan 4 Lecciones

Microtecnologia - Ramírez

Arnaldo Velarde Ramírez