Learning to Predict by the Methods of Temporal Differences1022633531479.pdf · Learning to Predict...

Machine Learning 3: 9-44, 1988© 1988 Kluwer Academic Publishers, Boston - Manufactured in The Netherlands

Learning to Predict by the Methodsof Temporal Differences

RICHARD S. SUTTON ([email protected])GTE Laboratories Incorporated, 40 Sylvan Road, Waltham, MA 02254, U.S.A.

(Received: April 22, 1987)(Revised: February 4, 1988)

Keywords: Incremental learning, prediction, connectionism, credit assignment,evaluation functions

Abstract. This article introduces a class of incremental learning procedures spe-cialized for prediction - that is, for using past experience with an incompletely knownsystem to predict its future behavior. Whereas conventional prediction-learningmethods assign credit by means of the difference between predicted and actual out-comes, the new methods assign credit by means of the difference between temporallysuccessive predictions. Although such temporal-difference methods have been used inSamuel's checker player, Holland's bucket brigade, and the author's Adaptive Heuris-tic Critic, they have remained poorly understood. Here we prove their convergenceand optimality for special cases and relate them to supervised-learning methods. Formost real-world prediction problems, temporal-difference methods require less mem-ory and less peak computation than conventional methods and they produce moreaccurate predictions. We argue that most problems to which supervised learningis currently applied are really prediction problems of the sort to which temporal-difference methods can be applied to advantage.

1. IntroductionThis article concerns the problem of learning to predict, that is, of using past

experience with an incompletely known system to predict its future behavior.For example, through experience one might learn to predict for particularchess positions whether they will lead to a win, for particular cloud formationswhether there will be rain, or for particular economic conditions how muchthe stock market will rise or fall. Learning to predict is one of the mostbasic and prevalent kinds of learning. Most pattern recognition problems, forexample, can be treated as prediction problems in which the classifier mustpredict the correct classifications. Learning-to-predict problems also arise inheuristic search, e.g., in learning an evaluation function that predicts the utilityof searching particular parts of the search space, or in learning the underlyingmodel of a problem domain. An important advantage of prediction learning isthat its training examples can be taken directly from the temporal sequenceof ordinary sensory input; no special supervisor or teacher is required.

10 R. S. SUTTON

In this article, we introduce and provide the first formal results in the theoryof temporal-difference (TD) methods, a class of incremental learning proceduresspecialized for prediction problems. Whereas conventional prediction-learningmethods are driven by the error between predicted and actual outcomes, TDmethods are similarly driven by the error or difference between temporallysuccessive predictions; with them, learning occurs whenever there is a changein prediction over time. For example, suppose a weatherman attempts topredict on each day of the week whether it will rain on the following Saturday.The conventional approach is to compare each prediction to the actual outcome- whether or not it does rain on Saturday. A TD approach, on the other hand,is to compare each day's prediction with that made on the following day. Ifa 50% chance of rain is predicted on Monday, and a 75% chance on Tuesday,then a TD method increases predictions for days similar to Monday, whereasa conventional method might either increase or decrease them depending onSaturday's actual outcome.

We will show that TD methods have two kinds of advantages over con-ventional prediction-learning methods. First, they are more incremental andtherefore easier to compute. For example, the TD method for predicting Sat-urday's weather can update each day's prediction on the following day, whereasthe conventional method must wait until Saturday, and then make the changesfor all days of the week. The conventional method would have to do more com-puting at one time than the TD method and would require more storage duringthe week. The second advantage of TD methods is that they tend to makemore efficient use of their experience: they converge faster and produce bet-ter predictions. We argue that the predictions of TD methods are both moreaccurate and easier to compute than those of conventional methods.

The earliest and best-known use of a TD method was in Samuel's (1959)celebrated checker-playing program. For each pair of successive game positions,the program used the difference between the evaluations assigned to the twopositions to modify the earlier one's evaluation. Similar methods have beenused in Holland's (1986) bucket brigade, in the author's Adaptive HeuristicCritic (Sutton, 1984; Barto, Sutton & Anderson, 1983), and in learning systemsstudied by Witten (1977), Booker (1982), and Hampson (1983). TD methodshave also been proposed as models of classical conditioning (Sutton & Barto,1981 a, 1987; Gelperin, Hopfield & Tank, 1985; Moore et al., 1986; Klopf, 1987).

Nevertheless, TD methods have remained poorly understood. Although theyhave performed well, there has been no theoretical understanding of how orwhy they worked. One reason is that they were never studied independently,but only as parts of larger and more complex systems. Within these systems,TD methods were used to improve evaluation functions by better predictinggoal-related events such as rewards, penalties, or checker game outcomes. Herewe advocate viewing TD methods in a simpler way - as methods for efficientlylearning to predict arbitrary events, not just goal-related ones. This simplifica-tion allows us to evaluate them in isolation and has enabled us to obtain formalresults. In this paper, we prove the convergence and optimality of TD meth-ods for important special cases, and we formally relate them to conventionalsupervised-learning procedures.

TEMPORAL-DIFFERENCE LEARNING 11

Another simplification we make in this paper is to focus on numerical predic-tion processes rather than on rule-based or symbolic prediction (e.g., Dietterich& Michalski, 1986). The approach taken here is much like that used in connec-tionism and in Samuel's original work our predictions are based on numericalfeatures combined using adjustable parameters or "weights." This and otherrepresentational assumptions are detailed in Section 2.

Given the current interest in learning procedures for multi-layer connec-tionist networks (e.g., Rumelhart, Hinton, & Williams, 1985; Ackley, Hinton,& Sejnowski, 1985; Barto, 1985; Anderson, 1986; Williams, 1986; Hampson& Volper, 1987), we note that here we are concerned with a different set ofissues. The work with multi-layer networks focuses on learning input-outputmappings of more complex functional forms. Most of that work remains withinthe supervised-learning paradigm, whereas here we are interested in extendingand going beyond it. We consider mostly mappings of very simple functionalforms, because the differences between supervised learning methods and TDmethods are clearest in these cases. Nevertheless, the TD methods presentedhere can be directly extended to multi-layer networks (see Section 6.2).

The next section introduces a specific class of temporal-difference proceduresby contrasting them with conventional, supervised-learning approaches, focus-ing on computational issues. Section 3 develops an extended example thatillustrates the potential performance advantages of TD methods. Section 4contains the convergence and optimality theorems and discusses TD methodsas gradient descent. Section 5 discusses how to extend TD procedures, andSection 6 relates them to other research.

2. Temporal-difference and supervised-learningapproaches to prediction

Historically, the most important learning paradigm has been that of super-vised learning. In this framework the learner is asked to associate pairs ofitems. When later presented with just the first item of a pair, the learner issupposed to recall the second. This paradigm has been used in pattern clas-sification, concept acquisition, learning from examples, system identification,and associative memory. For example, in pattern classification and conceptacquisition, the first item is an instance of some pattern or concept, and thesecond item is the name of that concept. In system identification, the learnermust reproduce the input-output behavior of some unknown system. Here, thefirst item of each pair is an input and the second is the corresponding output.

Any prediction problem can be cast in the supervised-learning paradigm bytaking the first item to be the data based on which a prediction must be made,and the second item to be the actual outcome, what the prediction should havebeen. For example, to predict Saturday's weather, one can form a pair from themeasurements taken on Monday and the actual observed weather on Saturday,another pair from the measurements taken on Tuesday and Saturday's weather,and so on. Although this pairwise approach ignores the sequential structureof the problem, it is easy to understand and analyze and it has been widelyused. In this paper, we refer to this as the supervised-learning approach to

12 R. S. SUTTON

prediction learning, and we refer to learning methods that take this approachas supervised-learning methods. We argue that such methods are inadequate,and that TD methods are far preferable.

2.1 Single-step and multi-step prediction

To clarify this claim, we distinguish two kinds of prediction-learning prob-lems. In single-step prediction problems, all information about the correctnessof each prediction is revealed at once. In multi-step prediction problems, cor-rectness is not revealed until more than one step after the prediction is made,but partial information relevant to its correctness is revealed at each step.For example, the weather prediction problem mentioned above is a multi-stepprediction problem because inconclusive evidence relevant to the correctnessof Monday's prediction becomes available in the form of new observations onTuesday, Wednesday, Thursday and Friday. On the other hand, if each day'sweather were to be predicted on the basis of the previous day's observations -that is, on Monday predict Tuesday's weather, on Tuesday predict Wednesday'sweather, etc. - one would have a single-step prediction problem, assuming nofurther observations were made between the time of each day's prediction andits confirmation or refutation on the following day.

In this paper, we will be concerned only with multi-step prediction problems.In single-step problems, data naturally comes in observation-outcome pairs;these problems are ideally suited to the pairwise supervised-learning approach.Temporal-difference methods cannot be distinguished from supervised-learningmethods in this case; thus the former improve over conventional methods onlyon multi-step problems. However, we argue that these predominate in real-world applications. For example, predictions about next year's economic per-formance are not confirmed or disconfirmed all at once, but rather bit by bitas the economic situation is observed through the year. The likely outcomeof elections is updated with each new poll, and the likely outcome of a chessgame is updated with each move. When a baseball batter predicts whether apitch will be a strike, he updates his prediction continuously during the ball'sflight.

In fact, many problems that are classically cast as single-step predictionproblems are more naturally viewed as multi-step problems. Perceptual learn-ing problems, such as vision or speech recognition, are classically treated assupervised learning, using a training set of isolated, correctly-classified inputpatterns. When humans hear or see things, on the other hand, they receive astream of input over time and constantly update their hypotheses about whatthey are seeing or hearing. People are faced not with a single-step problem ofunrelated pattern-class pairs, but rather with a series of related patterns, allproviding information about the same classification. To disregard this struc-ture seems improvident.

2.2 Computational issues

In this subsection, we introduce a particular TD procedure by formally re-lating it to a classical supervised-learning procedure, the Widrow-Hoff rule.


We show that the two procedures produce exactly the same weight changes,but that the TD procedure can be implemented incrementally and thereforerequires far less computational power. In the following subsection, this TDprocedure will be used also as a conceptual bridge to a larger family of TDprocedures that produce different weight changes than any supervised-learningmethod. First, we detail the representational assumptions that will be usedthroughout the paper.

We consider multi-step prediction problems in which experience comes inobservation-outcome sequences of the form x1, x2, x3 , . . . , xm, z, where each xtis a vector of observations available at time t in the sequence, and z is theoutcome of the sequence. Many such sequences will normally be experienced.The components of each Xt are assumed to be real-valued measurements orfeatures, and z is assumed to be a real-valued scalar. For each observation-outcome sequence, the learner produces a corresponding sequence of predic-tions P1 , P2,P3, • • • , Pm, each of which is an estimate of z. In general, each Ptcan be a function of all preceding observation vectors up through time t, but,for simplicity, here we assume that it is a function only of X t .

1 The predictionsare also based on a vector of modifiable parameters or weights, w. Pt's func-tional dependence on xt and w will sometimes be denoted explicitly by writingit as P(xt,w).

All learning procedures will be expressed as rules for updating w. For themoment we assume that w is updated only once for each complete observation-outcome sequence and thus does not change during a sequence. For eachobservation, an.increment to w, denoted Awt, is determined. After a completesequence has been processed, w is changed by (the sum of) all the sequence'sincrements:

Later, we will consider more incremental cases in which w is updated aftereach observation, and also less incremental cases in which it is updated onlyafter accumulating Awt's over a training set consisting of several sequences.

The supervised-learning approach treats each sequence of observations andits outcome as a sequence of observation-outcome pairs; that is, as the pairs(x1, z), (x2, z),..., (xm, z). The increment due to time t depends on the errorbetween Pt and z, and on how changing w will affect Pt. The prototypicalsupervised-learning update procedure is

where a is a positive parameter affecting the rate of learning, and the gradient,VwPt, is the vector of partial derivatives of Pt with respect to each componentof w.

For example, consider the special case in which Pt is a linear function of xt

and w, that is, in which Pt = wTxt = E iw(i)x t( i) , where w(i) and xt(i) are

'The other cases can be reduced to this one by reorganizing the observations in such away that each xt includes some or all of the earlier observations. Cases in which predictionsshould depend on t can also be reduced to this one by including t as a component of xt.

14 R. S. SUTTON

the i'th components of w and Xt, respectively.2 In this case we have VwPt = Xt,and (2) reduces to the well known Widrow-Hoff rule (Widrow & Hoff, 1960):

This linear learning method is also known as the "delta rule," the ADALINE,and the LMS filter. It is widely used in connectionism, pattern recognition,signal processing, and adaptive control. The basic idea is that the differencez - wTx t represents the scalar error between the prediction, wTx t , and whatit should have been, z. This is multiplied by the observation vector xt todetermine the weight changes because xt indicates how changing each weightwill affect the error. For example, if the error is positive and xt(i) is positive,then Wi(t) will be increased, increasing wTxt and reducing the error. TheWidrow-Hoff rule is simple, effective, and robust. Its theory is also betterdeveloped than that of any other learning method (e.g., see Widrow & Stearns,1985).

Another instance of the prototypical supervised-learning procedure is the"generalized delta rule," or backpropagation procedure, of Rumelhart et al.(1985). In this case, Pt is computed by a multi-layer connectionist networkand is a nonlinear function of xt and w. Nevertheless, the update rule usedis still exactly (2), just as in the Widrow-Hoff rule, the only difference beingthat a more complicated process is used to compute the gradient VwPt.

In any case, note that all Awt in (2) depend critically on z, and thus cannotbe determined until the end of the sequence when z becomes known. Thus,all observations and predictions made during a sequence must be remembereduntil its end, when all the Awt's are computed. In other words, (2) cannot becomputed incrementally.

There is, however, a TD procedure that produces exactly the same resultas (2), and yet which can be computed incrementally. The key is to representthe error z — Pt as a sum of changes in predictions, that is, as

where

Using this, equations (1) and (2) can be combined as

2wT is the transpose of the column vector w. Unless otherwise noted, all vectors arecolumn vectors.


In other words, converting back to a rule to be used with (1):

Unlike (2), this equation can be computed incrementally, because each Awtdepends only on a pair of successive predictions and on the sum of all pastvalues for VwPt. This saves substantially on memory, because it is no longernecessary to individually remember all past values of VwPt. Equation (3) alsomakes much milder demands on the computational speed of the device thatimplements it; although it requires slightly more arithmetic operations overall(the additional ones are those needed to accumulate Ek=1V w P k ) , they canbe distributed over time more evenly. Whereas (3) computes one incrementto w on each time step, (2) must wait until a sequence is completed and thencompute all of the increments due to that sequence. If M is the maximumpossible length of a sequence, then under many circumstances (3) will requireonly 1/Mth of the memory and speed required by (2).3

For reasons that will be made clear shortly, we refer to the procedure givenby (3) as the TD(1) procedure. In addition, we will refer to a procedure aslinear if its predictions Pt are a linear function of the observation vectors xtand the vector of memory parameters w, that is, if Pt = wTXt .We have justproven:

Theorem 1 On multi-step prediction problems, the linear TD(1) procedureproduces the same per-sequence weight changes as the Widrow-Hoff procedure.Next, we introduce a family of TD procedures that produce weight changesdifferent from those of any supervised-learning procedure.

2.3 The TD(A) family of learning procedures

The hallmark of temporal-difference methods is their sensitivity to changesin successive predictions rather than to overall error between predictions andthe final outcome. In response to an increase (decrease) in prediction from Ptto Pt+1, an increment Awt is determined that increases (decreases) the pre-dictions for some or all of the preceding observation vectors x 1 , . . . , xt. Theprocedure given by (3) is the special case in which all of those predictions arealtered to an equal extent. In this article we also consider a class of TD proce-dures that make greater alterations to more recent predictions. In particular,we consider an exponential weighting with recency, in which alterations to thepredictions of observation vectors occurring k steps in the past are weightedaccording to Ak for 0 < A < 1:

3Strictly speaking, there are other incremental procedures for implementing the com-bination of (l) and (2), but only the TD rule (3) is appropriate for updating w on aper-observation basis.

16 R. S. SUTTON

Note that for A = 1 this is equivalent to (3), the TD implementation of the pro-totypical supervised-learning method. Accordingly, we call this new procedureTD(A) and we will refer to the procedure given by (3) as TD(1).

Alterations of past predictions can be weighted in ways other than the ex-ponential form given above, and this may be appropriate for particular appli-cations. However, an important advantage to the exponential form is that itcan be computed incrementally. Given that et is the value of the sum in (4)for t, we can incrementally compute e t + 1 , using only current information, as

For A < 1, TD(A) produces weight changes different from those made by anysupervised-learning method. The difference is greatest in the case of TD(0)(where A = 0), in which the weight increment is determined only by its effecton the prediction associated with the most recent observation:

Note that this procedure is formally very similar to the prototypical supervised-learning procedure (2). The two equations are identical except that the actualoutcome z in (2) is replaced by the next prediction Pt+1 in the equation above.The two methods use the same learning mechanism, but with different errors.Because of these relationships and TD(0)'s overall simplicity, it is an importantfocus here.

3. Examples of faster learning with TD methods

In this section we begin to address the claim that TD methods make moreefficient use of their experience than do supervised-learning methods, that theyconverge more rapidly and make more accurate predictions along the way. TDmethods have this advantage whenever the data sequences have a certain sta-tistical structure that is ubiquitous in prediction problems. This structurenaturally arises whenever the data sequences are generated by a dynamicalsystem, that is, by a system that has a state which evolves and is partially re-vealed over time. Almost any real system is a dynamical system, including theweather, national economies, and chess games. In this section, we develop twoillustrative examples: a game-playing example to help develop intuitions, anda random-walk example as a simple demonstration with experimental results.

3.1 A game-playing example

It seems counter-intuitive that TD methods might learn more efficiently thansupervised-learning methods. In learning to predict an outcome, how can one


Figure 1 A game-playing example showing the inefficiency of supervised-learningmethods. Each circle represents a position or class of positions from a two-person board game. The "bad" position is known from long experience tolead 90% of the time to a loss and only 10% of the time to a win. The firstgame in which the "novel" position occurs evolves as shown by the dashedarrows. What evaluation should the novel position receive as a result ofthis experience? Whereas TD methods correctly conclude that it shouldbe considered another bad state, supervised-learning methods associate itfully with winning, the only outcome that has followed it.

do better than by knowing and using the actual outcome as a performancestandard? How can using a biased and potentially inaccurate subsequent pre-diction possibly be a better use of the experience? The following example ismeant to provide an intuitive understanding of how this is possible.

Suppose there is a game position that you have learned is bad for you, thathas resulted most of the time in a loss and only rarely in a win for your side. Forexample, this position might be a backgammon race in which you are behind,or a disadvantageous configuration of cards in blackjack. Figure 1 representsa simple case of such a position as a single "bad" state that has led 90% ofthe time to a loss and only 10% of the tune to a win. Now suppose you playa game that reaches a novel position (one that you have never seen before),that then progresses to reach the bad state, and that finally ends neverthelessin a victory for you. That is, over several moves it follows the path shownby dashed lines in Figure 1. As a result of this experience, your opinion ofthe bad state would presumably improve, but what of the novel state? Whatvalue would you associate with it as a result of this experience?

A supervised-learning method would form a pair from the novel state andthe win that followed it, and would conclude that the novel state is likely tolead to a win. A TD method, on the other hand, would form a pair fromthe novel state and the bad state that immediately followed it, and wouldconclude that the novel state is also a bad one, that it is likely to lead to a

18 R. S. SUTTON

Figure 2. A generator of bounded random walks. This Markov process generated thedata sequences in the example. All walks begin in state D. From statesB, C, D, E, and F, the walk has a 50% chance of moving either to theright or to the left. If either edge state, A or G, is entered, then the walkterminates.

loss. Assuming we have properly classified the "bad" state, the TD method'sconclusion is the correct one; the novel state led to a position that you knowusually leads to defeat; what happened after that is irrelevant. Although bothmethods should converge to the same evaluations with infinite experience, theTD method learns a better evaluation from this limited experience.

The TD method's prediction would also be better had the game been lostafter reaching the bad state, as is more likely. In this case, a supervised-learning method would tend to associate the novel position fully with losing,whereas a TD method would tend to associate it with the bad position's 90%chance of losing, again a presumably more accurate assessment. In either case,by adjusting its evaluation of the novel state towards the bad state's evaluation,rather than towards the actual outcome, the TD method makes better use ofthe experience. The bad state's evaluation is a better performance standardbecause it is uncorrupted by random factors that subsequently influence thefinal outcome. It is by eliminating this source of noise that TD methods canoutperform supervised-learning procedures.

In this example, we have ignored the possibility that the bad state's previ-ously learned evaluation is in error. Such errors will inevitably exist and willaffect the efficiency of TD methods in ways that cannot easily be evaluatedin an example of this sort. The example does not prove TD methods will bebetter on balance, but it does demonstrate that a subsequent prediction caneasily be a better performance standard than the actual outcome.

This game-playing example can also be used to show how TD methods canfail. Suppose the bad state is usually followed by defeats except when it ispreceded by the novel state, in which case it always leads to a victory. Inthis odd case, TD methods could not perform better and might perform worsethan supervised-learning methods. Although there are several techniques foreliminating or minimizing this sort of problem, it remains a greater difficultyfor TD methods than it does for supervised-learning methods. TD methodstry to take advantage of the information provided by the temporal sequenceof states, whereas supervised-learning methods ignore it. It is possible for thisinformation to be misleading, but more often it should be helpful.


Finally, note that although this example involved learning an evaluationfunction, nothing about it was specific to evaluation functions. The methodscan equally well be used to predict outcomes unrelated to the player's goals,such as the number of pieces left at the end of the game. If TD methods aremore efficient than supervised-learning methods in learning evaluation func-tions, then they should also be more efficient in general prediction-learningproblems.

3.2 A random-walk example

The game-play ing example is too complex to analyze in great detail. Pre-vious experiments with TD methods have also used complex domains (e.g.,Samuel, 1959; Sutton, 1984; Barto et al., 1983; Anderson, 1986, 1987). Whichaspects of these domains can be simplified or eliminated, and which aspectsare essential in order for TD methods to be effective? In this paper, we pro-pose that the only required characteristic is that the system predicted be adynamical one, that it have a, state which can be observed evolving over time.If this is true, then TD methods should leam more efficiently than supervised-learning methods even on very simple prediction problems, and this is what weillustrate in this subsection. Our example is one of the simplest of dynamicalsystems, that which generates bounded random walks.

A bounded random walk is a state sequence generated by taking randomsteps to the right or to the left until a boundary is reached. Figure 2 shows asystem that generates such state sequences. Every walk begins in the centerstate D. At each step the walk moves to a neighboring state, either to the rightor to the left with equal probability. If either edge state (A or G) is entered,the walk terminates. A typical walk might be DCDEFG. Suppose we wishto estimate the probabilities of a walk ending in the rightmost state, G, giventhat it is in each of the other states.

We applied linear supervised-learning and TD methods to this problem ina straightforward way. A walk's outcome was defined to be z = 0 for a walkending on the left at A and z = 1 for a walk ending on the right at G.The learning methods estimated the expected value of z; for this choice ofz, its expected value is equal to the probability of a right-side termination.For each nonterminal state i, there was a corresponding observation vectorXi; if the walk was in state i at time t then xt = xi. Thus, if the walkDCDEFG occurred, then the learning procedure would be given the sequenceX D , X C , X D , X F ; , X F , 1- The vectors {xi} were the unit basis vectors of length5, that is, four of their components were 0 and the fifth was 1 (e.g., XD =(0,0,1,0,0)T), with the one appearing at a different component for each state.Thus, if the state the walk was in at time t has its 1 at the ith componentof its observation vector, then the prediction Pt — wTxt was simply the valueof the ith component of w. We use this particularly simple case to make thisexample as clear as possible. The theorems we prove later for a more generalclass of dynamical systems require only that the set of observation vectors {xi}be linearly independent.

Two computational experiments were performed using observation-outcomesequences generated as described above. In order to obtain statistically reliable

20 R. S. SUTTON

Figure 3. Average error on the random-walk problem under repeated presentations.All data are from TD(A) with different values of A. The dependent measureused is the RMS error between the ideal predictions and those found by thelearning procedure after being repeatedly presented with the training setuntil convergence of the weight vector. This measure was averaged over100 training sets to produce the data shown. The A = 1 data point isthe performance level attained by the Widrow-Hoff procedure. For eachdata point, the standard error is approximately a = 0.01, so the differencesbetween the Widrow-Hoff procedure and the other procedures are highlysignificant.

results, 100 training sets, each consisting of 10 sequences, were constructed foruse by all learning procedures. For all procedures, weight increments werecomputed according to TD(A), as given by (4). Seven different values wereused for A. These were A = 1, resulting in the Widrow-Hoff supervised-learningprocedure, A = 0, resulting in linear TD(0), and also A = 0.1, 0.3, 0.5, 0.7, and0.9, resulting in a range of intermediate TD procedures.

In the first experiment, the weight vector was not updated after each se-quence as indicated by (1). Instead, the Aw's were accumulated over sequencesand only used to update the weight vector after the complete presentation ofa training set. Each training set was presented repeatedly to each learningprocedure until the procedure no longer produced any significant changes inthe weight vector. For small a, the weight vector always converged in this way,and always to the same final value, independent of its initial value. We callthis the repeated presentations training paradigm.

The true probabilities of right-side termination - the ideal predictions - foreach of the nonterminal states can be computed as described in section 4.1.These are |, |, |, 2 and | for states B, C, D, E and F, respectively. As


Figure 4. Average error on random walk problem after experiencing 10 sequences.All data are from TD(A) with different values of a and A. The dependentmeasure is the RMS error between the ideal predictions and those foundby the learning procedure after a single presentation of a training set.This measure was averaged over 100 training sets. The A = 1 data pointsrepresent performances of the Widrow-Hoff supervised-learning procedure.

a measure of the performance of a learning procedure on a training set, weused the root mean squared (RMS) error between the procedure's asymptoticpredictions using that training set and the ideal predictions. Averaging overtraining sets, we found that performance improved rapidly as A was reducedbelow 1 (the supervised-learning method) and was best at A = 0 (the extremeTD method), as shown in Figure 3.

This result contradicts conventional wisdom. It is well known that, underrepeated presentations, the Widrow-Hoff procedure minimizes the RMS errorbetween its predictions and the actual outcomes in the training set (Widrow &Stearns, 1985). How can it be that this optimal method performed worse thanall the TD methods for A < 1? The answer is that the Widrow-Hoff procedureonly minimizes error on the training set; it does not necessarily minimize errorfor future experience. In the following section, we prove that in fact it is linearTD(0) that converges to what can be considered the optimal estimates formatching future experience - those consistent with the maximum-likelihoodestimate of the underlying Markov process.

The second experiment concerns the question of learning rate when thetraining set is presented just once rather than repeatedly until convergence.Although it is difficult to prove a theorem concerning learning rate, it is easy toperform the relevant computational experiment. We presented the same datato the learning procedures, again for several values of A, with the following

22 R. S. SUTTON

Figure 5. Average error at best a value on random-walk problem. Each data pointrepresents the average over 100 training sets of the error in the estimatesfound by TD(A), for particular A and a values, after a single presentationof a training set. The A value is given by the horizontal coordinate. The avalue was selected from those shown in Figure 4 to yield the lowest errorfor that A value.

procedural changes. First, each training set was presented once to each pro-cedure. Second, weight updates were performed after each sequence, as in (1),rather than after each complete training set. Third, each learning procedurewas applied with a range of values for the learning-rate parameter a. Fourth,so that there was no bias either toward right-side or left-side terminations, allcomponents of the weight vector were initially set to 0.5.

The results for several representative values of A are shown in Figure 4.Not surprisingly, the value of a had a significant effect on performance, withbest results obtained with intermediate values. For all values, however, theWidrow-Hoff procedure, TD(1), produced the worst estimates. All of the TDmethods with A < 1 performed better both in absolute terms and over a widerrange of a values than did the supervised-learning method.

Figure 5 plots the best error level achieved for each A value, that is, usingthe a value that was best for that A value. As in the repeated-presentationexperiment, all A values less than 1 were superior to the A = 1 case. However,in this experiment the best A value was not 0, but somewhere near 0.3.

One reason A = 0 is not optimal for this problem is that TD(0) is relativelyslow at propagating prediction levels back along a sequence. For example,suppose states D, E, and F all start with the prediction value 0.5, and thesequence xD,XE,,xF, l is experienced. TD(0) will change only F's prediction,whereas the other procedures will also change E's and E's to decreasing ex-


tents. If the sequence is repeatedly presented, this is no handicap, as thechange works back an additional step with each presentation, but for a singlepresentation it means slower learning.

This handicap could be avoided by working backwards through the se-quences. For example, for the sequence XD, xE, xF, 1, first F's prediction couldbe updated in light of the 1, then E's prediction could be updated toward F'snew level, and so on. In this way the effect of the 1 could be propagatedback to the beginning of the sequence with only a single presentation. Thedrawback to this technique is that it loses the implementation advantages ofTD methods. Since it changes the last prediction in a sequence first, it hasno incremental implementation. However, when this is not an issue, such aswhen learning is done offline from an existing database, working backward inthis way should produce the best predictions.

4. Theory of temporal-difference methods

In this section, we provide a theoretical foundation for temporal-differencemethods. Such a foundation is particularly needed for these methods becausemost of their learning is done on the basis of previously learned quantities."Bootstrapping" in this way may be what makes TD methods efficient, butit can also make them difficult to analyze and to have confidence in. In fact,hitherto no TD method has ever been proved stable or convergent to thecorrect predictions.4 The theory developed here concerns the linear TD(0)procedure and a class of tasks typified by the random walk example discussedin the preceding section. Two major results are presented: (1} an asymptoticconvergence theorem for linear TD(0) when presented with new data sequences;and (2) a theorem that linear TD(0) converges under repeated presentationsto the optimal (maximum likelihood) estimates. Finally, we discuss how TDmethods can be viewed as gradient-descent procedures.

4.1 Convergence of linear TD(0)

The theory presented here is for data sequences generated by absorbingMarkov processes such as the random-walk process discussed in the preced-ing section. Such processes, in which each next state depends only on thecurrent state, are among the formally simplest dynamical systems. They aredefined by a set of terminal states T, a set of nonterminal states N, and a setof transition probabilities pij (i e N, j €. N U T), where each pij is the proba-bility of a transition from state i to state j, given that the process is in state i.The "absorbing" property means that indefinite cycles among the nontermi-nal states are not possible; all sequences (except for a set of zero probability)eventually terminate.

Given an initial state q1, an absorbing Markov process provides a way ofgenerating a state sequence q1, q2 , . . . , qm+1, where qm+1 € T. We will assumethe initial state is chosen probabilistically from among the nonterminal states,

4Witten (1977) presented a sketch of a convergence proof for a TD procedure that pre-dicted discounted costs in a Markov decision problem, but many steps were left out, and itnow appears that the theorem he proposed is not true.

24 R. S. SUTTON

each with probability Ui. As in the random walk example, we do not give thelearning algorithms direct knowledge of the state sequence, but only of a relatedobservation-outcome sequence x i , x 2 , . . . ,xm,z. Each numerical observationvector xt is chosen dependent only on the corresponding nonterminal stateqt, and the scalar outcome z is chosen dependent only on the terminal stateqm+1. In what follows, we assume that there is a specific observation vector Xi

corresponding to each nonterminal state i such that if qt = i, then xt = xi. Foreach nonterminal state j, we assume outcomes z are selected from an arbitraryprobability distribution with expected value Zj.

The first step toward a formal understanding of any learning procedure is toprove that it converges asymptotically to the correct behavior with experience.The desired behavior in this case is to map each nonterminal state's observationvector Xi to the true expected value of the outcome z given that the statesequence is starting in i. That is, we want the predictions P(xi,w) to equalE{z \ i}, Vi 6 N. Let us call these the ideal predictions. Given completeknowledge of the Markov process, they can be computed as follows:

For any matrix M, let [M]ij denote its ij'th component, and, for any vectorv, let [v]i denote its i'th component. Let Q denote the matrix with entriesQ]ij = pij for i,j 6 N, and let h denote the vector with components [h\i =Ej€Tpijzj for i € N. Then we can write the above equation as

The second equality and the existence of the limit and the inverse are assuredby Theorem A.1.5 This theorem can be applied here because the elements ofQk are the probabilities of going from one nonterminal state to another in ksteps; for an absorbing Markov process, these probabilities must all convergeto 0 as k — o.

If the set of observation vectors { xi \ i £ ./V } is linearly independent, and ifa is chosen small enough, then it is known that the predictions of the Widrow-Hoff rule converge in expected value to the ideal predictions (e.g., see Widrow&: Stearns, 1985). We now prove the same result for linear TD(0):

Theorem 2 For any absorbing Markov chain, for any distribution of startingprobabilities u2, for any outcome distributions with finite expected values Zj,and for any linearly independent set of observation vectors {xt | i € N},there exists an e > 0 such that, for all positive a < e and for any initialweight vector, the predictions of linear TD(0) (with weight updates after eachsequence) converge in expected value to the ideal predictions (5). That is, if

To simplify presentation of the proofs, some of the more straightforward but potentiallydistracting steps have been placed in the Appendix as separate theorems. These are referredto in the text as Theorems A.1, A.2, and A.3.


wn denotes the weight vector after n sequences have been experienced, then

PROOF: Linear TD(0) updates wn after each sequence as follows, where mdenotes the number of observation vectors in the sequence:

where xqt, is the observation vector corresponding to the state qt entered at timet within the sequence. This equation groups the weight increments accordingto their time of occurrence within the sequence. Each increment corresponds toa particular state transition, and so we can alternatively group them accordingto the source and destination states of the transitions:

where nij denotes the number of times the transition i — j occurs in thesequence. (For j 6 T, all but one of the nij is 0.)

Since the random processes generating state transitions and outcomes areindependent of each other, we can take the expected value of each term above,yielding

where di is the expected number of times the Markov chain is in state i in onesequence, so that d ip i j is the expected value of nij. For an absorbing Markovchain (e.g., see Kemeny & Snell, 1976, p. 46):

where [d ] i = di and [u]t = Ui, i € N. Each di is strictly positive, because anystate for which di = 0 has no probability of being visited and can be discarded.

Let wn denote the expected value of wn. Then, since the dependence ofE {wn+1 | wn} on wn is linear, we can write

26 R. S. SUTTON

an iterative update formula in wn that depends only on initial conditions.Now we rearrange terms and convert to matrix and vector notation, letting Ddenote the diagonal matrix with diagonal entries [ D ] i i — di and X denote thematrix with columns xi.

Assuming for the moment that limn_o(I - aXTXD(I - Q))n = 0, then, bytheorem A.I, the sequence {XTwn} converges to

which is the desired result. Note that D-1 must exist because D is diagonalwith all positive diagonal entries, and ( X T X ) - l must exist by Theorem A.2.

It thus remains to show that l i m n - 0 ( I - a X T X D ( I - Q ) ) n = 0. We do thisby first showing that D(I-Q) is positive definite, and then that XTXD(I-Q)has a full set of eigenvalues all of whose real parts are positive. This will enableus to show that a can be chosen such that all eigenvalues of I — aXTXD(I-Q)are less than 1 in modulus, which assures us that its powers converge.


We show that D(I — Q) is positive definite6 by applying the following lemma(see Varga, 1962, p. 23, for a proof):

Lemma // A is a real, symmetric, and strictly diagonally dominant matrixwith positive diagonal entries, then A is positive definite.

We cannot apply this lemma directly to D(I — Q) because it is not symmetric.However, by Theorem A.3, any matrix A is positive definite exactly when thesymmetric matrix A + AT is positive definite, so we can prove that D(I — Q)is positive definite by applying the lemma to 5 = D(I — Q) + (D(I — Q))T. Sis clearly real and symmetric; it remains to show that it has positive diagonalentries and is strictly diagonally dominant.

First, we note that

We will use this fact several times in the following.S's diagonal entries are positive, because [S} i i = [D(I - Q)]ii + [(D(I -

Q ) ) T ] i i = 2[D(I - Q)] i i = 2di[I - Q]ii = 2d*(l - pii) > 0, i € N. Furthermore,S"s off-diagonal entries are nonpositive, because, for i = j, [S]ij — [D(I —Q)]ij + [(D(I - Q)) T } i j = di[I - Q]ii + d j [ I - Q]ji = -diPij - diPji < 0.

S is strictly diagonally dominant if and only if \[S]ii| > Ej=i |[S]iji for alli, with strict inequality holding for at least one i. However, since [ S ] i i > 0 and[S]ij < 0, we need only show that [S]ii > — Ej=1[S]ij, in other words, thatEj[S]ij > 0, which can be directly shown:

Furthermore, strict inequality must hold for at least one i, because ui must bestrictly positive for at least one i. Therefore, 5 is strictly diagonally dominantand the lemma applies, proving that S and D(I—Q) are both positive definite.

6A matrix A is positive definite if and only if y7Ay > 0 for all real vectors y= 0.

28 R. S. SUTTON

Next we show that XTXD(I - Q) has a full set of eigenvalues all of whosereal parts are positive. First of all, the set of eigenvalues is clearly full, becausethe matrix is nonsingular, being the product of three matrices, XTX, D, andI — Q, that we have already established as nonsingular. Let A and y be anyeigenvalue-eigenvector pair. Let y = a + bi and z — (XTX)-1y ^ 0 (i.e.,y = XTXz). Then

where "*" denotes the conjugate-transpose. This implies that

Since the left side and (Xz)*Xz must both be strictly positive, so must thereal part of A.

Furthermore, y must also be an eigenvector of I - aXTXD(I - Q), because(/ — aXTXD(I - Q))y — y - aXy = (1 - a\)y. Thus, all eigenvectors of/ - aXTXD(I — Q) are of the form 1 - aA, where A has positive real part.For each A = a + bi, a > 0, if a is chosen 0 < a < a2+b2, then 1 - aA will havemodulus7 less than one:

The criterial value a2+b2 will be different for different A; choose e to be thesmallest such value. Then, for any positive a < e, all eigenvalues 1 - aAof / - aXD(I — Q}XT are less than one in modulus. And this immediatelyimplies (e.g., see Varga, 1962, p. 13) that limn_0(I - aXD(I-Q)XT)n = 0,completing the proof •

We have just shown that the expected values of the predictions found bylinear TD(0) converge to the ideal predictions for data sequences generatedby absorbing Markov processes. Of course, just as with the Widrow-Hoffprocedure, the predictions themselves do not converge; they continue to varyaround their expected values according to their most recent experience. In thecase of the Widrow-Hoff procedure, it is known that the asymptotic varianceof the predictions is finite and can be made arbitrarily small by the choiceof the learning-rate parameter a. Furthermore, if a is reduced according toan appropriate schedule, e.g., a =n, then the variance converges to zero as

7The modulus of a complex number a + bi is /a2 + b2.


well. We conjecture that these stronger forms of convergence hold for linearTD{0) as well, but this remains an open question. Also open is the question ofconvergence of linear TD(A) for 0 < A < 1. We now know that both TD(0) andTD(1) - the Widrow-Hoff rule - converge in the mean to the ideal predictions;we conjecture that the intermediate TD(A) procedures do as well.

4.2 Optimality and learning rate

The result obtained in the previous subsection assures us that both TDmethods and supervised learning methods converge asymptotically to the idealestimates for data sequences generated by absorbing Markov processes. How-ever, if both kinds of procedures converge to the same result, which gets therefaster? In other words, which kind of procedure makes the better predictionsfrom a finite rather than an infinite amount of experience? Despite the pre-viously noted empirical results showing faster learning with TD methods, thishas not been proved for any general case. In this subsection we present arelated formal result that helps explain the empirical result of faster learningwith TD methods. We show that the predictions of linear TD(0) are optimalin an important sense for repeatedly presented finite training sets.

In the following, we first define what we mean by optimal predictions forfinite training sets. Though optimal, these predictions are extremely expensiveto compute, and neither TD nor supervised-learning methods compute themdirectly. However, TD methods do have a special relationship with them.One common training process is to present a finite amount of data over andover again until the learning process converges (e.g., see Ackley, Hinton, &Sejnowski, 1985; Rumelhart, Hinton, & Williams, 1985). We prove that lin-ear TD(0) converges under this repeated presentations training paradigm tothe optimal predictions, while supervised-learning procedures converge to sub-optimal predictions. This result also helps explain TD methods' empiricallyfaster learning rates. Since they are stepping toward a better final result, itmakes sense that they would also be better after the first step.

The word optimal can be misleading because it suggests a universally agreedupon criterion for the best way of doing something. In fact, there are manykinds of optimality, and choosing among them is often a critical decision.Suppose that one observes a training set consisting of a finite number ofobservation-outcome sequences, and that one knows the sequences to be gen-erated by an absorbing Markov process as described in the previous section.What might one mean by the "best" predictions given such a training set?

If the a priori distribution of possible Markov processes is known, thenthe predictions that are optimal in the mean square sense can be calculatedthrough Bayes's rule. Unfortunately, it is very difficult to justify any a prioriassumptions about possible Markov processes. In order to avoid making anysuch assumptions, mathematicians have developed another kind of optimalestimate, known as the maximum-likelihood estimate. This is the kind ofoptimality with which we will be concerned. For example, suppose one flipsa coin ten times and gets seven heads. What is the best estimate of theprobability of getting a head on the next toss? In one sense, the best estimatedepends entirely on a priori assumptions about how likely one is to run into fair

30 R. S. SUTTON

and biased coins, and thus cannot be uniquely determined. On the other hand,the best answer in the maximum-likelihood sense requires no such assumptions;it is simply 10. In general, the maximum-likelihood estimate of the processthat produced a set of data is that process whose probability of producing thedata is the largest.

What is the maximum-likelihood estimate for our prediction problem? If theobservation vectors Xi for each nonterminal state i are distinct, then one canenumerate the nonterminal states appearing in the training set and effectivelyknow which state the process is in at each time. Since terminal states do notproduce observation vectors, but only outcomes, it is not possible to tell whentwo sequences end in the same terminal state; thus we will assume that allsequences terminate in different states.8

Let T and N denote the sets of terminal and nonterminal states, respectively,as observed in the training set. Let [Q]ij = pij (i,j 6 N) be the fraction ofthe times that state i was entered in which a transition occurred to statej. Let Zj be the outcome of the sequence in which termination occurred atstate j € T, and let [h]i = EjETPijzj ,i 6 N. Q and h are the maximum-likelihood estimates of the true process parameters Q and h. Finally, estimatethe expected value of the outcome z, given that the process is in state i € N,as

That is, choose the estimate that would be ideal if in fact the maximum-likelihood estimate of the underlying process were exactly correct. Let uscall these estimates the optimal predictions. Note that even though Q is anestimated quantity, it still corresponds to some absorbing Markov chain. Thus,l i m n _ o Qn = 0 and Theorem A.1 applies, assuring the existence of the limitand inverse in the above equation.

Although the procedure outlined above serves well as a definition of optimalperformance, note that it itself would be impractical to implement. Firstof all, it relies heavily on the observation vectors Xi being distinct, and onthe assumption that they map one-to-one onto states. Second, the procedureinvolves keeping statistics on each pair of states (e.g., the pij) rather thanon each state or component of the observation vector. If n is the number ofstates, then this procedure requires O(n2) memory whereas the other learningprocedures require only O(n) memory. In addition, the right side of (8) mustbe re-computed each time additional data become available and new estimatesare needed. This procedure may require as much as O(n3) computation pertime step as compared to O(n) for the supervised-learning and TD methods.

Consider the case in which the observation vectors are linearly independent,the training set is repeatedly presented, and the weights are updated aftereach complete presentation of the training set. In this case, the Widrow-Hoff

8Alternatively, we may assume that there is only one terminal state and that the distri-bution of a sequence's outcome depends on its penultimate state. This does not change anyof the conclusions of the analysis.


procedure converges so as to minimize the root mean squared error betweenits predictions and the actual outcomes in the training set (Widrow & Stearns,1985). As illustrated earlier in the random-walk example, linear TD(0) con-verges to a different set of predictions. We now show that those predictionsare in fact the optimal predictions in the maximum-likelihood sense discussedabove. That is, we prove the following theorem:

Theorem 3 For any training set whose observation vectors {xi | i € N}are linearly independent, there exists an e > 0 such that, for all positive a < eand for any initial weight vector, the predictions of linear TD(0) converge,under repeated presentations of the training set with weight updates after eachcomplete presentation, to the optimal predictions (8). That is, if wn is thevalue of the weight vector after the training set has been presented n times,then limn-0 xi wn = [(I - Q)"1h],, Vi E N.

PROOF: The proof of Theorem 3 is almost the same as that of Theorem 2,so here we only highlight the differences. Linear TD(0) updates wn after eachpresentation of the training set:

where ms is the number of observation vectors in the sth sequence in thetraining set, Pt

s is the ith prediction in the sth sequence, and Pms+1 is definedto be the outcome of the sth sequence. Let nij be the number of times thetransition i — j appears in the training set; then the sums can be regroupedas

where di is the number of times state i & N appears in the training set.The rest of the proof for Theorem 2, starting at (6), carries through withestimates substituting for actual values throughout. The only step in theproof that requires additional support is to show that (7) still holds, i.e., thatd7 = (iT(I — Q) - 1 , where [u]i is the number of sequences in the training setthat begin in state i € N. Note that ^iejv EieN nij

= EieN diPij is the numberof times state j appears in the training set as the destination of a transition.Since all occurrences of state j must be either as the destination of a transitionor as the beginning state of a sequence, dj- = [U]j + EieN diPij. Converting this tomatrix notation, we have dT = u,T + dTQ, which yields the desired conclusion,dT = uT(l - Q) - 1 , after algebraic manipulations. •

We have just shown that if linear TD(0) is repeatedly presented with afinite training set, then it converges to the optimal estimates. The Widrow-Hoff rule, on the other hand, converges to the estimates that minimize error on

32 R. S. SUTTON

the training set; as we saw in the random-walk example, these are in generaldifferent from the optimal estimates. That TD(0) converges to a better set ofestimates with repeated presentations helps explain how and why it could learnbetter estimates from a single presentation, but it does not prove that. Whatis still needed is a characterization of the learning rate of TD methods that canbe compared with those already available for supervised-learning methods.

4.3 Temporal-difference methods as gradient descent

Like many other statistical learning methods, TD methods can be viewedas gradient descent (hill climbing) in the space of the modifiable parameters(weights). That is, their goal can be viewed as minimizing an overall errormeasure J(w) over the space of weights by repeatedly incrementing the weightvector in (an approximation to) the direction in which J(w) decreases moststeeply. Denoting the approximation to this direction of steepest descent, orgradient, as VwJ(w), such methods are typically written as

where a is a positive constant determining step size.For a multi-step prediction problem in which Pt = P(xt,w) is meant to

approximate E {z \ xt}, a natural error measure is the expected value of thesquare of the difference between these two quantities:

where Ex{ } denotes the expectation operator over observation vectors x.J(w) measures the error for a weight vector averaged over all observationvectors, but at each time step one usually obtains additional information aboutonly a single observation vector. The usual next step, therefore, is to define aper-observation error measure Q(w,x) with the property that Ex{Q(w,x)} =J ( w ) . For a multi-step prediction problem,

Each time step's weight increments are then determined using VwQ(w,xt),relying on the fact that Ex{VwQ(w,x)} = VwJ(w), so that the overall effectof the equation for Awt given above can be approximated over many stepsusing small a by

The quantity E {z \ xt} is not directly known and must be estimated. De-pending on how this is done, one gets either a supervised-learning method ora TD method. If E{z\xt} is approximated by z, the outcome that actually


occurs following xt, then we get the classical supervised-learning procedure(2). Alternatively, if E{z \ xt} is approximated by P ( x t + 1 , w ) , the immedi-ately following prediction, then we get the extreme TD method, TD(0). Keyto this analysis is the recognition, in the definition of J(w), that our real goalis for each prediction to match the expected value of the subsequent outcome,not the actual outcome occurring in the training set. TD methods can per-form better than supervised-learning methods because the actual outcome ofa sequence is often not the best estimate of its expected value.

5. Generalizations of TD(A)In this article, we have chosen to analyze particularly simple cases of tem-

poral-difference methods. This has clarified their operation and made it pos-sible to prove theorems. However, more realistic problems may require morecomplex TD methods. In this section, we briefly explore some ways in whichthe simple methods can be extended. Except where explicitly noted, the the-orems presented earlier do not strictly apply to these extensions.

5.1 Predicting cumulative outcomes

Temporal-difference methods are not limited to predicting only the final out-come of a sequence; they can also be used to predict a quantity that accumu-lates over a sequence. That is, each step of a sequence may incur a cost, wherewe wish to predict the expected total cost over the sequence. A common wayfor this to arise is for the costs to be elapsed time. For example, in a boundedrandom walk one might want to predict how many steps will be taken beforetermination. In a pole-balancing problem one may want to predict time untila failure in balancing, and in a packet-switched telecommunications networkone may want to predict the total delay in sending a packet. In game playing,points may be lost or won throughout a game, and we may be interested inpredicting the expected net gain or loss. In all of these examples, the quantitypredicted is the cumulative sum of a number of parts, where the parts becomeknown as the sequence evolves. For convenience, we will continue to refer tothese parts as costs, even though their minimization will not be a goal in allapplications.

In such problems, it is natural to use the observation vector received ateach step to predict the total cumulative cost after that step, rather than thetotal cost for the sequence as a whole. Thus, we will want Pt to predict theremaining cumulative cost given the tth observation rather than the overallcost for the sequence. Since the cost for the preceding portion of the sequenceis already known, the total sequence cost can always be estimated as the sum ofthe known cost-so-far and the estimated cost-remaining (cf. the A* algorithm,dynamic programming).

The procedures presented earlier are easily generalized to include the caseof predicting cumulative outcomes. Let ct+1 denote the actual cost incurredbetween times t and t + 1, and let cij denote the expected value of the costincurred on transition from state i to state j. We would like Pt to equalthe expected value of zt = Ek=t Ck+1, where m is the number of observation

34 R. S. SUTTON

vectors in the sequence. The prediction error can be represented in terms oftemporal differences as zt-Pt = Ek=t Cfk+1 - Pt = Ek=t(ck+i + Pk+1 - Pk],where we define Pm+1 = 0. Then, following the same steps used to derive theTD(A) family of procedures defined by (4), one can also derive the cumulativeTD(X) family defined by

The three theorems presented earlier in this article carry over to the cumu-lative outcome case with the obvious modifications. For example, the idealprediction for each state i 6 N is the expected value of the cumulative sum ofthe costs:

If we let h be the vector with components [h]i = EjPijcij,i € N, then (5)holds for this case as well. Following steps similar to those in the proof ofTheorem 2, one can show that, using linear cumulative TD(0), the expectedvalue of the weight vector after n sequences have been experienced is

after which the rest of the proof of Theorem 2 follows unchanged.

5.2 Intra-sequence weight updating

So far we have concentrated on TD procedures in which the weight vectoris updated after the presentation of a complete sequence or training set. Sinceeach observation of a sequence generates an increment to the weight vector, inmany respects it would be simpler to update the weight vector immediately af-ter each observation. In fact, all previously studied TD methods have operatedin this more fully incremental way.


Extending TD(A) to allow for intra-sequence updating requires a bit of care.The obvious extension is

However, if w is changed within a sequence, then the temporal changes inprediction during the sequence, as defined by this procedure, will be due tochanges in w as well as to changes in x. This is probably an undesirablefeature; in extreme cases it may even lead to instability. The following updaterule ensures that only changes in prediction due to x are effective in causingweight alterations:

This refinement is used in Samuel's (1959) checker player and in Sutton's(1984) Adaptive Heuristic Critic, but not in Holland's (1986) bucket brigadeor in the system described by Barto et al. (1983).

5.3 Prediction by a fixed interval

Finally, consider the problem of making a prediction for a particular fixedamount of time later. For example, suppose you are interested in predictingone week in advance whether or not it will rain - on each Monday, you predictwhether it will rain on the following Monday, on each Tuesday, you predictwhether it will rain on the following Tuesday, and so on for each day of theweek. Although this problem involves a sequence of predictions, TD methodscannot be directly applied because each prediction is of a different event andthus there is no clear desired relationship between them.

In order to apply TD methods, this problem must be embedded within alarger family of prediction problems. At each day t, we must form not onlyPt, our estimate of the probability of rain seven days later, but also Pt, Pt,..., Pt, where each Pt is an estimate of the probability of rain 8 days later.This will provide for overlapping sequences of inter-related predictions, e.g.,P7, Pt+1 ,Pt+2> • • •' Pt+6 all of the same event, in this case of whether it willrain on day t + 7. If the predictions are accurate, we will have Pt = Pt+1 ,Vt, 1 < 6 < 7, where Pt is defined as the actual outcome at time t (e.g., 1 if itrains, 0 if it does not rain). The update rule for the weight vector w6 used tocompute Pt would be

As illustrated here, there are three key steps in constructing a TD methodfor a particular problem. First, embed the problem of interest in a larger

36 R. S. SUTTON

class of problems, if necessary, in order to produce an appropriate sequenceof predictions. Second, write down recursive equations expressing the desiredrelationship between predictions at different times in the sequence. For thesimplest cases, with which this article has been mostly concerned, these arejust Pt = Pt+1, whereas in the cumulative outcome case these are Pt — P t+1+C t + 1 . Third, construct an update rule that uses the mismatch in the recursiveequations to drive weight changes towards a better match. These three stepsare very similar to those taken in formulating a dynamic programming problem(e.g., Denardo, 1982).

6. Related researchAlthough temporal-difference methods have never previously been identified

or studied on their own, we can view some previous machine learning researchas having used them. In this section we briefly review some of this previouswork in light of the ideas developed here.

6.1 Samuel's checker-playing program

The earliest known use of a TD method was in Samuel's (1959) celebratedchecker-playing program. This was in his "learning by generalization" proce-dure, which modified the parameters of the function used to evaluate boardpositions. The evaluation of a position was thought of as an estimate or pre-diction of how the game would eventually turn out starting from that position.Thus, the sequence of positions from an actual game or an anticipated contin-uation naturally gave rise to a sequence of predictions, each about the game'sfinal outcome.

In Samuel's learning procedure, the difference between the evaluations ofeach pair of successive positions occurring in a game was used as an error; thatis, it was used to alter the prediction associated with the first position of thepair to be more like the prediction associated with the second. The predictionsfor the two positions were computed in different ways. In most versions ofthe program, the prediction for the first position was simply the result ofapplying the current evaluation function to that position. The prediction forthe second position was the "backed-up" or minimax score from a lookaheadsearch started at that position, using the current evaluation function. Samuelreferred to the difference between these two predictions as delta. Although hisupdating procedure was much more complicated than TD(0), his intent wasto use delta much as Pt+1 — Pt is used in (linear) TD(0).

However, Samuel's learning procedure significantly differed from all the TDmethods discussed here in its treatment of the final step of a sequence. We haveconsidered each sequence to end with a definite, externally-supplied outcome(e.g., 1 for a victory and 0 for a defeat). The prediction for the last positionin a sequence was altered so as to match this final outcome. In Samuel'sprocedure, on the other hand, no position had a definite a priori evaluation,and the evaluation for the last position in a sequence was never explicitlyaltered. Thus, while both procedures constrained the evaluations (predictions)of nonterminal positions to match those that follow them, Samuel's provided


no additional constraint on the evaluation of terminal positions. As he himselfpointed out, many useless evaluation functions satisfy just the first constraint(e.g., any function that is constant for all positions).

To discourage his learning procedure from finding useless evaluation func-tions, Samuel included in the evaluation function a non-modifiable term mea-suring how many more pieces his program had than its opponent. However,although this modification may have decreased the likelihood of finding use-less evaluation functions, it did not prohibit them. For example, a constantfunction could still have been attained by setting the modifiable terms so asto cancel the effect of the non-modifiable one.

If Samuel's learning procedure was not constrained to find useful evaluationfunctions, then it should have been possible for it to become worse with ex-perience. In fact, Samuel reported observing this during extensive self-playtraining sessions. He found that a good way to get the program improvingagain was to set the weight with the largest absolute value back to zero. His in-terpretation was that this drastic intervention jarred the program out of localoptima, but another possibility is that it jarred the program out of evaluationfunctions that changed little, but that also had little to do with winning orlosing the game.

Nevertheless, Samuel's learning procedure was overall very successful; itplayed an important role in significantly improving the play of his checker-playing program until it rivaled human checker masters. Christensen and Korfhave investigated a simplification of Samuel's procedure that also does notconstrain the evaluations of terminal positions, and have obtained promis-ing preliminary results (Christensen, 1986; Christensen & Korf, 1986). Thus,although a terminal constraint may be critical to good temporal-differencetheory, apparently it is not strictly necessary to obtain good performance.

6.2 Backpropagation in connectionist networks

The backpropagation technique of Rumelhart et al. (1985) is one of the mostexciting recent developments in incremental learning methods. This techniqueextends the Widrow-Hoff rule so that it can be applied to the interior "hidden"units of multi-layer connectionist networks. In a backpropagation network, theinput-output functions of all units are deterministic and differentiable. As aresult, the partial derivatives of the error measure with respect to each connec-tion weight are well-defined, and one can apply a gradient-descent approachsuch as that used in the original Widrow-Hoff rule. The term "backpropa-gation" refers to the way the partial derivatives are efficiently computed in abackward propagating sweep through the network. As presented by Rumelhartet al., backpropagation is explicitly a supervised-learning procedure.

The purpose of both backpropagation and TD methods is accurate creditassignment. Backpropagation decides which part(s) of a network to changeso as to influence the network's output and thus to reduce its overall error,whereas TD methods decide how each output of a temporal sequence of outputsshould be changed. Backpropagation addresses a structural credit-assignmentissue whereas TD methods address a temporal credit-assignment issue.

38 R. S. SUTTON

Although it currently seems that backpropagation and TD methods addressdifferent parts of the credit-assignment problem, it is important to note thatthey are perfectly compatible and easily combined. In this article, we have em-phasized the linear case, but the TD methods presented are equally applicableto predictions formed by nonlinear functions, such as backpropagation-stylenetworks. The key requirement is that the gradient VwPt be computable.In a linear system, this is just xt. In a network of differentiable nonlinearelements, it can be computed by a backpropagation process. For example, An-derson (1986, 1987) has implemented such a combination of backpropagationand a temporal-difference method (the Adaptive Heuristic Critic, see below),successfully applying it to both a nonlinear broomstick-balancing task and theTowers of Hanoi problem.

6.3 Holland's bucket brigade

Holland's (1986) bucket brigade is a technique for learning sequences ofrule invocations in a kind of adaptive production system called a classifiersystem. The production rules in a classifier system compete to become activeand have their right-hand sides (called messages) posted to a working-memorydata structure (called the message list). Conflict resolution is carried out bya competitive auction. Each rule that matches the current contents of themessage list makes a bid that depends on the product of its specificity andits strength, a modifiable numerical parameter. The highest bidders becomeactive and post their messages to a new message list for the next round of theauction.

The bucket brigade is the process that adjusts the strengths of the rules andthereby determines which rules will become active at which times. When arule becomes active, it loses strength by the amount of its bid, but also gainsstrength if the message it posts triggers other rules to become active in thenext round of the auction. The strength gained is exactly the bids of the otherrules. If several rules post the same message, then the bids of all responders arepooled and divided equally among the posting rules. In principle, long chainsof rule invocations can be learned in this way, with strength being passed backfrom rule to rule, thus the name "bucket brigade." For a chain to be stable,its final rule must affect the environment, achieve a goal, and thereby receivenew strength in the form of a payoff from the external environment.

Temporal-difference methods and the bucket brigade both borrow the samekey idea from Samuel's work - that the steps in a sequence should be evalu-ated and adjusted according to their immediate or near-immediate successors,rather than according to the final outcome. The similarity between TD meth-ods and the bucket brigade can be seen at a more detailed level by consideringthe latter's effect on a linear chain of rule invocations. Each rule's strength canbe viewed as a prediction of the payoff that will ultimately be obtained fromthe environment. Assuming equal specificities, the strength of each rule expe-riences a net change dependent on the difference between that strength and thestrength of the succeeding rule. Thus, like TD(0), the bucket brigade updateseach strength (prediction) in a sequence of strengths (predictions) accordingto the immediately following temporal difference in strength (prediction).


There are also numerous differences between the bucket brigade and theTD methods presented here. The most important of these is that the bucketbrigade assigns credit based on the rules that caused other rules to becomeactive, whereas TD methods assign credit based solely on temporal succession.The bucket brigade thus performs both temporal and structural credit assign-ment in a single mechanism. This contrasts with the TD/backpropagationcombination discussed in the preceding subsection, which uses separate mech-anisms for each kind of credit assignment. The relative advantages of thesetwo approaches are still to be determined.

6.4 Infinite discounted predictions and the Adaptive Heuristic Critic

All the prediction problems we have considered so far have had definiteoutcomes. That is, after some point in time the actual outcome correspondingto each prediction became known. Supervised-learning methods require thisproperty, because they make no learning changes until the actual outcome isdiscovered, but in some problems it never becomes completely known. Forexample, suppose one wishes to predict the total return from investing in thestock of various companies; unless a company goes out of business, total returnis never fully determined.

Actually, there is a problem of definition here: if a company never goes outof business and earns income every year, the total return can be infinite. Forreasons of this sort, infinite-horizon prediction problems usually include someform of discounting. For example, if some process generates costs ct+1 at eachtransition from t to t + 1, we may want Pt to predict the discounted sum:

where the discount-rate parameter 7, 0 < -7 < 1, determines the extent towhich we are concerned with short-range or long-range prediction.

If Pt should equal the above zt, then what are the recursive equations defin-ing the desired relationship between temporally successive predictions? If thepredictions are accurate, we can write

The mismatch or TD error is the difference between the two sides of thisequation, (ct+1 + tP t + 1 ) - Pt.9 Button's (1984) Adaptive Heuristic Critic uses

9Witten (1977) first proposed updating predictions of a discounted sum based on a dis-crepancy of this sort.

40 R. S. SUTTON

this error in a learning rule otherwise identical to TD(A)'s:

where Pt is the linear form wTxt, so that VwPt — xt. Thus, the AdaptiveHeuristic Critic is probably best understood as using the linear TD methodfor predicting discounted cumulative outcomes.

7. Conclusion

These analyses and experiments suggest that TD methods may be the learn-ing methods of choice for many real-world learning problems. We have arguedthat many of these problems involve temporal sequences of observations andpredictions. Whereas conventional, supervised-learning approaches disregardthis temporal structure, TD methods are specially tailored to it. As a re-sult, they can be computed more incrementally and require significantly lessmemory and peak computation. One TD method makes exactly the samepredictions and learning changes as a supervised-learning method, while re-taining these computational advantages. Another TD method makes differentlearning changes, but has been proved to converge asymptotically to the samecorrect predictions. Empirically, TD methods appear to learn faster thansupervised-learning methods, and one TD method has been proved to makeoptimal predictions for finite training sets that are presented repeatedly. Over-all, TD methods appear to be computationally cheaper and to learn faster thanconventional approaches to prediction learning.

The progress made in this paper has been due primarily to treating TDmethods as general methods for learning to predict rather than as special-ized methods for learning evaluation functions, as they were in all previouswork. This simplification makes their theory much easier and also greatlybroadens their range of applicability. It is now clear that TD methods can beused for any pattern recognition problem in which data are gathered over time- for example, speech recognition, process monitoring, target identification,and market-trend prediction. Potentially, all of these can benefit from theadvantages of TD methods vis-a-vis supervised-learning methods. In speechrecognition, for example, current learning methods cannot be applied until thecorrect classification of the word is known. This means that all critical infor-mation about the waveform and how it was processed must be stored for latercredit assignment. If learning proceeded simultaneously with processing, as inTD methods, this storage would be avoided, making it practical to considerfar more features and combinations of features.

As general prediction-learning methods, temporal-difference methods canalso be applied to the classic problem of learning an internal model of the world.Much of what we mean by having such a model is the ability to predict thefuture based on current actions and observations. This prediction problem isa multi-step one, and the external world is well modeled as a causal dynamicalsystem; hence TD methods should be applicable and advantageous. Sutton


and Pinette (1985) and Button and Barto (1981b) have begun to pursue oneapproach along these lines, using TD methods and recurrent connectionistnetworks.

Animals must also face the problem of learning internal models of the world.The learning process that seems to perform this function in animals is calledPavlovian or classical conditioning. For example, if a dog is repeatedly pre-sented with the sound of a bell and then fed, it will learn to predict the mealgiven just the bell, as evidenced by salivation to the bell alone. Some of thedetailed features of this learning process suggest that animals may be using aTD method (Kehoe, Schreurs, & Graham, 1987; Sutton & Barto, 1987).

Acknowledgements

The author acknowledges especially the assistance of Andy Barto, MarthaSteenstrup, Chuck Anderson, John Moore, and Harry Klopf. I also thankOliver Selfridge, Pat Langley, Ron Rivest, Mike Grimaldi, John Aspinall, GeneCooperman, Bud Frawley, Jonathan Bachrach, Mike Seymour, Steve Epstein,Jim Kehoe, Les Servi, Ron Williams, and Marie Goslin. The early stages ofthis research were supported by AFOSR contract F33615-83-C-1078.

References

Ackley, D. H., Hinton, G. H., & Sejnowski, T. J. (1985). A learning algorithmfor Boltzmann machines. Cognitive Science, 9, 147-169.

Anderson, C. W. (1986). Learning and problem solving with multilayer con-nectionist systems. Doctoral dissertation, Department of Computer andInformation Science, University of Massachusetts, Amherst.

Anderson, C. W. (1987). Strategy learning with multilayer connectionist repre-sentations. Proceedings of the Fourth International Workshop on MachineLearning (pp. 103-114). Irvine, CA: Morgan Kaufmann.

Barto, A. G. (1985). Learning by statistical cooperation of self-interestedneuron-like computing elements. Human Neurobiology, 4, 229-256.

Barto, A. G., Sutton R. S., & Anderson, C. W. (1983). Neuronlike elementsthat can solve difficult learning control problems. IEEE Transactions onSystems, Man, and Cybernetics, 13, 834 -846.

Booker, L. B. (1982). Intelligent behavior as an adaptation to the task envi-ronment. Doctoral dissertation, Department of Computer and Communi-cation Sciences, University of Michigan, Ann Arbor.

Christensen, J. (1986). Learning static evaluation functions by linear regres-sion. In T. M. Mitchell, J. G. Carbonell, & R. S. Michalski (Eds.), Machinelearning: A guide to current research. Boston: Kluwer Academic.

Christensen, J., &. Korf, R. E. (1986). A unified theory of heuristic evaluationfunctions and its application to learning. Proceedings of the Fifth NationalConference on Artificial Intelligence (pp. 148 152). Philadelphia, PA:Morgan Kaufmann.

Denardo, E. V. (1982). Dynamic programming: Models and applications. En-glewood Cliffs, NJ: Prentice-Hall.

42 R. S. SUTTON

Dietterich, T. G., & Michalski, R. S. (1986). Learning to predict sequences.In R. S. Michalski, J. G. Carbonell, & T. M. Mitchell (Eds.), Machinelearning: An artificial intelligence approach (Vol. 2). Los Altos, CA:Morgan Kaufmann.

Gelperin, A., Hopfield, J. J., Tank, D. W. (1985). The logic of Limax learning.In A. Selverston (Ed.), Model neural networks and behavior. New York:Plenum Press.

Hampson, S. E. (1983). A neural model of adaptive behavior. Doctoral disser-tation, Department of Information and Computer Science, University ofCalifornia, Irvine.

Hampson, S. E., & Volper, D. J. (1987). Disjunctive models of boolean cate-gory learning. Biological Cybernetics, 56, 121-137.

Holland, J. H. (1986). Escaping brittleness: The possibilities of general-purpose learning algorithms applied to parallel rule-based systems. InR. S. Michalski, J. G. Carbonell, & T. M. Mitchell (Eds.), Machine learn-ing: An artificial intelligence approach (Vol. 2). Los Altos, CA: MorganKaufmann.

Kehoe, E. J., Schreurs, B. G., & Graham, P. (1987). Temporal primacy over-rides prior training in serial compound conditioning of the rabbit's nicti-tating membrane response. Animal Learning and Behavior, 15, 455-464.

Kemeny, J. G., & Snell, J. L. (1976). Finite Markov chains. New York:Springer-Verlag.

Klopf, A. H. (1987). A neuronal model of classical conditioning (TechnicalReport 87-1139). OH: Wright-Patterson Air Force Base, Wright Aero-nautical Laboratories.

Moore, J. W., Desmond, J. E., Berthier, N. E., Blazis, D. E. J., Sutton, R. S.,& Barto, A. G. (1986). Simulation of the classically conditioned nicti-tating membrane response by a neuron-like adaptive element: Responsetopography, neuronal firing and interstimulus intervals. Behavioral BrainResearch, 21, 143-154.

Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1985). Learning inter-nal representations by error propagation (Technical Report No. 8506). LaJolla: University of California, San Diego, Institute for Cognitive Science.Also in D. E. Rumelhart & J. L. McClelland (Eds.), Parallel distributedprocessing: Explorations in the microstructure of cognition (Vol. 1). Cam-bridge, MA: MIT Press.

Samuel. A. L. (1959). Some studies in machine learning using the gameof checkers. IBM Journal on Research and Development, 3, 210-229.Reprinted in E. A. Feigenbaum & J. Feldman (Eds.), Computers andthought. New York: McGraw-Hill.

Sutton, R. S. (1984). Temporal credit assignment in reinforcement learning.Doctoral dissertation, Department of Computer and Information Science,University of Massachusetts, Amherst.

Sutton, R. S., & Barto, A. G. (1981a). Toward a modern theory of adaptivenetworks: Expectation and prediction. Psychological Review, 88, 135-171.


Sutton, R. S., & Barto, A. G. (1981b). An adaptive network that constructsand uses an internal model of its environment. Cognition and Brain The-ory, 4, 217-246.

Sutton, R. S., & Barto, A. G. (1987). A temporal-difference model of classicalconditioning. Proceedings of the Ninth Annual Conference of the CognitiveScience Society (pp. 355-378). Seattle, WA: Lawrence Erlbaum.

Sutton, R. S., & Pinette, B. (1985). The learning of world models by con-nectionist networks. Proceedings of the Seventh Annual Conference of theCognitive Science Society (pp. 54-64). Irvine, CA: Lawrence Erlbaum.

Varga, R. S. (1962). Matrix iterative analysis. Englewood Cliffs, NJ: Prentice-Hall.

Widrow B., & Hoff, M. E. (1960). Adaptive switching circuits. 1960 WESCONConvention Record, Part IV (pp. 96-104).

Widrow, B., & Stearns, S. D. (1985). Adaptive signal processing. EnglewoodCliffs, NJ: Prentice-Hall.

Williams, R. J. (1986). Reinforcement learning in connectionist networks: Amathematical analysis (Technical Report No. 8605). La Jolla: Universityof California, San Diego, Institute for Cognitive Science.

Witten, I. H. (1977). An adaptive optimal controller for discrete-time Markovenvironments. Information and Control, 34, 286-295.

44 R. S. SUTTON

Appendix: Accessory Theorems

Theorem A.1 if l i m n - o An = 0, then I —A has an inverse, and (I — A)-1 =Ei=0 Ai

PROOF: See Kemeny and Snell (1976, p. 22).

Theorem A.2 For any matrix A with linearly independent columns, ATA isnonsingular.PROOF: If ATA were singular, then there would exist a vector y = 0 suchthat

which would imply that Ay — 0, contradicting the assumptions that y = 0 andthat A has linearly independent columns.

Theorem A.3 A square matrix A is positive definite if and only if A + AT

is positive definite.PROOF:

But the second term is 0, because yT(A — AT)y — (yT(A-AT)y)T = yT(AT -A)y = — yT(A — AT)y, and the only number that equals its own inverse is 0.Therefore,

implying that yTAy and y T ( A + AT)y always have the same sign, and thuseither A and A + AT are both positive definite, or neither is positive definite.

Learning to Predict by the Methods of Temporal Differences1022633531479.pdf · Learning to Predict...

Documents

Transcript of Learning to Predict by the Methods of Temporal Differences1022633531479.pdf · Learning to Predict...