2997 spring 2004

download 2997 spring 2004

of 201

Transcript of 2997 spring 2004

  • 7/25/2019 2997 spring 2004

    1/201

    2.997Decision-Making inLarge-ScaleSystems February4

    MIT,

    Spring

    2004

    Handout

    #1

    Lecture

    Note

    1

    1 Markovdecisionprocesses

    In this class we will study discrete-time stochastic systems. We can describe the evolution (dynamics) of

    thesesystemsbythefollowingequation,whichwecallthesystemequation:

    xt+1 =f(xt, at, wt), (1)

    where xt S, at Axt and wt W denote the system state, decision and random disturbance at time

    t, respectively. In words, the state of the system at time t+1 is a function f of the state, the decision

    andarandomdisturbanceattimet. An importantassumptionofthisclassofmodels isthat, conditioned

    on the current state xt, the distribution of future states xt+1, xt+2, . . . is independent of the past states

    xt1, xt2, . . . . ThisistheMarkovproperty,whichrisetothenameMarkovdecisionprocesses.

    An

    alternative

    representation

    of

    the

    system

    dynamics

    is

    given

    through

    transition

    probability

    matrices:

    for

    each state-action pair (x, a), we let Pa(x, y) denotethe probability thatthe next state is y, given that the

    currentstateisxandthecurrentactionisa.

    Weareconcernedwiththeproblemofhowtomakedecisionsovertime. Inotherwords,wewouldliketo

    pickanactionat A ateachtimet. Inreal-worldproblems,thisistypicallydonewithsomeobjectivein

    mind,suchasminimizingcosts,maximizingprofitsorrewards,orreachingagoal. Letu(x, t)takevaluesin

    Ax,foreachx. Thenwecanthinkofuasadecisionrulethatprescribesanactionfromthesetofavailable

    actions

    Ax based

    on

    the

    current

    time

    stage

    t

    and

    current

    state

    x.

    We

    call

    u

    a

    policy.

    In this course, we will assess the quality of each policy based on costs that are accumulated additively

    xt

    over time. More specifically, we assume that at each time stage t a cost g (xt) is incurred. In the next

    section,wedescribesomeoftheoptimalitycriteriathatwillbeusedinthisclasswhenchoosingapolicy.

    Basedonthepreviousdiscussion,wecharacterizeaMarkovdecisionprocessbyatuple(S, A, P(, ), g()),

    consistingofastatespace,asetofactionsassociatedwitheachspace,transitionprobabilitiesandcostsas

    sociated with each state-action pair. For simplicity, we will assume throughout the course that S and Ax

    are

    finite.

    Most

    results

    extend

    to

    the

    case

    of

    countably

    or

    uncountably

    infinite

    state

    and

    action

    spaces

    under

    certaintechnicalassumptions.

    at

    2 OptimalityCriteria

    In

    the

    previous

    section

    we

    described

    Markov

    decision

    processes,

    and

    introduced

    the

    notion

    that

    decisions

  • 7/25/2019 2997 spring 2004

    2/201

    3

    2. Averagecost:

    T1 1 lim

    supE

    g (xt)

    x0 =

    x

    (3)

    T T

    at

    t=0

    3. Infinite-horizondiscountedcost:

    E tg (xt)x0 =x , (4)at

    t=0

    where (0, 1) is a discount factor expressing temporal preferences. The presence of a discount

    factorismostintuitiveinproblemsinvolvingcashflows,wherethevalueofthesamenominalamount

    of

    money

    at

    a

    later

    time

    stage

    is

    not

    the

    same

    as

    its

    value

    at

    a

    earlier

    time

    stage,

    since

    money

    at

    the

    earlier

    stage

    can

    be

    invested

    at

    a

    risk-free

    interest

    rate

    and

    is

    therefore

    equivalent

    to

    a

    larger

    nominal amount at a later stage. However, discounted costs also offer good approximations to the

    other optimality criteria. In particular, it can be shown that, when the state and action spaces are

    finite,there isa largeenough

  • 7/25/2019 2997 spring 2004

    3/201

    A common choice for the state of this system is an 8-dimensional vector containing the queue lengths.

    Since each server serves multiple queues, in each time step it is necessary to decide which queue each of

    the

    different

    servers

    is

    going

    to

    serve.

    A

    decision

    of

    this

    type

    may

    be

    coded

    as

    an

    8-dimensional

    vector

    a

    indicatingwhichqueuesarebeingserved,satisfyingtheconstraintthatnomorethanonequeueassociated

    witheachserverisbeingserved,i.e.,ai {0, 1},anda1+a3+a8 1,a2+a6 1,a4+a5+a7 1. Wecan

    imposeadditionalconstraintsonthechoicesofaasdesired,forinstanceconsideringonlynon-idlingpolicies.

    Policies

    are

    described

    by

    a

    mappingu returning an allocation of server efforta asa functionofsystem

    x. We represent the evolution of the queue lengths in terms of transition probabilities - the conditional

    probabilities for the next state x(t+1) given that the current state is x(t) and the current action is a(t).

    For

    instance

    Prob(x1(t+1)=x1(t)+1| x(t), a(t))=1,

    Prob(x3(t+1)=x3(t)+1, x2(t+1)=x2(t) 1| (x(t), a(t))=22I(x2(t)>0, a2(t))=1),

    Prob(x3(t+1)=x3(t) 1| (x(t), a(t))=13I(x3(t)>0, a3(t))=1),

    correspondingtoanarrivaltoqueue1,adeparturefromqueue2andanarrivaltoqueue3,andadeparture

    from queue 3. I() is the indication function. Transition probabilities related to other events are defined

    similarly.

    Wemayconsidercostsoftheformg(x)= xi,thetotalnumberofunfinishedunitsinthesystem. Foriinstance, this is a reasonably common choice of cost for manufacturing systems, which are often modelled

    asqueueingnetworks.

    Tetris

    Tetris

    is

    a

    computer

    game

    whose

    essence

    rule

    is

    to

    fit

    a

    sequence

    of

    geometrically

    different

    pieces,

    which

    fall

    fromthetopofthescreenstochastically, togethertocompletethecontiguousrowsofblocks. Piecesarrive

    sequentially and the geometric shape of the pieces are independently distributed. A falling piece can be

    rotated and moved horizontally into a desired position. Note that the rotation and move of falling pieces

    must be scheduled and executed before it reaches the remaining pile of pieces at the button of the screen.

    Onceapiecereachestheremainingpile,thepiecemustresitethereandcannotberotatedormoved.

    To put the Tetris game into the framework of Markov decision processes, one could define the state to

    correspondtothecurrentconfigurationandcurrent fallingpiece. Thedecision ineachtimestage iswhere

    to

    place

    the

    current

    falling

    piece.

    Transitions

    to

    the

    next

    board

    configuration

    follow

    deterministically

    from

    thecurrentstateandaction; transitionstothenext fallingpiecearegivenby itsdistribution,whichcould

    be, for instance, uniform over all piece types. Finally, we associate a reward with each state-action pair,

    correspondingtothepointsachievedbythenumberofrowseliminated.

  • 7/25/2019 2997 spring 2004

    4/201

    by

    n

    ixt+1 =

    ateixt.

    i=1

    Therefore, transition probabilities can be derived from the distribution of the rate of return of each riskyn

    assets. Weassociatewitheachstate-actionpair(x, a)arewardga(x)=x(1 i=1ai,correspondingtothe

    amountofwealthconsumed.

    4 SolvingFinite-HorizonProblems

    Finding a policy that minimizes the finite-horizon cost corresponds to solving the following optimization

    problem: T1

    minE gu(xt,t)(xt)|x0 =x (5)u(,)

    t=0

    A naive approach to solving (5) is to enumerate all possible policies u(x, t), evaluate the corresponding

    expected cost, and choose the policy that maximizes it. However, note that the number of policies grows

    exponentially

    on

    the

    number

    of

    states

    and

    time

    stages.

    A

    central

    idea

    in

    dynamic

    programming

    is

    that

    the

    computationrequiredtofindanoptimalpolicycanbegreatlyreducedbynotingthat(5)canberewritten

    asfollows: T1

    min ga(x)+ Pa(x, y)minE gu(xt,t)(xt)|x1 =y . (6)aAx u(,)

    yS t=1

    DefineJ(x, t0)asfollows:

    T1

    J

    (x, t0)

    =

    min E

    gu(xt,t)(xt)|x1 =

    y .

    u(,)t=t0

    Itisclearfrom(6)that, ifweknowJ(, t0 +1),wecaneasilyfindJ(x, t0)bysolving

    J(x, t0)= min ga(x)+ Pa(x, y)J(y, t0+1) . (7)

    aAx yS

    Moreover, (6) suggests that an optimal action at state x and time t0 is simply one that minimizes the

    right-hand

    side

    in

    (7).

    It

    is

    easy

    to

    verify

    that

    this

    is

    the

    case

    by

    using

    backwards

    induction.

    We

    callJ(x, t)thecost-to-gofunction. Itcanbefoundrecursivelybynotingthat

    J(x, T 1)=minga(x)a

    andJ(x, t), t=0, . . . , T 2,canbecomputedvia(7).

    N t th t fi di J ( t) f ll S d t 0 T 1 i b f t ti th t

  • 7/25/2019 2997 spring 2004

    5/201

    1. Find(somehow) foreveryxandt0,

    J(x,

    t0)

    =

    min E

    tt0gu(xt,t)(xt)|xt0 =

    x

    (8)

    u(,)t=t0

    2. Theoptimalactionforstatexattimet0 isgivenby

    u (x,t0)=argminaAx ga(x)+ Pa(x,y)J(y,t0 +1) . (9)

    yS

    We

    may

    also

    conjecture

    that,

    as

    in

    the

    finite-horizon

    case,

    J

    (x,

    t)

    satisfies

    a

    recursive

    relation

    of

    the

    form

    J(x,t)= min ga(x)+ Pa(x,y)J(y,t+1) .

    aAx yS

    The first thing to note in the infinite-horizon case is that, based on expression (8), we have J(x,t) = J(x,t)=J(x)foralltandt. Indeed,notethat,foreveryu,

    E tt0gu(xt,t)(xt)|xt0 =x =

    tt0Probu(xt =y|xt0 =x)gu(y)(y)t=t0 t=t0

    = tt0Probu(xtt0

    =y|x0 =x)gu(y)(y)t=t0

    = tProbu(xt =y|x0 =x)gu(y)(y).t=0

    Intuitively,

    since

    transition

    probabilities

    Pu(x,

    y)

    do

    not

    depend

    on

    time,

    infinite-horizon

    problems

    look

    the

    sameregardlessofthevalueoftheinitialtimestatet,aslongastheinitialstateisthesame.

    Note also that, since J(x,t) =J(x), we can also infer from (9) that the optimal policy u(x,t) does

    notdependonthecurrentstaget,sothatu(x,t)=u(x)forsomefunctionu(). Wecallpoliciesthatdo

    notdependonthetimestagestationary. Finally,J mustsatisfythefollowingequation:

    J(x)= min ga(x)+ Pa(x,y)J(y) .

    aA

    x

    yS

    ThisiscalledBellmansequation.

    Wewillshowinthenextlecturethatthecost-to-gofunctionistheuniquesolutionofBellmansequation

    andthestationarypolicyu isoptimal.

  • 7/25/2019 2997 spring 2004

    6/201

  • 7/25/2019 2997 spring 2004

    7/201

    Proof First,wehave

    J

    =

    J

    +

    J

    J

    J+JJe.

    . Wenowhave

    TJT J T(J+JJe)T J

    = T J+JJeT J

    = JJe.

    Thefirstinequalityfollowsfrommonotonicityandthesecond fromtheoffsetpropertyofT. SinceJ and J arearbitrary,weconcludebythesamereasoningthatT JTJ JJe. Thelemmafollows.

  • 7/25/2019 2997 spring 2004

    8/201

    2.997Decision-Making inLarge-ScaleSystems February9

    MIT,Spring2004 Handout#2

    LectureNote2

    1 Summary: MarkovDecisionProcesses

    Markovdecisionprocessescanbecharacterizedby(S, A, g(),P(, )),where

    S

    denotes

    a

    finite

    set

    of

    states

    x denotesafinitesetofactionsforstatex SA

    ga(x)denotesthefinitetime-stagecostforactiona Ax andstatex S

    Pa(x, y)denotesthetransmissionprobabilitywhenthetakenaction isa Ax,currentstate isx,and

    thenextstateisy

    Let u(x, t) denote thepolicy for state x at time t and, similarly, let u(x) denote the stationary policy for

    statex. Takingthestationarypolicyu(x)intoconsideration,we introducethefollowingnotation

    gu(x) gu(x)(x)

    Pu(x, y) Pu(x)(x, y)

    torepresentthecostfunctionandtransitionprobabilitiesunderpolicyu(x).

    2

    Cost-to-go

    Function

    and

    Bellmans

    Equation

    Intheprevious lecture,wedefinedthediscounted-cost,infinitehorizoncost-to-gofunctionas

    J(x)=minE tgu(xt)|x0 =x .u

    t=0

    WealsoconjecturedthatJ shouldsatisfiestheBellmansequation

    J(x)=minga(x)+ Pa(x, y)J(y)a ,

    yS

    or,usingtheoperatornotation introducedinthepreviouslecture,

    J = T J

  • 7/25/2019 2997 spring 2004

    9/201

    3 ValueIteration

    Thevalue iterationalgorithmgoesasfollows:

    1. J0,k=0

    2. Jk+1 =T Jk,k=k+1

    3. Gobackto2

    Theorem1

    lim Jk =J

    k

    Proof SinceJ0()andg )arefinite,thereexistsarealnumberM satisfying(

    M and M for all a Ax

    and x S. Then we have, for every integer K 1 and real|J0(x)| |ga(x)|

    number(0, 1),

    TKJ0(x)K1

    = minE tgu(xt)+KJ0(xK)

    JK(x) =

    x0 =xu

    t=0

    K 1

    minE tgu(xt)u

    t=0

    +KMx0 =x

    From

    J(x)=minK1

    tgu(xt)+

    tgu(xt) ,u

    t=0 t=K

    we

    have

    (TKJ0)(x) J(x)

    K1

    minE tgu(xt)+K

    J0(xK)K1

    minE tgu(xt)+ tgu(xt)

    u= x0 =x x0 =x

    ut=0

    t=0 t=K

    K1

    t=0

    K1

    t=0 t=K

    tgu(xt)+KJ0(xK

    ) tgu(xt)+ tgu(xt)E =x E x0

    =x x0

    E

    K J0(xk

    ) + tgu(xt) =x | | x0

    t=K

    maxE K J0(xK

    ) + t g0(xt) =x | | | | x0

    u

    t=K

    1

  • 7/25/2019 2997 spring 2004

    10/201

    Theorem2 J is theuniquesolutionof theBellmansequation.

    Proof WefirstshowthatJ =T J. Bycontractionprinciple,

    = ||Tk+1J0 TkJ0||||T(T

    kJ0)TkJ0||

    ||TkJ0 Tk1J0||

    0 k||T J0

    J0|| asK

    SinceforallkwehaveJT J Tk+1J0+JTkJ0+T

    k+1J0TkT0,weconclude T J

    thatJ =T J. WenextshowthatJ istheuniquesolutiontoJ =T J. SupposethatJ = J2. Then1

    1

    J2

    || =||T J1

    T J 1

    J0

    y.

    WehencedefinetheoperatorF asfollows

    (F J)(x) = min ga(x) + Pa(x y)(F J)(y) + Pa(x y)J(y) (1)

  • 7/25/2019 2997 spring 2004

    11/201

    Lemma1

    ||F JF J|| ||JJ||

    Proof BythedefinitionofF,weconsiderthecasex=1,

    |(F J)(1)(F J)(1) = (T J)(1)(T J)(1)| | | ||JJ||

    Forthecasex=2,bythedefinitionofF,wehave

    |(F J)(2)(F J)(2)| max |(F J)(1)(F J)(1), J(2)J(2), . . . , )| | | |J(|S|)J(|S| |

    ||J

    J||

    Repeating the same reasoning for x = 3, . . . , we can show by induction that |(F J)(x)(F J)(x)| Hence,weconclude||F JF ||JJ||,xS. J|| ||JJ||.

    Theorem3 F has theuniquefixedpoint J.

    Proof By the definition of operator F and the Bellmans equation J = T J, we have J = F J.

    The

    convergence

    result

    follows

    from

    the

    previous

    lemma.

    Therefore,

    F J

    =

    J

    .

    By

    maximum

    contraction

    property,theuniquenessofJ holds.

  • 7/25/2019 2997 spring 2004

    12/201

    2.997

    Decision-Making

    in

    Large-Scale

    Systems

    February

    11

    MIT,Spring2004 Handout#4

    LectureNote3

    1 Value Iteration

    Usingvalue iteration,startingatanarbitraryJ0,wegenerateasequenceof{Jk}by

    Jk+1 =

    T Jk,

    integer

    k

    0.

    WehaveshownthatthesequenceJk J ask,andderivedtheerrorbounds

    ||Jk J||

    k||J0J||

    Recall

    that

    the

    greedy

    policy

    uJ withrespecttovalueJ isdefinedasT J =TuJJ. Wealsodenoteuk =uJkasthegreedypolicywithrespecttovalueJk. Then,wehavethefollowinglemma.

    Lemma

    1

    Given

    (0, 1), 1

    ||Juk Jk|| 1||T Jk Jk||

    Proof:

    Juk Jk = (IPuk)1guk Jk

    = (IPuk)1(guk +PukJk Jk)

    = (IPuk)1(T Jk Jk)

    =

    tPt (T Jk Jk)ukt=0

    tPt

    uke||T Jk Jk||

    t=0

    =

    te||T Jk Jk||t=0

    e

    = 1

    ||T Jk

    Jk||

    where I is an identity matrix, and e is a vector of unit elements with appropriate dimension. The third

    equalitycomesfromT Jk =guk +PukJk,i.e.,uk isthegreedypolicyw.r.t. Jk,andtheforthequalityholdsebecause(IPuk)

    1 =

    tPt . ByswitchingJuk andJk,wecanobtainJkJukt=0 uk 1||T Jk Jk||,

    and hence conclude

  • 7/25/2019 2997 spring 2004

    13/201

    Theorem

    1

    2||Juk J

    || 1

    ||Jk J||

    Proof:

    || =uk J

    uk Jk +Jk J||J ||J ||

    uk ||J

    1

    Jk||+||Jk J||

    1

    ||T Jk J

    +JJk||+||Jk J||

    1Jk||)+||Jk J

    1

    (||T Jk J||

    +||J ||

    2

    1

    ||Jk J||

    ThesecondinequalitycomesfromLemma1andthethirdinequalityholdsbythecontractionprinciple. 2

    2

    Optimality

    of

    Stationary

    Policy

    Beforeprovingthemaintheoremofthissection,weintroducethefollowingusefullemma.

    Lemma2 IfJ T J, thenJ J. IfJ T J, thenJ J.

    Proof:

    Suppose

    that

    J T J. Applying operator T on both sides repeatedly k1 times and by the

    monotonicitypropertyofT,wehave

    .J

    T J

    T2J

    TkJ

    Forsufficientlylargek,TkJ approachestoJ. WehenceconcludeJ J. Theotherstatementfollowsthe

    same

    argument.

    2

    We

    show

    the

    optimality

    of

    the

    stationary

    policy

    by

    the

    following

    theorem.

    Theorem

    2

    Let

    u

    =

    (u1, u2, . . .)

    be

    any

    policy

    and

    let

    u

    uJ1.

    Then,

    Ju Ju =J.

    Moreover, letubeanystationarypolicysuchthatTuJ

    =T J.2 Then,Ju(x)>J(x)forat leastonestate

    xS.

  • 7/25/2019 2997 spring 2004

    14/201

    Then

    1u Ju|| M(1+1

    )k 0ask.||Jk

    Ifu=(u, u, . . . ),then

    u Jk 0ask.||J u||

    Thus, wehave Jku =TukJ =Tk1(T J)=Tk1J =J. ThereforeJu =J

    . For any other policy, foru u

    allk,

    1

    u M

    1

    kJu 1+ J

    k

    1= Tu1 . . . T ukJ

    M

    1+

    1

    k

    T J 1

    kTu1 . . . T uk1 T J

    1+ M 1

    =J

    1. . .J

    1+

    1

    kM

    Therefore

    Ju

    J.

    Take

    a

    stationary

    policy

    u

    such

    that

    TuJ

    =

    T J,

    i.e.

    TuJ

    T J,

    and

    at

    least

    one

    statexS suchthat(TuJ)(x)>(T J)(x). Observe

    J =T J TuJ

    Applying

    Tu onbothsidesandbythemonotonicitypropertyofT,orapplyingLemma2,

    J TuJ T2J TkJ Juu u

    andJ(x)

  • 7/25/2019 2997 spring 2004

    15/201

    Proof:

    If

    uk isoptimal,thenwearedone. Nowsupposethatuk isnotoptimal. Then

    T Juk TukJ =

    Jukuk

    withstrictinequalityforatleastonestatex. SinceTuk+1J =T Juk andJuk =TukJ ,wehave

    .

    uk uk

    Juk =TukJuk T Juk =Tuk+1Juk Tn Juk Juk+1 asn

    Therefore,

    policyuk+1 isanimprovementoverpolicyuk. 2

    uk+1

    In

    step

    2,

    we

    solve

    J =

    g +

    PukJuk,

    which

    would

    require

    a

    significant

    amount

    of

    computations.

    We

    thusintroduceanotheralgorithmwhichhasfeweriterationsinstep2.

    uk uk

    3.1 AsynchronousPolicyIteration

    Thealgorithmgoesas follows.

    1.

    Start

    with

    policy

    u0,cost-to-gofunctionJ0,k = 0

    2. ForsomesubsetSk S,dooneofthefollowing

    (i)valueupdate (Jk+1)(x)=(T Jk)(x), k,x Suk

    (ii)policyupdate uk+1(x)=uJk(x), kx S

    3. k=k+1;gobacktostep2

    Theorem

    4

    If

    Tu0J0 J0 and

    infinitely

    many

    value

    and

    policy

    updates

    are

    performed

    on

    each

    state,

    then

    lim Jk =J.

    k

    Proof: Weprovethistheorembytwosteps. First,wewillshowthat

    J Jk+1 Jk, k.

    ThisimpliesthatJk isanonincreasingsequence. SinceJk islowerboundedbyJ,Jk willconvergetosome

    value,i.e.,Jk . Next,wewillshowthatJk willconvergetoJ,i.e.,J =J.J ask

    Lemma3 If Tu0J0 J0, thesequenceJk generatedbyasynchronouspolicyiterationconverges.

    Proof:

    We

    start

    by

    showing

    that,

    if

    TukJk Jk , then Tuk+1Jk+1 Jk+1 Jk. Suppose we have a value

    update Then

  • 7/25/2019 2997 spring 2004

    16/201

    Now

    suppose

    that

    we

    have

    a

    policy

    update.

    Then

    Jk+1 =Jk. Moreover,forx Sk,wehave

    (T Jk+1)(x) = (T J

    k

    )(x)uk+1

    uk+1

    = (T Jk)(x)

    (TukJk)(x)

    Jk(x)

    = Jk+1(x).

    ThefirstequalityfollowsfromJk =Jk+1,thesecondequalityandfirstinequalityfollowsfromthefactthat

    uk+1(x)

    is

    greedy

    with

    respect

    to

    Jk for

    x

    Sk,

    the

    second

    inequality

    follows

    from

    the

    induction

    hypothesis,

    andthethirdequality follows fromJk =Jk+1. Forx Sk ,wehave

    (T Jk+1)(x) = (T Jk)(x)uk+1 uk

    Jk(x)

    = Jk+1(x).

    Theequalities follow fromJk =Jk+1 anduk+1(x)=uk(x) forx Sk, andthe inequality follows fromthe

    induction

    hypothesis.

    Since by hypothesis Tu0J0 J0, we conclude that Jk is a decreasing sequence. Moreover, we have

    TukJk Jk,henceJk J J,sothatJk isboundedbellow. It followsthatJk convergestosomelimit

    uk

    J. 2

    Lemma4 Suppose that Jk J,where Jk isgeneratedbyasynchronouspolicy iteration,and suppose that

    thereare infinitelymanyvalueandpolicyupdatesateachstate.ThenJ =J.

    Proof: First note that, since T Jk Jk, by continuity of the operator T, we must have T J J. Now suppose

    that

    (T J)(x)

    < J(x)

    for

    some

    state

    x.

    Then,

    by

    continuity,

    there

    is

    an

    iteration

    index

    k

    such

    that

    (T Jk)(x)< J(x)forallkk. Letk

    >k > kcorrespondtoiterationsoftheasynchronouspolicyiteration

    algorithmsuchthatthereisapolicyupdateatstatexatiterationk,avalueupdateatstatexatiteration

    k, and no updates at state x in iterations k

  • 7/25/2019 2997 spring 2004

    17/201

    2

    We

    have

    concluded

    that

    Jk+1 < J. HoweverbyhypothesisJk J,wehaveacontradiction,anditmust followthatT J =J,sothatJ =J.

  • 7/25/2019 2997 spring 2004

    18/201

    1

    2.997Decision-Making inLarge-ScaleSystems February17

    MIT,Spring2004 Handout#6

    Lecture

    Note

    4

    Average-costProblems

    Intheaveragecostproblems,weaimatfindingapolicyuwhichminimizes

    Ju(x)

    =

    lim

    sup

    1

    E

    T1

    gu(xt)

    x0 =

    0

    .

    (1)t=0T

    T

    Since the state space is finite, it can be shown that the limsup can actually be replaced with lim for any

    stationary policy. In the previous lectures, we first find the cost-to-go functions J(x) (for discounted

    problems)orJ(x, t) (for finitehorizon problems)and thenfindthe optimal policythrough thecost-to-go

    functions. However, in the average-cost problem, Ju(x) does not offer enough information for an optimal

    policy to be found; in particular, in mostcases of interest we will have Ju(x)=u for some scalar u, for

    allx,sothatitdoesnotallowustodistinguishthevalueofbeing ineachstate.

    We

    will

    start

    by

    deriving

    some

    intuition

    based

    on

    finite-horizon

    problems.

    Consider

    a

    set

    of

    states

    x1, x2, . . . , x, . . . , xn}. Thestatesarevisitedinasequencewithsomeinitialstatex,sayS={

    x , . . . . . . , x, . . . . . . , x, . . . . . . , x, . . . . . . ,

    h(x) 1 2u u

    LetTi(x), i=1, 2, . . . bethestagescorrespondingtotheithvisittostatex,startingatstatex. Let

    Ti+1(x)1gu(xt)

    u(x)

    =

    E

    t=Ti

    (x)i Ti+1(x) Ti(x)

    Intuitively,wemusthavei u(x)=j

    u(x)is independentofinitialstatexandi

    u(x),sincewehavethesame

    transitionprobabilitieswheneverwestartanewtrajectoryinstatex. Goingbacktoobservethedefinition

    ofthefunction

    T

    J(x, T)=minE gu(xt)

    xo

    =x ,u

    t=0

    we

    conjecture

    that

    the

    function

    can

    be

    approximated

    as

    follows.

    J(x, T) (x)T+h(x)+o(T), asT, (2)

    Notethat,since(x)is independentofthe initialstate,wecanrewritetheapproximationas

    J(x, T) T +h(x)+o(T), asT. (3)

  • 7/25/2019 2997 spring 2004

    19/201

    2

    We can now speculate about a version of Bellmans equation for computing and h. Approximating

    J(x, T)as in(3,wehave

    J(x, T +1)=min ga(x)+ Pa(x, y)J(y, T)

    ay

    (T+1)+h(x)+o(T)=min ga(x)+ Pa(x, y) [T +h(y)+o(T)]

    a

    y

    Therefore,wehave

    +h(x)=mina

    ga(x)+ y

    Pa(x, y)h(y) (4)

    As

    we

    did

    in

    the

    cost-to-go

    context,

    we

    set

    Tuh=gu

    +Puh

    and

    T h=minTuh.u

    Then,wehave

    Lemma

    1

    (Monotonicity)

    Let

    h

    h

    be

    arbitrary.

    Then

    T h

    T h. (Tuh

    Tuh)

    Lemma

    2

    (Offset)

    For

    allhandk ,wehave T(h+ke)=T h+ke.

    NoticethatthecontractionprincipledoesnotholdforT h=minuTuh.

    BellmansEquation

    From

    the

    discussion

    above,

    we

    can

    write

    the

    Bellmans

    equation

    e+h=T h. (5)

    BeforeexaminingtheexistenceofsolutionstoBellmansequation,weshowthefactthatthesolutionofthe

    Bellmansequationrenderstheoptimalpolicybythefollowingtheorem.

    Theorem

    1

    Suppose

    that andh satisfytheBellmansequation. Letu begreedywithrespecttoh,i.e.,

    T h Tuh. Then,

    Ju

    (x)

    =

    ,x,

    and

    u(x) Ju(x),u.

    Proof: Letu=(u1, u2, . . . ). LetN bearbitrary. Then

    J

  • 7/25/2019 2997 spring 2004

    20/201

    Then

    TN1h

    N e+hT1T2

    Thus,we

    have

    N

    1

    E gu(xt)+h(xN) (N 1)

    e+h

    t=0

    BydividingbothsidesbyN andtakethelimitasN approachesto infinity,wehave1

    Jue

    Takeu=(u, u, u, . . . ),thenalltheinequalitiesabovebecometheequality. Thus

    e=Ju.

    Thistheoremsaysthat, iftheBellmansequationhasasolution,thenwecangetaoptimalpolicyfrom it.

    Notethat,if(, h)isasolutiontotheBellmansequation,then(, h+ke)isalsoasolution,forall

    scalark. Hence,ifBellmansequation in(5)hasasolution,then ithasinfinitelymanysolutions. However,

    unlike

    the

    case

    of

    discounted-cost

    and

    finite-horizon

    problems,

    the

    average-cost

    Bellmans

    equation

    does

    not

    necessarily have a solution. In particular, the previous theorem implies that, if a solution exists, then the

    averagecostJu(x) isthesame forall initialstates. It iseasytocomeupwithexampleswherethis isnot

    thecase. Forinstance,considerthecasewhenthetransitionprobabilityisanidentitymatrix,i.e.,thestate

    visitsitselfeverytime,andeachstateincursdifferenttransitioncostsg(). Thentheaveragecost depends

    onthe initialstate, which is notthe property ofthe average cost. Hence,the Bellmans equationdoes not

    alwayshold.

    2

  • 7/25/2019 2997 spring 2004

    21/201

    1

    2.997Decision-Making inLarge-ScaleSystems February18

    MIT,Spring2004 Handout#7

    Lecture

    Note

    5

    RelationshipbetweenDiscountedandAverage-CostProblems

    Inthis lecture, wewillshowthatoptimalpolicies fordiscounted-costproblems with largeenoughdiscount

    factor are also optimal for average-cost problems. The analysis will also show that, if the optimal average

    cost

    is

    the

    same

    for

    all

    initial

    states,

    then

    the

    average-cost

    Bellmans

    equation

    has

    a

    solution.

    Notethattheoptimalaveragecost is independentofthe initialstate. Recallthat

    1N1

    Ju(x)=limsup E gu(xt)|x0 =xN

    Nt=0

    or,equivalently,

    1N1

    Ju = lim Putgu.

    NNt=0

    WealsoletJu, denotethediscountedcost-to-gofunctionassociatedwithpolicyuwhenthediscountfactor

    is,i.e.,

    tPtJu, = ugu =(IPu)1

    gu.t=0

    The following theorem formalizes the relationship between the discounted cost-to-go function and average

    cost.

    Theorem

    1

    For

    every

    stationary

    policy

    u,

    there

    is

    hu such

    that

    1Ju,

    =1

    Ju

    +hu

    +O( 1). (1)| |

    Theorem1followseasilyfromthefollowingproposition.

    Proposition

    1

    For

    all

    stationary

    policies u,wehave

    1

    (I

    Pu)1

    =

    1

    P

    u +

    Hu +

    O(|1

    |)1

    ,

    (2)

    where

    Pu = limN

    1

    N

    N1

    t=0

    Ptu, (3)

    1

  • 7/25/2019 2997 spring 2004

    22/201

    Proof: LetM =(1)(IPu)

    1. Then,since

    tPt

    u(x,y) (1) t 1 =1,M(x,y) =(1)| |

    t=0 t=0

    M(x,y) is intheformof

    M(x,y)=p()

    q(),

    wherep()andq()arepolynomialssuchthatq(1)=1. Weconcludethatthe limit lim 1M exists. Let

    P =lim 1M. WecandoTaylorsexpansionofM around=1,sothatu

    M =P

    +(1)Hu +O((1)2)u

    dMwhereHu

    =d

    . Therefore

    1(IPu)1

    =1

    P +Hu +O( 1)u | |

    forsomeP andHu.uNext,observethat

    (1)(IPu)(IPu)1 =(1)I

    for

    all

    .

    Taking

    the

    limit

    as

    1

    yields

    (IPu)P

    =0,u

    sothatP =PuP. WecanusethesamereasoningtoconcludethatP =PPu. Wealsohaveu u u u

    (IPu)P =(1)Pu

    ,u

    henceforeverywehave

    P =(1)(IPu)1Pu u,

    andtakingthe limitas1yieldsPP =P.u u uWe now show that, for every t 1, Ptu

    P = (Pu

    Pu

    )t. For t = 1, it is trivial. Suppose that theu

    resultholdsupton1,i.e.,Pn1Pu

    =(PuP

    u)n1

    =(PuP

    u)n1

    . Then(PuPu)(PuP

    u)(Pn1

    u u

    u) = Pn PuP

    uPn1 +PP = Pn Pn2 +P = Pu

    n Pu. By induction, we haveu uP

    u u P

    u u u u P P u u

    u

    P =(Pu

    PPt u

    )t.u

    Nownotethat

    uHu = limM P

    1

    1

    Pu= lim (IPu)1

    1 1

    = lim t(Pt u)u P

    1t=0

  • 7/25/2019 2997 spring 2004

    23/201

    |

    Hence Hu =(I Pu +P

    .u

    )1 Pu

    WenowshowPHu =0. Observeu

    u)1 PPHu =

    P (I

    Pu +

    Pu u u

    = Pu(Pu P

    uu)t P

    t=0

    = Pu P

    =0.u

    Therefore, PHu

    =0.u

    Observe(IPu+P

    u

    )P =IP. SincePHu

    =0,wehaveu

    )Hu

    =I(IPu+P

    u u

    u

    By

    multiplyingPk

    toP

    +H

    u

    =I

    +P

    uH

    u,

    we

    have

    u u

    +Pk+1Pk =PkuPu

    +PkHu u u Hu, ku

    Summingfromk=0tok=N 1,wehave

    NN1 N1

    N P + PkHu

    = Puk + PkHu,u u

    u

    k=0 k=0 k=1

    or,equivalently,N

    1

    N P =

    Pk +

    (PN I)Hu.u

    u u

    k=0

    DividingbothsidesbyN and lettingN ,thenwehave

    P +Hu

    =I+PuHu.u

    limN

    1

    N

    1

    Pk =P.N k=0 u u

    Since

    P =

    PPu and

    Pu itself

    is

    a

    stochastic

    matrix,

    the

    rows

    of

    P are

    of

    special

    meanings.

    Let

    u u

    u

    u denote a row of Pu. Then u = uPu and u(x) = yu(y)Pu(y, x). Then Pu(x1 = x x0 u) =

    u(y)Pu(y, x). Wecan concludethatanyrow inmatrixP isa stationarydistribution fortheMarkovy u

    chainunderthepolicyu. However,doesthisobservationmeanthatallrows inP areidentical?u

    Theorem

    2

    JuJu, =

    1 +hu +O( 1 )| |

    Proof:

    Ju, = (I Pu)1

    gu

    Pu=1

    +Hu

    +O( 1 ) gu| |

    Pu

    gu= + Hugu + O( 1 )| |

    2

  • 7/25/2019 2997 spring 2004

    24/201

    2

    2 BlackwellOptimality

    Inthissection,wewillshowthatpoliciesthatareoptimalforthediscounted-costcriterionwithlargeenough

    discountfactorsarealsooptimalfortheaverage-costcriterion. Indeed,wecanactuallystrengthenthenotion

    ofaverage-costoptimalityandestablishtheexistenceofpoliciesthatareoptimalforalllargeenoughdiscount

    factors.

    Definition

    1

    (Blackwell

    Optimality)

    A

    stationary

    policy u is calledBlackwell optimal if (0,1)

    such

    thatu isoptimal [,1).

    Theorem

    3

    There

    exists

    a

    stationary

    Blackwell

    optimal

    policy

    and

    it

    is

    also

    optimal

    for

    the

    average-cost

    problemamongallstationarypolicies.

    Proof: Since there are only finitely many policies, we must have for each state x a policy x such that

    Jux

    ,(x)

    Ju,(x)

    for

    all

    large

    enough

    .

    If

    we

    take

    the

    policy

    to

    be

    given

    by

    (x)

    =

    x(x),

    then

    mustsatisfyBellmansequation

    Ju, =min{gu +PuJu,}u

    foralllargeenough,andweconcludethat isBlackwelloptimal.

    Nowletu beBlackwelloptimal. Alsosupposethat uisoptimalfortheaverage-costproblem. Then

    Ju J+hu +O( 1| |)

    u

    +hu +O( 1),.| |1 1

    Taking

    the

    limit

    as

    1,

    we

    conclude

    that

    Ju J ,u

    andu mustbeoptimalfortheaverage-costproblem.

    Remark

    1

    It

    is

    actually

    possible

    to

    establish

    average-cost

    optimality

    of

    Blackwell

    optimal

    policies

    among

    the

    set

    of

    all

    policies,

    not

    only

    stationary

    ones.

    Remark

    2

    An

    algorithm

    for

    computing

    Blackwell

    optimal

    policies

    involves

    lexicographic

    optimization

    of

    Ju,

    hu andhigher-order terms in theTaylorexpansionof Ju,.

    Th 3 i li th t if th ti l t i th dl f th i iti l t t th th

    2

  • 7/25/2019 2997 spring 2004

    25/201

    Proof: Wehave,forall largeenough,

    Ju, = min gu +PuJu,}u

    {

    Ju

    1 +hu +O((1 )

    2) = min gu +PuJu

    +hu +O((1 )2)

    u 1

    e e

    1 +hu +O((1 )

    2) = min gu +Pu1

    +hu +O((1 )2)

    u

    +hu +O((1 )2) = min gu

    +Pu

    hu +O((1 )2) .

    u

    Takingthelimitas1,weget

    e

    +

    hu =min gu +Puhu}=Thu.u

    {

    2

    Intheaverage-costsetting,existenceofasolutiontoBellmansequationactuallydependsonthestructure

    of transition probabilities in the system. Some sufficient conditions for the optimal average cost to be the

    sameregardlessofthe initialstatearegivenbelow.

    Definition2 We say that two states x,y communicate under policy u if there are k, 1,2,...} suchk {kthat

    Puk(x,y)>0,Pu(y,x)>0.

    Definition

    3

    We

    say

    that

    a

    state x is recurrentunderpolicy u if,conditionedon thefact that it isvisited

    at

    least

    once,

    it

    is

    visited

    infinitely

    many

    times.

    Definition4 We say that a state x is transient under policy u if it is only visitedfinitely many times,

    regardless

    of

    the

    initial

    condition

    of

    the

    system.

    Definition

    5

    We

    say

    that

    a

    policyu isunichain ifallof itsrecurrentstatescommunicate.

    Westatewithoutproofthefollowingtheorem.

    Theorem

    4 Either of thefollowing conditions is sufficientfor the optimal average cost to be the same

    regardless

    of

    the

    initial

    state:

    1. Thereexistsaunichainoptimalpolicy.

    2.

    For

    every

    pair

    of

    states xandy, there isapolicy usuch thatxandy communicate.

    3 Value Iteration

  • 7/25/2019 2997 spring 2004

    26/201

    OnewaytoobtainthisvalueistocalculateafinitebutverylargeN toapproximatethelimitandspeculate

    thatsuchan limitisaccurate. Henceweconsider

    k1

    TkJ =minE gu(xt)+J0(xk)u

    t=0

    RecallJ(x, T) x,wehave=T +h(x).Choosesomestatexand

    J(x, T)J( x)x, T)=h(x)h(

    Then

    hk(x)

    =

    J

    (x, k)

    k

    ,

    for

    some

    1

    ,

    2

    , . . .

    Notethat,since(, h+ke)isasolutiontoBellmansequationforallkwhenever(, h)isasolution,we

    canchoosethevalue ofasinglestatearbitrarily. Letting h(x)=0,wehavethe followingcommonly used

    versionofvalue iteration;

    hk+1(x)=(T hk)(x)(T hk)(x) (8)

    Theorem

    5

    Lethk begivenby(8). Then if hk h,wehave

    =(T h)( x)andh =h,e +h =T h.

    Note

    that

    there

    must

    exist

    a

    solution

    to

    the

    average-cost

    Bellmans

    equation

    for

    value

    iteration

    to

    con-verge. However, itcanbeshownthatexistenceofasolution isnotasufficientcondition.

  • 7/25/2019 2997 spring 2004

    27/201

    2.997Decision-Making inLarge-ScaleSystems February23

    MIT,

    Spring

    2004

    Handout

    #9

    LectureNote6

    1 ApplicationtoQueueingNetworks

    In the first part of this lecture, we will discuss the application of dynamic programming to the queueing

    network introduced in [1], which illustrates several issues encountered in the application of dynamic pro

    gramming

    in

    practical

    problems.

    In

    particular,

    we

    consider

    the

    issues

    that

    arise

    when

    value

    iteration

    is

    applied

    to

    problems

    with

    a

    large

    or

    infinite

    state

    space.

    Themainpointsin[1],whichweoverviewtoday,arethefollowing:

    Naive implementation of value iteration may lead to slow convergence and, in the case of infinite

    statespaces,policieswith infiniteaveragecost inevery iterationstep,eventhoughthe iteratesJk(x)

    convergepointwisetoJ(x) foreverystatex;

    Under

    certain

    conditions,

    with

    proper

    initialization

    J0,

    we

    can

    have

    a

    faster

    convergence

    and

    stability

    guarantees;

    Inqueueingnetworks,properJ0 canbefoundfromwell-knownheuristicssuchasfluidmodelsolutions.

    We will illustrate these issues with examples involving queueing networks. For the generic results, in

    cluding a proof of convergence of average-cost value iteration for MDPs with infinite state spaces, refer to

    [1].

    1.1 Multiclassqueueingnetworks

    Consideraqueueingnetworkasillustrated inFig.1.

    1

    Machine 1

    13

    11

    18

    4

    Machine 2

    22

    26

    37

    35

    34

    Machine 3

  • 7/25/2019 2997 spring 2004

    28/201

    We introducesomenotation:

    N thenumberofqueuesinthesystem

    i probabilityofexogenousarrivalatqueuei

    i probabilitythatajobatqueueiiscompletedifthejobisbeingserved

    xi state,lengthofqueueiN

    g(x)= xi costfunction,inwhichstatex=(x1, . . . , xN)i=1

    a {0, 1}N ai =1ifajobfromqueueiisbeingserved,andai =0otherwise.

    The

    interpretation

    is

    as

    follows.

    At

    each

    time

    stage,

    at

    most

    one

    of

    the

    following

    events

    can

    happen:

    a

    new

    jobarrives atqueue i with probability i, ajob from queue ithat iscurrently beingserved has itsservice

    completed, withprobabilityi, and either movesto another queueor leaves thesystem, dependingonthe

    structureofthenetwork. Notethat,ateachtimestage,aservermaychoosetoprocessajobfromanyofthe

    queues

    associated

    with

    it.

    Therefore

    the

    decision

    a

    encodes

    which

    queue

    is

    being

    processed

    at

    each

    server.

    Werefertosuchaqueueingnetworkasmulticlass becausejobsatdifferentqueueshasdifferentservicerates

    and

    trajectories

    through

    the

    system.

    As seen before, an optimal policy could be derived from the differential cost function h, which is the

    solution

    of

    Bellmans

    equation:

    +h =T h.

    Considerusingvalue iteration forestimatingh. Thisrequiressome initialguessh0. Acommonchoice

    ish0 =0; however, wewillshow thatthis can lead to slow convergence of h. Indeed, weknow thath is

    equivalenttoaquadratic,inthesensethatthere isaconstant andasolutiontoBellmansequationsuch2 21

    .

    Now

    let

    h0 =

    0.

    Then

    that

    1i

    xi h

    (x)

    ixi k1 N

    xi(t) x0 =x .Tkh0(x)=minE

    ut=0 i=1

    Since

    E[xi(t)]=E[xi(t 1)]+ E[Ai(t)] E[Di(t)] ,

    =i (arrival)0(departure)

    we

    have

    E[xi(t) xi(t 1)]i (1)

    By(1),wehave

    E

    [xi(1)] E[xi(0)]+i

  • 7/25/2019 2997 spring 2004

    29/201

    2

    Thus,

    T

    k

    h0

    N

    k1

    (xi(0)+ti)t=0 i=1N k(k

    1)kxi(0)+ i=

    2i=1

    This impliesthathk(x) isupperboundedbya linear functionofthestatex. Inorder for ittoapproacha

    quadratic functionofx,the iterationnumberk musthavethesamemagnitudeasx. It followsthat, ifthe

    state space is very large, convergence is slow. Moreover, if the state space is infinite, which is the case if

    queues

    do

    not

    have

    finite

    buffers,

    only

    pointwise

    convergence

    of

    hk(x)

    to

    h

    (x)

    can

    be

    ensured,

    but

    for

    every

    k,thereissomestatexsuchthathk(x)isapoorapproximationtoh(x).

    Example1(Singlequeue lengthwithcontrolled servicerate) Considerasinglequeuewith

    State

    x

    defined

    as

    the

    queue

    length

    Pa(x,x+1)=, (arrivalrate)

    Pa(x,x 1)=1+a2, whereactiona {0,1}

    Pa(x,

    x)

    =

    1

    1 a2.

    Let

    the

    cost

    function

    be

    defined

    as

    ga(x)=(1+a)x.

    The interpretation isasfollows. Ateach timestage, there isachoicebetweenprocessingjobsata lower

    service

    rate 1 or at a higher service rate 2. Processing at a higher service rate helps to decreasefuture

    queue

    lengths

    but

    an

    extra

    cost

    must

    be

    paid

    for

    the

    extra

    effort.

    Suppose

    that

    >

    1. Then ifpolicy u(x) = 0forall x x0,whenever thequeue length isat least x0,

    thereareonaveragemorejobarrivalsthandepartures,and itcanbeshownthateventuallythequeue length

    converges

    to

    infinity,

    leading

    to

    infinite

    average

    cost.

    Suppose

    thath0(x)=0, x. Thenineveryiterationk,thereexistsanxk suchthathk =T

    kh0(x)=cx+d

    forallxxk. Moreover,whenhk =cx+d inaneighborhoodof x)=0,which isx,thegreedyaction isuk(

    thecase that theaveragecostgoes to infinity.2

    As

    shown

    in

    [1],

    using

    the

    initial

    value h0(x) = 1+

    x2

    leads

    to

    stable

    policies

    for

    every

    iterate hk,

    and

    ensures

    convergence

    to

    the

    optimal

    policy.

    The

    choice

    ofh0 asaquadraticarisesfromproblem-specific

    knowledge. Moreover,appropriatechoices in thecaseofqueueingnetworkscanbederivedfromwell-known

    heuristics

    and

    analysis

    specific

    to

    the

    field.

    Simulation-basedMethods

    Thedynamicprogrammingalgorithmsstudiedso farhavethefollowingcharacteristics:

  • 7/25/2019 2997 spring 2004

    30/201

    In realistic scenarios, each of these requirements may pose difficulties. When the state space is large,

    performingupdatesinfinitelyoften ineverystatemaybeprohibitive,orevenif it is feasible,acleverorder

    of

    visitation

    may

    considerably

    speed

    up

    convergence.

    In

    many

    cases,

    the

    system

    parameters

    are

    not

    known,

    andinsteadonehasonlyaccesstoobservationsaboutthesystem. Finally,evenifthetransitionprobabilities

    are known, computing expectations of the form (2) may be costly. In the next few lectures, we will study

    simulation-basedmethods,whichaimatalleviatingtheseissues.

    2.1 Asynchronousvalue iteration

    Wedescribetheasynchronousvalueiteration(AVI)as

    Jk+1(xk)=(T Jk)(xk), xk Sk

    We have seen that, if every state has its value updated infinitely many times, then the AVI converges

    (see arguments in Problem set 1). The question remains as to whether convergence may be improved by

    selecting

    states

    in

    a

    particular

    order,

    and

    whether

    we

    can

    dispense

    with

    the

    requirement

    of

    visiting

    every

    stateinfinitelymanytimes.

    WewillconsideraversionofAVIwherestateupdatesarebasedonactualorsimulatedtrajectories for

    the

    system.

    It

    seems

    reasonable

    to

    expect

    that,

    if

    the

    system

    is

    often

    encountered

    at

    certain

    states,

    more

    emphasis

    should

    be

    placed

    in

    obtaining

    accurate

    estimates

    and

    good

    actions

    for

    those

    states,

    motivating

    performingvalueupdatesmoreoftenatthosestates. Inthe limit, it isclearthatifastate isnevervisited,

    under any policy, then the value of the cost-to-go function at such a state never comes into play in the

    decision-makingprocess,andnoupdatesneedtobeperformed forsuchastateatall. Basedonthenotion

    that state trajectories contain information about which states are most relevant, we propose the following

    versionofAVI.Wecallitreal-timevalue iteration(RTVI).

    1.

    Take

    an

    arbitrary

    state

    x0. Letk=0.

    2. Chooseactionuk insomefashion.

    3. Let xk+1 = f(xk, uk, wk) (recall from lecture 1 that f gives an alternative representation for state

    transitions).

    4. LetJk+1(xk+1)=(T Jk)(xk+1).

    5.

    Let

    k

    =

    k

    +

    1

    and

    return

    to

    step

    2.

    2.2 ExplorationxExploitation

    Notethatthere isstillanelementmissing inthedescriptionofRTVI,namely,howtochooseactionuk. It

  • 7/25/2019 2997 spring 2004

    31/201

    Ingeneral,choosinguk greedilydoesnotensureconvergencetotheoptimalpolicy. Onepossiblefailure

    scenario is illustrated in Figure 2. Suppose that there is a subset of states B which is recurrent under an

    optimal

    policy,

    and

    a

    disjoint

    subset

    of

    states

    A

    which

    is

    recurrent

    under

    another

    policy.

    If

    we

    start

    with

    a

    guessJ0 whichishighenoughatstatesoutsideregionA,andalwayschooseactionsgreedily,thenanaction

    thatneverleadstostatesoutsideregionAwillbeselected. HenceRTVIneverhasachanceofupdatingand

    correctingthe initialguessJ0 atstatesinsubsetB,andinparticular,theoptimalpolicyisneverachieved.

    It

    turns

    out

    that,

    if

    we

    choose

    initial

    value

    J0 J,

    then

    the

    greedy

    policy

    selection

    performs

    well,

    as

    showninFig2(b). Westatethisconcept formallybythefollowingtheorem.

    The previous discussion highlights a tradeoff that is fundamental to learning algorithms: the conflict

    of exploitation versus exploration. In particular, there is usually tension between exploiting information

    accumulated

    by

    previous

    learning

    steps

    and

    exploring

    different

    options,

    possibly

    at

    a

    certain

    cost,

    in

    order

    togathermoreinformation.

    J J* *J

    J

    0

    0

    AB

    J(x)

    x x

    J(x)

    *(a) Improper init ial value J with greedy (b) Initial value with J0less or equal to J0

    policy selection

    Figure2: InitialValueSelection

    Theorem

    1

    If

    J0 J and

    all

    states

    are

    reachable

    from

    one

    another,

    then

    the

    real

    time

    value

    iteration

    algorithm(RTVI)withgreedypolicyut satisfies thefollowing

    (a)

    Jk JforsomeJ,

    (b)

    J =

    Jfor

    all

    states

    visited

    infinitely

    many

    times,

    (c)

    after

    some

    iterations,

    all

    decisions

    will

    be

    optimal.

    Proof: SinceT ismonotone,wehave

    (T J )(x) (T J)(x) x J (x ) J(x ) and J (x) = J (x) J(x) x = x

  • 7/25/2019 2997 spring 2004

    32/201

    HenceonecouldregardJ asa function fromthesetAto|A|. SoTA issimilartoDPoperator forthe

    subsetAofstatesand

    ||T

    A

    J

    T

    A

    J

    ||

    ||J

    J

    ||.

    Therefore,

    RTVI

    is

    AVI

    over

    A,

    with

    every

    state

    visited

    infinitely

    many

    times.

    Thus,

    J(x), ifxA,Jk(x)J(x)=

    J0(x), ifx /A.

    Since

    the

    states

    x

    /Aarenevervisited,wemusthave

    Pa(x,

    y)

    =

    0,

    x

    A,

    y

    /

    A,

    whereaisgreedywithrespecttoJ. Letu bethegreedypolicyofJ. Then

    J(x)=gu(x)+ Pu(x,y)J(y)=gu(x)+ Pu(x,y)J(y),xA.yA yS

    Therefore,weconclude

    J(x)=Ju(x)J(x),

    xA.

    ByhypothesisJ0 J,weknowthat

    J(x)=J(x), xA.

    References

    [1] R-RChenandS.P.Meyn,ValueIterationandOptimizationofMulticlassQueueingNetworks,Queueing

    Systems,

    32,

    pp.

    6597,

    1999.

  • 7/25/2019 2997 spring 2004

    33/201

    2.997Decision-Making inLarge-ScaleSystems February25

    MIT,Spring2004 Handout#10

    Lecture

    Note

    7

    1 Real-TimeValue Iteration

    Recallthereal-timevalue iteration(RTVI)algorithm

    choose xk+1 =f(xk, uk, wk)

    choose ut insomefashion

    update Jk+1(xk)=(T Jk)(xk), Jk+1(x)=(T Jk)(x), x=xk

    Wethushave

    T Jk(xk

    )=min ga(xk)+ Pa(xk

    , y)Jk

    (y)a

    y

    Weencounterthefollowingtwoquestionsinthisalgorithm.

    1. whatifwedonotknowPa(x, y)?

    2. even ifweknow/cansimulatePa(x, y),computing yPa(x, y)J(y)maybeexpensive.

    Toovercomethesetwoproblems,weconsidertheQ-learningapproach.

    2 Q-Learning

    2.1

    Q-factors

    Foreverystate-actionpair,weconsider

    Q(x, a) = ga(x)+Pa(xk, y)J(y) (1)

    J(x) = minQ(x, a) (2)a

    WecaninterprettheseequationsasBellmansequationsforanMDPwithexpandedstatespace. Wehave

    the

    original

    states

    x

    S,

    with

    associated

    sets

    of

    feasible

    actions

    Ax,

    and

    extra

    states

    (x, a), x

    S, a

    Ax,

    corresponding to state-action pairs, for which there is only one action available, and no decision must be

    made. Note that, whenever we are in a state x where a decision must be made, the system transitions

    deterministicallytostate (x, a) basedonthestateand actionachosen. Thereforewecircumvent theneed

    toperformexpectations y

    Pa(x, y)J(y)associatedwithgreedypolicies.

  • 7/25/2019 2997 spring 2004

    34/201

    Monotonicity Q,andQsuchthatQQ,HQHQ.

    Offset H(Q+Ke)=HQ+Ke.

    Contraction

    HQ

    H

    QQ ||QQ||,Q,

    ItfollowsthatH hasauniquefixedpoint,correspondingtotheQfactorQ.

    2.2

    Q-Learning

    Wenowdevelopareal-timevalue iterationalgorithm forcomputingQ. AnalgorithmanalogoustoRTVI

    forcomputingthecost-to-gofunctionisasfollows:

    Qt+1(xt, ut)=gut(xt)+ Put(x, y)minQt(y, a).

    ay

    However, this algorithm undermines the idea that Q-learning is motivated by situations where we do not

    know Pa(x, y) or find it expensive to compute expectations Pa(x, y)J(y). Alternatively, we consideravariantsthatimplicitlyestimatethisexpectation,basedonstatetransitionsobservedinsystemtrajectories.

    Basedonthis idea,onepossibilityistoutilizeaschemeoftheform

    Qt+1(xt, at)=gat(xt)+ minQt(xt+1, a)

    a

    However, note that such an algorithm should not be expected to converge; in particular, Qt(xt+1, a) is a

    noisy estimate of yPut(x, y)mina Qt(y, a). We consider a small-step version of this scheme, where the

    noiseisattenuated:

    Qt+1(xt, at)=(1t)Qt(xt, at)+t

    gat(xt)+ minQt(xt+1, a) . (4)

    a

    We

    will

    study

    the

    properties

    of

    (4)

    under

    the

    more

    general

    framework

    of

    stochastic

    approximations,

    which

    areatthecoreofmanysimulation-based orreal-timedynamicprogrammingalgorithms.

    3 StochasticApproximation

    Inthestochasticapproximationsetting,thegoalistosolveasystemofequations

    r

    =

    Hr,

    wherer isavector inn forsomen andH isanoperator defined inn. If weknowhow tocompute Hr

    foranygivenr, itiscommontotrytosolvethissytemofequationsbyvalueiteration:

    rk+1

    =Hrk

    . (5)

  • 7/25/2019 2997 spring 2004

    35/201

    Wecanalsodothesummationrecursivelybysetting

    (i) 1i

    r = (Hrt +wi),t

    i

    j=1

    (i+1)

    i 1r =

    i+1r

    (i)

    +i+1

    (Hrt

    +wi+1).t

    t

    Therefore, rt+1

    = rt

    (k). Finally, we may consider replacing samples Hrt

    +wi

    with samples Hr(i1)

    +wi,t

    obtainingthefinalform

    rt+1 =(1t)rt +t(Hrt +wt).

    A

    simple

    application

    of

    these

    ideas

    involves

    estimating

    the

    expected

    value

    of

    a

    random

    variable

    by

    drawing

    i.i.d. samples.

    Example

    1 Letv1, v2, . . . be i.i.d. randomvariables. Given

    t 1

    rt+1 =t+1

    rt +t+1

    vt+1

    we

    know

    that

    rt

    v by

    strong

    law

    of

    large

    numbers.

    We

    can

    actually

    prove

    (General

    Version)

    rt+1

    =

    (1

    t)rt

    +

    tvt+1

    v w.p.

    1,

    if

    t =and

    t=1 t=1t2

  • 7/25/2019 2997 spring 2004

    36/201

    3.1

    Lyapunov

    function

    analysis

    Thequestionwetrytoanswer isDoes(8)converge? Ifso,wheredoesitconvergeto?

    We

    will

    first

    illustrate

    the

    basic

    ideas

    of

    Lyapunov

    function

    analysis

    by

    considering

    a

    deterministic

    case.

    3.1.1

    Deterministic

    Case

    Indeterministiccase,wehaveS(r, w)=S(r). Supposethereexistssomeuniquer suchthat

    S(r)=Hr r =0.

    Thebasic idea istoshowthatacertainmeasureofdistancebetweenrt andr

    isdecreasing.

    Example

    2

    Suppose

    that

    F

    is

    a

    contraction

    with

    respect

    to

    2.

    Then

    rt+1 =rt +t(F rt rt)

    converges.

    Proof: SinceF isacontraction,thereexistsauniquer s.t. F r =r. Let

    V(r)=r r 2.

    We

    will

    show

    V(rt)

    V(rt+1).

    Observe

    V(rt+1) = rt+1 r

    2

    = rt

    +t(F rt

    rt) r

    2

    = (1 t)(rt r)+t(F rt r

    )2

    (1 t)rt r

    2 +tF rt r

    2

    (1 t)rt r

    2

    +trt r

    2

    =

    rt r 2 (1

    )trt r 2.

    t 0

    Therefore,rt r

    2 isnonincreasingandboundedbelowbyzero. Thus,rt r

    2 0. Then

    rt+1

    r 2

    rt

    r 2

    (1 )trt

    r 20

    rt r

    2 (1 )t

    rt1 r

    2 (1 )(t +t1)

    ...t

    r0 r

    2 (1 ) l l=1

    Hencer0

    r 2 t

  • 7/25/2019 2997 spring 2004

    37/201

    1. WedefineadistanceV(rt)0indicatinghowfarrt isfromasolutionr satisfyingS(r)=01

    2. Weshowthatthedistanceisnonincreasingint

    3.

    We

    show

    that

    the

    distance

    indeed

    converges

    to

    0.

    Theargument also involves thebasicresult thatevery nonincreasingsequence bounded below converges

    toshowthatthedistanceconverges

    Motivatedbythesepoints,weintroducethenotionofaLyapunovfunction:

    Definition

    1

    We

    call

    function

    V

    a

    Lyapunov

    function

    if

    V

    satisfies

    (a)

    V()

    0

    (b) (rV)T

    S(r)0

    (c) V(r)=0S(r)=0

    3.1.2

    Stochastic

    Case

    Theargument used for convergence in the stochastic caseparallels the argument used inthe deterministic

    case. LetFt

    denoteallinformationthatisavailableatstaget,and let

    St(r)=E [S(r, wt)|Ft].

    ThenwerequireaLyapunovfunctionV satisfying

    V()0 (9)

    2(V(rt))T (10)St(rt)cV(rt)

    (11)V(r)V(r)Lr r

    S(rt, wt)2 2E |Ft K1 +K2V(rt) , (12)

    forsomeconstantsc,L,K1 andK2.

    Notethat(9)and(10)aredirectanaloguesofrequiringexistenceofadistancethat isnonincreasing in

    t;moreover,(10)ensuresthatthedistancedecreasesatacertainrate ifrt isfar fromadesiredsolutionr

    satisfying V(r = 0). Condition (11) imposes some regularity on V which is required to show that V(rt)

    does indeedconvergeto0,andcondition(12)imposessomecontroloverthenoise.

    A

    last

    point

    worth

    mentioning

    is

    that

    (10)

    implies

    that

    the

    expected

    value

    of

    V(rt)

    is

    nonincreasing;

    however, we may have V(rt+1) > V(rt) occasionally. Therefore we need an stochastic counterpart to the

    resultthateverynonincreasingsequenceboundedbelowconverges. Thestochasticcounterpartofinterest

    toouranalysis isgivenbelow.

    Theorem 1 (Supermartingale Convergence Theorem) Suppose that Xt Yt and Zt are nonnegative

  • 7/25/2019 2997 spring 2004

    38/201

    1. Xt converges toa limit(whichcanbearandomvariable)withprobability1,

    2.

    t=1

    Zt

    0with

    t = and ,t=0 t=0t

    2 uu u r

    gmax 01

    whichimplies(1)

    PNgmax

    (p) PN

    (p)1 u gmax

    > .ur r

    In order the complete the proof of Theorem 1 from the four lemmas above, we have to consider the

    probabilities fromtwoformsoffailure:

    failuretostopthealgorithmwithanear-optimalpolicy

  • 7/25/2019 2997 spring 2004

    49/201

    References

    [1]

    M.

    Kearns

    and

    S.

    Singh,

    Near-Optimal

    Reinforcement

    Learning

    in

    Polynomial

    Time,

    Machine

    Learning,

    Volume

    49,

    Issue

    2,

    pp.

    209-232,

    Nov

    2002.

  • 7/25/2019 2997 spring 2004

    50/201

    2.997 Decision-Making in Large-Scale Systems March 8

    MIT, Spring 2004 Handout #13

    Lecture Note 10

    1 Value Function Approximation

    DP problems are centered around the cost-to-go functionJ or the Q-factorQ. In certain problems, such as

    linear-quadratic-Gaussian systems, J exhibits some structure which allows for its compact representation:

    Example 1 In LQG system, we have

    xk+1 = Axk+ Buk+ Cwk, x n

    g(x, u) = xDx + uEu,

    wherexk represents the system state, uk represents the control action, andwk is a Gaussian noise. It can

    be shown that the optimal policy is of the form

    uk = Lkxk

    and the optimal cost-to-go function is of the form

    J(x) = xRx + S, R nm, S

    where R is a symmetric matrix. It follows that, if there are n state variables (i.e., xk n), storing

    J requires storing n(n+ 1)/2 + 1 real numbers, corresponding to the matrix R and the scalar S. The

    computational time and storage space required is quadratic in the number of state variables.

    In general, we are not as lucky as in the LQG system case, and exact representation ofJ requires that

    it be stored as a lookup table, with one value per state. Therefore, the space is proportional to the size of

    the state space, which grows exponentially with the number of state variables. This problem, known as the

    curse of dimensionality, makes dynamic programming intractable in face of most problems of practical scale.

    Example 2 Consider the game of Tetris, represented in Fig. 1. As seen in previous lectures, this game maybe represented as an MDP, and a possible choice of state is the pair(B, P), in whichB nm represents

    the board configuration andP represents the current falling piece. More specifically, we haveb(i, j) = 1, if

    position(i, j) of the board is filled, andb(i, j) = 0 otherwise.

    If there arep different types of pieces, and the board has dimensionn m, the number of states is on the

  • 7/25/2019 2997 spring 2004

    51/201

    Figure 1: A tetris game

    as a deterministic optimization problem, in the following way. Denote by (u) the average cost of policy

    u. Then our problem corresponds to

    minuU

    (u), (1)

    whereU is the set of all possible policies. In principle, we could solve (1) by enumerating all policies and

    choosing the one with the smallest value of(u); however, note that the number of policies is exponential

    in the number of states we have |Y|= |A||S| ; if there is no special structure to U, this problem requires

    even more computational time than solving Bellmans equation for the cost-to-go function. A possible

    approach to approximating the solution is to transform problem (1) by considering only a tractable

    subset of all policies:

    minuF

    (u)

    where F is a subset of the policy space. If F has some appropriate format, e.g., we consider policies

    that are parameterized by a continuous variable, we may be able to solve this problem without having

    to enumerate all policies in the set, but by using some standard optimization method such as gradient

    descent. Methods based on this idea are called approximations in the policy space, and will be studied

    later on in this class.

    (2) Cost-to-go Function Approximation

    Another approach to approximating the dynamic programming solution is to approximate the cost-to-gofunction. The underlying idea for cost-to-go function approximation is thatJ has some structure that

    allows for approximate compact representation

    J(x) J(x, r), for some parameterr P.

    E l

  • 7/25/2019 2997 spring 2004

    52/201

    Example 3

    J(x, r) = cos(xTr) nonlinear inr

    J(x, r) =r0+ rT1xJ(x, r) =r0+ r

    t1(x)

    linear inr

    In the next few lectures, we will focus on cost-to-go function approximation. Note that there are two

    important preconditions to the development of an effective approximation. First, we need to choose a

    parameterization Jthat can closely approximate the desired cost-to-go function. In this respect, a suitable

    choice requires some practical experience or theoretical analysis that provides rough information on the shape

    of the function to be approximated. Regularities associated with the function, for example, can guide the

    choice of representation. Designing an approximation architecture is a problem-specific task and it is not

    the main focus of this paper; however, we provide some general guidelines and illustration via case studies

    involving queueing problems. Second, given a parameterization for the cost-to-go function approximation,

    we need an efficient algorithm that computes appropriate parameter values.

    We will start by describing usual choices for approximation architectures.

    2 Approximation Architectures

    2.1 Neural Networks

    A common choice for an approximation architecture are neural networks. Fig. ??represents a neural network.

    The underlying idea is as follows: we first convert the original state x into a vector x n, for somen. This

    vector is used as the input to a linear layerof the neural network, which maps the input to a vector y m,

    for some m, such that yj = n

    i=1 rijxi. The vector y is then used as the input to a sigmoidal layer, which

    outputs a vector z m

    with the property that zi = f(yi), and f(.) is a sigmoidal function. A sigmoidalfunction is any function with the following properties:

    1. monotonically increasing

    2. differentiable

    3. bounded

    Fig. 3 represents a typical sigmoidal function.

    The combination of a linear and a sigmoidal layer is called a perceptron, and a neural network consists

    of a chain of one or more perceptrons (i.e., the output of a sigmoidal layer can be redirected to another

    sigmoidal layer, and so on). Finally, the output of the neural network consists of a weighted sum of the

    outputz of the final layer:

  • 7/25/2019 2997 spring 2004

    53/201

    Input

    Sigmoidal LayerLinear Layer

    ir

    ijr

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    +

    Figure 2: A neural network

    Figure 3: A sigmoidal function

    of functions on some bounded and closed set, if functions are uniformly smooth, we can get error O( 1n

    )

    with n sigmoidal functions. (Barron 1990). Note, however, that in order to obtain a good approximation,

    an adequate set of weights r must be found. Backpropagation, which is simply a gradient descent algorithm,

    is able to find a local optimum among all set of weights, but finding the global optimum may be a difficult

    problem.

    2.2 State Space Partitioning

    Another common choice for approximation architecture is based on partitioning of the state space. The

    underlying idea is that similar states may be grouped together For instance in an MDP involving

    2 3 Features

  • 7/25/2019 2997 spring 2004

    54/201

    2.3 Features

    A special case of state space partitioning consists of mapping states to features, and considering approxima-

    tions of the cost-to-go function that are functions of the features. The hope is that the featurewould captureaspects of the state that are relevant for the decision-making process and discard irrelevant details, thus

    providing a more compact representation. At the same time, one would also hope that, with an appropriate

    choice of features, the mapping from features to the (approximate) cost-to-go function would be smoother

    than that from the original state to the cost-to-go function, thereby allowing for successful approximation

    with architectures that are suitable for smooth mappings (e.g., polynomials). This process is represented

    below.

    Statex features f(x) J(f(x), r).

    J(x) J(f(x)) such thatf(x) Jis smooth.

    Example 4 Consider the tetris game. What features we should choose?

    1. |h(i) h(i + 1)| (height)

    2. how many holes

    3. maxh(i)

    2 997 Decision-Making in Large-Scale Systems March 10

  • 7/25/2019 2997 spring 2004

    55/201

    x

    2.997Decision-Making inLarge-ScaleSystems March10

    MIT,

    Spring

    2004

    Handout

    #14

    Lecture

    Note

    11

    1 ComplexityandModelSelection

    In this lecture, we will consider the problem of supervised learning. The setup is as follows. We have

    pairs(x, y),distributedaccordingtoajointdistributionP(x, y). Wewouldliketodescribetherelationshipbetween

    x

    and

    y

    through

    some

    function

    f chosen

    from

    a

    set

    of

    available

    functions

    C,

    so

    that

    y

    f(x).

    Ideally,wewouldchoosef bysolving

    f =

    argmingCE (yf)2 x, yP (testerror)|

    However, wewillassumethat thedistributionP isnotknown,butrather, weonly have access tosamples

    (xi, yi). Intuitively,wemaytrytosolve

    n

    21

    min yi

    f(xi) (trainingerror)f n

    i=1

    instead. Italsoseemsthat,therichertheclassCis,thebetterthechancetocorrectlydescribetherelationshipbetweenxandy. Inthislecture,wewillshowthatthisisnotthecase,andtheappropriatecomplexityofCandtheselectionofamodelfordescribinghowxandyrelatedmustbeguidedbyhowmuchdataisactually

    available.

    This

    issue

    is

    illustrated

    in

    the

    following

    example.

    2

    Example

    Consider

    fitting

    the

    following

    data

    by

    a

    polynomial

    of

    finite

    degree:

    1 2 3 4

    y 2.5 3.5 4.5 5.5Amongseveralothers,thefollowingpolynomialsfitthedataperfectly:

    y =

    x+

    1.5

    2

    y =

    2x

    4

    20x

    3

    +

    70x

    99x+

    49.5

    8

    7

    6

    Which polynomial should we choose?

  • 7/25/2019 2997 spring 2004

    56/201

    3

    Whichpolynomialshouldwechoose?

    x 1 2 3 4Nowconsiderthefollowing(possiblynoisy)data:

    y 2.3 3.5 4.7 5.5

    Fitting

    the

    data

    with

    a

    first-degree

    polynomial

    yields

    =

    1.03x

    +

    1.3;

    fitting

    it

    with

    a

    fourth-degree

    y

    y=2x420.0667x3 +70.4x 99.5333x+49.5.polynomialyields(amongothers) 2Whichpolynomialshouldwechoose?

    1

    1.5

    2 2.5 3 3.5 40

    1

    2

    3

    4

    Training

    error

    vs.

    test

    error.

    It seems intuitive in the previous example that a line may be the best description for the relationship

    between x and y, even though a polynomial of degree 3 describes the data perfectly in both cases and no

    linear function is ableto describe the data perfectly in the second case. Is the intuition correct, and if so,

    how can we decide on an appropriate representation, if relying solely on the training error does not seem

    completelyreasonable?

    The

    essence

    of

    the

    problem

    is

    as

    follows.

    Ultimately,

    what

    we

    are

    interested

    in

    is

    the

    ability

    of

    our

    fitted

    curvetopredictfuturedata,ratherthansimplyexplainingtheobserveddata. Inotherwords,wewouldlike

    to choose a predictor that minimizes the expected error y(x)y(x) over all possible x. We call this the| |testerror. Theaverageerroroverthedatasetiscalledthetrainingerror.

    We will show that training error and test error can be related through a measure of the complexity of

    the

    class

    of

    predictors

    being

    considered.

    Appropriate

    choice

    of

    a

    predictor

    will

    then

    be

    shown

    to

    require

    balancing the training error and the complexity of the predictors being considered. Their relationship is

    described inFig.1, whereweplot test and training errorsversuscomplexity ofthepredictorclass

    C when

    the

    number

    of

    samples

    is

    fixed.

    The

    main

    difficulty

    is

    that,

    as

    indicated

    in

    Fig.

    1,

    there

    exists

    a

    tradeoff

    between

    the

    complexity

    and

    the

    errors,

    i.e.,

    training

    error

    and

    the

    test

    error;

    while

    the

    approximation

    error

    over the sampled points goes to zero as we consider richer approximation classes, the same is not true for

    the test error, which we are ultimately interested in minimizing. This is due to the fact that, with only

    C

    Error

  • 7/25/2019 2997 spring 2004

    57/201

    test error

    training error

    Optimal degree maximum degree (d)

    Figure1: Errorvs. degreeofapproximationfunction

    3.1

    Classification

    with

    a

    finite

    number

    of

    classifiers

    Suppose that, given n samples (xi, yi), i = 1, . . . , n, we need to choose a classifier hi from a finite set of

    classifiersf1, . . . , f d.

    Define

    (k) = E[y fk(x) ]|n

    |1

    n(k) = yifk(xi).n

    | |i=1

    In

    words,

    (k)

    is

    the

    test

    error

    associated

    with

    classifier

    fk,

    and

    n(k)

    is

    a

    random

    variable

    representing

    the

    training

    error

    associated

    with

    classifier

    fk over the samples (xi, yi), i = 1, . . . , n. As described before, we

    wouldliketofindk =argmink(k),butcannotcomputedirectly. Letusconsiderusing instead

    k=argminn(k)k

    We

    are

    interested

    in

    the

    following

    question:

    How

    does

    the

    test

    error

    (k)

    compare

    to

    the

    optimal

    error

    (k)?

    Suppose

    that

    |n(k)(k), k, (1)|forsome>0. Thenwehave

    (k) n(k) +

    In words, if the training error is close to the test error for all classifiers fk, then using k instead of k is

  • 7/25/2019 2997 spring 2004

    58/201

    4

    near-optimal.

    But

    can

    we

    expect

    (1)

    to

    hold?

    Observethat yi

    fk(xi) are i.i.d. Bernoullirandomvariables. Fromthestrong law of largenumbers,

    | |we

    must

    have

    n(k)(k) w.p. 1.Thismeansthat, iftherearesufficientsamples,(1)shouldbetrue. Havingonlyfinitelymanysamples,we

    facetwoquestions:

    (1) Howmanysamplesareneededbeforewehavehighconfidencethat n(k) iscloseto(k)?

    (2) Canweshowthat n(k)approaches(k)equallyfastforallfk

    C?

    The

    first

    question

    is

    resolved

    by

    the

    Chernoff

    bound:

    For

    i.i.d.

    Bernoulli

    random

    variables

    xi,i=1, . . . , n,

    wehave P 1 n xiEx1> 2exp(2n2)

    n

    i=1

    Moreover, since there are only finitely many functions in C, uniform convergence of n(k) to (k) followsimmediately:

    P(k

    :

    (k)

    (k) >

    ) =

    (k)

    (k) >

    })| |

    P

    (k{| |d

    (k)(k) >}) P({| |k=1

    2d exp(2n2).

    Therefore

    we

    have

    the

    following

    theorem.

    Theorem

    1

    With

    probability

    at

    least

    1

    ,

    the

    training

    set

    (xi, yi),

    i

    =

    1, . . . , n,

    will

    be

    such

    that

    testerror trainingerror+(d,n,)

    where

    1 1(d,n,)= log2d+log .

    2n

    Measures

    of

    Complexity

    In Theorem 1, the error (d,n,) is on the order of

    logd. In other words, the more classifiers are under

    consideration,

    the

    larger

    the

    bound

    on

    the

    difference

    between

    the

    testing

    and

    training

    errors,

    and

    the

    difference grows as a function of logd. It follows that, for our purposes, logd captures the complexity of

    4.1

    VC

    dimension

  • 7/25/2019 2997 spring 2004

    59/201

    TheVCdimensionisapropertyofaclassCoffunctionsi.e.,foreachsetC,wehaveanassociatedmeasure

    of

    complexity, d

    V C(C). d

    V C

    captures

    how

    much

    variability

    there

    is

    between

    different

    functions

    in

    C.

    The

    underlying idea is as follows. Take n points x1, . . . , xn, and consider binary vectors in{1, +1}n formedby

    applying

    a

    function

    f

    C

    to

    (xi). How many different vectors can we come up with? In other words,

    considerthefollowingmatrix: f1(x1) f1(x2) . . . f1(xn) f2(x1) f2(x2) . . . f2(xn) . . . ..

    .

    .

    .. . . .

    where

    fi

    C.

    How

    many

    distinct

    rows

    can

    this

    matrix

    have?

    This

    discussion

    leads

    to

    the

    notion

    of

    shattering

    andtothedefinitionoftheVCdimension.

    Definition

    1

    (Shattering) A set of points x1, . . . , xn is shattered by a class C of classifiers iffor any

    assignment

    of

    labels

    in

    {1, 1},

    there

    is

    f

    C

    such

    that

    f(xi)=yi, i.

    Definition2 VCdimensionofC is thecardinalityof the largestset itcanshatter.

    Example

    1

    Consider

    |C =

    d.

    Suppose

    x1, x2,dots,xn is

    shattered

    by

    C.

    We

    need

    d

    2n and

    thus

    n

    log

    d.|Thismeans thatdV C(C)logd.

    Example2

    2ConsiderC ={hyperplanes in2}, Any two points in can be shattered. Hence, dV C(C)2. Considerany three points in2, C can shatter these three points. Hence dV C(C) 3. SinceC cannot shatter anyfour points in

    2, hence dV C(

    C)

    3. It follows that dV C(

    C) = 3. Moreover, it can be shown that, if

    =

    {hyperplanes

    in

    },

    then

    dV C(C)

    =

    n+

    1.

    C

    n

    Example3 IfC is thesetofallconvexsets in2,wecanshow thatdV C(C)=.

    ItturnsoutthattheVCdimensionprovidesageneralizationoftheresultsfromtheprevioussection,for

    finitesetsofclassifiers,togeneralclassesofclassifiers:

    Theorem2 Withprobabilityat least1over thechoiceofsamplepoints(xi, yi), i=1, . . . , n ,wehave

    (f)

    n(f)+(n, dV C(C), ), fC,

    where

    dV C log(2n )

    +

    1

    +

    log(41

    )dVC

    5 StructuralRiskMinimization

  • 7/25/2019 2997 spring 2004

    60/201

    Based

    on

    the

    previous

    results,

    we

    may

    consider

    the

    following

    approach

    to

    selecting

    a

    class

    of

    functions

    C

    whose

    complexity

    is

    appropriate

    for

    the

    number

    of

    samples

    available.

    Suppose

    that

    we

    have

    several

    classes

    . . . Cp. Note that complexity increases from C1 to Cp. We have classifiers f1, f2, . . . , f p whichC1 C2minimizesthetrainingerror n(fi)withineachclass. Then,givenaconfidencelevel,wemayfoundupper

    bounds

    on

    the

    test

    error

    (fi)associatedwitheachclassifier:

    (fi)n(fi)+(dV C, n , ),

    with probability at least1, andwe can choosethe classifier fi that minimizes the above upper bound.This

    approach

    is

    called

    structural

    risk

    minimization.

    There

    are

    two

    difficulties

    associated

    with

    structural

    risk

    minimization:

    first,

    the

    upper

    bound

    provided

    by Theorems 1 and 2 may be loose; second, it may be difficult to determine the VC dimension of a given

    classofclassifiers,androughestimatesorupperboundsmayhavetobeused instead. Still, thismaybea

    reasonableapproach,ifwehavealimitedamountofdata. Ifwehavealotofdata,analternativeapproachis

    asfollows. Wecansplitthedatainthreesets: atrainingset,avalidationsetandatestset. Wecanusethe

    trainingsettofindtheclassifierfi withineachclassCi thatminimizesthetrainingerror;usethevalidation

    set

    to

    estimate

    the

    test

    error

    of

    each

    selected

    classifier

    fi,

    and

    choose

    the

    classifier

    f

    from

    f1, . . . , f p

    with

    the

    smallestestimate;andfinally,usethetestsettogenerateanestimateofthetesterrorassociatedwith f.

    2.997Decision-Making inLarge-ScaleSystems March12

  • 7/25/2019 2997 spring 2004

    61/201

    1

    MIT,

    Spring

    2004

    Handout

    #16

    Lecture

    Note12

    ValueFunctionApproximationandPolicyPerformance

    Recall that two tasks must be accomplished in order to for a value function approximation scheme to be

    successful:

    1.

    We

    must

    pick

    a

    good

    representation

    J,

    such

    that

    J()

    J(, r)

    for

    at

    least

    some

    parameter

    r;

    2. Wemustpickagoodparameter r,suchthatJ(x)J(x, r).

    Consider

    approximatingJ withalineararchitecture,i.e.,let

    p

    J(x, r)= i(x)ri,t=1

    for

    some

    functions

    i, i=1, . . . , p. Wecandefineamatrix |S|p givenby | |

    .= 1 . . . p | |

    Withthisnotation,wecanrepresenteachfunctionJ(, r)as

    J =

    r.

    Fig.

    1

    gives

    a

    geometric

    interpretation

    of

    value

    function

    approximation.

    We

    may

    think

    ofJ asavector

    in|S| ;byconsideringapproximationsofthe formJ =r,werestrictattentiontothehyperplaneJ =r

    in the same space. Given a norm (e.g., the Euclidean norm), an ideal value function approximation

    algorithmwouldchooserminimizingJ r;inotherwords,itwouldfindtheprojectionr ofJ onto

    the

    hyperplane.

    Note

    that

    J risanaturalmeasureforthequalityoftheapproximationarchitecture,

    sinceit isthebestapproximationerrorthatcanbeattainedbyanyalgorithmgiventhechoiceof.

    Algorithmsforvaluefunctionapproximationfoundintheliteraturedonotcomputetheprojectionr,

    sincethis isan intractableproblem. BuildingontheknowledgethatJ satisfiesBellmansequation, value

    function approximation typically involves adaptation of exact dynamic programming algorithms. For in

    stance,drawinginspirationfromvalueiteration,onemightconsiderthefollowingapproximatevalueiteration

    is capable of producing a good approximation to J, then the approximation algorithm should be able to

    d l ti l d i ti

  • 7/25/2019 2997 spring 2004

    62/201

    2

    produce

    a

    relatively

    good

    approximation.

    Another important question concerns the choice of a norm used to measure approximation errors.

    Recall

    that,

    ultimately,

    we

    are

    not

    interested

    in

    finding

    an

    approximation

    to

    J

    ,

    but

    rather

    in

    finding

    a

    good

    policy fortheoriginaldecision problem. Thereforewe would liketo choose to reflecttheperformance

    associated

    with

    approximations

    toJ.

    State 2

    State 1

    State 3

    J

    J r

    J( )

    =

    x,r

    Figure1: ValueFunctionApproximation

    PerformanceBounds

    We

    are

    interested

    in

    the

    following

    question.

    Let uJ be the greedy policy associated with an arbitrary

    function J, and JuJ be the cost-to-go function associated with that policy. Can we relate the increase in

    costJuJ JtotheapproximationerrorJ J?

    Recallthefollowingtheorem,fromLectureNote3:

    Theorem1 LetJ bearbitrary,uJ beagreedypolicywithrespect toJ.1 LetJuJ be thecost-to-gofunction

    forpolicyuJ. Then

    JuJ J

    2J J.

    1

    is unrealistic to expect that we could approximate J uniformly well over all states (which is required by

    Theorem 1) or that e could find a polic that ields a cost to go uniforml close to J o er all states

  • 7/25/2019 2997 spring 2004

    63/201

    Theorem

    1)

    or

    that

    we

    could

    find

    a

    policyuJ thatyieldsacost-to-gouniformlyclosetoJ

    over

    all

    states.

    The following example illustrates the notion that having a large error J J does not necessarily

    lead

    to

    a

    bad

    policy.

    Moreover,

    minimizing

    J

    J may

    also

    lead

    to

    undesirable

    results.

    Example1 Considera singlequeuewithcontrolled service rate. Let xdenote thequeue length, B denote

    the

    buffer

    size,

    and

    Pa(x, x+1) = , a, x=0, 1, . . . , B1

    Pa(x, x 1) = (a), a, x=1, 2, . . . , B ,

    Pa(B, B+1) = 0, a

    ga(x) = x +q(a)

    Suppose thatweare interested inminimizing theaverage cost in thisproblem. Thenwewould like tofind

    anapproximation to thedifferentialcostfunctionh. Suppose thatweconsideronly linearapproximations:

    h(x, r)=r1 +r2x.At the topofFigure1,werepresent h and twopossibleapproximations, h1 and h2. h1

    is

    chosen

    so

    as

    to

    minimize

    h h. Which one is a better approximation? Note that h1 yields smaller

    approximation errors than h2 over large states, but yields large approximation errors over the whole state

    space.

    In

    particular,

    as

    we

    increase

    the

    buffer

    size

    B,

    it

    should

    lead

    to

    worse

    and

    worse

    approximation

    errors

    in almost all states. h2, on the other hand, has an interesting property, which we now describe. At the

    bottom

    of

    Figure

    1,

    we

    represent

    the

    stationary

    state

    distribution(x)encounteredundertheoptimalpolicy.

    Note that itdecaysexponentiallywithx,and largestatesarerarelyvisited. Thissuggests that,forpractical

    purposes, h2 may lead toabetterpolicy, since itapproximates h better than h1 over the setof states that

    arevisitedalmostallof the time.

    What the previous example hints at is that, in the case of a large state space, it may be important to

    consider

    errors

    J

    J

    that

    differentiate

    between

    more

    or

    less

    important

    states.

    In

    the

    next

    section,

    we

    will

    introduce

    the

    notion

    of

    weighted

    norms

    and

    present

    performance

    bounds

    that

    take

    state

    relevance

    into

    account.

    2.1 PerformanceBoundswithState-RelevanceWeights

    We

    first

    introduce

    the

    following

    weighted

    norms:

    ||J

    J(, r)|| =

    max

    xS |J

    (x)

    J(x, r)|

    JJ(, r), = maxxS

    (x)|J(x)J(x, r)|J J(, r)1, = (x)|J

    (x)J(x, r)| (>0)xS

  • 7/25/2019 2997 spring 2004

    64/201

    h(x)

    B

    x

    h2

    h1

    h

    x

    Dist. of

    x

    whereT T tP t(1 ) T (I P )1 (1 )

  • 7/25/2019 2997 spring 2004

    65/201

    T T tPtcJ =(1 )cT(I PuJ)

    1 =(1 )c uJt=0

    or

    equivalently

    c,J(x)=(1 ) c(y) tPu

    tJ

    (y,x), x S.y t=0

    Proof: WehaveJ TJ J JuJ. Then

    TJuJ J1,c = c (JuJ J

    )

    cT(JuJ J)

    T

    =

    c

    (I

    PuJ)1guJ J)

    T= c (I PuJ)

    1(guJ +PuJJ J)

    T= c (I PuJ)1(TJ J)

    T c (I PuJ)1(J J)

    1= J J1,c,J1

    ComparingTheorems1and2,wehave

    JuJ J

    2J J

    1

    JuJ J1,c

    1J J1,c,J.1

    There

    are

    two

    important

    differences

    between

    these

    bounds:

    1. Thefirstboundrelatesperformancetotheworstpossibleapproximationerroroverallstates,whereas

    thesecondinvolvesonlyaweightedaverageoferrors. Thereforeweexpectthesecondboundtoexhibit

    better

    scaling

    properties.

    2. The first bound presents a worst-case guarantee on performance: the cost-to-go starting from any

    initialstatexcannotbegreaterthanthestatedbound. Thesecondboundpresentsaguaranteeonthe

    expectedcost-to-go,giventhattheinitialstateisdistributedaccordingtodistributionc. Althoughthis

    is

    a

    weaker

    guarantee,

    it

    represents

    a

    more

    realistic

    requirement

    for

    large-scale

    systems,

    and

    raises

    the

    possibilityofexploitinginformationabouthowimportanteachdifferentstateisintheoveralldecision

    problem.

    This step is typically done inreal-time, asthe system is being controlled. If the setof available actions Ais relatively small and the summation y Pa(x, y)J (y, r) can be computed relatively fast, then evaluating

  • 7/25/2019 2997 spring 2004

    66/201

    s

    e at ve y

    s a

    a d

    t e