DPA51 Dynamic Programming Applications Lecture 5.
-
Upload
regina-dennis -
Category
Documents
-
view
214 -
download
1
Transcript of DPA51 Dynamic Programming Applications Lecture 5.
![Page 1: DPA51 Dynamic Programming Applications Lecture 5.](https://reader030.fdocuments.in/reader030/viewer/2022032804/56649edd5503460f94bedc27/html5/thumbnails/1.jpg)
DPA5 1
Dynamic Programming Applications
Lecture 5
![Page 2: DPA51 Dynamic Programming Applications Lecture 5.](https://reader030.fdocuments.in/reader030/viewer/2022032804/56649edd5503460f94bedc27/html5/thumbnails/2.jpg)
DPA5 2
Preview
Last time:
Structural properties .
Today:
Optimal stopping & the OLA rule
(Secretary problem, Asset selling)
Next time:
Infinite horizon.
![Page 3: DPA51 Dynamic Programming Applications Lecture 5.](https://reader030.fdocuments.in/reader030/viewer/2022032804/56649edd5503460f94bedc27/html5/thumbnails/3.jpg)
DPA5 3
The RM problemJt(x,i)= max{Jt-1(x), Ri+ Jt-1(x-1)}= (Ri- OCt-1(x))+ +Jt-1(x)
Optimal policy: accept cls. i iff Ri OCt-1(x) = Jt-1(x) - Jt-1(x-1)
Results:
1. Jt(x) increasing in x - by induction
2. OCt(x) decreasing in x - single crossing
3. OCt(x) increasing in t - by induction + 2:
Jt (x) = pi (Ri- OCt-1(x))+ + Jt-1(x)
Jt(x-1)= pi (Ri- OCt-1(x-1))+ + Jt-1(x-1)
OCt(x)- OCt-1(x)= pi [(Ri- OCt-1(x))+- (Ri- OCt-1(x-1))+] 0
![Page 4: DPA51 Dynamic Programming Applications Lecture 5.](https://reader030.fdocuments.in/reader030/viewer/2022032804/56649edd5503460f94bedc27/html5/thumbnails/4.jpg)
DPA5 4
The RM problem - results
• The optimal policy is characterized by threshold levels bi
t as follows:
Accept class i at time t iff 0 x < bit
where bit = min{x | OCt-1(x) > Ri}
• Moreover, b1t … bm
t , where R1 … Rm
![Page 5: DPA51 Dynamic Programming Applications Lecture 5.](https://reader030.fdocuments.in/reader030/viewer/2022032804/56649edd5503460f94bedc27/html5/thumbnails/5.jpg)
DPA5 5
Optimal Stopping
At each stage a control is available that stops
the evolution of the system.
At stage k there are 2 options:
1. Stop process (get a certain reward)
2. Continue process, perhaps at a certain cost, and select one of the next available choices.
If there is only one other choice besides stopping,
policy is characterized by the stopping states-set.
![Page 6: DPA51 Dynamic Programming Applications Lecture 5.](https://reader030.fdocuments.in/reader030/viewer/2022032804/56649edd5503460f94bedc27/html5/thumbnails/6.jpg)
DPA5 6
Secretary Problems
• Cayley 1875• Interview N candidates for a job• Must accept/reject at end of interview• Objectives:
– Maximize expected ‘score’– Maximize P(get the best)
(you risk to hire nobody!)
![Page 7: DPA51 Dynamic Programming Applications Lecture 5.](https://reader030.fdocuments.in/reader030/viewer/2022032804/56649edd5503460f94bedc27/html5/thumbnails/7.jpg)
DPA5 7
Archetype problem
Make irrevocable choice from a fixed
number of opportunities whose values
are revealed sequentially.• Asset selling• Purchasing with a deadline• Exercising stock options (in your next HW)
![Page 8: DPA51 Dynamic Programming Applications Lecture 5.](https://reader030.fdocuments.in/reader030/viewer/2022032804/56649edd5503460f94bedc27/html5/thumbnails/8.jpg)
DPA5 8
Max P(get best)
• Wt=history of relative ranks of candidates seen by time t (inclusive)
• xt = 1, if tth candidate is best seen so far
0, otherwise
• Relevant: t and xt
• Fact: xt=1 and Wt-1 statistically independent:
![Page 9: DPA51 Dynamic Programming Applications Lecture 5.](https://reader030.fdocuments.in/reader030/viewer/2022032804/56649edd5503460f94bedc27/html5/thumbnails/9.jpg)
DPA5 9
Objective
Jt = P(under optimal policy we select best candidate given that we’ve rejected t-1 so far )
Jt (0)=P(under optimal policy we select best candidate given that we’ve seen t so far and the last one was NOT the best so far)
Jt (1)= …
P(best of N| best of first t) = ?
![Page 10: DPA51 Dynamic Programming Applications Lecture 5.](https://reader030.fdocuments.in/reader030/viewer/2022032804/56649edd5503460f94bedc27/html5/thumbnails/10.jpg)
DPA5 10
DP equation
JN+1 = 0
Jt = (t-1)/t Jt (0) + 1/t Jt (1)
Jt (0) = Jt+1 (must continue)
Jt (1) = max ( t/N , Jt+1) (accept or continue)
Fact 1: Jt -1 Jt
Fact 2: Jt t and t/N t => single crossing
Define: t* = min {t | Jt+1 t/N}
![Page 11: DPA51 Dynamic Programming Applications Lecture 5.](https://reader030.fdocuments.in/reader030/viewer/2022032804/56649edd5503460f94bedc27/html5/thumbnails/11.jpg)
DPA5 11
RecursionJt = Jt* , if t < t*
(t-1)/t Jt + 1/N, if t t*
Jt/(t-1) = Jt+1/t + 1/(N(t-1))
Therefore: Jt+1 = t/N 1/s (after telescoping)
By definition, t* is the smallest s.t. Jt*+1 t* /N , so
t* = min{t | 1/s 1} = ?
N-1
s=t
N-1
s=t
![Page 12: DPA51 Dynamic Programming Applications Lecture 5.](https://reader030.fdocuments.in/reader030/viewer/2022032804/56649edd5503460f94bedc27/html5/thumbnails/12.jpg)
DPA5 12
Policy
• For large N: 1/s loge(N/ t0)
• Therefore t0 N/e
• Policy: Interview N/e candidates and reject them, then select best you see so far.
• P(success) = J(t0) t0 /N 1/e .3679
• Empirical validation?
N-1
s=t0
![Page 13: DPA51 Dynamic Programming Applications Lecture 5.](https://reader030.fdocuments.in/reader030/viewer/2022032804/56649edd5503460f94bedc27/html5/thumbnails/13.jpg)
DPA5 13
The Last Shall be First
“..The last person interviewed for a job gets it 55.8% of the time according to Runzheimer Canada, Inc. Early applicants are hired only 17.6% of the time; the management consulting firm suggests that job-seekers who find they are among the first to be grilled‘tactfully ask to be rescheduled for a later date’. Mondays are also poor days to be interviewed and anyday just before quitting time is also bad.”
(The Globe and Mail, Sept. 12, 1990, pg. A22)
![Page 14: DPA51 Dynamic Programming Applications Lecture 5.](https://reader030.fdocuments.in/reader030/viewer/2022032804/56649edd5503460f94bedc27/html5/thumbnails/14.jpg)
DPA5 14
Asset selling
• Like maximizing interview score, but with discounting/investment
• Offers: w0,w1,…,wN-1 i.i.d with fixed known distribution (if not known: inference, learning)
• Stage k choices:1. Accept, and invest $wk at rate r2. Reject, and wait until stage k+1
• Objective: maximize revenue at end of period N
![Page 15: DPA51 Dynamic Programming Applications Lecture 5.](https://reader030.fdocuments.in/reader030/viewer/2022032804/56649edd5503460f94bedc27/html5/thumbnails/15.jpg)
DPA5 15
Formulation
State:
• xkT: asset has not been sold, current offer is xk
• xk=T: asset has been sold
Decision:
• uk= u sell; uk= u’ don’t sell
Plant equation:
xk+1= T, if xk=T, or if xkT and uk= u (sell)
wk, otherwise
![Page 16: DPA51 Dynamic Programming Applications Lecture 5.](https://reader030.fdocuments.in/reader030/viewer/2022032804/56649edd5503460f94bedc27/html5/thumbnails/16.jpg)
DPA5 16
Costs
gN(xN) = xN , if xN T
0 , else
gk(xk) = (1+r)N-k xk , if xk T and uk=u
0 , else
JN(xN) = xN , if xN T
0 , else
Jk(xk) = max((1+r)N-k xk , Ew{Jk+1(wk)}), if xk T
0 , else
![Page 17: DPA51 Dynamic Programming Applications Lecture 5.](https://reader030.fdocuments.in/reader030/viewer/2022032804/56649edd5503460f94bedc27/html5/thumbnails/17.jpg)
DPA5 17
Policy
• Accept offer xk if xk > ak
• Reject offer xk if xk < ak
• Indifferent if xk = ak
Optimal policy is determined by sequence ak:
• ak = Ew{Jk+1(wk)} / (1+r)N-k
![Page 18: DPA51 Dynamic Programming Applications Lecture 5.](https://reader030.fdocuments.in/reader030/viewer/2022032804/56649edd5503460f94bedc27/html5/thumbnails/18.jpg)
DPA5 18
Structural properties
Fact: ak ak+1 for all k
Intuition:
if an offer is good enough to be acceptable at time k, it should be so at time k+1.
![Page 19: DPA51 Dynamic Programming Applications Lecture 5.](https://reader030.fdocuments.in/reader030/viewer/2022032804/56649edd5503460f94bedc27/html5/thumbnails/19.jpg)
DPA5 19
General stopping & OLA
• Stopping mandatory at or before stage N• Stationary: state, control, disturbances, and their space
sets, and cost/stage are constant over time
• Xtra action: go to termination state @ cost t(xk)
DP-algorithm:
JN(xN) = t(xN )
Jk(xk) = min(t(xk), Ew{g(xk,uk,wk)+Jk+1(f( xk,uk,wk)})
![Page 20: DPA51 Dynamic Programming Applications Lecture 5.](https://reader030.fdocuments.in/reader030/viewer/2022032804/56649edd5503460f94bedc27/html5/thumbnails/20.jpg)
DPA5 20
Stopping set
It is optimal to stop at time k for states x in the set:
Tk={x| t(x) minu E{g(x,u,w) + Jk+1(f(x,u,w)) }
Fact: JN-1(x) JN(x), so Jk-1(x) Jk(x) for all k, x.
Cor.: T0 … Tk Tk+1 … TN-1
Question: how to guarantee equality?
![Page 21: DPA51 Dynamic Programming Applications Lecture 5.](https://reader030.fdocuments.in/reader030/viewer/2022032804/56649edd5503460f94bedc27/html5/thumbnails/21.jpg)
DPA5 21
Absorbance
Condition: TN-1 is absorbing if x TN-1 and termination not selected, then next state is in TN-1.
That is f(x,u,w) TN-1 for all x TN-1 , u U(x), w.
Intuition: if you reach a state that’s optimal to stop at, but you don’t stop, then you move to a state that’s also optimal to stop at.
Theorem: If TN-1 is absorbing then Tk=TN-1 for all k.
OLA policy: iff TN-1 (1-step stopping set) absorbing.