E 3 Finish-up; Intro to Clustering & Unsup. Kearns & Singh, “Near-Optimal Reinforcement Learning...
-
Upload
frank-randall -
Category
Documents
-
view
212 -
download
0
Transcript of E 3 Finish-up; Intro to Clustering & Unsup. Kearns & Singh, “Near-Optimal Reinforcement Learning...
![Page 1: E 3 Finish-up; Intro to Clustering & Unsup. Kearns & Singh, “Near-Optimal Reinforcement Learning in Polynomial Time.”Machine Learning 49, 2002. Class Text:](https://reader036.fdocuments.in/reader036/viewer/2022083005/56649f2b5503460f94c464a8/html5/thumbnails/1.jpg)
E3 Finish-up; Intro to Clustering &
Unsup.Kearns & Singh, “Near-Optimal Reinforcement Learning in Polynomial Time.”Machine Learning 49, 2002.
Class Text: Sec 3.9; 5.9; 5.10; 6.6
![Page 2: E 3 Finish-up; Intro to Clustering & Unsup. Kearns & Singh, “Near-Optimal Reinforcement Learning in Polynomial Time.”Machine Learning 49, 2002. Class Text:](https://reader036.fdocuments.in/reader036/viewer/2022083005/56649f2b5503460f94c464a8/html5/thumbnails/2.jpg)
Administrivia•Office hours next week (Nov 24)
•Truncated on early end
•From “whenever I get in” ‘til noon
•Or by appt.
![Page 3: E 3 Finish-up; Intro to Clustering & Unsup. Kearns & Singh, “Near-Optimal Reinforcement Learning in Polynomial Time.”Machine Learning 49, 2002. Class Text:](https://reader036.fdocuments.in/reader036/viewer/2022083005/56649f2b5503460f94c464a8/html5/thumbnails/3.jpg)
Oral presentations: Tips #1•Not too much text on any slide
•No paragraphs!!!
•Not even full sentences (usually)
•Be sure text is readable
•Fonts big enough
•Beware of Serifed Fonts
![Page 4: E 3 Finish-up; Intro to Clustering & Unsup. Kearns & Singh, “Near-Optimal Reinforcement Learning in Polynomial Time.”Machine Learning 49, 2002. Class Text:](https://reader036.fdocuments.in/reader036/viewer/2022083005/56649f2b5503460f94c464a8/html5/thumbnails/4.jpg)
Oral presentations: Tips #1This is a deliberately bad example of presentation style. Note that the text is
very dense, there’s a lot of it, the font is way too small, and the font is somewhat difficult to read (the serifs are very narrow and the kerning is too tight, so the letters tend to smear together when viewed from a distance). It’s essentially impossible for your audience to follow this text while you’re speaking. (Except for a few speedreaders who happen to be sitting close to the screen.) In general, don’t expect your audience to read the text on your presentation -- it’s mostly there as a reminder to keep them on track while you’re talking and remind them what you’re talking about when they fall asleep. Note that these rules of thumb also apply well to posters. Unless you want your poster to completely standalone (no human there to describe it), it’s best to avoid large blocks of dense text.
![Page 5: E 3 Finish-up; Intro to Clustering & Unsup. Kearns & Singh, “Near-Optimal Reinforcement Learning in Polynomial Time.”Machine Learning 49, 2002. Class Text:](https://reader036.fdocuments.in/reader036/viewer/2022083005/56649f2b5503460f94c464a8/html5/thumbnails/5.jpg)
Oral presentations: Tips #1•Also, don’t switch slides too quickly...
![Page 6: E 3 Finish-up; Intro to Clustering & Unsup. Kearns & Singh, “Near-Optimal Reinforcement Learning in Polynomial Time.”Machine Learning 49, 2002. Class Text:](https://reader036.fdocuments.in/reader036/viewer/2022083005/56649f2b5503460f94c464a8/html5/thumbnails/6.jpg)
Exercise•Given: MDP M= 〈 S,A,T,R 〉 ; discount
factor γ, max absolute rwd Rmax
=maxS{|R(s)|}
•Find: A planning horizon Hγmax
such that if
the agent plans only about events that take place within Hγ
max steps, then the
agent is gauranteed to miss no more than ε
•I.e., For any trajectory of length Hγmax
,
hγH, the value difference between hγ
H
and hγ∞ is less than ε:
![Page 7: E 3 Finish-up; Intro to Clustering & Unsup. Kearns & Singh, “Near-Optimal Reinforcement Learning in Polynomial Time.”Machine Learning 49, 2002. Class Text:](https://reader036.fdocuments.in/reader036/viewer/2022083005/56649f2b5503460f94c464a8/html5/thumbnails/7.jpg)
E3
•Efficient Explore & Exploit algorithm•Kearns & Singh, Machine Learning 49, 2002
•Explicitly keeps a T matrix and a R table•Plan (policy iter) w/ curr. T & R -> curr. π
•Every state/action entry in T and R:•Can be marked known or unknown•Has a #visits counter, nv(s,a)
•After every 〈 s,a,r,s’ 〉 tuple, update T & R (running average)
•When nv(s,a)>NVthresh , mark cell as known & re-plan
•When all states known, done learning & have π*
![Page 8: E 3 Finish-up; Intro to Clustering & Unsup. Kearns & Singh, “Near-Optimal Reinforcement Learning in Polynomial Time.”Machine Learning 49, 2002. Class Text:](https://reader036.fdocuments.in/reader036/viewer/2022083005/56649f2b5503460f94c464a8/html5/thumbnails/8.jpg)
The E3 algorithmAlgorithm: E3_learn_sketch // only an overviewInputs: S, A, γ (0<=γ<1), NVthresh, R
max, Var
max
Outputs: T, R, π*Initialization:
R(s)=Rmax // for all s
T(s,a,s’)=1/|S| // for all s,a,s’known(s,a)=0; nv(s,a)=0; // for all s, aπ=policy_iter(S,A,T,R)
![Page 9: E 3 Finish-up; Intro to Clustering & Unsup. Kearns & Singh, “Near-Optimal Reinforcement Learning in Polynomial Time.”Machine Learning 49, 2002. Class Text:](https://reader036.fdocuments.in/reader036/viewer/2022083005/56649f2b5503460f94c464a8/html5/thumbnails/9.jpg)
The E3 algorithmAlgorithm: E3_learn_sketch // con’tRepeat {
s=get_current_world_state()a=π(s)(r,s’)=act_in_world(a)T(s,a,s’)=(1+T(s,a,s’)*nv(s,a))/(nv(s,a)+1)nv(s,a)++;if (nv(s,a)>NVthresh) {
known(s,a)=1;π=policy_iter(S,A,T,R)
}} Until (all (s,a) known)
![Page 10: E 3 Finish-up; Intro to Clustering & Unsup. Kearns & Singh, “Near-Optimal Reinforcement Learning in Polynomial Time.”Machine Learning 49, 2002. Class Text:](https://reader036.fdocuments.in/reader036/viewer/2022083005/56649f2b5503460f94c464a8/html5/thumbnails/10.jpg)
Choosing NVthresh•Critical parameter in E3: NVthresh
•Affects how much experience agent needs to be confident in saying a T(s,a,s’) value is known
•How to pick this param?
•Want to ensure that curr estimate, , is close to true T(s,a,s’) with high prob:
•How to do that?
![Page 11: E 3 Finish-up; Intro to Clustering & Unsup. Kearns & Singh, “Near-Optimal Reinforcement Learning in Polynomial Time.”Machine Learning 49, 2002. Class Text:](https://reader036.fdocuments.in/reader036/viewer/2022083005/56649f2b5503460f94c464a8/html5/thumbnails/11.jpg)
5 minutes of math...•General problem:
•Given a binomially distributed random variable, X, what is the probability that it deviates very far from its true mean?
• R.v. could be:
•Sum of many coin flips:
•Average of many samples from a transition function:
![Page 12: E 3 Finish-up; Intro to Clustering & Unsup. Kearns & Singh, “Near-Optimal Reinforcement Learning in Polynomial Time.”Machine Learning 49, 2002. Class Text:](https://reader036.fdocuments.in/reader036/viewer/2022083005/56649f2b5503460f94c464a8/html5/thumbnails/12.jpg)
5 minutes of math...•Theorem (Chernoff bound): Given a
binomially distributed random variable, X, generated from a sequence of n events, the probability that X is very far from its true mean, , is given by:
![Page 13: E 3 Finish-up; Intro to Clustering & Unsup. Kearns & Singh, “Near-Optimal Reinforcement Learning in Polynomial Time.”Machine Learning 49, 2002. Class Text:](https://reader036.fdocuments.in/reader036/viewer/2022083005/56649f2b5503460f94c464a8/html5/thumbnails/13.jpg)
5 minutes of math...•Consequence of the Chernoff bound
(informal):
•With a bit of fiddling, you can show that:
•The probability that the estimated mean for a binomially distributed random variable falls very far from the true mean falls off exponentially quickly with the size of the sample set
![Page 14: E 3 Finish-up; Intro to Clustering & Unsup. Kearns & Singh, “Near-Optimal Reinforcement Learning in Polynomial Time.”Machine Learning 49, 2002. Class Text:](https://reader036.fdocuments.in/reader036/viewer/2022083005/56649f2b5503460f94c464a8/html5/thumbnails/14.jpg)
Chernoff bound & NVthresh•Using Chernoff bound, can show that a
transition can be considered “known” when:
•Where:
•N=number of states in M, |S|
•δ=amount you’re willing to be wrong by
•ε=prob that you got it wrong by more than δ
![Page 15: E 3 Finish-up; Intro to Clustering & Unsup. Kearns & Singh, “Near-Optimal Reinforcement Learning in Polynomial Time.”Machine Learning 49, 2002. Class Text:](https://reader036.fdocuments.in/reader036/viewer/2022083005/56649f2b5503460f94c464a8/html5/thumbnails/15.jpg)
Poly time RL•A further consequence (once you layer on a
bunch of math & assumptions):
•Can learn complete model in at most
•steps
•Notes:
•Polynomial in N, 1/ε, and 1/δ
•BIG polynomial, nasty constants
![Page 16: E 3 Finish-up; Intro to Clustering & Unsup. Kearns & Singh, “Near-Optimal Reinforcement Learning in Polynomial Time.”Machine Learning 49, 2002. Class Text:](https://reader036.fdocuments.in/reader036/viewer/2022083005/56649f2b5503460f94c464a8/html5/thumbnails/16.jpg)
Take-home messages•Model based RL is a different way to
think of the goals of RL
•Get better understanding of world
•(Sometimes) provides stronger theoretical leverage
•There exists a provably poly time alg. for RL
•Nasty polynomial, tho.
•Doesn’t work well in practice
•Still, nice explanation of why some forms of RL work
![Page 17: E 3 Finish-up; Intro to Clustering & Unsup. Kearns & Singh, “Near-Optimal Reinforcement Learning in Polynomial Time.”Machine Learning 49, 2002. Class Text:](https://reader036.fdocuments.in/reader036/viewer/2022083005/56649f2b5503460f94c464a8/html5/thumbnails/17.jpg)
Unsupervised Learning:
Clustering & Model Fitting
![Page 18: E 3 Finish-up; Intro to Clustering & Unsup. Kearns & Singh, “Near-Optimal Reinforcement Learning in Polynomial Time.”Machine Learning 49, 2002. Class Text:](https://reader036.fdocuments.in/reader036/viewer/2022083005/56649f2b5503460f94c464a8/html5/thumbnails/18.jpg)
The unsupervised problem•Given:
•Set of data points
•Find:
•Good description of the data
![Page 19: E 3 Finish-up; Intro to Clustering & Unsup. Kearns & Singh, “Near-Optimal Reinforcement Learning in Polynomial Time.”Machine Learning 49, 2002. Class Text:](https://reader036.fdocuments.in/reader036/viewer/2022083005/56649f2b5503460f94c464a8/html5/thumbnails/19.jpg)
Typical tasks•Given: many measurements of flowers
•What different breeds are there?
•Given: many microarray measurements,
•What genes act the same?
•Given: bunch of documents
•What topics are there? How are they related? Which are “good” essays and which are “bad”?
•Given: Long sequences of GUI events
•What tasks was user working on? Are they “flat” or hierarchical?