E 3 Finish-up; Intro to Clustering & Unsup. Kearns & Singh, “Near-Optimal Reinforcement Learning...

E3 Finish-up; Intro to Clustering &

Unsup.Kearns & Singh, “Near-Optimal Reinforcement Learning in Polynomial Time.”Machine Learning 49, 2002.

Class Text: Sec 3.9; 5.9; 5.10; 6.6

Administrivia•Office hours next week (Nov 24)

•Truncated on early end

•From “whenever I get in” ‘til noon

•Or by appt.

Oral presentations: Tips #1•Not too much text on any slide

•No paragraphs!!!

•Not even full sentences (usually)

•Be sure text is readable

•Fonts big enough

•Beware of Serifed Fonts

Oral presentations: Tips #1This is a deliberately bad example of presentation style. Note that the text is

very dense, there’s a lot of it, the font is way too small, and the font is somewhat difficult to read (the serifs are very narrow and the kerning is too tight, so the letters tend to smear together when viewed from a distance). It’s essentially impossible for your audience to follow this text while you’re speaking. (Except for a few speedreaders who happen to be sitting close to the screen.) In general, don’t expect your audience to read the text on your presentation -- it’s mostly there as a reminder to keep them on track while you’re talking and remind them what you’re talking about when they fall asleep. Note that these rules of thumb also apply well to posters. Unless you want your poster to completely standalone (no human there to describe it), it’s best to avoid large blocks of dense text.

Oral presentations: Tips #1•Also, don’t switch slides too quickly...

Exercise•Given: MDP M= 〈 S,A,T,R 〉 ; discount

factor γ, max absolute rwd Rmax

=maxS{|R(s)|}

•Find: A planning horizon Hγmax

such that if

the agent plans only about events that take place within Hγ

max steps, then the

agent is gauranteed to miss no more than ε

•I.e., For any trajectory of length Hγmax

,

hγH, the value difference between hγ

H

and hγ∞ is less than ε:

E3

•Efficient Explore & Exploit algorithm•Kearns & Singh, Machine Learning 49, 2002

•Explicitly keeps a T matrix and a R table•Plan (policy iter) w/ curr. T & R -> curr. π

•Every state/action entry in T and R:•Can be marked known or unknown•Has a #visits counter, nv(s,a)

•After every 〈 s,a,r,s’ 〉 tuple, update T & R (running average)

•When nv(s,a)>NVthresh , mark cell as known & re-plan

•When all states known, done learning & have π*

The E3 algorithmAlgorithm: E3_learn_sketch // only an overviewInputs: S, A, γ (0<=γ<1), NVthresh, R

max, Var

max

Outputs: T, R, π*Initialization:

R(s)=Rmax // for all s

T(s,a,s’)=1/|S| // for all s,a,s’known(s,a)=0; nv(s,a)=0; // for all s, aπ=policy_iter(S,A,T,R)

The E3 algorithmAlgorithm: E3_learn_sketch // con’tRepeat {

s=get_current_world_state()a=π(s)(r,s’)=act_in_world(a)T(s,a,s’)=(1+T(s,a,s’)*nv(s,a))/(nv(s,a)+1)nv(s,a)++;if (nv(s,a)>NVthresh) {

known(s,a)=1;π=policy_iter(S,A,T,R)

}} Until (all (s,a) known)

Choosing NVthresh•Critical parameter in E3: NVthresh

•Affects how much experience agent needs to be confident in saying a T(s,a,s’) value is known

•How to pick this param?

•Want to ensure that curr estimate, , is close to true T(s,a,s’) with high prob:

•How to do that?

5 minutes of math...•General problem:

•Given a binomially distributed random variable, X, what is the probability that it deviates very far from its true mean?

• R.v. could be:

•Sum of many coin flips:

•Average of many samples from a transition function:

5 minutes of math...•Theorem (Chernoff bound): Given a

binomially distributed random variable, X, generated from a sequence of n events, the probability that X is very far from its true mean, , is given by:

5 minutes of math...•Consequence of the Chernoff bound

(informal):

•With a bit of fiddling, you can show that:

•The probability that the estimated mean for a binomially distributed random variable falls very far from the true mean falls off exponentially quickly with the size of the sample set

Chernoff bound & NVthresh•Using Chernoff bound, can show that a

transition can be considered “known” when:

•Where:

•N=number of states in M, |S|

•δ=amount you’re willing to be wrong by

•ε=prob that you got it wrong by more than δ

Poly time RL•A further consequence (once you layer on a

bunch of math & assumptions):

•Can learn complete model in at most

•steps

•Notes:

•Polynomial in N, 1/ε, and 1/δ

•BIG polynomial, nasty constants

Take-home messages•Model based RL is a different way to

think of the goals of RL

•Get better understanding of world

•(Sometimes) provides stronger theoretical leverage

•There exists a provably poly time alg. for RL

•Nasty polynomial, tho.

•Doesn’t work well in practice

•Still, nice explanation of why some forms of RL work

Unsupervised Learning:

Clustering & Model Fitting

The unsupervised problem•Given:

•Set of data points

•Find:

•Good description of the data

Typical tasks•Given: many measurements of flowers

•What different breeds are there?

•Given: many microarray measurements,

•What genes act the same?

•Given: bunch of documents

•What topics are there? How are they related? Which are “good” essays and which are “bad”?

•Given: Long sequences of GUI events

•What tasks was user working on? Are they “flat” or hierarchical?

E 3 Finish-up; Intro to Clustering & Unsup. Kearns & Singh, “Near-Optimal Reinforcement Learning...

Documents

Transcript of E 3 Finish-up; Intro to Clustering & Unsup. Kearns & Singh, “Near-Optimal Reinforcement Learning...