Machine LearningRL1 Reinforcement Learning in Partially Observable Environments Michael L. Littman.

Machine Learning RL 1

Reinforcement Learning in Partially Observable Environments

Michael L. Littman

Temporal Difference Learning (1)

),(),(),()[1(),( )3(2)2()1(tttttttt asQasQasQasQ

),(ˆmax),( 1)1( asQrasQ t

Q learning: reduce discrepancy between successive Q estimates

One step time difference:

Why not two steps?

),(ˆmax),( 22

1)2( asQrrasQ t

Or n?),(ˆmax),( 1

)( asQrrrasQ nta

Blend all of these:

Temporal Difference Learning (2)

TD() algorithm uses above training rule• Sometimes converges faster than Q learning• converges for learning V* for any 0 1 [Dayan,

1992]• Tesauro's TD-Gammon uses this algorithm• Bias-variance tradeoff [Kearns & Singh 2000]• Implemented using “eligibility traces” [Sutton 1988]• Helps overcome non-Markov environments [Loch &

Singh, 1998]

)],(),(ˆmax)1[(),( 11 tttta

ttt asQasQrasQ Equivalent expression:

non-Markov Examples

Can you solve them?

Markov Decision ProcessesRecall MDP:• finite set of states S • set of actions A

• at each discrete time agent observes state st S and chooses action at A

• receives reward rt, and state changes to st+1

• Markov assumption: st+1 = (st, at) and rt = r(st, at)

– rt and st+1 depend only on current state and action

– functions and r may be nondeterministic – functions and r not necessarily known to agent

Partially Observable MDPs

Same as MDP, but additional observation function that translates the state into what the learner can observe: ot = (st)

Transitions and rewards still depend on state, but learner only sees a “shadow”.

How can we learn what to do?

State Approaches to POMDPs

Q learning (dynamic programming) states:– observations– short histories– learn POMDP model: most likely state– learn POMDP model: information state– learn predictive model: predictive state– experience as state

Advantages, disadvantages?

Learning a POMDPInput: history (action-observation seq).

Output: POMDP that “explains” the data.EM, iterative algorithm (Baum et al. 70; Chrisman 92)

state occupationprobabilities

POMDPmodel

E: Forward-backward

M: Fractional counting

EM Pitfalls

Each iteration increases data likelihood.Local maxima. (Shatkay & Kaelbling 97; Nikovski 99)

Rarely learns good model.

Hidden states are truly unobservable.

Information State

Assumes:

• objective reality

• known “map”

Also belief state: represents location.

Vector of probabilities, one for each state.

Easily updated if model known. (Ex.)

Plan with Information StatesNow, learner is 50% here and 50% there

instead of in any particular state.

Good news: Markov in these vectors

Bad news: States continuous

Good news: Can be solved

Bad news: ...slowly

More bad news: Model is approximate!

Predictions as StateIdea: Key information from distance past, but

never too far in the future. (Littman et al. 02)

start at blue: down red (left red)odd up __?

history: forget

up blue

left red

up not blue

left not red

up blue

left not red

predict: up blue? left red?

up not blue

left red

Experience as State

Nearest sequence memory (McCallum 1995)

Relate current episode to past experience.

k longest matches considered to be the same for purposes of estimating value and updating.

Current work: Extend TD(), extend notion of similarity (allow for soft matches, sensors)

Classification Dialog (Keim & Littman 99)

User to travel to Roma, Torino, or Merino?

States: SR, ST, SM, done. Transitions to done. Actions:

– QC (What city?),– QR, QT, QM (Going to X?),– R, T, M (I think X).

Observations:– Yes, no (more reliable), R, T, M (T/M confusable).

Objective:– Reward for correct class, cost for questions.

Incremental Pruning Output

Optimal plan varies with priors (SR = SM).

ST=0.00

ST=0.02

ST=0.22

ST=0.76

ST=0.90

Wrap Up

Reinforcement learning: Get the right answer without being told. Hard, less developed than supervised learning.

Lecture slides on web.

http://www.cs.rutgers.edu/~mlittman/talks/• ml02-rl1.ppt• ml02-rl2.ppt

Machine LearningRL1 Reinforcement Learning in Partially Observable Environments Michael L. Littman.

Documents

Transcript of Machine LearningRL1 Reinforcement Learning in Partially Observable Environments Michael L. Littman.

CSE-573 Artificial Intelligence Partially-Observable MDPS (POMDPs)

Partially Observable Markov Decision Processes: A ...€¦ · Partially observable Markov decision processes (POMDPs) extend the MDPs by relaxing this assumption. The ﬁrst explicit

Learning and Solving Partially Observable Markov Decision ...shanigu/Publications/Dissertation.4.pdf · Learning and Solving Partially Observable Markov Decision Processes Dissertation

Partially Observable Markov Decision Processes …ggordon/780-fall07/... · What is a Partially Observable Markov Decision Process? ... Observations: S1 emits O1 with prob 1.0, S2

Actor-Critic Policy Optimization in Partially Observable ...papers.nips.cc/paper/7602-actor-critic-policy-optimization-in-partially-observable... · Partially observable environments

Reinforcement Learning in Partially Observable Multiagent ...thinc.cs.uga.edu/data/cdbaamas16_presentation.pdf · Reinforcement Learning in Partially Observable Multiagent Settings:

Partially Observable Markov Decision Process (Chapter 15 & 16)

Partially Observable Markov Decision Processes for …mi.eng.cam.ac.uk/~sjy/papers/wiyo07.pdf · Partially Observable Markov Decision Processes for Spoken Dialog Systems Jason D.

Inverse Reinforcement Learning in Partially Observable Environments

Planning with Continuous Actions in Partially Observable Environments · Planning with Continuous Actions in Partially Observable Environments Matthijs T. J. Spaan and Nikos Vlassis

Learning and Solving Partially Observable Markov Decision Processes

Partially Observable Markov Decision Processes · 2018-09-27 · partially observable, stochastic domains. For finite horizon problems, the resulting value functions are piecewise

Planning and acting in partially observable …people.csail.mit.edu/lpk/papers/aij98-pomdp.pdfoptimal actions in partially observable stochastic domains. We begin by introducing the

Value-Function Approximations for Partially Observable ...milos/research/JAIR-2000.pdf · Value-Function Approximations for Partially Observable Markov Decision Processes Milos Hauskrecht

Partially Observable Markov Decision Process ...ckreuche/PAPERS/2009DEDS.pdf · 1) Formulating the adaptive sensing problem as a partially observable Markov decision process (POMDP);

Planning and Acting in Partially Observable Stochastic Domainsnemo.yonsei.ac.kr/wp-content/uploads/2018/02/20180206... · 2018-02-06 · Planning and Acting in Partially Observable

Partially-Observable Markov Decision Processesclopinet.com/isabelle/Projects/NIPS2013/slides/Doshi.pdf · Partially-Observable Markov Decision Processes as Dynamical Causal Models

Partially Observable Markov Decision Processesleews/MLSS/POMDP.pdf · 2011. 6. 8. · Bayesian Reinforcement Learning POMDP • Aim: ... • Partially Observable Markov Decision Process

Planning in partially-observable switching-mode continuous ...people.csail.mit.edu/lpk/papers/amai_pomdp_continuous_brunskill.pdf · Planning in partially-observable switching-mode

CSE-473 Artificial Intelligence Partially-Observable MDPS (POMDPs)