Markov Decision Processes Value Iteration Pieter Abbeel UC Berkeley EECS TexPoint fonts used in EMF....

Markov Decision ProcessesValue Iteration

Pieter AbbeelUC Berkeley EECS

[Drawing from Sutton and Barto, Reinforcement Learning: An Introduction, 1998]

Markov Decision Process

Assumption: agent gets to observe the state

Markov Decision Process (S, A, T, R, H)

S: set of states

A: set of actions

T: S x A x S x {0,1,…,H} [0,1], Tt(s,a,s’) = P(st+1 = s’ | st = s, at =a)

R: S x A x S x {0, 1, …, H} < Rt(s,a,s’) = reward for (st+1 = s’, st = s, at =a)

H: horizon over which the agent will act

Find ¼ : S x {0, 1, …, H} A that maximizes expected sum of rewards, i.e.,

MDP (S, A, T, R, H), goal:

Cleaning robot

Walking robot

Pole balancing

Games: tetris, backgammon

Server management

Shortest path problems

Model for animals, people

Examples

Canonical Example: Grid World

The agent lives in a grid Walls block the agent’s

path The agent’s actions do

not always go as planned: 80% of the time, the action

North takes the agent North (if there is no wall there)

10% of the time, North takes the agent West; 10% East

If there is a wall in the direction the agent would have been taken, the agent stays put

Big rewards come at the end

Grid Futures

Deterministic Grid World Stochastic Grid World

E N S W

E N S W ?

Solving MDPs

In an MDP, we want an optimal policy *: S x 0:H → A

A policy gives an action for each state for each time

An optimal policy maximizes expected sum of rewards Contrast: In deterministic, want an optimal plan, or

sequence of actions, from start to a goal

t=1t=2

t=3t=4

Value Iteration Idea:

= the expected sum of rewards accumulated when starting from state s and acting optimally for a horizon of i steps

Algorithm: Start with for all s. For i=1, … , H

Given Vi*, calculate for all states s 2 S:

This is called a value update or Bellman update/back-up

Example

Example: Value Iteration

Information propagates outward from terminal states and eventually all states have correct value estimates

Practice: Computing Actions

Which action should we chose from state s: Given optimal values V*?

= greedy action with respect to V* = action choice with one step lookahead w.r.t.

Optimal control: provides general computational approach to tackle control problems.

Dynamic programming / Value iteration Discrete state spaces (DONE!) Discretization of continuous state spaces Linear systems LQR Extensions to nonlinear settings:

Local linearization Differential dynamic programming

Optimal Control through Nonlinear Optimization Open-loop Model Predictive Control

Examples:

Today and forthcoming lectures

Markov Decision Processes Value Iteration Pieter Abbeel UC Berkeley EECS TexPoint fonts used in EMF....

Documents

Transcript of Markov Decision Processes Value Iteration Pieter Abbeel UC Berkeley EECS TexPoint fonts used in EMF....

Beam Sensor Models Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics TexPoint fonts used in EMF. Read.

Learning Parameterized Maneuvers for Autonomous Helicopter Flight Jie Tang, Arjun Singh, Nimbus Goehausen, Pieter Abbeel UC Berkeley.

Particle Filters Pieter Abbeel UC Berkeley EECS

Particle Filters · 2011. 10. 22. · Page 1! Particle Filters Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics TexPoint fonts

Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS

Gaussians Pieter Abbeel UC Berkeley EECS

Nonlinear Optimization for Optimal Control Part 2 Pieter Abbeel UC Berkeley EECS TexPoint fonts used in EMF. Read the TexPoint manual before you delete.

Rao-Blackwellized Particle Filtering Pieter Abbeel UC Berkeley EECS

Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics.

Using Inaccurate Models in Reinforcement Learning Pieter Abbeel, Morgan Quigley and Andrew Y. Ng Stanford University.

Planning Curvature and Torsion Constrained Ribbons for Intracavitary Brachytherapy Sachin Patil, Jia Pan, Pieter Abbeel, Ken Goldberg UC Berkeley EECS.

Nonlinear Optimization for Optimal Control Pieter Abbeel UC Berkeley EECS

Probability: Review Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics TexPoint fonts used in EMF.

An Application of Reinforcement Learning to Autonomous Helicopter Flight Pieter Abbeel, Adam Coates, Morgan Quigley and Andrew Y. Ng Stanford University.

Markov Decision Processes Value Iteration Pieter Abbeel UC Berkeley EECS

CS 188: Artificial Intelligence Introduction Instructors: Dan Klein and Pieter Abbeel University of California, Berkeley [These slides were created by.

Bayes Filters - University of California, Berkeley Filters Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics TexPoint fonts used

Discretization Pieter Abbeel UC Berkeley EECS

OpmalControlfor Linear’Dynamical’Systems’and’Quadra#c’Costpabbeel/cs287-fa13/slides/LQR.pdfLinear’Dynamical’Systems’and’Quadra#c’Cost! (“LQR”)’! Pieter!Abbeel!

EKF , UKF Pieter Abbeel UC Berkeley EECS