1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 15: Partially Observable Markov...
-
Upload
kevin-maxwell -
Category
Documents
-
view
216 -
download
0
Transcript of 1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 15: Partially Observable Markov...
11
ECE-517: Reinforcement Learning in ECE-517: Reinforcement Learning in Artificial IntelligenceArtificial Intelligence
Lecture 15: Partially Observable Markov Decision Lecture 15: Partially Observable Markov Decision Processes (POMDPs)Processes (POMDPs)
Dr. Itamar ArelDr. Itamar Arel
College of EngineeringCollege of EngineeringElectrical Engineering and Computer Science DepartmentElectrical Engineering and Computer Science Department
The University of TennesseeThe University of TennesseeFall 2011Fall 2011
October 27, 2011October 27, 2011
ECE 517 – Reinforcement Learning in AI 22
OutlineOutline
Why use POMDPs?Why use POMDPs?
Formal definitionFormal definition
Belief stateBelief state
Value function Value function
ECE 517 – Reinforcement Learning in AI 33
Partially Observable Markov Decision Problems Partially Observable Markov Decision Problems (POMDPs)(POMDPs)
To introduce POMDPs let us consider an example To introduce POMDPs let us consider an example where an agent learns to drive a car in New York citywhere an agent learns to drive a car in New York cityThe agent can look forward, backward, left or rightThe agent can look forward, backward, left or rightIt can’t change speed but it can steer into the lane it is It can’t change speed but it can steer into the lane it is looking atlooking at
The different types of observations areThe different types of observations are the direction in which the agent's gaze is directedthe direction in which the agent's gaze is directed the closest object in the agent's gazethe closest object in the agent's gaze whether the object is looming or recedingwhether the object is looming or receding the color of the objectthe color of the object whether a horn is soundingwhether a horn is sounding
To drive safely the agent must steer out of its lane to To drive safely the agent must steer out of its lane to avoid slow cars ahead and fast cars behindavoid slow cars ahead and fast cars behind
ECE 517 – Reinforcement Learning in AI 44
POMDP ExamplePOMDP Example
The agent is in control of the The agent is in control of the middle carmiddle car
The car behind is fast and The car behind is fast and will not slow downwill not slow down
The car ahead is slowerThe car ahead is slower
To avoid a crash, the agent To avoid a crash, the agent must steer rightmust steer right
However, when the agent is However, when the agent is gazing to the right, there is gazing to the right, there is no immediate observation no immediate observation that tells it about the that tells it about the impending crashimpending crash
ECE 517 – Reinforcement Learning in AI 55
POMDP Example (cont.)POMDP Example (cont.)
This is not easy when the agent This is not easy when the agent has no explicit goals beyond has no explicit goals beyond “performing well"“performing well"There are no explicit training There are no explicit training patterns such as “patterns such as “if there is a if there is a car ahead and left, steer right."car ahead and left, steer right."However, a scalar reward is However, a scalar reward is provided to the agent as a provided to the agent as a performance indicator (just like performance indicator (just like MDPs)MDPs)
The agent is penalized for The agent is penalized for colliding with other cars or the colliding with other cars or the road shoulderroad shoulder
The only goal hard-wired into The only goal hard-wired into the agent is that it must the agent is that it must maximize a long-term maximize a long-term measure of the rewardmeasure of the reward
ECE 517 – Reinforcement Learning in AI 66
POMDP Example (cont.)POMDP Example (cont.)
Two significant problems make it difficult to learn Two significant problems make it difficult to learn under these conditionsunder these conditions
Temporal credit assignmentTemporal credit assignment –– If our agent hits another car and is consequently penalized, If our agent hits another car and is consequently penalized, how does the agent reason about which sequence of how does the agent reason about which sequence of actions should not be repeated, and in what circumstances? actions should not be repeated, and in what circumstances?
Generally same as in MDPsGenerally same as in MDPs Partial Observability -Partial Observability -
If the agent is about to hit the car ahead of it, and there is a If the agent is about to hit the car ahead of it, and there is a car to the left, then circumstances dictate that the agent car to the left, then circumstances dictate that the agent should steer rightshould steer right
However, when it looks to the right it has no sensory However, when it looks to the right it has no sensory information regarding what goes on elsewhereinformation regarding what goes on elsewhere
To solve the latter, the agent needs To solve the latter, the agent needs memorymemory – creates – creates knowledge of the state of the world around itknowledge of the state of the world around it
ECE 517 – Reinforcement Learning in AI 77
Forms of Partial ObservabilityForms of Partial Observability
Partial Observability coarsely pertains to eitherPartial Observability coarsely pertains to either Lack of important state informationLack of important state information in observations – must be in observations – must be
compensated using memorycompensated using memory Extraneous informationExtraneous information in observations – needs to learn to avoid in observations – needs to learn to avoid
In our example:In our example: Color of the car in its gaze is extraneous (unless Color of the car in its gaze is extraneous (unless redred cars really cars really
drive faster)drive faster) It needs to build a memory-based model of the world in order to It needs to build a memory-based model of the world in order to
accurately predict what will happenaccurately predict what will happen Creates “belief state” information (we’ll see later)Creates “belief state” information (we’ll see later)
If the agent has access to the complete state, such as a If the agent has access to the complete state, such as a chess playing machine that can view the entire board:chess playing machine that can view the entire board:
It can choose optimal actions without memoryIt can choose optimal actions without memory Markov property holds – i.e. future state of the world is simply a Markov property holds – i.e. future state of the world is simply a
function of the current state and actionfunction of the current state and action
ECE 517 – Reinforcement Learning in AI 88
Modeling the world as a POMDPModeling the world as a POMDP
Our setting is that of an agent taking actions in a Our setting is that of an agent taking actions in a world according to its policyworld according to its policyThe agent still receives feedback about its The agent still receives feedback about its performance through a scalar reward received at each performance through a scalar reward received at each time steptime step
Formally stated, POMDPs consists of …Formally stated, POMDPs consists of … ||SS| states | states SS = {1,2,…,| = {1,2,…,|SS|} of the world|} of the world ||UU| actions (or controls) | actions (or controls) UU = {1,2,…, | = {1,2,…, |UU|} available |} available
to the policyto the policy ||YY| observations | observations YY = {1,2,…,| = {1,2,…,|YY|}|} a (a (possibly stochasticpossibly stochastic) reward ) reward rr((ii) for each state ) for each state ii
in in SS
ECE 517 – Reinforcement Learning in AI 99
Modeling the world as a POMDP (cont.)Modeling the world as a POMDP (cont.)
ECE 517 – Reinforcement Learning in AI 1010
MDPs vs. POMDPsMDPs vs. POMDPs
In MDP: one observation for each stateIn MDP: one observation for each state Concept of Concept of observationobservation and and statestate being interchangeable being interchangeable Memoryless policy that does not make use of internal Memoryless policy that does not make use of internal
statestate
In POMDPs different states may have similar In POMDPs different states may have similar probability distributions over observationsprobability distributions over observations
Different states may look the same to the agentDifferent states may look the same to the agent For this reason, POMDPs are said to have For this reason, POMDPs are said to have hidden statehidden state
Two hallways may look the same for a robot’s sensorsTwo hallways may look the same for a robot’s sensors Optimal action for the first Optimal action for the first take lefttake left Optimal action for the second Optimal action for the second take righttake right A memoryless policy can’t distinguish between the twoA memoryless policy can’t distinguish between the two
ECE 517 – Reinforcement Learning in AI 1111
MDPs vs. POMDPs (cont.)MDPs vs. POMDPs (cont.)
Noise can create ambiguity in state inferenceNoise can create ambiguity in state inference Agent’s sensors are always limited in the amount of Agent’s sensors are always limited in the amount of
information they can pick upinformation they can pick up
One way of overcoming this is to add sensorsOne way of overcoming this is to add sensors Specific sensors that help it to “disambiguate” hallwaysSpecific sensors that help it to “disambiguate” hallways Only when possible, affordable or desirableOnly when possible, affordable or desirable
In general, we’re now considering agents that need to In general, we’re now considering agents that need to be proactive (also called “anticipatory”) be proactive (also called “anticipatory”)
Not only react to environmental stimuliNot only react to environmental stimuli Self-create context using memorySelf-create context using memory
POMDP problems are harder to solve, but represent POMDP problems are harder to solve, but represent realistic scenariosrealistic scenarios
ECE 517 – Reinforcement Learning in AI 1212
POMDP solution techniques – model based methodsPOMDP solution techniques – model based methods
If an exact model of the environment is available, If an exact model of the environment is available, POMDPs can (in theory) be solvedPOMDPs can (in theory) be solved
i.e. an optimal policy can be foundi.e. an optimal policy can be found
Like model-based MDPs, it’s not so much a learning Like model-based MDPs, it’s not so much a learning problemproblem
No real “learning”, or trial and error taking placeNo real “learning”, or trial and error taking place No exploration/exploitation dilemmaNo exploration/exploitation dilemma Rather a probabilistic planning problem Rather a probabilistic planning problem find the find the
optimal policyoptimal policy
In POMDPs the above is broken into two elementsIn POMDPs the above is broken into two elements Belief state computation, andBelief state computation, and Value function computation based on belief statesValue function computation based on belief states
ECE 517 – Reinforcement Learning in AI 1313
The belief stateThe belief state
Instead of maintaining the complete Instead of maintaining the complete action/observation history, we maintain a action/observation history, we maintain a belief belief state bstate b..
The belief state is a The belief state is a probability distribution over the probability distribution over the statesstates
Given an observationGiven an observation Dim(b) = |S|-1Dim(b) = |S|-1
The The belief spacebelief space is the entire probability space is the entire probability space We’ll use a two-state POMDP as a running exampleWe’ll use a two-state POMDP as a running example
Probability of being in state one = Probability of being in state one = pp probability of being probability of being in state two = in state two = 1-p1-p
Therefore, the entire space of belief states can be Therefore, the entire space of belief states can be represented as a line segmentrepresented as a line segment
ECE 517 – Reinforcement Learning in AI 1414
The belief spaceThe belief space
Here is a representation of the belief space Here is a representation of the belief space when we have two states (swhen we have two states (s00,s,s11))
ECE 517 – Reinforcement Learning in AI 1515
The belief space (cont.)The belief space (cont.)
The belief space is continuous, but we only visit a The belief space is continuous, but we only visit a countable number of belief pointscountable number of belief pointsAssumptionAssumption::
Finite actionFinite action set set Finite observationFinite observation set set Next belief state Next belief state b’ b’ = = f f ((bb,a,o) where,a,o) where::
bb: current belief state, a:action, o:observation: current belief state, a:action, o:observation
ECE 517 – Reinforcement Learning in AI 1616
The Tiger ProblemThe Tiger Problem
• Standing in front of two closed doorsStanding in front of two closed doors• World is in one of two states: World is in one of two states: tiger is behind left door or right doortiger is behind left door or right door• Three actions: Three actions: Open left door, open right door, listenOpen left door, open right door, listen
• Listening is not free, and not accurate (may get wrong info)Listening is not free, and not accurate (may get wrong info)• Reward: Reward: Open the wrong door and get eaten by the tiger (Open the wrong door and get eaten by the tiger (large –rlarge –r)) Open the right door and get a prize (Open the right door and get a prize (small +rsmall +r))
ECE 517 – Reinforcement Learning in AI 1717
Tiger Problem: POMDP FormulationTiger Problem: POMDP Formulation
Two states: Two states: SLSL and and SRSR (tiger is really behind (tiger is really behind leftleft or or rightright door)door)
Three actions: Three actions: LEFT, RIGHT, LISTENLEFT, RIGHT, LISTEN
Transition probabilities:Transition probabilities:
Listening does not change theListening does not change thetiger’s positiontiger’s position
Each episode is a “Reset”Each episode is a “Reset”
ListenListen SLSL SRSR
SLSL 1.01.0 0.00.0
SRSR 0.00.0 1.01.0
LeftLeft SLSL SRSR
SLSL 0.50.5 0.50.5
SRSR 0.50.5 0.50.5
RightRight SLSL SRSR
SLSL 0.50.5 0.50.5
SRSR 0.50.5 0.50.5
Current state
Nex
t st
ate
ECE 517 – Reinforcement Learning in AI 1818
Tiger Problem: POMDP Formulation (cont.)Tiger Problem: POMDP Formulation (cont.)
Observations: TL (tiger left) or TR (tiger right)Observations: TL (tiger left) or TR (tiger right)
Observation probabilities:Observation probabilities:
Rewards: – R(SL, Listen) = R(SR, Listen) = -1– R(SL, Left) = R(SR, Right) = -100– R(SL, Right) = R(SR, Left) = +10
ListenListen TLTL TRTR
SLSL 0.850.85 0.150.15
SRSR 0.150.15 0.850.85
LeftLeft TLTL TRTR
SLSL 0.50.5 0.50.5
SRSR 0.50.5 0.50.5
RightRight TLTL TRTR
SLSL 0.50.5 0.50.5
SRSR 0.50.5 0.50.5
Current state
Nex
t st
ate
ECE 517 – Reinforcement Learning in AI 1919
POMDP Policy Tree (Fake Policy)POMDP Policy Tree (Fake Policy)
Listen
ListenOpenLeftdoor
ListenOpenLeft door
Listen
Tiger roarleft Tiger roar right
Tiger roarleft
Tiger roarright
……
Starting belief state(tiger left probability: 0.3)
New belief state(0.6) New belief
State (0.15)
New beliefState (0.9)
ECE 517 – Reinforcement Learning in AI 2020
POMDP Policy Tree (cont’)POMDP Policy Tree (cont’)
A1
A2
A3 A4
A5A6
A7
A8
o1
o2 o3
o4o5
o6
……
ECE 517 – Reinforcement Learning in AI 2121
How many POMDP policies possibleHow many POMDP policies possible
A1
A2A3 A4
A5 A6A7
A8
o1o2 o3
o4 o5o6
… …
How many policy trees, if How many policy trees, if |A||A| actions, actions, |O||O| observations, observations, TT horizon: horizon:• How many nodes in a tree:How many nodes in a tree:
N =N = |O||O|ii = = (|O|(|O|TT- 1)- 1) / / (|O| - 1)(|O| - 1)i=0
T-1How many trees:
|A|N
1
|O|
|O|^2
…
ECE 517 – Reinforcement Learning in AI 2222
Belief StateBelief State
Overall formula:Overall formula:
The belief state is updated proportionally to:The belief state is updated proportionally to: The prob. of seeing the current observation given state The prob. of seeing the current observation given state
s’,s’, and to the prob. of arriving at state s’ given the action and to the prob. of arriving at state s’ given the action
and our previous belief state (b)and our previous belief state (b) The above are all given by the model The above are all given by the model
ECE 517 – Reinforcement Learning in AI 2323
Belief State (cont.)Belief State (cont.)
Let’s look at an example:Let’s look at an example: Consider a robot that is initially completely Consider a robot that is initially completely
uncertain about its locationuncertain about its location Seeing a door may, as specified by the model’s Seeing a door may, as specified by the model’s
occur in three different locationsoccur in three different locations Suppose that the robot takes an action and Suppose that the robot takes an action and
observes a T-junctionobserves a T-junction It may be that given the action only one of the It may be that given the action only one of the
three states could have lead to an observation of a three states could have lead to an observation of a T-junctionT-junction
The agent now knows withThe agent now knows withcertainty which state it is incertainty which state it is in
Not in all cases the uncertaintyNot in all cases the uncertaintydisappears like thatdisappears like that
'soP
ECE 517 – Reinforcement Learning in AI 2424
Finding an optimal policyFinding an optimal policy
The policy component of a POMDP agent must map The policy component of a POMDP agent must map the current belief state into actionthe current belief state into action
It turns out that the process of maintaining belief It turns out that the process of maintaining belief states is a states is a sufficient statistic (i.e. Markovian)sufficient statistic (i.e. Markovian) We can’t do better even if we remembered the We can’t do better even if we remembered the
entire history of observations and actionsentire history of observations and actions
We have now transformed the POMDP into a We have now transformed the POMDP into a MDPMDP Good news:Good news: we have ways of solving those (GPI we have ways of solving those (GPI
algorithms)algorithms) Bad news:Bad news: the belief state space is continuous !! the belief state space is continuous !!
ECE 517 – Reinforcement Learning in AI 2525
Value functionValue function
The belief state is the input to the second The belief state is the input to the second component of the method: the component of the method: the value function value function computationcomputation
The belief state is a point in a continuous The belief state is a point in a continuous space of space of N-1 dimensionsN-1 dimensions!!
The value function must be defined over this The value function must be defined over this infinite spaceinfinite space
Application of dynamic programming Application of dynamic programming techniques techniques infeasible infeasible
ECE 517 – Reinforcement Learning in AI 2626
Value function (cont.)Value function (cont.)
• Let’s assume only two states: SLet’s assume only two states: S11 and S and S22
• Belief state Belief state [0.25 0.75][0.25 0.75] indicates indicates b(sb(s11) = 0.25) = 0.25, , b(sb(s22) = ) = 0.750.75
• With two states, b(sWith two states, b(s11) is sufficient to indicate belief ) is sufficient to indicate belief state: b(sstate: b(s22) = 1 – b(s) = 1 – b(s11))
S1
[1, 0]S2
[0, 1][0.5, 0.5]
V(b)
b: belief state
ECE 517 – Reinforcement Learning in AI 2727
Piecewise linear and Convex (PWLC)Piecewise linear and Convex (PWLC)
Turns out that the value function is, or can be accurately Turns out that the value function is, or can be accurately approximated, by a approximated, by a piecewise linear and convex functionpiecewise linear and convex function
Intuition on convexity: being certain of a state yields high Intuition on convexity: being certain of a state yields high value, where as uncertainty lowers the valuevalue, where as uncertainty lowers the value
S1
[1, 0]S2
[0, 1][0.5, 0.5]
V(b)
b: belief state
ECE 517 – Reinforcement Learning in AI 2828
Why does PWLC helps?Why does PWLC helps?
• We can directly work with regions (intervals) of belief space!• The vectors are policies, and indicate the right action to take in
each region of the space
S1
[1, 0]S2
[0, 1][0.5, 0.5]
V(b)
b: belief state
Vp1
Vp2
Vp3
region1 region2 region3
ECE 517 – Reinforcement Learning in AI 2929
SummarySummary
POMDPs POMDPs model realistic scenarios more model realistic scenarios more accuratelyaccurately
Rely on Rely on belief statesbelief states that are derived from that are derived from observations and actionsobservations and actions
Can be transformed into an MDP with PWLC Can be transformed into an MDP with PWLC for value function approximationfor value function approximation
What if we don’t have a model???What if we don’t have a model??? Next class: (recurrent) neural networks come Next class: (recurrent) neural networks come
to the rescue …to the rescue …