Hierarchical Reinforcement Learning Mausam [A Survey and Comparison of HRL techniques]
-
Upload
chelsey-berman -
Category
Documents
-
view
223 -
download
4
Transcript of Hierarchical Reinforcement Learning Mausam [A Survey and Comparison of HRL techniques]
![Page 1: Hierarchical Reinforcement Learning Mausam [A Survey and Comparison of HRL techniques]](https://reader034.fdocuments.in/reader034/viewer/2022052504/551aaa7e550346e0158b5ed3/html5/thumbnails/1.jpg)
Hierarchical Reinforcement Learning
Mausam
[A Survey and Comparison of HRL techniques]
![Page 2: Hierarchical Reinforcement Learning Mausam [A Survey and Comparison of HRL techniques]](https://reader034.fdocuments.in/reader034/viewer/2022052504/551aaa7e550346e0158b5ed3/html5/thumbnails/2.jpg)
The Outline of the Talk
MDPs and Bellman’s curse of dimensionality.
RL: Simultaneous learning and planning.Explore avenues to speed up RL.
Illustrate prominent HRL methods.Compare prominent HRL methods.
Discuss future research.Summarise
![Page 3: Hierarchical Reinforcement Learning Mausam [A Survey and Comparison of HRL techniques]](https://reader034.fdocuments.in/reader034/viewer/2022052504/551aaa7e550346e0158b5ed3/html5/thumbnails/3.jpg)
Decision Making
Environment
Percept Action
What action next?
Slide courtesy Dan Weld
![Page 4: Hierarchical Reinforcement Learning Mausam [A Survey and Comparison of HRL techniques]](https://reader034.fdocuments.in/reader034/viewer/2022052504/551aaa7e550346e0158b5ed3/html5/thumbnails/4.jpg)
Personal Printerbot
States (S) : {loc,has-robot-printout, user-loc,has-user-printout},map
Actions (A) :{moven,moves,movee,movew, extend-arm,grab-page,release-pages}
Reward (R) : if h-u-po +20 else -1Goal (G) : All states with h-u-po true.
Start state : A state with h-u-po false.
![Page 5: Hierarchical Reinforcement Learning Mausam [A Survey and Comparison of HRL techniques]](https://reader034.fdocuments.in/reader034/viewer/2022052504/551aaa7e550346e0158b5ed3/html5/thumbnails/5.jpg)
Episodic Markov Decision Process
hS, A, P, R, G, s0i S : Set of environment states. A : Set of available actions. P : Probability Transition model. P(s’|
s,a)* R : Reward model. R(s)* G : Absorbing goal states. s0 : Start state. : Discount factor**.
* Markovian assumption.** bounds R for infinite horizon.
Episodic MDP ´ MDP with
absorbing goals
Episodic MDP ´ MDP with
absorbing goals
![Page 6: Hierarchical Reinforcement Learning Mausam [A Survey and Comparison of HRL techniques]](https://reader034.fdocuments.in/reader034/viewer/2022052504/551aaa7e550346e0158b5ed3/html5/thumbnails/6.jpg)
Goal of an Episodic MDP
Find a policy (S ! A), which:maximises expected discounted reward
for a a fully observable* Episodic MDP.if agent is allowed to execute for an
indefinite horizon.
* Non-noisy complete information perceptors
![Page 7: Hierarchical Reinforcement Learning Mausam [A Survey and Comparison of HRL techniques]](https://reader034.fdocuments.in/reader034/viewer/2022052504/551aaa7e550346e0158b5ed3/html5/thumbnails/7.jpg)
Solution of an Episodic MDP
Define V*(s) : Optimal reward starting in state s.
Value Iteration : Start with an estimate of V*(s) and successively re-estimate it to converge to a fixed point.
![Page 8: Hierarchical Reinforcement Learning Mausam [A Survey and Comparison of HRL techniques]](https://reader034.fdocuments.in/reader034/viewer/2022052504/551aaa7e550346e0158b5ed3/html5/thumbnails/8.jpg)
Complexity of Value Iteration
Each iteration – polynomial in |S|Number of iterations – polynomial in
|S|Overall – polynomial in |S|
Polynomial in |S| - |S| : exponential in number of features in the domain*.
* Bellman’s curse of dimensionality
![Page 9: Hierarchical Reinforcement Learning Mausam [A Survey and Comparison of HRL techniques]](https://reader034.fdocuments.in/reader034/viewer/2022052504/551aaa7e550346e0158b5ed3/html5/thumbnails/9.jpg)
The Outline of the Talk
MDPs and Bellman’s curse of dimensionality.
RL: Simultaneous learning and planning.
Explore avenues to speed up RL. Illustrate prominent HRL methods. Compare prominent HRL methods.
Discuss future research. Summarise
![Page 10: Hierarchical Reinforcement Learning Mausam [A Survey and Comparison of HRL techniques]](https://reader034.fdocuments.in/reader034/viewer/2022052504/551aaa7e550346e0158b5ed3/html5/thumbnails/10.jpg)
Learning
Environment
Data
•Gain knowledge•Gain understanding•Gain skills•Modification of behavioural tendency
![Page 11: Hierarchical Reinforcement Learning Mausam [A Survey and Comparison of HRL techniques]](https://reader034.fdocuments.in/reader034/viewer/2022052504/551aaa7e550346e0158b5ed3/html5/thumbnails/11.jpg)
Decision Making while Learning*Environment
PerceptsDatum Action
What action next?
•Gain knowledge•Gain understanding•Gain skills•Modification of behavioural tendency
* Known as ReinforcementLearning
![Page 12: Hierarchical Reinforcement Learning Mausam [A Survey and Comparison of HRL techniques]](https://reader034.fdocuments.in/reader034/viewer/2022052504/551aaa7e550346e0158b5ed3/html5/thumbnails/12.jpg)
Reinforcement LearningUnknown P and reward R.Learning Component : Estimate the P
and R values via data observed from the environment.
Planning Component : Decide which actions to take that will maximise reward.
Exploration vs. Exploitation GLIE (Greedy in Limit with
Infinite Exploration)
![Page 13: Hierarchical Reinforcement Learning Mausam [A Survey and Comparison of HRL techniques]](https://reader034.fdocuments.in/reader034/viewer/2022052504/551aaa7e550346e0158b5ed3/html5/thumbnails/13.jpg)
Learning
Model-based learningLearn the model, and do planningRequires less data, more computation
Model-free learningPlan without learning an explicit modelRequires a lot of data, less computation
![Page 14: Hierarchical Reinforcement Learning Mausam [A Survey and Comparison of HRL techniques]](https://reader034.fdocuments.in/reader034/viewer/2022052504/551aaa7e550346e0158b5ed3/html5/thumbnails/14.jpg)
Q-LearningInstead of learning, P and R, learn Q*
directly. Q*(s,a) : Optimal reward starting in
s, if the first action is a, and after that the optimal policy is followed.
Q* directly defines the optimal policy:
Optimal policy is the action with maximum
Q* value.
Optimal policy is the action with maximum
Q* value.
![Page 15: Hierarchical Reinforcement Learning Mausam [A Survey and Comparison of HRL techniques]](https://reader034.fdocuments.in/reader034/viewer/2022052504/551aaa7e550346e0158b5ed3/html5/thumbnails/15.jpg)
Q-Learning
Given an experience tuple hs,a,s’,ri
Under suitable assumptions, and GLIE exploration Q-Learning
converges to optimal.
New estimate of Q value
New estimate of Q value
Old estimate of Q value
Old estimate of Q value
![Page 16: Hierarchical Reinforcement Learning Mausam [A Survey and Comparison of HRL techniques]](https://reader034.fdocuments.in/reader034/viewer/2022052504/551aaa7e550346e0158b5ed3/html5/thumbnails/16.jpg)
Semi-MDP: When actions take time.
The Semi-MDP equation:
Semi-MDP Q-Learning equation:
where experience tuple is hs,a,s’,r,Ni
r = accumulated discounted reward while action a was executing.
![Page 17: Hierarchical Reinforcement Learning Mausam [A Survey and Comparison of HRL techniques]](https://reader034.fdocuments.in/reader034/viewer/2022052504/551aaa7e550346e0158b5ed3/html5/thumbnails/17.jpg)
Printerbot
Paul G. Allen Center has 85000 sq ft space
Each floor ~ 85000/7 ~ 12000 sq ftDiscretise location on a floor: 12000
parts.State Space (without map) :
2*2*12000*12000 --- very large!!!!!How do humans do the
decision making?
![Page 18: Hierarchical Reinforcement Learning Mausam [A Survey and Comparison of HRL techniques]](https://reader034.fdocuments.in/reader034/viewer/2022052504/551aaa7e550346e0158b5ed3/html5/thumbnails/18.jpg)
The Outline of the Talk
MDPs and Bellman’s curse of dimensionality.
RL: Simultaneous learning and planning.Explore avenues to speedup RL.Illustrate prominent HRL methods.Compare prominent HRL methods.
Discuss future research. Summarise
![Page 19: Hierarchical Reinforcement Learning Mausam [A Survey and Comparison of HRL techniques]](https://reader034.fdocuments.in/reader034/viewer/2022052504/551aaa7e550346e0158b5ed3/html5/thumbnails/19.jpg)
1. The Mathematical Perspective
A Structure Paradigm S : Relational MDP A : Concurrent MDP P : Dynamic Bayes Nets R : Continuous-state MDP G : Conjunction of state
variables V : Algebraic Decision Diagrams : Decision List (RMDP)
![Page 20: Hierarchical Reinforcement Learning Mausam [A Survey and Comparison of HRL techniques]](https://reader034.fdocuments.in/reader034/viewer/2022052504/551aaa7e550346e0158b5ed3/html5/thumbnails/20.jpg)
2. Modular Decision Making
![Page 21: Hierarchical Reinforcement Learning Mausam [A Survey and Comparison of HRL techniques]](https://reader034.fdocuments.in/reader034/viewer/2022052504/551aaa7e550346e0158b5ed3/html5/thumbnails/21.jpg)
2. Modular Decision Making
•Go out of room•Walk in hallway•Go in the room
![Page 22: Hierarchical Reinforcement Learning Mausam [A Survey and Comparison of HRL techniques]](https://reader034.fdocuments.in/reader034/viewer/2022052504/551aaa7e550346e0158b5ed3/html5/thumbnails/22.jpg)
2. Modular Decision Making
Humans plan modularly at different granularities of understanding.
Going out of one room is similar to going out of another room.
Navigation steps do not depend on whether we have the print out or not.
![Page 23: Hierarchical Reinforcement Learning Mausam [A Survey and Comparison of HRL techniques]](https://reader034.fdocuments.in/reader034/viewer/2022052504/551aaa7e550346e0158b5ed3/html5/thumbnails/23.jpg)
3. Background Knowledge
Classical Planners using additional control knowledge can scale up to larger problems.
(E.g. : HTN planning, TLPlan)What forms of control knowledge can
we provide to our Printerbot?First pick printouts, then deliver them.Navigation – consider rooms, hallway,
separately, etc.
![Page 24: Hierarchical Reinforcement Learning Mausam [A Survey and Comparison of HRL techniques]](https://reader034.fdocuments.in/reader034/viewer/2022052504/551aaa7e550346e0158b5ed3/html5/thumbnails/24.jpg)
A mechanism that exploits all three avenues : Hierarchies
1. Way to add a special (hierarchical) structure on different parameters of an MDP.
2. Draws from the intuition and reasoning in human decision making.
3. Way to provide additional control knowledge to the system.
![Page 25: Hierarchical Reinforcement Learning Mausam [A Survey and Comparison of HRL techniques]](https://reader034.fdocuments.in/reader034/viewer/2022052504/551aaa7e550346e0158b5ed3/html5/thumbnails/25.jpg)
The Outline of the Talk
MDPs and Bellman’s curse of dimensionality.
RL: Simultaneous learning and planning.Explore avenues to speedup RL.
Illustrate prominent HRL methods.Compare prominent HRL methods.
Discuss future research. Summarise
![Page 26: Hierarchical Reinforcement Learning Mausam [A Survey and Comparison of HRL techniques]](https://reader034.fdocuments.in/reader034/viewer/2022052504/551aaa7e550346e0158b5ed3/html5/thumbnails/26.jpg)
HierarchyHierarchy of : Behaviour, Skill,
Module, SubTask, Macro-action, etc.picking the pagescollision avoidancefetch pages phasewalk in hallway
HRL ´ RL with temporally extended actions
![Page 27: Hierarchical Reinforcement Learning Mausam [A Survey and Comparison of HRL techniques]](https://reader034.fdocuments.in/reader034/viewer/2022052504/551aaa7e550346e0158b5ed3/html5/thumbnails/27.jpg)
Hierarchical Algos ´ Gating Mechanism
Hierarchical Learning•Learning the gating function•Learning the individual behaviours•Learning both
*
*Can be a multi- level hierarchy.
g is a gateg is a gate
bi is a behaviour
bi is a behaviour
![Page 28: Hierarchical Reinforcement Learning Mausam [A Survey and Comparison of HRL techniques]](https://reader034.fdocuments.in/reader034/viewer/2022052504/551aaa7e550346e0158b5ed3/html5/thumbnails/28.jpg)
Option : Movee until end of hallway
Start : Any state in the hallway.
Execute : policy as shown.
Terminate : when s is end of hallway.
![Page 29: Hierarchical Reinforcement Learning Mausam [A Survey and Comparison of HRL techniques]](https://reader034.fdocuments.in/reader034/viewer/2022052504/551aaa7e550346e0158b5ed3/html5/thumbnails/29.jpg)
Options [Sutton, Precup, Singh’99]
An option is a well defined behaviour.o = h Io, o, o i
Io : Set of states (IoµS) in which o can be initiated.
os : Policy (S!A*) when o is executing.
o(s) : Probability that o terminates in s.
*Can be a policy over lower level options.
![Page 30: Hierarchical Reinforcement Learning Mausam [A Survey and Comparison of HRL techniques]](https://reader034.fdocuments.in/reader034/viewer/2022052504/551aaa7e550346e0158b5ed3/html5/thumbnails/30.jpg)
Learning
An option is temporally extended action with well defined policy.
Set of options (O) replaces the set of actions (A)
Learning occurs outside options.Learning over options ´ Semi MDP Q-
Learning.
![Page 31: Hierarchical Reinforcement Learning Mausam [A Survey and Comparison of HRL techniques]](https://reader034.fdocuments.in/reader034/viewer/2022052504/551aaa7e550346e0158b5ed3/html5/thumbnails/31.jpg)
Machine: Movee + Collision Avoidance
Movew Moven Moven Return
Movew Moves Moves Return
Movee Choose
Return
End of hallway
: End of hallway
Obstacle
Call M1
Call M2
M1
M2
![Page 32: Hierarchical Reinforcement Learning Mausam [A Survey and Comparison of HRL techniques]](https://reader034.fdocuments.in/reader034/viewer/2022052504/551aaa7e550346e0158b5ed3/html5/thumbnails/32.jpg)
Hierarchies of Abstract Machines[Parr, Russell’97]
A machine is a partial policy represented by a Finite State Automaton.
Node :Execute a ground action.Call a machine as a subroutine.Choose the next node.Return to the calling machine.
![Page 33: Hierarchical Reinforcement Learning Mausam [A Survey and Comparison of HRL techniques]](https://reader034.fdocuments.in/reader034/viewer/2022052504/551aaa7e550346e0158b5ed3/html5/thumbnails/33.jpg)
Hierarchies of Abstract Machines
A machine is a partial policy represented by a Finite State Automaton.
Node :Execute a ground action.Call a machine as subroutine.Choose the next node.Return to the calling machine.
![Page 34: Hierarchical Reinforcement Learning Mausam [A Survey and Comparison of HRL techniques]](https://reader034.fdocuments.in/reader034/viewer/2022052504/551aaa7e550346e0158b5ed3/html5/thumbnails/34.jpg)
LearningLearning occurs within machines, as
machines are only partially defined.Flatten all machines out and consider
states [s,m] where s is a world state, and m, a machine node ´ MDP
reduce(SoM) : Consider only states where machine node is a choice node ´ Semi-MDP.
Learning ¼ Semi-MDP Q-Learning
![Page 35: Hierarchical Reinforcement Learning Mausam [A Survey and Comparison of HRL techniques]](https://reader034.fdocuments.in/reader034/viewer/2022052504/551aaa7e550346e0158b5ed3/html5/thumbnails/35.jpg)
Task Hierarchy: MAXQ Decomposition
[Dietterich’00]
Root
Take GiveNavigate(loc)
DeliverFetch
Extend-arm Extend-armGrab Release
MoveeMovewMovesMoven
Children of a task are
unordered
Children of a task are
unordered
![Page 36: Hierarchical Reinforcement Learning Mausam [A Survey and Comparison of HRL techniques]](https://reader034.fdocuments.in/reader034/viewer/2022052504/551aaa7e550346e0158b5ed3/html5/thumbnails/36.jpg)
MAXQ Decomposition
Augment the state s by adding the subtask i : [s,i].
Define C([s,i],j) as the reward received in i after j finishes.
Q([s,Fetch],Navigate(prr)) = V([s,Navigate(prr)])+C([s,Fetch],Navigate(prr))*
Express V in terms of CLearn C, instead of learning Q
*Observe the context-free nature of Q-value
Reward received while navigating
Reward received while navigating
Reward received after navigation
Reward received after navigation
![Page 37: Hierarchical Reinforcement Learning Mausam [A Survey and Comparison of HRL techniques]](https://reader034.fdocuments.in/reader034/viewer/2022052504/551aaa7e550346e0158b5ed3/html5/thumbnails/37.jpg)
The Outline of the Talk
MDPs and Bellman’s curse of dimensionality.
RL: Simultaneous learning and planning.Explore avenues to speedup RL.
Illustrate prominent HRL methods.Compare prominent HRL methods.
Discuss future research. Summarise
![Page 38: Hierarchical Reinforcement Learning Mausam [A Survey and Comparison of HRL techniques]](https://reader034.fdocuments.in/reader034/viewer/2022052504/551aaa7e550346e0158b5ed3/html5/thumbnails/38.jpg)
1. State AbstractionAbstract state : A state having
fewer state variables; different world states maps to the same abstract state.
If we can reduce some state variables, then we can reduce on the learning time considerably!
We may use different abstract states for different macro-actions.
![Page 39: Hierarchical Reinforcement Learning Mausam [A Survey and Comparison of HRL techniques]](https://reader034.fdocuments.in/reader034/viewer/2022052504/551aaa7e550346e0158b5ed3/html5/thumbnails/39.jpg)
State Abstraction in MAXQRelevance : Only some variables are
relevant for the task.Fetch : user-loc irrelevantNavigate(printer-room) : h-r-po,h-u-po,user-
locFewer params for V of lower levels.
Funnelling : Subtask maps many states to smaller set of states. Fetch : All states map to h-r-po=true,
loc=pr.room. Fewer params for C of higher levels.
![Page 40: Hierarchical Reinforcement Learning Mausam [A Survey and Comparison of HRL techniques]](https://reader034.fdocuments.in/reader034/viewer/2022052504/551aaa7e550346e0158b5ed3/html5/thumbnails/40.jpg)
State Abstraction in Options, HAM
Options : Learning required only in states that are terminal states for some option.
HAM : Original work has no abstraction.Extension: Three-way value
decomposition*:Q([s,m],n) = V([s,n]) + C([s,m],n) + Cex([s,m])
Similar abstractions are employed.*[Andre,Russell’02]
![Page 41: Hierarchical Reinforcement Learning Mausam [A Survey and Comparison of HRL techniques]](https://reader034.fdocuments.in/reader034/viewer/2022052504/551aaa7e550346e0158b5ed3/html5/thumbnails/41.jpg)
2. Optimality
Hierarchical Optimalityvs.
Recursive Optimality
![Page 42: Hierarchical Reinforcement Learning Mausam [A Survey and Comparison of HRL techniques]](https://reader034.fdocuments.in/reader034/viewer/2022052504/551aaa7e550346e0158b5ed3/html5/thumbnails/42.jpg)
OptimalityOptions :
HierarchicalUse (A [ O) : Global**Interrupt options
HAM : Hierarchical*MAXQ : Recursive*
Interrupt subtasksUse Pseudo-rewardsIterate!
* Can define eqns for both optimalities**Adv. of using macro-actions maybe lost.
![Page 43: Hierarchical Reinforcement Learning Mausam [A Survey and Comparison of HRL techniques]](https://reader034.fdocuments.in/reader034/viewer/2022052504/551aaa7e550346e0158b5ed3/html5/thumbnails/43.jpg)
3. Language ExpressivenessOption
Can only input a complete policyHAM
Can input a complete policy.Can input a task hierarchy.Can represent “amount of effort”.Later extended to partial programs.
MAXQCannot input a policy (full/partial)
![Page 44: Hierarchical Reinforcement Learning Mausam [A Survey and Comparison of HRL techniques]](https://reader034.fdocuments.in/reader034/viewer/2022052504/551aaa7e550346e0158b5ed3/html5/thumbnails/44.jpg)
4. Knowledge Requirements
OptionsRequires complete specification of policy.One could learn option policies – given
subtasks.HAM
Medium requirementsMAXQ
Minimal requirements
![Page 45: Hierarchical Reinforcement Learning Mausam [A Survey and Comparison of HRL techniques]](https://reader034.fdocuments.in/reader034/viewer/2022052504/551aaa7e550346e0158b5ed3/html5/thumbnails/45.jpg)
5. Models advancedOptions : ConcurrencyHAM : Richer representation,
ConcurrencyMAXQ : Continuous time, state, actions;
Multi-agents, Average-reward.In general, more researchers have
followed MAXQLess input knowledgeValue decomposition
![Page 46: Hierarchical Reinforcement Learning Mausam [A Survey and Comparison of HRL techniques]](https://reader034.fdocuments.in/reader034/viewer/2022052504/551aaa7e550346e0158b5ed3/html5/thumbnails/46.jpg)
6. Structure Paradigm
S : Options, MAXQ A : All P : None R : MAXQ G : All V : MAXQ : All
![Page 47: Hierarchical Reinforcement Learning Mausam [A Survey and Comparison of HRL techniques]](https://reader034.fdocuments.in/reader034/viewer/2022052504/551aaa7e550346e0158b5ed3/html5/thumbnails/47.jpg)
The Outline of the Talk
MDPs and Bellman’s curse of dimensionality.
RL: Simultaneous learning and planning.Explore avenues to speedup RL.
Illustrate prominent HRL methods.Compare prominent HRL methods.
Discuss future research. Summarise
![Page 48: Hierarchical Reinforcement Learning Mausam [A Survey and Comparison of HRL techniques]](https://reader034.fdocuments.in/reader034/viewer/2022052504/551aaa7e550346e0158b5ed3/html5/thumbnails/48.jpg)
Directions for Future Research
Bidirectional State AbstractionsHierarchies over other RL
researchModel based methodsFunction Approximators
Probabilistic PlanningHierarchical P and Hierarchical R
Imitation Learning
![Page 49: Hierarchical Reinforcement Learning Mausam [A Survey and Comparison of HRL techniques]](https://reader034.fdocuments.in/reader034/viewer/2022052504/551aaa7e550346e0158b5ed3/html5/thumbnails/49.jpg)
Directions for Future Research
TheoryBounds (goodness of hierarchy)Non-asymptotic analysis
Automated Discovery Discovery of HierarchiesDiscovery of State Abstraction
Apply…
![Page 50: Hierarchical Reinforcement Learning Mausam [A Survey and Comparison of HRL techniques]](https://reader034.fdocuments.in/reader034/viewer/2022052504/551aaa7e550346e0158b5ed3/html5/thumbnails/50.jpg)
Applications
Toy Robot Flight SimulatorAGV SchedulingKeepaway
soccer
Parts
Assemblies
Ware-house
P2 P1
P3 P4
D2
D3 D4
D1
Images courtesy various sources
![Page 51: Hierarchical Reinforcement Learning Mausam [A Survey and Comparison of HRL techniques]](https://reader034.fdocuments.in/reader034/viewer/2022052504/551aaa7e550346e0158b5ed3/html5/thumbnails/51.jpg)
Thinking Big…
"... consider maze domains. Reinforcement learning researchers, including this author, have spent countless years of research solving a solved problem! Navigating in grid worlds, even with stochastic dynamics, has been far from rocket science since the advent of search techniques such as A*.” -- David Andre
Use planners, theorem provers, etc. as components in big hierarchical solver.
![Page 52: Hierarchical Reinforcement Learning Mausam [A Survey and Comparison of HRL techniques]](https://reader034.fdocuments.in/reader034/viewer/2022052504/551aaa7e550346e0158b5ed3/html5/thumbnails/52.jpg)
The Outline of the Talk
MDPs and Bellman’s curse of dimensionality.
RL: Simultaneous learning and planning.Explore avenues to speedup RL.
Illustrate prominent HRL methods.Compare prominent HRL methods.
Discuss future research. Summarise
![Page 53: Hierarchical Reinforcement Learning Mausam [A Survey and Comparison of HRL techniques]](https://reader034.fdocuments.in/reader034/viewer/2022052504/551aaa7e550346e0158b5ed3/html5/thumbnails/53.jpg)
How to choose appropriate hierarchy
Look at available domain knowledgeIf some behaviours are completely
specified – optionsIf some behaviours are partially
specified – HAMIf less domain knowledge available –
MAXQWe can use all three to specify
different behaviours in tandem.
![Page 54: Hierarchical Reinforcement Learning Mausam [A Survey and Comparison of HRL techniques]](https://reader034.fdocuments.in/reader034/viewer/2022052504/551aaa7e550346e0158b5ed3/html5/thumbnails/54.jpg)
Main ideas in HRL community
Hierarchies speedup learningValue function decompositionState Abstractions Greedy non-hierarchical executionContext-free learning and pseudo-
rewardsPolicy improvement by re-estimation
and re-learning.