Collaborative Reinforcement Learning
description
Transcript of Collaborative Reinforcement Learning
![Page 1: Collaborative Reinforcement Learning](https://reader035.fdocuments.in/reader035/viewer/2022062423/56814749550346895db48935/html5/thumbnails/1.jpg)
Collaborative Reinforcement Learning
Presented by Dr. Ying Lu
![Page 2: Collaborative Reinforcement Learning](https://reader035.fdocuments.in/reader035/viewer/2022062423/56814749550346895db48935/html5/thumbnails/2.jpg)
Credits
Reinforcement Learning: A User’s Guide. Bill Smart at ICAC 2005
Jim Dowling, Eoin Curran, Raymond Cunningham and Vinny Cahill, "Collaborative Reinforcement Learning of Autonomic Behaviour", 2nd International Workshop on Self-Adaptive and Autonomic Computing Systems, pages 700-704, 2004. [Winner Best Paper Award].
![Page 3: Collaborative Reinforcement Learning](https://reader035.fdocuments.in/reader035/viewer/2022062423/56814749550346895db48935/html5/thumbnails/3.jpg)
What is RL?
“a way of programming agents by reward and punishment without needing to specify how the
task is to be achieved”
[Kaelbling, Littman, & Moore, 96]
![Page 4: Collaborative Reinforcement Learning](https://reader035.fdocuments.in/reader035/viewer/2022062423/56814749550346895db48935/html5/thumbnails/4.jpg)
Basic RL Model
1. Observe state, st
2. Decide on an action, at
3. Perform action
4. Observe new state, st+1
5. Observe reward, rt+1
6. Learn from experience7. Repeat
Goal: Find a control policy that will maximize the observed rewards over the lifetime of the agent
AS R
World
![Page 5: Collaborative Reinforcement Learning](https://reader035.fdocuments.in/reader035/viewer/2022062423/56814749550346895db48935/html5/thumbnails/5.jpg)
An Example: Gridworld
Canonical RL domain• States are grid cells• 4 actions: N, S, E, W• Reward for entering top right cell• -0.01 for every other move
Maximizing sum of rewards Shortest path• In this instance
+1
![Page 6: Collaborative Reinforcement Learning](https://reader035.fdocuments.in/reader035/viewer/2022062423/56814749550346895db48935/html5/thumbnails/6.jpg)
The Promise of RL
Specify what to do, but not how to do it• Through the reward function• Learning “fills in the details”
Better final solutions• Based on actual experiences, not programmer
assumptions
Less (human) time needed for a good solution
![Page 7: Collaborative Reinforcement Learning](https://reader035.fdocuments.in/reader035/viewer/2022062423/56814749550346895db48935/html5/thumbnails/7.jpg)
Mathematics of RL
Before we talk about RL, we need to cover some background material
• Some simple decision theory• Markov Decision Processes• Value functions
![Page 8: Collaborative Reinforcement Learning](https://reader035.fdocuments.in/reader035/viewer/2022062423/56814749550346895db48935/html5/thumbnails/8.jpg)
Making Single Decisions
Single decision to be made• Multiple discrete actions• Each action has a reward associated
with it
Goal is to maximize reward• Not hard: just pick the action with the largest reward
State 0 has a value of 2• Sum of rewards from taking the best action from the
state
0
1
2
A
B2
1
![Page 9: Collaborative Reinforcement Learning](https://reader035.fdocuments.in/reader035/viewer/2022062423/56814749550346895db48935/html5/thumbnails/9.jpg)
Markov Decision Processes
We can generalize the previous example to multiple sequential decisions
• Each decision affects subsequent decisions
This is formally modeled by a Markov Decision Process (MDP)
0
1
2
A
B2
1
5
3
4
AA
-1000
1
A A10
1B
1
![Page 10: Collaborative Reinforcement Learning](https://reader035.fdocuments.in/reader035/viewer/2022062423/56814749550346895db48935/html5/thumbnails/10.jpg)
Markov Decision Processes
Formally, an MDP is• A set of states, S = {s1, s2, ... , sn}
• A set of actions, A = {a1, a2, ... , am}
• A reward function, R: SAS→• A transition function,
We want to learn a policy, : S →A• Maximize sum of rewards we see over our lifetime
aai,s|jsPP tt1taij
![Page 11: Collaborative Reinforcement Learning](https://reader035.fdocuments.in/reader035/viewer/2022062423/56814749550346895db48935/html5/thumbnails/11.jpg)
Policies
There are 3 policies for this MDP1. 0 →1 →3 →5
2. 0 →1 →4 →5
3. 0 →2 →4 →5
Which is the best one?
0
1
2
A
B2
1
5
3
4
AA
-1000
1
A A10
1B
1
![Page 12: Collaborative Reinforcement Learning](https://reader035.fdocuments.in/reader035/viewer/2022062423/56814749550346895db48935/html5/thumbnails/12.jpg)
Comparing Policies
Order policies by how much reward they see1. 0 →1 →3 →5 = 1 + 1 + 1 = 3
2. 0 →1 →4 →5 = 1 + 1 + 10 = 12
3. 0 →2 →4 →5 = 2 – 1000 + 10 = -988
0
1
2
A
B2
1
5
3
4
AA
-1000
1
A A10
1B
1
![Page 13: Collaborative Reinforcement Learning](https://reader035.fdocuments.in/reader035/viewer/2022062423/56814749550346895db48935/html5/thumbnails/13.jpg)
Value Functions
We can define value without specifying the policy• Specify the value of taking action a from state s and
then performing optimally• This is the state-action value function, Q
0
1
2
A
B2
1
5
3
4
A
-1000
1
A10
1B
1
Q(0, A) = 12 Q(0, B) = -988
Q(3, A) = 1
Q(4, A) = 10
Q(1, A) = 2Q(1, B) = 11
Q(2, A) = -990
A
A
How do you tell whichaction to take from
each state?
![Page 14: Collaborative Reinforcement Learning](https://reader035.fdocuments.in/reader035/viewer/2022062423/56814749550346895db48935/html5/thumbnails/14.jpg)
Value Functions
So, we have value function• Q(s, a) = R(s, a, s’) + maxa’ Q(s’, a’)
In the form of• Next reward plus the best I can do from the next state
These extend to probabilistic actions•
s’ is thenext state
a' ,s'Q maxs' a, s,RPas,Q a's'
as's,
![Page 15: Collaborative Reinforcement Learning](https://reader035.fdocuments.in/reader035/viewer/2022062423/56814749550346895db48935/html5/thumbnails/15.jpg)
Getting the Policy
If we have the value function, then finding the best policy is easy
• (s) = arg maxa Q(s, a)
We’re looking for the optimal policy, (s)• No policy generates more reward than
Optimal policy defines optimal value functions•
The easiest way to learn the optimal policy is to learn the optimal value function first
a' ,s'Qmaxs' a, s,Ras,Q *a'*
![Page 16: Collaborative Reinforcement Learning](https://reader035.fdocuments.in/reader035/viewer/2022062423/56814749550346895db48935/html5/thumbnails/16.jpg)
Collaborative Reinforcement Learningto Adaptively Optimize MANET Routing
Jim Dowling, Eoin Curran, Raymond Cunningham and Vinny Cahill
![Page 17: Collaborative Reinforcement Learning](https://reader035.fdocuments.in/reader035/viewer/2022062423/56814749550346895db48935/html5/thumbnails/17.jpg)
Overview
Building autonomic distributed systems with self* properties
• Self-Organizing• Self-Healing• Self-Optimizing
Add collaborative learning mechanism to self-adaptive component modelImproved ad-hoc routing protocol
![Page 18: Collaborative Reinforcement Learning](https://reader035.fdocuments.in/reader035/viewer/2022062423/56814749550346895db48935/html5/thumbnails/18.jpg)
Introduction
Autonomous distributed systems will consist of interacting components free from human interference
• Existing top-down management and programming solutions require too much global state
• Bottom up, decentralized collection of components who make their own decisions based on local information
• System wide self* behavior emerges from interactions
![Page 19: Collaborative Reinforcement Learning](https://reader035.fdocuments.in/reader035/viewer/2022062423/56814749550346895db48935/html5/thumbnails/19.jpg)
Self-* Behavior
Self-adaptive components that change structure and/or behavior at run-time, adapt to
• discovered faults• reduced performance
Requires active monitoring of component states and external dependencies
![Page 20: Collaborative Reinforcement Learning](https://reader035.fdocuments.in/reader035/viewer/2022062423/56814749550346895db48935/html5/thumbnails/20.jpg)
Self-* Distributed Systems using Distributed (collaborative) Reinforcement Learning
For complex systems, programmers cannot be expected to describe all conditions
• Self-adaptive behavior learnt by components• Decentralized co-ordination of components to
support system-wide properties• Distributed Reinforcement Learning (DRL) is
extension to RL and uses neighbor interactions only
![Page 21: Collaborative Reinforcement Learning](https://reader035.fdocuments.in/reader035/viewer/2022062423/56814749550346895db48935/html5/thumbnails/21.jpg)
Model-Based Reinforcement Learning
( , ) ( , ) ' | , . 's
Q s a R s a P s s a V s
S
MDPAdaptationContract
AMM
action (at)rt+1
st+1
reward (rt)
state (st)
Component
1.Action Reward 2. State Transition Model
3. Next State Reward
Markov Decision Process = {States }, {Actions}, R(States,Actions), (States, Actions, States)
![Page 22: Collaborative Reinforcement Learning](https://reader035.fdocuments.in/reader035/viewer/2022062423/56814749550346895db48935/html5/thumbnails/22.jpg)
Decentralised System Optimisation
Coordinating the solution to a set of Discrete Optimisation Problems (DOPs)
• Components have a Partial System View• Coordination Actions
• Actions ={delegation} U {DOP actions} U {discovery}
• Connection Costs
A
B
Causally-Connected
States
CDelegation
![Page 23: Collaborative Reinforcement Learning](https://reader035.fdocuments.in/reader035/viewer/2022062423/56814749550346895db48935/html5/thumbnails/23.jpg)
Collaborative Reinforcement Learning
Advertisement• Update Partial Views of Neighbours
Decay• Negative Feedback on State Values in the Absence of
Advertisements
( , ) ( , ) ' | , . ' ' | ,i i i is
Q s a R s a P s s a Decay V s D s s a
S
Action Reward State Transition Model
CachedNeighbour’s V-value
ConnectionCost
![Page 24: Collaborative Reinforcement Learning](https://reader035.fdocuments.in/reader035/viewer/2022062423/56814749550346895db48935/html5/thumbnails/24.jpg)
Adaptation in CRL System
A feedback process to
• Changes in the optimal policy of any RL agent• Changes in the system environment• The passing time
![Page 25: Collaborative Reinforcement Learning](https://reader035.fdocuments.in/reader035/viewer/2022062423/56814749550346895db48935/html5/thumbnails/25.jpg)
SAMPLE: Ad-hoc Routing using DRL
Probabilistic ad-hoc routing protocol based on DRL• Adaptation of network traffic around areas of congestion
• Exploitation of stable routes
Routing decisions based on local information and information obtained from neighbors
Outperforms Ad-hoc On Demand Distance Vector Routing (AODV) and Dynamic Source Routing (DSR)
![Page 26: Collaborative Reinforcement Learning](https://reader035.fdocuments.in/reader035/viewer/2022062423/56814749550346895db48935/html5/thumbnails/26.jpg)
SAMPLE: A CRL System (I)
![Page 27: Collaborative Reinforcement Learning](https://reader035.fdocuments.in/reader035/viewer/2022062423/56814749550346895db48935/html5/thumbnails/27.jpg)
SAMPLE: A CRL System (II)
Instead of always choosing the neighbor with the best Q value, i.e., taking the delegation action
a= arg maxaQi(B, a),
a neighbor is chosen probabilistically
![Page 28: Collaborative Reinforcement Learning](https://reader035.fdocuments.in/reader035/viewer/2022062423/56814749550346895db48935/html5/thumbnails/28.jpg)
SAMPLE: A CRL System (III)
Pi(s’|s, aj) = E(CS/CA)
![Page 29: Collaborative Reinforcement Learning](https://reader035.fdocuments.in/reader035/viewer/2022062423/56814749550346895db48935/html5/thumbnails/29.jpg)
SAMPLE: A CRL System (IV)
![Page 30: Collaborative Reinforcement Learning](https://reader035.fdocuments.in/reader035/viewer/2022062423/56814749550346895db48935/html5/thumbnails/30.jpg)
Performance
Metric:• Maximize
• throughput
• ratio of delivered packets to undelivered packets
• Minimize• number of transmission required per packet sent
Figures 5-10
![Page 31: Collaborative Reinforcement Learning](https://reader035.fdocuments.in/reader035/viewer/2022062423/56814749550346895db48935/html5/thumbnails/31.jpg)
Questions/Discussions