1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
-
date post
15-Jan-2016 -
Category
Documents
-
view
217 -
download
0
Transcript of 1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
1
Hybrid Agent-Based Modeling: Hybrid Agent-Based Modeling: Architectures,Analyses and ApplicationsArchitectures,Analyses and Applications(Stage One)(Stage One)
Li, Hailin
2
Outline
Introduction
Least-Squares Method for Reinforcement Learning
Evolutionary Algorithms For RL Problem (in progress)
Technical Analysis based upon hybrid agent-based architecture (in progress)
Conclusion (Stage One)
3
Introduction Learning From Interaction
Interact with environment Consequences of actions to achieve goals No explicit teacher but experience
Examples Chess player in a game Someone prepares some food The actions of a gazelle calf after its born
4
Introduction Characteristics
Decision making in uncertain environment Actions
Affect the future situation Effects cannot be fully predicted
Goals are explicit Use experience to improve performance
5
Introduction What to be learned
Mapping from situations to actions Maximizes a scalar reward or reinforcement
signal Learning
Does not need to be told which actions to take Must discover which actions yield most reward
by trying
6
Introduction Challenge
Action may affect not only immediate reward but also the next situation, and consequently all subsequent rewards
Trial and error search Delayed reward
7
Introduction Exploration and exploitation
Exploit what it already knows in order to obtain reward
Explore in order to make better action selections in the future
Neither can be pursued exclusively without failing at the task
Trade-off
8
Introduction Components of an agent
Policy Decision-making function
Reward (Total reward, Average reward, Discounted reward) Good and bad events for the agent
Value Rewards in a long run
Model of environment Behavior of the environment
9
Introduction Markov Property & Markov Decision Processes
“Independence of path”:all that matters is in the current state signal
A reinforcement learning task that satisfies the Markov property is called a Markov decision process, MDP
Finite Markov Decision Process (MDP)
1( | , )ass t t tP P s s s s a a
1 1( | , , )ass t t t tR E r s s a a s s
10
Introduction Three categories of methods for solving the
reinforcement learning problem
Dynamic programming Complete and accurate model of the environment A full backup operation on each state
Monte Carlo methods A backup for each state based on the entire sequence of observed
rewards from that state until the end of the episode Temporal-difference learning
Approximate the optimal value function, and to view the approximation as an adequate guide
11
LS Method for Reinforcement Learning
For stochastic dynamic system 1 , ,t t t tx f x a w
tx ta
tw
: Current State : Control decision generated by policy
: Disturbance independently sampled from some fixed distribution
is a Markov chain
, , ,S A P RA P
R ,t tg x a : PrS A
0 1 2, , ,...x x x
MDP can be denoted by a quadrupleS : State Set : Action Set : state transition probability
: denotes the reward function : The policy is a mapping
12
For each policy , the value function is defined by equation:
LS Method for Reinforcement Learning
J
00
,tt t
t
J x E g x x x x
0,1
The optimal value function is defined by J
maxJ x J x
13
LS Method for Reinforcement Learning
The optimal action can be generated through
arg max , , ,t t twa A
a E g x a J f x a w
Introducing Q value function
, , , ,w
Q x a E g x a J f x a w
arg max ,t ta A
a Q x a
Now the optimal action can be generated through
14
The exact Q-values for all state-action pairs can be obtained by solving the Bellman equations (full backups):
LS Method for Reinforcement Learning
1 1
1 1 1 1 1, , , , , , , ,t t
t t t t t t t t t t t t tx xQ x a P x a x g x a x P x a x Q x x
or, in matrix format:
PQ Q
P S A S A denotes the transition probability from to ,t tx a 1 1,t tx x
15
Traditional Q-learning
LS Method for Reinforcement Learning
Popular variant of temporal-difference learning to approximate Q value functions.
In the absence of the model of the MDP, using sample data 1, , ,t t t tx a r x
The temporal difference is defined as: td
1
1 1, max , ,t
t t t t t t ta
d g x a Q x a Q x a
Consider one-step Q-learning, the updated equation is:
1 ( )
, ,t t
tQ x a Q x a d
0,1
16
The final decision base upon Q-learning:
LS Method for Reinforcement Learning
arg max ,t t ta A
a Q x a
The reason for the development of approximation methods:
•Size of state-action space•The overwhelming requirement for computation
•Model Approximation •Policy Approximation •Value Function Approximation
The categories of approximation methods for Machine Learning:
17
Model-Free Least-Squares Q-learning
LS Method for Reinforcement Learning
1
, , , ,K
T
kk
Q x a w w k x a x a W
1,..., k : Basis Functions '1 ,...,W w w K : A vector of scalar weights
Linear Function Approximator
18
For a fixed policy LS Method for Reinforcement Learning
Q W
is S A K matrix and K S A
If the model of MDP P , is available
1 1,
...
,
...
,
T
T
T
S A
x a
x a
x a
1
1
1
1 1 1 1 1 1
1 1
1 1
, , , ,
...
, , , ,
...
, , , ,
t
t
t
t tx
t tx
t tS A S Ax
P x a x g x a x
P x a x g x a x
P x a x g x a x
19
The policy
LS Method for Reinforcement Learning
1W A B
where PTA and TB
If the model of MDP P , is not available: Model-Free
1, , , 1, 2,...,t t t ti i i ix a r x i L Given Samples
1 1, , ,T
t t t t t tA A x a x a x x
,t t tB B x a r
0 0A
0 0B
0,1
20
Optimal policy can be found:
LS Method for Reinforcement Learning
1 arg max , arg max ,t
tTt
a ax Q x a x a W
The greedy policy is represented by the parameter t
W
and can be determined on demand for any given state.
21
Simulation System is hard to model but easy to simulate Implicitly indicate the features of the system in terms of
the state visiting frequency Orthogonal least-squares algorithm for training
an RBF network Systematic learning approach for solving center selection
problem Newly added center always maximizes the amount of
energy of the desired network output
LS Method for Reinforcement Learning
22
Hybrid Least-Squares Method
LS Method for Reinforcement Learning
Least-Squares Policy Iteration (LSPI) algorithm
Simulation & Orthogonal Least-Squares regression
Environment
State Action
Feature Configuration
Reward
Optimal policy
23
LS Method for Reinforcement Learning
24
Simulation
Cart-Pole System
2. ..
..
2
sin sgnsin cos
cos43
t tt p t ctp
t tc p p
t
p t
c p
F m l xg
m m m l
ml
m m
2. .. .
..sin cos sgnt t tt p t t c
t
c p
F m l x
xm m
25
Simulation
26
Conclusion(Stage One)
From Reinforcement learning perspective, the intractability of solutions to sequential decision problems requires value function approximation methods
At present, linear function approximators are the best alternatives as approximation architecture mainly due to their transparent structure.
Model-free least squares policy iteration (LSPI) method is a promising algorithm that uses linear approximator architecture to achieve policy optimization in the spirit of Q-learning. May converge in surprising few steps
Inspired by orthogonal least-squares regression method for selecting the centers of RBF neural network, a new hybrid learning method for LSPI can produce more robust and human-independent solution.