1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

1

Hybrid Agent-Based Modeling: Hybrid Agent-Based Modeling: Architectures,Analyses and ApplicationsArchitectures,Analyses and Applications(Stage One)(Stage One)

Li, Hailin

2

Outline

Introduction

Least-Squares Method for Reinforcement Learning

Evolutionary Algorithms For RL Problem (in progress)

Technical Analysis based upon hybrid agent-based architecture (in progress)

Conclusion (Stage One)

3

Introduction Learning From Interaction

Interact with environment Consequences of actions to achieve goals No explicit teacher but experience

Examples Chess player in a game Someone prepares some food The actions of a gazelle calf after its born

4

Introduction Characteristics

Decision making in uncertain environment Actions

Affect the future situation Effects cannot be fully predicted

Goals are explicit Use experience to improve performance

5

Introduction What to be learned

Mapping from situations to actions Maximizes a scalar reward or reinforcement

signal Learning

Does not need to be told which actions to take Must discover which actions yield most reward

by trying

6

Introduction Challenge

Action may affect not only immediate reward but also the next situation, and consequently all subsequent rewards

Trial and error search Delayed reward

7

Introduction Exploration and exploitation

Exploit what it already knows in order to obtain reward

Explore in order to make better action selections in the future

Neither can be pursued exclusively without failing at the task

Trade-off

8

Introduction Components of an agent

Policy Decision-making function

Reward (Total reward, Average reward, Discounted reward) Good and bad events for the agent

Value Rewards in a long run

Model of environment Behavior of the environment

9

Introduction Markov Property & Markov Decision Processes

“Independence of path”:all that matters is in the current state signal

A reinforcement learning task that satisfies the Markov property is called a Markov decision process, MDP

Finite Markov Decision Process (MDP)

1( | , )ass t t tP P s s s s a a

1 1( | , , )ass t t t tR E r s s a a s s

10

Introduction Three categories of methods for solving the

reinforcement learning problem

Dynamic programming Complete and accurate model of the environment A full backup operation on each state

Monte Carlo methods A backup for each state based on the entire sequence of observed

rewards from that state until the end of the episode Temporal-difference learning

Approximate the optimal value function, and to view the approximation as an adequate guide

11

LS Method for Reinforcement Learning

For stochastic dynamic system 1 , ,t t t tx f x a w

tx ta

tw

: Current State : Control decision generated by policy

: Disturbance independently sampled from some fixed distribution

is a Markov chain

, , ,S A P RA P

R ,t tg x a : PrS A

0 1 2, , ,...x x x

MDP can be denoted by a quadrupleS : State Set : Action Set : state transition probability

: denotes the reward function : The policy is a mapping

12

For each policy , the value function is defined by equation:


J

00

,tt t

t

J x E g x x x x

0,1

The optimal value function is defined by J

maxJ x J x

13


The optimal action can be generated through

arg max , , ,t t twa A

a E g x a J f x a w

Introducing Q value function

, , , ,w

Q x a E g x a J f x a w

arg max ,t ta A

a Q x a

Now the optimal action can be generated through

14

The exact Q-values for all state-action pairs can be obtained by solving the Bellman equations (full backups):


1 1

1 1 1 1 1, , , , , , , ,t t

t t t t t t t t t t t t tx xQ x a P x a x g x a x P x a x Q x x

or, in matrix format:

PQ Q

P S A S A denotes the transition probability from to ,t tx a 1 1,t tx x

15

Traditional Q-learning


Popular variant of temporal-difference learning to approximate Q value functions.

In the absence of the model of the MDP, using sample data 1, , ,t t t tx a r x

The temporal difference is defined as: td

1

1 1, max , ,t

t t t t t t ta

d g x a Q x a Q x a

Consider one-step Q-learning, the updated equation is:

1 ( )

, ,t t

tQ x a Q x a d

0,1

16

The final decision base upon Q-learning:


arg max ,t t ta A

a Q x a

The reason for the development of approximation methods:

•Size of state-action space•The overwhelming requirement for computation

•Model Approximation •Policy Approximation •Value Function Approximation

The categories of approximation methods for Machine Learning:

17

Model-Free Least-Squares Q-learning


1

, , , ,K

T

kk

Q x a w w k x a x a W

1,..., k : Basis Functions '1 ,...,W w w K : A vector of scalar weights

Linear Function Approximator

18

For a fixed policy LS Method for Reinforcement Learning

Q W

is S A K matrix and K S A

If the model of MDP P , is available

1 1,

...

,

...

,

T

T

T

S A

x a

x a

x a

1

1

1

1 1 1 1 1 1

1 1

1 1

, , , ,

...

, , , ,

...

, , , ,

t

t

t

t tx

t tx

t tS A S Ax

P x a x g x a x

P x a x g x a x

P x a x g x a x

19

The policy


1W A B

where PTA and TB

If the model of MDP P , is not available: Model-Free

1, , , 1, 2,...,t t t ti i i ix a r x i L Given Samples

1 1, , ,T

t t t t t tA A x a x a x x

,t t tB B x a r

0 0A

0 0B

0,1

20

Optimal policy can be found:


1 arg max , arg max ,t

tTt

a ax Q x a x a W

The greedy policy is represented by the parameter t

W

and can be determined on demand for any given state.

21

Simulation System is hard to model but easy to simulate Implicitly indicate the features of the system in terms of

the state visiting frequency Orthogonal least-squares algorithm for training

an RBF network Systematic learning approach for solving center selection

problem Newly added center always maximizes the amount of

energy of the desired network output


22

Hybrid Least-Squares Method


Least-Squares Policy Iteration (LSPI) algorithm

Simulation & Orthogonal Least-Squares regression

Environment

State Action

Feature Configuration

Reward

Optimal policy

23


24

Simulation

Cart-Pole System

2. ..

..

2

sin sgnsin cos

cos43

t tt p t ctp

t tc p p

t

p t

c p

F m l xg

m m m l

ml

m m

2. .. .

..sin cos sgnt t tt p t t c

t

c p

F m l x

xm m

25

Simulation

26

Conclusion(Stage One)

From Reinforcement learning perspective, the intractability of solutions to sequential decision problems requires value function approximation methods

At present, linear function approximators are the best alternatives as approximation architecture mainly due to their transparent structure.

Model-free least squares policy iteration (LSPI) method is a promising algorithm that uses linear approximator architecture to achieve policy optimization in the spirit of Q-learning. May converge in surprising few steps

Inspired by orthogonal least-squares regression method for selecting the centers of RBF neural network, a new hybrid learning method for LSPI can produce more robust and human-independent solution.

1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

Documents

Transcript of 1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.