1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

26
1 Hybrid Agent-Based Modeling: Hybrid Agent-Based Modeling: Architectures,Analyses and Architectures,Analyses and Applications Applications (Stage One) (Stage One) Li, Hailin
  • date post

    15-Jan-2016
  • Category

    Documents

  • view

    217
  • download

    0

Transcript of 1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

Page 1: 1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

1

Hybrid Agent-Based Modeling: Hybrid Agent-Based Modeling: Architectures,Analyses and ApplicationsArchitectures,Analyses and Applications(Stage One)(Stage One)

Li, Hailin

Page 2: 1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

2

Outline

Introduction

Least-Squares Method for Reinforcement Learning

Evolutionary Algorithms For RL Problem (in progress)

Technical Analysis based upon hybrid agent-based architecture (in progress)

Conclusion (Stage One)

Page 3: 1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

3

Introduction Learning From Interaction

Interact with environment Consequences of actions to achieve goals No explicit teacher but experience

Examples Chess player in a game Someone prepares some food The actions of a gazelle calf after its born

Page 4: 1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

4

Introduction Characteristics

Decision making in uncertain environment Actions

Affect the future situation Effects cannot be fully predicted

Goals are explicit Use experience to improve performance

Page 5: 1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

5

Introduction What to be learned

Mapping from situations to actions Maximizes a scalar reward or reinforcement

signal Learning

Does not need to be told which actions to take Must discover which actions yield most reward

by trying

Page 6: 1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

6

Introduction Challenge

Action may affect not only immediate reward but also the next situation, and consequently all subsequent rewards

Trial and error search Delayed reward

Page 7: 1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

7

Introduction Exploration and exploitation

Exploit what it already knows in order to obtain reward

Explore in order to make better action selections in the future

Neither can be pursued exclusively without failing at the task

Trade-off

Page 8: 1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

8

Introduction Components of an agent

Policy Decision-making function

Reward (Total reward, Average reward, Discounted reward) Good and bad events for the agent

Value Rewards in a long run

Model of environment Behavior of the environment

Page 9: 1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

9

Introduction Markov Property & Markov Decision Processes

“Independence of path”:all that matters is in the current state signal

A reinforcement learning task that satisfies the Markov property is called a Markov decision process, MDP

Finite Markov Decision Process (MDP)

1( | , )ass t t tP P s s s s a a

1 1( | , , )ass t t t tR E r s s a a s s

Page 10: 1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

10

Introduction Three categories of methods for solving the

reinforcement learning problem

Dynamic programming Complete and accurate model of the environment A full backup operation on each state

Monte Carlo methods A backup for each state based on the entire sequence of observed

rewards from that state until the end of the episode Temporal-difference learning

Approximate the optimal value function, and to view the approximation as an adequate guide

Page 11: 1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

11

LS Method for Reinforcement Learning

For stochastic dynamic system 1 , ,t t t tx f x a w

tx ta

tw

: Current State : Control decision generated by policy

: Disturbance independently sampled from some fixed distribution

is a Markov chain

, , ,S A P RA P

R ,t tg x a : PrS A

0 1 2, , ,...x x x

MDP can be denoted by a quadrupleS : State Set : Action Set : state transition probability

: denotes the reward function : The policy is a mapping

Page 12: 1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

12

For each policy , the value function is defined by equation:

LS Method for Reinforcement Learning

J

00

,tt t

t

J x E g x x x x

0,1

The optimal value function is defined by J

maxJ x J x

Page 13: 1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

13

LS Method for Reinforcement Learning

The optimal action can be generated through

arg max , , ,t t twa A

a E g x a J f x a w

Introducing Q value function

, , , ,w

Q x a E g x a J f x a w

arg max ,t ta A

a Q x a

Now the optimal action can be generated through

Page 14: 1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

14

The exact Q-values for all state-action pairs can be obtained by solving the Bellman equations (full backups):

LS Method for Reinforcement Learning

1 1

1 1 1 1 1, , , , , , , ,t t

t t t t t t t t t t t t tx xQ x a P x a x g x a x P x a x Q x x

or, in matrix format:

PQ Q

P S A S A denotes the transition probability from to ,t tx a 1 1,t tx x

Page 15: 1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

15

Traditional Q-learning

LS Method for Reinforcement Learning

Popular variant of temporal-difference learning to approximate Q value functions.

In the absence of the model of the MDP, using sample data 1, , ,t t t tx a r x

The temporal difference is defined as: td

1

1 1, max , ,t

t t t t t t ta

d g x a Q x a Q x a

Consider one-step Q-learning, the updated equation is:

1 ( )

, ,t t

tQ x a Q x a d

0,1

Page 16: 1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

16

The final decision base upon Q-learning:

LS Method for Reinforcement Learning

arg max ,t t ta A

a Q x a

The reason for the development of approximation methods:

•Size of state-action space•The overwhelming requirement for computation

•Model Approximation •Policy Approximation •Value Function Approximation

The categories of approximation methods for Machine Learning:

Page 17: 1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

17

Model-Free Least-Squares Q-learning

LS Method for Reinforcement Learning

1

, , , ,K

T

kk

Q x a w w k x a x a W

1,..., k : Basis Functions '1 ,...,W w w K : A vector of scalar weights

Linear Function Approximator

Page 18: 1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

18

For a fixed policy LS Method for Reinforcement Learning

Q W

is S A K matrix and K S A

If the model of MDP P , is available

1 1,

...

,

...

,

T

T

T

S A

x a

x a

x a

1

1

1

1 1 1 1 1 1

1 1

1 1

, , , ,

...

, , , ,

...

, , , ,

t

t

t

t tx

t tx

t tS A S Ax

P x a x g x a x

P x a x g x a x

P x a x g x a x

Page 19: 1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

19

The policy

LS Method for Reinforcement Learning

1W A B

where PTA and TB

If the model of MDP P , is not available: Model-Free

1, , , 1, 2,...,t t t ti i i ix a r x i L Given Samples

1 1, , ,T

t t t t t tA A x a x a x x

,t t tB B x a r

0 0A

0 0B

0,1

Page 20: 1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

20

Optimal policy can be found:

LS Method for Reinforcement Learning

1 arg max , arg max ,t

tTt

a ax Q x a x a W

The greedy policy is represented by the parameter t

W

and can be determined on demand for any given state.

Page 21: 1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

21

Simulation System is hard to model but easy to simulate Implicitly indicate the features of the system in terms of

the state visiting frequency Orthogonal least-squares algorithm for training

an RBF network Systematic learning approach for solving center selection

problem Newly added center always maximizes the amount of

energy of the desired network output

LS Method for Reinforcement Learning

Page 22: 1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

22

Hybrid Least-Squares Method

LS Method for Reinforcement Learning

Least-Squares Policy Iteration (LSPI) algorithm

Simulation & Orthogonal Least-Squares regression

Environment

State Action

Feature Configuration

Reward

Optimal policy

Page 23: 1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

23

LS Method for Reinforcement Learning

Page 24: 1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

24

Simulation

Cart-Pole System

2. ..

..

2

sin sgnsin cos

cos43

t tt p t ctp

t tc p p

t

p t

c p

F m l xg

m m m l

ml

m m

2. .. .

..sin cos sgnt t tt p t t c

t

c p

F m l x

xm m

Page 25: 1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

25

Simulation

Page 26: 1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

26

Conclusion(Stage One)

From Reinforcement learning perspective, the intractability of solutions to sequential decision problems requires value function approximation methods

At present, linear function approximators are the best alternatives as approximation architecture mainly due to their transparent structure.

Model-free least squares policy iteration (LSPI) method is a promising algorithm that uses linear approximator architecture to achieve policy optimization in the spirit of Q-learning. May converge in surprising few steps

Inspired by orthogonal least-squares regression method for selecting the centers of RBF neural network, a new hybrid learning method for LSPI can produce more robust and human-independent solution.