Continous-Action Q-Learning

(C) 2003, SNU Biointelligence Lab, http://bi.snu.ac.kr/

1

Continous-Action Q-LearningContinous-Action Q-Learning

Summarized by Seung-Joon Yi

Jose Del R.Millan et al, Machine Learning 49, 247-265 (2002)


2

ITPM(Incremental Topology ITPM(Incremental Topology Preserving Map)Preserving Map)

Consists of units and edges between pairs of units. Maps current sensory situation x onto action a. Units are created incrementally and incorporates bias After being created, the units’ sensory component is

tuned by self-organizing rules Their action component is updated through

reinforcement learning.


3

ITPMITPM

Units and bias Initially the ITPM has no units and they are created as

the robot uses built-in reflexes.

Units in the network have overlapping localized receptive fields.

When the neural controller makes incorrect generalizations, reflexes get control of the robot and it adds a new unit to the ITPM.


4

ITPMITPM

Self-organizing rules


5

ITPMITPM

Advantages Automatically allocates units in the visited parts of the

input space. Adjusts dynamically the necessary resolution in

different regions. Experiments show that in everage every unit is

connected to 5 others at the end of learning episodes.


6

ITPMITPM

General learning algorithm


7

Discrete-action Q-LearningDiscrete-action Q-Learning

Action selection rule Ε-greedy policy

Q-value update rule


8

Continous-action Q-LearningContinous-action Q-Learning

Action selection rule An average of the discrete actions of the nearest unit

weighted by their Q-values

Q-value of the selected continous action a is:


9

Continous-action Q-LearningContinous-action Q-Learning

Q-value update rule


10

Average-Reward RLAverage-Reward RL

Q-value update rule


11

ExperimentsExperiments

Wall following task

Reward


12


Performance comparison between discrete and continous discountd-rewarded RL


13


Performance comparison between discrete and continous average-rewarded RL


14


Performance comparison between discounted and average-rewarded RL,discrete-action case


15

ConclusionConclusion

Presented a simple Q-learning that works in continous domains.

ITPM represents continous input space Compared discounted-rewarded RL against

average-awarded RL

Continous-Action Q-Learning

Documents

Transcript of Continous-Action Q-Learning