Continous-Action Q-Learning

15
(C) 2003, SNU Biointell igence Lab, http://bi.s nu.ac.kr/ 1 Continous-Action Q-Learning Continous-Action Q-Learning Summarized by Seung-Joon Yi Jose Del R.Millan et al, Machine Learning 49, 247-265 (2002)

description

Continous-Action Q-Learning. Jose Del R.Millan et al, Machine Learning 49, 247-265 (2002). Summarized by Seung-Joon Yi. ITPM(Incremental Topology Preserving Map). Consists of units and edges between pairs of units. Maps current sensory situation x onto action a. - PowerPoint PPT Presentation

Transcript of Continous-Action Q-Learning

Page 1: Continous-Action Q-Learning

(C) 2003, SNU Biointelligence Lab, http://bi.snu.ac.kr/

1

Continous-Action Q-LearningContinous-Action Q-Learning

Summarized by Seung-Joon Yi

Jose Del R.Millan et al, Machine Learning 49, 247-265 (2002)

Page 2: Continous-Action Q-Learning

(C) 2003, SNU Biointelligence Lab, http://bi.snu.ac.kr/

2

ITPM(Incremental Topology ITPM(Incremental Topology Preserving Map)Preserving Map)

Consists of units and edges between pairs of units. Maps current sensory situation x onto action a. Units are created incrementally and incorporates bias After being created, the units’ sensory component is

tuned by self-organizing rules Their action component is updated through

reinforcement learning.

Page 3: Continous-Action Q-Learning

(C) 2003, SNU Biointelligence Lab, http://bi.snu.ac.kr/

3

ITPMITPM

Units and bias Initially the ITPM has no units and they are created as

the robot uses built-in reflexes.

Units in the network have overlapping localized receptive fields.

When the neural controller makes incorrect generalizations, reflexes get control of the robot and it adds a new unit to the ITPM.

Page 4: Continous-Action Q-Learning

(C) 2003, SNU Biointelligence Lab, http://bi.snu.ac.kr/

4

ITPMITPM

Self-organizing rules

Page 5: Continous-Action Q-Learning

(C) 2003, SNU Biointelligence Lab, http://bi.snu.ac.kr/

5

ITPMITPM

Advantages Automatically allocates units in the visited parts of the

input space. Adjusts dynamically the necessary resolution in

different regions. Experiments show that in everage every unit is

connected to 5 others at the end of learning episodes.

Page 6: Continous-Action Q-Learning

(C) 2003, SNU Biointelligence Lab, http://bi.snu.ac.kr/

6

ITPMITPM

General learning algorithm

Page 7: Continous-Action Q-Learning

(C) 2003, SNU Biointelligence Lab, http://bi.snu.ac.kr/

7

Discrete-action Q-LearningDiscrete-action Q-Learning

Action selection rule Ε-greedy policy

Q-value update rule

Page 8: Continous-Action Q-Learning

(C) 2003, SNU Biointelligence Lab, http://bi.snu.ac.kr/

8

Continous-action Q-LearningContinous-action Q-Learning

Action selection rule An average of the discrete actions of the nearest unit

weighted by their Q-values

Q-value of the selected continous action a is:

Page 9: Continous-Action Q-Learning

(C) 2003, SNU Biointelligence Lab, http://bi.snu.ac.kr/

9

Continous-action Q-LearningContinous-action Q-Learning

Q-value update rule

Page 10: Continous-Action Q-Learning

(C) 2003, SNU Biointelligence Lab, http://bi.snu.ac.kr/

10

Average-Reward RLAverage-Reward RL

Q-value update rule

Page 11: Continous-Action Q-Learning

(C) 2003, SNU Biointelligence Lab, http://bi.snu.ac.kr/

11

ExperimentsExperiments

Wall following task

Reward

Page 12: Continous-Action Q-Learning

(C) 2003, SNU Biointelligence Lab, http://bi.snu.ac.kr/

12

ExperimentsExperiments

Performance comparison between discrete and continous discountd-rewarded RL

Page 13: Continous-Action Q-Learning

(C) 2003, SNU Biointelligence Lab, http://bi.snu.ac.kr/

13

ExperimentsExperiments

Performance comparison between discrete and continous average-rewarded RL

Page 14: Continous-Action Q-Learning

(C) 2003, SNU Biointelligence Lab, http://bi.snu.ac.kr/

14

ExperimentsExperiments

Performance comparison between discounted and average-rewarded RL,discrete-action case

Page 15: Continous-Action Q-Learning

(C) 2003, SNU Biointelligence Lab, http://bi.snu.ac.kr/

15

ConclusionConclusion

Presented a simple Q-learning that works in continous domains.

ITPM represents continous input space Compared discounted-rewarded RL against

average-awarded RL