Lecture 4: Q-learning (table) - GitHub Pageshunkim.github.io/ml/RL/rl04.pdf · 2017-10-02 ·...

23
Lecture 4: Q-learning (table) exploit&exploration and discounted future reward Reinforcement Learning with TensorFlow&OpenAI Gym Sung Kim <[email protected]>

Transcript of Lecture 4: Q-learning (table) - GitHub Pageshunkim.github.io/ml/RL/rl04.pdf · 2017-10-02 ·...

Page 1: Lecture 4: Q-learning (table) - GitHub Pageshunkim.github.io/ml/RL/rl04.pdf · 2017-10-02 · Lecture 4: Q-learning (table) exploit&exploration and discounted future reward Reinforcement

Lecture 4: Q-learning (table)exploit&exploration and discounted future reward

Reinforcement Learning with TensorFlow&OpenAI GymSung Kim <[email protected]>

Page 2: Lecture 4: Q-learning (table) - GitHub Pageshunkim.github.io/ml/RL/rl04.pdf · 2017-10-02 · Lecture 4: Q-learning (table) exploit&exploration and discounted future reward Reinforcement

Dummy Q-learning algorithm

Page 3: Lecture 4: Q-learning (table) - GitHub Pageshunkim.github.io/ml/RL/rl04.pdf · 2017-10-02 · Lecture 4: Q-learning (table) exploit&exploration and discounted future reward Reinforcement

Learning Q (s, a)?

Page 4: Lecture 4: Q-learning (table) - GitHub Pageshunkim.github.io/ml/RL/rl04.pdf · 2017-10-02 · Lecture 4: Q-learning (table) exploit&exploration and discounted future reward Reinforcement

Learning Q(s, a) Table: one success!

11

11

1

111

Page 5: Lecture 4: Q-learning (table) - GitHub Pageshunkim.github.io/ml/RL/rl04.pdf · 2017-10-02 · Lecture 4: Q-learning (table) exploit&exploration and discounted future reward Reinforcement

Exploit VS Exploration

11

11

1

111

http://home.deib.polimi.it/restelli/MyWebSite/pdf/rl5.pdf

Page 6: Lecture 4: Q-learning (table) - GitHub Pageshunkim.github.io/ml/RL/rl04.pdf · 2017-10-02 · Lecture 4: Q-learning (table) exploit&exploration and discounted future reward Reinforcement

Exploit VS Exploration

0.0 0.0 0.0 0.0 0.0

Page 7: Lecture 4: Q-learning (table) - GitHub Pageshunkim.github.io/ml/RL/rl04.pdf · 2017-10-02 · Lecture 4: Q-learning (table) exploit&exploration and discounted future reward Reinforcement

Exploit (weekday) VS Exploration (weekend)

0.5 0.6 0.3 0.2 0.5

Page 8: Lecture 4: Q-learning (table) - GitHub Pageshunkim.github.io/ml/RL/rl04.pdf · 2017-10-02 · Lecture 4: Q-learning (table) exploit&exploration and discounted future reward Reinforcement

Exploit VS Exploration: E-greedy

e = 0.1

if rand < e: a = random

else: a = argmax(Q(s, a))

Page 9: Lecture 4: Q-learning (table) - GitHub Pageshunkim.github.io/ml/RL/rl04.pdf · 2017-10-02 · Lecture 4: Q-learning (table) exploit&exploration and discounted future reward Reinforcement

Exploit VS Exploration: decaying E-greedy

for i in range (1000)

e = 0.1 / (i+1)

if random(1) < e: a = random

else: a = argmax(Q(s, a))

Page 10: Lecture 4: Q-learning (table) - GitHub Pageshunkim.github.io/ml/RL/rl04.pdf · 2017-10-02 · Lecture 4: Q-learning (table) exploit&exploration and discounted future reward Reinforcement

Exploit VS Exploration: add random noise

0.5 0.6 0.3 0.2 0.5

Page 11: Lecture 4: Q-learning (table) - GitHub Pageshunkim.github.io/ml/RL/rl04.pdf · 2017-10-02 · Lecture 4: Q-learning (table) exploit&exploration and discounted future reward Reinforcement

Exploit VS Exploration: add random noise

0.5 0.6 0.3 0.2 0.5

a = argmax(Q(s, a) + random_values)

a = argmax([0.5 0.6 0.3 0.2 0.5]+[0.1 0.2 0.7 0.3 0.1])

Page 12: Lecture 4: Q-learning (table) - GitHub Pageshunkim.github.io/ml/RL/rl04.pdf · 2017-10-02 · Lecture 4: Q-learning (table) exploit&exploration and discounted future reward Reinforcement

Exploit VS Exploration: add random noise

0.5 0.6 0.3 0.2 0.5

for i in range (1000) a = argmax(Q(s, a) + random_values / (i+1))

Page 13: Lecture 4: Q-learning (table) - GitHub Pageshunkim.github.io/ml/RL/rl04.pdf · 2017-10-02 · Lecture 4: Q-learning (table) exploit&exploration and discounted future reward Reinforcement

Exploit VS Exploration

11

11

1

111

Page 14: Lecture 4: Q-learning (table) - GitHub Pageshunkim.github.io/ml/RL/rl04.pdf · 2017-10-02 · Lecture 4: Q-learning (table) exploit&exploration and discounted future reward Reinforcement

Dummy Q-learning algorithm

Page 15: Lecture 4: Q-learning (table) - GitHub Pageshunkim.github.io/ml/RL/rl04.pdf · 2017-10-02 · Lecture 4: Q-learning (table) exploit&exploration and discounted future reward Reinforcement

Discounted future reward

11

11

1

111

1

1

Page 16: Lecture 4: Q-learning (table) - GitHub Pageshunkim.github.io/ml/RL/rl04.pdf · 2017-10-02 · Lecture 4: Q-learning (table) exploit&exploration and discounted future reward Reinforcement

Learning Q (s, a) with discounted reward

Page 17: Lecture 4: Q-learning (table) - GitHub Pageshunkim.github.io/ml/RL/rl04.pdf · 2017-10-02 · Lecture 4: Q-learning (table) exploit&exploration and discounted future reward Reinforcement

Discounted future reward

Page 18: Lecture 4: Q-learning (table) - GitHub Pageshunkim.github.io/ml/RL/rl04.pdf · 2017-10-02 · Lecture 4: Q-learning (table) exploit&exploration and discounted future reward Reinforcement

Learning Q (s, a) with discounted reward

Page 19: Lecture 4: Q-learning (table) - GitHub Pageshunkim.github.io/ml/RL/rl04.pdf · 2017-10-02 · Lecture 4: Q-learning (table) exploit&exploration and discounted future reward Reinforcement

Discounted reward ( = 0.9)

1

Page 20: Lecture 4: Q-learning (table) - GitHub Pageshunkim.github.io/ml/RL/rl04.pdf · 2017-10-02 · Lecture 4: Q-learning (table) exploit&exploration and discounted future reward Reinforcement

Q-learning algorithm

Page 21: Lecture 4: Q-learning (table) - GitHub Pageshunkim.github.io/ml/RL/rl04.pdf · 2017-10-02 · Lecture 4: Q-learning (table) exploit&exploration and discounted future reward Reinforcement

Q-Table Policy

(3) quality (reward) for the given action

(eg, LEFT: 0.5, RIGHT 0.1 UP: 0.0, DOWN: 0.8)

Q (s, a)

111 11111

1

1

(1) state, s

(2) action, a

Page 22: Lecture 4: Q-learning (table) - GitHub Pageshunkim.github.io/ml/RL/rl04.pdf · 2017-10-02 · Lecture 4: Q-learning (table) exploit&exploration and discounted future reward Reinforcement

Convergence

• In deterministic worlds

• In finite states

Machine Learning, Tom Mitchell, 1997

Page 23: Lecture 4: Q-learning (table) - GitHub Pageshunkim.github.io/ml/RL/rl04.pdf · 2017-10-02 · Lecture 4: Q-learning (table) exploit&exploration and discounted future reward Reinforcement

Next

Lab: Q-learning Table