Introduction of Reinforcement Learningtlkagk/courses/ML_2017...Goodness of Actor โขGiven an actor...
Transcript of Introduction of Reinforcement Learningtlkagk/courses/ML_2017...Goodness of Actor โขGiven an actor...
![Page 1: Introduction of Reinforcement Learningtlkagk/courses/ML_2017...Goodness of Actor โขGiven an actor ๐๐ with network parameter โขUse the actor ๐๐ to play the video game](https://reader034.fdocuments.in/reader034/viewer/2022042622/5fa96534220b452cce61566e/html5/thumbnails/1.jpg)
Introduction ofReinforcement Learning
![Page 2: Introduction of Reinforcement Learningtlkagk/courses/ML_2017...Goodness of Actor โขGiven an actor ๐๐ with network parameter โขUse the actor ๐๐ to play the video game](https://reader034.fdocuments.in/reader034/viewer/2022042622/5fa96534220b452cce61566e/html5/thumbnails/2.jpg)
Deep Reinforcement Learning
![Page 3: Introduction of Reinforcement Learningtlkagk/courses/ML_2017...Goodness of Actor โขGiven an actor ๐๐ with network parameter โขUse the actor ๐๐ to play the video game](https://reader034.fdocuments.in/reader034/viewer/2022042622/5fa96534220b452cce61566e/html5/thumbnails/3.jpg)
Reference
โข Textbook: Reinforcement Learning: An Introduction
โข http://incompleteideas.net/sutton/book/the-book.html
โข Lectures of David Silver
โข http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching.html (10 lectures, around 1:30 each)
โข http://videolectures.net/rldm2015_silver_reinforcement_learning/ (Deep Reinforcement Learning )
โข Lectures of John Schulman
โข https://youtu.be/aUrX-rP_ss4
![Page 4: Introduction of Reinforcement Learningtlkagk/courses/ML_2017...Goodness of Actor โขGiven an actor ๐๐ with network parameter โขUse the actor ๐๐ to play the video game](https://reader034.fdocuments.in/reader034/viewer/2022042622/5fa96534220b452cce61566e/html5/thumbnails/4.jpg)
Scenario of Reinforcement Learning
Agent
Environment
Observation Action
RewardDonโt do that
State Change the environment
![Page 5: Introduction of Reinforcement Learningtlkagk/courses/ML_2017...Goodness of Actor โขGiven an actor ๐๐ with network parameter โขUse the actor ๐๐ to play the video game](https://reader034.fdocuments.in/reader034/viewer/2022042622/5fa96534220b452cce61566e/html5/thumbnails/5.jpg)
Scenario of Reinforcement Learning
Agent
Environment
Observation
RewardThank you.
Agent learns to take actions maximizing expected reward.
https://yoast.com/how-to-clean-site-structure/
State
Action
Change the environment
![Page 6: Introduction of Reinforcement Learningtlkagk/courses/ML_2017...Goodness of Actor โขGiven an actor ๐๐ with network parameter โขUse the actor ๐๐ to play the video game](https://reader034.fdocuments.in/reader034/viewer/2022042622/5fa96534220b452cce61566e/html5/thumbnails/6.jpg)
Machine Learning โ Looking for a Function
Environment
Observation Action
Reward
Function input
Used to pick the best function
Function output
Action = ฯ( Observation )
Actor/Policy
![Page 7: Introduction of Reinforcement Learningtlkagk/courses/ML_2017...Goodness of Actor โขGiven an actor ๐๐ with network parameter โขUse the actor ๐๐ to play the video game](https://reader034.fdocuments.in/reader034/viewer/2022042622/5fa96534220b452cce61566e/html5/thumbnails/7.jpg)
Learning to play Go
Environment
Observation Action
Reward
Next Move
![Page 8: Introduction of Reinforcement Learningtlkagk/courses/ML_2017...Goodness of Actor โขGiven an actor ๐๐ with network parameter โขUse the actor ๐๐ to play the video game](https://reader034.fdocuments.in/reader034/viewer/2022042622/5fa96534220b452cce61566e/html5/thumbnails/8.jpg)
Learning to play Go
Environment
Observation Action
Reward
If win, reward = 1
If loss, reward = -1
reward = 0 in most cases
Agent learns to take actions maximizing expected reward.
![Page 9: Introduction of Reinforcement Learningtlkagk/courses/ML_2017...Goodness of Actor โขGiven an actor ๐๐ with network parameter โขUse the actor ๐๐ to play the video game](https://reader034.fdocuments.in/reader034/viewer/2022042622/5fa96534220b452cce61566e/html5/thumbnails/9.jpg)
Learning to play Go
โข Supervised:
โข Reinforcement Learning
Next move:โ5-5โ
Next move:โ3-3โ
First move โฆโฆ many moves โฆโฆ Win!
Alpha Go is supervised learning + reinforcement learning.
Learning from teacher
Learning from experience
(Two agents play with each other.)
![Page 10: Introduction of Reinforcement Learningtlkagk/courses/ML_2017...Goodness of Actor โขGiven an actor ๐๐ with network parameter โขUse the actor ๐๐ to play the video game](https://reader034.fdocuments.in/reader034/viewer/2022042622/5fa96534220b452cce61566e/html5/thumbnails/10.jpg)
Learning a chat-bot
โข Machine obtains feedback from user
โข Chat-bot learns to maximize the expected reward
https://image.freepik.com/free-vector/variety-of-human-avatars_23-2147506285.jpg
How are you?
Bye bye โบ
Hello
Hi โบ
-10 3
http://www.freepik.com/free-vector/variety-of-human-avatars_766615.htm
![Page 11: Introduction of Reinforcement Learningtlkagk/courses/ML_2017...Goodness of Actor โขGiven an actor ๐๐ with network parameter โขUse the actor ๐๐ to play the video game](https://reader034.fdocuments.in/reader034/viewer/2022042622/5fa96534220b452cce61566e/html5/thumbnails/11.jpg)
Learning a chat-bot
โข Let two agents talk to each other (sometimes generate good dialogue, sometimes bad)
How old are you?
See you.
See you.
See you.
How old are you?
I am 16.
I though you were 12.
What make you think so?
![Page 12: Introduction of Reinforcement Learningtlkagk/courses/ML_2017...Goodness of Actor โขGiven an actor ๐๐ with network parameter โขUse the actor ๐๐ to play the video game](https://reader034.fdocuments.in/reader034/viewer/2022042622/5fa96534220b452cce61566e/html5/thumbnails/12.jpg)
Learning a chat-bot
โข By this approach, we can generate a lot of dialogues.
โข Use some pre-defined rules to evaluate the goodness of a dialogue
Dialogue 1 Dialogue 2 Dialogue 3 Dialogue 4
Dialogue 5 Dialogue 6 Dialogue 7 Dialogue 8
Machine learns from the evaluation
Deep Reinforcement Learning for Dialogue Generation
https://arxiv.org/pdf/1606.01541v3.pdf
![Page 13: Introduction of Reinforcement Learningtlkagk/courses/ML_2017...Goodness of Actor โขGiven an actor ๐๐ with network parameter โขUse the actor ๐๐ to play the video game](https://reader034.fdocuments.in/reader034/viewer/2022042622/5fa96534220b452cce61566e/html5/thumbnails/13.jpg)
Learning a chat-bot
โข Supervised
โข Reinforcement
Helloโบ
Agent
โฆโฆ
Agent
โฆโฆ. โฆโฆ. โฆโฆ
Bad
โHelloโ Say โHiโ
โBye byeโ Say โGood byeโ
![Page 14: Introduction of Reinforcement Learningtlkagk/courses/ML_2017...Goodness of Actor โขGiven an actor ๐๐ with network parameter โขUse the actor ๐๐ to play the video game](https://reader034.fdocuments.in/reader034/viewer/2022042622/5fa96534220b452cce61566e/html5/thumbnails/14.jpg)
More applications
โข Flying Helicopterโข https://www.youtube.com/watch?v=0JL04JJjocc
โข Drivingโข https://www.youtube.com/watch?v=0xo1Ldx3L5Q
โข Robotโข https://www.youtube.com/watch?v=370cT-OAzzM
โข Google Cuts Its Giant Electricity Bill With DeepMind-Powered AIโข http://www.bloomberg.com/news/articles/2016-07-19/google-cuts-its-
giant-electricity-bill-with-deepmind-powered-ai
โข Text generation โข https://www.youtube.com/watch?v=pbQ4qe8EwLo
![Page 15: Introduction of Reinforcement Learningtlkagk/courses/ML_2017...Goodness of Actor โขGiven an actor ๐๐ with network parameter โขUse the actor ๐๐ to play the video game](https://reader034.fdocuments.in/reader034/viewer/2022042622/5fa96534220b452cce61566e/html5/thumbnails/15.jpg)
Example: Playing Video Game
โข Widely studies: โข Gym: https://gym.openai.com/
โข Universe: https://openai.com/blog/universe/
Machine learns to play video games as human players
โข Machine learns to take proper action itself
โข What machine observes is pixels
![Page 16: Introduction of Reinforcement Learningtlkagk/courses/ML_2017...Goodness of Actor โขGiven an actor ๐๐ with network parameter โขUse the actor ๐๐ to play the video game](https://reader034.fdocuments.in/reader034/viewer/2022042622/5fa96534220b452cce61566e/html5/thumbnails/16.jpg)
Example: Playing Video Game
โข Space invader
fire
Score (reward)
Kill the aliens
Termination: all the aliens are killed, or your spaceship is destroyed.
shield
![Page 17: Introduction of Reinforcement Learningtlkagk/courses/ML_2017...Goodness of Actor โขGiven an actor ๐๐ with network parameter โขUse the actor ๐๐ to play the video game](https://reader034.fdocuments.in/reader034/viewer/2022042622/5fa96534220b452cce61566e/html5/thumbnails/17.jpg)
Example: Playing Video Game
โข Space invader
โข Play yourself: http://www.2600online.com/spaceinvaders.html
โข How about machine: https://gym.openai.com/evaluations/eval_Eduozx4HRyqgTCVk9ltw
![Page 18: Introduction of Reinforcement Learningtlkagk/courses/ML_2017...Goodness of Actor โขGiven an actor ๐๐ with network parameter โขUse the actor ๐๐ to play the video game](https://reader034.fdocuments.in/reader034/viewer/2022042622/5fa96534220b452cce61566e/html5/thumbnails/18.jpg)
Start with observation ๐ 1 Observation ๐ 2 Observation ๐ 3
Example: Playing Video Game
Action ๐1: โrightโ
Obtain reward ๐1 = 0
Action ๐2 : โfireโ
(kill an alien)
Obtain reward ๐2 = 5
Usually there is some randomness in the environment
![Page 19: Introduction of Reinforcement Learningtlkagk/courses/ML_2017...Goodness of Actor โขGiven an actor ๐๐ with network parameter โขUse the actor ๐๐ to play the video game](https://reader034.fdocuments.in/reader034/viewer/2022042622/5fa96534220b452cce61566e/html5/thumbnails/19.jpg)
Start with observation ๐ 1 Observation ๐ 2 Observation ๐ 3
Example: Playing Video Game
After many turns
Action ๐๐
Obtain reward ๐๐
Game Over(spaceship destroyed)
This is an episode.
Learn to maximize the expected cumulative reward per episode
![Page 20: Introduction of Reinforcement Learningtlkagk/courses/ML_2017...Goodness of Actor โขGiven an actor ๐๐ with network parameter โขUse the actor ๐๐ to play the video game](https://reader034.fdocuments.in/reader034/viewer/2022042622/5fa96534220b452cce61566e/html5/thumbnails/20.jpg)
Properties of Reinforcement Learningโข Reward delay
โข In space invader, only โfireโ obtains reward
โข Although the moving before โfireโ is important
โข In Go playing, it may be better to sacrifice immediate reward to gain more long-term reward
โข Agentโs actions affect the subsequent data it receives
โข E.g. Exploration
![Page 21: Introduction of Reinforcement Learningtlkagk/courses/ML_2017...Goodness of Actor โขGiven an actor ๐๐ with network parameter โขUse the actor ๐๐ to play the video game](https://reader034.fdocuments.in/reader034/viewer/2022042622/5fa96534220b452cce61566e/html5/thumbnails/21.jpg)
Outline
Policy-based Value-based
Learning an Actor Learning a CriticActor + Critic
Model-based Approach
Model-free Approach
Alpha Go: policy-based + value-based + model-based
![Page 22: Introduction of Reinforcement Learningtlkagk/courses/ML_2017...Goodness of Actor โขGiven an actor ๐๐ with network parameter โขUse the actor ๐๐ to play the video game](https://reader034.fdocuments.in/reader034/viewer/2022042622/5fa96534220b452cce61566e/html5/thumbnails/22.jpg)
Policy-based ApproachLearning an Actor
![Page 23: Introduction of Reinforcement Learningtlkagk/courses/ML_2017...Goodness of Actor โขGiven an actor ๐๐ with network parameter โขUse the actor ๐๐ to play the video game](https://reader034.fdocuments.in/reader034/viewer/2022042622/5fa96534220b452cce61566e/html5/thumbnails/23.jpg)
Step 1: define a set of function
Step 2: goodness of
function
Step 3: pick the best function
Three Steps for Deep Learning
Deep Learning is so simple โฆโฆ
Neural Network as Actor
![Page 24: Introduction of Reinforcement Learningtlkagk/courses/ML_2017...Goodness of Actor โขGiven an actor ๐๐ with network parameter โขUse the actor ๐๐ to play the video game](https://reader034.fdocuments.in/reader034/viewer/2022042622/5fa96534220b452cce61566e/html5/thumbnails/24.jpg)
Neural network as Actor
โข Input of neural network: the observation of machine represented as a vector or a matrix
โข Output neural network : each action corresponds to a neuron in output layer
โฆโฆ
NN as actor
pixelsfire
right
leftProbability of taking
the action
What is the benefit of using network instead of lookup table?
0.7
0.2
0.1
generalization
![Page 25: Introduction of Reinforcement Learningtlkagk/courses/ML_2017...Goodness of Actor โขGiven an actor ๐๐ with network parameter โขUse the actor ๐๐ to play the video game](https://reader034.fdocuments.in/reader034/viewer/2022042622/5fa96534220b452cce61566e/html5/thumbnails/25.jpg)
Step 1: define a set of function
Step 2: goodness of
function
Step 3: pick the best function
Three Steps for Deep Learning
Deep Learning is so simple โฆโฆ
Neural Network as Actor
![Page 26: Introduction of Reinforcement Learningtlkagk/courses/ML_2017...Goodness of Actor โขGiven an actor ๐๐ with network parameter โขUse the actor ๐๐ to play the video game](https://reader034.fdocuments.in/reader034/viewer/2022042622/5fa96534220b452cce61566e/html5/thumbnails/26.jpg)
Goodness of Actor
โข Review: Supervised learning
1x
2x
โฆโฆ
256x
โฆโฆ
โฆโฆ
โฆโฆ
โฆโฆ
โฆโฆ
y1
y2
y10
Loss ๐
โ1โ
โฆโฆ
1
0
0
โฆโฆ target
Softm
axAs close as
possibleGiven a set of parameters ๐
๐ฟ =
๐=1
๐
๐๐
Total Loss:
Training Example
![Page 27: Introduction of Reinforcement Learningtlkagk/courses/ML_2017...Goodness of Actor โขGiven an actor ๐๐ with network parameter โขUse the actor ๐๐ to play the video game](https://reader034.fdocuments.in/reader034/viewer/2022042622/5fa96534220b452cce61566e/html5/thumbnails/27.jpg)
Goodness of Actor
โข Given an actor ๐๐ ๐ with network parameter ๐
โข Use the actor ๐๐ ๐ to play the video gameโข Start with observation ๐ 1โข Machine decides to take ๐1โข Machine obtains reward ๐1โข Machine sees observation ๐ 2โข Machine decides to take ๐2โข Machine obtains reward ๐2โข Machine sees observation ๐ 3โข โฆโฆ
โข Machine decides to take ๐๐โข Machine obtains reward ๐๐
Total reward: ๐ ๐ = ฯ๐ก=1๐ ๐๐ก
Even with the same actor, ๐ ๐ is different each time
We define เดค๐ ๐ as the expected value of Rฮธ
เดค๐ ๐ evaluates the goodness of an actor ๐๐ ๐
Randomness in the actor and the game
END
![Page 28: Introduction of Reinforcement Learningtlkagk/courses/ML_2017...Goodness of Actor โขGiven an actor ๐๐ with network parameter โขUse the actor ๐๐ to play the video game](https://reader034.fdocuments.in/reader034/viewer/2022042622/5fa96534220b452cce61566e/html5/thumbnails/28.jpg)
Goodness of Actor
โข ๐ = ๐ 1, ๐1, ๐1, ๐ 2, ๐2, ๐2, โฏ , ๐ ๐ , ๐๐ , ๐๐
๐ ๐|๐ =
๐ ๐ 1 ๐ ๐1|๐ 1, ๐ ๐ ๐1, ๐ 2|๐ 1, ๐1 ๐ ๐2|๐ 2, ๐ ๐ ๐2, ๐ 3|๐ 2, ๐2 โฏ
= ๐ ๐ 1 เท
๐ก=1
๐
๐ ๐๐ก|๐ ๐ก, ๐ ๐ ๐๐ก , ๐ ๐ก+1|๐ ๐ก , ๐๐ก
Control by your actor ๐๐
not related to your actor
Actor๐๐
left
right
fire
๐ ๐ก
0.1
0.2
0.7
๐ ๐๐ก = "๐๐๐๐"|๐ ๐ก , ๐= 0.7
We define เดค๐ ๐ as the expected value of Rฮธ
![Page 29: Introduction of Reinforcement Learningtlkagk/courses/ML_2017...Goodness of Actor โขGiven an actor ๐๐ with network parameter โขUse the actor ๐๐ to play the video game](https://reader034.fdocuments.in/reader034/viewer/2022042622/5fa96534220b452cce61566e/html5/thumbnails/29.jpg)
Goodness of Actor
โข An episode is considered as a trajectory ๐
โข ๐ = ๐ 1, ๐1, ๐1, ๐ 2, ๐2, ๐2, โฏ , ๐ ๐ , ๐๐ , ๐๐โข ๐ ๐ = ฯ๐ก=1
๐ ๐๐กโข If you use an actor to play the game, each ๐ has a
probability to be sampled
โข The probability depends on actor parameter ๐: ๐ ๐|๐
เดค๐ ๐ =
๐
๐ ๐ ๐ ๐|๐
Sum over all possible trajectory
Use ๐๐ to play the game N times, obtain ๐1, ๐2, โฏ , ๐๐
Sampling ๐ from ๐ ๐|๐N times
โ1
๐
๐=1
๐
๐ ๐๐
![Page 30: Introduction of Reinforcement Learningtlkagk/courses/ML_2017...Goodness of Actor โขGiven an actor ๐๐ with network parameter โขUse the actor ๐๐ to play the video game](https://reader034.fdocuments.in/reader034/viewer/2022042622/5fa96534220b452cce61566e/html5/thumbnails/30.jpg)
Step 1: define a set of function
Step 2: goodness of
function
Step 3: pick the best function
Three Steps for Deep Learning
Deep Learning is so simple โฆโฆ
Neural Network as Actor
![Page 31: Introduction of Reinforcement Learningtlkagk/courses/ML_2017...Goodness of Actor โขGiven an actor ๐๐ with network parameter โขUse the actor ๐๐ to play the video game](https://reader034.fdocuments.in/reader034/viewer/2022042622/5fa96534220b452cce61566e/html5/thumbnails/31.jpg)
Gradient Ascent
โข Problem statement
โข Gradient ascent
โข Start with ๐0
โข ๐1 โ ๐0 + ๐๐ป เดค๐ ๐0
โข ๐2 โ ๐1 + ๐๐ป เดค๐ ๐1
โข โฆโฆ
๐โ = ๐๐๐max๐
เดค๐ ๐
๐ป เดค๐ ๐
๐ = ๐ค1, ๐ค2, โฏ , ๐1, โฏ
=
ฮค๐ เดค๐ ๐ ๐๐ค1
ฮค๐ เดค๐ ๐ ๐๐ค2
โฎฮค๐ เดค๐ ๐ ๐๐1โฎ
![Page 32: Introduction of Reinforcement Learningtlkagk/courses/ML_2017...Goodness of Actor โขGiven an actor ๐๐ with network parameter โขUse the actor ๐๐ to play the video game](https://reader034.fdocuments.in/reader034/viewer/2022042622/5fa96534220b452cce61566e/html5/thumbnails/32.jpg)
Policy Gradient เดค๐ ๐ =
๐
๐ ๐ ๐ ๐|๐ ๐ป เดค๐ ๐ =?
๐ป เดค๐ ๐ =
๐
๐ ๐ ๐ป๐ ๐|๐ =
๐
๐ ๐ ๐ ๐|๐๐ป๐ ๐|๐
๐ ๐|๐
=
๐
๐ ๐ ๐ ๐|๐ ๐ป๐๐๐๐ ๐|๐
Use ๐๐ to play the game N times, Obtain ๐1, ๐2, โฏ , ๐๐
โ1
๐
๐=1
๐
๐ ๐๐ ๐ป๐๐๐๐ ๐๐|๐
๐ ๐ do not have to be differentiable
It can even be a black box.
๐๐๐๐ ๐ ๐ฅ
๐๐ฅ=
1
๐ ๐ฅ
๐๐ ๐ฅ
๐๐ฅ
![Page 33: Introduction of Reinforcement Learningtlkagk/courses/ML_2017...Goodness of Actor โขGiven an actor ๐๐ with network parameter โขUse the actor ๐๐ to play the video game](https://reader034.fdocuments.in/reader034/viewer/2022042622/5fa96534220b452cce61566e/html5/thumbnails/33.jpg)
Policy Gradient
โข ๐ = ๐ 1, ๐1, ๐1, ๐ 2, ๐2, ๐2, โฏ , ๐ ๐ , ๐๐ , ๐๐
๐ ๐|๐ = ๐ ๐ 1 เท
๐ก=1
๐
๐ ๐๐ก|๐ ๐ก, ๐ ๐ ๐๐ก , ๐ ๐ก+1|๐ ๐ก, ๐๐ก
๐ป๐๐๐๐ ๐|๐ =?
๐๐๐๐ ๐|๐
= ๐๐๐๐ ๐ 1 +
๐ก=1
๐
๐๐๐๐ ๐๐ก|๐ ๐ก , ๐ + ๐๐๐๐ ๐๐ก , ๐ ๐ก+1|๐ ๐ก , ๐๐ก
๐ป๐๐๐๐ ๐|๐ =
๐ก=1
๐
๐ป๐๐๐๐ ๐๐ก|๐ ๐ก , ๐Ignore the terms not related to ๐
![Page 34: Introduction of Reinforcement Learningtlkagk/courses/ML_2017...Goodness of Actor โขGiven an actor ๐๐ with network parameter โขUse the actor ๐๐ to play the video game](https://reader034.fdocuments.in/reader034/viewer/2022042622/5fa96534220b452cce61566e/html5/thumbnails/34.jpg)
Policy Gradient
๐ป เดค๐ ๐ โ1
๐
๐=1
๐
๐ ๐๐ ๐ป๐๐๐๐ ๐๐|๐
๐ป๐๐๐๐ ๐|๐
=
๐ก=1
๐
๐ป๐๐๐๐ ๐๐ก|๐ ๐ก , ๐
=1
๐
๐=1
๐
๐ ๐๐
๐ก=1
๐๐
๐ป๐๐๐๐ ๐๐ก๐|๐ ๐ก
๐, ๐
๐๐๐๐ค โ ๐๐๐๐ + ๐๐ป เดค๐ ๐๐๐๐
=1
๐
๐=1
๐
๐ก=1
๐๐
๐ ๐๐ ๐ป๐๐๐๐ ๐๐ก๐|๐ ๐ก
๐, ๐
If in ๐๐ machine takes ๐๐ก๐ when seeing ๐ ๐ก
๐ in
๐ ๐๐ is positive Tuning ๐ to increase ๐ ๐๐ก๐|๐ ๐ก
๐
๐ ๐๐ is negative Tuning ๐ to decrease ๐ ๐๐ก๐|๐ ๐ก
๐
It is very important to consider the cumulative reward ๐ ๐๐ of the whole trajectory ๐๐ instead of immediate reward ๐๐ก
๐
What if we replace ๐ ๐๐ with ๐๐ก
๐ โฆโฆ
![Page 35: Introduction of Reinforcement Learningtlkagk/courses/ML_2017...Goodness of Actor โขGiven an actor ๐๐ with network parameter โขUse the actor ๐๐ to play the video game](https://reader034.fdocuments.in/reader034/viewer/2022042622/5fa96534220b452cce61566e/html5/thumbnails/35.jpg)
Policy Gradient
๐1: ๐ 11, ๐1
1 ๐ ๐1
๐ 21, ๐2
1
โฆโฆ
๐ ๐1
โฆโฆ
๐2: ๐ 12, ๐1
2 ๐ ๐2
๐ 22, ๐2
2
โฆโฆ
๐ ๐2
โฆโฆ
๐ โ ๐ + ๐๐ป เดค๐ ๐
๐ป เดค๐ ๐ =
1
๐
๐=1
๐
๐ก=1
๐๐
๐ ๐๐ ๐ป๐๐๐๐ ๐๐ก๐|๐ ๐ก
๐, ๐
Given actor parameter ๐
UpdateModel
DataCollection
![Page 36: Introduction of Reinforcement Learningtlkagk/courses/ML_2017...Goodness of Actor โขGiven an actor ๐๐ with network parameter โขUse the actor ๐๐ to play the video game](https://reader034.fdocuments.in/reader034/viewer/2022042622/5fa96534220b452cce61566e/html5/thumbnails/36.jpg)
Policy Gradient
โฆโฆ
fire
right
left
s
1
0
0
โ
๐=1
3
เท๐ฆ๐๐๐๐๐ฆ๐Minimize:
Maximize:
๐๐๐๐ "๐๐๐๐ก"|๐
๐ฆ๐ เท๐ฆ๐
๐๐๐๐ฆ๐ =
๐ โ ๐ + ๐๐ป๐๐๐๐ "๐๐๐๐ก"|๐
Considered as Classification Problem
![Page 37: Introduction of Reinforcement Learningtlkagk/courses/ML_2017...Goodness of Actor โขGiven an actor ๐๐ with network parameter โขUse the actor ๐๐ to play the video game](https://reader034.fdocuments.in/reader034/viewer/2022042622/5fa96534220b452cce61566e/html5/thumbnails/37.jpg)
Policy Gradient๐ โ ๐ + ๐๐ป เดค๐ ๐
๐ป เดค๐ ๐ =
1
๐
๐=1
๐
๐ก=1
๐๐
๐ ๐๐ ๐ป๐๐๐๐ ๐๐ก๐|๐ ๐ก
๐, ๐
๐1: ๐ 11, ๐1
1 ๐ ๐1
๐ 21, ๐2
1
โฆโฆ
๐ ๐1
โฆโฆ
๐2: ๐ 12, ๐1
2 ๐ ๐2
๐ 22, ๐2
2
โฆโฆ
๐ ๐2
โฆโฆ
Given actor parameter ๐
โฆ
๐ 11
NN
fire
right
left
๐11 = ๐๐๐๐ก
10
0
๐ 12
NN
fire
right
left
๐12 = ๐๐๐๐
00
1
![Page 38: Introduction of Reinforcement Learningtlkagk/courses/ML_2017...Goodness of Actor โขGiven an actor ๐๐ with network parameter โขUse the actor ๐๐ to play the video game](https://reader034.fdocuments.in/reader034/viewer/2022042622/5fa96534220b452cce61566e/html5/thumbnails/38.jpg)
Policy Gradient๐ โ ๐ + ๐๐ป เดค๐ ๐
๐ป เดค๐ ๐ =
1
๐
๐=1
๐
๐ก=1
๐๐
๐ ๐๐ ๐ป๐๐๐๐ ๐๐ก๐|๐ ๐ก
๐, ๐
๐1: ๐ 11, ๐1
1 ๐ ๐1
๐ 21, ๐2
1
โฆโฆ
๐ ๐1
โฆโฆ
๐2: ๐ 12, ๐1
2 ๐ ๐2
๐ 22, ๐2
2
โฆโฆ
๐ ๐2
โฆโฆ
Given actor parameter ๐
โฆ
๐ 11
NN ๐11 = ๐๐๐๐ก
2
2
1
1 ๐ 11
NN ๐11 = ๐๐๐๐ก
๐ 12
NN ๐12 = ๐๐๐๐
Each training data is weighted by ๐ ๐๐
![Page 39: Introduction of Reinforcement Learningtlkagk/courses/ML_2017...Goodness of Actor โขGiven an actor ๐๐ with network parameter โขUse the actor ๐๐ to play the video game](https://reader034.fdocuments.in/reader034/viewer/2022042622/5fa96534220b452cce61566e/html5/thumbnails/39.jpg)
Add a Baseline๐๐๐๐ค โ ๐๐๐๐ + ๐๐ป เดค๐ ๐๐๐๐
๐ป เดค๐ ๐ โ1
๐
๐=1
๐
๐ก=1
๐๐
๐ ๐๐ ๐ป๐๐๐๐ ๐๐ก๐|๐ ๐ก
๐ , ๐
It is possible that ๐ ๐๐
is always positive.
Ideal case
Samplingโฆโฆ
a b c a b c
It is probability โฆ
a b c
Not sampled
a b c
The probability of the actions not sampled will decrease.
1
๐
๐=1
๐
๐ก=1
๐๐
๐ ๐๐ โ ๐ ๐ป๐๐๐๐ ๐๐ก๐|๐ ๐ก
๐, ๐
![Page 40: Introduction of Reinforcement Learningtlkagk/courses/ML_2017...Goodness of Actor โขGiven an actor ๐๐ with network parameter โขUse the actor ๐๐ to play the video game](https://reader034.fdocuments.in/reader034/viewer/2022042622/5fa96534220b452cce61566e/html5/thumbnails/40.jpg)
Value-based ApproachLearning a Critic
![Page 41: Introduction of Reinforcement Learningtlkagk/courses/ML_2017...Goodness of Actor โขGiven an actor ๐๐ with network parameter โขUse the actor ๐๐ to play the video game](https://reader034.fdocuments.in/reader034/viewer/2022042622/5fa96534220b452cce61566e/html5/thumbnails/41.jpg)
Critic
โข A critic does not determine the action.
โข Given an actor ฯ, it evaluates the how good the actor is
http://combiboilersleeds.com/picaso/critics/critics-4.html
An actor can be found from a critic.
e.g. Q-learning
![Page 42: Introduction of Reinforcement Learningtlkagk/courses/ML_2017...Goodness of Actor โขGiven an actor ๐๐ with network parameter โขUse the actor ๐๐ to play the video game](https://reader034.fdocuments.in/reader034/viewer/2022042622/5fa96534220b452cce61566e/html5/thumbnails/42.jpg)
Critic
โข State value function ๐๐ ๐
โข When using actor ๐, the cumulated reward expects to be obtained after seeing observation (state) s
๐๐s๐๐ ๐
scalar
๐๐ ๐ is large ๐๐ ๐ is smaller
![Page 43: Introduction of Reinforcement Learningtlkagk/courses/ML_2017...Goodness of Actor โขGiven an actor ๐๐ with network parameter โขUse the actor ๐๐ to play the video game](https://reader034.fdocuments.in/reader034/viewer/2022042622/5fa96534220b452cce61566e/html5/thumbnails/43.jpg)
Critic
๐ไปฅๅ็้ฟๅ ๅคง้ฆฌๆญฅ้ฃ = bad
๐่ฎๅผท็้ฟๅ ๅคง้ฆฌๆญฅ้ฃ = good
![Page 44: Introduction of Reinforcement Learningtlkagk/courses/ML_2017...Goodness of Actor โขGiven an actor ๐๐ with network parameter โขUse the actor ๐๐ to play the video game](https://reader034.fdocuments.in/reader034/viewer/2022042622/5fa96534220b452cce61566e/html5/thumbnails/44.jpg)
How to estimate ๐๐ ๐
โข Monte-Carlo based approachโข The critic watches ๐ playing the game
After seeing ๐ ๐,
Until the end of the episode, the cumulated reward is ๐บ๐
After seeing ๐ ๐,
Until the end of the episode, the cumulated reward is ๐บ๐
๐๐ ๐ ๐๐๐๐ ๐ ๐บ๐
๐๐ ๐ ๐๐๐๐ ๐ ๐บ๐
![Page 45: Introduction of Reinforcement Learningtlkagk/courses/ML_2017...Goodness of Actor โขGiven an actor ๐๐ with network parameter โขUse the actor ๐๐ to play the video game](https://reader034.fdocuments.in/reader034/viewer/2022042622/5fa96534220b452cce61566e/html5/thumbnails/45.jpg)
How to estimate ๐๐ ๐
โข Temporal-difference approach
โฏ๐ ๐ก , ๐๐ก , ๐๐ก , ๐ ๐ก+1โฏ
๐๐ ๐ ๐ก๐๐๐ ๐ก
๐๐ ๐ ๐ก+1๐๐๐ ๐ก+1
๐๐ ๐ ๐ก + ๐๐ก = ๐๐ ๐ ๐ก+1
๐๐ ๐ ๐ก โ ๐๐ ๐ ๐ก+1 ๐๐ก-
Some applications have very long episodes, so that delaying all learning until an episode's end is too slow.
![Page 46: Introduction of Reinforcement Learningtlkagk/courses/ML_2017...Goodness of Actor โขGiven an actor ๐๐ with network parameter โขUse the actor ๐๐ to play the video game](https://reader034.fdocuments.in/reader034/viewer/2022042622/5fa96534220b452cce61566e/html5/thumbnails/46.jpg)
MC v.s. TD
๐๐ ๐ ๐๐๐๐ ๐ ๐บ๐Larger variance
unbiased
๐๐ ๐ ๐ก๐๐๐ ๐ก ๐๐ ๐ ๐ก+1 ๐๐ ๐ ๐ก+1๐ +
Smaller varianceMay be biased
โฆ
![Page 47: Introduction of Reinforcement Learningtlkagk/courses/ML_2017...Goodness of Actor โขGiven an actor ๐๐ with network parameter โขUse the actor ๐๐ to play the video game](https://reader034.fdocuments.in/reader034/viewer/2022042622/5fa96534220b452cce61566e/html5/thumbnails/47.jpg)
MC v.s. TD
โข The critic has the following 8 episodesโข ๐ ๐ , ๐ = 0, ๐ ๐ , ๐ = 0, END
โข ๐ ๐ , ๐ = 1, END
โข ๐ ๐ , ๐ = 1, END
โข ๐ ๐ , ๐ = 1, END
โข ๐ ๐ , ๐ = 1, END
โข ๐ ๐ , ๐ = 1, END
โข ๐ ๐ , ๐ = 1, END
โข ๐ ๐ , ๐ = 0, END
[Sutton, v2, Example 6.4]
(The actions are ignored here.)
๐๐ ๐ ๐ =?
๐๐ ๐ ๐ = 3/4
0? 3/4?
Monte-Carlo:
Temporal-difference:
๐๐ ๐ ๐ = 0
๐๐ ๐ ๐ + ๐ = ๐๐ ๐ ๐3/43/4 0
![Page 48: Introduction of Reinforcement Learningtlkagk/courses/ML_2017...Goodness of Actor โขGiven an actor ๐๐ with network parameter โขUse the actor ๐๐ to play the video game](https://reader034.fdocuments.in/reader034/viewer/2022042622/5fa96534220b452cce61566e/html5/thumbnails/48.jpg)
Another Critic
โข State-action value function ๐๐ ๐ , ๐
โข When using actor ๐, the cumulated reward expects to be obtained after seeing observation s and taking a
๐๐s ๐๐ ๐ , ๐
scalara
๐๐ ๐ , ๐ = ๐๐๐๐ก
๐๐ ๐ , ๐ = ๐๐๐๐
๐๐ ๐ , ๐ = ๐๐๐โ๐ก๐๐
for discrete action only
s
![Page 49: Introduction of Reinforcement Learningtlkagk/courses/ML_2017...Goodness of Actor โขGiven an actor ๐๐ with network parameter โขUse the actor ๐๐ to play the video game](https://reader034.fdocuments.in/reader034/viewer/2022042622/5fa96534220b452cce61566e/html5/thumbnails/49.jpg)
Q-Learning
๐ interacts with the environment
Learning ๐๐ ๐ , ๐Find a new actor ๐โฒ โbetterโ than ๐
TD or MC
?
๐ = ๐โฒ
![Page 50: Introduction of Reinforcement Learningtlkagk/courses/ML_2017...Goodness of Actor โขGiven an actor ๐๐ with network parameter โขUse the actor ๐๐ to play the video game](https://reader034.fdocuments.in/reader034/viewer/2022042622/5fa96534220b452cce61566e/html5/thumbnails/50.jpg)
Q-Learning
โข Given ๐๐ ๐ , ๐ , find a new actor ๐โฒ โbetterโ than ๐
โข โBetterโ: ๐๐โฒ ๐ โฅ ๐๐ ๐ , for all state s
๐โฒ ๐ = ๐๐๐max๐
๐๐ ๐ , ๐
โข๐โฒ does not have extra parameters. It depends on Q
โขNot suitable for continuous action a
![Page 51: Introduction of Reinforcement Learningtlkagk/courses/ML_2017...Goodness of Actor โขGiven an actor ๐๐ with network parameter โขUse the actor ๐๐ to play the video game](https://reader034.fdocuments.in/reader034/viewer/2022042622/5fa96534220b452cce61566e/html5/thumbnails/51.jpg)
Deep Reinforcement LearningActor-Critic
![Page 52: Introduction of Reinforcement Learningtlkagk/courses/ML_2017...Goodness of Actor โขGiven an actor ๐๐ with network parameter โขUse the actor ๐๐ to play the video game](https://reader034.fdocuments.in/reader034/viewer/2022042622/5fa96534220b452cce61566e/html5/thumbnails/52.jpg)
Actor-Critic
๐ interacts with the environment
Learning ๐๐ ๐ , ๐ , ๐๐ ๐
Update actor from ๐ โ ๐โ based on ๐๐ ๐ , ๐ , ๐๐ ๐
TD or MC๐ = ๐โฒ
![Page 53: Introduction of Reinforcement Learningtlkagk/courses/ML_2017...Goodness of Actor โขGiven an actor ๐๐ with network parameter โขUse the actor ๐๐ to play the video game](https://reader034.fdocuments.in/reader034/viewer/2022042622/5fa96534220b452cce61566e/html5/thumbnails/53.jpg)
Actor-Critic
โข Tips
โข The parameters of actor ๐ ๐ and critic ๐๐ ๐ can be shared
Network๐
Network
fire
right
left
Network
๐๐ ๐
![Page 54: Introduction of Reinforcement Learningtlkagk/courses/ML_2017...Goodness of Actor โขGiven an actor ๐๐ with network parameter โขUse the actor ๐๐ to play the video game](https://reader034.fdocuments.in/reader034/viewer/2022042622/5fa96534220b452cce61566e/html5/thumbnails/54.jpg)
Source of image: https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-8-asynchronous-actor-critic-agents-a3c-c88f72a5e9f2#.68x6na7o9
Asynchronous
๐2
๐1
1. Copy global parameters
2. Sampling some data
3. Compute gradients
4. Update global models
โ๐
โ๐
๐1
+๐โ๐๐1
(other workers also update models)
![Page 55: Introduction of Reinforcement Learningtlkagk/courses/ML_2017...Goodness of Actor โขGiven an actor ๐๐ with network parameter โขUse the actor ๐๐ to play the video game](https://reader034.fdocuments.in/reader034/viewer/2022042622/5fa96534220b452cce61566e/html5/thumbnails/55.jpg)
Demo of A3C
โข Racing Car (DeepMind)
https://www.youtube.com/watch?v=0xo1Ldx3L5Q
![Page 56: Introduction of Reinforcement Learningtlkagk/courses/ML_2017...Goodness of Actor โขGiven an actor ๐๐ with network parameter โขUse the actor ๐๐ to play the video game](https://reader034.fdocuments.in/reader034/viewer/2022042622/5fa96534220b452cce61566e/html5/thumbnails/56.jpg)
Demo of A3C
โข Visual Doom AI Competition @ CIG 2016
โข https://www.youtube.com/watch?v=94EPSjQH38Y
![Page 57: Introduction of Reinforcement Learningtlkagk/courses/ML_2017...Goodness of Actor โขGiven an actor ๐๐ with network parameter โขUse the actor ๐๐ to play the video game](https://reader034.fdocuments.in/reader034/viewer/2022042622/5fa96534220b452cce61566e/html5/thumbnails/57.jpg)
Concluding Remarks
Policy-based Value-based
Learning an Actor Learning a CriticActor + Critic
Model-based Approach
Model-free Approach