Reinforcement Learning

36
Lisa Torrey University of Wisconsin – Madison HAMLET 2009 Reinforcement Learning

description

 

Transcript of Reinforcement Learning

  • 1. Lisa Torrey
    University of Wisconsin Madison
    HAMLET 2009
    Reinforcement Learning

2. Reinforcement learning
What is it and why is it important in machine learning?
What machine learning algorithms exist for it?
Q-learning in theory
How does it work?
How can it be improved?
Q-learning in practice
What are the challenges?
What are the applications?
Link with psychology
Do people use similar mechanisms?
Do people use other methods that could inspire algorithms?
Resources for future reference
Outline
3. Reinforcement learning
What is it and why is it important in machine learning?
What machine learning algorithms exist for it?
Q-learning in theory
How does it work?
How can it be improved?
Q-learning in practice
What are the challenges?
What are the applications?
Link with psychology
Do people use similar mechanisms?
Do people use other methods that could inspire algorithms?
Resources for future reference
Outline
4. Machine Learning
Classification:where AI meets statistics
Given
Training data
Learn
A model for making a single prediction or decision
xnew
Classification Algorithm
Training Data
(x1, y1)
(x2, y2)
(x3, y3)

Model
ynew
5. Animal/Human Learning
Memorization
x1
y1
Classification
xnew
ynew
Procedural
decision
Other?
environment
6. Learning how to act to accomplish goals
Given
Environment that contains rewards
Learn
A policy for acting
Important differences from classification
You dont get examples of correct answers
You have to try things in order to learn
Procedural Learning
7. A Good Policy
8. Do you know your environment?
The effects of actions
The rewards
If yes, you can use Dynamic Programming
More like planning than learning
Value Iteration and Policy Iteration
If no, you can use Reinforcement Learning (RL)
Acting and observing in the environment
What You Know Matters
9. RL shapes behavior using reinforcement
Agent takes actions in an environment (in episodes)
Those actions change the state and trigger rewards
Through experience, an agent learns a policy for acting
Given a state, choose an action
Maximize cumulative reward during an episode
Interesting things about this problem
Requires solving credit assignment
What action(s) are responsible for a reward?
Requires both exploring and exploiting
Do what looks best, or see if something else is really best?
RL as Operant Conditioning
10. Search-based:evolution directly on a policy
E.g. genetic algorithms
Model-based:build a model of the environment
Then you can use dynamic programming
Memory-intensive learning method
Model-free:learn a policy without any model
Temporal difference methods (TD)
Requires limited episodic memory (though more helps)
Types of Reinforcement Learning
11. Actor-critic learning
The TD version of Policy Iteration
Q-learning
The TD version of Value Iteration
This is the most widely used RL algorithm
Types of Model-Free RL
12. Reinforcement learning
What is it and why is it important in machine learning?
What machine learning algorithms exist for it?
Q-learning in theory
How does it work?
How can it be improved?
Q-learning in practice
What are the challenges?
What are the applications?
Link with psychology
Do people use similar mechanisms?
Do people use other methods that could inspire algorithms?
Resources for future reference
Outline
13. Current state: s
Current action: a
Transition function: (s, a) = s
Reward function: r(s, a) R
Policy (s) = a
Q(s, a) value of taking action a from state s
Q-Learning:Definitions
Markov property: this is independent of previous states given current state
In classification wed have examples (s, (s)) to learn from
14. Q(s, a) estimates the discounted cumulative reward
Starting in state s
Taking action a
Following the current policy thereafter
Suppose we have the optimal Q-function
Whats the optimal policy in state s?
The action argmaxb Q(s, b)
But we dont have the optimal Q-function at first
Lets act as if we do
And updates it after each step so its closer to optimal
Eventually it will be optimal!
The Q-function
15. Q-Learning:The Procedure
Agent
Q(s1, a) = 0
(s1) = a1
Q(s1, a1) Q(s1, a1) +
(s2) = a2
s2
s3
a1
a2
r2
r3
s1
Environment
(s2, a2) = s3
r(s2, a2) = r3
(s1, a1) = s2
r(s1, a1) = r2
16. Q-Learning:Updates

  • The basic update equation

17. With a discount factor to give later rewards less impact 18. With a learning rate for non-deterministic worlds