Reinforcement Learning Chapter 13 What is Reinforcement Learning? Q-Learning Examples 1.

1

Reinforcement Learning

Chapter 13

• What is Reinforcement Learning?• Q-Learning• Examples

2

Machine Learning Categories

3

What’s reinforcement Learning?

• An autonomous agent should learn to choose optimal actions in each state to achieve its goals.

• The agent learns how to achieve that goal by trial-and-error interactions with its environment.

4

Example: Learning to ride a bike

• Suppose: In the first trial, the RL system begins riding the bicycle and performs a series of actions that result in the bicycle being tilted 45 degrees to the right.

• At this point, there are two possible actions: – turn the handle bars right:

• crashing to the ground (a negative reinforcement)

– turn the handle bars left:• crashing to the ground (a negative reinforcement)

5

Example: Learning to ride a bike

• At this point, the RL system has not only learned that turning the handle bars right or left when tilted 45 degrees to the right is bad, but that the "state" of being titled 45 degrees to the right is bad.

• Again, the RL system begins another trial and performs a series of actions that result in the bicycle being tilted 40 degrees to the right. ……

6

Reinforcement Learning: Suitable for state-action problems

• Board games: E.g. backgammon, chess, 8-puzzle, …(Reinforcement learning in board games., Imran Ghory, 2004)

s0

s2 s1

s5 s6 s7

s3

s8

a5a4

a1a2

a3

a6 a7

7

What’s reinforcement Learning?

s0 s1

Agent

environment

StateReward

a0

r0 r1

s2

r2

a1

Action

a2

s : state

a : action

r : a reward function

control policy : S -> A

8

Example: TD-Gammon

• Tesauro (1995)

• RL to play Backgammon to become the world championship

• Immediate reward

– +100 if win

– -100 if lose

– 0 for all other states

• Trained by playing 1.5 million games against itself

• Now approximately equal to best human player

9

An Example of Reward Function

10

The Goal in Reinforcement Learning

• Goal: learn to choose actions that maximize:

r0 + r1 + 2 r2 + … ,

• where 0 < 1

• The discount factor is used to exponentially decrease the weight of reinforcements received in the future

• It’s called: Discounted Cumulative Reward

11

Discounted Cumulative Reward

=0.9

12

Other Options

• Finite-horizon model:

• Average-reward model:

• Average discounted reward model:

13

Different Types of Learning Tasks

• Agent’s actions: – Deterministic, or – Nondeterministic

• Agent may have or haven’t the ability of predicting the next state that will result from each action

• Trainer of the agent: – Expert (who shows it examples of optimal action

sequences), or – agent itself(train itself by performing actions of its own

choice.)

14

Q-Learning for Simple Deterministic Worlds

15

example

Q(s1, aright) r + Q (s2 , )

0 + 0.9 max{63,81,100}

90

16

RL as a function approximation method

• Learning the control policy () is very similar to the function approximation problem, except:

1. Delayed reward– In RL, The trainer provides only a sequence of immediate

reward values => Facing the problem of temporal credit assignment.

2. Exploration or Exploitation (next slide)– Exploration to collect new information, or Exploitation of

what it already learned to maximize the cumulative rewards.

– In RL, the agents influence the distribution of training examples by the action sequence it chooses.

17

Explore or Exploit?• In Q-learning, there is no mention about how to choose an

action among possible actions, some obtions:

– Random uniform selction

– High Q-value selection

– Selection based on the following probability:

– Small k => exploration, large k => exploitation,

– Common choice: small k at the beginning of the learning process, then gradually increasing k

18

RL Vs. other function approximation(continued)

3. Partially Observable States– In many practical situations, the sensors provide only partial information

(like the camera in front of a robot). – Solution: considering previous observations together with the current

sensor data

4. Life-long Learning– Unlike the function approximation task, in RL, robots need to learn many

task simultaneously plus online learning process forever.

19

RL Convergence• Proved in p 377-378, Mitchell.

• Three conditions of convergency:

– Deterministic Markov Decision Process (MDP)

– Immediate positive bounded rewards

– Agent selects every agent-action pairs infinitely often.

20

Markov Decision Process• Finite set of States : S; Set of Actions: A

– t: discrete time step; – st: the state at time t; – at: the action at time t;

• At each discrete time, agent observe states st S, and chooses action at A. • Then receive immediate reward: rt , And state change to: st+1

• Markov assumption: st+1= (st , at ), rt=r (st , at )– i.e., rt, and st+1 depend only on current state and action

• Functions and r may be nondeterministic • Functions and r not necessarily be known to agent

stat

rt

st+1 rt+1

st+2 rt+2

at+1 at+2

…

21

Other issues in RL (p. 381 - 386)• Reinforcement Learning for non-deterministic rewards

and actions

• Temporal Difference Learning

• Generalizing from examples

• Relationship to dynamic programming

• Continuous reinforcement learning (state-of-the-art)

22

Homework

• 13.3– Tik-Tak-Toe

Reinforcement Learning Chapter 13 What is Reinforcement Learning? Q-Learning Examples 1.

Documents

Transcript of Reinforcement Learning Chapter 13 What is Reinforcement Learning? Q-Learning Examples 1.