Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Reinforcement Learning.

Lehrstuhl für Informatik 2Gabriella Kókai: Maschine Learning

Reinforcement Learning


Literatur

Reinforcement Learning:An Introduction

Richard S. Sutton and Andrew G. Barto

http://www.cs.ualberta.ca/~sutton/book/the-book.html


Context

➔ Introduction The Learning Task Q Learning Nondeterministic Rewards and Actions Summary


Introduction Situation:

A robot, or an agent has a set of sensors to observe the state of its environment and a set of actions it can perform to alter this state

The agent knows only the current state and the possibilities (actions) from which it can choose

Learning strategy: Reward func: Assigns a numerical value to each distinct action the agent may take

from each distinct state. The task of the robot is to perform sequences of actions, observe their consequences,

and learn a control policy by choosing from any initial state the action that maximise the reward accumulated over time)

Problem areas: Learning to control a mobile robot, learning to optimise operations in a factory or

learning to play board games Teaching a robot to dock its battery onto charger whenever the battery level goes low


Context

Introduction➔ The Learning Task

Classification of the Problem The Markov Decision Process (MDF) Goal of the Learning Example

Q Learning Nondeterministic Rewards and Actions Summary


Classification of the Problem Agent interacts with its environment. The agent

exists in the environment described by some set of possible states S. It can perform any set of possible actions A.

Each time it performs an action in state the agent receives a reward that indicates the immediate value of this state-action transaction.

This produces a sequence of states actions and immediate rewards

The agent's task is to learn a control policy that maximizes the expected sum of timmediate rewards and the future rewards exponentially discounted by their delay

ta

iais

ir

π :S Air

t ts ,a


Classification of the Problem 2 Consider specific settings:

The actions have deterministic or nondeterministic outcomes The agent has or does not have prior knowledge about the effects of its action on the

enviroment (r reward) (s:state)

Is the agent trained by an expert?

Difference to other function approximation tasks Delayed rewards: The trainer provides only a sequence of immediate reward values as the

agent executes its sequence of action => temporal credit assignment determins which of the actions in its sequence are to be credited by producing the eventual rewards

Exploration: Which experimentation strategy produces most effective learning?Choosing the exploration of unknown states and actions or the exploration of states and actions that are already learned will yield high reward?

Partially observed state : In practical situation sensors provide only partial information Life-long learning: Robot learns several task within the same enviroment using the same

sensor => using previously obtained experience

t+1 t ts = δ s ,a t t tr = r s ,a


The Markov Decision Process (MDF)

The process is deterministic The agent can perceive a set distinct states (S) in it environment and

has set of action (A) allowed to perform At each discrete time step t, the agent senses the current state ,

chooses a current action and performs it. The environment responds by giving the agent a reward and

by producing the succeeding state . In MDP and depend only on the current state and

action and the earlier ones and are not part of the environment so the agent does not know them.

S and A are finite.

tsta

t t tr = r s ,a

t+1 t ts = δ s ,a

t tr s ,a t tδ s ,a


Goal of the Learning Goal: Learn a policy for selecting its next action

based on the current observed state : Approach: require the policy that produces the greatest possible

cumulative reward:

Precisely the agent's learning task is to learn a policy that maximizes for all state s. Call optimal policy, denotedis the following:

Simplify the notation: maximum discounted cumulative reward that the agent can obtain starting from s

π :S A ta

t tπ s = ats

π 2 it t t+1 t+2 t+i

i=0

V s r + γr + γ r +... γ r

0 γ 1

π πV s π

πππ argmax V s , s πV s = V


Example

r(s,a) values Values V s

One optimal policy

Q(s,a) values


Context

Introduction The Learning Task➔ Q Learning

The Q Function An Algorithm for Learning Q An Illustrative Example Convergence Experimental Strategies Updating Sequence

Nondeterministic Rewards and Actions Summary


The Q Function

Problem: There is no training data in the form instead the only training information available to the learner is the sequence of immediate rewards

The agent can learn a numerical evaluation function like : The agent prefers state to state whenever

How can the agent choose among actions Solution: The optimal action in a state s is the action a that

maximizes the sum of immediate rewards r(s,a) and the value of the immediate successor state, discounted by

s,a

i ir s ,a

V

1s 2s 1 2V s > V s

γ

V

aπ s = argmax r s,a + γV δ s,a


The Q Function

Problem: A perfect knowledge of the immediate reward function r and the state transition function would be necessary

Solution: Q is the sum of the reward received immediately upon executing an action a from state s, plus the value gained by following the optimal policy thereafter

Advantage: of using Q instead , it will able to select optimal actions even when it has no knowledge of the function r and δ

Q s,a r s,a + γV δ s,a aπ s = argmax Q s,a

V

δ


An Algorithm for Learning Q How can Q be learned?

Through transformation we get a recursive definition of Q (Watkins 1989)

: refer to the learner's estimate, or hypothesis, of the actual Q function It is represented by a large table with separate entry for each state-action pair The table can be initially filled with random values The agent works as before + updates the table entry

Q learning propagates the estimates backward from the new state to the oldone Episode: During each episode, the agent begins at some randomly chosen state and is

allowed to execute actions until it reaches the absorbing goal state When it does, the episode ends and the agent is transported to a new, randomly chosen, initial state of the next episode.

a'V s = max Q s,a'

a'Q s,a = r s,a + γmax Q δ s,a ,a'

Q̂

Q̂

a'ˆ ˆQ s,a r + γmax Q s',a'


An Algorithm for Learning Q (2)

Algorithm:For each (s, a) pair initialise the table entry to zeroObserve the current state sDo forever:

Select an action a and execute it Receive immediate reward r Observe the new state s' Update the table entry for

Q̂ s,a

a'ˆ ˆQ s,a r + γmax Q s',a'

Q̂ s,a

s s'


An Algorithm for Learning Q (3)

Two general properties of this Q learning algorithm that hold for any deterministic MDP in which the rewards are non-negative assuming the values are initialized with zero:

values never decrease during training:

Every will remain in the interval between zero and its true Q Q̂

n+1 nˆ ˆs,a,n Q s,a Q s,a

Q̂

nˆs,a,n 0 Q s,a Q s,a

Q̂


An Illustrative Example

The diagram on the left shows the initial state of the robot and several relevant values in its initial hypotheses. ( )

When the robot executes the action it receives immediate reward r=0 and transition state

It then updates its estimate based on its estimates for this state

Here

1, right a' 2,ˆ ˆQ s a r + γmax Q s a'

0 + 0.9 max{63,81,100} 90

1, rightQ̂ s a = 72

rightas2

1, rightQ̂ s a

γ = 0.9

Q̂

s1

Q̂


Convergence Will the algorithm converge to Q ? YES but with some conditions:

Deterministic MDP Intermediate reward values are bounded: Agent select actions in such a fashion that it visits every possible state-action

pair infinitely often Theorem: Consider a Q learning agent in a deterministic MDP with bounded

rewards The Q learning agent uses the training rule initializes its table to arbitrary finite values and uses a discount factor such that Let the agent's hypothesis following the nth update. If each state-action pair is visited infinitely often then converges to as

Q̂

r(s,a) < c

s,a r(s,a) < c

Q̂ s,a

γ

0 γ 1 nQ̂ s,a

Q̂ s,a Q̂ s,a

nQ̂ s,a

n


Experimentation Strategies Question: How actions are choosen during the training?

The agent selects in the state s the action a that maximizes Disadvantage: Agents will prefer actions that are found to have high values

during the early training and will fail to explore other actions that might have even higher values.

Using probabilistic approach to selecting actions

The probability of selecting action given that the agent is in state s k>0 is a constant that determines how strongly the selection favours actions with

high values

Q̂ s,aQ̂

Q̂ s,ai

iQ̂ s,a j

j

kP a | s =

k iP a | s

Q̂

ai


Updating Sequence

Possibilities to improve the convergence: If all values initialized 0 => after the first full episode only one entry

in the agent's table will have changed (the entire corresponding the final transition into the goal state. (backward change)

Training on this same state-action transition but in reverse chronological order for each episode: apply the same update rules for each transition , but perform this updates in reverse order. => convergence in few iterations but higher memory usage

Second strategy stores past state-actions

QQ


Context

Introduction The Learning Task Q Learning➔ Nondeterministic Rewards and Actions Summary


Nondeterministic Rewards and Actions In such cases and r(s,a) can be viewed as first producing a probability distribution

over outcomes based on s and a, and then drawing an outcome at random according to this distribution

The changing in the Q algorithm, first to the expected value of the discounted cumulative reward

Generalization of Q:

Recursive Q function:

Modify the training rule so that it takes a decaying weighted average of current values and the revised estimate

Where is the total number of visits of this state-action pair up to and including the the nth iteration

δ s,a

π it t+i

i=1

V s E γ r

Q s,a E r s,a + γV δ s,a = E r s,a + γE V δ s,a s'

= E r s,a + γ P s' | s,a V s'

a's'

Q s,a = E r s,a + γ P s' | s,a max Q s',a' Q

n n n 1 n a' n 1ˆ ˆ ˆQ s,a 1 α Q s,a + α r + max {Q s',a' }

nn

1α =

1+ visit s,a

nvisit s,a

πtV s


Context

Introduction The Learning Task Q Learning Nondeterministic Rewards and Actions➔ Summary


Summary Reinforcement learning addresses the problem of learning control strategies for

autonomous agents Training information is available in the form of a real-valued reward signal

given for each state-action transition. The goal of the agent is to learn an action policy that maximizes the total reward

received from any starting state. The Q-learning relates to a limited field of problems, named Markov decision

processes. The Q-function is defined as the maximum of the expected, discounted,

cumulative rewards may be achieved by the agent via applying action a in the state s.

is represented by a lookup table with a distinct entry for each pair

It can show the convergence in both deterministic and nondeterministic MDP's .

Q̂ s,as,a

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Reinforcement Learning.

Documents

Transcript of Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Reinforcement Learning.