people.cs.pitt.edujlee/teaching/cs1675/cs1675_week14.pdf · (C) Hauskrecht 1 CS 1675 Introduction...

Week14

JeongminLeeComputerScienceDepartment

UniversityofPittsburgh

CS1675IntrotoMachineLearning– Recitation

•RecaponReinforcementLearning

(C)Hauskrecht

CS 1675 Introduction to Machine LearningLecture 26

Milos Hauskrechtmilos@cs.pitt.edu5329 Sennott Square

Reinforcement learning II

Reinforcement learning

Basics:

• Learner interacts with the environment– Receives input with information about the environment (e.g.

from sensors)– Makes actions that (may) effect the environment– Receives a reinforcement signal that provides a feedback on

how well it performed

LearnerInput x Output a

Critic

Reinforcement r

(action)

(C)Hauskrecht

Reinforcement learning

Objective: Learn how to act in the environment in order to maximize the reinforcement signal • The selection of actions should depend on the input• A policy maps inputs to actions• Goal: find the optimal policy that gives the best

expected reinforcements

Example: learn how to play games (AlphaGo)

LearnerInput x Output a

Critic

Reinforcement r

AX o:SAX o:S

Gambling example• Game: 3 biased coins

– The coin to be tossed is selected randomly from the three coin options. The agent always sees which coin is going to be played next. The agent makes a bet on either a head or a tail with a wage of $1. If after the coin toss, the outcome agrees with the bet, the agent wins $1, otherwise it looses $1

• RL model:– Input: X – a coin chosen for the next toss, – Action: A – choice of head or tail the agent bets on, – Reinforcements: {1, -1}

• A policyExample:

AX o:SCoin1 headCoin2 tailCoin3 head

headtail

(C)Hauskrecht

Gambling exampleRL model:• Input: X – a coin chosen for the next toss, • Action: A – choice of head or tail the agent bets on, • Reinforcements: {1, -1}• A policy

State, action reward trajectories

Coin1 headCoin2 tailCoin3 head

Step0 Step1

Step k

.. ..state

action

reward

RL learning: objective functions

• Objective: Find a policy

That maximizes some combination of future reinforcements (rewards) received over time

• Valuation models (quantify how good the mapping is):– Finite horizon models

– Infinite horizon discounted model

– Average reward

)(0¦f

t rE J Discount factor:

)(1lim0¦

0!TTime horizon:

AX o:*S

10 �d J

t rE J Discount factor: 10 �d J

Coin1 ? Coin2 ?Coin3 ?

Step0 Step1

Step k

.. ..state

action

reward

– Average reward

)(0¦f

)(1lim0¦

0!TTime horizon:

AX o:*S

10 �d J

Action

Reward

Theagentalwaysseeswhichcoinisgoingtobeplayednext.Theagentmakesabetoneitheraheadoratailwithawageof$1.

Ifafterthecointoss,theoutcomeagreeswiththebet,theagentwins$1,otherwiseitlooses$1

Step0 Step1

Step k

.. ..state

action

reward

– Average reward

)(0¦f

)(1lim0¦

0!TTime horizon:

AX o:*S

10 �d J

Action

Reward

Step0 Step1

Step k

.. ..state

action

reward

– Average reward

)(0¦f

)(1lim0¦

0!TTime horizon:

AX o:*S

10 �d J

Action

Reward

Step0 Step1

Step k

.. ..state

action

reward

– Average reward

)(0¦f

)(1lim0¦

0!TTime horizon:

AX o:*S

10 �d J

Action

Reward

Step0 Step1

Step k

.. ..state

action

reward

– Average reward

)(0¦f

)(1lim0¦

0!TTime horizon:

AX o:*S

10 �d J

Action

Reward

Insequenceofthese,ourgoalistomaximizeThetotalreward.

Inotherwords,itisaboutfindingaoptimalpolicy

Step0 Step1

Step k

.. ..state

action

reward

– Average reward

)(0¦f

)(1lim0¦

0!TTime horizon:

AX o:*S

10 �d J

Howcanwequantifythegoodnessoftheactions?

Howcanwequantifythegoodnessoftheactions?:Valuationmodels

(C)Hauskrecht

Step0 Step1

Step k

.. ..state

action

reward

– Average reward

)(0¦f

)(1lim0¦

0!TTime horizon:

AX o:*S

10 �d J

T:Numberoffuturesteps

->Morefuture,morediscounted

Step0 Step1

Step k

.. ..state

action

reward

– Average reward

)(0¦f

)(1lim0¦

0!TTime horizon:

AX o:*S

10 �d J

Action

Reward

Step0 Step1

Step k

.. ..state

action

reward

– Average reward

)(0¦f

)(1lim0¦

0!TTime horizon:

AX o:*S

10 �d J

Let’ssayT=4,

Step0 Step1

Step k

.. ..state

action

reward

– Average reward

)(0¦f

)(1lim0¦

0!TTime horizon:

AX o:*S

10 �d J

Step0 Step1

Step k

.. ..state

action

reward

– Average reward

)(0¦f

)(1lim0¦

0!TTime horizon:

AX o:*S

10 �d J

Action

Reward

Step0 Step1

Step k

.. ..state

action

reward

– Average reward

)(0¦f

)(1lim0¦

0!TTime horizon:

AX o:*S

10 �d J

Let’ssayT=4,

Step0 Step1

Step k

.. ..state

action

reward

– Average reward

)(0¦f

)(1lim0¦

0!TTime horizon:

AX o:*S

10 �d J

=-1+1+1+-1+1=-2+3=1

Step0 Step1

Step k

.. ..state

action

reward

– Average reward

)(0¦f

)(1lim0¦

0!TTime horizon:

AX o:*S

10 �d J

Step0 Step1

Step k

.. ..state

action

reward

– Average reward

)(0¦f

)(1lim0¦

0!TTime horizon:

AX o:*S

10 �d J

Action

Reward

Let’ssayT=4and𝛾 = 0.5

=-1*(0.5)^0+1*(0.5)^1+1*(0.5)*2+-1*(0.5)^3+1*(0.5)^4

=-0.3125

Step0 Step1

Step k

.. ..state

action

reward

– Average reward

)(0¦f

)(1lim0¦

0!TTime horizon:

AX o:*S

10 �d J

Twowaysofgettingrewards:

1.RLwithimmediate rewards(thisrecitation)

2.RLwithdelayed rewards

(C)Hauskrecht

RL with immediate rewards

• Expected reward

• Immediate reward case:– Reward depends only on x and the action choice – The action does not affect the environment and hence future

inputs (states) and future rewards– Expected one step reward for input x (coin to play next) and

the choice a :

)(0¦f

t rE J

),( aR x

10 �d J

• Expected reward

• Optimal strategy:

: Expected one step reward for input x (coin to play next) and the choice a

...)()()()( 22

�� ¦f

rErErErEt

tt JJJ

AX o:*S

),(maxarg)(* aRa

),( aR x

• Expected reward

• Immediate reward case:– Reward depends only on x and the action choice – The action does not affect the environment and hence future

inputs (states) and future rewards– Expected one step reward for input x (coin to play next) and

the choice a :

)(0¦f

t rE J

),( aR x

10 �d J

• Expected reward

• Optimal strategy:

: Expected one step reward for input x (coin to play next) and the choice a

...)()()()( 22

�� ¦f

rErErErEt

tt JJJ

AX o:*S

),(maxarg)(* aRa

),( aR x

Action

Reward

Currentstep

Inhere,whatmattersisknowingrewardsaboutactionaatinputx,R(x,a)

RL with immediate rewardsThe optimal choice assumes we know the expected reward

• Then:

Caveats• We do not know the expected reward

– We need to estimate it using from interaction• We cannot determine the optimal policy if the estimate of

the expected reward is not good– We need to try also actions that look suboptimal wrt the

current estimates of

),( aR x

),(maxarg)(* aRa

),( aR x),(~ aR x

),(~ aR x

Estimating R(x,a)• Solution 1:

– For each input x try different actions a– Estimate using the average of observed rewards

• Solution 2: online approximation• Updates an estimate after performing action a in x and

observing the reward

1),(~ x

),( aR x

ii riaRiaR ,)1()( )(),(~))(1(),(~ DD ��m �xx

- a learning rate

• Then:

),( aR x

),(maxarg)(* aRa

),( aR x),(~ aR x

),(~ aR x

1),(~ x

),( aR x

ii riaRiaR ,)1()( )(),(~))(1(),(~ DD ��m �xx

- a learning rate

Knowingexpectedrewardsmeans,weknowTheprobabilitiesofR(x,a)forallpossiblepairsof(x,a)s

e.g.,P(coin2,head)=0.3P(coin2,tail)=0.7P(coin1,head)=0.9P(coin1,tail)=0.1…

Then,givencurrentinput(coin1,2,or3),maximizingrewardsshouldbemucheasier!

• Then:

),( aR x

),(maxarg)(* aRa

),( aR x),(~ aR x

),(~ aR x

1),(~ x

),( aR x

ii riaRiaR ,)1()( )(),(~))(1(),(~ DD ��m �xx

- a learning rate

Knowingexpectedrewardsmeans,weknowTheprobabilitiesofR(x,a)forallpossiblepairsof(x,a)s

So,weneedtoestimate R(x,a):thatisdenotedas

• Then:

),( aR x

),(maxarg)(* aRa

),( aR x),(~ aR x

),(~ aR x

1),(~ x

),( aR x

ii riaRiaR ,)1()( )(),(~))(1(),(~ DD ��m �xx

- a learning rate

(C)Hauskrecht

• Then:

),( aR x

),(maxarg)(* aRa

),( aR x),(~ aR x

),(~ aR x

1),(~ x

),( aR x

ii riaRiaR ,)1()( )(),(~))(1(),(~ DD ��m �xx

- a learning rate

axr ,:Numberofsamples*we’resamplingactionsanditsrewardsandgetaverageofthem

• Then:

),( aR x

),(maxarg)(* aRa

),( aR x),(~ aR x

),(~ aR x

1),(~ x

),( aR x

ii riaRiaR ,)1()( )(),(~))(1(),(~ DD ��m �xx

- a learning rate

(C)Hauskrecht

• Then:

),( aR x

),(maxarg)(* aRa

),( aR x),(~ aR x

),(~ aR x

1),(~ x

),( aR x

ii riaRiaR ,)1()( )(),(~))(1(),(~ DD ��m �xx

- a learning rate

Action

Reward

Coin2 …

Currentstep

Heador

TrythisN{x,a}times..

(C)Hauskrecht

• Then:

),( aR x

),(maxarg)(* aRa

),( aR x),(~ aR x

),(~ aR x

1),(~ x

),( aR x

ii riaRiaR ,)1()( )(),(~))(1(),(~ DD ��m �xx

- a learning rate

• Then:

),( aR x

),(maxarg)(* aRa

),( aR x),(~ aR x

),(~ aR x

1),(~ x

),( aR x

ii riaRiaR ,)1()( )(),(~))(1(),(~ DD ��m �xx

- a learning rate

(C)Hauskrecht

• Then:

),( aR x

),(maxarg)(* aRa

),( aR x),(~ aR x

),(~ aR x

1),(~ x

),( aR x

ii riaRiaR ,)1()( )(),(~))(1(),(~ DD ��m �xx

- a learning rate

• Then:

),( aR x

),(maxarg)(* aRa

),( aR x),(~ aR x

),(~ aR x

1),(~ x

),( aR x

ii riaRiaR ,)1()( )(),(~))(1(),(~ DD ��m �xx

- a learning rate

Withthisapproach,wekeep“reward-memory”ofallpossiblepairs ofinputxandactiona,andkeepitupdateeachtimeweobserveapair.

• Then:

),( aR x

),(maxarg)(* aRa

),( aR x),(~ aR x

),(~ aR x

1),(~ x

),( aR x

ii riaRiaR ,)1()( )(),(~))(1(),(~ DD ��m �xx

- a learning rate

Action

Reward

RL with immediate rewards• At any step in time i during the experiment we have estimates of

expected rewards for each (coin, action) pair:

• Assume the next coin to play in step (i+1) is coin 2 and we pick head as our bet. Then we update using the observed reward and one of the update strategy above, and keep the reward estimates for the remaining (coin, action) pairs unchanged, e.g.

)(),1(~ iheadcoinR)(),1(~ itailcoinR

)1(),2(~ �iheadcoinR

)()1( ),2(~),2(~ ii tailcoinRtailcoinR �

Exploration vs. Exploitation

• Uniform exploration: – Uses exploration parameter – Choose the “current” best choice with probability

– All other choices are selected witha uniform probability

Advantages: • Simple, easy to implementDisadvantages: • Exploration more appropriate at the beginning when we do not

have good estimates of • Exploitation more appropriate later when we have good estimates

),(~maxarg)(ˆ aRAa

1|| �AH

10 dd H

),(~ aR x

000000

Currentstep

• Then:

),( aR x

),(maxarg)(* aRa

),( aR x),(~ aR x

),(~ aR x

1),(~ x

),( aR x

ii riaRiaR ,)1()( )(),(~))(1(),(~ DD ��m �xx

- a learning rate

Action

Reward

),(~maxarg)(ˆ aRAa

1|| �AH

10 dd H

),(~ aR x

000000

Currentstep

• Then:

),( aR x

),(maxarg)(* aRa

),( aR x),(~ aR x

),(~ aR x

1),(~ x

),( aR x

ii riaRiaR ,)1()( )(),(~))(1(),(~ DD ��m �xx

- a learning rate

Action

Reward

),(~maxarg)(ˆ aRAa

1|| �AH

10 dd H

),(~ aR x

000000

Currentstep

• Then:

),( aR x

),(maxarg)(* aRa

),( aR x),(~ aR x

),(~ aR x

1),(~ x

),( aR x

ii riaRiaR ,)1()( )(),(~))(1(),(~ DD ��m �xx

- a learning rate

Action

Reward

),(~maxarg)(ˆ aRAa

1|| �AH

10 dd H

),(~ aR x

000000

Currentstep

• Then:

),( aR x

),(maxarg)(* aRa

),( aR x),(~ aR x

),(~ aR x

1),(~ x

),( aR x

ii riaRiaR ,)1()( )(),(~))(1(),(~ DD ��m �xx

- a learning rate

),(~maxarg)(ˆ aRAa

1|| �AH

10 dd H

),(~ aR x

=(1-0.5)*0+(0.5)*-1=-0.5

• Then:

),( aR x

),(maxarg)(* aRa

),( aR x),(~ aR x

),(~ aR x

1),(~ x

),( aR x

ii riaRiaR ,)1()( )(),(~))(1(),(~ DD ��m �xx

- a learning rate

Action

Reward

),(~maxarg)(ˆ aRAa

1|| �AH

10 dd H

),(~ aR x

000-0.500

Currentstep

• Then:

),( aR x

),(maxarg)(* aRa

),( aR x),(~ aR x

),(~ aR x

1),(~ x

),( aR x

ii riaRiaR ,)1()( )(),(~))(1(),(~ DD ��m �xx

- a learning rate

),(~maxarg)(ˆ aRAa

1|| �AH

10 dd H

),(~ aR x

=(1-0.5)*0+(0.5)*-1=-0.5

Memoryupdate!

• Then:

),( aR x

),(maxarg)(* aRa

),( aR x),(~ aR x

),(~ aR x

1),(~ x

),( aR x

ii riaRiaR ,)1()( )(),(~))(1(),(~ DD ��m �xx

- a learning rate

Action

Reward

),(~maxarg)(ˆ aRAa

1|| �AH

10 dd H

),(~ aR x

000-0.500

Currentstep

• Then:

),( aR x

),(maxarg)(* aRa

),( aR x),(~ aR x

),(~ aR x

1),(~ x

),( aR x

ii riaRiaR ,)1()( )(),(~))(1(),(~ DD ��m �xx

- a learning rate

),(~maxarg)(ˆ aRAa

1|| �AH

10 dd H

),(~ aR x

• Then:

),( aR x

),(maxarg)(* aRa

),( aR x),(~ aR x

),(~ aR x

1),(~ x

),( aR x

ii riaRiaR ,)1()( )(),(~))(1(),(~ DD ��m �xx

- a learning rate

Action

Reward

),(~maxarg)(ˆ aRAa

1|| �AH

10 dd H

),(~ aR x

000-0.500

Currentstep

• Then:

),( aR x

),(maxarg)(* aRa

),( aR x),(~ aR x

),(~ aR x

1),(~ x

),( aR x

ii riaRiaR ,)1()( )(),(~))(1(),(~ DD ��m �xx

- a learning rate

),(~maxarg)(ˆ aRAa

1|| �AH

10 dd H

),(~ aR x

=(1-0.5)*0+(0.5)*1=0.5

• Then:

),( aR x

),(maxarg)(* aRa

),( aR x),(~ aR x

),(~ aR x

1),(~ x

),( aR x

ii riaRiaR ,)1()( )(),(~))(1(),(~ DD ��m �xx

- a learning rate

Action

Reward

),(~maxarg)(ˆ aRAa

1|| �AH

10 dd H

),(~ aR x

0.500-0.500

Currentstep

• Then:

),( aR x

),(maxarg)(* aRa

),( aR x),(~ aR x

),(~ aR x

1),(~ x

),( aR x

ii riaRiaR ,)1()( )(),(~))(1(),(~ DD ��m �xx

- a learning rate

),(~maxarg)(ˆ aRAa

1|| �AH

10 dd H

),(~ aR x

=(1-0.5)*0+(0.5)*1=0.5Memoryupdate!

• Then:

),( aR x

),(maxarg)(* aRa

),( aR x),(~ aR x

),(~ aR x

1),(~ x

),( aR x

ii riaRiaR ,)1()( )(),(~))(1(),(~ DD ��m �xx

- a learning rate

Action

Reward

),(~maxarg)(ˆ aRAa

1|| �AH

10 dd H

),(~ aR x

0.500-0.500

Currentstep

• Then:

),( aR x

),(maxarg)(* aRa

),( aR x),(~ aR x

),(~ aR x

1),(~ x

),( aR x

ii riaRiaR ,)1()( )(),(~))(1(),(~ DD ��m �xx

- a learning rate

),(~maxarg)(ˆ aRAa

1|| �AH

10 dd H

),(~ aR x

=(1-0.5)*0+(0.5)*1=0.5

• Then:

),( aR x

),(maxarg)(* aRa

),( aR x),(~ aR x

),(~ aR x

1),(~ x

),( aR x

ii riaRiaR ,)1()( )(),(~))(1(),(~ DD ��m �xx

- a learning rate

Action

Reward

),(~maxarg)(ˆ aRAa

1|| �AH

10 dd H

),(~ aR x

0.500-0.500.5

Currentstep

• Then:

),( aR x

),(maxarg)(* aRa

),( aR x),(~ aR x

),(~ aR x

1),(~ x

),( aR x

ii riaRiaR ,)1()( )(),(~))(1(),(~ DD ��m �xx

- a learning rate

),(~maxarg)(ˆ aRAa

1|| �AH

10 dd H

),(~ aR x

=(1-0.5)*0+(0.5)*1=0.5

Memoryupdate!

• Then:

),( aR x

),(maxarg)(* aRa

),( aR x),(~ aR x

),(~ aR x

1),(~ x

),( aR x

ii riaRiaR ,)1()( )(),(~))(1(),(~ DD ��m �xx

- a learning rate

Action

Reward

),(~maxarg)(ˆ aRAa

1|| �AH

10 dd H

),(~ aR x

0.500-0.500.5

Currentstep

• Then:

),( aR x

),(maxarg)(* aRa

),( aR x),(~ aR x

),(~ aR x

1),(~ x

),( aR x

ii riaRiaR ,)1()( )(),(~))(1(),(~ DD ��m �xx

- a learning rate

),(~maxarg)(ˆ aRAa

1|| �AH

10 dd H

),(~ aR x

=(1-0.5)*(0.5)+(0.5)*1=0.75

• Then:

),( aR x

),(maxarg)(* aRa

),( aR x),(~ aR x

),(~ aR x

1),(~ x

),( aR x

ii riaRiaR ,)1()( )(),(~))(1(),(~ DD ��m �xx

- a learning rate

Action

Reward

),(~maxarg)(ˆ aRAa

1|| �AH

10 dd H

),(~ aR x

0.7500-0.500.5

Currentstep

• Then:

),( aR x

),(maxarg)(* aRa

),( aR x),(~ aR x

),(~ aR x

1),(~ x

),( aR x

ii riaRiaR ,)1()( )(),(~))(1(),(~ DD ��m �xx

- a learning rate

),(~maxarg)(ˆ aRAa

1|| �AH

10 dd H

),(~ aR x

=(1-0.5)*(0.5)+(0.5)*1=0.75Memoryupdate!

(C)Hauskrecht

Exploration vs. Exploitation in RL

The (learner) actively interacts with the environment via actions:• At the beginning the learner does not know anything about the

environment• It gradually gains the experience and learns how to react to the

environmentDilemma (exploration-exploitation):• After some number of steps, should I select the best current

choice (exploitation) or try to learn more about the environment (exploration)?

• Exploitation may involve the selection of a sub-optimal action and prevent the learning of the optimal choice

• Exploration may spend to much time on trying bad currently suboptimal actions

• In the RL framework– the (learner) actively interacts with the environment and

choses the action to play for the current input x– Also at any point in time it has an estimate of for

any (input,action) pair• Dilemma for choosing the action to play for x:

– Should the learner choose the current best choice of action (exploitation)

– Or choose some other action a which may help to improve its estimate (exploration)

This dilemma is called exploration/exploitation dilemma• Different exploration/exploitation strategies exist