REINFORCEMENT LEARNING 12/2/20151 Group 11 Ashish Meena 04005006 Rohitashwa Bhotica 04005010 Hansraj...

40
REINFORCEMENT REINFORCEMENT LEARNING LEARNING 03/26/22 1 Group 11 Ashish Meena 04005006 Rohitashwa Bhotica 04005010 Hansraj Choudhary 04d05005 Piyush Kedia 04d05009

Transcript of REINFORCEMENT LEARNING 12/2/20151 Group 11 Ashish Meena 04005006 Rohitashwa Bhotica 04005010 Hansraj...

REINFORCEMENT REINFORCEMENT LEARNINGLEARNING

04/18/23 1

Group 11

Ashish Meena 04005006

Rohitashwa Bhotica 04005010

Hansraj Choudhary 04d05005

Piyush Kedia 04d05009

OutlineOutline Introduction Learning Models Motivation Reinforcement Learning Framework Q – Learning Algorithm Applications Summary

04/18/23 2

IntroductionIntroduction Machine Learning

◦ Construction of programs that automatically improve with experience.

Types of Learning◦ Supervised Learning◦ Unsupervised Learning◦ Reinforcement Learning

04/18/23 3

Supervised LearningSupervised Learning- Training data: (X,Y). (features,

label) - Predict Y, minimizing some loss.

- Regression, Classification.Example

◦ Predict whether a patient, hospitalized due to a heart attack, will have a second heart attack. The prediction is to be based on demographic, diet and clinical measurements for that patient. (Logistic Regression)

04/18/23 4

Unsupervised LearningUnsupervised LearningUnsupervised Learning - Training data: X. (features only) - Find “similar” points in high-dim X-

space. - Clustering.

Example◦ From the DNA micro-array data, determine

which genes are most “similar” in terms of their expression profiles. (Clustering)

04/18/23 5

Reinforcement LearningReinforcement Learning Training data: (S, A, R). (State-Action-Reward)

Develop an optimal policy (sequence of decision rules) for the learner so as to maximize its long – term reward.

04/18/23 6

04/18/23 7

Agent

Environment

StateReward

Action

Policy

sss 221100 r a2

r a1

r a0 :::

Reinforcement Learning

MotivationMotivationSupervised and unsupervised

learning fail in many situations.Example

◦ FLIGHT CONTROLS SYSTEMS For a set of all sensor readings at a given

time deciding how the flight controls should function.

In case of supervised learning the labels for many features are unknown.

Performing trial-and-error interactions with the environment reinforcement learning is capable of solving such problems.

04/18/23 8

Learning by InteractionLearning by Interaction

04/18/23 9

• Learning to ride a bicycle

• Actions:• Turn handle bars RIGHT.• Turn handle bars LEFT.

• Rewards: • Positive if the cycle is perfectly balanced.• Negative if the angle between the cycle and the ground decreases

Inspiration from Natural Inspiration from Natural LearningLearning

Dopamine ◦ Neurohormone occurring in a wide variety of

animals, including both vertebrates and

invertebrates

◦ Regulates and controls behavior by inducing

pleasurable effects equivalent to rewards

◦ Motivates us to proactively to perform certain

activities

04/18/23 10

Markov Decision Markov Decision ProcessesProcesses

A discrete set of environment states S.

A discrete set of agent actions A.

At each discrete time agent observes state st Є S and

chooses action st Є A

Receives reward rt

State changes to st+1

Markov Assumptions st+1 = δ(st,at)

rt = r(st,at) δ and r are usually unknown to the learner δ and r may also be non deterministic

04/18/23 11

Simple deterministic Simple deterministic exampleexample

04/18/23 12

Model-free v/s Model-based Model-free v/s Model-based MethodsMethods

Model-free Methods◦ Eliminate need to know the model

◦ Learn a policy without estimating the model.

◦ Example Q LearningModel Based Methods

◦ Model consists of the Transition Probability

Function and the Reward Function.

◦ Interactively estimate the model and calculate the

policy.

◦ Example Dyna

04/18/23 13

Precise Goal Maximization of long-term reward. An optimal policy has to be learnt in order to

do this. Long-term reward may have different

interpretations according to how the future is taken into account. i.e. different models of optimal behaviour

Models of Optimal Models of Optimal BehaviourBehaviour

Three main models Finite Horizon Model

◦ Optimize the reward for the next h steps

rt represents the reward for the (t+1)th action performed

h=1 represents a greedy strategy

Models of Optimal Models of Optimal Behaviour (2)Behaviour (2)

Infinite Horizon Discounted Model◦ Models future rewards being less valuable than in

the present

◦ Has infinite horizon i.e. considers infinite actions in

the future

◦ Makes use of a discount factor between 0 and

1 which represents the importance of the future.

◦ This is the model used in the Q-Learning algorithm

Models of Optimal Models of Optimal Behaviour (3)Behaviour (3)

Average-Reward Model◦ Reward per action is considered◦ Infinite future is taken into account◦ Rewards in the future are equally valuable

The Learning TaskThe Learning Task Execute actions, observe results and

◦ Learn a policy, π : S → A maximizing

for all states S

• Infinite horizon discounted model is being used here

Value FunctionValue Function For each possible policy π the agent can

adopt the value function is2

1 2( ) ....t t tV S r r r

In terms of the value function the learning task can be reformulated to learn the optimal policy p* such that

* arg max ( ), ( )V s S

Learning the Value Learning the Value FunctionFunction

π* (s) = argmax [r(s, a) + V

*((s, a))]

Learning V * is only possible when the agent has perfect knowledge of both and r.

Thus need to define the Q function arises

Example V* ValuesExample V* Values

04/18/23 21

0.9

The Q FunctionThe Q FunctionThe evaluation function Q(s, a) is

defined asQ(s, a) = r(s, a) + V *((s, a))

If the Q-function is learnt the optimal action can be selected without knowing and r.

*( ) arg max ( , )a

s Q s a

Learning the Q functionLearning the Q functionNotice the relationship between Q

and V*

a'

V * s max Q s, a'

Rewriting the value of Q

Learning the Q-valueLearning the Q-value FOR each <s, a> DO

◦ Initialize table entry:

Observe current state s

WHILE (true) DO

◦ Select action a and execute it

◦ Receive immediate reward r

◦ Observe new state s’

◦ Update table entry for using the training rule:

◦ Move: record transition from s to s’

04/18/23 24

0 a s,Q̂

a s,Q̂

ˆ ˆ a'

Q s, a r s, a max Q s', a'

Q Learning IllustrationQ Learning Illustration Consider that an optimal strategy is being

learnt for the given set of states and rewards for actions

The reward for all actions is 0 unless it is moving to the goal state.

04/18/23 25

0

0

G

0

00

000

0100

100

0

Q Learning Illustration(2)Q Learning Illustration(2) The actual values of V and Q taking g=0.9 are:

04/18/23 26

Learning StepLearning Step Since an absorbing state exists learning will

consist of a series of episodes with a random start state

04/18/23 27

0 0.9 [63,81,100]

90

1 right 2a'

Q s , a r max Q s , a'

max

Uncertain State Uncertain State TransitionsTransitions

04/18/23 28

State transition probability function is used

◦ Denotes the transition probability from State s to s’

when action a is performed. Q function is modified to

( , , ')T s a s

'

( , , ')s S

T s a s

a'

Q s, a r(s,a) max Q s', a'

ApplicationsApplications Cell Phone Channel Allocation

Cobot: A Social Reinforcement Learning Agent

Car Simulation: Using Reinforcement Learning

Network Packet Routing 

Elevator Scheduling 

Use of RL to improve the performance of natural language question

answering systems

Reinforcement Learning Methods for Military Applications

04/18/23 29

Cell Phone Channel Cell Phone Channel AllocationAllocation

Learns channel allocations for cell phones◦ Channels are limited◦ Allocations affect adjacent cells◦ Want to minimize dropped and blocked calls

04/18/23 30

Cell Phone Channel Allocation Cell Phone Channel Allocation Cont…Cont…

States 

◦ Occupied and unoccupied channels for each cell Availability: Number of free channels for cell

Actions ◦ Call arrival

Evaluate possible free channels Assign one that has highest value

◦ Call termination Free channel Consider reassigning each ongoing call to just-

released channel Perform reassignment (if any) with highest value

Rewards and Values ◦ Reward is number of on-going calls 

04/18/23 31

Performance of FA, BDCL, and RL

FA=fixed assignment method (FA) BDCL=borrowing with directional channel locking

Cobot: A Social Reinforcement Learning Agent

Cobot is a software agent ◦ Apply RL in a complex human online social chat based

environment

The goal is to interact with other members and to become a real part of his social fabric

Takes certain actions under his own initiative

Any user can reward or punish him.

Cobot has a incremental database of “social statistics”

◦ e.g. how frequently and in what ways users interacted

with one another provided summaries of these statistics

as a service

Cobot Cont…

States

◦ One state space corresponding to each user

State space contains a number of features containing

statistics about that particular user

Actions

◦ Null Action Choose to remain silent for this time period.

◦ Topic Starters Introduce a conversational topic

◦ Social Commentary Make a comment describing the current

social state of the Living Room, such as “It sure is quiet” or

“Everyone here is friendly.”

Cobot cont…

The RL Reward Function

◦ Reward verb

E.g. hug

◦ Punish verb

E.g. spank

◦ These verbs give a numerical (positive and

negative, respectively) training signal to Cobot

Car Simulation: Using Reinforcement Learning

The drivers do not know the track information beforehand

Takes appropriate action to avoid bumping into the wall by learning from the past experience and the given rewards

States

◦ state of the car is represented by two variables

The distance of the car to the left wall of the track

the car’s velocity towards right wall

Car Simulation cont…

Action

◦ Turn left or right to go in right direction

Reward

◦ The car will be given a negative reward if

Car bumping into the wall

car going backwards

◦ Positive if

Going on correct direction

ConclusionConclusion The basic reinforcement learning model consists of:

◦ a set of environment states S

◦ a set of actions A and

◦ a set of scalar "rewards" Learner take actions in an environment so as to maximize

its long-term reward Any problem domain that can be cast as a Markov decision

process can potentially benefit from this technique. Unlike supervised learning, reinforcement learning systems

do not require explicit input-output pairs for training

04/18/23 38

ReferencesReferences Reinforcement Learning:  A User’s Guide Bill Smart Department of

Computer Science and Engineering Washington University in St. Louis http://www.cse.wustl.edu/~wds/ ICAC 2005

Machine Learning, Tom Mitchell, McGraw Hill, 1997. Harmon, M., Harmon, S.: Reinforcement Learning : A Tutorial, Wright

State University, 2000. Rich Sutton: Reinforcement Learning: Tutorial, AT& T Labs

http://www.cs.ualberta.ca/~sutton/Talks/RL-Tutorial/RL-Tutorial.ppt Kaelbling, L.P., Littman, M.L., and Moore, A.W. "Reinforcement

learning: A survey". Journal of Artificial Intelligence Research, 4, 1996.

http://en.wikipedia.org/wiki/Reinforcement_learning A Social Reinforcement Learning Agent by Charles Isbell, Christian

Shelton, Michael Kearns, Satinder Singh and Peter Stone. In Proceedings of the Fifth International Conference on Autonomous Agents (AGENTS), pages 377-384, 2001. Winner of Best Paper Award

04/18/23 39

04/18/23 40

Questions & Comments