Reinforcement Learning Upside Down - GitHub Pages

57
Reinforcement Learning Upside Down Don’t predict rewards - Just Map Them to Actions Matthia Sabatelli Montefiore Institute, Department of Electrical Engineering and Computer Science, Universit´ e de Li` ege, Belgium March 18th 2021 1 / 35

Transcript of Reinforcement Learning Upside Down - GitHub Pages

Page 1: Reinforcement Learning Upside Down - GitHub Pages

Reinforcement Learning Upside DownDon’t predict rewards - Just Map Them to Actions

Matthia SabatelliMontefiore Institute, Department of Electrical Engineering and

Computer Science, Universite de Liege, Belgium

March 18th 2021

1 / 35

Page 2: Reinforcement Learning Upside Down - GitHub Pages

Presentation outline

1 Reinforcement Learning

2 Upside-Down Reinforcement Learning

3 Personal Thoughts

2 / 35

Page 3: Reinforcement Learning Upside Down - GitHub Pages

Reinforcement Learning

The goal of Reinforcement Learning is to train an agent tointeract with an environment which is modeled as a MarkovDecision Process (MDP)

• a set of possible states S

• a set of possible actions A

• a reward signal R(st , at , st+1)

• a transition probability distribution p(st+1|st , at)• a discount factor γ ∈ [0, 1]

3 / 35

Page 4: Reinforcement Learning Upside Down - GitHub Pages

Reinforcement Learning

The agent interacts continuously with the environment in therl-loop

Figure: Image taken from page 48 of Sutton and Barto [2].

4 / 35

Page 5: Reinforcement Learning Upside Down - GitHub Pages

Reinforcement Learning

The goal of the agent is to maximize the expected discountedcumulative reward

Gt = rt + γrt+1 + γ2rt+2 + . . .

=∞∑k=0

γk rt+k+1.

5 / 35

Page 6: Reinforcement Learning Upside Down - GitHub Pages

Reinforcement Learning

An agent decides how to interact with the environment based onits policy which maps every state to an action:

π : S→ A

• The essence of RL algorithms is to find the best possiblepolicy.

• How to define a good policy?

6 / 35

Page 7: Reinforcement Learning Upside Down - GitHub Pages

Reinforcement Learning

We need the concept of value function

• The state value function V π(s)

• The state-action value function Qπ(s, a)

V π(s) = E[ ∞∑k=0

γk rt+k

∣∣∣∣st = s, π

]

Qπ(s, a) = E[ ∞∑k=0

γk rt+k

∣∣∣∣st = s, at = a, π

].

Both value functions can compute the desirability of being in aspecific state.

7 / 35

Page 8: Reinforcement Learning Upside Down - GitHub Pages

Reinforcement Learning

When maximized both value functions satisfy a consistencycondition that allows us to re-express them recursively

V ∗(st) = maxa

∑st+1

p(st+1|st , a)

[<(st , a, st+1) + γV ∗(st+1)

]and

Q∗(st , at) =∑st+1

p(st+1|st , at)[<(st , at , st+1)+γmax

aQ∗(st+1, a)

],

which correspond to the Bellman optimality equations.

8 / 35

Page 9: Reinforcement Learning Upside Down - GitHub Pages

Reinforcement Learning

Value functions play a key role in the development of RL

• Dynamic Programming and Value Iteration (if p(st+1|st , at)or <(st , at , st+1) are known!)• Model Free RL:

• Value based methods• Policy based methods

However ...... when dealing with large MDPs learning these value functions can becomecomplicated since they scale with respect to the state-action space of theenvironment.

9 / 35

Page 10: Reinforcement Learning Upside Down - GitHub Pages

Reinforcement Learning

Why is RL considered to be so challenging?

• We are dealing with a component of time

• The environment is unknown

• There no such a thing as a fixed dataset

• The dataset consists of a moving target

These are all differences that make Reinforcement Learning sodifferent from Supervised Learning!

10 / 35

Page 11: Reinforcement Learning Upside Down - GitHub Pages

Upside-Down Reinforcement Learning

This idea has been introduced in 2 papers:

• Schmidhuber, Juergen. ”Reinforcement Learning UpsideDown: Don’t Predict Rewards–Just Map Them toActions.” arXiv preprint arXiv:1912.02875 (2019).

• Srivastava, Rupesh Kumar, et al. ”Training agents usingupside-down reinforcement learning.” arXiv preprintarXiv:1912.02877 (2019).

11 / 35

Page 12: Reinforcement Learning Upside Down - GitHub Pages

Upside-Down Reinforcement Learning

The paper starts with two strong claims:

• Supervised Learning (SL) techniques are alreadyincorporated within Reinforcement Learning (RL) algorithms

• There is no way of turning an RL problem into an SLproblem

Main IdeaYet the main idea of Upside-Down RL is to turn traditional RL on its headand transform it into a form of SL.

12 / 35

Page 13: Reinforcement Learning Upside Down - GitHub Pages

Upside-Down Reinforcement Learning

• Let’s take a look at why the gap between SL and RL isactually not that large

• Consider Deep Q-Networks, a DRL technique which trainsneural networks for learning an approximation of thestate-action value function Q(s, a; θ) ≈ Q∗(s, a)

L(θ) = E〈st ,at ,rt ,st+1〉∼U(D)

[(rt + γ max

a∈AQ(st+1, a; θ−)

− Q(st , at ; θ))2],

13 / 35

Page 14: Reinforcement Learning Upside Down - GitHub Pages

Upside-Down Reinforcement Learning

• Let’s take a look at why the gap between SL and RL isactually not that large

• Consider Deep Q-Networks, a DRL technique which trainsneural networks for learning an approximation of thestate-action value function Q(s, a; θ) ≈ Q∗(s, a)

L(θ) = E〈st ,at ,rt ,st+1〉∼U(D)

[(rt + γ max

a∈AQ(st+1, a; θ−)

− Q(st , at ; θ))2],

14 / 35

Page 15: Reinforcement Learning Upside Down - GitHub Pages

Upside-Down Reinforcement Learning

• Let’s take a look at why the gap between SL and RL isactually not that large

• Consider Deep Q-Networks, a DRL technique which trainsneural networks for learning an approximation of thestate-action value function Q(s, a; θ) ≈ Q∗(s, a)

L(θ) = E〈st ,at ,rt ,st+1〉∼U(D)

[(rt + γ max

a∈AQ(st+1, a; θ−)

− Q(st , at ; θ))2],

15 / 35

Page 16: Reinforcement Learning Upside Down - GitHub Pages

Upside-Down Reinforcement Learning

In order to successfully train DRL algorithms we are alreadyexploiting SL

• We need to collect a large dataset of RL trajectories〈st , at , rt , st+1〉

• Networks are trained with common SL loss functions (MSE)

• Not visited states are dealt with data-augmentation

• Policy and Value function regularization

16 / 35

Page 17: Reinforcement Learning Upside Down - GitHub Pages

Upside-Down Reinforcement Learning

In order to successfully train DRL algorithms we are alreadyexploiting SL

• We need to collect a large dataset of RL trajectories〈st , at , rt , st+1〉• Networks are trained with common SL loss functions (MSE)

• Not visited states are dealt with data-augmentation

• Policy and Value function regularization

16 / 35

Page 18: Reinforcement Learning Upside Down - GitHub Pages

Upside-Down Reinforcement Learning

In order to successfully train DRL algorithms we are alreadyexploiting SL

• We need to collect a large dataset of RL trajectories〈st , at , rt , st+1〉• Networks are trained with common SL loss functions (MSE)

• Not visited states are dealt with data-augmentation

• Policy and Value function regularization

16 / 35

Page 19: Reinforcement Learning Upside Down - GitHub Pages

Upside-Down Reinforcement Learning

In order to successfully train DRL algorithms we are alreadyexploiting SL

• We need to collect a large dataset of RL trajectories〈st , at , rt , st+1〉• Networks are trained with common SL loss functions (MSE)

• Not visited states are dealt with data-augmentation

• Policy and Value function regularization

16 / 35

Page 20: Reinforcement Learning Upside Down - GitHub Pages

Upside-Down Reinforcement Learning

DRL algorithms come with a lot of issues that go back totabular RL

• It is not easy to learn value functions, i.e they can be biased

• When approximated, the algorithms that learn thesefunctions diverge

• Same issues hold for policy gradient methods: extrapolationerror

• The RL set-up is distorted: see the role of γ

17 / 35

Page 21: Reinforcement Learning Upside Down - GitHub Pages

Upside-Down Reinforcement Learning

DRL algorithms come with a lot of issues that go back totabular RL

• It is not easy to learn value functions, i.e they can be biased

• When approximated, the algorithms that learn thesefunctions diverge

• Same issues hold for policy gradient methods: extrapolationerror

• The RL set-up is distorted: see the role of γ

17 / 35

Page 22: Reinforcement Learning Upside Down - GitHub Pages

Upside-Down Reinforcement Learning

DRL algorithms come with a lot of issues that go back totabular RL

• It is not easy to learn value functions, i.e they can be biased

• When approximated, the algorithms that learn thesefunctions diverge

• Same issues hold for policy gradient methods: extrapolationerror

• The RL set-up is distorted: see the role of γ

17 / 35

Page 23: Reinforcement Learning Upside Down - GitHub Pages

Upside-Down Reinforcement Learning

DRL algorithms come with a lot of issues that go back totabular RL

• It is not easy to learn value functions, i.e they can be biased

• When approximated, the algorithms that learn thesefunctions diverge

• Same issues hold for policy gradient methods: extrapolationerror

• The RL set-up is distorted: see the role of γ

17 / 35

Page 24: Reinforcement Learning Upside Down - GitHub Pages

Upside-Down Reinforcement Learning

We consider the same setting as the one that characterizesclassic RL

• We are dealing with Markovian environments

• We have access to states, actions and rewards (s, a, r)

• The agent is governed by a policy π : S→ A

• We deal with RL episodes that are described by trajectoriesτ in the form of 〈st , at , rt , st+1〉

18 / 35

Page 25: Reinforcement Learning Upside Down - GitHub Pages

Upside-Down Reinforcement Learning

We consider the same setting as the one that characterizesclassic RL

• We are dealing with Markovian environments

• We have access to states, actions and rewards (s, a, r)

• The agent is governed by a policy π : S→ A

• We deal with RL episodes that are described by trajectoriesτ in the form of 〈st , at , rt , st+1〉

18 / 35

Page 26: Reinforcement Learning Upside Down - GitHub Pages

Upside-Down Reinforcement Learning

We consider the same setting as the one that characterizesclassic RL

• We are dealing with Markovian environments

• We have access to states, actions and rewards (s, a, r)

• The agent is governed by a policy π : S→ A

• We deal with RL episodes that are described by trajectoriesτ in the form of 〈st , at , rt , st+1〉

18 / 35

Page 27: Reinforcement Learning Upside Down - GitHub Pages

Upside-Down Reinforcement Learning

We consider the same setting as the one that characterizesclassic RL

• We are dealing with Markovian environments

• We have access to states, actions and rewards (s, a, r)

• The agent is governed by a policy π : S→ A

• We deal with RL episodes that are described by trajectoriesτ in the form of 〈st , at , rt , st+1〉

18 / 35

Page 28: Reinforcement Learning Upside Down - GitHub Pages

Upside-Down Reinforcement Learning

Even if the setting is the same one as in RL the core principle ofUpside-Down Reinforcement Learning is different!

• Value based RL algorithms predict rewards

• Policy based RL algorithms search for a π that maximizes areturn

• Upside-Down RL predicts actions

19 / 35

Page 29: Reinforcement Learning Upside Down - GitHub Pages

Upside-Down Reinforcement Learning

Even if the setting is the same one as in RL the core principle ofUpside-Down Reinforcement Learning is different!

• Value based RL algorithms predict rewards

• Policy based RL algorithms search for a π that maximizes areturn

• Upside-Down RL predicts actions

19 / 35

Page 30: Reinforcement Learning Upside Down - GitHub Pages

Upside-Down Reinforcement Learning

Even if the setting is the same one as in RL the core principle ofUpside-Down Reinforcement Learning is different!

• Value based RL algorithms predict rewards

• Policy based RL algorithms search for a π that maximizes areturn

• Upside-Down RL predicts actions

19 / 35

Page 31: Reinforcement Learning Upside Down - GitHub Pages

Upside-Down Reinforcement Learning

In order to predict actions we introduce two new concepts thatare not present in the classic RL setting

• A behavior function B

• A set of commands c that will be given as input to B

20 / 35

Page 32: Reinforcement Learning Upside Down - GitHub Pages

Upside-Down Reinforcement Learning

21 / 35

Page 33: Reinforcement Learning Upside Down - GitHub Pages

Upside-Down Reinforcement Learning

s0

s1

s2

s3

a1

2

a2

1

a3

-1

State dr dh a

s0 2 1 a1s0 1 1 a2s0 1 2 a1s1 -1 1 a3

22 / 35

Page 34: Reinforcement Learning Upside Down - GitHub Pages

Upside-Down Reinforcement Learning

While simple and intuitive learning B is not enough forsuccessfully tackling RL tasks

• We are learning a mapping f : (s, d r , dh)→ a

• B can be learned for any possible trajectory

23 / 35

Page 35: Reinforcement Learning Upside Down - GitHub Pages

Upside-Down Reinforcement Learning

While simple and intuitive learning B is not enough forsuccessfully tackling RL tasks

• We are learning a mapping f : (s, d r , dh)→ a

• B can be learned for any possible trajectory

23 / 35

Page 36: Reinforcement Learning Upside Down - GitHub Pages

Upside-Down Reinforcement Learning

While simple and intuitive learning B is not enough forsuccessfully tackling RL tasks

• We are learning a mapping f : (s, d r , dh)→ a

• B can be learned for any possible trajectory

We are missing two crucial RL components

• Improvement of π over time

• Exploration of the environment

24 / 35

Page 37: Reinforcement Learning Upside Down - GitHub Pages

Upside-Down Reinforcement Learning

While simple and intuitive learning B is not enough forsuccessfully tackling RL tasks

• We are learning a mapping f : (s, d r , dh)→ a

• B can be learned for any possible trajectory

We are missing two crucial RL components

• Improvement of π over time

• Exploration of the environment

24 / 35

Page 38: Reinforcement Learning Upside Down - GitHub Pages

Upside-Down Reinforcement Learning

Yet there exists one algorithm that is able to deal with theseissues and that trains Upside-Down agents (sort of) successfully.Its main components are:

• An experience replay memory buffer E which storesdifferent τ while learning progresses

• A representation of a state st and a command tuplect = (d r

t , dht )

• A behavior function B(st , ct ; θ) that predicts an actiondistribution P(at |st , ct)

25 / 35

Page 39: Reinforcement Learning Upside Down - GitHub Pages

Upside-Down Reinforcement Learning

26 / 35

Page 40: Reinforcement Learning Upside Down - GitHub Pages

Upside-Down Reinforcement Learning

27 / 35

Page 41: Reinforcement Learning Upside Down - GitHub Pages

Upside-Down Reinforcement Learning

28 / 35

Page 42: Reinforcement Learning Upside Down - GitHub Pages

Personal Thoughts

A critical analysis of Upside-Down RL ...

• I started with re-implementing the main ideas of bothUpside-Down RL papers

• Is Upside-Down RL a potential breakthrough?

• What are the pros & cons compared to more common RLresearch?

29 / 35

Page 43: Reinforcement Learning Upside Down - GitHub Pages

Personal Thoughts

0 50 100 150 200 250 300

0

50

100

150

200

Episodes

Rew

ard

Cartpole

DQVDQN

DDQNUpside-Down RL

Figure:

30 / 35

Page 44: Reinforcement Learning Upside Down - GitHub Pages

Personal Thoughts

0 500 1,000 1,500

−20

−10

0

10

20

Episodes

Rew

ard

Pong

DQVDQN

DDQNUpside-Down RL

Figure:

31 / 35

Page 45: Reinforcement Learning Upside Down - GitHub Pages

Personal Thoughts

Some personal insights about the experiments on the Cartpole

environment

It is possible to train agents with the Upside-Down RLsetting

When it works performance is better than traditional DRLmethods

The algorithm allows for more efficient implementationsthan standard DRL methods

For example the capacity of the memory buffer cansignificantly be reduced

32 / 35

Page 46: Reinforcement Learning Upside Down - GitHub Pages

Personal Thoughts

Some personal insights about the experiments on the Cartpole

environment

It is possible to train agents with the Upside-Down RLsetting

When it works performance is better than traditional DRLmethods

The algorithm allows for more efficient implementationsthan standard DRL methods

For example the capacity of the memory buffer cansignificantly be reduced

32 / 35

Page 47: Reinforcement Learning Upside Down - GitHub Pages

Personal Thoughts

Some personal insights about the experiments on the Cartpole

environment

It is possible to train agents with the Upside-Down RLsetting

When it works performance is better than traditional DRLmethods

The algorithm allows for more efficient implementationsthan standard DRL methods

For example the capacity of the memory buffer cansignificantly be reduced

32 / 35

Page 48: Reinforcement Learning Upside Down - GitHub Pages

Personal Thoughts

Some personal insights about the experiments on the Cartpole

environment

It is possible to train agents with the Upside-Down RLsetting

When it works performance is better than traditional DRLmethods

The algorithm allows for more efficient implementationsthan standard DRL methods

For example the capacity of the memory buffer cansignificantly be reduced

32 / 35

Page 49: Reinforcement Learning Upside Down - GitHub Pages

Personal Thoughts

Some personal insights about the experiments on the Cartpole

environment

× Requires networks with more parameters

× It seems that it does not scale too more complexenvironments

× Does not deal well with the exploration-exploitationtrade-off

× There are no theoretical guarantees: no value functions noBellmann optimality equations neither

33 / 35

Page 50: Reinforcement Learning Upside Down - GitHub Pages

Personal Thoughts

Some personal insights about the experiments on the Cartpole

environment

× Requires networks with more parameters

× It seems that it does not scale too more complexenvironments

× Does not deal well with the exploration-exploitationtrade-off

× There are no theoretical guarantees: no value functions noBellmann optimality equations neither

33 / 35

Page 51: Reinforcement Learning Upside Down - GitHub Pages

Personal Thoughts

Some personal insights about the experiments on the Cartpole

environment

× Requires networks with more parameters

× It seems that it does not scale too more complexenvironments

× Does not deal well with the exploration-exploitationtrade-off

× There are no theoretical guarantees: no value functions noBellmann optimality equations neither

33 / 35

Page 52: Reinforcement Learning Upside Down - GitHub Pages

Personal Thoughts

Some personal insights about the experiments on the Cartpole

environment

× Requires networks with more parameters

× It seems that it does not scale too more complexenvironments

× Does not deal well with the exploration-exploitationtrade-off

× There are no theoretical guarantees: no value functions noBellmann optimality equations neither

33 / 35

Page 53: Reinforcement Learning Upside Down - GitHub Pages

Personal Thoughts

To conclude ...

• Schmidhuber’s claims are certainly well motivated

• We should rethink the way we are doing DRL

• The entire field does not need to fully start over, yet it istrue that some concepts do need revision

• Time will tell :)

34 / 35

Page 54: Reinforcement Learning Upside Down - GitHub Pages

Personal Thoughts

To conclude ...

• Schmidhuber’s claims are certainly well motivated

• We should rethink the way we are doing DRL

• The entire field does not need to fully start over, yet it istrue that some concepts do need revision

• Time will tell :)

34 / 35

Page 55: Reinforcement Learning Upside Down - GitHub Pages

Personal Thoughts

To conclude ...

• Schmidhuber’s claims are certainly well motivated

• We should rethink the way we are doing DRL

• The entire field does not need to fully start over, yet it istrue that some concepts do need revision

• Time will tell :)

34 / 35

Page 56: Reinforcement Learning Upside Down - GitHub Pages

Personal Thoughts

To conclude ...

• Schmidhuber’s claims are certainly well motivated

• We should rethink the way we are doing DRL

• The entire field does not need to fully start over, yet it istrue that some concepts do need revision

• Time will tell :)

34 / 35

Page 57: Reinforcement Learning Upside Down - GitHub Pages

References

• Bellman, Richard. ”Dynamic programming.” Science 153.3731 (1966): 34-37.

• Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An introduction. MIT press, 2018.

• Mnih, Volodymyr, et al. ”Human-level control through deep reinforcement learning.” nature 518.7540(2015): 529-533.

• Van Hasselt, Hado, Arthur Guez, and David Silver. ”Deep reinforcement learning with doubleq-learning.” Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 30. No. 1. 2016.

• Sabatelli, Matthia, et al. ”The deep quality-value family of deep reinforcement learning algorithms.”2020 International Joint Conference on Neural Networks (IJCNN). IEEE, 2020.

• Schmidhuber, Juergen. ”Reinforcement Learning Upside Down: Don’t Predict Rewards–Just MapThem to Actions.” arXiv preprint arXiv:1912.02875 (2019).

• Srivastava, Rupesh Kumar, et al. ”Training agents using upside-down reinforcement learning.” arXivpreprint arXiv:1912.02877 (2019).

35 / 35