Bayesian reinforcement learning - Markov decision processes and approximate Bayesian...

126
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bayesian reinforcement learning Markov decision processes and approximate Bayesian computation Christos Dimitrakakis Chalmers April 16, 2015 Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 1 / 60

Transcript of Bayesian reinforcement learning - Markov decision processes and approximate Bayesian...

Page 1: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bayesian reinforcement learningMarkov decision processes and approximate Bayesian computation

Christos Dimitrakakis

Chalmers

April 16, 2015

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 1 / 60

Page 2: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Overview

Subjective probability and utilitySubjective probabilityRewards and preferences

Bandit problemsBernoulli bandits

Markov decision processes and reinforcement learningMarkov processesValue functionsExamples

Bayesian reinforcement learningReinforcement learningBounds on the utilityPlanning: Heuristics and exact solutionsBelief-augmented MDPsThe expected MDP heuristicThe maximum MDP heuristicInference: Approximate Bayesian computationProperties of ABC

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 2 / 60

Page 3: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Subjective probability and utility

Subjective probability and utilitySubjective probabilityRewards and preferences

Bandit problemsBernoulli bandits

Markov decision processes and reinforcement learningMarkov processesValue functionsExamples

Bayesian reinforcement learningReinforcement learningBounds on the utilityPlanning: Heuristics and exact solutionsBelief-augmented MDPsThe expected MDP heuristicThe maximum MDP heuristicInference: Approximate Bayesian computationProperties of ABC

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 3 / 60

Page 4: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Subjective probability and utility Subjective probability

Objective Probability

x

Figure: The double slit experiment

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 4 / 60

Page 5: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Subjective probability and utility Subjective probability

Objective Probability

Figure: The double slit experiment

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 4 / 60

Page 6: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Subjective probability and utility Subjective probability

Objective Probability

Figure: The double slit experiment

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 4 / 60

Page 7: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Subjective probability and utility Subjective probability

What about everyday life?

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 5 / 60

Page 8: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Subjective probability and utility Subjective probability

Subjective probability

Making decisions requires making predictions.

Outcomes of decisions are uncertain.

How can we represent this uncertainty?

Subjective probability

Describe which events we think are more likely.

We quantify this with probability.

Why probability?

Quantifies uncertainty in a “natural” way.

A framework for drawing conclusions from data.

Computationally convenient for decision making.

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 6 / 60

Page 9: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Subjective probability and utility Subjective probability

Subjective probability

Making decisions requires making predictions.

Outcomes of decisions are uncertain.

How can we represent this uncertainty?

Subjective probability

Describe which events we think are more likely.

We quantify this with probability.

Why probability?

Quantifies uncertainty in a “natural” way.

A framework for drawing conclusions from data.

Computationally convenient for decision making.

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 6 / 60

Page 10: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Subjective probability and utility Subjective probability

Subjective probability

Making decisions requires making predictions.

Outcomes of decisions are uncertain.

How can we represent this uncertainty?

Subjective probability

Describe which events we think are more likely.

We quantify this with probability.

Why probability?

Quantifies uncertainty in a “natural” way.

A framework for drawing conclusions from data.

Computationally convenient for decision making.

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 6 / 60

Page 11: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Subjective probability and utility Subjective probability

Subjective probability

Making decisions requires making predictions.

Outcomes of decisions are uncertain.

How can we represent this uncertainty?

Subjective probability

Describe which events we think are more likely.

We quantify this with probability.

Why probability?

Quantifies uncertainty in a “natural” way.

A framework for drawing conclusions from data.

Computationally convenient for decision making.

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 6 / 60

Page 12: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Subjective probability and utility Rewards and preferences

Rewards

We are going to receive a reward r from a set R of possible rewards.

We prefer some rewards to others.

Example 1 (Possible sets of rewards R)

R is a set of tickets to different musical events.

R is a set of financial commodities.

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 7 / 60

Page 13: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Subjective probability and utility Rewards and preferences

When we cannot select rewards directly

In most problems, we cannot just choose which reward to receive.

We can only specify a distribution on rewards.

Example 2 (Route selection)

Each reward r ∈ R is the time it takes to travel from A to B.

Route P1 is faster than P2 in heavy traffic and vice-versa.

Which route should be preferred, given a certain probability for heavy traffic?

In order to choose between random rewards, we use the concept of utility.

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 8 / 60

Page 14: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Subjective probability and utility Rewards and preferences

Definition 3 (Utility)

The utility is a function U : R → R, such that for all a, b ∈ R

a ≿∗ b iff U(a) ≥ U(b), (1.1)

The expected utility of a distribution P on R is:

EP(U) =∑r∈R

U(r)P(r)

(1.2)

=

∫R

U(r) dP(r)

(1.3)

Assumption 1 (The expected utility hypothesis)

The utility of P is equal to the expected utility of the reward under P. Consequently,

P ≿∗ Q iff EP(U) ≥ EQ(U). (1.4)

i.e. we prefer P to Q iff the expected utility under P is higher than under Q

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 9 / 60

Page 15: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Subjective probability and utility Rewards and preferences

Definition 3 (Utility)

The utility is a function U : R → R, such that for all a, b ∈ R

a ≿∗ b iff U(a) ≥ U(b), (1.1)

The expected utility of a distribution P on R is:

EP(U) =∑r∈R

U(r)P(r) (1.2)

=

∫R

U(r) dP(r) (1.3)

Assumption 1 (The expected utility hypothesis)

The utility of P is equal to the expected utility of the reward under P. Consequently,

P ≿∗ Q iff EP(U) ≥ EQ(U). (1.4)

i.e. we prefer P to Q iff the expected utility under P is higher than under Q

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 9 / 60

Page 16: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Subjective probability and utility Rewards and preferences

Definition 3 (Utility)

The utility is a function U : R → R, such that for all a, b ∈ R

a ≿∗ b iff U(a) ≥ U(b), (1.1)

The expected utility of a distribution P on R is:

EP(U) =∑r∈R

U(r)P(r) (1.2)

=

∫R

U(r) dP(r) (1.3)

Assumption 1 (The expected utility hypothesis)

The utility of P is equal to the expected utility of the reward under P. Consequently,

P ≿∗ Q iff EP(U) ≥ EQ(U). (1.4)

i.e. we prefer P to Q iff the expected utility under P is higher than under Q

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 9 / 60

Page 17: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Subjective probability and utility Rewards and preferences

The St. Petersburg Paradox

A simple game [Bernoulli, 1713]

A fair coin is tossed until a head is obtained.

If the first head is obtained on the n-th toss, our reward will be 2n currency units.

How much are you willing to pay, to play this game once?

The probability to stop at round n is 2−n.

Thus, the expected monetary gain of the game is

∞∑n=1

2n2−n =∞.

If your utility function were linear (U(r) = r) you’d be willing to pay any amount toplay.

You might not internalise the setup of the game (is the coin really fair?)

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 10 / 60

Page 18: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Subjective probability and utility Rewards and preferences

The St. Petersburg Paradox

A simple game [Bernoulli, 1713]

A fair coin is tossed until a head is obtained.

If the first head is obtained on the n-th toss, our reward will be 2n currency units.

How much are you willing to pay, to play this game once?

The probability to stop at round n is 2−n.

Thus, the expected monetary gain of the game is

∞∑n=1

2n2−n =∞.

If your utility function were linear (U(r) = r) you’d be willing to pay any amount toplay.

You might not internalise the setup of the game (is the coin really fair?)

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 10 / 60

Page 19: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Subjective probability and utility Rewards and preferences

The St. Petersburg Paradox

A simple game [Bernoulli, 1713]

A fair coin is tossed until a head is obtained.

If the first head is obtained on the n-th toss, our reward will be 2n currency units.

How much are you willing to pay, to play this game once?

The probability to stop at round n is 2−n.

Thus, the expected monetary gain of the game is

∞∑n=1

2n2−n =∞.

If your utility function were linear (U(r) = r) you’d be willing to pay any amount toplay.

You might not internalise the setup of the game (is the coin really fair?)

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 10 / 60

Page 20: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Subjective probability and utility Rewards and preferences

The St. Petersburg Paradox

A simple game [Bernoulli, 1713]

A fair coin is tossed until a head is obtained.

If the first head is obtained on the n-th toss, our reward will be 2n currency units.

How much are you willing to pay, to play this game once?

The probability to stop at round n is 2−n.

Thus, the expected monetary gain of the game is

∞∑n=1

2n2−n =∞.

If your utility function were linear (U(r) = r) you’d be willing to pay any amount toplay.

You might not internalise the setup of the game (is the coin really fair?)

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 10 / 60

Page 21: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Subjective probability and utility Rewards and preferences

The St. Petersburg Paradox

A simple game [Bernoulli, 1713]

A fair coin is tossed until a head is obtained.

If the first head is obtained on the n-th toss, our reward will be 2n currency units.

How much are you willing to pay, to play this game once?

The probability to stop at round n is 2−n.

Thus, the expected monetary gain of the game is

∞∑n=1

2n2−n =∞.

If your utility function were linear (U(r) = r) you’d be willing to pay any amount toplay.

You might not internalise the setup of the game (is the coin really fair?)

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 10 / 60

Page 22: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Subjective probability and utility Rewards and preferences

Summary

We can subjectively indicate which events we think are more likely.

We can define a subjective probability P for all events.

Similarly, we can subjectively indicate preferences for rewards.

We can determine a utility function for all rewards.

Hypothesis: we prefer the probability distribution with the highest expected utility.

This allows us to create algorithms for decision making.

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 11 / 60

Page 23: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Subjective probability and utility Rewards and preferences

Experimental design and Markov decision processes

The following problems

Shortest path problems.

Optimal stopping problems.

Reinforcement learning problems.

Experiment design (clinical trial) problems

Advertising.

can be all formalised as Markov decision processes.

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 12 / 60

Page 24: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bandit problems

Subjective probability and utilitySubjective probabilityRewards and preferences

Bandit problemsBernoulli bandits

Markov decision processes and reinforcement learningMarkov processesValue functionsExamples

Bayesian reinforcement learningReinforcement learningBounds on the utilityPlanning: Heuristics and exact solutionsBelief-augmented MDPsThe expected MDP heuristicThe maximum MDP heuristicInference: Approximate Bayesian computationProperties of ABC

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 13 / 60

Page 25: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bandit problems

Bandit problems

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 14 / 60

Page 26: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bandit problems

Bandit problems

Applications

Efficient optimisation.

Online advertising.

Clinical trials.

Robot scientist.

-1

-0.5

0

0.5

1

1.5

2

0 0.5 1 1.5 2 2.5 3 3.5 4

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 14 / 60

Page 27: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bandit problems

Bandit problems

Applications

Efficient optimisation.

Online advertising.

Clinical trials.

Robot scientist.

-1

-0.5

0

0.5

1

1.5

2

0 0.5 1 1.5 2 2.5 3 3.5 4

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 14 / 60

Page 28: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bandit problems

Bandit problems

Applications

Efficient optimisation.

Online advertising.

Clinical trials.

Robot scientist.

-1

-0.5

0

0.5

1

1.5

2

0 0.5 1 1.5 2 2.5 3 3.5 4

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 14 / 60

Page 29: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bandit problems

Bandit problems

Applications

Efficient optimisation.

Online advertising.

Clinical trials.

Robot scientist.

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 14 / 60

Page 30: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bandit problems

Bandit problems

Applications

Efficient optimisation.

Online advertising.

Clinical trials.

Robot scientist.

Ultrasound

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 14 / 60

Page 31: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bandit problems

Bandit problems

Applications

Efficient optimisation.

Online advertising.

Clinical trials.

Robot scientist.

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 14 / 60

Page 32: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bandit problems

The stochastic n-armed bandit problem

Actions and rewards

A set of actions A = 1, . . . , n. Each action gives you a random reward with distribution P(rt | at = i).

The expected reward of the i-th arm is ωi ≜ E(rt | at = i).

Utility

The utility is the sum of the individual rewards r = r1, . . . , rT

U(r) ≜T∑t=1

rt .

Definition 4 (Policies)

A policy π is an algorithm for taking actions given the observed history.

Pπ(at+1 | a1, r1, . . . , at , rt)

is the probability of the next action at+1.

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 15 / 60

Page 33: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bandit problems Bernoulli bandits

Bernoulli bandits

Example 5 (Bernoulli bandits)

Consider n Bernoulli distributions with parameters ωi (i = 1, . . . , n) such thatrt | at = i ∼ Bern(ωi ). Then,

P(rt = 1 | at = i) = ωi P(rt = 0 | at = i) = 1− ωi (2.1)

Then the expected reward for the i-th bandit is E(rt | at = i) = ωi .

Exercise 1 (The optimal policy)

If we know ωi for all i , what is the best policy?

What if we don’t?

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 16 / 60

Page 34: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bandit problems Bernoulli bandits

Bernoulli bandits

Example 5 (Bernoulli bandits)

Consider n Bernoulli distributions with parameters ωi (i = 1, . . . , n) such thatrt | at = i ∼ Bern(ωi ). Then,

P(rt = 1 | at = i) = ωi P(rt = 0 | at = i) = 1− ωi (2.1)

Then the expected reward for the i-th bandit is E(rt | at = i) = ωi .

Exercise 1 (The optimal policy)

If we know ωi for all i , what is the best policy?

What if we don’t?

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 16 / 60

Page 35: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bandit problems Bernoulli bandits

A simple heuristic for the unknown reward case

Say you keep a running average of the reward obtained by each arm

ωt,i = Rt,i/nt,i

nt,i the number of times you played arm i

Rt,i the total reward received from i .

Whenever you play at = i :

Rt+1,i = Rt,i + rt , nt+1,i = nt,i + 1.

Greedy policy:at = argmax

iωt,i .

What should the initial values n0,i ,R0,i be?

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 17 / 60

Page 36: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bandit problems Bernoulli bandits

The greedy policy for n0,i = R0,i = 1

0

0.2

0.4

0.6

0.8

1

0 100 200 300 400 500 600 700 800 9001000

ω1ω2ω1ω2∑t

k=1 rk/t

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 18 / 60

Page 37: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Markov decision processes and reinforcement learning

Subjective probability and utilitySubjective probabilityRewards and preferences

Bandit problemsBernoulli bandits

Markov decision processes and reinforcement learningMarkov processesValue functionsExamples

Bayesian reinforcement learningReinforcement learningBounds on the utilityPlanning: Heuristics and exact solutionsBelief-augmented MDPsThe expected MDP heuristicThe maximum MDP heuristicInference: Approximate Bayesian computationProperties of ABC

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 19 / 60

Page 38: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Markov decision processes and reinforcement learning Markov processes

A Markov process

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 20 / 60

Page 39: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Markov decision processes and reinforcement learning Markov processes

Markov process

st−1 st st+1

Definition 6 (Markov Process – or Markov Chain)

The sequence st | t = 1, . . . of random variables st : Ω → S is a Markov process if

P(st+1 | st , . . . , s1) = P(st+1 | st). (3.1)

st is state of the Markov process at time t.

P(st+1 | st) is the transition kernel of the process.

The state of an algorithm

Observe that the R, n vectors of our greedy bandit algorithm form a Markov process.They also summarise our belief about which arm is the best.

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 21 / 60

Page 40: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Markov decision processes and reinforcement learning Markov processes

Markov decision processes

Markov decision processes (MDP).

At each time step t:

We observe state st ∈ S. We take action at ∈ A. We receive a reward rt ∈ R. at

st st+1

rt

Markov property of the reward and state distribution

Pµ(st+1 | st , at) (Transition distribution)

Pµ(rt | st , at) (Reward distribution)

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 22 / 60

Page 41: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Markov decision processes and reinforcement learning Markov processes

The agent

The agent’s policy π

Pπ(at | rt , st , at , . . . , r1, s1, a1) (history-dependent policy)

Pπ(at | st) (Markov policy)

Definition 7 (Utility)

Given a horizon T ≥ 0, and discount factor γ ∈ (0, 1] the utility can be defined as

Ut ≜T−t∑k=0

γk rt+k (3.2)

The agent wants to to find π maximising the expected total future reward

Eπµ Ut = Eπ

µ

T−t∑k=0

γk rt+k . (expected utility)

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 23 / 60

Page 42: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Markov decision processes and reinforcement learning Value functions

State value function

V πµ,t(s) ≜ Eπ

µ(Ut | st = s) (3.3)

The optimal policy π∗

π∗(µ) : Vπ∗(µ)t,µ (s) ≥ V π

t,µ(s) ∀π, t, s (3.4)

dominates all other policies π everywhere in S.The optimal value function V ∗

V ∗t,µ(s) ≜ V

π∗(µ)t,µ (s), (3.5)

is the value function of the optimal policy π∗.

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 24 / 60

Page 43: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Markov decision processes and reinforcement learning Examples

Stochastic shortest path problem with a pit

O X

Properties

T →∞.

rt = −1, but rt = 0 at X and −100 at O andthe problem ends.

Pµ(st+1 = X |st = X ) = 1.

A = North,South,East,West Moves to a random direction with probabilityω. Walls block.

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 25 / 60

Page 44: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Markov decision processes and reinforcement learning Examples

1

2

3

4

5

6

7

8

1 2 3 4 5 6 7 8

(a) ω = 0.1

1

2

3

4

5

6

7

8

1 2 3 4 5 6 7 8

(b) ω = 0.5

0.51

1.52

2.5

-120 -100 -80 -60 -40 -20 0

(c) value

Figure: Pit maze solutions for two values of ω.

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 26 / 60

Page 45: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Markov decision processes and reinforcement learning Examples

How to evaluate a policy (Case: γ = 1)

V πµ,t(s) ≜ Eπ

µ(Ut | st = s) (3.6)

(3.7)

This derivation directly gives a number of policy evaluation algorithms.

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 27 / 60

Page 46: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Markov decision processes and reinforcement learning Examples

How to evaluate a policy (Case: γ = 1)

V πµ,t(s) ≜ Eπ

µ(Ut | st = s) (3.6)

=T−t∑k=0

Eπµ(rt+k | st = s) (3.7)

(3.8)

This derivation directly gives a number of policy evaluation algorithms.

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 27 / 60

Page 47: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Markov decision processes and reinforcement learning Examples

How to evaluate a policy (Case: γ = 1)

V πµ,t(s) ≜ Eπ

µ(Ut | st = s) (3.6)

=T−t∑k=0

Eπµ(rt+k | st = s) (3.7)

= Eπµ(rt | st = s) + Eπ

µ(Ut+1 | st = s) (3.8)

(3.9)

This derivation directly gives a number of policy evaluation algorithms.

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 27 / 60

Page 48: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Markov decision processes and reinforcement learning Examples

How to evaluate a policy (Case: γ = 1)

V πµ,t(s) ≜ Eπ

µ(Ut | st = s) (3.6)

=T−t∑k=0

Eπµ(rt+k | st = s) (3.7)

= Eπµ(rt | st = s) + Eπ

µ(Ut+1 | st = s) (3.8)

= Eπµ(rt | st = s) +

∑i∈S

V πµ,t+1(i)Pπ

µ(st+1 = i |st = s). (3.9)

This derivation directly gives a number of policy evaluation algorithms.

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 27 / 60

Page 49: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Markov decision processes and reinforcement learning Examples

How to evaluate a policy (Case: γ = 1)

V πµ,t(s) ≜ Eπ

µ(Ut | st = s) (3.6)

=T−t∑k=0

Eπµ(rt+k | st = s) (3.7)

= Eπµ(rt | st = s) + Eπ

µ(Ut+1 | st = s) (3.8)

= Eπµ(rt | st = s) +

∑i∈S

V πµ,t+1(i)Pπ

µ(st+1 = i |st = s). (3.9)

This derivation directly gives a number of policy evaluation algorithms.

maxπ

V πµ,t(s) = max

aEµ(rt | st = s, a) + max

π′

∑i∈S

V π′µ,t+1(i)Pπ′

µ (st+1 = i |st = s).

gives us the optimal policy value.

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 27 / 60

Page 50: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Markov decision processes and reinforcement learning Examples

Backward induction for discounted infinite horizon problems

We can also apply backwards induction to the infinite case.

The resulting policy is stationary.

So memory does not grow with T .

Value iteration

for n = 1, 2, . . . and s ∈ S dovn(s) = maxa r(s, a) + γ

∑s′∈S Pµ(s

′ | s, a)vn−1(s′)

end for

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 28 / 60

Page 51: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Markov decision processes and reinforcement learning Examples

Summary

Markov decision processes model controllable dynamical systems. Optimal policies maximise expected utility can be found with:

Backwards induction / value iteration. Policy iteration.

The MDP state can be seen as The state of a dynamic controllable process. The internal state of an agent.

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 29 / 60

Page 52: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bayesian reinforcement learning

Subjective probability and utilitySubjective probabilityRewards and preferences

Bandit problemsBernoulli bandits

Markov decision processes and reinforcement learningMarkov processesValue functionsExamples

Bayesian reinforcement learningReinforcement learningBounds on the utilityPlanning: Heuristics and exact solutionsBelief-augmented MDPsThe expected MDP heuristicThe maximum MDP heuristicInference: Approximate Bayesian computationProperties of ABC

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 30 / 60

Page 53: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bayesian reinforcement learning Reinforcement learning

The reinforcement learning problem

Learning to act in an unknown world, by interaction and reinforcement.

World µ; Policy π; at time t

µ generates observation xt ∈ X . We take action at ∈ A using π.

µ gives us reward rt ∈ R.

Definition 8 ()

Eπµ

Ut =

Eπµ

T∑k=t

rk

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 31 / 60

Page 54: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bayesian reinforcement learning Reinforcement learning

The reinforcement learning problem

Learning to act in an unknown world, by interaction and reinforcement.

World µ; Policy π; at time t

µ generates observation xt ∈ X . We take action at ∈ A using π.

µ gives us reward rt ∈ R.

Definition 8 ()

Eπµ

Ut =

Eπµ

T∑k=t

rk

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 31 / 60

Page 55: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bayesian reinforcement learning Reinforcement learning

The reinforcement learning problem

Learning to act in an unknown world, by interaction and reinforcement.

World µ; Policy π; at time t

µ generates observation xt ∈ X . We take action at ∈ A using π.

µ gives us reward rt ∈ R.

Definition 8 (Utility)

Eπµ

Ut =

Eπµ

T∑k=t

rk

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 31 / 60

Page 56: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bayesian reinforcement learning Reinforcement learning

The reinforcement learning problem

Learning to act in an unknown world, by interaction and reinforcement.

World µ; Policy π; at time t

µ generates observation xt ∈ X . We take action at ∈ A using π.

µ gives us reward rt ∈ R.

Definition 8 (Expected utility)

Eπµ Ut = Eπ

µ

T∑k=t

rk

When µ is known, calculate maxπ Eπµ U.

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 31 / 60

Page 57: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bayesian reinforcement learning Reinforcement learning

The reinforcement learning problem

Learning to act in an unknown world, by interaction and reinforcement.

World µ; Policy π; at time t

µ generates observation xt ∈ X . We take action at ∈ A using π.

µ gives us reward rt ∈ R.

Definition 8 (Expected utility)

Eπµ Ut = Eπ

µ

T∑k=t

rk

Knowing µ is contrary to the problem definition

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 31 / 60

Page 58: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bayesian reinforcement learning Reinforcement learning

When µ is not known

Bayesian idea: use a subjective belief ξ(µ) on M

Initial belief ξ(µ).

The probability of observing history h is Pπµ(h).

We can use this to adjust our belief via Bayes’ theorem:

ξ(µ | h, π) ∝ Pπµ(h)ξ(µ)

We can thus conclude which µ is more likely.

The subjective expected utility

U∗ξ ≜ max

π

Eπξ U

=

maxπ

∑µ

(Eπ

µ U)ξ(µ).

Integrates planning and learning, and the exploration-exploitation trade-off

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 32 / 60

Page 59: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bayesian reinforcement learning Reinforcement learning

When µ is not known

Bayesian idea: use a subjective belief ξ(µ) on M

Initial belief ξ(µ).

The probability of observing history h is Pπµ(h).

We can use this to adjust our belief via Bayes’ theorem:

ξ(µ | h, π) ∝ Pπµ(h)ξ(µ)

We can thus conclude which µ is more likely.

The subjective expected utility

U∗ξ ≜ max

π

Eπξ U

=

maxπ

∑µ

(Eπ

µ U)ξ(µ).

Integrates planning and learning, and the exploration-exploitation trade-off

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 32 / 60

Page 60: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bayesian reinforcement learning Reinforcement learning

When µ is not known

Bayesian idea: use a subjective belief ξ(µ) on M

Initial belief ξ(µ).

The probability of observing history h is Pπµ(h).

We can use this to adjust our belief via Bayes’ theorem:

ξ(µ | h, π) ∝ Pπµ(h)ξ(µ)

We can thus conclude which µ is more likely.

The subjective expected utility

U∗ξ ≜ max

π

Eπξ U

=

maxπ

∑µ

(Eπ

µ U)ξ(µ).

Integrates planning and learning, and the exploration-exploitation trade-off

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 32 / 60

Page 61: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bayesian reinforcement learning Reinforcement learning

When µ is not known

Bayesian idea: use a subjective belief ξ(µ) on M

Initial belief ξ(µ).

The probability of observing history h is Pπµ(h).

We can use this to adjust our belief via Bayes’ theorem:

ξ(µ | h, π) ∝ Pπµ(h)ξ(µ)

We can thus conclude which µ is more likely.

The subjective expected utility

U∗ξ ≜ max

π

Eπξ U

=

maxπ

∑µ

(Eπ

µ U)ξ(µ).

Integrates planning and learning, and the exploration-exploitation trade-off

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 32 / 60

Page 62: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bayesian reinforcement learning Reinforcement learning

When µ is not known

Bayesian idea: use a subjective belief ξ(µ) on M

Initial belief ξ(µ).

The probability of observing history h is Pπµ(h).

We can use this to adjust our belief via Bayes’ theorem:

ξ(µ | h, π) ∝ Pπµ(h)ξ(µ)

We can thus conclude which µ is more likely.

The subjective expected utility

U∗ξ ≜ max

π

Eπξ U =

maxπ

∑µ

(Eπ

µ U)ξ(µ).

Integrates planning and learning, and the exploration-exploitation trade-off

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 32 / 60

Page 63: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bayesian reinforcement learning Reinforcement learning

When µ is not known

Bayesian idea: use a subjective belief ξ(µ) on M

Initial belief ξ(µ).

The probability of observing history h is Pπµ(h).

We can use this to adjust our belief via Bayes’ theorem:

ξ(µ | h, π) ∝ Pπµ(h)ξ(µ)

We can thus conclude which µ is more likely.

The subjective expected utility

U∗ξ ≜ max

πEπ

ξ U = maxπ

∑µ

(Eπ

µ U)ξ(µ).

Integrates planning and learning, and the exploration-exploitation trade-off

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 32 / 60

Page 64: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bayesian reinforcement learning Bounds on the utility

Bounds on the ξ-optimal utility U∗ξ ≜ maxπ Eπ

ξ U

EU

ξ

U∗µ1: No trap

U∗µ2: Trap

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 33 / 60

Page 65: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bayesian reinforcement learning Bounds on the utility

Bounds on the ξ-optimal utility U∗ξ ≜ maxπ Eπ

ξ U

EU

ξ

U∗µ1: No trap

U∗µ2: Trap

π1

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 33 / 60

Page 66: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bayesian reinforcement learning Bounds on the utility

Bounds on the ξ-optimal utility U∗ξ ≜ maxπ Eπ

ξ U

EU

ξ

U∗µ1: No trap

U∗µ2: Trap

π1

π2

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 33 / 60

Page 67: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bayesian reinforcement learning Bounds on the utility

Bounds on the ξ-optimal utility U∗ξ ≜ maxπ Eπ

ξ U

EU

ξ

U∗µ1: No trap

U∗µ2: Trap

π1

π2

ξ1

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 33 / 60

Page 68: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bayesian reinforcement learning Bounds on the utility

Bounds on the ξ-optimal utility U∗ξ ≜ maxπ Eπ

ξ U

EU

ξ

U∗µ1: No trap

U∗µ2: Trap

π1

π2

π∗ξ1

U∗ξ1

ξ1

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 33 / 60

Page 69: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bayesian reinforcement learning Bounds on the utility

Bounds on the ξ-optimal utility U∗ξ ≜ maxπ Eπ

ξ U

EU

ξ

U∗µ1: No trap

U∗µ2: Trap

π1

π2

π∗ξ1

U∗ξ1

ξ1

U∗ξ

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 33 / 60

Page 70: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bayesian reinforcement learning Bounds on the utility

Bounds on the ξ-optimal utility U∗ξ ≜ maxπ Eπ

ξ U

EU

ξ

U∗µ1: No trap

U∗µ2: Trap

π1

π2

π∗ξ1

U∗ξ1

ξ1

U∗ξ

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 33 / 60

Page 71: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bayesian reinforcement learning Planning: Heuristics and exact solutions

Bernoulli bandits

Decision-theoretic approach

Assume rt | at = i ∼ Pωi , with ωi ∈ Ω.

Define prior belief ξ1 on Ω.

For each step t, select action at to maximise

Eξt (Ut | at) = Eξt

(T−t∑k=1

γk rt+k

∣∣∣∣∣ at)

Obtain reward rt .

Calculate the next beliefξt+1 = ξt(· | at , rt)

How can we implement this?

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 34 / 60

Page 72: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bayesian reinforcement learning Planning: Heuristics and exact solutions

Bayesian inference on Bernoulli bandits

Likelihood: Pω(rt = 1) = ω.

Prior: ξ(ω) ∝ ωα−1(1− ω)β−1 (i.e. Beta(α, β)).

0 0.2 0.4 0.6 0.8 10

1

2

3

4prior

Figure: Prior belief ξ about the mean reward ω.

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 35 / 60

Page 73: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bayesian reinforcement learning Planning: Heuristics and exact solutions

Bayesian inference on Bernoulli bandits

For a sequence r = r1, . . . , rn, ⇒ Pω(r) ∝ ω#1(r)i (1− ωi )

#0(r)

0 0.2 0.4 0.6 0.8 10

2

4

6

8

10prior

likelihood

Figure: Prior belief ξ about ω and likelihood of ω for 100 plays with 70 1s.

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 35 / 60

Page 74: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bayesian reinforcement learning Planning: Heuristics and exact solutions

Bayesian inference on Bernoulli bandits

Posterior: Beta(α+#1(r), β +#0(r)).

0 0.2 0.4 0.6 0.8 10

2

4

6

8

10prior

likelihood

posterior

Figure: Prior belief ξ(ω) about ω, likelihood of ω for the data r , and posterior belief ξ(ω | r)

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 35 / 60

Page 75: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bayesian reinforcement learning Planning: Heuristics and exact solutions

Bernoulli example.

Consider n Bernoulli distributions with unknown parameters ωi (i = 1, . . . , n) such that

rt | at = i ∼ Bern(ωi ), E(rt | at = i) = ωi . (4.1)

Our belief for each parameter ωi is Beta(αi , βi ), with density f (ω | αi , βi ) so that

ξ(ω1, . . . , ωn) =n∏

i=1

f (ωi | αi , βi ). (a priori independent)

Nt,i ≜t∑

k=1

I ak = i , rt,i ≜1

Nt,i

t∑k=1

rt I ak = i

Then, the posterior distribution for the parameter of arm i is

ξt = Beta(αi + Nt,i rt,i , βi + Nt,i (1− rt,i )).

Since rt ∈ 0, 1 there are O((2n)T ) possible belief states for a T -step bandit problem.

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 36 / 60

Page 76: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bayesian reinforcement learning Planning: Heuristics and exact solutions

Belief states

The state of the decision-theoretic bandit problem is the state of our belief.

A sufficient statistic is the number of plays and total rewards.

Our belief state ξt is described by the priors α, β and the vectors

Nt = (Nt,1, . . . ,Nt,i ) (4.2)

rt = (rt,1, . . . , rt,i ). (4.3)

The next-state probabilities are defined as:

P(rt = 1 | at = i , ξt) =αi + Nt,i rt,iαi + βi + Nt,i

as ξt+1 is a deterministic function of ξt , rt and at

So the bandit problem can be formalised as a Markov decision process.

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 37 / 60

Page 77: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bayesian reinforcement learning Planning: Heuristics and exact solutions

at

ω

rt

Figure: The basic bandit MDP. The decision maker selects at , while the parameter ω of theprocess is hidden. It then obtains reward rt . The process repeats for t = 1, . . . ,T .

ξt

at

rt

ξt+1

at+1

rt+1

Figure: The decision-theoretic bandit MDP. While ω is not known, at each time step t wemaintain a belief ξt on Ω. The reward distribution is then defined through our belief.

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 38 / 60

Page 78: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bayesian reinforcement learning Planning: Heuristics and exact solutions

Backwards induction (Dynamic programming)

for n = 1, 2, . . . and s ∈ S do

E(Ut | ξt) = maxat∈A

E(rt | ξt , at) + γ∑ξt+1

P(ξt+1 | ξt , at)E(Ut+1 | ξt+1)

end for

Exact solution methods: exponential in the horizon

Dynamic programming (backwards induction etc)

Policy search.

Approximations

(Stochastic) branch and bound.

Upper confidence trees.

Approximate dynamic programming.

Local policy search (e.g. gradient based)

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 39 / 60

Page 79: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bayesian reinforcement learning Planning: Heuristics and exact solutions

Backwards induction (Dynamic programming)

for n = 1, 2, . . . and s ∈ S do

E(Ut | ξt) = maxat∈A

E(rt | ξt , at) + γ∑ξt+1

P(ξt+1 | ξt , at)E(Ut+1 | ξt+1)

end for

Exact solution methods: exponential in the horizon

Dynamic programming (backwards induction etc)

Policy search.

Approximations

(Stochastic) branch and bound.

Upper confidence trees.

Approximate dynamic programming.

Local policy search (e.g. gradient based)

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 39 / 60

Page 80: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bayesian reinforcement learning Planning: Heuristics and exact solutions

Bayesian RL for unknown MDPs

The MDP as an environment.

We are in some environment µ, where ateach time, we: step t:

Observe state st ∈ S. Take action at ∈ A. Receive reward rt ∈ R.

µ

ξ

at

st st+1

rt

Figure: The unknown Markov decision process

How can we find the Bayes optimal policy for unknown MDPs?

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 40 / 60

Page 81: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bayesian reinforcement learning Planning: Heuristics and exact solutions

Some heuristics

1. Only change policy at the start of epochs ti .

2. Calculate the belief ξti .

3. Find a “good” policy πi for the current belief.

4. Execute it until the next epoch i + 1.

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 41 / 60

Page 82: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bayesian reinforcement learning Planning: Heuristics and exact solutions

One simple heuristic is to simply calculate the expected MDP for a given belief ξ:

µξ ≜ Eξ µ.

Then, we simply calculate the optimal policy for µξ:

π∗(µξ) ∈ argmaxπ∈Π1

V πµξ,

Example 9 (Counterexample)

0

ϵ0 0

1

(a) µ1

0

ϵ0 0

1

(b) µ2

0

ϵ0 0

1

(c) µξ

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 42 / 60

Page 83: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bayesian reinforcement learning Planning: Heuristics and exact solutions

Another heuristic is to get the most probable MDP for a belief ξ:

µ∗ξ ≜ argmax

µξ(µ)

Then, we simply calculate the optimal policy for µ∗ξ :

π∗(µξ) ∈ argmaxπ∈Π1

V πµξ,

Example 10

a = 0

a = 1

a = i

a = n

ϵ

0

1

0

Figure: The MDP µi from |A|+ 1 MDPs.Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 43 / 60

Page 84: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bayesian reinforcement learning Planning: Heuristics and exact solutions

Posterior (Thompson) sampling

Another heuristic is to simply sample an MDP from the belief ξ:

µ(k) ∼ ξ(µ)

Then, we simply calculate the optimal policy for µ(k):

π∗(µξ) ∈ argmaxπ∈Π1

V πµ(k) ,

Properties

√T regret. (Direct proof: hard [1]. Easy proof: convert to confidence bound [11])

Generally applicable for many beliefs.

Connections to differential privacy [9].

Generalises to stochastic value function bounds [8].

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 44 / 60

Page 85: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bayesian reinforcement learning Planning: Heuristics and exact solutions

Belief-Augmented MDPs

Unknown bandit problems can be converted into MDPs through the belief state.

We can do the same for MDPs. We just create a hyperstate, composed of thecurrent belief and the current belief state.

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 45 / 60

Page 86: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bayesian reinforcement learning Planning: Heuristics and exact solutions

st

ξt

at

st+1

ξt+1

ψt ψt+1

(a) The complete MDPmodel

ψt

at

ψt+1

(b) Compact form ofthe model

Figure: Belief-augmented MDP

The augmented MDP

P(st+1 ∈ S | ξt , st , at) ≜∫S

Pµ(st+1 ∈ S | st , at) dξt(µ) (4.4)

ξt+1(·) = ξt(· | st+1, st , at) (4.5)

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 46 / 60

Page 87: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bayesian reinforcement learning Planning: Heuristics and exact solutions

So now we have converted the unknown MDP problem into an MDP.

That means we can use dynamic programming to solve it.

So... are we done?

Unfortunately the exact solution is again exponential in the horizon.

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 47 / 60

Page 88: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bayesian reinforcement learning Planning: Heuristics and exact solutions

So now we have converted the unknown MDP problem into an MDP.

That means we can use dynamic programming to solve it.

So... are we done?

Unfortunately the exact solution is again exponential in the horizon.

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 47 / 60

Page 89: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bayesian reinforcement learning Planning: Heuristics and exact solutions

So now we have converted the unknown MDP problem into an MDP.

That means we can use dynamic programming to solve it.

So... are we done?

Unfortunately the exact solution is again exponential in the horizon.

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 47 / 60

Page 90: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bayesian reinforcement learning Planning: Heuristics and exact solutions

So now we have converted the unknown MDP problem into an MDP.

That means we can use dynamic programming to solve it.

So... are we done?

Unfortunately the exact solution is again exponential in the horizon.

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 47 / 60

Page 91: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bayesian reinforcement learning Inference: Approximate Bayesian computation

ABC (Approximate Bayesian Computation) RL1

How to deal with an arbitrary model space M

The models µ ∈M may be non-probabilistic simulators.

We may not know how to choose the simulator parameters.

Overview of the approach

Place a prior on the simulator parameters.

Observe some data h on the real system.

Approximate the posterior by statistics on simulated data.

Calculate a near-optimal policy for the posterior.

Results

Soundness depends on properties of the statistics.

In practice, can require much less data than a general model.

1Dimitrakakis, Tziortiotis. ABC Reinforcement Learning: ICML 2013

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 48 / 60

Page 92: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bayesian reinforcement learning Inference: Approximate Bayesian computation

ABC (Approximate Bayesian Computation) RL1

How to deal with an arbitrary model space M

The models µ ∈M may be non-probabilistic simulators.

We may not know how to choose the simulator parameters.

Overview of the approach

Place a prior on the simulator parameters.

Observe some data h on the real system.

Approximate the posterior by statistics on simulated data.

Calculate a near-optimal policy for the posterior.

Results

Soundness depends on properties of the statistics.

In practice, can require much less data than a general model.

1Dimitrakakis, Tziortiotis. ABC Reinforcement Learning: ICML 2013

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 48 / 60

Page 93: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bayesian reinforcement learning Inference: Approximate Bayesian computation

ABC (Approximate Bayesian Computation) RL1

How to deal with an arbitrary model space M

The models µ ∈M may be non-probabilistic simulators.

We may not know how to choose the simulator parameters.

Overview of the approach

Place a prior on the simulator parameters.

Observe some data h on the real system.

Approximate the posterior by statistics on simulated data.

Calculate a near-optimal policy for the posterior.

Results

Soundness depends on properties of the statistics.

In practice, can require much less data than a general model.

1Dimitrakakis, Tziortiotis. ABC Reinforcement Learning: ICML 2013

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 48 / 60

Page 94: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bayesian reinforcement learning Inference: Approximate Bayesian computation

ABC (Approximate Bayesian Computation) RL1

How to deal with an arbitrary model space M

The models µ ∈M may be non-probabilistic simulators.

We may not know how to choose the simulator parameters.

Overview of the approach

Place a prior on the simulator parameters.

Observe some data h on the real system.

Approximate the posterior by statistics on simulated data.

Calculate a near-optimal policy for the posterior.

Results

Soundness depends on properties of the statistics.

In practice, can require much less data than a general model.

1Dimitrakakis, Tziortiotis. ABC Reinforcement Learning: ICML 2013

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 48 / 60

Page 95: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bayesian reinforcement learning Inference: Approximate Bayesian computation

A set of trajectories

-40 -20 0 20 40 60 80 100-50

-40

-30

-20

-10

0

10real data

good sim

bad sim

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 49 / 60

Page 96: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bayesian reinforcement learning Inference: Approximate Bayesian computation

A set of trajectories

2 4 6 8 10

2

4

6

8

10

Cumulative features of real data

Trajectories are easy to generate.

How to compare?

Use a statistic.

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 49 / 60

Page 97: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bayesian reinforcement learning Inference: Approximate Bayesian computation

A set of trajectories

2 4 6 8 10

2

4

6

8

10

Cumulative features of good sim

Trajectories are easy to generate.

How to compare?

Use a statistic.

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 49 / 60

Page 98: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bayesian reinforcement learning Inference: Approximate Bayesian computation

A set of trajectories

2 4 6 8 10

2

4

6

8

10

Cumulative features of bad sim

Trajectories are easy to generate.

How to compare?

Use a statistic.

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 49 / 60

Page 99: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bayesian reinforcement learning Inference: Approximate Bayesian computation

ABC (Approximate Bayesian Computation)

When there is no probabilistic model (Pµ is not available): ABC!

A prior ξ on a class of simulatorsM History h ∈ H from policy π.

Statistic f : H → (W, ∥ · ∥) Threshold ϵ > 0.

ABC-RL using Thompson sampling

do µ ∼ ξ, h′ ∼ Pπµ // sample a model and history

until ∥f (h′)− f (h)∥ ≤ ϵ // until the statistics are close

µ(k) = µ // approximate posterior sample µ(k) ∼ ξϵ(· | ht) π(k) ≈ argmaxEπ

µ(k) Ut // approximate optimal policy for sample

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 50 / 60

Page 100: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bayesian reinforcement learning Inference: Approximate Bayesian computation

ABC (Approximate Bayesian Computation)

When there is no probabilistic model (Pµ is not available): ABC!

A prior ξ on a class of simulatorsM History h ∈ H from policy π.

Statistic f : H → (W, ∥ · ∥) Threshold ϵ > 0.

ABC-RL using Thompson sampling

do µ ∼ ξ, h′ ∼ Pπµ // sample a model and history

until ∥f (h′)− f (h)∥ ≤ ϵ // until the statistics are close

µ(k) = µ // approximate posterior sample µ(k) ∼ ξϵ(· | ht) π(k) ≈ argmaxEπ

µ(k) Ut // approximate optimal policy for sample

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 50 / 60

Page 101: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bayesian reinforcement learning Inference: Approximate Bayesian computation

ABC (Approximate Bayesian Computation)

When there is no probabilistic model (Pµ is not available): ABC!

A prior ξ on a class of simulatorsM History h ∈ H from policy π.

Statistic f : H → (W, ∥ · ∥) Threshold ϵ > 0.

Example 11 (Cumulative features)

Feature function ϕ : X → Rk .

f (h) ≜∑t

ϕ(xt)

ABC-RL using Thompson sampling

do µ ∼ ξ, h′ ∼ Pπµ // sample a model and history

until ∥f (h′)− f (h)∥ ≤ ϵ // until the statistics are close

µ(k) = µ // approximate posterior sample µ(k) ∼ ξϵ(· | ht) π(k) ≈ argmaxEπ

µ(k) Ut // approximate optimal policy for sample

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 50 / 60

Page 102: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bayesian reinforcement learning Inference: Approximate Bayesian computation

ABC (Approximate Bayesian Computation)

When there is no probabilistic model (Pµ is not available): ABC!

A prior ξ on a class of simulatorsM History h ∈ H from policy π.

Statistic f : H → (W, ∥ · ∥) Threshold ϵ > 0.

Example 11 (Utility)

f (h) ≜∑t

rt

ABC-RL using Thompson sampling

do µ ∼ ξ, h′ ∼ Pπµ // sample a model and history

until ∥f (h′)− f (h)∥ ≤ ϵ // until the statistics are close

µ(k) = µ // approximate posterior sample µ(k) ∼ ξϵ(· | ht) π(k) ≈ argmaxEπ

µ(k) Ut // approximate optimal policy for sample

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 50 / 60

Page 103: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bayesian reinforcement learning Inference: Approximate Bayesian computation

ABC (Approximate Bayesian Computation)

When there is no probabilistic model (Pµ is not available): ABC!

A prior ξ on a class of simulatorsM History h ∈ H from policy π.

Statistic f : H → (W, ∥ · ∥) Threshold ϵ > 0.

ABC-RL using Thompson sampling

do µ ∼ ξ, h′ ∼ Pπµ // sample a model and history

until ∥f (h′)− f (h)∥ ≤ ϵ // until the statistics are close

µ(k) = µ // approximate posterior sample µ(k) ∼ ξϵ(· | ht) π(k) ≈ argmaxEπ

µ(k) Ut // approximate optimal policy for sample

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 50 / 60

Page 104: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bayesian reinforcement learning Inference: Approximate Bayesian computation

ABC (Approximate Bayesian Computation)

When there is no probabilistic model (Pµ is not available): ABC!

A prior ξ on a class of simulatorsM History h ∈ H from policy π.

Statistic f : H → (W, ∥ · ∥) Threshold ϵ > 0.

ABC-RL using Thompson sampling

do µ ∼ ξ, h′ ∼ Pπµ // sample a model and history

until ∥f (h′)− f (h)∥ ≤ ϵ // until the statistics are close

µ(k) = µ // approximate posterior sample µ(k) ∼ ξϵ(· | ht) π(k) ≈ argmaxEπ

µ(k) Ut // approximate optimal policy for sample

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 50 / 60

Page 105: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bayesian reinforcement learning Inference: Approximate Bayesian computation

ABC (Approximate Bayesian Computation)

When there is no probabilistic model (Pµ is not available): ABC!

A prior ξ on a class of simulatorsM History h ∈ H from policy π.

Statistic f : H → (W, ∥ · ∥) Threshold ϵ > 0.

ABC-RL using Thompson sampling

do µ ∼ ξ, h′ ∼ Pπµ // sample a model and history

until ∥f (h′)− f (h)∥ ≤ ϵ // until the statistics are close

µ(k) = µ // approximate posterior sample µ(k) ∼ ξϵ(· | ht) π(k) ≈ argmaxEπ

µ(k) Ut // approximate optimal policy for sample

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 50 / 60

Page 106: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bayesian reinforcement learning Inference: Approximate Bayesian computation

ABC (Approximate Bayesian Computation)

When there is no probabilistic model (Pµ is not available): ABC!

A prior ξ on a class of simulatorsM History h ∈ H from policy π.

Statistic f : H → (W, ∥ · ∥) Threshold ϵ > 0.

ABC-RL using Thompson sampling

do µ ∼ ξ, h′ ∼ Pπµ // sample a model and history

until ∥f (h′)− f (h)∥ ≤ ϵ // until the statistics are close

µ(k) = µ // approximate posterior sample µ(k) ∼ ξϵ(· | ht) π(k) ≈ argmaxEπ

µ(k) Ut // approximate optimal policy for sample

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 50 / 60

Page 107: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bayesian reinforcement learning Inference: Approximate Bayesian computation

ABC (Approximate Bayesian Computation)

When there is no probabilistic model (Pµ is not available): ABC!

A prior ξ on a class of simulatorsM History h ∈ H from policy π.

Statistic f : H → (W, ∥ · ∥) Threshold ϵ > 0.

ABC-RL using Thompson sampling

do µ ∼ ξ, h′ ∼ Pπµ // sample a model and history

until ∥f (h′)− f (h)∥ ≤ ϵ // until the statistics are close

µ(k) = µ // approximate posterior sample µ(k) ∼ ξϵ(· | ht) π(k) ≈ argmaxEπ

µ(k) Ut // approximate optimal policy for sample

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 50 / 60

Page 108: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bayesian reinforcement learning Properties of ABC

The approximate posterior ξϵ(· | h)

Corollary 11

If f is a sufficient statistic and ϵ = 0, then ξ(· | h) = ξϵ(· | h).

Assumption 2 (A1. Lipschitz log-probabilities)

For the policy π, ∃L > 0 s.t. ∀h, h′ ∈ H and ∀µ ∈M∣∣ln [Pπµ(h)/Pπ

µ(h′)]∣∣ ≤ L∥f (h)− f (h′)∥

Theorem 12 (The approximate posterior ξϵ(· | h) is close to ξ(· | h))If A1 holds then ∀ϵ > 0:

D (ξ(· | h) ∥ ξϵ(· | h)) ≤ 2Lϵ+ ln |Ahϵ|, (4.6)

where Ahϵ ≜ z ∈ H | ∥f (z)− f (h)∥ ≤ ϵ.

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 51 / 60

Page 109: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bayesian reinforcement learning Properties of ABC

The approximate posterior ξϵ(· | h)

Corollary 11

If f is a sufficient statistic and ϵ = 0, then ξ(· | h) = ξϵ(· | h).

Assumption 2 (A1. Lipschitz log-probabilities)

For the policy π, ∃L > 0 s.t. ∀h, h′ ∈ H and ∀µ ∈M∣∣ln [Pπµ(h)/Pπ

µ(h′)]∣∣ ≤ L∥f (h)− f (h′)∥

Theorem 12 (The approximate posterior ξϵ(· | h) is close to ξ(· | h))If A1 holds then ∀ϵ > 0:

D (ξ(· | h) ∥ ξϵ(· | h)) ≤ 2Lϵ+ ln |Ahϵ|, (4.6)

where Ahϵ ≜ z ∈ H | ∥f (z)− f (h)∥ ≤ ϵ.

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 51 / 60

Page 110: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bayesian reinforcement learning Properties of ABC

The approximate posterior ξϵ(· | h)

Corollary 11

If f is a sufficient statistic and ϵ = 0, then ξ(· | h) = ξϵ(· | h).

Assumption 2 (A1. Lipschitz log-probabilities)

For the policy π, ∃L > 0 s.t. ∀h, h′ ∈ H and ∀µ ∈M∣∣ln [Pπµ(h)/Pπ

µ(h′)]∣∣ ≤ L∥f (h)− f (h′)∥

Theorem 12 (The approximate posterior ξϵ(· | h) is close to ξ(· | h))If A1 holds then ∀ϵ > 0:

D (ξ(· | h) ∥ ξϵ(· | h)) ≤ 2Lϵ+ ln |Ahϵ|, (4.6)

where Ahϵ ≜ z ∈ H | ∥f (z)− f (h)∥ ≤ ϵ.

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 51 / 60

Page 111: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bayesian reinforcement learning Properties of ABC

The approximate posterior ξϵ(· | h)

Corollary 11

If f is a sufficient statistic and ϵ = 0, then ξ(· | h) = ξϵ(· | h).

Assumption 2 (A1. Lipschitz log-probabilities)

For the policy π, ∃L > 0 s.t. ∀h, h′ ∈ H and ∀µ ∈M∣∣ln [Pπµ(h)/Pπ

µ(h′)]∣∣ ≤ L∥f (h)− f (h′)∥

Theorem 12 (The approximate posterior ξϵ(· | h) is close to ξ(· | h))If A1 holds then ∀ϵ > 0:

D (ξ(· | h) ∥ ξϵ(· | h)) ≤ 2Lϵ+ ln |Ahϵ|, (4.6)

where Ahϵ ≜ z ∈ H | ∥f (z)− f (h)∥ ≤ ϵ.

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 51 / 60

Page 112: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Bayesian reinforcement learning Properties of ABC

Summary

Unknown MDPs can be handled in a Bayesian framework. This defines a belief-augmented MDP with

A state for the MDP. A state for the agent’s belief.

The Bayes-optimal utility is convex, enabling approximations.

A big problem in specifying the “right” prior.

Questions?

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 52 / 60

Page 113: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Belief updates

Belief updates

Discounted reward MDPsBackwards induction

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 53 / 60

Page 114: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Belief updates

Updating the belief in discrete MDPs

Let Dt =⟨s t , at−1, r t−1

⟩be the observed data to time t. Then

ξ(B | Dt , π) =

∫BPπµ(Dt) dξ(µ)∫

M Pπµ(Dt) dξ(µ)

. (5.1)

ξt+1(B) ≜ ξ(B | Dt+1) =

∫BPπµ(Dt) dξ(µ)∫

M Pπµ(Dt) dξ(µ)

(5.2)

=

∫BPµ(st+1, rt | st , at)π(at | s t , at−1, r t−1) dξ(µ | Dt)∫

M Pµ(st+1, rt | st , at)π(at | st , at−1, r t−1) dξ(µ | Dt)(5.3)

=

∫BPµ(st+1, rt | st , at)dξt(µ)∫

M Pµ(st+1, rt | st , at) dξt(µ)(5.4)

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 54 / 60

Page 115: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Belief updates

Backwards induction policy evaluation

for State s ∈ S , t = T , . . . , 1 doUpdate values

vt(s) = Eπµ(rt | st = s) +

∑j∈S

Pπµ(st+1 = j | st = s)vt+1(j), (5.5)

end for

st at rt st+1

?

?

?

1

0

1

0

0.5

0.5

0.7

0.3

0.4

0.6

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 55 / 60

Page 116: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Belief updates

Backwards induction policy evaluation

for State s ∈ S , t = T , . . . , 1 doUpdate values

vt(s) = Eπµ(rt | st = s) +

∑j∈S

Pπµ(st+1 = j | st = s)vt+1(j), (5.5)

end for

st at rt st+1

?

0.7

?

1

0

1

0

0.5

0.5

0.7

0.3

0.4

0.6

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 55 / 60

Page 117: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Belief updates

Backwards induction policy evaluation

for State s ∈ S , t = T , . . . , 1 doUpdate values

vt(s) = Eπµ(rt | st = s) +

∑j∈S

Pπµ(st+1 = j | st = s)vt+1(j), (5.5)

end for

st at rt st+1

?

0.7

1.4

1

0

1

0

0.5

0.5

0.7

0.3

0.4

0.6

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 55 / 60

Page 118: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Belief updates

Backwards induction policy evaluation

for State s ∈ S , t = T , . . . , 1 doUpdate values

vt(s) = Eπµ(rt | st = s) +

∑j∈S

Pπµ(st+1 = j | st = s)vt+1(j), (5.5)

end for

st at rt st+1

1.05

0.7

1.4

1

0

1

0

0.5

0.5

0.7

0.3

0.4

0.6

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 55 / 60

Page 119: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Discounted reward MDPs

Belief updates

Discounted reward MDPsBackwards induction

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 56 / 60

Page 120: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Discounted reward MDPs

Discounted total reward.

Ut = limT→∞

T∑k=t

γk rk , γ ∈ (0, 1)

Definition 13

A policy π is stationary if π(at | st) does not depend on t.

Remark 1

We can use the Markov chain kernel Pµ,π to write the expected utility vector as

vπ =∞∑t=0

γtP tµ,πr (6.1)

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 57 / 60

Page 121: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Discounted reward MDPs

Theorem 14

For any stationary policy π, vπ is the unique solution of

v = r + γPµ,πv. ← fixed point (6.2)

In addition, the solution is:vπ = (I − γPµ,π)

−1r. (6.3)

Example 15

Similar to the geometric series:∞∑t=0

αt =1

1− α

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 58 / 60

Page 122: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Discounted reward MDPs

Policy iteration

Algorithm 1 Policy iteration

Input µ, S.Initialise v0.for n = 1, 2, . . . doπn+1 = argmaxπ r + γPπvn // policy improvement

vn+1 = (I − γPµ,πn+1)−1r // policy evaluation

break if πn+1 = πn.end forReturn πn,vn.

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 59 / 60

Page 123: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Discounted reward MDPs Backwards induction

Backwards induction policy optimization

for State s ∈ S , t = T , . . . , 1 doUpdate values

vt(s) = maxa

Eµ(rt | st = s, at = a) +∑j∈S

Pµ(st+1 = j | st = s, at = a)vt+1(j), (6.4)

end for

st at rt st+1

?

0.7

1.4

1

0

1

0

?

?

0.7

0.3

0.4

0.6

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 60 / 60

Page 124: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Discounted reward MDPs Backwards induction

Backwards induction policy optimization

for State s ∈ S , t = T , . . . , 1 doUpdate values

vt(s) = maxa

Eµ(rt | st = s, at = a) +∑j∈S

Pµ(st+1 = j | st = s, at = a)vt+1(j), (6.4)

end for

st at rt st+1

1.4

0.7

1.4

1

0

1

0

0

1

0.7

0.3

0.4

0.6

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 60 / 60

Page 125: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

References

[1] Shipra Agrawal and Navi Goyal. Analysis of thompson sampling for the multi-armedbandit problem. In COLT 2012, 2012.

[2] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite time analysis of themultiarmed bandit problem. Machine Learning, 47(2/3):235–256, 2002.

[3] Dimitri P. Bertsekas. Dynamic Programming and Optimal Control. AthenaScientific, 2001.

[4] Dimitri P. Bertsekas and John N. Tsitsiklis. Neuro-Dynamic Programming. AthenaScientific, 1996.

[5] Herman Chernoff. Sequential design of experiments. Annals of MathematicalStatistics, 30(3):755–770, 1959.

[6] Herman Chernoff. Sequential Models for Clinical Trials. In Proceedings of the FifthBerkeley Symposium on Mathematical Statistics and Probability, Vol.4, pages805–812. Univ. of Calif Press, 1966.

[7] Morris H. DeGroot. Optimal Statistical Decisions. John Wiley & Sons, 1970.

[8] Christos Dimitrakakis. Monte-carlo utility estimates for bayesian reinforcementlearning. In IEEE 52nd Annual Conference on Decision and Control (CDC 2013),2013. arXiv:1303.2506.

[9] Christos Dimitrakakis, Blaine Nelson, Aikaterini Mitrokotsa, and BenjaminRubinstein. Robust and private Bayesian inference. In Algorithmic Learning Theory,2014.

[10] Milton Friedman and Leonard J. Savage. The expected-utility hypothesis and themeasurability of utility. The Journal of Political Economy, 60(6):463, 1952.

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 60 / 60

Page 126: Bayesian reinforcement learning - Markov decision processes and approximate Bayesian ...chrdimi/workshops/chalmers-mlss... · 2015-04-16 · Bayesian reinforcement learning Markov

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Discounted reward MDPs Backwards induction

[11] Emilie Kaufmanna, Nathaniel Korda, and Remi Munos. Thompson sampling: Anoptimal finite time analysis. In ALT-2012, 2012.

[12] Marting L. Puterman. Markov Decision Processes : Discrete Stochastic DynamicProgramming. John Wiley & Sons, New Jersey, US, 1994.

[13] Leonard J. Savage. The Foundations of Statistics. Dover Publications, 1972.

[14] Niranjan Srinivas, Andreas Krause, Sham Kakade, and Matthias Seeger. Gaussianprocess optimization in the bandit setting: No regret and experimental design. InICML 2010, 2010.

[15] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction.MIT Press, 1998.

Christos Dimitrakakis (Chalmers) Bayesian reinforcement learning April 16, 2015 60 / 60