Sample Efficiency in Online RL

113
Reinforcement Learning China Summer School RLChina 2021 Sample Efficiency in Online RL Zhuoran Yang Princeton -→ Yale (2022) August 16, 2021

Transcript of Sample Efficiency in Online RL

Page 1: Sample Efficiency in Online RL

Reinforcement Learning China Summer School

RLChina 2021

Sample Efficiency in OnlineRL

Zhuoran Yang

Princeton −→ Yale (2022)

August 16, 2021

Page 2: Sample Efficiency in Online RL

Motivation: sample complexitychallenge in deep RL

2

Page 3: Sample Efficiency in Online RL

Success and challenge of deep RL

DRL = Representation (DL) + Decision Making (RL)

board games computer games

robotic control policy making

image source: Deepmind, OpenAI, Salesforce Research.

3

Page 4: Sample Efficiency in Online RL

Success and challenge of deep RL

AlphaGo: 3× 107 games of self-play (data), 40 days oftraining (computation)

AlphaStar: 2× 102 years of self-play, 44 days of training

Rubik’s Cube: 104 years of simulation, 10 days of training

Insufficient Data

Lack of ConvergenceUnreliable AI

Massive Data Needed

Massive Computation

4

Page 5: Sample Efficiency in Online RL

Success and challenge of deep RL

AlphaGo: 3× 107 games of self-play (data), 40 days oftraining (computation)

AlphaStar: 2× 102 years of self-play, 44 days of training

Rubik’s Cube: 104 years of simulation, 10 days of training

Insufficient Data

Lack of ConvergenceUnreliable AI

Massive Data Needed

Massive Computation

4

Page 6: Sample Efficiency in Online RL

Our goal: provably efficient RLalgorithms

Sample efficiency: how many data points needed?

Computational efficiency: how much computation needed?

Function approximation: allow infinite number ofobservations?

5

Page 7: Sample Efficiency in Online RL

Background: Episodic Markovdecision process

6

Page 8: Sample Efficiency in Online RL

Episodic Markov decision process

Agent

Environment

action ah = πh(sh) ∈ A

reward next state

rh = rh(sh, ah) ∈ [0,1]sh+1 ∼ Ph( ⋅ | sh, ah) ∈ S

state sh ∈ S

Policy

For h ∈ [H], in the h-th step, observe state sh, takes action ah

Receive immediate reward rh, E[rh|sh = s, ah = a] = rh(s, a)

Environment evolves to a new state sh+1 ∼ Ph(·|sh, ah)

Policy πh : S → A: how agent takes action at h-th step

Goal: find the policy π? that maximizes J(π) := Eπ[∑H

h=1 rh]

7

Page 9: Sample Efficiency in Online RL

Episodic Markov decision process

Agent

Environment

action ah = πh(sh) ∈ A

reward next state

rh = rh(sh, ah) ∈ [0,1]sh+1 ∼ Ph( ⋅ | sh, ah) ∈ S

state sh ∈ S

Policy

For h ∈ [H], in the h-th step, observe state sh, takes action ah

Receive immediate reward rh, E[rh|sh = s, ah = a] = rh(s, a)

Environment evolves to a new state sh+1 ∼ Ph(·|sh, ah)

Policy πh : S → A: how agent takes action at h-th step

Goal: find the policy π? that maximizes J(π) := Eπ[∑H

h=1 rh]

7

Page 10: Sample Efficiency in Online RL

Episodic Markov decision process

a1

s1

r1

π1(s1)

a2

s2

r2

π2(s2)

a3

s3

r3

π3(s3)

aH

sH

rH

πH(sH)

s⋆

Action

State P( ⋅ | s1, a1) P( ⋅ | s2, a2) P( ⋅ | s3, a3)

r(s1, a1) r(s2, a2) r(s3, a3) r(sH, aH)

+ + +⋯+𝔼 J(π)=

Policy

Reward

For simplicity, fix s1 = s?, rh = r , Ph = P for all h.

8

Page 11: Sample Efficiency in Online RL

Contextual bandit is a special case of MDP

Contextual bandit (CB): observe context s ∈ A, take actiona ∈ A, observe (random) reward r with mean r?(s, a)

Optimal policy π?(s) = arg maxa∈A r?(s, a)

CB is a MDP with H = 1

Multi-armed bandit (MAB): H = 1, |A| finite, and |S| = 1

MDP is significantly more challenging than CB due to tem-poral dependency (i.e., state transitions)

9

Page 12: Sample Efficiency in Online RL

Contextual bandit is a special case of MDP

Contextual bandit (CB): observe context s ∈ A, take actiona ∈ A, observe (random) reward r with mean r?(s, a)

Optimal policy π?(s) = arg maxa∈A r?(s, a)

CB is a MDP with H = 1

Multi-armed bandit (MAB): H = 1, |A| finite, and |S| = 1

MDP is significantly more challenging than CB due to tem-poral dependency (i.e., state transitions)

9

Page 13: Sample Efficiency in Online RL

From MDP to RL

Informal definition of RL: Solve the MDP when the environment(r , P) is unknown

(i) generative model: able to query the reward rh and next statesh+1 for any (sh, ah) ∈ S ×A

(ii) offline setting: given a dataset D = (sth, ath, r th)h∈[H],t∈[T ] (Ttrajectories), learn π? without any interactions

(iii) online setting: without any prior knowledge, learn π? byinteracting with the MDP for T episodes

(i) learning from model query complexity(ii) learning from data sample complexity

(iii) learning by doing sample complexity + exploration

10

Page 14: Sample Efficiency in Online RL

From MDP to RL

Informal definition of RL: Solve the MDP when the environment(r , P) is unknown

(i) generative model: able to query the reward rh and next statesh+1 for any (sh, ah) ∈ S ×A

(ii) offline setting: given a dataset D = (sth, ath, r th)h∈[H],t∈[T ] (Ttrajectories), learn π? without any interactions

(iii) online setting: without any prior knowledge, learn π? byinteracting with the MDP for T episodes

(i) learning from model query complexity(ii) learning from data sample complexity

(iii) learning by doing sample complexity + exploration

10

Page 15: Sample Efficiency in Online RL

From MDP to RL

Informal definition of RL: Solve the MDP when the environment(r , P) is unknown

(i) generative model: able to query the reward rh and next statesh+1 for any (sh, ah) ∈ S ×A

(ii) offline setting: given a dataset D = (sth, ath, r th)h∈[H],t∈[T ] (Ttrajectories), learn π? without any interactions

(iii) online setting: without any prior knowledge, learn π? byinteracting with the MDP for T episodes

(i) learning from model query complexity(ii) learning from data sample complexity

(iii) learning by doing sample complexity + exploration

10

Page 16: Sample Efficiency in Online RL

From MDP to RL

Informal definition of RL: Solve the MDP when the environment(r , P) is unknown

(i) generative model: able to query the reward rh and next statesh+1 for any (sh, ah) ∈ S ×A

(ii) offline setting: given a dataset D = (sth, ath, r th)h∈[H],t∈[T ] (Ttrajectories), learn π? without any interactions

(iii) online setting: without any prior knowledge, learn π? byinteracting with the MDP for T episodes

(i) learning from model query complexity(ii) learning from data sample complexity

(iii) learning by doing sample complexity + exploration

10

Page 17: Sample Efficiency in Online RL

From MDP to RL

Informal definition of RL: Solve the MDP when the environment(r , P) is unknown

(i) generative model: able to query the reward rh and next statesh+1 for any (sh, ah) ∈ S ×A

(ii) offline setting: given a dataset D = (sth, ath, r th)h∈[H],t∈[T ] (Ttrajectories), learn π? without any interactions

(iii) online setting: without any prior knowledge, learn π? byinteracting with the MDP for T episodes

(i) learning from model query complexity(ii) learning from data sample complexity

(iii) learning by doing sample complexity + exploration

10

Page 18: Sample Efficiency in Online RL

Online RL: Pipeline

Online Agent

Environment

Data !t

Execute πt

πt+1 ← #$%&'((!t, ℱ)

!t ← !t−1 ∪ Traj(πt)Continue for episodesT

Initialize π1 = π1hh∈[H] arbitrarily

For each t ∈ [T ], execute πt , get a trajectory Traj(πt)

Store Traj(πt) into dataset Dt = Traj(πi ), i ≤ tUpdate the policy via RL algorithm (need our design)

11

Page 19: Sample Efficiency in Online RL

Sample efficiency in online RL: regret

Measure sample efficiency via regret:

Regret(T ) =T∑t=1

[J(π?)− J(πt)]︸ ︷︷ ︸suboptimality in t-th episode

Why regret is meaningful?

Define a random policy πT uniformly sampled fromπt , t ∈ [T ]. Suppose Regret(T ), thenJ(π?)− J(πT ) = Regret(T )/T

If Regret(T ) = o(T ), when T sufficiency large, πT isarbitrarily close to optimality

Goal: Regret(T ) = O(√

T · poly(H) · poly(dim))

12

Page 20: Sample Efficiency in Online RL

Sample efficiency in online RL: regret

Measure sample efficiency via regret:

Regret(T ) =T∑t=1

[J(π?)− J(πt)]︸ ︷︷ ︸suboptimality in t-th episode

Why regret is meaningful?

Define a random policy πT uniformly sampled fromπt , t ∈ [T ]. Suppose Regret(T ), thenJ(π?)− J(πT ) = Regret(T )/T

If Regret(T ) = o(T ), when T sufficiency large, πT isarbitrarily close to optimality

Goal: Regret(T ) = O(√

T · poly(H) · poly(dim))

12

Page 21: Sample Efficiency in Online RL

Sample efficiency in online RL: regret

Measure sample efficiency via regret:

Regret(T ) =T∑t=1

[J(π?)− J(πt)]︸ ︷︷ ︸suboptimality in t-th episode

Why regret is meaningful?

Define a random policy πT uniformly sampled fromπt , t ∈ [T ]. Suppose Regret(T ), thenJ(π?)− J(πT ) = Regret(T )/T

If Regret(T ) = o(T ), when T sufficiency large, πT isarbitrarily close to optimality

Goal: Regret(T ) = O(√

T · poly(H) · poly(dim))

12

Page 22: Sample Efficiency in Online RL

Challenge of Online RL: exploration & uncertainty

×

×

×

×

×

××××× ××

×

×

××

×

××

×××××

×

×××

××

×

××

××

exploration vs. exploitation function estimation

How to construct exploration incentives tailored to function approximators?

Function EstimationDeep Exploration

How to assess estimation uncertainty based on adaptively acquired data?

Online RL

13

Page 23: Sample Efficiency in Online RL

Challenge of Online RL: exploration & uncertainty

×

×

×

×

×

××××× ××

×

×

××

×

××

×××××

×

×××

××

×

××

××

exploration vs. exploitation function estimation

How to construct exploration incentives tailored to function approximators?

Function EstimationDeep Exploration

How to assess estimation uncertainty based on adaptively acquired data?

Online RL

13

Page 24: Sample Efficiency in Online RL

Warmup example:LinUCB for linear CB

14

Page 25: Sample Efficiency in Online RL

Linear contextual bandit

Setting of linear CB:

Observation structure: observe (perhaps adversarial) st ∈ S,take action at ∈ A, receive bounded reward r t ∈ [0, 1]

E[r t |st = s, at = a] = r?(s, a) for all (s, a) ∈ S ×A

Linear reward function: r?(s, a) = φ(s, a)>θ?

φ : S ×A → Rd : known feature mapping

Normalization condition: ‖θ?‖2 ≤√d , sups,a ‖φ(s, a)‖2 ≤ 1

Why? Include MAB as a special case:|S| = 1, d = |A|, φ(s, a) = ea, θ? = (µ1, µ2, · · · , µA) withµa ∈ [0, 1] for all a ∈ A

Regret(T ) =∑T

t=1

(maxa∈A〈φ(st , a)− φ(st , at), θ?〉

)

15

Page 26: Sample Efficiency in Online RL

Linear contextual bandit

Setting of linear CB:

Observation structure: observe (perhaps adversarial) st ∈ S,take action at ∈ A, receive bounded reward r t ∈ [0, 1]

E[r t |st = s, at = a] = r?(s, a) for all (s, a) ∈ S ×A

Linear reward function: r?(s, a) = φ(s, a)>θ?

φ : S ×A → Rd : known feature mapping

Normalization condition: ‖θ?‖2 ≤√d , sups,a ‖φ(s, a)‖2 ≤ 1

Why? Include MAB as a special case:|S| = 1, d = |A|, φ(s, a) = ea, θ? = (µ1, µ2, · · · , µA) withµa ∈ [0, 1] for all a ∈ A

Regret(T ) =∑T

t=1

(maxa∈A〈φ(st , a)− φ(st , at), θ?〉

)

15

Page 27: Sample Efficiency in Online RL

Linear contextual bandit

Setting of linear CB:

Observation structure: observe (perhaps adversarial) st ∈ S,take action at ∈ A, receive bounded reward r t ∈ [0, 1]

E[r t |st = s, at = a] = r?(s, a) for all (s, a) ∈ S ×A

Linear reward function: r?(s, a) = φ(s, a)>θ?

φ : S ×A → Rd : known feature mapping

Normalization condition: ‖θ?‖2 ≤√d , sups,a ‖φ(s, a)‖2 ≤ 1

Why? Include MAB as a special case:|S| = 1, d = |A|, φ(s, a) = ea, θ? = (µ1, µ2, · · · , µA) withµa ∈ [0, 1] for all a ∈ A

Regret(T ) =∑T

t=1

(maxa∈A〈φ(st , a)− φ(st , at), θ?〉

)

15

Page 28: Sample Efficiency in Online RL

Linear contextual bandit

Setting of linear CB:

Observation structure: observe (perhaps adversarial) st ∈ S,take action at ∈ A, receive bounded reward r t ∈ [0, 1]

E[r t |st = s, at = a] = r?(s, a) for all (s, a) ∈ S ×A

Linear reward function: r?(s, a) = φ(s, a)>θ?

φ : S ×A → Rd : known feature mapping

Normalization condition: ‖θ?‖2 ≤√d , sups,a ‖φ(s, a)‖2 ≤ 1

Why? Include MAB as a special case:|S| = 1, d = |A|, φ(s, a) = ea, θ? = (µ1, µ2, · · · , µA) withµa ∈ [0, 1] for all a ∈ A

Regret(T ) =∑T

t=1

(maxa∈A〈φ(st , a)− φ(st , at), θ?〉

)15

Page 29: Sample Efficiency in Online RL

Exploration via optimism in the face of uncertainty (OFU)

For each t ∈ [T ], in the t-th round, determine at in thefollowing two steps

Uncertainty quantification: based on the current dataset Dt−1,construct a high-probability confidence set Θt that contains θ?

P(∀t ∈ [T ], θ? ∈ Θt

)≥ 1− δ (1)

Optimistic planing: choose at to maximize the most optimisticmodel within Θt

at = arg maxa∈A

maxθ∈Θt〈φ(st , at), θ〉

(2)

16

Page 30: Sample Efficiency in Online RL

Exploration via optimism in the face of uncertainty (OFU)

For each t ∈ [T ], in the t-th round, determine at in thefollowing two steps

Uncertainty quantification: based on the current dataset Dt−1,construct a high-probability confidence set Θt that contains θ?

P(∀t ∈ [T ], θ? ∈ Θt

)≥ 1− δ (1)

Optimistic planing: choose at to maximize the most optimisticmodel within Θt

at = arg maxa∈A

maxθ∈Θt〈φ(st , at), θ〉

(2)

16

Page 31: Sample Efficiency in Online RL

Exploration via optimism in the face of uncertainty (OFU)

For each t ∈ [T ], in the t-th round, determine at in thefollowing two steps

Uncertainty quantification: based on the current dataset Dt−1,construct a high-probability confidence set Θt that contains θ?

P(∀t ∈ [T ], θ? ∈ Θt

)≥ 1− δ (1)

Optimistic planing: choose at to maximize the most optimisticmodel within Θt

at = arg maxa∈A

maxθ∈Θt〈φ(st , at), θ〉

(2)

16

Page 32: Sample Efficiency in Online RL

Exploration via optimism in the face of uncertainty (OFU)

For each t ∈ [T ], in the t-th round, determine at in thefollowing two steps

Uncertainty quantification: based on the current dataset Dt−1,construct a high-probability confidence set Θt that contains θ?

P(∀t ∈ [T ], θ? ∈ Θt

)≥ 1− δ (1)

Optimistic planing: choose at to maximize the most optimisticmodel within Θt

at = arg maxa∈A

maxθ∈Θt〈φ(st , at), θ〉

(2)

16

Page 33: Sample Efficiency in Online RL

Intuition of OFU principle

Define two functions

r+,t(s, a) = maxθ∈Θt〈φ(s, a), θ〉

r−,t(s, a) = minθ∈Θt〈φ(s, a), θ〉

By (1), θ? ∈ Θt

r+,t(s, a) ≥ r?(s, a) ≥ r−,t(s, a), ∀(s, a) ∈ S ×A

|r+,t(s, a)− r−,t(s, a)| reflects the uncertainty about a atcontext s

By (2), at = arg maxa r+,t(st , a) — choose the action with

either large uncertainty (explore) or large reward (exploit)

17

Page 34: Sample Efficiency in Online RL

Intuition of OFU principle

Define two functions

r+,t(s, a) = maxθ∈Θt〈φ(s, a), θ〉

r−,t(s, a) = minθ∈Θt〈φ(s, a), θ〉

By (1), θ? ∈ Θt

r+,t(s, a) ≥ r?(s, a) ≥ r−,t(s, a), ∀(s, a) ∈ S ×A

|r+,t(s, a)− r−,t(s, a)| reflects the uncertainty about a atcontext s

By (2), at = arg maxa r+,t(st , a) — choose the action with

either large uncertainty (explore) or large reward (exploit)

17

Page 35: Sample Efficiency in Online RL

Intuition of OFU principle

Define two functions

r+,t(s, a) = maxθ∈Θt〈φ(s, a), θ〉

r−,t(s, a) = minθ∈Θt〈φ(s, a), θ〉

By (1), θ? ∈ Θt

r+,t(s, a) ≥ r?(s, a) ≥ r−,t(s, a), ∀(s, a) ∈ S ×A

|r+,t(s, a)− r−,t(s, a)| reflects the uncertainty about a atcontext s

By (2), at = arg maxa r+,t(st , a) — choose the action with

either large uncertainty (explore) or large reward (exploit)

17

Page 36: Sample Efficiency in Online RL

Intuition of OFU principle

Define two functions

r+,t(s, a) = maxθ∈Θt〈φ(s, a), θ〉

r−,t(s, a) = minθ∈Θt〈φ(s, a), θ〉

By (1), θ? ∈ Θt

r+,t(s, a) ≥ r?(s, a) ≥ r−,t(s, a), ∀(s, a) ∈ S ×A

|r+,t(s, a)− r−,t(s, a)| reflects the uncertainty about a atcontext s

By (2), at = arg maxa r+,t(st , a) — choose the action with

either large uncertainty (explore) or large reward (exploit)

17

Page 37: Sample Efficiency in Online RL

How to construct Θt for linear CB?In the t-th round, our dataset is Dt−1 = (s i , ai , r i )i≤t−1

First construct an estimator of θ? via ridge regression

θt = minθ∈Rd

Lt(θ) : =

t−1∑i=1

[r i − 〈φ(s i , ai ), θ〉]2 + ‖θ‖22

Hessian of the Ridge loss:

Λt = ∇2Lt(θ) = I +t−1∑i=1

φ(s i , ai )φ(s i , ai )>

Closed-form solution: θt = (Λt)−1∑t−1

i=1 ri · φ(s i , ai )

Define Θt as the confidence ellipsoid :

Θt = θ ∈ Rd : ‖θ − θt‖Λt ≤ β,

here define β = C ·√

d · log(1 + T/δ), ‖x‖M =√x>Mx

18

Page 38: Sample Efficiency in Online RL

How to construct Θt for linear CB?In the t-th round, our dataset is Dt−1 = (s i , ai , r i )i≤t−1

First construct an estimator of θ? via ridge regression

θt = minθ∈Rd

Lt(θ) : =

t−1∑i=1

[r i − 〈φ(s i , ai ), θ〉]2 + ‖θ‖22

Hessian of the Ridge loss:

Λt = ∇2Lt(θ) = I +t−1∑i=1

φ(s i , ai )φ(s i , ai )>

Closed-form solution: θt = (Λt)−1∑t−1

i=1 ri · φ(s i , ai )

Define Θt as the confidence ellipsoid :

Θt = θ ∈ Rd : ‖θ − θt‖Λt ≤ β,

here define β = C ·√

d · log(1 + T/δ), ‖x‖M =√x>Mx

18

Page 39: Sample Efficiency in Online RL

How to construct Θt for linear CB?In the t-th round, our dataset is Dt−1 = (s i , ai , r i )i≤t−1

First construct an estimator of θ? via ridge regression

θt = minθ∈Rd

Lt(θ) : =

t−1∑i=1

[r i − 〈φ(s i , ai ), θ〉]2 + ‖θ‖22

Hessian of the Ridge loss:

Λt = ∇2Lt(θ) = I +t−1∑i=1

φ(s i , ai )φ(s i , ai )>

Closed-form solution: θt = (Λt)−1∑t−1

i=1 ri · φ(s i , ai )

Define Θt as the confidence ellipsoid :

Θt = θ ∈ Rd : ‖θ − θt‖Λt ≤ β,

here define β = C ·√

d · log(1 + T/δ), ‖x‖M =√x>Mx

18

Page 40: Sample Efficiency in Online RL

How to construct Θt for linear CB?In the t-th round, our dataset is Dt−1 = (s i , ai , r i )i≤t−1

First construct an estimator of θ? via ridge regression

θt = minθ∈Rd

Lt(θ) : =

t−1∑i=1

[r i − 〈φ(s i , ai ), θ〉]2 + ‖θ‖22

Hessian of the Ridge loss:

Λt = ∇2Lt(θ) = I +t−1∑i=1

φ(s i , ai )φ(s i , ai )>

Closed-form solution: θt = (Λt)−1∑t−1

i=1 ri · φ(s i , ai )

Define Θt as the confidence ellipsoid :

Θt = θ ∈ Rd : ‖θ − θt‖Λt ≤ β,

here define β = C ·√

d · log(1 + T/δ), ‖x‖M =√x>Mx

18

Page 41: Sample Efficiency in Online RL

How to construct Θt for linear CB?In the t-th round, our dataset is Dt−1 = (s i , ai , r i )i≤t−1

First construct an estimator of θ? via ridge regression

θt = minθ∈Rd

Lt(θ) : =

t−1∑i=1

[r i − 〈φ(s i , ai ), θ〉]2 + ‖θ‖22

Hessian of the Ridge loss:

Λt = ∇2Lt(θ) = I +t−1∑i=1

φ(s i , ai )φ(s i , ai )>

Closed-form solution: θt = (Λt)−1∑t−1

i=1 ri · φ(s i , ai )

Define Θt as the confidence ellipsoid :

Θt = θ ∈ Rd : ‖θ − θt‖Λt ≤ β,

here define β = C ·√

d · log(1 + T/δ), ‖x‖M =√x>Mx

18

Page 42: Sample Efficiency in Online RL

LinUCB algorithm

Confidence ellipsoid: Θt = θ ∈ Rd : ‖θ − θt‖Λt ≤ β

r+,t admits a closed-form:

r+,t(s, a) = maxθ∈Θt〈φ(s, a), θ〉 = 〈φ(s, a), θt〉+ β · ‖φ(s, a)‖(Λt)−1︸ ︷︷ ︸

bonus

LinUCB algorithm (Dani et all. (2008), Abbasi-Yadkori et al.(2011), ...)

For t = 1, . . . ,T , in the beginning of t-th round

Observe context st

Compute θt via ridge regression and compute Hessian Λt

Choose at = arg maxa∈A r+,t(st , a) and observe r t

19

Page 43: Sample Efficiency in Online RL

LinUCB algorithm

Confidence ellipsoid: Θt = θ ∈ Rd : ‖θ − θt‖Λt ≤ β

r+,t admits a closed-form:

r+,t(s, a) = maxθ∈Θt〈φ(s, a), θ〉 = 〈φ(s, a), θt〉+ β · ‖φ(s, a)‖(Λt)−1︸ ︷︷ ︸

bonus

LinUCB algorithm (Dani et all. (2008), Abbasi-Yadkori et al.(2011), ...)

For t = 1, . . . ,T , in the beginning of t-th round

Observe context st

Compute θt via ridge regression and compute Hessian Λt

Choose at = arg maxa∈A r+,t(st , a) and observe r t

19

Page 44: Sample Efficiency in Online RL

LinUCB algorithm

Confidence ellipsoid: Θt = θ ∈ Rd : ‖θ − θt‖Λt ≤ β

r+,t admits a closed-form:

r+,t(s, a) = maxθ∈Θt〈φ(s, a), θ〉 = 〈φ(s, a), θt〉+ β · ‖φ(s, a)‖(Λt)−1︸ ︷︷ ︸

bonus

LinUCB algorithm (Dani et all. (2008), Abbasi-Yadkori et al.(2011), ...)

For t = 1, . . . ,T , in the beginning of t-th round

Observe context st

Compute θt via ridge regression and compute Hessian Λt

Choose at = arg maxa∈A r+,t(st , a) and observe r t

19

Page 45: Sample Efficiency in Online RL

LinUCB algorithm

Confidence ellipsoid: Θt = θ ∈ Rd : ‖θ − θt‖Λt ≤ β

r+,t admits a closed-form:

r+,t(s, a) = maxθ∈Θt〈φ(s, a), θ〉 = 〈φ(s, a), θt〉+ β · ‖φ(s, a)‖(Λt)−1︸ ︷︷ ︸

bonus

LinUCB algorithm (Dani et all. (2008), Abbasi-Yadkori et al.(2011), ...)

For t = 1, . . . ,T , in the beginning of t-th round

Observe context st

Compute θt via ridge regression and compute Hessian Λt

Choose at = arg maxa∈A r+,t(st , a) and observe r t

19

Page 46: Sample Efficiency in Online RL

Regret Bound of LinUCB

Theorem. With probability 1− δ, LinUCB satisfies

Regret(T ) ≤ β ·√dT ·

√logT = O(d

√T )

Independent of the size of context and action spaces, onlydepends on the dimension (linear in d)

Depend on T through√T (sample efficient)

Computational cost poly(d ,A,T ) and memory requirement ispoly(d ,T ) (Why?)

20

Page 47: Sample Efficiency in Online RL

Regret Bound of LinUCB

Theorem. With probability 1− δ, LinUCB satisfies

Regret(T ) ≤ β ·√dT ·

√logT = O(d

√T )

Independent of the size of context and action spaces, onlydepends on the dimension (linear in d)

Depend on T through√T (sample efficient)

Computational cost poly(d ,A,T ) and memory requirement ispoly(d ,T ) (Why?)

20

Page 48: Sample Efficiency in Online RL

Regret Bound of LinUCB

Theorem. With probability 1− δ, LinUCB satisfies

Regret(T ) ≤ β ·√dT ·

√logT = O(d

√T )

Independent of the size of context and action spaces, onlydepends on the dimension (linear in d)

Depend on T through√T (sample efficient)

Computational cost poly(d ,A,T ) and memory requirement ispoly(d ,T ) (Why?)

20

Page 49: Sample Efficiency in Online RL

Regret Bound of LinUCB

Theorem. With probability 1− δ, LinUCB satisfies

Regret(T ) ≤ β ·√dT ·

√logT = O(d

√T )

Independent of the size of context and action spaces, onlydepends on the dimension (linear in d)

Depend on T through√T (sample efficient)

Computational cost poly(d ,A,T ) and memory requirement ispoly(d ,T ) (Why?)

20

Page 50: Sample Efficiency in Online RL

Regret Analysis of LinUCB (1/3)

Step 1. Conditioning on the event E = ∀t ∈ [T ], θ? ∈ Θt,r−,t(s, a) ≤ r?(s, a) ≤ r+,t(s, a), ∀(x , a) ∈ S ×A

Step 2. Regret decomposition:

maxa

r?(st , a)− r?(st , at)

=(

maxa

r?(st , a)− r+,t(st , at))

+(r+,t(st , at)− r?(st , at)

)≤ r+,t(st , at)− r−,t(st , at) ≤ 2β · ‖φ(st , at)‖(Λt)−1

Thus we have shown

maxa

r?(st , a)− r?(st , at) ≤ β ·min1, ‖φ(st , at)‖(Λt)−1

21

Page 51: Sample Efficiency in Online RL

Regret Analysis of LinUCB (1/3)

Step 1. Conditioning on the event E = ∀t ∈ [T ], θ? ∈ Θt,r−,t(s, a) ≤ r?(s, a) ≤ r+,t(s, a), ∀(x , a) ∈ S ×A

Step 2. Regret decomposition:

maxa

r?(st , a)− r?(st , at)

=(

maxa

r?(st , a)− r+,t(st , at))

+(r+,t(st , at)− r?(st , at)

)≤ r+,t(st , at)− r−,t(st , at) ≤ 2β · ‖φ(st , at)‖(Λt)−1

Thus we have shown

maxa

r?(st , a)− r?(st , at) ≤ β ·min1, ‖φ(st , at)‖(Λt)−1

21

Page 52: Sample Efficiency in Online RL

Regret Analysis of LinUCB (1/3)

Step 1. Conditioning on the event E = ∀t ∈ [T ], θ? ∈ Θt,r−,t(s, a) ≤ r?(s, a) ≤ r+,t(s, a), ∀(x , a) ∈ S ×A

Step 2. Regret decomposition:

maxa

r?(st , a)− r?(st , at)

=(

maxa

r?(st , a)− r+,t(st , at))

+(r+,t(st , at)− r?(st , at)

)≤ r+,t(st , at)− r−,t(st , at) ≤ 2β · ‖φ(st , at)‖(Λt)−1

Thus we have shown

maxa

r?(st , a)− r?(st , at) ≤ β ·min1, ‖φ(st , at)‖(Λt)−1

21

Page 53: Sample Efficiency in Online RL

Regret Analysis of LinUCB (2/3)

Step 3. Telescope the bonus: Cauchy-Schwarz + elliptical potentiallemma

Regret(T ) ≤ 2β ·T∑t=1

·min1, ‖φ(st , at)‖(Λt)−1

≤ 2β√T ·[ T∑

t=1

min1, ‖φ(st , at)‖2(Λt)−1

]1/2

≤ 2β√T ·√

2logdet(ΛT ) . β ·√T ·√d · logT

22

Page 54: Sample Efficiency in Online RL

Elliptical potential lemma

Lemma

t−1∑i=1

min1, ‖φ(s i , ai )‖2(Λt)−1 ≤ 2logdet(ΛT ) ≤ d · logT

Bayesian perspective:

Prior distribution θ? ∼ N(0, I )

Likelihood ri = r?(s i , ai ) + ε, ε ∈ N(0, 1)

Given dataset Dt−1, the posterior distribution is N(θt ,Λt)

‖φ(s, a)‖2(Λt)−1 = I (θ?, (s, a)|Dt−1) =

H(θ?|Dt−1)− H(θ?|Dt−1 ∪ (s, a)) — conditional mutualinformation gain

Elliptical potential lemma ↔ chain rule of mutual information

23

Page 55: Sample Efficiency in Online RL

Elliptical potential lemma

Lemma

t−1∑i=1

min1, ‖φ(s i , ai )‖2(Λt)−1 ≤ 2logdet(ΛT ) ≤ d · logT

Bayesian perspective:

Prior distribution θ? ∼ N(0, I )

Likelihood ri = r?(s i , ai ) + ε, ε ∈ N(0, 1)

Given dataset Dt−1, the posterior distribution is N(θt ,Λt)

‖φ(s, a)‖2(Λt)−1 = I (θ?, (s, a)|Dt−1) =

H(θ?|Dt−1)− H(θ?|Dt−1 ∪ (s, a)) — conditional mutualinformation gain

Elliptical potential lemma ↔ chain rule of mutual information

23

Page 56: Sample Efficiency in Online RL

Elliptical potential lemma

Lemma

t−1∑i=1

min1, ‖φ(s i , ai )‖2(Λt)−1 ≤ 2logdet(ΛT ) ≤ d · logT

Bayesian perspective:

Prior distribution θ? ∼ N(0, I )

Likelihood ri = r?(s i , ai ) + ε, ε ∈ N(0, 1)

Given dataset Dt−1, the posterior distribution is N(θt ,Λt)

‖φ(s, a)‖2(Λt)−1 = I (θ?, (s, a)|Dt−1) =

H(θ?|Dt−1)− H(θ?|Dt−1 ∪ (s, a)) — conditional mutualinformation gain

Elliptical potential lemma ↔ chain rule of mutual information

23

Page 57: Sample Efficiency in Online RL

Regret Analysis of LinUCB (3/3)

Step 4. Show optimism: θ? ∈ Θt = θ ∈ Rd : ‖θ − θt‖Λt ≤ β

θt − θ? = (Λt)−1

[t−1∑i=1

(r?(s i , ai ) + εi )︸ ︷︷ ︸ri

·φ(s i , ai )− (Λt) · θ?]

= (Λt)−1θ? + (Λt)−1t−1∑i=1

εi︸︷︷︸martingale difference

·φ(s i , ai )

‖θt − θ?‖Λt ≈ ‖∑t−1

i=1 εi · φ(s i , ai )‖(Λt)−1 ≤ βConcentration of self-normalized process (e.g., Abbasi-Yadkoriet al, 2011)

24

Page 58: Sample Efficiency in Online RL

Regret Analysis of LinUCB (3/3)

Step 4. Show optimism: θ? ∈ Θt = θ ∈ Rd : ‖θ − θt‖Λt ≤ β

θt − θ? = (Λt)−1

[t−1∑i=1

(r?(s i , ai ) + εi )︸ ︷︷ ︸ri

·φ(s i , ai )− (Λt) · θ?]

= (Λt)−1θ? + (Λt)−1t−1∑i=1

εi︸︷︷︸martingale difference

·φ(s i , ai )

‖θt − θ?‖Λt ≈ ‖∑t−1

i=1 εi · φ(s i , ai )‖(Λt)−1 ≤ βConcentration of self-normalized process (e.g., Abbasi-Yadkoriet al, 2011)

24

Page 59: Sample Efficiency in Online RL

Summary of LinUCB

Algorithm is based on the optimism principle

upper confidence bound = ridge estimator + bonus

O(d√T ) regret via

a. utilize optimism to bound instant regret by width of confidenceregion (bonus)

b. sum up the instant regret terms by elliptical potential lemma

c. confidence region constructed via uncertainty quantification forridge regression

Extensions: generalized linear model, RKHS,overparameterized network, abstract function class withbounded eluder dimension, ...

25

Page 60: Sample Efficiency in Online RL

Summary of LinUCB

Algorithm is based on the optimism principle

upper confidence bound = ridge estimator + bonus

O(d√T ) regret via

a. utilize optimism to bound instant regret by width of confidenceregion (bonus)

b. sum up the instant regret terms by elliptical potential lemma

c. confidence region constructed via uncertainty quantification forridge regression

Extensions: generalized linear model, RKHS,overparameterized network, abstract function class withbounded eluder dimension, ...

25

Page 61: Sample Efficiency in Online RL

Summary of LinUCB

Algorithm is based on the optimism principle

upper confidence bound = ridge estimator + bonus

O(d√T ) regret via

a. utilize optimism to bound instant regret by width of confidenceregion (bonus)

b. sum up the instant regret terms by elliptical potential lemma

c. confidence region constructed via uncertainty quantification forridge regression

Extensions: generalized linear model, RKHS,overparameterized network, abstract function class withbounded eluder dimension, ...

25

Page 62: Sample Efficiency in Online RL

From Linear CB to Linear RL

26

Page 63: Sample Efficiency in Online RL

Value functions and Bellman equation

Our goal is to find π? = arg maxπ J(π) = E[∑H

h=1 rh]

π? is characterized by Q? = Q?hh∈[H+1] : S ×A → [0,H]

π?h(s) = arg maxa Q?h (s, a), ∀s ∈ S; V ?

h (s) = maxa Q?h (s, a)

Q?h (s, a): optimal return starting from sh = s and ah = a

Q? is characterized by Bellman equation

= Q⋆h (s, a)

ah

sh

ah+1

sh+1

rh+1

π⋆h+1

ah+2

sh+2

rh+2

aH

sH

rH

s

π⋆h+2 π⋆

H

a

+ + +⋯+# r(s, a)

• given by Bellman equation

π⋆h (s) = arg max

aQ⋆

h (s, ⋅ )V⋆

h (s) = maxa

Q⋆h (s, a)

Q⋆h (s, a)

27

Page 64: Sample Efficiency in Online RL

Value functions and Bellman equation

Our goal is to find π? = arg maxπ J(π) = E[∑H

h=1 rh]

π? is characterized by Q? = Q?hh∈[H+1] : S ×A → [0,H]

π?h(s) = arg maxa Q?h (s, a), ∀s ∈ S; V ?

h (s) = maxa Q?h (s, a)

Q?h (s, a): optimal return starting from sh = s and ah = a

Q? is characterized by Bellman equation

= Q⋆h (s, a)

ah

sh

ah+1

sh+1

rh+1

π⋆h+1

ah+2

sh+2

rh+2

aH

sH

rH

s

π⋆h+2 π⋆

H

a

+ + +⋯+# r(s, a)

• given by Bellman equation

π⋆h (s) = arg max

aQ⋆

h (s, ⋅ )V⋆

h (s) = maxa

Q⋆h (s, a)

Q⋆h (s, a)

27

Page 65: Sample Efficiency in Online RL

Value functions and Bellman equation

Our goal is to find π? = arg maxπ J(π) = E[∑H

h=1 rh]

π? is characterized by Q? = Q?hh∈[H+1] : S ×A → [0,H]

π?h(s) = arg maxa Q?h (s, a), ∀s ∈ S; V ?

h (s) = maxa Q?h (s, a)

Q?h (s, a): optimal return starting from sh = s and ah = a

Q? is characterized by Bellman equation

= Q⋆h (s, a)

ah

sh

ah+1

sh+1

rh+1

π⋆h+1

ah+2

sh+2

rh+2

aH

sH

rH

s

π⋆h+2 π⋆

H

a

+ + +⋯+# r(s, a)

• given by Bellman equation

π⋆h (s) = arg max

aQ⋆

h (s, ⋅ )V⋆

h (s) = maxa

Q⋆h (s, a)

Q⋆h (s, a)

27

Page 66: Sample Efficiency in Online RL

Value functions and Bellman equation

Our goal is to find π? = arg maxπ J(π) = E[∑H

h=1 rh]

π? is characterized by Q? = Q?hh∈[H+1] : S ×A → [0,H]

π?h(s) = arg maxa Q?h (s, a), ∀s ∈ S; V ?

h (s) = maxa Q?h (s, a)

Q?h (s, a): optimal return starting from sh = s and ah = a

Q? is characterized by Bellman equation

= Q⋆h (s, a)

ah

sh

ah+1

sh+1

rh+1

π⋆h+1

ah+2

sh+2

rh+2

aH

sH

rH

s

π⋆h+2 π⋆

H

a

+ + +⋯+# r(s, a)

• given by Bellman equation

π⋆h (s) = arg max

aQ⋆

h (s, ⋅ )V⋆

h (s) = maxa

Q⋆h (s, a)

Q⋆h (s, a)

27

Page 67: Sample Efficiency in Online RL

Value functions and Bellman equation

MDP terminates after H steps — Q?H+1 = 0, Q?

H = r

Bellman equation

Q?h(s, a) =

Bellman operator: BQ?h+1︷ ︸︸ ︷

r(s, a) + Esh+1∼P[V ?h+1(sh+1)|sh = s, ah = a

]V ?h (s) = max

aQ?

h(s, a)

RL with linear function approximation: useFlin = φ(s, a)>θ : S ×A → R to approximate Q?

hh∈[H]

φ : S ×A → Rd : known feature mapping

Can allow known and step-varying features: φhh∈[H]

28

Page 68: Sample Efficiency in Online RL

Value functions and Bellman equation

MDP terminates after H steps — Q?H+1 = 0, Q?

H = r

Bellman equation

Q?h(s, a) =

Bellman operator: BQ?h+1︷ ︸︸ ︷

r(s, a) + Esh+1∼P[V ?h+1(sh+1)|sh = s, ah = a

]V ?h (s) = max

aQ?

h(s, a)

RL with linear function approximation: useFlin = φ(s, a)>θ : S ×A → R to approximate Q?

hh∈[H]

φ : S ×A → Rd : known feature mapping

Can allow known and step-varying features: φhh∈[H]

28

Page 69: Sample Efficiency in Online RL

Value functions and Bellman equation

MDP terminates after H steps — Q?H+1 = 0, Q?

H = r

Bellman equation

Q?h(s, a) =

Bellman operator: BQ?h+1︷ ︸︸ ︷

r(s, a) + Esh+1∼P[V ?h+1(sh+1)|sh = s, ah = a

]V ?h (s) = max

aQ?

h(s, a)

RL with linear function approximation: useFlin = φ(s, a)>θ : S ×A → R to approximate Q?

hh∈[H]

φ : S ×A → Rd : known feature mapping

Can allow known and step-varying features: φhh∈[H]

28

Page 70: Sample Efficiency in Online RL

RL is more challenging than bandit

Online Agent

Environment

Data !t

Execute πt

πt+1 ← #$%&'((!t, ℱ)

!t ← !t−1 ∪ Traj(πt)Continue for episodesT

Regret(T ) =∑T

t=1[J(π?)− J(πt)]

J(π?) = maxa Q?1 (s?, a), fixed initial state for simplicity

Can generalize to adversarial s1

Online RL 6= a contextual bandit with TH rounds

RL requires deep exploration for achieving o(T ) regret.

29

Page 71: Sample Efficiency in Online RL

RL is more challenging than bandit

Online Agent

Environment

Data !t

Execute πt

πt+1 ← #$%&'((!t, ℱ)

!t ← !t−1 ∪ Traj(πt)Continue for episodesT

Regret(T ) =∑T

t=1[J(π?)− J(πt)]

J(π?) = maxa Q?1 (s?, a), fixed initial state for simplicity

Can generalize to adversarial s1

Online RL 6= a contextual bandit with TH rounds

RL requires deep exploration for achieving o(T ) regret.

29

Page 72: Sample Efficiency in Online RL

RL is more challenging than bandit

Online Agent

Environment

Data !t

Execute πt

πt+1 ← #$%&'((!t, ℱ)

!t ← !t−1 ∪ Traj(πt)Continue for episodesT

Regret(T ) =∑T

t=1[J(π?)− J(πt)]

J(π?) = maxa Q?1 (s?, a), fixed initial state for simplicity

Can generalize to adversarial s1

Online RL 6= a contextual bandit with TH rounds

RL requires deep exploration for achieving o(T ) regret.

29

Page 73: Sample Efficiency in Online RL

Deep Exploration

s⋆ good good good good

bad bad bad bad

A = b1, b2, …

b1 b1 b1 b1

A\b1

∀a ∀a ∀a

A\b1 A\b1A\b1

π⋆ π⋆π⋆π⋆

Contextual bandit algorithm directly goes to the bad state atthe initial state — Ω(T ) regret

Random sampling requires |A|H samples (inefficient)

Goal: Regret(T ) = poly(H) ·√T (need deep exploration)

O(SAH) query complexity if have a generative model

30

Page 74: Sample Efficiency in Online RL

Deep Exploration

s⋆ good good good good

bad bad bad bad

A = b1, b2, …

b1 b1 b1 b1

A\b1

∀a ∀a ∀a

A\b1 A\b1A\b1

π⋆ π⋆π⋆π⋆

Contextual bandit algorithm directly goes to the bad state atthe initial state — Ω(T ) regret

Random sampling requires |A|H samples (inefficient)

Goal: Regret(T ) = poly(H) ·√T (need deep exploration)

O(SAH) query complexity if have a generative model

30

Page 75: Sample Efficiency in Online RL

Deep Exploration

s⋆ good good good good

bad bad bad bad

A = b1, b2, …

b1 b1 b1 b1

A\b1

∀a ∀a ∀a

A\b1 A\b1A\b1

π⋆ π⋆π⋆π⋆

Contextual bandit algorithm directly goes to the bad state atthe initial state — Ω(T ) regret

Random sampling requires |A|H samples (inefficient)

Goal: Regret(T ) = poly(H) ·√T (need deep exploration)

O(SAH) query complexity if have a generative model

30

Page 76: Sample Efficiency in Online RL

Deep Exploration

s⋆ good good good good

bad bad bad bad

A = b1, b2, …

b1 b1 b1 b1

A\b1

∀a ∀a ∀a

A\b1 A\b1A\b1

π⋆ π⋆π⋆π⋆

Contextual bandit algorithm directly goes to the bad state atthe initial state — Ω(T ) regret

Random sampling requires |A|H samples (inefficient)

Goal: Regret(T ) = poly(H) ·√T (need deep exploration)

O(SAH) query complexity if have a generative model30

Page 77: Sample Efficiency in Online RL

What assumption should we impose?

In CB, π? is greedy with respect to r?, we assume r? is linear

In RL, π? is greedy with respect to Q?

Linear realizability : Q?h ∈ Flin, ∀h ∈ [H]

Theorem [Wang-Wang-Kakade-21] There exists a classof linearly realizable MDPs such that any online RL al-gorithm requires minΩ(2d),Ω(2H) samples to obtaina near optimal policy.

Fundamentally different from supervised learning and bandits

To achieve efficiency in online RL, we need strongerassumptions

31

Page 78: Sample Efficiency in Online RL

What assumption should we impose?

In CB, π? is greedy with respect to r?, we assume r? is linear

In RL, π? is greedy with respect to Q?

Linear realizability : Q?h ∈ Flin, ∀h ∈ [H]

Theorem [Wang-Wang-Kakade-21] There exists a classof linearly realizable MDPs such that any online RL al-gorithm requires minΩ(2d),Ω(2H) samples to obtaina near optimal policy.

Fundamentally different from supervised learning and bandits

To achieve efficiency in online RL, we need strongerassumptions

31

Page 79: Sample Efficiency in Online RL

What assumption should we impose?

In CB, π? is greedy with respect to r?, we assume r? is linear

In RL, π? is greedy with respect to Q?

Linear realizability : Q?h ∈ Flin, ∀h ∈ [H]

Theorem [Wang-Wang-Kakade-21] There exists a classof linearly realizable MDPs such that any online RL al-gorithm requires minΩ(2d),Ω(2H) samples to obtaina near optimal policy.

Fundamentally different from supervised learning and bandits

To achieve efficiency in online RL, we need strongerassumptions

31

Page 80: Sample Efficiency in Online RL

Assumption: Image(B) ⊆ Flin

We assume the image set of the Bellman operator lies in Flin:

For any Q : S ×A → [0,H], there exists θQ ∈ Rd s.t.

(BQ)(s, a) = r(s, a) + E[maxa′

Q(s ′, a′)] = 〈φ(s, a), θQ〉

Normalization condition: ‖θQ‖2 ≤ 2H√d ,

sups,a ‖φ(s, a)‖2 ≤ 1

Is there such an MDP? Yes, Linear MDP

r(s, a) = 〈φ(s, a),ω〉 P(s ′|s, a) = 〈φ(s, a),µ(s ′)〉Normalization: ‖ω‖2 ≤

√d ,∑

s′ ‖µ(s ′)‖2 ≤√d

Linear MDP contains tabular MDP as a special case:φ(s, a) = e(s,a), d = |S| · |A|

32

Page 81: Sample Efficiency in Online RL

Assumption: Image(B) ⊆ Flin

We assume the image set of the Bellman operator lies in Flin:

For any Q : S ×A → [0,H], there exists θQ ∈ Rd s.t.

(BQ)(s, a) = r(s, a) + E[maxa′

Q(s ′, a′)] = 〈φ(s, a), θQ〉

Normalization condition: ‖θQ‖2 ≤ 2H√d ,

sups,a ‖φ(s, a)‖2 ≤ 1

Is there such an MDP? Yes, Linear MDP

r(s, a) = 〈φ(s, a),ω〉 P(s ′|s, a) = 〈φ(s, a),µ(s ′)〉Normalization: ‖ω‖2 ≤

√d ,∑

s′ ‖µ(s ′)‖2 ≤√d

Linear MDP contains tabular MDP as a special case:φ(s, a) = e(s,a), d = |S| · |A|

32

Page 82: Sample Efficiency in Online RL

Assumption: Image(B) ⊆ Flin

We assume the image set of the Bellman operator lies in Flin:

For any Q : S ×A → [0,H], there exists θQ ∈ Rd s.t.

(BQ)(s, a) = r(s, a) + E[maxa′

Q(s ′, a′)] = 〈φ(s, a), θQ〉

Normalization condition: ‖θQ‖2 ≤ 2H√d ,

sups,a ‖φ(s, a)‖2 ≤ 1

Is there such an MDP? Yes, Linear MDP

r(s, a) = 〈φ(s, a),ω〉 P(s ′|s, a) = 〈φ(s, a),µ(s ′)〉Normalization: ‖ω‖2 ≤

√d ,∑

s′ ‖µ(s ′)‖2 ≤√d

Linear MDP contains tabular MDP as a special case:φ(s, a) = e(s,a), d = |S| · |A|

32

Page 83: Sample Efficiency in Online RL

Algorithm for linear RL: LSVI-UCB

33

Page 84: Sample Efficiency in Online RL

Algorithm: LSVI + optimism

In the beginning t-th episode, datasetDt−1 = (s ih, aih, r ih), h ∈ [H]i<t

For h = H, . . . , 1, backwardly solve ridge regression:

y ih = r ih + maxa

Q+,th+1(sh+1, a)

θth = arg minθ

t−1∑i=1

[y ih − 〈φ(s ih, aih), θ〉]2 + ‖θ‖2

2

Hessian of ridge loss: Λth = I +

∑t−1i=1 φ(s ih, a

ih)φ(s ih, a

ih)>

Construct bonus Γth(s, a) = β · ‖φ(s, a)‖(Λt

h)−1

UCB Q-function

Q+,th = Trunc[0,H]〈φ(s, a), θth〉+ Γt

h(s, a)

Execute πth(·) = arg maxa Q+,th (·, a)

34

Page 85: Sample Efficiency in Online RL

Algorithm: LSVI + optimism

In the beginning t-th episode, datasetDt−1 = (s ih, aih, r ih), h ∈ [H]i<t

For h = H, . . . , 1, backwardly solve ridge regression:

y ih = r ih + maxa

Q+,th+1(sh+1, a)

θth = arg minθ

t−1∑i=1

[y ih − 〈φ(s ih, aih), θ〉]2 + ‖θ‖2

2

Hessian of ridge loss: Λth = I +

∑t−1i=1 φ(s ih, a

ih)φ(s ih, a

ih)>

Construct bonus Γth(s, a) = β · ‖φ(s, a)‖(Λt

h)−1

UCB Q-function

Q+,th = Trunc[0,H]〈φ(s, a), θth〉+ Γt

h(s, a)

Execute πth(·) = arg maxa Q+,th (·, a)

34

Page 86: Sample Efficiency in Online RL

Algorithm: LSVI + optimism

In the beginning t-th episode, datasetDt−1 = (s ih, aih, r ih), h ∈ [H]i<t

For h = H, . . . , 1, backwardly solve ridge regression:

y ih = r ih + maxa

Q+,th+1(sh+1, a)

θth = arg minθ

t−1∑i=1

[y ih − 〈φ(s ih, aih), θ〉]2 + ‖θ‖2

2

Hessian of ridge loss: Λth = I +

∑t−1i=1 φ(s ih, a

ih)φ(s ih, a

ih)>

Construct bonus Γth(s, a) = β · ‖φ(s, a)‖(Λt

h)−1

UCB Q-function

Q+,th = Trunc[0,H]〈φ(s, a), θth〉+ Γt

h(s, a)

Execute πth(·) = arg maxa Q+,th (·, a)

34

Page 87: Sample Efficiency in Online RL

Algorithm: LSVI + optimism

In the beginning t-th episode, datasetDt−1 = (s ih, aih, r ih), h ∈ [H]i<t

For h = H, . . . , 1, backwardly solve ridge regression:

y ih = r ih + maxa

Q+,th+1(sh+1, a)

θth = arg minθ

t−1∑i=1

[y ih − 〈φ(s ih, aih), θ〉]2 + ‖θ‖2

2

Hessian of ridge loss: Λth = I +

∑t−1i=1 φ(s ih, a

ih)φ(s ih, a

ih)>

Construct bonus Γth(s, a) = β · ‖φ(s, a)‖(Λt

h)−1

UCB Q-function

Q+,th = Trunc[0,H]〈φ(s, a), θth〉+ Γt

h(s, a)

Execute πth(·) = arg maxa Q+,th (·, a)

34

Page 88: Sample Efficiency in Online RL

Algorithm: LSVI + optimism

In the beginning t-th episode, datasetDt−1 = (s ih, aih, r ih), h ∈ [H]i<t

For h = H, . . . , 1, backwardly solve ridge regression:

y ih = r ih + maxa

Q+,th+1(sh+1, a)

θth = arg minθ

t−1∑i=1

[y ih − 〈φ(s ih, aih), θ〉]2 + ‖θ‖2

2

Hessian of ridge loss: Λth = I +

∑t−1i=1 φ(s ih, a

ih)φ(s ih, a

ih)>

Construct bonus Γth(s, a) = β · ‖φ(s, a)‖(Λt

h)−1

UCB Q-function

Q+,th = Trunc[0,H]〈φ(s, a), θth〉+ Γt

h(s, a)

Execute πth(·) = arg maxa Q+,th (·, a)

34

Page 89: Sample Efficiency in Online RL

A more abstract version of LSVI-UCB

For h = H, . . . , 1, backwardly solve least-squares regression:

y ih = r ih + maxa

Q+,th+1(sh+1, a)

Qth = arg min

f ∈F

t−1∑i=1

[y ih − f (s ih, aih)]2 + penalty(f )

Uncertainty quantification: for Qth: find Γt

h such that

P(∀(t, h, s, a), |Qt

h(s, a)− (BQ+,th+1)(s, a)| ≤ Γt

h(s, a))≥ 1− δ

Optimism:

Q+,th = Trunc[0,H]Qt

h + Γth(s, a)

• Computation & memory cost independent of |S| (even forRKHS, overparameterized NN)

35

Page 90: Sample Efficiency in Online RL

A more abstract version of LSVI-UCB

For h = H, . . . , 1, backwardly solve least-squares regression:

y ih = r ih + maxa

Q+,th+1(sh+1, a)

Qth = arg min

f ∈F

t−1∑i=1

[y ih − f (s ih, aih)]2 + penalty(f )

Uncertainty quantification: for Qth: find Γt

h such that

P(∀(t, h, s, a), |Qt

h(s, a)− (BQ+,th+1)(s, a)| ≤ Γt

h(s, a))≥ 1− δ

Optimism:

Q+,th = Trunc[0,H]Qt

h + Γth(s, a)

• Computation & memory cost independent of |S| (even forRKHS, overparameterized NN)

35

Page 91: Sample Efficiency in Online RL

A more abstract version of LSVI-UCB

For h = H, . . . , 1, backwardly solve least-squares regression:

y ih = r ih + maxa

Q+,th+1(sh+1, a)

Qth = arg min

f ∈F

t−1∑i=1

[y ih − f (s ih, aih)]2 + penalty(f )

Uncertainty quantification: for Qth: find Γt

h such that

P(∀(t, h, s, a), |Qt

h(s, a)− (BQ+,th+1)(s, a)| ≤ Γt

h(s, a))≥ 1− δ

Optimism:

Q+,th = Trunc[0,H]Qt

h + Γth(s, a)

• Computation & memory cost independent of |S| (even forRKHS, overparameterized NN)

35

Page 92: Sample Efficiency in Online RL

A more abstract version of LSVI-UCB

For h = H, . . . , 1, backwardly solve least-squares regression:

y ih = r ih + maxa

Q+,th+1(sh+1, a)

Qth = arg min

f ∈F

t−1∑i=1

[y ih − f (s ih, aih)]2 + penalty(f )

Uncertainty quantification: for Qth: find Γt

h such that

P(∀(t, h, s, a), |Qt

h(s, a)− (BQ+,th+1)(s, a)| ≤ Γt

h(s, a))≥ 1− δ

Optimism:

Q+,th = Trunc[0,H]Qt

h + Γth(s, a)

• Computation & memory cost independent of |S| (even forRKHS, overparameterized NN)

35

Page 93: Sample Efficiency in Online RL

A more abstract version of LSVI-UCB

Q+h

Γth

Qth 𝔹Q+

h+1 Q+

h − 2Γth

| Qth(s, a) − (𝔹Q+

h+1)(s, a) | ≤ Γth(s, a), ∀t, h, s, a

Q+h (s, a) = Qt

h(s, a)LSVI

+ Γth(s, a)

bonusOptimism gives: BQ+

h+1 ∈ [Q+h − 2Γt

h,Q+h ]

Monotonicity of Bellman operator: Q+h ≥ Q?

h (UCB)36

Page 94: Sample Efficiency in Online RL

Sample efficiency of LSVI-UCB

37

Page 95: Sample Efficiency in Online RL

LSVI-UCB achieves√T -regret

Theorem [Jin-Yang-Wang-Jordan-20] Choosing β = O(dH),with probability at least 1− 1/T ,

Regret(T ) = O(β · H ·√dT ) = O(H2 ·

√d3T )

Directly imply a |S|1.5|A|1.5H2√T regret for tabular RL

First algorithm with both sample and computational efficiencyin the context of RL with function approximation

Only assumption: Image set of B is in Flin

Optimal regret dH2√T is achieved by [Zanette et al. 2020]

with a relaxed model assumption. But the computation isintractable.

38

Page 96: Sample Efficiency in Online RL

LSVI-UCB achieves√T -regret

Theorem [Jin-Yang-Wang-Jordan-20] Choosing β = O(dH),with probability at least 1− 1/T ,

Regret(T ) = O(β · H ·√dT ) = O(H2 ·

√d3T )

Directly imply a |S|1.5|A|1.5H2√T regret for tabular RL

First algorithm with both sample and computational efficiencyin the context of RL with function approximation

Only assumption: Image set of B is in Flin

Optimal regret dH2√T is achieved by [Zanette et al. 2020]

with a relaxed model assumption. But the computation isintractable.

38

Page 97: Sample Efficiency in Online RL

LSVI-UCB achieves√T -regret

Theorem [Jin-Yang-Wang-Jordan-20] Choosing β = O(dH),with probability at least 1− 1/T ,

Regret(T ) = O(β · H ·√dT ) = O(H2 ·

√d3T )

Directly imply a |S|1.5|A|1.5H2√T regret for tabular RL

First algorithm with both sample and computational efficiencyin the context of RL with function approximation

Only assumption: Image set of B is in Flin

Optimal regret dH2√T is achieved by [Zanette et al. 2020]

with a relaxed model assumption. But the computation isintractable.

38

Page 98: Sample Efficiency in Online RL

LSVI-UCB achieves√T -regret

Theorem [Jin-Yang-Wang-Jordan-20] Choosing β = O(dH),with probability at least 1− 1/T ,

Regret(T ) = O(β · H ·√dT ) = O(H2 ·

√d3T )

Directly imply a |S|1.5|A|1.5H2√T regret for tabular RL

First algorithm with both sample and computational efficiencyin the context of RL with function approximation

Only assumption: Image set of B is in Flin

Optimal regret dH2√T is achieved by [Zanette et al. 2020]

with a relaxed model assumption. But the computation isintractable.

38

Page 99: Sample Efficiency in Online RL

LSVI-UCB achieves√T -regret

Theorem [Jin-Yang-Wang-Jordan-20] Choosing β = O(dH),with probability at least 1− 1/T ,

Regret(T ) = O(β · H ·√dT ) = O(H2 ·

√d3T )

Directly imply a |S|1.5|A|1.5H2√T regret for tabular RL

First algorithm with both sample and computational efficiencyin the context of RL with function approximation

Only assumption: Image set of B is in Flin

Optimal regret dH2√T is achieved by [Zanette et al. 2020]

with a relaxed model assumption. But the computation isintractable.

38

Page 100: Sample Efficiency in Online RL

A general version of regretLet Qucb be the function class containing the Q-functionsconstructed by LSVI-UCB

Qucb =

Trunc[0,H]φ(·, ·)>θ+ β · ‖φ(·, ·)‖Λ−1

for linear case

Theorem [Jin-Yang-Wang-Wang-Jordan-20] Choosing β =O(H ·

√logN∞(Qucb,T−2)

), with probability at least 1 −

1/T ,Regret(T ) = O(βH ·

√deff · T ) = O(H2 ·

√deff · T · logN∞)

logN∞(Qucb,T−2) d logT for linear case

Include kernel and overparameterized neural network

deff is the effective dimension of RKHS or NTK

For an abstract function class F , deff can be set as theBellman-Eluder dimension [Jin-Liu-Miryoosef-21]

39

Page 101: Sample Efficiency in Online RL

A general version of regretLet Qucb be the function class containing the Q-functionsconstructed by LSVI-UCB

Qucb =

Trunc[0,H]φ(·, ·)>θ+ β · ‖φ(·, ·)‖Λ−1

for linear case

Theorem [Jin-Yang-Wang-Wang-Jordan-20] Choosing β =O(H ·

√logN∞(Qucb,T−2)

), with probability at least 1 −

1/T ,Regret(T ) = O(βH ·

√deff · T ) = O(H2 ·

√deff · T · logN∞)

logN∞(Qucb,T−2) d logT for linear case

Include kernel and overparameterized neural network

deff is the effective dimension of RKHS or NTK

For an abstract function class F , deff can be set as theBellman-Eluder dimension [Jin-Liu-Miryoosef-21]

39

Page 102: Sample Efficiency in Online RL

A general version of regretLet Qucb be the function class containing the Q-functionsconstructed by LSVI-UCB

Qucb =

Trunc[0,H]φ(·, ·)>θ+ β · ‖φ(·, ·)‖Λ−1

for linear case

Theorem [Jin-Yang-Wang-Wang-Jordan-20] Choosing β =O(H ·

√logN∞(Qucb,T−2)

), with probability at least 1 −

1/T ,Regret(T ) = O(βH ·

√deff · T ) = O(H2 ·

√deff · T · logN∞)

logN∞(Qucb,T−2) d logT for linear case

Include kernel and overparameterized neural network

deff is the effective dimension of RKHS or NTK

For an abstract function class F , deff can be set as theBellman-Eluder dimension [Jin-Liu-Miryoosef-21]

39

Page 103: Sample Efficiency in Online RL

A general version of regretLet Qucb be the function class containing the Q-functionsconstructed by LSVI-UCB

Qucb =

Trunc[0,H]φ(·, ·)>θ+ β · ‖φ(·, ·)‖Λ−1

for linear case

Theorem [Jin-Yang-Wang-Wang-Jordan-20] Choosing β =O(H ·

√logN∞(Qucb,T−2)

), with probability at least 1 −

1/T ,Regret(T ) = O(βH ·

√deff · T ) = O(H2 ·

√deff · T · logN∞)

logN∞(Qucb,T−2) d logT for linear case

Include kernel and overparameterized neural network

deff is the effective dimension of RKHS or NTK

For an abstract function class F , deff can be set as theBellman-Eluder dimension [Jin-Liu-Miryoosef-21]

39

Page 104: Sample Efficiency in Online RL

Regret analysis

40

Page 105: Sample Efficiency in Online RL

Regret analysis: sensitivity analysis + elliptical potentialA general sensitivity analysis

J(π?)− J(πt)

=∑

h∈[H] Eπ?

[Q+,t

h

(sh, π

?h(sh)

)− Q+,t

h

(sh, π

th(sh)

)︸ ︷︷ ︸(i) policy optimization error

]+∑

h∈[H]Eπt

[(Q+,t

h − BQ+,th+1)

]︸ ︷︷ ︸(ii) Bellman error on πt

+∑

h∈[H]Eπ?

[−(Q+,t

h − BQ+,th+1)

]︸ ︷︷ ︸(iii) Bellman error on π?

Term (i) ≤ 0 as πt is greedy with respect to Q+,th

Optimism: Q+,th+1 − 2Γt

h ≤ BQ+,th+1 ≤ Q+,t

h , Term (iii) ≤ 0

41

Page 106: Sample Efficiency in Online RL

Regret analysis: sensitivity analysis + elliptical potentialTherefore, we have

Regret(T ) =∑T

t=1J(π?)− J(πt) ≤ 2∑

h∈[H]Eπt

[Γth(sh, ah)

]= 2∑

h∈[H]

[Γth(sth, a

th)]

+ martingale− diff

= 2β∑

h∈[H]

∑t∈[T ]

[‖φ(sth, a

th)‖(Λt

h)−1

]+ H ·

√T

= O(βH√dT )

Second line holds because (sth, ath), h ∈ [H] ∼ πt

It remains to conduct uncertainty quantification:

P(∀(t, h, s, a), |Qt

h(s, a)− (BQ+,th+1)(s, a)| ≤ Γt

h(s, a))≥ 1− δ

uniform concentration over Qucb (new for RL)+self-normalized concentration (same as in CB)

42

Page 107: Sample Efficiency in Online RL

Regret analysis: sensitivity analysis + elliptical potentialTherefore, we have

Regret(T ) =∑T

t=1J(π?)− J(πt) ≤ 2∑

h∈[H]Eπt

[Γth(sh, ah)

]= 2∑

h∈[H]

[Γth(sth, a

th)]

+ martingale− diff

= 2β∑

h∈[H]

∑t∈[T ]

[‖φ(sth, a

th)‖(Λt

h)−1

]+ H ·

√T

= O(βH√dT )

Second line holds because (sth, ath), h ∈ [H] ∼ πt

It remains to conduct uncertainty quantification:

P(∀(t, h, s, a), |Qt

h(s, a)− (BQ+,th+1)(s, a)| ≤ Γt

h(s, a))≥ 1− δ

uniform concentration over Qucb (new for RL)+self-normalized concentration (same as in CB)

42

Page 108: Sample Efficiency in Online RL

Summary & Extensions

Exploration in RL is more challenging in that we need tohandle deep exploration

LSVI-UCB explores by doing uncertainty quantification forLSVI estimator

LSVI propagates the uncertainty in larger h steps backward toh = 1

By construction, Q+1 contains the uncertainty about all h ≥ 1

LSVI-UCB achieves sample-efficiency and computationaltractability in online RL with function approximation

Similar principle can be extended to:proximal policy optimization (use soft-greedy instead ofgreedy)

zero-sum Markov game (two-player extension)

constrained MDP (primal dual optimization)

43

Page 109: Sample Efficiency in Online RL

Summary & Extensions

Exploration in RL is more challenging in that we need tohandle deep exploration

LSVI-UCB explores by doing uncertainty quantification forLSVI estimator

LSVI propagates the uncertainty in larger h steps backward toh = 1

By construction, Q+1 contains the uncertainty about all h ≥ 1

LSVI-UCB achieves sample-efficiency and computationaltractability in online RL with function approximation

Similar principle can be extended to:proximal policy optimization (use soft-greedy instead ofgreedy)

zero-sum Markov game (two-player extension)

constrained MDP (primal dual optimization)

43

Page 110: Sample Efficiency in Online RL

Summary & Extensions

Exploration in RL is more challenging in that we need tohandle deep exploration

LSVI-UCB explores by doing uncertainty quantification forLSVI estimator

LSVI propagates the uncertainty in larger h steps backward toh = 1

By construction, Q+1 contains the uncertainty about all h ≥ 1

LSVI-UCB achieves sample-efficiency and computationaltractability in online RL with function approximation

Similar principle can be extended to:proximal policy optimization (use soft-greedy instead ofgreedy)

zero-sum Markov game (two-player extension)

constrained MDP (primal dual optimization)

43

Page 111: Sample Efficiency in Online RL

Summary & Extensions

Exploration in RL is more challenging in that we need tohandle deep exploration

LSVI-UCB explores by doing uncertainty quantification forLSVI estimator

LSVI propagates the uncertainty in larger h steps backward toh = 1

By construction, Q+1 contains the uncertainty about all h ≥ 1

LSVI-UCB achieves sample-efficiency and computationaltractability in online RL with function approximation

Similar principle can be extended to:proximal policy optimization (use soft-greedy instead ofgreedy)

zero-sum Markov game (two-player extension)

constrained MDP (primal dual optimization)

43

Page 112: Sample Efficiency in Online RL

References (a very incomplete list)

linear bandit

• [Dani et al, 2008] Stochastic Linear Optimization under BanditFeedback

• [Abbasi-Yadkori et al, 2011] Improved algorithms for linearstochastic bandits

optimism in tabular RL

• [Auer & Ortner, 2007] Logarithmic Online Regret Bounds forUndiscounted Reinforcement Learning

• [Jaksch et al, 2010] Near-optimal Regret Bounds forReinforcement Learning

• [Azar et al, 2017] Minimax Regret Bounds for ReinforcementLearning

44

Page 113: Sample Efficiency in Online RL

References (a very incomplete list)RL with function approximation (incomplete list)• [Dann et al, 2018] On Oracle-Efficient PAC RL with Rich

Observations• [Wang & Yang, 2019] Sample-Optimal Parametric Q-Learning

Using Linearly Additive Features• [Jin et al, 2020] Provably Efficient Reinforcement Learning

with Linear Function Approximation (this talk)• [Zanette et al, 2020] Learning near optimal policies with low

inherent bellman error• [Du et al, 2020] Is a Good Representation Sufficient for

Sample Efficient Reinforcement Learning?• [Xie et al, 2020] Learning Zero-Sum Simultaneous-Move

Markov Games Using Function Approximation and CorrelatedEquilibrium

• [Yang et al, 2020] On Function Approximation inReinforcement Learning: Optimism in the Face of Large StateSpaces (this talk)

• [Jin et al, 2021] Bellman Eluder Dimension: New Rich Classesof RL Problems, and Sample-Efficient Algorithms

45