Sample Efficiency in Online RL
Transcript of Sample Efficiency in Online RL
Reinforcement Learning China Summer School
RLChina 2021
Sample Efficiency in OnlineRL
Zhuoran Yang
Princeton −→ Yale (2022)
August 16, 2021
Motivation: sample complexitychallenge in deep RL
2
Success and challenge of deep RL
DRL = Representation (DL) + Decision Making (RL)
board games computer games
robotic control policy making
image source: Deepmind, OpenAI, Salesforce Research.
3
Success and challenge of deep RL
AlphaGo: 3× 107 games of self-play (data), 40 days oftraining (computation)
AlphaStar: 2× 102 years of self-play, 44 days of training
Rubik’s Cube: 104 years of simulation, 10 days of training
Insufficient Data
Lack of ConvergenceUnreliable AI
Massive Data Needed
Massive Computation
4
Success and challenge of deep RL
AlphaGo: 3× 107 games of self-play (data), 40 days oftraining (computation)
AlphaStar: 2× 102 years of self-play, 44 days of training
Rubik’s Cube: 104 years of simulation, 10 days of training
Insufficient Data
Lack of ConvergenceUnreliable AI
Massive Data Needed
Massive Computation
4
Our goal: provably efficient RLalgorithms
Sample efficiency: how many data points needed?
Computational efficiency: how much computation needed?
Function approximation: allow infinite number ofobservations?
5
Background: Episodic Markovdecision process
6
Episodic Markov decision process
Agent
Environment
action ah = πh(sh) ∈ A
reward next state
rh = rh(sh, ah) ∈ [0,1]sh+1 ∼ Ph( ⋅ | sh, ah) ∈ S
state sh ∈ S
Policy
For h ∈ [H], in the h-th step, observe state sh, takes action ah
Receive immediate reward rh, E[rh|sh = s, ah = a] = rh(s, a)
Environment evolves to a new state sh+1 ∼ Ph(·|sh, ah)
Policy πh : S → A: how agent takes action at h-th step
Goal: find the policy π? that maximizes J(π) := Eπ[∑H
h=1 rh]
7
Episodic Markov decision process
Agent
Environment
action ah = πh(sh) ∈ A
reward next state
rh = rh(sh, ah) ∈ [0,1]sh+1 ∼ Ph( ⋅ | sh, ah) ∈ S
state sh ∈ S
Policy
For h ∈ [H], in the h-th step, observe state sh, takes action ah
Receive immediate reward rh, E[rh|sh = s, ah = a] = rh(s, a)
Environment evolves to a new state sh+1 ∼ Ph(·|sh, ah)
Policy πh : S → A: how agent takes action at h-th step
Goal: find the policy π? that maximizes J(π) := Eπ[∑H
h=1 rh]
7
Episodic Markov decision process
a1
s1
r1
π1(s1)
a2
s2
r2
π2(s2)
a3
s3
r3
π3(s3)
aH
sH
rH
πH(sH)
s⋆
Action
State P( ⋅ | s1, a1) P( ⋅ | s2, a2) P( ⋅ | s3, a3)
r(s1, a1) r(s2, a2) r(s3, a3) r(sH, aH)
+ + +⋯+𝔼 J(π)=
Policy
Reward
For simplicity, fix s1 = s?, rh = r , Ph = P for all h.
8
Contextual bandit is a special case of MDP
Contextual bandit (CB): observe context s ∈ A, take actiona ∈ A, observe (random) reward r with mean r?(s, a)
Optimal policy π?(s) = arg maxa∈A r?(s, a)
CB is a MDP with H = 1
Multi-armed bandit (MAB): H = 1, |A| finite, and |S| = 1
MDP is significantly more challenging than CB due to tem-poral dependency (i.e., state transitions)
9
Contextual bandit is a special case of MDP
Contextual bandit (CB): observe context s ∈ A, take actiona ∈ A, observe (random) reward r with mean r?(s, a)
Optimal policy π?(s) = arg maxa∈A r?(s, a)
CB is a MDP with H = 1
Multi-armed bandit (MAB): H = 1, |A| finite, and |S| = 1
MDP is significantly more challenging than CB due to tem-poral dependency (i.e., state transitions)
9
From MDP to RL
Informal definition of RL: Solve the MDP when the environment(r , P) is unknown
(i) generative model: able to query the reward rh and next statesh+1 for any (sh, ah) ∈ S ×A
(ii) offline setting: given a dataset D = (sth, ath, r th)h∈[H],t∈[T ] (Ttrajectories), learn π? without any interactions
(iii) online setting: without any prior knowledge, learn π? byinteracting with the MDP for T episodes
(i) learning from model query complexity(ii) learning from data sample complexity
(iii) learning by doing sample complexity + exploration
10
From MDP to RL
Informal definition of RL: Solve the MDP when the environment(r , P) is unknown
(i) generative model: able to query the reward rh and next statesh+1 for any (sh, ah) ∈ S ×A
(ii) offline setting: given a dataset D = (sth, ath, r th)h∈[H],t∈[T ] (Ttrajectories), learn π? without any interactions
(iii) online setting: without any prior knowledge, learn π? byinteracting with the MDP for T episodes
(i) learning from model query complexity(ii) learning from data sample complexity
(iii) learning by doing sample complexity + exploration
10
From MDP to RL
Informal definition of RL: Solve the MDP when the environment(r , P) is unknown
(i) generative model: able to query the reward rh and next statesh+1 for any (sh, ah) ∈ S ×A
(ii) offline setting: given a dataset D = (sth, ath, r th)h∈[H],t∈[T ] (Ttrajectories), learn π? without any interactions
(iii) online setting: without any prior knowledge, learn π? byinteracting with the MDP for T episodes
(i) learning from model query complexity(ii) learning from data sample complexity
(iii) learning by doing sample complexity + exploration
10
From MDP to RL
Informal definition of RL: Solve the MDP when the environment(r , P) is unknown
(i) generative model: able to query the reward rh and next statesh+1 for any (sh, ah) ∈ S ×A
(ii) offline setting: given a dataset D = (sth, ath, r th)h∈[H],t∈[T ] (Ttrajectories), learn π? without any interactions
(iii) online setting: without any prior knowledge, learn π? byinteracting with the MDP for T episodes
(i) learning from model query complexity(ii) learning from data sample complexity
(iii) learning by doing sample complexity + exploration
10
From MDP to RL
Informal definition of RL: Solve the MDP when the environment(r , P) is unknown
(i) generative model: able to query the reward rh and next statesh+1 for any (sh, ah) ∈ S ×A
(ii) offline setting: given a dataset D = (sth, ath, r th)h∈[H],t∈[T ] (Ttrajectories), learn π? without any interactions
(iii) online setting: without any prior knowledge, learn π? byinteracting with the MDP for T episodes
(i) learning from model query complexity(ii) learning from data sample complexity
(iii) learning by doing sample complexity + exploration
10
Online RL: Pipeline
Online Agent
Environment
Data !t
Execute πt
πt+1 ← #$%&'((!t, ℱ)
!t ← !t−1 ∪ Traj(πt)Continue for episodesT
Initialize π1 = π1hh∈[H] arbitrarily
For each t ∈ [T ], execute πt , get a trajectory Traj(πt)
Store Traj(πt) into dataset Dt = Traj(πi ), i ≤ tUpdate the policy via RL algorithm (need our design)
11
Sample efficiency in online RL: regret
Measure sample efficiency via regret:
Regret(T ) =T∑t=1
[J(π?)− J(πt)]︸ ︷︷ ︸suboptimality in t-th episode
Why regret is meaningful?
Define a random policy πT uniformly sampled fromπt , t ∈ [T ]. Suppose Regret(T ), thenJ(π?)− J(πT ) = Regret(T )/T
If Regret(T ) = o(T ), when T sufficiency large, πT isarbitrarily close to optimality
Goal: Regret(T ) = O(√
T · poly(H) · poly(dim))
12
Sample efficiency in online RL: regret
Measure sample efficiency via regret:
Regret(T ) =T∑t=1
[J(π?)− J(πt)]︸ ︷︷ ︸suboptimality in t-th episode
Why regret is meaningful?
Define a random policy πT uniformly sampled fromπt , t ∈ [T ]. Suppose Regret(T ), thenJ(π?)− J(πT ) = Regret(T )/T
If Regret(T ) = o(T ), when T sufficiency large, πT isarbitrarily close to optimality
Goal: Regret(T ) = O(√
T · poly(H) · poly(dim))
12
Sample efficiency in online RL: regret
Measure sample efficiency via regret:
Regret(T ) =T∑t=1
[J(π?)− J(πt)]︸ ︷︷ ︸suboptimality in t-th episode
Why regret is meaningful?
Define a random policy πT uniformly sampled fromπt , t ∈ [T ]. Suppose Regret(T ), thenJ(π?)− J(πT ) = Regret(T )/T
If Regret(T ) = o(T ), when T sufficiency large, πT isarbitrarily close to optimality
Goal: Regret(T ) = O(√
T · poly(H) · poly(dim))
12
Challenge of Online RL: exploration & uncertainty
×
×
×
×
×
××××× ××
×
×
××
×
××
×××××
×
×××
××
×
××
××
exploration vs. exploitation function estimation
How to construct exploration incentives tailored to function approximators?
Function EstimationDeep Exploration
How to assess estimation uncertainty based on adaptively acquired data?
Online RL
13
Challenge of Online RL: exploration & uncertainty
×
×
×
×
×
××××× ××
×
×
××
×
××
×××××
×
×××
××
×
××
××
exploration vs. exploitation function estimation
How to construct exploration incentives tailored to function approximators?
Function EstimationDeep Exploration
How to assess estimation uncertainty based on adaptively acquired data?
Online RL
13
Warmup example:LinUCB for linear CB
14
Linear contextual bandit
Setting of linear CB:
Observation structure: observe (perhaps adversarial) st ∈ S,take action at ∈ A, receive bounded reward r t ∈ [0, 1]
E[r t |st = s, at = a] = r?(s, a) for all (s, a) ∈ S ×A
Linear reward function: r?(s, a) = φ(s, a)>θ?
φ : S ×A → Rd : known feature mapping
Normalization condition: ‖θ?‖2 ≤√d , sups,a ‖φ(s, a)‖2 ≤ 1
Why? Include MAB as a special case:|S| = 1, d = |A|, φ(s, a) = ea, θ? = (µ1, µ2, · · · , µA) withµa ∈ [0, 1] for all a ∈ A
Regret(T ) =∑T
t=1
(maxa∈A〈φ(st , a)− φ(st , at), θ?〉
)
15
Linear contextual bandit
Setting of linear CB:
Observation structure: observe (perhaps adversarial) st ∈ S,take action at ∈ A, receive bounded reward r t ∈ [0, 1]
E[r t |st = s, at = a] = r?(s, a) for all (s, a) ∈ S ×A
Linear reward function: r?(s, a) = φ(s, a)>θ?
φ : S ×A → Rd : known feature mapping
Normalization condition: ‖θ?‖2 ≤√d , sups,a ‖φ(s, a)‖2 ≤ 1
Why? Include MAB as a special case:|S| = 1, d = |A|, φ(s, a) = ea, θ? = (µ1, µ2, · · · , µA) withµa ∈ [0, 1] for all a ∈ A
Regret(T ) =∑T
t=1
(maxa∈A〈φ(st , a)− φ(st , at), θ?〉
)
15
Linear contextual bandit
Setting of linear CB:
Observation structure: observe (perhaps adversarial) st ∈ S,take action at ∈ A, receive bounded reward r t ∈ [0, 1]
E[r t |st = s, at = a] = r?(s, a) for all (s, a) ∈ S ×A
Linear reward function: r?(s, a) = φ(s, a)>θ?
φ : S ×A → Rd : known feature mapping
Normalization condition: ‖θ?‖2 ≤√d , sups,a ‖φ(s, a)‖2 ≤ 1
Why? Include MAB as a special case:|S| = 1, d = |A|, φ(s, a) = ea, θ? = (µ1, µ2, · · · , µA) withµa ∈ [0, 1] for all a ∈ A
Regret(T ) =∑T
t=1
(maxa∈A〈φ(st , a)− φ(st , at), θ?〉
)
15
Linear contextual bandit
Setting of linear CB:
Observation structure: observe (perhaps adversarial) st ∈ S,take action at ∈ A, receive bounded reward r t ∈ [0, 1]
E[r t |st = s, at = a] = r?(s, a) for all (s, a) ∈ S ×A
Linear reward function: r?(s, a) = φ(s, a)>θ?
φ : S ×A → Rd : known feature mapping
Normalization condition: ‖θ?‖2 ≤√d , sups,a ‖φ(s, a)‖2 ≤ 1
Why? Include MAB as a special case:|S| = 1, d = |A|, φ(s, a) = ea, θ? = (µ1, µ2, · · · , µA) withµa ∈ [0, 1] for all a ∈ A
Regret(T ) =∑T
t=1
(maxa∈A〈φ(st , a)− φ(st , at), θ?〉
)15
Exploration via optimism in the face of uncertainty (OFU)
For each t ∈ [T ], in the t-th round, determine at in thefollowing two steps
Uncertainty quantification: based on the current dataset Dt−1,construct a high-probability confidence set Θt that contains θ?
P(∀t ∈ [T ], θ? ∈ Θt
)≥ 1− δ (1)
Optimistic planing: choose at to maximize the most optimisticmodel within Θt
at = arg maxa∈A
maxθ∈Θt〈φ(st , at), θ〉
(2)
16
Exploration via optimism in the face of uncertainty (OFU)
For each t ∈ [T ], in the t-th round, determine at in thefollowing two steps
Uncertainty quantification: based on the current dataset Dt−1,construct a high-probability confidence set Θt that contains θ?
P(∀t ∈ [T ], θ? ∈ Θt
)≥ 1− δ (1)
Optimistic planing: choose at to maximize the most optimisticmodel within Θt
at = arg maxa∈A
maxθ∈Θt〈φ(st , at), θ〉
(2)
16
Exploration via optimism in the face of uncertainty (OFU)
For each t ∈ [T ], in the t-th round, determine at in thefollowing two steps
Uncertainty quantification: based on the current dataset Dt−1,construct a high-probability confidence set Θt that contains θ?
P(∀t ∈ [T ], θ? ∈ Θt
)≥ 1− δ (1)
Optimistic planing: choose at to maximize the most optimisticmodel within Θt
at = arg maxa∈A
maxθ∈Θt〈φ(st , at), θ〉
(2)
16
Exploration via optimism in the face of uncertainty (OFU)
For each t ∈ [T ], in the t-th round, determine at in thefollowing two steps
Uncertainty quantification: based on the current dataset Dt−1,construct a high-probability confidence set Θt that contains θ?
P(∀t ∈ [T ], θ? ∈ Θt
)≥ 1− δ (1)
Optimistic planing: choose at to maximize the most optimisticmodel within Θt
at = arg maxa∈A
maxθ∈Θt〈φ(st , at), θ〉
(2)
16
Intuition of OFU principle
Define two functions
r+,t(s, a) = maxθ∈Θt〈φ(s, a), θ〉
r−,t(s, a) = minθ∈Θt〈φ(s, a), θ〉
By (1), θ? ∈ Θt
r+,t(s, a) ≥ r?(s, a) ≥ r−,t(s, a), ∀(s, a) ∈ S ×A
|r+,t(s, a)− r−,t(s, a)| reflects the uncertainty about a atcontext s
By (2), at = arg maxa r+,t(st , a) — choose the action with
either large uncertainty (explore) or large reward (exploit)
17
Intuition of OFU principle
Define two functions
r+,t(s, a) = maxθ∈Θt〈φ(s, a), θ〉
r−,t(s, a) = minθ∈Θt〈φ(s, a), θ〉
By (1), θ? ∈ Θt
r+,t(s, a) ≥ r?(s, a) ≥ r−,t(s, a), ∀(s, a) ∈ S ×A
|r+,t(s, a)− r−,t(s, a)| reflects the uncertainty about a atcontext s
By (2), at = arg maxa r+,t(st , a) — choose the action with
either large uncertainty (explore) or large reward (exploit)
17
Intuition of OFU principle
Define two functions
r+,t(s, a) = maxθ∈Θt〈φ(s, a), θ〉
r−,t(s, a) = minθ∈Θt〈φ(s, a), θ〉
By (1), θ? ∈ Θt
r+,t(s, a) ≥ r?(s, a) ≥ r−,t(s, a), ∀(s, a) ∈ S ×A
|r+,t(s, a)− r−,t(s, a)| reflects the uncertainty about a atcontext s
By (2), at = arg maxa r+,t(st , a) — choose the action with
either large uncertainty (explore) or large reward (exploit)
17
Intuition of OFU principle
Define two functions
r+,t(s, a) = maxθ∈Θt〈φ(s, a), θ〉
r−,t(s, a) = minθ∈Θt〈φ(s, a), θ〉
By (1), θ? ∈ Θt
r+,t(s, a) ≥ r?(s, a) ≥ r−,t(s, a), ∀(s, a) ∈ S ×A
|r+,t(s, a)− r−,t(s, a)| reflects the uncertainty about a atcontext s
By (2), at = arg maxa r+,t(st , a) — choose the action with
either large uncertainty (explore) or large reward (exploit)
17
How to construct Θt for linear CB?In the t-th round, our dataset is Dt−1 = (s i , ai , r i )i≤t−1
First construct an estimator of θ? via ridge regression
θt = minθ∈Rd
Lt(θ) : =
t−1∑i=1
[r i − 〈φ(s i , ai ), θ〉]2 + ‖θ‖22
Hessian of the Ridge loss:
Λt = ∇2Lt(θ) = I +t−1∑i=1
φ(s i , ai )φ(s i , ai )>
Closed-form solution: θt = (Λt)−1∑t−1
i=1 ri · φ(s i , ai )
Define Θt as the confidence ellipsoid :
Θt = θ ∈ Rd : ‖θ − θt‖Λt ≤ β,
here define β = C ·√
d · log(1 + T/δ), ‖x‖M =√x>Mx
18
How to construct Θt for linear CB?In the t-th round, our dataset is Dt−1 = (s i , ai , r i )i≤t−1
First construct an estimator of θ? via ridge regression
θt = minθ∈Rd
Lt(θ) : =
t−1∑i=1
[r i − 〈φ(s i , ai ), θ〉]2 + ‖θ‖22
Hessian of the Ridge loss:
Λt = ∇2Lt(θ) = I +t−1∑i=1
φ(s i , ai )φ(s i , ai )>
Closed-form solution: θt = (Λt)−1∑t−1
i=1 ri · φ(s i , ai )
Define Θt as the confidence ellipsoid :
Θt = θ ∈ Rd : ‖θ − θt‖Λt ≤ β,
here define β = C ·√
d · log(1 + T/δ), ‖x‖M =√x>Mx
18
How to construct Θt for linear CB?In the t-th round, our dataset is Dt−1 = (s i , ai , r i )i≤t−1
First construct an estimator of θ? via ridge regression
θt = minθ∈Rd
Lt(θ) : =
t−1∑i=1
[r i − 〈φ(s i , ai ), θ〉]2 + ‖θ‖22
Hessian of the Ridge loss:
Λt = ∇2Lt(θ) = I +t−1∑i=1
φ(s i , ai )φ(s i , ai )>
Closed-form solution: θt = (Λt)−1∑t−1
i=1 ri · φ(s i , ai )
Define Θt as the confidence ellipsoid :
Θt = θ ∈ Rd : ‖θ − θt‖Λt ≤ β,
here define β = C ·√
d · log(1 + T/δ), ‖x‖M =√x>Mx
18
How to construct Θt for linear CB?In the t-th round, our dataset is Dt−1 = (s i , ai , r i )i≤t−1
First construct an estimator of θ? via ridge regression
θt = minθ∈Rd
Lt(θ) : =
t−1∑i=1
[r i − 〈φ(s i , ai ), θ〉]2 + ‖θ‖22
Hessian of the Ridge loss:
Λt = ∇2Lt(θ) = I +t−1∑i=1
φ(s i , ai )φ(s i , ai )>
Closed-form solution: θt = (Λt)−1∑t−1
i=1 ri · φ(s i , ai )
Define Θt as the confidence ellipsoid :
Θt = θ ∈ Rd : ‖θ − θt‖Λt ≤ β,
here define β = C ·√
d · log(1 + T/δ), ‖x‖M =√x>Mx
18
How to construct Θt for linear CB?In the t-th round, our dataset is Dt−1 = (s i , ai , r i )i≤t−1
First construct an estimator of θ? via ridge regression
θt = minθ∈Rd
Lt(θ) : =
t−1∑i=1
[r i − 〈φ(s i , ai ), θ〉]2 + ‖θ‖22
Hessian of the Ridge loss:
Λt = ∇2Lt(θ) = I +t−1∑i=1
φ(s i , ai )φ(s i , ai )>
Closed-form solution: θt = (Λt)−1∑t−1
i=1 ri · φ(s i , ai )
Define Θt as the confidence ellipsoid :
Θt = θ ∈ Rd : ‖θ − θt‖Λt ≤ β,
here define β = C ·√
d · log(1 + T/δ), ‖x‖M =√x>Mx
18
LinUCB algorithm
Confidence ellipsoid: Θt = θ ∈ Rd : ‖θ − θt‖Λt ≤ β
r+,t admits a closed-form:
r+,t(s, a) = maxθ∈Θt〈φ(s, a), θ〉 = 〈φ(s, a), θt〉+ β · ‖φ(s, a)‖(Λt)−1︸ ︷︷ ︸
bonus
LinUCB algorithm (Dani et all. (2008), Abbasi-Yadkori et al.(2011), ...)
For t = 1, . . . ,T , in the beginning of t-th round
Observe context st
Compute θt via ridge regression and compute Hessian Λt
Choose at = arg maxa∈A r+,t(st , a) and observe r t
19
LinUCB algorithm
Confidence ellipsoid: Θt = θ ∈ Rd : ‖θ − θt‖Λt ≤ β
r+,t admits a closed-form:
r+,t(s, a) = maxθ∈Θt〈φ(s, a), θ〉 = 〈φ(s, a), θt〉+ β · ‖φ(s, a)‖(Λt)−1︸ ︷︷ ︸
bonus
LinUCB algorithm (Dani et all. (2008), Abbasi-Yadkori et al.(2011), ...)
For t = 1, . . . ,T , in the beginning of t-th round
Observe context st
Compute θt via ridge regression and compute Hessian Λt
Choose at = arg maxa∈A r+,t(st , a) and observe r t
19
LinUCB algorithm
Confidence ellipsoid: Θt = θ ∈ Rd : ‖θ − θt‖Λt ≤ β
r+,t admits a closed-form:
r+,t(s, a) = maxθ∈Θt〈φ(s, a), θ〉 = 〈φ(s, a), θt〉+ β · ‖φ(s, a)‖(Λt)−1︸ ︷︷ ︸
bonus
LinUCB algorithm (Dani et all. (2008), Abbasi-Yadkori et al.(2011), ...)
For t = 1, . . . ,T , in the beginning of t-th round
Observe context st
Compute θt via ridge regression and compute Hessian Λt
Choose at = arg maxa∈A r+,t(st , a) and observe r t
19
LinUCB algorithm
Confidence ellipsoid: Θt = θ ∈ Rd : ‖θ − θt‖Λt ≤ β
r+,t admits a closed-form:
r+,t(s, a) = maxθ∈Θt〈φ(s, a), θ〉 = 〈φ(s, a), θt〉+ β · ‖φ(s, a)‖(Λt)−1︸ ︷︷ ︸
bonus
LinUCB algorithm (Dani et all. (2008), Abbasi-Yadkori et al.(2011), ...)
For t = 1, . . . ,T , in the beginning of t-th round
Observe context st
Compute θt via ridge regression and compute Hessian Λt
Choose at = arg maxa∈A r+,t(st , a) and observe r t
19
Regret Bound of LinUCB
Theorem. With probability 1− δ, LinUCB satisfies
Regret(T ) ≤ β ·√dT ·
√logT = O(d
√T )
Independent of the size of context and action spaces, onlydepends on the dimension (linear in d)
Depend on T through√T (sample efficient)
Computational cost poly(d ,A,T ) and memory requirement ispoly(d ,T ) (Why?)
20
Regret Bound of LinUCB
Theorem. With probability 1− δ, LinUCB satisfies
Regret(T ) ≤ β ·√dT ·
√logT = O(d
√T )
Independent of the size of context and action spaces, onlydepends on the dimension (linear in d)
Depend on T through√T (sample efficient)
Computational cost poly(d ,A,T ) and memory requirement ispoly(d ,T ) (Why?)
20
Regret Bound of LinUCB
Theorem. With probability 1− δ, LinUCB satisfies
Regret(T ) ≤ β ·√dT ·
√logT = O(d
√T )
Independent of the size of context and action spaces, onlydepends on the dimension (linear in d)
Depend on T through√T (sample efficient)
Computational cost poly(d ,A,T ) and memory requirement ispoly(d ,T ) (Why?)
20
Regret Bound of LinUCB
Theorem. With probability 1− δ, LinUCB satisfies
Regret(T ) ≤ β ·√dT ·
√logT = O(d
√T )
Independent of the size of context and action spaces, onlydepends on the dimension (linear in d)
Depend on T through√T (sample efficient)
Computational cost poly(d ,A,T ) and memory requirement ispoly(d ,T ) (Why?)
20
Regret Analysis of LinUCB (1/3)
Step 1. Conditioning on the event E = ∀t ∈ [T ], θ? ∈ Θt,r−,t(s, a) ≤ r?(s, a) ≤ r+,t(s, a), ∀(x , a) ∈ S ×A
Step 2. Regret decomposition:
maxa
r?(st , a)− r?(st , at)
=(
maxa
r?(st , a)− r+,t(st , at))
+(r+,t(st , at)− r?(st , at)
)≤ r+,t(st , at)− r−,t(st , at) ≤ 2β · ‖φ(st , at)‖(Λt)−1
Thus we have shown
maxa
r?(st , a)− r?(st , at) ≤ β ·min1, ‖φ(st , at)‖(Λt)−1
21
Regret Analysis of LinUCB (1/3)
Step 1. Conditioning on the event E = ∀t ∈ [T ], θ? ∈ Θt,r−,t(s, a) ≤ r?(s, a) ≤ r+,t(s, a), ∀(x , a) ∈ S ×A
Step 2. Regret decomposition:
maxa
r?(st , a)− r?(st , at)
=(
maxa
r?(st , a)− r+,t(st , at))
+(r+,t(st , at)− r?(st , at)
)≤ r+,t(st , at)− r−,t(st , at) ≤ 2β · ‖φ(st , at)‖(Λt)−1
Thus we have shown
maxa
r?(st , a)− r?(st , at) ≤ β ·min1, ‖φ(st , at)‖(Λt)−1
21
Regret Analysis of LinUCB (1/3)
Step 1. Conditioning on the event E = ∀t ∈ [T ], θ? ∈ Θt,r−,t(s, a) ≤ r?(s, a) ≤ r+,t(s, a), ∀(x , a) ∈ S ×A
Step 2. Regret decomposition:
maxa
r?(st , a)− r?(st , at)
=(
maxa
r?(st , a)− r+,t(st , at))
+(r+,t(st , at)− r?(st , at)
)≤ r+,t(st , at)− r−,t(st , at) ≤ 2β · ‖φ(st , at)‖(Λt)−1
Thus we have shown
maxa
r?(st , a)− r?(st , at) ≤ β ·min1, ‖φ(st , at)‖(Λt)−1
21
Regret Analysis of LinUCB (2/3)
Step 3. Telescope the bonus: Cauchy-Schwarz + elliptical potentiallemma
Regret(T ) ≤ 2β ·T∑t=1
·min1, ‖φ(st , at)‖(Λt)−1
≤ 2β√T ·[ T∑
t=1
min1, ‖φ(st , at)‖2(Λt)−1
]1/2
≤ 2β√T ·√
2logdet(ΛT ) . β ·√T ·√d · logT
22
Elliptical potential lemma
Lemma
t−1∑i=1
min1, ‖φ(s i , ai )‖2(Λt)−1 ≤ 2logdet(ΛT ) ≤ d · logT
Bayesian perspective:
Prior distribution θ? ∼ N(0, I )
Likelihood ri = r?(s i , ai ) + ε, ε ∈ N(0, 1)
Given dataset Dt−1, the posterior distribution is N(θt ,Λt)
‖φ(s, a)‖2(Λt)−1 = I (θ?, (s, a)|Dt−1) =
H(θ?|Dt−1)− H(θ?|Dt−1 ∪ (s, a)) — conditional mutualinformation gain
Elliptical potential lemma ↔ chain rule of mutual information
23
Elliptical potential lemma
Lemma
t−1∑i=1
min1, ‖φ(s i , ai )‖2(Λt)−1 ≤ 2logdet(ΛT ) ≤ d · logT
Bayesian perspective:
Prior distribution θ? ∼ N(0, I )
Likelihood ri = r?(s i , ai ) + ε, ε ∈ N(0, 1)
Given dataset Dt−1, the posterior distribution is N(θt ,Λt)
‖φ(s, a)‖2(Λt)−1 = I (θ?, (s, a)|Dt−1) =
H(θ?|Dt−1)− H(θ?|Dt−1 ∪ (s, a)) — conditional mutualinformation gain
Elliptical potential lemma ↔ chain rule of mutual information
23
Elliptical potential lemma
Lemma
t−1∑i=1
min1, ‖φ(s i , ai )‖2(Λt)−1 ≤ 2logdet(ΛT ) ≤ d · logT
Bayesian perspective:
Prior distribution θ? ∼ N(0, I )
Likelihood ri = r?(s i , ai ) + ε, ε ∈ N(0, 1)
Given dataset Dt−1, the posterior distribution is N(θt ,Λt)
‖φ(s, a)‖2(Λt)−1 = I (θ?, (s, a)|Dt−1) =
H(θ?|Dt−1)− H(θ?|Dt−1 ∪ (s, a)) — conditional mutualinformation gain
Elliptical potential lemma ↔ chain rule of mutual information
23
Regret Analysis of LinUCB (3/3)
Step 4. Show optimism: θ? ∈ Θt = θ ∈ Rd : ‖θ − θt‖Λt ≤ β
θt − θ? = (Λt)−1
[t−1∑i=1
(r?(s i , ai ) + εi )︸ ︷︷ ︸ri
·φ(s i , ai )− (Λt) · θ?]
= (Λt)−1θ? + (Λt)−1t−1∑i=1
εi︸︷︷︸martingale difference
·φ(s i , ai )
‖θt − θ?‖Λt ≈ ‖∑t−1
i=1 εi · φ(s i , ai )‖(Λt)−1 ≤ βConcentration of self-normalized process (e.g., Abbasi-Yadkoriet al, 2011)
24
Regret Analysis of LinUCB (3/3)
Step 4. Show optimism: θ? ∈ Θt = θ ∈ Rd : ‖θ − θt‖Λt ≤ β
θt − θ? = (Λt)−1
[t−1∑i=1
(r?(s i , ai ) + εi )︸ ︷︷ ︸ri
·φ(s i , ai )− (Λt) · θ?]
= (Λt)−1θ? + (Λt)−1t−1∑i=1
εi︸︷︷︸martingale difference
·φ(s i , ai )
‖θt − θ?‖Λt ≈ ‖∑t−1
i=1 εi · φ(s i , ai )‖(Λt)−1 ≤ βConcentration of self-normalized process (e.g., Abbasi-Yadkoriet al, 2011)
24
Summary of LinUCB
Algorithm is based on the optimism principle
upper confidence bound = ridge estimator + bonus
O(d√T ) regret via
a. utilize optimism to bound instant regret by width of confidenceregion (bonus)
b. sum up the instant regret terms by elliptical potential lemma
c. confidence region constructed via uncertainty quantification forridge regression
Extensions: generalized linear model, RKHS,overparameterized network, abstract function class withbounded eluder dimension, ...
25
Summary of LinUCB
Algorithm is based on the optimism principle
upper confidence bound = ridge estimator + bonus
O(d√T ) regret via
a. utilize optimism to bound instant regret by width of confidenceregion (bonus)
b. sum up the instant regret terms by elliptical potential lemma
c. confidence region constructed via uncertainty quantification forridge regression
Extensions: generalized linear model, RKHS,overparameterized network, abstract function class withbounded eluder dimension, ...
25
Summary of LinUCB
Algorithm is based on the optimism principle
upper confidence bound = ridge estimator + bonus
O(d√T ) regret via
a. utilize optimism to bound instant regret by width of confidenceregion (bonus)
b. sum up the instant regret terms by elliptical potential lemma
c. confidence region constructed via uncertainty quantification forridge regression
Extensions: generalized linear model, RKHS,overparameterized network, abstract function class withbounded eluder dimension, ...
25
From Linear CB to Linear RL
26
Value functions and Bellman equation
Our goal is to find π? = arg maxπ J(π) = E[∑H
h=1 rh]
π? is characterized by Q? = Q?hh∈[H+1] : S ×A → [0,H]
π?h(s) = arg maxa Q?h (s, a), ∀s ∈ S; V ?
h (s) = maxa Q?h (s, a)
Q?h (s, a): optimal return starting from sh = s and ah = a
Q? is characterized by Bellman equation
= Q⋆h (s, a)
ah
sh
ah+1
sh+1
rh+1
π⋆h+1
ah+2
sh+2
rh+2
aH
sH
rH
s
π⋆h+2 π⋆
H
a
+ + +⋯+# r(s, a)
•
•
• given by Bellman equation
π⋆h (s) = arg max
aQ⋆
h (s, ⋅ )V⋆
h (s) = maxa
Q⋆h (s, a)
Q⋆h (s, a)
27
Value functions and Bellman equation
Our goal is to find π? = arg maxπ J(π) = E[∑H
h=1 rh]
π? is characterized by Q? = Q?hh∈[H+1] : S ×A → [0,H]
π?h(s) = arg maxa Q?h (s, a), ∀s ∈ S; V ?
h (s) = maxa Q?h (s, a)
Q?h (s, a): optimal return starting from sh = s and ah = a
Q? is characterized by Bellman equation
= Q⋆h (s, a)
ah
sh
ah+1
sh+1
rh+1
π⋆h+1
ah+2
sh+2
rh+2
aH
sH
rH
s
π⋆h+2 π⋆
H
a
+ + +⋯+# r(s, a)
•
•
• given by Bellman equation
π⋆h (s) = arg max
aQ⋆
h (s, ⋅ )V⋆
h (s) = maxa
Q⋆h (s, a)
Q⋆h (s, a)
27
Value functions and Bellman equation
Our goal is to find π? = arg maxπ J(π) = E[∑H
h=1 rh]
π? is characterized by Q? = Q?hh∈[H+1] : S ×A → [0,H]
π?h(s) = arg maxa Q?h (s, a), ∀s ∈ S; V ?
h (s) = maxa Q?h (s, a)
Q?h (s, a): optimal return starting from sh = s and ah = a
Q? is characterized by Bellman equation
= Q⋆h (s, a)
ah
sh
ah+1
sh+1
rh+1
π⋆h+1
ah+2
sh+2
rh+2
aH
sH
rH
s
π⋆h+2 π⋆
H
a
+ + +⋯+# r(s, a)
•
•
• given by Bellman equation
π⋆h (s) = arg max
aQ⋆
h (s, ⋅ )V⋆
h (s) = maxa
Q⋆h (s, a)
Q⋆h (s, a)
27
Value functions and Bellman equation
Our goal is to find π? = arg maxπ J(π) = E[∑H
h=1 rh]
π? is characterized by Q? = Q?hh∈[H+1] : S ×A → [0,H]
π?h(s) = arg maxa Q?h (s, a), ∀s ∈ S; V ?
h (s) = maxa Q?h (s, a)
Q?h (s, a): optimal return starting from sh = s and ah = a
Q? is characterized by Bellman equation
= Q⋆h (s, a)
ah
sh
ah+1
sh+1
rh+1
π⋆h+1
ah+2
sh+2
rh+2
aH
sH
rH
s
π⋆h+2 π⋆
H
a
+ + +⋯+# r(s, a)
•
•
• given by Bellman equation
π⋆h (s) = arg max
aQ⋆
h (s, ⋅ )V⋆
h (s) = maxa
Q⋆h (s, a)
Q⋆h (s, a)
27
Value functions and Bellman equation
MDP terminates after H steps — Q?H+1 = 0, Q?
H = r
Bellman equation
Q?h(s, a) =
Bellman operator: BQ?h+1︷ ︸︸ ︷
r(s, a) + Esh+1∼P[V ?h+1(sh+1)|sh = s, ah = a
]V ?h (s) = max
aQ?
h(s, a)
RL with linear function approximation: useFlin = φ(s, a)>θ : S ×A → R to approximate Q?
hh∈[H]
φ : S ×A → Rd : known feature mapping
Can allow known and step-varying features: φhh∈[H]
28
Value functions and Bellman equation
MDP terminates after H steps — Q?H+1 = 0, Q?
H = r
Bellman equation
Q?h(s, a) =
Bellman operator: BQ?h+1︷ ︸︸ ︷
r(s, a) + Esh+1∼P[V ?h+1(sh+1)|sh = s, ah = a
]V ?h (s) = max
aQ?
h(s, a)
RL with linear function approximation: useFlin = φ(s, a)>θ : S ×A → R to approximate Q?
hh∈[H]
φ : S ×A → Rd : known feature mapping
Can allow known and step-varying features: φhh∈[H]
28
Value functions and Bellman equation
MDP terminates after H steps — Q?H+1 = 0, Q?
H = r
Bellman equation
Q?h(s, a) =
Bellman operator: BQ?h+1︷ ︸︸ ︷
r(s, a) + Esh+1∼P[V ?h+1(sh+1)|sh = s, ah = a
]V ?h (s) = max
aQ?
h(s, a)
RL with linear function approximation: useFlin = φ(s, a)>θ : S ×A → R to approximate Q?
hh∈[H]
φ : S ×A → Rd : known feature mapping
Can allow known and step-varying features: φhh∈[H]
28
RL is more challenging than bandit
Online Agent
Environment
Data !t
Execute πt
πt+1 ← #$%&'((!t, ℱ)
!t ← !t−1 ∪ Traj(πt)Continue for episodesT
Regret(T ) =∑T
t=1[J(π?)− J(πt)]
J(π?) = maxa Q?1 (s?, a), fixed initial state for simplicity
Can generalize to adversarial s1
Online RL 6= a contextual bandit with TH rounds
RL requires deep exploration for achieving o(T ) regret.
29
RL is more challenging than bandit
Online Agent
Environment
Data !t
Execute πt
πt+1 ← #$%&'((!t, ℱ)
!t ← !t−1 ∪ Traj(πt)Continue for episodesT
Regret(T ) =∑T
t=1[J(π?)− J(πt)]
J(π?) = maxa Q?1 (s?, a), fixed initial state for simplicity
Can generalize to adversarial s1
Online RL 6= a contextual bandit with TH rounds
RL requires deep exploration for achieving o(T ) regret.
29
RL is more challenging than bandit
Online Agent
Environment
Data !t
Execute πt
πt+1 ← #$%&'((!t, ℱ)
!t ← !t−1 ∪ Traj(πt)Continue for episodesT
Regret(T ) =∑T
t=1[J(π?)− J(πt)]
J(π?) = maxa Q?1 (s?, a), fixed initial state for simplicity
Can generalize to adversarial s1
Online RL 6= a contextual bandit with TH rounds
RL requires deep exploration for achieving o(T ) regret.
29
Deep Exploration
s⋆ good good good good
bad bad bad bad
A = b1, b2, …
b1 b1 b1 b1
A\b1
∀a ∀a ∀a
A\b1 A\b1A\b1
π⋆ π⋆π⋆π⋆
Contextual bandit algorithm directly goes to the bad state atthe initial state — Ω(T ) regret
Random sampling requires |A|H samples (inefficient)
Goal: Regret(T ) = poly(H) ·√T (need deep exploration)
O(SAH) query complexity if have a generative model
30
Deep Exploration
s⋆ good good good good
bad bad bad bad
A = b1, b2, …
b1 b1 b1 b1
A\b1
∀a ∀a ∀a
A\b1 A\b1A\b1
π⋆ π⋆π⋆π⋆
Contextual bandit algorithm directly goes to the bad state atthe initial state — Ω(T ) regret
Random sampling requires |A|H samples (inefficient)
Goal: Regret(T ) = poly(H) ·√T (need deep exploration)
O(SAH) query complexity if have a generative model
30
Deep Exploration
s⋆ good good good good
bad bad bad bad
A = b1, b2, …
b1 b1 b1 b1
A\b1
∀a ∀a ∀a
A\b1 A\b1A\b1
π⋆ π⋆π⋆π⋆
Contextual bandit algorithm directly goes to the bad state atthe initial state — Ω(T ) regret
Random sampling requires |A|H samples (inefficient)
Goal: Regret(T ) = poly(H) ·√T (need deep exploration)
O(SAH) query complexity if have a generative model
30
Deep Exploration
s⋆ good good good good
bad bad bad bad
A = b1, b2, …
b1 b1 b1 b1
A\b1
∀a ∀a ∀a
A\b1 A\b1A\b1
π⋆ π⋆π⋆π⋆
Contextual bandit algorithm directly goes to the bad state atthe initial state — Ω(T ) regret
Random sampling requires |A|H samples (inefficient)
Goal: Regret(T ) = poly(H) ·√T (need deep exploration)
O(SAH) query complexity if have a generative model30
What assumption should we impose?
In CB, π? is greedy with respect to r?, we assume r? is linear
In RL, π? is greedy with respect to Q?
Linear realizability : Q?h ∈ Flin, ∀h ∈ [H]
Theorem [Wang-Wang-Kakade-21] There exists a classof linearly realizable MDPs such that any online RL al-gorithm requires minΩ(2d),Ω(2H) samples to obtaina near optimal policy.
Fundamentally different from supervised learning and bandits
To achieve efficiency in online RL, we need strongerassumptions
31
What assumption should we impose?
In CB, π? is greedy with respect to r?, we assume r? is linear
In RL, π? is greedy with respect to Q?
Linear realizability : Q?h ∈ Flin, ∀h ∈ [H]
Theorem [Wang-Wang-Kakade-21] There exists a classof linearly realizable MDPs such that any online RL al-gorithm requires minΩ(2d),Ω(2H) samples to obtaina near optimal policy.
Fundamentally different from supervised learning and bandits
To achieve efficiency in online RL, we need strongerassumptions
31
What assumption should we impose?
In CB, π? is greedy with respect to r?, we assume r? is linear
In RL, π? is greedy with respect to Q?
Linear realizability : Q?h ∈ Flin, ∀h ∈ [H]
Theorem [Wang-Wang-Kakade-21] There exists a classof linearly realizable MDPs such that any online RL al-gorithm requires minΩ(2d),Ω(2H) samples to obtaina near optimal policy.
Fundamentally different from supervised learning and bandits
To achieve efficiency in online RL, we need strongerassumptions
31
Assumption: Image(B) ⊆ Flin
We assume the image set of the Bellman operator lies in Flin:
For any Q : S ×A → [0,H], there exists θQ ∈ Rd s.t.
(BQ)(s, a) = r(s, a) + E[maxa′
Q(s ′, a′)] = 〈φ(s, a), θQ〉
Normalization condition: ‖θQ‖2 ≤ 2H√d ,
sups,a ‖φ(s, a)‖2 ≤ 1
Is there such an MDP? Yes, Linear MDP
r(s, a) = 〈φ(s, a),ω〉 P(s ′|s, a) = 〈φ(s, a),µ(s ′)〉Normalization: ‖ω‖2 ≤
√d ,∑
s′ ‖µ(s ′)‖2 ≤√d
Linear MDP contains tabular MDP as a special case:φ(s, a) = e(s,a), d = |S| · |A|
32
Assumption: Image(B) ⊆ Flin
We assume the image set of the Bellman operator lies in Flin:
For any Q : S ×A → [0,H], there exists θQ ∈ Rd s.t.
(BQ)(s, a) = r(s, a) + E[maxa′
Q(s ′, a′)] = 〈φ(s, a), θQ〉
Normalization condition: ‖θQ‖2 ≤ 2H√d ,
sups,a ‖φ(s, a)‖2 ≤ 1
Is there such an MDP? Yes, Linear MDP
r(s, a) = 〈φ(s, a),ω〉 P(s ′|s, a) = 〈φ(s, a),µ(s ′)〉Normalization: ‖ω‖2 ≤
√d ,∑
s′ ‖µ(s ′)‖2 ≤√d
Linear MDP contains tabular MDP as a special case:φ(s, a) = e(s,a), d = |S| · |A|
32
Assumption: Image(B) ⊆ Flin
We assume the image set of the Bellman operator lies in Flin:
For any Q : S ×A → [0,H], there exists θQ ∈ Rd s.t.
(BQ)(s, a) = r(s, a) + E[maxa′
Q(s ′, a′)] = 〈φ(s, a), θQ〉
Normalization condition: ‖θQ‖2 ≤ 2H√d ,
sups,a ‖φ(s, a)‖2 ≤ 1
Is there such an MDP? Yes, Linear MDP
r(s, a) = 〈φ(s, a),ω〉 P(s ′|s, a) = 〈φ(s, a),µ(s ′)〉Normalization: ‖ω‖2 ≤
√d ,∑
s′ ‖µ(s ′)‖2 ≤√d
Linear MDP contains tabular MDP as a special case:φ(s, a) = e(s,a), d = |S| · |A|
32
Algorithm for linear RL: LSVI-UCB
33
Algorithm: LSVI + optimism
In the beginning t-th episode, datasetDt−1 = (s ih, aih, r ih), h ∈ [H]i<t
For h = H, . . . , 1, backwardly solve ridge regression:
y ih = r ih + maxa
Q+,th+1(sh+1, a)
θth = arg minθ
t−1∑i=1
[y ih − 〈φ(s ih, aih), θ〉]2 + ‖θ‖2
2
Hessian of ridge loss: Λth = I +
∑t−1i=1 φ(s ih, a
ih)φ(s ih, a
ih)>
Construct bonus Γth(s, a) = β · ‖φ(s, a)‖(Λt
h)−1
UCB Q-function
Q+,th = Trunc[0,H]〈φ(s, a), θth〉+ Γt
h(s, a)
Execute πth(·) = arg maxa Q+,th (·, a)
34
Algorithm: LSVI + optimism
In the beginning t-th episode, datasetDt−1 = (s ih, aih, r ih), h ∈ [H]i<t
For h = H, . . . , 1, backwardly solve ridge regression:
y ih = r ih + maxa
Q+,th+1(sh+1, a)
θth = arg minθ
t−1∑i=1
[y ih − 〈φ(s ih, aih), θ〉]2 + ‖θ‖2
2
Hessian of ridge loss: Λth = I +
∑t−1i=1 φ(s ih, a
ih)φ(s ih, a
ih)>
Construct bonus Γth(s, a) = β · ‖φ(s, a)‖(Λt
h)−1
UCB Q-function
Q+,th = Trunc[0,H]〈φ(s, a), θth〉+ Γt
h(s, a)
Execute πth(·) = arg maxa Q+,th (·, a)
34
Algorithm: LSVI + optimism
In the beginning t-th episode, datasetDt−1 = (s ih, aih, r ih), h ∈ [H]i<t
For h = H, . . . , 1, backwardly solve ridge regression:
y ih = r ih + maxa
Q+,th+1(sh+1, a)
θth = arg minθ
t−1∑i=1
[y ih − 〈φ(s ih, aih), θ〉]2 + ‖θ‖2
2
Hessian of ridge loss: Λth = I +
∑t−1i=1 φ(s ih, a
ih)φ(s ih, a
ih)>
Construct bonus Γth(s, a) = β · ‖φ(s, a)‖(Λt
h)−1
UCB Q-function
Q+,th = Trunc[0,H]〈φ(s, a), θth〉+ Γt
h(s, a)
Execute πth(·) = arg maxa Q+,th (·, a)
34
Algorithm: LSVI + optimism
In the beginning t-th episode, datasetDt−1 = (s ih, aih, r ih), h ∈ [H]i<t
For h = H, . . . , 1, backwardly solve ridge regression:
y ih = r ih + maxa
Q+,th+1(sh+1, a)
θth = arg minθ
t−1∑i=1
[y ih − 〈φ(s ih, aih), θ〉]2 + ‖θ‖2
2
Hessian of ridge loss: Λth = I +
∑t−1i=1 φ(s ih, a
ih)φ(s ih, a
ih)>
Construct bonus Γth(s, a) = β · ‖φ(s, a)‖(Λt
h)−1
UCB Q-function
Q+,th = Trunc[0,H]〈φ(s, a), θth〉+ Γt
h(s, a)
Execute πth(·) = arg maxa Q+,th (·, a)
34
Algorithm: LSVI + optimism
In the beginning t-th episode, datasetDt−1 = (s ih, aih, r ih), h ∈ [H]i<t
For h = H, . . . , 1, backwardly solve ridge regression:
y ih = r ih + maxa
Q+,th+1(sh+1, a)
θth = arg minθ
t−1∑i=1
[y ih − 〈φ(s ih, aih), θ〉]2 + ‖θ‖2
2
Hessian of ridge loss: Λth = I +
∑t−1i=1 φ(s ih, a
ih)φ(s ih, a
ih)>
Construct bonus Γth(s, a) = β · ‖φ(s, a)‖(Λt
h)−1
UCB Q-function
Q+,th = Trunc[0,H]〈φ(s, a), θth〉+ Γt
h(s, a)
Execute πth(·) = arg maxa Q+,th (·, a)
34
A more abstract version of LSVI-UCB
For h = H, . . . , 1, backwardly solve least-squares regression:
y ih = r ih + maxa
Q+,th+1(sh+1, a)
Qth = arg min
f ∈F
t−1∑i=1
[y ih − f (s ih, aih)]2 + penalty(f )
Uncertainty quantification: for Qth: find Γt
h such that
P(∀(t, h, s, a), |Qt
h(s, a)− (BQ+,th+1)(s, a)| ≤ Γt
h(s, a))≥ 1− δ
Optimism:
Q+,th = Trunc[0,H]Qt
h + Γth(s, a)
• Computation & memory cost independent of |S| (even forRKHS, overparameterized NN)
35
A more abstract version of LSVI-UCB
For h = H, . . . , 1, backwardly solve least-squares regression:
y ih = r ih + maxa
Q+,th+1(sh+1, a)
Qth = arg min
f ∈F
t−1∑i=1
[y ih − f (s ih, aih)]2 + penalty(f )
Uncertainty quantification: for Qth: find Γt
h such that
P(∀(t, h, s, a), |Qt
h(s, a)− (BQ+,th+1)(s, a)| ≤ Γt
h(s, a))≥ 1− δ
Optimism:
Q+,th = Trunc[0,H]Qt
h + Γth(s, a)
• Computation & memory cost independent of |S| (even forRKHS, overparameterized NN)
35
A more abstract version of LSVI-UCB
For h = H, . . . , 1, backwardly solve least-squares regression:
y ih = r ih + maxa
Q+,th+1(sh+1, a)
Qth = arg min
f ∈F
t−1∑i=1
[y ih − f (s ih, aih)]2 + penalty(f )
Uncertainty quantification: for Qth: find Γt
h such that
P(∀(t, h, s, a), |Qt
h(s, a)− (BQ+,th+1)(s, a)| ≤ Γt
h(s, a))≥ 1− δ
Optimism:
Q+,th = Trunc[0,H]Qt
h + Γth(s, a)
• Computation & memory cost independent of |S| (even forRKHS, overparameterized NN)
35
A more abstract version of LSVI-UCB
For h = H, . . . , 1, backwardly solve least-squares regression:
y ih = r ih + maxa
Q+,th+1(sh+1, a)
Qth = arg min
f ∈F
t−1∑i=1
[y ih − f (s ih, aih)]2 + penalty(f )
Uncertainty quantification: for Qth: find Γt
h such that
P(∀(t, h, s, a), |Qt
h(s, a)− (BQ+,th+1)(s, a)| ≤ Γt
h(s, a))≥ 1− δ
Optimism:
Q+,th = Trunc[0,H]Qt
h + Γth(s, a)
• Computation & memory cost independent of |S| (even forRKHS, overparameterized NN)
35
A more abstract version of LSVI-UCB
Q+h
Γth
Qth 𝔹Q+
h+1 Q+
h − 2Γth
| Qth(s, a) − (𝔹Q+
h+1)(s, a) | ≤ Γth(s, a), ∀t, h, s, a
Q+h (s, a) = Qt
h(s, a)LSVI
+ Γth(s, a)
bonusOptimism gives: BQ+
h+1 ∈ [Q+h − 2Γt
h,Q+h ]
Monotonicity of Bellman operator: Q+h ≥ Q?
h (UCB)36
Sample efficiency of LSVI-UCB
37
LSVI-UCB achieves√T -regret
Theorem [Jin-Yang-Wang-Jordan-20] Choosing β = O(dH),with probability at least 1− 1/T ,
Regret(T ) = O(β · H ·√dT ) = O(H2 ·
√d3T )
Directly imply a |S|1.5|A|1.5H2√T regret for tabular RL
First algorithm with both sample and computational efficiencyin the context of RL with function approximation
Only assumption: Image set of B is in Flin
Optimal regret dH2√T is achieved by [Zanette et al. 2020]
with a relaxed model assumption. But the computation isintractable.
38
LSVI-UCB achieves√T -regret
Theorem [Jin-Yang-Wang-Jordan-20] Choosing β = O(dH),with probability at least 1− 1/T ,
Regret(T ) = O(β · H ·√dT ) = O(H2 ·
√d3T )
Directly imply a |S|1.5|A|1.5H2√T regret for tabular RL
First algorithm with both sample and computational efficiencyin the context of RL with function approximation
Only assumption: Image set of B is in Flin
Optimal regret dH2√T is achieved by [Zanette et al. 2020]
with a relaxed model assumption. But the computation isintractable.
38
LSVI-UCB achieves√T -regret
Theorem [Jin-Yang-Wang-Jordan-20] Choosing β = O(dH),with probability at least 1− 1/T ,
Regret(T ) = O(β · H ·√dT ) = O(H2 ·
√d3T )
Directly imply a |S|1.5|A|1.5H2√T regret for tabular RL
First algorithm with both sample and computational efficiencyin the context of RL with function approximation
Only assumption: Image set of B is in Flin
Optimal regret dH2√T is achieved by [Zanette et al. 2020]
with a relaxed model assumption. But the computation isintractable.
38
LSVI-UCB achieves√T -regret
Theorem [Jin-Yang-Wang-Jordan-20] Choosing β = O(dH),with probability at least 1− 1/T ,
Regret(T ) = O(β · H ·√dT ) = O(H2 ·
√d3T )
Directly imply a |S|1.5|A|1.5H2√T regret for tabular RL
First algorithm with both sample and computational efficiencyin the context of RL with function approximation
Only assumption: Image set of B is in Flin
Optimal regret dH2√T is achieved by [Zanette et al. 2020]
with a relaxed model assumption. But the computation isintractable.
38
LSVI-UCB achieves√T -regret
Theorem [Jin-Yang-Wang-Jordan-20] Choosing β = O(dH),with probability at least 1− 1/T ,
Regret(T ) = O(β · H ·√dT ) = O(H2 ·
√d3T )
Directly imply a |S|1.5|A|1.5H2√T regret for tabular RL
First algorithm with both sample and computational efficiencyin the context of RL with function approximation
Only assumption: Image set of B is in Flin
Optimal regret dH2√T is achieved by [Zanette et al. 2020]
with a relaxed model assumption. But the computation isintractable.
38
A general version of regretLet Qucb be the function class containing the Q-functionsconstructed by LSVI-UCB
Qucb =
Trunc[0,H]φ(·, ·)>θ+ β · ‖φ(·, ·)‖Λ−1
for linear case
Theorem [Jin-Yang-Wang-Wang-Jordan-20] Choosing β =O(H ·
√logN∞(Qucb,T−2)
), with probability at least 1 −
1/T ,Regret(T ) = O(βH ·
√deff · T ) = O(H2 ·
√deff · T · logN∞)
logN∞(Qucb,T−2) d logT for linear case
Include kernel and overparameterized neural network
deff is the effective dimension of RKHS or NTK
For an abstract function class F , deff can be set as theBellman-Eluder dimension [Jin-Liu-Miryoosef-21]
39
A general version of regretLet Qucb be the function class containing the Q-functionsconstructed by LSVI-UCB
Qucb =
Trunc[0,H]φ(·, ·)>θ+ β · ‖φ(·, ·)‖Λ−1
for linear case
Theorem [Jin-Yang-Wang-Wang-Jordan-20] Choosing β =O(H ·
√logN∞(Qucb,T−2)
), with probability at least 1 −
1/T ,Regret(T ) = O(βH ·
√deff · T ) = O(H2 ·
√deff · T · logN∞)
logN∞(Qucb,T−2) d logT for linear case
Include kernel and overparameterized neural network
deff is the effective dimension of RKHS or NTK
For an abstract function class F , deff can be set as theBellman-Eluder dimension [Jin-Liu-Miryoosef-21]
39
A general version of regretLet Qucb be the function class containing the Q-functionsconstructed by LSVI-UCB
Qucb =
Trunc[0,H]φ(·, ·)>θ+ β · ‖φ(·, ·)‖Λ−1
for linear case
Theorem [Jin-Yang-Wang-Wang-Jordan-20] Choosing β =O(H ·
√logN∞(Qucb,T−2)
), with probability at least 1 −
1/T ,Regret(T ) = O(βH ·
√deff · T ) = O(H2 ·
√deff · T · logN∞)
logN∞(Qucb,T−2) d logT for linear case
Include kernel and overparameterized neural network
deff is the effective dimension of RKHS or NTK
For an abstract function class F , deff can be set as theBellman-Eluder dimension [Jin-Liu-Miryoosef-21]
39
A general version of regretLet Qucb be the function class containing the Q-functionsconstructed by LSVI-UCB
Qucb =
Trunc[0,H]φ(·, ·)>θ+ β · ‖φ(·, ·)‖Λ−1
for linear case
Theorem [Jin-Yang-Wang-Wang-Jordan-20] Choosing β =O(H ·
√logN∞(Qucb,T−2)
), with probability at least 1 −
1/T ,Regret(T ) = O(βH ·
√deff · T ) = O(H2 ·
√deff · T · logN∞)
logN∞(Qucb,T−2) d logT for linear case
Include kernel and overparameterized neural network
deff is the effective dimension of RKHS or NTK
For an abstract function class F , deff can be set as theBellman-Eluder dimension [Jin-Liu-Miryoosef-21]
39
Regret analysis
40
Regret analysis: sensitivity analysis + elliptical potentialA general sensitivity analysis
J(π?)− J(πt)
=∑
h∈[H] Eπ?
[Q+,t
h
(sh, π
?h(sh)
)− Q+,t
h
(sh, π
th(sh)
)︸ ︷︷ ︸(i) policy optimization error
]+∑
h∈[H]Eπt
[(Q+,t
h − BQ+,th+1)
]︸ ︷︷ ︸(ii) Bellman error on πt
+∑
h∈[H]Eπ?
[−(Q+,t
h − BQ+,th+1)
]︸ ︷︷ ︸(iii) Bellman error on π?
Term (i) ≤ 0 as πt is greedy with respect to Q+,th
Optimism: Q+,th+1 − 2Γt
h ≤ BQ+,th+1 ≤ Q+,t
h , Term (iii) ≤ 0
41
Regret analysis: sensitivity analysis + elliptical potentialTherefore, we have
Regret(T ) =∑T
t=1J(π?)− J(πt) ≤ 2∑
h∈[H]Eπt
[Γth(sh, ah)
]= 2∑
h∈[H]
[Γth(sth, a
th)]
+ martingale− diff
= 2β∑
h∈[H]
∑t∈[T ]
[‖φ(sth, a
th)‖(Λt
h)−1
]+ H ·
√T
= O(βH√dT )
Second line holds because (sth, ath), h ∈ [H] ∼ πt
It remains to conduct uncertainty quantification:
P(∀(t, h, s, a), |Qt
h(s, a)− (BQ+,th+1)(s, a)| ≤ Γt
h(s, a))≥ 1− δ
uniform concentration over Qucb (new for RL)+self-normalized concentration (same as in CB)
42
Regret analysis: sensitivity analysis + elliptical potentialTherefore, we have
Regret(T ) =∑T
t=1J(π?)− J(πt) ≤ 2∑
h∈[H]Eπt
[Γth(sh, ah)
]= 2∑
h∈[H]
[Γth(sth, a
th)]
+ martingale− diff
= 2β∑
h∈[H]
∑t∈[T ]
[‖φ(sth, a
th)‖(Λt
h)−1
]+ H ·
√T
= O(βH√dT )
Second line holds because (sth, ath), h ∈ [H] ∼ πt
It remains to conduct uncertainty quantification:
P(∀(t, h, s, a), |Qt
h(s, a)− (BQ+,th+1)(s, a)| ≤ Γt
h(s, a))≥ 1− δ
uniform concentration over Qucb (new for RL)+self-normalized concentration (same as in CB)
42
Summary & Extensions
Exploration in RL is more challenging in that we need tohandle deep exploration
LSVI-UCB explores by doing uncertainty quantification forLSVI estimator
LSVI propagates the uncertainty in larger h steps backward toh = 1
By construction, Q+1 contains the uncertainty about all h ≥ 1
LSVI-UCB achieves sample-efficiency and computationaltractability in online RL with function approximation
Similar principle can be extended to:proximal policy optimization (use soft-greedy instead ofgreedy)
zero-sum Markov game (two-player extension)
constrained MDP (primal dual optimization)
43
Summary & Extensions
Exploration in RL is more challenging in that we need tohandle deep exploration
LSVI-UCB explores by doing uncertainty quantification forLSVI estimator
LSVI propagates the uncertainty in larger h steps backward toh = 1
By construction, Q+1 contains the uncertainty about all h ≥ 1
LSVI-UCB achieves sample-efficiency and computationaltractability in online RL with function approximation
Similar principle can be extended to:proximal policy optimization (use soft-greedy instead ofgreedy)
zero-sum Markov game (two-player extension)
constrained MDP (primal dual optimization)
43
Summary & Extensions
Exploration in RL is more challenging in that we need tohandle deep exploration
LSVI-UCB explores by doing uncertainty quantification forLSVI estimator
LSVI propagates the uncertainty in larger h steps backward toh = 1
By construction, Q+1 contains the uncertainty about all h ≥ 1
LSVI-UCB achieves sample-efficiency and computationaltractability in online RL with function approximation
Similar principle can be extended to:proximal policy optimization (use soft-greedy instead ofgreedy)
zero-sum Markov game (two-player extension)
constrained MDP (primal dual optimization)
43
Summary & Extensions
Exploration in RL is more challenging in that we need tohandle deep exploration
LSVI-UCB explores by doing uncertainty quantification forLSVI estimator
LSVI propagates the uncertainty in larger h steps backward toh = 1
By construction, Q+1 contains the uncertainty about all h ≥ 1
LSVI-UCB achieves sample-efficiency and computationaltractability in online RL with function approximation
Similar principle can be extended to:proximal policy optimization (use soft-greedy instead ofgreedy)
zero-sum Markov game (two-player extension)
constrained MDP (primal dual optimization)
43
References (a very incomplete list)
linear bandit
• [Dani et al, 2008] Stochastic Linear Optimization under BanditFeedback
• [Abbasi-Yadkori et al, 2011] Improved algorithms for linearstochastic bandits
optimism in tabular RL
• [Auer & Ortner, 2007] Logarithmic Online Regret Bounds forUndiscounted Reinforcement Learning
• [Jaksch et al, 2010] Near-optimal Regret Bounds forReinforcement Learning
• [Azar et al, 2017] Minimax Regret Bounds for ReinforcementLearning
44
References (a very incomplete list)RL with function approximation (incomplete list)• [Dann et al, 2018] On Oracle-Efficient PAC RL with Rich
Observations• [Wang & Yang, 2019] Sample-Optimal Parametric Q-Learning
Using Linearly Additive Features• [Jin et al, 2020] Provably Efficient Reinforcement Learning
with Linear Function Approximation (this talk)• [Zanette et al, 2020] Learning near optimal policies with low
inherent bellman error• [Du et al, 2020] Is a Good Representation Sufficient for
Sample Efficient Reinforcement Learning?• [Xie et al, 2020] Learning Zero-Sum Simultaneous-Move
Markov Games Using Function Approximation and CorrelatedEquilibrium
• [Yang et al, 2020] On Function Approximation inReinforcement Learning: Optimism in the Face of Large StateSpaces (this talk)
• [Jin et al, 2021] Bellman Eluder Dimension: New Rich Classesof RL Problems, and Sample-Efficient Algorithms
45