  • Interactive Imitation Learning

    Sham Kakade and Wen Sun
CS 6789: Foundations of Reinforcement Learning

  • Announcements

    Final report: NeurIPS format

    Maximum 9 pages for main tex (not including references and appendix)

  • Recap

    Offline IL and Hybrid Setting:

    Ground truth reward is unknown;

    assume expert is a near optimal policy

    r(s, a) ∈ [0,1]π⋆

    We have a dataset 𝒟 = (s⋆i , a⋆i )

    Mi=1 ∼ d


    We can rewrite it in the max-min formulation:



    [𝔼s,a∼dπθ⊤ϕ(s, a) − 𝔼s,a∼dπ⋆θ⊤ϕ(s, a) + 𝔼s,a∼dπ ln π(a |s)]:=f(θ,π)

    θt+1 := θt + η∇θJ(θt, πt), πt+1 = arg minπ

    J(θt+1, π)

    Algorithm: incremental update on cost function , exact update on policy (soft VI):θ π

  • Today:

    Interactive Imitation Learning Setting

    Key assumption: we can query expert at any time and any state during trainingπ⋆

  • Recall the Main Problem from Behavior Cloning:

    Expert’s trajectoryLearned Policy

    No training data of “recovery’’ behavior

  • Intuitive solution: Interaction


    Use interaction to collect data where learned policy goes

  • General Idea: Iterative Interactive Approach

    Update PolicyCollect Data

    through Interaction

    New Data

    Updated Policy

    All DAgger slides credit: Drew Bagnell, Stephane Ross, Arun Venktraman

  • DAgger: Dataset Aggregation 0th iteration


    Expert Demonstrates Task Dataset

    Supervised Learning

    1st policy π1


    States from the learned policy

  • Success!


    [Ross AISTATS 2011]

  • Better


    [Ross AISTATS 2011]

    Average Falls/Lap

  • More fun than Video Games…

    18[Ross ICRA 2013]

  • Forms of the Interactive Experts

    Interactive Expert is expensive, especially when the expert is human…

    But expert does not have to be human…

    Example: high-speed off-road driving [Pan et al, RSS 18, Best System Paper]

    Example: high-speed off-road driving [Pan et al, RSS 18, Best System Paper] Goal: learn a racing control policy that

    maps from data on cheap on-board sensors (raw-pixel imagine) to low-level

    control (steer and throttle)

    Example: high-speed off-road driving [Pan et al, RSS 18, Best System Paper] Goal: learn a racing control policy that

    maps from data on cheap on-board sensors (raw-pixel imagine) to low-level

    control (steer and throttle)

    Steering + throttle

    Their Setup:

    At Training, we have expensive sensors for accurate state estimation

    and we have computation resources for MPC (i.e., high-frequency replanning)

    The MPC is the expert in this case!

  • Analysis of DAgger

    First let’s do a quick introduction of online no-regret learning

  • Online Learning



    convex Decision set 𝒳…

  • Online Learning



    convex Decision set 𝒳

    Learner picks a decision x0

  • Online Learning



    convex Decision set 𝒳

    Learner picks a decision x0

    Adversary picks a loss ℓ0 : 𝒳 → ℝ

  • Online Learning



    convex Decision set 𝒳

    Learner picks a decision x0

    Adversary picks a loss ℓ0 : 𝒳 → ℝ

    Learner picks a new decision x1

  • Online Learning



    convex Decision set 𝒳

    Learner picks a decision x0

    Adversary picks a loss ℓ0 : 𝒳 → ℝ

    Learner picks a new decision x1

    Adversary picks a loss ℓ1 : 𝒳 → ℝ

  • Online Learning



    convex Decision set 𝒳

    Learner picks a decision x0

    Adversary picks a loss ℓ0 : 𝒳 → ℝ

    Learner picks a new decision x1

    Adversary picks a loss ℓ1 : 𝒳 → ℝ

    Regret =T−1


    ℓt(xt) − minx∈𝒳




  • A no-regret algorithm: Follow-the-Leader

    At time step learner has seen , which new decision she could pick? t, ℓ0, …ℓt−1

    FTL: xt = minx∈𝒳




  • A no-regret algorithm: Follow-the-Leader

    At time step learner has seen , which new decision she could pick? t, ℓ0, …ℓt−1

    Theorem (FTL): if is convex, and is strongly convex for all , then for regret of FTL, we have: 𝒳 ℓt t1T [



    ℓt(xt) − minx∈𝒳



    ℓt(x)] = O ( log(T)T )

    FTL: xt = minx∈𝒳




  • DAgger Revisit

    New Data

    Supervised Learning

    New policy πn

    All previous data

    Steering from expert

    Aggregate Dataset

    At iteration n:

    At iteration n:

    At iteration n:

    Ln(⇡) =nX


    k⇡(x)� yk22

    At iteration n:

    Ln(⇡) =nX


    k⇡(x)� yk22

    At iteration n:

    Ln(⇡) =nX


    k⇡(x)� yk22




    At iteration n:

    Ln(⇡) =nX


    k⇡(x)� yk22




    At iteration n:

    Ln(⇡) =nX


    k⇡(x)� yk22







    At iteration n:

    Ln(⇡) =nX


    k⇡(x)� yk22







    Data Aggregation = Follow-the-Leader Online Learner

  • DAgger Analysis: A reduction to no-regret online learning

    Decision set (restricted policy class, may not be inside )Π := {π : S ↦ A} π⋆ ΠFinite horizon episodic MDP, assume discrete action space

    (Here could be any convex surrogate loss for classification, .e.g, hinge loss)ℓ

    Online Learning loss at iteration : t ℓt(π) = 𝔼s∼dπt [ℓ(π(s), π⋆(s))]

    DAgger is equivalent to FTL, i.e., πt+1 = arg minπ∈Π




    If the online learning procedure ensures no-regret, then

    1T [



    ℓt(πt) − minπ∈Π



    ℓt(π)] = o(T)/T

  • DAgger Analysis: A reduction to no-regret online learning

    Online Learning loss at iteration : t ℓt(π) = 𝔼s∼dπt [ℓ(π(s), π⋆(s))]T−1


    ℓt(πt) − minπ∈Π



    ℓt(π) = o(T)

    ℓt(πt) = o(T)/T


    + minπ∈Π






    , such that: ∃ ̂t ∈ [0,…, T − 1] ℓ ̂t(π ̂t ) ≤ ϵavg−reg + ϵΠ

    𝔼s∼dπ ̂t [π ̂t(s) ≠ π⋆(s)] ≤ 𝔼s∼dπ ̂t [ℓ(π ̂t(s), π⋆(s))] ≤ ϵavg−reg + ϵΠ

    Under the assumption that surrogate loss upper bounds zero-one loss:

  • DAgger Analysis: A reduction to no-regret online learning

    𝔼s∼dπ ̂t [π ̂t(s) ≠ π⋆(s)] ≤ 𝔼s∼dπ ̂t [ℓ(π ̂t(s), π⋆(s))] ≤ ϵreg + ϵΠ can predict well under its own state distributionπ ̂i π⋆

    Let’s turn this to the true performance under the cost function c(s, a)

    Vπ ̂i − Vπ⋆ =1

    1 − γ𝔼s∼dπ ̂i [A⋆(s, π ̂i(s))]

  • Summary of Imitation Learning

    3.DAgger w/ Interactive Experts:

    Performance-gap ≈sups,a |A⋆(s, a) |

    (1 − γ)(classification error)