Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant...

89
Counterfactuals and RL Emma Brunskill RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas, and Xinkun Nie for some figures

Transcript of Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant...

Page 1: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Counterfactuals and RLEmma Brunskill

RLDM 2019 TutorialAssistant Professor, Computer Science, Stanford

Thanks to Christoph Dann, Andrea Zanette, Phil Thomas, and Xinkun Nie for some figures

Page 2: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

A Brief Tale of 2 Hamburgers

1/4 1/3

Page 3: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,
Page 4: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Took > 30s

Took <= 30s

Page 5: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Given ~11k Learners’ TrajectoriesWith Random Action (Levels)

Goal: Learn a New Policy to Maximize Student Persistence

Page 6: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Parallel Legacy of “RL” to Benefit People

https://web.stanford.edu/group/cslipublications/cslipublications/SuppesCorpus/Professional%20Photos/album/1960s/slides/5.html

Page 7: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

● Simulator of domain● Enormous data to train● Can always try out a new

strategy in domain

vs

Page 8: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

● Simulator of domain● Enormous data to train● Can always try out a new

strategy in domain

● No good simulator of human physiology, behavior & learning

● Gathering real data involves impacting real people

vs

Page 9: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Techniques to Minimize & Understand Data Needed to Learn to Make Good Decisions

And if can learn to make good decisions faster, benefit more people

Page 10: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Background: Markov Decision Process

Page 11: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Background: Markov Decision Process

Page 12: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Background: Markov Decision Process

Page 13: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Background: Markov Decision Process Value Function

Page 14: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Background: Markov Decision Process Value Function

Page 15: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Background: Reinforcement Learning

Only observed through samples (experience)

Page 16: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Today: Counterfactual / Batch RL

Page 17: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

“What If?” Reasoning Given Past Data

Outcome: 91

Outcome: 92

Outcome: 85

?

Page 18: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Data Is Censored

Outcome: 91

Outcome: 92

Outcome: 85

Page 19: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Need for Generalization

Outcome: 91

Outcome: 92

Outcome: 85

?

Page 20: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Growing Interest in Causal Inference & ML

Page 21: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Batch Policy Optimization: Find a Good Policy That Will Perform Well in the Future

Page 22: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Batch Policy Optimization: Find a Good Policy That Will Perform Well in the Future

Page 23: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Batch Policy Optimization: Find a Good Policy That Will Perform Well in the Future

Page 24: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Batch Policy Optimization: Find a Good Policy That Will Perform Well in the Future

● Today will not be a comprehensive overview, but instead highlight some of the challenges involved & some approaches with desirable statistical properties convergence, sample efficiency & bounds

Page 25: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,
Page 26: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

: or

Substantial Literature Focuses on 1 Binary Decision: Treatment Effect Estimation from Old Data

Page 27: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Challenge: Covariate ShiftDifferent Policies → Different Actions → Different State Distributions

IS allows us to reweigh data to make it look as if it came from original distrib

Gottesman et a. Guidelines for reinforcement learning in healthcare. Nature Medicine 2019. Figure by Debbie Maizels/Springer Nature

Page 28: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Policy Evaluation

1. Model based2. Model free3. Importance sampling4. Doubly robust

Page 29: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Learn Dynamics and Reward Models from Data, Evaluate Policy

● (Mannor, Simster, Sun, Tsitsiklis 2007)

Page 30: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Model Free Value Function Approximation

● Fitted Q iteration, DQN, LSTD, ...

Page 31: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Counterfactual Reasoning for Policy Evaluation*

Parametric Modelsof dynamics, rewards or values fit to data

+ Low variance- Bias (unless realizable)

Page 32: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Importance Sampling Refresher

Page 33: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Importance Sampling for RL Policy Evaluation

Page 34: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Importance Sampling for RL Policy Evaluation

Page 35: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Importance Sampling for RL Policy Evaluation

● First used for RL by Precup, Sutton & Singh 2000. Recent work includes: Thomas, Theocharous, Ghavamzadeh 2015; Thomas and Brunskill 2017; Guo, Thomas, Brunskill 2017; Hanna, Niekum, Stone 2019

Page 36: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Stationary Importance Sampling (SIS) for RL Policy Evaluation

● Can be approximated and used as part of Q-learning style update● Hallak & Mannor 2017; Liu, Li, Tang, & Zhou 2018; Gelada & Bellemare 2019

Page 37: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Counterfactual Reasoning for Policy Evaluation

Parametric Modelsof dynamics, rewards or values fit to data

Importance Samplingcorrect mismatch of

state-action distribution

+ Low variance- Bias (unless realizable)

+ Unbiased under certain assumptions- High variance

Page 38: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Doubly Robust (DR) Estimation

• Model + IS-based estimator• Bandits (Dudik et al. 2011)

reward receivedmodel of reward

Page 39: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Doubly Robust (DR) Estimation

• Model + IS-based estimator• Bandits (Dudik et al. 2011)

reward receivedmodel of reward

Page 40: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Doubly Robust (DR) Estimation

• Model + IS-based estimator• Bandits (Dudik et al. 2011)

reward receivedmodel of reward

Page 41: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Doubly Robust Estimation for RL

• Jiang and Li (ICML 2016) extended DR to RL

model-based estimate of Q

actual rewards in the dataset

importance weights

model-based estimate of V

Page 42: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Doubly Robust Estimation for RL

• Jiang and Li (ICML 2016) extended DR to RL

• Limitation: Estimator derived is unbiased

model-based estimate of Q

actual rewards in the dataset

importance weights

model-based estimate of V

Page 43: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Instead Prioritize Accuracy & Measure with Mean Squared Error

Thomas and Brunskill, ICML 2016

• Trade bias and variance

Bias

Bias

Variance

+Model-based estimator Importance sampling estimator

Page 44: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Two New Off Policy Evaluation Estimators1. Weighted doubly robust for RL problems

a. Weighted importance Sampling often much lower variance

Page 45: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Two New Off Policy Evaluation Estimators1. Weighted doubly robust for RL problems

a. Weighted importance Sampling often much lower variance

b. WDR: doubly robust, just use normalized weights!c. Empirically can give much better estimatesd. Still has good properties (strongly consistent)

Page 46: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Price of Robustness?

Le, Voloshin, Yue (2019)

Page 47: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Two New Off Policy Evaluation Estimators1. Weighted doubly robust for RL problems

2. Model And Guided Importance sampling Combining estimatora. Directly try to minimize mean squared error by balancing

between value and importance sampling estimateb. Mean squared error is a function of bias and variance

Page 48: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Blend IS-Based & Model Based Estimators to Directly Min Mean Squared Error

Bias

Variance

1-step estimate 2-step N-step

x1 x

2 … x

N

Thomas and Brunskill, ICML 2016

Page 49: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Model and Guided Importance Sampling combining (MAGIC) Estimator

Estimated policy value using particular weighting of model estimate and

importance sampling estimate

Thomas and Brunskill, ICML 2016

• Solve quadratic program• Strongly consistent (under similar set of assumptions as

WDR)

Page 50: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Estimating Bias & Covariance

• Estimated covariance: sample covariance matrix

• Estimated bias:– May be as hard as estimating true policy value

Importance sampling estimate

Model based estimate Estimate of bias

Thomas and B, ICML 2016

Page 51: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Gridworld Simulation: Needed Only 10% of the Data to Learn a Good Estimate New Policy’s Value

IS-based

ModelDR

MAGIC

MAGIC-B

Number of Histories Thomas and B 2016

Page 52: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Gridworld Simulation: Needed Only 10% of the Data to Learn a Good Estimate of New Policy’s Value

IS-based

ModelDR

MAGIC

MAGIC-B

Number of Histories Thomas and B 2016

Page 53: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Sepsis treatment example

(Gottesman et al. arxiv 2018)

● Actions: IV fluids & vasopressors ● Reward: +100 survival, -100 death● State space: 750 (discretized)● 19,275 ICU patients

WDR in Health Example

Page 54: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Sepsis treatment example

(Gottesman et al. arxiv 2018)

● Actions: IV fluids & vasopressors ● Reward: +100 survival, -100 death● State space: 750 (discretized)● 19,275 ICU patients

Our weighted DR (WDR) was only consistent off policy estimator tried (PDDR, PDIS, WPDIS, WDR) that could find an optimal policy which estimated would improve over prior

WDR in Health Example

Page 55: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Sepsis treatment example

(Gottesman et al. arxiv 2018)

● Actions: IV fluids & vasopressors ● Reward: +100 survival, -100 death● State space: 750 (discretized)● 19,275 ICU patients

Our weighted DR (WDR) was only consistent off policy estimator tried (PDDR, PDIS, WPDIS, WDR) that could find an optimal policy which estimated would improve over prior

Under (common) assumption of no confounding, that is not likely to hold in practice

WDR in Health Example

Page 56: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Policy Evaluation

1. Model based2. Model free3. Importance sampling4. Doubly robust

Page 57: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Policy Optimization: Find Good Policy to Deploy

Page 58: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Learn Dynamics and Reward Models from Data, Plan

Page 59: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Mandel, Liu, Brunskill, Popovic 2014

Better Dynamics/Reward Models for Existing Data, May Not Lead to Better Policies for Future Use

Page 60: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Importance Sampling Estimators Unbiased for Policy Evaluation

Page 61: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Importance Sampling Estimators Unbiased for Policy Evaluation

• But using them for policy evaluation can lead to poor results

Page 62: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Fairness of Importance Sampling-Based Estimators for Policy Selection

62

• Unfortunately even if IS estimates are unbiased, policy selection using them can be unfair

• Here define unfair as: – Given two policies π1 and π2

– Where true performance V(π1) > V(π2)– Choose V(π2) more than 50% of time

Doroudi, Thomas and Brunskill, Best Paper, UAI 2017

Page 63: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Value

Policy 1 Policy 2

Max over Estimates with Differing Variances

Doroudi, Thomas and Brunskill, Best Paper, UAI 2017

Page 64: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Importance Sampling Favors Myopic Policies

Doroudi, Thomas and Brunskill, Best Paper, UAI 2017

Page 65: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Quest for Batch Policy Optimization with Generalization Guarantees→ SRM for Reinforcement Learning

Page 66: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Quest for Batch Policy Optimization with Generalization Guarantees→ SRM for Reinforcement Learning

Page 67: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Challenge: Good Error Bound Analysis

Page 68: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Challenge: Good Error Bound Analysis

● Importance sampling bounds (e.g. Thomas et al, 2015) ignore hypothesis class structure & are typically require very large n

Page 69: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Challenge: Good Error Bound Analysis

● Importance sampling bounds (e.g. Thomas et al, 2015) ignore hypothesis class structure & are typically require very large n

● Kernel function & averager approaches (e.g. Ormoneit & Sten 2002) can need # samples exponential in input state dimension

Page 70: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Challenge: Good Error Bound Analysis

● Importance sampling bounds (e.g. Thomas et al, 2015) ignore hypothesis class structure & are typically require very large n

● Kernel function & averager approaches (e.g. Ormoneit & Sten 2002) can need # samples exponential in input state dimension

● FQI bounds (e.g. Munos 2003; Munos & Szepesvári 2008; Antos et al., 2008; Lazaric et al., 2012; Farahmand et al., 2009; Maillard et al., 2010; Le, Voloshin, Yue 2019; Chen & Jiang 2019)

- Require stronger assumptions (realizability and bounds on the inherent Bellman error)

- If not realizable, FQI bounds depend on unknown quantities

Page 71: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Challenge: Good Error Bound Analysis

● Importance sampling bounds (e.g. Thomas et al, 2015)● Kernel function (e.g. Ormoneit & Sten 2002)● FQI bounds (e.g. Munos 2003; Munos & Szepesvári 2008; Antos

et al., 2008; Lazaric et al., 2012; Farahmand et al., 2009; Maillard et al., 2010; Le, Voloshin, Yue 2019; Chen & Jiang 2019)

- Require stronger assumptions (realizability and bounds on the inherent Bellman error)

- If not realizable, FQI bounds depend on unknown quantities● Primal dual approaches (e.g. Dai, Shaw, Li, Xiao, He, Liu, Chen,

Song 2018) are promising and have similar dependencies

Page 72: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Aim: Strong Generalization Guarantees on Policy Performance, Alternative: Find Good in Class Policy Given Past Data

Page 73: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Direct Batch Policy Search & Optimization● Despite popularity, relative little success in direct policy optimization using

offline / batch data● Correcting for the mismatch in state distributions can yield high variance

(“alternative life” from Sutton / White terminology)● Algorithmically often just correct for 1 step (e.g. Degris, White, & Sutton 2012)

Page 74: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Off-Policy Policy Gradient with State Distribution Correction● Leverage Markov structure idea of stationary importance sampling for RL

Monday RLDM Poster 114 & Liu, Swaminathan, Agarwal, B UAI 2019

Page 75: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

First Result that Can Provably Converge to Local Solution with Off Policy Batch Policy Gradient

Monday RLDM Poster 114 & Liu, Swaminathan, Agarwal, B UAI 2019

Page 76: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Aim: Strong Generalization Guarantees on Policy Performance, Alternative: Guarantee Find Best in Class Policy

Page 77: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

1st Guarantees on Performance of Policy Choose Vs Best in Class for When to Treat Policies (w/Xinkun Nie & Stefan Wager, arxiv)

Page 78: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Starting HIV treatment as soon as CD4 count dips below 200

Example: Linear Thresholding Policies

Source: https://alv.mizoapp.com/cd4count/

HIV Infection

CD

4 C

ount

HIV Treatment

2-10 years

200

policy parameter

Page 79: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Stopping treatment as soon as health metric above line

Starting HIV treatment as soon as CD4 count dips below 200

Example: Linear Thresholding Policies

Source: https://alv.mizoapp.com/cd4count/

HIV Infection

CD

4 C

ount

HIV Treatment

2-10 years

200Treatment

O

TreatmentOff

policy parameter

policy

Page 80: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Selecting a When to Treat Policy

Page 81: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

never acting

Use an Advantage Decomposition

Page 82: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

● Estimate treatment effect with a doubly robust estimator given available dataset D● Can learn “nuisance” parameters (propensity weights and value function estimates) at a

slower rate and still get sqrt(n) regret bounds, under various assumptions● Insights from orthogonal / double machine learning ideas from econometrics

never acting

Use a Doubly Robust Advantage Decomposition

Page 83: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Keeping a health metric above 0

evolves with brownian motion

treatment nudges it up, but at a cost

Always start with treatment ON

• Optimal stopping time of treatment?

• Unknown propensity

• Linear Decision Rules

#covariates = 2

• Observe states + noise

Horizon

Page 84: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Fitted Q Iteration Policy Less Interpretable

Page 85: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Quest for Batch Policy Optimization with Generalization Guarantees→ SRM for Reinforcement Learning

& many colleagues’ work (Murphy, Jiang, Yue, Munos, Lazaric, Szepesvari…)→ Much to be done, including to relax common assumptions

AAMAS 2014, AAAI 2015, AAAI 2016, ICML 2016, IJCAI 2016, AAAI 2017, NeurIPS 2017, NeurIPS

2018, ICML 2019

AAAI 2015, AAAI 2016, L@S 2017, UAI 2017, UAI 2019, (Nie,

B, Wagner arxiv)

Nie, B, Wagner arxiv; Thomas, da Silva, Barto,

B, arxiv

Page 86: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Power of Models for Off Policy Evaluation?

● Model based approaches can be provably more efficient than model free value function for online evaluation or control

Sun, Jiang, Krishnamurthy, Agarwal, Langford COLT 2019

Tu & Recht COLT 2019

Page 87: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Models Fit for Off Policy Evaluation May Benefit from Different Loss Function

Liu, Gottesman, Raghu, Komorowski, Faisal, Doshi-Velez, Brunskill NeurIPS 2018

Page 88: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Given ~11k Learners’ TrajectoriesWith Random Action (Levels)

Learn a Policy that Increases Student Persistence

(Mandel, Liu, B, Popovic 2014)

Page 89: Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant Professor, Computer Science, Stanford Thanks to Christoph Dann, Andrea Zanette, Phil Thomas,

Given ~11k Learners’ TrajectoriesWith Random Action (Levels)

Learned a Policy that Increased Student Persistence by +30%

(Mandel, Liu, B, Popovic 2014)