Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant...

Post on 20-Aug-2021

4 views 0 download

Transcript of Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant...

Counterfactuals and RLEmma Brunskill

RLDM 2019 TutorialAssistant Professor, Computer Science, Stanford

Thanks to Christoph Dann, Andrea Zanette, Phil Thomas, and Xinkun Nie for some figures

A Brief Tale of 2 Hamburgers

1/4 1/3

Took > 30s

Took <= 30s

Given ~11k Learners’ TrajectoriesWith Random Action (Levels)

Goal: Learn a New Policy to Maximize Student Persistence

Parallel Legacy of “RL” to Benefit People

https://web.stanford.edu/group/cslipublications/cslipublications/SuppesCorpus/Professional%20Photos/album/1960s/slides/5.html

● Simulator of domain● Enormous data to train● Can always try out a new

strategy in domain

vs

● Simulator of domain● Enormous data to train● Can always try out a new

strategy in domain

● No good simulator of human physiology, behavior & learning

● Gathering real data involves impacting real people

vs

Techniques to Minimize & Understand Data Needed to Learn to Make Good Decisions

And if can learn to make good decisions faster, benefit more people

Background: Markov Decision Process

Background: Markov Decision Process

Background: Markov Decision Process

Background: Markov Decision Process Value Function

Background: Markov Decision Process Value Function

Background: Reinforcement Learning

Only observed through samples (experience)

Today: Counterfactual / Batch RL

“What If?” Reasoning Given Past Data

Outcome: 91

Outcome: 92

Outcome: 85

?

Data Is Censored

Outcome: 91

Outcome: 92

Outcome: 85

Need for Generalization

Outcome: 91

Outcome: 92

Outcome: 85

?

Growing Interest in Causal Inference & ML

Batch Policy Optimization: Find a Good Policy That Will Perform Well in the Future

Batch Policy Optimization: Find a Good Policy That Will Perform Well in the Future

Batch Policy Optimization: Find a Good Policy That Will Perform Well in the Future

Batch Policy Optimization: Find a Good Policy That Will Perform Well in the Future

● Today will not be a comprehensive overview, but instead highlight some of the challenges involved & some approaches with desirable statistical properties convergence, sample efficiency & bounds

: or

Substantial Literature Focuses on 1 Binary Decision: Treatment Effect Estimation from Old Data

Challenge: Covariate ShiftDifferent Policies → Different Actions → Different State Distributions

IS allows us to reweigh data to make it look as if it came from original distrib

Gottesman et a. Guidelines for reinforcement learning in healthcare. Nature Medicine 2019. Figure by Debbie Maizels/Springer Nature

Policy Evaluation

1. Model based2. Model free3. Importance sampling4. Doubly robust

Learn Dynamics and Reward Models from Data, Evaluate Policy

● (Mannor, Simster, Sun, Tsitsiklis 2007)

Model Free Value Function Approximation

● Fitted Q iteration, DQN, LSTD, ...

Counterfactual Reasoning for Policy Evaluation*

Parametric Modelsof dynamics, rewards or values fit to data

+ Low variance- Bias (unless realizable)

Importance Sampling Refresher

Importance Sampling for RL Policy Evaluation

Importance Sampling for RL Policy Evaluation

Importance Sampling for RL Policy Evaluation

● First used for RL by Precup, Sutton & Singh 2000. Recent work includes: Thomas, Theocharous, Ghavamzadeh 2015; Thomas and Brunskill 2017; Guo, Thomas, Brunskill 2017; Hanna, Niekum, Stone 2019

Stationary Importance Sampling (SIS) for RL Policy Evaluation

● Can be approximated and used as part of Q-learning style update● Hallak & Mannor 2017; Liu, Li, Tang, & Zhou 2018; Gelada & Bellemare 2019

Counterfactual Reasoning for Policy Evaluation

Parametric Modelsof dynamics, rewards or values fit to data

Importance Samplingcorrect mismatch of

state-action distribution

+ Low variance- Bias (unless realizable)

+ Unbiased under certain assumptions- High variance

Doubly Robust (DR) Estimation

• Model + IS-based estimator• Bandits (Dudik et al. 2011)

reward receivedmodel of reward

Doubly Robust (DR) Estimation

• Model + IS-based estimator• Bandits (Dudik et al. 2011)

reward receivedmodel of reward

Doubly Robust (DR) Estimation

• Model + IS-based estimator• Bandits (Dudik et al. 2011)

reward receivedmodel of reward

Doubly Robust Estimation for RL

• Jiang and Li (ICML 2016) extended DR to RL

model-based estimate of Q

actual rewards in the dataset

importance weights

model-based estimate of V

Doubly Robust Estimation for RL

• Jiang and Li (ICML 2016) extended DR to RL

• Limitation: Estimator derived is unbiased

model-based estimate of Q

actual rewards in the dataset

importance weights

model-based estimate of V

Instead Prioritize Accuracy & Measure with Mean Squared Error

Thomas and Brunskill, ICML 2016

• Trade bias and variance

Bias

Bias

Variance

+Model-based estimator Importance sampling estimator

Two New Off Policy Evaluation Estimators1. Weighted doubly robust for RL problems

a. Weighted importance Sampling often much lower variance

Two New Off Policy Evaluation Estimators1. Weighted doubly robust for RL problems

a. Weighted importance Sampling often much lower variance

b. WDR: doubly robust, just use normalized weights!c. Empirically can give much better estimatesd. Still has good properties (strongly consistent)

Price of Robustness?

Le, Voloshin, Yue (2019)

Two New Off Policy Evaluation Estimators1. Weighted doubly robust for RL problems

2. Model And Guided Importance sampling Combining estimatora. Directly try to minimize mean squared error by balancing

between value and importance sampling estimateb. Mean squared error is a function of bias and variance

Blend IS-Based & Model Based Estimators to Directly Min Mean Squared Error

Bias

Variance

1-step estimate 2-step N-step

x1 x

2 … x

N

Thomas and Brunskill, ICML 2016

Model and Guided Importance Sampling combining (MAGIC) Estimator

Estimated policy value using particular weighting of model estimate and

importance sampling estimate

Thomas and Brunskill, ICML 2016

• Solve quadratic program• Strongly consistent (under similar set of assumptions as

WDR)

Estimating Bias & Covariance

• Estimated covariance: sample covariance matrix

• Estimated bias:– May be as hard as estimating true policy value

Importance sampling estimate

Model based estimate Estimate of bias

Thomas and B, ICML 2016

Gridworld Simulation: Needed Only 10% of the Data to Learn a Good Estimate New Policy’s Value

IS-based

ModelDR

MAGIC

MAGIC-B

Number of Histories Thomas and B 2016

Gridworld Simulation: Needed Only 10% of the Data to Learn a Good Estimate of New Policy’s Value

IS-based

ModelDR

MAGIC

MAGIC-B

Number of Histories Thomas and B 2016

Sepsis treatment example

(Gottesman et al. arxiv 2018)

● Actions: IV fluids & vasopressors ● Reward: +100 survival, -100 death● State space: 750 (discretized)● 19,275 ICU patients

WDR in Health Example

Sepsis treatment example

(Gottesman et al. arxiv 2018)

● Actions: IV fluids & vasopressors ● Reward: +100 survival, -100 death● State space: 750 (discretized)● 19,275 ICU patients

Our weighted DR (WDR) was only consistent off policy estimator tried (PDDR, PDIS, WPDIS, WDR) that could find an optimal policy which estimated would improve over prior

WDR in Health Example

Sepsis treatment example

(Gottesman et al. arxiv 2018)

● Actions: IV fluids & vasopressors ● Reward: +100 survival, -100 death● State space: 750 (discretized)● 19,275 ICU patients

Our weighted DR (WDR) was only consistent off policy estimator tried (PDDR, PDIS, WPDIS, WDR) that could find an optimal policy which estimated would improve over prior

Under (common) assumption of no confounding, that is not likely to hold in practice

WDR in Health Example

Policy Evaluation

1. Model based2. Model free3. Importance sampling4. Doubly robust

Policy Optimization: Find Good Policy to Deploy

Learn Dynamics and Reward Models from Data, Plan

Mandel, Liu, Brunskill, Popovic 2014

Better Dynamics/Reward Models for Existing Data, May Not Lead to Better Policies for Future Use

Importance Sampling Estimators Unbiased for Policy Evaluation

Importance Sampling Estimators Unbiased for Policy Evaluation

• But using them for policy evaluation can lead to poor results

Fairness of Importance Sampling-Based Estimators for Policy Selection

62

• Unfortunately even if IS estimates are unbiased, policy selection using them can be unfair

• Here define unfair as: – Given two policies π1 and π2

– Where true performance V(π1) > V(π2)– Choose V(π2) more than 50% of time

Doroudi, Thomas and Brunskill, Best Paper, UAI 2017

Value

Policy 1 Policy 2

Max over Estimates with Differing Variances

Doroudi, Thomas and Brunskill, Best Paper, UAI 2017

Importance Sampling Favors Myopic Policies

Doroudi, Thomas and Brunskill, Best Paper, UAI 2017

Quest for Batch Policy Optimization with Generalization Guarantees→ SRM for Reinforcement Learning

Quest for Batch Policy Optimization with Generalization Guarantees→ SRM for Reinforcement Learning

Challenge: Good Error Bound Analysis

Challenge: Good Error Bound Analysis

● Importance sampling bounds (e.g. Thomas et al, 2015) ignore hypothesis class structure & are typically require very large n

Challenge: Good Error Bound Analysis

● Importance sampling bounds (e.g. Thomas et al, 2015) ignore hypothesis class structure & are typically require very large n

● Kernel function & averager approaches (e.g. Ormoneit & Sten 2002) can need # samples exponential in input state dimension

Challenge: Good Error Bound Analysis

● Importance sampling bounds (e.g. Thomas et al, 2015) ignore hypothesis class structure & are typically require very large n

● Kernel function & averager approaches (e.g. Ormoneit & Sten 2002) can need # samples exponential in input state dimension

● FQI bounds (e.g. Munos 2003; Munos & Szepesvári 2008; Antos et al., 2008; Lazaric et al., 2012; Farahmand et al., 2009; Maillard et al., 2010; Le, Voloshin, Yue 2019; Chen & Jiang 2019)

- Require stronger assumptions (realizability and bounds on the inherent Bellman error)

- If not realizable, FQI bounds depend on unknown quantities

Challenge: Good Error Bound Analysis

● Importance sampling bounds (e.g. Thomas et al, 2015)● Kernel function (e.g. Ormoneit & Sten 2002)● FQI bounds (e.g. Munos 2003; Munos & Szepesvári 2008; Antos

et al., 2008; Lazaric et al., 2012; Farahmand et al., 2009; Maillard et al., 2010; Le, Voloshin, Yue 2019; Chen & Jiang 2019)

- Require stronger assumptions (realizability and bounds on the inherent Bellman error)

- If not realizable, FQI bounds depend on unknown quantities● Primal dual approaches (e.g. Dai, Shaw, Li, Xiao, He, Liu, Chen,

Song 2018) are promising and have similar dependencies

Aim: Strong Generalization Guarantees on Policy Performance, Alternative: Find Good in Class Policy Given Past Data

Direct Batch Policy Search & Optimization● Despite popularity, relative little success in direct policy optimization using

offline / batch data● Correcting for the mismatch in state distributions can yield high variance

(“alternative life” from Sutton / White terminology)● Algorithmically often just correct for 1 step (e.g. Degris, White, & Sutton 2012)

Off-Policy Policy Gradient with State Distribution Correction● Leverage Markov structure idea of stationary importance sampling for RL

Monday RLDM Poster 114 & Liu, Swaminathan, Agarwal, B UAI 2019

First Result that Can Provably Converge to Local Solution with Off Policy Batch Policy Gradient

Monday RLDM Poster 114 & Liu, Swaminathan, Agarwal, B UAI 2019

Aim: Strong Generalization Guarantees on Policy Performance, Alternative: Guarantee Find Best in Class Policy

1st Guarantees on Performance of Policy Choose Vs Best in Class for When to Treat Policies (w/Xinkun Nie & Stefan Wager, arxiv)

Starting HIV treatment as soon as CD4 count dips below 200

Example: Linear Thresholding Policies

Source: https://alv.mizoapp.com/cd4count/

HIV Infection

CD

4 C

ount

HIV Treatment

2-10 years

200

policy parameter

Stopping treatment as soon as health metric above line

Starting HIV treatment as soon as CD4 count dips below 200

Example: Linear Thresholding Policies

Source: https://alv.mizoapp.com/cd4count/

HIV Infection

CD

4 C

ount

HIV Treatment

2-10 years

200Treatment

O

TreatmentOff

policy parameter

policy

Selecting a When to Treat Policy

never acting

Use an Advantage Decomposition

● Estimate treatment effect with a doubly robust estimator given available dataset D● Can learn “nuisance” parameters (propensity weights and value function estimates) at a

slower rate and still get sqrt(n) regret bounds, under various assumptions● Insights from orthogonal / double machine learning ideas from econometrics

never acting

Use a Doubly Robust Advantage Decomposition

Keeping a health metric above 0

evolves with brownian motion

treatment nudges it up, but at a cost

Always start with treatment ON

• Optimal stopping time of treatment?

• Unknown propensity

• Linear Decision Rules

#covariates = 2

• Observe states + noise

Horizon

Fitted Q Iteration Policy Less Interpretable

Quest for Batch Policy Optimization with Generalization Guarantees→ SRM for Reinforcement Learning

& many colleagues’ work (Murphy, Jiang, Yue, Munos, Lazaric, Szepesvari…)→ Much to be done, including to relax common assumptions

AAMAS 2014, AAAI 2015, AAAI 2016, ICML 2016, IJCAI 2016, AAAI 2017, NeurIPS 2017, NeurIPS

2018, ICML 2019

AAAI 2015, AAAI 2016, L@S 2017, UAI 2017, UAI 2019, (Nie,

B, Wagner arxiv)

Nie, B, Wagner arxiv; Thomas, da Silva, Barto,

B, arxiv

Power of Models for Off Policy Evaluation?

● Model based approaches can be provably more efficient than model free value function for online evaluation or control

Sun, Jiang, Krishnamurthy, Agarwal, Langford COLT 2019

Tu & Recht COLT 2019

Models Fit for Off Policy Evaluation May Benefit from Different Loss Function

Liu, Gottesman, Raghu, Komorowski, Faisal, Doshi-Velez, Brunskill NeurIPS 2018

Given ~11k Learners’ TrajectoriesWith Random Action (Levels)

Learn a Policy that Increases Student Persistence

(Mandel, Liu, B, Popovic 2014)

Given ~11k Learners’ TrajectoriesWith Random Action (Levels)

Learned a Policy that Increased Student Persistence by +30%

(Mandel, Liu, B, Popovic 2014)