Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant...

Counterfactuals and RLEmma Brunskill

RLDM 2019 TutorialAssistant Professor, Computer Science, Stanford

Thanks to Christoph Dann, Andrea Zanette, Phil Thomas, and Xinkun Nie for some figures

A Brief Tale of 2 Hamburgers

1/4 1/3

Took > 30s

Took <= 30s

Given ~11k Learners’ TrajectoriesWith Random Action (Levels)

Goal: Learn a New Policy to Maximize Student Persistence

Parallel Legacy of “RL” to Benefit People

https://web.stanford.edu/group/cslipublications/cslipublications/SuppesCorpus/Professional%20Photos/album/1960s/slides/5.html

● Simulator of domain● Enormous data to train● Can always try out a new

strategy in domain

● Simulator of domain● Enormous data to train● Can always try out a new

strategy in domain

● No good simulator of human physiology, behavior & learning

● Gathering real data involves impacting real people

Techniques to Minimize & Understand Data Needed to Learn to Make Good Decisions

And if can learn to make good decisions faster, benefit more people

Background: Markov Decision Process

Background: Markov Decision Process Value Function

Background: Reinforcement Learning

Only observed through samples (experience)

Today: Counterfactual / Batch RL

“What If?” Reasoning Given Past Data

Outcome: 91

Outcome: 92

Outcome: 85

Data Is Censored

Outcome: 91

Outcome: 92

Outcome: 85

Need for Generalization

Outcome: 91

Outcome: 92

Outcome: 85

Growing Interest in Causal Inference & ML

Batch Policy Optimization: Find a Good Policy That Will Perform Well in the Future

● Today will not be a comprehensive overview, but instead highlight some of the challenges involved & some approaches with desirable statistical properties convergence, sample efficiency & bounds

Substantial Literature Focuses on 1 Binary Decision: Treatment Effect Estimation from Old Data

Challenge: Covariate ShiftDifferent Policies → Different Actions → Different State Distributions

IS allows us to reweigh data to make it look as if it came from original distrib

Gottesman et a. Guidelines for reinforcement learning in healthcare. Nature Medicine 2019. Figure by Debbie Maizels/Springer Nature

Policy Evaluation

1. Model based2. Model free3. Importance sampling4. Doubly robust

Learn Dynamics and Reward Models from Data, Evaluate Policy

● (Mannor, Simster, Sun, Tsitsiklis 2007)

Model Free Value Function Approximation

● Fitted Q iteration, DQN, LSTD, ...

Counterfactual Reasoning for Policy Evaluation*

Parametric Modelsof dynamics, rewards or values fit to data

+ Low variance- Bias (unless realizable)

Importance Sampling Refresher

Importance Sampling for RL Policy Evaluation

● First used for RL by Precup, Sutton & Singh 2000. Recent work includes: Thomas, Theocharous, Ghavamzadeh 2015; Thomas and Brunskill 2017; Guo, Thomas, Brunskill 2017; Hanna, Niekum, Stone 2019

Stationary Importance Sampling (SIS) for RL Policy Evaluation

● Can be approximated and used as part of Q-learning style update● Hallak & Mannor 2017; Liu, Li, Tang, & Zhou 2018; Gelada & Bellemare 2019

Counterfactual Reasoning for Policy Evaluation

Parametric Modelsof dynamics, rewards or values fit to data

Importance Samplingcorrect mismatch of

state-action distribution

+ Low variance- Bias (unless realizable)

+ Unbiased under certain assumptions- High variance

Doubly Robust (DR) Estimation

• Model + IS-based estimator• Bandits (Dudik et al. 2011)

reward receivedmodel of reward

Doubly Robust Estimation for RL

• Jiang and Li (ICML 2016) extended DR to RL

model-based estimate of Q

actual rewards in the dataset

importance weights

model-based estimate of V

Doubly Robust Estimation for RL

• Jiang and Li (ICML 2016) extended DR to RL

• Limitation: Estimator derived is unbiased

model-based estimate of Q

actual rewards in the dataset

importance weights

model-based estimate of V

Instead Prioritize Accuracy & Measure with Mean Squared Error

Thomas and Brunskill, ICML 2016

• Trade bias and variance

Variance

+Model-based estimator Importance sampling estimator

Two New Off Policy Evaluation Estimators1. Weighted doubly robust for RL problems

a. Weighted importance Sampling often much lower variance

b. WDR: doubly robust, just use normalized weights!c. Empirically can give much better estimatesd. Still has good properties (strongly consistent)

Price of Robustness?

Le, Voloshin, Yue (2019)

2. Model And Guided Importance sampling Combining estimatora. Directly try to minimize mean squared error by balancing

between value and importance sampling estimateb. Mean squared error is a function of bias and variance

Blend IS-Based & Model Based Estimators to Directly Min Mean Squared Error

Variance

1-step estimate 2-step N-step

2 … x

Model and Guided Importance Sampling combining (MAGIC) Estimator

Estimated policy value using particular weighting of model estimate and

importance sampling estimate

• Solve quadratic program• Strongly consistent (under similar set of assumptions as

Estimating Bias & Covariance

• Estimated covariance: sample covariance matrix

• Estimated bias:– May be as hard as estimating true policy value

Importance sampling estimate

Model based estimate Estimate of bias

Thomas and B, ICML 2016

Gridworld Simulation: Needed Only 10% of the Data to Learn a Good Estimate New Policy’s Value

IS-based

ModelDR

MAGIC-B

Number of Histories Thomas and B 2016

Gridworld Simulation: Needed Only 10% of the Data to Learn a Good Estimate of New Policy’s Value

IS-based

ModelDR

MAGIC-B

Number of Histories Thomas and B 2016

Sepsis treatment example

(Gottesman et al. arxiv 2018)

● Actions: IV fluids & vasopressors ● Reward: +100 survival, -100 death● State space: 750 (discretized)● 19,275 ICU patients

WDR in Health Example

Our weighted DR (WDR) was only consistent off policy estimator tried (PDDR, PDIS, WPDIS, WDR) that could find an optimal policy which estimated would improve over prior

Under (common) assumption of no confounding, that is not likely to hold in practice

Policy Evaluation

1. Model based2. Model free3. Importance sampling4. Doubly robust

Policy Optimization: Find Good Policy to Deploy

Learn Dynamics and Reward Models from Data, Plan

Mandel, Liu, Brunskill, Popovic 2014

Better Dynamics/Reward Models for Existing Data, May Not Lead to Better Policies for Future Use

Importance Sampling Estimators Unbiased for Policy Evaluation

• But using them for policy evaluation can lead to poor results

Fairness of Importance Sampling-Based Estimators for Policy Selection

• Unfortunately even if IS estimates are unbiased, policy selection using them can be unfair

• Here define unfair as: – Given two policies π1 and π2

– Where true performance V(π1) > V(π2)– Choose V(π2) more than 50% of time

Doroudi, Thomas and Brunskill, Best Paper, UAI 2017

Policy 1 Policy 2

Max over Estimates with Differing Variances

Importance Sampling Favors Myopic Policies

Quest for Batch Policy Optimization with Generalization Guarantees→ SRM for Reinforcement Learning

Challenge: Good Error Bound Analysis

● Importance sampling bounds (e.g. Thomas et al, 2015) ignore hypothesis class structure & are typically require very large n

● Kernel function & averager approaches (e.g. Ormoneit & Sten 2002) can need # samples exponential in input state dimension

● FQI bounds (e.g. Munos 2003; Munos & Szepesvári 2008; Antos et al., 2008; Lazaric et al., 2012; Farahmand et al., 2009; Maillard et al., 2010; Le, Voloshin, Yue 2019; Chen & Jiang 2019)

- Require stronger assumptions (realizability and bounds on the inherent Bellman error)

- If not realizable, FQI bounds depend on unknown quantities

● Importance sampling bounds (e.g. Thomas et al, 2015)● Kernel function (e.g. Ormoneit & Sten 2002)● FQI bounds (e.g. Munos 2003; Munos & Szepesvári 2008; Antos

et al., 2008; Lazaric et al., 2012; Farahmand et al., 2009; Maillard et al., 2010; Le, Voloshin, Yue 2019; Chen & Jiang 2019)

- Require stronger assumptions (realizability and bounds on the inherent Bellman error)

- If not realizable, FQI bounds depend on unknown quantities● Primal dual approaches (e.g. Dai, Shaw, Li, Xiao, He, Liu, Chen,

Song 2018) are promising and have similar dependencies

Aim: Strong Generalization Guarantees on Policy Performance, Alternative: Find Good in Class Policy Given Past Data

Direct Batch Policy Search & Optimization● Despite popularity, relative little success in direct policy optimization using

offline / batch data● Correcting for the mismatch in state distributions can yield high variance

(“alternative life” from Sutton / White terminology)● Algorithmically often just correct for 1 step (e.g. Degris, White, & Sutton 2012)

Off-Policy Policy Gradient with State Distribution Correction● Leverage Markov structure idea of stationary importance sampling for RL

Monday RLDM Poster 114 & Liu, Swaminathan, Agarwal, B UAI 2019

First Result that Can Provably Converge to Local Solution with Off Policy Batch Policy Gradient

Monday RLDM Poster 114 & Liu, Swaminathan, Agarwal, B UAI 2019

Aim: Strong Generalization Guarantees on Policy Performance, Alternative: Guarantee Find Best in Class Policy

1st Guarantees on Performance of Policy Choose Vs Best in Class for When to Treat Policies (w/Xinkun Nie & Stefan Wager, arxiv)

Starting HIV treatment as soon as CD4 count dips below 200

Example: Linear Thresholding Policies

Source: https://alv.mizoapp.com/cd4count/

HIV Infection

HIV Treatment

2-10 years

policy parameter

Stopping treatment as soon as health metric above line

Starting HIV treatment as soon as CD4 count dips below 200

Example: Linear Thresholding Policies

Source: https://alv.mizoapp.com/cd4count/

HIV Infection

HIV Treatment

2-10 years

200Treatment

TreatmentOff

policy parameter

policy

Selecting a When to Treat Policy

never acting

Use an Advantage Decomposition

● Estimate treatment effect with a doubly robust estimator given available dataset D● Can learn “nuisance” parameters (propensity weights and value function estimates) at a

slower rate and still get sqrt(n) regret bounds, under various assumptions● Insights from orthogonal / double machine learning ideas from econometrics

never acting

Use a Doubly Robust Advantage Decomposition

Keeping a health metric above 0

evolves with brownian motion

treatment nudges it up, but at a cost

Always start with treatment ON

• Optimal stopping time of treatment?

• Unknown propensity

• Linear Decision Rules

#covariates = 2

• Observe states + noise

Horizon

Fitted Q Iteration Policy Less Interpretable

Quest for Batch Policy Optimization with Generalization Guarantees→ SRM for Reinforcement Learning

& many colleagues’ work (Murphy, Jiang, Yue, Munos, Lazaric, Szepesvari…)→ Much to be done, including to relax common assumptions

AAMAS 2014, AAAI 2015, AAAI 2016, ICML 2016, IJCAI 2016, AAAI 2017, NeurIPS 2017, NeurIPS

2018, ICML 2019

AAAI 2015, AAAI 2016, L@S 2017, UAI 2017, UAI 2019, (Nie,

B, Wagner arxiv)

Nie, B, Wagner arxiv; Thomas, da Silva, Barto,

B, arxiv

Power of Models for Off Policy Evaluation?

● Model based approaches can be provably more efficient than model free value function for online evaluation or control

Sun, Jiang, Krishnamurthy, Agarwal, Langford COLT 2019

Tu & Recht COLT 2019

Models Fit for Off Policy Evaluation May Benefit from Different Loss Function

Liu, Gottesman, Raghu, Komorowski, Faisal, Doshi-Velez, Brunskill NeurIPS 2018

Learn a Policy that Increases Student Persistence

(Mandel, Liu, B, Popovic 2014)

Learned a Policy that Increased Student Persistence by +30%

(Mandel, Liu, B, Popovic 2014)

Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant...

Documents

Transcript of Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant...

Stanford BaSketBall - Stanford University

E145/STS173 Workshop C Staged Venture Financing E145/STS173 Workshop C Staged Venture Financing Professor Tom Byers Stanford University Special Thanks.

Robust Optimization and Applications - Stanford University · Robust Optimization and Applications Laurent El Ghaoui elghaoui@eecs.berkeley.edu IMA Tutorial, March 11, 2003. Thanks.

1 Lecture 18: Introduction to Multiprocessors Prepared and presented by: Kurt Keutzer with thanks for materials from Kunle Olukotun, Stanford; David Patterson,

STANFORD GEOTHERiMAL PROGRAiM STANFORD UNIVERSITY

Arto Anttila Stanford University · Arto Anttila Stanford University Acknowledgements: Based on joint work with Stefan Benus, Vivienne Fong, and Jennifer Nycz. Thanks to K.P. Mohanan

STANFORD UNIVERSITY STANFORD LINEAR ACCELERATOR …

CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 15: Speaker Recognition Lots of slides thanks to.

Thanks to Jure Leskovec, Stanford and Panayiotis … › ... › lectures › 04-network-models.pdfThanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina for

Clustering Shannon Quinn (with thanks to J. Leskovec, A. Rajaraman, and J. Ullman of Stanford University)

Thanks to You: Successful Campaign for Stanford Medicinemedicalgiving.stanford.edu › content › dam › sm...Thanks to You: Successful Campaign for Stanford Medicine fueling medical

The Moderator-Mediator Variable Distinction in Social … · the Center for Advanced Studies in the Behavioral Sciences, Stanford, California. Thanks are due to Judith Harackiewicz,

STANFORD GEOTHERMAL PROGRAM STANFORD UNIVERSITY … · STANFORD GEOTHERMAL PROGRAM STANFORD UNIVERSITY Stanford Geothermal Program Interdisciplinary Res ear ch in Engineering and

‘Technological Cost as Law in Intellectual Property’jolt.law.harvard.edu/articles/pdf/v27/27HarvJLTech135.pdf · B.A. Cornell Univer-sity; J.D. Stanford University. Many thanks

Neural’Machine’Transla/on’’ - Stanford NLP Group · 2015-10-23 · Neural’Machine’Transla/on’’ ThangLuong Tutorial)@Stanford)NLP)lunch)) (Thanks)to)Chris)Manning,)Abigail)See,)and))

Home | Stanford HAIHome | Stanford HAI

Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas ...papaggel/courses/eecs6413/docs/lectur… · Thanks to Jure Leskovec, Stanford and Panayiotis Tsaparas, Univ. of Ioannina

Stanford Law Review...* J.D. Candidate, Stanford Law School, 2017. Many thanks to Debojyoti Sarkar, Alok Khare, Anand Goel, Mitch Polinsky, John Donohue, and Joseph Grundfest for helpful

Stanford, California - Geothermal Program · 2018. 9. 4. · ACKNOWLEDGEMENTS Special thanks are due to administrative and field personnel of the Bureau of Reclamation, Burmah Oil

Energy and Climate Plan 2009 - Sustainable Stanford › sites › default › ...3 | Page Stanford Energy and Climate Plan, October 2009 Acknowledgements Special thanks to: Mike Goff,