Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant...
Transcript of Assistant Professor, Computer Science, Stanford Thanks to ......RLDM 2019 Tutorial Assistant...
Counterfactuals and RLEmma Brunskill
RLDM 2019 TutorialAssistant Professor, Computer Science, Stanford
Thanks to Christoph Dann, Andrea Zanette, Phil Thomas, and Xinkun Nie for some figures
A Brief Tale of 2 Hamburgers
1/4 1/3
Took > 30s
Took <= 30s
Given ~11k Learners’ TrajectoriesWith Random Action (Levels)
Goal: Learn a New Policy to Maximize Student Persistence
Parallel Legacy of “RL” to Benefit People
https://web.stanford.edu/group/cslipublications/cslipublications/SuppesCorpus/Professional%20Photos/album/1960s/slides/5.html
● Simulator of domain● Enormous data to train● Can always try out a new
strategy in domain
vs
● Simulator of domain● Enormous data to train● Can always try out a new
strategy in domain
● No good simulator of human physiology, behavior & learning
● Gathering real data involves impacting real people
vs
Techniques to Minimize & Understand Data Needed to Learn to Make Good Decisions
And if can learn to make good decisions faster, benefit more people
Background: Markov Decision Process
Background: Markov Decision Process
Background: Markov Decision Process
Background: Markov Decision Process Value Function
Background: Markov Decision Process Value Function
Background: Reinforcement Learning
Only observed through samples (experience)
Today: Counterfactual / Batch RL
“What If?” Reasoning Given Past Data
Outcome: 91
Outcome: 92
Outcome: 85
?
Data Is Censored
Outcome: 91
Outcome: 92
Outcome: 85
Need for Generalization
Outcome: 91
Outcome: 92
Outcome: 85
?
Growing Interest in Causal Inference & ML
Batch Policy Optimization: Find a Good Policy That Will Perform Well in the Future
Batch Policy Optimization: Find a Good Policy That Will Perform Well in the Future
Batch Policy Optimization: Find a Good Policy That Will Perform Well in the Future
Batch Policy Optimization: Find a Good Policy That Will Perform Well in the Future
● Today will not be a comprehensive overview, but instead highlight some of the challenges involved & some approaches with desirable statistical properties convergence, sample efficiency & bounds
: or
Substantial Literature Focuses on 1 Binary Decision: Treatment Effect Estimation from Old Data
Challenge: Covariate ShiftDifferent Policies → Different Actions → Different State Distributions
IS allows us to reweigh data to make it look as if it came from original distrib
Gottesman et a. Guidelines for reinforcement learning in healthcare. Nature Medicine 2019. Figure by Debbie Maizels/Springer Nature
Policy Evaluation
1. Model based2. Model free3. Importance sampling4. Doubly robust
Learn Dynamics and Reward Models from Data, Evaluate Policy
● (Mannor, Simster, Sun, Tsitsiklis 2007)
Model Free Value Function Approximation
● Fitted Q iteration, DQN, LSTD, ...
Counterfactual Reasoning for Policy Evaluation*
Parametric Modelsof dynamics, rewards or values fit to data
+ Low variance- Bias (unless realizable)
Importance Sampling Refresher
Importance Sampling for RL Policy Evaluation
Importance Sampling for RL Policy Evaluation
Importance Sampling for RL Policy Evaluation
● First used for RL by Precup, Sutton & Singh 2000. Recent work includes: Thomas, Theocharous, Ghavamzadeh 2015; Thomas and Brunskill 2017; Guo, Thomas, Brunskill 2017; Hanna, Niekum, Stone 2019
Stationary Importance Sampling (SIS) for RL Policy Evaluation
● Can be approximated and used as part of Q-learning style update● Hallak & Mannor 2017; Liu, Li, Tang, & Zhou 2018; Gelada & Bellemare 2019
Counterfactual Reasoning for Policy Evaluation
Parametric Modelsof dynamics, rewards or values fit to data
Importance Samplingcorrect mismatch of
state-action distribution
+ Low variance- Bias (unless realizable)
+ Unbiased under certain assumptions- High variance
Doubly Robust (DR) Estimation
• Model + IS-based estimator• Bandits (Dudik et al. 2011)
reward receivedmodel of reward
Doubly Robust (DR) Estimation
• Model + IS-based estimator• Bandits (Dudik et al. 2011)
reward receivedmodel of reward
Doubly Robust (DR) Estimation
• Model + IS-based estimator• Bandits (Dudik et al. 2011)
reward receivedmodel of reward
Doubly Robust Estimation for RL
• Jiang and Li (ICML 2016) extended DR to RL
model-based estimate of Q
actual rewards in the dataset
importance weights
model-based estimate of V
Doubly Robust Estimation for RL
• Jiang and Li (ICML 2016) extended DR to RL
• Limitation: Estimator derived is unbiased
model-based estimate of Q
actual rewards in the dataset
importance weights
model-based estimate of V
Instead Prioritize Accuracy & Measure with Mean Squared Error
Thomas and Brunskill, ICML 2016
• Trade bias and variance
Bias
Bias
Variance
+Model-based estimator Importance sampling estimator
Two New Off Policy Evaluation Estimators1. Weighted doubly robust for RL problems
a. Weighted importance Sampling often much lower variance
Two New Off Policy Evaluation Estimators1. Weighted doubly robust for RL problems
a. Weighted importance Sampling often much lower variance
b. WDR: doubly robust, just use normalized weights!c. Empirically can give much better estimatesd. Still has good properties (strongly consistent)
Price of Robustness?
Le, Voloshin, Yue (2019)
Two New Off Policy Evaluation Estimators1. Weighted doubly robust for RL problems
2. Model And Guided Importance sampling Combining estimatora. Directly try to minimize mean squared error by balancing
between value and importance sampling estimateb. Mean squared error is a function of bias and variance
Blend IS-Based & Model Based Estimators to Directly Min Mean Squared Error
Bias
Variance
1-step estimate 2-step N-step
x1 x
2 … x
N
Thomas and Brunskill, ICML 2016
Model and Guided Importance Sampling combining (MAGIC) Estimator
Estimated policy value using particular weighting of model estimate and
importance sampling estimate
Thomas and Brunskill, ICML 2016
• Solve quadratic program• Strongly consistent (under similar set of assumptions as
WDR)
Estimating Bias & Covariance
• Estimated covariance: sample covariance matrix
• Estimated bias:– May be as hard as estimating true policy value
Importance sampling estimate
Model based estimate Estimate of bias
Thomas and B, ICML 2016
Gridworld Simulation: Needed Only 10% of the Data to Learn a Good Estimate New Policy’s Value
IS-based
ModelDR
MAGIC
MAGIC-B
Number of Histories Thomas and B 2016
Gridworld Simulation: Needed Only 10% of the Data to Learn a Good Estimate of New Policy’s Value
IS-based
ModelDR
MAGIC
MAGIC-B
Number of Histories Thomas and B 2016
Sepsis treatment example
(Gottesman et al. arxiv 2018)
● Actions: IV fluids & vasopressors ● Reward: +100 survival, -100 death● State space: 750 (discretized)● 19,275 ICU patients
WDR in Health Example
Sepsis treatment example
(Gottesman et al. arxiv 2018)
● Actions: IV fluids & vasopressors ● Reward: +100 survival, -100 death● State space: 750 (discretized)● 19,275 ICU patients
Our weighted DR (WDR) was only consistent off policy estimator tried (PDDR, PDIS, WPDIS, WDR) that could find an optimal policy which estimated would improve over prior
WDR in Health Example
Sepsis treatment example
(Gottesman et al. arxiv 2018)
● Actions: IV fluids & vasopressors ● Reward: +100 survival, -100 death● State space: 750 (discretized)● 19,275 ICU patients
Our weighted DR (WDR) was only consistent off policy estimator tried (PDDR, PDIS, WPDIS, WDR) that could find an optimal policy which estimated would improve over prior
Under (common) assumption of no confounding, that is not likely to hold in practice
WDR in Health Example
Policy Evaluation
1. Model based2. Model free3. Importance sampling4. Doubly robust
Policy Optimization: Find Good Policy to Deploy
Learn Dynamics and Reward Models from Data, Plan
Mandel, Liu, Brunskill, Popovic 2014
Better Dynamics/Reward Models for Existing Data, May Not Lead to Better Policies for Future Use
Importance Sampling Estimators Unbiased for Policy Evaluation
Importance Sampling Estimators Unbiased for Policy Evaluation
• But using them for policy evaluation can lead to poor results
Fairness of Importance Sampling-Based Estimators for Policy Selection
62
• Unfortunately even if IS estimates are unbiased, policy selection using them can be unfair
• Here define unfair as: – Given two policies π1 and π2
– Where true performance V(π1) > V(π2)– Choose V(π2) more than 50% of time
Doroudi, Thomas and Brunskill, Best Paper, UAI 2017
Value
Policy 1 Policy 2
Max over Estimates with Differing Variances
Doroudi, Thomas and Brunskill, Best Paper, UAI 2017
Importance Sampling Favors Myopic Policies
Doroudi, Thomas and Brunskill, Best Paper, UAI 2017
Quest for Batch Policy Optimization with Generalization Guarantees→ SRM for Reinforcement Learning
Quest for Batch Policy Optimization with Generalization Guarantees→ SRM for Reinforcement Learning
Challenge: Good Error Bound Analysis
Challenge: Good Error Bound Analysis
● Importance sampling bounds (e.g. Thomas et al, 2015) ignore hypothesis class structure & are typically require very large n
Challenge: Good Error Bound Analysis
● Importance sampling bounds (e.g. Thomas et al, 2015) ignore hypothesis class structure & are typically require very large n
● Kernel function & averager approaches (e.g. Ormoneit & Sten 2002) can need # samples exponential in input state dimension
Challenge: Good Error Bound Analysis
● Importance sampling bounds (e.g. Thomas et al, 2015) ignore hypothesis class structure & are typically require very large n
● Kernel function & averager approaches (e.g. Ormoneit & Sten 2002) can need # samples exponential in input state dimension
● FQI bounds (e.g. Munos 2003; Munos & Szepesvári 2008; Antos et al., 2008; Lazaric et al., 2012; Farahmand et al., 2009; Maillard et al., 2010; Le, Voloshin, Yue 2019; Chen & Jiang 2019)
- Require stronger assumptions (realizability and bounds on the inherent Bellman error)
- If not realizable, FQI bounds depend on unknown quantities
Challenge: Good Error Bound Analysis
● Importance sampling bounds (e.g. Thomas et al, 2015)● Kernel function (e.g. Ormoneit & Sten 2002)● FQI bounds (e.g. Munos 2003; Munos & Szepesvári 2008; Antos
et al., 2008; Lazaric et al., 2012; Farahmand et al., 2009; Maillard et al., 2010; Le, Voloshin, Yue 2019; Chen & Jiang 2019)
- Require stronger assumptions (realizability and bounds on the inherent Bellman error)
- If not realizable, FQI bounds depend on unknown quantities● Primal dual approaches (e.g. Dai, Shaw, Li, Xiao, He, Liu, Chen,
Song 2018) are promising and have similar dependencies
Aim: Strong Generalization Guarantees on Policy Performance, Alternative: Find Good in Class Policy Given Past Data
Direct Batch Policy Search & Optimization● Despite popularity, relative little success in direct policy optimization using
offline / batch data● Correcting for the mismatch in state distributions can yield high variance
(“alternative life” from Sutton / White terminology)● Algorithmically often just correct for 1 step (e.g. Degris, White, & Sutton 2012)
Off-Policy Policy Gradient with State Distribution Correction● Leverage Markov structure idea of stationary importance sampling for RL
Monday RLDM Poster 114 & Liu, Swaminathan, Agarwal, B UAI 2019
First Result that Can Provably Converge to Local Solution with Off Policy Batch Policy Gradient
Monday RLDM Poster 114 & Liu, Swaminathan, Agarwal, B UAI 2019
Aim: Strong Generalization Guarantees on Policy Performance, Alternative: Guarantee Find Best in Class Policy
1st Guarantees on Performance of Policy Choose Vs Best in Class for When to Treat Policies (w/Xinkun Nie & Stefan Wager, arxiv)
Starting HIV treatment as soon as CD4 count dips below 200
Example: Linear Thresholding Policies
Source: https://alv.mizoapp.com/cd4count/
HIV Infection
CD
4 C
ount
HIV Treatment
2-10 years
200
policy parameter
Stopping treatment as soon as health metric above line
Starting HIV treatment as soon as CD4 count dips below 200
Example: Linear Thresholding Policies
Source: https://alv.mizoapp.com/cd4count/
HIV Infection
CD
4 C
ount
HIV Treatment
2-10 years
200Treatment
O
TreatmentOff
policy parameter
policy
Selecting a When to Treat Policy
never acting
Use an Advantage Decomposition
● Estimate treatment effect with a doubly robust estimator given available dataset D● Can learn “nuisance” parameters (propensity weights and value function estimates) at a
slower rate and still get sqrt(n) regret bounds, under various assumptions● Insights from orthogonal / double machine learning ideas from econometrics
never acting
Use a Doubly Robust Advantage Decomposition
Keeping a health metric above 0
evolves with brownian motion
treatment nudges it up, but at a cost
Always start with treatment ON
• Optimal stopping time of treatment?
• Unknown propensity
• Linear Decision Rules
#covariates = 2
• Observe states + noise
Horizon
Fitted Q Iteration Policy Less Interpretable
Quest for Batch Policy Optimization with Generalization Guarantees→ SRM for Reinforcement Learning
& many colleagues’ work (Murphy, Jiang, Yue, Munos, Lazaric, Szepesvari…)→ Much to be done, including to relax common assumptions
AAMAS 2014, AAAI 2015, AAAI 2016, ICML 2016, IJCAI 2016, AAAI 2017, NeurIPS 2017, NeurIPS
2018, ICML 2019
AAAI 2015, AAAI 2016, L@S 2017, UAI 2017, UAI 2019, (Nie,
B, Wagner arxiv)
Nie, B, Wagner arxiv; Thomas, da Silva, Barto,
B, arxiv
Power of Models for Off Policy Evaluation?
● Model based approaches can be provably more efficient than model free value function for online evaluation or control
Sun, Jiang, Krishnamurthy, Agarwal, Langford COLT 2019
Tu & Recht COLT 2019
Models Fit for Off Policy Evaluation May Benefit from Different Loss Function
Liu, Gottesman, Raghu, Komorowski, Faisal, Doshi-Velez, Brunskill NeurIPS 2018
Given ~11k Learners’ TrajectoriesWith Random Action (Levels)
Learn a Policy that Increases Student Persistence
(Mandel, Liu, B, Popovic 2014)
Given ~11k Learners’ TrajectoriesWith Random Action (Levels)
Learned a Policy that Increased Student Persistence by +30%
(Mandel, Liu, B, Popovic 2014)