Machine Learning and Econometrics

401
AEA Continuing Education Program Machine Learning and Econometrics Susan Athey, Stanford Univ. Guido Imbens, Stanford Univ. January 7-9, 2018

Transcript of Machine Learning and Econometrics

Page 1: Machine Learning and Econometrics

AEA Continuing Education

Program

Machine Learning

and Econometrics

Susan Athey, Stanford Univ.

Guido Imbens, Stanford Univ.

January 7-9, 2018

Page 2: Machine Learning and Econometrics

Machine Learning for Economics:An IntroductionSUSAN ATHEY (STANFORD GSB)

Page 3: Machine Learning and Econometrics

Two Types of Machine LearningSUPERVISED

Independent observations

Stable environment

Regression/prediction:◦ E[Y|X=x]

Classification◦ Pr(Y=y|X=x)

UNSUPERVISED

Collections of units characterized by features◦ Images◦ Documents◦ Individual internet activity history

Find groups of similar items

Page 4: Machine Learning and Econometrics

ClassificationAdvances in ML dramatically improve quality of image classification

Page 5: Machine Learning and Econometrics

Classification

Neural nets figure out what features of image are important

Features can be used to classify images

Relies on stability

Xi

i

Page 6: Machine Learning and Econometrics

What’s New About ML?

Flexible, rich, data‐driven models

Increase in personalization and precision

Methods to avoid over‐fitting

Page 7: Machine Learning and Econometrics

Ability to Fit Complex Shapes

Page 8: Machine Learning and Econometrics

Prediction in a Stable EnvironmentGoal: estimate  | and minimize MSE in a new dataset where only X is observed◦ MSE:  ∑ ◦ No matter how complex the model, the 

output, the prediction, is a single number◦ Can hold out a test set and evaluate the 

performance of a model◦ Ground truth is observed in a test set◦ Only assumptions required: independent 

observations, and joint distribution of (Y,X) same in test set as in training set

Note: minimizing MSE entails bias‐variance tradeoff, and always accept some bias◦ Idea: if estimator too sensitive to current 

dataset, then procedure will be variable across datasets

◦ Models are very rich, and overfitting is a real concern, so approaches to control overfitnecessary

Idea of ML algorithms◦ Consider a family of models◦ Use the data to select among the models or 

choose tuning parameters◦ Common approach: cross‐validation

◦ Break data into 10 folds◦ Estimate on 9/10 of data, estimate MSE on last tenth, 

for each of a grid of tuning parameters◦ Choose the parameters that minimize MSE

ML works well because you can accurately evaluate performance without add’lassumptions◦ Your robotic research assistant then tests 

many models to see what performs best

Page 9: Machine Learning and Econometrics

What We Say v. What We Do (ML)What we say◦ ML = Data Science, statistics

◦ Is there anything else?

◦ Use language of answering questions or solving problems, e.g. advertising allocation, salesperson prioritization

◦ Aesthetic: human analyst does not have to make any choices

◦ All that matters is prediction

What we do◦ Use predictive models and ignore other considerations◦ E.g. Causality, equilibrium or feedback 

effects

◦ Wonder/worry about interpretability/reliability/robustness/ adaptability, but have little way to ask algos to optimize for it

Page 10: Machine Learning and Econometrics

Contrast with Traditional EconometricsEconomists have focused on the case with substantially more observations than covariates (N>>P)◦ In‐sample MSE is a good approximation to out‐of‐sample MSE

◦ OLS is BLUE, and if overfitting is not a problem, then no need to incur bias

◦ OLS uses all the data and minimizes in‐sample MSE

OLS obviously fails due to overfitting when P~N and fails entirely when P>N◦ ML methods generally work when P>N

Economists worry about estimating causal effects and identification◦ Causal effects◦ Counterfactual predictions◦ Separating correlation from causality◦ Standard errors◦ Structural models incorporating behavioral assns

Identification problems can not be evaluated using a hold‐out set◦ If joint dist’n of observable same in training and test, will get the same results in both

Causal methods sacrifice goodness‐of‐fit to focus only on variation in data that identifies parameters of interest

Page 11: Machine Learning and Econometrics

What We Say v. What We Do (Econometrics)What We Say◦ Causal inference and counterfactuals

◦ God gave us the model◦ We report estimated causal effects and appropriate standard errors

◦ Plus a few additional specifications for robustness

What we do◦ Run OLS or IV regressions◦ Try a lot of functional forms◦ Report standard errors as if we ran only one model

◦ Have research assistants run hundreds of regressions and pick a few “representative” ones

◦ Use complex structural models◦ Make a lot of assumptions without a great way to test them

Page 12: Machine Learning and Econometrics

Key Lessons for EconometricsMany problems can be decomposed into predictive and causal parts◦ Can use off‐the‐shelf ML for predictive parts

Data‐driven model selection◦ Tailored to econometric goals

◦ Focus on parameters of interest

◦ Define correct criterion for model

◦ Use data‐driven model selection where performance can be evaluated

◦ While retaining ability to do inference

ML‐Inspired Approaches for Robustness

Validation◦ ML always has a test set◦ Econometrics can consider alternatives

◦ Ruiz, Athey and Blei (2017) evaluate on days with unusual prices

◦ Athey, Blei, Donnelly and Ruiz (2017) evaluate change in purchases before and after price changes

◦ Tech firm applications have many A/B tests and algorithm changes

Other computational approaches for structural models◦ Stochastic gradient descent◦ Variational Inference (Bayesian models)

See Sendhil Mullainathan et al (JEP, AER) for key lessons about prediction in economics◦ See also Athey (Science, 2017)

Page 13: Machine Learning and Econometrics

Empirical Economics in Five Years:My PredictionsRegularization/data‐driven model selection will be the standard for economic models

Prediction problems better appreciated

Measurement using ML techniques an important subfield

Textual analysis standard (already many examples)

Models will explicitly distinguish causal parts and predictive parts

Reduced emphasis on sampling variation

Model robustness emphasized on equal footing with standard errors

Models with lots of latent variables

Page 14: Machine Learning and Econometrics

An Introduction to Regularized Regression

Machine Learning and Causal Inference

Susan Athey

Thanks to Sendhil Mullainathan for sharing slides

Page 15: Machine Learning and Econometrics

What we do in Econometrics: The Case of Regression

• Specify a model:

• Data set has observations i=1,..,n• Use OLS regression on the entire dataset to construct an estimate • Discuss assumptions under which some components of have a

causal interpretation• Consider that Sn (set of observed units, i=1,..,n) is a random sample

from a much larger population. • Construct confidence intervals and test the hypothesis that some

components are equal to zero. • Theorem: OLS is BLUE (Best Linear Unbiased Estimator)

– Best = lowest-variance

Page 16: Machine Learning and Econometrics

Goals of Prediction and Estimation• Goal of estimation: unbiasedness

• Goal of prediction: loss minimization,

– E.g. ℓ ,– Use the data to pick a function that does well on a new data

point

Page 17: Machine Learning and Econometrics

Key assumptions in both cases

• Stationary data generating process

• Estimation: – Interested in a parameter of that process

• Prediction:– Interested in predicting y

Page 18: Machine Learning and Econometrics

High v. Low Dimensional Analysis• We have discussed prediction as a high

dimensional construct

• Practically that is where it is useful

• But to understand how high dimensional prediction works we must unpack an implicit presumption– Presumption: Our known estimation strategies would

be great predictors if they were feasible

Page 19: Machine Learning and Econometrics

A Simple OLS example

• Suppose we truly live in a linear world

• Write x = (1,x)

Page 20: Machine Learning and Econometrics

OLS seems like a good predictor

Especially since it is known to be efficient

Page 21: Machine Learning and Econometrics

An Even Simpler Set-up

• Let’s get even lower dimensional

• No variables at all

• Suppose you get the data of the type:

• You would like to estimate the mean

Page 22: Machine Learning and Econometrics

• Minimize bias: • The sample mean is an unbiased estimator

– Also what you would get from OLS regression on a constant

Forming an estimator of the mean

Page 23: Machine Learning and Econometrics

A prediction problem

• In the same setup, you are given n data points

• You would like to guess the value of a new data point from the same distribution

• Goal: minimize quadratic loss of prediction

Page 24: Machine Learning and Econometrics

Best Predictor

Page 25: Machine Learning and Econometrics

The higher alpha the lower the bias

The higher alpha the more variable across samples it is

Page 26: Machine Learning and Econometrics

Key problem

• The unbiased estimator has a nice property:

• But getting that property means large sample to sample variation of estimator

• This sample to sample variation means that in any particular finite sample I’m paying the cost of being off on all my predictions

Page 27: Machine Learning and Econometrics

Intuition• I see your first test score. What should my

prediction of your next test be? – Your first test score is an unbiased estimator– But it is very variable

• Note: “Bayesian” intuition– Even simpler: what was my guess before I saw any

information– Shrink to that– In this example I’m shrinking to zero

Page 28: Machine Learning and Econometrics

But in a way you know this

• As empiricists you already have this intuition

Page 29: Machine Learning and Econometrics

Back to Simple OLS example

• Suppose we truly live in a linear world

• Write x = (1,x)

Page 30: Machine Learning and Econometrics

• You run a one variable regression and get

• Would you use the OLS coefficients to predict

• Or drop the first variable and use this:

A Simple Example

Page 31: Machine Learning and Econometrics

Deciding whether to drop

• Suppose in the (impossible) case we got the true world right. – (0,2) are the right coefficients

• Of course OLS does perfectly (by assumption).

• But how would OLS do on new samples…where (0,2) being the generating coefficients?– We’re giving OLS a huge leg up here.

Page 32: Machine Learning and Econometrics

OLS Performance

Page 33: Machine Learning and Econometrics

What if we dropped the variable

Page 34: Machine Learning and Econometrics

Your standard error worry!

Page 35: Machine Learning and Econometrics

Where does your standard error intuition come from?

• You see a standard error

• You think “that variable is not ‘significant’” so you might not want to include it.

• But this is misleading

Page 36: Machine Learning and Econometrics
Page 37: Machine Learning and Econometrics
Page 38: Machine Learning and Econometrics
Page 39: Machine Learning and Econometrics
Page 40: Machine Learning and Econometrics

Your Standard Error Worry

• For hypothesis testing se tells you whether the coefficient is significant are not

• For prediction it’s telling you how variable an estimator using it really is

Page 41: Machine Learning and Econometrics

Dual purposes of the standard eror

• The standard error also tells you that even if you’re right on average: – Your estimator will produce a lot of variance– And then in those cases you make systematic

prediction mistakes. • Bias variance tradeoff

– Being right on average on the coefficient is not equal to the best predictor.

Page 42: Machine Learning and Econometrics

The Problem Here

• Prediction quality suffers from:– Biased coefficients– Variability in estimated coefficients

• Even if the true coefficient is 2, in any sample, we will estimate something else

• OLS is lexicographic– First ensure unbiased– Amongst unbiased estimators: seek efficiency

• Good predictions must trade these off

Page 43: Machine Learning and Econometrics

Two Variable Example

• Belaboring the point here…• Assume now that we have two variables

– As before, both normally distributed unit variance• Your estimator produces

Page 44: Machine Learning and Econometrics

What would you do now?

• Logic above suggests you would drop both variables?

• Or keep both variables?

• It really depends on how you feel about the variance (10)?

Page 45: Machine Learning and Econometrics

Calculation

Page 46: Machine Learning and Econometrics

Hidden in Bias-Variance Tradeoff

• Covariance is central

• The standard error on several variables can be large, even though together their effect is highly consistent

• For prediction covariance between x matters

Page 47: Machine Learning and Econometrics

In a way this problem is not important

• The variance term diminishes with sample size– Prediction-estimation wedge falls off as

• But variance term increases with “variables”– Prediction-estimation rises with k

• So this is a problem when…– Function class high dimensional relative to data

Page 48: Machine Learning and Econometrics

What this means practically

• In some cases what you already know (estimation) is perfectly fine for prediction– This is why ML textbooks teach OLS, etc.– They are perfectly useful for the kinds of

prediction problems ML tries to solve in low dimensional settings

• But in high dimensional settings…– Note: high dimensional does not ONLY mean lots

of variables! It can mean rich interactions.

Page 49: Machine Learning and Econometrics

So far…• All this gives you a flavor of how the prediction task is

not mechanically a consequence of the estimation task

• But it doesn’t really tell you how to predict– Bias variance tradeoff is entirely unactionable– What’s the bias? – What’s the variance?– This is not really a tradeoff you can make

• A different look at the same problem produces a practical insight though

Page 50: Machine Learning and Econometrics

Back to OLS

• The real problem here is minimizing the “wrong” thing: In-sample fit vs out-of-sample fit

AVERAGES NOTATION:

for sample ave.for sample

Page 51: Machine Learning and Econometrics

Overfit problem

• OLS looks good with the sample you have– It’s the best you can do on this sample

• Bias-variance improving predictive power is about improving out of sample predictive power

• Problem is OLS by construction overfits– We overfit in estimation

Page 52: Machine Learning and Econometrics

This problem is exactly why wide data is troubling

• Similarly think of the wide data case

• Why are we worried about having so many variables?

• We’ll fit very well (perfectly if k > n) in sample

• But arbitrarily badly out of sample

Page 53: Machine Learning and Econometrics

Understanding overfit

• Let’s consider a general class of algorithms

Page 54: Machine Learning and Econometrics

• Let ℓ , ,, for some loss function ℓ (e.g. squared error)– Note: L is an unknown function: we don’t know P

• Consider algorithms of the form, argmin ∈

– is used here as shorthand for sample mean observations in sample of size n

– OLS is an empirical loss minimizer: it minimizes the sample average over observed data of the loss function

• So empirical loss minimization algorithms are defined by the function class they choose from

• For estimation what we typically do…– Show that empirical loss minimizers generate unbiasedness

A General Class of Algorithms

Page 55: Machine Learning and Econometrics

Empirical Loss minimization

• Leads to unbiasedness/consistency– Fit the data you have…– In a frequentist world “on average” (across all Sn)

this will produce the right thing– This is usually how we prove

consistency/unbiasedness

• Other variants: – MLE

Page 56: Machine Learning and Econometrics

Some Notation

• Define

– Recall: L is infeasible b/c we don’t know true data-generating process

• Contrast the latter with:, ∈

The best we can do

The best in the subset of functionsthat the algorithm looks at

What the in-sample loss minimizer actuallyproduces given a sample

Page 57: Machine Learning and Econometrics

Performance of Algorithm

• Performance of a predictor

• Performance of an Algorithm

– Algorithm’s expected loss– (Suppress Sn in some of the notation for estimator)

Page 58: Machine Learning and Econometrics

The performance of A

Understanding estimation error:

“Wrong” function looks good in-sample

Algorithm doesnot see this

Page 59: Machine Learning and Econometrics

Basic Tradeoff• These two terms go hand in hand:

Page 60: Machine Learning and Econometrics

Approximation – Overfit Tradeoff

• If we reduce set of f to reduce possible over-fit:

• Then we fit fewer “true functions” and drive up

• Only way to avoid this is if we knew information about f* so we could shrink the set

Page 61: Machine Learning and Econometrics

Unobserved overfit

• So the problem of prediction really is managing unobserved overfit

• We do well in-sample. But some of that “fit” is overfit.

Page 62: Machine Learning and Econometrics

Return to the original example

• We drove down overfit by doing a constrained optimization

Greater Chance To Overfit

Less Chance To Overfit

Page 63: Machine Learning and Econometrics

Basic Tradeoff at the Heart of Machine Learning

• Bigger function classes…– The more likely we are to get to the truth (less

approximation)– The more likely we are to overfit

• So we want to not just minimize in-sample error given a class of functions

• We also want to decide on the class of functions– More expressive means less approximation error– More expressive means more overfit

Page 64: Machine Learning and Econometrics

Let’s do the same thing here

Unconstrained, ∈

But we are worried about

So why not do this instead?∈

s.t. Complexity measure: tendency to overfit

Page 65: Machine Learning and Econometrics

Return to the original example

• Reduce overfit by approximating worse• Choose less expressive function class

More ExpressiveR(f) higher

Greater Overfit

Less ExpressiveR(f) lower

Less Overfit

Better approximation Worse approximation

Page 66: Machine Learning and Econometrics

Constrained minimization• We could do a constrained minimization

• But notice that this is equivalent to:

, ∈

• Complexity measure should capture tendency to overfit

Page 67: Machine Learning and Econometrics

Basic insight

• Data has signal and noise

• More expressive function classes-– Allow us to pick up more of the signal– But also pick up more of the noise

• So the problem of prediction becomes the problem of choosing expressiveness

Page 68: Machine Learning and Econometrics

Overall Structure

• Create a regularizer that:– Measures expressiveness

• Penalize algorithm for choosing more expressive functions – Tuning parameter lambda

• Let it weigh this penalty against in-sample fit

Page 69: Machine Learning and Econometrics

Linear Example

• Linear function class

• Regularized linear regression

Page 70: Machine Learning and Econometrics

Regularizers for Linear Functions

• Linear functions more expressive if use more variables

• Can transform coefficients

Page 71: Machine Learning and Econometrics

Computationally More Tractable

• Lasso

• Ridge

Page 72: Machine Learning and Econometrics

What makes a good regularizer?

• You might think…– Bayesian assumptions– Example: Ridge

• A good regularizer can build in beliefs• Those are great and useful when available• But central force is tendency to overfit• Example:

– Even if true world were not sparse or priors were not normal you’d still do this

Page 73: Machine Learning and Econometrics

Summary

• Regularization is one half of the secret sauce

• Gives a single-dimensional way of deciding of capturing expressiveness

, ∈

• Still missing ingredient is lambda

Page 74: Machine Learning and Econometrics

Choosing lambda

• How much should we penalize expressiveness?

• How do you make the over-fit approximation tradeoff?

• The tuning problem.

• Use cross-validation

Page 75: Machine Learning and Econometrics

Train

Tune

How Does Cross Validation Work?

Tuning Set = 1/5 of Training Set

CV-TrainingCV-Tuning

Page 76: Machine Learning and Econometrics

Cross-Validation Mechanics• Loop over cross-validation samples

– Train a deep tree on CV-training subset

• Loop over penalty parameters – Loop over cross-validation samples

• Prune the tree according to penalty• Calculate new MSE of tree

– Average (over c-v samples) the MSE for this penalty

• Choose the penalty ∗ that gives the best average MSE

Page 77: Machine Learning and Econometrics

LASSO c-v Example

Page 78: Machine Learning and Econometrics

Creating Out-of-Sample In Sample

• Major point:– Not many assumptions– Don’t need to know true model.– Don’t need to know much about algorithm

• Minor but important point– To get asymptotics right we need to make some

regularity assumptions• Side point (to which we return)

– We’d like to choose best algorithm for sample size n– But this will not do that. Why?

Page 79: Machine Learning and Econometrics

Why does this work?

1. Not just because we can split a sample and call it out of sample

– It’s because the thing we are optimizing is observable (easily estimable)

Page 80: Machine Learning and Econometrics

This is more than a trick

• It illustrates what separates prediction from estimation: – I can’t ‘observe’ my prior.

• Whether the world is truly drawn from a linear model – But prediction quality is observable

• Put simply: – Validity of predictions are measurable– Validity of coefficient estimators require structural

knowledgeThis is the essential ingredient to prediction: Prediction quality is an empirical quantity not a theoretical guarantee

Page 81: Machine Learning and Econometrics

Why does this work?

1. It’s because the thing we are optimizing is observable

2. By focusing on prediction quality we have reduced dimensionality

Page 82: Machine Learning and Econometrics

To understand this…• Suppose you tried to use this to choose coefficients

– Ask which set of coefficients worked well out-of sample. • Does this work? • Problem 1: Estimation quality is unobservable

– Need the same assumptions as algorithm to know whether you “work” out of sample• If you just go by fit you are ceding to say you want best predicting

model

• Problem 2: No dimensionality reduction.– You’ve got as many coefficients as before to search over

Page 83: Machine Learning and Econometrics
Page 84: Machine Learning and Econometrics

Bayesian Interpretation of Ridge

Page 85: Machine Learning and Econometrics

Bayesian Interpretation of Ridge

Page 86: Machine Learning and Econometrics

Bayesian Interpretation of Ridge

Page 87: Machine Learning and Econometrics

POST-Lasso

• Important distinction:– Use LASSO to choose variables– Use OLS on these variables

• How should we think about these?

Page 88: Machine Learning and Econometrics
Page 89: Machine Learning and Econometrics
Page 90: Machine Learning and Econometrics

LASSO

OLS

Soft ThresholdingWhy not Hard Thresholding?

SJL [3]1

Page 91: Machine Learning and Econometrics

Slide 77

SJL [3]1 \hat{\beta}_{OLS}?Spiess, Jann Lorenz, 8/28/2015

Page 92: Machine Learning and Econometrics

Orthonormal: RIDGE OLS

1

RIDGE

OLS SJL [2]

Page 93: Machine Learning and Econometrics

Slide 78

SJL [2]1 \hat{\beta}_{OLS}?Spiess, Jann Lorenz, 8/28/2015

Page 94: Machine Learning and Econometrics

Can be very misleading

Page 95: Machine Learning and Econometrics

Coefficient on Number of Bedrooms

Page 96: Machine Learning and Econometrics

Coefficient on Number of Bedrooms

Page 97: Machine Learning and Econometrics

Coefficient on Number of BedroomsWhat is this about?

Page 98: Machine Learning and Econometrics

Coefficient on Number of Bedrooms

What is this about?

Page 99: Machine Learning and Econometrics
Page 100: Machine Learning and Econometrics

Prediction Policy

Susan Athey-Machine Learning and Causal Inference

Thanks to Sendhil Mullainathan for sharing his slides

Page 101: Machine Learning and Econometrics

Three Direct Uses of Prediction

1. Policy

2. Testing Whether Theories are Right

3. Testing Theory Completeness

Page 102: Machine Learning and Econometrics

When is Prediction Primary Focus?• Economics: “allocation of

scarce resources”• An allocation is a

decision. – Generally, optimizing

decisions requires knowing the counterfactual payoffs from alternative decisions.

• Hence: intense focus on causal inference in applied economics

• Examples where prediction plays the dominant role in a decision– Decision is obvious given

an unknown state– Many decisions hinge on a

prediction of a future state

Page 103: Machine Learning and Econometrics

Prediction and Decision-Making:Predicting a State Variable

Kleinberg, Ludwig, Mullainathan, and Obermeyer (2015)• Motivating examples:

– Will it rain? (Should I take an umbrella?)– Which teacher is best? (Hiring, promotion)– Unemployment spell length? (Savings)– Risk of violation of regulation (Health inspections)– Riskiest youth (Targeting interventions)– Creditworthiness (Granting loans)

• Empirical applications: – Will defendant show up for court? (Should we grant bail?)– Will patient die within the year? (Should we replace joints?)

Page 104: Machine Learning and Econometrics

Allocation of Inspections

• Examples:– Auditors– Health inspectors– Fire code inspectors– Equipment

• Efficient use of resources:– Inspect highest-risk units– (Assuming you can remedy problem at equal cost

for all…)

Page 105: Machine Learning and Econometrics
Page 106: Machine Learning and Econometrics

Prediction Problem

• Over 750,000 joint replacements every year• Benefits

– Improved mobility and reduced pain• Costs

– Monetary: $15,000 (roughly)– Non-monetary: short-run utility costs as people

recover from surgery

Page 107: Machine Learning and Econometrics

Look at death rate in a yearHow well are we doing avoiding unnecessary surgery? 

Medicare claims data 2010 surgeries for joint replacement

Average death rate is 5% 

Don’t want average patient

Want marginal patientPredictably highest risk patients

But is that the right metric for excess joint replacements?

Page 108: Machine Learning and Econometrics

Approach: use ML methods to predict mortality as a function of covariates• e.g. regularized regression, random 

forest• Put individuals into percentiles of 

mortality risk

A large number of joint replacements going to people who die within the year

Could we just eliminate the ones above a certain risk? 

Page 109: Machine Learning and Econometrics

Econometrics of Prediction Policy Problems

1. Problem: Omitted Payoff Bias

Page 110: Machine Learning and Econometrics

This Unobservable is a Problem

Pain

What if those with highMortality also benefit most?

Page 111: Machine Learning and Econometrics

Cov(X, Z) is not a problemCov(X,W ) is a problem

Y f (X, Z)

g(X0,W ) g(X0 )

Omitted Payoff Bias

Page 112: Machine Learning and Econometrics

Econometrics of Prediction Policy Problems

1. Omitted Payoff Bias– Like omitted variable bias but not in y – Can partially assess on the basis of observables

Page 113: Machine Learning and Econometrics

No sign of bias: Highest risk show no signs of greater benefit

Page 114: Machine Learning and Econometrics

Quantifying gain of predicting better

• Allocation problem:– Reallocate joints to other eligible patients

• How to estimate the risk of those who didn’t get surgery?– Look at those who could get surgery but didn’t– Doctors should choose the least risky first– So those who don’t receive should be particularly risky.

• Take a conservative approach– Compare to median risk in this pool

Page 115: Machine Learning and Econometrics
Page 116: Machine Learning and Econometrics

Assessing the Research Agenda

• Follows economic tradition of using data to improve policy

• In an area of economic interest– Similar to a lot of health econ work

• Of course this does not answer all questions of interest– Why not?

Page 117: Machine Learning and Econometrics

Another Prediction Policy Problem

• Each year police make over 12 million arrests

• Many detained in jail before trial

• Release vs. detain high stakes– Pre-trial detention spells avg. 2-3 months (can be

up to 9-12 months)– Nearly 750,000 people in jails in US– Consequential for jobs, families as well as crime

Kleinberg Lakkaraju Leskovec Ludwig and Mullainathan

Page 118: Machine Learning and Econometrics

Judge’s Problem

• Judge must decide whether to release or not (bail)

• Defendant when out on bail can behave badly:– Fail to appear at case– Commit a crime

• The judge is making a prediction

Page 119: Machine Learning and Econometrics

PREDICTION

Page 120: Machine Learning and Econometrics

Omitted Payoff Bias?

• Bail carefully chosen – Unlike other sentencing no other concerns:

• Retributive justice

– Family & other considerations low

• Bad use case: Parole Decision

Page 121: Machine Learning and Econometrics

Evaluating the Prediction

• NOT just AUC or Loss

• Use predictions to create a release rule

• What is the release – crime rate tradeoff?

• Note: There’s a problem

Page 122: Machine Learning and Econometrics

Econometrics of Prediction Policy Problems

1. Omitted Payoff Bias2. “Selective Labels”– What do we do with people algorithm releases

that judge jails?– (Like people who get surgery and didn’t before)

Page 123: Machine Learning and Econometrics
Page 124: Machine Learning and Econometrics

Selective Labels Revisted• What is the crime rate we must use?

– For released defendants, empirical crime rate– For jailed ones, imputed crime rate

• But imputation may be biased…– Judge sees factors we don’t– Suppose young people have dots on their foreheads

• Perfectly predictive: judge releases only if no dot– In released sample: young people have no crime

• We would falsely conclude young people have no risk. • But this is because the young people with dots are in jail.

– We would then falsely presume when we release all young people we will do better than judge

• Key problem: unobserved factors seen by judge affect crime rate (& judge uses these wisely)

• How to fix?

Page 125: Machine Learning and Econometrics

Is not problem when we look just at released

Key insight: Contraction

Would judges knowingly release at 55% risk?

Willing to live with very high crime rates? 

Or

Judges mispredicting

55%

Page 126: Machine Learning and Econometrics

Contraction• Multiple judges with similar caseloads and

different lenience

• Strategy: use most lenient judges. – Take their released population and ask which of

those would you incarcerate to become less lenient– Compare to less lenient judges

Page 127: Machine Learning and Econometrics

Released

Jailed

90% Judge

Released

Jailed

90% Judge

Jail

Contraction

Released

Imputation

Jaile

d

Page 128: Machine Learning and Econometrics

Measure

Impute

Imputation

Measure

Contraction

Released

90% Judge

Jailed

80% JudgePeformance

Page 129: Machine Learning and Econometrics

Contraction

• Requires– Judges have similar cases (random assignment)– Does not require judges having “similar” rankings

• But does give performance of a different rule– “Human constrained” release rule

Page 130: Machine Learning and Econometrics

Contraction and Imputation Compared

Page 131: Machine Learning and Econometrics

Selective Labels

• In this case does not appear to be a problem

• But generically a problem– Extremely common problem – occurs whenever

prediction -> decision -> treatment– Data generated by previous decisions

Page 132: Machine Learning and Econometrics
Page 133: Machine Learning and Econometrics
Page 134: Machine Learning and Econometrics
Page 135: Machine Learning and Econometrics

Econometrics of Prediction Policy Problems

1. Omitted Payoff Bias2. Selective Labels3. Restricted Inputs

Page 136: Machine Learning and Econometrics

Restricted Inputs• Race and gender are not legal to use

– We do not use them

• But is that enough? – Reconstruction problem– Optimizing in presence of this additional

reconstruction constraint

• Rethinking disparate impact and disparate treatment

Page 137: Machine Learning and Econometrics

Racist Algorithms?

Page 138: Machine Learning and Econometrics

Econometrics of Prediction Policy Problems

1. Omitted Payoff Bias2. Selective Labels3. Restricted Inputs4. Response to Decision Rule

Page 139: Machine Learning and Econometrics

ComparingJudgestoThemselves

Page 140: Machine Learning and Econometrics

Why do we beat judges?

• Judges see more than we do

• Perhaps that is the problem

• Suggests behavioral economics of salience important here– In general, any kind of “noise”

Page 141: Machine Learning and Econometrics

General points here

• Need more ways of comparing human and machine predictions

• Notion of private information called into question

Page 142: Machine Learning and Econometrics

Summary

• Many prediction policy problems

• Raise their own econometric challenges

• Can also provide conceptual insights

Page 143: Machine Learning and Econometrics

Causal Inference for Average Treatment Effects

Professor Susan AtheyStanford University

Machine Learning and Causal Inference

Spring 2017

Page 144: Machine Learning and Econometrics

The potential outcomes framework

For a set of i.i.d. subjects i = 1, ..., n, we observe a tuple(Xi , Yi , Wi ), comprised of

I A feature vector Xi ∈ Rp,

I A response Yi ∈ R, and

I A treatment assignment Wi ∈ {0, 1}.

Following the potential outcomes framework (Holland, 1986,Imbens and Rubin, 2015, Rosenbaum and Rubin, 1983, Rubin,1974), we posit the existence of quantities Y

(0)i and Y

(1)i .

I These correspond to the response we would have measuredgiven that the i-th subject received treatment (Wi = 1) or notreatment (Wi = 0).

I NB: We only get to see Yi = Y(Wi )i

Page 145: Machine Learning and Econometrics

The potential outcomes framework

For a set of i.i.d. subjects i = 1, ..., n, we observe a tuple(Xi , Yi , Wi ), comprised of

I A feature vector Xi ∈ Rp,

I A response Yi ∈ R, and

I A treatment assignment Wi ∈ {0, 1}.

I Define the average treatment effect (ATE), the averagetreatment effect on the treated (ATT)

τ = τATE = E[Y (1) − Y (0)

]; τATT = E

[Y (1) − Y (0)

∣∣Wi = 1]

;

I and, the conditional average treatment effect (CATE)

τ (x) = E[Y (1) − Y (0)

∣∣X = x].

Page 146: Machine Learning and Econometrics

The potential outcomes framework

Page 147: Machine Learning and Econometrics

The potential outcomes framework

If we make no further assumptions, it is not possible to estimateATE, ATT, CATE, and related quantities.

I This is a failure of identification (infinite sample size), not asmall sample issue. Unobserved confounders correlated withboth the treatment and the outcome make it impossible toseparate correlation from causality.

I One way out is to assume that we have measured enoughfeatures to achieve unconfoundedness (Rosenbaum andRubin, 1983) {

Y(0)i , Y

(1)i

}⊥⊥Wi

∣∣ Xi .

I When this assumption + OVERLAP (e(x) ∈ (0, 1)) holds,causal effects are identified and can be estimated.

Page 148: Machine Learning and Econometrics

Identification

EYi (1)

[Y

(1)i

]= EXi

[EY

(1)i |Xi

[Y

(1)i

∣∣Xi

]]= EXi

[EYi (1)|Xi

[ Y(1)i ·Wi

Pr(Wi = 1|Xi )

∣∣Xi

]]= EXi

[EYi |Xi

[ Yi ·Wi

Pr(Wi = 1|Xi )

∣∣Xi

]]= EYi

[ Yi ·Wi

Pr(Wi = 1|Xi )

]I Argument is analogous for E

[Y 0], which leads to ATE; and

similar arguments allow you to identify CATE as well as thecounterfactual effect of any policy assigning units totreatments on the basis of covariates.

I This result suggests a natural estimator: propensity scoreweighting using the sample analog of the last equation.

Page 149: Machine Learning and Econometrics

The role of overlap

Note that we need e(x) ∈ (0, 1) to be able to calculate treatmenteffects for all x .

I Intuitively, how could you possibly infer [Y (0)|Xi = x ] ife(x) = 1?

I Note that for discrete x , the variance of ATE is infinite whene(x) = 0.

I “Moving the goalposts”: Crump, Hotz, Imbens, Miller (2009)analyze trimming, which entails dropping observations wheree(x) is too extreme. Typical approaches entail droppingbottom and top 5% or 10%.

I Approaches that don’t directly require propensity scoreweighting may seem to avoid the need for this, but importantto understand role of extrapolation.

Page 150: Machine Learning and Econometrics

Propensity Score Plots: Assessing Overlap

The causal inference literature has developed a variety ofconventions, broadly referred to as “supplementary analysis,” forassessing credibility of empirical studies. One of the most prevalentconventions is to plot the propensity scores of treated and controlgroups to assess overlap.

I Idea: for each q ∈ (0, 1), plot the fraction of observations inthe treatment group with e(x) = q, and likewise for thecontrol group.

I Even if there is overlap, when there are large imbalances, thisis a sign that it may be difficult to get an accurate estimate ofthe treatment effect.

Page 151: Machine Learning and Econometrics

Propensity Score Plots: Assessing Overlap

Example: Athey, Levin and Seira analysis of timber.I Assignment to first price or open ascending:

I in ID, randomized for subset of tracts with differentprobabilities in different geographies;

I in CA, small v. large sales (with cutoffs varying by geography).

I So W = 1 if auction is sealed, and X represents geography,size and year.

Page 152: Machine Learning and Econometrics

Propensity Score Plots: Assessing Overlap in ID

Very few observations with extreme propensity scores

Page 153: Machine Learning and Econometrics

Propensity Score Plots: Assessing Overlap in CA

Untrimmed v. trimmed so that e(x) ∈ [.025, .975]

Page 154: Machine Learning and Econometrics

Variance of Estimator: Discrete Case

I Suppose small number of realizations of Xi .

I Under unconfoundedness, can analyze these as separateexperiments and average up the results.

I How does conditioning on Xi affect variance of estimator?

Page 155: Machine Learning and Econometrics

Variance of Estimator: Discrete Case

Let E denote the sample average, V be the variance, π(x) be theproportion of observations with Xi = x , and let e(x) be thepropensity score (Pr(Wi = 1|Xi = x)).

V(Ei :Xi=x ,Wi=1(Yi )

)=

σ2(x)

n · π(x) · e(x)

V(τ(x)) =σ2(x)

n · π(x) · e(x)+

σ2(x)

n · π(x) · (1− e(x)).

V( ˆATE) =∑x

[n(x)

n

σ2(x)

n(x) · e(x)+

σ2(x)

n(x) · (1− e(x))

].

=∑x

σ2(x)

n

[ 1

e(x)+

1

(1− e(x))

].

Page 156: Machine Learning and Econometrics

Estimation Methods

The following methods are efficient when the number of covariatesis fixed:

I Propensity score weighting

I “Direct” model of the outcome (model of E[Yi

∣∣Xi ,Wi

]),

e.g. using regression

I Propensity-score weighted regression of Y on X ,W (doublyrobust)

The choice among these methods is widely studied:

I Other popular methods include matching, propensity scorematching, propensity score blocking, which are not efficientbut often do better in practice.

I Note: Hirano, Imbens, Ridder (2003) establish that moreefficient to weight by estimated propensity score than actual.

Page 157: Machine Learning and Econometrics

Regression Case

Suppose that conditional mean function is given by

µ(w , x) = β(w) · x .

If we estimate using OLS, then we can estimate the ATE as

ATE = X · (β(1) − ˆβ(0))

Note that OLS is unbiased and efficient, so the above quantityconverges to the true values at rate

√n:

X · (β(1) − β(0))− µx · (β(1) − β(0)) = Op

( 1√n

)

Page 158: Machine Learning and Econometrics

High-Dimensional Analogs??

Obvious possibility: substitute in the lasso (or ridge, or elastic net)for OLS. But bias is a big problem.With lasso, for each component j :

β(w)j − β(w)

j = Op

(√ log(p)

n

)This adds up across all dimensions, so that we can only guaranteefor the ATT:

ATT− ATT = Op

(√ log(p)

n‖X1 − X0‖∞ · ‖β(0)‖0

)

Page 159: Machine Learning and Econometrics

Imposing Sparsity: LASSO Crash CourseAssume linear model, and that there are at most a fixed number kof non-zero coefficients: ‖β‖0 ≤ k .Suppose X satisfies a “restricted eigenvalue” condition: no smallgroup of variables is nearly collinear.

‖β − β‖2 = Op

(√k · log(p)

n

)

‖β − β‖1 = Op

(k

√log(p)

n

)With the “de-biased lasso” (post-LASSO OLS) we can even build

confidence intervals on β if k <<√n

log(p) .

Applying to the ATT, where we need to estimate X1 · β(0) (the cfoutcome for treated observations had they been control instead):

ˆATT− ATT = Op

(k‖X1 − X0‖∞

√log(p)

n

)

Page 160: Machine Learning and Econometrics

Improving the Properties of ATE Estimation in HighDimensions: A “Double-Selection” Method

Belloni, Chernozukov, and Hansen (2013) observe that causalinference is not an off-the-shelf prediction problem: confoundersmight be important if they have a large effect on outcomes OR alarge effect on treatment assignment. They propose:

I Run LASSO of W on X . Select variables with non-zerocoefficients at a selected λ (e.g. cross-validation).

I Run a LASSO of Y on X . Select variables with non-zerocoefficients at a selected λ (may be different than first λ).

I Run a OLS of Y on W and the union of selected variables.(Not as good at purely predicting Y as using only second set.)

Result: under “approximate sparsity” of BOTH propensity andoutcome models, and constant treatment effects, estimated ATE isasymptotically normal and estimation is efficient.Intuition: with enough data, can find the variables relevant forbias. With approximate sparsity and constant treatment effect,there aren’t too many, and OLS will be unbiased.

Page 161: Machine Learning and Econometrics

Single v. Double Selection in BCH Algorithm

Page 162: Machine Learning and Econometrics

More General Results

Belloni, Chernozukov, Fernandez-Val and Hansen (2016)(http://arxiv.org/abs/1311.2645, forthcoming Econometrica) havea variety of generalizations:

I Applies general approach to IV

I Allows for a continuum of outcome variables

I Observes that nuisance parameters can be estimated generallyusing ML methods without affecting the convergence rate,subject to orthoganality conditions

I Shows how to use a framework based on orthogonality inmoment conditions

Page 163: Machine Learning and Econometrics

Doubly Robust Methods

With small data, a “doubly robust” estimator (though not thetypical one, where typically people use inverse propensity scoreweighted regression) is (with γi = 1

e(Xi )):

µ01 = X1 · β(0) + Ei :Wi=0γi(Yi − Xi β

(0))

To see why, note that the term in parentheses goes to 0 if weestimate β(0) well, while to show that we get the right answer if weestimate the propensity score well, we rearrange the expression tobe

µ01 =(X1 − Ei :Wi=0(γiXi )

)β(0) + Ei :Wi=0γiYi

The first term has expectation 0, and the second term gives therelevant counterfactual, if the propensity score is well-estimated.

Page 164: Machine Learning and Econometrics

Doubly Robust Methods: A High-Dimensional Analog?

µ01 = X1 · β(0) + Ei :Wi=0γi(Yi − Xi β

(0))

How does this relate to the truth?

µ01 − µ01 = X1 · (β(0) − β(0)) + Ei :Wi=0γi(εi + Xiβ

(0) − Xi β(0))

=(X1 − γ′X0

)· (β(0) − β(0)) + Ei :Wi=0γiεi

With high dimensions, we could try to estimate β and thepropensity score with LASSO or post-LASSO rather than OLS.However, this may not be good enough. It is also not clear how toget good estimates of the inverse propensity score weights γi , inparticular if we don’t want to assume that the propensity model issparse (e.g. if the treatment assignment is a complicated functionof confounders).

Page 165: Machine Learning and Econometrics

Residuals on Residuals

I Small data approach (a la Robinson’s 1988) analyzed asemi-parametric model

I Model Yi = τWi + g(Xi ) + εiI Goal: estimate τI Approach: residuals on residuals gives

√n-consistent and

asymptotically normal estimatorI Regress Yi − g(Xi ) on Wi − E[Wi |Xi ]

Page 166: Machine Learning and Econometrics

Double Machine Learning

I Chernozhukov et al (2017):I Model Yi = τWi + g(Xi ) + εi , E[Wi |Xi ] = h(Xi )I Goal: estimate τI Use a modern machine learning method like random forests to

estimate the “nuisance parameters”I Regress Yi − g(Xi ) on Wi − E[Wi |Xi ]I If ML method converges at the rate n

14 , residuals on residuals

gives√n-consistent and asymptotically normal estimator

Page 167: Machine Learning and Econometrics

Comparing Straight Regression to Double ML

I Moments used in estimation:I Regression: E[(Yi −Wiτ − g(Xi )) ·Wi ] = 0I Double ML:

E[((Yi − g(Xi )− (Wi − h(Xi ]))τ) · (Wi − h(Xi ))] = 0

I Double robustness and orthogonality: Robinson’s resultimplies that if g(Xi ) is consistent, then τ is the regressioncoefficient of the residual on residual regression, and even if his wrong, the orthogonality of the residual of the outcomeregression and the residual Wi − h still holds

I Neyman orthogonality: the Double ML moment condition hasthe property that when evaluated at g = g and h = h, smallchanges in either of them do not change the momentcondition. The moment condition is minimized at the truth.

I You are robust to small mistakes in estimation of nuisanceparameters, unlike regression approach

Page 168: Machine Learning and Econometrics

Comparing Straight Regression to Double ML

Page 169: Machine Learning and Econometrics

An Efficient Approach with Non-Sparse PropensityThe solution proposed in Athey, Imbens and Wager (2016) forattacking the gap

µ01 − µ01 =(X1 − γ′X0

)· (β(0) − β(0)) + Ei :Wi=0γiεi

is to bound 1st term by selecting γi ’s using brute force. Inparticular:

γ = argminγζ · ‖X1 − γ′X0‖∞ + (1− ζ)‖γ‖22The parameter ζ is a tuning parameter; the paper shows that ζexists such that the γ’s exist to tightly bound the first term above.

With overlap, we can make ‖X1 − γ′X0‖∞ be O(√

log(p)n ).

Result: If the outcome model is sparse, estimate β using LASSO

yielding bias of second term Op

(k√

log(p)n

), so the bias term is

O(k log(p)n ), so for k small enough, the last term involving γiεi

dominates, and ATE estimator is O( 1√n

).

Page 170: Machine Learning and Econometrics

Why Approximately Balancing Beats Propensity WeightingOne question is why the balancing weights perform better than thepropensity score weights. To gain intution, suppose the propensityscore has the following logistic form,

e(x) =exp(x · θ)

1 + exp(x · θ).

After normalization, the inverse propensity score weights satisfy

γi ∝ exp(x · θ).

The efficient estimator for θ is the maximum likelihood estimator,

θml = arg maxθ

n∑i=1

{Wi Xi · θ − ln(1 + exp(Xi · θ))} .

An alternativeis the method of moments estimator θmm thatbalances the covariates exactly:

X0 =∑

{i :Wi=0}

Xiexp(Xi · θ)∑

{j :Wj=0} exp(Xj · θ).

Page 171: Machine Learning and Econometrics

Why Approximately Balancing Beats Propensity Weighting

An alternative is the method of moments estimator θmm thatbalances the covariates exactly:

X0 =∑

{i :Wi=0}

Xiexp(Xi · θ)∑

{j :Wj=0} exp(Xj · θ).

with implied weights γi ∝ exp(Xi · θmm).

I The only difference between the two sets of weights is thatthe parameter estimates θ differ.

I The estimator θmm leads to weights that achieve exactbalance on the covariates, in contrast to either the true valueθ, or the maximum likelihood estimator θml.

I The goal of balancing (leading to θmm) is different from thegoal of estimating the propensity score (for which θml isoptimal).

Page 172: Machine Learning and Econometrics

Summarizing the Approximate Residual Balancing Methodof Athey, Imbens, Wager (2016)

I Estimate lasso (or elastic net) of Y on X in control group.

I Find “approximately balancing” weights that make the controlgroup look like the treatment group in terms of covariates,while attending to the sum of squares of the weights. Withmany covariates, balance is not exact.

I Adjust the lasso prediction of the counterfactual outcome forthe treatment group (if it had been control) usingapproximately balancing weights to take a weighted average ofthe residuals from the lasso model.

Main result: if the model relating outcomes to covariates is sparse,and there is overlap, then this procedure achieves thesemi-parametric efficiency bound. No other method is known to dothis for non-sparse propensity models.Simulations show that it performs much better than alternativeswhen propensity is not sparse.

Page 173: Machine Learning and Econometrics

Simulation Experiment

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ● ●

●●

●●

● ●

●●

● ●

●●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

● ●

● ●

● ●

●●

●●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

● ●

●●

●●

●●

● ●

●●

●●●

●●●

●●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●● ●

● ●

● ●

●●

●●

●●●

●●

● ●

●●●

●●

● ●

●● ●

●●

●●

●●

−15 −10 −5 0 5

−5

05

10

The design X is “clustered.” We study the following settings for β:

Dense: β ∝ (1, 1/√

2, ..., 1/√p),

Harmonic: β ∝ (1/10, 1/11, ..., 1/ (p + 9)) ,

Moderately sparse: β ∝ (10, ..., 10︸ ︷︷ ︸10

, 1, ..., 1︸ ︷︷ ︸90

, 0, ..., 0︸ ︷︷ ︸p−100

),

Very sparse: β ∝ (1, ..., 1︸ ︷︷ ︸10

, 0, ..., 0︸ ︷︷ ︸p−10

).

Page 174: Machine Learning and Econometrics

Simulation Experiment

Beta Model dense harmonic moderately sparse very sparseOverlap (η) 0.1 0.25 0.1 0.25 0.1 0.25 0.1 0.25

Naive 0.672 0.498 0.688 0.484 0.686 0.484 0.714 0.485Elastic Net 0.451 0.302 0.423 0.260 0.181 0.114 0.031 0.021

Approximate Balance 0.470 0.317 0.498 0.292 0.489 0.302 0.500 0.302Approx. Resid. Balance 0.412 0.273 0.399 0.243 0.172 0.111 0.030 0.021

Inverse Prop. Weight 0.491 0.396 0.513 0.376 0.513 0.388 0.533 0.380Inv. Prop. Resid. Weight 0.463 0.352 0.479 0.326 0.389 0.273 0.363 0.248

Double-Select + OLS 0.679 0.368 0.595 0.329 0.239 0.145 0.047 0.023

Simulation results, with n = 300 and p = 800. Approximateresidual balancing estimates β using the elastic net. Inversepropensity residual weighting is like our method, except withγi = 1/e(Xi ). We report root-mean-squared error for τ .

Observation: Weighting regression residuals works better thanweighting the original data; balanced weighting works betterinverse-propensity weighting.

Page 175: Machine Learning and Econometrics

Simulation Experiment

βj ∝ 1 ({j ≤ 10}) βj ∝ 1/j2 βj ∝ 1/jn p η = 0.25 η = 0.1 η = 0.25 η = 0.1 η = 0.25 η = 0.1

200 400 0.90 0.84 0.94 0.88 0.84 0.71200 800 0.86 0.76 0.92 0.85 0.82 0.71200 1600 0.84 0.74 0.93 0.85 0.85 0.73

400 400 0.94 0.90 0.97 0.93 0.90 0.78400 800 0.93 0.91 0.95 0.90 0.88 0.76400 1600 0.93 0.88 0.94 0.90 0.86 0.76

800 400 0.96 0.95 0.98 0.96 0.96 0.90800 800 0.96 0.94 0.97 0.96 0.94 0.90800 1600 0.95 0.92 0.97 0.95 0.93 0.86

We report coverage of τ for 95% confidence intervals constructedby approximate residual balancing.

Page 176: Machine Learning and Econometrics

Simulation Experiment

−2 −1 0 1 2

0.0

0.1

0.2

0.3

0.4

0.5

X

Treatment EffectDistr. of TreatedDistr. of Controls

We are in a misspecified linear model; the “main effects” model is10-sparse and linear.

Page 177: Machine Learning and Econometrics

Simulation Experiment

n 400 1000p 100 200 400 800 1600 100 200 400 800 1600

Naive 1.72 1.73 1.73 1.72 1.74 1.71 1.70 1.72 1.70 1.72Elastic Net 0.44 0.46 0.50 0.51 0.54 0.37 0.39 0.39 0.40 0.42

Approximate Balance 0.48 0.55 0.61 0.63 0.70 0.24 0.30 0.38 0.40 0.45Approx. Resid. Balance 0.24 0.26 0.28 0.29 0.32 0.16 0.17 0.18 0.19 0.20

Inverse Prop. Weight 1.04 1.07 1.11 1.13 1.18 0.82 0.84 0.88 0.89 0.94Inv. Prop. Resid. Weight 1.29 1.30 1.31 1.31 1.33 1.25 1.25 1.26 1.25 1.28

Double-Select + OLS 0.28 0.29 0.31 0.31 0.34 0.24 0.25 0.25 0.25 0.26

Approximate residual balancing estimates β using the elastic net.Inverse propensity residual weighting is like our method, exceptwith γi = 1/e(Xi ). We report root-mean-squared error for τ1.

Page 178: Machine Learning and Econometrics

Estimating the Effect of a Welfare-to-Work Program

Data from the California GAIN Pro-gram, as in Hotz et al. (2006).

I Program separately randomizedin: Riverside, Alameda, LosAngeles, San Diego.

I Outcome: mean earnings overnext 3 years.

I We hide county information.Seek to compensate withp = 93 controls.

I Full dataset has n = 19170.

200 500 1000 2000 5000

0.80

0.85

0.90

0.95

nC

over

age

OracleApprox. Resid. BalanceDouble Select + OLSLasso Resid. IPWNo Correction

Page 179: Machine Learning and Econometrics

Closing Thoughts

What are the pros and cons of approximate residual balancing vs.inverse-propensity residual weighting?

Pros of balancing:

I Works under weaker assumptions (only overlap).

I Algorithmic transparency.

I ...

Pros of propensity methods:

I Potential for double robustness.

I Potential for efficiency under heteroskedasticity.

I Generalizations beyond linearity.

I ...

Page 180: Machine Learning and Econometrics

An Introduction to Regression Trees (CART)

Susan Athey, Stanford UniversityMachine Learning and Causal Inference

Page 181: Machine Learning and Econometrics

What is the goal of prediction? Machine learning answer: Smallest mean-squared error in a test set

Formally: Let be a test set.

Think of this as a random draw of individuals from a population

Let be a candidate (estimated) predictor MSE on test set is:

Page 182: Machine Learning and Econometrics

Regression Trees Simple method for prediction Partition data into subsets by covariates Predict using average within each subset

Why are regression trees popular? Easy to understand and explain Businesses often need “segments” Software assigns different algorithms to different segments

Can completely describe the algorithm and interpretation

Page 183: Machine Learning and Econometrics

Example: Who survived the Titantic?

Page 184: Machine Learning and Econometrics

Regression Trees for Prediction

Data Outcomes Yi, attributes Xi. Support of Xi is X. Have training sample with

independent obs. Want to predict on new

sample

Build a “tree”: Partition of X into “leaves” X j Predict Y conditional on realization of X

in each region X j using the sample mean in that region

Go through variables and leaves and decide whether and where to split leaves (creating a finer partition) using in-sample goodness of fit criterion

Select tree complexity using cross-validation based on prediction quality

Page 185: Machine Learning and Econometrics

Regression Trees for Prediction

Outcome: Binary (Y in {0,1})Two covariates

Goal: Predict Y as a function of X“Classify” units as a function of X according to whether they are more

likely to have Y=0 or Y=1

Page 186: Machine Learning and Econometrics

Regression Trees for Prediction

(1) Tree-building: Use algorithm to partition data according to covariates (adaptive: do this based on the difference in mean outcomes in different

potential leaves.)(II) Estimation/prediction: calculate mean outcomes in each leaf

(III) Use cross-validation to select tree complexity penalty

Page 187: Machine Learning and Econometrics

Tree Building Details Impossible to search over all possible partitions, so use a

greedy algorithm Do until all leaves have less than 2*minsize obs: For each leaf:

For each observed value of each covariate : Consider splitting the leaf into two children according to whether Make new predictions in each candidate child according to sample mean Calculate the improvement in “fit” (MSE)

Select the covariate j and the cutoff value that lead to the greatest improvement in MSE; split the leaf into two child leaves

Observations In-sample MSE always improves with additional splits What is MSE when each leaf has one observation?

Page 188: Machine Learning and Econometrics

Problem: Tree has been “over-fitted” Suppose we fit a tree and pick a particular leaf . Do we expect that if we drew a new sample, we would get the

same answer?

More formally: Let be training dataset and be an independent test set

Let ℓ

∑ ∈ℓ ,

Is ∈ | ∈ ℓ ?

Page 189: Machine Learning and Econometrics

What are tradeoffs in tree depth? First: note that in-sample MSE doesn’t guide you It always increases with depth

Tradeoff as you grow tree deeper More personalized predictions More biased estimates

Page 190: Machine Learning and Econometrics

Regression Trees for Prediction: Components1. Model and Estimation

A. Model type: Tree structure

B. Estimator : sample mean of Yi within leafC. Set of candidate estimators C: correspond to different specifications of

how tree is split

2. Criterion function (for fixed tuning parameter )A. In-sample Goodness-of-fit function:

Qis = -MSE (Mean Squared Error)=- ∑

A. Structure and use of criterioni. Criterion: Qcrit = Qis – x # leaves

ii. Select member of set of candidate estimators that maximizes Qcrit, given

3. Cross-validation approachA. Approach: Cross-validation on grid of tuning parameters. Select tuning

parameter with highest Out-of-sample Goodness-of-Fit Qos.B. Out-of-sample Goodness-of-fit function: Qos = -MSE

Page 191: Machine Learning and Econometrics

Train

Tune

How Does Cross Validation Work?

Tuning Set = 1/5 of Training Set

CV-TrainingCV-Tuning

Page 192: Machine Learning and Econometrics

Cross-Validation Mechanics Loop over cross-validation samples Train a deep tree on CV-training subset

Loop over penalty parameters Loop over cross-validation samples

Prune the tree according to penalty Calculate new MSE of tree

Average (over c-v samples) the MSE for this penalty

Choose the penalty ∗ that gives the best average MSE

Page 193: Machine Learning and Econometrics

Choosing the penalty parameter

Page 194: Machine Learning and Econometrics

Some example code

Page 195: Machine Learning and Econometrics
Page 196: Machine Learning and Econometrics

Pruning Code

Page 197: Machine Learning and Econometrics

A Basic Policy Problem Every transfer program in the world must determine… Who is eligible for the transfer

Typical goal of redistributive programs Transfer to neediest

But identifying the neediest is easier said than done

Thanks to Sendhil Mullainathan for providing this worked out example….

Page 198: Machine Learning and Econometrics

Typical Poverty Scorecard

Page 199: Machine Learning and Econometrics
Page 200: Machine Learning and Econometrics

Can we do better? This component of targeting is a pure prediction problem

We fundamentally care about getting best predictive accuracy

Let’s use this example to illustrate the mechanics of prediction

Page 201: Machine Learning and Econometrics

Brazilian Data The data: 44,787 data points 53 variables Not very wide?

Median Annual consumption (in dollars): 3918 348.85 monthly income

6 percent below 1.90 poverty line 14 percent below the 3.10 poverty line

Page 202: Machine Learning and Econometrics

Consumption

Page 203: Machine Learning and Econometrics

log (consumption)

Page 204: Machine Learning and Econometrics

50th Percentile

Page 205: Machine Learning and Econometrics

25th Percentile

Page 206: Machine Learning and Econometrics

10th Percentile

Page 207: Machine Learning and Econometrics
Page 208: Machine Learning and Econometrics
Page 209: Machine Learning and Econometrics
Page 210: Machine Learning and Econometrics
Page 211: Machine Learning and Econometrics

Two Variable Tree

Page 212: Machine Learning and Econometrics
Page 213: Machine Learning and Econometrics
Page 214: Machine Learning and Econometrics

28,573 data points to Fit with

8061 806180618061 8061

Fit trees on 4/5 of the dataFit a tree for every level of split size

Set of Trees

Page 215: Machine Learning and Econometrics

28,573 data points to Fit with

8061 806180618061 8061 Set of Trees

Set of Trees

Set of Trees

Set of Trees

Set of Trees

REPEAT leaving each fold out

Page 216: Machine Learning and Econometrics

OverfitDominates

Page 217: Machine Learning and Econometrics

Why are we tuning on number of splits?

Page 218: Machine Learning and Econometrics

Questions and Observations How do we choose hold-

out set size? How to choose the # of

folds? What to tune on?

(regularizer)

Page 219: Machine Learning and Econometrics

What are these standard errors?

Page 220: Machine Learning and Econometrics

Questions and Observations How do we choose hold-

out set size? How to choose the # of

folds? What to tune on?

(regularizer) Which tuning parameter

to choose from cross-validation?

Page 221: Machine Learning and Econometrics

Tuning Parameter Choice Minimum?

One standard error “rule” (rule of thumb) Which direction?

Page 222: Machine Learning and Econometrics

Output Which of these many trees do we output?

Even after choosing lambda we have as many trees as folds…

Estimate one tree on full data using chosen cut size

Key point: Cross validation is just for choosing tuning parameter Just for deciding how complex a model to choose

Page 223: Machine Learning and Econometrics

Questions and Observations How do we choose hold-

out set size? How to choose the # of

folds? What to tune on?

(regularizer) Which tuning parameter

to choose from cross-validation?

Is there a problem tuning on subsets and then outputting fitted value on full set?

Page 224: Machine Learning and Econometrics

Lets look at the

predictions

Notice something?

Page 225: Machine Learning and Econometrics

WHY?

Page 226: Machine Learning and Econometrics

What does the tree look like?

Page 227: Machine Learning and Econometrics
Page 228: Machine Learning and Econometrics

What else can we look at to get a sense of what the predictions are?

Page 229: Machine Learning and Econometrics

Variable Importance

Empirical loss by noising up x minus Empirical loss

Page 230: Machine Learning and Econometrics

How to describe model

Large discussion of “interpretability” Will return to this

But one implication is that the prediction function itself becomes a new y variable to analyze.

Is any of this stable? What would a confidence interval look like?

Page 231: Machine Learning and Econometrics

Questions and Observations How do we choose hold-

out set size? How to choose the # of

folds? What to tune on?

(regularizer) Which tuning parameter

to choose from cross-validation?

Is there a problem tuning on subsets and then outputting fitted value on full set?

What is stable/robust about the estimated function?

Page 232: Machine Learning and Econometrics

Measuring Performance

Page 233: Machine Learning and Econometrics
Page 234: Machine Learning and Econometrics

Measuring Performance Area Under Curve: Typical measure of performance

What do you think of this measure?

Page 235: Machine Learning and Econometrics

What fraction of the poor do we reach?

Page 236: Machine Learning and Econometrics

Measuring Performance AUC: Typical measure of performance

What do you think of this measure?

Getting the domain specific meaningful performance measure Magnitudes Need point of comparison

Page 237: Machine Learning and Econometrics

What fraction of the poor do we reach? Confidence

Intervals?

Page 238: Machine Learning and Econometrics

This is what we want from econometric theorems How do we choose hold-

out set size? How to choose the # of

folds? What to tune on?

(regularizer) Which tuning parameter

to choose from cross-validation?

Is there a problem tuning on subsets and then outputting fitted value on full set?

What is stable/robust about the estimated function?

How do we form standard errors on performance?

Page 239: Machine Learning and Econometrics

Summary Regression trees easy to understand and interpret Tradeoff between personalized versus inaccurate

predictions Cross-validation is a tool to figure out the best balance in

a particular dataset E.g if truth is complex, may want to go deeper

CART is ad hoc, but works well in practice Loses to OLS/logit if true model is linear Good at finding lots of complex interactions

Page 240: Machine Learning and Econometrics

Heterogeneous Treatment Effects and ParameterEstimation with Causal Forests and Gradient

Forests

Susan AtheyStanford University

Machine Learning and Econometrics

See Wager and Athey (forthcoming, JASA)and Athey, Tibshirani and Wagerhttps://arxiv.org/abs/1610.01271

Page 241: Machine Learning and Econometrics

Treatment Effect Heterogeneity

Heterogeneous Treatment Effects

I Insight about mechansims

I Designing policies, selecting groups for application/eligibility

I Personalized policies

Literature: Many Covariates

I See Wager and Athey (2015) and Athey and Imbens (2016)for ML-based analyses and many references on treatmenteffect heterogeneity

I Imai and Ratkovic (2013) analyze treatment effectheterogeneity with LASSO

I Targeted ML (van der Laan, 2006) can be used as asemi-parametric approach to estimating treatment effectheterogeneity

Page 242: Machine Learning and Econometrics

ML Methods for Causal Inference: Treatment EffectHeterogeneity

I ML methods perform well in practice, but many do not havewell established statistical properties (see Chen and White(1999) for early analysis of neural nets)

I Unlike prediction, ground truth for causal parameters notdirectly observed

I Need valid confidence intervals for many applications (ABtesting, drug trials); challenges include adaptive modelselection and multiple testing

I Different possible questions of interest, e.g.:I Identifying subgroups (Athey and Imbens, 2016)I Testing for heterogeneity across all covariates (List, Shaikh,

and Xu, 2016)I Robustness to model specification (Athey and Imbens, 2015)I Personalized estimates (Wager and Athey, 2015; Taddy et al

2014; others)

Page 243: Machine Learning and Econometrics

The potential outcomes framework

For a set of i.i.d. subjects i = 1, ..., n, we observe a tuple(Xi , Yi , Wi ), comprised of

I A feature vector Xi ∈ Rp,

I A response Yi ∈ R, and

I A treatment assignment Wi ∈ {0, 1}.

Following the potential outcomes framework (Holland, 1986,Imbens and Rubin, 2015, Rosenbaum and Rubin, 1983, Rubin,1974), we posit the existence of quantities Y

(0)i and Y

(1)i .

I These correspond to the response we would have measuredgiven that the i-th subject received treatment (Wi = 1) or notreatment (Wi = 0).

Page 244: Machine Learning and Econometrics

The potential outcomes framework

For a set of i.i.d. subjects i = 1, ..., n, we observe a tuple(Xi , Yi , Wi ), comprised of

I A feature vector Xi ∈ Rp,

I A response Yi ∈ R, and

I A treatment assignment Wi ∈ {0, 1}.

Goal is to estimate the conditional average treatment effect

τ (x) = E[Y (1) − Y (0)

∣∣X = x].

NB: In experiments, we only get to see Yi = Y(Wi )i .

Page 245: Machine Learning and Econometrics

The potential outcomes framework

If we make no further assumptions, estimating τ(x) is not possible.

I Literature often assumes unconfoundedness (Rosenbaumand Rubin, 1983)

{Y (0)i ,Y

(1)i }⊥⊥Wi

∣∣ Xi .

I When this assumption holds, methods based on matching orpropensity score estimation are usually consistent.

Page 246: Machine Learning and Econometrics

Baseline method: k-NN matching

Consider the k-NN matching estimator for τ(x):

τ (x) =1

k

∑S1(x)

Yi −1

k

∑S0(x)

Yi ,

where S0/1(x) is the set of k-nearest cases/controls to x . This isconsistent given unconfoundedness and regularity conditions.

I Pro: Transparent asymptotics and good, robust performancewhen p is small.

I Con: Acute curse of dimensionality, even when p = 20 andn = 20k .

NB: Kernels have similar qualitative issues as k-NN.

Page 247: Machine Learning and Econometrics

Adaptive nearest neighbor matching

Random forests are a a popular heuristic for adaptive nearestneighbors estimation introduced by Breiman (2001).

I Pro: Excellent empirical track record.

I Con: Often used as a black box, without statistical discussion.

There has been considerable interest in using forest-like methodsfor treatment effect estimation, but without formal theory.

I Green and Kern (2012) and Hill (2011) have considered usingBayesian forest algorithms (BART, Chipman et al., 2010).

I Several authors have also studied related tree-basedmethods: Athey and Imbens (2016), Su et al. (2009), Taddyet al. (2014), Wang and Rudin (2015), Zeilis et al. (2008), ...

Wager and Athey (2015) provide the first formal results allowingrandom forest to be used for provably valid asymptotic inference.

Page 248: Machine Learning and Econometrics

Making k-NN matching adaptiveAthey and Imbens (2016) introduce causal tree: definesneighborhoods for matching based on recursive partitioning(Breiman, Friedman, Olshen, and Stone, 1984), advocate samplesplitting (w/ modified splitting rule) to get assumption-freeconfidence intervals for treatment effects in each leaf.

Euclidean neighborhood,for k-NN matching.

Tree-based neighborhood.

Page 249: Machine Learning and Econometrics

From trees to random forests (Breiman, 2001)

Suppose we have a training set {(Xi , Yi , Wi )}ni=1, a test point x ,and a tree predictor

τ (x) = T (x ; {(Xi , Yi , Wi )}ni=1) .

Random forest idea: build and average many different trees T ∗:

τ (x) =1

B

B∑b=1

T ∗b (x ; {(Xi , Yi , Wi )}ni=1) .

· · ·

Page 250: Machine Learning and Econometrics

From trees to random forests (Breiman, 2001)

Suppose we have a training set {(Xi , Yi , Wi )}ni=1, a test point x ,and a tree predictor

τ (x) = T (x ; {(Xi , Yi , Wi )}ni=1) .

Random forest idea: build and average many different trees T ∗:

τ (x) =1

B

B∑b=1

T ∗b (x ; {(Xi , Yi , Wi )}ni=1) .

We turn T into T ∗ by:

I Bagging / subsampling the training set (Breiman, 1996); thishelps smooth over discontinuities (Buhlmann and Yu, 2002).

I Selecting the splitting variable at each step from m out of prandomly drawn features (Amit and Geman, 1997).

Page 251: Machine Learning and Econometrics

Statistical inference with regression forests

Honest trees do not use the same data to select partition (splits)and make predictions. Ex: Split-sample trees, propensity trees.

Theorem. (Wager and Athey, 2015) Regression forests areasymptotically Gaussian and centered,

µn (x)− µ (x)

σn (x)⇒ N (0, 1) , σ2n(x)→p 0,

given the following assumptions (+ technical conditions):

1. Honesty. Individual trees are honest.

2. Subsampling. Individual trees are built on randomsubsamples of size s � nβ, where βmin < β < 1.

3. Continuous features. The features Xi have a density that isbounded away from 0 and ∞.

4. Lipschitz response. The conditional mean functionµ(x) = E

[Y∣∣X = x

]is Lipschitz continuous.

Page 252: Machine Learning and Econometrics

Valid Confidence Intervals

Athey and Imbens (2016), Wager and Athey (2015) highlight theperils of adaptive estimation for confidence intervals, tradeoffbetween MSE and coverage for trees but not forests.

Single Tree Forests

Page 253: Machine Learning and Econometrics

Proof idea

Use the shorthand Zi = (Xi , Yi ) for training examples.

I The regression forest prediction is

µ := µ (Z1, ..., Zn) .

I The Hajek projection of the regression forest is

◦µ = E [µ] +

n∑i=1

(E[µ∣∣Zi

]− E [µ]

).

I Classical statistics (Hoeffding, Hajek) tells us that

Var [◦µ] ≤ Var [µ] , and that lim

n→∞Var [

◦µ]/

Var [µ] = 1

implies asymptotic normality.

Page 254: Machine Learning and Econometrics

Proof idea

Now, let µ∗b(x) denote the estimate for µ(x) given by a singleregression tree, and let

◦µ∗b be its Hajek projection,

I Using the adaptive nearest neighbors framework of Lin andJeon (2006), we show that

Var [◦µ∗b]

/Var [µ∗b] & log−p(s).

I As a consequence of the ANOVA decomposition of Efronand Stein (1981), the full forest gets

Var [◦µ]

Var [µ]≥ 1− s

n

Var [µ∗b]

Var[ ◦µ∗b] ,

thus yielding the asymptotic normality result for s � nβ forany 0 < β < 1.

I For centering, we bound the bias by requiring β > βmin.

Page 255: Machine Learning and Econometrics

Variance estimation for regression forestsWe estimate the variance of the regression forest using theinfinitesimal jackknife for random forests (Wager, Hastie, andEfron, 2014). For each of the b = 1, ..., B trees comprising theforest, define

I The estimated response as µ∗b(x), and

I The number of times the i-th observation was used as N∗bi .

Then, defining Cov∗ as the covariance taken with respect to all thetrees comprising the forest, we set

σ2 =n − 1

n

(n

n − s

)2 n∑i=1

Cov∗ [µ∗b(x), N∗bi ]2 .

Theorem. (Wager and Athey, 2015) Given the same conditions asused for asymptotic normality, the infinitesimal jackknife forregression forests is consistent:

σ2n(x)/σ2n(x)→p 1.

Page 256: Machine Learning and Econometrics

Causal forest exampleWe have n = 20k observations whose features are distributed asX ∼ U([−1, 1]p) with p = 6; treatment assignment is random. Allthe signal is concentrated along two features.

The plots below depict τ(x) for 10k random test examples,projected into the 2 signal dimensions.

true effect τ(x) causal forest k-NN estimate

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

● ●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

● ●●

● ●

● ●

● ●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

● ●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

● ●

●●

● ●

● ●

●●

●●

●●

● ●

●●

● ●

● ●

●●

●●

● ●

●●

● ●●

●●

●●

● ●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

● ●

● ●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

● ●

● ●

●●

●●

●●

●●

● ●

● ●

● ●

●●

● ●

● ●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●● ●

● ●

●●

● ●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

● ●

● ●

●●

● ●

●●

●●

● ●

● ●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

x1

x2

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

● ●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

● ●●

● ●

● ●

● ●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

● ●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

● ●

●●

● ●

● ●

●●

●●

●●

● ●

●●

● ●

● ●

●●

●●

● ●

●●

● ●●

●●

●●

● ●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

● ●

● ●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

● ●

● ●

●●

●●

●●

●●

● ●

● ●

● ●

●●

● ●

● ●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●● ●

● ●

●●

● ●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

● ●

● ●

●●

● ●

●●

●●

● ●

● ●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

x1

x2

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

● ●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

● ●●

● ●

● ●

● ●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

● ●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

● ●

●●

● ●

● ●

●●

●●

●●

● ●

●●

● ●

● ●

●●

●●

● ●

●●

● ●●

●●

●●

● ●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

● ●

● ●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

● ●

● ●

●●

●●

●●

●●

● ●

● ●

● ●

●●

● ●

● ●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●● ●

● ●

●●

● ●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

● ●

● ●

●●

● ●

●●

●●

● ●

● ●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

x1

x2

Software: causalTree for R (Athey, Kong, and Wager, 2015)available at github: susanathey/causalTree

Page 257: Machine Learning and Econometrics

Causal forest exampleWe have n = 20k observations whose features are distributed asX ∼ U([−1, 1]p) with p = 20; treatment assignment is random.All the signal is concentrated along two features.

The plots below depict τ(x) for 10k random test examples,projected into the 2 signal dimensions.

true effect τ(x) causal forest k-NN estimate

●●

●● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

● ●

● ●

●●

● ●

●●

●●

● ●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

● ●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

● ● ●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

● ●

● ●

●●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●●

● ●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

x1

x2

●●

●● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

● ●

● ●

●●

● ●

●●

●●

● ●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

● ●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

● ● ●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

● ●

● ●

●●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●●

● ●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

x1

x2

●●

●● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

● ●

● ●

●●

● ●

●●

●●

● ●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

● ●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

● ● ●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

● ●

● ●

●●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●●

● ●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

x1

x2

Software: causalTree for R (Athey, Kong, and Wager, 2015)available at github: susanathey/causalTree

Page 258: Machine Learning and Econometrics

Causal forest exampleThe causal forest dominates k-NN for both bias and variance.With p = 20, the relative mean-squared error (MSE) for τ is

MSE for k-NN (tuned on test set)

MSE for forest (heuristically tuned)= 19.2.

causal forest k-NN estimate

●●

●●

●●

●●

● ●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●●● ●●

● ●

●●●

●●●● ●

●●

●●

●●

● ●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●● ●

●●

●●

● ●● ●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●●

●●

●●●●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●●

●●● ●

●●

●●

● ●

●●

●●●●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

● ●

●●●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●●

●●

●●●●●

●●●

●●

●●

●●

●●

●●●●

●●●●

●●

●●

●●●●● ●

●●

●●

●●

●●●

●●●

●●

●●●

●●

●●●●

●●

●●●●

●●

● ●

● ●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●● ●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

● ●●

●●●

● ●

●●●

● ●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●●●

●●

●●●●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

● ●●

●●●

●●

●●

●●

●●●

● ●

●●●

●●

●●

●●

●●●●

●●●● ●

●●

●●

●●●●

●●

●●

●●●

●●

●●

●● ●

● ●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

● ●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●● ●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

● ●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●●

●●

●●

● ●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

● ●●

●●●

●●●

●●● ●● ●

●●●●

●●●

●● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

● ●

●●

●●

●●

●●●

●●●

● ●●●

●●●

●●

● ●

●●●

●●

●●

●●

●●

●●●

●●

●● ●

●●

●●

●●●

●●

● ●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●● ●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●●

●●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●●

● ●

●●●

●●

● ●

●●

●●

●●

●●●

●●

●● ●

● ●

●●

● ●●

●●●

●●●

●●●

● ●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●●

●●●

●●

●●●

●●

●●

●●

● ●

● ●●● ●

●●

●●

●●

●●

●●

●●

● ●●

●●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●●

●●

●●

●●

●●

●●

●●

● ●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●● ●

●●● ●

● ●

●●

●●

●●●

●●

●●

●●

●●

●●

●● ● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●●

●● ●

●●

●●

●●●

● ●

●● ●

●●●

●●

●●

●●

●●●

●●

●●●

●●

● ●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

● ●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

● ●

●●

●●

●●●

● ●

●●●

●●

●●

●●

●●

●● ●

●●

●●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

● ●●

● ●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●●●

● ●

● ●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●●

●●●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●●

●●

●●

●●●●●●

●●

●●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

● ●

● ●●●

●●●

●●

●●

●●

●●●

●●

●●

●●●

●● ●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●●

●●●●

●●

●● ●

●●

●●

●●

●●●

●●

●●

●●●●●●

●●

●●

●●

●●

●●● ●

●●

●●

●●

●●

●●● ●●

●●

●●

● ●

●●

● ●

● ●

●● ●

●●

●●

●●

●●

●●●● ●

●●●●

●●●

●●

●●●

●●●

●●

●●●

●●

●●●

●●● ●●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●●

●●

●●

●●●

●●●

●●

● ●

●●●

●●

●●

●●

●● ●

●●●●

●●

●●●

●● ●

●●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●●

●●

● ●

●●●

●●●●●

●●

●●

●●

●●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●●

● ●

●●●

●●

●●●

● ●

●●

●●

●●●

●●

●●

●●

● ●

●●

●●

●●

●● ●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●●●●

●●●●

●●

●●●

●●●

●●●

●●

●●● ●

● ●

●●

●●

●●

● ●

●●

●●

●●●

● ●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●● ●

● ●●

●●●

●●

● ●

●●

●●●●

●●●

●●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●●

●●●

●●

●●

● ●

●●

●●

●●

●●● ●

●●

●●●●

● ●●●

●●●●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

● ●

● ●

●●●

●●

●●

●● ●

●●

●●

● ●

●●

●●●

● ●●

●●●

●●

● ●

●● ●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●●

●●

●●

●●●●

● ●

●●●●

●●

●●

●●

●●

●●

●●

●●

● ●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●● ●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●●●

●●●

●●

●●

●●

●●

●●●

●●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●●

● ●

●●

●●

●●

●●

●●

●●

●●●●●

●●●

●●

● ●

●●

●●●

● ●●

●●●

●●●

●●●

●●

●●

●●●

●●

●●

●●

●●

● ●●

●●●

●●●

●●

●● ●

●●●

●● ●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●●● ●●

● ●

●●●

●●●

●●

●●

●●●●

●●●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●●

●●

● ●

●●

●●●●●

●●●

●●

●●●● ●

●●

●●●●

●●●

●●

●●

●●

●●

●●

●● ●

●●

●●●

● ●

●● ●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

● ●

●●

●●

●●●

●●●

●●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●●●●

● ●

●●

●●

●●●

● ●

●●●

●●●

●●●

●●●●●

●●

●●●●

●●

●●

● ●●

●●

●●

●●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●●●

●●

●●

●●●

●●●

● ●

●●

●●●

●● ●●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

● ●

●●

●●

●●

0.0 0.5 1.0 1.5 2.0

0.0

0.5

1.0

1.5

2.0

True Treatment Effect

Est

imat

ed T

reat

men

t Effe

ct

●●

●●

●●

●● ●

● ●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●● ●

●●

●●

●●

●●

● ●●

●●

●●

●●

● ●

●●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●●

●●

●●

●●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

● ●●

●●

●●

●●

●●

● ●

●●

●●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●

●● ●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

● ●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

● ●

●●

●●

●●

●●

● ●

●●

●●

●● ●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

●● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

● ●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●●

●●

●●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●●

● ●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

● ●

●●●

●●

● ● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

● ●

●●

●●

●●

●●

●●

●● ●

●●

●●

● ●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●● ●

●●

●●

●●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

● ●●

● ●

● ●

●●

● ●

●●

●●

● ●

●●

●●

●●

● ●

●●

● ●

● ●

● ●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●●

●●●

●●

●●

●●●

● ●●

● ● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●●

●●

●● ●

● ●

●●

●●

●● ●

●●

●●

● ●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●

●●

● ●

● ● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

● ●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●●

●●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●●● ●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

0.0 0.5 1.0 1.5 2.0

−0.

50.

00.

51.

01.

5

True Treatment Effect

Est

imat

ed T

reat

men

t Effe

ct

For p = 6, the corresponding MSE ratio for τ is 2.2.

Page 259: Machine Learning and Econometrics

Application: General Social SurveyThe General Social Survey is an extensive survey, collected since1972, that seeks to measure demographics, political views, socialattitudes, etc. of the U.S. population.

Of particular interest to us is a randomized experiment, forwhich we have data between 1986 and 2010.

I Question A: Are we spending too much, too little, or aboutthe right amount on welfare?

I Question B: Are we spending too much, too little, or aboutthe right amount on assistance to the poor?

Treatment effect: how much less likely are people to answer toomuch to question B than to question A.

I We want to understand how the treatment effect depends oncovariates: political views, income, age, hours worked, ...

NB: This dataset has also been analyzed by Green and Kern(2012) using Bayesian additive regression trees (Chipman, George,and McCulloch, 2010).

Page 260: Machine Learning and Econometrics

Application: General Social Survey

A causal forest analysis uncovers strong treatmentheterogeneity (n = 28, 686, p = 12).

<−−− less income −−− more income −−−>

<−

−−

m

ore

liber

al

−−

mor

e co

nser

vativ

e −

−−

>

0.250.300.350.40

CATE

Page 261: Machine Learning and Econometrics

ML Methods for Causal Inference: More general models

I Much recent literature bringing ML methods to causalinference focus on single binary treatment in environment withunconfoundedness

I Economic models often have more complex estimationapproaches

I Athey, Tibshirani, and Wager (2016) tackle general GMMcase:

I Quantile regressionI Instrumental VariablesI Panel regressionI Consumer choiceI Euler equationsI Survival analysis

Page 262: Machine Learning and Econometrics

ML Methods for Causal Inference: More general models

Average Treatment Effects with IV or Unconfoundedness

I In a series of influential papers, Belloni, Chernozhukov,Hansen, et al generalized LASSO methods to averagetreatment effect estimation through instrumental variablesmodels, unconfoundedness, and also moment-based methods

I See also Athey, Imbens and Wager (2016) combine regularizedregression and high-dimensional covariate balancing foraverage treatment effect estimation; and references therein onmore recent papers on ATE estimation in high dimensions

Page 263: Machine Learning and Econometrics

Forests for GMM Parameter Heterogeneity

I Local GMM/ML uses kernel weighting to estimatepersonalized model for each individual, weighting nearbyobservations more.

I Problem: curse of dimensionality

I We propose forest methods to determine what dimensionsmatter for “nearby” metric, reducing curse of dimensionality.

I Estimate model for each point using “forest-based” weights:the fraction of trees in which an observation appears in thesame leaf as the target

I We derive splitting rules optimized for objectiveI Computational trick:

I Use approximation to gradient to construct pseudo-outcomesI Then apply a splitting rule inspired by regression trees to these

pseudo-outcomes

Page 264: Machine Learning and Econometrics

Related Work(Semi-parametric) local maximum likelihood/GMM

I Local ML (Hastie and Tibshirani, 1987) weights nearbyobservations; e.g. local linear regression. See Loader, C.(1999); also Hastie and Tibshirani (1990) on GAM; see alsoNewey (1994)

I Lewbel (2006) asymptotic prop of kernel-based local GMMI Other approaches include Sieve: Chen (2007) reviews

Score-based test statistics for parameter heterogeneityI Andrews (1993), Hansen (1992), and many others, e.g.

structural breaks, using scores of estimating equationsI Zeiles et al (2008) apply this literature to split points, when

estimating models in the leaves of a single tree.

Splitting rulesI CART: MSE of predictions for regression, Gini impurity for

classification, survival (see Bouhamad et al (2011))I Statistical tests, multiple testing corrections: Su et al (2009)I Causal trees/forests: adaptive v. honest est. (Athey and

Imbens, 2016); propensity forests (Wager and Athey, 2015)

Page 265: Machine Learning and Econometrics

Solving estimating equations with random forests

We have i = 1, . . . , n i.i.d. samples, each of which has anobservable quantity Oi , and a set of auxiliary covariates Xi .

Examples:

I Non-parametric regression: Oi = {Yi}.I Treatment effect estimation: Oi = {Yi , Wi}.I Instrumental variables regression: Oi = {Yi , Wi , Zi}.

Our parameter of interest, θ(x), is characterized by anestimating equation:

E[ψθ(x), ν(x)(Oi )

∣∣Xi = x]

= 0 for all x ∈ X ,

where ν(x) is an optional nuisance parameter.

Page 266: Machine Learning and Econometrics

The GMM Setup: Examples

Our parameter of interest, θ(x), is characterized by

E[ψθ(x), ν(x)(Oi )

∣∣Xi = x]

= 0 for all x ∈ X ,

where ν(x) is an optional nuisance parameter.

I Quantile regression, where θ(x) = F−1x (q) for q ∈ (0, 1):

ψθ(x)(Yi ) = q 1({Yi > θ(x)})− (1− q) 1({Yi ≤ θ(x)})

I IV regression, with treatment assignment W and instrumentZ . We care about the treatment effect τ(x):

ψτ(x), µ(x) =

(Zi (Yi −Wi τ(x)− µ(x))Yi −Wi τ(x)− µ(x)

).

Page 267: Machine Learning and Econometrics

Solving heterogeneous estimating equations

The classical approach is to rely on local solutions (Fan andGijbels, 1996; Hastie and Tibshirani, 1990; Loader, 1999).

n∑i=1

α(x ; Xi ) ψθ(x), ν(x) (Oi ) = 0,

where the weights α(x ; Xi ) are obtained from, e.g., a kernel.

We use random forests to get good data-adaptive weights. Haspotential to be help mitigate the curse of dimensionality.

I Building many trees with small leaves, then solving theestimating equation in each leaf, and finally averaging theresults is a bad idea. Quantile and IV regression are badlybiased in very small samples.

I Using RF as an “adaptive kernel” protects against this effect.

Page 268: Machine Learning and Econometrics

The random forest kernel

· · ·

=⇒

Forests induce a kernel via averaging tree-based neighborhoods.This idea was used by Meinshausen (2006) for quantile regression.

Page 269: Machine Learning and Econometrics

Solving estimating equations with random forests

We want to use an estimator of the form

n∑i=1

α(x ; Xi ) ψθ(x), ν(x) (Oi ) = 0,

where the weights α(x ; Xi ) are from a random forest.

Key Challenges:

I How do we grow trees that yield an expressive yet stableneighborhood function α(·; Xi )?

I We do not have access to “prediction error” for θ(x), sohow should we optimize splitting?

I How should we account for nuisance parameters?

I Split evaluation rules need to be computationally efficient,as they will be run many times for each split in each tree.

Page 270: Machine Learning and Econometrics

Step #1: Conceptual motivation

Following CART (Breiman et al., 1984), we use greedy splits.Each split directly seeks to improve the fit as much as possible.

I For regression trees, in large samples, the best split is thatwhich increases the heterogeneity of the predictions themost.

I The same fact also holds locally for estimating equations.

We split a parent node P into two children C1 and C2. In largesamples and with no computational constraints, we would liketo maximize

∆ (C1, C2) = nC1nC2

(θC1 − θC2

)2,

where θC1 , θC2 solve the estimating equation in the children.

Page 271: Machine Learning and Econometrics

Step #2: Practical realization

Computationally, solving the estimating equation in each possiblechild to get θC1 and θC2 can be prohibitively expensive.

To avoid this problem, we use a gradient-based approximation.The same idea underlies gradient boosting (Friedman, 2001).

θC ≈ θC := θP −1

|{i : Xi ∈ C}|∑

{i :Xi∈C}

ξ>A−1P ψθP , νp (Oi ) ,

AP =1

|{i : Xi ∈ P}|∑

{i :Xi∈P}

∇ψθP , νP (Oi ) ,

where θP and νP are obtained by solving the estimating equationonce in the parent node, and ξ is a vector that picks out theθ-coordinate from the (θ, ν) vector.

Page 272: Machine Learning and Econometrics

Step #2: Practical realization

In practice, this idea leads to a split-relabel algorithm:

1. Relabel step: Start by computing pseudo-outcomes

θi = −ξ>A−1P ψθP , νP (Oi ) ∈ R.

2. Split step: Apply a CART-style regression split to the Yi .

This procedure has several advantages, including the following:

I Computationally, the most demanding part of growing a treeis in scanning over all possible splits. Here, we reduce to aregression split that can be efficiently implemented.

I Statistically, we only have to solve the estimating equationonce. This reduces the risk of hitting a numerically unstableleaf—which can be a risk with methods like IV.

I From an engineering perspective, we can write a single,optimized split-step algorithm, and then use it everywhere.

Page 273: Machine Learning and Econometrics

Step #3: Variance correction

Conceptually, we saw that—in large samples—we want splits thatmaximize the heterogeneity of the θ(Xi ). In small samples, weneed to account for sampling variance.

We need to penalize for the following two sources of variance.

I Our plug-in estimates for the heterogeneity of θ(Xi ) will beoverly optimistic about the large-sample parameterheterogeneity. We need to correct for this kind of over-fitting.

I We anticipate “honest” estimation, and want to avoidleaves where the estimating equation is unstable. Forexample, with IV regression, we want to avoid leaves with anunusually weak 1st-stage coefficient.

This is a generalization of the analysis of Athey and Imbens (2016)for treatment effect estimation.

Page 274: Machine Learning and Econometrics

Gradient forests

Our label-and-regress splitting rules can be used to grow anensemble of trees that yield a forest kernel. We call the resultingprocedure a gradient forest.

I Regression forests are a special case of gradient forests with asquared-error loss.

Available as an R-package, gradientForest, built on top of theranger package for random forests (Wright and Ziegler, 2015).

Page 275: Machine Learning and Econometrics

Asymptotic normality of gradient forests

Theorem. (Athey, Tibshirani and Wager, 2016) Given regularity ofboth the estimating equation and the data-generating distribution,gradient forests are consistent and asymptotically normal:

θn(x)− θ(x)

σn(x)⇒ N (0, 1) , σ2n → 0.

Proof sketch.

I Influence functions: Hampel (1974); also parallels to use inNewey (1994).

I Influence function heuristic motivates approximating gradientforests with a class of regression forests.

I Analyze the approximating regression forests using Wager andAthey (2015)

I Use coupling result to derive conclusions about gradientforests.

Page 276: Machine Learning and Econometrics

Asymptotic normality of gradient forests: Proof detailsI Influence function heuristic motivates approximating gradient

forests with a class of regression forests. Start as if we knewtrue parameter value in calculating influence fn:

I Let θ∗i (x) denote the influence function of the i-th observationwith respect to the true parameter value θ(x):

Y ∗i (x) = −ξ>V (x)−1ψθ(x), ν(x)(Oi )I Pseudo-forest predictions: θ∗(x) = θ(x) +

∑ni=1 αi θ

∗i (x).

I Apply Wager and Athey (2015) to this. Key points: θ∗(x) islinear function, so we can write it as an average of treepredictions, with trees built on subsamples. Thus it isU-statistic; can use the ANOVA decomposition.

I Coupling result: conclusions about gradient forests.

Suppose that the gradient forest estimator θ(x) is consistent forθ(x). Then θ(x) and θ∗(x) are coupled,

θ∗(x)− θ(x) = oP

(∥∥∥∥∥n∑

i=1

αi (x)ψθ(x), ν(x) (Oi )

∥∥∥∥∥2

). (1)

Page 277: Machine Learning and Econometrics

Simulation example: Quantile regression

In quantile regression, we want to estimate the q-th quantile of theconditional distribution of Y given X , namely θ(x) = F−1x (q).

I Meinshausen (2006) used the random forest kernel forquantile regression. However, he used standard CARTregression splitting instead of a tailored splitting rule.

I In our split-relabel paradigm, quantile splits reduce toclassification splits (θP is the q-th quantile of the parent):

Yi = 1({Yi > θP}).

I To estimate many quantiles, we do multi-class classification.

Page 278: Machine Learning and Econometrics

Simulation example: Quantile regression

Case 1: Mean Shift Case 2: Scale Shift

−1.0 −0.5 0.0 0.5 1.0

−1.

00.

00.

51.

01.

52.

0

X

Y

truthquantregForestsplit−relabel

−1.0 −0.5 0.0 0.5 1.0−

2−

10

12

3X

Y

truthquantregForestsplit−relabel

The above examples show quantile estimates at q = 0.1, 0.5, 0.9,on Gaussian data with n = 2, 000 and p = 40. The packagequantregForest implements the method of Meinshausen (2006).

Page 279: Machine Learning and Econometrics

Simulation example: Instrumental variables

We want to estimate heterogeneous treatment effects withendogenous treatment assignment: Yi is the treatment, Wi is thetreatment assignment, and Zi is an instrument satisfying:

{Yi (w)}w∈W ⊥⊥ Zi

∣∣Xi .

I Our split-relabel formalism tells us to use pseudo-outcomes

τi =(Zi − Zp

) ((Yi − Y p

)− τP

(Wi −W p

)),

where τP is the IV solution in the parent, and Y P , W p, Zp

are averages over the parent.

I This is just IV regression residuals projected onto theinstruments.

Page 280: Machine Learning and Econometrics

Simulation example: Instrumental variables

Using IV forests is important

−1.0 −0.5 0.0 0.5 1.0

−1

01

23

4

X

tau

True Treat. EffectRaw CorrelationCausal ForestIV Forest

We have spurious correlations:

I OLS for Y on W given X hastwo jumps, at X1 = −1/3and at X1 = 1/3.

I The causal effect τ(X ) onlyhas a jump at X1 = −1/3.

I n = 10, 000, p = 20.

The response function is

Yi = (2Wi − 1) 1({X1,i > −1/3})+ (3A− 1.5) 1({X1,i > 1/3})+ 2εi .

Ai is correlated with Wi .

Page 281: Machine Learning and Econometrics

Simulation example: Instrumental variables

Using IV splits is important

−1.0 −0.5 0.0 0.5 1.0

0.0

0.5

1.0

1.5

X

tau

True Treatment EffectIV Forest with Causal SplitIV Forest with IV Split

We have useless correlations:

I The joint distribution of(Wi , Yi ) is independent ofthe covariates Xi .

I But: the causal effect τ(X )has a jump at X1 = 0.

I n = 5, 000, p = 20.

The response function is

Yi = 2 · 1({X1,i ≤ 0})Ai

+ 1({X1,i > 0})Wi

+ 1(1 + 0.73 · 1({X1,i > 0}))εi .

Ai is correlated with Wi .

Page 282: Machine Learning and Econometrics

Empirical Application: Family Size

Angrist and Evans (1998) study the effect of family size onwomen’s labor market outcomes. Understanding heterogeneity canguide policy.

I Outcomes: participation, female income, hours worked, etc.

I Treatment: more than two kids

I Instrument: first two kids same sex

I First stage effect of same sex on more than two kids: .06

I Reduced form effect of same sex on probability of work,income: .008, $132

I LATE estimates of effect of kids on probability of work,income: .133, $2200

Page 283: Machine Learning and Econometrics

Treatment Effects: Magnitude of Decline

Effect on Participation Baseline Probability of Working

20 40 60 80 100

0.00

0.05

0.10

0.15

0.20

Father's Income [$1k/year]

CAT

E

18 years20 years22 years24 years

20 40 60 80 100

0.2

0.3

0.4

0.5

0.6

0.7

Father's Income [$1k/year]

Bas

elin

e P

rob.

Wor

king

18 years20 years22 years24 years

Page 284: Machine Learning and Econometrics

Treatment Effects: Magnitude of Decline

Effect on Participation Effect relative to Baseline

20 40 60 80 100

0.00

0.05

0.10

0.15

0.20

Father's Income [$1k/year]

CAT

E

18 years20 years22 years24 years

20 40 60 80 100

0.0

0.1

0.2

0.3

0.4

Father's Income [$1k/year]

CAT

E /

(Pro

babi

lity

of W

orki

ng)

18 years20 years22 years24 years

Page 285: Machine Learning and Econometrics

Treatment Effects: Magnitude of Decline

Effect on Earnings Baseline Earnings

20 40 60 80 100

500

1000

1500

Father's Income [$1,000/year]

CAT

E [$

1/ye

ar]

18 years20 years22 years24 years

20 40 60 80 100

45

67

8

Father's Income [$1,000/year]

Mot

her's

Bas

elin

e In

com

e [$

1,00

0/ye

ar] 18 years

20 years22 years24 years

Page 286: Machine Learning and Econometrics

RobustnessProfessor Susan AtheyML and Causal Inference

Page 287: Machine Learning and Econometrics

Robustness of Causal EstimatesAthey and Imbens (AER P&P, 2015)• General nonlinear models/estimation methods• Causal effect is defined as a function of model parameters

• Simple case with binary treatment, effect is  1 0• Consider other variables/features as “attributes”• Proposed metric for robustness:

• Use a series of “tree” models to partition the sample by attributes• Simple case: take each attribute one by one

• Re‐estimate model within each partition• For each tree, calculate overall sample average effect as a weighted average of effects within each partition

• This yields a set of sample average effects• Propose the standard deviation of effects as robustness measure

Page 288: Machine Learning and Econometrics

Robustness of Causal EstimatesAthey and Imbens (AER P&P, 2015)• Four Applications: 

• Randomly assigned training program• Treated individuals with artificial control group from census data (Lalonde)• Lottery data (Imbens, Rubin & Sacerdote (2001))• Regression of earnings on education from NLSY

• Findings• Robustness measure better for randomized experiments, worse in observational studies 

Page 289: Machine Learning and Econometrics

Lalonde data

Page 290: Machine Learning and Econometrics

Comparing Robustness

Lalonde

Page 291: Machine Learning and Econometrics

Robustness Metrics: Desiderata

• Invariant to: • Scaling of explanatory variables• Transformations of vector of explanatory variables• Adding irrelevant variables

• Each member model must be somehow distinct to create variance, yet we want to allow lots of interactions• Need to add lots of rich but different models

• Well‐grounded way to weight models• This paper had equal weighting

Page 292: Machine Learning and Econometrics

Robustness Metrics: Work In ProgressStd Deviation versus Worst‐Case

• Desire for set of alternative models that grows richer• New additions are similar to 

previous ones, lower std dev

• Standard dev metric: • Need to weight models to put 

more weight on distinctalternative models

• “Worst‐case” or “bounds”: • Find the lowest and highest 

parameter estimates from a set of models

• Ok to add more models that are similar to existing ones.  

• But worst‐case is very sensitive to outliers—how do you rule out “bad” models?

Theoretical underpinnings Subjective versus objective uncertainty

Subjective uncertainty: correct model Objective uncertainty: distribution of 

model estimates given correct model

What are the preferences of the “decision‐maker” who values robustness? “Variational preferences”  

“Worst‐case” in set of possible beliefs, allow for a “cost” of beliefs that captures beliefs that are “less likely.” (see Strzalecki, 2011)  

Our approach for exog. covariate case: Convex cost to models that perform poorly 

out of sample from a predictive perspective

Good model Low obj. uncertainty: tightly estimated Other models that predict outcomes well 

also yield similar parameter estimates

Page 293: Machine Learning and Econometrics

Policy Estimation

Susan AtheyStanford University

See Athey and Wager (2017)

Page 294: Machine Learning and Econometrics

Efficient Policy Estimation

I Learning optimal policy assignment and estimatingtreatment effect heterogeneity closely related

I ML literature proposed variety of methods (Langford et al;Swaminathan and Joachims; in econometrics, Kitagawa andTetenov, Hirano and Porter, Manski)

I Estimating the value of a personalized policy closely related toestimating average treatment effect (comparing treat allpolicy to treat none policy)

I Lots of econometric theory about how to estimate averagetreatment effects efficiently (achieve semi-parametricefficiency bound)

I Athey and Wager (2017): prove that bounds on regret (gapbetween optimal policy and estimated policy) can be tightenedusing an algorithm consistent with econometric theory

I Theory provides guidance for algorithm choice

Page 295: Machine Learning and Econometrics

Setup and Approach

Setup

I Policy π : X → {±1}I Given a class of policies Π, the optimal policy π∗ and the

regret R(π) of any other policy are respectively defined as

π∗ = argmaxπ∈Π {E [Yi (π (Xi ))]} (1)

R(π) = E [Yi (π∗ (Xi ))]− E [Yi (π (Xi ))] . (2)

I Goal: estimate a policy π that minimizes regret R(π).

Approach: Estimate Q(π), choose policy to minimize Q(π):

Q(π) = E [Yi (π (Xi ))]− 1

2E [Yi (−1) + Yi (+1)] (3)

π = argmaxπ∈Π

{Q(π)

}, (4)

Page 296: Machine Learning and Econometrics

Alternative Approaches

Q(π) = E [Yi (π (Xi ))]− 1

2E [Yi (−1) + Yi (+1)] (5)

Methods for estimating ATE or ATT can also be used to estimatethe effect of any policy–simply define treatment as following policy,and control as the reverseKitagawa and Tetenov (forthcoming, Econometrica):

I Estimate Q(π) using inverse propensity weightingI Suppose that Yi ≤ M uniformly bounded, overlap satisfied

with η ≤ e(x) ≤ 1− η , Π is Vapnik-Chervonenkis Class ofdimension VC(Π)

I Roughly, VC dimension is the maximum integer D such thatsome data set of cardinality D can be “shattered” by a policyin Π. If tree is limited to D leaves, its VC dimension is D.

I Then, π satisfies the regret bound

R (π) = OP

(M

η

√VC (Π)

n

). (6)

Page 297: Machine Learning and Econometrics

Alternative Approaches

Q (π) =1

2(E [Yi (π (Xi ))]− E [Yi (−π (Xi ))]) (7)

Q(π) =1

n

n∑i=1

π (Xi ) Γi (8)

What is Γi?Zhao (2015) assume a randomized controlled trial and useΓi = Wi Yi /P [Wi = 1], while Kitagawa and Tetenov(forthcoming) uses inverse-propensity weightingΓi = Wi Yi / eWi

(Xi ). In an attempt to stabilize the weights,Beygelzimer et al (2009) introduce an “offset”

Γi =Wi

eWi(Xi )

(Yi −

max {Yi}+ min {Yi}2

),

while Zhao et al (2015) go further and advocate

Γi =Wi

eWi(Xi )

(Yi −

µ+1(Xi ) + µ−1(Xi )

2

).

Page 298: Machine Learning and Econometrics

Alternative Approaches

Class of estimators:

Q(π) =1

n

n∑i=1

π (Xi ) Γi (9)

What is Γi? Athey/Wager:

Γi := µ(−k(i))+1 (Xi )− µ

(−k(i))−1 (Xi ) + Wi

Yi − µ(−k(i))Wi

(Xi )

e(−k(i))Wi

(Xi ), (10)

where k(i) ∈ {1, ..., K} denotes the fold containing the i-th obs.

Page 299: Machine Learning and Econometrics

Optimizing Q and Estimating Impact

Class of estimators:

Q(π) =1

n

n∑i=1

π (Xi ) Γi (11)

Optimize Q(π): use a classifier with labels sign(Γi ) and weights|Γi |. This step is treated as a constrained optimization problem,not estimation.Issue: extending to multiple classes.Evaluating benefit of policy: Re-estimate policy using leave one outonly at second stage; or leave one out for whole routine. Assignusing re-estimated policy. Do this at all 10 folds; use this overallassignment as a unified policy. Then, use ATE methods to estimatedifference between estimated treatment effect for group assignedto treatment, and the estimated treatment effect for groupassigned to control. That is overall estimated benefit of the policy.

Page 300: Machine Learning and Econometrics

Athey/Wager Main Result

I Let V (π) denote the semiparametrically efficient variance forestimating Q(π).

I Let V∗ := V (π∗) denote the semiparametrically efficientvariance for evaluating π∗

I Let Vmax denote a sharp bound on the worst case efficientvariance supπ V (π) for any policy π. Results:

I Given policy class Π with VC dimension VC (Π), proposedlearning rule yields policy π with regret bounded by

R (π) = OP

(√V∗ log

(Vmax

V∗

)VC (Π)

n

). (12)

I We also develop regret bounds for non-parametric policyclasses Π with a bounded entropy integral, such asfinite-depth decision trees.

Page 301: Machine Learning and Econometrics

Illustration

● ●

●●

●●

● ●

●●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

● ●

● ●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

● ●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●●

●●

● ●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

● ●

● ●

●●

● ●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

● ●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●●

● ●

●●

●●

● ●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

● ●

●●

● ●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

● ●

●●

● ●

●●

●●

●●

● ●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

−1.0 −0.5 0.0 0.5 1.0

−1.

0−

0.5

0.0

0.5

1.0

x1

x2

● ●

●●

●●

● ●

●●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

● ●

● ●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

● ●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●●

●●

● ●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

● ●

● ●

●●

● ●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

● ●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●●

● ●

●●

●●

● ●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

● ●

●●

● ●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

● ●

●●

● ●

●●

●●

●●

● ●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

−1.0 −0.5 0.0 0.5 1.0

−1.

0−

0.5

0.0

0.5

1.0

x1

x2

Inverse-propensity weighting Double machine learning

Figure: Results from two attempts at learning a policy π bycounterfactual risk minimization over depth-2 decision tree, withQ-estimators obtained by inverse-propensity weighting and doublemachine learning. Dashed blue line denotes the optimal decision rule(treat in the upper-right corner, do not treat elsewhere); solid black linesdenote learned policies π (treat in the shaded regions, not elsewhere).Heat map depicts average over 200 simulations.

Page 302: Machine Learning and Econometrics

Conclusions

Contributions from causal inference and econometrics literature:

I Identification and estimation of causal effects

I Classical theory to yield asymptotically normal and centeredconfidence intervals

I Semiparametric efficiency theory

Contributions from ML:

I Practical, high performance algorithms for personalizedprediction and policy estimation

Putting them together:

I Practical, high performance algorithms

I Causal effects with valid confidence intervals

I Consistent with insights from efficiency theory

Page 303: Machine Learning and Econometrics

Deep Learning and Neural Nets

American Economic Association,

Continuing Education Program 2018

Machine Learning and Econometrics, Lecture 7

Guido Imbens

Page 304: Machine Learning and Econometrics

Outline

1.

1

Page 305: Machine Learning and Econometrics

Scalar outcome yi, p-dimensional vector of covariates/inputs/features

xi, l-th element equal to xil.

We focus on two cases here:

• (Regression) Outcome yi is ordered, possibly continuous,

e.g., expenditure, earnings.

• (Classification) Outcome yi is element of finite unordered

set, e.g., choice of item.

3

Page 306: Machine Learning and Econometrics

I. In the regression problem we are interested in predicting the

value of yi, with a focus on out-of-sample prediction.

Given a data set, (xi, yi), i = 1, . . . , N , we split it into two parts,

a training sample, say the first Ntrain observations, and and a

test sample, the last Ntest observations, we are interested in

using the training sample (but not the test sample) to choose

a function f : X 7→ R that minimizes the average squared error

in the test sample:

1

Ntest

N∑

i=Ntrain+1

(

yi − f(xi))2

4

Page 307: Machine Learning and Econometrics

• From a conventional econometric/statistical perspective, if

we think that the original sample is a random sample from a

large population, so we can think of the (xi, yi) as realizations

of random variables (Xi, Yi), then the function that minimizes

the objective is the conditional expectation of the outcome

given the covariates/features:

f(x) = E[Yi|Xi = x]

In that case we can think of the problem as just using the train-

ing sample to get the optimal (in terms of expected squared

error) estimator for this conditional expectation.

5

Page 308: Machine Learning and Econometrics

There are two conceptual points to note with regard to that

perspective:

• We do not care separately about bias and variance, just about

expected squared error. In the same vein, we do not necessarily

care about inference.

• We do not need to think about where the sample came from

(e.g., the infinite superpopulation). The problem of finding a

good predictor is well-defined irrespective of that.

6

Page 309: Machine Learning and Econometrics

II. In the classification problem we are interested in classifying

observations on the basis of their covariate/feature values xi

in terms of the outcome yi. Suppose yi ∈ {1, . . . , J}.

Again we wish to come up with a function f : X : {1, . . . , J}

that minimizes the mis-classification rate in the test sample:

1

Ntest

N∑

i=Ntrain+1

1yi 6=f(xi)

7

Page 310: Machine Learning and Econometrics

Suppose again that the sample is a random sample from an

infinite population, and that in the population the fraction of

units with Yi = j in the subpopulation of units with Xi = x is

pj(x) = pr(Yi|Xi = x)

In that case the solution is

f(x) = j iff pj(x) =J

maxk=1

pk(x)

Note that we are not actually focused here on estimating pj(x),

just the index of the choice with the maximum probability.

• Knowing pj(·) solves the problem, but it is not necessary to

estimate pj(·).

8

Page 311: Machine Learning and Econometrics

Consider the classical example of the classification problem,

the zipcode problem:

Given an image of a digit, with the accompanying label (from

zero to nine), classify it as a 0-9.

• Thinking of this as a problem of estimating a conditional

probability is not necessarily natural: it is not clear what the

conditional probabilities mean. Although, from a Bayesian per-

spective one might improve things by taking into account the

other digits in a handwritten zipcode.

9

Page 312: Machine Learning and Econometrics

Suppose that J = 2, so we are in a setting with a binary

outcome, where the regression and classification problem are

the same in the standard econometric/statistics approach.

Suppose that the sample is a random sample from a large

population, with the conditional probability that Yi = 2 equal

to p2(x) = 1 − p1(x).

Viewing the problem as a regression problem the solution is

f(x) = 1 + p2(x) with f(x) ∈ [1,2]

The solution to the classification problem is slightly different:

f(x) = 1 + 1p2(x)>1/2 with f(x) ∈ {1,2}

10

Page 313: Machine Learning and Econometrics

Back to the Regression Problem

Linear Model for f(·):

f(x) = ω0 +p

j=1

ωjxj

Minimize sum of squared deviations over “weights” (parame-

ters) ωj:

minω0,...,ωp

Ntrain∑

i=1

Yi − ω0 −p

j=1

ωjXij

2

This is not very flexible, but has well-understood properties.

11

Page 314: Machine Learning and Econometrics

Let’s make this more flexible:

Single index model:

f(x) = g

p∑

j=1

ωjxj

Estimate both weights ωj and transformation g(·)

12

Page 315: Machine Learning and Econometrics

Additive model:

f(x) =p

j=1

gj(xj)

Estimate the p covariate-specific transformations gj(·).

Advantage over nonparametric kernel regression is the improved

rate of convergence (rate of convergence does not depend on

number of covariates, whereas for kernel regression the rate of

convergence slows down the larger the dimension of xi, p, is).

13

Page 316: Machine Learning and Econometrics

Projection Pursuit (hybrid of single index model and additive

models):

f(x) =L

l=1

gl

p∑

j=1

ωljxj

Estimate both the L transformations gl(·) and the L×p weights

ωlj.

14

Page 317: Machine Learning and Econometrics

Neural Net with single hidden layer:

f(x) = ω(2)0 +

L∑

l=1

ω(2)l g

p∑

j=1

ω(1)lj xj

Fix transformation g(·) and estimate only the weights ω(1)lj and

ω(2)l (but in both layers).

• Increased flexibility in one place, at the expense of flexibility

in other places.

• Computational advantages especially when generalized to

multiple hidden layers.

15

Page 318: Machine Learning and Econometrics

General neural net with K hidden layers, one observed input

layer, one observed output layer, K + 2 layers total.

Need additional notation to keep things tractable. Superscripts

in parentheses denote layers, from (1) to (K + 2).

p1 = p observed inputs (dimension of x). Define α(1)l = z

(1)l =

xl for l = 1, . . . , p1.

First hidden layer: p2 hidden elements z(2)1 , . . . , z

(2)p2

z(2)l = ω

(1)l0 +

p1∑

j=1

ω(1)lj α

(1)j = ω

(1)l0 +

p1∑

j=1

ω(1)lj xj, l = 1, . . . , p2

α(2)l = g

(

z(2)l

)

(with transformation g(·) known)

16

Page 319: Machine Learning and Econometrics

Hidden layer k, with pk hidden elements z(k)1 , . . . , z

(k)pk , for k =

2, . . . , K + 1

z(k)l = ω

(k−1)l0 +

pk−1∑

j=1

ω(k−1)lj α

(k−1)j , α

(k)l = g

(

z(k)l

)

Final output layer (layer K + 2) with pK+2 = 1:

z(K+2)1 = ω

(K+1)10 +

pK+1∑

j=1

ω(K+1)1j α

(K+1)j ,

f(x;ω) = α(K+2)1 = g(K+2)

(

z(K+2)1

)

= z(K+2)1

17

Page 320: Machine Learning and Econometrics

Choices for g(·) (pre-specified, not choosen by formal data-

driven optimization):

1. sigmoid: g(a) = (1 + exp(−a))−1

2. tanh: g(a) = (exp(a)− exp(−a))/(exp(a) + exp(−a))

3. rectified linear g(a) = a × 1a>0

4. leaky rectified linear g(a) = a × 1a>0 + γ × 1a<0

18

Page 321: Machine Learning and Econometrics

Lost of complexity allowed for, but comes with lots of choices.

Not easy to use out-of-the-box, but very succesful in complex

settings.

Computationally tricky because of multi-modality.

• can approximate smooth functions accurately (universal ap-

proximator) with many layers and many hidden units.

19

Page 322: Machine Learning and Econometrics

Interpretation

We can think of the layers up to the last one as constructing

regressors: z(K+1) = h(ω, x)

Alternative is to choose functions of regressors, e.g., polyno-

mials zij = xi1 × xi4 × x2i7.

In what sense is this better? Is this a statement about the type

of functions we encounter?

20

Page 323: Machine Learning and Econometrics

Multiple layers versus multiple hidden units

“We observe that shallow models [models with few lay-

ers] in this context overfit at around 20 millions pa-

rameters while deep ones can benefit from having over

60 million. This suggests that using a deep model ex-

presses a useful preference over the space of functions

the model can learn.” Deep Learning, Goodfellow, Ben-

gio, and Courville, 2016, MIT Press, p. 197.

21

Page 324: Machine Learning and Econometrics

Link to standard parametric methods

The sequential set up implies a (complicated) par function

f(x;ω)

We could simply take this as parametric model for E[Yi|Xi = x].

• Then we can estimate ω by minimizing, over all ω(k)lj , j =

1, . . . , pk, l = 1, . . . , pk+1, k = 1, . . . , K + 1, using nonlinear

optimization methods (e.g., Newton-Raphson):

N∑

i=1

(

yi − f(xi;ω))2

• Linearize to do standard inference for nonlinear least squares

and get asymptotic normality for ω(k)lj and predicted values

given model specification.

22

Page 325: Machine Learning and Econometrics

That is not what this literature is about:

• Computationally messy, multiple modes, too many parame-

ters to calculate second derivatives.

• Too many parameters, with the model too non-linear, so that

asymptotic normality is unlikely to be accurate approximation.

23

Page 326: Machine Learning and Econometrics

Instead, first add regularization:

Minimize, over all ω(k)lj ,

j = 1, . . . , pk, l = 1, . . . , pk+1, k = 1, . . . , K + 1:

N∑

i=1

(

yi − f(xi;ω))2

+ penalty term

where the penalty term is something like a LASSO penalty, or

similar:

λ ×K∑

k=1

pk∑

j=1

pk−1∑

l=1

(

ω(k)jl

)2

Choose λ through cross-validation.

24

Page 327: Machine Learning and Econometrics

Estimating the Parameters of a Neural Network: Back-

propagation Algorithm

Define objective function (initially without regularization)

Ji(ω, x, y) =(

yi − f(xi;ω))2

J(ω,x,y) =N∑

i=1

Ji(ω, xi, yi) =N∑

i=1

(

yi − f(xi;ω))2

We wish to minimize this over ω.

• How do we calculate first derivatives (at least approximately)?

• Forget about second cross derivatives!

25

Page 328: Machine Learning and Econometrics

The backpropagation algorithm calculates the derivatives of

Ji(ω, xi, yi) with respect to all ω(k)li .

You start at the input layer and calculate the hidden elements

z(k)lj , one layer at a time.

Then after calculating all the current values of the hidden el-

ements, we calculate the derivatives, starting with the deriva-

tives for the ω(k) in the last, output layer, working backwards

to the first layer.

Key is that the ω only enter in one layer each.

26

Page 329: Machine Learning and Econometrics

Recall layer k,

z(k)l = ω

(k−1)l0 +

pk−1∑

j=1

ω(k−1)lj g(k)

(

z(k−1)l

)

or in vector notation:

z(k) = h(k)(

ω(k−1), z(k−1))

• This function h(k)(·) does not depend on ω beyond ω(k−1).

Hence we can write the function f(·) recursively:

f(x;ω) = h(K+2)(

ω(K+1), h(K+1)(

ω(K), . . . .h(2)(

ω(1),x)

. . .))

27

Page 330: Machine Learning and Econometrics

Now start at the last layer where

h(K+2)(z(K+2)) = z(K+2).

Define for observation i:

δK+2i =

∂z(K+2)i

(

yi − h(K+2)(

z(K+2)i

))2

= −2

(

yi − h(

z(K+2)i

))

h(K+2)′(

z(K+2)i

)

= −2

(

yi − z(K+2)i

)

(this is just the scaled residual).

28

Page 331: Machine Learning and Econometrics

Down the layers:

Define δ(k)li in terms of δ(k+1) and z(k) as a linear combination

of the δ(k+1):

δ(k)li =

pk∑

j=1

ω(k)jl δ

(k+1)li

g′(

z(k)li

)

Then the derivatives are

∂ω(k)lj

Ji(ω, xi, yi) = g

(

z(k)li

)

δ(k+1)li

Page 332: Machine Learning and Econometrics

Given the derivatives, iterate over

ωm+1 = ωm − α ×∂

∂ωJ(ω,x,y)

α is the “learning rate” often set at 0.01.

This is still computationally (unnecessarily) demanding:

• given that we dont do optimal line search, we do not need

the exact derivative.

29

Page 333: Machine Learning and Econometrics

• often stochastic gradient descent: Instead of calculating

∂ωJ(ω,x, y)

estimate the derivative:

N∑

i=1

Ri∂

∂ωJ(ω, xi, yi)

/

N∑

i=1

Ri

for a random selection of units (Ri ∈ {0,1}) because it is faster.

30

Page 334: Machine Learning and Econometrics

Regularization to avoid overfitting:

• add penalty term, λ∑

jlk

(

ω(k)jl

)2, or λ

jlk

ω(k)jl

• early stopping rule: monitor average error in test sample, and

stop iterating when average test error deteriorates.

31

Page 335: Machine Learning and Econometrics

Classification

American Economic Association,

Continuing Education Program 2018

Machine Learning and Econometrics, Lecture 8

Guido Imbens

Page 336: Machine Learning and Econometrics

Outline

1. Classification

2. Relation to Discrete Choice Models

3. Classification Trees

4. Neural Networks for Classification

5. Support Vector Machines

1

Page 337: Machine Learning and Econometrics

A second canonical problem in supervised learning is classifi-

cation.

The setting is again one where we observe for a large number

of units, say N , a p-vector of covariates xi, and an outcome yi.

The outcome is now a class, an unordered discrete variable,

yi ∈ {1, . . . , M}. M may be 2, or fairly large.

The goal is a function f : X 7→ {1, . . . , M} that will allow us to

assign new units to one of the classes.

• This is like discrete choice problems, but often we are not

interested in estimating the probability of yi = m given xi = x,

just the classification itself.

2

Page 338: Machine Learning and Econometrics

The objective is typically to minimize for new cases (out-of-

sample) the miss-classification rate.

1

Ntest

Ntest∑

i=Ntrain+1

1yi 6=f(xi)

3

Page 339: Machine Learning and Econometrics

Consider the two-class (binary yi) case.

Suppose that the sample is a random sample from an infinite

population, where:

pr(yi = 1|xi = x) =exp(x′β)

1 + exp(x′β)

Then the optimal classification is

f(x) = 1

{

exp(x′β)

1 + exp(x′β)≥ 0.5

}

or equivalently

f(x) = 1{

x′β ≥ 0}

4

Page 340: Machine Learning and Econometrics

The standard thing to do in discrete choice analysis would be

to estimate β as

β = argmaxL(β)

where

L(β) =n

i=1

yi exp(x′iβ) − ln(

1 + exp(x′β))

and then classify as

f(x) = 1{

x′β ≥ 0}

Outliers can be very influential here.

5

Page 341: Machine Learning and Econometrics

The machine learning methods take a different approach. These

differences include

1. the objective function,

2. complexity of the models,

3. regularization

We will look at each of these.

6

Page 342: Machine Learning and Econometrics

β is estimated by focusing on (in- and out-of-sample) classifica-

tion accuracy. For in-sample classification accuracy, we could

consider a maximum score type objective function:

β = argmaxN∑

i=1

∣yi − 1{

x′β ≥ 0}∣

This of course is not necessarily an improvement. Given the

true value for β using x′β > 0 as a classifier is optimal, and

the maximum likelihood estimator is better than the estimator

based on classification error.

• The classification rate does not give the optimal weight to

different classification errors if we know the model is logistic.

But, it is robust because it does not rely on a model.

7

Page 343: Machine Learning and Econometrics

With many covariates we can also regularize this estimator (or

the maximum likelihood estimator) using an L1 penalty term:

β = argmaxN∑

i=1

∣yi − 1{

x′β ≥ 0}∣

∣ + λp

j=1

|βp|

For logistic regression this goes back to Tibshirani’s original

LASSO paper:

β = argmaxβ

n∑

i=1

{

yi exp(x′iβ) − ln

(

1 + exp(x′β))}

+ λp

j=1

|βj |

We could also use other penalties, e.g., ridge, or elastic net.

8

Page 344: Machine Learning and Econometrics

Many difference in complexity and type of models. Methods

and models are very close to regression versions. Slight differ-

ences in implementation and tuning parameters.

Here, we consider

1. trees

2. neural networks

3. support vector machines

9

Page 345: Machine Learning and Econometrics

Classification Trees

The basic algorithm has the same structure as in the regression

case.

We start with a single leaf. We then decide on a covari-

ate/feature k = 1, . . . , p, to split on, and a threshold c, so that

we split the single leaf into two leaves, depending on whether

xik ≤ c

or not.

The key question is how to compare the different potential

splits. In regression problems we use the sum of squared resid-

uals within each potential leave for this. Here we consider

alternatives.

10

Page 346: Machine Learning and Econometrics

Now define an impurity function as a function of the shares of

different classes in a leaf:

φ(π1, . . . , πM)

satisfying

φ(π1, . . . , πM) ≤ (1/M,1/M, . . . ,1/M)

φ(π1, . . . , πM) ≥ (1,0, . . . ,0)

(and permutations thereof)

11

Page 347: Machine Learning and Econometrics

Now consider a tree T with leaves t = 1, . . . , T . Let

πm|t and πt

be the share of class m in leave t and the share of leave t

respectively.

Then the impurity of tree T is the average of the impurities of

the leaves, weighted by their shares:

I(T) =T

t=1

ptφ(π1|t, . . . , πM |t))

This replaces the average squared residual as the objective

function to grow/compare/prune trees.

12

Page 348: Machine Learning and Econometrics

Gini impurity is a measure of how often a randomly chosen

element from the set would be incorrectly labeled if it was

randomly labeled according to the distribution of labels in the

subset.

φ(p1, . . . , pM) =M∑

m=1

pm(1 − pm) = 1 −M∑

m=1

p2m

Information criterion

φ(p1, . . . , pM) = −M∑

m=1

pm ln(pm)

13

Page 349: Machine Learning and Econometrics

Note that we do not use the within-sample classification error

here as the criterion. That would amount to using

φ(p1, . . . , pM) = 1 − maxm

pm

14

Page 350: Machine Learning and Econometrics

After the first split, we consider each of the two leaves sepa-

rately. For each leave we again determine the optimal covariate

to split on, and the optimal threshold.

Then we compare the impurity of the optimal split based on

splitting the first leaf, and the impurity of the optimal split

based on splitting the second leaf, and choose the one that

most improves the impurity.

We keep doing this till we satisfy some stopping criterion.

15

Page 351: Machine Learning and Econometrics

Typically: stop if the impurity plus some penality (e.g., linear

in the number of leaves) is minimized:

I(T) + λ|T|

The value of the penalty term, λ, is determined by out-of-

sample cross-validation.

16

Page 352: Machine Learning and Econometrics

To evaluate trees, and to compare them out of sample, we

typically do use the fraction of miss-classified units in a test

sample:

1

Ntest

Ntest∑

i=1

1 {yi 6= f(xi)}

without any weighting.

17

Page 353: Machine Learning and Econometrics

Neural Networks for Classification

Recall the output layer in the regression version, where we had

a single scalar outcome y:

z(K+2) = ω(K+1)0 +

pK+1∑

j=1

ω(K+1)j αK+1

j ,

f(x) = g(K+2)(

z(K+2))

= z(K+2)

= ω(K+1)0 +

pK+1∑

j=1

ω(K+1)j αK+1

j ,

18

Page 354: Machine Learning and Econometrics

Now modify this by defining the outcome to be the vector y

ym = 1{y = m}

Then introduce an M-component vector z(K+2)

z(K+2)m = ω

(K+1)m0 +

pK+1∑

j=1

ω(K+1)mj αK+1

j ,

and transformation (note that this involves all elements of

z(K+2), unlike the transformations we considered before)

g(K+2)m

(

z(K+2))

=exp

(

z(K+2)m

)

∑Ml=1 exp

(

z(K+2)l

)

Now we can use all the machinery from the deep neural net-works developed for the regression case.

19

Page 355: Machine Learning and Econometrics

Support Vector Machines

Consider the case with two classes, yi ∈ {−1,1}.

Define a hyperplane by

{x|x′β + β0 = 0}

with ‖β‖ = 1

The hyperplane generates a classification rule:

f(x) = sign(β0 + x′β)

Now if we are very lucky, or have lots of covariates, we may

find a hyperplane that lead to a classification rule that has no

in-sample classification errors, so that for all i.

yif(xi) > 0 (yi = 1, f(xi) > 0) or (yi = −1, f(xi) < 0)

20

Page 356: Machine Learning and Econometrics

In that case there are typically many such hyperplanes. We

can look for the one that has the biggest distance between the

two classes:

maxβ0,β,|β‖=1

M subject to yi(β0 + x′iβ) ≥ M i = 1, . . . , N

This is equivalent to

minβ0,β

‖β‖ subject to yi(β0 + x′iβ) ≥ 1 i = 1, . . . , N

This hyperplane depends only on a few units - outside of those

you can move the points a little bit without changing the value

of β.

Note that in this perfect separation case, the logit model is not

estimable (parameters go to infinity).

21

Page 357: Machine Learning and Econometrics

Now it is rare that we can find a hyperplane that completely

separates the classes. More generally we solve

minβ0,β,ε1,...,εN

‖β‖ subject to yi(β0 + x′iβ) ≥ 1 − εi

and εi ≥ 0, i = 1, . . . , NN∑

i=1

εi ≤ K

Or we can write this as (C is cost tuning parameter)

minβ0,β,ε1,...,εN

‖β‖ + CN∑

i=1

εi

subject to yi(β0 + x′iβ) ≥ 1 − εi, εi ≥ 0, i = 1, . . . , N

22

Page 358: Machine Learning and Econometrics

So far this is very much tied to the linear representation of the

classifier.

We can of course add functions of the original x’s to make the

model aribtrarily flexible.

It turns out we can do that in a particularly useful way using

kernels of the type we use in kernel regression.

First we rewrite the solution for the classifier in terms of the

dual problem.

23

Page 359: Machine Learning and Econometrics

Specifically, we can characterize the solution for the classifier

as

max0≤αi≤C

N∑

i=1

αi −1

2

N∑

i=1

N∑

i′=1

αiαi′yiyi′x>i xi′ s.t.

N∑

i=1

αiyi = 0

and

f(x) =N∑

i=1

αiyix>xi + β0

The αi are the Lagrange multipliers for the restrictions

yi(β0 + x′iβ) ≥ 1 − εi

C is the critical tuning parameter, typically chosen through

cross-validation.

24

Page 360: Machine Learning and Econometrics

This can be generalized by replacing x>xi by a kernel K(·, ·)

evaluated at x and xi, so that

f(x) =N∑

i=1

αiyiK(x, xi) + β0

For example, we can use an exponential kernel

K(x, x′) = exp(−γ‖x − x′‖)

This leads to very flexible surfaces to separate the classes.

25

Page 361: Machine Learning and Econometrics

Matrix Completion Methods

and Causal Panel Models

American Economic Association,

Continuing Education Program 2018

Machine Learning and Econometrics, Lecture 7-8

Guido Imbens

Page 362: Machine Learning and Econometrics

Motivating Example I

• California’s anti-smoking legislation (Proposition 99) took

effect in 1989.

• What is the causal effect of the legislation on smoking

rates in California in 1989?

• We observe smoking rates in California in 1989 given the

legislation. We need to impute the counterfactual smok-

ing rates in California in 1989 had the legislation not been

enacted.

• We have data in the absence of smoking legislation in Cal-

ifornia prior to 1989, and for other states both before and

after 1989. (and other variables, but not of essence)

1

Page 363: Machine Learning and Econometrics

Motivating Example II

In US, on any day, 1 in 25 patients suffers at least one Hos-

pital Acquired Infection (HAI).

• Hospital acquired infections cause 75,000 deaths per year,

cost 35 billion dollars per year

• 13 states have adopted a reporting policy (at different times

during 2000-2010) that rquires hospitals to report HAIs to

the state.

What is (average) causal effect of reporting policy on

deaths or costs?

2

Page 364: Machine Learning and Econometrics

Set Up: we observe (in addition to covariates):

Y =

Y11 Y12 Y13 . . . Y1TY21 Y22 Y23 . . . Y2TY31 Y32 Y33 . . . Y3T... ... ... . . . ...

YN1 YN2 YN3 . . . YNT

(realized outcome).

W =

1 1 0 . . . 10 0 1 . . . 01 0 1 . . . 0... ... ... . . . ...1 0 1 . . . 0

(binary treatment).

• rows of Y and W correspond to physical units, columns

correspond to time periods.

3

Page 365: Machine Learning and Econometrics

In terms of potential outcome matrices Y(0) and Y(1):

Y(0) =

? ? X . . . ?X X ? . . . X

? X ? . . . X... ... ... . . . ...? X ? . . . X

Y(1) =

X X ? . . . X

? ? X . . . ?X ? X . . . ?... ... ... . . . ...X ? X . . . ?

.

Yit(0) observed iff Wit = 0, Yit(1) observed iff Wit = 1.

In order to estimate the average treatment effect for the

treated, (or other average, e.g., overall average effect)

τ =

i,t Wit

(

Yit(1)− Yit(0))

it Wit,

We need to impute the missing potential outcomes in Y(0)

(and in Y(1) for other estimands).

4

Page 366: Machine Learning and Econometrics

Focus on problem of imputing missing in N × T matrix Y =

Y(0)

YN×T =

? ? X . . . ?X X ? . . . X

? X ? . . . X... ... ... . . . ...? X ? . . . X

O and M are sets of indices (i, t) with Yi,t observed and miss-

ing, with cardinalities |O| and |M|. Covariates, time-specific,

unit-specific, time/unit-specific.

• Now the problem is a Matrix Completion Problem.

5

Page 367: Machine Learning and Econometrics

At times we focus on the special case with Wit = 1 iff (i, t) =

(N, T), so that |O| = N × T − 1 and |M| = 1:

Y =

Y11 Y12 Y13 . . . Y1TY21 Y22 Y23 . . . Y2TY31 Y32 Y33 . . . Y3T... ... ... . . . ...

YN1 YN2 YN3 . . . ?

(realized outcome).

W =

1 1 1 . . . 11 1 1 . . . 11 1 1 . . . 1... ... ... . . . ...1 1 1 . . . 0

(binary treatment).

Here single entry YNT needs to be imputed.

6

Page 368: Machine Learning and Econometrics

But also important Staggered Adoption (e.g., adoption of

technology, Athey and Stern, 1998)

− Here each unit is characterized by an adoption date Ti ∈

{1, . . . , T,∞} that is the first date they are exposed to the

treatment.

− Once exposed a unit will subsequently be exposed, Wit+1 ≥

Wit.

YN×T =

X X X X . . . X (never adopter)X X X X . . . ? (late adopter)X X X X . . . ?X X ? ? . . . ?X X ? ? . . . ? (medium adopter)... ... ... ... . . . ...X ? ? ? . . . ? (early adopter)

7

Page 369: Machine Learning and Econometrics

Netflix Problem

• N ≈ 500,000 (individuals), raises computational issues

• Large T ≈ 20,000 (movies),

• General missing data pattern,

• Fraction of observed data is close to zero, |O| � N × T

YN×T =

? ? ? ? ? X . . . ?X ? ? ? X ? . . . X

? X ? ? ? ? . . . ?? ? ? ? ? X . . . ?X ? ? ? ? ? . . . X

? X ? ? ? ? . . . ?... ... ... ... ... ... . . . ...? ? ? ? X ? . . . ?

8

Page 370: Machine Learning and Econometrics

How is this problem treated in the various literatures?

1. Program evaluation literature under unconfoundedness /

selection-on-observables (Imbens-Rubin 2015)

2. Synthetic Control literature (Abadie-Diamond-Hainmueller,

JASA 2010)

3. Differences-In-Differences Literature (Bertrand-Dulfo-Mullainathan,

2003) and Econometric Panel Data literature (Bai 2003,

2009; Bai and Ng 2002)

4. Machine Learning literature on matrix completion / net-

flix problem (Candes-Recht, 2009; Keshavan, Montanari

& Oh, 2010).

9

Page 371: Machine Learning and Econometrics

1. Program Evaluation Literature under unconfounded-

ness / selection-on-observables (Imbens-Rubin 2015)

Focuses on special case:

− Thin Matrix (N large, T small),

− YiT is missing for some i (Nt ”treated units”), and no

missing entries for other units (Nc = N −Nt “control units”).

− Yit is always observed for t < T .

YN×T =

X X X

X X X

X X ?X X ?X X X... ... ...X X ?

WN×T =

1 1 11 1 11 1 01 1 01 1 1... ... ...1 1 0

10

Page 372: Machine Learning and Econometrics

Parametric version: Horizontal Regression

Specification:

YiT = β0 +T−1∑

t=1

βtYit + εi

estimated on Nc control units, with T regressors.

Prediction for Nt treated units:

YiT = β0 +T−1∑

t=1

βtYNt

• Nonparametric version: for each treated unit i find a control

unit j with Yit ≈ Yjt, t < T .

• If N large relative to T0 use regularized regression (lasso,

ridge, elastic net)

11

Page 373: Machine Learning and Econometrics

2. Synthetic Control Literature (Abadie-Diamond-Hainmueller,

JASA 2010)

Focuses on special case:

− Fat Matrix (N small, T large)

− YNt is missing for t ≥ T0 and no missing entries for other

units.

YN×T =

X X X . . . X

X X X . . . X

X X ? . . . ?

WN×T =

1 1 1 . . . 11 1 1 . . . 11 1 0 . . . 0

12

Page 374: Machine Learning and Econometrics

Parametric version: Vertical Regression:

YNt = α0 +N−1∑

i=1

αiYit + ηt

estimated on T0 pre-treatment periods, with N regressors (in

ADH with restrictions α0 = 0, αi ≥ 0).

Prediction for T − T0 treated periods:

YNt = α0 +N−1∑

i=1

αiYit

• Nonparametric version: for each post-treatment period t

find a pre-treatment period s with Yis ≈ Yit, i = 1, . . . , N − 1.

• If T large relative to Nc use regularized regression (lasso,

ridge, elastic net), following Doudchenko and Imbens (2016).

13

Page 375: Machine Learning and Econometrics

3. Differences-In-Differences Literature and Economet-

ric Panel Data literature (Bertrand-Dulfo-Mullainathan, 2003;

Bai 2003, 2009; Bai-Ng 2002)

Note: we model only Y = Y(0) here, because Yit(1) is ob-

served for the relevant units/time-periods.

Models developed here for complete-data matrices:

DID:

Yit = δi + γt + εit

Generalized Fixed Effects (Bai 2003, 2009)

Yit =R∑

r=1

δirγrt + εit

Estimate δ and γ by least squares and use to impute missing

values.

14

Page 376: Machine Learning and Econometrics

4. Machine Learning literature on matrix completion /

netflix problem (Candes-Recht, 2009; Keshavan, Montanari

& Oh, 2010)

− Focus on setting with many missing entries: |O|/|M| ≈ 0.

E.g., netflix problem, with units corresponding to individuals,

and time periods corresponding to movie titles, or image re-

covery from limited information, with i and t corresponding

to different dimensions.

− Focus on computational feasibility, with both N and T

large.

− Focus on randomly missing entries: W ⊥⊥ Y.

Like interactive fixed effect, focus on low-rank structure

underlying observations.

15

Page 377: Machine Learning and Econometrics

Set Up

General Case: N >> T , N << T , N ≈ T , possibly stable

patterns over time, possibly stable patterns within units.

Set up without covariates, connecting to the interactive fixed

effect literature (Bai, 2003), and the matrix completion lit-

erature (Cands, and Recht, 2009):

YN×T = LN×T + εN×T

• Key assumption:

WN×T ⊥⊥ εN×T (but W may depend on L)

• Staggered entry, Wit+1 ≥ Wit

• L has low rank relative to N and T .

16

Page 378: Machine Learning and Econometrics

More general case, with unit-specific P -component covariate

Xi, time-specific Q-component covariate Zt, and unit-time-

specific covariate Vit:

Yit = Lit +P∑

p=1

Q∑

q=1

XipHpqZqt + γi + δt + Vitβ + εit

• We do not necessarily need the fixed effects γi and δt, these

can be subsumed into L. It is convenient to include the fixed

effects given that we regularize L.

17

Page 379: Machine Learning and Econometrics

Too many parameters (especially N×T matrix L), so we need

regularization:

We shrink L and H towards zero.

For H we use Lasso-type element-wise `1 norm: defined as

‖H‖1,e =∑P

p=1

∑Qq=1 |Hpq|.

18

Page 380: Machine Learning and Econometrics

How do we regularize L?

LN×T = SN×NΣN×TRT×T

S, R unitary, Σ is rectangular diagonal with entries σi(L) that

are the singular values. Rank of L is number of non-zero

σi(L).

‖L‖2F =∑

i,t

|Lit|2 =

min(N,T )∑

j=1

σ2i (L) (Frobenius, like ridge)

=⇒ ‖L‖∗ =min(N,T )∑

j=1

σi(L) (nuclear norm, like LASSO)

‖L‖R =

min(N,T )∑

j=1

1σi(L)>0 (Rank, like subset selection)

19

Page 381: Machine Learning and Econometrics

Frobenius norm imputes missing values as 0.

Rank norm is computationally not feasible for general missing

data patterns.

The preferred Nuclear norm leads to low-rank matrix but is

computationally feasible.

20

Page 382: Machine Learning and Econometrics

Our proposed method: regularize using using nuclear norm:

minL

1

|O|

(i,t)∈O

(Yit − Lit)2 + λL‖L‖∗

For the general case we estimate H, L, δ, γ, and β as

minH,L,δ,γ

1

|O|

(i,t)∈O

Yit − Lit −P∑

p=1

Q∑

q=1

XipHpqZqt − γi − δt − Vitβ

2

+λL‖L‖∗ + λH‖H‖1,e

We choose λL and λH through cross-validation.

21

Page 383: Machine Learning and Econometrics

Algorithm (Mazumder, Hastie, & Tibshirani 2010)

Given any N × T matrix A, define the two N × T matrices

PO(A) and P⊥O(A) with typical elements:

PO(A)it =

{

Ait if (i, t) ∈ O ,0 if (i, t) /∈ O ,

and

P⊥O(A)it =

{

0 if (i, t) ∈ O ,Ait if (i, t) /∈ O .

22

Page 384: Machine Learning and Econometrics

Let A = SΣR> be the Singular Value Decomposition for A,

with σ1(A), . . . , σmin(N,T )(A), denoting the singular values.

Then define the matrix shrinkage operator

shrinkλ(A) = SΣR> ,

where Σ is equal to Σ with the i-th singular value σi(A)

replaced by max(σi(A) − λ,0).

23

Page 385: Machine Learning and Econometrics

Given these definitions, the algorithm proceeds as follows.

− Start with the initial choice L1(λ) = PO(Y), with zeros for

the missing values.

− Then for k = 1,2, . . . , define,

Lk+1(λ) = shrinkλ

{

PO(Y) + P⊥O(Lk(λ))

}

,

until the sequence {Lk(λ)}k≥1 converges.

− The limiting matrix L∗ is our estimator for the regulariza-

tion parameter λ, denoted by L(λ, O).

24

Page 386: Machine Learning and Econometrics

Results I (for case without covariates, and just L, O is suf-

ficiently random, and εit = Yit − Lit are iid with variance σ2.

‖Y‖F =√

i,t Y 2i,t is Frobenius norm. ‖Y‖∞ = maxi,t |Yi,t|.

Let Y∗ be the matrix including all the missing values (e.g.,

Y(1).

The estimated matrix L is close to L∗ in the following sense:

∥L − L

F

‖L‖F

≤ C max

(

σ,‖L‖∞‖L‖F

)

rank(L)(N + T) ln(N + T)

|O|

In many cases the number of observed entries |O| is of order

N × T so if rank(L) � (N + T) the error goes to 0 as N + T

grows.

25

Page 387: Machine Learning and Econometrics

Adaptive Properties of Matrix Regression

Suppose N is large, T = 2, Wi1 = 0, many control units

(treatment effect setting)

− In that case the natural imputation is Li2 = Yi1×ρ, where ρ

is the within unit correlation between periods for the control

units.

− The program-evaluation / horizontal-regression approach

would lead to Li2 = Yi1 × ρ

− The synthetic-control / vertical-regression approach would

lead to Li2 = 0.

− How does the matrix completion estimator impute the

missing values in this case?

26

Page 388: Machine Learning and Econometrics

Suppose for control units

(

Yi1Yi2

)

∼ N

((

00

)

,

(

1 ρρ 1

))

,

and suppose the pairs (Yi1, Yi2) are jointly independent.

Then for small λ, for the treated units, the imputed value is

Li2 = Yi1 ×ρ

1 +√

(1 − ρ2)

The regularization leads to a small amount of shrinkage to-

wards zero relative to optimal imputation.

Matrix Completion method adapts well to shape of matrix

and correlation structure.

27

Page 389: Machine Learning and Econometrics

Illustrations

• We take complete matrices YN×T ,

• We pretend some entries are missing.

• We use different estimators to impute the “missing” entries

and compare them to actual values, and calculate the mean-

squared-error.

28

Page 390: Machine Learning and Econometrics

We compare three estimators:

1. Matrix Completion Nuclear Norm, MC-NNM

2. Horizontal Regr with Elastic Net Regularization, EN-H

(Program evaluation type regression)

3. Vertical Regr with Elastic Net Regularization, EN-V (Syn-

thetic Control type regression)

29

Page 391: Machine Learning and Econometrics

Illustrations I: Stock Market Data

We use daily returns for 2453 stocks over 10 years (3082

days). We create sub-samples by looking at the first T daily

returns of N randomly sampled stocks for pairs of (N, T) such

that N × T = 4900, ranging from fat to thin:

(N, T) = (10,490), . . . , (70,70), . . . , (490,10).

Given the sample, we pretend that half the stocks are treated

at the mid point over time, so that 25% of the entries in the

matrix are missing.

YN×T =

X X X X . . . X

X X X X . . . X

X X X X . . . X

X X X ? . . . ?X X X ? . . . ?... ... ... ... . . . ...X X X ? . . . ?

30

Page 392: Machine Learning and Econometrics

0.0200

0.0225

0.0250

0.0275

0.0300

3 4 5 6log(N)

Ave

rage

RM

SE

Method

EN

EN−T

NxT = 4900 Fraction Missing = 0.25

0.0200

0.0225

0.0250

0.0275

0.0300

3 4 5 6log(N)

Ave

rage

RM

SE

Method

EN

EN−T

MC−NNM

NxT = 4900 Fraction Missing = 0.25

14

Page 393: Machine Learning and Econometrics

Illustrations II: California Smoking Rate Data

The outcome here is per capita smoking rates by state, for

38 states, 31 years.

We compare both simultaneous adoption and staggered adop-

tion.

31

Page 394: Machine Learning and Econometrics

Illustration I: California Smoking Example(N = 38, T = 31)

Simultaneous adoption, Nt = 8 Staggered adoption, Nt = 35

Page 395: Machine Learning and Econometrics

Generalizations I:

• Allow for propensity score weighting to focus on fit where

it matters:

Model propensity score Eit = pr(Wit = 1|Xi, Zt, Vit), E is N×T

matrix with typical element Eit

Possibly using matrix completion:

E = argminE

1

NT

(i,t)

(Wit − Eit)2 + λL‖E‖∗

and then

minL

1

|O|

(i,t)∈O

Eit

1 − Eit(Yit − Lit)

2 + λL‖L‖∗

32

Page 396: Machine Learning and Econometrics

Generalizations II:

• Take account of of time series correlation in εit = Yit − Lit

Modify objective function from logarithm of Gaussian likeli-

hood based on independence to have autoregressive struc-

ture.

33

Page 397: Machine Learning and Econometrics

References

Abadie, Alberto, Alexis Diamond, and Jens Hainmueller. ”Syn-

thetic control methods for comparative case studies: Esti-

mating the effect of Californias tobacco control program.”

Journal of the American statistical Association 105.490 (2010):

493-505.

Abadie, Alberto, Alexis Diamond, and Jens Hainmueller. ”Com-

parative politics and the synthetic control method.” Ameri-

can Journal of Political Science 59.2 (2015): 495-510.

Abadie, Alberto, and Jeremy L’Hour “A Penalized Synthetic

Control Estimator for Disaggregated Data”

Athey, Susan, Guido W. Imbens, and Stefan Wager. Efficient

inference of average treatment effects in high dimensions via

34

Page 398: Machine Learning and Econometrics

approximate residual balancing. arXiv preprint arXiv:1604.07125v3

(2016).

Bai, Jushan. ”Inferential theory for factor models of large

dimensions.” Econometrica 71.1 (2003): 135-171.

Bai, Jushan. ”Panel data models with interactive fixed ef-

fects.” Econometrica 77.4 (2009): 1229-1279.

Bai, Jushan, and Serena Ng. ”Determining the number of

factors in approximate factor models.” Econometrica 70.1

(2002): 191-221.

Candes, Emmanuel J., and Yaniv Plan. ”Matrix completion

with noise.” Proceedings of the IEEE 98.6 (2010): 925-936.

Page 399: Machine Learning and Econometrics

Cands, Emmanuel J., and Benjamin Recht. ”Exact matrix

completion via convex optimization.” Foundations of Com-

putational mathematics 9.6 (2009): 717.

Chamberlain, G., and M. Rothschild. ”Arbitrage, factor struc-

ture, and mean-variance analysis on large asset markets. Econo-

metrica 51 12811304, 1983.

Doudchenko, Nikolay, and Guido W. Imbens. Balancing, re-

gression, difference-in-differences and synthetic control meth-

ods: A synthesis. No. w22791. National Bureau of Eco-

nomic Research, 2016.

Gobillon, Laurent, and Thierry Magnac. ”Regional policy

evaluation: Interactive fixed effects and synthetic controls.”

Review of Economics and Statistics 98.3 (2016): 535-551.

Page 400: Machine Learning and Econometrics

Imbens, G., and D. Rubin Causal Inference Cambridge Uni-

versity Press.

Keshavan, Raghunandan H., Andrea Montanari, and Sewoong

Oh. “Matrix Completion from a Few Entries.” IEEE Trans-

actions on Information Theory, vol. 56,no. 6, pp.2980-2998,

June 2010

Keshavan, Raghunandan H., Andrea Montanari, and Sewoong

Oh. “Matrix completion from noisy entries.” Journal of Ma-

chine Learning Research 11.Jul (2010): 2057-2078.

Liang, Dawen, et al. ”Modeling user exposure in recommen-

dation.” Proceedings of the 25th International Conference

on World Wide Web. International World Wide Web Confer-

ences Steering Committee, 2016.

Page 401: Machine Learning and Econometrics

Mazumder, Rahul, Trevor Hastie, and Robert Tibshirani,

(2010) “Spectral regularization algorithms for learning large

incomplete matrices,” Journal of machine learning research,

11(Aug):2287–2322.

Benjamin Recht, “A Simpler Approach to Matrix Comple-

tion”, Journal of Machine Learning Research 12:3413-3430,

2011

Xu, Yiqing. ”Generalized Synthetic Control Method: Causal

Inference with Interactive Fixed Effects Models.” Political

Analysis 25.1 (2017): 57-76.