Causal Inference with Panel Data · 2020-03-04 · Two-step strategy: 1.estimate the propensity...

Causal Inference with Panel Data

Yiqing Xu (Stanford University)

Northwestern-Duke Causal Inference Workshop

19 August 2019

Motivation

Cadre Visits on Loans

−12 −10 −8 −6 −4 −2 0 2 4

Quarter(s) before / after a cadre's visit

5.0

5.5

6.0

6.5

Log

Out

stan

ding

Loa

ns

182 treated firms

542 control firms

Visits

1


−12 −10 −8 −6 −4 −2 0 2 4


−0.

6−

0.4

−0.

20.

00.

20.

40.

60.

8

Diff

−in

−di

ffs in

Loa

nsThe Effect of Cadre Visits on Loans95% Confidence Interval

Implicitly Assumed

2


−12 −10 −8 −6 −4 −2 0 2 4


−0.

6−

0.4

−0.

20.

00.

20.

40.

60.

8

Diff

−in

−di

ffs in

Loa

nsThe "Effect" of Cadre Visits on Loans95% Confidence Interval

3

−12 −10 −8 −6 −4 −2 0 2 4

Year(s) before / after Reform

−2

0−

10

01

02

03

0

Diff

−in

−d

iffs

in C

rim

e R

ate

%The "Effect" of Assault Weapon Ban on Crime Rate95% Confidence Interval

−20 −16 −12 −8 −4 0 4 8

Year(s) before / after Both Parties in the WTO

−0

.2−

0.1

0.0

0.1

0.2

Diff

−in

−d

iffs

in L

og

Tra

de

Vo

lum

e

The "Effect" of WTO on Trade95% Confidence Interval

−12 −10 −8 −6 −4 −2 0 2 4 6 8

Year(s) before / after Democratization

−0

.15

−0

.10

−0

.05

0.0

00

.05

0.1

00

.15

Diff

−in

−d

iffs

in L

og

GD

P p

er

cap

ita

The "Effect" of Democratization on Economic Development95% Confidence Interval

−28 −24 −20 −16 −12 −8 −4 0 4 8

Day(s) before / after the News Leak

−6

−4

−2

02

46

8

Diff

−in

−d

iffs

in A

bn

orm

al R

etu

rn %

The "Effect" of Tim Geithner Connections on Stock Market Return95% Confidence Interval

Motivations

Two-way fixed effects (FE) and difference-in-differences (DiD) methods

are commonly used in the social sciences.

• Abundant observational panel data

• Accounting for unobserved unit and time heterogeneity

• Easy to estimate and interpret

●● ● ● ●

●

●●

●

●● ●

●

●

●

●

●

●

Cou

nt

05

1015

2025

30

2000 2002 2004 2006 2008 2010 2012 2014 2016

"difference−in−differences"

"fixed effects" + "panel data"

#Papers published in APSR/AJPS/JOP

5

We Try to Answer

1. What about time-varying confounders, e.g. when a “pre-trend”

exists?

2. What if the treatment effect are heterogeneous — as they always

are?

3. What’re the differences between twoway FE and DiD?

4. Are the assumptions credible? What’re the hypothetical experiments

behind these models?

5. How can we do better than DiD or synthetic control?

6

What’s Special about Panel Data?

• The fundamental problem of causal inference (Holland 1986)

τi = Y1i − Yi0

• A statistical solution makes use of others’ information

e.g. ATE = E[Y1]− E[Y0]

• A scientific solution exploits homogeneity or invariance assumptions

e.g. A rock stays a rock.

e.g. The long-run growth rate of the US economy is 2.5%.

• Panel data allow us to construct treated counterfactuals using

information from both the past and the others with the caveat that

treatment assignment mechanism may be complicated

• Panel data is also difficult because of all kinds of interferences

(SUTVA violations)

7

Causal Inference with Panel Data

• “Scientific” solution: modeling (but all models are wrong...)

• Statistical solution: similar to the Selection-on-Observable (SOO)

approach, e.g., matching/reweighting

• Panel data make both easier

• Pre-trends are observable → more information for modeling

• The additional dimension helps relax the conventional ignorability

assumption

• And we can do more...

8

Roadmap

• DiD and synthetic control

• FE/DiD assumptions

• New estimators• Matching and reweighting

• Outcome models

* Diagnostic tools

• Hybrid methods

• Conclusions

Quick Review of DiD and Synth

Difference-in-Differences

Yit = τitDit + αi + ξt + εit

or{Y 0it = αi + ξt + εit

Y 1it = Y 0

it + τit

• τit is the treatment effect for unit i at time t

• Y 0it is a combination of two additive fixed effects and idiosyncratic

errors

• E[εit ] = 0 and εit ⊥⊥ Dis , for all i , t, s (strict exogeneity)

• ATT = E[τit |Dit = 1] can be non-parametrically identified if there

are only two periods (or two treatment histories)

10

Difference-in-Differences

(Y 0T ,pre ??

Y 0C,pre Y 0

C,post

)

• τit is the treatment effect for unit i at time t

• Y 0it is a combination of two additive fixed effects and idiosyncratic

errors

• E[εit ] = 0 and εit ⊥⊥ Dis , for all i , t, s (strict exogeneity)

• ATT = E[τit |Dit = 1] can be non-parametrically identified if there

are only two periods (or two treatment histories)

11

DiD

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Time

Uni

t

Under Control Under Treatment

Treatment Status

12

Semi-parametric DiD

• Abadie (2005) proposes “semi-parametric DiD”

• Assumption: non-parallel outcome dynamics between treated and

controls caused by observed characteristics

• Two-step strategy:

1. estimate the propensity score based on observed covariates; compute

the fitted value

2. run a weighted DiD model

• The idea of using pre-treatment variables to adjust trends is a

precursor to synthetic control

• Strezhnev (2018) extends this approach to incorporate pre-treatment

outcomes

13

Synthetic Control: Basic Idea

• J + 1 units in periods 1, 2, . . . ,T ; one treated “1”, J controls

• Region “1” is exposed to the intervention after period T0

• We aim to estimate the effect of the intervention on Region “1”

1 T0 T

J

1

W*

14

Synthetic Control: Insights

• Athey and Imbens (2017): “[a]rguably the most important

innovation in the policy evaluation literature in the last 15 years.”

• A combo of many innovations

• Take advantage of pre-treatment outcomes

• Use cross-sectional instead of temporal correlations in data

• Construct a convex combination of donors to construct a

counterfactual

• Reserve some pre-treatment periods for testing

15

Difference-in-Differences (DiD)

−12 −10 −8 −6 −4 −2 0 2 4

Time relative to the treatment

5.0

5.5

6.0

6.5

Out

com

e

treated

controls

T0

16

Synthetic Control (and Many Extensions)

−12 −10 −8 −6 −4 −2 0 2 4

Time relative to the treatment

5.0

5.5

6.0

6.5

Out

com

e

treated

controls

T0

17

Theoretical Justification

Yit = τitDit + θ′tZi + ξt + λ′i ft + εit

or{Y 0it = θ′tZi + ξt + λ′i ft + εit

Y 1it = Y 0

it + τit

• Suppose there are R time-varying signals ft out there

• Each unit (e.g. country, participant) picks up a fixed linear

combination of these signals based on factor loadings λi

• Since these “confounders” are evidenced in the pre-treatment

outcomes for both treated and controls, we can try to use this

information to “balance on” these confounders

• We will discuss the model-based approach later18

Theoretical Justification

{Y 0it = θ′tZi + ξt + λ′i ft + εit

Y 1it = Y 0

it + τit

• Let W = (w2, . . . ,wJ+1)′ with wj ≥ 0 and w2 + · · ·+ wJ+1 = 1.

• Let Y K1

i , . . . , Y KM

i be M > R linear functions of pre-intervention

outcomes

• Suppose that we can choose W ∗ such that:

Z1 =∑J+1

j=2 w∗j Zj , Y k1 =

∑J+1j=2 w∗j Y

kj , k ∈ {K1, . . . ,KM}

• When T0 is large, an approximately unbiased estimator of τ1t is:

τ1t = Y1t −∑J+1

j=2 w∗j Yjt , t ∈ {T0 + 1, . . . ,T}

19

Implementation

• Let X1 = (Z1, YK11 , . . . , Y KM

1 )′ be a (k × 1) vector of

pre-intervention characteristics for the treated and X0, a (k × J)

matrix, for the controls.

• The vector W ∗ is chosen to minimize ‖X1 − X0W ‖, subject to our

weight constraints.

• We consider ‖X1 − X0W ‖V =√

(X1 − X0W )′V (X1 − X0W ), where

V is some (k × k) symmetric and positive semidefinite matrix.

• Various ways to choose V (subjective assessment of predictive power

of X , regression, minimize MSPE, cross-validation, etc.).

20

Example: Proposition 99 on Cigarette Consumption

• In 1988, California first passed comprehensive tobacco control

legislation (cigarette tax, media campaign etc.)

• Using 38 states that had never passed such programs as controls

1970 1975 1980 1985 1990 1995 2000

020

4060

8010

012

014

0

year

per−

capi

ta c

igar

ette

sal

es (

in p

acks

)

Californiarest of the U.S.

Passage of Proposition 99

Cigarette Consumption: CA and the Rest of the U.S.

21

Example: Proposition 99 on Cigarette Consumption

• In 1988, California first passed comprehensive tobacco control

legislation (cigarette tax, media campaign etc.)

• Using 38 states that had never passed such programs as controls

1970 1975 1980 1985 1990 1995 2000

020

4060

8010

012

014

0

year

per−

capi

ta c

igar

ette

sal

es (

in p

acks

)

Californiasynthetic California


Cigarette Consumption: CA and Synthetic CA

22

Predictor Means: Actual vs. Synthetic California

California Average of

Variables Real Synthetic 38 control states

Ln(GDP per capita) 10.08 9.86 9.86

Percent aged 15-24 17.40 17.40 17.29

Retail price 89.42 89.41 87.27

Beer consumption per capita 24.28 24.20 23.75

Cigarette sales per capita 1988 90.10 91.62 114.20



Note: All variables except lagged cigarette sales are averaged for the 1980-1988

period (beer consumption is averaged 1984-1988).

23

Smoking Gap Between CA and Synthetic CA

1970 1975 1980 1985 1990 1995 2000

−30

−20

−10

010

2030

year

gap

in p

er−

capi

ta c

igar

ette

sal

es (

in p

acks

)


24

Smoking Gap for CA and 38 Control States

(All States in Donor Pool)

1970 1975 1980 1985 1990 1995 2000

−30

−20

−10

010

2030

year

gap

in p

er−

capi

ta c

igar

ette

sal

es (

in p

acks

) Californiacontrol states


25


(Pre-Prop. 99 MSPE ≤ 20 Times Pre-Prop. 99 MSPE for CA)

1970 1975 1980 1985 1990 1995 2000

−30

−20

−10

010

2030

year

gap

in p

er−

capi

ta c

igar

ette

sal

es (

in p

acks



26


(Pre-Treatment MSPE ≤ 5 Times Pre-Treatment MSPE for CA)

1970 1975 1980 1985 1990 1995 2000

−30

−20

−10

010

2030

year

gap

in p

er−

capi

ta c

igar

ette

sal

es (

in p

acks



27


(Pre-Treatment MSPE ≤ 2 Times Pre-Treatment MSPE for CA)

1970 1975 1980 1985 1990 1995 2000

−30

−20

−10

010

2030

year

gap

in p

er−

capi

ta c

igar

ette

sal

es (

in p

acks



28

Ratio Post-Treatment MSPE to Pre-Treatment MSPE

(All 38 States in Donor Pool)

0 20 40 60 80 100 120

01

23

45

post/pre−Proposition 99 mean squared prediction error

freq

uenc

y

California

29

Limitations and Potential Solutions

• Deal with only one treated unit at a time

• Multiple treated units, e.g. Acemoglu et al. (2017)

• Inference is hard

• Permutation inference and sensitivity analysis, e.g. Hahn and Shi

(2016); Sergio et al. (2017); Chernochukov (2017)

• Allow too much user discretion, e.g. cherry-picking Y ki results in

over-rejection (Ferman et al. 2017)

• Slow to implement and sometimes difficult to find a solution

• Other reweighting approaches...

• Model-based methods

30

Literature

• Difference-in-differences (Ashenfelter & Card 1985; Card and Krueger 1994)

• Synthetic control (Abadie and Gardeazabal 2003; Abadie et al 2010, 2015)

• Best subset or penalized regression models (e.g. Hsiao et al. 2012;

Valero 2015; Doudchenko and Imbens 2016; Chernozhukov et al 2019)

• Matching and reweighting methods (e.g. Abadie 2005; Imai and Kim

2018; Hazlett and Xu 2018; Strezhnev; 2018; Imai et al 2019)

• Outcome model: factor augmented and matrix completion methods

(e.g. Gobillon and Magnac 2016; Xu 2017; Athey et al 2019)

• Hybrid approaches (e.g. Ben-Michael, Feller and Rothsstien 2018;

Arkhangelsky et al. 2019)

• Growing ...

31

Difference-in-Differences (DiD) Ashenfelter & Card (1985), Card & Kruger (1993)

Semi-parametric DiDAbadie (2003)

Synthetic Control Method Abadie & Gardeazabel (2003)

Abadie et al. (2010)

Time-varying confounders

Matching/Reweighting

Interactive Fixed Effects Bai (2009), Gobillon & Magnac (2016),

Xu (2017)

Matrix Completion Athey et al. (2018)

Balancing Robbins et al. (2017)

Hazlett and Xu (2018)

Multiple treated units, Computational and statistical efficiency, Robustness, Inference

Regression Methods Hsiao (2012)

Doudchenko & Immens (2016)

Panel Matching Imai and Kim (2018)

Imai, Kim and Wang(2018)

Literature

Propensity Score Reweighting

Austin (2011) Blackwell & Glynn (2018)

Strezhnev (2018)

Outcome Modeling Hybrid Methods

Augmented Synth Ben-Michael et al. (2018)

Synthetic DID Arkhangelsky et al. (2018)


Matching/Reweighting Outcome Modeling


Literature Difference-in-Differences (DiD) Ashenfelter & Card (1985), Card & Kruger (1993)
















Xu (2017)

Outcome Modeling








Xu (2017)


Outcome Modeling








Xu (2017)


Outcome Modeling


Literature








Strezhnev (2018)


Doudchenko & Imbens (2016)








Xu (2017)




Literature







Strezhnev (2018)










Xu (2017)






Literature







Strezhnev (2018)










Xu (2017)









Strezhnev (2018)




Literature










Xu (2017)









Strezhnev (2018)




Literature



Roadmap




• Outcome models

* Diagnostic tools

• Hybrid methods

• Conclusions

FE/DiD Assumptions

Assumptions for Twoway FEs

Yit = τDit + X ′β + αi + ξt + εit

in which Dit is dichotomous

1. Functional form

• Additive fixed effect

• Constant and contemporaneous treatment effect

• Linearity in covariates

2. Strict exogeneity

εit ⊥⊥ Djs ,Xjs , αj , ξs ∀i , j , t, s

⇒ {Yit(0),Yit(1)} ⊥⊥ Djs |XXX ,ααα,ξξξ ∀i , j , t, s

if only two groups, parallel trends:

⇒ E[Yit(0)−Yit′(0)|XXX ] = E[Yjt(0)−Yjt′(0)|XXX ] i ∈ T , j ∈ C,∀t, t ′

33

Shortcomings of Twoway FEs


Assumptions

1. Functional form


Challenges

1. Treatment effect heterogeneity leads to bias

2. Difficult to evaluate strict exogeneity

3. Strict exogeneity means a lot more than what you think

4. A deeper question: what does fixed effects approach imply from a

design-based inference perspective?

34

Treatment Effect Heterogeneity Causes Bias

Yit = τitDit + αi + εit

Within estimator:

τ = arg minτ

∑i

∑t

{(Yit − Yit)− τ(Dit − Dit)}

Causal estimand (ATT):

τ = E[τit |Ci = 1,Dit = 1], Ci = 1 if var(Dit) 6= 0

Proposition:

τ → E{Ciσ2i [Yi (1)− Yi (0)]}E[Ciσ2

i ]=

E[Ciσ2i τi ]

E[Ciσ2i ]6= τ

i.e. a unit fixed effect model gives the average “variance-weighted”

treatment effect, cf. Chernochukov et al. (2013) Theorem 1

35



Assumptions

1. Functional form


Challenges






36

Common Practice: Dynamic Treatment Effect Plots

Adhikari and Alm (2016)

1. parametric assumptions

2. arbitrarily chosen base category

3. unreliable tests

37



Assumptions

1. Functional form


Challenges






37

Reinterpreting Strict Exogeneity (Imai and Kim 2019)

1. No unobserved time-varying confounder exists

2. Past outcomes don’t directly affect current outcome (no LDV)

3. Past treatments don’t directly affect current outcome (no “carryover

effect”)

4. Past outcomes don’t directly affect current treatment (no “feedback”)

38

Reinterpreting Strict Exogeneity (Imai and Kim 2019)

1. No unobserved time-varying confounder exists

2. Past outcomes don’t directly affect current outcome (no LDV)

3. Past treatments don’t directly affect current outcome (no “carryover

effect”)

4. Past outcomes don’t directly affect current treatment (no “feedback”)

• To relax 2 or 3, “block”/control for past treatments → but how many?

• To relax 4, need instrumental variables (Arellano and Bond 1991) → hard to

justify instruments; bad finite sample properties

• Often end up directly controlling for arbitrary number of past treatments

and LDVs → Nickel bias

39



Assumptions

1. Functional form


Challenges





design-based inference perspective?→ hypothetical experiment?

40

Hypothetical Experiment?

Strict exogeneity implies the following data generating processes:

αi ,XiXiXi → DiDiDi → YiYiYi

treatment status are assigned randomly or at one shot, not sequentially!

Examples: random assignment within units

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Time

Uni

t


Treatment Status

41

Hypothetical Experiment?

Strict exogeneity implies the following data generating processes:

αi ,XiXiXi → T0i → DiDiDi → YiYiYi

treatment status are assigned randomly or at one shot, not sequentially!

Examples: staggered adoption (Athey and Imbens 2018)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

Time

Uni

t


Treatment Status

42

What We Try to Do


Assumptions

1. Functional form


What we do

? Allow treatment effect heterogeneity

? Develop methods to evaluate strict exogeneity

X Relax the no-time-varying confounder assumption

- Think harder about the hypothetical experiment

43

Roadmap




• Outcome models

* Diagnostic tools

• Hybrid methods

• Conclusions

Matching and Reweighting

The Matching/Reweighting Approach

Synthetic control

Yit(0) = λ′i ft + ξt + εit , and E[εit ] = 0, ∀i , t

• Choose weights on controls such that Yit =∑

i ′ 6=i w∗i ′Yi ′t , ∀t ≤ T0

• As a result: λi =∑

i ′ 6=i w∗i ′λi ′

New Developments

• Propensity score reweighting – extension to Abadie (2005)

• Panel matching

• Panel regression methods

• Balancing: mean balancing and trajectory balancing

44

Panel Matching (Imai, Kim & Wang 2019)

Assumptions

• Sequential exogeneity (past info can affect today’s treatment)

→ Parallel trends after conditions (similar to Abadie 2005)

• SUTVA (no spillover effect)

• Limited carryover effect

Procedure

1. Create a matched set for each treated observation based on

treatment history

2. Refine the matched set via any matching or weighting method

3. Compute causal effect via weighted DiD using the refined set

4. Calculate model-based standard errors

45

Example: Democracy and Economic Growth

1960 1963 1966 1969 1972 1975 1978 1981 1984 1987 1990 1993 1996 1999 2002 2005 2008

Year

Cou

ntry

Missing Autocracy Democracy

Democracy and Economic Growth: Treatment Status

46


• Match based on treatment history for the past L periods

47


• Refine the matched set based on covariates and pre-treatment

outcomes

The number of matched control units

48


Estimated treatment effects

49

Panel Matching: Advantages and Limitations

Advantages

• Require sequential exogeneity instead of strict exogeneity

• Allow treatment reversal

• Allow a variety of matching/reweighting methods

Limitations

• Lots of data (w/ info on outcome dynamics) are dropped

• Normally, imbalances remain

• Many choices require user discretion

50

Panel Regression Methods

• Synthetic control finds a convex combination of controls to mimic

the treated

• Implicitly assumes

• non-negative weights

• no intercept shift

• weights add up to 1

• Other ways to obtain weights

• Constrained regression

• Regression based on the best subset

• Penalized regression

• Balancing

51

Constrained Regression

• No intercept shift; weights add up to 1; non-negativity

ωconstr = arg minω

∑t≤T0

(Y1t − ω′YCt)2

s.t.∑i∈C

ωi = 1 and ωi ≥ 0,∀i ∈ C

• Limitation: T0 > Nco

52

Best Subset (Hsiao et al. 2012)

• Choose the number of weights that are allowed to be positive

• Optimize the weights after taking out the intercepts(µsubset , ωsubset

)= arg min

µ,ω

∑t≤T0

(Y1t − µ− ω′YCt)2

s.t.∑i∈C

1ωi 6=0 ≤ k

• Bottom-up approach: search for the best 1, then the best 2, then

the best 3 ... (greedy)

• Weights can be negative and do not need to add up to 1

53

Penalized Regression (Doudchenko and Imbens 2016)

• Use elastic net (L1 and L2 regularization) to select weights

(µen, ωen) = arg minµ,ω

(Y1,pre−µ−ω′YC,pre)2+λ·(

1− α2‖ω‖2

2 + α‖ω‖1

)in which λ and α are tuning parameters

• Allow intercept shift; weights can be negative and do not add up to 1

• Randomization inference: (1) treated unit is randomly selected; (2)

the timing of the onset of the treatment is randomly selected

54

Example: Proposition 99 in California

Adapted from Doudchenko and Imbens (2016)

55

Example: German Reunification

Adapted from Doudchenko and Imbens (2016)

56

Differences Among Existing Methods

synth cstr Hsiao EN IPW PMatch meanbal tjbal

Methods of getting weights bal reg reg reg reg match bal bal

Allow short T0 X X X XNon-negative weights X X X X XWeights add up to 1 X X X XIntercept shift X X X X X X XConvex combination X X X

Multiple treated units X X X XComputational efficiency X X X X X XDimension reduction X X(Almost) exact balancing X XHigher-order features X

See Doudchenko & Imbens (2016) and Ben-Michael et al. (2018) for detailed discussions.

57

Balancing: A Simple, Unified Framework

Almost all commonly used panel data models imply common function

space with Y 0post linear in Ypre .

Linearity in Pre-Treatment Outcomes – LPO

E[Y 0it |Yi,pre ] = (1 Yi,pre)′θt , T0 < t ≤ T .

• Diff-in-Diffs

• Twoway FEs

• ARMA

• IFE, Synth

Basic Idea: Equal means on Yi,pre would imply equal mean on Y 0it,t>T0

regardless of θt (Robbins et al. 2017)

58

Mean Balancing

• Objective: choose weights on controls to get same average

pre-treatment trajectory for weighted controls as treated,

1

Ntr

∑i∈T

Yi,pre =∑j∈C

wjYj,pre

• Conceptually similar to synth: choosing weights to predict for

counterfactual averages

• In practice: seek approximate balance, working from largest toward

smallest principal components of Ypre(Ypre)′ with a stopping rule of

minimizing the upper bound of biases (Hazlett & Xu 2018)

59

Mean Balancing

Assumption 1 (Linearity in Pre-Treatment Outcomes)

E[Y 0it |Yi,pre ] = (1 Yi,pre)′θt , T0 < t ≤ T .

Assumption 2 (Conditional Ignorability)

Y 0it ⊥⊥ Gi |Yi,pre , ∀t > T0

Assumption 3 (Feasibility)1

Ntr

∑Gi=1

Yit =∑Gi=0

wiYit , t ≤ T0 wi ≥ 0,∑Gi=0

wi = 1

Under the above assumptions, we have:

Proposition 4 (Unbiasedness)

E[ATT t |Ypre ] = ATTt , ∀t > T0

60

Limitations of Mean Balance

1. Weights that achieve mean balancing can leave treated and control

different on non-linear functions of Ypre

2. Mean balancing limited by number of pre-treatment points

• Few pre-treatment periods = few constraints unless you make more

• With enough periods, anything that matters to Y 0 will appear in Y 0

and can be balanced on — but with fewer periods, no guarantees

61

Trajectory Balancing

• Mean balancing: a simple approach to get same average trajectory

for weighted controls as treated:

1

Ntr

∑i∈T

Yi,pre =∑j∈C

wjYj,pre

• Trajectory balancing: feature mapping Yi,pre 7→ φ(Yi,pre),

then balance

1

Ntr

∑i∈T

φ(Yi,pre) =∑j∈C

wjφ(Yj,pre)

• Get mean balance on a feature expansion instead

φ(Yi,pre ,Xi ), φ : RP 7→ RP′

62

Choice of φ()

A good choice of φ() is one that:

• requires little or no user discretion

• includes all continuous functions (at the limit)

• perhaps, prioritizes low frequency, smoother functions

• allows covariates to play a role

Implementation: Gaussian kernel then approximation via principal

components

63

Implementation

• Instead of Ypre , form kernel matrixKi,j = k([Yi ,Xi ], [Yj ,Xj ]) = exp(−||[Yi ,Xi ]− [Yj ,Xj ]||2/h)

• Replaces each unit’s [Yi,pre ,Xi ] with a vector ki encoding how similar

observation i is to observation 1, 2, ...

• SVD this matrix to obtain components/ eigenvectors

• Choose weights to get mean balance on these, starting from largest

• We choose the number of principal components to include by minimizing the

upper bound of bias in the ATT estimates

64

When Averages Fail and φ()’s Thrive

Intuition: mean balancing is okay but may emphasize “wrong” features

of the pre-treatment trend

• Trajectory balancing gets you similarity of whole trajectories rather

than just equal means at each time point → balance on

“higher-order” features such as variance, curvature, etc.

• Approximately, trajectory balancing gets multivariate distribution of

Ypre for the controls equal to that of the treated, whereas mean

balancing only gets marginals equal

• This can matter when non-linear functions of Ypre are confounders,

especially when T0 short

65

When Mean Balancing Fail: A Severe Example

• N = 200 countries with simulated GDP over years T ∈ {1, 2, ..., 24}

• Two “types” of countries:

Volatile with no growth:

GDPit = 5 + ai sin(.2πt) + bicos(.2πt) + .1εit

εit ∼ N(0, 1), ai , bi ∼ U(−1, 1)

Or steady growing:

GDPit = 4 + ci1.03t + .1εit

εit ∼ N(0, 1), ci ∼ U(0.9, 1.1)

• A randomly selected 25% of the stable type take the treatment.

66

When Mean Balancing Fails: A Severe Example

Controls

Year

GD

P

34

56

7

0 4 8 12 16 20 24

Treated

Year

GD

P

34

56

7

0 4 8 12 16 20 24

Heavily weighted control units (pre-treatment)

Controls: Mean Balancing

Year

GD

P

34

56

7

1 2 3 4 5 6 7 8

Treated AverageWeighted Control Average

Controls: Kernel Balancing

Year

GD

P

34

56

7

1 2 3 4 5 6 7 8

Treated AverageWeighted Control Average

67

When Mean Balancing Fails: A Severe Example

68

What Information is Encoded in the Kernel Matrix?

●●

●

●

● ●

●

● ●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●●●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●●

●

●

●

● ●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

● ●●

●

●

●

●

●

●●●

●

●

●●

● ●

●

●

●

●

● ●●

●

● ●●

●

●

●

●●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●●

●

●●● ●●

●●

●

●

●

● ●

●

●●

●

●

●●

●

●● ●

●

●

●

●

●

●

●

●

Log Variance of Pre−treatment Outcomes

Firs

t Prin

cipa

l Com

pone

nt o

f K

−0.

10−

0.08

−0.

06−

0.04

−0.

020.

00

−7 −6 −5 −4 −3 −2 −1 0

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●●

●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●● ●

●

●

●

●

●

●

● ●●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●●

●

●

●

●

TreatedControls (all)Controls (weighted)

69

Two Extensions

• Incorporating covariates

Assumption 5 (Linearity in φ(Xi ,Yi,pre))

E[Y 0it |Xi ,Yi,pre ] = φ(Xi ,Yi,pre)′θt , T0 < t ≤ T .

• Intercept shift and demeaning

Assumption 6 (Parallel Trends)

E[Y 0it − Y 0

is |Yi,pre ] = E[Y 0it − Y 0

is |Yi,pre ,Gi ], ∀ t, s ∈ {1, 2, · · · ,T}

70

Truex (2014): Return to office in China’s Parliament

• Treatment: CEO taking a seat in the

National People’s Congress (NPC)

Outcome: Return on assets (ROA)

• 48 treated firms, 984 controls

Pre-treatment: 2005-2007

Post-treatment: 2008-2010

• Two covariates: state ownership,

revenue in 2007

• Balancing on: roa2005, roa2006,

roa2007, so portion, rev2007 (and

higher order terms through a kernel

transformation)2005 2006 2007 2008 2009 2010

YearF

irm

71

Balance on Pre-treatment Outcome Trajectories

Year

Ret

urn

On

Ass

ets

Top 25% Heavily Weighted Controlsw/ Mean Balancing

2005 2006 2007

−0.

50−

0.25

0.00

0.25

0.50

Year

Ret

urn

On

Ass

ets

Treated

2005 2006 2007

−0.

50−

0.25

0.00

0.25

0.50

Year

Ret

urn

On

Ass

ets

Top 25% Heavily Weighted Controlsw/ Kernel Balancing

2005 2006 2007

−0.

50−

0.25

0.00

0.25

0.50

72

Balance Check

Mean

●

●

●

●

●rev2007

so_portion

roa2007

roa2006

roa2005

−0.50 −0.25 0.00 0.25 0.50

Difference in Means

● Unweighted Mean Balancing Kernel Balancing

Variance

●

●

●

●

●rev2007

so_portion

roa2007

roa2006

roa2005

−4 0 4

(Varco − Vartr) Vartr

● Unweighted Mean Balancing Kernel Balancing

73

Truex (2014): Main Results

2005 2006 2007 2008 2009 2010

−0.

020.

000.

020.

04

Year

Diff

eren

ce−

in−

Mea

ns

Mean BalancingKernel Balancing

NPC Membership and Return on Assets

74

Summary

• Removing time-invariant confounders is costly, i.e.,no carryover

effect, no feedback from past Y to current D (Kim & Imai 2018)

• Sequential exogeneity may be more desirable than strict exogeneity

• Panel non-parametric and semi-parametric methods have gone a

long way

• Things quickly get more complex when the number of different

treatment histories grows

• Inference is hard especially when there are only a small number of

treated units

75

Roadmap




• Outcome models

* Diagnostic tools

• Hybrid methods

• Conclusions

Outcome Models

Basic Idea

• In a panel setting, treat Y (1) as missing data

• Predict Y (0) based on an outcome model

• (Use pre-treatment data for model selection)

• Estimate ATT by averaging differences between Y (1) and Y (0)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

Time

Uni

t


Treatment Status

76

Basic Idea

ATT s = E[τit |Di,t−s = 0,Di,t−s+1 = Di,t−s+2 = · · · = Dit = 1︸︷︷︸s periods

,∀i ∈ T ].

−4 −3 −2 −1 0 1 2 3 4 5

−5 −4 −3 −2 −1 0 1 2 3 4

−6 −5 −4 −3 −2 −1 0 1 2 3

−7 −6 −5 −4 −3 −2 −1 0 1 2

−8 −7 −6 −5 −4 −3 −2 −1 0 1

10

9

8

7

6

5

4

3

2

1

1 2 3 4 5 6 7 8 9 10

time

id


77

A New Plot for “Dynamic Treatment Effects”

100

−5.0

−2.5

0.0

2.5

5.0

7.5

−30 −20 −10 0 10Time relative to the Treatment

Effe

ct o

n Y

No "Pre−trend"

78

A New Plot for “Dynamic Treatment Effects”

100

−5.0

−2.5

0.0

2.5

5.0

7.5


Effe

ct o

n Y

A "Pre−trend"

79

Model-based Counterfactual Estimators

A model-based counterfactual estimator proceeds in the following steps:

• Step 1. Train the model using observations under the control

condition (Dit = 0).

• Step 2. Predict the counterfactual outcome Yit(0) for each

observation under the treatment condition (Dit = 1) and obtain an

estimate of the individual treatment effect: τit = Yit − Yit(0).

• Step 3. Generate estimates for the causal quantities of interest

ATT = E[τit |Dit = 1,∀i ∈ T ,∀t], or

ATTs = E[τit |Di,t−s = 0,Di,t−s+1 = Di,t−s+2 = · · · = Dit = 1︸︷︷︸s periods

,∀i ∈ T ].

80

Review of Three Estimators

We review three estimation strategies:

• FEct:

Yit(0) = Xit β + αi + ξt

• IFEct (Gobillon&Magnac 2016; Xu 2017):

Yit(0) = Xit β + λ′i Ft

• Matrix Completion (MC) (Athey et al. 2018):

Yit(0) = Xit β + Lit ,

where matrix {Lit}N×T is a lower-rank matrix approximation of

{Y (0)}N×T with missing values

Remarks:

• DiD is a special case of FEct

• Both IFEct and MC are estimated via iterative algorithms

• Cross-validation to choose the tunning parameter

81

IFEct

Xu (2017) proposes a three-step approach based on a latent factor model:

Control Yit(0) = X ′itβ + αi + ξt + λ′i ft + εit

Treated Yit(0) = X ′itβ + αi + ξt + λ′i ft + εit (pre)

Yit(1) = X ′itβ + αi + ξt + λ′i ft + εit + τit (post)

1. Expectation-Maximization

(Gobillon & Magnac 2016)factors: r × T

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30Time

Treated (Pre) Treated (Post) Controls

82

Election Day Registration (EDR) and Voter Turnout

Causal inference is a missing data problem.

WVWAVTVAUTTXTNSDSCRIPA

OROKOHNYNVNMNJNENCMSMOMI

MDMALAKYKSINIL

GAFLDECOCAAZARALCTMTIA

WYNHIDWI

MNME

1920 1924 1928 1932 1936 1940 1944 1948 1952 1956 1960 1964 1968 1972 1976 1980 1984 1988 1992 1996 2000 2004 2008 2012Year

Sta

te

Treated (Pre) Treated (Post) Controls

EDR Reform

83

Main Results

9

−15

−10

−5

0

5

10

15


Effe

ct o

n Tu

rnou

t (%

)

FEct

9

−15

−10

−5

0

5

10

15


Effe

ct o

n Tu

rnou

t (%

)

IFEct

84

Factors and Factor Loadings

1920 1936 1952 1968 1984 2000 2016

Year

−15

−10

−5

05

1015

Turn

out %

Factor 1Factor 2

AL

AZ

AR

CA

CO

DE

FL

GAIL

IN

KS

KY

LA

MD

MA

MI

MS

MO

NE

NV

NJ

NM

NY

NC

OH

OK

OR

PARI

SC

SD

TN

TXUTVT

VA

WA

WV

CT

ID

IA

MEMN

MT

NH

WI

WY

−20 −10 0 10 20

Loadings for Factor 1

−4

−2

02

46

Load

ings

for

Fact

or 2

85

Example: Cadre Visits on Loans

−12 −10 −8 −6 −4 −2 0 2 4


−0.

6−

0.4

−0.

20.

00.

20.

40.

60.

8

GS

C E

stim

ate

The Effect of Cadre Visits95% Confidence Interval

86

Example: Property Rights and Land Improvement (Sanford 2019)

• Does property rights lead to improved land quality?

• “Experiment”: giving peasants in Borgou, Benin land titles

• Use satellite (remote sensing) data to measure land improvement,

i.e., switch from annual crops to perennial crops (bushes and trees)

• Use IFEct to construct counterfactuals

87

Original Satellite Images

88

Treated and Counterfactual Averages

Pre-treatment Outcome

89


Post-treatment Outcome (1 Year)

90


Post-treatment Outcome (4 Years)

91

Factors and Loadings

92

Geographic Distribution of Heavily-weighted Controls

Bigger number represents higher dissimilarity

93

Matrix Completion Methods

AL

CA

DE

IA

IN

LA

ME

MO

NC

NJ

NY

OR

SC

TX

VT

WV

1920 1928 1936 1944 1952 1960 1968 1976 1984 1992 2000 2008year

abb

Treated States (before EDR) Treated States (after EDR) Control States

EDR Reform• Recall that our main goal is to

predict treated counterfactuals

• Taking advantage of the matrix

structure,

matrix completion methods use

non-treated data to achieve this

goal

• The basic idea to find a

lower-rank representation of the

matrix to impute the “missing

data”

• Xu (2017) is a special case of

this approach94


• Recall in the baseline DiD setup:

Y =

(Y 0T ,pre ??

Y 0C,pre Y 0

C,post

)• Matrix completion (MC) methods attempt to find a lower-rank

representation of Y, which we call L, that makes predictions of

missing values in Y

• Athey et al. (2018) generalize Xu (2017) with different ways of

constructing L

• Plus, missingness can be arbitrary → accommodate reversible

treatments (note: strict exogeneity)

95


• Mathematically,

Yit = Lit + αi + ξt + X ′itβ + εit

in which Lit is an element of L, an (N × T ) matrix

• We need regularization on L because of too many parameters:

minL

1

#Controls

∑Dit=0

(Yit − Lit)2 + λL‖L‖∗

• The nuclear norm ‖.‖∗ generally leads to a low-rank solution for L

‖L‖∗ =

min(N,T )∑i=1

σi (L)

in which σi (L) that the singular values of L

96

IFEct vs. MC

• Singular value decomposition of L

LN×T = SN×NΣN×TRT×T

• Difference in how ΣN×T is regularized

IFE MC

best subset nuclear norm

σ1 0 0 · · · 0

0 σ2 0 · · · 0

0 0 0 . . . 0

.

.

.

.

.

.

.

.

.

...

.

.

.

0 0 0 · · · 0

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

0 0 0 · · · 0

|σ1 − λL|+ 0 0 · · · 0

0 |σ2 − λL|+ 0 · · · 0

0 0 |σ3 − λL|+ . . . 0

.

.

.

.

.

.

.

.

.

... 0

|σT − λL|+0 0 0 · · · 0

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

0 0 0 · · · 0

in which |a|+ = max(a, 0)

97

IFEct vs. MC

The pros and cons of IFEct and MC:

• IFEct works better with a small number of strong factors

• MC works better with a large number of weak factors

IFEct (known r)

IFEct (CV−ed r)

MC

Variance of each factor

0

5

10

15

20

1 2 3 4 5 6 7 8 9Number of Factors in the DGP

MS

PE

98

Over-fitting of IFEct and MC

When the true DGP is a two-factor IFE model:

sigma (Pre)

SD (ATT)

Bias (ATT)

RMSE (ATT)

IFEct

0.0

0.5

1.0

1.5

2.0

2.5

0 1 2 3 4 5 6Number of Factors in the Model

sigma (Pre)

SD (ATT)

Bias (ATT)

RMSE (ATT)

MC

0.0

0.5

1.0

1.5

2.0

2.5

−4.4 −5.2 −6.1 −7 −7.8 −8.7 −9.6log(Tunning Parameter)

99

Inferential Methods

• Non-parametric block bootstrap

• sample with replacement across units

• valid when N is large, NtrN

is fixed

• A permutation-based test for Sharp Nulls (Chernozhukov et al 2019)

• e.g. Yit(1) = Yit(0), ∀i ∈ T , t > T0i

• randomization over time (by blocks) instead of across units

• valid if T is large, errors are stationary weekly dependent, and

estimators are consistent or stable

• exact if errors are i.i.d.

100

Algorithm for Testing the Sharp Null

Based on Chernozhukov et al (2019):

• Assuming the Sharp Null is true: H0 : Y (0)it = Y (1)it , ∀i ∈ T , t.

• Denote Zt = (Y1t ,Y2t , · · ·YNt ,X1t ,X2t , · · ·XNt)′.

• Denote a i.i.d. block permutation πh a one-to-one mapping:

πh : {b1, · · · , bK} 7→ {b1, · · · , bK}, in which {b1, · · · , bK} is a

partition of {1, · · · ,T}.

101

Algorithm for Testing the Sharp Null

Step 1. Using a counterfactual estimator with real data, calculate the

following test statistics S(u) =√|O||ATT |, in which |O| is the number

of treated observations.

Step 2. Generate a random i.i.d block permutation πh; reorganize the

data (over the time dimension) based on πh while fixing the treatment

assignment matrix.

Step 3. Re-estimate the test statistic using the permuted data,

obtaining S(uπh) using the same method in Step 1.

Step 4. Repeat Steps 2-3 H times, obtaining {S(uπ1 ) · · · ,S(uπH)}.

Step 5. Calculate the p-value: p = 1− F (S(u)) in which

F (x) =1

H

H∑1

1{S(uπh) < x}

102

Diagnostic Tests

A Simulated Example

Data Generating Process:

• T = 35, N = 200

• Outcome model: a linear interactive fixed effect model with two factors: one drift

process and one white noise.

Yit = τitDit + 5 + 1 · Xit,1 + 3 · Xit,2 + λi1 · f1t + λi2 · f2t + αi + ξt + εit

• Treatment assignment: staggered adoption with T0i correlated with additive and

interactive fixed effect.

• Treatment effects: τi,t>T0i = 0.2(t − T0i ) + eit

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

Time

Uni

t


Treatment Status

103

Dynamic Treatment Effects

100

−5.0

−2.5

0.0

2.5

5.0

7.5


Effe

ct o

n Y

IFEct

104

1. Placebo Test

• Drop S periods before the treatment’s onset, and estimate the

average treatment effect in these periods.

• Test whether the average effect is significant

• Robust to model misspecification, but using only limited information

● ●

●●

●● ●

● ●

●

●

●●

●●

●

● ● ● ● ● ●● ●

●●

●● ●

●

●

●

●

●

●

●

●

●

●

●

●

●●

Placebo p value: 0.400

100

−5.0

−2.5

0.0

2.5

5.0

7.5


Effe

ct o

n Y

IFEct

105

2. Equivalence Test

• Extension to Hartman and Hidalgo (2018) in a TSCS setting

• H0: |ATTt | > τ vs. H1: |ATTt | ≤ τ

• We calculate the maximal possible τ and compare it with

pre-specified threshold: 0.36 ∗ sd(Yit |Dit = 0)

• It has more power when the sample size grows larger, and is more

likely to reject the Null (hence, equivalence holds) when a

confounder is trivial

• Drawback 1: setting the threshold requires user discretion

• Drawback 2: easy to pass when pre-treatment data are used to fit

the model (IFEct and MC)

106

Equivalence Test

Wald p value: 0.321

100

−5.0

−2.5

0.0

2.5

5.0

7.5


Effe

ct o

n Y

MC

107

Why Not a Wald test?

• A simpler test: conduct an Wald (F) test on pre-treatment residual

averages

• However, when there exists a small confounders which induces a

neglectable bias compared with the ATT, a Wald test will almost

always reject the Null (that equivalence holds) when there are

enough data

• The Equivalence Test avoids this problem

108

Why Not a Wald Test?

• There exist a time-varying confounder

• We vary its influence on the bias in the ATT

Wald Equivalence

0.00

0.25

0.50

0.75

1.00

0.0 0.1 0.2 0.3 0.4Bias / SD(Y)

Pro

port

ion

Dec

lare

d B

alan

ce

109

Why Not a Wald Test?

Wald

Equivalence

0.00

0.25

0.50

0.75

1.00

0 250 500 750 1000Number of Units (N)

Pro

port

ion

Dec

lare

d B

alan

ce

Bias = 0.08SD(Y )

Wald

Equivalence

0.00

0.25

0.50

0.75

1.00

0 250 500 750 1000Number of Units (N)

Pro

port

ion

Dec

lare

d B

alan

ce

Bias = 0.28SD(Y )

110

Hainmueller & Hangatner (2019)

Does indirect democracy benefit immigrant minorities?

• Unit of analysis: 1400 Swiss municipalities from 1991-2009

• Treatment: Indirect (vs. direct) democracy

• Outcome: Naturalization rate

1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009

Year

Uni

t

Direct Democracy Indirect Democracy

111

Hainmueller & Hangatner (2019)

●

●

●

●

● ● ●

●

●

● ●

●

●

●

●

● ●

●

●

●

●●


470

−4

−2

0

2

4


Effe

ct o

n N

atur

aliz

atio

n R

ate

(%)

Placebo Test

112

Xu (2017): EDR on Turnout

Does Election Day Registration increase turnout?

• Unit of analysis: 47 US states from 1920 to 2012

• Treatment: EDR reform

• Outcome: Turnout

1920 1924 1928 1932 1936 1940 1944 1948 1952 1956 1960 1964 1968 1972 1976 1980 1984 1988 1992 1996 2000 2004 2008 2012

Year

Sta

te

No EDR EDR

Treatment Status

113


FEct

Wald p value: 0.129

9

−15

−10

−5

0

5

10

15


Effe

ct o

n Tu

rnou

t (%

)


●

●

● ● ●

●

●

●

●●

●

●

●●

●

●

●

●

●

●● ●


9

−15

−10

−5

0

5

10

15


Effe

ct o

n Tu

rnou

t (%

)

Placebo Test

114


IFEct

Wald p value: 0.728

9

−15

−10

−5

0

5

10

15


Effe

ct o

n Tu

rnou

t (%

)


● ●● ●

●

●

●

●

● ●

●

●

●●

●

● ●

●

●

●● ●


9

−15

−10

−5

0

5

10

15


Effe

ct o

n Tu

rnou

t (%

)

Placebo Test

115


MC

Wald p value: 0.644

9

−15

−10

−5

0

5

10

15


Effe

ct o

n Tu

rnou

t (%

)


● ●●

●●

●

● ●

●●

●

●

●●

●

●

●

●

●

● ●●


9

−15

−10

−5

0

5

10

15


Effe

ct o

n Tu

rnou

t (%

)

Placebo Test

116

Acemoglu et al. (2019): Democracy on Growth

• Unit of analysis: 184 countries over 51 years (1960-2010)

• Treatment: a dichotomous measure of democracy and autocracy

• Outcome: Log GDP per capita (’2000 dollars)

Entering Democracy 117





Wald p value: 0.111

126

−40

−20

0

20

40

−20 −10 0 10 20Time relative to the Treatment

Effe

ct o

n lo

g G

DP

per

cap

ita

FEct

Wald p value: 0.182

126

−40

−20

0

20

40


Effe

ct o

n lo

g G

DP

per

cap

ita

IFEct

Wald p value: 0.079

126

−40

−20

0

20

40


Effe

ct o

n lo

g G

DP

per

cap

ita

MC

Entering Democracy

118





Wald p value: 0.890

58

−60

−30

0

30

60

−10 0 10 20Time relative to the Treatment

Effe

ct o

n lo

g G

DP

per

cap

ita

FEct

Wald p value: 0.270

58

−60

−30

0

30

60


Effe

ct o

n lo

g G

DP

per

cap

ita

IFEct

Wald p value: 0.620

58

−60

−30

0

30

60


Effe

ct o

n lo

g G

DP

per

cap

ita

MC

Exiting Democracy

119

Roadmap




• Outcome models

* Diagnostic tools

• Hybrid methods

• Conclusions

Hybrid Methods

Hybrid Methods

• So far, we’ve surveyed two group of methods: (1) those constructing

balancing weights; (2) those modeling the conditional outcomes

• Combining the two approaches will likely produce doubly robust

estimators

• Some methods we discussed, including semi-parametric DiD, panel

matching, trajectory balancing, are already doing a simple version of

it (balancing plus regression)

• We review two new methods that formally adopt this idea

• Augmented synthetic control (Ben-Michael et al 2018): modeling first

• Synthetic DiD (Arkhangelsky et al. 2019): weighting first

120

Augmented Synthetic Control

• Assuming one treated unit (unit N)

• Procedure

1. Run an outcome model (FEct, IFEct, MC, etc.) and obtain model fit m(Xi )

2. Balance on the residual averages, obtaining weights γi for the controls

3. Treated average is constructed using:

Y augN (0) = m(XN) +

N−1∑i=1

γi (Yi − m(Xi ))

• The balancing weights take care of the remaining biases from the outcome

model; the estimator is thus doubly robust

• Inference via clustered bootstrap

121

Synthetic DiD

• Assuming one treated unit (unit N) and one post-treatment period

(period T ); weights add up to 1

• Procedure1. Estimate “synthetic control weight” for each control unit:

ωsc = arg minω∑T−1

t

(∑N−1i=1 ωiYit − YNt

)2. Estimate “synthetic control weight” for each time period:

λsc = arg minλ∑N−1

i

(∑T−1t=1 λtYit − YiT

)3. Estimate a weighted DiD by minimizing:

N∑i=1

T∑t=1

(Yit − µ− αi − βt − Xitγ − Ditτ)2ωi λt

• Either the SC weights or the outcome model is correct, the causal effect

will be identified (doubly robust)

• Inference via jackknife

122

Conclusions

Concluding Remarks

• Removing unit fixed effects is costly, i.e., strict exogeneity; the alternative

is sequential exogeneity

• “Parallel trends” can very well be wrong; when T is large, we can assess

the assumption by checking the “pre-trend” using diagnostic testes

• Counterfactual estimators can relax the homogeneity assumption with

little cost

• Both the matching/reweighting approach (dealing with D) and the

model-based approach (dealing with Y) can help with time-varying

confounders with sufficient data

• Be aware of the modeling assumptions when the latter is employed

• A hybrid estimator is doubly robust

123

Practical Recommendations

• Plotting raw data helps us see obvious problems

• Start from conventional estimators (i.e. DiD, FEct) and check the

“pre-trend”

• If they don’t work, try easy fixes, e.g. trimming the data to make

the treated and controls more alike

• If that doesn’t work, either, we need more complex methods

• Always ask yourself first: “what’s the hypothetical experiment?’

124

Packages

• panelView: panel data visualization

• fastplm: fast panel linear fixed effects estimation (coming soon)

• gsynth: IFEct/MC approach with non-reversible treatments

• fect: IFEct/MC methods with diagnostic tests (coming soon)

• tjbal: trajectory balancing

Thank [email protected]

http://yiqingxu.org

github.com/xuyiqing

125

[email protected]

http://yiqingxu.org

github.com/xuyiqing

References

• Abadie, Alberto, Alexis Diamond, and Jens Hainmueller (2010). Journal of the American Statistical

Association. June 1, 2010, 105(490): 493–505.

• Imai, Kosuke and In Song Kim (2019). “When Should We Use Unit Fixed Effects Regression Models

for Causal Inference with Longitudinal Data?” American Journal of Political Science, Vol. 62, Iss. 2,

April 2019, pp. 467–490.

• Abadie, Alberto (2005). “Semiparametric Difference-in-Differences Estimators,” Review of Economic

Studies (2005) 72, 1–19.

• Hsiao, Cheng, H. Steve Ching and Shui Ki Wan (2012). “A Panel Data Approach for Program

Evaluation: Measuring the Benefits of Political and Economic Integration of Hong Kong with

Mainland China,” Journal of Applied Econometrics, Vol. 27, Iss. 5, August 2012, pp. 705–740.

• Doudchenko, Nikolay and Guido Imbens (2016). “Balancing, Regression, Difference-In-Differences

and Synthetic Control Methods: A Synthesis.” Working Paper Stanford University.

• Hazlett, Chad and Yiqing Xu (2018). “Trajectory Balancing: A Kernel Method for Causal Inference

with Time-Series Cross-Sectional Data.” Working Paper, UCLA.

• Xu, Yiqing (2017). “Generalized Synthetic Control Method: Causal Inference with Interactive Fixed

Effects Models” Political Analysis, Vol. 25, Iss. 1, January 2017, pp. 57-76.

• Athey, Susan, Mohsen Bayati, Nikolay Doudchenko, Guido Imbens, Khashayar Khosravi (2017).

“Matrix Completion Methods for Causal Panel Data Models.” Working Paper, Stanford University.

• Liu, Licheng, Ye Wang, Yiqing Xu (2019). ”A Practical Guide to Counterfacutal Estimators for

Causal Inference with Time-Series Cross-Sectional Data.” Working Paper, Stanford University.

• Ben-Michael Eli, Avi Feller, Jesse Rothstein (2018). “The Augmented Synthetic Control Method.”

Working Paper, UC Berkeley.

• Arkhangelsky, Dmitry, Susan Athey, David A. Hirshberg, Guido W. Imbens and Stefan Wager (2019).

“Synthetic Difference In Differences.” Working Paper, UC Berkeley.126

Causal Inference with Panel Data · 2020-03-04 · Two-step strategy: 1.estimate the propensity...

Documents

Transcript of Causal Inference with Panel Data · 2020-03-04 · Two-step strategy: 1.estimate the propensity...