Causal Inference with Panel Data · 2020-03-04 · Two-step strategy: 1.estimate the propensity...
Transcript of Causal Inference with Panel Data · 2020-03-04 · Two-step strategy: 1.estimate the propensity...
Causal Inference with Panel Data
Yiqing Xu (Stanford University)
Northwestern-Duke Causal Inference Workshop
19 August 2019
Motivation
Cadre Visits on Loans
−12 −10 −8 −6 −4 −2 0 2 4
Quarter(s) before / after a cadre's visit
5.0
5.5
6.0
6.5
Log
Out
stan
ding
Loa
ns
182 treated firms
542 control firms
Visits
1
Cadre Visits on Loans
−12 −10 −8 −6 −4 −2 0 2 4
Quarter(s) before / after a cadre's visit
−0.
6−
0.4
−0.
20.
00.
20.
40.
60.
8
Diff
−in
−di
ffs in
Loa
nsThe Effect of Cadre Visits on Loans95% Confidence Interval
Implicitly Assumed
2
Cadre Visits on Loans
−12 −10 −8 −6 −4 −2 0 2 4
Quarter(s) before / after a cadre's visit
−0.
6−
0.4
−0.
20.
00.
20.
40.
60.
8
Diff
−in
−di
ffs in
Loa
nsThe "Effect" of Cadre Visits on Loans95% Confidence Interval
3
−12 −10 −8 −6 −4 −2 0 2 4
Year(s) before / after Reform
−2
0−
10
01
02
03
0
Diff
−in
−d
iffs
in C
rim
e R
ate
%The "Effect" of Assault Weapon Ban on Crime Rate95% Confidence Interval
−20 −16 −12 −8 −4 0 4 8
Year(s) before / after Both Parties in the WTO
−0
.2−
0.1
0.0
0.1
0.2
Diff
−in
−d
iffs
in L
og
Tra
de
Vo
lum
e
The "Effect" of WTO on Trade95% Confidence Interval
−12 −10 −8 −6 −4 −2 0 2 4 6 8
Year(s) before / after Democratization
−0
.15
−0
.10
−0
.05
0.0
00
.05
0.1
00
.15
Diff
−in
−d
iffs
in L
og
GD
P p
er
cap
ita
The "Effect" of Democratization on Economic Development95% Confidence Interval
−28 −24 −20 −16 −12 −8 −4 0 4 8
Day(s) before / after the News Leak
−6
−4
−2
02
46
8
Diff
−in
−d
iffs
in A
bn
orm
al R
etu
rn %
The "Effect" of Tim Geithner Connections on Stock Market Return95% Confidence Interval
Motivations
Two-way fixed effects (FE) and difference-in-differences (DiD) methods
are commonly used in the social sciences.
• Abundant observational panel data
• Accounting for unobserved unit and time heterogeneity
• Easy to estimate and interpret
●● ● ● ●
●
●●
●
●● ●
●
●
●
●
●
●
Cou
nt
05
1015
2025
30
2000 2002 2004 2006 2008 2010 2012 2014 2016
"difference−in−differences"
"fixed effects" + "panel data"
#Papers published in APSR/AJPS/JOP
5
We Try to Answer
1. What about time-varying confounders, e.g. when a “pre-trend”
exists?
2. What if the treatment effect are heterogeneous — as they always
are?
3. What’re the differences between twoway FE and DiD?
4. Are the assumptions credible? What’re the hypothetical experiments
behind these models?
5. How can we do better than DiD or synthetic control?
6
What’s Special about Panel Data?
• The fundamental problem of causal inference (Holland 1986)
τi = Y1i − Yi0
• A statistical solution makes use of others’ information
e.g. ATE = E[Y1]− E[Y0]
• A scientific solution exploits homogeneity or invariance assumptions
e.g. A rock stays a rock.
e.g. The long-run growth rate of the US economy is 2.5%.
• Panel data allow us to construct treated counterfactuals using
information from both the past and the others with the caveat that
treatment assignment mechanism may be complicated
• Panel data is also difficult because of all kinds of interferences
(SUTVA violations)
7
Causal Inference with Panel Data
• “Scientific” solution: modeling (but all models are wrong...)
• Statistical solution: similar to the Selection-on-Observable (SOO)
approach, e.g., matching/reweighting
• Panel data make both easier
• Pre-trends are observable → more information for modeling
• The additional dimension helps relax the conventional ignorability
assumption
• And we can do more...
8
Roadmap
• DiD and synthetic control
• FE/DiD assumptions
• New estimators• Matching and reweighting
• Outcome models
* Diagnostic tools
• Hybrid methods
• Conclusions
Quick Review of DiD and Synth
Difference-in-Differences
Yit = τitDit + αi + ξt + εit
or{Y 0it = αi + ξt + εit
Y 1it = Y 0
it + τit
• τit is the treatment effect for unit i at time t
• Y 0it is a combination of two additive fixed effects and idiosyncratic
errors
• E[εit ] = 0 and εit ⊥⊥ Dis , for all i , t, s (strict exogeneity)
• ATT = E[τit |Dit = 1] can be non-parametrically identified if there
are only two periods (or two treatment histories)
10
Difference-in-Differences
(Y 0T ,pre ??
Y 0C,pre Y 0
C,post
)
• τit is the treatment effect for unit i at time t
• Y 0it is a combination of two additive fixed effects and idiosyncratic
errors
• E[εit ] = 0 and εit ⊥⊥ Dis , for all i , t, s (strict exogeneity)
• ATT = E[τit |Dit = 1] can be non-parametrically identified if there
are only two periods (or two treatment histories)
11
DiD
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Time
Uni
t
Under Control Under Treatment
Treatment Status
12
Semi-parametric DiD
• Abadie (2005) proposes “semi-parametric DiD”
• Assumption: non-parallel outcome dynamics between treated and
controls caused by observed characteristics
• Two-step strategy:
1. estimate the propensity score based on observed covariates; compute
the fitted value
2. run a weighted DiD model
• The idea of using pre-treatment variables to adjust trends is a
precursor to synthetic control
• Strezhnev (2018) extends this approach to incorporate pre-treatment
outcomes
13
Synthetic Control: Basic Idea
• J + 1 units in periods 1, 2, . . . ,T ; one treated “1”, J controls
• Region “1” is exposed to the intervention after period T0
• We aim to estimate the effect of the intervention on Region “1”
1 T0 T
J
1
W*
14
Synthetic Control: Insights
• Athey and Imbens (2017): “[a]rguably the most important
innovation in the policy evaluation literature in the last 15 years.”
• A combo of many innovations
• Take advantage of pre-treatment outcomes
• Use cross-sectional instead of temporal correlations in data
• Construct a convex combination of donors to construct a
counterfactual
• Reserve some pre-treatment periods for testing
15
Difference-in-Differences (DiD)
−12 −10 −8 −6 −4 −2 0 2 4
Time relative to the treatment
5.0
5.5
6.0
6.5
Out
com
e
treated
controls
T0
16
Synthetic Control (and Many Extensions)
−12 −10 −8 −6 −4 −2 0 2 4
Time relative to the treatment
5.0
5.5
6.0
6.5
Out
com
e
treated
controls
T0
17
Theoretical Justification
Yit = τitDit + θ′tZi + ξt + λ′i ft + εit
or{Y 0it = θ′tZi + ξt + λ′i ft + εit
Y 1it = Y 0
it + τit
• Suppose there are R time-varying signals ft out there
• Each unit (e.g. country, participant) picks up a fixed linear
combination of these signals based on factor loadings λi
• Since these “confounders” are evidenced in the pre-treatment
outcomes for both treated and controls, we can try to use this
information to “balance on” these confounders
• We will discuss the model-based approach later18
Theoretical Justification
{Y 0it = θ′tZi + ξt + λ′i ft + εit
Y 1it = Y 0
it + τit
• Let W = (w2, . . . ,wJ+1)′ with wj ≥ 0 and w2 + · · ·+ wJ+1 = 1.
• Let Y K1
i , . . . , Y KM
i be M > R linear functions of pre-intervention
outcomes
• Suppose that we can choose W ∗ such that:
Z1 =∑J+1
j=2 w∗j Zj , Y k1 =
∑J+1j=2 w∗j Y
kj , k ∈ {K1, . . . ,KM}
• When T0 is large, an approximately unbiased estimator of τ1t is:
τ1t = Y1t −∑J+1
j=2 w∗j Yjt , t ∈ {T0 + 1, . . . ,T}
19
Implementation
• Let X1 = (Z1, YK11 , . . . , Y KM
1 )′ be a (k × 1) vector of
pre-intervention characteristics for the treated and X0, a (k × J)
matrix, for the controls.
• The vector W ∗ is chosen to minimize ‖X1 − X0W ‖, subject to our
weight constraints.
• We consider ‖X1 − X0W ‖V =√
(X1 − X0W )′V (X1 − X0W ), where
V is some (k × k) symmetric and positive semidefinite matrix.
• Various ways to choose V (subjective assessment of predictive power
of X , regression, minimize MSPE, cross-validation, etc.).
20
Example: Proposition 99 on Cigarette Consumption
• In 1988, California first passed comprehensive tobacco control
legislation (cigarette tax, media campaign etc.)
• Using 38 states that had never passed such programs as controls
1970 1975 1980 1985 1990 1995 2000
020
4060
8010
012
014
0
year
per−
capi
ta c
igar
ette
sal
es (
in p
acks
)
Californiarest of the U.S.
Passage of Proposition 99
Cigarette Consumption: CA and the Rest of the U.S.
21
Example: Proposition 99 on Cigarette Consumption
• In 1988, California first passed comprehensive tobacco control
legislation (cigarette tax, media campaign etc.)
• Using 38 states that had never passed such programs as controls
1970 1975 1980 1985 1990 1995 2000
020
4060
8010
012
014
0
year
per−
capi
ta c
igar
ette
sal
es (
in p
acks
)
Californiasynthetic California
Passage of Proposition 99
Cigarette Consumption: CA and Synthetic CA
22
Predictor Means: Actual vs. Synthetic California
California Average of
Variables Real Synthetic 38 control states
Ln(GDP per capita) 10.08 9.86 9.86
Percent aged 15-24 17.40 17.40 17.29
Retail price 89.42 89.41 87.27
Beer consumption per capita 24.28 24.20 23.75
Cigarette sales per capita 1988 90.10 91.62 114.20
Cigarette sales per capita 1980 120.20 120.43 136.58
Cigarette sales per capita 1975 127.10 126.99 132.81
Note: All variables except lagged cigarette sales are averaged for the 1980-1988
period (beer consumption is averaged 1984-1988).
23
Smoking Gap Between CA and Synthetic CA
1970 1975 1980 1985 1990 1995 2000
−30
−20
−10
010
2030
year
gap
in p
er−
capi
ta c
igar
ette
sal
es (
in p
acks
)
Passage of Proposition 99
24
Smoking Gap for CA and 38 Control States
(All States in Donor Pool)
1970 1975 1980 1985 1990 1995 2000
−30
−20
−10
010
2030
year
gap
in p
er−
capi
ta c
igar
ette
sal
es (
in p
acks
) Californiacontrol states
Passage of Proposition 99
25
Smoking Gap for CA and 34 Control States
(Pre-Prop. 99 MSPE ≤ 20 Times Pre-Prop. 99 MSPE for CA)
1970 1975 1980 1985 1990 1995 2000
−30
−20
−10
010
2030
year
gap
in p
er−
capi
ta c
igar
ette
sal
es (
in p
acks
) Californiacontrol states
Passage of Proposition 99
26
Smoking Gap for CA and 29 Control States
(Pre-Treatment MSPE ≤ 5 Times Pre-Treatment MSPE for CA)
1970 1975 1980 1985 1990 1995 2000
−30
−20
−10
010
2030
year
gap
in p
er−
capi
ta c
igar
ette
sal
es (
in p
acks
) Californiacontrol states
Passage of Proposition 99
27
Smoking Gap for CA and 19 Control States
(Pre-Treatment MSPE ≤ 2 Times Pre-Treatment MSPE for CA)
1970 1975 1980 1985 1990 1995 2000
−30
−20
−10
010
2030
year
gap
in p
er−
capi
ta c
igar
ette
sal
es (
in p
acks
) Californiacontrol states
Passage of Proposition 99
28
Ratio Post-Treatment MSPE to Pre-Treatment MSPE
(All 38 States in Donor Pool)
0 20 40 60 80 100 120
01
23
45
post/pre−Proposition 99 mean squared prediction error
freq
uenc
y
California
29
Limitations and Potential Solutions
• Deal with only one treated unit at a time
• Multiple treated units, e.g. Acemoglu et al. (2017)
• Inference is hard
• Permutation inference and sensitivity analysis, e.g. Hahn and Shi
(2016); Sergio et al. (2017); Chernochukov (2017)
• Allow too much user discretion, e.g. cherry-picking Y ki results in
over-rejection (Ferman et al. 2017)
• Slow to implement and sometimes difficult to find a solution
• Other reweighting approaches...
• Model-based methods
30
Literature
• Difference-in-differences (Ashenfelter & Card 1985; Card and Krueger 1994)
• Synthetic control (Abadie and Gardeazabal 2003; Abadie et al 2010, 2015)
• Best subset or penalized regression models (e.g. Hsiao et al. 2012;
Valero 2015; Doudchenko and Imbens 2016; Chernozhukov et al 2019)
• Matching and reweighting methods (e.g. Abadie 2005; Imai and Kim
2018; Hazlett and Xu 2018; Strezhnev; 2018; Imai et al 2019)
• Outcome model: factor augmented and matrix completion methods
(e.g. Gobillon and Magnac 2016; Xu 2017; Athey et al 2019)
• Hybrid approaches (e.g. Ben-Michael, Feller and Rothsstien 2018;
Arkhangelsky et al. 2019)
• Growing ...
31
Difference-in-Differences (DiD) Ashenfelter & Card (1985), Card & Kruger (1993)
Semi-parametric DiDAbadie (2003)
Synthetic Control Method Abadie & Gardeazabel (2003)
Abadie et al. (2010)
Time-varying confounders
Matching/Reweighting
Interactive Fixed Effects Bai (2009), Gobillon & Magnac (2016),
Xu (2017)
Matrix Completion Athey et al. (2018)
Balancing Robbins et al. (2017)
Hazlett and Xu (2018)
Multiple treated units, Computational and statistical efficiency, Robustness, Inference
Regression Methods Hsiao (2012)
Doudchenko & Immens (2016)
Panel Matching Imai and Kim (2018)
Imai, Kim and Wang(2018)
Literature
Propensity Score Reweighting
Austin (2011) Blackwell & Glynn (2018)
Strezhnev (2018)
Outcome Modeling Hybrid Methods
Augmented Synth Ben-Michael et al. (2018)
Synthetic DID Arkhangelsky et al. (2018)
Difference-in-Differences (DiD) Ashenfelter & Card (1985), Card & Kruger (1993)
Matching/Reweighting Outcome Modeling
Time-varying confounders
Literature Difference-in-Differences (DiD) Ashenfelter & Card (1985), Card & Kruger (1993)
Semi-parametric DiDAbadie (2003)
Matching/Reweighting Outcome Modeling
Time-varying confounders
Literature Difference-in-Differences (DiD) Ashenfelter & Card (1985), Card & Kruger (1993)
Semi-parametric DiDAbadie (2003)
Synthetic Control Method Abadie & Gardeazabel (2003)
Abadie et al. (2010)
Matching/Reweighting Outcome Modeling
Time-varying confounders
Literature Difference-in-Differences (DiD) Ashenfelter & Card (1985), Card & Kruger (1993)
Semi-parametric DiDAbadie (2003)
Synthetic Control Method Abadie & Gardeazabel (2003)
Abadie et al. (2010)
Matching/Reweighting
Interactive Fixed Effects Bai (2009), Gobillon & Magnac (2016),
Xu (2017)
Outcome Modeling
Time-varying confounders
Literature Difference-in-Differences (DiD) Ashenfelter & Card (1985), Card & Kruger (1993)
Semi-parametric DiDAbadie (2003)
Synthetic Control Method Abadie & Gardeazabel (2003)
Abadie et al. (2010)
Matching/Reweighting
Interactive Fixed Effects Bai (2009), Gobillon & Magnac (2016),
Xu (2017)
Matrix Completion Athey et al. (2018)
Outcome Modeling
Time-varying confounders
Literature Difference-in-Differences (DiD) Ashenfelter & Card (1985), Card & Kruger (1993)
Semi-parametric DiDAbadie (2003)
Synthetic Control Method Abadie & Gardeazabel (2003)
Abadie et al. (2010)
Matching/Reweighting
Interactive Fixed Effects Bai (2009), Gobillon & Magnac (2016),
Xu (2017)
Multiple treated units, Computational and statistical efficiency, Robustness, Inference
Outcome Modeling
Time-varying confounders
Literature
Matrix Completion Athey et al. (2018)
Balancing Robbins et al. (2017)
Hazlett and Xu (2018)
Panel Matching Imai and Kim (2018)
Imai, Kim and Wang(2018)
Propensity Score Reweighting
Austin (2011) Blackwell & Glynn (2018)
Strezhnev (2018)
Regression Methods Hsiao (2012)
Doudchenko & Imbens (2016)
Difference-in-Differences (DiD) Ashenfelter & Card (1985), Card & Kruger (1993)
Semi-parametric DiDAbadie (2003)
Synthetic Control Method Abadie & Gardeazabel (2003)
Abadie et al. (2010)
Time-varying confounders
Matching/Reweighting
Interactive Fixed Effects Bai (2009), Gobillon & Magnac (2016),
Xu (2017)
Matrix Completion Athey et al. (2018)
Multiple treated units, Computational and statistical efficiency, Robustness, Inference
Outcome Modeling Hybrid Methods
Literature
Balancing Robbins et al. (2017)
Hazlett and Xu (2018)
Panel Matching Imai and Kim (2018)
Imai, Kim and Wang(2018)
Propensity Score Reweighting
Austin (2011) Blackwell & Glynn (2018)
Strezhnev (2018)
Regression Methods Hsiao (2012)
Doudchenko & Imbens (2016)
Difference-in-Differences (DiD) Ashenfelter & Card (1985), Card & Kruger (1993)
Semi-parametric DiDAbadie (2003)
Synthetic Control Method Abadie & Gardeazabel (2003)
Abadie et al. (2010)
Time-varying confounders
Matching/Reweighting
Interactive Fixed Effects Bai (2009), Gobillon & Magnac (2016),
Xu (2017)
Matrix Completion Athey et al. (2018)
Multiple treated units, Computational and statistical efficiency, Robustness, Inference
Outcome Modeling Hybrid Methods
Augmented Synth Ben-Michael et al. (2018)
Synthetic DID Arkhangelsky et al. (2018)
Literature
Balancing Robbins et al. (2017)
Hazlett and Xu (2018)
Panel Matching Imai and Kim (2018)
Imai, Kim and Wang(2018)
Propensity Score Reweighting
Austin (2011) Blackwell & Glynn (2018)
Strezhnev (2018)
Regression Methods Hsiao (2012)
Doudchenko & Imbens (2016)
Difference-in-Differences (DiD) Ashenfelter & Card (1985), Card & Kruger (1993)
Semi-parametric DiDAbadie (2003)
Synthetic Control Method Abadie & Gardeazabel (2003)
Abadie et al. (2010)
Time-varying confounders
Matching/Reweighting
Interactive Fixed Effects Bai (2009), Gobillon & Magnac (2016),
Xu (2017)
Matrix Completion Athey et al. (2018)
Balancing Robbins et al. (2017)
Hazlett and Xu (2018)
Multiple treated units, Computational and statistical efficiency, Robustness, Inference
Panel Matching Imai and Kim (2018)
Imai, Kim and Wang(2018)
Propensity Score Reweighting
Austin (2011) Blackwell & Glynn (2018)
Strezhnev (2018)
Outcome Modeling Hybrid Methods
Augmented Synth Ben-Michael et al. (2018)
Synthetic DID Arkhangelsky et al. (2018)
Literature
Regression Methods Hsiao (2012)
Doudchenko & Imbens (2016)
Difference-in-Differences (DiD) Ashenfelter & Card (1985), Card & Kruger (1993)
Semi-parametric DiDAbadie (2003)
Synthetic Control Method Abadie & Gardeazabel (2003)
Abadie et al. (2010)
Time-varying confounders
Matching/Reweighting
Interactive Fixed Effects Bai (2009), Gobillon & Magnac (2016),
Xu (2017)
Matrix Completion Athey et al. (2018)
Balancing Robbins et al. (2017)
Hazlett and Xu (2018)
Multiple treated units, Computational and statistical efficiency, Robustness, Inference
Panel Matching Imai and Kim (2018)
Imai, Kim and Wang(2018)
Propensity Score Reweighting
Austin (2011) Blackwell & Glynn (2018)
Strezhnev (2018)
Outcome Modeling Hybrid Methods
Augmented Synth Ben-Michael et al. (2018)
Synthetic DID Arkhangelsky et al. (2018)
Literature
Regression Methods Hsiao (2012)
Doudchenko & Imbens (2016)
Roadmap
• DiD and synthetic control
• FE/DiD assumptions
• New estimators• Matching and reweighting
• Outcome models
* Diagnostic tools
• Hybrid methods
• Conclusions
FE/DiD Assumptions
Assumptions for Twoway FEs
Yit = τDit + X ′β + αi + ξt + εit
in which Dit is dichotomous
1. Functional form
• Additive fixed effect
• Constant and contemporaneous treatment effect
• Linearity in covariates
2. Strict exogeneity
εit ⊥⊥ Djs ,Xjs , αj , ξs ∀i , j , t, s
⇒ {Yit(0),Yit(1)} ⊥⊥ Djs |XXX ,ααα,ξξξ ∀i , j , t, s
if only two groups, parallel trends:
⇒ E[Yit(0)−Yit′(0)|XXX ] = E[Yjt(0)−Yjt′(0)|XXX ] i ∈ T , j ∈ C,∀t, t ′
33
Shortcomings of Twoway FEs
Yit = τDit + X ′β + αi + ξt + εit
Assumptions
1. Functional form
2. Strict exogeneity
Challenges
1. Treatment effect heterogeneity leads to bias
2. Difficult to evaluate strict exogeneity
3. Strict exogeneity means a lot more than what you think
4. A deeper question: what does fixed effects approach imply from a
design-based inference perspective?
34
Treatment Effect Heterogeneity Causes Bias
Yit = τitDit + αi + εit
Within estimator:
τ = arg minτ
∑i
∑t
{(Yit − Yit)− τ(Dit − Dit)}
Causal estimand (ATT):
τ = E[τit |Ci = 1,Dit = 1], Ci = 1 if var(Dit) 6= 0
Proposition:
τ → E{Ciσ2i [Yi (1)− Yi (0)]}E[Ciσ2
i ]=
E[Ciσ2i τi ]
E[Ciσ2i ]6= τ
i.e. a unit fixed effect model gives the average “variance-weighted”
treatment effect, cf. Chernochukov et al. (2013) Theorem 1
35
Shortcomings of Twoway FEs
Yit = τDit + X ′β + αi + ξt + εit
Assumptions
1. Functional form
2. Strict exogeneity
Challenges
1. Treatment effect heterogeneity leads to bias
2. Difficult to evaluate strict exogeneity
3. Strict exogeneity means a lot more than what you think
4. A deeper question: what does fixed effects approach imply from a
design-based inference perspective?
36
Common Practice: Dynamic Treatment Effect Plots
Adhikari and Alm (2016)
1. parametric assumptions
2. arbitrarily chosen base category
3. unreliable tests
37
Shortcomings of Twoway FEs
Yit = τDit + X ′β + αi + ξt + εit
Assumptions
1. Functional form
2. Strict exogeneity
Challenges
1. Treatment effect heterogeneity leads to bias
2. Difficult to evaluate strict exogeneity
3. Strict exogeneity means a lot more than what you think
4. A deeper question: what does fixed effects approach imply from a
design-based inference perspective?
37
Reinterpreting Strict Exogeneity (Imai and Kim 2019)
1. No unobserved time-varying confounder exists
2. Past outcomes don’t directly affect current outcome (no LDV)
3. Past treatments don’t directly affect current outcome (no “carryover
effect”)
4. Past outcomes don’t directly affect current treatment (no “feedback”)
38
Reinterpreting Strict Exogeneity (Imai and Kim 2019)
1. No unobserved time-varying confounder exists
2. Past outcomes don’t directly affect current outcome (no LDV)
3. Past treatments don’t directly affect current outcome (no “carryover
effect”)
4. Past outcomes don’t directly affect current treatment (no “feedback”)
• To relax 2 or 3, “block”/control for past treatments → but how many?
• To relax 4, need instrumental variables (Arellano and Bond 1991) → hard to
justify instruments; bad finite sample properties
• Often end up directly controlling for arbitrary number of past treatments
and LDVs → Nickel bias
39
Shortcomings of Twoway FEs
Yit = τDit + X ′β + αi + ξt + εit
Assumptions
1. Functional form
2. Strict exogeneity
Challenges
1. Treatment effect heterogeneity leads to bias
2. Difficult to evaluate strict exogeneity
3. Strict exogeneity means a lot more than what you think
4. A deeper question: what does fixed effects approach imply from a
design-based inference perspective?→ hypothetical experiment?
40
Hypothetical Experiment?
Strict exogeneity implies the following data generating processes:
αi ,XiXiXi → DiDiDi → YiYiYi
treatment status are assigned randomly or at one shot, not sequentially!
Examples: random assignment within units
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Time
Uni
t
Under Control Under Treatment
Treatment Status
41
Hypothetical Experiment?
Strict exogeneity implies the following data generating processes:
αi ,XiXiXi → T0i → DiDiDi → YiYiYi
treatment status are assigned randomly or at one shot, not sequentially!
Examples: staggered adoption (Athey and Imbens 2018)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
Time
Uni
t
Under Control Under Treatment
Treatment Status
42
What We Try to Do
Yit = τDit + X ′β + αi + ξt + εit
Assumptions
1. Functional form
2. Strict exogeneity
What we do
? Allow treatment effect heterogeneity
? Develop methods to evaluate strict exogeneity
X Relax the no-time-varying confounder assumption
- Think harder about the hypothetical experiment
43
Roadmap
• DiD and synthetic control
• FE/DiD assumptions
• New estimators• Matching and reweighting
• Outcome models
* Diagnostic tools
• Hybrid methods
• Conclusions
Matching and Reweighting
The Matching/Reweighting Approach
Synthetic control
Yit(0) = λ′i ft + ξt + εit , and E[εit ] = 0, ∀i , t
• Choose weights on controls such that Yit =∑
i ′ 6=i w∗i ′Yi ′t , ∀t ≤ T0
• As a result: λi =∑
i ′ 6=i w∗i ′λi ′
New Developments
• Propensity score reweighting – extension to Abadie (2005)
• Panel matching
• Panel regression methods
• Balancing: mean balancing and trajectory balancing
44
Panel Matching (Imai, Kim & Wang 2019)
Assumptions
• Sequential exogeneity (past info can affect today’s treatment)
→ Parallel trends after conditions (similar to Abadie 2005)
• SUTVA (no spillover effect)
• Limited carryover effect
Procedure
1. Create a matched set for each treated observation based on
treatment history
2. Refine the matched set via any matching or weighting method
3. Compute causal effect via weighted DiD using the refined set
4. Calculate model-based standard errors
45
Example: Democracy and Economic Growth
1960 1963 1966 1969 1972 1975 1978 1981 1984 1987 1990 1993 1996 1999 2002 2005 2008
Year
Cou
ntry
Missing Autocracy Democracy
Democracy and Economic Growth: Treatment Status
46
Example: Democracy and Economic Growth
• Match based on treatment history for the past L periods
47
Example: Democracy and Economic Growth
• Refine the matched set based on covariates and pre-treatment
outcomes
The number of matched control units
48
Example: Democracy and Economic Growth
Estimated treatment effects
49
Panel Matching: Advantages and Limitations
Advantages
• Require sequential exogeneity instead of strict exogeneity
• Allow treatment reversal
• Allow a variety of matching/reweighting methods
Limitations
• Lots of data (w/ info on outcome dynamics) are dropped
• Normally, imbalances remain
• Many choices require user discretion
50
Panel Regression Methods
• Synthetic control finds a convex combination of controls to mimic
the treated
• Implicitly assumes
• non-negative weights
• no intercept shift
• weights add up to 1
• Other ways to obtain weights
• Constrained regression
• Regression based on the best subset
• Penalized regression
• Balancing
51
Constrained Regression
• No intercept shift; weights add up to 1; non-negativity
ωconstr = arg minω
∑t≤T0
(Y1t − ω′YCt)2
s.t.∑i∈C
ωi = 1 and ωi ≥ 0,∀i ∈ C
• Limitation: T0 > Nco
52
Best Subset (Hsiao et al. 2012)
• Choose the number of weights that are allowed to be positive
• Optimize the weights after taking out the intercepts(µsubset , ωsubset
)= arg min
µ,ω
∑t≤T0
(Y1t − µ− ω′YCt)2
s.t.∑i∈C
1ωi 6=0 ≤ k
• Bottom-up approach: search for the best 1, then the best 2, then
the best 3 ... (greedy)
• Weights can be negative and do not need to add up to 1
53
Penalized Regression (Doudchenko and Imbens 2016)
• Use elastic net (L1 and L2 regularization) to select weights
(µen, ωen) = arg minµ,ω
(Y1,pre−µ−ω′YC,pre)2+λ·(
1− α2‖ω‖2
2 + α‖ω‖1
)in which λ and α are tuning parameters
• Allow intercept shift; weights can be negative and do not add up to 1
• Randomization inference: (1) treated unit is randomly selected; (2)
the timing of the onset of the treatment is randomly selected
54
Example: Proposition 99 in California
Adapted from Doudchenko and Imbens (2016)
55
Example: German Reunification
Adapted from Doudchenko and Imbens (2016)
56
Differences Among Existing Methods
synth cstr Hsiao EN IPW PMatch meanbal tjbal
Methods of getting weights bal reg reg reg reg match bal bal
Allow short T0 X X X XNon-negative weights X X X X XWeights add up to 1 X X X XIntercept shift X X X X X X XConvex combination X X X
Multiple treated units X X X XComputational efficiency X X X X X XDimension reduction X X(Almost) exact balancing X XHigher-order features X
See Doudchenko & Imbens (2016) and Ben-Michael et al. (2018) for detailed discussions.
57
Balancing: A Simple, Unified Framework
Almost all commonly used panel data models imply common function
space with Y 0post linear in Ypre .
Linearity in Pre-Treatment Outcomes – LPO
E[Y 0it |Yi,pre ] = (1 Yi,pre)′θt , T0 < t ≤ T .
• Diff-in-Diffs
• Twoway FEs
• ARMA
• IFE, Synth
Basic Idea: Equal means on Yi,pre would imply equal mean on Y 0it,t>T0
regardless of θt (Robbins et al. 2017)
58
Mean Balancing
• Objective: choose weights on controls to get same average
pre-treatment trajectory for weighted controls as treated,
1
Ntr
∑i∈T
Yi,pre =∑j∈C
wjYj,pre
• Conceptually similar to synth: choosing weights to predict for
counterfactual averages
• In practice: seek approximate balance, working from largest toward
smallest principal components of Ypre(Ypre)′ with a stopping rule of
minimizing the upper bound of biases (Hazlett & Xu 2018)
59
Mean Balancing
Assumption 1 (Linearity in Pre-Treatment Outcomes)
E[Y 0it |Yi,pre ] = (1 Yi,pre)′θt , T0 < t ≤ T .
Assumption 2 (Conditional Ignorability)
Y 0it ⊥⊥ Gi |Yi,pre , ∀t > T0
Assumption 3 (Feasibility)1
Ntr
∑Gi=1
Yit =∑Gi=0
wiYit , t ≤ T0 wi ≥ 0,∑Gi=0
wi = 1
Under the above assumptions, we have:
Proposition 4 (Unbiasedness)
E[ATT t |Ypre ] = ATTt , ∀t > T0
60
Limitations of Mean Balance
1. Weights that achieve mean balancing can leave treated and control
different on non-linear functions of Ypre
2. Mean balancing limited by number of pre-treatment points
• Few pre-treatment periods = few constraints unless you make more
• With enough periods, anything that matters to Y 0 will appear in Y 0
and can be balanced on — but with fewer periods, no guarantees
61
Trajectory Balancing
• Mean balancing: a simple approach to get same average trajectory
for weighted controls as treated:
1
Ntr
∑i∈T
Yi,pre =∑j∈C
wjYj,pre
• Trajectory balancing: feature mapping Yi,pre 7→ φ(Yi,pre),
then balance
1
Ntr
∑i∈T
φ(Yi,pre) =∑j∈C
wjφ(Yj,pre)
• Get mean balance on a feature expansion instead
φ(Yi,pre ,Xi ), φ : RP 7→ RP′
62
Choice of φ()
A good choice of φ() is one that:
• requires little or no user discretion
• includes all continuous functions (at the limit)
• perhaps, prioritizes low frequency, smoother functions
• allows covariates to play a role
Implementation: Gaussian kernel then approximation via principal
components
63
Implementation
• Instead of Ypre , form kernel matrixKi,j = k([Yi ,Xi ], [Yj ,Xj ]) = exp(−||[Yi ,Xi ]− [Yj ,Xj ]||2/h)
• Replaces each unit’s [Yi,pre ,Xi ] with a vector ki encoding how similar
observation i is to observation 1, 2, ...
• SVD this matrix to obtain components/ eigenvectors
• Choose weights to get mean balance on these, starting from largest
• We choose the number of principal components to include by minimizing the
upper bound of bias in the ATT estimates
64
When Averages Fail and φ()’s Thrive
Intuition: mean balancing is okay but may emphasize “wrong” features
of the pre-treatment trend
• Trajectory balancing gets you similarity of whole trajectories rather
than just equal means at each time point → balance on
“higher-order” features such as variance, curvature, etc.
• Approximately, trajectory balancing gets multivariate distribution of
Ypre for the controls equal to that of the treated, whereas mean
balancing only gets marginals equal
• This can matter when non-linear functions of Ypre are confounders,
especially when T0 short
65
When Mean Balancing Fail: A Severe Example
• N = 200 countries with simulated GDP over years T ∈ {1, 2, ..., 24}
• Two “types” of countries:
Volatile with no growth:
GDPit = 5 + ai sin(.2πt) + bicos(.2πt) + .1εit
εit ∼ N(0, 1), ai , bi ∼ U(−1, 1)
Or steady growing:
GDPit = 4 + ci1.03t + .1εit
εit ∼ N(0, 1), ci ∼ U(0.9, 1.1)
• A randomly selected 25% of the stable type take the treatment.
66
When Mean Balancing Fails: A Severe Example
Controls
Year
GD
P
34
56
7
0 4 8 12 16 20 24
Treated
Year
GD
P
34
56
7
0 4 8 12 16 20 24
Heavily weighted control units (pre-treatment)
Controls: Mean Balancing
Year
GD
P
34
56
7
1 2 3 4 5 6 7 8
Treated AverageWeighted Control Average
Controls: Kernel Balancing
Year
GD
P
34
56
7
1 2 3 4 5 6 7 8
Treated AverageWeighted Control Average
67
When Mean Balancing Fails: A Severe Example
68
What Information is Encoded in the Kernel Matrix?
●●
●
●
● ●
●
● ●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●●●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
● ●●
●
●
●
●
●
●●●
●
●
●●
● ●
●
●
●
●
● ●●
●
● ●●
●
●
●
●●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●●
●
●●● ●●
●●
●
●
●
● ●
●
●●
●
●
●●
●
●● ●
●
●
●
●
●
●
●
●
Log Variance of Pre−treatment Outcomes
Firs
t Prin
cipa
l Com
pone
nt o
f K
−0.
10−
0.08
−0.
06−
0.04
−0.
020.
00
−7 −6 −5 −4 −3 −2 −1 0
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●● ●
●
●
●
●
●
●
● ●●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
TreatedControls (all)Controls (weighted)
69
Two Extensions
• Incorporating covariates
Assumption 5 (Linearity in φ(Xi ,Yi,pre))
E[Y 0it |Xi ,Yi,pre ] = φ(Xi ,Yi,pre)′θt , T0 < t ≤ T .
• Intercept shift and demeaning
Assumption 6 (Parallel Trends)
E[Y 0it − Y 0
is |Yi,pre ] = E[Y 0it − Y 0
is |Yi,pre ,Gi ], ∀ t, s ∈ {1, 2, · · · ,T}
70
Truex (2014): Return to office in China’s Parliament
• Treatment: CEO taking a seat in the
National People’s Congress (NPC)
Outcome: Return on assets (ROA)
• 48 treated firms, 984 controls
Pre-treatment: 2005-2007
Post-treatment: 2008-2010
• Two covariates: state ownership,
revenue in 2007
• Balancing on: roa2005, roa2006,
roa2007, so portion, rev2007 (and
higher order terms through a kernel
transformation)2005 2006 2007 2008 2009 2010
YearF
irm
71
Balance on Pre-treatment Outcome Trajectories
Year
Ret
urn
On
Ass
ets
Top 25% Heavily Weighted Controlsw/ Mean Balancing
2005 2006 2007
−0.
50−
0.25
0.00
0.25
0.50
Year
Ret
urn
On
Ass
ets
Treated
2005 2006 2007
−0.
50−
0.25
0.00
0.25
0.50
Year
Ret
urn
On
Ass
ets
Top 25% Heavily Weighted Controlsw/ Kernel Balancing
2005 2006 2007
−0.
50−
0.25
0.00
0.25
0.50
72
Balance Check
Mean
●
●
●
●
●rev2007
so_portion
roa2007
roa2006
roa2005
−0.50 −0.25 0.00 0.25 0.50
Difference in Means
● Unweighted Mean Balancing Kernel Balancing
Variance
●
●
●
●
●rev2007
so_portion
roa2007
roa2006
roa2005
−4 0 4
(Varco − Vartr) Vartr
● Unweighted Mean Balancing Kernel Balancing
73
Truex (2014): Main Results
2005 2006 2007 2008 2009 2010
−0.
020.
000.
020.
04
Year
Diff
eren
ce−
in−
Mea
ns
Mean BalancingKernel Balancing
NPC Membership and Return on Assets
74
Summary
• Removing time-invariant confounders is costly, i.e.,no carryover
effect, no feedback from past Y to current D (Kim & Imai 2018)
• Sequential exogeneity may be more desirable than strict exogeneity
• Panel non-parametric and semi-parametric methods have gone a
long way
• Things quickly get more complex when the number of different
treatment histories grows
• Inference is hard especially when there are only a small number of
treated units
75
Roadmap
• DiD and synthetic control
• FE/DiD assumptions
• New estimators• Matching and reweighting
• Outcome models
* Diagnostic tools
• Hybrid methods
• Conclusions
Outcome Models
Basic Idea
• In a panel setting, treat Y (1) as missing data
• Predict Y (0) based on an outcome model
• (Use pre-treatment data for model selection)
• Estimate ATT by averaging differences between Y (1) and Y (0)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
Time
Uni
t
Under Control Under Treatment
Treatment Status
76
Basic Idea
ATT s = E[τit |Di,t−s = 0,Di,t−s+1 = Di,t−s+2 = · · · = Dit = 1︸ ︷︷ ︸s periods
,∀i ∈ T ].
−4 −3 −2 −1 0 1 2 3 4 5
−5 −4 −3 −2 −1 0 1 2 3 4
−6 −5 −4 −3 −2 −1 0 1 2 3
−7 −6 −5 −4 −3 −2 −1 0 1 2
−8 −7 −6 −5 −4 −3 −2 −1 0 1
10
9
8
7
6
5
4
3
2
1
1 2 3 4 5 6 7 8 9 10
time
id
Under Control Under Treatment
77
A New Plot for “Dynamic Treatment Effects”
100
−5.0
−2.5
0.0
2.5
5.0
7.5
−30 −20 −10 0 10Time relative to the Treatment
Effe
ct o
n Y
No "Pre−trend"
78
A New Plot for “Dynamic Treatment Effects”
100
−5.0
−2.5
0.0
2.5
5.0
7.5
−30 −20 −10 0 10Time relative to the Treatment
Effe
ct o
n Y
A "Pre−trend"
79
Model-based Counterfactual Estimators
A model-based counterfactual estimator proceeds in the following steps:
• Step 1. Train the model using observations under the control
condition (Dit = 0).
• Step 2. Predict the counterfactual outcome Yit(0) for each
observation under the treatment condition (Dit = 1) and obtain an
estimate of the individual treatment effect: τit = Yit − Yit(0).
• Step 3. Generate estimates for the causal quantities of interest
ATT = E[τit |Dit = 1,∀i ∈ T ,∀t], or
ATTs = E[τit |Di,t−s = 0,Di,t−s+1 = Di,t−s+2 = · · · = Dit = 1︸ ︷︷ ︸s periods
,∀i ∈ T ].
80
Review of Three Estimators
We review three estimation strategies:
• FEct:
Yit(0) = Xit β + αi + ξt
• IFEct (Gobillon&Magnac 2016; Xu 2017):
Yit(0) = Xit β + λ′i Ft
• Matrix Completion (MC) (Athey et al. 2018):
Yit(0) = Xit β + Lit ,
where matrix {Lit}N×T is a lower-rank matrix approximation of
{Y (0)}N×T with missing values
Remarks:
• DiD is a special case of FEct
• Both IFEct and MC are estimated via iterative algorithms
• Cross-validation to choose the tunning parameter
81
IFEct
Xu (2017) proposes a three-step approach based on a latent factor model:
Control Yit(0) = X ′itβ + αi + ξt + λ′i ft + εit
Treated Yit(0) = X ′itβ + αi + ξt + λ′i ft + εit (pre)
Yit(1) = X ′itβ + αi + ξt + λ′i ft + εit + τit (post)
1. Expectation-Maximization
(Gobillon & Magnac 2016)factors: r × T
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30Time
Treated (Pre) Treated (Post) Controls
82
Election Day Registration (EDR) and Voter Turnout
Causal inference is a missing data problem.
WVWAVTVAUTTXTNSDSCRIPA
OROKOHNYNVNMNJNENCMSMOMI
MDMALAKYKSINIL
GAFLDECOCAAZARALCTMTIA
WYNHIDWI
MNME
1920 1924 1928 1932 1936 1940 1944 1948 1952 1956 1960 1964 1968 1972 1976 1980 1984 1988 1992 1996 2000 2004 2008 2012Year
Sta
te
Treated (Pre) Treated (Post) Controls
EDR Reform
83
Main Results
9
−15
−10
−5
0
5
10
15
−15 −10 −5 0 5Time relative to the Treatment
Effe
ct o
n Tu
rnou
t (%
)
FEct
9
−15
−10
−5
0
5
10
15
−15 −10 −5 0 5Time relative to the Treatment
Effe
ct o
n Tu
rnou
t (%
)
IFEct
84
Factors and Factor Loadings
1920 1936 1952 1968 1984 2000 2016
Year
−15
−10
−5
05
1015
Turn
out %
Factor 1Factor 2
AL
AZ
AR
CA
CO
DE
FL
GAIL
IN
KS
KY
LA
MD
MA
MI
MS
MO
NE
NV
NJ
NM
NY
NC
OH
OK
OR
PARI
SC
SD
TN
TXUTVT
VA
WA
WV
CT
ID
IA
MEMN
MT
NH
WI
WY
−20 −10 0 10 20
Loadings for Factor 1
−4
−2
02
46
Load
ings
for
Fact
or 2
85
Example: Cadre Visits on Loans
−12 −10 −8 −6 −4 −2 0 2 4
Quarter(s) before / after a cadre's visit
−0.
6−
0.4
−0.
20.
00.
20.
40.
60.
8
GS
C E
stim
ate
The Effect of Cadre Visits95% Confidence Interval
86
Example: Property Rights and Land Improvement (Sanford 2019)
• Does property rights lead to improved land quality?
• “Experiment”: giving peasants in Borgou, Benin land titles
• Use satellite (remote sensing) data to measure land improvement,
i.e., switch from annual crops to perennial crops (bushes and trees)
• Use IFEct to construct counterfactuals
87
Original Satellite Images
88
Treated and Counterfactual Averages
Pre-treatment Outcome
89
Treated and Counterfactual Averages
Post-treatment Outcome (1 Year)
90
Treated and Counterfactual Averages
Post-treatment Outcome (4 Years)
91
Factors and Loadings
92
Geographic Distribution of Heavily-weighted Controls
Bigger number represents higher dissimilarity
93
Matrix Completion Methods
AL
CA
DE
IA
IN
LA
ME
MO
NC
NJ
NY
OR
SC
TX
VT
WV
1920 1928 1936 1944 1952 1960 1968 1976 1984 1992 2000 2008year
abb
Treated States (before EDR) Treated States (after EDR) Control States
EDR Reform• Recall that our main goal is to
predict treated counterfactuals
• Taking advantage of the matrix
structure,
matrix completion methods use
non-treated data to achieve this
goal
• The basic idea to find a
lower-rank representation of the
matrix to impute the “missing
data”
• Xu (2017) is a special case of
this approach94
Matrix Completion Methods
• Recall in the baseline DiD setup:
Y =
(Y 0T ,pre ??
Y 0C,pre Y 0
C,post
)• Matrix completion (MC) methods attempt to find a lower-rank
representation of Y, which we call L, that makes predictions of
missing values in Y
• Athey et al. (2018) generalize Xu (2017) with different ways of
constructing L
• Plus, missingness can be arbitrary → accommodate reversible
treatments (note: strict exogeneity)
95
Matrix Completion Methods
• Mathematically,
Yit = Lit + αi + ξt + X ′itβ + εit
in which Lit is an element of L, an (N × T ) matrix
• We need regularization on L because of too many parameters:
minL
1
#Controls
∑Dit=0
(Yit − Lit)2 + λL‖L‖∗
• The nuclear norm ‖.‖∗ generally leads to a low-rank solution for L
‖L‖∗ =
min(N,T )∑i=1
σi (L)
in which σi (L) that the singular values of L
96
IFEct vs. MC
• Singular value decomposition of L
LN×T = SN×NΣN×TRT×T
• Difference in how ΣN×T is regularized
IFE MC
best subset nuclear norm
σ1 0 0 · · · 0
0 σ2 0 · · · 0
0 0 0 . . . 0
.
.
.
.
.
.
.
.
.
...
.
.
.
0 0 0 · · · 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 0 · · · 0
|σ1 − λL|+ 0 0 · · · 0
0 |σ2 − λL|+ 0 · · · 0
0 0 |σ3 − λL|+ . . . 0
.
.
.
.
.
.
.
.
.
... 0
|σT − λL|+0 0 0 · · · 0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0 0 0 · · · 0
in which |a|+ = max(a, 0)
97
IFEct vs. MC
The pros and cons of IFEct and MC:
• IFEct works better with a small number of strong factors
• MC works better with a large number of weak factors
IFEct (known r)
IFEct (CV−ed r)
MC
Variance of each factor
0
5
10
15
20
1 2 3 4 5 6 7 8 9Number of Factors in the DGP
MS
PE
98
Over-fitting of IFEct and MC
When the true DGP is a two-factor IFE model:
sigma (Pre)
SD (ATT)
Bias (ATT)
RMSE (ATT)
IFEct
0.0
0.5
1.0
1.5
2.0
2.5
0 1 2 3 4 5 6Number of Factors in the Model
sigma (Pre)
SD (ATT)
Bias (ATT)
RMSE (ATT)
MC
0.0
0.5
1.0
1.5
2.0
2.5
−4.4 −5.2 −6.1 −7 −7.8 −8.7 −9.6log(Tunning Parameter)
99
Inferential Methods
• Non-parametric block bootstrap
• sample with replacement across units
• valid when N is large, NtrN
is fixed
• A permutation-based test for Sharp Nulls (Chernozhukov et al 2019)
• e.g. Yit(1) = Yit(0), ∀i ∈ T , t > T0i
• randomization over time (by blocks) instead of across units
• valid if T is large, errors are stationary weekly dependent, and
estimators are consistent or stable
• exact if errors are i.i.d.
100
Algorithm for Testing the Sharp Null
Based on Chernozhukov et al (2019):
• Assuming the Sharp Null is true: H0 : Y (0)it = Y (1)it , ∀i ∈ T , t.
• Denote Zt = (Y1t ,Y2t , · · ·YNt ,X1t ,X2t , · · ·XNt)′.
• Denote a i.i.d. block permutation πh a one-to-one mapping:
πh : {b1, · · · , bK} 7→ {b1, · · · , bK}, in which {b1, · · · , bK} is a
partition of {1, · · · ,T}.
101
Algorithm for Testing the Sharp Null
Step 1. Using a counterfactual estimator with real data, calculate the
following test statistics S(u) =√|O||ATT |, in which |O| is the number
of treated observations.
Step 2. Generate a random i.i.d block permutation πh; reorganize the
data (over the time dimension) based on πh while fixing the treatment
assignment matrix.
Step 3. Re-estimate the test statistic using the permuted data,
obtaining S(uπh) using the same method in Step 1.
Step 4. Repeat Steps 2-3 H times, obtaining {S(uπ1 ) · · · ,S(uπH)}.
Step 5. Calculate the p-value: p = 1− F (S(u)) in which
F (x) =1
H
H∑1
1{S(uπh) < x}
102
Diagnostic Tests
A Simulated Example
Data Generating Process:
• T = 35, N = 200
• Outcome model: a linear interactive fixed effect model with two factors: one drift
process and one white noise.
Yit = τitDit + 5 + 1 · Xit,1 + 3 · Xit,2 + λi1 · f1t + λi2 · f2t + αi + ξt + εit
• Treatment assignment: staggered adoption with T0i correlated with additive and
interactive fixed effect.
• Treatment effects: τi,t>T0i = 0.2(t − T0i ) + eit
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
Time
Uni
t
Under Control Under Treatment
Treatment Status
103
Dynamic Treatment Effects
100
−5.0
−2.5
0.0
2.5
5.0
7.5
−30 −20 −10 0 10Time relative to the Treatment
Effe
ct o
n Y
IFEct
104
1. Placebo Test
• Drop S periods before the treatment’s onset, and estimate the
average treatment effect in these periods.
• Test whether the average effect is significant
• Robust to model misspecification, but using only limited information
● ●
●●
●● ●
● ●
●
●
●●
●●
●
● ● ● ● ● ●● ●
●●
●● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
Placebo p value: 0.400
100
−5.0
−2.5
0.0
2.5
5.0
7.5
−30 −20 −10 0 10Time relative to the Treatment
Effe
ct o
n Y
IFEct
105
2. Equivalence Test
• Extension to Hartman and Hidalgo (2018) in a TSCS setting
• H0: |ATTt | > τ vs. H1: |ATTt | ≤ τ
• We calculate the maximal possible τ and compare it with
pre-specified threshold: 0.36 ∗ sd(Yit |Dit = 0)
• It has more power when the sample size grows larger, and is more
likely to reject the Null (hence, equivalence holds) when a
confounder is trivial
• Drawback 1: setting the threshold requires user discretion
• Drawback 2: easy to pass when pre-treatment data are used to fit
the model (IFEct and MC)
106
Equivalence Test
Wald p value: 0.321
100
−5.0
−2.5
0.0
2.5
5.0
7.5
−30 −20 −10 0 10Time relative to the Treatment
Effe
ct o
n Y
MC
107
Why Not a Wald test?
• A simpler test: conduct an Wald (F) test on pre-treatment residual
averages
• However, when there exists a small confounders which induces a
neglectable bias compared with the ATT, a Wald test will almost
always reject the Null (that equivalence holds) when there are
enough data
• The Equivalence Test avoids this problem
108
Why Not a Wald Test?
• There exist a time-varying confounder
• We vary its influence on the bias in the ATT
Wald Equivalence
0.00
0.25
0.50
0.75
1.00
0.0 0.1 0.2 0.3 0.4Bias / SD(Y)
Pro
port
ion
Dec
lare
d B
alan
ce
109
Why Not a Wald Test?
Wald
Equivalence
0.00
0.25
0.50
0.75
1.00
0 250 500 750 1000Number of Units (N)
Pro
port
ion
Dec
lare
d B
alan
ce
Bias = 0.08SD(Y )
Wald
Equivalence
0.00
0.25
0.50
0.75
1.00
0 250 500 750 1000Number of Units (N)
Pro
port
ion
Dec
lare
d B
alan
ce
Bias = 0.28SD(Y )
110
Hainmueller & Hangatner (2019)
Does indirect democracy benefit immigrant minorities?
• Unit of analysis: 1400 Swiss municipalities from 1991-2009
• Treatment: Indirect (vs. direct) democracy
• Outcome: Naturalization rate
1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
Year
Uni
t
Direct Democracy Indirect Democracy
111
Hainmueller & Hangatner (2019)
●
●
●
●
● ● ●
●
●
● ●
●
●
●
●
● ●
●
●
●
●●
Placebo p value: 0.404
470
−4
−2
0
2
4
−15 −10 −5 0 5Time relative to the Treatment
Effe
ct o
n N
atur
aliz
atio
n R
ate
(%)
Placebo Test
112
Xu (2017): EDR on Turnout
Does Election Day Registration increase turnout?
• Unit of analysis: 47 US states from 1920 to 2012
• Treatment: EDR reform
• Outcome: Turnout
1920 1924 1928 1932 1936 1940 1944 1948 1952 1956 1960 1964 1968 1972 1976 1980 1984 1988 1992 1996 2000 2004 2008 2012
Year
Sta
te
No EDR EDR
Treatment Status
113
Xu (2017): EDR on Turnout
FEct
Wald p value: 0.129
9
−15
−10
−5
0
5
10
15
−15 −10 −5 0 5Time relative to the Treatment
Effe
ct o
n Tu
rnou
t (%
)
Dynamic Treatment Effects
●
●
● ● ●
●
●
●
●●
●
●
●●
●
●
●
●
●
●● ●
Placebo p value: 0.088
9
−15
−10
−5
0
5
10
15
−15 −10 −5 0 5Time relative to the Treatment
Effe
ct o
n Tu
rnou
t (%
)
Placebo Test
114
Xu (2017): EDR on Turnout
IFEct
Wald p value: 0.728
9
−15
−10
−5
0
5
10
15
−15 −10 −5 0 5Time relative to the Treatment
Effe
ct o
n Tu
rnou
t (%
)
Dynamic Treatment Effects
● ●● ●
●
●
●
●
● ●
●
●
●●
●
● ●
●
●
●● ●
Placebo p value: 0.316
9
−15
−10
−5
0
5
10
15
−15 −10 −5 0 5Time relative to the Treatment
Effe
ct o
n Tu
rnou
t (%
)
Placebo Test
115
Xu (2017): EDR on Turnout
MC
Wald p value: 0.644
9
−15
−10
−5
0
5
10
15
−15 −10 −5 0 5Time relative to the Treatment
Effe
ct o
n Tu
rnou
t (%
)
Dynamic Treatment Effects
● ●●
●●
●
● ●
●●
●
●
●●
●
●
●
●
●
● ●●
Placebo p value: 0.094
9
−15
−10
−5
0
5
10
15
−15 −10 −5 0 5Time relative to the Treatment
Effe
ct o
n Tu
rnou
t (%
)
Placebo Test
116
Acemoglu et al. (2019): Democracy on Growth
• Unit of analysis: 184 countries over 51 years (1960-2010)
• Treatment: a dichotomous measure of democracy and autocracy
• Outcome: Log GDP per capita (’2000 dollars)
Entering Democracy 117
Acemoglu et al. (2019): Democracy on Growth
• Unit of analysis: 184 countries over 51 years (1960-2010)
• Treatment: a dichotomous measure of democracy and autocracy
• Outcome: Log GDP per capita (’2000 dollars)
Wald p value: 0.111
126
−40
−20
0
20
40
−20 −10 0 10 20Time relative to the Treatment
Effe
ct o
n lo
g G
DP
per
cap
ita
FEct
Wald p value: 0.182
126
−40
−20
0
20
40
−20 −10 0 10 20Time relative to the Treatment
Effe
ct o
n lo
g G
DP
per
cap
ita
IFEct
Wald p value: 0.079
126
−40
−20
0
20
40
−20 −10 0 10 20Time relative to the Treatment
Effe
ct o
n lo
g G
DP
per
cap
ita
MC
Entering Democracy
118
Acemoglu et al. (2019): Democracy on Growth
• Unit of analysis: 184 countries over 51 years (1960-2010)
• Treatment: a dichotomous measure of democracy and autocracy
• Outcome: Log GDP per capita (’2000 dollars)
Wald p value: 0.890
58
−60
−30
0
30
60
−10 0 10 20Time relative to the Treatment
Effe
ct o
n lo
g G
DP
per
cap
ita
FEct
Wald p value: 0.270
58
−60
−30
0
30
60
−10 0 10 20Time relative to the Treatment
Effe
ct o
n lo
g G
DP
per
cap
ita
IFEct
Wald p value: 0.620
58
−60
−30
0
30
60
−10 0 10 20Time relative to the Treatment
Effe
ct o
n lo
g G
DP
per
cap
ita
MC
Exiting Democracy
119
Roadmap
• DiD and synthetic control
• FE/DiD assumptions
• New estimators• Matching and reweighting
• Outcome models
* Diagnostic tools
• Hybrid methods
• Conclusions
Hybrid Methods
Hybrid Methods
• So far, we’ve surveyed two group of methods: (1) those constructing
balancing weights; (2) those modeling the conditional outcomes
• Combining the two approaches will likely produce doubly robust
estimators
• Some methods we discussed, including semi-parametric DiD, panel
matching, trajectory balancing, are already doing a simple version of
it (balancing plus regression)
• We review two new methods that formally adopt this idea
• Augmented synthetic control (Ben-Michael et al 2018): modeling first
• Synthetic DiD (Arkhangelsky et al. 2019): weighting first
120
Augmented Synthetic Control
• Assuming one treated unit (unit N)
• Procedure
1. Run an outcome model (FEct, IFEct, MC, etc.) and obtain model fit m(Xi )
2. Balance on the residual averages, obtaining weights γi for the controls
3. Treated average is constructed using:
Y augN (0) = m(XN) +
N−1∑i=1
γi (Yi − m(Xi ))
• The balancing weights take care of the remaining biases from the outcome
model; the estimator is thus doubly robust
• Inference via clustered bootstrap
121
Synthetic DiD
• Assuming one treated unit (unit N) and one post-treatment period
(period T ); weights add up to 1
• Procedure1. Estimate “synthetic control weight” for each control unit:
ωsc = arg minω∑T−1
t
(∑N−1i=1 ωiYit − YNt
)2. Estimate “synthetic control weight” for each time period:
λsc = arg minλ∑N−1
i
(∑T−1t=1 λtYit − YiT
)3. Estimate a weighted DiD by minimizing:
N∑i=1
T∑t=1
(Yit − µ− αi − βt − Xitγ − Ditτ)2ωi λt
• Either the SC weights or the outcome model is correct, the causal effect
will be identified (doubly robust)
• Inference via jackknife
122
Conclusions
Concluding Remarks
• Removing unit fixed effects is costly, i.e., strict exogeneity; the alternative
is sequential exogeneity
• “Parallel trends” can very well be wrong; when T is large, we can assess
the assumption by checking the “pre-trend” using diagnostic testes
• Counterfactual estimators can relax the homogeneity assumption with
little cost
• Both the matching/reweighting approach (dealing with D) and the
model-based approach (dealing with Y) can help with time-varying
confounders with sufficient data
• Be aware of the modeling assumptions when the latter is employed
• A hybrid estimator is doubly robust
123
Practical Recommendations
• Plotting raw data helps us see obvious problems
• Start from conventional estimators (i.e. DiD, FEct) and check the
“pre-trend”
• If they don’t work, try easy fixes, e.g. trimming the data to make
the treated and controls more alike
• If that doesn’t work, either, we need more complex methods
• Always ask yourself first: “what’s the hypothetical experiment?’
124
Packages
• panelView: panel data visualization
• fastplm: fast panel linear fixed effects estimation (coming soon)
• gsynth: IFEct/MC approach with non-reversible treatments
• fect: IFEct/MC methods with diagnostic tests (coming soon)
• tjbal: trajectory balancing
Thank [email protected]
http://yiqingxu.org
github.com/xuyiqing
125
References
• Abadie, Alberto, Alexis Diamond, and Jens Hainmueller (2010). Journal of the American Statistical
Association. June 1, 2010, 105(490): 493–505.
• Imai, Kosuke and In Song Kim (2019). “When Should We Use Unit Fixed Effects Regression Models
for Causal Inference with Longitudinal Data?” American Journal of Political Science, Vol. 62, Iss. 2,
April 2019, pp. 467–490.
• Abadie, Alberto (2005). “Semiparametric Difference-in-Differences Estimators,” Review of Economic
Studies (2005) 72, 1–19.
• Hsiao, Cheng, H. Steve Ching and Shui Ki Wan (2012). “A Panel Data Approach for Program
Evaluation: Measuring the Benefits of Political and Economic Integration of Hong Kong with
Mainland China,” Journal of Applied Econometrics, Vol. 27, Iss. 5, August 2012, pp. 705–740.
• Doudchenko, Nikolay and Guido Imbens (2016). “Balancing, Regression, Difference-In-Differences
and Synthetic Control Methods: A Synthesis.” Working Paper Stanford University.
• Hazlett, Chad and Yiqing Xu (2018). “Trajectory Balancing: A Kernel Method for Causal Inference
with Time-Series Cross-Sectional Data.” Working Paper, UCLA.
• Xu, Yiqing (2017). “Generalized Synthetic Control Method: Causal Inference with Interactive Fixed
Effects Models” Political Analysis, Vol. 25, Iss. 1, January 2017, pp. 57-76.
• Athey, Susan, Mohsen Bayati, Nikolay Doudchenko, Guido Imbens, Khashayar Khosravi (2017).
“Matrix Completion Methods for Causal Panel Data Models.” Working Paper, Stanford University.
• Liu, Licheng, Ye Wang, Yiqing Xu (2019). ”A Practical Guide to Counterfacutal Estimators for
Causal Inference with Time-Series Cross-Sectional Data.” Working Paper, Stanford University.
• Ben-Michael Eli, Avi Feller, Jesse Rothstein (2018). “The Augmented Synthetic Control Method.”
Working Paper, UC Berkeley.
• Arkhangelsky, Dmitry, Susan Athey, David A. Hirshberg, Guido W. Imbens and Stefan Wager (2019).
“Synthetic Difference In Differences.” Working Paper, UC Berkeley.126