OPPORTUNITIES, COSTS AND BENEFITS: …fw890kf0299/...is to use an education production function,...
Transcript of OPPORTUNITIES, COSTS AND BENEFITS: …fw890kf0299/...is to use an education production function,...
OPPORTUNITIES, COSTS AND BENEFITS: RETHINKING THE
EDUCATION PRODUCTION FUNCTION
A DISSERTATION
SUBMITTED TO THE GRADUATE SCHOOL OF EDUCATION
AND THE COMMITTEE ON GRADUATE STUDIES
OF STANFORD UNIVERSITY
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
KENNETH SHORES
MARCH 2016
http://creativecommons.org/licenses/by-nc/3.0/us/
This dissertation is online at: http://purl.stanford.edu/fw890kf0299
© 2016 by Kenneth Aaron Shores. All Rights Reserved.
Re-distributed by Stanford University under license with the author.
This work is licensed under a Creative Commons Attribution-Noncommercial 3.0 United States License.
ii
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
sean reardon, Primary Adviser
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
Eamonn Callan
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
Susanna Loeb
I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.
Debra Satz
Approved for the Stanford University Committee on Graduate Studies.
Patricia J. Gumport, Vice Provost for Graduate Education
This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file inUniversity Archives.
iii
ABSTRACT
My dissertation incorporates both traditional and non-traditional approaches to the specifi-
cation and estimation of the education production function. I pursue three related questions:
1. I use normative philosophical methods to consider whether implications of equal
opportunity and adequacy distributive principles are compatible with the basic and
intuitive right that all students have claim to at least some educational resources to
develop their abilities. If an incompatibility is found, this suggests that these paradig-
matic distributive principle are in need of revision.
2. I propose a method for estimating an achievement scale that is equal-interval with
respect to benefit. I develop and implement survey experiments to estimate individ-
ual preferences for math and reading academic skills. This quantitative description
allows for both between and within attribute comparisons, making it possible to de-
termine, for example, whether a 10-point gain at the low end of the math scale is
preferable to a 20-point gain at the high end of the reading scale. Such a scale can be
used for cost-effectiveness evaluations.
3. I (with Christopher Candelaria) provide new evidence about the effect of court-
ordered finance reform on per-pupil revenues and graduation rates. We account
for cross-sectional dependence and heterogeneity in the treated and counterfactual
groups to estimate the effect of overturning a state’s finance system. Seven years af-
ter reform, the highest poverty quartile in a treated state experienced a 4 to 12 percent
increase in per-pupil spending and a 5 to 8 percentage point increase in graduation
rates.
iv
Contents
1 Introduction to the dissertation 1
I. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
II. Paper one: Normative approaches to achievement . . . . . . . . . . . . . . 3
III. Paper two: Welfare adjusted scale score . . . . . . . . . . . . . . . . . . . 5
IV. Paper three: Sensitivity of causal estimates from finance reform (with Christo-
pher Candelaria) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
V. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Achievement is not income 10
I. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
II. Establishing a right to educational resources . . . . . . . . . . . . . . . . . 13
III. Testing distributive theories against the right to some educational resources 16
III.A. Expanded fair equality of opportunity . . . . . . . . . . . . . . . . 16
III.B. Restricted fair equality of opportunity . . . . . . . . . . . . . . . . 19
III.C. Objections to the test . . . . . . . . . . . . . . . . . . . . . . . . . 21
III.D. Equality of opportunity for what? . . . . . . . . . . . . . . . . . . 25
III.E. Adequacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
IV. What distribution of resources are entailed by the right? . . . . . . . . . . . 31
IV.A. A right that is too weak . . . . . . . . . . . . . . . . . . . . . . . . 32
v
IV.B. A right that is too strong . . . . . . . . . . . . . . . . . . . . . . . 32
IV.C. What right is ‘just right’? . . . . . . . . . . . . . . . . . . . . . . . 33
V. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3 Welfare adjusted scale score: Method toward the development of an equal-interval welfare scale 36
I. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
II. Survey design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
II.A. Math and reading descriptors and scale scores . . . . . . . . . . . . 45
II.B. Linking NAEP descriptors to scale scores . . . . . . . . . . . . . . 45
II.C. Data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
III. Econometric framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
IV. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
IV.A. Ordinal ranking exercise . . . . . . . . . . . . . . . . . . . . . . . 51
IV.B. Beta estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
IV.C. Comparing original to welfare-adjusted scale . . . . . . . . . . . . 58
V. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
VI. Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
VII. Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4 The sensitivity of causal estimates from Court-ordered finance reform on spend-ing and graduation rates (with Christopher Candelaria) 82
I. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
II. Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
III. Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
IV. Econometric specifications and model sensitivity . . . . . . . . . . . . . . 91
IV.A. Benchmark differences-in-differences model . . . . . . . . . . . . 91
vi
IV.B. Explaining model specifications . . . . . . . . . . . . . . . . . . . 93
IV.C. Alternative model specifications . . . . . . . . . . . . . . . . . . . 96
V. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
V.A. Benchmark differences-in-differences model results . . . . . . . . . 97
V.B. Model sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
V.C. Equalizing effects . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
V.D. Robustness checks . . . . . . . . . . . . . . . . . . . . . . . . . . 105
VI. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
VII. Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
VIII. Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
IX. Data Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
X. Additional Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
XI. Additional Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
XII. Technical Appendix: Standard Errors . . . . . . . . . . . . . . . . . . . . 133
5 Bibliography 136
vii
Chapter 1
Introduction to the dissertation
I. Introduction
The United States currently spends about $700 billion per year on the provision of K-12
public education, and we would like to know if this investment is worthwhile. One method
is to use an education production function, linking inputs to valued outcomes (Hanushek,
1979). Much depends upon what counts as valued. It is common to use academic achieve-
ment scores as the outcome in these production functions, as test scores describe what a
person knows and can do, and one of the purposes of schools is to produce knowledge and
ability. While descriptions of achievement are certainly useful for describing what individ-
uals know and can do, on their own, the information they provide about value is likely to be
limited. Consider, for example, an achievement score that measures how well individuals
speak a fictional language. Even with a good measure of the ability, the benefit correspond-
ing to the measure is likely to be very small. This suggests that in order for an achievement
score to reflect value it needs to be linked to something else.
One approach is to recast an achievement score to reflect its labor market value (Mur-
nane, et al., 2001; Cunha & Heckman, 2008; Cunha, Heckman, & Schennach, 2010;
Chetty, Friedman & Rockoff, 2011; Bond & Lang, 2013). With the assumption that more
1
earnings are better than less, this approach usefully links an academic achievement score
to something else of value. Still, earnings do not capture the benefits of achievement inclu-
sively. A retiree on a fixed income who values being literate rather than illiterate demon-
strates that one can value academic achievement for non-pecuniary reasons. Moreover, if
this same person would like to remain up to date with current medical research, but cares
little for reading fiction, he values one type of academic skill over another. Irrespective of
their impact on earnings, then, academic achievement clearly affects happiness, and differ-
ent abilities will produce different benefits.
An outcome measure that reflects the value of achievement, broadly construed, would be
useful for determining whether schools are effective or not. Determining what is valuable
and how much something should be valued requires an interdisciplinary approach. Philos-
ophy (specifically normative theory) applies analytical methods to identify the values that
correspond to different choices and, when possible, to suggest which values are more im-
portant (Swift, 1999; McDermott, 2008). Economics is more concerned with the value of
efficiently satisfying preferences (Hausman & McPherson, 2006). Stated preferences (also
referred to as empirical social choice) methods are an economic tool that use survey or
laboratory experiments to quantify how much welfare is associated with certain non-priced
goods (Adamowicz, Louviere, & Williams, 1994; Gaertner, 2009; Gaertner & Schokkaert,
2012). Taken together, philosophical and economic tools can provide a broader, more in-
clusive description of the values associated with academic achievement.
My dissertation incorporates both traditional and non-traditional approaches to the spec-
ification and estimation of the education production function. I pursue three related ques-
tions:
1. I use normative philosophical methods to consider whether implications of equal
opportunity and adequacy distributive principles are compatible with the basic and
intuitive right that all students have claim to at least some educational resources to
develop their abilities. If an incompatibility is found, this suggests that these paradig-
matic distributive principle are in need of revision.
2. I propose a method for estimating an achievement scale that is equal-interval with
2
respect to benefit. I develop and implement survey experiments to estimate individ-
ual preferences for math and reading academic skills. This quantitative description
allows for both between and within attribute comparisons, making it possible to de-
termine, for example, whether a 10-point gain at the low end of the math scale is
preferable to a 20-point gain at the high end of the reading scale. Such a scale can be
used for cost-effectiveness evaluations.
3. I (with Christopher Candelaria) provide new evidence about the effect of court-
ordered finance reform on per-pupil revenues and graduation rates. We account
for cross-sectional dependence and heterogeneity in the treated and counterfactual
groups to estimate the effect of overturning a state’s finance system. Seven years af-
ter reform, the highest poverty quartile in a treated state experienced a 4 to 12 percent
increase in per-pupil spending and a 5 to 8 percentage point increase in graduation
rates.
II. Paper one: Normative approaches to achievement
This paper argues for a rights-based principle for the distribution of educational resources.
The right that is proposed is: “All students, no matter who they are, have claim to at least
some educational resources to develop their abilities.” By endorsing this claim, we em-
brace the idea that schools exist for all students, and that no students should be denied edu-
cational opportunities for any characteristic they may have. The phrase “some educational
resources” is intentionally vague. It does not specify how much each student is entitled to,
and allows for the possibility that some may need more than others. The principle only
entails a minimum amount for all students.
Any time a student’s characteristics are used as a basis for exclusion from any educa-
tional opportunity, we will recognize the harm that is done that student. Likewise, if a
distributive principle requires that some groups receive no resources, we must conclude
that the principle is inadequate, as the fulfillment of the principle means that some stu-
dents’ legitimate claims on some educational resources will be violated. A principle of
3
racial or gender discrimination proves the point, as students, based on their race or gender,
are denied educational opportunities on account of those characteristics. Similarly, a prin-
ciple that said “give only to those who stand to gain the most” would also be in violation,
as there are necessarily some who do not stand to gain the most and will be left without.
Such a right seems like a small thing, and I doubt there would be much objection to it.
Nevertheless, I show that a suite of popular distributive theories—two versions of fair
equality of opportunity and adequacy—problematically conflict with the right in many fea-
sible circumstances. At issue is the simple fact that, in many cases, satisfying equality
and adequacy principles will require the rich or the talented to relinquish their claim on
educational resources. Discrimination on the basis of ability or income violates the right.
Two subtle questions will be in need of sorting out:
1 Is the right for all students to at least some educational resources merely one ofmany competing principles–akin to parental partiality, “all things considered prior-ity,” etc.–or does the right reveal fundamental limitations to these popular distributiveprinciples when applied to educational achievement?
2 If we concede some basic egalitarian intuitions–that ability differences are importantfor one’s life prospects, morally arbitrarily assigned, and compensable through edu-cational investments–how do we reconcile these intuitions with the general claim toeducational resources that all students share?
I argue in response to the first question that the right does reveal fundamental limitations
to current distributive paradigms applied to education. In short, achievement is not income
and needs to be treated differently. Indeed, when we consider other distributive objects,
such as income and welfare, the rights-based objection to equality and adequacy falls short.
A thoroughgoing response to the second question is, unfortunately, not forthcoming.
Balancing equality against general claims has a kind of Goldilocks problem. A minimal
right to educational resources, such as an offer of the smallest divisible unit of resource
for the rich or talented, is too weak; a maximal right that promises equal resources to all
students violates any notion of fair equality and is too strong. The right that is ‘just right’
is not well defined, but it can be found somewhere between the minimal and maximal
specifications.
4
III. Paper two: Welfare adjusted scale score
The use of academic scale scores in education production functions is so commonplace
a list of citations is unnecessary. When a scale score is used as the dependent variable
it connotes value or expected benefit. Holding costs constant, a program that raises test
scores 20 points is more effective than a program that raises test scores 10 points. This the
logic of cost-effectiveness analysis (see Levin and Belfield, 2014 for review). In order for
this evaluation to be made, researchers and policymakers must assume that a scale score
is equal-interval scaled with respect to benefit. That is, for example, they must assume
that a 10-point gain at the bottom of the scale score is equivalent to a 10-point gain at the
top of the scale score. Such an assumption is rarely tested and there are not strong priors
indicating that such a relationship exists.
In this paper I describe and implement a method for constructing a scale score that is
equal-interval with respect to welfare. I employ a choice-based conjoint design (often times
referred to as a discrete choice experiment) to obtain utility values for different math and
reading descriptors obtained from the National Assessment of Educational Progress Long
Term Trend (NAEP-LTT). In the experiment, respondents are provided with a description
of two individuals who are alike in all respects except that they differ in their math and
reading abilities. Respondents are asked to determine which bundle of math and reading
abilities indicate which of the two persons will have an “all things considered” better life.
After the reading and math profiles, the respondent is forced to make a choice between
Persons A and B. The response is coded dichotomously, 1 if Person A or B was chosen and
0 otherwise.
The purpose of this experiment is for the respondent to make interval comparisons be-
tween Persons A and B with respect to welfare. As an example, consider a choice task
where Person A has reading ability equal to 5 and math ability equal to 2, while Person
B has reading and math abilities equal to 3.1 Effectively, the respondent is being asked to
make a trade between 2 units of reading for 1 unit of math. Whether respondents, on aver-
1In the actual choice task, respondents are given performance level descriptors, which are taken from theNAEP-LTT. These descriptors are equidistant textual accounts of an individual’s math and reading ability.
5
age, choose Person A over B will depend on how much they value reading relative to math,
and, importantly, how much they value math gains at the bottom of the distribution rela-
tive to reading losses at the top. Depending on how respondents on average weight these
different trades will determine the relative concavity of the welfare-adjusted scale score.
These performance level descriptors are then mapped onto the original scale score us-
ing the scale anchoring process employed by the NAEP. I now have a data set with three
variables and 10 observations: a vector of performance level descriptors, the corresponding
scale scores (150, 200, 250, 300, 350), and the estimated utilities. I use piece-wise mono-
tone cubic interpolation (MCI) for values not directly estimated from performance level
descriptors. This provides a scale score that is equal-interval with respect to welfare, as
long as we assume that the equal-interval assumptions (with respect to ability) of the orig-
inal scale score hold and that the performance level descriptors are appropriately mapped
to scale score values.
As hypothesized, I find that utility values for different achievement states are non-linear
and concave. Gains in reading and math are worth more at the bottom than at the top.
In order to demonstrate how the newly estimated scale can be applied, I compare cohort
trends in the white-black achievement gap between the original NAEP scale and the newly
estimated one. When achievement is re-scaled to reflect value, changes in achievement
gaps are different in both magnitude and direction in many instances. This is due to the fact
that gains/losses for lower achieving groups are worth more than gains/losses for higher
achieving groups.
IV. Paper three: Sensitivity of causal estimates from fi-
nance reform (with Christopher Candelaria)
Whether school spending has an effect on student outcomes has been an open question in
the economics literature, dating back to at least Coleman (1966). The causal relationship
between spending and desirable outcomes is of obvious interest, as the share of GDP that
6
the United States spends on public elementary and secondary education has remained fairly
large throughout the past five decades, ranging from 3.7 to 4.5 percent.2 Given the large
share of spending on education, it would be useful to know if these resources are well
spent. Despite this interest, we lack stylized facts about the effects of spending on changes
in student academic and adult outcomes. The goal of this paper is to provide a robust
description of the causal relationship between fiscal shocks and student outcomes at the
district level for US states undergoing financial reform for the period 1990-91 to 2009-10.
Using district-level aggregate data from the Common Core (CCD), we estimate the ef-
fects of fiscal shocks, where fiscal shocks are defined as a state’s first Supreme Court ruling
that overturns a given state’s finance system, on the natural logarithm of per-pupil spend-
ing and graduation rates. Researchers are presented with a number of modeling strategies
in panel data situations. We have two objectives. The first is to present a theoretically
rich model that is flexible enough to handle two aspects of the identification problem: first,
treatment occurs at the state level and, second, there is treatment effect heterogeneity within
states. Given the variety of reasonable modeling choices that exist, our second objective is
to show how sensitive our results are to some common alternative specifications.
All together, we estimate a heterogeneous differences-in-differences model that accounts
for (a) cross-sectional dependence at the state level, (b) a poverty quartile-by-year secular
trend, and (c) state-by-poverty quartile linear time trends. In this preferred specification,
we find that high poverty districts in states that had their finance regimes overthrown by
Court order experienced an increase in log spending by 4 to 12 percent and graduation
rates by 5 to 8 percentage points seven years following reform.
To test the extent to which results are equalizing, we estimate slightly different models,
allowing the effect of reform to be continuous across poverty quantiles. We control for sec-
ular changes in the equalization efforts in non-treated states by including year fixed effects
interacted with continuous poverty. This provides an estimate of the marginal change in
graduation rate for a one-unit increase in poverty percentile rank within a state. Here we
see that the effect of reform was equalizing: for every 10 percentile increase in poverty
2These estimates come from the Digest of Education Statistics, 2013 edition. As of the writing of thisdraft, the 2013 version is the most recent publication available.
7
within a treated state, per-pupil log revenues increased by 0.9 to 1.8 percent and graduation
rates increased by 0.5 to 0.85 percentage points in year seven.
We then subject the model to various sensitivity tests by permuting the interactive fixed
effects, secular time trends, and correlated random trends. In total we estimate 15 com-
plementary models and present these results graphically. Overall, the results are robust to
model fit.
This paper makes substantive and methodological contributions. Substantively, we find
that court cases overturning a state’s financial system for the period 1991-2010 had an effect
on revenues and graduation rates, that these results are robust to a wide variety of modeling
choices, and that this effect was equalizing. Taken together, our two models show that
states undergoing Court ordered finance both (a) increased revenues and graduation rates
in high poverty districts relative to high poverty districts in other states and (b) allocated
a greater share of these revenues to higher poverty districts within the state, relative to
allocations taking place in non-treated states, resulting in an increase in graduation rates.
Methodologically, we emphasize the variety of modeling strategies available to researchers
using panel data sets, including specification of the secular trend, correlated random trends,
and cross-sectional dependence. While the researcher may argue for a preferred model,
justifiable alternatives are often available. Here we have presented a graphical method
that researchers can use to effectively and efficiently demonstrate the sensitivity of point
estimates to modeling choice.
V. Conclusion
The quality of our inferences in an education production function hinges on three indepen-
dent but interrelated factors:
1 A normative account that justifies which outcome variables should (or should not) beincluded in the model
2 The specification of an outcome variable that accurately measures our valued com-mitments
8
3 An econometric model that properly links inputs to desired outcomes
In educational policy settings, we are presented with choices about which outcomes to in-
clude, how to specify them, and which identification strategy to follow. In this dissertation I
have considered each of these factors in isolation. I have argued that equal opportunity and
adequacy principles are not easily reconciled with legitimate student claims to some educa-
tional resources. I have presented and implemented a method that applies a welfare-based
weighting scheme to different parts of the ability distribution, thus allowing the outcome
variable in cost-effectiveness evaluations to better reflect benefit. Finally, I have presented
compelling causal evidence about the effect of court-ordered finance reform on spending
and graduation rates in high poverty districts.
9
Chapter 2
Achievement is not income
10
Abstract In education, it is common to hear that certain students should receive ad-ditional resources in order to increase their achievement. A class of equal opportunityprinciples classifies students into protected and non-protected groups. Protected groups areto receive educational resources up until the point they catch up to non-protected groups.Adequacy principles specify a threshold, perhaps indexed to some other value, that classi-fies students into two groups, those who are below and above the threshold. Those beloware to receive educational resources up until the point they reach the threshold. The fullrealization of both equality and adequacy principles will, in many cases, make it so that noresources will be available for either non-protected groups or those above the threshold. Ifwe endorse a rights-based distributive principle that says all students have claim to at leastsome educational resources to develop their abilities, then equality and adequacy principlesare incomplete distributive principles. A re-calibrated distributive principle must reconcileegalitarian moral reasoning with the general claim to some educational resources. The sizeof each student’s claim is left intentionally vague. It can neither be too little (just a token)or too large (equal resources for all), but somewhere between the two extremes lies theclaim.
11
I. Introduction
In this paper, I argue for a rights-based principle for the distribution of educational re-
sources. The right that is proposed is: “All students, no matter who they are, have claim
to at least some educational resources to develop their abilities.” The phrase “some edu-
cational resources” is intentionally vague. It does not specify how much each student is
entitled to, and allows for the possibility that some may need more than others. The princi-
ple only entails a minimum amount for all students. Such a right seems like a small thing,
and I doubt there will be much objection to it.
Nevertheless, I show that a suite of popular distributive theories—two versions of fair
equality of opportunity and adequacy—problematically conflict with the right to resources
that all students share. In many cases, satisfying equality and adequacy principles will
require the rich or the talented to relinquish their claim on educational resources. At issue
is the simple fact that both principles divide student populations into protected and non-
protected classes (in the case of equality) or below and above threshold groups (in the case
of adequacy). Non-protected classes and above threshold groups are offered nothing by the
respective principles. Such principles discriminate on the basis of ability or income and
this violates the right.
Two questions will need to be sorted out:
1 Is the right for all students to at least some educational resources merely one ofmany competing principles–akin to parental partiality, “all things considered” prior-ity, etc.–or does the right reveal fundamental limitations to these popular distributiveprinciples when they are applied to educational achievement?
2 If we concede some basic egalitarian intuitions–that ability differences are importantfor one’s life prospects, morally arbitrarily assigned, and compensable through edu-cational investments–how do we reconcile these intuitions with the general claim toeducational resources that all students share?
In response to the first question, I argue that the right does reveal fundamental limitations
to current distributive paradigms applied to education. In short, achievement is not income
12
and needs to be treated differently. Indeed, when we consider other distributive objects,
such as income and welfare, the rights-based objection to equality and adequacy falls short.
A thoroughgoing response to the second question is, unfortunately, not forthcoming.
Balancing equality against general claims has a kind of Goldilocks problem. A minimal
right to educational resources, such as an offer of the smallest divisible unit of resource
for the rich or talented, is too weak; a maximal right that promises equal resources to all
students violates any notion of fair equality and is too strong. The right that is ‘just right’
is not well defined, but it can be found somewhere between the minimal and maximal
specifications.
The paper proceeds as follows. I lay out some understanding for what the right to edu-
cational resources entails (and specifically, what it does not entail). I then hold the right to
educational resources against three popular distributive principles in education: expanded
and restricted fair equality of opportunity and adequacy. I show that these principles will
conflict with the right to educational resources in many plausible scenarios. I then consider
whether we should interpret this conflict as just one more example of an important value
that conflicts with equality, or whether the conflict reveals something more fundamental
about the principles. Finally, I provide some (admittedly unsatisfactory) details about what
the right to educational resources must entail.
II. Establishing a right to educational resources
Suppose we observed the following:
In some schools, certain students—we know nothing about their demographiccharacteristics, their social origins, or their performance on a test—are left totheir own devices every day. Teachers give them no attention. On some days,if the guardian is present, students may stay home to play video games or readcomic books. In short, they learn nothing.
Here I make a claim about the distribution of opportunities for achievement that I think
most people will find uncontroversial. If we find the above scenario troubling, then we are
13
led to endorse the view that:
Rights-based distributive principle: All students, no matter who they are, have
claim to at least some educational resources to develop their abilities.
By endorsing this view, we embrace the idea that schools exist for all students, and
that no students should be denied educational opportunities for any characteristic they may
have.1 The phrase “some educational resources” is intentionally vague. It does not specify
how much each student is entitled to, and allows for the possibility that some may need
more than others. The principle only entails a minimum amount for all students.
Any time a student’s characteristics are used as a basis for exclusion from any educa-
tional opportunity, we will recognize the harm that is done that student. Likewise, if a
distributive principle requires that some groups receive no resources, we must conclude
that the principle is inadequate, as the fulfillment of the principle means that some stu-
dents’ legitimate claims on some educational resources will be violated. Thus, if we are to
endorse the idea that no students should be deprived of all opportunities to develop their
abilities in school, we can use it to test other, competing distributive principles. Here is the
test:
1 If the principle identifies some subgroups based on their characteristics and usesthose characteristics to deny those students any resources to develop their abilities,then we can conclude that the principle is inadequate.
2 If in order for the distributive principle to be satisfied, it requires, at least in someinstances, that certain students receive no resources to develop their abilities, thenwe can conclude that the principle is inadequate.
This test is not conclusive. It cannot be used, for example, to determine whether some
students should receive more resources than other students. Instead, the purpose of the
1Some might object that the view is too strong, as it allows for the possibility that students who are athreat to other students are also entitled to resources for opportunities to develop their abilities. In otherwords, the principle excludes policies like out of school suspension. Whether or not out of school suspensionis a legitimate policy choice or not is outside the scope of this paper. If necessary, we can modify the principleto be “all students, no matter who they are, as long as they are not a threat to other students, have claim to atleast some educational resources to develop their abilities.” The change will not affect the results in any way.
14
test is to determine whether a principle fails in this very fundamental way. What kinds
of principles fail this test? There are obvious candidates. A principle of racial or gender
discrimination obviously fails, as students, based on their race or gender, are denied edu-
cational opportunities on account of those characteristics. Similarly, a principle that said
“give only to those who stand to gain the most” would also be in violation, as there are
necessarily some who do not stand to gain the most and will be left without.
However, the focus of this essay is not on those principles that are most easy to defeat.
My focus instead is a suite of popular theories of distributive justice in education that, one
way or another, tacitly endorse discrimination of students on the basis of ability. I am not
referring to the kind of ability that we normally think about; on the contrary, most theories
endorse some form of compensation for low-achieving students in the form of educational
resources. The discrimination I have in mind, in most cases, is against the very talented,
those whose abilities surpass the abilities of other students, though through no fault of their
own.
By calling attention to the very talented, the purpose is not to defend a meritocratic
conception of justice, where only those with the “right” abilities have access to certain ad-
vantages. It is possible to recognize the unfairness in the distribution of opportunities for
labor market success, in part resulting from differences in ability, without the need to deny
high achieving students all learning opportunities. Nor do I argue or believe that the very
talented are a particularly disadvantaged group, one with whom we should sympathize. We
can recognize that those with greater ability will have certain advantages without denying
them the opportunity to develop their abilities through schooling. Finally, we need not
claim that higher achieving students deserve better or even equal educational opportunities
compared to their lower achieving peers. I only assert that we cannot divest higher achiev-
ing students of good opportunities to learn simply in order to increase investment in the
education of lower achieving children. This assertion rests on the idea that if we were to
learn that those students left to their own devices in the example above were the highest
achieving students in the school, we would be no less offended.
15
III. Testing distributive theories against the right to some
educational resources
I now consider three prominent distributive theories in education and show that each fails
the test under plausible conditions. The theories are expanded fair equality of opportu-
nity, restricted fair equality of opportunity and adequacy.2 I first evaluate the two equal
opportunity principles before considering objections to the test itself. The ordering may
appear awkward at first, since an objection to the evaluation may be immediately evident.
Nevertheless, I think it is worthwhile to first go through the arguments against both equal
opportunity principles before considering rebuttals.
III.A. Expanded fair equality of opportunity
It is common to hear that low achieving students are due compensation in the form of edu-
cational resources so that their opportunities for labor market success, freedom of occupa-
tional choice, and political participation (among other things) may be more equal. Such a
notion rests on three premises, to which most egalitarians would agree:
1 Differences in ability are a legitimate, meaningful and long-standing obstacle in theway of opportunities for labor market success, occupational choice, challenging andmeaningful labor, and making contributions to the social good, such as through publicoffice.
2 Differences in opportunities for the benefits described above that are the result ofdifferences in ability are no more the responsibility of the individual than differencesin opportunities that result from social origin, race or gender. Each of these barriersto equal chances are, to use Rawls’ phrase, arbitrary from the moral point of view.
3 Social institutions, such as schools, exist and are suitable for changing the abilitydistribution to make it more equitable. These institutions have at their disposal apool of resources and can use them in a compensatory way such that students with
2I exclude a principle of priority from the list, as it is self-evident that a priority principle will not allocateresources to any students who are not the least advantaged.
16
lower ability receive more, thereby improving their achievement so that they catchup to students with greater starting ability.
Taken together, these three premises often lead to an “expanded” fair equality of oppor-
tunity principle that includes among its set of protected classes ability, social class origin,
gender and race. Expanded fair equality of opportunity is violated when differences in
opportunities for culture and achievement exist between persons of different ability, social
class, race and gender.3
The idea that opportunities for culture and achievement should be equal for those of
different abilities is sometimes thought to be too extreme. Two arguments against expanded
fair equality are commonly raised. The first is that it is impractical and too costly. This
argument is most strongly expressed when is it grounded on the premise that such costs
can harm the worst off. The idea is that if we compare two educational systems, one that
spends its resources trying to improve the achievement prospects of the worst off (in this
case, the untalented) and another that spends its resources efficiently, we will find that the
efficiency-based system actually improves the quality of life for the worst off group, all
things considered. The mechanism for this is that the overall level of achievement will be
higher in the efficiency-based system and, subsequently, the level of material comfort will
be higher for the worst off as well, since they stand to gain from the advancement of others.
In short, trying to help the worst off exclusively through schooling ends up leaving them
worse off in other dimensions, such as income.
The second argument against expanded fair equality is that making sure there is equality
between different classes will inevitably result in overly restrictive infringements on the
3See Brighouse and Swift, 2015 and Jencks, 1998. Brighouse and Swift refer to this as “the RadicalConception” of equal opportunity; Jencks referred to this as “Strong Humane Justice.” Brighouse and Swift(2015) argue for the radical conception; they write, “what motivates [fair equality of opportunity], which wetake to be the concern that people not be disadvantaged in competitions by characteristics for which they arenot responsible, condemns unequal achievements due to talent (whether natural or endogenously developed)just as much as it condemns those due directly to social class. People are no more responsible for havingthe talents or defects they were born with than for the class background into which they were born, and nomore responsible for the class-based factors that impact on their development. None of these are reasonsto welcome social class influencing unequal achievement. They are not, in other words, reasons for uneaseabout educational equality. ... Rather, they are reasons for resisting the idea that inequalities of talent (naturalor developed) should influence educational achievement,” (2015, p 15-16).
17
rights of parents to interact with their children. Parents have different preferences for their
children, and some will invest more resources and time than others, and if equality is to per-
sist, these parental interactions will need to be curtailed. Such a curtailment is a violation
of more fundamental liberties.4
These two arguments do not speak against expanded fair equality per se; rather, they
suggest that expanded fair equality will need to be brought into balance against these other
competing values.5 Note that this balancing may not even be necessary, as the presence
of the conflict between the competing principles is empirically circumstantial. Suppose
that two students are very far apart in terms of their abilities. Suppose further that the
low achieving student stands to gain as much as the high achieving student for every unit
of educational input. In this case, compensation coincides with efficiency, and there is
no leveling down of achievement when resources are given to the low achieving student.
Consequently, the low achieving student is no worse off under the expanded fair equality
regime than she would be under an efficiency-based regime. Likewise, parental partiality
need not be troubled by expanded fair equality. Parents of high achieving students may,
either because of preferences or their own moral reasoning, choose not to allocate their time
and resources to ensure their child maintains his or her positional advantage in achievement.
In such a world, parental partiality is unconflicted with expanded fair equality.
We see, then, that agreement with the first three egalitarian premises is compatible with
parental partiality and an “all things considered” concern for the least advantaged. We can
now evaluate the principle to see if it conflicts with the rights-based principle advocated
here.
When we considered the case where it was just as efficient to help the lowest achieving
student as it was to help the higher achieving student, if the high achieving student con-
tinues to be higher achieving, despite the compensation going towards the low achieving
student, expanded fair equality demands that the higher achieving student get nothing. Ful-
filling the principle requires, in this circumstance, that high achieving students relinquish
4See Brighouse and Swift, 2009; 2015; Brighouse, Ladd, Loeb and Swift, 2015 for discussion.5Such a balancing can be achieved in any number of ways. Strict lexical priority that protects parental
partiality is one way.
18
their claim on educational resources to develop their abilities. Here we see that expanded
equality of opportunity, even without the threat to priority, threatens the rights of all stu-
dents to some educational resources.
Alternatively, consider a case in which parental partiality is not threatened. Here, a
parent may say to his child, “You are higher achieving than So-and-So. The school system
is going to give her all the resources so that she can catch up with you. I am an egalitarian
minded parent. So that she may catch up with you quickly, and without bankrupting the
school, I will ensure that you do not get any education outside of school as well.” Here,
the parent voluntarily submits in order to fulfill the requirements of expanded fair equality.
Parental partiality is preserved, expanded fair equality is satisfied, and yet justice has been
undermined. A distributive principle that fails to recognize the basic right of opportunity
to develop one’s abilities is deficient.
III.B. Restricted fair equality of opportunity
By including ability differences in the set of protected obstacles, expanded fair equality
of opportunity will, in many instances, require that all educational resources go to the
lowest achieving students in order to equalize achievement. In such instances, the right
for all students to some educational resources to develop their abilities will be violated.
What about a restricted equal opportunity principle, one that compensates on the basis of
differences resulting from social circumstance alone? Such a principle, which Brighouse
and Swift refer to as the “meritocratic conception,” and others will recognize as Rawlsian
fair equality of opportunity, says that opportunities for achievement should be equal for
those of equal ability.
We can now see whether this restricted version of fair equality conforms to our test.
In the first case, restricted fair equality offers nothing to students whose abilities are low
for reasons not having to do with social circumstance. To see this, consider a two-class
society. In this society, there are two distributions of achievement. Restricted fair equality
is satisfied when the two distributions converge, even though there may be considerable
19
within class variation. Those students at the very bottom of the distributions have nothing
to gain from restricted fair equality.
It is worth noting that restricted fair equality is implicated both by the test proposed here
and the three egalitarian premises described above—that talent differences are meaningful,
arbitrary and compensable. Rawls’ fair equality principle has been criticized for this very
deficiency.6 The test proposed here–“[i]f the principle identifies some subgroups based on
their characteristics and uses those characteristics to deny those students any resources to
develop their abilities, then we can conclude that the principle is wrong or inadequate”–
is echoed by Arneson (1999), who asks, “[w]hy is it morally acceptable to single out the
untalented and herd this group into the bottom rung on the social hierarchy? Why is this
not invidious discrimination, invisible to us because it chimes in with the ethos of mod-
ern democratic societies?” (Arneson, 1999, p 10). Restricted fair equality of opportunity
discriminates on the basis of talent and for this reason it is deficient.
Whereas in the first case we saw that low achieving students’ claims on educational
resources were disregarded, in the second case we will see that high achieving students
also stand to lose. The necessary conditions for high achieving students to lose are only
that social circumstances are prohibitively difficult to remedy through schooling. If, in
order to bring the two distributions together, all resources must be spent in order to raise
the achievement of the lower distribution, the entire non-protected class of students will
lose their claim on educational resources in order for restricted equality of opportunity to
be satisfied. Considering this last case raises a more fundamental criticism against both
expanded and restricted versions of fair equality.
The principles differ in who is included among the protected classes. Yet, how ever the
protected classes are defined, the purpose of the principles in both cases is to bring the levels
of achievement of the protected class up to the levels of achievement of the non-protected
class.7 This, by definition, excludes the non-protected classes from the distributive prin-
ciple. Indeed, consider that restricted fair equality of opportunity is most easily satisfied
6See Arneson, 1999; Pogge, 2004; Clayton, 2004.7This is assuming that the non-protected classes are not brought down, which is an alternative and not
relevant criticism against equality generally.
20
if the students of high income families are removed from the educational system entirely.
Likewise, expanded fair equality is most easily satisfied when the talented are excluded. It
is a strange feature of egalitarian theory that a great deal of specification setting is required
in order to lay out clearly just which practices of motivated parents, high income families
or the talented are permissible and which are not.8 We can agree that certain educational
and parental practices (such as private schooling, or test preparation services) are unfair, or
that the very talented should receive fewer resources than the less talented, without relying
on equality-based theories that fail to establish baseline conditions: children of the rich,
children who are intelligent, and all other children, have a right to learn in school. Nothing
about equality lets that be known, and for that reason it is deficient.
III.C. Objections to the test
We are now prepared to consider two objections to the test. The first objection is that the
test holds equality principles accountable to standards equality principles are not intended
to satisfy. Equality principles are intended to describe patterns that are fair. Fundamentally,
equality principles indicate whether or not groups or individuals have too much or too
little of the distributive object; they are not principles of efficiency.9 For these reasons, the
principle of “all students, no matter who they are, have claim to at least some educational
resources to develop their abilities,” should be understood as an addendum to these or other
distributive principles. Yes, we care about equality, the objection goes, and we care about
providing everyone chances as well. 10 The proposed test does not reveal a deficit about
equality; at best, it merely suggests that we should include rights for all students as one of
the many competing goals a just society pursues alongside equality.
Whereas the first objection stated that the rights-based distributive principle should be in-
terpreted as an addendum to equality principles; the second objection states that the rights-
8See Brighouse and Swift discuss which parental practices are legitimate and not (2009). Here, seeBrighouse and Swift discuss whether private schooling is legitimate or not (2006; 2015).
9This argument is sometimes used against the leveling down objection. Equality is one concern; levels ofthe distributive object are another.
10Aggregating values is the approach suggested by Brighouse and Swift, 2008; 2015 and Brighouse, Ladd,Loeb and Swift, 2015.
21
based claim is no weightier than any of the many other competing values to equality, such
as priority and parental partiality. Let us recall the comparison between parental partiality
and the right for all students to develop their abilities in school. Regarding parental par-
tiality, we saw that satisfying equality could, in some cases, require restrictions on parental
inputs. In other cases, it will not, depending on the state of parental preferences at the time.
In this way, the principle of rights is strictly parallel to parental partiality. In some cases,
satisfying equality could require that high ability students relinquish their claim on edu-
cational resources. In other cases, it will not, depending on how efficient it is to raise the
achievement of low ability students and on how far apart low ability students are relative to
high ability students. In such a case, equality will be satisfied (eventually) and rights will
be secured. By faulting the principle of equality because it, in some cases, conflicts with
other values makes the same mistake others have made who have faulted equality because
it conflicts with parental partiality. It is not a deficiency of a principle when, in certain
cases, fulfilling the requirements of the principle conflicts with other desired ends.
Let me respond in three ways. First, I note that the objections do not defeat the argument
I make here. The test succeeds, even minimally, if it reveals that there are other values at
stake when we try to implement equality. Any policy will have to grapple with the broad
suite of values implicated thereby, including rights for all students. This is a non-trivial
point insofar as it has not been recognized in the literature.
The second response is that we should note that the rights-based principle I have pro-
posed is a distributive principle in the metric of educational opportunities. For this reason,
equality of opportunity for achievement and the right to resources for achievement fall un-
der the same class of principles. This is in contrast to the values of parental partiality and
the “all things considered” level of welfare for the least advantaged. In that sense, a cohe-
sive principle for the distribution of educational opportunities ought to include both facets
of the good in question. Equality is deemed deficient because it neglects this other element
of the distributive problem. This is what separates a rights-based account from parental par-
tiality and priority and shows that the test is necessary: the test reveals limitations internal
to the distribution of educational opportunities.
22
Third and finally, we can see whether the objections to the test could be applied to al-
ternative metrics of benefit, such as income or welfare. If the objections apply equally to
all metrics of benefit, then we can conclude that, at best, I have presented another value, at
times in conflict with equality. Alternatively, if we find that the test does not apply to other
metrics of benefit, then we can conclude that I have offered more than a mere addendum
to the already long list of competing values. Instead, it suggests that the distribution of
achievement requires a different distributive paradigm, as the class of distributive princi-
ples used for metrics of benefit like income and wealth cannot be easily applied to other
domains.
Let us consider income. We need to establish three similarities: the moral intuition,
the equal opportunity expression, and the right. We can then test whether the right is
problematically in conflict with the equal opportunity expression.
1 Egalitarian moral reasoning can be applied to income in the same way it was appliedto achievement: income is important (it is an all purpose good necessary for therealization of one’s ends), arbitrarily assigned (to the extent that one’s abilities andsocial circumstances are necessary for earnings and arbitrary from a moral point ofview), and compensable (earnings can be taxed and transferred).
2 The analogous equal opportunity expression reads: opportunities for income shouldbe equal for those who put in similar effort. This suggests an equal pay for equaleffort system, or one for which a tax and transfer system equalizes income.
3 An analogous income-based right would be: all persons have some claim on re-sources in order to develop their incomes. Such a right seems odd, but we can giveit one of two plausible interpretation. (The difficulty in articulating a right to de-velop one’s income should already give us pause, as no such difficulty is present inthe right to develop one’s abilities. It suggests a problem before the comparison evengets off the ground. We tend to think of welfare and income as cross-sectional goods,meaning that what matters about them is how much one has of the good, and not howmuch one is able to develop the good.)
3a The first might include something like Elster’s (1986) Marxian notion of self-realization through labor. We could say that an analogous right would be thatall persons have some claim on resources through labor (meaning they are guar-anteed a wage). This interpretation speaks to the value of meaningful labor.
23
3b The second might include something like Krouse and McPherson’s (1986) no-tion of a property-owning democracy. We could say that an alternative rightwould be that all persons have some claim on capital in order to develop theirincomes. This second interpretation speaks to the autonomy one is affordedthrough ownership of capital.
We can now see whether satisfying [2] the equality opportunity expression will, in certain
instances, require the violation of [3] individual rights to [3a] earn an income through work
or [3b] earn an income through capital. The concern raised against equality above was that
equality offers nothing to members in non-protected classes. In the case of income, the
principle says that incomes will be equal for those who put in equal effort. Here, we can
pick any income distribution that conforms to the principle to make the test. For example,
in a world where the distribution of effort is skewed, incomes will be skewed; in a world of
equal effort, incomes will be equal.
The question is whether any of those income distributions will, in any case, necessarily
result in a loss of opportunity to develop income through work or earn an income through
capital. The most obvious case for which equality denies everyone an income is made
when we invoke the leveling down objection. Equality, say, is so dis-incentivizing that
nobody puts in effort, production halts, and incomes for everyone converge to zero. In such
a setting, equality necessarily conflicts with either of the two versions of the right. This,
however, is an extreme case, and leveling down was not invoked in the previous examples.
Let us see if other non-leveling down versions can lead to conflict.
A second possibility emerges when we realize that equal income for equal effort leaves
nothing for those who put in no effort. This, too, is an extreme case and leaves open the
possibility that effort should be included as one of the protected classes. If all obstacles
are included in the set of protected classes, equal opportunity converges to equal outcome.
When we consider perfect equality of income, the distinction between the two metrics
of benefit comes into sharp relief. Equal income does not leave anyone without income,
insofar as there is divisible income available. Individuals will then be free to do with that
income what they like, such as find work or purchase capital. Certainly, the opportunities
to find meaningful work or purchase capital may be violated in other ways. The society
24
may not, perhaps, guarantee an income through all forms of labor. The violation of the
right in that case, however, is not a necessary consequence of equality.
The analysis above reveals the fundamental difference between income and achievement.
When we consider achievement, we recognize that individuals have initial endowments that
cannot be taxed; we do not take one person’s ability points and give them to another.11 It
is the assignment of initial endowments that bestows upon individuals the right to develop
those abilities.12 The difference between the two metrics, and the corresponding right that is
attached to one metric and not the other, is what explains why equal opportunity paradigms
run aground when they are applied to achievement. The proposed test is therefore use-
ful because it reveals this limitation in traditional distributive theories that are applied in
achievement.
III.D. Equality of opportunity for what?
I would like to take a moment here to develop some of the ideas in the previous discussion
and see if those can be used to offer some clarification about two long-standing disagree-
ments among equal opportunity theorists. The disagreement is about whether the set of
protected obstacles included in the equal opportunity expression should be expanded to
include ability differences. Arneson (1999) argues that by failing to protect against ability
differences, Rawlsian fair equality suffers from “meritocratic bias,” (p 86).13 Brighouse
11Admittedly, the inability to tax and transfer ability is both a structural and an ethical problem. It isstructural because we lack the technology to do so; it is ethical because it is not clear we would choose to taxabilities even if we could. Indeed, the right to develop one’s abilities suggests we would not take from oneand give to another.
12There is a puzzling passage in Rawls that seems to capture this idea. He writes “the difference principlerepresents an agreement to regard the distribution of native endowments as a common asset and to share in thebenefits of this distribution whatever it turns out to be. It is not said that this distribution is a common asset:to say that would presuppose a (normative) principle of ownership that is not available in the fundamentalideas from which we begin the exposition.... Note that what is regarded as a common asset is the distributionof native endowments and not our native endowments per se,” (JF, p 75). It is not clear what exactly Rawlsmeans by this phrase. How is it that the distribution of native assets are to be regarded as common assets butnot actually incorporated as common assets? By attaching the fundamental guarantee of individual libertiesto the ownership of one’s own native assets (thereby giving ownership of one’s assets lexical priority overthe distributive principles), Rawls’ view potentially aligns with the right to develop one’s abilities that I haveargued for here. Though this is only speculation.
13The full quote is “Fairness to talent trumps fairness to the worst off in Rawlss system. That no talentedand ambitious person should be worse off in prospects than any person who is less talented and ambitious
25
and Swift give credence to this notion, when they describe the view as the “meritocratic
conception,” (2015).
It is important to distinguish the ultimate aims of the equal opportunity expression. They
are at times treated as substitutes and this can lead to confusion. The first version, which I
have used throughout, says that opportunities for culture and achievement should be equal
for a set of protected classes, where the set can be defined to include social class, gender,
race and ability. This version is the one that is often used in educational settings. A second
version says that opportunities for careers and political positions should be equal for a set
of classes.14
The two versions are related but distinct. Equal opportunities for careers and political
positions includes both non-discrimination and educational compensation aspects. Non-
discrimination is usually applied to the labor market and prohibits discrimination on the
basis of race and gender. The compensatory aspect comes into play when we recognize
that one’s social origin can affect one’s developed abilities. When we say that opportuni-
ties for careers and political office should be equal to those from different social origins,
we recognize that individuals from disadvantaged social backgrounds will need compen-
satory educational training if they are to compete on equal footing against those from more
advantaged backgrounds. Equal achievement for those from different social origins is the
instrument through which equal opportunities in the labor market are made possible.
We should note that, while educational compensation is necessary to satisfy the intent
of fair equality of opportunity for careers, it is not strictly necessary for equal chances. A
non-discrimination principle on the basis of social background would still allow for equal
chances, if, for example, individuals were assigned to jobs according to a lottery. Assuming
social background affects developed abilities, the consequence would be that those from
disadvantaged backgrounds would be ill-prepared for the jobs to which they applied. Such
a lottery might technically fulfill the requirements of fair equality of opportunity but would
takes lexical priority over egalitarian (Prioritarian) norms. To my mind this is wrong and reveals a meritocraticbias,” (Arneson, 1999, p 86).
14Rawls, in TJ, for example, describes fair equality as being both for culture and achievement as well ascareers. See pages 63-64 and 91-92 (1999) for both descriptions.
26
certainly fail to satisfy its intent. Whatever benefit there is to be derived from fair equality
requires that individuals be able to perform the job to which they are assigned, and this
means that educational compensation is a necessary component of the principle.
This is why when Arneson says fair equality of opportunity suffers from meritocratic
bias, or when Brighouse and Swift call it the meritocratic conception, they have to specify
the equal opportunity outcome to which they refer. Discrimination on the basis of ability in
the labor market is necessary for the benefit of the principle of fair equality to be realized.
Matching skills to positions is necessary for one to realize the benefit of performance within
that position, making discrimination on the basis of ability in the labor market acceptable.15
However, when we talk about equal opportunity for achievement, the case is different, as
schools are responsible for (or at least capable of) affecting the distribution of ability. This
is categorically distinct from the labor market. For the three egalitarian reasons mentioned
above—talent differences are important, arbitrary and compensable—discriminating on the
basis of ability with respect to the distribution of school resources is fundamentally differ-
ent and difficult to justify. Calling it “meritocratic”, however, confuses the issue, as no
proponent of Rawlsian fair equality, to my knowledge, has argued that abilities are earned,
especially in the case of children.
Because achievement is instrumental for careers, it should be clear that satisfying ex-
panded fair equality of opportunity with respect to culture and achievement will converge
with expanded fair equality of opportunity with respect to careers. Once opportunities for
15Having said this, we might then conclude that equal opportunity for culture and achievement is simplya roundabout way of articulating the two discrete purposes of the broader fair equality principle. What isreally meant is that fair equality for careers requires both (a) non-discrimination on the basis of the protectedclasses (social origin, race and gender) and (b) educational compensation in cases for which the protectedclass has lower achievement as a result of membership in that protected class (b.1.) in cases where careersand positions of political office have certain achievement-based requirements.
This rendering is not entirely satisfactory, however, as it treats culture and achievement as mere vehiclesfor economic and political advancement. We can grant that culture and achievement are important in thisinstrumental way and still maintain that culture and achievement are valued goods in their own right. Forexample, one’s abilities allow one to engage in political affairs more effectively and provide access to a widerrange of professional and recreational activities. For this reason, treating equal opportunity for culture andachievement as sub-components of a fair equality principle for careers and political office is too restrictive.Nevertheless, we must also recognize that equal opportunity for culture and achievement is not a sufficientlyrobust description of equal opportunity, as it does not include a prohibition against labor market discrimina-tion.
27
culture and achievement are equal for all persons, irrespective of ability, socioeconomic
origin, gender and race, then careers will likewise be equal, as differences in ability will
be erased. In this way, Arneson’s objection to Rawlsian fair equality still has force: dis-
crimination on the basis of ability for culture and achievement leads to discrimination for
careers. The question for proponents of Rawlsian fair equality is what justifies this type of
discrimination.
Here I must speculate somewhat. I suggest that proponents of Rawlsian fair equality have
argued for discrimination on the basis of ability in school settings for one or two reasons.
The first is that they may have conflated opportunities for culture and achievement with op-
portunities for careers, thereby conflating a true statement—ability matching is necessary
for the realization of benefit in the labor market—with a false statement—allocating edu-
cational resources to students on the basis of ability is not problematically discriminatory.
The second is that they may be worried that realizing expanded fair equality of opportunity
for achievement will take all educational opportunities away from students who are high
achieving, which would constitute another kind of unfairness, the revocation of student
claims on educational resources to develop their abilities.16
In both cases, however, restricted (or Rawlsian) fair equality of opportunity for achieve-
ment is found deficient. It unjustifiably discriminates on the basis of ability in school
settings, by only offering resources to students whose abilities are low due to social circum-
stance. Conversely, it offers no claim on educational resources to students whose abilities
are high and are members of one of the non-protected classes.
16It is difficult to ascertain what proponents of Rawlsian fair equality believe with respect to Arneson’scriticism because they have never addressed it. This is outside the scope of the paper, but Satz (2012; 2015),Shiffrin (2004) and Taylor (2004) have responded to only the first objection to Rawlsian fair equality–thatis unjustifiably gives more weight to fair equality than it does to the prioritarian distribution of income andwealth. Satz, Shiffrin and Taylor have all each argued, broadly, that equal opportunity principles provideindependent benefit, separable from shares of income and wealth, which are allocated according to the dif-ference principle. Note that critics can concede this point and still ask why Rawlsian fair equality is fair ifit discriminates on the basis of ability, with respect to the distribution of educational resources. No propo-nents of Rawlsian fair equality have defended the specification of the principle viz. an expanded fair equalityprinciple.
28
III.E. Adequacy
It might be thought that equality principles are soft targets for a rights-based critique, as
equality principles are only concerned with relative differences. Of course equality disre-
gards rights, the argument goes, as equality is a categorically distinct principle from rights.
As I have already argued, the categorical distinction is overstated, since both rights and
equality claim to be distributive principles in the metric of educational achievement. Nev-
ertheless, equality principles are not the only distributive patterns available. Adequacy
principles have been offered as alternatives to egalitarianism, and they may be immune to
the test proposed here, as relative differences, while not completely outside the scope of
adequacy principles, are de-emphasized.
Satz (2006) and Anderson (2006) offer the two most well known versions of these princi-
ples. The adequacy principles offered by Satz and Anderson are not strictly sufficientarian.
They argue that democratic equality, which, in broad terms, is the requirement that all
citizens be able to interact and be treated as equals in democratic society, only requires
an adequate level of achievement. This complicates the notion somewhat, as the level of
achievement will depend on what democratic equality requires. Suppose for the sake of
simplicity, the level is set at whatever degree of numeracy, literacy and general knowledge
(social scientific, historical and scientific) is required to participate intelligently in political
debate. Adequacy principles require that everyone below the threshold be brought up to the
level, while levels of ability above the threshold are not considered problematic (insofar as
democratic equality is not violated).
While adequacy demands all students be brought up to some threshold level, it says
nothing about students whose abilities are above the threshold. This is seen as a virtue of
the principle. Satz writes:
It is certainly true that if educational resources were improved for poor chil-dren, then they could compete for higher education and jobs on fairer terms.But even so, no society has the resources to supply the same opportunities topoor families as are possible for those with more wealth who value the contin-ued development of their children’s talents. As one child’s potentials expandmore than another’s, this principle will continually justify devoting more re-
29
sources to bring the now disadvantaged child up to the levels of her wealthierpeers. Yet no society can devote all of its resources to education, and so atsome point a line must be drawn as to how much the state is willing to spend.Authorized democratic decision-making bodies will draw lines that reflect therelative value they assign to education as opposed to other social goods. (Satz,2007, p 632)
The virtue of adequacy principles, as expressed here, is that the level of compensation
is not indexed to an advantaged classes’ level of achievement. While Satz’s objection to
equality resembles my own, it has an important distinction. Satz is concerned that public
efforts to compete with parental investments will result in a society re-routing all of its re-
sources to the educational system. Note that this argument is perfectly compatible with the
notion that equality must compete against other values. If educational equality requires all
resources to be invested in the educational system, then that will naturally come into con-
flict with other societal goals, such as health care expenditures. Nothing about balancing
the distribution of health with the distribution of education speaks against equal education
per se. Satz’s objection overlooks the more fundamental point: the problem is not that
equality robs from health to pay for education; the problem is that equality denies students
their rightful claim to educational opportunities.
The question now facing us is whether adequacy also encounters this problem. On the
one hand, it is straightforward to see that a principle of sufficiency, by requiring all students
be brought up to a certain threshold, could easily result in the kind of problem that threatens
equality. Any students that are prohibitively expensive to bring to the required threshold
will necessarily result in the educational system spending all its resources on those students,
leaving students who are above the threshold with nothing, thus denying the right of all
students some claim on resources to develop their abilities. It would appear to fail the test.
Defenders of adequacy could argue that the threshold could be benchmarked to ensure
that it is never so costly that bringing all students to the requisite level would bankrupt
the system. Suppose social planners were considering what threshold level was necessary
for democratic equality and they converged on some level of numeracy, literacy and social
scientific knowledge. Cost evaluations are conducted and it is determined that getting all
30
students to this level will use all available resources, making it so that students above the
level have nothing. Social planners then revise the standard downward, to such a degree
that there are enough residual resources for students above the threshold to at least have
some. Thus, the adequacy threshold appears to pass the test.
The difficulty for proponents of adequacy is that there is nothing in the principle itself
that suggests the cost evaluations should be conducted in the first place. The principle
only requires that individuals be brought up to some threshold level, and that the level is
commensurate with democratic equality. If all educational resources are required to bring
students to the level, there is nothing about sufficiency to protest this result.
One of the supplied arguments in favor of adequacy is that it does not index one’s level
of achievement to the level of some comparison group (high achieving poor to high achiev-
ing non-poor, for example), demanding equality between the classes. To use Anderson’s
phrase “[i]t therefore does not require criteria for equality of resources that depend on the
morally dubious idea that the distribution of resources should be sensitive to considerations
of envy,” (Anderson, 1999, p 321). Those above the relevant threshold are disregarded by
the theory, in the sense that their advantage is not problematic; adequacy is envy-free in
this way. A rights-based account faults adequacy theories for this very disregard. Ade-
quacy may be envy-free, but it comes at a high price: it excludes those above the threshold
from active membership in the school community. Insofar as adequacy does not come at-
tached with a corresponding right for all students to develop their abilities, it fails its own
test of democratic equality, by excluding students from educational membership based on
a self-imposed sufficiency standard.
IV. What distribution of resources are entailed by the right?
Up to this point I have argued that the distribution of opportunities for achievement must
include, at a minimum, some provision that guarantees all students at least some resources
for the development of their abilities. “Some resources” was left intentionally vague, but
now I wish to see if it can be better specified. Unfortunately, the specification of what the
31
right entails has a kind of Goldilocks problem. We can identify a specification that is too
weak, one that is too strong, but the one that is just right will not be known. Let us evaluate
them in turn.
IV.A. A right that is too weak
The phrase “some resources” can be defined to include only a token amount of resources.
Suppose the right is specified to mean, “all students are entitled to at least the smallest divis-
ible unit of resources available to develop their abilities.” This would leave some students
with, perhaps, an hour of the teacher’s time in the course of a year. Such a specification
reflects a commitment to the guarantee in name only. In some ways, such a minimal pro-
vision is worse than nothing at all, as it signals to students their outsider status. Tokenism
violates the purpose of the right and contradicts democratic equality.
IV.B. A right that is too strong
Conversely, “some resources” can be defined to mean equal resources for all students, so
that the right reads “all students are entitled to equal resources in order to develop their
abilities.” Equal resources for all students is to be resisted for the fundamental reasons
that some students are lower achieving and most expensive to educate than others. Equal
resources for all students denies this reality and treats all students as the same. This fails to
respect their uniqueness.
Moreover, giving the same to all will make fair equality of opportunity impossible. From
the previous discussion, we know that an equal opportunity principle requires four assump-
tions:
1 An equal opportunity principle has benefits, distinct from shares of income andwealth.
2 Whatever benefit there is from an equal opportunity principle requires that individu-als have the requisite skills to perform the tasks required by the position.
32
3 A fair chance in the labor market therefore requires efforts to improve the skills ofthose whose abilities are lower.
4 Improving the skills of those whose abilities are lower requires giving them addi-tional resources.
If there is agreement about [1] through [3], then equal resources for all students conflicts
with [4] compensatory spending. In which case, this conception of equal rights fails the test
we used to fault equality: realization of the principle violates other distributive principles
in the same metric of opportunities for educational achievement.
Similarly, a right that reads “all students are entitled to enough resources so that opportu-
nities to develop individual abilities are equal” must be rejected as well. This principle is an
improvement over the former, as it does not assume all students are the same, but it fails to
provide fair or reasonable chances for labor market success for students coming from dif-
ferent social backgrounds and ability levels. By neglecting that individuals have different
starting places, inequalities in life outcomes and political participation will be replicated
across generations. Schools are an important agent for remedying those inequalities.
IV.C. What right is ‘just right’?
A specification that is too weak provides only token resources to students; a specification
that is too strong fails to recognize individual differences and the importance of redress-
ing unfairnesses in labor market competitions and opportunities for political participation.
The question facing us now is what specification of the right to some resources will bal-
ance the need for meaningful opportunities to develop individual abilities in conjunction
with a concern for fairness and individual differences. The challenge is striking a balance
between securing enough resources so that the right is meaningful and recognizing that
compensatory spending is necessary for opportunities in the labor market to be fair. This
requires that we answer two questions: how much is enough, and what constitutes a fair
competition?
It may be that a strictly fair competition, understood as equal chances, is not compatible
33
with the rights-based claim, when the costs of providing equality are too high. When fair
competitions require equality, and equality negates claims to resources, something must
give.
Alternatively, adequacy theorists discount the importance of fair competitions (insofar as
fairness requires equal chances) and argue that what matters is chances reasonable enough
to satisfy democratic equality. Reasonable chances can be characterized by a sufficient
level of achievement. We can incorporate two sufficiency standards without difficulty:
Two-part distributive principle: All students should receive enough resources
for opportunities to develop their abilities; all students should receive enough
resources so that their achievement levels are adequate for reasonable chances
in the labor market. Satisfaction of the first part is lexically prior to the second.
The virtue of this two-pronged approach is that it is internally consistent and guarantees
the claim to resources for all students. It fails to the extent that any normative weight is
given to equality of opportunity. Insofar as we are motivated to reject the strong version
of the right based on the egalitarian reasoning described above, we will be led to demand
a stronger principle than adequacy provides. A final arbitration between equality and ad-
equacy proponents is beyond the scope of this paper, however. Ultimately, we know the
size of the claim to resources is greater than a token amount and smaller than full equality.
Specifying the claim more than that is saved for later.
V. Conclusion
In this essay I have identified a neglected value in educational distributive justice, the guar-
antee for all students to have some resources to develop their abilities. Moreover, I show
that three prominent distributive theories problematically conflict with this goal in many
plausible scenarios. Finally, in contrast to the myriad values that may or may not conflict
with equality, I have argued that the right to educational resources is fundamental to any dis-
tributive principle that takes achievement as its distributive object. A complete distributive
34
theory must strike a balance between the right to resources, on the one hand, and compen-
sating students who are lower achieving, on the other. Only when we have this balanced
distributive theory can we determine the extent to which other competing values, such as
parental partiality and “all things considered” priority, are threatened by the principle.
35
Chapter 3
Welfare adjusted scale score: Method
toward the development of an
equal-interval welfare scale
36
Abstract It is becoming increasingly common to question the equal-interval assump-tions of most academic scale scores. Even if interval assumptions hold, it is problematicto assume equal-interval distances with respect to benefits. For example, equivalent scalescore increases at the top and bottom of the distribution are unlikely to yield equivalentwelfare gains. The best available research to date approaches this problem by employingad hoc statistical procedures to rescale extant scale score distributions by anchoring them toother, ostensibly, interval scales (such as income). I propose an alternative strategy that es-timates the welfare returns to academic achievement directly. This approach makes use ofwell-established methodologies in health economics to estimate quality-adjusted life years(QALY). Using data from the performance level descriptors of the National Assessmentof of Educational Progress Long-Term Trends (NAEP-LTT) and a random utility model(RUM), I construct a welfare adjusted equal-interval scale. With this scale, I show thatwelfare returns to achievement are non-linear, convex and lead to different inferences re-garding achievement gap trends.
37
I. Introduction
The use of academic scale scores in education production functions is commonplace. When
a scale score is used as a dependent variable it connotes value or expected benefit. For in-
stance, holding costs constant, a program that raises test scores 20 points is more effective
than a program that raises test scores 10 points. This the logic of cost-effectiveness analysis
(see Levin and Belfield, 2014 for review), which is used for policy evaluation and decisions
about resource allocation. In order for this conclusion to hold, researchers and policymak-
ers must assume that a scale score is equal-interval scaled with respect to benefit. That
is, for example, they must assume that a 10-point gain at the bottom of the scale score is
equivalent to a 10-point gain at the top of the scale score. Such an assumption is rarely
tested and there are not strong priors indicating that such a relationship exists.
As a stylized example, consider Figure I. Along the x-axis I have mapped a traditional
scale score from a standardized test, and along the y-axis I have mapped a welfare-adjusted
scale score that connotes how much utility is attributable to any point along the original
scale score. When a traditional scale score is used as the dependent variable in an educa-
tion production function, it is assumed that X equals Y for any point along the distribution;
that is, for every increase along the ability distribution there is, by assumption, an equidis-
tant gain in welfare. Such a relationship is indicated by the dashed black line labeled
“Achievement.” Conversely, a plausible relationship between ability and welfare can be
described by the dashed gray line labeled “Utility.” 1 As is evident, when the relationship
is concave, utility increases faster at the bottom of the original scale score than at the top.
Thus, any cost-effectiveness analysis that uses the un-adjusted scale score as the dependent
variable will understate gains at the bottom and overstate gains at the top. A scale score
that accurately connotes benefit would therefore be useful.
[Insert Figure I Here]
In this paper I describe and implement a method for constructing a scale score that is
equal-interval with respect to welfare (the y-axis in Figure I). The method I propose esti-
1The welfare adjusted scale score is simply the cumulative beta distribution, with shape parameters α andβ equal to 2 and 5, respectively, and X equal to the original scale score divided by 350.
38
mates utility for a set of 10 “achievement states”, where an achievement state corresponds
to a performance level descriptor for reading and math taken from the National Assessment
of Educational Progress Long-Term Trend (NAEP LTT), and utility corresponds to how
much better, all things considered, survey respondents believe a person’s life will be for
a given achievement state.2 The NAEP uses a scale anchoring process that links a scale
score value (the x-axis Figure I) to a performance level descriptor. With the estimated
utility values, I have a data set with three variables and 10 observations: a vector of per-
formance level descriptors, the corresponding scale scores, and the estimated utilities. I
use piece-wise monotone cubic interpolation (MCI) to link each individual NAEP score to
a utility value, represented by the gray dashed “Utility” line in Figure I.3 We now have a
scale score that is equal-interval with respect to welfare, as long as the equal-interval as-
sumptions (with respect to ability) of the original scale score hold and that the performance
level descriptors are appropriately mapped to scale score values. As a demonstration of the
usefulness of a welfare scale, I take a repeated cross-sectional panel of student test scores
from the NAEP-LTT and re-scale them according to the method outlined above. I show
that inferences about changes in achievement and achievement gaps over time and age are
sensitive to the choice of scale.
The method that I described above is similar in concept to methods commonly used in
health care research. In health economics, effect sizes are, in many cases, given in the
metric of a Quality Adjusted Life Year (QALY) (see Drummond, 2005 and Whitehead and
Shehzad, 2010 for review), where the QALY metric is used to make comparisons between
different ‘health states’ (where health states are the analogue to achievement states taken
from performance level descriptors) in health care production functions for purposes of
cost-effectiveness analysis. As an example, consider a medical intervention that improves
mobility by 2-units and another that reduces pain by 3-units. Holding costs constant, re-
searchers, insurance companies and policy makers are interested in determining which of
2Performance level descriptors for reading can be found online here:https://nces.ed.gov/nationsreportcard/ltt/reading-descriptions.aspx; math here:https://nces.ed.gov/nationsreportcard/ltt/math-descriptions.aspx. 5 performance level descriptors areavailable for both reading and math, for a total of 10 descriptors. See Data section for details.
3MCI is implemented according to Fritsch and Carlson, 1980
39
the two interventions should be pursued. The QALY-metric puts discrete health outcomes
on a common utility scale, making comparisons possible. In addition to being used for
making between health state comparisons (e.g., mobility against pain), QALY-scales can
be used for making within health state comparisons (e.g., completely immobile against able
to walk without assistance).4
Questions about equal-interval assumptions in educational research are well known.
While it is generally assumed that scale scores safely provide ordinal information, the
assumptions that allow us to assume equal-interval properties are more difficult to test.
Establishing the interval properties of a test metric is important because interval distances
are much more informative than ordinal rankings—just think of a rank-ordering of basket-
ball players that includes Lebron James and the members of a high school basketball team.
While an ordinal scale may be equal-interval with respect to ranks, ordinal rankings do
not reflect the fact that the distance in ability between James and the second best player’s
is enormous. Extant research suggests that equal-interval assumptions are problematic,
however. Domingue and, in a working paper, Nielsen have developed methods for test-
ing whether the equal-interval assumptions are plausibly met for some common academic
assessments and find that these assumptions are not (Domingue, 2014; Neilsen, 2015).
Other researchers assume that any given scale is but one among many monotone transfor-
mations of a latent scale. Given this agnosticism, Cunha and Heckman (2008) and Cunha,
Heckman and Schennach (2010) propose a scale transformation that anchors the original
scale to adult earnings, a distribution that is assumed to have equal-interval properties. The
transformed scale score is then used to estimate production functions for cognitive develop-
ment. Relying on similar assumptions about the flexibility of scale transformations, Bond
and Lang (2013a; 2013b) subject a scale score to a variety of monotone transformations
according to an algorithmic objective function that maximizes and minimizes changes in
the white-black achievement gap. The authors find that inferences about gap changes are,
not surprisingly, sensitive to these scale transformations.
4The Eq-5d, for example, is one of the more commonly used metrics and provides three descriptionsof mobility states, three descriptions of pain states, as well as three other health states. Utility scores areestimated for each health state, allowing for between and within health state comparisons (Oppe, Devlin andSzende, 2007).
40
The approach that I take here is to assume an equal-interval scale in the metric of achieve-
ment and estimate a new scale that will be equal-interval in the metric of welfare. Such a
scale will be useful insofar as we wish to understand whether changes in achievement are
important. For instance, current program evaluations leave fundamental questions unan-
swerable. Holding costs constant, if one intervention raises math scores 10-units and an-
other raises reading scores 10-units, we lack an outcome variable that adjudicates between
the two interventions. Likewise, if one intervention raises math scores 10-units at the low
end of the scale and another intervention raises math scores 10-units at the high end of the
scale, current practice fails to distinguish between these two results. The method I describe
and implement is one solution for resolving this uncertainty. It should be interpreted as
a proof of concept and perhaps is most useful for raising more questions than it answers.
Consider:
1 To what outcome should scale scores be indexed? In this paper, I present respondentswith questions, asking them to determine which description of math and reading ismore important for an “all things considered” better life. Other indices are avail-able, such as income, civics engagement, or health outcomes. Cunha and colleagues(2008; 2010) index a child’s test score to the child’s future earnings, using a factorloading technique to weight the achievement distribution as a function of how well itpredicts earnings. Such a technique is not a panacea however. First, the factor loadingmethod is an ad hoc scaling technique. Second, earnings connote their own scalingassumptions, e.g., should the scale be log transformed? Finally, linking achievementto earnings ignores the academic capabilities captured by the scale score.
2 Whose preferences for achievement should be included in the index? The sampleof respondents included in this essay are mostly college educated. In the survey ex-periment, respondents are asked which state of achievement is more important for agood life. If respondents have no understanding of what a high level of numeracyor literacy feels like or entails, they will struggle to respond to the question. Thissuggests college educated respondents are appropriate. Nevertheless, it is likely im-portant that an index of benefit captures the preferences of everyone. How to includeall respondents in an exercise for which some may lack the cognitive capacity to par-ticipate is a difficult question. Note that the question does not apply to achievementalone: whether the poor can predict how much they would prefer being non-poor(and vice-versa) seems similarly opaque. Whether a healthy person can predict how
41
much they are pain averse has a similar problem.5
3 How should the index balance individual and social benefits? The approach usedhere measures preferences for individual benefits to achievement states. Such anapproach ignores other distributive concerns, such as equity. It is known that surveyrespondents may be indifferent between a 1-unit change at the bottom and top of ascale when comparing between two persons, but when respondents are asked whichof the two persons should receive treatment in a group of persons, they will choose togive treatment to the person whose health is at the bottom of the scale (Nord, 1999).This suggests individuals value relative differences (Otsuka and Voorhoeve, 2009).How such equity concerns, and other social values, should be included in the indexis an important question.
4 How should time be modeled in the elicitation and estimation of the utility value?The method used here (and the one that is commonly employed in health economics)is to present respondents with a cross-sectional preference: “Person A has charac-teristics X and Person B has characteristics X ′. Who is better?” In health, thesecharacteristics are fixed by specifying that the health state will persist for t-years,whereas achievement states are naturally assumed to change over time as studentslearn. Moreover, individuals may have different preferences for achievement growththan they do for achievement states. Linking a preference for achievement change toa student’s scale score is complicated because our current measures of achievementonly provide cross-sectional information about the student’s abilities.6
These questions are a current source of debate in health economics and philosophy,7 and
are likely to continue to be debated. Questions like these are currently neglected in most
education policy evaluation, or the assumptions that go into the evaluation are left unstated.
The paper proceeds as follows: I begin by providing an overview of the survey design
and the data used for analysis. I then describe the theoretical model that motivates the
analysis and the econometric model that will be used for empirical estimation. The first set
of results I show describe utility values for the different achievement states. Interpolation
techniques are described that connect NAEP scores to utility values for the full distribution
of NAEP data. With interpolated data, I then estimate white-black achievement gaps using
5Dolan and Kahneman call this distinction experience versus decision utility.6See Lipscomb, et al., 2009 for a review of this and other issues related to time in the health landscape.7See Daniels, 1985 for a philosophical view of justice in the provision of health care, as well as Nord,
1999 who offers a mixture of economics and philosophy in his evaluation of the QALY metric.
42
the original and welfare-adjusted NAEP data and show that inferences about gap trends are
sensitive to scale selection.
II. Survey design
The survey design has two components. The first is a ranking exercise, in which three out
of five reading or three out of five math descriptors are randomly selected and respondents
are asked to rank these descriptors in order of difficulty. Reading and math ability descrip-
tions are taken from the NAEP-LTT performance level descriptors, described below. The
purpose of this ranking exercise is two-fold: to prime respondents so that they recognize
these descriptors are ordinally ranked, and to screen respondents who cannot (or will not)
rank descriptors correctly. Figures III through IV display the ranking and choice tasks as
they appeared in the experiment.
The ranking exercise is followed by a choice-based conjoint design (often times referred
to as a discrete choice experiment) to obtain utility values for different math and reading
descriptors. Choice-based conjoint designs are widespread in health and public economics,
marketing research, and have become increasingly common in political science (for ex-
amples in health and public economics, see De Bekker-Grob, et al., 2012 and McFadden,
2001, respectively; in marketing, see Raghavarao, et al., 2011; in political science see
Hainmueller and Hopkins, 2015). In the experiment, respondents are provided with a de-
scription of two individuals (Person A and Person B) who are alike in all respects, except
that they differ in their math and reading abilities. Respondents are asked to determine
which bundle of math and reading abilities between Persons A and B will lead to an “all
things considered” better life. After being presented with the reading and math profiles,
the respondent is forced to make a choice between Persons A and B. The response is coded
dichotomously, 1 if Person A or B was chosen and 0 otherwise. Each respondent is given
only one choice task.8
8More than one choice task is of course possible, requiring that standard errors be clustered at the re-spondent level. The decision to offer respondents only one choice task was motivated by a reduction incognitive load, as performance descriptors are text heavy, as well as the fact that the marginal survey cost
43
The purpose of the choice task is for respondent to make interval comparisons between
Persons A and B with respect to welfare. As an example, consider a choice task where
Person A has reading ability equal to 5 and math ability equal to 2, while Person B has
reading and math abilities equal to 3.9 Effectively, the respondent is being asked to make
a trade between 2 units of reading for 1 unit of math. Whether respondents, on average,
choose Person A over B will depend on how much they value reading relative to math,
and, importantly, how much they value math gains at the bottom of the distribution relative
to reading losses at the top. To see this, consider an alternative choice task where Person
A has reading ability 4 and math ability 1 and Person B has reading and math abilities 2.
Here, the reading and math abilities of Persons A and B have been shifted down equally, but
respondents may not make the same selections, since a change in reading from 5 to 4 need
not be equivalent to a change in reading from 4 to 3. This exercise formally tests whether
respondents’ preferences are, indeed, equal interval with respect to welfare. Depending
on how respondents on average weight these different trades will determine the relative
concavity of the welfare-adjusted scale score.
[Insert Figure II Here]
[Insert Figure III Here]
[Insert Figure IV Here]
Finally, note that Figure III explicitly states the age of Persons A and B. Because the
performance level descriptors from the NAEP-LTT pertain to students at the ages of 9,
13 and 17, and because individual scale scores are available for students at those ages, I
randomly assign one of three ages (9, 13, 17) to each choice task. The purpose of this
additional randomization is to test the sensitivity of preferences for achievement bundles to
age. For example, respondents may value gains in reading and math at the low end of the
distribution for persons aged 9 more than they value equivalent gains for persons aged 17.
Randomly assigning age will allow me to test this hypothesis.
using Amazon’s MTurk suite are relatively low.9Where ability level 5 corresponds to highest performance level descriptor on the NAEP, and so on.
Respondents are not asked to make trades regarding integer values of the NAEP but are instead presented withtextual descriptions of reading and math abilities commensurate with integer scores. See Scale Anchoringsection below.
44
II.A. Math and reading descriptors and scale scores
The choice task described above uses performance level descriptors to connote reading and
math ability levels. In order to construct a data set with performance level descriptors,
utility values, and scale scores, it is necessary that these performance level descriptors (and
their estimated utilities) can be plausibly linked to scale scores. The plausibility of this
linking is defended below, but it is natural to wonder why performance level descriptors are
needed at all. Why not ask respondents to make trades using the scale scores themselves?
There are two problems with such an approach. To the first, I hope it is evident that in
order to estimate an interval scale with respect to welfare it is important not to conflate
the welfare scale with the original scale that describes ability. The purpose of the choice
task is to allow respondents to decide for themselves the interval distances with respect
to value between, say, reading units 1 and 2 and units 3 and 4, and so on. A Rasch or
IRT model might estimate equal-interval distances between these units, but respondents
are being asked to decide whether these distances are equal in another dimension, that
of welfare. The second problem is that a scale score decoupled from a performance level
descriptor connotes no meaningful information to the respondent. Any scale can be linearly
transformed, and determining how much 5 units is worth relative to 4 units, or 500 relative
to 400 is not possible. For these reasons, it is necessary to provide respondents with a richer
descriptor of what performance looks like, and then link the performance-level descriptor
back to a scale score.
II.B. Linking NAEP descriptors to scale scores
I now turn to the question of whether or not performance-level descriptors can be plausibly
mapped onto scale scores. One of the goals of academic measurement, dating back to
at least 1963, is to provide criterion-referenced interpretations of scale scores—in other
words, to be able to provide descriptions of what students know and can do in an academic
domain (Mullis and Jenkins, 1988). The process by which the NAEP links performance
level descriptors to scale scores is called scale anchoring. Scale anchoring attempts to
45
provide a context for understanding the level of performance defined by the specific test
items that are likely to be answered correctly by students (Lissitz and Bourqe, 1995).
Anchor levels are determined by a combination of statistical and judgmental processes.
For the NAEP, an IRT model is used to estimate an ability score, θ, for each student,
bounded between 0 and 500. The equidistant points 150, 200, 250, 300 and 350 are then
selected from the scale.10 Test items from the assessment are then selected and categorized
according to whether or not the item discriminates between students with different scale
scores. For example, an item will be categorized as a “150 level item” if (a) 65 percent of
students scoring at or around 150 answered the item correctly; (b) 30 percent of students
or fewer scoring below 150 answered it correctly; (c) no more than 50 percent of students
scoring below 150 answered it correctly; and (d) a sufficient number of students responded
to the item. With this procedure, a large number of items can be categorized as being “150
level items”, “200 level items”, and so on. This completes the statistical part of the process.
The judgmental part of the process occurs when teams of curriculum and content special-
ists from the respective domains (i.e., reading and math) are asked to describe the kinds of
academic competencies reflected in the categorized items. Specialists meet in teams and
form a consensus about what these items signal.
The final result is a set of performance level descriptors that characterize what students
know and can do as defined by test performance on selected items. Scale scores are em-
pirically determined, anchor items are empirically identified, and anchor descriptions are
provided by expert judgment (see Beaton and Allen, 1992; Mullis and Johnson, 1992; Lis-
sitz and Bourqe, 1995 for full description of the scale anchoring process).11
There are problems with this procedure. Lissitz and Bourque describe the anchor item
10Very few students score in the tails of the scale score distribution, and for this reason the selected pointsof interest ignore those regions.
11Performance level descriptors differ from standards setting or achievement level descriptors. Standardssetting practices begin with a set of skills that experts believe correspond to proficiency levels. For instance,it might be asserted that a 4th grade student is proficient in reading if that 4th grade student can read chapterbooks for comprehension. Experts then work through the test items and determine subjectively what percentof students would answer the item correctly, if the student was proficient in reading. This stands in starkcontrast to the anchoring procedure described here, as the items are not categorized according to a statisti-cal procedure and given subjective analysis ex post, but instead are categorized exclusively according to ajudgmental procedure.
46
selection process as “low inference” and the descriptive process as “high inference.” The
key issue revolves around whether the descriptors are overly uni-dimensional. Not all items
can be empirically anchored to different ability levels, leaving open the possibility that the
anchored items are too narrow. While experts construct uni-dimensional descriptions of
anchor items, other descriptions cannot be ruled out. Moreover, the performance level de-
scriptors collapse across different sub-scales, glossing over multi-dimemsionality that is
present even in the empirical data. Finally, even though equidistant anchor levels are se-
lected, if the equal-interval assumptions of the scale score are not met, then the descriptors
will likewise not be equal-interval scaled.
Despite these concerns, anchoring in this way is the most widely used technique for
providing descriptions of what students know and can do at different points across the scale.
Given how widely these benchmarks are used in classrooms and policy discussions, it is at
least plausible to suggest that the performance descriptors used in this survey experiment
can be mapped to specific scale scores. The performance level descriptors for reading and
math are described in Tables 1 and 2 below. The entire performance level descriptor is used
in the choice-based conjoint experiment.
[Insert Table I Here]
[Insert Table II Here]
II.C. Data collection
Utility values are estimated from survey respondents. Respondents are drawn from the
United States during the period of June and July, 2015. Participants were enrolled using
Amazon’s Mechanical Turk software suite and the survey was administered using Qualtrics.
Respondents were offered $0.35 to participate in the survey, equivalent to about $6.00 per
hour, and the study was administered with IRB approval. In total, 2351 respondents partici-
pated. According to self-reports, respondents were primarily college educated (78 percent),
white (73 percent) and balanced by gender (48 percent male, 52 percent female). Mechan-
ical Turk populations, while not representative of the national population on observables,
47
have been shown to have nationally representative preferences with respect to certain stim-
uli, such as responsiveness to information about income distributions (Kuziemko, Norton,
and Saez, 2015) and risk aversion (Buhrmester, Kwang and Gosling, 2011).12
III. Econometric framework
In this section I describe the modeling approach I use to estimate utility values for each of
the math and reading performance level descriptors. The model uses the logistic likelihood
function to provide point estimates for reading and math performance level descriptors at
levels 150, 200, 250, 300 and 350. Point estimates for reading and math performance level
descriptors can be interpreted as the log likelihood that respondent i chose Person A (profile
1) with reading and math characteristics θsl, where s indexes subject (reading or math) and
l indexes performance level (150, 200, 250, 300, 350).
Formally, the data are structured so that there is one row of observation for each survey
respondent i. A response variable is coded 1 if respondent chose Person A (profile 1);
0 otherwise–that is, if the perceived utility of Person A exceeded the perceived utility of
Person B. The pairwise offerings presented to each respondent are coded as indicator vari-
ables. For example, if respondent i compared Person A, who had Reading 150 Math 300
(Reading 1, Math 4) and Person B, who had reading 200 Math 250 (Reading 2, Math 3),
the indicator variables Read1a, Math4a, Read2b and Math3b would be coded 1; all other
indicator variables (Read2a through Read5a, Math1a-Math3a and Math5a, etc.) are coded
0. These ones and zeroes mark the choice set available to the respondent.
Thus, the probability that respondent i chose Person A is:
Pr(ChooseA) = f(Uia + εia > Uib + εib),(3.1)
= f(Uia − Uib + εia − εib > 0),(3.2)
12Pilot studies took place over the months of September, 2014 to June, 2015. Development of the surveydesign took place in Stanford’s Laboratory for the Study of American Values.
48
This expression says that the probability of choosing Person A is a function of an indi-
vidual’s observed utility for Persons A and B plus a random component εij . Respondents
choose A when they perceive more utility for A than B, or when the difference in utility
between Persons A and B is greater than zero.
If we assume that the errors have a logistic distribution, then we can specify the model
such that:
Pr(ChooseA) = 1 +1
e−(Uia−Uib)+ εij; εij = εia − εib(3.3)
= 1 +1
eUib−Uia+ εij(3.4)
We simplify by taking logs and get:
LnPr(ChooseA)
Pr(ChooseB)= Uib − Uia + µij(3.5)
So far we have only shown that the log odds of choosing Person A over B will be a
function of how much the utility attributed to Person A exceeds the utility attributed to
Person B. We also know that Persons A and B have characteristics. Substituting, we get:
Uib =Mathib +Readib;Uia =Mathia +Readia(3.6)
LnPr(ChooseA)
Pr(ChooseB)= (Mathib −Mathia) + (Readib −Readia) + µij(3.7)
This expression says that the log odds of choosing Person A over B will be a function
of how much Person A’s math and reading abilities (Mathia and Readia, respectively) are
preferred over Person B’s math and reading abilities (Mathib and Readib, respectively).
Let θsl =Mathib−Mathia or Readib−Readia for the full vector of Math and Reading
pairwise offerings made available to all respondents. Then, the model can be estimated
49
with the equation:
LnPr(ChooseA)
Pr(ChooseB)= α +
2∑s=1
5∑l=2
θsli + µsli(3.8)
Where s indexes subjects (reading and math), l indexes levels (200, 250, 300, 350 and
150 for both subjects is jointly estimated by the constant α), and i indexes respondents.
This model estimates a total of 8 parameters plus a constant. Standard errors are clustered
to account for heteroskedasticity.
Previously, I noted that the ages 9, 13 and 17 were randomly assigned to respondents,
in order to test whether respondent preferences for different parts of the reading and math
distributions varied by the supplied ages of Persons A and B. These age terms can be
introduced in the model as interactions:
LnPr(ChooseA)
Pr(ChooseB)= α + δa × (
2∑s=1
5∑l=2
θsli) + µsli(3.9)
where δa is an age fixed effect, thus giving 24 total parameters estimated (8 reading and
math x 3 age terms) and a constant.
IV. Results
I now turn to results. The survey consisted of both a ranking and a choice exercise. The
ranking exercise was included to determine whether respondents could and did understand
that the performance level descriptors provided increasingly sophisticated descriptions of
reading and math abilities. I begin by showing percents of respondents ranking perfor-
mance level descriptors correctly in terms of difficulty. A majority of respondents are able
to rank these descriptors correctly, suggesting that they understand the descriptors connote
ordinal information in terms of ability.
I then turn to point estimates from logistic linear regression models. I show point es-
50
timates for two sets of models: age-interaction models (for ages 9, 13 and 17) are shown
along with models that estimate the weighted average across age. These allow us to see
whether age-interactions meaningfully change respondent behavior. Three interpolation
schemes are considered and monotonic cubic interpolation (MCI) according to Fritsch and
Carlson (1980) is selected.
With an interpolation scheme in place, I have a full range of data for both the original
scale score and the estimated welfare scale. As a descriptive application, I show trends in
the white-black achievement gap, defined as the difference in mean white and black scores,
for the original and adjusted scales. NAEP scores are fairly stable across time but change
substantially as students age.13 Test scores are available for a random sampling of students
at ages 9 and 17 every 8 years for six cohorts in math and reading, allowing for description
of achievement growth as students age across various cohorts. I conclude by showing gap
trends across age for various cohorts using both scales.
IV.A. Ordinal ranking exercise
Respondents first participated in a ranking exercise in which they were randomly assigned
3 of 5 reading or 3 of 5 math performance level descriptors (an example of the exercise
is shown in Figure II). Only three descriptors were randomly drawn in order to simplify
the ranking task. There are 10 possible reading and math bundles for which there are no
ties randomly assigned to respondents, when a tie is defined as respondent being randomly
assigned one or more equivalent reading or math performance level descriptors.14 Among
non-ties, the probability of being assigned any one reading or math performance level de-
scriptor is uniformly distributed. There are 10 possible non-tying combinations of perfor-
mance level descriptors: 123, 124, 125, 134, 135, 145, 234, 235, 245, 345 (where 1=150,
2=200, 3=250, etc.). Likewise, the distribution of descriptor combinations is uniformly
distributed.13The NAEP-LTT is vertically scaled, meaning students at different ages are exposed to an overlapping
subset of test items. See Beaton and Swick (1992) and Haertel (1991) for discussion.14Ties are excluded because the exercise is made radically simpler when ranking only two unique sets of
descriptors.
51
Uniformity allows for independent point estimates of each subject-level descriptor. How-
ever, independent estimates of the effect of being assigned a performance level descriptor
on the probability of ranking that descriptor correctly are available only if descriptor com-
binations are equally difficult. For example, some descriptor combinations will, by chance,
assign respondents combinations of descriptors that are further spaced than other combi-
nations (e.g., 135 is futher spaced than 234 or 125). If correctly ranking is easier when
descriptors are further apart (e.g., 135 is easier than 234 or 125), and if some performance
level descriptors are more commonly found in these more easily ranked combinations, then
independent estimates of each performance level will be biased. To test for this, I construct
three indicator variables (Distance 100, Distance 150, and Distance 200) indicating the cu-
mulative distance between the three performance descriptors. For example, the indicator
Distance 100 will be coded 1 if the three descriptors were 150, 200, 250 (distance is 50 be-
tween 150 and 200 and 50 between 200 and 250 for a total of 100); 0 otherwise. Distance
150 and 200 are coded similarly.
The data are structured such that there are three observations per respondent. Each row
corresponds to the subject s and level l randomly shown to the respondent i. If the respon-
dent ranked the item correctly, it is coded 1; 0 otherwise. In total, a respondent may rank
0,1 or 3 descriptors correctly (mis-ranking one descriptor necessarily results two or more
descriptors mis-ranked). I estimate two regression models:
Ranksli = θsli + µsli
(3.10)
= δd × (θsli) + µsli
(3.11)
Here, s indexes subject, l indexes level and i indexes respondent. Each model is run sepa-
rately for math and reading, for a total of four estimations. The dependent variable Ranksli
is coded as 0 or 1 depending on whether the respondent ranked correctly; indicator variables
θsli indicate the linear probability that respondents ranked performance levels 1 through 5
correctly, and the interaction term δd indicates the proportion of respondents ranking θsli
52
correctly when they were offered three-descriptor combinations with distances equal to
100, 150 or 200. Point estimates for δd interactions are relative to d = 100. Standard errors
are clustered at the respondent level to account for intra-respondent correlation. Results are
reported in Table III.
[Insert Table III Here]
The main effects coefficients (Levels 150 through 350, indicated by column header
“Mean”) indicate the proportion of respondents ranking that performance level descrip-
tor correctly. Here we see that, for the most part, respondents were successful at ranking
the descriptors. Percents correct range between 63 percent to 79 percent depending on sub-
ject and level. There is not an obvious pattern between subjects and levels with respect to
how effectively respondents ranked.
The interaction terms confirm the hypothesis that additional space between performance
level descriptors improves ranking competence. Relative to when cumulative distance is
100 (the smallest possible distance among non-ties), distances at 150 and 200 are nearly
always higher (math, level 300, distance 200) and generally significant.15 Overall, respon-
dents ranked reading and math descriptors correctly 63 to 79 percent and 64 to 72 percent
of the time, respectively. Whether respondents rank incorrectly on account of negligence
or genuine confusion is unknown.
Correct rank ordering of performance level descriptors is relevant to the utility model be-
cause the model assumes monotonicity of consumer choice preferences. The monotonicity
assumption is simply that respondents should choose higher levels of reading or math, all
else constant. That is, for example, if respondent i faces a choice task k in which Person
A has Reading and Math 250 and Person B has Reading 250 and Math 300, respondent
i must choose Person B. In health economics, where choice-based conjoint designs are
common and assumptions of monotonicity are likewise required, there is no consensus on
best practices for when respondents “choose badly.” I follow current practices and delete
15The interaction terms do not average to the main mean effect because the Distance 150 terms are approx-imately 1.3 times more prevalent than either the 100 or 200 terms.
53
observations for which respondent choices violate monotonicity assumptions.16,17
IV.B. Beta estimates
Estimation of the utility model (Equations (1.8) and (1.9)) is done on a sub-sample of
respondents who (a) responded to the choice task and (b) complied with monotonicity
assumptions. The sample includes 2057 of 2351 respondents given a choice task. I begin
by showing results for Equation (1.9), where randomly assigned age descriptors δa for
ages 9, 13, and 17 are interacted with reading and math performance level descriptors θsl,
providing 24 (3 ages x 4 betas x 2 subjects) point estimates plus a constant. The interaction
terms allow us to see whether respondent preferences for performance levels are sensitive
to profile age. Results are displayed in Figure V.
[Insert Figure V Here]
The common intercept α anchors point estimates for Math and Reading ages 9, 13 and
17 at Level 150. Because each of the subjects are estimated simultaneously, it is possible to
16See Lancsar and Louviere, 2006; Lancsar and Louviere, 2008; Inza and Amaya-Amaya, 2008 for discus-sion. These papers discuss both monotonicity violations as well as other violations of rational choice theory.The focus is primarily on repeated observation of respondent choice behavior, when preferences should betransitive and consistent. In cases where transitivity and consistency are violated, deletion of respondentchoice data is discouraged. Guidelines for best practices in cases of monotonicity violations are not wellspecified. Higher quality (and more expensive) data can be obtained in order to determine whether respon-dents failed to comprehend, did not take the choice task seriously, or had other reasons for preferring lessover more achievement. In total 147 respondents out of 2351 were removed from the sample for makingnon-monotonic choices, i.e., choosing a Profile with performance level descriptors lower than the alterna-tive. 752 respondents were presented with a choice set in which a decision required monotonicity, meaningthat about 19 percent of respondents “chose badly” when given the option. An additional 147 respondentswere removed because they did not respond to the choice task. The final estimation sample includes 2057respondents.
17In pilot surveys that took place between September 2014 and June 2015, I attempted to make the perfor-mance level descriptors more concise in order to improve respondent comprehension and to present respon-dents with additional choice sets. This procedure has the drawback of undoing the scale anchoring processdescribed previously. In particular, complete descriptors have already been criticized for excessive unidi-mensionality, and any additional concision would bolster those criticisms. In an effort to allow for the de-scriptors to maintain their multidimensionality and increase concision, respondents were randomly assignedsub-elements within each performance level descriptor. I generated 3 to 5 sub-descriptors for each com-plete performance level descriptor and randomly assigned those. An average estimate of the sub-descriptorswould, in theory, describe the multidimensional aspects of full descriptor. Nevertheless, I found that re-spondents were not additionally successful at ranking sub-descriptors relative to the full performance leveldescriptor; indeed, for many of the sub-descriptors I constructed, respondents were much worse at rankingthem. For these reasons, I chose to use the full descriptor.
54
compare across subject domains, as well as within domain, across performance level. The
solid and dashed lines correspond to fitted quadratic and cubic regression lines, precision
weighted by the inverse of the standard error squared. Analytic weights are likewise applied
to each of the point estimates to indicate precision (i.e,. larger circles have smaller standard
errors).
With only five estimated points, there are many data missing throughout the entire range
of potential scale scores. The problem of missing data is unique to educational setting,
where two measures of ability are commonplace: discrete performance level descriptors
and continuous measures. In order to capture the full continuous range of ability using
only discrete descriptors, we will need to fill in the missing data. The two interpolation
and extrapolation schemes presented here (quadratic and cubic interpolation) are seen to
be inadequate. The primary purpose of Figure V is to illustrate two problems with the
schemes. First, by not imposing monotonicity on the interpolated line, we violate utility
axioms. Second, using either extrapolation method for points beyond 150 and 350 leads to
outlandish prediction.
To correct these limitations, I use monotone piecewise cubic interpolation (MCI) as sug-
gested by Fritsch and Carlson (1980). MCI produces results depicted in Figure VI for the
range 100 to 500. The top panel shows results for Equation (1.8), where the age-interaction
terms are removed. By construction, the curvature is monotonic throughout the entire range
and fits the estimated data perfectly. MCI extrapolates for points beyond 150 and 350 by
linearly fitting a line from the last two known points (i.e., 150 to 149 and 349 to 350).
Linear extrapolation may not be appropriate for points outside the estimated range. Later,
I test how sensitivity results are to alternative extrapolation techniques.
[Insert Figure VI Here]
We can now observe results. First note the concavity of the each of the point estimates.
As hypothesized, welfare returns to achievement are non-linear and decrease at the higher
end of the scale. This is true for all ages and subjects. There is variation in the curvature
between subjects and ages. For all ages in reading (bottom right panel of Figure VI), there
is a steep gain in utility at the bottom of the scale, and then utility gains flatten out. Age 17
55
shows a steep increase at the high end of the scale, but much of this is due to extrapolation
beyond the estimated value of 350. For math (bottom left panel), the largest gains are in the
middle of the distribution, as scores increase from 200 to 350, and this is true for all ages.
Overall, we see confirmation of the initial hypothesis that utility gains for achievement are
non-linear and concave.
In order to estimate changes in achievement across age, by cohort, it will be necessary to
combine age terms and estimate Equation (1.8). Recall the monotonicity assumption im-
plicit in the model: increases in the original scale score must be associated with increases
in benefit. As seen in Figures V and VI, each of the age curves are monotonically increas-
ing, but the model does not impose monotonicity across age. To understand why, consider
Figure VII, which shows a stylized depiction of Figure VI overlaying Ages 9 and 17. Here,
the curvatures for Ages 9 and 17 are respectively monotonic, but as achievement along the
x-axis increases and “jumps” from Age 9 to 17, there is a concomitant decrease in Y , i.e.
utility. This violates modeling assumptions and implies that as children gain in achieve-
ment as they age from 9 to 17 they are made worse, all things considered. This implication
is made despite the fact that we observe positive utility returns to achievement within age.
[Insert Figure VII Here]
We observe this “jump” problem because respondents are not asked to make marginal
welfare preferences for achievement gains but are instead asked to state preferences for
achievement states. The theoretical and practical differences between estimating gains and
states is a recurring theme in health research and was introduced earlier. The problem is
even more pronounced in educational applications, as any vertical scale assumes change
in ability across age. Nevertheless, it is not obvious whether welfare evaluations should
be sensitive to those changes. Moreover, most test scores are presented as cross-sectional
measures of ability. Given that the purpose of this exercise is to convert a commonly used
measure of ability into one that connotes utility, using the cross-sectional achievement score
seems appropriate. Modeling growth may be possible but is left aside for future research.18
18See Weinstein, et al., 2009 and especially Nord, et al., 2009 for important discussion about gains versuslevels in health, with emphasis on both policy and normative implications.
56
Point estimates will therefore be taken from Equation (1.8). By eliminating the age inter-
action terms, the model describes average welfare returns to achievement and is monotonic.
Point estimates correspond to the weighted average of the three age terms (9, 13, 17) for
each performance level. This can be seen in the upper left and right panels of Figure VI.
Comparing between lower and upper panels of Figure VI shows that age-specific point es-
timates do not substantively alter interpretation. Moreover, ignoring the age-interactions
helps to mitigate some of the exaggerated extrapolation for ranges beyond 350.
Estimating and converting NAEP scales
Here I describe how estimated utility values for Reading and Math performance level de-
scriptors 150, 200, 250, 300 and 350 are applied to individual level NAEP data. To do
this, I take individual level data from the NAEP restricted-use files and generate a vector of
reading and math scores for each individual student’s 5 plausible values.19 Each individual
student’s score is estimated according to the MCI projection. This is done for all student
scores in reading and math, ages 9, 13 and 17, for years 1990-2008. As a summary statis-
tic, I take the mean NAEP and mean welfare-adjusted score for each subject, age, year and
subgroup, taking account of the NAEP’s complex survey design as well as the five plausi-
ble values.20 Finally, in order to compare the original scale, which ranges between 100 and
500, to the welfare-adjusted scale, which ranges from -2.7 to 0.2, I standardize them both
to have mean µ = 100 and standard deviation σ = 10.19In the NAEP, individuals do not receive the complete battery of tests. For this reason, each individual
student is given 5 plausible values which are randomly drawn from a distribution of possible θ values. The 5plausible values can be combined to provide summary statistics for sub-populations following Rubin’s rulesfor multiple imputation. See Mislevy, et al., 1992 for a description of this procedure.
20Specifically, to estimate means for each plausible value of the NAEP, I use Stata’s –svy– commands,specifying probability and replicate weights, as well as the sampling clusters. I follow Rubin’s rules toaggregate across each of the 5 plausible values. The mean score is a simple average of each of the subject-age-year score, but error variance requires that we take account of between plausible value variation and the error
variance of each estimate. The formula for this is: within =1
5
∑5p=1 σ
2; between =1
4
∑4p=1(X̄ − Xp);
total =√within+ 1.2 ∗ between.
57
IV.C. Comparing original to welfare-adjusted scale
I now show how inferences between the original NAEP scale and the estimated and inter-
polated welfare-adjusted scale contrast. I first present a stylized figure to show the kinds of
cases for which inferences between the two scales will diverge. I then compare white-black
achievement gaps (defined as the mean difference between the two groups) across time and
across cohort (that is, as students age).
Achievement gap example
An example of a change in math achievement for a single cohort is shown in Figure VIII.
The x-axis shows the original standardized NAEP scale and the y-axis shows the estimated
and interpolated scale for a cohort of students in years 1982 to 1990. The solid intersecting
lines indicate mean black scores and the dashed intersecting lines indicate mean white
scores; the scores at the lower end of the distribution are for students at age 9 and at the
higher end of the distribution are for students at age 17. The difference between dashed
and solid lines for the respective axes provides the achievement gap.
[Insert Figure VIII Here]
It is clear from Figure VIII that white black differences at age 17 are slightly smaller than
they were at age 9 for both scales, indicating that the gap shrank as children aged. The size
of the change in the gap is much smaller using the welfare-adjusted scale than it is using the
original scale, as the difference in scores at age 9 are smaller in the welfare scale than they
are in the original NAEP. What is also revealing about this figure is that if all student scores
increased by the same amount (i.e., an equivalent mean increase in achievement), the effect
on the achievement gap in the adjusted scale would be profound. By shifting all scores to
the right, the size of the gap at age 9 in the adjusted scale would be larger, as a result of the
steeply increasing value in achievement, and the size of the gap at age 17 would be smaller,
as a result of the fact that gains at the high end of the scale are diminished. Taken together,
the adjusted scale would show a dramatic decrease in achievement gaps between the ages
of 9 and 17, simply by increasing all scores an equal amount. This contrast between scales
58
is exactly the consequence we would expect when equal interval assumptions with respect
to welfare are violated.
Two other points are worth emphasizing from Figure VIII. The first is that differences in
inferences between the two scales requires change. Cross-sectional comparisons between
the two scales result only in intercept differences (e.g., a score of 80 in the original scale
is equivalent to a score of 83). While achievement gaps have narrowed somewhat over
time, most of the change in NAEP scores takes place as children age. Below I present
achievement gap changes across time, and the relative stability of these gaps will be evident.
After, I show gap changes as children age, across cohort, and the consequence of scale
choice will be pronounced.
The second is to note two version of the interpolated lines, indicated by the solid black
curved line and the dashed maroon lines at the end. Dashed maroon lines reflect variants
in extrapolation strategy. As mentioned earlier, point estimates are only available up to
original NAEP scores of 150 and 350 (indicated at about 80 and 110 in the standardized
metric). The MCI procedure that generated the black solid line extends the line using linear
extrapolation. To test whether gap changes are sensitive to this out-of-sample extrapolation,
I generate four alternative welfare-adjusted scale scores:
1 Shallow/shallow: by assuming no decrease in welfare below 150 and no increasein welfare above 150 the welfare-adjusted scale will be flat below and above therespective regions.
2 Shallow/steep: by assuming no decrease in welfare below 150 and twice as muchwelfare above 350, the welfare-adjusted scale will be flat below 150 and twice theslope (relative to the slope between 300 and 350) above 350.
3 Steep/steep: by assuming twice as much welfare loss below 150 (relative to theslope between 200 and 150) and twice as much welfare gain above 350, the welfare-adjusted scale will be twice the slope below and above the respective regions.
4 Steep/shallow: by assuming twice as much welfare loss below 150 and no welfaregain above 350, the welfare-adjusted scale will be twice the slope below and flatabove the respective regions.
59
When comparing gap changes across cohort, I will test whether results differ with any of
the four variants.
Achievement gaps: Across years
I now turn to differences in mean achievement between white and black students across
time. Changes in NAEP achievement across time have been relatively modest and gap
changes have likewise been modest, so we would not expect dramatic differences in the in-
ferences drawn between the two scales. Nevertheless, looking at the top panel in Figure IX,
which describes changes in math gaps, we see differences in achievement gap trends for
ages 9 and 17. The trend in gap closure for 9-year olds is decreasing with the NAEP scale
and increasing (slightly) with the adjusted scale. Looking at the bottom panel, which de-
scribes changes in reading gaps, trends are very similar. This is due to the fact that reading
gaps have been very stable, which is not the same for math gaps (Reardon, et al., 2012).
[Insert Figure IX Here]
Achievement gaps: Across age
The more salient demonstration of the effects of rescaling can be seen when we look at
achievement gap changes as children age. The NAEP is vertically equated, meaning exam-
inees at ages 9, 13 and 17 are given a overlapping sample of test items at each age level.
While there are some concerns about the nature of the inference one can draw from vertical
equating, such cross-age comparisons are technically allowable with NAEP data (Haertel,
1991). In order to make cross-age comparisons, I use a sub-sample of cohorts for whom a
random sample of students are tested at age 9 in year t and tested again at age 17 in year
t + 8. The achievement growth for students from age 9 to 17 in year t to t + 8 is provided
for six cohorts c in both math and reading. Within each of these cohorts, because samples
of students are randomly drawn in each interval, it is possible to say that the achievement of
any subgroup g in cohort c grew or shrank by some amount, using both the original NAEP
and welfare-adjusted scales. The achievement gap for any cohort c is defined as the mean
60
white minus mean black score in years t and t+ 8.
[Insert Figure X Here]
[Insert Figure XI Here]
Figures X and XI present results for the possible six cohorts in math and reading,
respectively. Solid lines depict the original NAEP scale and dashed lines depict the welfare
adjusted scale. As hypothesized, in many instances, the inferences we would draw from
the adjusted scale depart in magnitude and sign when compared to the original scale. In
math, the 1982 to 1990 cohort (depicted in green) is as described in Figure VIII, and we
observe here what was described there: a rate of gap closure that is steeper in the original
NAEP metric than in the welfare scale. In 1978-1986, 1992-2000, and 1996-2004 cohorts
gap signs are reversed. Whether or not trends are reversed between the two scales will
be a function of the size, location and rate of change of the subgroup’s respective mean
achievement.
In reading, the departure between the original and adjusted scales is much more pro-
nounced. Using the adjusted scale, the reading achievement gap is shown to be decreasing
by about 6 to 10 points for every cohort. Conversely, using the original scale, the gap is
decreasing by about 1 to 3 points in four cohorts and increasing by 1 to 2 points in two
cohorts. This can be traced back to Figure VI, where we observed a steep change in slope
beginning at NAEP score 250. Gains below 250 are very steep, while gains above are
much more shallow. If black scores, between ages 9 and 17, move along the back half of
the curve, while white scores moves along the front half of the curve, gap decreases will be
much larger.
Finally, in order to test the sensitivity of our inferences to the extrapolation that takes
place for scores less than 150 and greater than 350, I present four “test” averages alongside
the previously shown MCI score. These “test” averages manipulate the extrapolation by
increasing or decreasing the rate of change below 150 and above 350, as described above.
As can be seen in Figures XII and XIII, the method of extrapolation has little bearing on
outcomes.
[Insert Figure XII Here]
61
[Insert Figure XIII Here]
V. Conclusion
Overall, I have demonstrated that welfare benefits of different achievement states described
by the NAEP are not equal interval. In contrast to existing methods, the technique I propose
provides a direct and explicit description of the welfare gains from different achievement
states. Moreover, instead of linking achievement to earnings, I have suggested that the
benefits of achievement can be described inclusively, meaning that achievement need not
serve merely pecuniary purposes. With the proposed method, the inferences we draw about
changes in achievement and changes in achievement gaps (especially as children age) will
differ depending on which scale we use. Which scale ought to be used is, I have argued,
application sensitive. When descriptions of academic ability are desired, or when we wish
to know how much more or less some subgroups know about math and reading relative to
other subgroups, the original NAEP scale allows for such inferences. When, however, we
wish to derive some additional inference about the scale—for instance, when an achieve-
ment score is used as an outcome variable, when a score is used for cost-effectiveness
evaluation, or when we wish to evaluate whether a narrowing of the achievement gap is
“good” or “bad”—the original NAEP scale is inadequate. It fails to accurately describe
benefit in any meaningful way. In this paper, I have described and implemented a method
that allows for such values-based inferences.
In light of the previous discussion, I would like to revisit the four questions that were
raised at the start of this essay.
1 To what outcome should scale scores be indexed?
2 Whose preferences for achievement should be included in the index?
3 How should the index balance individual and social benefits?
4 How should time be modeled in the elicitation and estimation of the utility value?
62
In this essay, I have supplied answers to each of these questions. Outcomes are indexed
to survey respondents’ understanding of how much welfare is attributable to certain levels
of achievement; college educated respondents are included in the index; equity is given
zero weight in the model; time is modeled cross-sectionally. Whether or not these choices
are the correct ways to link achievement to outcomes is not known, but the choices inherent
to the inference are here made explicit.
Contrast the approach detailed here to when achievement scores are used as outcome
variables. With the use of achievement scores, even if the equal-interval assumptions holds,
the implicit assumptions of the model are that benefits are best characterized by ability dif-
ferences, that all ability differences are equally beneficial, and that all benefits are individ-
ual (and not societal) and best characterized by a cross-section in time. These assumptions
lack theoretical and, as demonstrated here, empirical warrant; nevertheless, these assump-
tions form the basis of a great majority of education policy evaluations. Education policy
evaluation will be greatly improved when the implicit assumptions underlying the use of
traditional achievement scores are made explicit.
63
VI. Figures
64
Figure I: Stylized Welfare Returns to Achievement
150
200
250
300
350
Ben
efit/
Util
iity
150 200 250 300 350Scale Score
Achievement Utility
This figure depicts two stylized representations of the welfare returns to achievement. The black line assumesthat welfare returns are equal interval, meaning that a 50 unit increase in achievement corresponds to a 50unit increase in utility. The gray line presents a hypothetical relationship between achievement and welfarein which a 50 unit gain in achievement at the bottom of the scale equates to a much larger welfare gain thana 50 unit gain at the top of the scale.
65
Figure II: Survey Example: Ranking Exercise
This is a screen shot (1 of 3) from the online survey experiment administered to 2351 respondents throughAmazon’s Mechanical Turk software. This task asked respondents to rank 3 reading performance level de-scriptors in terms of difficulty. Respondents were randomly assigned either reading or math subject and 3 of5 performance level descriptors (with replacement).
66
Figure III: Survey Example: Introduction to Choice Exercise
This is a screen shot (2 of 3) from the online survey experiment administered to 2351 respondents throughAmazon’s Mechanical Turk software. In this screen shot, the choice task is introduced to respondents. Re-spondents are informed that the two profiles, Persons A and B, are equal in all respects except that they differin their reading and math abilities. They are instructed to select which person will be better off between thetwo. In paragraph 3, Persons A and B are also randomly assigned an age, which can be either 9, 13 or 17.
67
Figure IV: Survey Example: Choice Exercise
This is a screen shot (3 of 3) from the online survey experiment administered to 2351 respondents throughAmazon’s Mechanical Turk software. In this screen shot, the choice task is presented to respondents. Re-spondents are randomly assigned a reading and math performance level descriptor for Persons A and B, withreplacement. Performance level descriptors are taken from the NAEP-LTT and can be seen in Tables I andII. At the bottom of the choice task, respondents select which person (A or B) they think would be better off,“all things considered.”
68
Figure V: Estimated Beta Coefficients for Reading and Math, Age Interactions
0
1
2
3
150 200 250 300 350
Age 9
0
1
2
3
150 200 250 300 350
Age 13
0
1
2
3
150 200 250 300 350
Age 17
Math
(a)
0
1
2
3
4
150 200 250 300 350
Age 9
0
1
2
3
4
150 200 250 300 350
Age 13
0
1
2
3
4
150 200 250 300 350
Age 17
Reading
(b)
This figure depicts point estimates from logistic regression Equation (1.9) shown. Pointestimates indicate probability of respondent selecting profile with math (top panel) or read-ing (bottom panel) performance level descriptor equal to 200, 250, 300 or 350 (relative toomitted category 150). Solid line drawn using precision-weighted cubic regression throughthe estimates; dashed line drawn using precision-weighted quadratic regression. Each pointsized to indicate precision (i.e., larger points have smaller standard errors).
69
Figure VI: Monotonic Cubic Interpolation of Math and Reading Beta Coefficients
-1
0
1
2
3
100 200 300 400 500
Math
-1
0
1
2
3
100 200 300 400 500
Reading
0
1
2
3
100 200 300 400 500
Age 9
0
1
2
3
100 200 300 400 500
Age 13
0
1
2
3
100 200 300 400 500
Age 17
-1
0
1
2
3
4
100 200 300 400 500
Age 9
-1
0
1
2
3
4
100 200 300 400 500
Age 13
-1
0
1
2
3
4
100 200 300 400 500
Age 17
This figure takes point estimates from Equations 1.8 and 1.9 and performs piecewise monotone cubic inter-polation (MCI) according to Fritsch and Carlson (1980) for scale range 100 to 500. Extrapolation for pointsless than 150 and greater than 350, respectively, is done via linear extrapolation of the two most proximalpoints, e.g. linear extrapolation based on points 151 and 150 and 349 and 350, respectively. Top panel dropsage interactions (Equation 1.8) and bottom panel estimates equation 1.9.
70
Figure VII: Changes in Scale and Welfare Scores across Age: ”Jump” Problem
Age 9
Age 17
Increase in X
Decrease in
Y
Stylized depiction showing how monotonicity within age need not lead to monotonicity across age, i.e. the“jump” problem. In this representation, benefits are monotonically increasing for ages 9 and 17, but asachievement increases from age 9 to 17, there is a downward “jump” in welfare. This is due to the fact thatthe choice task is cross-sectional, asking respondents about their preferences for achievement states and notachievement growth.
71
Figure VIII: White and Black Changes in Math Achievement across Age: Example Cohort
White Black Gap Age 17
White Black Gap Age 9
80
90
100
110
120
Sta
ndar
dize
d M
CI (
Ave
rage
Age
) S
cale
60 75 90 105 120 130Standardized NAEP Scale
Blacks
Whites
This figure depicts standardized original and welfare-adjusted NAEP scores for one cohort of students ages9 and 17 for years 1982 and 1990 (for 1 of 5 plausible values). Solid intersecting lines correspond to meanblack scores of 9 and 17 year olds in 1982 and 1990, respectively. Dashed intersecting lines correspond tomean white scores for same ages and years. Achievement gaps are represented as the difference betweendashed and solid lines at ages 9 and 17 along both the x- and y-dimensions of the graph. In order to test thesensitivity of extrapolation to changes in gap estimates across cohorts, I construct artificial point estimates atNAEP scores of 100 and 500 that are equal to (a) estimated scores at 150 and 350; or (b) twice the slope ofscores from 200 to 150 and 300 to 350. In other words, I simulate welfare gains at the bottom and top of thedistribution that are either (a) worth no less or no more than the next estimated score or (b) worth twice asmuch/little as the previous estimated change. These alternative extrapolation lines are indicated in maroon.
72
Figure IX: Mean White Minus Black Scores over Time
4
6
8
10
12
Sta
ndar
dize
d W
hite
-Bla
ck M
ean
Diff
eren
ce
7882
8690
9294
9699
0408
year
NAEP MCI (Average Age)
Age 9
4
6
8
10
12
7882
8690
9294
9699
0408
year
NAEP MCI (Average Age)
Age 13
4
6
8
10
12
7882
8690
9294
9699
0408
year
NAEP MCI (Average Age)
Age 17
(a) Math
0
5
10
15
Sta
ndar
dize
d W
hite
-Bla
ck M
ean
Diff
eren
ce
7175
8084
8890
9294
9699
0408
year
Age 9
0
5
10
15
7175
8084
8890
9294
9699
0408
year
Age 13
0
5
10
15
7175
8084
8890
9294
9699
0408
year
Age 17
NAEP MCI (Average Age)
(b) Reading
This figure depicts mean white minus mean black scores across time for ages 9, 13 and 17in math and reading, respectively. Dashed line corresponds to original NAEP scale; solidline corresponds to welfare adjusted scale from Equation (1.8), dropping age interactions(standardized to be µ = 100 and σ = 10).
73
Figure X: Mean White Minus Mean Black Math Scores across Age, by Cohort
5
6
7
8
9
Sta
ndar
dize
d W
hite
-Bla
ck M
ean
Diff
eren
ce
7882
8690
9294
9699
0408
NAEP MCI (Average Age)
This figure depicts mean white minus mean black math achievement for six cohorts of students aged 9 and17 in years t and t+ 8. Solid lines correspond to original NAEP scale; dashed lines to welfare-adjusted scale.Each line reflects change in white-black achievement gap as one cohort of students changes in achievementbetween the ages of 9 and 17.
74
Figure XI: Mean White Minus Mean Black Reading Scores across Age, by Cohort
4
6
8
10
12
14
Sta
ndar
dize
d W
hite
-Bla
ck M
ean
Diff
eren
ce
7882
8690
9294
9699
0408
NAEP MCI (Average Age)
This figure depicts mean white minus mean black reading achievement for six cohorts of students aged 9 and17 in years t and t+ 8. Solid lines correspond to original NAEP scale; dashed lines to welfare-adjusted scale.Each line reflects change in white-black achievement gap as one cohort of students changes in achievementbetween the ages of 9 and 17.
75
Figure XII: Sensitivity to Extrapolation: White-Black Math Gap across Age, by Cohort
5
6
7
8
Sta
ndar
dize
d W
hite
-Bla
ck M
ean
Diff
eren
ce
8084
8890
9294
9699
0408
MCI Shallow/Steep Steep/Shallow Steep/Steep Shallow/Shallow
This figure depicts changes in white-black gaps across cohorts using monotone cubic interpolation, and fourvariants of extrapolation. The four variants are: shallow bottom/shallow top; shallow bottom/steep top; steepbottom/steep top; steep bottom/shallow top. Shallow indicates that utility gains below 150 or above 350 arenot worth more after that threshold. Steep indicates that utility gains below 150 and above 350 are worthtwice as much as they were from 200 to 150 and 300 to 350.
76
Figure XIII: Sensitivity to Extrapolation: White-Black Reading Gap across Age, by Cohort
4
6
8
10
12
14
Sta
ndar
dize
d W
hite
-Bla
ck M
ean
Diff
eren
ce
8084
8890
9294
9699
0408
MCI Shallow/Steep Steep/Shallow Steep/Steep Shallow/Shallow
This figure depicts changes in white-black gaps across cohorts using monotone cubic interpolation, and fourvariants of extrapolation. The four variants are: shallow bottom/shallow top; shallow bottom/steep top; steepbottom/steep top; steep bottom/shallow top. Shallow indicates that utility gains below 150 or above 350 arenot worth more after that threshold. Steep indicates that utility gains below 150 and above 350 are worthtwice as much as they were from 200 to 150 and 300 to 350.
77
VII. Tables
78
Table I: Reading Performance Level Descriptors
Level 150: Carry Out Simple, Discrete Reading TasksReaders at this level can follow brief written directions. They can also select words,phrases, or sentences to describe a simple picture and can interpret simple written cluesto identify a common object. Performance at this level suggests the ability to carry outsimple, discrete reading tasks.
Level 200: Demonstrate Partially Developed Skills and UnderstandingReaders at this level can locate and identify facts from simple informational paragraphs,stories, and news articles. In addition, they can combine ideas and make inferencesbased on short, uncomplicated passages. Performance at this level suggests the abilityto understand specific or sequentially related information.
Level 250: Interrelate Ideas and Make GeneralizationsReaders at this level use intermediate skills and strategies to search for, locate, andorganize the information they find in relatively lengthy passages and can recognizeparaphrases of what they have read. They can also make inferences and reachgeneralizations about main ideas and the author’s purpose from passages dealing withliterature, science, and social studies. Performance at this level suggests the ability tosearch for specific information, interrelate ideas, and make generalizations.
Level 300: Understand Complicated InformationReaders at this level can understand complicated literary and informational passages,including material about topics they study at school. They can also analyze andintegrate less familiar material about topics they study at school as well as providereactions to and explanations of the text as a whole. Performance at this level suggeststhe ability to find, understand, summarize, and explain relatively complicatedinformation.
Level 350: Learn from Specialized Reading MaterialsReaders at this level can extend and restructure the ideas presented in specialized andcomplex texts. Examples include scientific materials, literary essays, and historicaldocuments. Readers are also able to understand the links between ideas, even whenthose links are not explicitly stated, and to make appropriate generalizations.Performance at this level suggests the ability to synthesize and learn from specializedreading materials.
Reading Performance Level Descriptors for National Assessmentof Educational Progress, Long Term Trend. Available here:https://nces.ed.gov/nationsreportcard/ltt/reading-descriptions.aspx
79
Table II: Math Performance Level Descriptors
Level 150: Simple Arithmetic FactsStudents at this level know some basic addition and subtraction facts, and most can addtwo-digit numbers without regrouping. They recognize simple situations in whichaddition and subtraction apply. They also are developing rudimentary classificationskills.
Level 200: Beginning Skills and UnderstandingsStudents at this level have considerable understanding of two-digit numbers. They canadd two-digit numbers but are still developing an ability to regroup in subtraction. Theyknow some basic multiplication and division facts, recognize relations among coins,can read information from charts and graphs, and use simple measurement instruments.They are developing some reasoning skills.
Level 250: Numerical Operations and Beginning Problem SolvingStudents at this level have an initial understanding of the four basic operations. Theyare able to apply whole number addition and subtraction skills to one-step wordproblems and money situations. In multiplication, they can find the product of atwo-digit and a one-digit number. They can also compare information from graphs andcharts, and are developing an ability to analyze simple logical relations.
Level 300: Moderately Complex Procedures and ReasoningStudents at this level are developing an understanding of number systems. They cancompute with decimals, simple fractions, and commonly encountered percents. Theycan identify geometric figures, measure lengths and angles, and calculate areas ofrectangles. These students are also able to interpret simple inequalities, evaluateformulas, and solve simple linear equations. They can find averages, make decisionsbased on information drawn from graphs, and use logical reasoning to solve problems.They are developing the skills to operate with signed numbers, exponents, and squareroots.
Level 350: Multistep Problem Solving and AlgebraStudents at this level can apply a range of reasoning skills to solve multistep problems.They can solve routine problems involving fractions and percents, recognize propertiesof basic geometric figures, and work with exponents and square roots. They can solve avariety of two-step problems using variables, identify equivalent algebraic expressions,and solve linear equations and inequalities. They are developing an understanding offunctions and coordinate systems.
Math Performance Level Descriptors for National Assessment of Educational Progress,Long Term Trend. Available here: https://nces.ed.gov/nationsreportcard/ltt/math-descriptions.aspx 80
Table III: Results from Ranking Exercise
Reading Math
Mean Mean-by-Distance Mean Mean-by-DistanceLevel 150 0.751 *** 0.647 *** 0.691 *** 0.531 ***
(0.026) (0.063) (0.027) (0.066)Distance 150 0.161 * 0.143Distance 200 0.1 0.228 **
Level 200 0.788 *** 0.733 *** 0.661 *** 0.585 ***(0.027) (0.044) (0.027) (0.047)
Distance 150 0.059 0.062Distance 200 0.153 * 0.258 **
Level 250 0.672 *** 0.603 *** 0.721 *** 0.664 ***(0.026) (0.037) (0.027) (0.038)
Distance 150 0.163 ** 0.062Distance 200 0.106 0.238 **
Level 300 0.697 *** 0.59 *** 0.711 *** 0.713 ***(0.026) (0.045) (0.027) (0.047)
Distance 150 0.155 ** 0.072Distance 200 0.184 * -0.223 **
Level 350 0.627 *** 0.478 *** 0.637 *** 0.531 ***(0.027) (0.066) (0.027) (0.066)
Distance 150 0.14 0.136Distance 200 0.197 ** 0.122
N 1455 1461Respondents 485 487
Regression model estimates linear probability that respondents ranked performance level descriptor correctly. Samples excludes respondents if (a) they were randomly assigned “ties” or (b) they did
not rank all three items. Column “Mean” describes percent of reading or math level descriptors ∈ (150, 200, 250, 300, 350) ranked correctly. “Mean-by-Distance” disaggregates percentages into
three categories: whether the cumulative distance of the three descriptors summed to 100, 150, or 200 (e.g., a random draw of 150, 200, 250 sums to 100). Stars indicate * for p<.05, ** for p<.01,
and *** for p<.001. Mean-by-Distance test is relative to omitted category, Distance 100.
81
Chapter 4
The sensitivity of causal estimates from
Court-ordered finance reform on
spending and graduation rates (with
Christopher Candelaria)
82
Abstract We provide new evidence about the effect of court-ordered finance reform on per-pupil revenuesand graduation rates. We account for cross-sectional dependence and heterogeneity in the treated and coun-terfactual groups to estimate the effect of overturning a state’s finance system. Seven years after reform, thehighest poverty quartile in a treated state experienced a 4 to 12 percent increase in per-pupil spending and a 5to 8 percentage point increase in graduation rates. We subject the model to various sensitivity tests. In mostcases, point estimates for graduation rates are within 2 percentage points of our preferred model.
83
I. Introduction
Whether school spending has an effect on student outcomes has been an open question in the economics
literature, dating back to at least Coleman (1966). The causal relationship between spending and desirable
outcomes is of obvious interest, as the share of GDP that the United States spends on public elementary
and secondary education has remained fairly large throughout the past five decades, ranging from 3.7 to 4.5
percent.1 Given the large share of spending on education, it would be useful to know if these resources
are well spent. Despite this interest, we lack stylized facts about the effects of spending on changes in
student academic and adult outcomes. The goal of this paper is to provide a robust description of the causal
relationship between fiscal shocks and student outcomes at the district level for US states undergoing financial
reform for the period 1990-91 to 2009-10.
The opportunity for more robust descriptions of causal relationships with panel data emerges from two
sources. The first is that data collection efforts have extended the time dimension of panel data, allowing for
more sophisticated tests of the identifying assumptions of quasi-experimental methods such as differences-
in-differences estimators. Previous research efforts on the effects of school spending were hampered by
data limitations such as this.2 The second is that recent econometric methods have been developed to better
model unobserved treatment heterogeneity, counterfactual trends, pre-treatment trends (correlated random
trends), and cross-sectional dependence (CSD). If any of these unobserved components are correlated with
regressors, then the econometric model will be biased. Fortunately, it is possible to test for a wide variety of
model specifications to determine the sensitivity of causal estimates to modeling choice.
Using district-level aggregate data from the Common Core (CCD), we estimate the effects of fiscal shocks,
where fiscal shocks are defined as a state’s first Supreme Court ruling that overturns a given state’s finance
system, on the natural logarithm of per-pupil spending and graduation rates. Our estimation approach is
designed to handle three aspects of the identification problem: first, treatment occurs at the state level; second,
there is treatment effect heterogeneity within states; third, there is heterogeneity in the pre-treatment trends
between treatment and control units.
At the state level, we are interested in identifying a plausibly exogenous shock to the state’s finance system.
Here, we wish to control for the presence of cross-sectional dependence (CSD), which can arise if there is
1These estimates come from the Digest of Education Statistics, 2013 edition. As of the writing of thisdraft, the 2013 version is the most recent publication available.
2See for example, Hoxby (2001) and Card and Payne (2002).
84
cross-sectional heterogeneity in response to common shocks, spatial dependence or interdependence. We
control for CSD by including interactive fixed effects, as suggested by Bai (2009). These interactive fixed
effects are estimated at the state level and control for unobserved correlations between states in the panel.
We account for two forms of treatment heterogeneity. The first is heterogeneity that takes place within
treatment and control groups. It is known that unmodeled treatment heterogeneity can lead to bias if (a)
the probability of treatment assignment varies with a variable X and (b) the treatment effect varies with
a variable X (Angrist, 1995; Angrist, 2004; Elwert, 2010). Here, we decompose treatment and control
groups by constructing a time-invariant poverty quartile indicator variable equal to 1 if a state’s district is
in one of four poverty quartiles for year 1989. Each of these poverty quartile variables is interacted with a
treatment year variable, for a total of 76 (19 years x 4 quartiles) treatment interactions. In order to provide
a counterfactual for each of these poverty quartiles, we estimate a poverty quartile-by-year secular trend
indexed by poverty quartile-by-year fixed effects. Estimating treatment heterogeneity in this way can provide
an unbiased estimate of the treatment effect if we have identified the correct heterogeneous variable.
A second concern is heterogeneity between treatment and control groups, such as would occur if poverty
quartiles in treated states have different pre-treatment trends than poverty quartiles in non-treated states. This
suggests that some units (e.g., high poverty districts) of treated states have different secular trends than their
counterparts in non-treated states. Quartile-by-year fixed effects assume equivalent secular trends between
treated and control units; results will be biased if this assumption is not met. To account for this, we estimate
state-by-poverty quartile linear time trends (referred to as correlated random trends), for a total of 192 (48
states x 4 quartiles x linear time) continuous fixed effects. Effectively, we assume that treatment and control
differences in secular trends can be controlled for with functional form assumptions on the time trend. These
terms provide pre-treatment balance with respect to state-poverty quartile trends in the dependent variable.
All together, we estimate a heterogeneous differences-in-differences model that accounts for (a) cross-
sectional dependence at the state level, (b) a poverty quartile-by-year secular trend, and (c) state-by-poverty
quartile linear time trends. In this preferred specification, we find that high poverty districts in states that had
their finance regimes overthrown by Court order experienced an increase in log spending by 4 to 12 percent
and graduation rates by 5 to 8 percentage points seven years following reform.
We then subject the model to various sensitivity tests by permuting the interactive fixed effects, secular
time trends, and correlated random trends. In total we estimate 15 complementary models. Generally, the
results are robust to model fit: relative to the preferred model, interactive fixed effects and the specification
85
of the secular time trend have modest effects on point estimates. The model is quite sensitive to the presence
of correlated random trends, however. When state-by-poverty quartile time trends are ignored or estimated
at a higher level of aggregation (the state), the effects of reform on graduation rates are zero and precisely
estimated. When we estimate the linear time trend using a lower level of aggregation (the district), point
estimates are similar to those of the preferred model. We conclude that treatment and control sub-state units
have different secular trends, but conditionally exogenous point estimates are available if we are willing to
assume that the sub-unit pre-treatment trends can be approximated with a functional form.
To test the extent to which results are equalizing, we estimate slightly different models, allowing the effect
of reform to be continuous across the poverty quantiles. That is, we interact treatment year variables with
a continuous poverty quantile variable, while controlling for secular changes in this continuous variable for
untreated states. This provides an estimate of the marginal change in graduation rate for a one-unit increase
in poverty percentile rank within a state. Here we see that the effect of reform was equalizing: for every 10
percentile increase in poverty within a treated state, per-pupil log revenues increased by 0.9 to 1.8 percent
and graduation rates increased by 0.5 to 0.85 percentage points in year seven.
Because we have aggregate data, one threat to identification would occur if treatment induced demographic
change and demographic variables correlate with outcomes. For instance, if state spending increased school
quality but kept property taxes down, high income parents (with children who are presumably more likely to
graduate) might relocate to schools housed in historically high poverty districts. To test for this possibility, we
estimate our “equalizing” models substituting the original outcome variables for district level demographic
variables that could indicate propensity to graduate: percent poor, percent minority, and percent special
education. We find no evidence that the minority composition of high poverty districts changed after reform,
but we do find that these districts experienced an increase in poverty and percent of students qualifying for
special education. We would have to assume that increases in poverty and special education rates positively
effect graduation rates for our results to be biased in a meaningful way.
This paper makes substantive and methodological contributions. Substantively, we find that court cases
overturning a state’s financial system for the period 1991-2010 had an effect on revenues and graduation rates,
that these results are robust to a wide variety of modeling choices, and that this effect was equalizing. Taken
together, our two models show that states undergoing Court ordered finance both (a) increased revenues and
graduation rates in high poverty districts relative to high poverty districts in other states and (b) allocated
a greater share of these revenues to higher poverty districts within the state, relative to allocations taking
86
place in non-treated states, resulting in an increase in graduation rates. Methodologically, we emphasize the
variety of modeling strategies available to researchers using panel data sets, including specification of the
secular trend, correlated random trends, and cross-sectional dependence. While the researcher may argue for
a preferred model, justifiable alternatives are often available. Here we have presented a graphical method that
researchers can use to effectively and efficiently demonstrate the sensitivity of point estimates to modeling
choice.
II. Background
State-level court-ordered finance reform beginning in 1989 has come to be known as the ”Adequacy Era.”
These court rulings are often treated as fiscal shocks to state school funding systems. A number of papers
have attempted to link these plausibly exogenous changes in spending to changes in other desired outcomes,
like achievement, graduation and earnings (Hoxby, 2001; Card and Payne, 2002; Jackson, et al., 20150).
The results of Card and Payne (2002) and Hoxby (2001) were in conflict, but these were hampered by
data limitations, as only a simple pre- post- contrast between treatment and control states was available
to the researchers, thereby limiting their capacity to verify the identifying assumptions of the differences-in-
differences model.
Most recently, Jackson, Johnson and Persico (2015) have constructed a much longer panel (with respect to
time), with restricted-use individual-level data reaching back to children born between 1955 and 1985, to test
the effects of these cases on revenues, graduation rates and adult earnings. Leveraging variation across cohorts
in exposure to fiscal reform, this study finds large effects from Court order on school spending, graduation
and adult outcomes, and these results are especially pronounced for individuals coming from low-income
households and districts.
For this study, the sample of students are taken from the Panel Study of Income Dynamics (PSID), which
is representative at the national level. Using these data, identification is leveraged from variation between
states over time, and results are disaggregated using within state variation around district-level income. One
particular concern is the possibility of spurious correlations resulting from the PSID sampling design. If
sampled individuals in low income districts in treated states are, by chance, more likely to respond to treat-
ment, then results will be biased. The use of population weights can exacerbate this problem, if the spurious
correlations induced by the non-representative sample correlate with the weights. The authors are aware of
87
this concern and test a model using Common Core graduation rate data, as we do here.3 They find a similar
pattern of results using the alternative data set.
The purpose of this paper is three-fold. First, it is important to show that results by Jackson and colleagues
(2015) can be replicated across other data sets. Here, we use data from the Common Core (CCD), which
provides aggregate spending and graduation rates for the universe of school districts in the United States.
Both the PSID and CCD have a kind of Anna Karenina problem, in which each data set is unsatisfactory in
its own way. The PSID follows individual students over time but has unobserved sampling issues that may
correlate with treatment; the CCD contains the universe of districts but does not follow students over time
and may not reveal sorting within districts. If results are qualitatively similar across different data, we can
feel more confident that estimates do not reveal spurious correlations resulting from the sample generating
process. Second, it is important to show that results are insensitive to similarly compelling modeling choices.
While Jackson and colleagues (2015) find a pattern of results from the CCD that is largely consistent with
those results from the PSID, they do not test to see whether those results are sensitive to model specification.
If we are to believe that results from the CCD sample largely corroborate results from the PSID, we must
show that the identifying assumptions using the CCD sample are met. Our purpose is to present results
that are robust to modeling choices that account for secular trends, correlated random trends, and cross-
sectional dependence. In so doing, we provide upper and lower bounds on effect sizes by permuting these
parameters. Third, and finally, we provide new evidence about the extent to which Court-ordered finance
reforms increased levels of spending and graduation rates, as well as the extent to which these same states
equalized resources and graduation rates across poverty quantiles, relative to equalizing efforts made in states
without reform.
III. Data
Our data set is the compilation of several public-use surveys that are administered by the National Center for
Education Statistics and the U.S. Census Bureau. We construct our analytic sample using the following data
sets: Local Education Agency (School District) Finance Survey (F-33); Local Education Agency (School
3See Appendix B in the NBER working paper, found here http://www.nber.org/papers/w20118.pdf. In thefinal version (Appendix N), the authors test whether their estimates generalize to districts not included in thePSID, which is the same test for a different concern. External and internal validity are threatened if the PSIDhas spurious correlations or does not generalize.
88
District) Universe Survey; Local Education Agency Universe Survey Longitudinal Data File: 1986-1998
(13-year); Local Education Agency (School District) Universe Survey Dropout and Completion Data; and
Public Elementary/Secondary School Universe Survey.4
Our sample begins in the 1990-91 school year and ends in the 2009-10 school year. The data set is a
panel of aggregated data, containing United States district and state identifiers, indexed across time. The
panel includes the following variables: counts of free lunch eligible (FLE) students, per pupil log and total
revenues, percents of 8th grade students receiving diplomas 4 years later (graduation rates), total enrollment,
percents of students that are black, Hispanic, minority (an aggregate of all non-white race groups), special
education, and children in poverty. Counts of FLE students are turned into district-specific percentages, from
which within state rankings of districts based on the percents of students qualifying for free lunch are made.
Using FLE data from 1989-90, we divide states into FLE quartiles, where quartile 4 is the top poverty quartile
for the state.5 Total revenues are the sum of federal, local, and state revenues in each district. We divide this
value by the total number of students in the district and deflate by the US CPI, All Urban Consumers Index
to convert the figure to real terms. We then take the natural logarithm of this variable. Our graduation rates
variable is defined as the total number of diploma recipients in year t as a share of the number of 8th graders
in year t − 4, a measure which Heckman (2010) shows is not susceptible to the downward bias caused by
using lagged 9th grade enrollment in the denominator. We top-code graduation rates, so that they take a
maximum value of 1.6 The demographic race variables come from the school-level file from the Common
Core and are aggregated to the district level; percents are calculated by dividing by total enrollment. Special
education counts come from the district level Common Core. Child poverty is a variable we take from the
Small Area Income and Poverty Estimates (SAIPE).
To define our analytic sample, we place some restrictions on the data, and we address an issue with New
York City Public Schools (NYCPS). First, we drop Hawaii and the District of Columbia from our sample,
as each place has only one school district. We also dropped Montana from our analysis because they were
missing a substantial amount of graduation rate data. We keep only unified districts to exclude non-traditional
districts and to remove charter-only districts. We define unified districts as those districts that serve students in
either Pre-Kindergarten, Kindergarten, or 1st grade through the 12th grade. For the variables total enrollment,
4Web links to each of the data sources are listed in the appendix.5Missing FLE data for that year were addressed by NCES interpolation methods. Data are found on Local
Education Agency Universe Survey Longitudinal Data File (13-year).6In Appendix IX., we describe where the data was gathered and cleaned, including URL information for
where data can be found.
89
graduation rates and FLE, NYCPS reports its data as 33 geographic districts in the nonfiscal surveys; for total
revenues, NYCPS is treated as a consolidated district. For this reason, we needed to combine the non-fiscal
data into a single district. As suggested in the NCES documentation, we use NYCPS’s supervisory union
number to aggregate the geographical districts into a single entity.
We noticed a series of errors for some state-year-dependent variable combinations. In some states, counts
of minority students were mistakenly reported as 0, when in fact they were missing. This occurred in Ten-
nessee, Indiana, and Nevada in years 2000-2005, 2000, and 2005, respectively. The special education variable
had two distinct problems. For three states it was mistakenly coded as 0 when it should have been coded as
missing. This occurred in Missouri, Colorado, and Vermont in years 2004, 2010 and 2010, respectively. We
also observed state-wide 20 percentage point increases in special education enrollment for two states, which
immediately returned to pre-spike levels in the year after. This occurred in Oregon and Mississippi in years
2004 and 2007, respectively. Finally, graduation rate data also spiked dramatically before returning to pre-
spike levels in three state-years. This occurred in Wyoming, Kentucky and Tennessee in years 1992, 1994
and 1998, respectively. In each of these state-year instances where data were either inappropriately coded as
zero or fluctuated due to data error, we coded the value as missing.
To our analytic sample, we add the first year a state’s funding regime was overturned in the Adequacy
Era. The base set of court cases comes from Corcoran and Evans (2008), and we updated the list using data
from the National Education Access Network.7 Table I lists the court cases we are considering. As shown,
there are a total of twelve states that had their school finance systems overturned during the Adequacy Era.
Kentucky was the first to have its system overturned in 1989 and Alaska was the most recent; its finance
system was overturned in 2009.
[Insert Table I Here]
Table II provides summary statistics of the key variables in our interpolated data set, excluding New York
City Public Schools (NYCPS).8 We have a total of 188,752 district-year observations. The total number of
unified districts in our sample is 9,916. The average graduation rate is about 77 percent and average log per
pupil spending is 8.94 (total real per pupil revenues are about $7,590). When we do not weight our data by
district enrollment, we obtain similar figures, but they are slightly larger.
7The National Education Access Network provides up-to-date information about school finance reformand litigation. Source: http://schoolfunding.info/.
8We drop NYCPS because it is an outlier district in our data set. We provide a detailed explanation ofwhy we do this in our results section.
90
[Insert Table II Here]
IV. Econometric specifications and model sensitivity
In this section, we describe our empirical strategy to estimate the causal effects of school finance reform at the
state level on real log revenues per student and graduation rates at the district level. We begin by positing our
preferred model, which is a differences-in-differences equation that models treatment heterogeneity across
FLE poverty quartiles. We then explain what each of the parameter choices are designed to control for and
why they are selected. Because treatment occurs at the state level and our outcomes are at the district level,
there are several ways to specify the estimating equation. For example, there are choices about whether and
how to model the counterfactual time trend and how to adjust for correlated random trends (i.e., pre-treatment
trends) and unobservable factors such as cross-sectional dependence. We outline these alternative modeling
choices and discuss their implications relative to the benchmark model.
IV.A. Benchmark differences-in-differences model
To identify the causal effects of finance reform, we leverage the plausibly exogenous variation resulting from
state Supreme Court rulings overturning a given state’s fiscal regime. Prior education finance studies have also
relied on the exogenous nature of court rulings to estimate causal effects on fiscal and academic outcomes
(see, for example, Sims, 2011; Jackson, et al., 2015). After a lawsuit is filed against a state’s education
funding system, we assume that the timing of the Court’s ruling is unpredictable. Under this assumption,
the decision to overturn a funding system defines treatment and the date of the decision constitutes a random
shock.
With panel data, the exogenous timing of court decisions generates a natural experiment that can be mod-
eled using a differences-in-differences framework. States were subject to reform in different years and not
all states had reform, which provides treatment and control groups over time. Our benchmark differences-in-
differences model takes the following form:
Ysqdt = θd + δtq + ψsqt+ P ′stβq + λ′sFt + εsdt,(4.1)
91
where Ysqdt is an outcome of interest—real log revenues per student or graduation rates—in state s, in poverty
quartile q, in district d, in year t; θd is a district-specific fixed effect; δtq is a time by poverty quartile-specific
fixed effect, ψsqt is a state by quartile-specific linear time trend; Pit is a policy variable that takes value 1
in the year state s has its first reform and remains value 1 for all subsequent years following reform; λ′sFt is
a factor structure that accounts for cross-sectional dependence at the state level; and εsdt is assumed to be a
mean zero, random error term. To account for serial correlation, all point estimates have standard errors that
are clustered by state, the level at which treatment occurs (Bertrand, Duflo and Mullainathan (2004).
Our parameters of interest are the βq , which are the causal estimates of school finance reform in quartile
q on Ysqdt. We define q such that q ∈ {1, 2, 3, 4}, and the highest level of poverty is represented by quartile
4.9 Throughout our paper, we parameterize the policy variable P ′it such that each poverty quartile’s effect is
estimated, so we do not have an omitted quartile. Moreover, we allow treatment effects to have a dynamic
treatment response pattern (Wolfers, 2006). Each βq , therefore, is a vector of average effect estimates in the
years after reform in quartile q.
Overall, we estimate 19 treatment effects for each poverty quartile’s vector of effects. Although there are
21 effects we can potentially estimate, we combine treatment effect years 19 though 21 into a single binary
indicator, as there are only two treatment states that contribute to causal identification in these later years.10 In
reporting our results, we only report effect estimates for years 1 through 7 after reform. We do this because
estimating treatment effects several years after treatment occurs results in precision loss and because very
few states were treated early enough to contribute information to treatment effect estimates in later years
(see Table I). All together, this model absorbs approximately 9,800 district fixed effects, 76 year effects, 192
continuous fixed effects (state-by-FLE quartile interacted with linear time), as well as the factor variables
λ′sFt (i.e., the interactive fixed effects). The estimated factors and factor loadings λ′sFt can be decomposed
into the covariates θsFt + δtλs and entered directly into the regression model, thus allowing us to cluster the
standard errors at the state level.11 These non-treatment parameters are eliminated using high dimensional
fixed effects according to the Frisch-Waugh-Lovell theorem (Frisch, 1933; Lovell, 1963).12
9Using FLE data from 1989-90, we divide states into FLE quartiles, where quartile 4 is the top povertyquartile for the state.
10While our data sample has 20 years of data, we have up to 21 potential treatment effects, as KY had itsreform in 1989 and our panel begins in the 1990-91 academic year. Therefore, KY does not contribute to atreatment effect estimate in the first year of reform, but it does contribute to effect estimates in all subsequenttreatment years.
11Appendix XII. shows how an estimated factor structure can be included in an OLS regression frameworkto obtain a variety of standard error structures.
12The model is estimated in Julia using the package FixedEffects.jl and SparseFactorModels.jl (Gomez,
92
IV.B. Explaining model specifications
We now wish to articulate what the parameters from Equation (4.1) are controlling for and why they are
included in the model. Researchers are presented with a number of modeling choices, and our goal here is to
articulate the reasons we have for parameterizing the model the way we do. This also opens the possibility for
subjecting the model to sensitivity analysis, in order to determine upper and lower bounds of point estimates,
in relation to our preferred model. In particular, we examine choices related to cross-sectional dependence,
secular trends, and correlated random trends. We also consider the difference between OLS regression and
weighted least squares (WLS) regression, where the weights are measures of time varying district enrollment.
As will be shown in the results section, certain parameterizations of the differences-in-differences model have
substantial impacts on point estimates relative to our benchmark model, while many others do not.
Cross-sectional dependence
In our benchmark model, the terms λ′sFt specify that the error term has a factor structure that affects Ysqdt
and may be correlated with P ′st, the treatment indicators. Following Bai (2009), we define λs as a vector of
factor loadings and Ft as a vector of common factors. Each of these vectors is of size r, which is the number
of factors included the model; in our model, we set r = 1. In the differences-in-differences framework,
the factor structure has a natural interpretation. Namely, the common factors Ft represent macroeconomic
shocks that affect all the units (e.g., recessions, financial crises, and policy changes), and the factor loadings
λs capture how states are differentially affected by these shocks. Of particular concern is the presence of
interdependence, which can result if one state’s Supreme Court ruling affects the chances of another state’s
ruling. This would violate the identifying assumptions of the differences-in-differences model and result in
bias, unless that interdependence is accounted for (Pesaran, 2007; Bai, 2009).
To estimate the λ′sFt factor structure in Equation (4.1), we use the method of principal components as
described by Bai (2009) and implemented by Moon and Weidner (2014) and Gomez (2015b). The procedure
begins by choosing starting values for the βq vector, which we denote as β̃q . Then, the following steps are
carried out:
[1] Calculate the residuals of the OLS estimator excluding the factor structure using β̃.
[2] Estimate the factor loadings, λs, and the common factors, Ft, on the residual vector obtained in step
2015a; 2015b).
93
[1] using the Levenberg- Marquardt algorithm (wright, 1985).
[3] After estimating the factor structure, we remove it from the regressors using partitioned regression.
Then, we re-estimate the model using a least squares objective function in order to obtain a new
estimate of β̃.
[4] Steps [1] to [3] are repeated until the following stopping condition is achieved: After comparing each
element of the vector β̃ obtained in step [3] with the previous estimate β̃old, we stop if |β̃k − β̃oldk | <
10−8 for all k. If this condition is not achieved, then we stop if the difference in the least squares
objective function calculated in step [3] is greater than the Total Sum of Squares multiplied by 10−10.
This stopping condition takes effect when the estimator is not converging.
Traditional approaches to factor methods specify factor loadings at the lowest unit of analysis, in this case
d. However, we are interested in accounting for interdependence between states, the level at which treatment
occurs. While principal components analysis generally requires one observation per i-by-t, sparse factor
methods are available that allow for multiple i per t, as in our case, where we have multiple districts d within
states s for every year t. Moreover, sparse factor methods allow for missing data, thereby obviating the need
to implement interpolation or multiple imputation to fill in missing data (Wright, 1985; Raiko, 2008; Ilin,
2010).
Secular time trends
Secular time trends, often specified non-parametrically using binary indicator variables, such as year fixed
effects, adjust for unobservable factors affecting outcome variables over time. The usual assumption is that
these factors affect all units—in our case, districts—in the same way in a given year. Examples of un-
observable factors include national policy changes and macroeconomic events such as recessions. In the
differences-in-differences specification, controlling for these variables is important because the identifying
assumption of the model requires us to believe that the fixed effects represent the average counterfactual trend
that treated districts would have had in the absence of treatment.
In our benchmark specification, we control for secular time trends by including FLE quartile-by-year fixed
effects, denoted as δtq . We include these δtq fixed effects, instead of standard year fixed effects δt, to establish
a more plausible counterfactual trend for treated districts. Our main concern is that higher poverty districts
in both treated and untreated states were increasing in revenues and graduation rates. If the secular trend is
94
approximated by δt, and δt < δtq=4, then we will be overstating the effect of P ′stβq=4. By indexing the year
fixed effects with FLE quartiles, we are comparing treated districts in a given quartile control districts in the
same quartile.
Correlated random trends
If the timing of a state’s court ruling decision is correlated with unobserved trends at the state, district, or
other group level (e.g., state by FLE quartile), we must control for these trends to obtain unbiased estimates
of causal parameters. In the literature, these trends are formally called correlated random trends (Woolridge,
2005; 2010), but they are often informally referred to as pre-treatment trends. Correlated random trends
serve a distinct purpose from secular, non-parametric trends. While the secular trends help to establish a
plausible counterfactual trend for the common trends assumption to hold, correlated random trends guard
against omitted variable bias caused by an endogenous policy shock.
In Equation (4.1), the parameter ψsqt is included because we believe the timing of the state ruling is
correlated with pre-treatment trends within the state, approximated by a state-by-quartile-specific slope. The
inclusion of this parameter aligns with the notion that the date on which a court-ordered finance system is
deemed unconstitutional is correlated with the slope of the FLE quartile trend within the state. For example, if
graduation rates are steeply declining among the most impoverished districts within state s, we might expect
a reform decision sooner than if the graduation rate had a mild, decreasing trend. Evidence for variation in
pre-treatment trends within states can be seen in In Figure I. This Figure shows weighted mean log spending
and graduation rates for states that experienced a Court ruling over time, where time is centered around the
year of Court ruling. With respect to graduation rates, there is an obvious downward trend prior to a Court’s
ruling.
[Insert Figure I Here]
We can think of our specification of the secular time trends and correlated random trends as addressing two
forms of heterogeneity. Our specification of the secular time trend addresses heterogeneity in the treatment
effect and the need to account for that heterogeneity with an appropriate counterfactual. By indexing the
secular time trend as δtq , we allow high poverty districts to be compared to other high poverty districts.
Our specification of the correlated random trend addresses heterogeneity between treated and control groups.
Including ψsqt addresses the fact that high poverty districts in treated states may have different pre-treatment
trends than high poverty districts in non-treated states.
95
Weighting
In Equation (4.1), we explicitly model treatment heterogeneity by disaggregating treatment effects into
poverty quartiles, q (Meyer, 1995). These quartiles are derived from the percentages of Free Lunch Eli-
gible (FLE) students reported at the district level in each state in 1990. We fix the year at the start of our
sample because the poverty distribution could be affected by treatment over time. The quartiles are defined
within each state.
While we capture treatment heterogeneity across poverty quartiles, we may fail to capture other sources
of treatment heterogeneity. Weighting a regression model by group size is traditionally used to correct for
heteroskedasticity, but it also provides a way to test for the presence of additional unobserved heterogeneity.
According to asymptotic theory, the probability limits of OLS and weighted least squares should be consis-
tent. Thus, regardless of how you weight the data, the point estimates between the two models should not
dramatically differ. When OLS and WLS estimates do diverge substantially, there is concern that the model
is not correctly specified, and it may be due in part to unobserved heterogeneity associated with the weighting
variable (Solon, 2015).
For our differences-in-differences specification, we examine the sensitivity of our point estimates to the
inclusion of district-level, time-varying enrollment weights. We provide a detailed discussion about discrep-
ancies between weighted and unweighted point estimates in the next section.
IV.C. Alternative model specifications
Here we quickly outline alternative model specifications. In the Results section, we explore the sensitivity
of our preferred model to alternative parameterizations. We have presented arguments for our preferred
specification, but we recognize that alternative modeling approaches are common in the panel methods and
applied econometric literature. To test how sensitive our preferred results are to reasonable alternatives, we
estimate 15 models that broadly fall within the bounds of typical modeling choices. In the Results section,
we provide upper and lower bounds for how much point estimates depart from our preferred model. Here,
we outline the alternatives we estimate:
1. Cross-sectional dependence: We estimate models in which we assume λ′sFt = 0, as well as models
96
in which the number of included factors r is ∈ 1, 2, 3.13
2. Secular time trend: We estimate models in which we set δtq = δt, thereby modeling the counterfactual
time trend as constant across cross-sectional units.
3. Correlated random trend: We estimate models in which we set ψsqt =∈ {0, ψst, ψdt, ψsqt2, ψst2}.
That is, we either do not estimate a pre-treatment trend, we allow the pre-treatment trend to be es-
timated at the state and district levels, or we allow the time element to have a quadratic functional
form.
V. Results
We present our results in four parts. We first show and discuss the causal effect estimates of court-ordered
finance reforms using our preferred differences-in-differences model. Second, we examine the extent to
which our benchmark model point estimates are sensitive to assumptions about secular trends, correlated
random effects, and cross-sectional dependence. Third, we assess whether reforms were equalizing across
the FLE poverty distribution; we wish to test formally whether, within treated states, poorer districts benefited
more relative to richer districts in terms of revenues and graduation rates. Finally, we conclude with a series
of robustness checks that allow us to gauge the validity of our causal estimates.
V.A. Benchmark differences-in-differences model results
Revenues
We report our causal effect estimates of court-ordered finance reform on the logarithm of per pupil rev-
enues and graduation rates in Tables IV and V, respectively. We obtain results by estimating our benchmark
differences-in-differences model in Equation (4.1). FLE quartile 1 represents low-poverty districts, and FLE
quartile 4 represents high-poverty districts. We only report treatment effect years 1 though 7 because the
number of states in the treated group changes dramatically over time. Some states were treated very late (or
13Moon and Weidner (2014) show that point estimates stabilize once the number of factors r equals thetrue number of factors r◦ and that when r > r◦, there is no bias. However, this is only true when the timedimension t in the panel approaches infinity. When t is small, it is possible to increase bias by including toomany factors. See Table IV in their paper, as well as Oantski (2010) and Ahn (2013).
97
very early), and we do not have enough years of data to follow them past 2010. As shown in Table III, 6
years after treatment, Alaska and Kansas no longer contribute treatment information; in years 7 and 8, we
lose North Carolina and New York.14 We display both weighted and unweighted estimates, where the weight
is time-varying district enrollment.
[Insert Table III Here]
Examining the weighted results in Table IV, we find that court-ordered finance reforms increased revenues
in all FLE quartiles in the years after treatment, though not every point estimate is significant at conventional
levels. Because our models include FLE quartile-by-year fixed effects, point estimates for a given quartile
should be interpreted as relative to other FLE quartiles that are in the control group. For example, in year 7
after treatment, districts in FLE quartile 1 had revenues that were 12.7 percent higher than they would have
been in the absence of treatment, with the counterfactual trend established by non-treated districts in FLE
quartile 1 (significant at the 1 percent level). In FLE quartile 4, we find that the revenues were 11.9 percent
higher relative to what they would have been in the absence of reform, relative to non-treated districts in FLE
quartile 4 (significant at the 5 percent level). Comparing point estimates between FLE quartiles 1 and 4 is
problematic because the counterfactual trends in quartiles 1 and 4 may have different trajectories. It would be
wrong to conclude that the 12.7 percent effect is larger than the 11.9 percent effect, because the 12.7 percent
effect for quartile 1 may be relative to only a modest increase in non-treated low poverty districts, whereas
the 11.9 percent effect for quartile 4 may be relative to a steep increase in non-treated high poverty districts.
Later, we present models and results to test whether poorer districts received greater revenues and graduation
rates following reform.
Compared to weighted results, the unweighted results in Table IV suggest that revenues increased, but
many of the point estimates are not significant. In FLE quartile 4, for example, all point estimates are
positive, but none are significant. The magnitude of point estimates is also substantially smaller than those
from the weighted regression. In year 7, point estimates across the quartiles are at least half as a small
as the corresponding point estimates from the weighted model. When weighted and unweighted regression
estimates diverge, there is evidence of unmodeled heterogeneity, which we will discuss momentarily. Overall,
revenues increased across all FLE poverty quartiles in states with Court order, relative to equivalent poverty
quartiles in non-treated states.
[Insert Table IV Here]
14Throughout the rest of paper we restrict the description of our results for years less than or equal to 7,though estimation occurs for the entire panel of data.
98
Graduation rates
With respect to graduation rates, the weighted results in Table V show that court-ordered finance reforms
were consistently positive and significant among districts in FLE quartile 4. In the first year after reform,
graduation rates in quartile 4 increased modestly by 1.3 percentage points. By treatment year 7, however,
graduation rates increased by 8.4 percentage points, which is significant at the 0.1 percent level. It is worth
emphasizing that each treatment year effect corresponds to a different cohort of students. Therefore, the
dynamic treatment response pattern across all 7 years is consistent with the notion that graduation rates do
not increase instantaneously; longer exposure to increased revenues catalyzes changes in academic outcomes.
We find modest evidence that FLE quartiles 2 and 3 improved graduation rates following Court order, though
these point estimates are not consistently significant and are smaller in magnitude than those in FLE 4. The
lowest-poverty districts in FLE quartile 1 have no significant effects, and the point estimates show no evidence
of an upward trend over time.
The unweighted graduation results in Table V tell a similar story as the weighted results; one key dif-
ference is that the point estimates tend to be smaller. In FLE quartile 4, for example, graduation rates are
4.7 percentage points higher in year 7 that they would have been in the absence of treatment. This point
estimate is almost half the size of the corresponding estimate when using weights. Although districts in the
lowest-poverty quartile exhibit some marginally significant effects, these effects are small, and do not suggest
a substantial increase from their levels before reform, which corresponds to the weighted regression results.
[Insert Table V Here]
Understanding differences between weighted and unweighted results
Although the discrepancy between weighted and unweighted results is an indication of model mis-specification,
we present some evidence that the mis-specification is driven, in part, by unmodeled treatment effect hetero-
geneity that varies by district size (Solon, 2015). To examine this, we discuss the New York City public
schools district (NYCPS) as a case study. Throughout all our analyses, we have excluded NYCPS, the largest
school district in the United States, because of its strong influence on the point estimates in FLE quartile 4
when weighting by district enrollment.15
15To be clear, NYCPS was removed from our analytic sample before estimating the benchmark regressionsabove.
99
In Figure II, we plot the causal effect estimates for treatment years 1 to 7 for both the logarithm of per
pupil revenues (left panel) and graduation rates (right panel) when NYCPS is included in the sample and
when it is not. For the unweighted regressions, it does not matter whether NYCPS is included, as all districts
are weighted equally and removing one district has little effect on overall outcomes. The weighted regression
results, however, show that the inclusion of the NYCPS district produces point estimates for revenues that
are systematically higher than weighted results that exclude NYCPS. A similar story holds for graduation
rates beginning in treatment year 4. After excluding NYCPS, the weighted model results are closer to the
unweighted point estimates, though they still do not perfectly align.
[Insert Figure II Here]
Examining NYCPS provides just one example of how treatment heterogeneity might be related to district
size. As shown in Appendix Figure V, NYCPS Graduation rates and teacher salaries increased after 2003,
the year New York had its first court ruling. Enrollment, percent poverty, class size and percent minority
decreased during this period. Each of these are potential mechanisms for improving graduation rates and
likely contribute to the large treatment effect we observe in the weighted results. For all analyses, we drop
NYCPS from our sample because its district enrollment weight is near 1 million throughout our sample
period, which is an outlier in the distribution of district enrollment. Dropping the next set of largest districts
does not have such a dramatic effect on the results as dropping NYCPS does. For this reason, we retain all
other districts.
In line with Solon (2015), we acknowledge that the use of weighted least squares does not necessarily
provide a particular estimand of interest, such as the population average partial effect; instead, our OLS
and WLS results are identifying different weighted averages of complex heterogenous effects that vary ac-
cording to district size. In the presence of these heterogeneous effects neither set of results—weighted or
unweighted—should be preferred. Trying to model the heterogeneity is also quite complex, as illustrated
with the NYCPS case study. Overall, our weighted and unweighted point estimates are generally consistent
in terms of sign; however, the magnitude of the effect size tends to differ. In light of this, we continue to show
weighted and unweighted estimates in our tables. As the unweighted results tend to produce smaller effect
sizes than the the weighted results, the unweighted results may be considered lower bound estimates of the
(heterogeneous) treatment effect, and the weighted results may be considered as upper bound estimates.
100
V.B. Model sensitivity
Our preferred model indicates a meaningful positive and significant effect of Court order on the outcomes of
interest. To examine model sensitivity, we focus attention on districts in the highest poverty quartile (i.e., FLE
quartile 4). The evidence suggests there was an effect for graduation rates and revenues for the FLE quartile 4
districts, but these results assume that our benchmark model makes correct assumptions about secular trends,
correlated random trends, and cross-sectional dependence.16. We now assess the extent to which results are
sensitive to these modeling choices.
Figures III and IV graphically show the variability of causal effect estimates of finance reform on the
logarithm of per pupil revenues and graduation rates, respectively, across a variety of model specifications.
Each marker symbol represents the difference between two point estimates, one from an alternative model
specification and the other from our benchmark model; we calculate this difference for treatment years 1
through 7. In the figures, we normalize our benchmark effect estimates to zero in each year. If a point
estimate is greater than zero, then our model understates the effect from the alternative model; if it is less
than zero, our model overstates the effect. We report more traditional regression tables with point estimates
and standard errors in the Appendix.17
While we do not report all possible combinations of different secular trend, correlated random trend, and
cross-sectional dependence models, the 11 models we do show provide insight into sensitivity of the causal
effect estimates. In both figures, our benchmark differences-in-differences model corresponds to the third
model in the legend, which is denoted by an “x” marker symbol and the following triple:
• ST: FLE by year; CRT: State by FLE; CSD: 1 (delimiter is the semicolon).
ST refers to the type of secular trend, which we model as either FLE by year fixed effects or year fixed effects.
CRT refers to the type of correlated random trend in the model, which we specify as state-specific, state by
FLE quartile-specific, or district-specific trends. Each of these trends is formed by interacting the appropriate
fixed effects with a function of time, whether linear or quadratic. Finally, CSD refers to the number of factors
16Although the unweighted revenue results are not statistically significant, the point estimates show aconsistent positive effect
17For log per pupil revenues, see Appendix Tables VII and VIII report weighted results and unweightedresults, respectively. For graduation rates, see Appendix Tables IX and X report weighted and unweightedresults, respectively. We do not graphically present correlated random trend estimates that allow the timetrend to have a quadratic function. Quadratic correlated trends, at the state and state-FLE quartile levels, arevery noisy, with standard errors at times greater than twice the magnitude of point estimates. These resultsare also presented in the Appendix.
101
we include to account for cross-sectional dependence. A model with factor number 0 does not account for
cross-sectional dependence, while models with 1, 2, or 3 account for models with 1 factor, 2 factors, and 3
factors, respectively.18 Marker fill indicates the secular trend, and marker symbol-by-size is used to indicate
combinations of correlated random trends and CSD.
In Figure III, we find that, on average, our benchmark model understates causal effects on revenues in FLE
quartile 4 relative to all other graduation rate models. The largest effects in both the weighted and unweighted
regressions are produced by a specification that includes year fixed effects for the secular trend, a correlated
random trend at the state level, and no adjustment for cross-sectional dependence. The point estimates from
this model are large, and they are precisely estimated. Although these results suggest there were large revenue
effects, we worry that this specification ignores omitted variables, such as CSD, as well as mis-specifies the
counterfactual trend, all of which could result in upward bias.
It may be illuminating to compare solid and hollow circles in Figure III, as these reflect models that
ignore correlated random trends and CSD but differ in how they estimate the counterfactual trends. Hollow
circles assume the counterfactual trend is homogeneous, while solid circles assume the counterfactual trend
is heterogeneous, indexed by poverty quartile. Hollow circle point estimates are consistently larger than solid
circle point estimates, for both weighted and unweighted regressions. As hypothesized, this indicates that
when we assume homogeneous trends (δt) we overestimate the treatment effect for high poverty districts
because revenues were increasing faster in high poverty districts, on average, relative to low poverty districts.
[Insert Figure III Here]
We consider the variability of graduation rate estimates in Figure IV. Unlike the revenues results, we have
cases where our benchmark model both over- and under-states effect estimates relative to other models.
Of particular interest is the influence of correlated random trends. In the absence of specifying a correlated
random trend, point estimates are between 2 to 6 percentage points smaller than point estimates from our
benchmark model. By including a state-level time trend (indicated by the larger solid diamond and hollow
square), point estimates are nearer to our preferred model, but are still lower by about 2 percentage points.
However, when we include district-by-year effects (approximately 9,800 linear time trends, indicated by solid
and hollow triangles) the point estimates largely align with the preferred model. This suggests that treatment
and control groups do have different pre-treatment trends, and that these differences largely occur at the sub-
18Bracketed numbers indicate the column location for those model estimates available in Appendix TablesVII, VIII, IX, X.
102
state level. Modeling the time trends was motivated by the fact that FLE 4 districts were trending differently
prior to reform than FLE districts 1 through 3, and this is reinforced here. Overall, we find evidence of
omitted variable bias when correlated random trends are excluded from the model.
When our benchmark model understates the causal effects of other models, we find that the primary
difference is whether an adjustment for cross-sectional dependence has been made. Our benchmark model
accounts for a cross-sectional dependence using a 1-factor model. The models with the largest point estimates,
relative to our benchmark model, do not account for cross-sectional dependence. It appears that treatment
might be correlated with macroeconomic shocks affecting graduation rates, or that treatment might be induced
by another state’s pattern of graduation rates. After correcting for cross-sectional dependence, point estimates
are not as large. It is important to emphasize that we do not know the true number of factors r to include
in the model. Unfortunately, due to finite sample bias, we cannot include as many factors as is necessary to
make the errors i.i.d..
[Insert Figure IV Here]
Overall, we find that our preferred model tends to understate effect sizes for real log revenues and that
causal effect estimates for graduation rates are variable depending on how we model secular trends, correlated
random trends, and cross-sectional dependence. When we adjust for pre-treatment trends (especially at a sub-
state level) point estimates for graduation rates stabilize and the variation around our benchmark estimates is
less than 2 percentage points. Changing parameters in a model is not trivial because some of these changes
affect the identification strategy while others affect assumptions about omitted variable bias. Here we have
argued for a model that takes account of treatment heterogeneity and various sources of omitted variable bias.
In addition, we have shown that it is relatively straightforward to depict the results from a wide range of
alternative modeling strategies, as depicted in Figures III and IV.
V.C. Equalizing effects
Our preferred specification estimates levels of change in log revenues and graduation rates, comparing high-
/low poverty districts in treated states to high/low poverty districts in non-treated states. Comparing point
estimates across poverty quartiles is problematic because the counterfactual trends are allowed to vary. To
test whether revenues and graduation rates increased more in high poverty districts following court order, we
construct a variable that ranks districts within a state based on the percents of students qualifying for FLE
103
status in 1989. This ranking is then converted into a percentile by dividing by the total number of districts
in that state. Compared to percents qualifying for free lunch, these rank-orderings put districts on a common
metric, and are analogous to FLE quartiles, but with a continuous quantile rank-ordering.
The model that we estimate is analogous to Equation (4.1) with three changes:
1. We replace δtq to equal δt, so that the secular trend in Ysqdt is modeled as the average across all
districts.
2. We add a parameter δtQ, which is a continuous fixed effect variable that controls for year-specific
linear changes in Ysqdt with respect to Q, where Q is a continuous within-state poverty quantile rank-
ordering variable bounded between 0 and 1.
3. The treatment variable P ′stβq is set to equal P ′stQ.
Item [1] now adjusts for the average annual trend in log revenues and graduation rates among untreated
districts. Item [2] is done to provide a counterfactual secular trend with respect to how much non-treated
states are “equalizing” Ysqdt with respect to Q. The secular trend in these models now adjusts for the rate
that revenues and graduation rates are changing across FLE quantiles among untreated districts as well as the
average annual trend.19
The interpretation of the point estimates on the treatment year indicators, indicated in Item [3], is the
marginal change in our outcome variable Ysqdt given a one-unit change in FLE quantile within a state. For
revenues, a point estimate of 0.0001 is equivalent to a 0.01 percent change in per pupil total revenues for each
one-unit rank-order increase in FLE status within a state. For graduation rates, a point estimate of 0.0001 is
equivalent to a 0.01 percentage point increase for each one-unit rank-order increase. A positive coefficient
indicates that more revenues and graduation rates are going to poorer districts within a state.
Here, we highlight the trade-off between including δtq and δt in our preferred model. We included δtq
because we did not want to over-state the treatment effect by failing to account for the fact that high poverty
districts were increasing faster in untreated states as well. However, if we had included δt, then we could have
compared point estimates between P ′stβq=1 and P ′stβq=4, as these effect sizes would have been estimated
relative to a common secular trend. Given our interest in accounting for treatment heterogeneity, the two-
model approach is preferred, especially as the current model directly controls for the equalization efforts in
untreated states.
19Models that include the FLE poverty quartile-by-year fixed effect, not shown, are nearly identical.
104
We perform OLS and WLS, for Equation (4.1), with the modifications described just above. These results
can be seen for both dependent variables in Table IV and V. The columns of interest are columns 5 and 10,
which are labeled FLE Continuous.
After court-ordered reform, revenues increased across poverty quantiles, as indicated by the positive slope
coefficients in Table IV. Seven years after reform, a 10-unit increase in FLE percentile is associated with a
0.9 percent increase in per-pupil log revenues for the weighted regression. For the unweighted regression, a
10-unit increase is associated with a 1.8 percent increase. As previously discussed in Section V.A., neither the
weighted nor unweighted model results dominate each other, so we can view the slope coefficient as having a
lower bound of 0.9 percent and an upper bound of 1.8 percent. Assuming the treatment is linear, these results
suggest that districts in the 90th percentile would have had per pupil revenues that were between 7.2 to 14.2
percent higher than districts in the 10th percentile.
Table V also shows that court-ordered reform increased graduation rates across the FLE distribution. Seven
years after reform, a 10-unit increase in FLE percentile is associated with a 0.85 percentage point increase in
graduation rates for the weighted regression. For the unweighted model the corresponding point estimate is
0.50 percentage points. Assuming linearity, districts in the 90th percentile would have had graduation rates
that were between 4.0 to 6.8 percentage points higher than districts in the 10th percentile.
We showed in Figure I that high poverty quartiles in states undergoing reform experienced an increase
in both revenues and graduation rates, centered around the timing of reform. It was also evident that this
increase was larger than the increase in the other FLE quartiles. It was unknown whether this difference
reflected macroeconomic distributive patterns (perhaps due to federalization efforts) or whether it was a result
of Court order. The results of these models indicate that states undergoing reform shifted more revenues and
graduation rates to higher poverty districts, relative to shifts taking place in non-treated states.
V.D. Robustness checks
The largest threat to internal validity using aggregated data is selective migration. If treatment induces a
change in population and this change in population affects graduation rates, then the results using aggregate
graduation rates will be biased. Such a source of bias would occur if, for example, parents that value edu-
cation were more likely to move to areas that experienced increases in school spending. To test for selective
migration, we estimate the continuous version of our benchmark model on four dependent variables: log-
105
arithm of total district enrollment, percent minority (sum of Hispanic and black population shares within a
district), percent of children in poverty from the Census Bureau’s Small Area Income and Poverty Estimates
(SAIPE), and percent of students receiving special education. If there is evidence of population changes
resulting from treatment, and if these population characteristics are correlated with the outcome variable,
there may be bias. Ex hypothesi, we would assume that our results would be upwardly biased if treatment
decreased the percents of students who are minority, poor, or receiving special education, as these popula-
tions of students have been historically less likely to graduate (Stetser and Stillwell, 2014). Of course, we
cannot test for within demographic sorting, which would occur if students more likely to graduate within the
poor, minority and disabled populations we observe move into high poverty districts as a result of reform.
The inability to test for within composition sorting is a limitation of our data. Although we only report the
continuous treatment effect estimates of reform in Table VI, results from our main benchmark model appear
in the Appendix.20
Table VI shows that there is no strong evidence of selective migration to treated districts. None of the point
estimates for percent minority are statistically significant, nor are they large in magnitude. There are some
cases in which we obtain significance in terms of children in proverty (SAIPE) and the percentages of special
education students, but these point estimates are positive and not consistently significant across the treatment
years. If anything, the evidence from these models suggests our point estimates on the effect of reform are
downwardly biased, as the demographic changes indicate an increase, relative to the change across poverty
quantile in non-treated states, in the population of students that have been historically less likely to graduate.
In addition to considering selective migration, we also examine the robustness of our revenues dependent
variable. Prior research suggests that nonlinear transformations of the dependent variable (e.g., taking the
natural logarithm of the dependent variable) might produce treatment effect estimates that are substantially
different from the original variable (Lee, 2011; Solon, 2015). While transformations of the dependent variable
do affect the interpretation of the marginal effect, we should see similar patterns in terms of significance and
sign. In Table VI, we find that weighted results are marginally significant in treatment years 6 and 7. We
also find that the unweighted estimates are all statistically significant at the 5 percent level. This is the exact
same pattern of significance that appears Table IV, the table of main revenues results. Moreover, all point
estimates are positive across both tables. Appendix Table XV, which shows estimates for all FLE quartiles
is qualitatively similar in terms of significance and sign to the results in Table IV as well. Overall, we feel
20Please see Tables XI, XII, XIII, and XIV in Appendix XI..
106
confident that our logarithmic transformation does not jeopardize the validity of our revenues results.
[Insert Table VI Here]
VI. Conclusion
In this paper, we make both substantive and methodological contributions. Substantively, we demonstrate
that states undergoing Court-ordered finance reform in the period 1990-2010 experienced a sizable fiscal
shock that primarily affected high poverty districts in the state. This fiscal shock led to a subsequent change
in graduation rates in those states that likewise primarily benefited high poverty districts. The estimation of
these effects is largely immune to a variety of model specifications that vary in how they account for cross-
sectional dependence, pre-treatment trends and a heterogeneous secular trend. These effect sizes are, in turn,
robust to changes in demographic composition, as we observe population composition variables that are fairly
stable after reform. Moreover, changes after Court order have been equalizing, as states have shifted greater
resources, resulting in larger improvements in graduation rates, to higher poverty districts. At the start of
Court order, mean graduation rates for high poverty (Q4) districts were nearly 20 percentage points lower
than mean graduation rates for low poverty (Q1) districts. Our results indicate that Court order narrowed that
gap by 4.9 to 6.4 percentage points.
Methodologically, we subject the differences-in-differences estimator to a wide range of specification
checks, including cross-sectional dependence, correlated random trends, and secular trends. We efficiently
present upper and lower bounds on point estimates for a range of model choices. Assuming a homogenous
counterfactual trend will overstate effect sizes, in some cases, whereas ignoring cross-sectional dependence
does not meaningfully bias results in this application. There is substantial bias when we do not model pre-
treatment trends at the state-by-poverty quartile level. Without accounting for pre-treatment trends, effects of
reform on graduation rates are insignificant and precisely estimated.
The provocative results from Jackson and colleagues (2015) should not be undervalued. They find that
spending shocks resulting from Court order had major effects on a variety of student outcomes, including
adult earnings. The question of whether school spending—a public investment of $700 billion dollar per
year—can be causally linked to desirable outcomes has been a foundational question in public economics
for the past 50 years. We have argued here that it is necessary to replicate these results using other data sets
with better attributes or that have non-overlapping problems. Moreover, given the richness of modern panel
107
data sets, researchers are presented with a variety of plausibly equivalent modeling choices. In the absence
of strong priors regarding model specification, the challenge for applied microeconomists is to efficiently
convey the upper and lower bounds of estimates resulting from model choice. Here, we have shown that the
effects of Court order are consistent and robust on a data set that contains the universe of school districts
in the United States. Moreover, while results are sensitive to model fit, in most cases, point estimates for
graduation rates from the preferred model never diverge by more than 2 percentage points.
108
VII. Figures
Figure I: Mean Log Per Pupil Revenues & Graduation Rates, Centered around Timing ofReform
8.8
9
9.2
9.4
Mea
n Lo
g R
even
ues
-10 -5 0 5 10
Log Revenues
.65
.7
.75
.8
.85
.9
Mea
n G
rad
Rat
es
-10 -5 0 5 10
Grad Rates
FLE 1 FLE 2 FLE 3 FLE 4
Population weighted mean log revenues and graduation rates for states undergoing court-ordered financereform, by FLE quartile. Averages are for years 1990-2010, centered around first year of court-order. Notethat common trends assumptions are not reflected in this figure, as FLE quartile 4 graduation rates are notestimated relative to FLE quartile 4 graduation rates in non-treated states. NYCPS excluded.
109
Figure II: Change in Log Per Pupil Revenues & Graduation Rates after Court Ruling, FLE Quartile 4, with and without NYCPS
0
.05
.1
.15
1 2 3 4 5 6 7
Log Per Pupil Revenues
1 2 3 4 5 6 7
"Weighted, NYC Included" "Unweighted, NYC Included"
"Weighted, NYC Dropped" "Unweighted, NYC Dropped"
Graduation Rates
Notes: Differences-in-differences with treatment effects estimated non-parametrically after reform. Model accounts for district fixed effects (θd), FLE-by-yearfixed effects (δtq), state-by-FLE linear time trends (ψsqt), and a state-level factor (λ′sFt). Left panel corresponds to results for log revenues; right panel to results forgraduation rates. Black lines are for models that include NCYPS; gray lines are for models that exclude NYCPS. Solid lines are for models that include enrollmentas analytic weight; dashed lines are for unweighted models. Unweighted models completely overlap (black and gray dashed lines are not distinguishable). WhenNCYPS is removed, point estimates for weighted models are closer to unweighted models.
110
Figure III: Model Sensitivity: Changes in Estimates for Log Per Pupil Revenues across Models
-.04-.02 0 .02.04.06.08 .1Distance in Point Estimate from Preferred Model
Year 1
Year 2
Year 3
Year 4
Year 5
Year 6
Year 7
Weighted
-.04 -.02 0 .02 .04 .06 .08 .1Distance in Point Estimate from Preferred Model
Unweighted
ST: FLE by year; CRT: None; CSD: 0 [1]
ST: FLE by year; CRT: State; CSD: 0 [7]
ST: FLE by year; CRT: State by FLE; CSD: 0 [2]
ST: FLE by year; CRT: State by FLE; CSD: 1 [3]
ST: FLE by year; CRT: State by FLE; CSD: 2 [4]
ST: FLE by year; CRT: State by FLE; CSD: 3 [5]
ST: FLE by year; CRT: District; CSD: 0 [9]
ST: year; CRT: None; CSD: 0 [10]
ST: year; CRT: State; CSD: 0 [13]
ST: year; CRT: State by FLE; CSD: 0 [11]
ST: year; CRT: District; CSD: 0 [15]
Notes: Point estimate in treatment year t is shown as the difference between preferred model and model m, where model m variables are indicated in the legend.Point estimates along the x axis greater than zero indicate our preferred model underestimates the effect; greater than zero indicates our preferred model overstatesthe effect. Legend shows three parameter changes, delimited by “;”. ST denotes the type of nonparametric secular trend under consideration: (a) FLE quartile byyear fixed effects or (b) year fixed effects. CRT denotes the type of correlated random trends under consideration: (a) none, corresponding to no CRT; (b) state byFLE quartile fixed effects interacted with linear time; (c) state fixed effects interacted with linear time; (d) district fixed effects interacted with linear time. CSDdenotes type of cross-sectional dependence adjustment: (a) 0, which implies no factor structure; (b) 1, which is a 1 factor model; (c) 2, which is a 2 factor model;or (d) 3, which is a 3 factor model. Bracketed numbers indicate the column location of point estimates for model m in Tables VII and VIII
111
Figure IV: Model Sensitivity: Changes in Estimates for Graduation Rates across Models
-.04 -.02 0 .02 .04 .06Distance in Point Estimate from Preferred Model
Year 1
Year 2
Year 3
Year 4
Year 5
Year 6
Year 7
Weighted
-.04 -.02 0 .02 .04 .06Distance in Point Estimate from Preferred Model
Unweighted
ST: FLE by year; CRT: None; CSD: 0 [1]
ST: FLE by year; CRT: State; CSD: 0 [7]
ST: FLE by year; CRT: State by FLE; CSD: 0 [2]
ST: FLE by year; CRT: State by FLE; CSD: 1 [3]
ST: FLE by year; CRT: State by FLE; CSD: 2 [4]
ST: FLE by year; CRT: State by FLE; CSD: 3 [5]
ST: FLE by year; CRT: District; CSD: 0 [9]
ST: year; CRT: None; CSD: 0 [10]
ST: year; CRT: State; CSD: 0 [13]
ST: year; CRT: State by FLE; CSD: 0 [11]
ST: year; CRT: District; CSD: 0 [15]
Notes: Point estimate in treatment year t is shown as the difference between preferred model and model m, where model m variables are indicated in the legend.Point estimates along the x axis greater than zero indicate our preferred model underestimates the effect; greater than zero indicates our preferred model overstatesthe effect. Legend shows three parameter changes, delimited by “;”. ST denotes the type of nonparametric secular trend under consideration: (a) FLE quartile byyear fixed effects or (b) year fixed effects. CRT denotes the type of correlated random trends under consideration: (a) none, corresponding to no CRT; (b) state byFLE quartile fixed effects interacted with linear time; (c) state fixed effects interacted with linear time; (d) district fixed effects interacted with linear time. CSDdenotes type of cross-sectional dependence adjustment: (a) 0, which implies no factor structure; (b) 1, which is a 1 factor model; (c) 2, which is a 2 factor model;or (d) 3, which is a 3 factor model. Bracketed numbers indicate the column location of point estimates for model m in Tables IX and X
112
VIII. Tables
Table I: Adequacy Era Court-Ordered Finance Reform Years
State Year of 1st CaseName Overturn NameAlaska 2009 Moore v. StateKansas 2005 Montoy v. State (Montoy II)Kentucky 1989 Rose v. Council for Better EducationMassachusetts 1993 McDuff v. Secretary of Executive Office of EducationMontana 2005 Columbia Falls Elementary School District No. 6 v. MontanaNorth Carolina 2004 Hoke County Board of Education v. North CarolinaNew Hampshire 1997 Claremont School District v. GovernorNew Jersey 1997 Abbott v. Burke (Abbott IV)New York 2003 Campaign for Fiscal Equity, Inc. v. New YorkOhio 1997 DeRolph v. Ohio (DeRolph I)Tennessee 1995 Tennessee Small School Systems v. McWherter (II)Wyoming 1995 Campbell County School District v. Wyoming (Campbell II)
Notes: The table shows the first year in which a state’s education finance system was overturned on adequacygrounds; we also provide the name of the case. The primary source of data from this table is Corcoranand Evans (2008). We have updated their table with information provided by ACCESS, Education FinanceLitigation: http://schoolfunding.info/.
113
Table II: Descriptive Statistics for Outcome Variables
Weighted UnweightedMean SD Mean SD
Graduation Rates .77 [.15] .82 [.14]Log Revenues 8.94 [.26] 8.97 [.29]
Total Revenues 7950.2 [2352.12] 8224.31 [2782.13]Percent Minority .32 [.3] .16 [.23]
Percent Black .17 [.22] .08 [.17]Percent Hispanic .15 [.22] .08 [.16]
Percent Special Education .12 [.05] .13 [.05]Percent Child Poverty .16 [.1] .16 [.09]
Log Enrollment 9.43 [1.62] 7.4 [1.25]
Notes: This table provides means and standard deviations for the outcome variables used in this paper. Sum-mary statistics shown here that are weighted use the district enrollment across the sample.
114
Table III: States Exiting from Treatment Status
States Treatment YearN/A 1AK 2AK 3AK 4AK 5AK, KS 6AK, KS, NC 7AK, KS, NC, NY 8AK, AR, KS, NC, NY 9AK, AR, KS, NC, NY 10AK, AR, KS, NC, NY 11AK, AR, KS, NC, NY 12AK, AR, KS, NC, NY 13AK, AR, KS, NC, NY 14
Notes: This table lists states that no longer contribute treatment information in year t of the dynamic treat-ment response period. This occurs for states that were treated relatively late in the panel of data that isavailable. For example, our data set ends in 2009, and Table I shows that Kansas had their system overturnedin 2005. Thus, Kansas contributes to the identification of the treatment response in treatment years 1 through5, corresponding to 2005 to 2009, inclusive. It does not contribute to treatment years 6 through 14.
115
Table IV: Change in Log Per Pupil Revenues, by Free Lunch Eligible Quartile and Quantile
Weighted Unweighted
Treatment Year FLE 1 FLE 2 FLE 3 FLE 4 FLE Cont FLE 1 FLE 2 FLE 3 FLE 4 FLE Cont1 .029 ** -.004 .019 .015 .00001 .018 .008 .008 .004 .00054 *
[.011] [.034] [.017] [.012] [.00019] [.011] [.009] [.01] [.011] [.00023]2 .054 ** .032 .03 .046 * .00029 .032 + .027 .014 .018 .0007 *
[.017] [.027] [.021] [.022] [.00025] [.017] [.017] [.016] [.02] [.00028]3 .042 * .013 .031 .044 .00014 .029 + .028 + .022 .041 .00097 **
[.019] [.041] [.026] [.028] [.00032] [.015] [.015] [.019] [.029] [.0003]4 .058 * .034 .059 * .06 + .00036 .046 ** .045 ** .045 + .056 .00146 ***
[.023] [.046] [.03] [.032] [.00035] [.017] [.017] [.026] [.034] [.0003]5 .081 ** .057 .063 + .059 .00033 .036 .044 .035 .034 .00137 **
[.025] [.042] [.035] [.04] [.0004] [.026] [.026] [.036] [.037] [.00043]6 .119 *** .106 ** .089 * .091 * .0007 * .049 + .053 * .041 .033 .0016 **
[.035] [.04] [.038] [.044] [.00035] [.027] [.026] [.031] [.032] [.00054]7 .127 ** .116 * .108 * .119 * .00092 * .054 .058 + .041 .038 .00183 **
[.041] [.057] [.048] [.055] [.00046] [.033] [.032] [.038] [.043] [.00056]
r-squared .999 .999 .999 .999 .999 .894 .894 .894 .894 .887
Notes: This table shows point estimates and standard errors for non-parametric heterogeneous differences-in-differences estimator. Model accounts for districtfixed effects (θd), FLE-by-year fixed effects (δtq), state-by-FLE linear time trends (ψsqt), and a state-level factor (λ′sFt). FLE quartiles are indexed by FLE∈ 1, 2, 3, 4. Column “FLE Cont” corresponds to models in which FLE-by-year fixed effects are substituted for year fixed effects and the additional control variableyear-by-FLE percentile (δtQ) is included. Point estimates for continuous model are interpreted as change in revenues for 1-unit change in poverty percentilerank within a state, relative to change in percentile rank in states without Court order. All standard errors are clustered at the state level. (Significance indicated+ < .10, ∗ < .05, ∗∗ < .01, ∗ ∗ ∗ < .001)
116
Table V: Change in Graduation Rates, by Free Lunch Eligible Quartile and Quantile
Weighted Unweighted
Treatment Year FLE 1 FLE 2 FLE 3 FLE 4 FLE Cont FLE 1 FLE 2 FLE 3 FLE 4 FLE Cont1 .003 .015 *** .008 .013 * .00018 ** .006 .013 *** .014 ** .012 * .00016 **
[.006] [.004] [.008] [.006] [.00006] [.004] [.003] [.005] [.006] [.00005]2 .005 .002 .001 .03 ** .00025 ** .005 .013 + .009 .019 ** .00018 *
[.009] [.006] [.009] [.009] [.00008] [.005] [.007] [.006] [.007] [.00007]3 -.003 .003 .001 .061 ** .00051 ** .003 .016 * .008 .024 ** .00023 *
[.008] [.01] [.007] [.019] [.00017] [.005] [.007] [.01] [.009] [.00009]4 .013 .019 * .021 * .061 *** .00065 *** .015 * .032 ** .025 * .034 ** .00043 **
[.01] [.008] [.01] [.013] [.00011] [.008] [.011] [.01] [.013] [.00014]5 .013 .017 * .02 + .07 *** .00072 *** .012 .031 ** .027 * .041 ** .00049 ***
[.014] [.009] [.011] [.016] [.00013] [.009] [.01] [.012] [.015] [.00015]6 .012 .019 .024 * .085 *** .00088 *** .013 + .033 ** .032 * .05 ** .00059 ***
[.013] [.012] [.011] [.019] [.00017] [.007] [.013] [.016] [.018] [.00017]7 -.009 .01 .016 .084 *** .00085 *** .008 .03 .021 .047 * .00052 *
[.015] [.016] [.015] [.023] [.00021] [.01] [.018] [.021] [.023] [.00021]
r-squared .999 .999 .999 .999 .999 .573 .573 .573 .573 .573
Notes: This table shows point estimates and standard errors for non-parametric heterogeneous differences-in-differences estimator. Model accounts for districtfixed effects (θd), FLE-by-year fixed effects (δtq), state-by-FLE linear time trends (ψsqt), and a state-level factor (λ′sFt). FLE quartiles are indexed by FLE∈ 1, 2, 3, 4. Column “FLE Cont” corresponds to models in which FLE-by-year fixed effects are substituted for year fixed effects and the additional control variableyear-by-FLE percentile (δtQ) is included. Point estimates for continuous model are interpreted as change in graduation rates for 1-unit change in poverty percentilerank within a state, relative to change in percentile rank in states without Court order. All standard errors are clustered at the state level. (Significance indicated+ < .10, ∗ < .05, ∗∗ < .01, ∗ ∗ ∗ < .001)
117
Table VI: Robustness Check: Preferred Model Specification - Change in Demographic Characteristics Resulting from Court Order
Log Enroll Percent Minority Percent SAIPE Percent Sped Total Revenues
Treatment Year Unweighted Weighted Unweighted Weighted Unweighted Weighted Unweighted Weighted Unweighted1 -.00012 .00001 0 .00022 ** .00012 * .0004 .00038 .18717 6.27757 *
[.00007] [.00005] [.00002] [.00008] [.00006] [.00027] [.00023] [2.00267] [3.00265]2 0 .00005 .00001 .00006 -.00003 .00046 .00044 + 3.00714 7.1655 *
[.00007] [.00006] [.00003] [.00011] [.0001] [.00029] [.00025] [2.5773] [2.93854]3 .00008 .00002 .00001 .00009 -.00002 .00031 .00026 1.75417 9.93939 **
[.00011] [.00007] [.00004] [.00006] [.00005] [.00033] [.00029] [3.41087] [3.47845]4 .00013 .00003 .00002 .00016 * .00005 .00039 .00027 4.32547 15.0462 ***
[.00013] [.00009] [.00004] [.00007] [.00008] [.00023] [.00017] [4.26495] [3.92201]5 .00014 .00009 .00006 .00007 .00003 .00041 + .00028 4.2002 14.5479 **
[.00017] [.00008] [.00005] [.00008] [.00005] [.00023] [.00017] [5.00702] [5.23225]6 .00014 .0001 .00005 .00012 .00002 .00049 + .00033 8.91194 + 18.25232 *
[.00019] [.00008] [.00005] [.00008] [.00007] [.00028] [.00021] [4.75711] [7.54563]7 .0001 .00008 .00004 .00008 .00004 .00046 .00032 11.43805 + 20.3912 **
[.00021] [.0001] [.00006] [.00014] [.00012] [.0003] [.00022] [6.48416] [7.19182]
r2 .992 .999 .971 .999 .914 .999 .6 .999 .864
Notes: This table shows point estimates and standard errors for non-parametric heterogeneous differences-in-differences estimator. Model accounts for districtfixed effects (θd), year fixed effects (δt), state-by-FLE linear time trends (ψsqt), a state-level factor (λ′sFt), and a FLE percentile-by-year fixed effect (δtQ). Pointestimates are interpreted as change in dependent variable per 1-unit change in poverty percentile rank within a state, relative to change in percentile rank in stateswithout Court order. All standard errors are clustered at the state level. (Significance indicated + < .10, ∗ < .05, ∗∗ < .01, ∗ ∗ ∗ < .001)
118
Appendices
IX. Data Appendix
Our data set is the compilation of several public-use surveys that are administered by the National Center
for Education Statistics and the U.S. Census Bureau. We construct our analytic sample using the following
data sets: Local Education Agency (School District) Finance Survey (F-33); Local Education Agency (School
District) Universe Survey; Local Education Agency Universe Survey Longitudinal Data File: 1986-1998 (13-
year); Local Education Agency (School District) Universe Survey Dropout and Completion Data; and Public
Elementary/Secondary School Universe Survey. Web links to each of the data sources are listed below:
Note: All links last accessed June 2015.
Local Education Agency Finance Survey (F-33)
Source: https://nces.ed.gov/ccd/f33agency.asp
Local Education Agency Universe Survey
Source: https://nces.ed.gov/ccd/pubagency.asp
Local Education Agency Universe Survey Longitudinal Data File (13-year)
Source: https://nces.ed.gov/ccd/CCD13YR.ASP
Local Education Agency Universe Survey Dropout and Completion Data
Source: https://nces.ed.gov/ccd/drpagency.asp
Public Elementary/Secondary School Universe Survey Data
Source: https://nces.ed.gov/ccd/pubschuniv.asp
We construct total real revenues per student, our first outcome of interest, using the F-33 survey, where
total revenues is the sum of federal, local, and state revenues in each district.21 We divide this value by the
total number of students in the district and deflate by the US CPI, All Urban Consumers Index to convert
the figure to year 2000 dollars. Because of large outliers in the F-33 survey, we replace with missing values
observations with real total revenues per student that are either 1.5 times larger than the 95th percentile or 0.5
21For years 1990-91, 1992-93, and 1993-94, we obtained district-level data from Kforce GovernmentSolutions, Inc. These data are public-use files.
119
times smaller than the 5th percentile of per-student total revenues within a state. We do this to prevent large
outliers from driving our results.22
Our measure of graduation rates is a combination of data from the Local Education Agency (School Dis-
trict) Universe Survey, the Public Elementary/Secondary School Universe Survey, and the Local Education
Agency (School District) Universe Survey Dropout and Completion Data. From the school-level file, we ex-
tract the number of 8th graders in each school, and we aggregate these school-level data to obtain district-level
data in each year. From the Local Education Agency data files we construct a time series of total diploma
recipients. Data on total diploma recipients was not part of the Local Education Agency universe files as
of academic year 1997-98, so beginning with that year, we use the the Dropout and Completion public-use
files to obtain the diploma data. We calculate the graduation rate as the total number of diploma recipients
in year t as a share of the number of 8th graders in year t − 4, a measure which Heckman (2010) shows is
not susceptible to the downward biased caused by using lagged 9th grade enrollment in the denominator. We
top-coded the graduation rates, so that they take a maximum value of 1.
22The outlier adjustment described above has been used in other studies; for example, see Murray, Evansand Schwab (1998). We generally find that outliers are usually found in districts with very small enrollment.
120
X. Additional Figures
121
Figure V: New York City Public Schools Potential Mediators & Outcomes
-2-1
01
23
Sta
ndar
dize
d w
ithin
NY
CP
S
1990 1995 2000 2005 2010Year
Grad Rates
Revenues
Salaries
% Spec. Ed.
Class Size
% Minority
% Poverty
NYCPS is the largest district in the United States, and it was also demonstrably improving in many waysduring this period. Here, we plot standardized beta coefficients for various outcomes of interest. Theseare standardized to be mean zero with standard deviation one for NYCPS across the entire time period.Graduation rates and teacher salaries increased after 2003, the year New York had its first court ruling. En-rollment, percent poverty, class size and percent minority decreased during this period. All of these potentialmechanisms for improving graduation rates are intended to be controlled for in a differences-in-differencesframework through the inclusion of year dummies.
122
XI. Additional Tables
123
Table VII: Log Per Pupil Revenues Results, Model Sensitivity Specifications, Weighted Least Squares
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 .024 .034 .004 .015 .017 .001 .05 .013 .033 .03 .03 -.003 .056 .019 .03[.012] [.014] [.011] [.013] [.012] [.012] [.018] [.012] [.015] [.012] [.014] [.013] [.018] [.01] [.014]
2 .043 .047 .018 .038 .043 .018 .068 .032 .047 .051 .044 .013 .077 .041 .044[.017] [.018] [.02] [.021] [.021] [.022] [.024] [.02] [.018] [.017] [.017] [.022] [.024] [.019] [.018]
3 .071 .076 .041 .067 .074 .04 .101 .055 .075 .08 .072 .033 .11 .065 .071[.024] [.019] [.029] [.028] [.024] [.03] [.023] [.026] [.02] [.024] [.018] [.032] [.022] [.026] [.019]
4 .098 .107 .056 .087 .092 .055 .135 .069 .105 .109 .101 .047 .145 .08 .101
[.024] [.018] [.034] [.033] [.031] [.037] [.025] [.033] [.019] [.025] [.018] [.04] [.025] [.034] [.018]5 .083 .093 .034 .072 .082 .031 .125 .045 .091 .095 .087 .02 .138 .057 .086
[.029] [.026] [.037] [.036] [.034] [.041] [.034] [.037] [.027] [.03] [.026] [.044] [.034] [.037] [.027]6 .102 .103 .033 .075 .09 .027 .144 .043 .101 .119 .1 .017 .161 .06 .099
[.022] [.038] [.032] [.034] [.036] [.041] [.044] [.036] [.04] [.021] [.036] [.043] [.042] [.037] [.037]7 .123 .121 .038 .087 .118 .027 .168 .043 .118 .139 .115 .013 .184 .059 .114
[.019] [.039] [.043] [.045] [.05] [.058] [.045] [.054] [.04] [.018] [.036] [.06] [.043] [.055] [.038]Weight
Yes
No X X X X X X X X X X X X X X X
Fixed Effect
FLE-Year X X X X X X X X X
Year X X X X X X
CRT
None X X
State-FLE X X X X X
State-FLE^2 X X
State X X
State^2 X X
District X X
Factor
0 X X X X X X X X X X X X
1 X
2 X
3 X
Treatment Year Log Per Pupil Revenues, Unweighted
Preferred model highlighted. Point estimates and standard errors, indicated by brackets, are shown for various model specifications. Model choice is indicatedin bottom of panel. We permute parameters using weights (yes/no), fixed effects (FLE-by-year/year), correlated random trends (none/state-by-FLE/state-by-FLEsquared/state/state squared/district), and factors (0/1/2/3). Not all combinations are available.
124
Table VIII: Log Per Pupil Revenues Results, Model Sensitivity Specifications, OLS
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 .028 .026 .015 .019 .02 .01 .048 .02 .027 .032 .025 .005 .051 .021 .026[.01] [.013] [.012] [.016] [.016] [.009] [.017] [.011] [.014] [.009] [.013] [.01] [.018] [.01] [.014]
2 .056 .05 .046 .049 .052 .034 .08 .05 .051 .061 .05 .029 .083 .053 .051[.013] [.019] [.022] [.024] [.023] [.02] [.024] [.019] [.02] [.013] [.022] [.02] [.025] [.019] [.023]
3 .055 .049 .044 .049 .047 .029 .083 .046 .05 .066 .054 .027 .092 .055 .055
[.017] [.022] [.028] [.033] [.031] [.026] [.026] [.025] [.024] [.016] [.022] [.027] [.026] [.024] [.023]4 .079 .073 .06 .069 .063 .045 .11 .061 .075 .092 .078 .043 .121 .071 .08
[.017] [.024] [.032] [.04] [.039] [.03] [.028] [.031] [.025] [.015] [.021] [.032] [.027] [.029] [.022]5 .074 .067 .059 .071 .059 .031 .11 .052 .07 .09 .076 .03 .124 .066 .078
[.021] [.03] [.04] [.052] [.047] [.039] [.035] [.038] [.032] [.021] [.029] [.041] [.035] [.037] [.031]6 .102 .092 .091 .105 .076 .047 .142 .071 .095 .121 .104 .048 .158 .089 .107
[.018] [.037] [.044] [.061] [.054] [.045] [.046] [.045] [.039] [.02] [.035] [.046] [.046] [.044] [.037]7 .129 .115 .119 .128 .103 .059 .174 .09 .119 .154 .133 .064 .195 .114 .137
[.015] [.036] [.055] [.084] [.076] [.065] [.047] [.063] [.038] [.016] [.034] [.067] [.047] [.063] [.036]Weight
Yes X X X X X X X X X X X X X X X
No
Fixed Effect
FLE-Year X X X X X X X X X
Year X X X X X X
CRT
None X X
State-FLE X X X X X
State-FLE^2 X X
State X X
State^2 X X
District X X
Factor
0 X X X X X X X X X X X X
1 X
2 X
3 X
Treatment Year Log Per Pupil Revenues, Weighted
Preferred model highlighted. Point estimates and standard errors, indicated by brackets, are shown for various model specifications. Model choice is indicatedin bottom of panel. We permute parameters using weights (yes/no), fixed effects (FLE-by-year/year), correlated random trends (none/state-by-FLE/state-by-FLEsquared/state/state squared/district), and factors (0/1/2/3). Not all combinations are available.
125
Table IX: Graduation Rates Results, Model Sensitivity Specifications, Weighted Least Squares
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 -0.001 0.016 0.012 0.014 0.008 0.021 0.012 0.016 0.016 -0.004 0.019 0.022 0.009 0.012 0.019
[.007] [.008] [.006] [.007] [.006] [.01] [.006] [.007] [.009] [.007] [.009] [.011] [.006] [.007] [.009]
2 -0.004 0.022 0.019 0.021 0.01 0.029 0.015 0.021 0.021 -0.009 0.023 0.029 0.01 0.015 0.023
[.011] [.01] [.007] [.008] [.01] [.014] [.007] [.009] [.01] [.012] [.009] [.014] [.006] [.009] [.009]
3 0.003 0.033 0.024 0.026 0.011 0.041 0.025 0.029 0.033 -0.004 0.034 0.04 0.018 0.023 0.034
[.012] [.011] [.009] [.009] [.01] [.018] [.008] [.012] [.011] [.012] [.01] [.019] [.007] [.012] [.01]
4 0.007 0.042 0.034 0.035 0.012 0.053 0.033 0.042 0.042 0 0.044 0.053 0.026 0.035 0.044
[.012] [.017] [.013] [.014] [.016] [.028] [.013] [.018] [.018] [.012] [.017] [.029] [.012] [.019] [.017]
5 0.016 0.055 0.041 0.043 0.017 0.065 0.045 0.053 0.055 0.008 0.058 0.066 0.038 0.046 0.057
[.013] [.019] [.015] [.015] [.017] [.034] [.013] [.022] [.02] [.012] [.018] [.035] [.013] [.022] [.019]
6 0.022 0.069 0.05 0.051 0.016 0.076 0.056 0.063 0.068 0.016 0.075 0.08 0.051 0.058 0.074
[.013] [.022] [.018] [.017] [.016] [.045] [.016] [.03] [.023] [.013] [.021] [.045] [.016] [.03] [.022]
7 0.015 0.068 0.047 0.046 0.008 0.075 0.055 0.062 0.068 0.01 0.076 0.08 0.05 0.057 0.075
[.013] [.027] [.023] [.021] [.018] [.056] [.019] [.037] [.028] [.013] [.026] [.056] [.02] [.037] [.027]
Weight
Yes
No X X X X X X X X X X X X X X X
Fixed Effect
FLE-Year X X X X X X X X X
Year X X X X X X
CRT
None X X
State-FLE X X X X X
State-FLE^2 X X
State X X
State^2 X X
District X X
Factor
0 X X X X X X X X X X X X
1 X
2 X
3 X
Treatment Year Graduation Rates, Weighted
Preferred model highlighted. Point estimates and standard errors, indicated by brackets, are shown for various model specifications. Model choice is indicatedin bottom of panel. We permute parameters using weights (yes/no), fixed effects (FLE-by-year/year), correlated random trends (none/state-by-FLE/state-by-FLEsquared/state/state squared/district), and factors (0/1/2/3). Not all combinations are available.
126
Table X: Graduation Rates Results, Model Sensitivity Specifications, OLS
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 -0.013 0.02 0.013 0.01 0.013 0.018 0.002 0.001 0.019 -0.01 0.02 0.019 0.005 0.004 0.018
[.009] [.009] [.006] [.007] [.009] [.012] [.007] [.008] [.009] [.008] [.006] [.01] [.005] [.005] [.006]
2 -0.01 0.037 0.03 0.026 0.029 0.03 0.013 0.012 0.036 -0.008 0.033 0.031 0.014 0.012 0.033
[.01] [.012] [.009] [.01] [.012] [.019] [.007] [.012] [.012] [.011] [.009] [.016] [.006] [.01] [.009]
3 0.021 0.077 0.061 0.056 0.052 0.063 0.048 0.045 0.076 0.021 0.071 0.063 0.048 0.044 0.07
[.02] [.017] [.019] [.021] [.015] [.021] [.014] [.017] [.017] [.022] [.017] [.018] [.016] [.018] [.017]
4 0.009 0.074 0.061 0.055 0.048 0.058 0.042 0.042 0.074 0.008 0.066 0.057 0.04 0.039 0.065
[.01] [.018] [.013] [.013] [.018] [.035] [.01] [.023] [.019] [.011] [.014] [.031] [.008] [.02] [.014]
5 0.017 0.091 0.07 0.065 0.054 0.063 0.056 0.052 0.09 0.017 0.083 0.066 0.055 0.051 0.082
[.014] [.017] [.016] [.017] [.02] [.04] [.01] [.025] [.018] [.014] [.013] [.035] [.01] [.022] [.014]
6 0.024 0.11 0.085 0.079 0.068 0.071 0.07 0.065 0.108 0.027 0.103 0.078 0.072 0.066 0.101
[.02] [.019] [.019] [.02] [.024] [.049] [.013] [.029] [.019] [.021] [.015] [.043] [.013] [.028] [.015]
7 0.018 0.115 0.084 0.077 0.062 0.062 0.072 0.064 0.113 0.023 0.107 0.073 0.075 0.066 0.106
[.024] [.02] [.023] [.025] [.026] [.057] [.015] [.03] [.021] [.023] [.018] [.05] [.016] [.032] [.018]
Weight
Yes X X X X X X X X X X X X X X X
No
Fixed Effect
FLE-Year X X X X X X X X X
Year X X X X X X
CRT
None X X
State-FLE X X X X X
State-FLE^2 X X
State X X
State^2 X X
District X X
Factor
0 X X X X X X X X X X X X
1 X
2 X
3 X
Treatment Year Graduation Rates, Unweighted
Preferred model highlighted. Point estimates and standard errors, indicated by brackets, are shown for various model specifications. Model choice is indicatedin bottom of panel. We permute parameters using weights (yes/no), fixed effects (FLE-by-year/year), correlated random trends (none/state-by-FLE/state-by-FLEsquared/state/state squared/district), and factors (0/1/2/3). Not all combinations are available.
127
Table XI: Robustness: Log Enrollment, Preferred Model
Weighted Unweighted
Treatment Year FLE 1 FLE 2 FLE 3 FLE 4 FLE Cont FLE 1 FLE 2 FLE 3 FLE 4 FLE Cont1 -.012 ** -.01 -.015 + -.008 -.00012
[.004] [.008] [.009] [.005] [.00007]2 -.011 * -.01 -.011 .001 0
[.006] [.009] [.007] [.005] [.00007]3 -.016 + -.007 -.012 .006 .00008
[.008] [.016] [.013] [.014] [.00011]4 -.021 -.012 -.013 .009 .00013
[.013] [.02] [.018] [.017] [.00013]5 -.025 -.015 -.017 .008 .00014
[.017] [.025] [.022] [.02] [.00017]6 -.034 -.024 -.025 .01 .00014
[.021] [.028] [.025] [.022] [.00019]7 -.044 + -.033 -.035 .005 .0001
[.024] [.032] [.028] [.026] [.00021]
r-squared .992 .992 .992 .992 .992
Notes: This table shows point estimates and standard errors for non-parametric heterogeneous differences-in-differences estimator. Model accounts for districtfixed effects (θd), FLE-by-year fixed effects (δtq (or deltat for continuous models), state-by-FLE linear time trends (ψsqt), a state-level factor (λ′sFt), and, in thecase of continuous models, a FLE percentile-by-year fixed effect (δtQ). Point estimates are interpreted as change in dependent variable per 1-unit change in povertypercentile rank within a state, relative to change in percentile rank in states without Court order. All standard errors are clustered at the state level. (Significanceindicated + < .10, ∗ < .05, ∗∗ < .01, ∗ ∗ ∗ < .001)
128
Table XII: Robustness: Percent Minority, Preferred Model
Weighted Unweighted
Treatment Year FLE 1 FLE 2 FLE 3 FLE 4 FLE Cont FLE 1 FLE 2 FLE 3 FLE 4 FLE Cont1 -.004 + .003 .003 -.004 .00001 -.002 * .001 0 -.001 0
[.002] [.002] [.003] [.005] [.00005] [.001] [.001] [.001] [.001] [.00002]2 -.009 * -.001 .002 .004 .00005 -.002 + -.001 0 .002 .00001
[.005] [.002] [.003] [.004] [.00006] [.001] [.001] [.001] [.002] [.00003]3 -.013 -.004 .002 .001 .00002 -.002 -.001 0 .002 .00001
[.009] [.005] [.005] [.003] [.00007] [.002] [.002] [.002] [.002] [.00004]4 -.011 -.004 .006 0 .00003 -.002 -.001 .002 .003 .00002
[.01] [.006] [.006] [.004] [.00009] [.003] [.003] [.003] [.002] [.00004]5 -.005 .004 .01 * 0 .00009 0 .002 .005 + .005 * .00006
[.005] [.003] [.004] [.004] [.00008] [.002] [.002] [.003] [.002] [.00005]6 -.008 .004 .013 * 0 .0001 -.001 .001 .004 .005 .00005
[.006] [.004] [.005] [.005] [.00008] [.002] [.003] [.003] [.003] [.00005]7 -.012 .003 .014 * -.004 .00008 -.003 .001 .004 .004 .00004
[.008] [.005] [.007] [.005] [.0001] [.003] [.003] [.004] [.003] [.00006]
r-squared 1 1 1 1 1 .971 .971 .971 .971 .971
Notes: This table shows point estimates and standard errors for non-parametric heterogeneous differences-in-differences estimator. Model accounts for districtfixed effects (θd), FLE-by-year fixed effects (δtq (or deltat for continuous models), state-by-FLE linear time trends (ψsqt), a state-level factor (λ′sFt), and, in thecase of continuous models, a FLE percentile-by-year fixed effect (δtQ). Point estimates are interpreted as change in dependent variable per 1-unit change in povertypercentile rank within a state, relative to change in percentile rank in states without Court order. All standard errors are clustered at the state level. (Significanceindicated + < .10, ∗ < .05, ∗∗ < .01, ∗ ∗ ∗ < .001)
129
Table XIII: Robustness: Percent Child Poverty (SAIPE), Preferred Model
Weighted Unweighted
Treatment Year FLE 1 FLE 2 FLE 3 FLE 4 FLE Cont FLE 1 FLE 2 FLE 3 FLE 4 FLE Cont1 .012 *** .015 ** .016 *** .02 * .00022 ** .008 * .01 *** .015 *** .009 .00012 *
[.003] [.005] [.004] [.009] [.00008] [.004] [.003] [.003] [.007] [.00006]2 .005 .002 -.003 .004 .00006 .004 .003 .004 -.004 -.00003
[.004] [.004] [.009] [.011] [.00011] [.005] [.005] [.006] [.01] [.0001]3 .009 + .008 * .012 * .011 * .00009 .006 .007 .008 -.001 -.00002
[.005] [.004] [.005] [.005] [.00006] [.006] [.006] [.006] [.005] [.00005]4 .013 * .009 .015 + .028 ** .00016 * .008 .01 .013 .007 .00005
[.006] [.006] [.008] [.008] [.00007] [.008] [.008] [.009] [.008] [.00008]5 .017 ** .017 ** .02 ** .019 ** .00007 .011 .012 .014 .006 .00003
[.006] [.005] [.007] [.007] [.00008] [.007] [.007] [.01] [.004] [.00005]6 .024 ** .019 ** .023 ** .032 *** .00012 .01 .011 .016 .009 .00002
[.007] [.007] [.008] [.009] [.00008] [.01] [.01] [.012] [.009] [.00007]7 .027 ** .022 ** .027 ** .032 + .00008 .013 .014 .021 .011 .00004
[.009] [.008] [.01] [.017] [.00014] [.012] [.012] [.016] [.013] [.00012]
r-squared 1 1 1 1 1 .914 .914 .914 .914 .914
Notes: This table shows point estimates and standard errors for non-parametric heterogeneous differences-in-differences estimator. Model accounts for districtfixed effects (θd), FLE-by-year fixed effects (δtq (or deltat for continuous models), state-by-FLE linear time trends (ψsqt), a state-level factor (λ′sFt), and, in thecase of continuous models, a FLE percentile-by-year fixed effect (δtQ). Point estimates are interpreted as change in dependent variable per 1-unit change in povertypercentile rank within a state, relative to change in percentile rank in states without Court order. All standard errors are clustered at the state level. (Significanceindicated + < .10, ∗ < .05, ∗∗ < .01, ∗ ∗ ∗ < .001)
130
Table XIV: Robustness: Special Education, Preferred Model
Weighted Unweighted
Treatment Year FLE 1 FLE 2 FLE 3 FLE 4 FLE Cont FLE 1 FLE 2 FLE 3 FLE 4 FLE Cont1 .024 .02 .022 .033 .0004 .029 + .026 .027 .029 .00038
[.015] [.014] [.016] [.025] [.00027] [.015] [.016] [.016] [.019] [.00023]2 .027 .027 + .025 .038 .00046 .031 + .031 + .031 + .033 .00044 +
[.018] [.016] [.018] [.027] [.00029] [.018] [.017] [.018] [.02] [.00025]3 .022 .015 .016 .025 .00031 .022 .019 .019 .02 .00026
[.018] [.018] [.021] [.03] [.00033] [.019] [.019] [.021] [.022] [.00029]4 .027 * .02 .025 + .029 .00039 .023 * .021 * .023 + .02 .00027
[.013] [.013] [.015] [.023] [.00023] [.01] [.01] [.012] [.015] [.00017]5 .032 ** .024 * .026 .03 .00041 + .024 * .021 * .022 + .022 .00028
[.012] [.012] [.016] [.024] [.00023] [.01] [.01] [.012] [.015] [.00017]6 .038 ** .032 * .03 .034 .00049 + .029 * .026 * .026 + .023 .00033
[.014] [.014] [.018] [.029] [.00028] [.012] [.013] [.015] [.018] [.00021]7 .036 * .03 * .031 .031 .00046 .026 * .025 + .027 .022 .00032
[.015] [.015] [.019] [.031] [.0003] [.012] [.014] [.016] [.019] [.00022]
r-squared 1 1 1 1 1 .601 .601 .601 .601 .6
Notes: This table shows point estimates and standard errors for non-parametric heterogeneous differences-in-differences estimator. Model accounts for districtfixed effects (θd), FLE-by-year fixed effects (δtq (or deltat for continuous models), state-by-FLE linear time trends (ψsqt), a state-level factor (λ′sFt), and, in thecase of continuous models, a FLE percentile-by-year fixed effect (δtQ). Point estimates are interpreted as change in dependent variable per 1-unit change in povertypercentile rank within a state, relative to change in percentile rank in states without Court order. All standard errors are clustered at the state level. (Significanceindicated + < .10, ∗ < .05, ∗∗ < .01, ∗ ∗ ∗ < .001)
131
Table XV: Robustness: Total Per Pupil Revenues, Preferred Model
Weighted Unweighted
Treatment Year FLE 1 FLE 2 FLE 3 FLE 4 FLE Cont FLE 1 FLE 2 FLE 3 FLE 4 FLE Cont1 452 * 219 373 310 * 0.187 211 * 47 47 -30 6.278 *
[242] [382] [265] [169] [2.00267] [94] [70] [101] [122] [3.00265]2 631 * 393 392 563 * 3.007 355 ** 258 * 100 123 7.166 *
[282] [359] [282] [232] [2.5773] [131] [119] [148] [177] [2.93854]3 547 260 385 573 * 1.754 354 * 263 * 159 338 9.934 **
[371] [519] [353] [241] [3.41087] [144] [125] [182] [277] [3.47845]4 782 * 584 760 * 853 ** 4.325 538 *** 425 ** 394 * 503 15.046 ***
[443] [586] [402] [263] [4.26495] [141] [136] [261] [359] [3.92201]5 1005 * 770 803 * 893 * 4.200 445 * 381 * 246 255 14.548 **
[485] [582] [475] [364] [5.00702] [214] [231] [354] [404] [5.23225]6 1369 * 1276 * 1061 * 1268 ** 8.91194 + 653 ** 531 ** 388 316 18.252 *
[630] [648] [607] [486] [4.75711] [201] [208] [305] [322] [7.54563]7 1428 * 1309 * 1167 * 1567 *** 11.43805 + 688 ** 576 * 358 373 20.39 **
[722] [758] [614] [429] [6.48416] [248] [297] [409] [474] [7.19182]
r-squared 1 1 1 1 1 1 1 1 1 0.864
Notes: This table shows point estimates and standard errors for non-parametric heterogeneous differences-in-differences estimator. Model accounts for districtfixed effects (θd), FLE-by-year fixed effects (δtq (or deltat for continuous models), state-by-FLE linear time trends (ψsqt), a state-level factor (λ′sFt), and, in thecase of continuous models, a FLE percentile-by-year fixed effect (δtQ). Point estimates are interpreted as change in dependent variable per 1-unit change in povertypercentile rank within a state, relative to change in percentile rank in states without Court order. All standard errors are clustered at the state level. (Significanceindicated + < .10, ∗ < .05, ∗∗ < .01, ∗ ∗ ∗ < .001)
132
XII. Technical Appendix: Standard Errors
In this section, we show how to obtain a variety of standard error structures using the estimates of the factor
loadings and common factors that we obtain from the iterative procedure used in Moon and Weidner (2014).
This may be necessary when, for example, the unit of observation is at a lower level of aggregation than the
unit where serial correlation is expected. We thank Martin Weidner for suggesting this procedure and for
providing details about it.
First, we note that the interactive fixed effects (IFE) procedure provides us with a vector of estimates for
λi and Ft. We denote the estimated parameter vectors as λ̂i and F̂t.
The goal is to use these estimates to estimate the model using OLS in a statistical package that allows for
different error structures such as clustered standard errors. We want to estimate the following equation:
(4.2) Yit = αi + θt +D′itβ + λ′iFt + εit,
where we have λ̂i and F̂t from the IFE procedure. If we perform a first-order Taylor expansion about λ′iFt,
then we can utilize our estimates and obtain the same coefficients on our treatment indicators using the IFE
procedure.
In general, the first-order Taylor expansion for a function of two variables λ and F about the points λ̂ and
F̂ is as follows:
(4.3) f(λ, F ) ≈ f(λ̂, F̂ ) +∂f
∂λ
∣∣∣λ=λ̂,F=F̂
(λ− λ̂) +∂f
∂F
∣∣∣λ=λ̂,F=F̂
(F − F̂ )
In our case, we have f(λ, F ) = λ′iFt. After applying the Taylor expansion, we obtain:
(4.4) λ′iFt ≈ λ′iF̂t + λ̂i′Ft − λ̂i
′F̂t.
Next, we define αi = λ− λ̂ and Ft = θt. Therefore, we obtain:
(4.5) λ′iFt ≈ αiF̂t + λ̂iθt.
133
The model we can estimate via OLS is
(4.6) Yit = αi + θt +D′itβ + αiF̂t + λ̂iθt + εit,
where the αi and the θt are parameters that are to be estimated. We can easily apply various standard error
corrections to this model in a wide variety of statistics packages.
Proposition XII..1. OLS estimation of Equation (4.6) is identical to OLS estimation of Equation (4.2).
Proof. For this proof, we show that the first order conditions (FOCs) are identical. Thus, the OLS estimates
will also be identical.
To begin, we first simplify notation. We let Xit = αi + θt + Dit. There is no loss of generality in
performing this step, as we can think of estimating the unit-specific fixed effects and the time-specific fixed
using the least-squares dummy variable (LSDV) method.23
Part 1: Obtain FOCs for Equation (4.2)
Step 1: Specify the objective function:
minimizeβ̂,λ̂i,F̂t
Ψ =∑i
∑t
(yit −X ′itβ̂ − λ̂i′F̂t)
2
Step 2: Obtain partial derivatives for each parameter and set them equal to zero
∂Ψ
∂β̂= −2
[∑i
∑t
(yit −X ′itβ̂ − λ̂i′F̂t)(Xit)
]= 0
∂Ψ
∂λ̂i= −2
[∑i
∑t
(yit −X ′itβ̂ − λ̂i′F̂t)(F̂t)
]= 0
∂Ψ
∂F̂t= −2
[∑i
∑t
(yit −X ′itβ̂ − λ̂i′F̂t)(λ̂i)
]= 0
Part 2: Obtain FOCs for Equation (4.6)
Step 1: Specify the objective function:
minimizeβ̂,λ̂i,F̂t
Φ =∑i
∑t
(yit −X ′itβ̂ − αiF̂t − λ̂iθt)2
23In this paper, we actually remove the fixed effects using the within transformation.
134
Step 2: Obtain partial derivatives for each parameter and set them equal to zero
∂Φ
∂β̂= −2
[∑i
∑t
(yit −X ′itβ̂ − αiF̂t − λ̂iθt)(Xit)
]
= −2
[∑i
∑t
(yit −X ′itβ̂ − λ̂i′F̂t)(Xit)
]= 0
∂Φ
∂λ̂i= −2
[∑i
∑t
(yit −X ′itβ̂ − αiF̂t − λ̂iθt)(F̂t)
]
= −2
[∑i
∑t
(yit −X ′itβ̂ − λ̂i′F̂t)(F̂t)
]= 0
∂Φ
∂F̂t= −2
[∑i
∑t
(yit −X ′itβ̂ − αiF̂t − λ̂iθt)(λ̂i)
]
= −2
[∑i
∑t
(yit −X ′itβ̂ − λ̂i′F̂t)(λ̂i)
]= 0
To simplify the FOCs, we use the fact that we had defined αi = λ − λ̂ and Ft = θt. Given that we evaluate
the minimization problem at β̂, λ̂i, and F̂t, αi will be equal to zero. Finally, we can replace λ̂iθt with λ̂i′F̂t.
Thus, the FOCs for Equations (4.2) and (4.6) are identical.
135
Chapter 5
Bibliography
AHN, S. C. & HORENSTEIN, A. R. (2013). “Eigenvalue ratio test for the number of factors.” Econometrica,81(3), 1203–1227.
ANDERSON, E. (2007). “Fair Opportunity in Education: A Democratic Equality Perspective*.” Ethics,117(4), 595–622.
ANDERSON, E. S. (1999). “What Is the Point of Equality?*.” Ethics, 109(2), 287–337.
ANDRICH, D. (1978). “Relationships between the Thurstone and Rasch approaches to item scaling.” AppliedPsychological Measurement, 2(3), 451–462.
ANGRIST, J. D. (2004). “Treatment effect heterogeneity in theory and practice*.” The Economic Journal,114(494), C52–C83.
ANGRIST, J. D. & PISCHKE, J.-S. (2008). Mostly Harmless Econometrics: An Empiricist’s companion.:Princeton University Press.
ANGRIST, J. & IMBENS, G. (1995). “Identification and estimation of local average treatment effects.”
ARNESON, R. J. (1999). “Against Rawlsian equality of opportunity.” Philosophical Studies, 93(1), 77–112.
BAI, J. (2009). “Panel data models with interactive fixed effects.” Econometrica, 77(4), 1229–1279.
BAI, J. & NG, S. (2008). Large dimensional factor analysis.: Now Publishers Inc.
BANSBACK, N., BRAZIER, J., TSUCHIYA, A., & ANIS, A. (2012). “Using a discrete choice experiment toestimate health state utility values.” Journal of health economics, 31(1), 306–318.
BEATON, A. E. & ALLEN, N. L. (1992). “Interpreting scales through scale anchoring.” Journal of Educa-tional and Behavioral Statistics, 17(2), 191–204.
BEATON, A. E. & ZWICK, R. (1992). “Overview of the national assessment of educational progress.” Jour-nal of Educational and Behavioral Statistics, 17(2), 95–109.
DE BEKKER-GROB, E. W., RYAN, M., & GERARD, K. (2012). “Discrete choice experiments in healtheconomics: a review of the literature.” Health economics, 21(2), 145–172.
BERTRAND, M., DUFLO, E., & MULLAINATHAN, S. (2004). “How Much Should We TrustDifferences-In-Differences Estimates?” The Quarterly Journal of Economics, 119(1), 249–275. DOI:10.1162/003355304772839588.
136
BETTMAN, J. R., LUCE, M. F., & PAYNE, J. W. (1998). “Constructive consumer choice processes.” Journalof consumer research, 25(3), 187–217.
BOARDMAN, A. E. & MURNANE, R. J. (1979). “Using Panel Data to Improve Estimates of the Determinantsof Educational Achievement.” Sociology of Education, 52(2), 113–121.
BOND, T. N. & LANG, K. (2013a). “The Black-White education-scaled test-score gap in grades k-7.”Technical report, National Bureau of Economic Research.
(2013b). “The evolution of the Black-White test score gap in Grades K–3: The fragility of results.”Review of Economics and Statistics, 95(5), 1468–1479.
BRIGHOUSE, H., LADD, H. F., LOEB, S., & SWIFT, A. (2015). “Educational goods and values: A frame-work for decision makers.” Theory and Research in Education, 1477878515620887.
BRIGHOUSE, H. & SWIFT, A. (2006). “Equality, Priority, and Positional Goods*.” Ethics, 116(3), 471–497.
(2008). “Putting educational equality in its place.” Education, 3(4), 444–466.
(2009a). “Educational equality versus educational adequacy: A critique of Anderson and Satz.”Journal of Applied Philosophy, 26(2), 117–128.
(2009b). “Legitimate parental partiality.” Philosophy & Public Affairs, 37(1), 43–80.
(2014). “The place of educational equality in educational justice.” Education, justice and the humangood. Routledge, New York, 14–33.
BUHRMESTER, M., KWANG, T., & GOSLING, S. D. (2011). “Amazon’s Mechanical Turk a new source ofinexpensive, yet high-quality, data?” Perspectives on psychological science, 6(1), 3–5.
CANDELARIA, C. A. (2012). “Placeholder.” My Journal.
CARD, D. & PAYNE, A. A. (2002). “School Finance Reform, the Distribution of School Spending, and theDistribution of Student Test Scores.” Journal of Public Economics, 83(1), 49–82.
CHIZMAR, J. & ZAK, T. (1983). “Modeling multiple outputs in educational production functions.” TheAmerican Economic Review, 73(2), 18–22.
CHUDIK, A. & PESARAN, M. (2013). “Large panel data models with cross-sectional dependence: a survey.”CAFE Research Paper(13.15).
CLAYTON, M. (2001). “Rawls and natural aristocracy.” Croatian journal of philosophy, 1(3), 239–259.
COLEMAN, J. S., CAMPBELL, E. Q., HOBSON, C. J., MCPARTLAND, J., MOOD, A. M., WEINFELD,F. D., & YORK, R. (1966). “Equality of educational opportunity.” Washington, dc, 1066–5684.
CORCORAN, S. P. & EVANS, W. N. (2008). “Equity, Adequacy, and the Evolving State Role in EducationFinance.” In H. F. Ladd & E. B. Fiske (Eds.) Handbook of Research in Education Finance and Policy.:Routledge, 149–207.
CUNHA, F. & HECKMAN, J. J. (2008). “Formulating, identifying and estimating the technology of cognitiveand noncognitive skill formation.” Journal of Human Resources, 43(4), 738–782.
CUNHA, F., HECKMAN, J. J., & SCHENNACH, S. M. (2010). “Estimating the technology of cognitive andnoncognitive skill formation.” Econometrica, 78(3), 883–931.
DANIELS, N. (1983). “Health care needs and distributive justice.” In In Search of Equity.: Springer, 1–41.
137
(1985). Just health care.: Cambridge University Press.
DOLAN, P. & KAHNEMAN, D. (2008). “Interpretations of Utility and Their Implications for the Valuationof Health*.” The Economic Journal, 118(525), 215–234.
DOMINGUE, B. (2014). “Evaluating the equal-interval hypothesis with test score scales.” Psychometrika,79(1), 1–19.
DRUMMOND, M. F. (2005). Methods for the economic evaluation of health care programmes.: Oxforduniversity press.
DUFLO, E. (2001). “Schooling and Labor Market Consequences of School Construction in Indonesia: Ev-idence from an Unusual Policy Experiment.” The American Economic Review, 91(4), 795–813. DOI:http://dx.doi.org/10.1257/aer.91.4.795.
DUMOUCHEL, W. H. & DUNCAN, G. J. (1983). “Using sample survey weights in multiple regressionanalyses of stratified samples.” Journal of the American Statistical Association, 78(383), 535–543.
ELSTER, J. (1986). “Self-realization in work and politics: The Marxist conception of the good life.” SocialPhilosophy and Policy, 3(02), 97–126.
ELWERT, F. & WINSHIP, C. (2010). “Effect heterogeneity and bias in main-effects-only regression models.”Heuristics, probability and causality: A tribute to Judea Pearl, 327–36.
FRIEDBERG, L. (1998). “Did Unilateral Divorce Raise Divorce Rates? Evidence from Panel Data.” AmericanEconomic Review, 88(3).
FRISCH, R. & WAUGH, F. V. (1933). “Partial time regressions as compared with individual trends.” Econo-metrica: Journal of the Econometric Society, 387–401.
FRITSCH, F. N. & CARLSON, R. E. (1980). “Monotone piecewise cubic interpolation.” SIAM Journal onNumerical Analysis, 17(2), 238–246.
GOMEZ, M. (2015a). FixedEffectModels: Julia package for linear and iv models with highdimensional categorical variables.. sha: 368df32285d72db2220e3a7e02671ebdff54613e edition.https://github.com/matthieugomez/FixedEffectModels.jl.
(2015b). SparseFactorModels: Julia package for unbalanced factor models and in-teractive fixed effects models.. sha: 368df32285d72db2220e3a7e02671ebdff54613e edition.https://github.com/matthieugomez/SparseFactorModels.jl.
GRANGER, C. W. J. (1988). “Some recent development in a concept of causality.” Journal of econometrics,39(1), 199–211.
HAERTEL, E. H. (1991). “Report on TRP Analyses of Issues Concerning Within-Age versus Cross-AgeScales for the National Assessment of Educational Progress..”
HAINMUELLER, J., HANGARTNER, D., & YAMAMOTO, T. (2015). “Validating vignette and conjoint sur-vey experiments against real-world behavior.” Proceedings of the National Academy of Sciences, 112(8),2395–2400.
HAINMUELLER, J. & HOPKINS, D. J. (2014). “The hidden american immigration consensus: A conjointanalysis of attitudes toward immigrants.” American Journal of Political Science.
HAINMUELLER, J., HOPKINS, D. J., & YAMAMOTO, T. (2014). “Causal inference in conjoint analysis:Understanding multidimensional choices via stated preference experiments.” Political Analysis, 22(1), 1–30.
138
HANUSHEK, E. A. (1979). “Conceptual and Empirical Issues in the Estimation of Educational ProductionFunctions.” The Journal of Human Resources, 14(3), 351–388.
(1986). “The Economics of Schooling: Production and Efficiency in Public Schools.” Journal ofEconomic Literature, 24(3), 1141–1177.
(1996). “Measuring Investment in Education.” The Journal of Economic Perspectives, 10(4), 9–30.
(1997). “Assessing the Effects of School Resources on Student Performance: An Update.” Educa-tional Evaluation and Policy Analysis, 19(2), 141–164.
HECKMAN, J. J. & LAFONTAINE, P. A. (2010). “The American high school graduation rate: Trends andlevels.” The review of economics and statistics, 92(2), 244–262.
HOXBY, C. M. (1996). “How Teachers’ Unions Affect Education Production.” The Quarterly Journal ofEconomics, 111(3), 671–718. DOI: http://dx.doi.org/10.2307/2946669.
(2001). “All School Finance Equalizations are Not Created Equal.” The Quarterly Journal of Eco-nomics, 116(4), 1189–1231. DOI: http://dx.doi.org/10.1162/003355301753265552.
ILIN, A. & RAIKO, T. (2010). “Practical approaches to principal component analysis in the presence ofmissing values.” The Journal of Machine Learning Research, 11, 1957–2000.
INZA, F. S. M., RYAN, M., & AMAYA-AMAYA, M. (2007). ““Irrational” stated preferences.” Using DiscreteChoice Experiments to Value Health and Health Care, 11, 195.
JACKSON, C. K., JOHNSON, R. C., & PERSICO, C. (2015). “The Effects of School Spending on Educationaland Economic Outcomes: Evidence from School Finance Reforms.”Technical report, National Bureau ofEconomic Research.
KIM, D. & OKA, T. (2014). “Divorce Law Reforms And Divorce Rates In The Usa: An Interactive Fixed-Effects Approach.” Journal of Applied Econometrics, 29(2), 231–245.
KROUSE, R. & MCPHERSON, M. (1986). “A “Mixed”-Property Regime: Equality and Liberty in a MarketEconomy.” Ethics, 119–138.
KUZIEMKO, I., NORTON, M. I., & SAEZ, E. (2015). “How Elastic Are Preferences for Redistribution?Evidence from Randomized Survey Experiments.” American Economic Review, 105(4), 1478–1508.
LANCSAR, E. & LOUVIERE, J. (2006). “Deleting ‘irrational’ responses from discrete choice experiments: acase of investigating or imposing preferences?” Health economics, 15(8), 797–811.
(2008). “Conducting discrete choice experiments to inform healthcare decision making.” Pharma-coeconomics, 26(8), 661–677.
LEE, J. Y. & SOLON, G. (2011). “The fragility of estimated effects of unilateral divorce laws on divorcerates.” The BE Journal of Economic Analysis & Policy, 11(1).
LEVIN, H. M. & BELFIELD, C. (2014). “Guiding the development and use of cost-effectiveness analysis ineducation.” Journal of Research on Educational Effectiveness.
LIPSCOMB, J., DRUMMOND, M., FRYBACK, D., GOLD, M., & REVICKI, D. (2009). “Retaining, andenhancing, the QALY.” Value in Health, 12(s1), S18–S26.
LISSITZ, R. W. & BOURQUE, M. L. (1995). “Reporting NAEP results using standards.” Educational Mea-surement: Issues and Practice, 14(2), 14–23.
139
LUCE, R. D. (2005). Individual choice behavior: A theoretical analysis.: Courier Corporation.
LUNN, D. J., THOMAS, A., BEST, N., & SPIEGELHALTER, D. (2000). “WinBUGS—A Bayesian ModellingFramework: Concepts, Structure, and Extensibility.” Statistics and Computing, 10, 325–337.
MCFADDEN, D. ET AL. (1973). “Conditional logit analysis of qualitative choice behavior.”
MCFADDEN, D. (1980). “Econometric models for probabilistic choice among products.” Journal of Business,S13–S29.
(1986). “The choice theory approach to market research.” Marketing science, 5(4), 275–297.
(2001). “Economic choices.” American Economic Review, 351–378.
MCFADDEN, D., TRAIN, K. ET AL. (2000). “Mixed MNL models for discrete response.” Journal of appliedEconometrics, 15(5), 447–470.
MEYER, B. D. (1995). “Natural and Quasi-Experiments in Economics.” Journal of Business & EconomicStatistics, 13(2), 151–161.
MOON, H. R. & WEIDNER, M. (2015). “Linear regression for panel with unknown number of factors asinteractive fixed effects.” Econometrica(forthcoming).
MULLIS, I. V. & JENKINS, L. B. (1988). The Science Report Card: Elements of Risk and Recovery. Trendsand Achievement Based on the 1986 National Assessment..: ERIC.
MULLOS, I. & JOHNSON, E. (1992). “The NAEP scale anchoring process for the 1992 mathematics assess-ment.” The NAEP, 893–907.
MURRAY, S. E., EVANS, W. N., & SCHWAB, R. M. (1998). “Education-finance reform and the distributionof education resources.” American Economic Review, 789–812.
NEWSON, R. (2006). “Confidence intervals for rank statistics: Somers’ D and extensions.” Stata Journal,6(3), 309.
NIELSEN, E. ET AL. (2014). “The Income-Achievement Gap and Adult Outcome Inequality.” The FederalReserve Board of Governors.
NIELSEN, E. R. (2015). “Achievement Gap Estimates and Deviations from Cardinal Comparability.” Avail-able at SSRN 2597668.
NORD, E. (1999). Cost-value analysis in health care: making sense out of QALYs.: Cambridge UniversityPress.
NORD, E., DANIELS, N., & KAMLET, M. (2009). “QALYs: some challenges.” Value in Health, 12(s1),S10–S15.
ONATSKI, A. (2010). “Determining the number of factors from empirical distribution of eigenvalues.” TheReview of Economics and Statistics, 92(4), 1004–1016.
OPPE, M., DEVLIN, N. J., & SZENDE, A. (2007). EQ-5D value sets: inventory, comparative review anduser guide.: Springer.
PERIE, M. (2008). “A Guide to Understanding and Developing Performance-Level Descriptors.” EducationalMeasurement: Issues and Practice, 27(4), 15–29.
PESARAN, M. H. (2006). “Estimation and inference in large heterogeneous panels with a multifactor errorstructure.” Econometrica, 74(4), 967–1012.
140
PESARAN, M. H. & PICK, A. (2007). “Econometric issues in the analysis of contagion.” Journal of Eco-nomic Dynamics and Control, 31(4), 1245–1277.
POGGE, T. W. (1995). “Three problems with contractarian-consequentialist ways of assessing social institu-tions.” Social Philosophy and Policy, 12(02), 241–266.
(2003). “Incoherence between Rawls’s Theories of Justice, The.” Fordham L. Rev., 72, 1739.
RAGHAVARAO, D., WILEY, J. B., & CHITTURI, P. (2011). “Choice-Based Conjoint Analysis.” Models andDesigns (1st ed.). Boca Raton: Taylor and Frances Group.
RAIKO, T., ILIN, A., & KARHUNEN, J. (2008). “Principal component analysis for sparse high-dimensionaldata.” In Neural Information Processing., 566–575, Springer.
RAWLS, J. (1974). “Reply to Alexander and Musgrave.” The Quarterly Journal of Economics, 633–655.
(2001). Justice as fairness: A restatement.: Harvard University Press.
(2009). A theory of justice.: Harvard university press.
REARDON, S. F., VALENTINO, R. A., & SHORES, K. A. (2012). “Patterns of literacy among US students.”The Future of Children, 22(2), 17–37.
ROWEN, D., BRAZIER, J., & VAN HOUT, B. (2014). “A comparison of methods for converting DCE valuesonto the full health-dead QALY scale.” Medical Decision Making, 0272989X14559542.
SATZ, D. (2007). “Equality, Adequacy, and Education for Citizenship*.” Ethics, 117(4), 623–648.
(2012). “Unequal chances: Race, class and schooling.” Theory and Research in Education, 10(2),155–170.
(2014). “Unequal chances.” Education, Justice and the Human Good: Fairness and Equality in theEducation System, 34.
SCOTT, L. A. & INGELS, S. J. (2007). “Interpreting 12th-Graders’ NAEP-Scaled Mathematics PerformanceUsing High School Predictors and Postsecondary Outcomes from the National Education LongitudinalStudy of 1988 (NELS: 88). Statistical Analysis Report. NCES 2007-328..” National Center for EducationStatistics.
SHIFFRIN, S. V. (2003). “Race, labor, and the fair equality of opportunity principle.” Fordham L. Rev., 72,1643.
SIMS, D. P. (2011). “Lifting All Boats? Finance Litigation, Education Resources, and Student Needs in thePost-Rose Era.” Education Finance & Policy, 6(4), 455–485.
SOLON, G., HAIDER, S. J., & WOOLDRIDGE, J. M. (2015). “What are we weighting for?” Journal ofHuman Resources, 50(2), 301–316.
SPRINGER, M. G., LIU, K., & GUTHRIE, J. W. (2009). “The Impact of School Finance Litigation onResource Distribution: A Comparison of Court-Mandated Equity and Adequacy Reforms.” EducationEconomics, 17(4), 421–444.
TAYLOR, R. S. (2004). “Self-realization and the priority of fair equality of opportunity.” Journal of MoralPhilosophy, 1(3), 333–347.
THURSTONE, L. L. (1928). “Attitudes can be measured.” American journal of sociology, 529–554.
141
TORRANCE, G. W., FEENY, D. H., FURLONG, W. J., BARR, R. D., ZHANG, Y., & WANG, Q. (1996).“Multiattribute utility function for a comprehensive health status classification system: Health UtilitiesIndex Mark 2.” Medical care, 34(7), 702–722.
TRAIN, K. E. (2009). Discrete choice methods with simulation.: Cambridge university press.
WEINSTEIN, M. C., TORRANCE, G., & MCGUIRE, A. (2009). “QALYs: the basics.” Value in health,12(s1), S5–S9.
WESTEN, P. (1985). “The concept of equal opportunity.” Ethics, 837–850.
WHITEHEAD, S. J. & ALI, S. (2010). “Health outcomes in economic evaluation: the QALY and utilities.”British medical bulletin, 96(1), 5–21.
WOLFERS, J. (2006). “Did Unilateral Divorce Laws Raise Divorce Rates? A Reconciliation and New Re-sults.” American Economic Review, 96(5), 1802–1820.
WOOLDRIDGE, J. M. (2005). “Fixed-effects and related estimators for correlated random-coefficient andtreatment-effect panel data models.” Review of Economics and Statistics, 87(2), 385–390.
(2010). Econometric analysis of cross section and panel data.: MIT press.
WRIGHT, S. & HOLT, J. N. (1985). “An inexact levenberg-marquardt method for large sparse nonlinearleast squres.” The Journal of the Australian Mathematical Society. Series B. Applied Mathematics, 26(04),387–403.
142