OPPORTUNITIES, COSTS AND BENEFITS: …fw890kf0299/...is to use an education production function,...

OPPORTUNITIES, COSTS AND BENEFITS: RETHINKING THE

EDUCATION PRODUCTION FUNCTION

A DISSERTATION

SUBMITTED TO THE GRADUATE SCHOOL OF EDUCATION

AND THE COMMITTEE ON GRADUATE STUDIES

OF STANFORD UNIVERSITY

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

KENNETH SHORES

MARCH 2016

http://creativecommons.org/licenses/by-nc/3.0/us/

This dissertation is online at: http://purl.stanford.edu/fw890kf0299

© 2016 by Kenneth Aaron Shores. All Rights Reserved.

Re-distributed by Stanford University under license with the author.

This work is licensed under a Creative Commons Attribution-Noncommercial 3.0 United States License.

ii



http://purl.stanford.edu/fw890kf0299

I certify that I have read this dissertation and that, in my opinion, it is fully adequatein scope and quality as a dissertation for the degree of Doctor of Philosophy.

sean reardon, Primary Adviser


Eamonn Callan


Susanna Loeb


Debra Satz

Approved for the Stanford University Committee on Graduate Studies.

Patricia J. Gumport, Vice Provost for Graduate Education

This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file inUniversity Archives.

iii

ABSTRACT

My dissertation incorporates both traditional and non-traditional approaches to the specifi-

cation and estimation of the education production function. I pursue three related questions:

1. I use normative philosophical methods to consider whether implications of equal

opportunity and adequacy distributive principles are compatible with the basic and

intuitive right that all students have claim to at least some educational resources to

develop their abilities. If an incompatibility is found, this suggests that these paradig-

matic distributive principle are in need of revision.

2. I propose a method for estimating an achievement scale that is equal-interval with

respect to benefit. I develop and implement survey experiments to estimate individ-

ual preferences for math and reading academic skills. This quantitative description

allows for both between and within attribute comparisons, making it possible to de-

termine, for example, whether a 10-point gain at the low end of the math scale is

preferable to a 20-point gain at the high end of the reading scale. Such a scale can be

used for cost-effectiveness evaluations.

3. I (with Christopher Candelaria) provide new evidence about the effect of court-

ordered finance reform on per-pupil revenues and graduation rates. We account

for cross-sectional dependence and heterogeneity in the treated and counterfactual

groups to estimate the effect of overturning a state’s finance system. Seven years af-

ter reform, the highest poverty quartile in a treated state experienced a 4 to 12 percent

increase in per-pupil spending and a 5 to 8 percentage point increase in graduation

rates.

iv

Contents

1 Introduction to the dissertation 1

I. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

II. Paper one: Normative approaches to achievement . . . . . . . . . . . . . . 3

III. Paper two: Welfare adjusted scale score . . . . . . . . . . . . . . . . . . . 5

IV. Paper three: Sensitivity of causal estimates from finance reform (with Christo-

pher Candelaria) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

V. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Achievement is not income 10

I. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

II. Establishing a right to educational resources . . . . . . . . . . . . . . . . . 13

III. Testing distributive theories against the right to some educational resources 16

III.A. Expanded fair equality of opportunity . . . . . . . . . . . . . . . . 16

III.B. Restricted fair equality of opportunity . . . . . . . . . . . . . . . . 19

III.C. Objections to the test . . . . . . . . . . . . . . . . . . . . . . . . . 21

III.D. Equality of opportunity for what? . . . . . . . . . . . . . . . . . . 25

III.E. Adequacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

IV. What distribution of resources are entailed by the right? . . . . . . . . . . . 31

IV.A. A right that is too weak . . . . . . . . . . . . . . . . . . . . . . . . 32

v

IV.B. A right that is too strong . . . . . . . . . . . . . . . . . . . . . . . 32

IV.C. What right is ‘just right’? . . . . . . . . . . . . . . . . . . . . . . . 33

V. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3 Welfare adjusted scale score: Method toward the development of an equal-interval welfare scale 36

I. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

II. Survey design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

II.A. Math and reading descriptors and scale scores . . . . . . . . . . . . 45

II.B. Linking NAEP descriptors to scale scores . . . . . . . . . . . . . . 45

II.C. Data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

III. Econometric framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

IV. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

IV.A. Ordinal ranking exercise . . . . . . . . . . . . . . . . . . . . . . . 51

IV.B. Beta estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

IV.C. Comparing original to welfare-adjusted scale . . . . . . . . . . . . 58

V. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

VI. Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

VII. Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

4 The sensitivity of causal estimates from Court-ordered finance reform on spend-ing and graduation rates (with Christopher Candelaria) 82

I. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

II. Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

III. Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

IV. Econometric specifications and model sensitivity . . . . . . . . . . . . . . 91

IV.A. Benchmark differences-in-differences model . . . . . . . . . . . . 91

vi

IV.B. Explaining model specifications . . . . . . . . . . . . . . . . . . . 93

IV.C. Alternative model specifications . . . . . . . . . . . . . . . . . . . 96

V. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

V.A. Benchmark differences-in-differences model results . . . . . . . . . 97

V.B. Model sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

V.C. Equalizing effects . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

V.D. Robustness checks . . . . . . . . . . . . . . . . . . . . . . . . . . 105

VI. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

VII. Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

VIII. Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

IX. Data Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

X. Additional Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

XI. Additional Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

XII. Technical Appendix: Standard Errors . . . . . . . . . . . . . . . . . . . . 133

5 Bibliography 136

vii

Chapter 1

Introduction to the dissertation

I. Introduction

The United States currently spends about $700 billion per year on the provision of K-12

public education, and we would like to know if this investment is worthwhile. One method

is to use an education production function, linking inputs to valued outcomes (Hanushek,

1979). Much depends upon what counts as valued. It is common to use academic achieve-

ment scores as the outcome in these production functions, as test scores describe what a

person knows and can do, and one of the purposes of schools is to produce knowledge and

ability. While descriptions of achievement are certainly useful for describing what individ-

uals know and can do, on their own, the information they provide about value is likely to be

limited. Consider, for example, an achievement score that measures how well individuals

speak a fictional language. Even with a good measure of the ability, the benefit correspond-

ing to the measure is likely to be very small. This suggests that in order for an achievement

score to reflect value it needs to be linked to something else.

One approach is to recast an achievement score to reflect its labor market value (Mur-

nane, et al., 2001; Cunha & Heckman, 2008; Cunha, Heckman, & Schennach, 2010;

Chetty, Friedman & Rockoff, 2011; Bond & Lang, 2013). With the assumption that more

1

earnings are better than less, this approach usefully links an academic achievement score

to something else of value. Still, earnings do not capture the benefits of achievement inclu-

sively. A retiree on a fixed income who values being literate rather than illiterate demon-

strates that one can value academic achievement for non-pecuniary reasons. Moreover, if

this same person would like to remain up to date with current medical research, but cares

little for reading fiction, he values one type of academic skill over another. Irrespective of

their impact on earnings, then, academic achievement clearly affects happiness, and differ-

ent abilities will produce different benefits.

An outcome measure that reflects the value of achievement, broadly construed, would be

useful for determining whether schools are effective or not. Determining what is valuable

and how much something should be valued requires an interdisciplinary approach. Philos-

ophy (specifically normative theory) applies analytical methods to identify the values that

correspond to different choices and, when possible, to suggest which values are more im-

portant (Swift, 1999; McDermott, 2008). Economics is more concerned with the value of

efficiently satisfying preferences (Hausman & McPherson, 2006). Stated preferences (also

referred to as empirical social choice) methods are an economic tool that use survey or

laboratory experiments to quantify how much welfare is associated with certain non-priced

goods (Adamowicz, Louviere, & Williams, 1994; Gaertner, 2009; Gaertner & Schokkaert,

2012). Taken together, philosophical and economic tools can provide a broader, more in-

clusive description of the values associated with academic achievement.

My dissertation incorporates both traditional and non-traditional approaches to the spec-

ification and estimation of the education production function. I pursue three related ques-

tions:

1. I use normative philosophical methods to consider whether implications of equal

opportunity and adequacy distributive principles are compatible with the basic and

intuitive right that all students have claim to at least some educational resources to

develop their abilities. If an incompatibility is found, this suggests that these paradig-

matic distributive principle are in need of revision.

2. I propose a method for estimating an achievement scale that is equal-interval with

2

respect to benefit. I develop and implement survey experiments to estimate individ-

ual preferences for math and reading academic skills. This quantitative description

allows for both between and within attribute comparisons, making it possible to de-

termine, for example, whether a 10-point gain at the low end of the math scale is

preferable to a 20-point gain at the high end of the reading scale. Such a scale can be

used for cost-effectiveness evaluations.

3. I (with Christopher Candelaria) provide new evidence about the effect of court-

ordered finance reform on per-pupil revenues and graduation rates. We account

for cross-sectional dependence and heterogeneity in the treated and counterfactual

groups to estimate the effect of overturning a state’s finance system. Seven years af-

ter reform, the highest poverty quartile in a treated state experienced a 4 to 12 percent

increase in per-pupil spending and a 5 to 8 percentage point increase in graduation

rates.

II. Paper one: Normative approaches to achievement

This paper argues for a rights-based principle for the distribution of educational resources.

The right that is proposed is: “All students, no matter who they are, have claim to at least

some educational resources to develop their abilities.” By endorsing this claim, we em-

brace the idea that schools exist for all students, and that no students should be denied edu-

cational opportunities for any characteristic they may have. The phrase “some educational

resources” is intentionally vague. It does not specify how much each student is entitled to,

and allows for the possibility that some may need more than others. The principle only

entails a minimum amount for all students.

Any time a student’s characteristics are used as a basis for exclusion from any educa-

tional opportunity, we will recognize the harm that is done that student. Likewise, if a

distributive principle requires that some groups receive no resources, we must conclude

that the principle is inadequate, as the fulfillment of the principle means that some stu-

dents’ legitimate claims on some educational resources will be violated. A principle of

3

racial or gender discrimination proves the point, as students, based on their race or gender,

are denied educational opportunities on account of those characteristics. Similarly, a prin-

ciple that said “give only to those who stand to gain the most” would also be in violation,

as there are necessarily some who do not stand to gain the most and will be left without.

Such a right seems like a small thing, and I doubt there would be much objection to it.

Nevertheless, I show that a suite of popular distributive theories—two versions of fair

equality of opportunity and adequacy—problematically conflict with the right in many fea-

sible circumstances. At issue is the simple fact that, in many cases, satisfying equality

and adequacy principles will require the rich or the talented to relinquish their claim on

educational resources. Discrimination on the basis of ability or income violates the right.

Two subtle questions will be in need of sorting out:

1 Is the right for all students to at least some educational resources merely one ofmany competing principles–akin to parental partiality, “all things considered prior-ity,” etc.–or does the right reveal fundamental limitations to these popular distributiveprinciples when applied to educational achievement?

2 If we concede some basic egalitarian intuitions–that ability differences are importantfor one’s life prospects, morally arbitrarily assigned, and compensable through edu-cational investments–how do we reconcile these intuitions with the general claim toeducational resources that all students share?

I argue in response to the first question that the right does reveal fundamental limitations

to current distributive paradigms applied to education. In short, achievement is not income

and needs to be treated differently. Indeed, when we consider other distributive objects,

such as income and welfare, the rights-based objection to equality and adequacy falls short.

A thoroughgoing response to the second question is, unfortunately, not forthcoming.

Balancing equality against general claims has a kind of Goldilocks problem. A minimal

right to educational resources, such as an offer of the smallest divisible unit of resource

for the rich or talented, is too weak; a maximal right that promises equal resources to all

students violates any notion of fair equality and is too strong. The right that is ‘just right’

is not well defined, but it can be found somewhere between the minimal and maximal

specifications.

4

III. Paper two: Welfare adjusted scale score

The use of academic scale scores in education production functions is so commonplace

a list of citations is unnecessary. When a scale score is used as the dependent variable

it connotes value or expected benefit. Holding costs constant, a program that raises test

scores 20 points is more effective than a program that raises test scores 10 points. This the

logic of cost-effectiveness analysis (see Levin and Belfield, 2014 for review). In order for

this evaluation to be made, researchers and policymakers must assume that a scale score

is equal-interval scaled with respect to benefit. That is, for example, they must assume

that a 10-point gain at the bottom of the scale score is equivalent to a 10-point gain at the

top of the scale score. Such an assumption is rarely tested and there are not strong priors

indicating that such a relationship exists.

In this paper I describe and implement a method for constructing a scale score that is

equal-interval with respect to welfare. I employ a choice-based conjoint design (often times

referred to as a discrete choice experiment) to obtain utility values for different math and

reading descriptors obtained from the National Assessment of Educational Progress Long

Term Trend (NAEP-LTT). In the experiment, respondents are provided with a description

of two individuals who are alike in all respects except that they differ in their math and

reading abilities. Respondents are asked to determine which bundle of math and reading

abilities indicate which of the two persons will have an “all things considered” better life.

After the reading and math profiles, the respondent is forced to make a choice between

Persons A and B. The response is coded dichotomously, 1 if Person A or B was chosen and

0 otherwise.

The purpose of this experiment is for the respondent to make interval comparisons be-

tween Persons A and B with respect to welfare. As an example, consider a choice task

where Person A has reading ability equal to 5 and math ability equal to 2, while Person

B has reading and math abilities equal to 3.1 Effectively, the respondent is being asked to

make a trade between 2 units of reading for 1 unit of math. Whether respondents, on aver-

1In the actual choice task, respondents are given performance level descriptors, which are taken from theNAEP-LTT. These descriptors are equidistant textual accounts of an individual’s math and reading ability.

5

age, choose Person A over B will depend on how much they value reading relative to math,

and, importantly, how much they value math gains at the bottom of the distribution rela-

tive to reading losses at the top. Depending on how respondents on average weight these

different trades will determine the relative concavity of the welfare-adjusted scale score.

These performance level descriptors are then mapped onto the original scale score us-

ing the scale anchoring process employed by the NAEP. I now have a data set with three

variables and 10 observations: a vector of performance level descriptors, the corresponding

scale scores (150, 200, 250, 300, 350), and the estimated utilities. I use piece-wise mono-

tone cubic interpolation (MCI) for values not directly estimated from performance level

descriptors. This provides a scale score that is equal-interval with respect to welfare, as

long as we assume that the equal-interval assumptions (with respect to ability) of the orig-

inal scale score hold and that the performance level descriptors are appropriately mapped

to scale score values.

As hypothesized, I find that utility values for different achievement states are non-linear

and concave. Gains in reading and math are worth more at the bottom than at the top.

In order to demonstrate how the newly estimated scale can be applied, I compare cohort

trends in the white-black achievement gap between the original NAEP scale and the newly

estimated one. When achievement is re-scaled to reflect value, changes in achievement

gaps are different in both magnitude and direction in many instances. This is due to the fact

that gains/losses for lower achieving groups are worth more than gains/losses for higher

achieving groups.

IV. Paper three: Sensitivity of causal estimates from fi-

nance reform (with Christopher Candelaria)

Whether school spending has an effect on student outcomes has been an open question in

the economics literature, dating back to at least Coleman (1966). The causal relationship

between spending and desirable outcomes is of obvious interest, as the share of GDP that

6

the United States spends on public elementary and secondary education has remained fairly

large throughout the past five decades, ranging from 3.7 to 4.5 percent.2 Given the large

share of spending on education, it would be useful to know if these resources are well

spent. Despite this interest, we lack stylized facts about the effects of spending on changes

in student academic and adult outcomes. The goal of this paper is to provide a robust

description of the causal relationship between fiscal shocks and student outcomes at the

district level for US states undergoing financial reform for the period 1990-91 to 2009-10.

Using district-level aggregate data from the Common Core (CCD), we estimate the ef-

fects of fiscal shocks, where fiscal shocks are defined as a state’s first Supreme Court ruling

that overturns a given state’s finance system, on the natural logarithm of per-pupil spend-

ing and graduation rates. Researchers are presented with a number of modeling strategies

in panel data situations. We have two objectives. The first is to present a theoretically

rich model that is flexible enough to handle two aspects of the identification problem: first,

treatment occurs at the state level and, second, there is treatment effect heterogeneity within

states. Given the variety of reasonable modeling choices that exist, our second objective is

to show how sensitive our results are to some common alternative specifications.

All together, we estimate a heterogeneous differences-in-differences model that accounts

for (a) cross-sectional dependence at the state level, (b) a poverty quartile-by-year secular

trend, and (c) state-by-poverty quartile linear time trends. In this preferred specification,

we find that high poverty districts in states that had their finance regimes overthrown by

Court order experienced an increase in log spending by 4 to 12 percent and graduation

rates by 5 to 8 percentage points seven years following reform.

To test the extent to which results are equalizing, we estimate slightly different models,

allowing the effect of reform to be continuous across poverty quantiles. We control for sec-

ular changes in the equalization efforts in non-treated states by including year fixed effects

interacted with continuous poverty. This provides an estimate of the marginal change in

graduation rate for a one-unit increase in poverty percentile rank within a state. Here we

see that the effect of reform was equalizing: for every 10 percentile increase in poverty

2These estimates come from the Digest of Education Statistics, 2013 edition. As of the writing of thisdraft, the 2013 version is the most recent publication available.

7

within a treated state, per-pupil log revenues increased by 0.9 to 1.8 percent and graduation

rates increased by 0.5 to 0.85 percentage points in year seven.

We then subject the model to various sensitivity tests by permuting the interactive fixed

effects, secular time trends, and correlated random trends. In total we estimate 15 com-

plementary models and present these results graphically. Overall, the results are robust to

model fit.

This paper makes substantive and methodological contributions. Substantively, we find

that court cases overturning a state’s financial system for the period 1991-2010 had an effect

on revenues and graduation rates, that these results are robust to a wide variety of modeling

choices, and that this effect was equalizing. Taken together, our two models show that

states undergoing Court ordered finance both (a) increased revenues and graduation rates

in high poverty districts relative to high poverty districts in other states and (b) allocated

a greater share of these revenues to higher poverty districts within the state, relative to

allocations taking place in non-treated states, resulting in an increase in graduation rates.

Methodologically, we emphasize the variety of modeling strategies available to researchers

using panel data sets, including specification of the secular trend, correlated random trends,

and cross-sectional dependence. While the researcher may argue for a preferred model,

justifiable alternatives are often available. Here we have presented a graphical method

that researchers can use to effectively and efficiently demonstrate the sensitivity of point

estimates to modeling choice.

V. Conclusion

The quality of our inferences in an education production function hinges on three indepen-

dent but interrelated factors:

1 A normative account that justifies which outcome variables should (or should not) beincluded in the model

2 The specification of an outcome variable that accurately measures our valued com-mitments

8

3 An econometric model that properly links inputs to desired outcomes

In educational policy settings, we are presented with choices about which outcomes to in-

clude, how to specify them, and which identification strategy to follow. In this dissertation I

have considered each of these factors in isolation. I have argued that equal opportunity and

adequacy principles are not easily reconciled with legitimate student claims to some educa-

tional resources. I have presented and implemented a method that applies a welfare-based

weighting scheme to different parts of the ability distribution, thus allowing the outcome

variable in cost-effectiveness evaluations to better reflect benefit. Finally, I have presented

compelling causal evidence about the effect of court-ordered finance reform on spending

and graduation rates in high poverty districts.

9

Chapter 2

Achievement is not income

10

Abstract In education, it is common to hear that certain students should receive ad-ditional resources in order to increase their achievement. A class of equal opportunityprinciples classifies students into protected and non-protected groups. Protected groups areto receive educational resources up until the point they catch up to non-protected groups.Adequacy principles specify a threshold, perhaps indexed to some other value, that classi-fies students into two groups, those who are below and above the threshold. Those beloware to receive educational resources up until the point they reach the threshold. The fullrealization of both equality and adequacy principles will, in many cases, make it so that noresources will be available for either non-protected groups or those above the threshold. Ifwe endorse a rights-based distributive principle that says all students have claim to at leastsome educational resources to develop their abilities, then equality and adequacy principlesare incomplete distributive principles. A re-calibrated distributive principle must reconcileegalitarian moral reasoning with the general claim to some educational resources. The sizeof each student’s claim is left intentionally vague. It can neither be too little (just a token)or too large (equal resources for all), but somewhere between the two extremes lies theclaim.

11

I. Introduction

In this paper, I argue for a rights-based principle for the distribution of educational re-

sources. The right that is proposed is: “All students, no matter who they are, have claim

to at least some educational resources to develop their abilities.” The phrase “some edu-

cational resources” is intentionally vague. It does not specify how much each student is

entitled to, and allows for the possibility that some may need more than others. The princi-

ple only entails a minimum amount for all students. Such a right seems like a small thing,

and I doubt there will be much objection to it.

Nevertheless, I show that a suite of popular distributive theories—two versions of fair

equality of opportunity and adequacy—problematically conflict with the right to resources

that all students share. In many cases, satisfying equality and adequacy principles will

require the rich or the talented to relinquish their claim on educational resources. At issue

is the simple fact that both principles divide student populations into protected and non-

protected classes (in the case of equality) or below and above threshold groups (in the case

of adequacy). Non-protected classes and above threshold groups are offered nothing by the

respective principles. Such principles discriminate on the basis of ability or income and

this violates the right.

Two questions will need to be sorted out:

1 Is the right for all students to at least some educational resources merely one ofmany competing principles–akin to parental partiality, “all things considered” prior-ity, etc.–or does the right reveal fundamental limitations to these popular distributiveprinciples when they are applied to educational achievement?

2 If we concede some basic egalitarian intuitions–that ability differences are importantfor one’s life prospects, morally arbitrarily assigned, and compensable through edu-cational investments–how do we reconcile these intuitions with the general claim toeducational resources that all students share?

In response to the first question, I argue that the right does reveal fundamental limitations

to current distributive paradigms applied to education. In short, achievement is not income

12

and needs to be treated differently. Indeed, when we consider other distributive objects,

such as income and welfare, the rights-based objection to equality and adequacy falls short.

A thoroughgoing response to the second question is, unfortunately, not forthcoming.

Balancing equality against general claims has a kind of Goldilocks problem. A minimal

right to educational resources, such as an offer of the smallest divisible unit of resource

for the rich or talented, is too weak; a maximal right that promises equal resources to all

students violates any notion of fair equality and is too strong. The right that is ‘just right’

is not well defined, but it can be found somewhere between the minimal and maximal

specifications.

The paper proceeds as follows. I lay out some understanding for what the right to edu-

cational resources entails (and specifically, what it does not entail). I then hold the right to

educational resources against three popular distributive principles in education: expanded

and restricted fair equality of opportunity and adequacy. I show that these principles will

conflict with the right to educational resources in many plausible scenarios. I then consider

whether we should interpret this conflict as just one more example of an important value

that conflicts with equality, or whether the conflict reveals something more fundamental

about the principles. Finally, I provide some (admittedly unsatisfactory) details about what

the right to educational resources must entail.

II. Establishing a right to educational resources

Suppose we observed the following:

In some schools, certain students—we know nothing about their demographiccharacteristics, their social origins, or their performance on a test—are left totheir own devices every day. Teachers give them no attention. On some days,if the guardian is present, students may stay home to play video games or readcomic books. In short, they learn nothing.

Here I make a claim about the distribution of opportunities for achievement that I think

most people will find uncontroversial. If we find the above scenario troubling, then we are

13

led to endorse the view that:

Rights-based distributive principle: All students, no matter who they are, have

claim to at least some educational resources to develop their abilities.

By endorsing this view, we embrace the idea that schools exist for all students, and

that no students should be denied educational opportunities for any characteristic they may

have.1 The phrase “some educational resources” is intentionally vague. It does not specify

how much each student is entitled to, and allows for the possibility that some may need

more than others. The principle only entails a minimum amount for all students.

Any time a student’s characteristics are used as a basis for exclusion from any educa-

tional opportunity, we will recognize the harm that is done that student. Likewise, if a

distributive principle requires that some groups receive no resources, we must conclude

that the principle is inadequate, as the fulfillment of the principle means that some stu-

dents’ legitimate claims on some educational resources will be violated. Thus, if we are to

endorse the idea that no students should be deprived of all opportunities to develop their

abilities in school, we can use it to test other, competing distributive principles. Here is the

test:

1 If the principle identifies some subgroups based on their characteristics and usesthose characteristics to deny those students any resources to develop their abilities,then we can conclude that the principle is inadequate.

2 If in order for the distributive principle to be satisfied, it requires, at least in someinstances, that certain students receive no resources to develop their abilities, thenwe can conclude that the principle is inadequate.

This test is not conclusive. It cannot be used, for example, to determine whether some

students should receive more resources than other students. Instead, the purpose of the

1Some might object that the view is too strong, as it allows for the possibility that students who are athreat to other students are also entitled to resources for opportunities to develop their abilities. In otherwords, the principle excludes policies like out of school suspension. Whether or not out of school suspensionis a legitimate policy choice or not is outside the scope of this paper. If necessary, we can modify the principleto be “all students, no matter who they are, as long as they are not a threat to other students, have claim to atleast some educational resources to develop their abilities.” The change will not affect the results in any way.

14

test is to determine whether a principle fails in this very fundamental way. What kinds

of principles fail this test? There are obvious candidates. A principle of racial or gender

discrimination obviously fails, as students, based on their race or gender, are denied edu-

cational opportunities on account of those characteristics. Similarly, a principle that said

“give only to those who stand to gain the most” would also be in violation, as there are

necessarily some who do not stand to gain the most and will be left without.

However, the focus of this essay is not on those principles that are most easy to defeat.

My focus instead is a suite of popular theories of distributive justice in education that, one

way or another, tacitly endorse discrimination of students on the basis of ability. I am not

referring to the kind of ability that we normally think about; on the contrary, most theories

endorse some form of compensation for low-achieving students in the form of educational

resources. The discrimination I have in mind, in most cases, is against the very talented,

those whose abilities surpass the abilities of other students, though through no fault of their

own.

By calling attention to the very talented, the purpose is not to defend a meritocratic

conception of justice, where only those with the “right” abilities have access to certain ad-

vantages. It is possible to recognize the unfairness in the distribution of opportunities for

labor market success, in part resulting from differences in ability, without the need to deny

high achieving students all learning opportunities. Nor do I argue or believe that the very

talented are a particularly disadvantaged group, one with whom we should sympathize. We

can recognize that those with greater ability will have certain advantages without denying

them the opportunity to develop their abilities through schooling. Finally, we need not

claim that higher achieving students deserve better or even equal educational opportunities

compared to their lower achieving peers. I only assert that we cannot divest higher achiev-

ing students of good opportunities to learn simply in order to increase investment in the

education of lower achieving children. This assertion rests on the idea that if we were to

learn that those students left to their own devices in the example above were the highest

achieving students in the school, we would be no less offended.

15

III. Testing distributive theories against the right to some

educational resources

I now consider three prominent distributive theories in education and show that each fails

the test under plausible conditions. The theories are expanded fair equality of opportu-

nity, restricted fair equality of opportunity and adequacy.2 I first evaluate the two equal

opportunity principles before considering objections to the test itself. The ordering may

appear awkward at first, since an objection to the evaluation may be immediately evident.

Nevertheless, I think it is worthwhile to first go through the arguments against both equal

opportunity principles before considering rebuttals.

III.A. Expanded fair equality of opportunity

It is common to hear that low achieving students are due compensation in the form of edu-

cational resources so that their opportunities for labor market success, freedom of occupa-

tional choice, and political participation (among other things) may be more equal. Such a

notion rests on three premises, to which most egalitarians would agree:

1 Differences in ability are a legitimate, meaningful and long-standing obstacle in theway of opportunities for labor market success, occupational choice, challenging andmeaningful labor, and making contributions to the social good, such as through publicoffice.

2 Differences in opportunities for the benefits described above that are the result ofdifferences in ability are no more the responsibility of the individual than differencesin opportunities that result from social origin, race or gender. Each of these barriersto equal chances are, to use Rawls’ phrase, arbitrary from the moral point of view.

3 Social institutions, such as schools, exist and are suitable for changing the abilitydistribution to make it more equitable. These institutions have at their disposal apool of resources and can use them in a compensatory way such that students with

2I exclude a principle of priority from the list, as it is self-evident that a priority principle will not allocateresources to any students who are not the least advantaged.

16

lower ability receive more, thereby improving their achievement so that they catchup to students with greater starting ability.

Taken together, these three premises often lead to an “expanded” fair equality of oppor-

tunity principle that includes among its set of protected classes ability, social class origin,

gender and race. Expanded fair equality of opportunity is violated when differences in

opportunities for culture and achievement exist between persons of different ability, social

class, race and gender.3

The idea that opportunities for culture and achievement should be equal for those of

different abilities is sometimes thought to be too extreme. Two arguments against expanded

fair equality are commonly raised. The first is that it is impractical and too costly. This

argument is most strongly expressed when is it grounded on the premise that such costs

can harm the worst off. The idea is that if we compare two educational systems, one that

spends its resources trying to improve the achievement prospects of the worst off (in this

case, the untalented) and another that spends its resources efficiently, we will find that the

efficiency-based system actually improves the quality of life for the worst off group, all

things considered. The mechanism for this is that the overall level of achievement will be

higher in the efficiency-based system and, subsequently, the level of material comfort will

be higher for the worst off as well, since they stand to gain from the advancement of others.

In short, trying to help the worst off exclusively through schooling ends up leaving them

worse off in other dimensions, such as income.

The second argument against expanded fair equality is that making sure there is equality

between different classes will inevitably result in overly restrictive infringements on the

3See Brighouse and Swift, 2015 and Jencks, 1998. Brighouse and Swift refer to this as “the RadicalConception” of equal opportunity; Jencks referred to this as “Strong Humane Justice.” Brighouse and Swift(2015) argue for the radical conception; they write, “what motivates [fair equality of opportunity], which wetake to be the concern that people not be disadvantaged in competitions by characteristics for which they arenot responsible, condemns unequal achievements due to talent (whether natural or endogenously developed)just as much as it condemns those due directly to social class. People are no more responsible for havingthe talents or defects they were born with than for the class background into which they were born, and nomore responsible for the class-based factors that impact on their development. None of these are reasonsto welcome social class influencing unequal achievement. They are not, in other words, reasons for uneaseabout educational equality. ... Rather, they are reasons for resisting the idea that inequalities of talent (naturalor developed) should influence educational achievement,” (2015, p 15-16).

17

rights of parents to interact with their children. Parents have different preferences for their

children, and some will invest more resources and time than others, and if equality is to per-

sist, these parental interactions will need to be curtailed. Such a curtailment is a violation

of more fundamental liberties.4

These two arguments do not speak against expanded fair equality per se; rather, they

suggest that expanded fair equality will need to be brought into balance against these other

competing values.5 Note that this balancing may not even be necessary, as the presence

of the conflict between the competing principles is empirically circumstantial. Suppose

that two students are very far apart in terms of their abilities. Suppose further that the

low achieving student stands to gain as much as the high achieving student for every unit

of educational input. In this case, compensation coincides with efficiency, and there is

no leveling down of achievement when resources are given to the low achieving student.

Consequently, the low achieving student is no worse off under the expanded fair equality

regime than she would be under an efficiency-based regime. Likewise, parental partiality

need not be troubled by expanded fair equality. Parents of high achieving students may,

either because of preferences or their own moral reasoning, choose not to allocate their time

and resources to ensure their child maintains his or her positional advantage in achievement.

In such a world, parental partiality is unconflicted with expanded fair equality.

We see, then, that agreement with the first three egalitarian premises is compatible with

parental partiality and an “all things considered” concern for the least advantaged. We can

now evaluate the principle to see if it conflicts with the rights-based principle advocated

here.

When we considered the case where it was just as efficient to help the lowest achieving

student as it was to help the higher achieving student, if the high achieving student con-

tinues to be higher achieving, despite the compensation going towards the low achieving

student, expanded fair equality demands that the higher achieving student get nothing. Ful-

filling the principle requires, in this circumstance, that high achieving students relinquish

4See Brighouse and Swift, 2009; 2015; Brighouse, Ladd, Loeb and Swift, 2015 for discussion.5Such a balancing can be achieved in any number of ways. Strict lexical priority that protects parental

partiality is one way.

18

their claim on educational resources to develop their abilities. Here we see that expanded

equality of opportunity, even without the threat to priority, threatens the rights of all stu-

dents to some educational resources.

Alternatively, consider a case in which parental partiality is not threatened. Here, a

parent may say to his child, “You are higher achieving than So-and-So. The school system

is going to give her all the resources so that she can catch up with you. I am an egalitarian

minded parent. So that she may catch up with you quickly, and without bankrupting the

school, I will ensure that you do not get any education outside of school as well.” Here,

the parent voluntarily submits in order to fulfill the requirements of expanded fair equality.

Parental partiality is preserved, expanded fair equality is satisfied, and yet justice has been

undermined. A distributive principle that fails to recognize the basic right of opportunity

to develop one’s abilities is deficient.

III.B. Restricted fair equality of opportunity

By including ability differences in the set of protected obstacles, expanded fair equality

of opportunity will, in many instances, require that all educational resources go to the

lowest achieving students in order to equalize achievement. In such instances, the right

for all students to some educational resources to develop their abilities will be violated.

What about a restricted equal opportunity principle, one that compensates on the basis of

differences resulting from social circumstance alone? Such a principle, which Brighouse

and Swift refer to as the “meritocratic conception,” and others will recognize as Rawlsian

fair equality of opportunity, says that opportunities for achievement should be equal for

those of equal ability.

We can now see whether this restricted version of fair equality conforms to our test.

In the first case, restricted fair equality offers nothing to students whose abilities are low

for reasons not having to do with social circumstance. To see this, consider a two-class

society. In this society, there are two distributions of achievement. Restricted fair equality

is satisfied when the two distributions converge, even though there may be considerable

19

within class variation. Those students at the very bottom of the distributions have nothing

to gain from restricted fair equality.

It is worth noting that restricted fair equality is implicated both by the test proposed here

and the three egalitarian premises described above—that talent differences are meaningful,

arbitrary and compensable. Rawls’ fair equality principle has been criticized for this very

deficiency.6 The test proposed here–“[i]f the principle identifies some subgroups based on

their characteristics and uses those characteristics to deny those students any resources to

develop their abilities, then we can conclude that the principle is wrong or inadequate”–

is echoed by Arneson (1999), who asks, “[w]hy is it morally acceptable to single out the

untalented and herd this group into the bottom rung on the social hierarchy? Why is this

not invidious discrimination, invisible to us because it chimes in with the ethos of mod-

ern democratic societies?” (Arneson, 1999, p 10). Restricted fair equality of opportunity

discriminates on the basis of talent and for this reason it is deficient.

Whereas in the first case we saw that low achieving students’ claims on educational

resources were disregarded, in the second case we will see that high achieving students

also stand to lose. The necessary conditions for high achieving students to lose are only

that social circumstances are prohibitively difficult to remedy through schooling. If, in

order to bring the two distributions together, all resources must be spent in order to raise

the achievement of the lower distribution, the entire non-protected class of students will

lose their claim on educational resources in order for restricted equality of opportunity to

be satisfied. Considering this last case raises a more fundamental criticism against both

expanded and restricted versions of fair equality.

The principles differ in who is included among the protected classes. Yet, how ever the

protected classes are defined, the purpose of the principles in both cases is to bring the levels

of achievement of the protected class up to the levels of achievement of the non-protected

class.7 This, by definition, excludes the non-protected classes from the distributive prin-

ciple. Indeed, consider that restricted fair equality of opportunity is most easily satisfied

6See Arneson, 1999; Pogge, 2004; Clayton, 2004.7This is assuming that the non-protected classes are not brought down, which is an alternative and not

relevant criticism against equality generally.

20

if the students of high income families are removed from the educational system entirely.

Likewise, expanded fair equality is most easily satisfied when the talented are excluded. It

is a strange feature of egalitarian theory that a great deal of specification setting is required

in order to lay out clearly just which practices of motivated parents, high income families

or the talented are permissible and which are not.8 We can agree that certain educational

and parental practices (such as private schooling, or test preparation services) are unfair, or

that the very talented should receive fewer resources than the less talented, without relying

on equality-based theories that fail to establish baseline conditions: children of the rich,

children who are intelligent, and all other children, have a right to learn in school. Nothing

about equality lets that be known, and for that reason it is deficient.

III.C. Objections to the test

We are now prepared to consider two objections to the test. The first objection is that the

test holds equality principles accountable to standards equality principles are not intended

to satisfy. Equality principles are intended to describe patterns that are fair. Fundamentally,

equality principles indicate whether or not groups or individuals have too much or too

little of the distributive object; they are not principles of efficiency.9 For these reasons, the

principle of “all students, no matter who they are, have claim to at least some educational

resources to develop their abilities,” should be understood as an addendum to these or other

distributive principles. Yes, we care about equality, the objection goes, and we care about

providing everyone chances as well. 10 The proposed test does not reveal a deficit about

equality; at best, it merely suggests that we should include rights for all students as one of

the many competing goals a just society pursues alongside equality.

Whereas the first objection stated that the rights-based distributive principle should be in-

terpreted as an addendum to equality principles; the second objection states that the rights-

8See Brighouse and Swift discuss which parental practices are legitimate and not (2009). Here, seeBrighouse and Swift discuss whether private schooling is legitimate or not (2006; 2015).

9This argument is sometimes used against the leveling down objection. Equality is one concern; levels ofthe distributive object are another.

10Aggregating values is the approach suggested by Brighouse and Swift, 2008; 2015 and Brighouse, Ladd,Loeb and Swift, 2015.

21

based claim is no weightier than any of the many other competing values to equality, such

as priority and parental partiality. Let us recall the comparison between parental partiality

and the right for all students to develop their abilities in school. Regarding parental par-

tiality, we saw that satisfying equality could, in some cases, require restrictions on parental

inputs. In other cases, it will not, depending on the state of parental preferences at the time.

In this way, the principle of rights is strictly parallel to parental partiality. In some cases,

satisfying equality could require that high ability students relinquish their claim on edu-

cational resources. In other cases, it will not, depending on how efficient it is to raise the

achievement of low ability students and on how far apart low ability students are relative to

high ability students. In such a case, equality will be satisfied (eventually) and rights will

be secured. By faulting the principle of equality because it, in some cases, conflicts with

other values makes the same mistake others have made who have faulted equality because

it conflicts with parental partiality. It is not a deficiency of a principle when, in certain

cases, fulfilling the requirements of the principle conflicts with other desired ends.

Let me respond in three ways. First, I note that the objections do not defeat the argument

I make here. The test succeeds, even minimally, if it reveals that there are other values at

stake when we try to implement equality. Any policy will have to grapple with the broad

suite of values implicated thereby, including rights for all students. This is a non-trivial

point insofar as it has not been recognized in the literature.

The second response is that we should note that the rights-based principle I have pro-

posed is a distributive principle in the metric of educational opportunities. For this reason,

equality of opportunity for achievement and the right to resources for achievement fall un-

der the same class of principles. This is in contrast to the values of parental partiality and

the “all things considered” level of welfare for the least advantaged. In that sense, a cohe-

sive principle for the distribution of educational opportunities ought to include both facets

of the good in question. Equality is deemed deficient because it neglects this other element

of the distributive problem. This is what separates a rights-based account from parental par-

tiality and priority and shows that the test is necessary: the test reveals limitations internal

to the distribution of educational opportunities.

22

Third and finally, we can see whether the objections to the test could be applied to al-

ternative metrics of benefit, such as income or welfare. If the objections apply equally to

all metrics of benefit, then we can conclude that, at best, I have presented another value, at

times in conflict with equality. Alternatively, if we find that the test does not apply to other

metrics of benefit, then we can conclude that I have offered more than a mere addendum

to the already long list of competing values. Instead, it suggests that the distribution of

achievement requires a different distributive paradigm, as the class of distributive princi-

ples used for metrics of benefit like income and wealth cannot be easily applied to other

domains.

Let us consider income. We need to establish three similarities: the moral intuition,

the equal opportunity expression, and the right. We can then test whether the right is

problematically in conflict with the equal opportunity expression.

1 Egalitarian moral reasoning can be applied to income in the same way it was appliedto achievement: income is important (it is an all purpose good necessary for therealization of one’s ends), arbitrarily assigned (to the extent that one’s abilities andsocial circumstances are necessary for earnings and arbitrary from a moral point ofview), and compensable (earnings can be taxed and transferred).

2 The analogous equal opportunity expression reads: opportunities for income shouldbe equal for those who put in similar effort. This suggests an equal pay for equaleffort system, or one for which a tax and transfer system equalizes income.

3 An analogous income-based right would be: all persons have some claim on re-sources in order to develop their incomes. Such a right seems odd, but we can giveit one of two plausible interpretation. (The difficulty in articulating a right to de-velop one’s income should already give us pause, as no such difficulty is present inthe right to develop one’s abilities. It suggests a problem before the comparison evengets off the ground. We tend to think of welfare and income as cross-sectional goods,meaning that what matters about them is how much one has of the good, and not howmuch one is able to develop the good.)

3a The first might include something like Elster’s (1986) Marxian notion of self-realization through labor. We could say that an analogous right would be thatall persons have some claim on resources through labor (meaning they are guar-anteed a wage). This interpretation speaks to the value of meaningful labor.

23

3b The second might include something like Krouse and McPherson’s (1986) no-tion of a property-owning democracy. We could say that an alternative rightwould be that all persons have some claim on capital in order to develop theirincomes. This second interpretation speaks to the autonomy one is affordedthrough ownership of capital.

We can now see whether satisfying [2] the equality opportunity expression will, in certain

instances, require the violation of [3] individual rights to [3a] earn an income through work

or [3b] earn an income through capital. The concern raised against equality above was that

equality offers nothing to members in non-protected classes. In the case of income, the

principle says that incomes will be equal for those who put in equal effort. Here, we can

pick any income distribution that conforms to the principle to make the test. For example,

in a world where the distribution of effort is skewed, incomes will be skewed; in a world of

equal effort, incomes will be equal.

The question is whether any of those income distributions will, in any case, necessarily

result in a loss of opportunity to develop income through work or earn an income through

capital. The most obvious case for which equality denies everyone an income is made

when we invoke the leveling down objection. Equality, say, is so dis-incentivizing that

nobody puts in effort, production halts, and incomes for everyone converge to zero. In such

a setting, equality necessarily conflicts with either of the two versions of the right. This,

however, is an extreme case, and leveling down was not invoked in the previous examples.

Let us see if other non-leveling down versions can lead to conflict.

A second possibility emerges when we realize that equal income for equal effort leaves

nothing for those who put in no effort. This, too, is an extreme case and leaves open the

possibility that effort should be included as one of the protected classes. If all obstacles

are included in the set of protected classes, equal opportunity converges to equal outcome.

When we consider perfect equality of income, the distinction between the two metrics

of benefit comes into sharp relief. Equal income does not leave anyone without income,

insofar as there is divisible income available. Individuals will then be free to do with that

income what they like, such as find work or purchase capital. Certainly, the opportunities

to find meaningful work or purchase capital may be violated in other ways. The society

24

may not, perhaps, guarantee an income through all forms of labor. The violation of the

right in that case, however, is not a necessary consequence of equality.

The analysis above reveals the fundamental difference between income and achievement.

When we consider achievement, we recognize that individuals have initial endowments that

cannot be taxed; we do not take one person’s ability points and give them to another.11 It

is the assignment of initial endowments that bestows upon individuals the right to develop

those abilities.12 The difference between the two metrics, and the corresponding right that is

attached to one metric and not the other, is what explains why equal opportunity paradigms

run aground when they are applied to achievement. The proposed test is therefore use-

ful because it reveals this limitation in traditional distributive theories that are applied in

achievement.

III.D. Equality of opportunity for what?

I would like to take a moment here to develop some of the ideas in the previous discussion

and see if those can be used to offer some clarification about two long-standing disagree-

ments among equal opportunity theorists. The disagreement is about whether the set of

protected obstacles included in the equal opportunity expression should be expanded to

include ability differences. Arneson (1999) argues that by failing to protect against ability

differences, Rawlsian fair equality suffers from “meritocratic bias,” (p 86).13 Brighouse

11Admittedly, the inability to tax and transfer ability is both a structural and an ethical problem. It isstructural because we lack the technology to do so; it is ethical because it is not clear we would choose to taxabilities even if we could. Indeed, the right to develop one’s abilities suggests we would not take from oneand give to another.

12There is a puzzling passage in Rawls that seems to capture this idea. He writes “the difference principlerepresents an agreement to regard the distribution of native endowments as a common asset and to share in thebenefits of this distribution whatever it turns out to be. It is not said that this distribution is a common asset:to say that would presuppose a (normative) principle of ownership that is not available in the fundamentalideas from which we begin the exposition.... Note that what is regarded as a common asset is the distributionof native endowments and not our native endowments per se,” (JF, p 75). It is not clear what exactly Rawlsmeans by this phrase. How is it that the distribution of native assets are to be regarded as common assets butnot actually incorporated as common assets? By attaching the fundamental guarantee of individual libertiesto the ownership of one’s own native assets (thereby giving ownership of one’s assets lexical priority overthe distributive principles), Rawls’ view potentially aligns with the right to develop one’s abilities that I haveargued for here. Though this is only speculation.

13The full quote is “Fairness to talent trumps fairness to the worst off in Rawlss system. That no talentedand ambitious person should be worse off in prospects than any person who is less talented and ambitious

25

and Swift give credence to this notion, when they describe the view as the “meritocratic

conception,” (2015).

It is important to distinguish the ultimate aims of the equal opportunity expression. They

are at times treated as substitutes and this can lead to confusion. The first version, which I

have used throughout, says that opportunities for culture and achievement should be equal

for a set of protected classes, where the set can be defined to include social class, gender,

race and ability. This version is the one that is often used in educational settings. A second

version says that opportunities for careers and political positions should be equal for a set

of classes.14

The two versions are related but distinct. Equal opportunities for careers and political

positions includes both non-discrimination and educational compensation aspects. Non-

discrimination is usually applied to the labor market and prohibits discrimination on the

basis of race and gender. The compensatory aspect comes into play when we recognize

that one’s social origin can affect one’s developed abilities. When we say that opportuni-

ties for careers and political office should be equal to those from different social origins,

we recognize that individuals from disadvantaged social backgrounds will need compen-

satory educational training if they are to compete on equal footing against those from more

advantaged backgrounds. Equal achievement for those from different social origins is the

instrument through which equal opportunities in the labor market are made possible.

We should note that, while educational compensation is necessary to satisfy the intent

of fair equality of opportunity for careers, it is not strictly necessary for equal chances. A

non-discrimination principle on the basis of social background would still allow for equal

chances, if, for example, individuals were assigned to jobs according to a lottery. Assuming

social background affects developed abilities, the consequence would be that those from

disadvantaged backgrounds would be ill-prepared for the jobs to which they applied. Such

a lottery might technically fulfill the requirements of fair equality of opportunity but would

takes lexical priority over egalitarian (Prioritarian) norms. To my mind this is wrong and reveals a meritocraticbias,” (Arneson, 1999, p 86).

14Rawls, in TJ, for example, describes fair equality as being both for culture and achievement as well ascareers. See pages 63-64 and 91-92 (1999) for both descriptions.

26

certainly fail to satisfy its intent. Whatever benefit there is to be derived from fair equality

requires that individuals be able to perform the job to which they are assigned, and this

means that educational compensation is a necessary component of the principle.

This is why when Arneson says fair equality of opportunity suffers from meritocratic

bias, or when Brighouse and Swift call it the meritocratic conception, they have to specify

the equal opportunity outcome to which they refer. Discrimination on the basis of ability in

the labor market is necessary for the benefit of the principle of fair equality to be realized.

Matching skills to positions is necessary for one to realize the benefit of performance within

that position, making discrimination on the basis of ability in the labor market acceptable.15

However, when we talk about equal opportunity for achievement, the case is different, as

schools are responsible for (or at least capable of) affecting the distribution of ability. This

is categorically distinct from the labor market. For the three egalitarian reasons mentioned

above—talent differences are important, arbitrary and compensable—discriminating on the

basis of ability with respect to the distribution of school resources is fundamentally differ-

ent and difficult to justify. Calling it “meritocratic”, however, confuses the issue, as no

proponent of Rawlsian fair equality, to my knowledge, has argued that abilities are earned,

especially in the case of children.

Because achievement is instrumental for careers, it should be clear that satisfying ex-

panded fair equality of opportunity with respect to culture and achievement will converge

with expanded fair equality of opportunity with respect to careers. Once opportunities for

15Having said this, we might then conclude that equal opportunity for culture and achievement is simplya roundabout way of articulating the two discrete purposes of the broader fair equality principle. What isreally meant is that fair equality for careers requires both (a) non-discrimination on the basis of the protectedclasses (social origin, race and gender) and (b) educational compensation in cases for which the protectedclass has lower achievement as a result of membership in that protected class (b.1.) in cases where careersand positions of political office have certain achievement-based requirements.

This rendering is not entirely satisfactory, however, as it treats culture and achievement as mere vehiclesfor economic and political advancement. We can grant that culture and achievement are important in thisinstrumental way and still maintain that culture and achievement are valued goods in their own right. Forexample, one’s abilities allow one to engage in political affairs more effectively and provide access to a widerrange of professional and recreational activities. For this reason, treating equal opportunity for culture andachievement as sub-components of a fair equality principle for careers and political office is too restrictive.Nevertheless, we must also recognize that equal opportunity for culture and achievement is not a sufficientlyrobust description of equal opportunity, as it does not include a prohibition against labor market discrimina-tion.

27

culture and achievement are equal for all persons, irrespective of ability, socioeconomic

origin, gender and race, then careers will likewise be equal, as differences in ability will

be erased. In this way, Arneson’s objection to Rawlsian fair equality still has force: dis-

crimination on the basis of ability for culture and achievement leads to discrimination for

careers. The question for proponents of Rawlsian fair equality is what justifies this type of

discrimination.

Here I must speculate somewhat. I suggest that proponents of Rawlsian fair equality have

argued for discrimination on the basis of ability in school settings for one or two reasons.

The first is that they may have conflated opportunities for culture and achievement with op-

portunities for careers, thereby conflating a true statement—ability matching is necessary

for the realization of benefit in the labor market—with a false statement—allocating edu-

cational resources to students on the basis of ability is not problematically discriminatory.

The second is that they may be worried that realizing expanded fair equality of opportunity

for achievement will take all educational opportunities away from students who are high

achieving, which would constitute another kind of unfairness, the revocation of student

claims on educational resources to develop their abilities.16

In both cases, however, restricted (or Rawlsian) fair equality of opportunity for achieve-

ment is found deficient. It unjustifiably discriminates on the basis of ability in school

settings, by only offering resources to students whose abilities are low due to social circum-

stance. Conversely, it offers no claim on educational resources to students whose abilities

are high and are members of one of the non-protected classes.

16It is difficult to ascertain what proponents of Rawlsian fair equality believe with respect to Arneson’scriticism because they have never addressed it. This is outside the scope of the paper, but Satz (2012; 2015),Shiffrin (2004) and Taylor (2004) have responded to only the first objection to Rawlsian fair equality–thatis unjustifiably gives more weight to fair equality than it does to the prioritarian distribution of income andwealth. Satz, Shiffrin and Taylor have all each argued, broadly, that equal opportunity principles provideindependent benefit, separable from shares of income and wealth, which are allocated according to the dif-ference principle. Note that critics can concede this point and still ask why Rawlsian fair equality is fair ifit discriminates on the basis of ability, with respect to the distribution of educational resources. No propo-nents of Rawlsian fair equality have defended the specification of the principle viz. an expanded fair equalityprinciple.

28

III.E. Adequacy

It might be thought that equality principles are soft targets for a rights-based critique, as

equality principles are only concerned with relative differences. Of course equality disre-

gards rights, the argument goes, as equality is a categorically distinct principle from rights.

As I have already argued, the categorical distinction is overstated, since both rights and

equality claim to be distributive principles in the metric of educational achievement. Nev-

ertheless, equality principles are not the only distributive patterns available. Adequacy

principles have been offered as alternatives to egalitarianism, and they may be immune to

the test proposed here, as relative differences, while not completely outside the scope of

adequacy principles, are de-emphasized.

Satz (2006) and Anderson (2006) offer the two most well known versions of these princi-

ples. The adequacy principles offered by Satz and Anderson are not strictly sufficientarian.

They argue that democratic equality, which, in broad terms, is the requirement that all

citizens be able to interact and be treated as equals in democratic society, only requires

an adequate level of achievement. This complicates the notion somewhat, as the level of

achievement will depend on what democratic equality requires. Suppose for the sake of

simplicity, the level is set at whatever degree of numeracy, literacy and general knowledge

(social scientific, historical and scientific) is required to participate intelligently in political

debate. Adequacy principles require that everyone below the threshold be brought up to the

level, while levels of ability above the threshold are not considered problematic (insofar as

democratic equality is not violated).

While adequacy demands all students be brought up to some threshold level, it says

nothing about students whose abilities are above the threshold. This is seen as a virtue of

the principle. Satz writes:

It is certainly true that if educational resources were improved for poor chil-dren, then they could compete for higher education and jobs on fairer terms.But even so, no society has the resources to supply the same opportunities topoor families as are possible for those with more wealth who value the contin-ued development of their children’s talents. As one child’s potentials expandmore than another’s, this principle will continually justify devoting more re-

29

sources to bring the now disadvantaged child up to the levels of her wealthierpeers. Yet no society can devote all of its resources to education, and so atsome point a line must be drawn as to how much the state is willing to spend.Authorized democratic decision-making bodies will draw lines that reflect therelative value they assign to education as opposed to other social goods. (Satz,2007, p 632)

The virtue of adequacy principles, as expressed here, is that the level of compensation

is not indexed to an advantaged classes’ level of achievement. While Satz’s objection to

equality resembles my own, it has an important distinction. Satz is concerned that public

efforts to compete with parental investments will result in a society re-routing all of its re-

sources to the educational system. Note that this argument is perfectly compatible with the

notion that equality must compete against other values. If educational equality requires all

resources to be invested in the educational system, then that will naturally come into con-

flict with other societal goals, such as health care expenditures. Nothing about balancing

the distribution of health with the distribution of education speaks against equal education

per se. Satz’s objection overlooks the more fundamental point: the problem is not that

equality robs from health to pay for education; the problem is that equality denies students

their rightful claim to educational opportunities.

The question now facing us is whether adequacy also encounters this problem. On the

one hand, it is straightforward to see that a principle of sufficiency, by requiring all students

be brought up to a certain threshold, could easily result in the kind of problem that threatens

equality. Any students that are prohibitively expensive to bring to the required threshold

will necessarily result in the educational system spending all its resources on those students,

leaving students who are above the threshold with nothing, thus denying the right of all

students some claim on resources to develop their abilities. It would appear to fail the test.

Defenders of adequacy could argue that the threshold could be benchmarked to ensure

that it is never so costly that bringing all students to the requisite level would bankrupt

the system. Suppose social planners were considering what threshold level was necessary

for democratic equality and they converged on some level of numeracy, literacy and social

scientific knowledge. Cost evaluations are conducted and it is determined that getting all

30

students to this level will use all available resources, making it so that students above the

level have nothing. Social planners then revise the standard downward, to such a degree

that there are enough residual resources for students above the threshold to at least have

some. Thus, the adequacy threshold appears to pass the test.

The difficulty for proponents of adequacy is that there is nothing in the principle itself

that suggests the cost evaluations should be conducted in the first place. The principle

only requires that individuals be brought up to some threshold level, and that the level is

commensurate with democratic equality. If all educational resources are required to bring

students to the level, there is nothing about sufficiency to protest this result.

One of the supplied arguments in favor of adequacy is that it does not index one’s level

of achievement to the level of some comparison group (high achieving poor to high achiev-

ing non-poor, for example), demanding equality between the classes. To use Anderson’s

phrase “[i]t therefore does not require criteria for equality of resources that depend on the

morally dubious idea that the distribution of resources should be sensitive to considerations

of envy,” (Anderson, 1999, p 321). Those above the relevant threshold are disregarded by

the theory, in the sense that their advantage is not problematic; adequacy is envy-free in

this way. A rights-based account faults adequacy theories for this very disregard. Ade-

quacy may be envy-free, but it comes at a high price: it excludes those above the threshold

from active membership in the school community. Insofar as adequacy does not come at-

tached with a corresponding right for all students to develop their abilities, it fails its own

test of democratic equality, by excluding students from educational membership based on

a self-imposed sufficiency standard.

IV. What distribution of resources are entailed by the right?

Up to this point I have argued that the distribution of opportunities for achievement must

include, at a minimum, some provision that guarantees all students at least some resources

for the development of their abilities. “Some resources” was left intentionally vague, but

now I wish to see if it can be better specified. Unfortunately, the specification of what the

31

right entails has a kind of Goldilocks problem. We can identify a specification that is too

weak, one that is too strong, but the one that is just right will not be known. Let us evaluate

them in turn.

IV.A. A right that is too weak

The phrase “some resources” can be defined to include only a token amount of resources.

Suppose the right is specified to mean, “all students are entitled to at least the smallest divis-

ible unit of resources available to develop their abilities.” This would leave some students

with, perhaps, an hour of the teacher’s time in the course of a year. Such a specification

reflects a commitment to the guarantee in name only. In some ways, such a minimal pro-

vision is worse than nothing at all, as it signals to students their outsider status. Tokenism

violates the purpose of the right and contradicts democratic equality.

IV.B. A right that is too strong

Conversely, “some resources” can be defined to mean equal resources for all students, so

that the right reads “all students are entitled to equal resources in order to develop their

abilities.” Equal resources for all students is to be resisted for the fundamental reasons

that some students are lower achieving and most expensive to educate than others. Equal

resources for all students denies this reality and treats all students as the same. This fails to

respect their uniqueness.

Moreover, giving the same to all will make fair equality of opportunity impossible. From

the previous discussion, we know that an equal opportunity principle requires four assump-

tions:

1 An equal opportunity principle has benefits, distinct from shares of income andwealth.

2 Whatever benefit there is from an equal opportunity principle requires that individu-als have the requisite skills to perform the tasks required by the position.

32

3 A fair chance in the labor market therefore requires efforts to improve the skills ofthose whose abilities are lower.

4 Improving the skills of those whose abilities are lower requires giving them addi-tional resources.

If there is agreement about [1] through [3], then equal resources for all students conflicts

with [4] compensatory spending. In which case, this conception of equal rights fails the test

we used to fault equality: realization of the principle violates other distributive principles

in the same metric of opportunities for educational achievement.

Similarly, a right that reads “all students are entitled to enough resources so that opportu-

nities to develop individual abilities are equal” must be rejected as well. This principle is an

improvement over the former, as it does not assume all students are the same, but it fails to

provide fair or reasonable chances for labor market success for students coming from dif-

ferent social backgrounds and ability levels. By neglecting that individuals have different

starting places, inequalities in life outcomes and political participation will be replicated

across generations. Schools are an important agent for remedying those inequalities.

IV.C. What right is ‘just right’?

A specification that is too weak provides only token resources to students; a specification

that is too strong fails to recognize individual differences and the importance of redress-

ing unfairnesses in labor market competitions and opportunities for political participation.

The question facing us now is what specification of the right to some resources will bal-

ance the need for meaningful opportunities to develop individual abilities in conjunction

with a concern for fairness and individual differences. The challenge is striking a balance

between securing enough resources so that the right is meaningful and recognizing that

compensatory spending is necessary for opportunities in the labor market to be fair. This

requires that we answer two questions: how much is enough, and what constitutes a fair

competition?

It may be that a strictly fair competition, understood as equal chances, is not compatible

33

with the rights-based claim, when the costs of providing equality are too high. When fair

competitions require equality, and equality negates claims to resources, something must

give.

Alternatively, adequacy theorists discount the importance of fair competitions (insofar as

fairness requires equal chances) and argue that what matters is chances reasonable enough

to satisfy democratic equality. Reasonable chances can be characterized by a sufficient

level of achievement. We can incorporate two sufficiency standards without difficulty:

Two-part distributive principle: All students should receive enough resources

for opportunities to develop their abilities; all students should receive enough

resources so that their achievement levels are adequate for reasonable chances

in the labor market. Satisfaction of the first part is lexically prior to the second.

The virtue of this two-pronged approach is that it is internally consistent and guarantees

the claim to resources for all students. It fails to the extent that any normative weight is

given to equality of opportunity. Insofar as we are motivated to reject the strong version

of the right based on the egalitarian reasoning described above, we will be led to demand

a stronger principle than adequacy provides. A final arbitration between equality and ad-

equacy proponents is beyond the scope of this paper, however. Ultimately, we know the

size of the claim to resources is greater than a token amount and smaller than full equality.

Specifying the claim more than that is saved for later.

V. Conclusion

In this essay I have identified a neglected value in educational distributive justice, the guar-

antee for all students to have some resources to develop their abilities. Moreover, I show

that three prominent distributive theories problematically conflict with this goal in many

plausible scenarios. Finally, in contrast to the myriad values that may or may not conflict

with equality, I have argued that the right to educational resources is fundamental to any dis-

tributive principle that takes achievement as its distributive object. A complete distributive

34

theory must strike a balance between the right to resources, on the one hand, and compen-

sating students who are lower achieving, on the other. Only when we have this balanced

distributive theory can we determine the extent to which other competing values, such as

parental partiality and “all things considered” priority, are threatened by the principle.

35

Chapter 3

Welfare adjusted scale score: Method

toward the development of an

equal-interval welfare scale

36

Abstract It is becoming increasingly common to question the equal-interval assump-tions of most academic scale scores. Even if interval assumptions hold, it is problematicto assume equal-interval distances with respect to benefits. For example, equivalent scalescore increases at the top and bottom of the distribution are unlikely to yield equivalentwelfare gains. The best available research to date approaches this problem by employingad hoc statistical procedures to rescale extant scale score distributions by anchoring them toother, ostensibly, interval scales (such as income). I propose an alternative strategy that es-timates the welfare returns to academic achievement directly. This approach makes use ofwell-established methodologies in health economics to estimate quality-adjusted life years(QALY). Using data from the performance level descriptors of the National Assessmentof of Educational Progress Long-Term Trends (NAEP-LTT) and a random utility model(RUM), I construct a welfare adjusted equal-interval scale. With this scale, I show thatwelfare returns to achievement are non-linear, convex and lead to different inferences re-garding achievement gap trends.

37

I. Introduction

The use of academic scale scores in education production functions is commonplace. When

a scale score is used as a dependent variable it connotes value or expected benefit. For in-

stance, holding costs constant, a program that raises test scores 20 points is more effective

than a program that raises test scores 10 points. This the logic of cost-effectiveness analysis

(see Levin and Belfield, 2014 for review), which is used for policy evaluation and decisions

about resource allocation. In order for this conclusion to hold, researchers and policymak-

ers must assume that a scale score is equal-interval scaled with respect to benefit. That

is, for example, they must assume that a 10-point gain at the bottom of the scale score is

equivalent to a 10-point gain at the top of the scale score. Such an assumption is rarely

tested and there are not strong priors indicating that such a relationship exists.

As a stylized example, consider Figure I. Along the x-axis I have mapped a traditional

scale score from a standardized test, and along the y-axis I have mapped a welfare-adjusted

scale score that connotes how much utility is attributable to any point along the original

scale score. When a traditional scale score is used as the dependent variable in an educa-

tion production function, it is assumed that X equals Y for any point along the distribution;

that is, for every increase along the ability distribution there is, by assumption, an equidis-

tant gain in welfare. Such a relationship is indicated by the dashed black line labeled

“Achievement.” Conversely, a plausible relationship between ability and welfare can be

described by the dashed gray line labeled “Utility.” 1 As is evident, when the relationship

is concave, utility increases faster at the bottom of the original scale score than at the top.

Thus, any cost-effectiveness analysis that uses the un-adjusted scale score as the dependent

variable will understate gains at the bottom and overstate gains at the top. A scale score

that accurately connotes benefit would therefore be useful.

[Insert Figure I Here]

In this paper I describe and implement a method for constructing a scale score that is

equal-interval with respect to welfare (the y-axis in Figure I). The method I propose esti-

1The welfare adjusted scale score is simply the cumulative beta distribution, with shape parameters α andβ equal to 2 and 5, respectively, and X equal to the original scale score divided by 350.

38

mates utility for a set of 10 “achievement states”, where an achievement state corresponds

to a performance level descriptor for reading and math taken from the National Assessment

of Educational Progress Long-Term Trend (NAEP LTT), and utility corresponds to how

much better, all things considered, survey respondents believe a person’s life will be for

a given achievement state.2 The NAEP uses a scale anchoring process that links a scale

score value (the x-axis Figure I) to a performance level descriptor. With the estimated

utility values, I have a data set with three variables and 10 observations: a vector of per-

formance level descriptors, the corresponding scale scores, and the estimated utilities. I

use piece-wise monotone cubic interpolation (MCI) to link each individual NAEP score to

a utility value, represented by the gray dashed “Utility” line in Figure I.3 We now have a

scale score that is equal-interval with respect to welfare, as long as the equal-interval as-

sumptions (with respect to ability) of the original scale score hold and that the performance

level descriptors are appropriately mapped to scale score values. As a demonstration of the

usefulness of a welfare scale, I take a repeated cross-sectional panel of student test scores

from the NAEP-LTT and re-scale them according to the method outlined above. I show

that inferences about changes in achievement and achievement gaps over time and age are

sensitive to the choice of scale.

The method that I described above is similar in concept to methods commonly used in

health care research. In health economics, effect sizes are, in many cases, given in the

metric of a Quality Adjusted Life Year (QALY) (see Drummond, 2005 and Whitehead and

Shehzad, 2010 for review), where the QALY metric is used to make comparisons between

different ‘health states’ (where health states are the analogue to achievement states taken

from performance level descriptors) in health care production functions for purposes of

cost-effectiveness analysis. As an example, consider a medical intervention that improves

mobility by 2-units and another that reduces pain by 3-units. Holding costs constant, re-

searchers, insurance companies and policy makers are interested in determining which of

2Performance level descriptors for reading can be found online here:https://nces.ed.gov/nationsreportcard/ltt/reading-descriptions.aspx; math here:https://nces.ed.gov/nationsreportcard/ltt/math-descriptions.aspx. 5 performance level descriptors areavailable for both reading and math, for a total of 10 descriptors. See Data section for details.

3MCI is implemented according to Fritsch and Carlson, 1980

39

the two interventions should be pursued. The QALY-metric puts discrete health outcomes

on a common utility scale, making comparisons possible. In addition to being used for

making between health state comparisons (e.g., mobility against pain), QALY-scales can

be used for making within health state comparisons (e.g., completely immobile against able

to walk without assistance).4

Questions about equal-interval assumptions in educational research are well known.

While it is generally assumed that scale scores safely provide ordinal information, the

assumptions that allow us to assume equal-interval properties are more difficult to test.

Establishing the interval properties of a test metric is important because interval distances

are much more informative than ordinal rankings—just think of a rank-ordering of basket-

ball players that includes Lebron James and the members of a high school basketball team.

While an ordinal scale may be equal-interval with respect to ranks, ordinal rankings do

not reflect the fact that the distance in ability between James and the second best player’s

is enormous. Extant research suggests that equal-interval assumptions are problematic,

however. Domingue and, in a working paper, Nielsen have developed methods for test-

ing whether the equal-interval assumptions are plausibly met for some common academic

assessments and find that these assumptions are not (Domingue, 2014; Neilsen, 2015).

Other researchers assume that any given scale is but one among many monotone transfor-

mations of a latent scale. Given this agnosticism, Cunha and Heckman (2008) and Cunha,

Heckman and Schennach (2010) propose a scale transformation that anchors the original

scale to adult earnings, a distribution that is assumed to have equal-interval properties. The

transformed scale score is then used to estimate production functions for cognitive develop-

ment. Relying on similar assumptions about the flexibility of scale transformations, Bond

and Lang (2013a; 2013b) subject a scale score to a variety of monotone transformations

according to an algorithmic objective function that maximizes and minimizes changes in

the white-black achievement gap. The authors find that inferences about gap changes are,

not surprisingly, sensitive to these scale transformations.

4The Eq-5d, for example, is one of the more commonly used metrics and provides three descriptionsof mobility states, three descriptions of pain states, as well as three other health states. Utility scores areestimated for each health state, allowing for between and within health state comparisons (Oppe, Devlin andSzende, 2007).

40

The approach that I take here is to assume an equal-interval scale in the metric of achieve-

ment and estimate a new scale that will be equal-interval in the metric of welfare. Such a

scale will be useful insofar as we wish to understand whether changes in achievement are

important. For instance, current program evaluations leave fundamental questions unan-

swerable. Holding costs constant, if one intervention raises math scores 10-units and an-

other raises reading scores 10-units, we lack an outcome variable that adjudicates between

the two interventions. Likewise, if one intervention raises math scores 10-units at the low

end of the scale and another intervention raises math scores 10-units at the high end of the

scale, current practice fails to distinguish between these two results. The method I describe

and implement is one solution for resolving this uncertainty. It should be interpreted as

a proof of concept and perhaps is most useful for raising more questions than it answers.

Consider:

1 To what outcome should scale scores be indexed? In this paper, I present respondentswith questions, asking them to determine which description of math and reading ismore important for an “all things considered” better life. Other indices are avail-able, such as income, civics engagement, or health outcomes. Cunha and colleagues(2008; 2010) index a child’s test score to the child’s future earnings, using a factorloading technique to weight the achievement distribution as a function of how well itpredicts earnings. Such a technique is not a panacea however. First, the factor loadingmethod is an ad hoc scaling technique. Second, earnings connote their own scalingassumptions, e.g., should the scale be log transformed? Finally, linking achievementto earnings ignores the academic capabilities captured by the scale score.

2 Whose preferences for achievement should be included in the index? The sampleof respondents included in this essay are mostly college educated. In the survey ex-periment, respondents are asked which state of achievement is more important for agood life. If respondents have no understanding of what a high level of numeracyor literacy feels like or entails, they will struggle to respond to the question. Thissuggests college educated respondents are appropriate. Nevertheless, it is likely im-portant that an index of benefit captures the preferences of everyone. How to includeall respondents in an exercise for which some may lack the cognitive capacity to par-ticipate is a difficult question. Note that the question does not apply to achievementalone: whether the poor can predict how much they would prefer being non-poor(and vice-versa) seems similarly opaque. Whether a healthy person can predict how

41

much they are pain averse has a similar problem.5

3 How should the index balance individual and social benefits? The approach usedhere measures preferences for individual benefits to achievement states. Such anapproach ignores other distributive concerns, such as equity. It is known that surveyrespondents may be indifferent between a 1-unit change at the bottom and top of ascale when comparing between two persons, but when respondents are asked whichof the two persons should receive treatment in a group of persons, they will choose togive treatment to the person whose health is at the bottom of the scale (Nord, 1999).This suggests individuals value relative differences (Otsuka and Voorhoeve, 2009).How such equity concerns, and other social values, should be included in the indexis an important question.

4 How should time be modeled in the elicitation and estimation of the utility value?The method used here (and the one that is commonly employed in health economics)is to present respondents with a cross-sectional preference: “Person A has charac-teristics X and Person B has characteristics X ′. Who is better?” In health, thesecharacteristics are fixed by specifying that the health state will persist for t-years,whereas achievement states are naturally assumed to change over time as studentslearn. Moreover, individuals may have different preferences for achievement growththan they do for achievement states. Linking a preference for achievement change toa student’s scale score is complicated because our current measures of achievementonly provide cross-sectional information about the student’s abilities.6

These questions are a current source of debate in health economics and philosophy,7 and

are likely to continue to be debated. Questions like these are currently neglected in most

education policy evaluation, or the assumptions that go into the evaluation are left unstated.

The paper proceeds as follows: I begin by providing an overview of the survey design

and the data used for analysis. I then describe the theoretical model that motivates the

analysis and the econometric model that will be used for empirical estimation. The first set

of results I show describe utility values for the different achievement states. Interpolation

techniques are described that connect NAEP scores to utility values for the full distribution

of NAEP data. With interpolated data, I then estimate white-black achievement gaps using

5Dolan and Kahneman call this distinction experience versus decision utility.6See Lipscomb, et al., 2009 for a review of this and other issues related to time in the health landscape.7See Daniels, 1985 for a philosophical view of justice in the provision of health care, as well as Nord,

1999 who offers a mixture of economics and philosophy in his evaluation of the QALY metric.

42

the original and welfare-adjusted NAEP data and show that inferences about gap trends are

sensitive to scale selection.

II. Survey design

The survey design has two components. The first is a ranking exercise, in which three out

of five reading or three out of five math descriptors are randomly selected and respondents

are asked to rank these descriptors in order of difficulty. Reading and math ability descrip-

tions are taken from the NAEP-LTT performance level descriptors, described below. The

purpose of this ranking exercise is two-fold: to prime respondents so that they recognize

these descriptors are ordinally ranked, and to screen respondents who cannot (or will not)

rank descriptors correctly. Figures III through IV display the ranking and choice tasks as

they appeared in the experiment.

The ranking exercise is followed by a choice-based conjoint design (often times referred

to as a discrete choice experiment) to obtain utility values for different math and reading

descriptors. Choice-based conjoint designs are widespread in health and public economics,

marketing research, and have become increasingly common in political science (for ex-

amples in health and public economics, see De Bekker-Grob, et al., 2012 and McFadden,

2001, respectively; in marketing, see Raghavarao, et al., 2011; in political science see

Hainmueller and Hopkins, 2015). In the experiment, respondents are provided with a de-

scription of two individuals (Person A and Person B) who are alike in all respects, except

that they differ in their math and reading abilities. Respondents are asked to determine

which bundle of math and reading abilities between Persons A and B will lead to an “all

things considered” better life. After being presented with the reading and math profiles,

the respondent is forced to make a choice between Persons A and B. The response is coded

dichotomously, 1 if Person A or B was chosen and 0 otherwise. Each respondent is given

only one choice task.8

8More than one choice task is of course possible, requiring that standard errors be clustered at the re-spondent level. The decision to offer respondents only one choice task was motivated by a reduction incognitive load, as performance descriptors are text heavy, as well as the fact that the marginal survey cost

43

The purpose of the choice task is for respondent to make interval comparisons between

Persons A and B with respect to welfare. As an example, consider a choice task where

Person A has reading ability equal to 5 and math ability equal to 2, while Person B has

reading and math abilities equal to 3.9 Effectively, the respondent is being asked to make

a trade between 2 units of reading for 1 unit of math. Whether respondents, on average,

choose Person A over B will depend on how much they value reading relative to math,

and, importantly, how much they value math gains at the bottom of the distribution relative

to reading losses at the top. To see this, consider an alternative choice task where Person

A has reading ability 4 and math ability 1 and Person B has reading and math abilities 2.

Here, the reading and math abilities of Persons A and B have been shifted down equally, but

respondents may not make the same selections, since a change in reading from 5 to 4 need

not be equivalent to a change in reading from 4 to 3. This exercise formally tests whether

respondents’ preferences are, indeed, equal interval with respect to welfare. Depending

on how respondents on average weight these different trades will determine the relative

concavity of the welfare-adjusted scale score.

[Insert Figure II Here]

[Insert Figure III Here]

[Insert Figure IV Here]

Finally, note that Figure III explicitly states the age of Persons A and B. Because the

performance level descriptors from the NAEP-LTT pertain to students at the ages of 9,

13 and 17, and because individual scale scores are available for students at those ages, I

randomly assign one of three ages (9, 13, 17) to each choice task. The purpose of this

additional randomization is to test the sensitivity of preferences for achievement bundles to

age. For example, respondents may value gains in reading and math at the low end of the

distribution for persons aged 9 more than they value equivalent gains for persons aged 17.

Randomly assigning age will allow me to test this hypothesis.

using Amazon’s MTurk suite are relatively low.9Where ability level 5 corresponds to highest performance level descriptor on the NAEP, and so on.

Respondents are not asked to make trades regarding integer values of the NAEP but are instead presented withtextual descriptions of reading and math abilities commensurate with integer scores. See Scale Anchoringsection below.

44

II.A. Math and reading descriptors and scale scores

The choice task described above uses performance level descriptors to connote reading and

math ability levels. In order to construct a data set with performance level descriptors,

utility values, and scale scores, it is necessary that these performance level descriptors (and

their estimated utilities) can be plausibly linked to scale scores. The plausibility of this

linking is defended below, but it is natural to wonder why performance level descriptors are

needed at all. Why not ask respondents to make trades using the scale scores themselves?

There are two problems with such an approach. To the first, I hope it is evident that in

order to estimate an interval scale with respect to welfare it is important not to conflate

the welfare scale with the original scale that describes ability. The purpose of the choice

task is to allow respondents to decide for themselves the interval distances with respect

to value between, say, reading units 1 and 2 and units 3 and 4, and so on. A Rasch or

IRT model might estimate equal-interval distances between these units, but respondents

are being asked to decide whether these distances are equal in another dimension, that

of welfare. The second problem is that a scale score decoupled from a performance level

descriptor connotes no meaningful information to the respondent. Any scale can be linearly

transformed, and determining how much 5 units is worth relative to 4 units, or 500 relative

to 400 is not possible. For these reasons, it is necessary to provide respondents with a richer

descriptor of what performance looks like, and then link the performance-level descriptor

back to a scale score.

II.B. Linking NAEP descriptors to scale scores

I now turn to the question of whether or not performance-level descriptors can be plausibly

mapped onto scale scores. One of the goals of academic measurement, dating back to

at least 1963, is to provide criterion-referenced interpretations of scale scores—in other

words, to be able to provide descriptions of what students know and can do in an academic

domain (Mullis and Jenkins, 1988). The process by which the NAEP links performance

level descriptors to scale scores is called scale anchoring. Scale anchoring attempts to

45

provide a context for understanding the level of performance defined by the specific test

items that are likely to be answered correctly by students (Lissitz and Bourqe, 1995).

Anchor levels are determined by a combination of statistical and judgmental processes.

For the NAEP, an IRT model is used to estimate an ability score, θ, for each student,

bounded between 0 and 500. The equidistant points 150, 200, 250, 300 and 350 are then

selected from the scale.10 Test items from the assessment are then selected and categorized

according to whether or not the item discriminates between students with different scale

scores. For example, an item will be categorized as a “150 level item” if (a) 65 percent of

students scoring at or around 150 answered the item correctly; (b) 30 percent of students

or fewer scoring below 150 answered it correctly; (c) no more than 50 percent of students

scoring below 150 answered it correctly; and (d) a sufficient number of students responded

to the item. With this procedure, a large number of items can be categorized as being “150

level items”, “200 level items”, and so on. This completes the statistical part of the process.

The judgmental part of the process occurs when teams of curriculum and content special-

ists from the respective domains (i.e., reading and math) are asked to describe the kinds of

academic competencies reflected in the categorized items. Specialists meet in teams and

form a consensus about what these items signal.

The final result is a set of performance level descriptors that characterize what students

know and can do as defined by test performance on selected items. Scale scores are em-

pirically determined, anchor items are empirically identified, and anchor descriptions are

provided by expert judgment (see Beaton and Allen, 1992; Mullis and Johnson, 1992; Lis-

sitz and Bourqe, 1995 for full description of the scale anchoring process).11

There are problems with this procedure. Lissitz and Bourque describe the anchor item

10Very few students score in the tails of the scale score distribution, and for this reason the selected pointsof interest ignore those regions.

11Performance level descriptors differ from standards setting or achievement level descriptors. Standardssetting practices begin with a set of skills that experts believe correspond to proficiency levels. For instance,it might be asserted that a 4th grade student is proficient in reading if that 4th grade student can read chapterbooks for comprehension. Experts then work through the test items and determine subjectively what percentof students would answer the item correctly, if the student was proficient in reading. This stands in starkcontrast to the anchoring procedure described here, as the items are not categorized according to a statisti-cal procedure and given subjective analysis ex post, but instead are categorized exclusively according to ajudgmental procedure.

46

selection process as “low inference” and the descriptive process as “high inference.” The

key issue revolves around whether the descriptors are overly uni-dimensional. Not all items

can be empirically anchored to different ability levels, leaving open the possibility that the

anchored items are too narrow. While experts construct uni-dimensional descriptions of

anchor items, other descriptions cannot be ruled out. Moreover, the performance level de-

scriptors collapse across different sub-scales, glossing over multi-dimemsionality that is

present even in the empirical data. Finally, even though equidistant anchor levels are se-

lected, if the equal-interval assumptions of the scale score are not met, then the descriptors

will likewise not be equal-interval scaled.

Despite these concerns, anchoring in this way is the most widely used technique for

providing descriptions of what students know and can do at different points across the scale.

Given how widely these benchmarks are used in classrooms and policy discussions, it is at

least plausible to suggest that the performance descriptors used in this survey experiment

can be mapped to specific scale scores. The performance level descriptors for reading and

math are described in Tables 1 and 2 below. The entire performance level descriptor is used

in the choice-based conjoint experiment.

[Insert Table I Here]

[Insert Table II Here]

II.C. Data collection

Utility values are estimated from survey respondents. Respondents are drawn from the

United States during the period of June and July, 2015. Participants were enrolled using

Amazon’s Mechanical Turk software suite and the survey was administered using Qualtrics.

Respondents were offered $0.35 to participate in the survey, equivalent to about $6.00 per

hour, and the study was administered with IRB approval. In total, 2351 respondents partici-

pated. According to self-reports, respondents were primarily college educated (78 percent),

white (73 percent) and balanced by gender (48 percent male, 52 percent female). Mechan-

ical Turk populations, while not representative of the national population on observables,

47

have been shown to have nationally representative preferences with respect to certain stim-

uli, such as responsiveness to information about income distributions (Kuziemko, Norton,

and Saez, 2015) and risk aversion (Buhrmester, Kwang and Gosling, 2011).12

III. Econometric framework

In this section I describe the modeling approach I use to estimate utility values for each of

the math and reading performance level descriptors. The model uses the logistic likelihood

function to provide point estimates for reading and math performance level descriptors at

levels 150, 200, 250, 300 and 350. Point estimates for reading and math performance level

descriptors can be interpreted as the log likelihood that respondent i chose Person A (profile

1) with reading and math characteristics θsl, where s indexes subject (reading or math) and

l indexes performance level (150, 200, 250, 300, 350).

Formally, the data are structured so that there is one row of observation for each survey

respondent i. A response variable is coded 1 if respondent chose Person A (profile 1);

0 otherwise–that is, if the perceived utility of Person A exceeded the perceived utility of

Person B. The pairwise offerings presented to each respondent are coded as indicator vari-

ables. For example, if respondent i compared Person A, who had Reading 150 Math 300

(Reading 1, Math 4) and Person B, who had reading 200 Math 250 (Reading 2, Math 3),

the indicator variables Read1a, Math4a, Read2b and Math3b would be coded 1; all other

indicator variables (Read2a through Read5a, Math1a-Math3a and Math5a, etc.) are coded

0. These ones and zeroes mark the choice set available to the respondent.

Thus, the probability that respondent i chose Person A is:

Pr(ChooseA) = f(Uia + εia > Uib + εib),(3.1)

= f(Uia − Uib + εia − εib > 0),(3.2)

12Pilot studies took place over the months of September, 2014 to June, 2015. Development of the surveydesign took place in Stanford’s Laboratory for the Study of American Values.

48

This expression says that the probability of choosing Person A is a function of an indi-

vidual’s observed utility for Persons A and B plus a random component εij . Respondents

choose A when they perceive more utility for A than B, or when the difference in utility

between Persons A and B is greater than zero.

If we assume that the errors have a logistic distribution, then we can specify the model

such that:

Pr(ChooseA) = 1 +1

e−(Uia−Uib)+ εij; εij = εia − εib(3.3)

= 1 +1

eUib−Uia+ εij(3.4)

We simplify by taking logs and get:

LnPr(ChooseA)

Pr(ChooseB)= Uib − Uia + µij(3.5)

So far we have only shown that the log odds of choosing Person A over B will be a

function of how much the utility attributed to Person A exceeds the utility attributed to

Person B. We also know that Persons A and B have characteristics. Substituting, we get:

Uib =Mathib +Readib;Uia =Mathia +Readia(3.6)

LnPr(ChooseA)

Pr(ChooseB)= (Mathib −Mathia) + (Readib −Readia) + µij(3.7)

This expression says that the log odds of choosing Person A over B will be a function

of how much Person A’s math and reading abilities (Mathia and Readia, respectively) are

preferred over Person B’s math and reading abilities (Mathib and Readib, respectively).

Let θsl =Mathib−Mathia or Readib−Readia for the full vector of Math and Reading

pairwise offerings made available to all respondents. Then, the model can be estimated

49

with the equation:

LnPr(ChooseA)

Pr(ChooseB)= α +

2∑s=1

5∑l=2

θsli + µsli(3.8)

Where s indexes subjects (reading and math), l indexes levels (200, 250, 300, 350 and

150 for both subjects is jointly estimated by the constant α), and i indexes respondents.

This model estimates a total of 8 parameters plus a constant. Standard errors are clustered

to account for heteroskedasticity.

Previously, I noted that the ages 9, 13 and 17 were randomly assigned to respondents,

in order to test whether respondent preferences for different parts of the reading and math

distributions varied by the supplied ages of Persons A and B. These age terms can be

introduced in the model as interactions:

LnPr(ChooseA)

Pr(ChooseB)= α + δa × (

2∑s=1

5∑l=2

θsli) + µsli(3.9)

where δa is an age fixed effect, thus giving 24 total parameters estimated (8 reading and

math x 3 age terms) and a constant.

IV. Results

I now turn to results. The survey consisted of both a ranking and a choice exercise. The

ranking exercise was included to determine whether respondents could and did understand

that the performance level descriptors provided increasingly sophisticated descriptions of

reading and math abilities. I begin by showing percents of respondents ranking perfor-

mance level descriptors correctly in terms of difficulty. A majority of respondents are able

to rank these descriptors correctly, suggesting that they understand the descriptors connote

ordinal information in terms of ability.

I then turn to point estimates from logistic linear regression models. I show point es-

50

timates for two sets of models: age-interaction models (for ages 9, 13 and 17) are shown

along with models that estimate the weighted average across age. These allow us to see

whether age-interactions meaningfully change respondent behavior. Three interpolation

schemes are considered and monotonic cubic interpolation (MCI) according to Fritsch and

Carlson (1980) is selected.

With an interpolation scheme in place, I have a full range of data for both the original

scale score and the estimated welfare scale. As a descriptive application, I show trends in

the white-black achievement gap, defined as the difference in mean white and black scores,

for the original and adjusted scales. NAEP scores are fairly stable across time but change

substantially as students age.13 Test scores are available for a random sampling of students

at ages 9 and 17 every 8 years for six cohorts in math and reading, allowing for description

of achievement growth as students age across various cohorts. I conclude by showing gap

trends across age for various cohorts using both scales.

IV.A. Ordinal ranking exercise

Respondents first participated in a ranking exercise in which they were randomly assigned

3 of 5 reading or 3 of 5 math performance level descriptors (an example of the exercise

is shown in Figure II). Only three descriptors were randomly drawn in order to simplify

the ranking task. There are 10 possible reading and math bundles for which there are no

ties randomly assigned to respondents, when a tie is defined as respondent being randomly

assigned one or more equivalent reading or math performance level descriptors.14 Among

non-ties, the probability of being assigned any one reading or math performance level de-

scriptor is uniformly distributed. There are 10 possible non-tying combinations of perfor-

mance level descriptors: 123, 124, 125, 134, 135, 145, 234, 235, 245, 345 (where 1=150,

2=200, 3=250, etc.). Likewise, the distribution of descriptor combinations is uniformly

distributed.13The NAEP-LTT is vertically scaled, meaning students at different ages are exposed to an overlapping

subset of test items. See Beaton and Swick (1992) and Haertel (1991) for discussion.14Ties are excluded because the exercise is made radically simpler when ranking only two unique sets of

descriptors.

51

Uniformity allows for independent point estimates of each subject-level descriptor. How-

ever, independent estimates of the effect of being assigned a performance level descriptor

on the probability of ranking that descriptor correctly are available only if descriptor com-

binations are equally difficult. For example, some descriptor combinations will, by chance,

assign respondents combinations of descriptors that are further spaced than other combi-

nations (e.g., 135 is futher spaced than 234 or 125). If correctly ranking is easier when

descriptors are further apart (e.g., 135 is easier than 234 or 125), and if some performance

level descriptors are more commonly found in these more easily ranked combinations, then

independent estimates of each performance level will be biased. To test for this, I construct

three indicator variables (Distance 100, Distance 150, and Distance 200) indicating the cu-

mulative distance between the three performance descriptors. For example, the indicator

Distance 100 will be coded 1 if the three descriptors were 150, 200, 250 (distance is 50 be-

tween 150 and 200 and 50 between 200 and 250 for a total of 100); 0 otherwise. Distance

150 and 200 are coded similarly.

The data are structured such that there are three observations per respondent. Each row

corresponds to the subject s and level l randomly shown to the respondent i. If the respon-

dent ranked the item correctly, it is coded 1; 0 otherwise. In total, a respondent may rank

0,1 or 3 descriptors correctly (mis-ranking one descriptor necessarily results two or more

descriptors mis-ranked). I estimate two regression models:

Ranksli = θsli + µsli

(3.10)

= δd × (θsli) + µsli

(3.11)

Here, s indexes subject, l indexes level and i indexes respondent. Each model is run sepa-

rately for math and reading, for a total of four estimations. The dependent variable Ranksli

is coded as 0 or 1 depending on whether the respondent ranked correctly; indicator variables

θsli indicate the linear probability that respondents ranked performance levels 1 through 5

correctly, and the interaction term δd indicates the proportion of respondents ranking θsli

52

correctly when they were offered three-descriptor combinations with distances equal to

100, 150 or 200. Point estimates for δd interactions are relative to d = 100. Standard errors

are clustered at the respondent level to account for intra-respondent correlation. Results are

reported in Table III.

[Insert Table III Here]

The main effects coefficients (Levels 150 through 350, indicated by column header

“Mean”) indicate the proportion of respondents ranking that performance level descrip-

tor correctly. Here we see that, for the most part, respondents were successful at ranking

the descriptors. Percents correct range between 63 percent to 79 percent depending on sub-

ject and level. There is not an obvious pattern between subjects and levels with respect to

how effectively respondents ranked.

The interaction terms confirm the hypothesis that additional space between performance

level descriptors improves ranking competence. Relative to when cumulative distance is

100 (the smallest possible distance among non-ties), distances at 150 and 200 are nearly

always higher (math, level 300, distance 200) and generally significant.15 Overall, respon-

dents ranked reading and math descriptors correctly 63 to 79 percent and 64 to 72 percent

of the time, respectively. Whether respondents rank incorrectly on account of negligence

or genuine confusion is unknown.

Correct rank ordering of performance level descriptors is relevant to the utility model be-

cause the model assumes monotonicity of consumer choice preferences. The monotonicity

assumption is simply that respondents should choose higher levels of reading or math, all

else constant. That is, for example, if respondent i faces a choice task k in which Person

A has Reading and Math 250 and Person B has Reading 250 and Math 300, respondent

i must choose Person B. In health economics, where choice-based conjoint designs are

common and assumptions of monotonicity are likewise required, there is no consensus on

best practices for when respondents “choose badly.” I follow current practices and delete

15The interaction terms do not average to the main mean effect because the Distance 150 terms are approx-imately 1.3 times more prevalent than either the 100 or 200 terms.

53

observations for which respondent choices violate monotonicity assumptions.16,17

IV.B. Beta estimates

Estimation of the utility model (Equations (1.8) and (1.9)) is done on a sub-sample of

respondents who (a) responded to the choice task and (b) complied with monotonicity

assumptions. The sample includes 2057 of 2351 respondents given a choice task. I begin

by showing results for Equation (1.9), where randomly assigned age descriptors δa for

ages 9, 13, and 17 are interacted with reading and math performance level descriptors θsl,

providing 24 (3 ages x 4 betas x 2 subjects) point estimates plus a constant. The interaction

terms allow us to see whether respondent preferences for performance levels are sensitive

to profile age. Results are displayed in Figure V.

[Insert Figure V Here]

The common intercept α anchors point estimates for Math and Reading ages 9, 13 and

17 at Level 150. Because each of the subjects are estimated simultaneously, it is possible to

16See Lancsar and Louviere, 2006; Lancsar and Louviere, 2008; Inza and Amaya-Amaya, 2008 for discus-sion. These papers discuss both monotonicity violations as well as other violations of rational choice theory.The focus is primarily on repeated observation of respondent choice behavior, when preferences should betransitive and consistent. In cases where transitivity and consistency are violated, deletion of respondentchoice data is discouraged. Guidelines for best practices in cases of monotonicity violations are not wellspecified. Higher quality (and more expensive) data can be obtained in order to determine whether respon-dents failed to comprehend, did not take the choice task seriously, or had other reasons for preferring lessover more achievement. In total 147 respondents out of 2351 were removed from the sample for makingnon-monotonic choices, i.e., choosing a Profile with performance level descriptors lower than the alterna-tive. 752 respondents were presented with a choice set in which a decision required monotonicity, meaningthat about 19 percent of respondents “chose badly” when given the option. An additional 147 respondentswere removed because they did not respond to the choice task. The final estimation sample includes 2057respondents.

17In pilot surveys that took place between September 2014 and June 2015, I attempted to make the perfor-mance level descriptors more concise in order to improve respondent comprehension and to present respon-dents with additional choice sets. This procedure has the drawback of undoing the scale anchoring processdescribed previously. In particular, complete descriptors have already been criticized for excessive unidi-mensionality, and any additional concision would bolster those criticisms. In an effort to allow for the de-scriptors to maintain their multidimensionality and increase concision, respondents were randomly assignedsub-elements within each performance level descriptor. I generated 3 to 5 sub-descriptors for each com-plete performance level descriptor and randomly assigned those. An average estimate of the sub-descriptorswould, in theory, describe the multidimensional aspects of full descriptor. Nevertheless, I found that re-spondents were not additionally successful at ranking sub-descriptors relative to the full performance leveldescriptor; indeed, for many of the sub-descriptors I constructed, respondents were much worse at rankingthem. For these reasons, I chose to use the full descriptor.

54

compare across subject domains, as well as within domain, across performance level. The

solid and dashed lines correspond to fitted quadratic and cubic regression lines, precision

weighted by the inverse of the standard error squared. Analytic weights are likewise applied

to each of the point estimates to indicate precision (i.e,. larger circles have smaller standard

errors).

With only five estimated points, there are many data missing throughout the entire range

of potential scale scores. The problem of missing data is unique to educational setting,

where two measures of ability are commonplace: discrete performance level descriptors

and continuous measures. In order to capture the full continuous range of ability using

only discrete descriptors, we will need to fill in the missing data. The two interpolation

and extrapolation schemes presented here (quadratic and cubic interpolation) are seen to

be inadequate. The primary purpose of Figure V is to illustrate two problems with the

schemes. First, by not imposing monotonicity on the interpolated line, we violate utility

axioms. Second, using either extrapolation method for points beyond 150 and 350 leads to

outlandish prediction.

To correct these limitations, I use monotone piecewise cubic interpolation (MCI) as sug-

gested by Fritsch and Carlson (1980). MCI produces results depicted in Figure VI for the

range 100 to 500. The top panel shows results for Equation (1.8), where the age-interaction

terms are removed. By construction, the curvature is monotonic throughout the entire range

and fits the estimated data perfectly. MCI extrapolates for points beyond 150 and 350 by

linearly fitting a line from the last two known points (i.e., 150 to 149 and 349 to 350).

Linear extrapolation may not be appropriate for points outside the estimated range. Later,

I test how sensitivity results are to alternative extrapolation techniques.

[Insert Figure VI Here]

We can now observe results. First note the concavity of the each of the point estimates.

As hypothesized, welfare returns to achievement are non-linear and decrease at the higher

end of the scale. This is true for all ages and subjects. There is variation in the curvature

between subjects and ages. For all ages in reading (bottom right panel of Figure VI), there

is a steep gain in utility at the bottom of the scale, and then utility gains flatten out. Age 17

55

shows a steep increase at the high end of the scale, but much of this is due to extrapolation

beyond the estimated value of 350. For math (bottom left panel), the largest gains are in the

middle of the distribution, as scores increase from 200 to 350, and this is true for all ages.

Overall, we see confirmation of the initial hypothesis that utility gains for achievement are

non-linear and concave.

In order to estimate changes in achievement across age, by cohort, it will be necessary to

combine age terms and estimate Equation (1.8). Recall the monotonicity assumption im-

plicit in the model: increases in the original scale score must be associated with increases

in benefit. As seen in Figures V and VI, each of the age curves are monotonically increas-

ing, but the model does not impose monotonicity across age. To understand why, consider

Figure VII, which shows a stylized depiction of Figure VI overlaying Ages 9 and 17. Here,

the curvatures for Ages 9 and 17 are respectively monotonic, but as achievement along the

x-axis increases and “jumps” from Age 9 to 17, there is a concomitant decrease in Y , i.e.

utility. This violates modeling assumptions and implies that as children gain in achieve-

ment as they age from 9 to 17 they are made worse, all things considered. This implication

is made despite the fact that we observe positive utility returns to achievement within age.

[Insert Figure VII Here]

We observe this “jump” problem because respondents are not asked to make marginal

welfare preferences for achievement gains but are instead asked to state preferences for

achievement states. The theoretical and practical differences between estimating gains and

states is a recurring theme in health research and was introduced earlier. The problem is

even more pronounced in educational applications, as any vertical scale assumes change

in ability across age. Nevertheless, it is not obvious whether welfare evaluations should

be sensitive to those changes. Moreover, most test scores are presented as cross-sectional

measures of ability. Given that the purpose of this exercise is to convert a commonly used

measure of ability into one that connotes utility, using the cross-sectional achievement score

seems appropriate. Modeling growth may be possible but is left aside for future research.18

18See Weinstein, et al., 2009 and especially Nord, et al., 2009 for important discussion about gains versuslevels in health, with emphasis on both policy and normative implications.

56

Point estimates will therefore be taken from Equation (1.8). By eliminating the age inter-

action terms, the model describes average welfare returns to achievement and is monotonic.

Point estimates correspond to the weighted average of the three age terms (9, 13, 17) for

each performance level. This can be seen in the upper left and right panels of Figure VI.

Comparing between lower and upper panels of Figure VI shows that age-specific point es-

timates do not substantively alter interpretation. Moreover, ignoring the age-interactions

helps to mitigate some of the exaggerated extrapolation for ranges beyond 350.

Estimating and converting NAEP scales

Here I describe how estimated utility values for Reading and Math performance level de-

scriptors 150, 200, 250, 300 and 350 are applied to individual level NAEP data. To do

this, I take individual level data from the NAEP restricted-use files and generate a vector of

reading and math scores for each individual student’s 5 plausible values.19 Each individual

student’s score is estimated according to the MCI projection. This is done for all student

scores in reading and math, ages 9, 13 and 17, for years 1990-2008. As a summary statis-

tic, I take the mean NAEP and mean welfare-adjusted score for each subject, age, year and

subgroup, taking account of the NAEP’s complex survey design as well as the five plausi-

ble values.20 Finally, in order to compare the original scale, which ranges between 100 and

500, to the welfare-adjusted scale, which ranges from -2.7 to 0.2, I standardize them both

to have mean µ = 100 and standard deviation σ = 10.19In the NAEP, individuals do not receive the complete battery of tests. For this reason, each individual

student is given 5 plausible values which are randomly drawn from a distribution of possible θ values. The 5plausible values can be combined to provide summary statistics for sub-populations following Rubin’s rulesfor multiple imputation. See Mislevy, et al., 1992 for a description of this procedure.

20Specifically, to estimate means for each plausible value of the NAEP, I use Stata’s –svy– commands,specifying probability and replicate weights, as well as the sampling clusters. I follow Rubin’s rules toaggregate across each of the 5 plausible values. The mean score is a simple average of each of the subject-age-year score, but error variance requires that we take account of between plausible value variation and the error

variance of each estimate. The formula for this is: within =1

5

∑5p=1 σ

2; between =1

4

∑4p=1(X̄ − Xp);

total =√within+ 1.2 ∗ between.

57

IV.C. Comparing original to welfare-adjusted scale

I now show how inferences between the original NAEP scale and the estimated and inter-

polated welfare-adjusted scale contrast. I first present a stylized figure to show the kinds of

cases for which inferences between the two scales will diverge. I then compare white-black

achievement gaps (defined as the mean difference between the two groups) across time and

across cohort (that is, as students age).

Achievement gap example

An example of a change in math achievement for a single cohort is shown in Figure VIII.

The x-axis shows the original standardized NAEP scale and the y-axis shows the estimated

and interpolated scale for a cohort of students in years 1982 to 1990. The solid intersecting

lines indicate mean black scores and the dashed intersecting lines indicate mean white

scores; the scores at the lower end of the distribution are for students at age 9 and at the

higher end of the distribution are for students at age 17. The difference between dashed

and solid lines for the respective axes provides the achievement gap.

[Insert Figure VIII Here]

It is clear from Figure VIII that white black differences at age 17 are slightly smaller than

they were at age 9 for both scales, indicating that the gap shrank as children aged. The size

of the change in the gap is much smaller using the welfare-adjusted scale than it is using the

original scale, as the difference in scores at age 9 are smaller in the welfare scale than they

are in the original NAEP. What is also revealing about this figure is that if all student scores

increased by the same amount (i.e., an equivalent mean increase in achievement), the effect

on the achievement gap in the adjusted scale would be profound. By shifting all scores to

the right, the size of the gap at age 9 in the adjusted scale would be larger, as a result of the

steeply increasing value in achievement, and the size of the gap at age 17 would be smaller,

as a result of the fact that gains at the high end of the scale are diminished. Taken together,

the adjusted scale would show a dramatic decrease in achievement gaps between the ages

of 9 and 17, simply by increasing all scores an equal amount. This contrast between scales

58

is exactly the consequence we would expect when equal interval assumptions with respect

to welfare are violated.

Two other points are worth emphasizing from Figure VIII. The first is that differences in

inferences between the two scales requires change. Cross-sectional comparisons between

the two scales result only in intercept differences (e.g., a score of 80 in the original scale

is equivalent to a score of 83). While achievement gaps have narrowed somewhat over

time, most of the change in NAEP scores takes place as children age. Below I present

achievement gap changes across time, and the relative stability of these gaps will be evident.

After, I show gap changes as children age, across cohort, and the consequence of scale

choice will be pronounced.

The second is to note two version of the interpolated lines, indicated by the solid black

curved line and the dashed maroon lines at the end. Dashed maroon lines reflect variants

in extrapolation strategy. As mentioned earlier, point estimates are only available up to

original NAEP scores of 150 and 350 (indicated at about 80 and 110 in the standardized

metric). The MCI procedure that generated the black solid line extends the line using linear

extrapolation. To test whether gap changes are sensitive to this out-of-sample extrapolation,

I generate four alternative welfare-adjusted scale scores:

1 Shallow/shallow: by assuming no decrease in welfare below 150 and no increasein welfare above 150 the welfare-adjusted scale will be flat below and above therespective regions.

2 Shallow/steep: by assuming no decrease in welfare below 150 and twice as muchwelfare above 350, the welfare-adjusted scale will be flat below 150 and twice theslope (relative to the slope between 300 and 350) above 350.

3 Steep/steep: by assuming twice as much welfare loss below 150 (relative to theslope between 200 and 150) and twice as much welfare gain above 350, the welfare-adjusted scale will be twice the slope below and above the respective regions.

4 Steep/shallow: by assuming twice as much welfare loss below 150 and no welfaregain above 350, the welfare-adjusted scale will be twice the slope below and flatabove the respective regions.

59

When comparing gap changes across cohort, I will test whether results differ with any of

the four variants.

Achievement gaps: Across years

I now turn to differences in mean achievement between white and black students across

time. Changes in NAEP achievement across time have been relatively modest and gap

changes have likewise been modest, so we would not expect dramatic differences in the in-

ferences drawn between the two scales. Nevertheless, looking at the top panel in Figure IX,

which describes changes in math gaps, we see differences in achievement gap trends for

ages 9 and 17. The trend in gap closure for 9-year olds is decreasing with the NAEP scale

and increasing (slightly) with the adjusted scale. Looking at the bottom panel, which de-

scribes changes in reading gaps, trends are very similar. This is due to the fact that reading

gaps have been very stable, which is not the same for math gaps (Reardon, et al., 2012).

[Insert Figure IX Here]

Achievement gaps: Across age

The more salient demonstration of the effects of rescaling can be seen when we look at

achievement gap changes as children age. The NAEP is vertically equated, meaning exam-

inees at ages 9, 13 and 17 are given a overlapping sample of test items at each age level.

While there are some concerns about the nature of the inference one can draw from vertical

equating, such cross-age comparisons are technically allowable with NAEP data (Haertel,

1991). In order to make cross-age comparisons, I use a sub-sample of cohorts for whom a

random sample of students are tested at age 9 in year t and tested again at age 17 in year

t + 8. The achievement growth for students from age 9 to 17 in year t to t + 8 is provided

for six cohorts c in both math and reading. Within each of these cohorts, because samples

of students are randomly drawn in each interval, it is possible to say that the achievement of

any subgroup g in cohort c grew or shrank by some amount, using both the original NAEP

and welfare-adjusted scales. The achievement gap for any cohort c is defined as the mean

60

white minus mean black score in years t and t+ 8.

[Insert Figure X Here]

[Insert Figure XI Here]

Figures X and XI present results for the possible six cohorts in math and reading,

respectively. Solid lines depict the original NAEP scale and dashed lines depict the welfare

adjusted scale. As hypothesized, in many instances, the inferences we would draw from

the adjusted scale depart in magnitude and sign when compared to the original scale. In

math, the 1982 to 1990 cohort (depicted in green) is as described in Figure VIII, and we

observe here what was described there: a rate of gap closure that is steeper in the original

NAEP metric than in the welfare scale. In 1978-1986, 1992-2000, and 1996-2004 cohorts

gap signs are reversed. Whether or not trends are reversed between the two scales will

be a function of the size, location and rate of change of the subgroup’s respective mean

achievement.

In reading, the departure between the original and adjusted scales is much more pro-

nounced. Using the adjusted scale, the reading achievement gap is shown to be decreasing

by about 6 to 10 points for every cohort. Conversely, using the original scale, the gap is

decreasing by about 1 to 3 points in four cohorts and increasing by 1 to 2 points in two

cohorts. This can be traced back to Figure VI, where we observed a steep change in slope

beginning at NAEP score 250. Gains below 250 are very steep, while gains above are

much more shallow. If black scores, between ages 9 and 17, move along the back half of

the curve, while white scores moves along the front half of the curve, gap decreases will be

much larger.

Finally, in order to test the sensitivity of our inferences to the extrapolation that takes

place for scores less than 150 and greater than 350, I present four “test” averages alongside

the previously shown MCI score. These “test” averages manipulate the extrapolation by

increasing or decreasing the rate of change below 150 and above 350, as described above.

As can be seen in Figures XII and XIII, the method of extrapolation has little bearing on

outcomes.

[Insert Figure XII Here]

61

[Insert Figure XIII Here]

V. Conclusion

Overall, I have demonstrated that welfare benefits of different achievement states described

by the NAEP are not equal interval. In contrast to existing methods, the technique I propose

provides a direct and explicit description of the welfare gains from different achievement

states. Moreover, instead of linking achievement to earnings, I have suggested that the

benefits of achievement can be described inclusively, meaning that achievement need not

serve merely pecuniary purposes. With the proposed method, the inferences we draw about

changes in achievement and changes in achievement gaps (especially as children age) will

differ depending on which scale we use. Which scale ought to be used is, I have argued,

application sensitive. When descriptions of academic ability are desired, or when we wish

to know how much more or less some subgroups know about math and reading relative to

other subgroups, the original NAEP scale allows for such inferences. When, however, we

wish to derive some additional inference about the scale—for instance, when an achieve-

ment score is used as an outcome variable, when a score is used for cost-effectiveness

evaluation, or when we wish to evaluate whether a narrowing of the achievement gap is

“good” or “bad”—the original NAEP scale is inadequate. It fails to accurately describe

benefit in any meaningful way. In this paper, I have described and implemented a method

that allows for such values-based inferences.

In light of the previous discussion, I would like to revisit the four questions that were

raised at the start of this essay.

1 To what outcome should scale scores be indexed?

2 Whose preferences for achievement should be included in the index?

3 How should the index balance individual and social benefits?

4 How should time be modeled in the elicitation and estimation of the utility value?

62

In this essay, I have supplied answers to each of these questions. Outcomes are indexed

to survey respondents’ understanding of how much welfare is attributable to certain levels

of achievement; college educated respondents are included in the index; equity is given

zero weight in the model; time is modeled cross-sectionally. Whether or not these choices

are the correct ways to link achievement to outcomes is not known, but the choices inherent

to the inference are here made explicit.

Contrast the approach detailed here to when achievement scores are used as outcome

variables. With the use of achievement scores, even if the equal-interval assumptions holds,

the implicit assumptions of the model are that benefits are best characterized by ability dif-

ferences, that all ability differences are equally beneficial, and that all benefits are individ-

ual (and not societal) and best characterized by a cross-section in time. These assumptions

lack theoretical and, as demonstrated here, empirical warrant; nevertheless, these assump-

tions form the basis of a great majority of education policy evaluations. Education policy

evaluation will be greatly improved when the implicit assumptions underlying the use of

traditional achievement scores are made explicit.

63

VI. Figures

64

Figure I: Stylized Welfare Returns to Achievement

150

200

250

300

350

Ben

efit/

Util

iity

150 200 250 300 350Scale Score

Achievement Utility

This figure depicts two stylized representations of the welfare returns to achievement. The black line assumesthat welfare returns are equal interval, meaning that a 50 unit increase in achievement corresponds to a 50unit increase in utility. The gray line presents a hypothetical relationship between achievement and welfarein which a 50 unit gain in achievement at the bottom of the scale equates to a much larger welfare gain thana 50 unit gain at the top of the scale.

65

Figure II: Survey Example: Ranking Exercise

This is a screen shot (1 of 3) from the online survey experiment administered to 2351 respondents throughAmazon’s Mechanical Turk software. This task asked respondents to rank 3 reading performance level de-scriptors in terms of difficulty. Respondents were randomly assigned either reading or math subject and 3 of5 performance level descriptors (with replacement).

66

Figure III: Survey Example: Introduction to Choice Exercise

This is a screen shot (2 of 3) from the online survey experiment administered to 2351 respondents throughAmazon’s Mechanical Turk software. In this screen shot, the choice task is introduced to respondents. Re-spondents are informed that the two profiles, Persons A and B, are equal in all respects except that they differin their reading and math abilities. They are instructed to select which person will be better off between thetwo. In paragraph 3, Persons A and B are also randomly assigned an age, which can be either 9, 13 or 17.

67

Figure IV: Survey Example: Choice Exercise

This is a screen shot (3 of 3) from the online survey experiment administered to 2351 respondents throughAmazon’s Mechanical Turk software. In this screen shot, the choice task is presented to respondents. Re-spondents are randomly assigned a reading and math performance level descriptor for Persons A and B, withreplacement. Performance level descriptors are taken from the NAEP-LTT and can be seen in Tables I andII. At the bottom of the choice task, respondents select which person (A or B) they think would be better off,“all things considered.”

68

Figure V: Estimated Beta Coefficients for Reading and Math, Age Interactions

0

1

2

3

150 200 250 300 350

Age 9

0

1

2

3

150 200 250 300 350

Age 13

0

1

2

3

150 200 250 300 350

Age 17

Math

(a)

0

1

2

3

4

150 200 250 300 350

Age 9

0

1

2

3

4

150 200 250 300 350

Age 13

0

1

2

3

4

150 200 250 300 350

Age 17

Reading

(b)

This figure depicts point estimates from logistic regression Equation (1.9) shown. Pointestimates indicate probability of respondent selecting profile with math (top panel) or read-ing (bottom panel) performance level descriptor equal to 200, 250, 300 or 350 (relative toomitted category 150). Solid line drawn using precision-weighted cubic regression throughthe estimates; dashed line drawn using precision-weighted quadratic regression. Each pointsized to indicate precision (i.e., larger points have smaller standard errors).

69

Figure VI: Monotonic Cubic Interpolation of Math and Reading Beta Coefficients

-1

0

1

2

3

100 200 300 400 500

Math

-1

0

1

2

3

100 200 300 400 500

Reading

0

1

2

3

100 200 300 400 500

Age 9

0

1

2

3

100 200 300 400 500

Age 13

0

1

2

3

100 200 300 400 500

Age 17

-1

0

1

2

3

4

100 200 300 400 500

Age 9

-1

0

1

2

3

4

100 200 300 400 500

Age 13

-1

0

1

2

3

4

100 200 300 400 500

Age 17

This figure takes point estimates from Equations 1.8 and 1.9 and performs piecewise monotone cubic inter-polation (MCI) according to Fritsch and Carlson (1980) for scale range 100 to 500. Extrapolation for pointsless than 150 and greater than 350, respectively, is done via linear extrapolation of the two most proximalpoints, e.g. linear extrapolation based on points 151 and 150 and 349 and 350, respectively. Top panel dropsage interactions (Equation 1.8) and bottom panel estimates equation 1.9.

70

Figure VII: Changes in Scale and Welfare Scores across Age: ”Jump” Problem

Age 9

Age 17

Increase in X

Decrease in

Y

Stylized depiction showing how monotonicity within age need not lead to monotonicity across age, i.e. the“jump” problem. In this representation, benefits are monotonically increasing for ages 9 and 17, but asachievement increases from age 9 to 17, there is a downward “jump” in welfare. This is due to the fact thatthe choice task is cross-sectional, asking respondents about their preferences for achievement states and notachievement growth.

71

Figure VIII: White and Black Changes in Math Achievement across Age: Example Cohort

White Black Gap Age 17

White Black Gap Age 9

80

90

100

110

120

Sta

ndar

dize

d M

CI (

Ave

rage

Age

) S

cale

60 75 90 105 120 130Standardized NAEP Scale

Blacks

Whites

This figure depicts standardized original and welfare-adjusted NAEP scores for one cohort of students ages9 and 17 for years 1982 and 1990 (for 1 of 5 plausible values). Solid intersecting lines correspond to meanblack scores of 9 and 17 year olds in 1982 and 1990, respectively. Dashed intersecting lines correspond tomean white scores for same ages and years. Achievement gaps are represented as the difference betweendashed and solid lines at ages 9 and 17 along both the x- and y-dimensions of the graph. In order to test thesensitivity of extrapolation to changes in gap estimates across cohorts, I construct artificial point estimates atNAEP scores of 100 and 500 that are equal to (a) estimated scores at 150 and 350; or (b) twice the slope ofscores from 200 to 150 and 300 to 350. In other words, I simulate welfare gains at the bottom and top of thedistribution that are either (a) worth no less or no more than the next estimated score or (b) worth twice asmuch/little as the previous estimated change. These alternative extrapolation lines are indicated in maroon.

72

Figure IX: Mean White Minus Black Scores over Time

4

6

8

10

12

Sta

ndar

dize

d W

hite

-Bla

ck M

ean

Diff

eren

ce

7882

8690

9294

9699

0408

year

NAEP MCI (Average Age)

Age 9

4

6

8

10

12

7882

8690

9294

9699

0408

year


Age 13

4

6

8

10

12

7882

8690

9294

9699

0408

year


Age 17

(a) Math

0

5

10

15

Sta

ndar

dize

d W

hite

-Bla

ck M

ean

Diff

eren

ce

7175

8084

8890

9294

9699

0408

year

Age 9

0

5

10

15

7175

8084

8890

9294

9699

0408

year

Age 13

0

5

10

15

7175

8084

8890

9294

9699

0408

year

Age 17


(b) Reading

This figure depicts mean white minus mean black scores across time for ages 9, 13 and 17in math and reading, respectively. Dashed line corresponds to original NAEP scale; solidline corresponds to welfare adjusted scale from Equation (1.8), dropping age interactions(standardized to be µ = 100 and σ = 10).

73

Figure X: Mean White Minus Mean Black Math Scores across Age, by Cohort

5

6

7

8

9

Sta

ndar

dize

d W

hite

-Bla

ck M

ean

Diff

eren

ce

7882

8690

9294

9699

0408


This figure depicts mean white minus mean black math achievement for six cohorts of students aged 9 and17 in years t and t+ 8. Solid lines correspond to original NAEP scale; dashed lines to welfare-adjusted scale.Each line reflects change in white-black achievement gap as one cohort of students changes in achievementbetween the ages of 9 and 17.

74

Figure XI: Mean White Minus Mean Black Reading Scores across Age, by Cohort

4

6

8

10

12

14

Sta

ndar

dize

d W

hite

-Bla

ck M

ean

Diff

eren

ce

7882

8690

9294

9699

0408


This figure depicts mean white minus mean black reading achievement for six cohorts of students aged 9 and17 in years t and t+ 8. Solid lines correspond to original NAEP scale; dashed lines to welfare-adjusted scale.Each line reflects change in white-black achievement gap as one cohort of students changes in achievementbetween the ages of 9 and 17.

75

Figure XII: Sensitivity to Extrapolation: White-Black Math Gap across Age, by Cohort

5

6

7

8

Sta

ndar

dize

d W

hite

-Bla

ck M

ean

Diff

eren

ce

8084

8890

9294

9699

0408

MCI Shallow/Steep Steep/Shallow Steep/Steep Shallow/Shallow

This figure depicts changes in white-black gaps across cohorts using monotone cubic interpolation, and fourvariants of extrapolation. The four variants are: shallow bottom/shallow top; shallow bottom/steep top; steepbottom/steep top; steep bottom/shallow top. Shallow indicates that utility gains below 150 or above 350 arenot worth more after that threshold. Steep indicates that utility gains below 150 and above 350 are worthtwice as much as they were from 200 to 150 and 300 to 350.

76

Figure XIII: Sensitivity to Extrapolation: White-Black Reading Gap across Age, by Cohort

4

6

8

10

12

14

Sta

ndar

dize

d W

hite

-Bla

ck M

ean

Diff

eren

ce

8084

8890

9294

9699

0408

MCI Shallow/Steep Steep/Shallow Steep/Steep Shallow/Shallow

This figure depicts changes in white-black gaps across cohorts using monotone cubic interpolation, and fourvariants of extrapolation. The four variants are: shallow bottom/shallow top; shallow bottom/steep top; steepbottom/steep top; steep bottom/shallow top. Shallow indicates that utility gains below 150 or above 350 arenot worth more after that threshold. Steep indicates that utility gains below 150 and above 350 are worthtwice as much as they were from 200 to 150 and 300 to 350.

77

VII. Tables

78

Table I: Reading Performance Level Descriptors

Level 150: Carry Out Simple, Discrete Reading TasksReaders at this level can follow brief written directions. They can also select words,phrases, or sentences to describe a simple picture and can interpret simple written cluesto identify a common object. Performance at this level suggests the ability to carry outsimple, discrete reading tasks.

Level 200: Demonstrate Partially Developed Skills and UnderstandingReaders at this level can locate and identify facts from simple informational paragraphs,stories, and news articles. In addition, they can combine ideas and make inferencesbased on short, uncomplicated passages. Performance at this level suggests the abilityto understand specific or sequentially related information.

Level 250: Interrelate Ideas and Make GeneralizationsReaders at this level use intermediate skills and strategies to search for, locate, andorganize the information they find in relatively lengthy passages and can recognizeparaphrases of what they have read. They can also make inferences and reachgeneralizations about main ideas and the author’s purpose from passages dealing withliterature, science, and social studies. Performance at this level suggests the ability tosearch for specific information, interrelate ideas, and make generalizations.

Level 300: Understand Complicated InformationReaders at this level can understand complicated literary and informational passages,including material about topics they study at school. They can also analyze andintegrate less familiar material about topics they study at school as well as providereactions to and explanations of the text as a whole. Performance at this level suggeststhe ability to find, understand, summarize, and explain relatively complicatedinformation.

Level 350: Learn from Specialized Reading MaterialsReaders at this level can extend and restructure the ideas presented in specialized andcomplex texts. Examples include scientific materials, literary essays, and historicaldocuments. Readers are also able to understand the links between ideas, even whenthose links are not explicitly stated, and to make appropriate generalizations.Performance at this level suggests the ability to synthesize and learn from specializedreading materials.

Reading Performance Level Descriptors for National Assessmentof Educational Progress, Long Term Trend. Available here:https://nces.ed.gov/nationsreportcard/ltt/reading-descriptions.aspx

79

Table II: Math Performance Level Descriptors

Level 150: Simple Arithmetic FactsStudents at this level know some basic addition and subtraction facts, and most can addtwo-digit numbers without regrouping. They recognize simple situations in whichaddition and subtraction apply. They also are developing rudimentary classificationskills.

Level 200: Beginning Skills and UnderstandingsStudents at this level have considerable understanding of two-digit numbers. They canadd two-digit numbers but are still developing an ability to regroup in subtraction. Theyknow some basic multiplication and division facts, recognize relations among coins,can read information from charts and graphs, and use simple measurement instruments.They are developing some reasoning skills.

Level 250: Numerical Operations and Beginning Problem SolvingStudents at this level have an initial understanding of the four basic operations. Theyare able to apply whole number addition and subtraction skills to one-step wordproblems and money situations. In multiplication, they can find the product of atwo-digit and a one-digit number. They can also compare information from graphs andcharts, and are developing an ability to analyze simple logical relations.

Level 300: Moderately Complex Procedures and ReasoningStudents at this level are developing an understanding of number systems. They cancompute with decimals, simple fractions, and commonly encountered percents. Theycan identify geometric figures, measure lengths and angles, and calculate areas ofrectangles. These students are also able to interpret simple inequalities, evaluateformulas, and solve simple linear equations. They can find averages, make decisionsbased on information drawn from graphs, and use logical reasoning to solve problems.They are developing the skills to operate with signed numbers, exponents, and squareroots.

Level 350: Multistep Problem Solving and AlgebraStudents at this level can apply a range of reasoning skills to solve multistep problems.They can solve routine problems involving fractions and percents, recognize propertiesof basic geometric figures, and work with exponents and square roots. They can solve avariety of two-step problems using variables, identify equivalent algebraic expressions,and solve linear equations and inequalities. They are developing an understanding offunctions and coordinate systems.

Math Performance Level Descriptors for National Assessment of Educational Progress,Long Term Trend. Available here: https://nces.ed.gov/nationsreportcard/ltt/math-descriptions.aspx 80

Table III: Results from Ranking Exercise

Reading Math

Mean Mean-by-Distance Mean Mean-by-DistanceLevel 150 0.751 *** 0.647 *** 0.691 *** 0.531 ***

(0.026) (0.063) (0.027) (0.066)Distance 150 0.161 * 0.143Distance 200 0.1 0.228 **

Level 200 0.788 *** 0.733 *** 0.661 *** 0.585 ***(0.027) (0.044) (0.027) (0.047)

Distance 150 0.059 0.062Distance 200 0.153 * 0.258 **

Level 250 0.672 *** 0.603 *** 0.721 *** 0.664 ***(0.026) (0.037) (0.027) (0.038)

Distance 150 0.163 ** 0.062Distance 200 0.106 0.238 **

Level 300 0.697 *** 0.59 *** 0.711 *** 0.713 ***(0.026) (0.045) (0.027) (0.047)

Distance 150 0.155 ** 0.072Distance 200 0.184 * -0.223 **

Level 350 0.627 *** 0.478 *** 0.637 *** 0.531 ***(0.027) (0.066) (0.027) (0.066)

Distance 150 0.14 0.136Distance 200 0.197 ** 0.122

N 1455 1461Respondents 485 487

Regression model estimates linear probability that respondents ranked performance level descriptor correctly. Samples excludes respondents if (a) they were randomly assigned “ties” or (b) they did

not rank all three items. Column “Mean” describes percent of reading or math level descriptors ∈ (150, 200, 250, 300, 350) ranked correctly. “Mean-by-Distance” disaggregates percentages into

three categories: whether the cumulative distance of the three descriptors summed to 100, 150, or 200 (e.g., a random draw of 150, 200, 250 sums to 100). Stars indicate * for p<.05, ** for p<.01,

and *** for p<.001. Mean-by-Distance test is relative to omitted category, Distance 100.

81

Chapter 4

The sensitivity of causal estimates from

Court-ordered finance reform on

spending and graduation rates (with

Christopher Candelaria)

82

Abstract We provide new evidence about the effect of court-ordered finance reform on per-pupil revenuesand graduation rates. We account for cross-sectional dependence and heterogeneity in the treated and coun-terfactual groups to estimate the effect of overturning a state’s finance system. Seven years after reform, thehighest poverty quartile in a treated state experienced a 4 to 12 percent increase in per-pupil spending and a 5to 8 percentage point increase in graduation rates. We subject the model to various sensitivity tests. In mostcases, point estimates for graduation rates are within 2 percentage points of our preferred model.

83

I. Introduction

Whether school spending has an effect on student outcomes has been an open question in the economics

literature, dating back to at least Coleman (1966). The causal relationship between spending and desirable

outcomes is of obvious interest, as the share of GDP that the United States spends on public elementary

and secondary education has remained fairly large throughout the past five decades, ranging from 3.7 to 4.5

percent.1 Given the large share of spending on education, it would be useful to know if these resources

are well spent. Despite this interest, we lack stylized facts about the effects of spending on changes in

student academic and adult outcomes. The goal of this paper is to provide a robust description of the causal

relationship between fiscal shocks and student outcomes at the district level for US states undergoing financial

reform for the period 1990-91 to 2009-10.

The opportunity for more robust descriptions of causal relationships with panel data emerges from two

sources. The first is that data collection efforts have extended the time dimension of panel data, allowing for

more sophisticated tests of the identifying assumptions of quasi-experimental methods such as differences-

in-differences estimators. Previous research efforts on the effects of school spending were hampered by

data limitations such as this.2 The second is that recent econometric methods have been developed to better

model unobserved treatment heterogeneity, counterfactual trends, pre-treatment trends (correlated random

trends), and cross-sectional dependence (CSD). If any of these unobserved components are correlated with

regressors, then the econometric model will be biased. Fortunately, it is possible to test for a wide variety of

model specifications to determine the sensitivity of causal estimates to modeling choice.

Using district-level aggregate data from the Common Core (CCD), we estimate the effects of fiscal shocks,

where fiscal shocks are defined as a state’s first Supreme Court ruling that overturns a given state’s finance

system, on the natural logarithm of per-pupil spending and graduation rates. Our estimation approach is

designed to handle three aspects of the identification problem: first, treatment occurs at the state level; second,

there is treatment effect heterogeneity within states; third, there is heterogeneity in the pre-treatment trends

between treatment and control units.

At the state level, we are interested in identifying a plausibly exogenous shock to the state’s finance system.

Here, we wish to control for the presence of cross-sectional dependence (CSD), which can arise if there is

1These estimates come from the Digest of Education Statistics, 2013 edition. As of the writing of thisdraft, the 2013 version is the most recent publication available.

2See for example, Hoxby (2001) and Card and Payne (2002).

84

cross-sectional heterogeneity in response to common shocks, spatial dependence or interdependence. We

control for CSD by including interactive fixed effects, as suggested by Bai (2009). These interactive fixed

effects are estimated at the state level and control for unobserved correlations between states in the panel.

We account for two forms of treatment heterogeneity. The first is heterogeneity that takes place within

treatment and control groups. It is known that unmodeled treatment heterogeneity can lead to bias if (a)

the probability of treatment assignment varies with a variable X and (b) the treatment effect varies with

a variable X (Angrist, 1995; Angrist, 2004; Elwert, 2010). Here, we decompose treatment and control

groups by constructing a time-invariant poverty quartile indicator variable equal to 1 if a state’s district is

in one of four poverty quartiles for year 1989. Each of these poverty quartile variables is interacted with a

treatment year variable, for a total of 76 (19 years x 4 quartiles) treatment interactions. In order to provide

a counterfactual for each of these poverty quartiles, we estimate a poverty quartile-by-year secular trend

indexed by poverty quartile-by-year fixed effects. Estimating treatment heterogeneity in this way can provide

an unbiased estimate of the treatment effect if we have identified the correct heterogeneous variable.

A second concern is heterogeneity between treatment and control groups, such as would occur if poverty

quartiles in treated states have different pre-treatment trends than poverty quartiles in non-treated states. This

suggests that some units (e.g., high poverty districts) of treated states have different secular trends than their

counterparts in non-treated states. Quartile-by-year fixed effects assume equivalent secular trends between

treated and control units; results will be biased if this assumption is not met. To account for this, we estimate

state-by-poverty quartile linear time trends (referred to as correlated random trends), for a total of 192 (48

states x 4 quartiles x linear time) continuous fixed effects. Effectively, we assume that treatment and control

differences in secular trends can be controlled for with functional form assumptions on the time trend. These

terms provide pre-treatment balance with respect to state-poverty quartile trends in the dependent variable.

All together, we estimate a heterogeneous differences-in-differences model that accounts for (a) cross-

sectional dependence at the state level, (b) a poverty quartile-by-year secular trend, and (c) state-by-poverty

quartile linear time trends. In this preferred specification, we find that high poverty districts in states that had

their finance regimes overthrown by Court order experienced an increase in log spending by 4 to 12 percent

and graduation rates by 5 to 8 percentage points seven years following reform.

We then subject the model to various sensitivity tests by permuting the interactive fixed effects, secular

time trends, and correlated random trends. In total we estimate 15 complementary models. Generally, the

results are robust to model fit: relative to the preferred model, interactive fixed effects and the specification

85

of the secular time trend have modest effects on point estimates. The model is quite sensitive to the presence

of correlated random trends, however. When state-by-poverty quartile time trends are ignored or estimated

at a higher level of aggregation (the state), the effects of reform on graduation rates are zero and precisely

estimated. When we estimate the linear time trend using a lower level of aggregation (the district), point

estimates are similar to those of the preferred model. We conclude that treatment and control sub-state units

have different secular trends, but conditionally exogenous point estimates are available if we are willing to

assume that the sub-unit pre-treatment trends can be approximated with a functional form.

To test the extent to which results are equalizing, we estimate slightly different models, allowing the effect

of reform to be continuous across the poverty quantiles. That is, we interact treatment year variables with

a continuous poverty quantile variable, while controlling for secular changes in this continuous variable for

untreated states. This provides an estimate of the marginal change in graduation rate for a one-unit increase

in poverty percentile rank within a state. Here we see that the effect of reform was equalizing: for every 10

percentile increase in poverty within a treated state, per-pupil log revenues increased by 0.9 to 1.8 percent

and graduation rates increased by 0.5 to 0.85 percentage points in year seven.

Because we have aggregate data, one threat to identification would occur if treatment induced demographic

change and demographic variables correlate with outcomes. For instance, if state spending increased school

quality but kept property taxes down, high income parents (with children who are presumably more likely to

graduate) might relocate to schools housed in historically high poverty districts. To test for this possibility, we

estimate our “equalizing” models substituting the original outcome variables for district level demographic

variables that could indicate propensity to graduate: percent poor, percent minority, and percent special

education. We find no evidence that the minority composition of high poverty districts changed after reform,

but we do find that these districts experienced an increase in poverty and percent of students qualifying for

special education. We would have to assume that increases in poverty and special education rates positively

effect graduation rates for our results to be biased in a meaningful way.

This paper makes substantive and methodological contributions. Substantively, we find that court cases

overturning a state’s financial system for the period 1991-2010 had an effect on revenues and graduation rates,

that these results are robust to a wide variety of modeling choices, and that this effect was equalizing. Taken

together, our two models show that states undergoing Court ordered finance both (a) increased revenues and

graduation rates in high poverty districts relative to high poverty districts in other states and (b) allocated

a greater share of these revenues to higher poverty districts within the state, relative to allocations taking

86

place in non-treated states, resulting in an increase in graduation rates. Methodologically, we emphasize the

variety of modeling strategies available to researchers using panel data sets, including specification of the

secular trend, correlated random trends, and cross-sectional dependence. While the researcher may argue for

a preferred model, justifiable alternatives are often available. Here we have presented a graphical method that

researchers can use to effectively and efficiently demonstrate the sensitivity of point estimates to modeling

choice.

II. Background

State-level court-ordered finance reform beginning in 1989 has come to be known as the ”Adequacy Era.”

These court rulings are often treated as fiscal shocks to state school funding systems. A number of papers

have attempted to link these plausibly exogenous changes in spending to changes in other desired outcomes,

like achievement, graduation and earnings (Hoxby, 2001; Card and Payne, 2002; Jackson, et al., 20150).

The results of Card and Payne (2002) and Hoxby (2001) were in conflict, but these were hampered by

data limitations, as only a simple pre- post- contrast between treatment and control states was available

to the researchers, thereby limiting their capacity to verify the identifying assumptions of the differences-in-

differences model.

Most recently, Jackson, Johnson and Persico (2015) have constructed a much longer panel (with respect to

time), with restricted-use individual-level data reaching back to children born between 1955 and 1985, to test

the effects of these cases on revenues, graduation rates and adult earnings. Leveraging variation across cohorts

in exposure to fiscal reform, this study finds large effects from Court order on school spending, graduation

and adult outcomes, and these results are especially pronounced for individuals coming from low-income

households and districts.

For this study, the sample of students are taken from the Panel Study of Income Dynamics (PSID), which

is representative at the national level. Using these data, identification is leveraged from variation between

states over time, and results are disaggregated using within state variation around district-level income. One

particular concern is the possibility of spurious correlations resulting from the PSID sampling design. If

sampled individuals in low income districts in treated states are, by chance, more likely to respond to treat-

ment, then results will be biased. The use of population weights can exacerbate this problem, if the spurious

correlations induced by the non-representative sample correlate with the weights. The authors are aware of

87

this concern and test a model using Common Core graduation rate data, as we do here.3 They find a similar

pattern of results using the alternative data set.

The purpose of this paper is three-fold. First, it is important to show that results by Jackson and colleagues

(2015) can be replicated across other data sets. Here, we use data from the Common Core (CCD), which

provides aggregate spending and graduation rates for the universe of school districts in the United States.

Both the PSID and CCD have a kind of Anna Karenina problem, in which each data set is unsatisfactory in

its own way. The PSID follows individual students over time but has unobserved sampling issues that may

correlate with treatment; the CCD contains the universe of districts but does not follow students over time

and may not reveal sorting within districts. If results are qualitatively similar across different data, we can

feel more confident that estimates do not reveal spurious correlations resulting from the sample generating

process. Second, it is important to show that results are insensitive to similarly compelling modeling choices.

While Jackson and colleagues (2015) find a pattern of results from the CCD that is largely consistent with

those results from the PSID, they do not test to see whether those results are sensitive to model specification.

If we are to believe that results from the CCD sample largely corroborate results from the PSID, we must

show that the identifying assumptions using the CCD sample are met. Our purpose is to present results

that are robust to modeling choices that account for secular trends, correlated random trends, and cross-

sectional dependence. In so doing, we provide upper and lower bounds on effect sizes by permuting these

parameters. Third, and finally, we provide new evidence about the extent to which Court-ordered finance

reforms increased levels of spending and graduation rates, as well as the extent to which these same states

equalized resources and graduation rates across poverty quantiles, relative to equalizing efforts made in states

without reform.

III. Data

Our data set is the compilation of several public-use surveys that are administered by the National Center for

Education Statistics and the U.S. Census Bureau. We construct our analytic sample using the following data

sets: Local Education Agency (School District) Finance Survey (F-33); Local Education Agency (School

3See Appendix B in the NBER working paper, found here http://www.nber.org/papers/w20118.pdf. In thefinal version (Appendix N), the authors test whether their estimates generalize to districts not included in thePSID, which is the same test for a different concern. External and internal validity are threatened if the PSIDhas spurious correlations or does not generalize.

88

District) Universe Survey; Local Education Agency Universe Survey Longitudinal Data File: 1986-1998

(13-year); Local Education Agency (School District) Universe Survey Dropout and Completion Data; and

Public Elementary/Secondary School Universe Survey.4

Our sample begins in the 1990-91 school year and ends in the 2009-10 school year. The data set is a

panel of aggregated data, containing United States district and state identifiers, indexed across time. The

panel includes the following variables: counts of free lunch eligible (FLE) students, per pupil log and total

revenues, percents of 8th grade students receiving diplomas 4 years later (graduation rates), total enrollment,

percents of students that are black, Hispanic, minority (an aggregate of all non-white race groups), special

education, and children in poverty. Counts of FLE students are turned into district-specific percentages, from

which within state rankings of districts based on the percents of students qualifying for free lunch are made.

Using FLE data from 1989-90, we divide states into FLE quartiles, where quartile 4 is the top poverty quartile

for the state.5 Total revenues are the sum of federal, local, and state revenues in each district. We divide this

value by the total number of students in the district and deflate by the US CPI, All Urban Consumers Index

to convert the figure to real terms. We then take the natural logarithm of this variable. Our graduation rates

variable is defined as the total number of diploma recipients in year t as a share of the number of 8th graders

in year t − 4, a measure which Heckman (2010) shows is not susceptible to the downward bias caused by

using lagged 9th grade enrollment in the denominator. We top-code graduation rates, so that they take a

maximum value of 1.6 The demographic race variables come from the school-level file from the Common

Core and are aggregated to the district level; percents are calculated by dividing by total enrollment. Special

education counts come from the district level Common Core. Child poverty is a variable we take from the

Small Area Income and Poverty Estimates (SAIPE).

To define our analytic sample, we place some restrictions on the data, and we address an issue with New

York City Public Schools (NYCPS). First, we drop Hawaii and the District of Columbia from our sample,

as each place has only one school district. We also dropped Montana from our analysis because they were

missing a substantial amount of graduation rate data. We keep only unified districts to exclude non-traditional

districts and to remove charter-only districts. We define unified districts as those districts that serve students in

either Pre-Kindergarten, Kindergarten, or 1st grade through the 12th grade. For the variables total enrollment,

4Web links to each of the data sources are listed in the appendix.5Missing FLE data for that year were addressed by NCES interpolation methods. Data are found on Local

Education Agency Universe Survey Longitudinal Data File (13-year).6In Appendix IX., we describe where the data was gathered and cleaned, including URL information for

where data can be found.

89

graduation rates and FLE, NYCPS reports its data as 33 geographic districts in the nonfiscal surveys; for total

revenues, NYCPS is treated as a consolidated district. For this reason, we needed to combine the non-fiscal

data into a single district. As suggested in the NCES documentation, we use NYCPS’s supervisory union

number to aggregate the geographical districts into a single entity.

We noticed a series of errors for some state-year-dependent variable combinations. In some states, counts

of minority students were mistakenly reported as 0, when in fact they were missing. This occurred in Ten-

nessee, Indiana, and Nevada in years 2000-2005, 2000, and 2005, respectively. The special education variable

had two distinct problems. For three states it was mistakenly coded as 0 when it should have been coded as

missing. This occurred in Missouri, Colorado, and Vermont in years 2004, 2010 and 2010, respectively. We

also observed state-wide 20 percentage point increases in special education enrollment for two states, which

immediately returned to pre-spike levels in the year after. This occurred in Oregon and Mississippi in years

2004 and 2007, respectively. Finally, graduation rate data also spiked dramatically before returning to pre-

spike levels in three state-years. This occurred in Wyoming, Kentucky and Tennessee in years 1992, 1994

and 1998, respectively. In each of these state-year instances where data were either inappropriately coded as

zero or fluctuated due to data error, we coded the value as missing.

To our analytic sample, we add the first year a state’s funding regime was overturned in the Adequacy

Era. The base set of court cases comes from Corcoran and Evans (2008), and we updated the list using data

from the National Education Access Network.7 Table I lists the court cases we are considering. As shown,

there are a total of twelve states that had their school finance systems overturned during the Adequacy Era.

Kentucky was the first to have its system overturned in 1989 and Alaska was the most recent; its finance

system was overturned in 2009.

[Insert Table I Here]

Table II provides summary statistics of the key variables in our interpolated data set, excluding New York

City Public Schools (NYCPS).8 We have a total of 188,752 district-year observations. The total number of

unified districts in our sample is 9,916. The average graduation rate is about 77 percent and average log per

pupil spending is 8.94 (total real per pupil revenues are about $7,590). When we do not weight our data by

district enrollment, we obtain similar figures, but they are slightly larger.

7The National Education Access Network provides up-to-date information about school finance reformand litigation. Source: http://schoolfunding.info/.

8We drop NYCPS because it is an outlier district in our data set. We provide a detailed explanation ofwhy we do this in our results section.

90

http://schoolfunding.info/

[Insert Table II Here]

IV. Econometric specifications and model sensitivity

In this section, we describe our empirical strategy to estimate the causal effects of school finance reform at the

state level on real log revenues per student and graduation rates at the district level. We begin by positing our

preferred model, which is a differences-in-differences equation that models treatment heterogeneity across

FLE poverty quartiles. We then explain what each of the parameter choices are designed to control for and

why they are selected. Because treatment occurs at the state level and our outcomes are at the district level,

there are several ways to specify the estimating equation. For example, there are choices about whether and

how to model the counterfactual time trend and how to adjust for correlated random trends (i.e., pre-treatment

trends) and unobservable factors such as cross-sectional dependence. We outline these alternative modeling

choices and discuss their implications relative to the benchmark model.

IV.A. Benchmark differences-in-differences model

To identify the causal effects of finance reform, we leverage the plausibly exogenous variation resulting from

state Supreme Court rulings overturning a given state’s fiscal regime. Prior education finance studies have also

relied on the exogenous nature of court rulings to estimate causal effects on fiscal and academic outcomes

(see, for example, Sims, 2011; Jackson, et al., 2015). After a lawsuit is filed against a state’s education

funding system, we assume that the timing of the Court’s ruling is unpredictable. Under this assumption,

the decision to overturn a funding system defines treatment and the date of the decision constitutes a random

shock.

With panel data, the exogenous timing of court decisions generates a natural experiment that can be mod-

eled using a differences-in-differences framework. States were subject to reform in different years and not

all states had reform, which provides treatment and control groups over time. Our benchmark differences-in-

differences model takes the following form:

Ysqdt = θd + δtq + ψsqt+ P ′stβq + λ′sFt + εsdt,(4.1)

91

where Ysqdt is an outcome of interest—real log revenues per student or graduation rates—in state s, in poverty

quartile q, in district d, in year t; θd is a district-specific fixed effect; δtq is a time by poverty quartile-specific

fixed effect, ψsqt is a state by quartile-specific linear time trend; Pit is a policy variable that takes value 1

in the year state s has its first reform and remains value 1 for all subsequent years following reform; λ′sFt is

a factor structure that accounts for cross-sectional dependence at the state level; and εsdt is assumed to be a

mean zero, random error term. To account for serial correlation, all point estimates have standard errors that

are clustered by state, the level at which treatment occurs (Bertrand, Duflo and Mullainathan (2004).

Our parameters of interest are the βq , which are the causal estimates of school finance reform in quartile

q on Ysqdt. We define q such that q ∈ {1, 2, 3, 4}, and the highest level of poverty is represented by quartile

4.9 Throughout our paper, we parameterize the policy variable P ′it such that each poverty quartile’s effect is

estimated, so we do not have an omitted quartile. Moreover, we allow treatment effects to have a dynamic

treatment response pattern (Wolfers, 2006). Each βq , therefore, is a vector of average effect estimates in the

years after reform in quartile q.

Overall, we estimate 19 treatment effects for each poverty quartile’s vector of effects. Although there are

21 effects we can potentially estimate, we combine treatment effect years 19 though 21 into a single binary

indicator, as there are only two treatment states that contribute to causal identification in these later years.10 In

reporting our results, we only report effect estimates for years 1 through 7 after reform. We do this because

estimating treatment effects several years after treatment occurs results in precision loss and because very

few states were treated early enough to contribute information to treatment effect estimates in later years

(see Table I). All together, this model absorbs approximately 9,800 district fixed effects, 76 year effects, 192

continuous fixed effects (state-by-FLE quartile interacted with linear time), as well as the factor variables

λ′sFt (i.e., the interactive fixed effects). The estimated factors and factor loadings λ′sFt can be decomposed

into the covariates θsFt + δtλs and entered directly into the regression model, thus allowing us to cluster the

standard errors at the state level.11 These non-treatment parameters are eliminated using high dimensional

fixed effects according to the Frisch-Waugh-Lovell theorem (Frisch, 1933; Lovell, 1963).12

9Using FLE data from 1989-90, we divide states into FLE quartiles, where quartile 4 is the top povertyquartile for the state.

10While our data sample has 20 years of data, we have up to 21 potential treatment effects, as KY had itsreform in 1989 and our panel begins in the 1990-91 academic year. Therefore, KY does not contribute to atreatment effect estimate in the first year of reform, but it does contribute to effect estimates in all subsequenttreatment years.

11Appendix XII. shows how an estimated factor structure can be included in an OLS regression frameworkto obtain a variety of standard error structures.

12The model is estimated in Julia using the package FixedEffects.jl and SparseFactorModels.jl (Gomez,

92

IV.B. Explaining model specifications

We now wish to articulate what the parameters from Equation (4.1) are controlling for and why they are

included in the model. Researchers are presented with a number of modeling choices, and our goal here is to

articulate the reasons we have for parameterizing the model the way we do. This also opens the possibility for

subjecting the model to sensitivity analysis, in order to determine upper and lower bounds of point estimates,

in relation to our preferred model. In particular, we examine choices related to cross-sectional dependence,

secular trends, and correlated random trends. We also consider the difference between OLS regression and

weighted least squares (WLS) regression, where the weights are measures of time varying district enrollment.

As will be shown in the results section, certain parameterizations of the differences-in-differences model have

substantial impacts on point estimates relative to our benchmark model, while many others do not.

Cross-sectional dependence

In our benchmark model, the terms λ′sFt specify that the error term has a factor structure that affects Ysqdt

and may be correlated with P ′st, the treatment indicators. Following Bai (2009), we define λs as a vector of

factor loadings and Ft as a vector of common factors. Each of these vectors is of size r, which is the number

of factors included the model; in our model, we set r = 1. In the differences-in-differences framework,

the factor structure has a natural interpretation. Namely, the common factors Ft represent macroeconomic

shocks that affect all the units (e.g., recessions, financial crises, and policy changes), and the factor loadings

λs capture how states are differentially affected by these shocks. Of particular concern is the presence of

interdependence, which can result if one state’s Supreme Court ruling affects the chances of another state’s

ruling. This would violate the identifying assumptions of the differences-in-differences model and result in

bias, unless that interdependence is accounted for (Pesaran, 2007; Bai, 2009).

To estimate the λ′sFt factor structure in Equation (4.1), we use the method of principal components as

described by Bai (2009) and implemented by Moon and Weidner (2014) and Gomez (2015b). The procedure

begins by choosing starting values for the βq vector, which we denote as β̃q . Then, the following steps are

carried out:

[1] Calculate the residuals of the OLS estimator excluding the factor structure using β̃.

[2] Estimate the factor loadings, λs, and the common factors, Ft, on the residual vector obtained in step

2015a; 2015b).

93

[1] using the Levenberg- Marquardt algorithm (wright, 1985).

[3] After estimating the factor structure, we remove it from the regressors using partitioned regression.

Then, we re-estimate the model using a least squares objective function in order to obtain a new

estimate of β̃.

[4] Steps [1] to [3] are repeated until the following stopping condition is achieved: After comparing each

element of the vector β̃ obtained in step [3] with the previous estimate β̃old, we stop if |β̃k − β̃oldk | <

10−8 for all k. If this condition is not achieved, then we stop if the difference in the least squares

objective function calculated in step [3] is greater than the Total Sum of Squares multiplied by 10−10.

This stopping condition takes effect when the estimator is not converging.

Traditional approaches to factor methods specify factor loadings at the lowest unit of analysis, in this case

d. However, we are interested in accounting for interdependence between states, the level at which treatment

occurs. While principal components analysis generally requires one observation per i-by-t, sparse factor

methods are available that allow for multiple i per t, as in our case, where we have multiple districts d within

states s for every year t. Moreover, sparse factor methods allow for missing data, thereby obviating the need

to implement interpolation or multiple imputation to fill in missing data (Wright, 1985; Raiko, 2008; Ilin,

2010).

Secular time trends

Secular time trends, often specified non-parametrically using binary indicator variables, such as year fixed

effects, adjust for unobservable factors affecting outcome variables over time. The usual assumption is that

these factors affect all units—in our case, districts—in the same way in a given year. Examples of un-

observable factors include national policy changes and macroeconomic events such as recessions. In the

differences-in-differences specification, controlling for these variables is important because the identifying

assumption of the model requires us to believe that the fixed effects represent the average counterfactual trend

that treated districts would have had in the absence of treatment.

In our benchmark specification, we control for secular time trends by including FLE quartile-by-year fixed

effects, denoted as δtq . We include these δtq fixed effects, instead of standard year fixed effects δt, to establish

a more plausible counterfactual trend for treated districts. Our main concern is that higher poverty districts

in both treated and untreated states were increasing in revenues and graduation rates. If the secular trend is

94

approximated by δt, and δt < δtq=4, then we will be overstating the effect of P ′stβq=4. By indexing the year

fixed effects with FLE quartiles, we are comparing treated districts in a given quartile control districts in the

same quartile.

Correlated random trends

If the timing of a state’s court ruling decision is correlated with unobserved trends at the state, district, or

other group level (e.g., state by FLE quartile), we must control for these trends to obtain unbiased estimates

of causal parameters. In the literature, these trends are formally called correlated random trends (Woolridge,

2005; 2010), but they are often informally referred to as pre-treatment trends. Correlated random trends

serve a distinct purpose from secular, non-parametric trends. While the secular trends help to establish a

plausible counterfactual trend for the common trends assumption to hold, correlated random trends guard

against omitted variable bias caused by an endogenous policy shock.

In Equation (4.1), the parameter ψsqt is included because we believe the timing of the state ruling is

correlated with pre-treatment trends within the state, approximated by a state-by-quartile-specific slope. The

inclusion of this parameter aligns with the notion that the date on which a court-ordered finance system is

deemed unconstitutional is correlated with the slope of the FLE quartile trend within the state. For example, if

graduation rates are steeply declining among the most impoverished districts within state s, we might expect

a reform decision sooner than if the graduation rate had a mild, decreasing trend. Evidence for variation in

pre-treatment trends within states can be seen in In Figure I. This Figure shows weighted mean log spending

and graduation rates for states that experienced a Court ruling over time, where time is centered around the

year of Court ruling. With respect to graduation rates, there is an obvious downward trend prior to a Court’s

ruling.

[Insert Figure I Here]

We can think of our specification of the secular time trends and correlated random trends as addressing two

forms of heterogeneity. Our specification of the secular time trend addresses heterogeneity in the treatment

effect and the need to account for that heterogeneity with an appropriate counterfactual. By indexing the

secular time trend as δtq , we allow high poverty districts to be compared to other high poverty districts.

Our specification of the correlated random trend addresses heterogeneity between treated and control groups.

Including ψsqt addresses the fact that high poverty districts in treated states may have different pre-treatment

trends than high poverty districts in non-treated states.

95

Weighting

In Equation (4.1), we explicitly model treatment heterogeneity by disaggregating treatment effects into

poverty quartiles, q (Meyer, 1995). These quartiles are derived from the percentages of Free Lunch Eli-

gible (FLE) students reported at the district level in each state in 1990. We fix the year at the start of our

sample because the poverty distribution could be affected by treatment over time. The quartiles are defined

within each state.

While we capture treatment heterogeneity across poverty quartiles, we may fail to capture other sources

of treatment heterogeneity. Weighting a regression model by group size is traditionally used to correct for

heteroskedasticity, but it also provides a way to test for the presence of additional unobserved heterogeneity.

According to asymptotic theory, the probability limits of OLS and weighted least squares should be consis-

tent. Thus, regardless of how you weight the data, the point estimates between the two models should not

dramatically differ. When OLS and WLS estimates do diverge substantially, there is concern that the model

is not correctly specified, and it may be due in part to unobserved heterogeneity associated with the weighting

variable (Solon, 2015).

For our differences-in-differences specification, we examine the sensitivity of our point estimates to the

inclusion of district-level, time-varying enrollment weights. We provide a detailed discussion about discrep-

ancies between weighted and unweighted point estimates in the next section.

IV.C. Alternative model specifications

Here we quickly outline alternative model specifications. In the Results section, we explore the sensitivity

of our preferred model to alternative parameterizations. We have presented arguments for our preferred

specification, but we recognize that alternative modeling approaches are common in the panel methods and

applied econometric literature. To test how sensitive our preferred results are to reasonable alternatives, we

estimate 15 models that broadly fall within the bounds of typical modeling choices. In the Results section,

we provide upper and lower bounds for how much point estimates depart from our preferred model. Here,

we outline the alternatives we estimate:

1. Cross-sectional dependence: We estimate models in which we assume λ′sFt = 0, as well as models

96

in which the number of included factors r is ∈ 1, 2, 3.13

2. Secular time trend: We estimate models in which we set δtq = δt, thereby modeling the counterfactual

time trend as constant across cross-sectional units.

3. Correlated random trend: We estimate models in which we set ψsqt =∈ {0, ψst, ψdt, ψsqt2, ψst2}.

That is, we either do not estimate a pre-treatment trend, we allow the pre-treatment trend to be es-

timated at the state and district levels, or we allow the time element to have a quadratic functional

form.

V. Results

We present our results in four parts. We first show and discuss the causal effect estimates of court-ordered

finance reforms using our preferred differences-in-differences model. Second, we examine the extent to

which our benchmark model point estimates are sensitive to assumptions about secular trends, correlated

random effects, and cross-sectional dependence. Third, we assess whether reforms were equalizing across

the FLE poverty distribution; we wish to test formally whether, within treated states, poorer districts benefited

more relative to richer districts in terms of revenues and graduation rates. Finally, we conclude with a series

of robustness checks that allow us to gauge the validity of our causal estimates.

V.A. Benchmark differences-in-differences model results

Revenues

We report our causal effect estimates of court-ordered finance reform on the logarithm of per pupil rev-

enues and graduation rates in Tables IV and V, respectively. We obtain results by estimating our benchmark

differences-in-differences model in Equation (4.1). FLE quartile 1 represents low-poverty districts, and FLE

quartile 4 represents high-poverty districts. We only report treatment effect years 1 though 7 because the

number of states in the treated group changes dramatically over time. Some states were treated very late (or

13Moon and Weidner (2014) show that point estimates stabilize once the number of factors r equals thetrue number of factors r◦ and that when r > r◦, there is no bias. However, this is only true when the timedimension t in the panel approaches infinity. When t is small, it is possible to increase bias by including toomany factors. See Table IV in their paper, as well as Oantski (2010) and Ahn (2013).

97

very early), and we do not have enough years of data to follow them past 2010. As shown in Table III, 6

years after treatment, Alaska and Kansas no longer contribute treatment information; in years 7 and 8, we

lose North Carolina and New York.14 We display both weighted and unweighted estimates, where the weight

is time-varying district enrollment.

[Insert Table III Here]

Examining the weighted results in Table IV, we find that court-ordered finance reforms increased revenues

in all FLE quartiles in the years after treatment, though not every point estimate is significant at conventional

levels. Because our models include FLE quartile-by-year fixed effects, point estimates for a given quartile

should be interpreted as relative to other FLE quartiles that are in the control group. For example, in year 7

after treatment, districts in FLE quartile 1 had revenues that were 12.7 percent higher than they would have

been in the absence of treatment, with the counterfactual trend established by non-treated districts in FLE

quartile 1 (significant at the 1 percent level). In FLE quartile 4, we find that the revenues were 11.9 percent

higher relative to what they would have been in the absence of reform, relative to non-treated districts in FLE

quartile 4 (significant at the 5 percent level). Comparing point estimates between FLE quartiles 1 and 4 is

problematic because the counterfactual trends in quartiles 1 and 4 may have different trajectories. It would be

wrong to conclude that the 12.7 percent effect is larger than the 11.9 percent effect, because the 12.7 percent

effect for quartile 1 may be relative to only a modest increase in non-treated low poverty districts, whereas

the 11.9 percent effect for quartile 4 may be relative to a steep increase in non-treated high poverty districts.

Later, we present models and results to test whether poorer districts received greater revenues and graduation

rates following reform.

Compared to weighted results, the unweighted results in Table IV suggest that revenues increased, but

many of the point estimates are not significant. In FLE quartile 4, for example, all point estimates are

positive, but none are significant. The magnitude of point estimates is also substantially smaller than those

from the weighted regression. In year 7, point estimates across the quartiles are at least half as a small

as the corresponding point estimates from the weighted model. When weighted and unweighted regression

estimates diverge, there is evidence of unmodeled heterogeneity, which we will discuss momentarily. Overall,

revenues increased across all FLE poverty quartiles in states with Court order, relative to equivalent poverty

quartiles in non-treated states.

[Insert Table IV Here]

14Throughout the rest of paper we restrict the description of our results for years less than or equal to 7,though estimation occurs for the entire panel of data.

98

Graduation rates

With respect to graduation rates, the weighted results in Table V show that court-ordered finance reforms

were consistently positive and significant among districts in FLE quartile 4. In the first year after reform,

graduation rates in quartile 4 increased modestly by 1.3 percentage points. By treatment year 7, however,

graduation rates increased by 8.4 percentage points, which is significant at the 0.1 percent level. It is worth

emphasizing that each treatment year effect corresponds to a different cohort of students. Therefore, the

dynamic treatment response pattern across all 7 years is consistent with the notion that graduation rates do

not increase instantaneously; longer exposure to increased revenues catalyzes changes in academic outcomes.

We find modest evidence that FLE quartiles 2 and 3 improved graduation rates following Court order, though

these point estimates are not consistently significant and are smaller in magnitude than those in FLE 4. The

lowest-poverty districts in FLE quartile 1 have no significant effects, and the point estimates show no evidence

of an upward trend over time.

The unweighted graduation results in Table V tell a similar story as the weighted results; one key dif-

ference is that the point estimates tend to be smaller. In FLE quartile 4, for example, graduation rates are

4.7 percentage points higher in year 7 that they would have been in the absence of treatment. This point

estimate is almost half the size of the corresponding estimate when using weights. Although districts in the

lowest-poverty quartile exhibit some marginally significant effects, these effects are small, and do not suggest

a substantial increase from their levels before reform, which corresponds to the weighted regression results.

[Insert Table V Here]

Understanding differences between weighted and unweighted results

Although the discrepancy between weighted and unweighted results is an indication of model mis-specification,

we present some evidence that the mis-specification is driven, in part, by unmodeled treatment effect hetero-

geneity that varies by district size (Solon, 2015). To examine this, we discuss the New York City public

schools district (NYCPS) as a case study. Throughout all our analyses, we have excluded NYCPS, the largest

school district in the United States, because of its strong influence on the point estimates in FLE quartile 4

when weighting by district enrollment.15

15To be clear, NYCPS was removed from our analytic sample before estimating the benchmark regressionsabove.

99

In Figure II, we plot the causal effect estimates for treatment years 1 to 7 for both the logarithm of per

pupil revenues (left panel) and graduation rates (right panel) when NYCPS is included in the sample and

when it is not. For the unweighted regressions, it does not matter whether NYCPS is included, as all districts

are weighted equally and removing one district has little effect on overall outcomes. The weighted regression

results, however, show that the inclusion of the NYCPS district produces point estimates for revenues that

are systematically higher than weighted results that exclude NYCPS. A similar story holds for graduation

rates beginning in treatment year 4. After excluding NYCPS, the weighted model results are closer to the

unweighted point estimates, though they still do not perfectly align.

[Insert Figure II Here]

Examining NYCPS provides just one example of how treatment heterogeneity might be related to district

size. As shown in Appendix Figure V, NYCPS Graduation rates and teacher salaries increased after 2003,

the year New York had its first court ruling. Enrollment, percent poverty, class size and percent minority

decreased during this period. Each of these are potential mechanisms for improving graduation rates and

likely contribute to the large treatment effect we observe in the weighted results. For all analyses, we drop

NYCPS from our sample because its district enrollment weight is near 1 million throughout our sample

period, which is an outlier in the distribution of district enrollment. Dropping the next set of largest districts

does not have such a dramatic effect on the results as dropping NYCPS does. For this reason, we retain all

other districts.

In line with Solon (2015), we acknowledge that the use of weighted least squares does not necessarily

provide a particular estimand of interest, such as the population average partial effect; instead, our OLS

and WLS results are identifying different weighted averages of complex heterogenous effects that vary ac-

cording to district size. In the presence of these heterogeneous effects neither set of results—weighted or

unweighted—should be preferred. Trying to model the heterogeneity is also quite complex, as illustrated

with the NYCPS case study. Overall, our weighted and unweighted point estimates are generally consistent

in terms of sign; however, the magnitude of the effect size tends to differ. In light of this, we continue to show

weighted and unweighted estimates in our tables. As the unweighted results tend to produce smaller effect

sizes than the the weighted results, the unweighted results may be considered lower bound estimates of the

(heterogeneous) treatment effect, and the weighted results may be considered as upper bound estimates.

100

V.B. Model sensitivity

Our preferred model indicates a meaningful positive and significant effect of Court order on the outcomes of

interest. To examine model sensitivity, we focus attention on districts in the highest poverty quartile (i.e., FLE

quartile 4). The evidence suggests there was an effect for graduation rates and revenues for the FLE quartile 4

districts, but these results assume that our benchmark model makes correct assumptions about secular trends,

correlated random trends, and cross-sectional dependence.16. We now assess the extent to which results are

sensitive to these modeling choices.

Figures III and IV graphically show the variability of causal effect estimates of finance reform on the

logarithm of per pupil revenues and graduation rates, respectively, across a variety of model specifications.

Each marker symbol represents the difference between two point estimates, one from an alternative model

specification and the other from our benchmark model; we calculate this difference for treatment years 1

through 7. In the figures, we normalize our benchmark effect estimates to zero in each year. If a point

estimate is greater than zero, then our model understates the effect from the alternative model; if it is less

than zero, our model overstates the effect. We report more traditional regression tables with point estimates

and standard errors in the Appendix.17

While we do not report all possible combinations of different secular trend, correlated random trend, and

cross-sectional dependence models, the 11 models we do show provide insight into sensitivity of the causal

effect estimates. In both figures, our benchmark differences-in-differences model corresponds to the third

model in the legend, which is denoted by an “x” marker symbol and the following triple:

• ST: FLE by year; CRT: State by FLE; CSD: 1 (delimiter is the semicolon).

ST refers to the type of secular trend, which we model as either FLE by year fixed effects or year fixed effects.

CRT refers to the type of correlated random trend in the model, which we specify as state-specific, state by

FLE quartile-specific, or district-specific trends. Each of these trends is formed by interacting the appropriate

fixed effects with a function of time, whether linear or quadratic. Finally, CSD refers to the number of factors

16Although the unweighted revenue results are not statistically significant, the point estimates show aconsistent positive effect

17For log per pupil revenues, see Appendix Tables VII and VIII report weighted results and unweightedresults, respectively. For graduation rates, see Appendix Tables IX and X report weighted and unweightedresults, respectively. We do not graphically present correlated random trend estimates that allow the timetrend to have a quadratic function. Quadratic correlated trends, at the state and state-FLE quartile levels, arevery noisy, with standard errors at times greater than twice the magnitude of point estimates. These resultsare also presented in the Appendix.

101

we include to account for cross-sectional dependence. A model with factor number 0 does not account for

cross-sectional dependence, while models with 1, 2, or 3 account for models with 1 factor, 2 factors, and 3

factors, respectively.18 Marker fill indicates the secular trend, and marker symbol-by-size is used to indicate

combinations of correlated random trends and CSD.

In Figure III, we find that, on average, our benchmark model understates causal effects on revenues in FLE

quartile 4 relative to all other graduation rate models. The largest effects in both the weighted and unweighted

regressions are produced by a specification that includes year fixed effects for the secular trend, a correlated

random trend at the state level, and no adjustment for cross-sectional dependence. The point estimates from

this model are large, and they are precisely estimated. Although these results suggest there were large revenue

effects, we worry that this specification ignores omitted variables, such as CSD, as well as mis-specifies the

counterfactual trend, all of which could result in upward bias.

It may be illuminating to compare solid and hollow circles in Figure III, as these reflect models that

ignore correlated random trends and CSD but differ in how they estimate the counterfactual trends. Hollow

circles assume the counterfactual trend is homogeneous, while solid circles assume the counterfactual trend

is heterogeneous, indexed by poverty quartile. Hollow circle point estimates are consistently larger than solid

circle point estimates, for both weighted and unweighted regressions. As hypothesized, this indicates that

when we assume homogeneous trends (δt) we overestimate the treatment effect for high poverty districts

because revenues were increasing faster in high poverty districts, on average, relative to low poverty districts.

[Insert Figure III Here]

We consider the variability of graduation rate estimates in Figure IV. Unlike the revenues results, we have

cases where our benchmark model both over- and under-states effect estimates relative to other models.

Of particular interest is the influence of correlated random trends. In the absence of specifying a correlated

random trend, point estimates are between 2 to 6 percentage points smaller than point estimates from our

benchmark model. By including a state-level time trend (indicated by the larger solid diamond and hollow

square), point estimates are nearer to our preferred model, but are still lower by about 2 percentage points.

However, when we include district-by-year effects (approximately 9,800 linear time trends, indicated by solid

and hollow triangles) the point estimates largely align with the preferred model. This suggests that treatment

and control groups do have different pre-treatment trends, and that these differences largely occur at the sub-

18Bracketed numbers indicate the column location for those model estimates available in Appendix TablesVII, VIII, IX, X.

102

state level. Modeling the time trends was motivated by the fact that FLE 4 districts were trending differently

prior to reform than FLE districts 1 through 3, and this is reinforced here. Overall, we find evidence of

omitted variable bias when correlated random trends are excluded from the model.

When our benchmark model understates the causal effects of other models, we find that the primary

difference is whether an adjustment for cross-sectional dependence has been made. Our benchmark model

accounts for a cross-sectional dependence using a 1-factor model. The models with the largest point estimates,

relative to our benchmark model, do not account for cross-sectional dependence. It appears that treatment

might be correlated with macroeconomic shocks affecting graduation rates, or that treatment might be induced

by another state’s pattern of graduation rates. After correcting for cross-sectional dependence, point estimates

are not as large. It is important to emphasize that we do not know the true number of factors r to include

in the model. Unfortunately, due to finite sample bias, we cannot include as many factors as is necessary to

make the errors i.i.d..

[Insert Figure IV Here]

Overall, we find that our preferred model tends to understate effect sizes for real log revenues and that

causal effect estimates for graduation rates are variable depending on how we model secular trends, correlated

random trends, and cross-sectional dependence. When we adjust for pre-treatment trends (especially at a sub-

state level) point estimates for graduation rates stabilize and the variation around our benchmark estimates is

less than 2 percentage points. Changing parameters in a model is not trivial because some of these changes

affect the identification strategy while others affect assumptions about omitted variable bias. Here we have

argued for a model that takes account of treatment heterogeneity and various sources of omitted variable bias.

In addition, we have shown that it is relatively straightforward to depict the results from a wide range of

alternative modeling strategies, as depicted in Figures III and IV.

V.C. Equalizing effects

Our preferred specification estimates levels of change in log revenues and graduation rates, comparing high-

/low poverty districts in treated states to high/low poverty districts in non-treated states. Comparing point

estimates across poverty quartiles is problematic because the counterfactual trends are allowed to vary. To

test whether revenues and graduation rates increased more in high poverty districts following court order, we

construct a variable that ranks districts within a state based on the percents of students qualifying for FLE

103

status in 1989. This ranking is then converted into a percentile by dividing by the total number of districts

in that state. Compared to percents qualifying for free lunch, these rank-orderings put districts on a common

metric, and are analogous to FLE quartiles, but with a continuous quantile rank-ordering.

The model that we estimate is analogous to Equation (4.1) with three changes:

1. We replace δtq to equal δt, so that the secular trend in Ysqdt is modeled as the average across all

districts.

2. We add a parameter δtQ, which is a continuous fixed effect variable that controls for year-specific

linear changes in Ysqdt with respect to Q, where Q is a continuous within-state poverty quantile rank-

ordering variable bounded between 0 and 1.

3. The treatment variable P ′stβq is set to equal P ′stQ.

Item [1] now adjusts for the average annual trend in log revenues and graduation rates among untreated

districts. Item [2] is done to provide a counterfactual secular trend with respect to how much non-treated

states are “equalizing” Ysqdt with respect to Q. The secular trend in these models now adjusts for the rate

that revenues and graduation rates are changing across FLE quantiles among untreated districts as well as the

average annual trend.19

The interpretation of the point estimates on the treatment year indicators, indicated in Item [3], is the

marginal change in our outcome variable Ysqdt given a one-unit change in FLE quantile within a state. For

revenues, a point estimate of 0.0001 is equivalent to a 0.01 percent change in per pupil total revenues for each

one-unit rank-order increase in FLE status within a state. For graduation rates, a point estimate of 0.0001 is

equivalent to a 0.01 percentage point increase for each one-unit rank-order increase. A positive coefficient

indicates that more revenues and graduation rates are going to poorer districts within a state.

Here, we highlight the trade-off between including δtq and δt in our preferred model. We included δtq

because we did not want to over-state the treatment effect by failing to account for the fact that high poverty

districts were increasing faster in untreated states as well. However, if we had included δt, then we could have

compared point estimates between P ′stβq=1 and P ′stβq=4, as these effect sizes would have been estimated

relative to a common secular trend. Given our interest in accounting for treatment heterogeneity, the two-

model approach is preferred, especially as the current model directly controls for the equalization efforts in

untreated states.

19Models that include the FLE poverty quartile-by-year fixed effect, not shown, are nearly identical.

104

We perform OLS and WLS, for Equation (4.1), with the modifications described just above. These results

can be seen for both dependent variables in Table IV and V. The columns of interest are columns 5 and 10,

which are labeled FLE Continuous.

After court-ordered reform, revenues increased across poverty quantiles, as indicated by the positive slope

coefficients in Table IV. Seven years after reform, a 10-unit increase in FLE percentile is associated with a

0.9 percent increase in per-pupil log revenues for the weighted regression. For the unweighted regression, a

10-unit increase is associated with a 1.8 percent increase. As previously discussed in Section V.A., neither the

weighted nor unweighted model results dominate each other, so we can view the slope coefficient as having a

lower bound of 0.9 percent and an upper bound of 1.8 percent. Assuming the treatment is linear, these results

suggest that districts in the 90th percentile would have had per pupil revenues that were between 7.2 to 14.2

percent higher than districts in the 10th percentile.

Table V also shows that court-ordered reform increased graduation rates across the FLE distribution. Seven

years after reform, a 10-unit increase in FLE percentile is associated with a 0.85 percentage point increase in

graduation rates for the weighted regression. For the unweighted model the corresponding point estimate is

0.50 percentage points. Assuming linearity, districts in the 90th percentile would have had graduation rates

that were between 4.0 to 6.8 percentage points higher than districts in the 10th percentile.

We showed in Figure I that high poverty quartiles in states undergoing reform experienced an increase

in both revenues and graduation rates, centered around the timing of reform. It was also evident that this

increase was larger than the increase in the other FLE quartiles. It was unknown whether this difference

reflected macroeconomic distributive patterns (perhaps due to federalization efforts) or whether it was a result

of Court order. The results of these models indicate that states undergoing reform shifted more revenues and

graduation rates to higher poverty districts, relative to shifts taking place in non-treated states.

V.D. Robustness checks

The largest threat to internal validity using aggregated data is selective migration. If treatment induces a

change in population and this change in population affects graduation rates, then the results using aggregate

graduation rates will be biased. Such a source of bias would occur if, for example, parents that value edu-

cation were more likely to move to areas that experienced increases in school spending. To test for selective

migration, we estimate the continuous version of our benchmark model on four dependent variables: log-

105

arithm of total district enrollment, percent minority (sum of Hispanic and black population shares within a

district), percent of children in poverty from the Census Bureau’s Small Area Income and Poverty Estimates

(SAIPE), and percent of students receiving special education. If there is evidence of population changes

resulting from treatment, and if these population characteristics are correlated with the outcome variable,

there may be bias. Ex hypothesi, we would assume that our results would be upwardly biased if treatment

decreased the percents of students who are minority, poor, or receiving special education, as these popula-

tions of students have been historically less likely to graduate (Stetser and Stillwell, 2014). Of course, we

cannot test for within demographic sorting, which would occur if students more likely to graduate within the

poor, minority and disabled populations we observe move into high poverty districts as a result of reform.

The inability to test for within composition sorting is a limitation of our data. Although we only report the

continuous treatment effect estimates of reform in Table VI, results from our main benchmark model appear

in the Appendix.20

Table VI shows that there is no strong evidence of selective migration to treated districts. None of the point

estimates for percent minority are statistically significant, nor are they large in magnitude. There are some

cases in which we obtain significance in terms of children in proverty (SAIPE) and the percentages of special

education students, but these point estimates are positive and not consistently significant across the treatment

years. If anything, the evidence from these models suggests our point estimates on the effect of reform are

downwardly biased, as the demographic changes indicate an increase, relative to the change across poverty

quantile in non-treated states, in the population of students that have been historically less likely to graduate.

In addition to considering selective migration, we also examine the robustness of our revenues dependent

variable. Prior research suggests that nonlinear transformations of the dependent variable (e.g., taking the

natural logarithm of the dependent variable) might produce treatment effect estimates that are substantially

different from the original variable (Lee, 2011; Solon, 2015). While transformations of the dependent variable

do affect the interpretation of the marginal effect, we should see similar patterns in terms of significance and

sign. In Table VI, we find that weighted results are marginally significant in treatment years 6 and 7. We

also find that the unweighted estimates are all statistically significant at the 5 percent level. This is the exact

same pattern of significance that appears Table IV, the table of main revenues results. Moreover, all point

estimates are positive across both tables. Appendix Table XV, which shows estimates for all FLE quartiles

is qualitatively similar in terms of significance and sign to the results in Table IV as well. Overall, we feel

20Please see Tables XI, XII, XIII, and XIV in Appendix XI..

106

confident that our logarithmic transformation does not jeopardize the validity of our revenues results.

[Insert Table VI Here]

VI. Conclusion

In this paper, we make both substantive and methodological contributions. Substantively, we demonstrate

that states undergoing Court-ordered finance reform in the period 1990-2010 experienced a sizable fiscal

shock that primarily affected high poverty districts in the state. This fiscal shock led to a subsequent change

in graduation rates in those states that likewise primarily benefited high poverty districts. The estimation of

these effects is largely immune to a variety of model specifications that vary in how they account for cross-

sectional dependence, pre-treatment trends and a heterogeneous secular trend. These effect sizes are, in turn,

robust to changes in demographic composition, as we observe population composition variables that are fairly

stable after reform. Moreover, changes after Court order have been equalizing, as states have shifted greater

resources, resulting in larger improvements in graduation rates, to higher poverty districts. At the start of

Court order, mean graduation rates for high poverty (Q4) districts were nearly 20 percentage points lower

than mean graduation rates for low poverty (Q1) districts. Our results indicate that Court order narrowed that

gap by 4.9 to 6.4 percentage points.

Methodologically, we subject the differences-in-differences estimator to a wide range of specification

checks, including cross-sectional dependence, correlated random trends, and secular trends. We efficiently

present upper and lower bounds on point estimates for a range of model choices. Assuming a homogenous

counterfactual trend will overstate effect sizes, in some cases, whereas ignoring cross-sectional dependence

does not meaningfully bias results in this application. There is substantial bias when we do not model pre-

treatment trends at the state-by-poverty quartile level. Without accounting for pre-treatment trends, effects of

reform on graduation rates are insignificant and precisely estimated.

The provocative results from Jackson and colleagues (2015) should not be undervalued. They find that

spending shocks resulting from Court order had major effects on a variety of student outcomes, including

adult earnings. The question of whether school spending—a public investment of $700 billion dollar per

year—can be causally linked to desirable outcomes has been a foundational question in public economics

for the past 50 years. We have argued here that it is necessary to replicate these results using other data sets

with better attributes or that have non-overlapping problems. Moreover, given the richness of modern panel

107

data sets, researchers are presented with a variety of plausibly equivalent modeling choices. In the absence

of strong priors regarding model specification, the challenge for applied microeconomists is to efficiently

convey the upper and lower bounds of estimates resulting from model choice. Here, we have shown that the

effects of Court order are consistent and robust on a data set that contains the universe of school districts

in the United States. Moreover, while results are sensitive to model fit, in most cases, point estimates for

graduation rates from the preferred model never diverge by more than 2 percentage points.

108

VII. Figures

Figure I: Mean Log Per Pupil Revenues & Graduation Rates, Centered around Timing ofReform

8.8

9

9.2

9.4

Mea

n Lo

g R

even

ues

-10 -5 0 5 10

Log Revenues

.65

.7

.75

.8

.85

.9

Mea

n G

rad

Rat

es

-10 -5 0 5 10

Grad Rates

FLE 1 FLE 2 FLE 3 FLE 4

Population weighted mean log revenues and graduation rates for states undergoing court-ordered financereform, by FLE quartile. Averages are for years 1990-2010, centered around first year of court-order. Notethat common trends assumptions are not reflected in this figure, as FLE quartile 4 graduation rates are notestimated relative to FLE quartile 4 graduation rates in non-treated states. NYCPS excluded.

109

Figure II: Change in Log Per Pupil Revenues & Graduation Rates after Court Ruling, FLE Quartile 4, with and without NYCPS

0

.05

.1

.15

1 2 3 4 5 6 7

Log Per Pupil Revenues

1 2 3 4 5 6 7

"Weighted, NYC Included" "Unweighted, NYC Included"

"Weighted, NYC Dropped" "Unweighted, NYC Dropped"

Graduation Rates

Notes: Differences-in-differences with treatment effects estimated non-parametrically after reform. Model accounts for district fixed effects (θd), FLE-by-yearfixed effects (δtq), state-by-FLE linear time trends (ψsqt), and a state-level factor (λ′sFt). Left panel corresponds to results for log revenues; right panel to results forgraduation rates. Black lines are for models that include NCYPS; gray lines are for models that exclude NYCPS. Solid lines are for models that include enrollmentas analytic weight; dashed lines are for unweighted models. Unweighted models completely overlap (black and gray dashed lines are not distinguishable). WhenNCYPS is removed, point estimates for weighted models are closer to unweighted models.

110

Figure III: Model Sensitivity: Changes in Estimates for Log Per Pupil Revenues across Models

-.04-.02 0 .02.04.06.08 .1Distance in Point Estimate from Preferred Model

Year 1

Year 2

Year 3

Year 4

Year 5

Year 6

Year 7

Weighted

-.04 -.02 0 .02 .04 .06 .08 .1Distance in Point Estimate from Preferred Model

Unweighted

ST: FLE by year; CRT: None; CSD: 0 [1]

ST: FLE by year; CRT: State; CSD: 0 [7]

ST: FLE by year; CRT: State by FLE; CSD: 0 [2]




ST: FLE by year; CRT: District; CSD: 0 [9]

ST: year; CRT: None; CSD: 0 [10]

ST: year; CRT: State; CSD: 0 [13]

ST: year; CRT: State by FLE; CSD: 0 [11]

ST: year; CRT: District; CSD: 0 [15]

Notes: Point estimate in treatment year t is shown as the difference between preferred model and model m, where model m variables are indicated in the legend.Point estimates along the x axis greater than zero indicate our preferred model underestimates the effect; greater than zero indicates our preferred model overstatesthe effect. Legend shows three parameter changes, delimited by “;”. ST denotes the type of nonparametric secular trend under consideration: (a) FLE quartile byyear fixed effects or (b) year fixed effects. CRT denotes the type of correlated random trends under consideration: (a) none, corresponding to no CRT; (b) state byFLE quartile fixed effects interacted with linear time; (c) state fixed effects interacted with linear time; (d) district fixed effects interacted with linear time. CSDdenotes type of cross-sectional dependence adjustment: (a) 0, which implies no factor structure; (b) 1, which is a 1 factor model; (c) 2, which is a 2 factor model;or (d) 3, which is a 3 factor model. Bracketed numbers indicate the column location of point estimates for model m in Tables VII and VIII

111

Figure IV: Model Sensitivity: Changes in Estimates for Graduation Rates across Models

-.04 -.02 0 .02 .04 .06Distance in Point Estimate from Preferred Model

Year 1

Year 2

Year 3

Year 4

Year 5

Year 6

Year 7

Weighted

-.04 -.02 0 .02 .04 .06Distance in Point Estimate from Preferred Model

Unweighted

ST: FLE by year; CRT: None; CSD: 0 [1]

ST: FLE by year; CRT: State; CSD: 0 [7]





ST: FLE by year; CRT: District; CSD: 0 [9]

ST: year; CRT: None; CSD: 0 [10]

ST: year; CRT: State; CSD: 0 [13]

ST: year; CRT: State by FLE; CSD: 0 [11]

ST: year; CRT: District; CSD: 0 [15]

Notes: Point estimate in treatment year t is shown as the difference between preferred model and model m, where model m variables are indicated in the legend.Point estimates along the x axis greater than zero indicate our preferred model underestimates the effect; greater than zero indicates our preferred model overstatesthe effect. Legend shows three parameter changes, delimited by “;”. ST denotes the type of nonparametric secular trend under consideration: (a) FLE quartile byyear fixed effects or (b) year fixed effects. CRT denotes the type of correlated random trends under consideration: (a) none, corresponding to no CRT; (b) state byFLE quartile fixed effects interacted with linear time; (c) state fixed effects interacted with linear time; (d) district fixed effects interacted with linear time. CSDdenotes type of cross-sectional dependence adjustment: (a) 0, which implies no factor structure; (b) 1, which is a 1 factor model; (c) 2, which is a 2 factor model;or (d) 3, which is a 3 factor model. Bracketed numbers indicate the column location of point estimates for model m in Tables IX and X

112

VIII. Tables

Table I: Adequacy Era Court-Ordered Finance Reform Years

State Year of 1st CaseName Overturn NameAlaska 2009 Moore v. StateKansas 2005 Montoy v. State (Montoy II)Kentucky 1989 Rose v. Council for Better EducationMassachusetts 1993 McDuff v. Secretary of Executive Office of EducationMontana 2005 Columbia Falls Elementary School District No. 6 v. MontanaNorth Carolina 2004 Hoke County Board of Education v. North CarolinaNew Hampshire 1997 Claremont School District v. GovernorNew Jersey 1997 Abbott v. Burke (Abbott IV)New York 2003 Campaign for Fiscal Equity, Inc. v. New YorkOhio 1997 DeRolph v. Ohio (DeRolph I)Tennessee 1995 Tennessee Small School Systems v. McWherter (II)Wyoming 1995 Campbell County School District v. Wyoming (Campbell II)

Notes: The table shows the first year in which a state’s education finance system was overturned on adequacygrounds; we also provide the name of the case. The primary source of data from this table is Corcoranand Evans (2008). We have updated their table with information provided by ACCESS, Education FinanceLitigation: http://schoolfunding.info/.

113

http://schoolfunding.info/

Table II: Descriptive Statistics for Outcome Variables

Weighted UnweightedMean SD Mean SD

Graduation Rates .77 [.15] .82 [.14]Log Revenues 8.94 [.26] 8.97 [.29]

Total Revenues 7950.2 [2352.12] 8224.31 [2782.13]Percent Minority .32 [.3] .16 [.23]

Percent Black .17 [.22] .08 [.17]Percent Hispanic .15 [.22] .08 [.16]

Percent Special Education .12 [.05] .13 [.05]Percent Child Poverty .16 [.1] .16 [.09]

Log Enrollment 9.43 [1.62] 7.4 [1.25]

Notes: This table provides means and standard deviations for the outcome variables used in this paper. Sum-mary statistics shown here that are weighted use the district enrollment across the sample.

114

Table III: States Exiting from Treatment Status

States Treatment YearN/A 1AK 2AK 3AK 4AK 5AK, KS 6AK, KS, NC 7AK, KS, NC, NY 8AK, AR, KS, NC, NY 9AK, AR, KS, NC, NY 10AK, AR, KS, NC, NY 11AK, AR, KS, NC, NY 12AK, AR, KS, NC, NY 13AK, AR, KS, NC, NY 14

Notes: This table lists states that no longer contribute treatment information in year t of the dynamic treat-ment response period. This occurs for states that were treated relatively late in the panel of data that isavailable. For example, our data set ends in 2009, and Table I shows that Kansas had their system overturnedin 2005. Thus, Kansas contributes to the identification of the treatment response in treatment years 1 through5, corresponding to 2005 to 2009, inclusive. It does not contribute to treatment years 6 through 14.

115

Table IV: Change in Log Per Pupil Revenues, by Free Lunch Eligible Quartile and Quantile

Weighted Unweighted

Treatment Year FLE 1 FLE 2 FLE 3 FLE 4 FLE Cont FLE 1 FLE 2 FLE 3 FLE 4 FLE Cont1 .029 ** -.004 .019 .015 .00001 .018 .008 .008 .004 .00054 *

[.011] [.034] [.017] [.012] [.00019] [.011] [.009] [.01] [.011] [.00023]2 .054 ** .032 .03 .046 * .00029 .032 + .027 .014 .018 .0007 *

[.017] [.027] [.021] [.022] [.00025] [.017] [.017] [.016] [.02] [.00028]3 .042 * .013 .031 .044 .00014 .029 + .028 + .022 .041 .00097 **

[.019] [.041] [.026] [.028] [.00032] [.015] [.015] [.019] [.029] [.0003]4 .058 * .034 .059 * .06 + .00036 .046 ** .045 ** .045 + .056 .00146 ***

[.023] [.046] [.03] [.032] [.00035] [.017] [.017] [.026] [.034] [.0003]5 .081 ** .057 .063 + .059 .00033 .036 .044 .035 .034 .00137 **

[.025] [.042] [.035] [.04] [.0004] [.026] [.026] [.036] [.037] [.00043]6 .119 *** .106 ** .089 * .091 * .0007 * .049 + .053 * .041 .033 .0016 **

[.035] [.04] [.038] [.044] [.00035] [.027] [.026] [.031] [.032] [.00054]7 .127 ** .116 * .108 * .119 * .00092 * .054 .058 + .041 .038 .00183 **

[.041] [.057] [.048] [.055] [.00046] [.033] [.032] [.038] [.043] [.00056]

r-squared .999 .999 .999 .999 .999 .894 .894 .894 .894 .887

Notes: This table shows point estimates and standard errors for non-parametric heterogeneous differences-in-differences estimator. Model accounts for districtfixed effects (θd), FLE-by-year fixed effects (δtq), state-by-FLE linear time trends (ψsqt), and a state-level factor (λ′sFt). FLE quartiles are indexed by FLE∈ 1, 2, 3, 4. Column “FLE Cont” corresponds to models in which FLE-by-year fixed effects are substituted for year fixed effects and the additional control variableyear-by-FLE percentile (δtQ) is included. Point estimates for continuous model are interpreted as change in revenues for 1-unit change in poverty percentilerank within a state, relative to change in percentile rank in states without Court order. All standard errors are clustered at the state level. (Significance indicated+ < .10, ∗ < .05, ∗∗ < .01, ∗ ∗ ∗ < .001)

116

Table V: Change in Graduation Rates, by Free Lunch Eligible Quartile and Quantile

Weighted Unweighted

Treatment Year FLE 1 FLE 2 FLE 3 FLE 4 FLE Cont FLE 1 FLE 2 FLE 3 FLE 4 FLE Cont1 .003 .015 *** .008 .013 * .00018 ** .006 .013 *** .014 ** .012 * .00016 **

[.006] [.004] [.008] [.006] [.00006] [.004] [.003] [.005] [.006] [.00005]2 .005 .002 .001 .03 ** .00025 ** .005 .013 + .009 .019 ** .00018 *

[.009] [.006] [.009] [.009] [.00008] [.005] [.007] [.006] [.007] [.00007]3 -.003 .003 .001 .061 ** .00051 ** .003 .016 * .008 .024 ** .00023 *

[.008] [.01] [.007] [.019] [.00017] [.005] [.007] [.01] [.009] [.00009]4 .013 .019 * .021 * .061 *** .00065 *** .015 * .032 ** .025 * .034 ** .00043 **

[.01] [.008] [.01] [.013] [.00011] [.008] [.011] [.01] [.013] [.00014]5 .013 .017 * .02 + .07 *** .00072 *** .012 .031 ** .027 * .041 ** .00049 ***

[.014] [.009] [.011] [.016] [.00013] [.009] [.01] [.012] [.015] [.00015]6 .012 .019 .024 * .085 *** .00088 *** .013 + .033 ** .032 * .05 ** .00059 ***

[.013] [.012] [.011] [.019] [.00017] [.007] [.013] [.016] [.018] [.00017]7 -.009 .01 .016 .084 *** .00085 *** .008 .03 .021 .047 * .00052 *

[.015] [.016] [.015] [.023] [.00021] [.01] [.018] [.021] [.023] [.00021]

r-squared .999 .999 .999 .999 .999 .573 .573 .573 .573 .573

Notes: This table shows point estimates and standard errors for non-parametric heterogeneous differences-in-differences estimator. Model accounts for districtfixed effects (θd), FLE-by-year fixed effects (δtq), state-by-FLE linear time trends (ψsqt), and a state-level factor (λ′sFt). FLE quartiles are indexed by FLE∈ 1, 2, 3, 4. Column “FLE Cont” corresponds to models in which FLE-by-year fixed effects are substituted for year fixed effects and the additional control variableyear-by-FLE percentile (δtQ) is included. Point estimates for continuous model are interpreted as change in graduation rates for 1-unit change in poverty percentilerank within a state, relative to change in percentile rank in states without Court order. All standard errors are clustered at the state level. (Significance indicated+ < .10, ∗ < .05, ∗∗ < .01, ∗ ∗ ∗ < .001)

117

Table VI: Robustness Check: Preferred Model Specification - Change in Demographic Characteristics Resulting from Court Order

Log Enroll Percent Minority Percent SAIPE Percent Sped Total Revenues

Treatment Year Unweighted Weighted Unweighted Weighted Unweighted Weighted Unweighted Weighted Unweighted1 -.00012 .00001 0 .00022 ** .00012 * .0004 .00038 .18717 6.27757 *

[.00007] [.00005] [.00002] [.00008] [.00006] [.00027] [.00023] [2.00267] [3.00265]2 0 .00005 .00001 .00006 -.00003 .00046 .00044 + 3.00714 7.1655 *

[.00007] [.00006] [.00003] [.00011] [.0001] [.00029] [.00025] [2.5773] [2.93854]3 .00008 .00002 .00001 .00009 -.00002 .00031 .00026 1.75417 9.93939 **

[.00011] [.00007] [.00004] [.00006] [.00005] [.00033] [.00029] [3.41087] [3.47845]4 .00013 .00003 .00002 .00016 * .00005 .00039 .00027 4.32547 15.0462 ***

[.00013] [.00009] [.00004] [.00007] [.00008] [.00023] [.00017] [4.26495] [3.92201]5 .00014 .00009 .00006 .00007 .00003 .00041 + .00028 4.2002 14.5479 **

[.00017] [.00008] [.00005] [.00008] [.00005] [.00023] [.00017] [5.00702] [5.23225]6 .00014 .0001 .00005 .00012 .00002 .00049 + .00033 8.91194 + 18.25232 *

[.00019] [.00008] [.00005] [.00008] [.00007] [.00028] [.00021] [4.75711] [7.54563]7 .0001 .00008 .00004 .00008 .00004 .00046 .00032 11.43805 + 20.3912 **

[.00021] [.0001] [.00006] [.00014] [.00012] [.0003] [.00022] [6.48416] [7.19182]

r2 .992 .999 .971 .999 .914 .999 .6 .999 .864

Notes: This table shows point estimates and standard errors for non-parametric heterogeneous differences-in-differences estimator. Model accounts for districtfixed effects (θd), year fixed effects (δt), state-by-FLE linear time trends (ψsqt), a state-level factor (λ′sFt), and a FLE percentile-by-year fixed effect (δtQ). Pointestimates are interpreted as change in dependent variable per 1-unit change in poverty percentile rank within a state, relative to change in percentile rank in stateswithout Court order. All standard errors are clustered at the state level. (Significance indicated + < .10, ∗ < .05, ∗∗ < .01, ∗ ∗ ∗ < .001)

118

Appendices

IX. Data Appendix

Our data set is the compilation of several public-use surveys that are administered by the National Center

for Education Statistics and the U.S. Census Bureau. We construct our analytic sample using the following

data sets: Local Education Agency (School District) Finance Survey (F-33); Local Education Agency (School

District) Universe Survey; Local Education Agency Universe Survey Longitudinal Data File: 1986-1998 (13-

year); Local Education Agency (School District) Universe Survey Dropout and Completion Data; and Public

Elementary/Secondary School Universe Survey. Web links to each of the data sources are listed below:

Note: All links last accessed June 2015.

Local Education Agency Finance Survey (F-33)

Source: https://nces.ed.gov/ccd/f33agency.asp

Local Education Agency Universe Survey

Source: https://nces.ed.gov/ccd/pubagency.asp

Local Education Agency Universe Survey Longitudinal Data File (13-year)

Source: https://nces.ed.gov/ccd/CCD13YR.ASP

Local Education Agency Universe Survey Dropout and Completion Data

Source: https://nces.ed.gov/ccd/drpagency.asp

Public Elementary/Secondary School Universe Survey Data

Source: https://nces.ed.gov/ccd/pubschuniv.asp

We construct total real revenues per student, our first outcome of interest, using the F-33 survey, where

total revenues is the sum of federal, local, and state revenues in each district.21 We divide this value by the

total number of students in the district and deflate by the US CPI, All Urban Consumers Index to convert

the figure to year 2000 dollars. Because of large outliers in the F-33 survey, we replace with missing values

observations with real total revenues per student that are either 1.5 times larger than the 95th percentile or 0.5

21For years 1990-91, 1992-93, and 1993-94, we obtained district-level data from Kforce GovernmentSolutions, Inc. These data are public-use files.

119

https://nces.ed.gov/ccd/f33agency.asp

https://nces.ed.gov/ccd/pubagency.asp

https://nces.ed.gov/ccd/CCD13YR.ASP

https://nces.ed.gov/ccd/drpagency.asp

https://nces.ed.gov/ccd/pubschuniv.asp

times smaller than the 5th percentile of per-student total revenues within a state. We do this to prevent large

outliers from driving our results.22

Our measure of graduation rates is a combination of data from the Local Education Agency (School Dis-

trict) Universe Survey, the Public Elementary/Secondary School Universe Survey, and the Local Education

Agency (School District) Universe Survey Dropout and Completion Data. From the school-level file, we ex-

tract the number of 8th graders in each school, and we aggregate these school-level data to obtain district-level

data in each year. From the Local Education Agency data files we construct a time series of total diploma

recipients. Data on total diploma recipients was not part of the Local Education Agency universe files as

of academic year 1997-98, so beginning with that year, we use the the Dropout and Completion public-use

files to obtain the diploma data. We calculate the graduation rate as the total number of diploma recipients

in year t as a share of the number of 8th graders in year t − 4, a measure which Heckman (2010) shows is

not susceptible to the downward biased caused by using lagged 9th grade enrollment in the denominator. We

top-coded the graduation rates, so that they take a maximum value of 1.

22The outlier adjustment described above has been used in other studies; for example, see Murray, Evansand Schwab (1998). We generally find that outliers are usually found in districts with very small enrollment.

120

X. Additional Figures

121

Figure V: New York City Public Schools Potential Mediators & Outcomes

-2-1

01

23

Sta

ndar

dize

d w

ithin

NY

CP

S

1990 1995 2000 2005 2010Year

Grad Rates

Revenues

Salaries

% Spec. Ed.

Class Size

% Minority

% Poverty

NYCPS is the largest district in the United States, and it was also demonstrably improving in many waysduring this period. Here, we plot standardized beta coefficients for various outcomes of interest. Theseare standardized to be mean zero with standard deviation one for NYCPS across the entire time period.Graduation rates and teacher salaries increased after 2003, the year New York had its first court ruling. En-rollment, percent poverty, class size and percent minority decreased during this period. All of these potentialmechanisms for improving graduation rates are intended to be controlled for in a differences-in-differencesframework through the inclusion of year dummies.

122

XI. Additional Tables

123

Table VII: Log Per Pupil Revenues Results, Model Sensitivity Specifications, Weighted Least Squares

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

1 .024 .034 .004 .015 .017 .001 .05 .013 .033 .03 .03 -.003 .056 .019 .03[.012] [.014] [.011] [.013] [.012] [.012] [.018] [.012] [.015] [.012] [.014] [.013] [.018] [.01] [.014]

2 .043 .047 .018 .038 .043 .018 .068 .032 .047 .051 .044 .013 .077 .041 .044[.017] [.018] [.02] [.021] [.021] [.022] [.024] [.02] [.018] [.017] [.017] [.022] [.024] [.019] [.018]

3 .071 .076 .041 .067 .074 .04 .101 .055 .075 .08 .072 .033 .11 .065 .071[.024] [.019] [.029] [.028] [.024] [.03] [.023] [.026] [.02] [.024] [.018] [.032] [.022] [.026] [.019]

4 .098 .107 .056 .087 .092 .055 .135 .069 .105 .109 .101 .047 .145 .08 .101

[.024] [.018] [.034] [.033] [.031] [.037] [.025] [.033] [.019] [.025] [.018] [.04] [.025] [.034] [.018]5 .083 .093 .034 .072 .082 .031 .125 .045 .091 .095 .087 .02 .138 .057 .086

[.029] [.026] [.037] [.036] [.034] [.041] [.034] [.037] [.027] [.03] [.026] [.044] [.034] [.037] [.027]6 .102 .103 .033 .075 .09 .027 .144 .043 .101 .119 .1 .017 .161 .06 .099

[.022] [.038] [.032] [.034] [.036] [.041] [.044] [.036] [.04] [.021] [.036] [.043] [.042] [.037] [.037]7 .123 .121 .038 .087 .118 .027 .168 .043 .118 .139 .115 .013 .184 .059 .114

[.019] [.039] [.043] [.045] [.05] [.058] [.045] [.054] [.04] [.018] [.036] [.06] [.043] [.055] [.038]Weight

Yes

No X X X X X X X X X X X X X X X

Fixed Effect

FLE-Year X X X X X X X X X

Year X X X X X X

CRT

None X X

State-FLE X X X X X

State-FLE^2 X X

State X X

State^2 X X

District X X

Factor

0 X X X X X X X X X X X X

1 X

2 X

3 X

Treatment Year Log Per Pupil Revenues, Unweighted

Preferred model highlighted. Point estimates and standard errors, indicated by brackets, are shown for various model specifications. Model choice is indicatedin bottom of panel. We permute parameters using weights (yes/no), fixed effects (FLE-by-year/year), correlated random trends (none/state-by-FLE/state-by-FLEsquared/state/state squared/district), and factors (0/1/2/3). Not all combinations are available.

124

Table VIII: Log Per Pupil Revenues Results, Model Sensitivity Specifications, OLS

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

1 .028 .026 .015 .019 .02 .01 .048 .02 .027 .032 .025 .005 .051 .021 .026[.01] [.013] [.012] [.016] [.016] [.009] [.017] [.011] [.014] [.009] [.013] [.01] [.018] [.01] [.014]

2 .056 .05 .046 .049 .052 .034 .08 .05 .051 .061 .05 .029 .083 .053 .051[.013] [.019] [.022] [.024] [.023] [.02] [.024] [.019] [.02] [.013] [.022] [.02] [.025] [.019] [.023]

3 .055 .049 .044 .049 .047 .029 .083 .046 .05 .066 .054 .027 .092 .055 .055

[.017] [.022] [.028] [.033] [.031] [.026] [.026] [.025] [.024] [.016] [.022] [.027] [.026] [.024] [.023]4 .079 .073 .06 .069 .063 .045 .11 .061 .075 .092 .078 .043 .121 .071 .08

[.017] [.024] [.032] [.04] [.039] [.03] [.028] [.031] [.025] [.015] [.021] [.032] [.027] [.029] [.022]5 .074 .067 .059 .071 .059 .031 .11 .052 .07 .09 .076 .03 .124 .066 .078

[.021] [.03] [.04] [.052] [.047] [.039] [.035] [.038] [.032] [.021] [.029] [.041] [.035] [.037] [.031]6 .102 .092 .091 .105 .076 .047 .142 .071 .095 .121 .104 .048 .158 .089 .107

[.018] [.037] [.044] [.061] [.054] [.045] [.046] [.045] [.039] [.02] [.035] [.046] [.046] [.044] [.037]7 .129 .115 .119 .128 .103 .059 .174 .09 .119 .154 .133 .064 .195 .114 .137

[.015] [.036] [.055] [.084] [.076] [.065] [.047] [.063] [.038] [.016] [.034] [.067] [.047] [.063] [.036]Weight

Yes X X X X X X X X X X X X X X X

No

Fixed Effect


Year X X X X X X

CRT

None X X

State-FLE X X X X X

State-FLE^2 X X

State X X

State^2 X X

District X X

Factor


1 X

2 X

3 X

Treatment Year Log Per Pupil Revenues, Weighted


125

Table IX: Graduation Rates Results, Model Sensitivity Specifications, Weighted Least Squares

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

1 -0.001 0.016 0.012 0.014 0.008 0.021 0.012 0.016 0.016 -0.004 0.019 0.022 0.009 0.012 0.019

[.007] [.008] [.006] [.007] [.006] [.01] [.006] [.007] [.009] [.007] [.009] [.011] [.006] [.007] [.009]

2 -0.004 0.022 0.019 0.021 0.01 0.029 0.015 0.021 0.021 -0.009 0.023 0.029 0.01 0.015 0.023

[.011] [.01] [.007] [.008] [.01] [.014] [.007] [.009] [.01] [.012] [.009] [.014] [.006] [.009] [.009]

3 0.003 0.033 0.024 0.026 0.011 0.041 0.025 0.029 0.033 -0.004 0.034 0.04 0.018 0.023 0.034

[.012] [.011] [.009] [.009] [.01] [.018] [.008] [.012] [.011] [.012] [.01] [.019] [.007] [.012] [.01]

4 0.007 0.042 0.034 0.035 0.012 0.053 0.033 0.042 0.042 0 0.044 0.053 0.026 0.035 0.044

[.012] [.017] [.013] [.014] [.016] [.028] [.013] [.018] [.018] [.012] [.017] [.029] [.012] [.019] [.017]

5 0.016 0.055 0.041 0.043 0.017 0.065 0.045 0.053 0.055 0.008 0.058 0.066 0.038 0.046 0.057

[.013] [.019] [.015] [.015] [.017] [.034] [.013] [.022] [.02] [.012] [.018] [.035] [.013] [.022] [.019]

6 0.022 0.069 0.05 0.051 0.016 0.076 0.056 0.063 0.068 0.016 0.075 0.08 0.051 0.058 0.074

[.013] [.022] [.018] [.017] [.016] [.045] [.016] [.03] [.023] [.013] [.021] [.045] [.016] [.03] [.022]

7 0.015 0.068 0.047 0.046 0.008 0.075 0.055 0.062 0.068 0.01 0.076 0.08 0.05 0.057 0.075

[.013] [.027] [.023] [.021] [.018] [.056] [.019] [.037] [.028] [.013] [.026] [.056] [.02] [.037] [.027]

Weight

Yes

No X X X X X X X X X X X X X X X

Fixed Effect


Year X X X X X X

CRT

None X X

State-FLE X X X X X

State-FLE^2 X X

State X X

State^2 X X

District X X

Factor


1 X

2 X

3 X

Treatment Year Graduation Rates, Weighted


126

Table X: Graduation Rates Results, Model Sensitivity Specifications, OLS

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

1 -0.013 0.02 0.013 0.01 0.013 0.018 0.002 0.001 0.019 -0.01 0.02 0.019 0.005 0.004 0.018

[.009] [.009] [.006] [.007] [.009] [.012] [.007] [.008] [.009] [.008] [.006] [.01] [.005] [.005] [.006]

2 -0.01 0.037 0.03 0.026 0.029 0.03 0.013 0.012 0.036 -0.008 0.033 0.031 0.014 0.012 0.033

[.01] [.012] [.009] [.01] [.012] [.019] [.007] [.012] [.012] [.011] [.009] [.016] [.006] [.01] [.009]

3 0.021 0.077 0.061 0.056 0.052 0.063 0.048 0.045 0.076 0.021 0.071 0.063 0.048 0.044 0.07

[.02] [.017] [.019] [.021] [.015] [.021] [.014] [.017] [.017] [.022] [.017] [.018] [.016] [.018] [.017]

4 0.009 0.074 0.061 0.055 0.048 0.058 0.042 0.042 0.074 0.008 0.066 0.057 0.04 0.039 0.065

[.01] [.018] [.013] [.013] [.018] [.035] [.01] [.023] [.019] [.011] [.014] [.031] [.008] [.02] [.014]

5 0.017 0.091 0.07 0.065 0.054 0.063 0.056 0.052 0.09 0.017 0.083 0.066 0.055 0.051 0.082

[.014] [.017] [.016] [.017] [.02] [.04] [.01] [.025] [.018] [.014] [.013] [.035] [.01] [.022] [.014]

6 0.024 0.11 0.085 0.079 0.068 0.071 0.07 0.065 0.108 0.027 0.103 0.078 0.072 0.066 0.101

[.02] [.019] [.019] [.02] [.024] [.049] [.013] [.029] [.019] [.021] [.015] [.043] [.013] [.028] [.015]

7 0.018 0.115 0.084 0.077 0.062 0.062 0.072 0.064 0.113 0.023 0.107 0.073 0.075 0.066 0.106

[.024] [.02] [.023] [.025] [.026] [.057] [.015] [.03] [.021] [.023] [.018] [.05] [.016] [.032] [.018]

Weight

Yes X X X X X X X X X X X X X X X

No

Fixed Effect


Year X X X X X X

CRT

None X X

State-FLE X X X X X

State-FLE^2 X X

State X X

State^2 X X

District X X

Factor


1 X

2 X

3 X

Treatment Year Graduation Rates, Unweighted


127

Table XI: Robustness: Log Enrollment, Preferred Model

Weighted Unweighted

Treatment Year FLE 1 FLE 2 FLE 3 FLE 4 FLE Cont FLE 1 FLE 2 FLE 3 FLE 4 FLE Cont1 -.012 ** -.01 -.015 + -.008 -.00012

[.004] [.008] [.009] [.005] [.00007]2 -.011 * -.01 -.011 .001 0

[.006] [.009] [.007] [.005] [.00007]3 -.016 + -.007 -.012 .006 .00008

[.008] [.016] [.013] [.014] [.00011]4 -.021 -.012 -.013 .009 .00013

[.013] [.02] [.018] [.017] [.00013]5 -.025 -.015 -.017 .008 .00014

[.017] [.025] [.022] [.02] [.00017]6 -.034 -.024 -.025 .01 .00014

[.021] [.028] [.025] [.022] [.00019]7 -.044 + -.033 -.035 .005 .0001

[.024] [.032] [.028] [.026] [.00021]

r-squared .992 .992 .992 .992 .992

Notes: This table shows point estimates and standard errors for non-parametric heterogeneous differences-in-differences estimator. Model accounts for districtfixed effects (θd), FLE-by-year fixed effects (δtq (or deltat for continuous models), state-by-FLE linear time trends (ψsqt), a state-level factor (λ′sFt), and, in thecase of continuous models, a FLE percentile-by-year fixed effect (δtQ). Point estimates are interpreted as change in dependent variable per 1-unit change in povertypercentile rank within a state, relative to change in percentile rank in states without Court order. All standard errors are clustered at the state level. (Significanceindicated + < .10, ∗ < .05, ∗∗ < .01, ∗ ∗ ∗ < .001)

128

Table XII: Robustness: Percent Minority, Preferred Model

Weighted Unweighted

Treatment Year FLE 1 FLE 2 FLE 3 FLE 4 FLE Cont FLE 1 FLE 2 FLE 3 FLE 4 FLE Cont1 -.004 + .003 .003 -.004 .00001 -.002 * .001 0 -.001 0

[.002] [.002] [.003] [.005] [.00005] [.001] [.001] [.001] [.001] [.00002]2 -.009 * -.001 .002 .004 .00005 -.002 + -.001 0 .002 .00001

[.005] [.002] [.003] [.004] [.00006] [.001] [.001] [.001] [.002] [.00003]3 -.013 -.004 .002 .001 .00002 -.002 -.001 0 .002 .00001

[.009] [.005] [.005] [.003] [.00007] [.002] [.002] [.002] [.002] [.00004]4 -.011 -.004 .006 0 .00003 -.002 -.001 .002 .003 .00002

[.01] [.006] [.006] [.004] [.00009] [.003] [.003] [.003] [.002] [.00004]5 -.005 .004 .01 * 0 .00009 0 .002 .005 + .005 * .00006

[.005] [.003] [.004] [.004] [.00008] [.002] [.002] [.003] [.002] [.00005]6 -.008 .004 .013 * 0 .0001 -.001 .001 .004 .005 .00005

[.006] [.004] [.005] [.005] [.00008] [.002] [.003] [.003] [.003] [.00005]7 -.012 .003 .014 * -.004 .00008 -.003 .001 .004 .004 .00004

[.008] [.005] [.007] [.005] [.0001] [.003] [.003] [.004] [.003] [.00006]

r-squared 1 1 1 1 1 .971 .971 .971 .971 .971


129

Table XIII: Robustness: Percent Child Poverty (SAIPE), Preferred Model

Weighted Unweighted

Treatment Year FLE 1 FLE 2 FLE 3 FLE 4 FLE Cont FLE 1 FLE 2 FLE 3 FLE 4 FLE Cont1 .012 *** .015 ** .016 *** .02 * .00022 ** .008 * .01 *** .015 *** .009 .00012 *

[.003] [.005] [.004] [.009] [.00008] [.004] [.003] [.003] [.007] [.00006]2 .005 .002 -.003 .004 .00006 .004 .003 .004 -.004 -.00003

[.004] [.004] [.009] [.011] [.00011] [.005] [.005] [.006] [.01] [.0001]3 .009 + .008 * .012 * .011 * .00009 .006 .007 .008 -.001 -.00002

[.005] [.004] [.005] [.005] [.00006] [.006] [.006] [.006] [.005] [.00005]4 .013 * .009 .015 + .028 ** .00016 * .008 .01 .013 .007 .00005

[.006] [.006] [.008] [.008] [.00007] [.008] [.008] [.009] [.008] [.00008]5 .017 ** .017 ** .02 ** .019 ** .00007 .011 .012 .014 .006 .00003

[.006] [.005] [.007] [.007] [.00008] [.007] [.007] [.01] [.004] [.00005]6 .024 ** .019 ** .023 ** .032 *** .00012 .01 .011 .016 .009 .00002

[.007] [.007] [.008] [.009] [.00008] [.01] [.01] [.012] [.009] [.00007]7 .027 ** .022 ** .027 ** .032 + .00008 .013 .014 .021 .011 .00004

[.009] [.008] [.01] [.017] [.00014] [.012] [.012] [.016] [.013] [.00012]

r-squared 1 1 1 1 1 .914 .914 .914 .914 .914


130

Table XIV: Robustness: Special Education, Preferred Model

Weighted Unweighted

Treatment Year FLE 1 FLE 2 FLE 3 FLE 4 FLE Cont FLE 1 FLE 2 FLE 3 FLE 4 FLE Cont1 .024 .02 .022 .033 .0004 .029 + .026 .027 .029 .00038

[.015] [.014] [.016] [.025] [.00027] [.015] [.016] [.016] [.019] [.00023]2 .027 .027 + .025 .038 .00046 .031 + .031 + .031 + .033 .00044 +

[.018] [.016] [.018] [.027] [.00029] [.018] [.017] [.018] [.02] [.00025]3 .022 .015 .016 .025 .00031 .022 .019 .019 .02 .00026

[.018] [.018] [.021] [.03] [.00033] [.019] [.019] [.021] [.022] [.00029]4 .027 * .02 .025 + .029 .00039 .023 * .021 * .023 + .02 .00027

[.013] [.013] [.015] [.023] [.00023] [.01] [.01] [.012] [.015] [.00017]5 .032 ** .024 * .026 .03 .00041 + .024 * .021 * .022 + .022 .00028

[.012] [.012] [.016] [.024] [.00023] [.01] [.01] [.012] [.015] [.00017]6 .038 ** .032 * .03 .034 .00049 + .029 * .026 * .026 + .023 .00033

[.014] [.014] [.018] [.029] [.00028] [.012] [.013] [.015] [.018] [.00021]7 .036 * .03 * .031 .031 .00046 .026 * .025 + .027 .022 .00032

[.015] [.015] [.019] [.031] [.0003] [.012] [.014] [.016] [.019] [.00022]

r-squared 1 1 1 1 1 .601 .601 .601 .601 .6


131

Table XV: Robustness: Total Per Pupil Revenues, Preferred Model

Weighted Unweighted

Treatment Year FLE 1 FLE 2 FLE 3 FLE 4 FLE Cont FLE 1 FLE 2 FLE 3 FLE 4 FLE Cont1 452 * 219 373 310 * 0.187 211 * 47 47 -30 6.278 *

[242] [382] [265] [169] [2.00267] [94] [70] [101] [122] [3.00265]2 631 * 393 392 563 * 3.007 355 ** 258 * 100 123 7.166 *

[282] [359] [282] [232] [2.5773] [131] [119] [148] [177] [2.93854]3 547 260 385 573 * 1.754 354 * 263 * 159 338 9.934 **

[371] [519] [353] [241] [3.41087] [144] [125] [182] [277] [3.47845]4 782 * 584 760 * 853 ** 4.325 538 *** 425 ** 394 * 503 15.046 ***

[443] [586] [402] [263] [4.26495] [141] [136] [261] [359] [3.92201]5 1005 * 770 803 * 893 * 4.200 445 * 381 * 246 255 14.548 **

[485] [582] [475] [364] [5.00702] [214] [231] [354] [404] [5.23225]6 1369 * 1276 * 1061 * 1268 ** 8.91194 + 653 ** 531 ** 388 316 18.252 *

[630] [648] [607] [486] [4.75711] [201] [208] [305] [322] [7.54563]7 1428 * 1309 * 1167 * 1567 *** 11.43805 + 688 ** 576 * 358 373 20.39 **

[722] [758] [614] [429] [6.48416] [248] [297] [409] [474] [7.19182]

r-squared 1 1 1 1 1 1 1 1 1 0.864


132

XII. Technical Appendix: Standard Errors

In this section, we show how to obtain a variety of standard error structures using the estimates of the factor

loadings and common factors that we obtain from the iterative procedure used in Moon and Weidner (2014).

This may be necessary when, for example, the unit of observation is at a lower level of aggregation than the

unit where serial correlation is expected. We thank Martin Weidner for suggesting this procedure and for

providing details about it.

First, we note that the interactive fixed effects (IFE) procedure provides us with a vector of estimates for

λi and Ft. We denote the estimated parameter vectors as λ̂i and F̂t.

The goal is to use these estimates to estimate the model using OLS in a statistical package that allows for

different error structures such as clustered standard errors. We want to estimate the following equation:

(4.2) Yit = αi + θt +D′itβ + λ′iFt + εit,

where we have λ̂i and F̂t from the IFE procedure. If we perform a first-order Taylor expansion about λ′iFt,

then we can utilize our estimates and obtain the same coefficients on our treatment indicators using the IFE

procedure.

In general, the first-order Taylor expansion for a function of two variables λ and F about the points λ̂ and

F̂ is as follows:

(4.3) f(λ, F ) ≈ f(λ̂, F̂ ) +∂f

∂λ

∣∣∣λ=λ̂,F=F̂

(λ− λ̂) +∂f

∂F

∣∣∣λ=λ̂,F=F̂

(F − F̂ )

In our case, we have f(λ, F ) = λ′iFt. After applying the Taylor expansion, we obtain:

(4.4) λ′iFt ≈ λ′iF̂t + λ̂i′Ft − λ̂i

′F̂t.

Next, we define αi = λ− λ̂ and Ft = θt. Therefore, we obtain:

(4.5) λ′iFt ≈ αiF̂t + λ̂iθt.

133

The model we can estimate via OLS is

(4.6) Yit = αi + θt +D′itβ + αiF̂t + λ̂iθt + εit,

where the αi and the θt are parameters that are to be estimated. We can easily apply various standard error

corrections to this model in a wide variety of statistics packages.

Proposition XII..1. OLS estimation of Equation (4.6) is identical to OLS estimation of Equation (4.2).

Proof. For this proof, we show that the first order conditions (FOCs) are identical. Thus, the OLS estimates

will also be identical.

To begin, we first simplify notation. We let Xit = αi + θt + Dit. There is no loss of generality in

performing this step, as we can think of estimating the unit-specific fixed effects and the time-specific fixed

using the least-squares dummy variable (LSDV) method.23

Part 1: Obtain FOCs for Equation (4.2)

Step 1: Specify the objective function:

minimizeβ̂,λ̂i,F̂t

Ψ =∑i

∑t

(yit −X ′itβ̂ − λ̂i′F̂t)

2

Step 2: Obtain partial derivatives for each parameter and set them equal to zero

∂Ψ

∂β̂= −2

[∑i

∑t

(yit −X ′itβ̂ − λ̂i′F̂t)(Xit)

]= 0

∂Ψ

∂λ̂i= −2

[∑i

∑t

(yit −X ′itβ̂ − λ̂i′F̂t)(F̂t)

]= 0

∂Ψ

∂F̂t= −2

[∑i

∑t

(yit −X ′itβ̂ − λ̂i′F̂t)(λ̂i)

]= 0

Part 2: Obtain FOCs for Equation (4.6)

Step 1: Specify the objective function:

minimizeβ̂,λ̂i,F̂t

Φ =∑i

∑t

(yit −X ′itβ̂ − αiF̂t − λ̂iθt)2

23In this paper, we actually remove the fixed effects using the within transformation.

134

Step 2: Obtain partial derivatives for each parameter and set them equal to zero

∂Φ

∂β̂= −2

[∑i

∑t

(yit −X ′itβ̂ − αiF̂t − λ̂iθt)(Xit)

]

= −2

[∑i

∑t

(yit −X ′itβ̂ − λ̂i′F̂t)(Xit)

]= 0

∂Φ

∂λ̂i= −2

[∑i

∑t

(yit −X ′itβ̂ − αiF̂t − λ̂iθt)(F̂t)

]

= −2

[∑i

∑t

(yit −X ′itβ̂ − λ̂i′F̂t)(F̂t)

]= 0

∂Φ

∂F̂t= −2

[∑i

∑t

(yit −X ′itβ̂ − αiF̂t − λ̂iθt)(λ̂i)

]

= −2

[∑i

∑t

(yit −X ′itβ̂ − λ̂i′F̂t)(λ̂i)

]= 0

To simplify the FOCs, we use the fact that we had defined αi = λ − λ̂ and Ft = θt. Given that we evaluate

the minimization problem at β̂, λ̂i, and F̂t, αi will be equal to zero. Finally, we can replace λ̂iθt with λ̂i′F̂t.

Thus, the FOCs for Equations (4.2) and (4.6) are identical.

135

Chapter 5

Bibliography

AHN, S. C. & HORENSTEIN, A. R. (2013). “Eigenvalue ratio test for the number of factors.” Econometrica,81(3), 1203–1227.

ANDERSON, E. (2007). “Fair Opportunity in Education: A Democratic Equality Perspective*.” Ethics,117(4), 595–622.

ANDERSON, E. S. (1999). “What Is the Point of Equality?*.” Ethics, 109(2), 287–337.

ANDRICH, D. (1978). “Relationships between the Thurstone and Rasch approaches to item scaling.” AppliedPsychological Measurement, 2(3), 451–462.

ANGRIST, J. D. (2004). “Treatment effect heterogeneity in theory and practice*.” The Economic Journal,114(494), C52–C83.

ANGRIST, J. D. & PISCHKE, J.-S. (2008). Mostly Harmless Econometrics: An Empiricist’s companion.:Princeton University Press.

ANGRIST, J. & IMBENS, G. (1995). “Identification and estimation of local average treatment effects.”

ARNESON, R. J. (1999). “Against Rawlsian equality of opportunity.” Philosophical Studies, 93(1), 77–112.

BAI, J. (2009). “Panel data models with interactive fixed effects.” Econometrica, 77(4), 1229–1279.

BAI, J. & NG, S. (2008). Large dimensional factor analysis.: Now Publishers Inc.

BANSBACK, N., BRAZIER, J., TSUCHIYA, A., & ANIS, A. (2012). “Using a discrete choice experiment toestimate health state utility values.” Journal of health economics, 31(1), 306–318.

BEATON, A. E. & ALLEN, N. L. (1992). “Interpreting scales through scale anchoring.” Journal of Educa-tional and Behavioral Statistics, 17(2), 191–204.

BEATON, A. E. & ZWICK, R. (1992). “Overview of the national assessment of educational progress.” Jour-nal of Educational and Behavioral Statistics, 17(2), 95–109.

DE BEKKER-GROB, E. W., RYAN, M., & GERARD, K. (2012). “Discrete choice experiments in healtheconomics: a review of the literature.” Health economics, 21(2), 145–172.

BERTRAND, M., DUFLO, E., & MULLAINATHAN, S. (2004). “How Much Should We TrustDifferences-In-Differences Estimates?” The Quarterly Journal of Economics, 119(1), 249–275. DOI:10.1162/003355304772839588.

136

BETTMAN, J. R., LUCE, M. F., & PAYNE, J. W. (1998). “Constructive consumer choice processes.” Journalof consumer research, 25(3), 187–217.

BOARDMAN, A. E. & MURNANE, R. J. (1979). “Using Panel Data to Improve Estimates of the Determinantsof Educational Achievement.” Sociology of Education, 52(2), 113–121.

BOND, T. N. & LANG, K. (2013a). “The Black-White education-scaled test-score gap in grades k-7.”Technical report, National Bureau of Economic Research.

(2013b). “The evolution of the Black-White test score gap in Grades K–3: The fragility of results.”Review of Economics and Statistics, 95(5), 1468–1479.

BRIGHOUSE, H., LADD, H. F., LOEB, S., & SWIFT, A. (2015). “Educational goods and values: A frame-work for decision makers.” Theory and Research in Education, 1477878515620887.

BRIGHOUSE, H. & SWIFT, A. (2006). “Equality, Priority, and Positional Goods*.” Ethics, 116(3), 471–497.

(2008). “Putting educational equality in its place.” Education, 3(4), 444–466.

(2009a). “Educational equality versus educational adequacy: A critique of Anderson and Satz.”Journal of Applied Philosophy, 26(2), 117–128.

(2009b). “Legitimate parental partiality.” Philosophy & Public Affairs, 37(1), 43–80.

(2014). “The place of educational equality in educational justice.” Education, justice and the humangood. Routledge, New York, 14–33.

BUHRMESTER, M., KWANG, T., & GOSLING, S. D. (2011). “Amazon’s Mechanical Turk a new source ofinexpensive, yet high-quality, data?” Perspectives on psychological science, 6(1), 3–5.

CANDELARIA, C. A. (2012). “Placeholder.” My Journal.

CARD, D. & PAYNE, A. A. (2002). “School Finance Reform, the Distribution of School Spending, and theDistribution of Student Test Scores.” Journal of Public Economics, 83(1), 49–82.

CHIZMAR, J. & ZAK, T. (1983). “Modeling multiple outputs in educational production functions.” TheAmerican Economic Review, 73(2), 18–22.

CHUDIK, A. & PESARAN, M. (2013). “Large panel data models with cross-sectional dependence: a survey.”CAFE Research Paper(13.15).

CLAYTON, M. (2001). “Rawls and natural aristocracy.” Croatian journal of philosophy, 1(3), 239–259.

COLEMAN, J. S., CAMPBELL, E. Q., HOBSON, C. J., MCPARTLAND, J., MOOD, A. M., WEINFELD,F. D., & YORK, R. (1966). “Equality of educational opportunity.” Washington, dc, 1066–5684.

CORCORAN, S. P. & EVANS, W. N. (2008). “Equity, Adequacy, and the Evolving State Role in EducationFinance.” In H. F. Ladd & E. B. Fiske (Eds.) Handbook of Research in Education Finance and Policy.:Routledge, 149–207.

CUNHA, F. & HECKMAN, J. J. (2008). “Formulating, identifying and estimating the technology of cognitiveand noncognitive skill formation.” Journal of Human Resources, 43(4), 738–782.

CUNHA, F., HECKMAN, J. J., & SCHENNACH, S. M. (2010). “Estimating the technology of cognitive andnoncognitive skill formation.” Econometrica, 78(3), 883–931.

DANIELS, N. (1983). “Health care needs and distributive justice.” In In Search of Equity.: Springer, 1–41.

137

(1985). Just health care.: Cambridge University Press.

DOLAN, P. & KAHNEMAN, D. (2008). “Interpretations of Utility and Their Implications for the Valuationof Health*.” The Economic Journal, 118(525), 215–234.

DOMINGUE, B. (2014). “Evaluating the equal-interval hypothesis with test score scales.” Psychometrika,79(1), 1–19.

DRUMMOND, M. F. (2005). Methods for the economic evaluation of health care programmes.: Oxforduniversity press.

DUFLO, E. (2001). “Schooling and Labor Market Consequences of School Construction in Indonesia: Ev-idence from an Unusual Policy Experiment.” The American Economic Review, 91(4), 795–813. DOI:http://dx.doi.org/10.1257/aer.91.4.795.

DUMOUCHEL, W. H. & DUNCAN, G. J. (1983). “Using sample survey weights in multiple regressionanalyses of stratified samples.” Journal of the American Statistical Association, 78(383), 535–543.

ELSTER, J. (1986). “Self-realization in work and politics: The Marxist conception of the good life.” SocialPhilosophy and Policy, 3(02), 97–126.

ELWERT, F. & WINSHIP, C. (2010). “Effect heterogeneity and bias in main-effects-only regression models.”Heuristics, probability and causality: A tribute to Judea Pearl, 327–36.

FRIEDBERG, L. (1998). “Did Unilateral Divorce Raise Divorce Rates? Evidence from Panel Data.” AmericanEconomic Review, 88(3).

FRISCH, R. & WAUGH, F. V. (1933). “Partial time regressions as compared with individual trends.” Econo-metrica: Journal of the Econometric Society, 387–401.

FRITSCH, F. N. & CARLSON, R. E. (1980). “Monotone piecewise cubic interpolation.” SIAM Journal onNumerical Analysis, 17(2), 238–246.

GOMEZ, M. (2015a). FixedEffectModels: Julia package for linear and iv models with highdimensional categorical variables.. sha: 368df32285d72db2220e3a7e02671ebdff54613e edition.https://github.com/matthieugomez/FixedEffectModels.jl.

(2015b). SparseFactorModels: Julia package for unbalanced factor models and in-teractive fixed effects models.. sha: 368df32285d72db2220e3a7e02671ebdff54613e edition.https://github.com/matthieugomez/SparseFactorModels.jl.

GRANGER, C. W. J. (1988). “Some recent development in a concept of causality.” Journal of econometrics,39(1), 199–211.

HAERTEL, E. H. (1991). “Report on TRP Analyses of Issues Concerning Within-Age versus Cross-AgeScales for the National Assessment of Educational Progress..”

HAINMUELLER, J., HANGARTNER, D., & YAMAMOTO, T. (2015). “Validating vignette and conjoint sur-vey experiments against real-world behavior.” Proceedings of the National Academy of Sciences, 112(8),2395–2400.

HAINMUELLER, J. & HOPKINS, D. J. (2014). “The hidden american immigration consensus: A conjointanalysis of attitudes toward immigrants.” American Journal of Political Science.

HAINMUELLER, J., HOPKINS, D. J., & YAMAMOTO, T. (2014). “Causal inference in conjoint analysis:Understanding multidimensional choices via stated preference experiments.” Political Analysis, 22(1), 1–30.

138

http://dx.doi.org/10.1257/aer.91.4.795

HANUSHEK, E. A. (1979). “Conceptual and Empirical Issues in the Estimation of Educational ProductionFunctions.” The Journal of Human Resources, 14(3), 351–388.

(1986). “The Economics of Schooling: Production and Efficiency in Public Schools.” Journal ofEconomic Literature, 24(3), 1141–1177.

(1996). “Measuring Investment in Education.” The Journal of Economic Perspectives, 10(4), 9–30.

(1997). “Assessing the Effects of School Resources on Student Performance: An Update.” Educa-tional Evaluation and Policy Analysis, 19(2), 141–164.

HECKMAN, J. J. & LAFONTAINE, P. A. (2010). “The American high school graduation rate: Trends andlevels.” The review of economics and statistics, 92(2), 244–262.

HOXBY, C. M. (1996). “How Teachers’ Unions Affect Education Production.” The Quarterly Journal ofEconomics, 111(3), 671–718. DOI: http://dx.doi.org/10.2307/2946669.

(2001). “All School Finance Equalizations are Not Created Equal.” The Quarterly Journal of Eco-nomics, 116(4), 1189–1231. DOI: http://dx.doi.org/10.1162/003355301753265552.

ILIN, A. & RAIKO, T. (2010). “Practical approaches to principal component analysis in the presence ofmissing values.” The Journal of Machine Learning Research, 11, 1957–2000.

INZA, F. S. M., RYAN, M., & AMAYA-AMAYA, M. (2007). ““Irrational” stated preferences.” Using DiscreteChoice Experiments to Value Health and Health Care, 11, 195.

JACKSON, C. K., JOHNSON, R. C., & PERSICO, C. (2015). “The Effects of School Spending on Educationaland Economic Outcomes: Evidence from School Finance Reforms.”Technical report, National Bureau ofEconomic Research.

KIM, D. & OKA, T. (2014). “Divorce Law Reforms And Divorce Rates In The Usa: An Interactive Fixed-Effects Approach.” Journal of Applied Econometrics, 29(2), 231–245.

KROUSE, R. & MCPHERSON, M. (1986). “A “Mixed”-Property Regime: Equality and Liberty in a MarketEconomy.” Ethics, 119–138.

KUZIEMKO, I., NORTON, M. I., & SAEZ, E. (2015). “How Elastic Are Preferences for Redistribution?Evidence from Randomized Survey Experiments.” American Economic Review, 105(4), 1478–1508.

LANCSAR, E. & LOUVIERE, J. (2006). “Deleting ‘irrational’ responses from discrete choice experiments: acase of investigating or imposing preferences?” Health economics, 15(8), 797–811.

(2008). “Conducting discrete choice experiments to inform healthcare decision making.” Pharma-coeconomics, 26(8), 661–677.

LEE, J. Y. & SOLON, G. (2011). “The fragility of estimated effects of unilateral divorce laws on divorcerates.” The BE Journal of Economic Analysis & Policy, 11(1).

LEVIN, H. M. & BELFIELD, C. (2014). “Guiding the development and use of cost-effectiveness analysis ineducation.” Journal of Research on Educational Effectiveness.

LIPSCOMB, J., DRUMMOND, M., FRYBACK, D., GOLD, M., & REVICKI, D. (2009). “Retaining, andenhancing, the QALY.” Value in Health, 12(s1), S18–S26.

LISSITZ, R. W. & BOURQUE, M. L. (1995). “Reporting NAEP results using standards.” Educational Mea-surement: Issues and Practice, 14(2), 14–23.

139

http://dx.doi.org/10.2307/2946669

http://dx.doi.org/10.1162/003355301753265552

LUCE, R. D. (2005). Individual choice behavior: A theoretical analysis.: Courier Corporation.

LUNN, D. J., THOMAS, A., BEST, N., & SPIEGELHALTER, D. (2000). “WinBUGS—A Bayesian ModellingFramework: Concepts, Structure, and Extensibility.” Statistics and Computing, 10, 325–337.

MCFADDEN, D. ET AL. (1973). “Conditional logit analysis of qualitative choice behavior.”

MCFADDEN, D. (1980). “Econometric models for probabilistic choice among products.” Journal of Business,S13–S29.

(1986). “The choice theory approach to market research.” Marketing science, 5(4), 275–297.

(2001). “Economic choices.” American Economic Review, 351–378.

MCFADDEN, D., TRAIN, K. ET AL. (2000). “Mixed MNL models for discrete response.” Journal of appliedEconometrics, 15(5), 447–470.

MEYER, B. D. (1995). “Natural and Quasi-Experiments in Economics.” Journal of Business & EconomicStatistics, 13(2), 151–161.

MOON, H. R. & WEIDNER, M. (2015). “Linear regression for panel with unknown number of factors asinteractive fixed effects.” Econometrica(forthcoming).

MULLIS, I. V. & JENKINS, L. B. (1988). The Science Report Card: Elements of Risk and Recovery. Trendsand Achievement Based on the 1986 National Assessment..: ERIC.

MULLOS, I. & JOHNSON, E. (1992). “The NAEP scale anchoring process for the 1992 mathematics assess-ment.” The NAEP, 893–907.

MURRAY, S. E., EVANS, W. N., & SCHWAB, R. M. (1998). “Education-finance reform and the distributionof education resources.” American Economic Review, 789–812.

NEWSON, R. (2006). “Confidence intervals for rank statistics: Somers’ D and extensions.” Stata Journal,6(3), 309.

NIELSEN, E. ET AL. (2014). “The Income-Achievement Gap and Adult Outcome Inequality.” The FederalReserve Board of Governors.

NIELSEN, E. R. (2015). “Achievement Gap Estimates and Deviations from Cardinal Comparability.” Avail-able at SSRN 2597668.

NORD, E. (1999). Cost-value analysis in health care: making sense out of QALYs.: Cambridge UniversityPress.

NORD, E., DANIELS, N., & KAMLET, M. (2009). “QALYs: some challenges.” Value in Health, 12(s1),S10–S15.

ONATSKI, A. (2010). “Determining the number of factors from empirical distribution of eigenvalues.” TheReview of Economics and Statistics, 92(4), 1004–1016.

OPPE, M., DEVLIN, N. J., & SZENDE, A. (2007). EQ-5D value sets: inventory, comparative review anduser guide.: Springer.

PERIE, M. (2008). “A Guide to Understanding and Developing Performance-Level Descriptors.” EducationalMeasurement: Issues and Practice, 27(4), 15–29.

PESARAN, M. H. (2006). “Estimation and inference in large heterogeneous panels with a multifactor errorstructure.” Econometrica, 74(4), 967–1012.

140

PESARAN, M. H. & PICK, A. (2007). “Econometric issues in the analysis of contagion.” Journal of Eco-nomic Dynamics and Control, 31(4), 1245–1277.

POGGE, T. W. (1995). “Three problems with contractarian-consequentialist ways of assessing social institu-tions.” Social Philosophy and Policy, 12(02), 241–266.

(2003). “Incoherence between Rawls’s Theories of Justice, The.” Fordham L. Rev., 72, 1739.

RAGHAVARAO, D., WILEY, J. B., & CHITTURI, P. (2011). “Choice-Based Conjoint Analysis.” Models andDesigns (1st ed.). Boca Raton: Taylor and Frances Group.

RAIKO, T., ILIN, A., & KARHUNEN, J. (2008). “Principal component analysis for sparse high-dimensionaldata.” In Neural Information Processing., 566–575, Springer.

RAWLS, J. (1974). “Reply to Alexander and Musgrave.” The Quarterly Journal of Economics, 633–655.

(2001). Justice as fairness: A restatement.: Harvard University Press.

(2009). A theory of justice.: Harvard university press.

REARDON, S. F., VALENTINO, R. A., & SHORES, K. A. (2012). “Patterns of literacy among US students.”The Future of Children, 22(2), 17–37.

ROWEN, D., BRAZIER, J., & VAN HOUT, B. (2014). “A comparison of methods for converting DCE valuesonto the full health-dead QALY scale.” Medical Decision Making, 0272989X14559542.

SATZ, D. (2007). “Equality, Adequacy, and Education for Citizenship*.” Ethics, 117(4), 623–648.

(2012). “Unequal chances: Race, class and schooling.” Theory and Research in Education, 10(2),155–170.

(2014). “Unequal chances.” Education, Justice and the Human Good: Fairness and Equality in theEducation System, 34.

SCOTT, L. A. & INGELS, S. J. (2007). “Interpreting 12th-Graders’ NAEP-Scaled Mathematics PerformanceUsing High School Predictors and Postsecondary Outcomes from the National Education LongitudinalStudy of 1988 (NELS: 88). Statistical Analysis Report. NCES 2007-328..” National Center for EducationStatistics.

SHIFFRIN, S. V. (2003). “Race, labor, and the fair equality of opportunity principle.” Fordham L. Rev., 72,1643.

SIMS, D. P. (2011). “Lifting All Boats? Finance Litigation, Education Resources, and Student Needs in thePost-Rose Era.” Education Finance & Policy, 6(4), 455–485.

SOLON, G., HAIDER, S. J., & WOOLDRIDGE, J. M. (2015). “What are we weighting for?” Journal ofHuman Resources, 50(2), 301–316.

SPRINGER, M. G., LIU, K., & GUTHRIE, J. W. (2009). “The Impact of School Finance Litigation onResource Distribution: A Comparison of Court-Mandated Equity and Adequacy Reforms.” EducationEconomics, 17(4), 421–444.

TAYLOR, R. S. (2004). “Self-realization and the priority of fair equality of opportunity.” Journal of MoralPhilosophy, 1(3), 333–347.

THURSTONE, L. L. (1928). “Attitudes can be measured.” American journal of sociology, 529–554.

141

TORRANCE, G. W., FEENY, D. H., FURLONG, W. J., BARR, R. D., ZHANG, Y., & WANG, Q. (1996).“Multiattribute utility function for a comprehensive health status classification system: Health UtilitiesIndex Mark 2.” Medical care, 34(7), 702–722.

TRAIN, K. E. (2009). Discrete choice methods with simulation.: Cambridge university press.

WEINSTEIN, M. C., TORRANCE, G., & MCGUIRE, A. (2009). “QALYs: the basics.” Value in health,12(s1), S5–S9.

WESTEN, P. (1985). “The concept of equal opportunity.” Ethics, 837–850.

WHITEHEAD, S. J. & ALI, S. (2010). “Health outcomes in economic evaluation: the QALY and utilities.”British medical bulletin, 96(1), 5–21.

WOLFERS, J. (2006). “Did Unilateral Divorce Laws Raise Divorce Rates? A Reconciliation and New Re-sults.” American Economic Review, 96(5), 1802–1820.

WOOLDRIDGE, J. M. (2005). “Fixed-effects and related estimators for correlated random-coefficient andtreatment-effect panel data models.” Review of Economics and Statistics, 87(2), 385–390.

(2010). Econometric analysis of cross section and panel data.: MIT press.

WRIGHT, S. & HOLT, J. N. (1985). “An inexact levenberg-marquardt method for large sparse nonlinearleast squres.” The Journal of the Australian Mathematical Society. Series B. Applied Mathematics, 26(04),387–403.

142

OPPORTUNITIES, COSTS AND BENEFITS: …fw890kf0299/...is to use an education production function,...

Documents

Transcript of OPPORTUNITIES, COSTS AND BENEFITS: …fw890kf0299/...is to use an education production function,...